Data Mining - Mehmed Kantardzic [195]
Figure 12.15. Traditional statistical analyses of time series. (a) Trend; (b) cycle; (c) seasonal; (d) outliers.
12.2.1 Temporal Data Representation
The representation of temporal data is especially important when dealing with large volumes of data since direct manipulation of continuous, high-dimensional data in an efficient way is extremely difficult. There are a few ways this problem can be addressed.
Original Data or with Minimal Preprocessing
Use data as they are without or with minimal preprocessing. We preserve the characteristics of each data point when a model is built. The main disadvantage of this process is that it is extremely inefficient to build data-mining models with millions of records of temporal data, all of them with different values.
Windowing and Piecewise Approximations
There is well-known psychological evidence that the human eye segments smooth curves into piecewise straight lines. Based on this theory, there are a number of algorithms for segmenting a curve that represents a time series. Figure 12.16 shows a simple example where it replaces the original nonlinear function with several piecewise linear functions. As shown in the figure, the initial real-valued elements (time series) are partitioned into several segments. Finding the required number of segments that best represent the original sequence is not trivial. An easy approach is to predefine the number of segments. A more realistic approach may be to define them when a change point is detected in the original sequence. Another technique based on the same idea segments a sequence by iteratively merging two similar segments. Segments to be merged are selected based on the squared-error minimization criteria. Even though these methods have the advantage of ability to reduce the impact of noise in the original sequence, when it comes to real-world applications (e.g., sequence matching) differences in amplitudes (scaling) and time-axis distortions are not addressed easily.
Figure 12.16. Simplified representation of temporal data. (a) Piecewise linear approximation; (b) piecewise aggregate approximation.
To overcome these drawbacks, Piecewise Aggregate Approximation (PAA) technique was introduced. It approximates the original data by segmenting the sequences into same-length sections and recording the mean value of these sections. A time-series C of length n is represented as C = {c1, c2, … , cn}. The task is to represent C as C′ in a w-dimensional space (w < n) by mean values of cis in w equal-sized segments. The ith element of C′ is calculated as a mean of all values in the segment:
For example, if the original sequence is C = {−2, −4, −3, −1, 0, 1, 2, 1, 1, 0}, where n = |C| = 10, and we decide to represent C in two sections of the same length, then
Usually, PAA is visualized as a linear combination of box bases functions as illustrated in Figure 12.16b, where a continuous function is replaced with 10 discrete averages for each interval.
A modified PAA algorithm, which is called Symbolic Aggregate Approximation (SAX), is proposed with the assumption that the original normalized time series, C, has a Gaussian distribution of PAA values. SAX defines “break points” in the Gaussian curve that will produce equal-sized areas below the curve. Formally, break points are a sorted list of numbers such that the areas under the Gaussian curve from to are equal to 1/α, and they are constant. α is a parameter of the methodology representing the number of intervals. These break points can be determined in a statistical table. For example, Figure 12.17 gives the break points for