Data Mining - Mehmed Kantardzic [29]
and if the window for analysis of the time-series is five, then it is possible to reorganize the input data into a tabular form with six samples, which is more convenient (standardized) for the application of data-mining techniques. Transformed data are given in Table 2.1.
TABLE 2.1. Transformation of Time Series to Standard Tabular Form (Window = 5)
The best time lag must be determined by the usual evaluation techniques for a varying complexity measure using independent test data. Instead of preparing the data once and turning them over to the data-mining programs for prediction, additional iterations of data preparation have to be performed. Although the typical goal is to predict the next value in time, in some applications, the goal can be modified to predict values in the future, several time units in advance. More formally, given the time-dependent values t(n − i), … , t(n), it is necessary to predict the value t(n + j). In the previous example, taking j = 3, the new samples are given in Table 2.2.
TABLE 2.2. Time-Series Samples in Standard Tabular Form (Window = 5) with Postponed Predictions (j = 3)
In general, the further in the future, the more difficult and less reliable is the forecast. The goal for a time series can easily be changed from predicting the next value in the time series to classification into one of predefined categories. From a data-preparation perspective, there are no significant changes. For example, instead of predicted output value t(i + 1), the new classified output will be binary: T for t(i + 1) ≥ threshold value and F for t(i + 1) < threshold value.
The time units can be relatively small, enlarging the number of artificial features in a tabular representation of time series for the same time period. The resulting problem of high dimensionality is the price paid for precision in the standard representation of the time-series data.
In practice, many older values of a feature may be historical relics that are no longer relevant and should not be used for analysis. Therefore, for many business and social applications, new trends can make old data less reliable and less useful. This leads to a greater emphasis on recent data, possibly discarding the oldest portions of the time series. Now we are talking not only of a fixed window for the presentation of a time series but also on a fixed size for the data set. Only the n most recent cases are used for analysis, and, even then, they may not be given equal weight. These decisions must be given careful attention and are somewhat dependent on knowledge of the application and past experience. For example, using 20-year-old data about cancer patients will not give the correct picture about the chances of survival today.
Besides standard tabular representation of time series, sometimes it is necessary to additionally preprocess raw data and summarize their characteristics before application of data-mining techniques. Many times it is better to predict the difference t(n + 1) − t(n) instead of the absolute value t(n + 1) as the output. Also, using a ratio, t(n + 1)/t(n), which indicates the percentage of changes, can sometimes give better prediction results. These transformations of the predicted values of the output are particularly useful for logic-based data-mining methods such as decision trees or rules. When differences or ratios are used to specify the goal, features measuring differences or ratios for input features may also be advantageous.
Time-dependent cases are specified in terms of a goal and a time lag or a window of size m. One way of summarizing features in the data set is to average them, producing MA. A single average summarizes the most recent m feature values for each case, and for each increment in time, its value is
Knowledge of the application can aid in specifying reasonable sizes for m. Error estimation should