Data Mining_ Concepts and Techniques - Jiawei Han [352]
Biological sequences include DNA and protein sequences. Such sequences are typically very long, and carry important, complicated, but hidden semantic meaning. Here, gaps are usually important.
Let's look into data mining for each of these sequence data types.
Similarity Search in Time-Series Data
A time-series data set consists of sequences of numeric values obtained over repeated measurements of time. The values are typically measured at equal time intervals (e.g., every minute, hour, or day). Time-series databases are popular in many applications such as stock market analysis, economic and sales forecasting, budgetary analysis, utility studies, inventory studies, yield projections, workload projections, and process and quality control. They are also useful for studying natural phenomena (e.g., atmosphere, temperature, wind, earthquake), scientific and engineering experiments, and medical treatments.
Unlike normal database queries, which find data that match a given query exactly, a similarity search finds data sequences that differ only slightly from the given query sequence. Many time-series similarity queries require subsequence matching, that is, finding a set of sequences that contain subsequences that are similar to a given query sequence.
For similarity search, it is often necessary to first perform data or dimensionality reduction and transformation of time-series data. Typical dimensionality reduction techniques include (1) the discrete Fourier transform (DFT), (2) discrete wavelet transforms (DWT), and (3) singular value decomposition (SVD) based on principle components analysis (PCA). Because we touched on these concepts in Chapter 3 and because a thorough explanation is beyond the scope of this book, we will not go into great detail here. With such techniques, the data or signal is mapped to a signal in a transformed space. A small subset of the “strongest” transformed coefficients are saved as features.
These features form a feature space, which is a projection of the transformed space. Indices can be constructed on the original or transformed time-series data to speed up a search. For a query-based similarity search, techniques include normalization transformation, atomic matching (i.e., finding pairs of gap-free windows of a small length that are similar), window stitching (i.e., stitching similar windows to form pairs of large similar subsequences, allowing gaps between atomic matches), and subsequence ordering (i.e., linearly ordering the subsequence matches to determine whether enough similar pieces exist). Numerous software packages exist for a similarity search in time-series data.
Recently, researchers have proposed transforming time-series data into piecewise aggregate approximations so that the data can be viewed as a sequence of symbolic representations. The problem of similarity search is then transformed into one of matching subsequences in symbolic sequence data. We can identify motifs (i.e., frequently occurring sequential patterns) and build index or hashing mechanisms for an efficient search based on such motifs. Experiments show this approach is fast and simple, and has comparable search quality to that of DFT, DWT, and other dimensionality reduction methods.
Regression and Trend Analysis in Time-Series Data
Regression analysis of time-series data has been studied substantially in the fields of statistics and signal analysis. However, one may often need to go beyond pure regression analysis and perform trend analysis for many practical applications. Trend analysis builds an integrated model using the following four major components or movements to characterize time-series data:
1. Trend or long-term movements: These indicate the general direction in which a time-series graph is moving over time, for example, using weighted moving average and the least squares methods to find trend