Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [197]

By Root 739 0

his or her own language with complex patterns in terms of primitives. More interestingly, it performs well on approximate matching, where the user cares only about the overall shape of the sequence but not about specific details. The first step in the representation process is defining the alphabet of symbols and then translating the initial sequence to a sequence of symbols. The translation is done by considering sample-to-sample transitions, and then assigning a symbol of the described alphabet to each transition.

A significantly different approach is to convert a sequence into discrete representation by using clustering. A sliding window of width w is used to generate subsequences from the original sequence. These subsequences are then clustered, considering the pattern similarity between subsequences, using a suitable clustering method, for example, the k-nearest neighbor method. A different symbol is then assigned to each cluster. The discrete version of the time series is obtained by using cluster identities corresponding to the subsequence. For example, the original time sequence is defined with integer values given in time: (1, 2, 3, 2, 3, 4, 3, 4, 3, 4, 5, 4, 5) as represented in Figure 12.19a. The Window width is defined by three consecutive values, and samples of primitives are collected through the time series. After simplified clustering, the final set of three “frequent” primitive shapes, representing cluster centroids, is given in Figure 12.19b. Assigning symbolic representation for these shapes, a1, a2, and a3, the final symbolic representation of the series will be (a3, a2, a1, a1, a3, a2).

Figure 12.19. Clustering approach in primitive shapes discovery. (a) Time series; (b) primitive shapes after clustering.

12.2.2 Similarity Measures between Sequences

The individual elements of the sequences may be vectors of real numbers (e.g., in applications involving speech or audio signals) or they may be symbolic data (e.g., in applications involving gene sequences). After representing each sequence in a suitable form, it is necessary to define a similarity measure between sequences, in order to determine if they match. Given two sequences T1 and T2, we need to define an appropriate similarity function Sim, which calculates the closeness of the two sequences, denoted by Sim(T1, T2). Usually, the similarity measure is expressed in terms of inverse distance measure, and for various types of sequences and applications, we have numerous distance measures. An important issue in measuring similarity between two sequences is the ability to deal with outlying points, noise in the data, amplitude differences causing scaling problems, and the existence of gaps and other time-distortion problems. The most straightforward distance measure for time series is the Euclidean distance and its variants, based on the common Lp-norms. It is used in time-domain continuous representations by viewing each sub-sequence with n discrete values as a point in Rn. Besides being relatively straightforward and intuitive, Euclidean distance and its variants have several other advantages. The complexity of evaluating these measures is linear; they are easy to implement, indexable with any access method, and, in addition, are parameter-free. Furthermore, the Euclidean distance is surprisingly competitive with other more complex approaches, especially if the size of the training set/database is relatively large. However, since the mapping between the points of two time series is fixed, these distance measures are very sensitive to noise and misalignments in time, and are unable to handle local-time shifting, that is, similar segments that are out of phase.

When a sequence is represented as a sequence of discrete symbols of an alphabet, the similarity between two sequences is achieved most of the time by comparing each element of one sequence with the corresponding one in the other sequence. The best known such distance is the longest common subsequence (LCS) similarity, utilizing the search for the LCS in both sequences we are comparing, and

Online Book Reader

Data Mining - Mehmed Kantardzic [197]

®Online Book Reader