Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [175]

By Root 836 0

The SOM technique is used as the most appropriate technique for the problem of Web-page organization because of its strength not only in grouping data points into clusters, but also in graphically representing the relationship among clusters. The system starts with a Web-log file indicating the date, time, and address of the requested Web pages as well as the IP address of the user’s machine. The data are grouped into meaningful transactions or sessions, where a transaction is defined by a set of user-requested Web pages. We assume that there is a finite set of unique URLs:

and a finite set of m user transactions:

Transactions are represented as a vector with binary values ui:

where

Preprocessed log files can be represented as a binary matrix. One example is given in Table 11.1.

TABLE 11.1. Transactions Described by a Set of URLs

Since the dimensions of a table (n × m) for real-world applications would be very large, especially as input data to SOM, a reduction is necessary. By using the K-means clustering algorithm, it is possible to cluster transactions into prespecified number k (k m) of transaction groups. An example of a transformed table with a new, reduced data set is represented in Table 11.2, where the elements in the rows represent the total number of times a group accessed a particular URL (the form of the table and values are only one illustration, and they are not directly connected with the values in Table 11.1).

TABLE 11.2. Representing URLs as Vectors of Transaction Group Activity

The new, reduced table is the input for SOM processing. Details about the application of SOM as a clustering technique and the settings of their parameters are given in the previous chapter. We will explain only the final results and their interpretation in terms of Web-page analysis. Each URL will be mapped onto a SOM based on its similarity with other URLs in terms of user usage or, more precisely, according to users’ navigation patterns (transaction group “weights” in Table 11.2). Suppose that the SOM is a 2-D map with p × p nodes, where p × p ≥ n, then a typical result of SOM processing is given in Table 11.3. The dimensions and values in the table are not the results of any computation with values in Tables 11.1 and 11.2, but a typical illustration of the SOM’s final presentation.

TABLE 11.3. A Typical SOM Generated by the Description of URLs

The SOM organizes Web pages into similar classes based on users’ navigation patterns. The blank nodes in the table show that there are no corresponding URLs, while the numbered nodes indicate the number of URLs contained within each node (or within each class). The distance on the map indicates the similarity of the Web pages measured by the user-navigation patterns. For example, the number 54 in the last row shows that 54 Web pages are grouped in the same class because they have been accessed by similar types of people, as indicated by their transaction patterns. Similarity here is measured not by similarity of content but by similarity of usage. Therefore, the organization of the Web documents in this graphical representation is based solely on the users’ navigation behavior.

What are the possible applications of the LOGSOM methodology? The ability to identify which Web pages are being accessed by a company’s potential customers gives the company information to make improved decisions. If one Web page within a node successfully refers clients to the desired information or desired page, the other pages in the same node are likely to be successful as well. Instead of subjectively deciding where to place an Internet advertisement, the company can now decide objectively, supported directly by the user-navigation patterns.

11.4 MINING PATH–TRAVERSAL PATTERNS


Before improving a company’s Web site, we need a way of evaluating its current usage. Ideally, we would like to evaluate a site based on the data automatically recorded on it. Each site is electronically administered by a Web server, which logs all activities that take place in it in a file called a Web-server

Return Main Page Previous Page Next Page

®Online Book Reader