Data Mining - Mehmed Kantardzic [269]
A.4.2 Clustering
Bag of Words Data Set.
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Word counts have been extracted from five document sources: Enron Emails, NIPS full papers, KOS blog entries, NYTimes news articles and Pubmed abstracts. The task is to cluster the documents used in this data set based on the word counts found. One may compare the output clusters with the sources from which each document came.
US Census Data (1990) Data Set.
http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29
This data set is a one percent sample from the 1990 Public Use Microdata Samples (PUMS). It contains 2,458,285 records and 68 attributes.
A.4.3 Regression
Auto MPG Data Set.
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
This data set provides a number of attributes of cars that can be used to attempt to predict the “city-cycle fuel consumption in miles per gallon.” There are 398 data points and eight attributes.
Computer Hardware Data Set.
http://archive.ics.uci.edu/ml/datasets/Computer+Hardware
This data set provides a number of CPU attributes that can be used to predict relative CPU performance. It contains 209 data points and 10 attributes.
A.4.4 Web Mining
Anonymous Microsoft Web Data.
http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data
This data set contains page visits for a number of anonymous users who visited www.microsoft.com. The task is to predict future categories of pages a user will visit based on the Web pages previously visited.
KDD Cup 2000.
http://www.sigkdd.org
This Web site contains five tasks used in a data-mining competition run yearly called KDD Cup. KDD Cup 2000 uses clickstream and purchase data obtained from Gazelle.com. Gazelle.com sold legwear and legcare products and closed their online store that same year. This Web site provides links to papers and posters of the winners of the various tasks and outlines their effective methods. Additionally, the description of the tasks provides great insight into original approaches to using data mining with clickstream data.
A.4.5 Text Mining
Reuters-21578 Text Categorization Collection.
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
This is a collection of news articles that appeared on Reuters newswire in 1987. All of the news articles have been categorized. The categorization provides opportunities to test text classification or clustering methodologies.
20 Newsgroups.
http://people.csail.mit.edu/jrennie/20Newsgroups/
The 20 Newsgroups data set contains 20,000 newsgroup documents. These documents are divided nearly evenly among 20 different newsgroups. Similar to the Reuters collection, this data set provides opportunities for text classification and clustering.
A.4.6 Time Series
Dodgers Loop Sensor Data Set.
http://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor
This data set provides the number of cars counted by a sensor every 5 min over 25 weeks. The sensor was for the Glendale on ramp for the 101 North Freeway in Los Angeles. The goal of this data was to “predict the presence of a baseball game at Dodgers stadium.”
Australia Gun Deaths.
http://robjhyndman.com/TSDL/crime.html
These data give the yearly death rates in Australia for gun-related and non-gun-related homicides and suicides for the years 1915–2004.
A.4.7 Data for Association Rule Mining
BMS-POS.
http://www.sigkdd.org/kddcup
This data set gives the category for each product purchased from a large electronics retailer. It covers several years worth of point of sales data. This data set contains 515,597 transactions and 1,657 distinct items.
BMS-WebView1.
http://www.sigkdd.org/kddcup
This data set contains several months of clickstream sessions for Gazelle.com. A transaction is defined in this data set as the detail pages viewed per session. This data set contains 59,602 transactions and 497 distinct items.
A.5 COMERCIALLY AND PUBLICLY