Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [269]

By Root 690 0

taken over a “digitized image of a fine needle aspirate (FNA) of a breast mass.” There are 569 samples. The task is to classify each data point as benign or malignant.

A.4.2 Clustering

Bag of Words Data Set.

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Word counts have been extracted from five document sources: Enron Emails, NIPS full papers, KOS blog entries, NYTimes news articles and Pubmed abstracts. The task is to cluster the documents used in this data set based on the word counts found. One may compare the output clusters with the sources from which each document came.

US Census Data (1990) Data Set.

http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29

This data set is a one percent sample from the 1990 Public Use Microdata Samples (PUMS). It contains 2,458,285 records and 68 attributes.

A.4.3 Regression

Auto MPG Data Set.

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

This data set provides a number of attributes of cars that can be used to attempt to predict the “city-cycle fuel consumption in miles per gallon.” There are 398 data points and eight attributes.

Computer Hardware Data Set.

http://archive.ics.uci.edu/ml/datasets/Computer+Hardware

This data set provides a number of CPU attributes that can be used to predict relative CPU performance. It contains 209 data points and 10 attributes.

A.4.4 Web Mining

Anonymous Microsoft Web Data.

http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data

This data set contains page visits for a number of anonymous users who visited www.microsoft.com. The task is to predict future categories of pages a user will visit based on the Web pages previously visited.

KDD Cup 2000.

http://www.sigkdd.org

This Web site contains five tasks used in a data-mining competition run yearly called KDD Cup. KDD Cup 2000 uses clickstream and purchase data obtained from Gazelle.com. Gazelle.com sold legwear and legcare products and closed their online store that same year. This Web site provides links to papers and posters of the winners of the various tasks and outlines their effective methods. Additionally, the description of the tasks provides great insight into original approaches to using data mining with clickstream data.

A.4.5 Text Mining

Reuters-21578 Text Categorization Collection.

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

This is a collection of news articles that appeared on Reuters newswire in 1987. All of the news articles have been categorized. The categorization provides opportunities to test text classification or clustering methodologies.

20 Newsgroups.

http://people.csail.mit.edu/jrennie/20Newsgroups/

The 20 Newsgroups data set contains 20,000 newsgroup documents. These documents are divided nearly evenly among 20 different newsgroups. Similar to the Reuters collection, this data set provides opportunities for text classification and clustering.

A.4.6 Time Series

Dodgers Loop Sensor Data Set.

http://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor

This data set provides the number of cars counted by a sensor every 5 min over 25 weeks. The sensor was for the Glendale on ramp for the 101 North Freeway in Los Angeles. The goal of this data was to “predict the presence of a baseball game at Dodgers stadium.”

Australia Gun Deaths.

http://robjhyndman.com/TSDL/crime.html

These data give the yearly death rates in Australia for gun-related and non-gun-related homicides and suicides for the years 1915–2004.

A.4.7 Data for Association Rule Mining

BMS-POS.

http://www.sigkdd.org/kddcup

This data set gives the category for each product purchased from a large electronics retailer. It covers several years worth of point of sales data. This data set contains 515,597 transactions and 1,657 distinct items.

BMS-WebView1.

http://www.sigkdd.org/kddcup

This data set contains several months of clickstream sessions for Gazelle.com. A transaction is defined in this data set as the detail pages viewed per session. This data set contains 59,602 transactions and 497 distinct items.

A.5 COMERCIALLY AND PUBLICLY

Online Book Reader

Data Mining - Mehmed Kantardzic [269]

®Online Book Reader