datasets
datasets
data to be used for machine learning exercises.
-
admit_data.txt,
the English SAT score, the Math SAT score, and admit/not admit (1/0),
for 100 students applying to a particular college,
100x3 items
-
aircon_data.txt,
humidity, temperature and comfort (1/0),
44x3 items
-
album_data.txt,
sales of music albums, column 1 is year (2007-2017)
and column 2 is the number of sales (LP's, cassettes, CD's
and downloads),
11x2 items
-
ambulance_data.txt,
125x2 items
-
anscombe1_data.txt,
11x2 items
-
anscombe2_data.txt,
11x2 items
-
anscombe3_data.txt,
11x2 items
-
anscombe4_data.txt,
11x2 items
-
apples_and_oranges.csv,
There are three fields, weight (gm), size (cm), class (apple/orange).
header line plus
40x3 items
-
bankloan1_data.csv,
bank loan data #1, from Kelleher, MacNamee, D'Arcy.
There is an initial header line.
There are 10 records;
There are 5 fields: ID, Occupation, Age(years), Ratio, Outcome;
header line, plus
10x5 items
-
bankloan2_data.csv,
bank loan data #2, from Kelleher, MacNamee, D'Arcy.
There is an initial header line.
There are 25 records;
There are 9 fields: ID, Amount($), Salary($), Ratio, Age, Occupation,
Property, Type, Outcome;
header line, plus
25x9 items
-
basketball_data.csv,
basketball player data, from Kelleher, MacNamee, D'Arcy.
There are 30 records;
There are 5 fields: ID, Height(centimeters), Weight(pounds),
Sponsorship($), Age(years);
There is an initial header line, plus
30x5 items
-
basketball_data.txt,
basketball player data, from Kelleher, MacNamee, D'Arcy.
There are 30 records;
There are 5 fields: ID, Height(centimeters), Weight(pounds),
Sponsorship($), Age(years);
30x5 items
-
birthweight.csv,
id, headcircumference, length, Birthweight, gestation,
smoker, motherage, mother cigarettes, mother height, mother ppwt,
fage, fedyrs, father cigarettes, father height, lowbwt,
mage35, LowBirthWeight,
one header line, plus
42x17 items
-
blobs_centers.txt,
5x2 items, cluster centers
-
blobs_clusters.txt,
2020x1 items, assign each blob data item to a blob cluster center.
-
blobs_data.txt,
2020x2 items, data that forms 5 clusters
-
blobs_std.txt,
5x1 items, cluster standard deviations
-
bulgaria_population_data.txt,
column 1 is the year (1997-2018) and column 2 is the population,
29x2 items.
-
caesarian_data.txt,
80x6 items
-
card_data.csv,
Collection of 50 playing cards;
There is an initial header line.
There are 50 records;
There are 4 fields: Index, Rank(1-13), Suit(1-4), Order(1-52);
1 header line, plus
50x4 items
-
china_data.txt,
year and average income for China, 1952-2007.
12x2 items
-
climate_data.xls,
the data naturally forms two clusters.
1570x2 items.
-
corvette_data.txt,
the resale price for Corvettes by
model year. Each record lists the model year and asking price.
72x2 items
-
crash_data.csv,
computer crash reports.
There is an initial header line.
There are 20 records.
There are 5 fields: index(1-20), OS(1=Linux,2=OSX,3=Windows),
LANG (1=C,2=Python), Browser(1=Chrome,2=Explorer,3=Firefox,4=Safari),
Crash (0=No,1=Yes).
1 header line, plus
20x5 items
-
diabetes_data.csv,
1 header line, plus
768x9 items
-
draft_data.csv,
1 header line, plus
20x4 items
-
faith_data.txt,
eruption time (minutes), pause (minutes), height of geyser (feet),
50x3 items
-
faithful_data.csv,
Old Faithful geyser.
There are 272 records;
There are 3 fields: index, time between eruptions, and length of
eruption.
1 header line, plus
272x3 items
-
faithful_data.txt,
records 272 observations of eruptions of the Old
Faithful geyser, giving eruption length and eruption wait in minutes.
272x2 items
-
filip_data.txt,
is a dataset from the National Institute of Standards and
Technology (NIST), supplied by Albert Filippelli. 82 pairs of (x,y)
values are recorded. A 10th-degree polynomial fit y=p(x) is desired.
82x2 items
-
ford_data.csv,
the model year, mileage (miles), and selling price ($)
for 23 Ford Escorts. A linear fit is desired, to predict selling
price from mileage.
1 header line:
"Year","Mileage(thousands)","Price",
23x3 items
-
ford_data.txt,
the model year, mileage (miles), and selling price ($)
for 23 Ford Escorts. A linear fit is desired, to predict selling
price from mileage.
23x3 items
-
generator_data.txt,
56 records: index, rpm, vibration, fail(0)/active(1)
56x4 items
-
geyser_data.txt,
272x2 items
-
gold_data.txt,
observations of gold coins, some of which are counterfeit,
weight (grams), genuine (1 True, 0 False),
20x2 items
-
gopher_data.txt,
measurements of gophers from two species:
skull width (cm), skull length (cm), species (-1 or +1).
50x3 items
-
homes_data.txt,
records, for 49 houses that were sold recently, the selling price $,
asking price $, living area in square feet, # rooms, # bedrooms,
# bathrooms, age in years, lot size in acres, taxes $.
49x9 items
-
homes_test.txt,
3x9 items
-
hw_data.txt,
25000x3 items
-
hyperlink_data.csv,
Hyperlink directed adjacency matrix for 16 web pages.
There are 16 records:
There are 16 fields: 0 or 1 if page i links to page j.
16x16 items
-
hyperlink_map.png,
a map of the hyperlinks.
-
insurance_data.csv,
Medical insurance costs;
There is an initial header line.
There are 1338 records:
There are 7 fields: age, sex, bmi, children, smoker, region, charges
1 header line, plus
1338x7 items
-
insurance_data.txt,
Medical insurance costs;
There is an initial header line.
There are 1338 records:
There are 7 fields: age, sex, bmi, children, smoker, region, charges
1338x7 items
-
insurance_test.txt,
61x7 items
-
insurance_train.txt,
1000x7 items
-
iris_data.csv,
150x5 items
-
iris_description.txt
-
jet_data.txt,
56 records: index, rpm, vibration, fail(-1)/working(1)
56x4 items
-
ladybug.png
-
loan_data.csv,
Data for a fraud-detection system.
There are 20 records.
There are 5 fields: ID(1-20), Credit History ("none", "paid", "current",
"arrears"), Guarantor ("none", "guarantor", "coapplicant" ),
Accommodation ("own", "rent", "free"), Fraud ("true", "false");
1 header line, plus
20x5 items
-
lump_data.txt,
11x8 items
-
medicine_hat_tigers_2007.txt,
1 header line,
25x8 items
-
mexico_population_data.txt,
column 1 is the year (1865-2018) and column 2 is the population,
13x2 items
-
mlb_data.txt,
records the 2018 winning percentage up to July 1, and
the team payroll, for the 15 American League teams:
Boston Red Sox, Los Angeles Angels, New York Yankees, Toronto Blue Jays,
Houston Astros, Seattle Mariners, Texas Rangers, Baltimore Orioles,
Detroit Tigers, Cleveland Indians, Kansas City Royals, Minnesota Twins,
Tampa Bay Rays, Oakland Athletics, Chicago White Sox.
15x2 items
-
playfair_data.txt,
Column 1 is the year, column 2 the price of a measure of wheat in
shillings, and column 3 is the average weekly earnings of a
mechanic in shillings.
The interesting item is the ratio of wheat price to earnings.
50x3 items
-
pollution_data.csv,
Various measurements related to air pollution in US cities.
There is an initial header line.
There are 41 records.
There are 8 fields: "City name", "SO2 mg/cm", "Average Temperature F",
"Manufacturing Plants", "1970 Population", "Average Wind Speed mph",
"Average Precipitation inches", "Annual Precipitation days"
1 header line, plus
41x8 items
-
price_data.csv,
is a table of average monthly prices for 11 consumer products,
between February 2008 and February 2018. There are 241 records. Each record
contains 13 items:
the month,
the year,
bananas (lb),
oranges (lb),
bread (lb),
tomatoes (lb),
chicken (lb),
electricity (kwh),
egss (dozen),
gasoline (gallon),
ground chuck (lb),
heating gas (therm),
milk (gallon).
1 header line, plus
241x13 items
-
price_data.txt,
is a table of average monthly prices for 11 consumer products,
between February 2008 and February 2018. There are 241 records. Each record
contains 13 items:
the month,
the year,
bananas (lb),
oranges (lb),
bread (lb),
tomatoes (lb),
chicken (lb),
electricity (kwh),
egss (dozen),
gasoline (gallon),
ground chuck (lb),
heating gas (therm),
milk (gallon).
241x13 items.
-
random_data.txt,
the (x,y) coordinates of 100 random points.
100x2 items.
-
rising_data.txt,
records 25 measurements of a physical experiment at
intervals of 1 second. A model of the form y(t) = c1 + c2 * t + c3 * sin(t)
is suggested.
This example is taken from "Numerical Methods and Software"
by Kahaner, Moler and Nash.
25x2 items.
-
risk_data.csv,
Regional adjacency matrix for the game of Risk.
There are 42 records:
There are 42 fields of 0 or 1: adjacency to region 1,
region 2, ..., 42.
42x42 items.
-
risk_map.png,
A map that displays the names and numbers of the 42 Risk regions.
-
risk_names.csv,
Region names for the game of Risk.
Column 1 is the index, and column 2 is the region name.
There is one initial record.
42x2 items.
-
ruspini_data.txt,
a set of (x,y) coordinates which naturally form 4 clusters.
75x2 items.
-
schoolyear_data.csv,
the number of days in a school year, by country.
There is an initial header line.
There are 28 records.
There are 2 fields: Country name (a string), days in school year.
1 header line, plus
28x2 items
-
sex_age_height_weight_data.txt,
records sex (0=female,1=male), age (months),
height (inches), and weight (pounds) for 237 school children.
From Lewis and Taylor, 1967.
237x4 items
-
sine_test.txt,
100 pairs of (x,y) data in [0,1], column 1 is x, column 2 is sine(x).
This data should be used to test the model generated by
the sine_train.txt data.
100x2 items
-
sine_train.txt,
10 pairs of (x,y) data, column 1 is x, column 2 is sine(x),
to be used to construct a model function.
10x2 items
-
snowfall_data.txt,
a table of M = 132 rows and N = 10 columns.
Column 1 is the winter year
identifier, ranging from 1890-1891 to 2021-2022. Columns 2 through 9
are the snowfall in inches for October, November, December, January,
February, March, April and May. Column 10 is the total snowfall.
These measurements were taken near Michigan Tech.
132x10 items
-
spring_data.txt,
records the results of Hooke's law experiments, in which
a spring is loaded with a given weight M, and the resulting
deflection D is measured. A linear relationship D=c*M is expected.
There are 20 records, with two fields, mass in kilograms, and
deflection in meters. From Guttag, 2016.
20x2 items.
-
strain_data.txt,
10x2 items
-
titanic.pdf
-
titanic_test.csv,
1 header line, plus
1308x11 items.
-
titanic_train.csv,
1 header line, plus
891x12 items
-
titanium_data.txt,
is a dataset from Carl DeBoor's "A Practical Guide to Splines".
49 pairs of (x,y) values are recorded, which represent the variation
of some property of titanium.
49x2 items
-
turtle_data.csv,
contains 54 records of turtle measurements.
Each record lists the index, sex (-1=M,+1=F), length of carapace,
width of carapace, and height (the measurement of
the carapace plus the plastron).
1 header line, plus
54x6 items
-
two_temperatures_data.txt,
Fahrenheit and Celsius temperatures of freezing and boiling,
2x2 items
-
us_population_data.txt,
column 1 is the year (1900-2020) and column 2 is the population,
121x2 items
-
volcano_data.txt,
is based on a rectangular grid of 87 X coordinates
and 61 Y coordinates. At each grid point (X(I),Y(J)), the value
Z(I,J) stores the height of a volcano. There are 87 records in the
file, one for each X value. Each record contains 61 values of Z,
one for each Y value.
87x61 items
-
weather_data.txt,
temperature, pressure, humidity and wind speed, for
June 26, 27, 28, and 29,
5x4 items
-
weight_data.txt,
sex (0/1), September weight (kg), April weight (kg),
September BMI, April BMI.
67x5 items
-
wine_data.csv,
no header line,
178x14 items
-
wine_header.txt,
the header information for the wine data, as a separate file,
14x1 items
Last revised on 13 February 2022.