STATS
Statistical Datasets


STATS is a dataset directory which contains example datasets used for statistical analysis.

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Related Data and Programs:

CENSUS, a dataset directory which contains US census data;

DRAFT_LOTTERY, a dataset directory which contains the numbers assigned to each birthday, for the Selective Service System lotteries for 1970 through 1976.

HARTIGAN, a dataset directory which contains datasets for testing clustering algorithms;

ISWR, a dataset directory which contains datasets used for statistical analysis.

MARTINEZ, a dataset directory which contains datasets for computational statistics, including cluster analysis;

MDS, a dataset directory which contains datasets for M-dimensional scaling;

PCL, a dataset directory which contains datasets from a gene expression experiment on Arabidopsis, which are candidates for data cluster analysis;

REGRESSION, a dataset directory which contains datasets for testing linear regression;

SGB, a dataset directory which contains files used as input data for demonstrations and tests of Donald Knuth's Stanford Graph Base.

SOKAL_ROHLF, a dataset directory which contains biological datasets considered by Sokal and Rohlf.

SPAETH, a dataset directory which contains datasets for cluster analysis;

SPAETH2, a dataset directory which contains datasets for cluster analysis;

TIME_SERIES, a data directory of examples of time series, which are simply records of the values of some quantity at a sequence of times.

TRIOLA, a dataset directory which contains datasets used for statistical analysis.

WORDS, a dataset directory which contains lists of words;

Reference:

  1. Francis Anscombe,
    Graphs in Statistical Analysis,
    The American Statistician,
    Volume 27, Number 1, February 1973, pages 17-21.
  2. Andrew Frank, Arthur Asuncion,
    UCI Machine Learning Repository,
    http://archive.ics.uci.edu/ml,
    School of Information and Computer Science,
    University of California, Irvine, California.
  3. Philipp Janert,
    Gnuplot in Action: Understanding Data with Graphs,
    Manning, 2010,
    ISBN13: 978-1-933988-39-8,
    LC: QA276.4.J37.
  4. R J Kuczmarksi, CL Ogden, SS Guo,
    2000 CDC Growth Charts for the United States: Methods and Development,
    National Center for Health Statistics,
    Vital and Health Statistics, Series 11, Number 246, May 2002,
    ISBN: 0-8406-0575-7,
    LC: GN63.A225.
  5. Wendy Martinez, Angel Martinez,
    Computational Statistics Handbook with MATLAB,
    Chapman and Hall / CRC, 2002,
    ISBN: 1-58488-229-8,
    LC: QA276.4.M272.
  6. Paul Sommers,
    Is Presidential Greatness Related to Height?,
    The College Mathematics Journal,
    Volume 33, Number 1, pages 14-16, January 2002.

Datasets:

Files with an extension of .txt are "text" files; the columns of data are separated by spaces.

Files with an extension of .csv are "comma separated value" files; the columns of data are separated by commas. These files often have an initial "header" line containing a label for each column. Each label is generally delimited by quotation marks.

ANSCOMBE contains four sets of 11 pairs of (x,y) data. Sets (x1,y1), (x2,y2), (x3,y3) and (x4,y4) all have the same average x, average y, the same regression line: y = 3 + 0.5 * x, the same variance, the same correlation coefficient, and the same value of r^2. And yet these datasets, when plotted, are obviously quite different.

ALLIGATORS contains weight (in pounds) and length (in inches) of 26 alligators sampled in central Florida.

ASTEROIDS contains records about a number of asteroids, including the index, name, mass in kg, density in g/cm^3, and three radial dimensions. This information is extracted from data presented as part of Jim Baer's program CODES (Comet/asteroid Orbit Determination and Ephemeris Software).

AUTOMOBILE contains 205 records, with 26 attributes, describing properties of cars available in 1985, taken from the UCI Machine Learning Repository. Some data values are missing, and are indicated by '?'. The data is comma separated, and includes text, integers, and real values. Our interest is to make a scatter plot of certain pairs of real attributes.

BEN_STILLER records the title, year, and US Gross of 53 movies in which Ben Stiller had a role.

BRAD_PITT records the title, year and US Gross of 40 movies in which Brad Pitt had a role.

GEYSER contains the waiting time in minutes between successive eruptions of the Old Faithful geyser. 299 values are recorded. The data ranges from 43 to 108. The data comes from Martinez and Martinez.

HEIGHT_FEMALE_BABY records the height in inches of female babies, measured every three months from 0 to 36 months, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

HEIGHT_FEMALE_YOUNG records the height in inches of young females, measured every year from 0 to 20 years, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

HEIGHT_MALE_BABY records the height in inches of male babies, measured every three months from 0 to 36 months, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

HEIGHT_MALE_YOUNG records the height in inches of young males, measured every year from 0 to 20 years, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

LEAD_SHOT records measurements for 25 grades of lead shot, including the grade, weight in ounces and grams, diameter in inches and millimeters, and the rough number of pellets per ounce.

LYNX records the yearly lynx harvest from 1821 to 1934.

MEASLES_NYC is a table of two columns. Column 1 lists months from January 1, 1928 to November 1, 1963, as decimal values. Column 2 lists the number of measles cases reported over that month.

MGP_STUDENTS records the frequency with which a person listed in the Math Genealogy Project web site has the given number of students. The data was extracted on 15 January 2012, and the number of students ranges from 0 to 110.

MIXING records the concentration of dye in a stream of water leaving a container, over time. It is expected that this concentration will decrease approximately linearly over time. This example is taken from "Numerical Methods and Software" by Kahaner, Moler and Nash.

MOVIE_BUDGETS records the budgets for 3,546 movies made in recent years. Information includes rank, title, release date, distributor, budget, US gross sales, and worldwide gross sales.

MOVIES records the money made by movies during a given year, including movies released in previous years. Information includes rank, title, release date, distributor, genre, MPAA rating, gross sales, tickets sold, gross adjusted for inflation. The information was obtained from "THE NUMBERS" website: http://www.the-numbers.com.

PRESIDENTIAL_HEIGHTS contains the name of each president, his height measured in inches, and a rating of greatness (0 = failure, 1 = below average, 2 = average, 3 = above average, 4 = near great, 5 = great, NA = not available. The data is from the article by Paul Sommers.

REPUBLICANS_2012 records, for each of five Republican candidates for president, the percentage of responses that were favorable, unfavorable, no opinion, or "Who?", for Republican, Democratic, Independent and Total voters.

SAT_BY_STATE records average SAT (Scholastic Aptitute Test) scores per state, including population, average verbal and math scores, percentage of eligible students taking the exam, percentage of adult population without a high school education, and annual teacher pay in thousands of dollars.

TALLY_CAB records the distance in miles (from Google Maps) and a taxi cab fare in dollars, from the Tallahassee Airport to various destinations, including Saint George Island.

TOURISTS contains the number of tourists to Apple beach each month. The file contains 12 records, with each record listing the index (1-12) of the month, the number of tourists, and a 3 letter month abbreviation.

TURTLES contains measurements for 54 turtles, taken from two collections: SREL (Savannah River Ecology Lab) and CMNH (Carnegie Museum of Natural History). The sex of the turtle is described as "M" or "F", the length and width of the carapace are given, followed by the 'height', defined as the measurement of the carapace plus the plastron.

WEIGHT_FEMALE_BABY records the weight in pounds of female babies, measured every three months from 0 to 36 months, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

WEIGHT_FEMALE_YOUNG records the weight in pounds of young females, measured every year from 0 to 20 years, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

WEIGHT_MALE_BABY records the weight in pounds of male babies, measured every three months from 0 to 36 months, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

WEIGHT_MALE_YOUNG records the weight in pounds of young males, measured every year from 0 to 20 years, for the 3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 97th percentiles, taken from the 2000 CDC Growth Charts.

You can go up one level to the DATASETS directory.


Last revised on 15 January 2012.