MARTINEZ
Computational Statistics Datasets


MARTINEZ is a dataset directory which contains data associated with a book on computational statistics and MATLAB.

The original data files are available as MATLAB M files, and as text files. The original text files were broken up so that each variable is now in its own file, with no extraneous text or blank lines. This may facilitate the use of the data by a variety of programs.

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Related Data and Programs:

CENSUS, a dataset directory which contains US census data;

DRAFT_LOTTERY, a dataset directory which contains the numbers assigned to each birthday, for the Selective Service System lotteries for 1970 through 1976.

HARTIGAN, a dataset directory which contains datasets for testing clustering algorithms;

ISWR, a dataset directory which contains datasets for computational statistics.

MDS, a dataset directory which contains datasets for M-dimensional scaling;

PCL, a dataset directory which contains datasets from a gene expression experiment on Arabidopsis, which are candidates for data cluster analysis;

REGRESSION, a dataset directory which contains datasets for testing linear regression;

SAMMON, a dataset directory which contains six sets of M-dimensional data for cluster analysis.

SGB, a dataset directory which contains files used as input data for demonstrations and tests of Donald Knuth's Stanford Graph Base.

SOKAL_ROHLF, a dataset directory which contains biological datasets considered by Sokal and Rohlf.

SPAETH, a dataset directory which contains datasets for cluster analysis;

SPAETH2, a dataset directory which contains datasets for cluster analysis;

STATS, a dataset directory which contains datasets for computational statistics;

TIME_SERIES, a data directory of examples of time series, which are simply records of the values of some quantity at a sequence of times.

TRIOLA, a dataset directory which contains datasets used for statistical analysis.

WORDS, a dataset directory which contains lists of words;

Reference:

  1. Wendy Martinez, Angel Martinez,
    Computational Statistics Handbook with MATLAB,
    Chapman and Hall / CRC, 2002,
    ISBN: 1-58488-229-8,
    LC: QA276.4.M272.
  2. http://lib.stat.cmu.edu,
    the STATLIB web site.
  3. http://www.infinityassociates.com

Datasets:

The ABRASION data set has 30 observations (rows).

The ANAEROB data set has 53 observations (rows) of oxygen uptake and expired ventilation.

The ANSCOMBE data set has 11 observations (rows) of simulated data used to illustrate the ideas of exploratory data analysis.

The BANK data set has 100 observations (rows) of six properties (columns) of banknotes. Observations were made for sets of 100 forged and 100 genuine banknotes. This data can be used to test clustering techniques.

The BIOLOGY data set records the number of papers published for 1534 biologists. The number of papers ranges from 1 to 11.

The BODMIN data set records the location of 35 granite tors on Bodmin Moor.

The BOSTON data set contains 14 measures (columns) of housing data for 506 census tracts (rows) in Boston, taken in 1970. The columns

The BROWNLEE data set has 21 observations of 4 variables for a plant which oxidizes ammonia. There are three predictor or "X" variables, and one response or "Y" variable. The variables are X1 = "air flow", X2 = "cooling temperature", X3 = "acid percentage", and Y = "stack loss".

The CARDIFF data set records the location of the homes of 168 juvenile offenders in Cardiff, Wales.

The CEREAL data set conains 11 ratings (columns) of 8 brands (rows) of cereal.

The COAL data set counts the number of coal mine disasters per year over 112 years.

The CLUSTER data set is an artificial and simple example of 5 points in 2D, which can be grouped into two clusters. This data can be used to test clustering techniques.

The COUNTING data set counts the number of scintillations in 72 second intervals arising from the decay of radioactive polonium.

The ELDERLY data set contains the height measurements in centimeters of 351 elderly women.

The ENVIRON data set contains 111 daily readings of ozone level and wind speed in New York City between May and September 1973.

The FILIP data set contains 82 pairs of (x,y) data, used as a standard test for least squares calculations.

The FLEA data set contains measurements (rows) of 2 quantities (columns) for each of 3 species of flea. This data can be used to test clustering techniques.

The FOREARM data set contains measurements of the length in inches of the forearms of 140 adult males.

The GEYSER data set contains the waiting time in minutes between successive eruptions of the Old Faithful geyser. 299 values are recorded.

The HELMETS data set has 133 observations of the acceleration of a head after an accident.

The HOUSEHOLD data set contains observations of 4 expenditures (columns) for households of single men and single women. This data can be used to test clustering techniques.

The HUMAN data set records measurements of the percentage of fat (column 1) and age (column 2). This data can be used to test clustering techniques.

The INSECT data set contains 10 measurements (rows) of 3 quantities (columns) for each of 3 species of insect. This data can be used to test clustering techniques.

The INSULATE data set contains measurements (rows) of 2 quantities (columns): the average outside temperature in degrees Celsius, and the weekly gas consumption in thousands of cubic feet. One set of data was take before insulation, and the other after insulation.

The IRIS data set contains 50 measurements (rows) of 4 quantities (columns) for each of 3 species of iris. This data can be used to test clustering techniques.

The LAW data set is a random sampling of the LAWPOP data set. It contains the LSAT scores and GPA's for 15 randomly chosen records.

The LAWPOP data set contains the average LSAT scores and GPA's for freshman students at 82 law schools.

The LONGLEY data set contains 16 observations (rows) of 7 predictor variables X (one of which is always 1), and a response variable Y.

The MEASURE data set contains 20 measurements (rows) of 3 quantities (columns), chest, waist and hips. 10 of the measurements are for men, 10 for women. This data can be used to test clustering techniques.

The MOTHS data set contains the number of moths caugh in a trap over 24 consecutive nights.

The NFL data set contains measure of the game time til first score by kicking the ball between the end posts (X1) and game time til the first score made by moving the ball into the end zone (X2). 42 observations were made.

The PEANUTS data set contains measure of the average level of alfatoxin of a batch of peanuts, and the percentage of non-contaminated peanuts in the batch. 34 observations were made.

The POSSE data set contains 6 sets of data generated for simulation studies. Each data set has 400 observations (rows) in 8 dimensions (columns).

The QUAKES data set records the time in days between successive earthquakes. 62 intervals are recorded.

The REMISS data set contains the remission times for 42 leukemia patients. Some of the patients were treated with the drug 6-mercaptopurine, and the rest were part of the control group.

The SNOWFALL records the annual snowfall, in inches, in Buffalo, New York, for the 63 years from 1910 to 1972.

The SPATIAL data set records the scores of 26 neurologically impaired children on a test of spatial perception.

The STEAM data set records the average atmospheric temperature X, and the corresponding amount of steam used per month, Y. 25 observations were made.

The THROMBOS data set has measurements of urinary-thromboglobulin excretion in 12 normal and 12 diabetic patients. This data can be used to test clustering techniques.

The TIBETAN data set contains 32 observations (rows) of 5 measurements (columns) of skull height. 17 of the skulls came from one are, and 15 from another. This data can be used to test clustering techniques.

The UGANDA data set records the location of 120 volcano crater centers in west Uganda.

The WHISKY data set records the price in dollars of a fifth of whisky in 16 states with state-owned liquor stores and 26 states with private liquor stores.

You can go up one level to the DATASETS directory.


Last revised on 16 October 2011.