mlds_2019
Labs for MATH 728 D
Machine Learning and Data Science
Spring 2019
mlds_2019,
the home page for a set of MATLAB laboratory exercises
associated with the class MATH 728 D: "Machine Learning and Data Science",
taught in Spring 2019 by Professor Wolfgang Dahmen, of the Mathematics
Department at the University of South Carolina. The labs were
written and presented by John Burkardt.
The class met for lectures on Tuesdays and Thursdays,
from 1:15-2:30 pm in LeConte 121.
John Burkardt was available in LeConte 401, on Mondays, from
1:10-2:00pm, for students who wish to work on any of the lab
exercises, or who have questions about the assigned projects.
Note that Professor Dahmen is sponsoring a "Spring School" workshop,
March 17th - 20th. The Math 728D class for March 18th will be
cancelled. Students are urged to attend any of the Spring School
lectures that attract them, as described on the web site:
http://people.math.sc.edu/imi/dasiv/SpringSchool/.
The class included both regular lectures, and as a follow-up,
classroom lab exercises in which students attempted to carry out
tasks related to the lecture material, using MATLAB or Python.
The following lab exercises were originally developed at
the University of South Carolina, under the guidance of Professor
Wolfgang Dahmen, in the spring of 2019.
This material was later extensively revised and reworked for an
undergraduate class in machine learning, planned and developed at
the University of Pittsburgh, in consultation with Professor
Michael Schneier, in the fall of 2019.
Some of this material has since been further developed and modified
for a class at the Missouri University of Science and Technology,
in consultation with Professor Yanzhi Zhang, in the fall of 2020.
Class information and lecture notes:
Homework exercises:
(Homework is intended as exercises for you to familiarize yourself with the course material.
It will not be collected or graded.
If you have questions about the exercises, these can be answered through email or at office hours.)
Project #1: Regression with Linear Least Squares (Due March 19)
Project #2: High Dimensional Sampling and Ranking (Due April 11)
Project #3: Perceptron, SVM, Multilinear Regression, Clustering (Due April 25)
-
project3.pdf;
-
generators.txt,
56 rows, 4 columns: index, rpm, vib, grade(0,1);
-
jet_engines.txt,
56 rows, 4 columns: index, rpm, vib, grade(-1,+1);
-
insurance_train.txt,
1000 rows, 7 columns: age, sex, bmi, children, smoker, region, charges;
-
insurance_test.txt,
338 rows, 7 columns: age, sex, bmi, children, smoker, region, charges;
-
faithful.txt,
272 rows, 2 columns: eruption length, pause length;
LAB #1: MATLAB
-
lab01.pdf;
-
scrambled.txt,
a file to be loaded, describing a matrix whose rows and columns
have been scrambled.
-
correct.txt,
a file describing the unscrambled matrix.
-
geyser_data.csv,
a file of observations of the Old Faithful geyser,
including the lengths of eruptions, and quiet times.
Several data values have been corrupted, and should be
removed before plotting the rest.
LAB #2: Linear Algebra
LAB #3: Plotting
-
lab03.pdf;
-
bulgaria_data.txt,
a file listing census years and population counts for Bulgaria.
-
faithful_data.csv,
a file of observations of the Old Faithful geyser,
including the lengths of eruptions, and quiet times.
-
price_data.txt,
the month, the year, and average monthly prices for 11 consumer
products, between February 2008 and February 2018.
-
schoolyear_data.m,
a MATLAB M file that lists countries and school year lengths in days.
-
snowfall_data.txt,
snowfall measurements at Michigan Tech, from 1890 to 2017.
-
volcano_data.csv,
data defining the local elevation near a volcano.
LAB #4: Probability
LAB #5: Optimization
LAB #6: Linear Regression
LAB #7: Multilinear Regression
-
lab07.pdf;
-
insurance_data.csv,
a header line, then 1338 records of
(age, sex, BMI, children, smoker, region, charges).
sex, smoker and region are "text" values;
-
insurance_data.txt,
a header line, then 1338 records of
(age, sex, BMI, children, smoker, region, charges);
sex, smoker, and region are numeric values.
LAB #8: Logistic Regression
LAB #9: Clustering
-
lab09.pdf;
-
faithful_data.csv,
a file of observations of the Old Faithful geyser,
including the lengths of eruptions, and quiet times.
-
pollution_data.csv,
Various measurements related to air pollution in US cities.
There is an initial header line.
There are 41 records.
There are 8 fields: "City name", "SO2 mg/cm", "Average Temperature F",
"Manufacturing Plants", "1970 Population", "Average Wind Speed mph",
"Average Precipitation inches", "Annual Precipitation days"
-
scale_01.m,
a function which shifts and rescales each column of an array to have
minimum 0 and maximum 1;
-
swim.jpg,
an image using 256 colors;
LAB #10: Gaussian Mixture Models
LAB #11: Principal Component Analysis
-
lab11.pdf;
-
glass_data.csv,
Data about chemical composition of samples of glass.
There is an initial header line.
214 records are stored;
Each record includes 11 fields: "Index","Refractive Index","Na","Mg",
"Al","Si","K","Ca","Ba","Fe","Class";
Class:1=building float glass,2=building nonfloat glass,
3=vehicle Float glass, 4=vehicle nonfloat glass, 5=containers,
6=tableware, 7=headlamps.
-
casablanca.png
a gray-scale image from the movie "Casablanca",
460 pixels wide and 360 pixels high.
LAB #12: Naive Bayes Classification
-
lab12.pdf;
-
card_data.csv
Collection of 50 playing cards;
There is an initial header line.
There are 50 records;
There are 4 fields: Index, Rank(1-13), Suit(1-4), Order(1-52);
-
crash_data.csv
computer crash reports.
There is an initial header line.
There are 20 records.
There are 5 fields: index(1-20), OS(1=Linux,2=OSX,3=Windows),
LANG (1=C,2=Python), Browser(1=Chrome,2=Explorer,3=Firefox,4=Safari),
Crash (0=No,1=Yes).
-
loan_data.csv
Data for a fraud-detection system.
There are 20 records.
There are 5 fields: ID(1-20), Credit History ("none", "paid", "current",
"arrears"), Guarantor ("none", "guarantor", "coapplicant" ),
Accommodation ("own", "rent", "free"), Fraud ("true", "false");
-
tax_data.csv
Tax information.
There is an initial header line.
There are 15 records:
There are 5 fields: Index(1-15), Refund("true", "false"),
Status ("single", "married", "divorced", Income (in thousands),
Cheating ("true", "false").
LAB #13: Markov Methods
-
lab13.pdf;
-
hyperlink_data.csv,
Hyperlink directed adjacency matrix for 16 web pages.
There are 16 records:
There are 16 fields: 0 or 1 if page i links to page j.
-
risk_data.csv,
Regional adjacency matrix for the game of Risk.
There are 42 records:
There are 42 fields of 0 or 1: adjacency to region 1, region 2, ..., 42.
-
risk_map.png,
A map that displays the names and numbers of the 42 Risk regions.
-
risk_names.csv,
Region names for the game of Risk.
There is one initial record.
There are 42 records:
There is 2 fields: index, region name
LAB #14: Facial Recognition
-
lab14.pdf;
-
angela.zip
12 images of Angela Merkel, that are an example of a starting
collection for the exercise;
LAB #15: Vector and Matrix Norms
LAB #16: Curve Fitting
-
lab16.pdf;
-
sine_test.txt
100 samples, at uniformly spaced points in [0,1], of the
function y=sin(2*pi*x)+0.1*randn(), used for testing.
-
sine_train.txt
10 samples, at uniformly spaced points in [0,1], of the
function y=sin(2*pi*x)+0.1*randn(), used for training.
-
spring.txt
records the results of Hooke's law experiments, in which
a spring is loaded with a given weight M, and the resulting deflection D
is measured. A linear relationship D=c*M is expected.
There are 20 records, with two fields, mass in kilograms, and
deflection in meters. From Guttag, 2016.
LAB #17: Projection
LAB #18: Expected Values
Last revised on 10 October 2020.