LINEAR_LAB


http://people.sc.fsu.edu/~jburkardt/public_html/r_src/linear_lab/linear_lab.html

linear lab, an R lab project which searches for linear relationships between pairs of data.

Licensing:

The computer code and data files on this web page are distributed under the MIT license

1: The Taxicab Fare Data:

Copy the data file tally_cab.csv, which is a "CSV" (comma separated variable) file, with a header, containing a table of several taxicab rides, including the miles traveled and the fare charged. In part 1 of this lab, you will read this data into R, plot it, and then try to establish a relationship between the miles traveled and the fare paid.

1.1) Use the read.csv() command to read the taxicab data from the file 'tally_cab.csv', storing it in a variable. We suggest using the name taxi for this variable.

1.2) Use the print() statement to print a listing of taxi, which will include a header line with labels distance and fare, followed by 8 numbered lines of data. Each column of the data can be identified by its header. For instance, the first column has the name taxi$distance. Use the print() command again, but this time only print taxi$distance.

1.3) Use the plot() command to display the taxi data. This should produce a graphical image in which each data value appears as a small circle.

1.4) We expect that there is a simple relationship between the distance traveled and the fare charged. For instance, this relationship might be simply that there is a 3 dollar charge per mile. Before we explore the relationship, make a copy of the distance data with the shortened name of x:

        x <- taxi$distance
      
If the cab driver charged 3 dollars per mile, then we could compute the fare for each of these trips by the following mathematical formula:
        y1 = 3 * x
      
To actually evaluate this formula on our data, the corresponding R command is simply
        y1 <- 3 * x
      
Now add a line to your current plot, displaying this guess for the relationship between the distance and fare, using the command:
        lines ( x, y1 )
      
The plot suggests that the formula is not very accurate, especially for the trip that took the greatest distance.

1.5) R includes a function called lm() which can search for a linear relationship between two data vectors. If the data items are called x and y, then the relationship is approximately explained by the following mathematical formula:

        y = slope * x + intercept
      
Given sets of data x and y, the R command lm(y~x) will search for values of slope and intercept that most closely match the data, to show y as a linear function of x. To find this relationship for our data, use the lm() command to ask for a linear relationship that represents taxi$fare as a function of taxi$distance.

1.6) The output of the lm() command should include an intercept and a slope, (which R will call the "coefficient of taxi$distance"). Earlier, you made a copy of taxi$distance and called it x. Using the values of intercept and slope reported by lm(), compute the estimated taxi fares:

        y2 <- slope * x + intercept
      
and add this second line to your current plot using the command
        lines ( x, y2 )
      
Your plot should now contain points for the actual data, and two lines that represent extimates for the relationship between the data values.

(Note that the actual taxicab rates in Tallahassee specify an initial charge of 2 dollars, and 25 cents for each 1/10 of a mile traveled.)

2: The Size and Weight of Lead Shot:

Copy the data file lead_shot.csv, which is a "CSV" (comma separated variable) file, with a header, containing a table of the type, size, and weight of various kinds of lead shot. In part 2 of this lab, you will read this data into R, plot it, and then try to establish a relationship between the measured diameter d and weight w. We will find that this relationship is not linear, but that if we modify the data in the right way, R will be able to find a simple relationship for us.

2.1) Use the read.csv() command to read the data from the file 'lead_shot.csv', storing it in a variable. We suggest using the name shot for this variable.

2.2) Use the print() statement to print a listing of the shot data.

The columns of interest to us are the weight in ounces (column 2, "ounce"), the diameter in inches (column 4, "inch"), and a measurement called "PPO" (column 6, "ppo"). To refer to any column of the data, you can use the name of the data, joined by a dollar sign and the column heading. Thus, the weight data can be referred to by the name shot$ounce. For convenience, make temporary copies of these columns of the data, called d for "diameter", w for "weight", and ppo for "ppo". For instance, your first command might be:

        d <- shot$inch
      

2.3) Use the plot() command to display the d data on the x axis, versus the w data on the y axis. By looking at the graph, you may conclude that this is probably not a linear relationship.

2.4) The plot command allows us to request plots in which formulas or expressions are involved. In particular, we can make a plot of the square of d versus w simply by using the command:

        plot ( d^2, w )
      
Make this plot of d^2 versus w. Then plot d^3 versus w. In the second plot, the data seems to form a straight line. This suggests there is a relationship that can be expressed in the mathematical formula:
        w = slope * d^3 + intercept
      

2.5) Just like the plot() command, we can ask the lm() command to find a relationship in which we have squared or cubed one of the variables. Use the lm() command to find a relationship for w in terms of d^3. What are the values of slope and intercept?

2.6) Use the plot() command to plot w versus ppo. The data seems to lie on a sharply curved figure, not a straight line. In fact, as the size of w increases, ppo decreases. This is sometimes called an "inverse" relationship. Use the plot() command again, but this time plot w versus 1/ppo, and notice that now there seems to be a simple relationship between the two quantities.

(The relationship between the diameter and weight of a typical lead shot pellet is not linear. However, we found that there was a linear relationship between the weight and a power of the diameter. The R command lm() can usually work out the details of such a linear relationship, if we can find a new way of looking at the data that makes the linear relationship evident.)

3: Predicting a Giant

Copy the data file height_male_young.csv, which is a "CSV" (comma separated variable) file with a header, containing a table of heights, in inches, for young males between the ages of 0 and 20. In fact, the table has 9 columns of height data. The column labeled "p50" lists the median height, that is, the height for which 50 percent of males are above, and 50 percent below. Similarly, the "p25" column lists a height which 25 percent of males are below.

We will concentrate on the "p50" column. By plotting this data, we will notice a linear relationship that seems to hold over most of the range of ages. We will try to determine that linear relationship, and then use it to predict the height of a male at age 40.

3.1) Use the read.csv() command to read the young male height data from the file 'height_male_young.csv', storing it in a variable. This discussion will assume you've named the variable hm.

3.2) Use the print() statement to print a listing of hm. Notice the column containing the age. Notice that heights increase whether you read down a column or across a row. Notice that the column labeled "p50" contains the median height, which we will want to concentrate on. Extract a copy of the "age" and "p50" columns of data. We will assume you call the first variable age and the second variable height

3.3) Use the plot() statement to display age versus height. You should notice that the plot data seems to form three segments. There is a short period of rapid growth, then an extensive period of moderate growth, followed by a sudden change to almost no growth.

3.4) We will focus on the period of moderate growth. Estimate the first and last years during which the moderate growth rate holds. Make new variables age2 and height2 which copy only the data over this interval. For instance, to copy just the data between ages 5 to 10, the commands would be:

        age2 <- age[6:11]
        height2 <- height[6:11]
      
(The data for ages 5 to 10 is stored in entries 6 through 11, because we included data for age 0! Also, please note that ages 5 to 10 is not a good response for this question!)

3.5) Use the plot() command on your subset of data to see that you've extracted data that seems to lie along a line. Now use the lm() command to ask for a linear relationship that relates height2 to age2. The values of slope and intercept that are computed indicate that, based on the observed data, the relationship can be explained by the following mathematical formula:

        height = slope * age + intercept
      

3.6) Now we will test this formula, by having it predict the height of males between ages 0 and 40. To do this, set up a new age vector by the command

        age3 <- 0 : 40
      
Now compute the corresponding predicted heights by plugging in the values of slope and intercept:
        height3 <- slope * age3 + intercept
      

3.7) What does your formula predict as the height of the average 40 year old male? Use the following form of the plot() command to display the prediction line with the data it was based on:

        plot ( age3, height3, type = "l" )
        point ( age, height )
      

The linear formula that the lm() command found was good enough to describe the data we observed. Since the formula allows us to plug in any age, it will produce predictions for ages far outside the range of data we observed, even for negative ages. We know people don't continue to grow throughout adulthood; similarly, it can be dangerous to "discover" a relationship based on a set of data, and then use that relationship to predict behavior for data that is far beyond the observed range.


Last modified on 15 September 2011.