SELECT_LAB
Combine, Select or Merge Data


http://people.sc.fsu.edu/~jburkardt/public_html/r_src/select_lab/select_lab.html

select lab, an R lab project which investigates ways of combining, searching, and merging data.

Licensing:

The computer code and data files on this web page are distributed under the MIT license

It is often the case that the data you want to analyze must be prepared in some way before it is ready. The problem may simply be that parts of your data are stored in different datasets, in which case you need to extract and combine them in some way. Instead, you may find that all your data is available in a single file, but that there are many other records in that file that you want to exclude. In this case, your task is to select the data of interest to you. Another common situation occurs when a dataset contains multiple items of related information, which we want to merge before analyzing. This is something like starting with a deck of 52 cards and grouping them into four separate suits.

This lab will look at procedures for combining, selecting, or merging data so that we are able to carry out the analysis we are interested in.

1: Creating a Data File, Dealing with Missing Data

When a large set of statistical measurements are compiled to form a single dataset, there may be cases where certain measurements are missing or unknown. For instance a database of movies in which Brad Pitt had a role might want to list the title, year, and the amount of money made by ticket sales in the US market. However, for any early movie like "Cutting Class", the sales information might not have been recorded, and if the database is being compiled in 2011, the ticket sales for the movie "World War Z" will not even have started yet. Nonetheless, it may be important to include both these movies in the database, even though their information is incomplete. What, then, is to be done about the missing item of information?

Consider the task of entering data into R about some movies in which Brad Pitt had a role. We will start by supposing we have information about the title, release date, budget for several movies, as follows:
TitleReleaseBudget
BabelOctober 29, 2006$20,000,000
KaliforniaSeptember 3, 1993?
MegamindNovember 5, 2010$130,000,000
Mr and Mrs SmithJune 10, 2005$110,000,000
SnatchJanuary 19, 2000$10,000,000
TroyMay 14, 2004$150,000,000

We can store this information in a datafile so that it can be processed by R. Before doing so, we must note several issues:

1.1) Use a text editor to create a csv file, called flick.csv, containing the information from the above table.

1.2) Start the R program, and use the read.csv() command to read your data file, and store the result as a data frame called "flick1". Use the print() command to verify that your data has been read correctly, and to see how the missing budget item is displayed.

1.3) Now that we've entered the data, we decide that we want to include the distributor of each movie. We only know the distributors for four of the six movies; for the other two movies, we will list the distributor as "NA". Since we are adding a distributor for each movie, this is like adding a column to the table.

Title......Distributor
Babel......?
Kalifornia......MGM
Megamind......Dreamworks
Mr and Mrs Smith......?
Snatch......Sony
Troy......Warner

To add this data to the data frame, we first use the "combine" function c() to create a list of the new data.

        Distributor <- c ( "NA", "MGM", "Dreamworks", "NA", "Sony", "Warner" )
Now we create a new dataframe using the cbind() function to append the Distributor column to the columns of the old data frame:
        flick2 <- cbind ( flick1, Distributor )
Use the print() command to verify that "flick" now includes the Distributor information.

1.4) We now wish to add information for two more movies, which is essentially adding two new rows to our dataframe. The information we want to add is in the following table:

TitleReleaseBudgetDistributor
Ocean's ElevenDecember 7, 2001$85,000,000Warner
Thelma & LouiseMay 24, 1991$16,500,000?

Adding new rows to a dataframe is more awkward than adding columns, partly because the data along a row may be a mixture of character and numeric values. This means that, before we can combine the data, we have to put it into a separate data frame that uses the same column headings as "flick2".

Use the following R commands to create a data frame containing the information about the two new movies.

        Title <- c ( "Ocean's Eleven", "Thelma & Louise" )
        Release <- c ( "12/7/2001", "5/24/1991" )
        Budget <- c ( 85000000, 16500000 )
        Distributor <- c ( "Warner", "NA" )
        flick3 <- data.frame ( Title, Release, Budget, Distributor )

Now we can combine the two data frames, which have the same titles and the same number of columns, using the rbind() command:

        flick4 <- rbind ( flick2, flick3 );
      
Use the print() command to verify that the data frame "flick4" contains all the information.

1.5) Save a copy of the updated flick4 data frame as a CSV file, by using the write.csv() command:

        write.csv ( flick4, file = "flick4.csv" )
      

2: Selecting Data

Suppose we have a large set of data about movies, including such information as titles, distributor, budget, and gross sales. Since several hundred movies are released each year, it's useful to be able to select a particular subset of this information for analysis or graphical display.

2.1) This part of the lab requires that you start by copying a small CSV file called brad_pitt.csv. Now start R and use the read.csv() command to read the file. This file includes header information, so be sure to read that correctly. Store the result as the data frame pitt, and use the print() command to verify that the data was read correctly.

2.2) The data frame pitt contains all the information from the file. If we simply type pitt, R will print out this information. We will now look at ways of having R display only a portion of this information, using bracket notation:

         optional name <- data frame [ rows, columns ] 
Here rows and columns are information that selects the rows and columns we want.

Thus, we can select data if we know the numeric index of a row or column. To see row 1 (and all its columns):

        pitt [ 1, ] 
To select all rows, but just column 2:
        pitt [ , 2 ]
and to select row 3, column 4:
        pitt [ 3, 4 ]
To "select" the entire set of data, we can type
         pitt [ , ] 
although, of course, in this case we could have simply said pitt!

Rather than selecting a single row or column, we can use R's colon notation. Remember that 3:7 means 3, 4, 5, 6, 7, so we can ask for rows 3 to 7 by

         pitt [ 3:7, ] 
If the indices we want are not sequential, we can use the c() command to form a list. Thus, if we're interested in rows 2, 3, 7 and 11, the expression c(2,3,7,11) makes this a single list, and we can display exactly those rows, and just the title (in column 1) by
         pitt [ c(2,3,7,11), 1 ] 
Similarly, if we wanted to print just the gross receipts (column 3) followed by the title (column 1) for movies 9, 10, and 11, we could use the command
         pitt [ 9:11, c(3,1) ] 
Notice that the column data appears in exactly the order we requested. What do you think would happen if we specified c(3,1,3) for the columns?

However, we often don't want to select a row or column by its index, but rather because of some property that the item has. Perhaps we want to list the titles of all movies made in 1997. R allows us to do this using the bracket notation, but now we have to specify the rows we want. The column containing the movie year is called Year. You can print all the year information in the pitt dataframe by typing

         pitt$Year
You can ask which movies came out in 1997 by writing
         pitt$Year == 1997
Note that we need two equal signs when we are not assigning a value, but looking for occurrences of a value! Notice that this command produces a list of TRUE and FALSE values. If we use this command as a row selector, R will display exactly those movies that came out in 1997!
         pitt [ pitt$Year == 1997, 1 ]

Here are some examples of selections we could make, assuming that we are dealing with a numeric variable:

2.3) We would like to select the movies that Brad Pitt made between 1995 and 2000. The data frame stores the year as the variable pitt$Year. A movie was made on or after 1995 if it is true that:

        1995 <= pitt$Year
A movie was made between 1995 and 2000 if it is true that:
        1995 <= pitt$Year & pitt$Year <= 2000
We can use this logical expression to select the rows we want; for now, we will suppose that all the column information is interesting to us. The format of the command we want to use is
        new dataframe <- old dataframe [ rows, columns ]
so, to select from pitt the rows (movies) in the given year range, we type:
        pitt2 <- pitt [ 1995 <= pitt$Year & pitt$Year <= 2000,  ]
Notice the comma in the above command; it separates the row selection (range of years) from the column selection (which we did not use).

2.4) Now we want to consider movies made after 2007, and we only want to list the title (column 1) and gross receipts (column 3). What form of the bracket command will display title and gross for just these 9 movies?

        pitt [ ?, ? ]

3: Dealing with Missing Data

We'd like to print and plot the data in the gross receipts column of the data file. However, note that this column has many NA entries, either because the data has been lost over time, or because the movie hadn't actually been released when the data file was created. The NA values work fine when we print information. But how is R going to handle them when we try to select or plot information?

3.1) To try to get an idea of whether there will be a problem, let's use the bracket notation, and ask for a list of all movies for which the gross receipts were more than $150,000,000. If you check the data first, by using the print() command, you will see that there are just two such movies, "Mr and Mrs Smith" and "Ocean's Eleven". Let's try asking R this question:
        pitt [ 150000000 <= pitt$Gross, ]
Unfortunately, we see the two movies we want, but also every movie for which the gross receipt information has the NA value.

If we are interested in gross receipts, then the records for which no such information is available are of no interest to us. And it looks as though our selection commands won't work properly when NA data is involved. What can we do?

3.2) The right approach is to make a copy of the data frame that drops those records that don't have gross receipts information. In other words, we want to make a new data frame by selecting those records for which the gross receipts are not "NA". To ask if a number is an NA, we write

        is.na ( number )
To ask if a number is not an NA, we write
        !is.na ( number )
To ask if the number pitt$Gross is not an NA, we write
        !is.na ( pitt$Gross )

Make a new data frame, called pitt2, containing the records for which gross receipt information is available, by typing

        pitt2 <- pitt [ !is.na ( pitt$Gross), ]
 
The print() command can be used to verify that your command selected the correct data.

3.3) Now repeat your question, asking which movies had gross receipts of $150,000,000 or more, but ask it about the pitt2 data frame! If you ask this question correctly, you see exactly two movies listed.

3.4) Use the barplot() command on the pitt2 gross receipts data to illustrate the hits and flops.

4: Grouping Data

4.1) This part of the lab requires that you start by copying a large CSV file called movies_2010.csv. Now start R and use the read.csv() command to read this file into the data frame "movie1". This file includes header information, so be sure to read that correctly. Since the file is large, don't use the print() command! Instead, try the summary() command. The first column is labeled "Rank", and you should see that it runs from 1 to 670; indeed, there are 670 movies in this file.

4.2) Suppose we want to look at the gross receipts for movies distributed by IFC Films. That means we want to look at movies based on the "Distributor" field. However, as the summary() command indicates, there are 39 movies for which the "Distributor" is listed as "NA". R has trouble working with NA data, so our first task is to make a new dataframe containing only the movies for which "Distributor" is not "NA".

The movies (rows) we want to select are those for which the distributor is not an NA value. So we are looking at the movie1 dataframe, at column Distributor, and we want to select those rows for which this value is not NA. R uses the function is.na() to report that a value is NA; an exclamation mark reverses this check, so that we are picking up the non-NA values:

        movie2 <- movie1 [ !is.na ( movie1$Distibutor ), ]
      
Use the summary() command on the movie2 data frame. Under the Distributor heading, there should be no NA values. If you look closely at the Title heading, you should be able to see that there are now 631 movies in the new data frame.

4.3) Now that the Distributor column is cleaned up, we can make a new data frame that selects movies based on the Distibutor value. In particular, we could pick out movies distributed by IFC Films:

        movie3 <- movie2 [ movie2$Distributor == "IFC Films", ]
      
Here, the summary() command should show 33 films in the new data frame; the highest ranking movie came in at 182.

4.4) Since there are just a few movies in this data frame, use the print() command to report the information for them.

4.5) Use the barplot() command to plot the gross receipts for each of the movies distributed by IFC Films. This plot should suggest that IFC Films had just two moderately successful films (gross more than $1,000,000) while the remaining films typically had gross receipts of around $500,000 or less.


Last modified on 24 November 2011.