Common problems:

  • Data comes in lots of files with same internal structure
  • File names contain information needed for the analysis but not contained in variables in the file.
  • Files are in some directory on the web

There a number of pretend classlists in http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise

A number of files are classlists from a number of sections of STAT1000 at WOSU (West Overshoe State University) in the 2016-17 academic session that includes the summer 2017 term and sections of STAT2000 in the summer of 2017 and in the fall and winter terms of 2017-18.

Conveniently, WOSU has the same set of programs and faculty names as York University so they will be familiar.

The head of the Statistics Program has some questions about retention:

  • what proportion of students in STAT1000 go on to take STAT2000
  • is the proportion similar from different sections of STAT1000, perhaps adjusting for the programs of students in STAT1000
  • describe the profiles of statistics majors in STAT2000? How many are in a Statistics program and likely to go on? Had most students already decided on a statistics major in STAT1000 or did many change their major between the two courses? If so, from which majors?

Information about faculty and programs is contained in a character field that is coded in a way that makes extraction challenging.

Your challenge is to do everything in R code, i.e. don’t touch the data files themselves:

  1. Find the data set names.
  2. Read the data into data frames incorporating information from the data set names into the data sets.
  3. Create variables for the infomation you need.
  4. Merge data sets as is appropriate to get information on each student combined from the various courses each student took.
  5. Analyze the data to answer the questions above.

Finding data set names

If the data sets were in a local directory you could get their names with list.files. Unfortunately, list.files doesn’t work with a URL instead of a local path.

url <- 'http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/'
library(XML)
z <- htmlParse(url) # extracts HTML version of directory listing
z <- xpathSApply(z, '//a/@href') # extracts content of href tags
z
##                             href                             href 
##                       "?C=N;O=D"                       "?C=M;O=A" 
##                             href                             href 
##                       "?C=S;O=A"                       "?C=D;O=A" 
##                             href                             href 
##                "/MATH4939/data/"   "STAT1000_2016_Fall_Sec_A.csv" 
##                             href                             href 
##   "STAT1000_2016_Fall_Sec_B.csv" "STAT1000_2017_Summer_Sec_A.csv" 
##                             href                             href 
## "STAT1000_2017_Winter_Sec_A.csv" "STAT2000_2016_Summer_Sec_A.csv" 
##                             href                             href 
## "STAT2000_2017_Winter_Sec_A.csv" "STAT2000_2018_Winter_Sec_A.csv"

We only want the ones that start with ‘STAT’

library(spida2)
filenames <- grepv('^STAT', z)
filenames
##                             href                             href 
##   "STAT1000_2016_Fall_Sec_A.csv"   "STAT1000_2016_Fall_Sec_B.csv" 
##                             href                             href 
## "STAT1000_2017_Summer_Sec_A.csv" "STAT1000_2017_Winter_Sec_A.csv" 
##                             href                             href 
## "STAT2000_2016_Summer_Sec_A.csv" "STAT2000_2017_Winter_Sec_A.csv" 
##                             href 
## "STAT2000_2018_Winter_Sec_A.csv"
  • Read the files and save the data frames in a list
  • Add the filename as a variable in the data frame
cls <- lapply(filenames, function(nn) {
  dd <- read.csv(paste0(url,nn))
  dd$course <- nn
  dd
})

Check this worked:

lapply(cls, head, 2)
## $href
##   X   number                program year                       course
## 1 1 54296119 LE BSC H SP LE/CSEC NO    1 STAT1000_2016_Fall_Sec_A.csv
## 2 2 53784527 SC BSC H SP SC/MAED NO    1 STAT1000_2016_Fall_Sec_A.csv
## 
## $href
##   X   number                 program year                       course
## 1 1 51526050 LE BSC H HO LE/COSC PRO   03 STAT1000_2016_Fall_Sec_B.csv
## 2 2 51648341   AP BA H HO AP/HESO NO   04 STAT1000_2016_Fall_Sec_B.csv
## 
## $href
##   X   number                program year                         course
## 1 1 54182515 AP BA H HO AP/ITEC PRO   02 STAT1000_2017_Summer_Sec_A.csv
## 2 2 54717026 AP BA H HO AP/ITEC PRO   02 STAT1000_2017_Summer_Sec_A.csv
## 
## $href
##   X   number                      program year
## 1 1 54572616        SC BA O OR SC/MATH NO   01
## 2 2 53526303 LE BSC H DM LE/COSC APMA PRO   03
##                           course
## 1 STAT1000_2017_Winter_Sec_A.csv
## 2 STAT1000_2017_Winter_Sec_A.csv
## 
## $href
##   X   number                                    program year
## 1 1 53526303               LE BSC H DM LE/COSC APMA PRO   04
## 2 2 55492938 SC NO                                  NA   N/A
##                           course
## 1 STAT2000_2016_Summer_Sec_A.csv
## 2 STAT2000_2016_Summer_Sec_A.csv
## 
## $href
##   X   number                        program year
## 1 1 53436098       SC BA H HO SC/MATC>AC NO   02
## 2 2 53732643 SC BSC H DM SC/MATH PHAS>PH NO   02
##                           course
## 1 STAT2000_2017_Winter_Sec_A.csv
## 2 STAT2000_2017_Winter_Sec_A.csv
## 
## $href
##   X   number                   program year                         course
## 1 1 53901761 LE BSC H MM COSC MATH PRO   03 STAT2000_2018_Winter_Sec_A.csv
## 2 2 53700981  SC BA H HO SC/MATC>AC NO   02 STAT2000_2018_Winter_Sec_A.csv

Name the elements of the list

names(cls) <- filenames
lapply(cls, head, 2)
## $STAT1000_2016_Fall_Sec_A.csv
##   X   number                program year                       course
## 1 1 54296119 LE BSC H SP LE/CSEC NO    1 STAT1000_2016_Fall_Sec_A.csv
## 2 2 53784527 SC BSC H SP SC/MAED NO    1 STAT1000_2016_Fall_Sec_A.csv
## 
## $STAT1000_2016_Fall_Sec_B.csv
##   X   number                 program year                       course
## 1 1 51526050 LE BSC H HO LE/COSC PRO   03 STAT1000_2016_Fall_Sec_B.csv
## 2 2 51648341   AP BA H HO AP/HESO NO   04 STAT1000_2016_Fall_Sec_B.csv
## 
## $STAT1000_2017_Summer_Sec_A.csv
##   X   number                program year                         course
## 1 1 54182515 AP BA H HO AP/ITEC PRO   02 STAT1000_2017_Summer_Sec_A.csv
## 2 2 54717026 AP BA H HO AP/ITEC PRO   02 STAT1000_2017_Summer_Sec_A.csv
## 
## $STAT1000_2017_Winter_Sec_A.csv
##   X   number                      program year
## 1 1 54572616        SC BA O OR SC/MATH NO   01
## 2 2 53526303 LE BSC H DM LE/COSC APMA PRO   03
##                           course
## 1 STAT1000_2017_Winter_Sec_A.csv
## 2 STAT1000_2017_Winter_Sec_A.csv
## 
## $STAT2000_2016_Summer_Sec_A.csv
##   X   number                                    program year
## 1 1 53526303               LE BSC H DM LE/COSC APMA PRO   04
## 2 2 55492938 SC NO                                  NA   N/A
##                           course
## 1 STAT2000_2016_Summer_Sec_A.csv
## 2 STAT2000_2016_Summer_Sec_A.csv
## 
## $STAT2000_2017_Winter_Sec_A.csv
##   X   number                        program year
## 1 1 53436098       SC BA H HO SC/MATC>AC NO   02
## 2 2 53732643 SC BSC H DM SC/MATH PHAS>PH NO   02
##                           course
## 1 STAT2000_2017_Winter_Sec_A.csv
## 2 STAT2000_2017_Winter_Sec_A.csv
## 
## $STAT2000_2018_Winter_Sec_A.csv
##   X   number                   program year                         course
## 1 1 53901761 LE BSC H MM COSC MATH PRO   03 STAT2000_2018_Winter_Sec_A.csv
## 2 2 53700981  SC BA H HO SC/MATC>AC NO   02 STAT2000_2018_Winter_Sec_A.csv

Next steps:

  • Extract any variables your need from ‘program’ string using combinations of strsplit, sub, gsub, grep, grepv
    • LE BSC H MM COSC MATH PRO:
      • LE: Lassonde Engineering
      • BSC: registered in a BSc program
      • H: Honours (could be O for ordinary)
      • HO, OR, DM, MM, SP, …: Honours, Ordinary, Double Major, Major Minor, Specialist, …
      • COSC, SC/MATC>AC, …: Computer Science, Science Math for Commerce Actuarial Science Stream, …
      • PRO, BL, NO: Professional, Bilingual, Nothing special!
      • Some program strings don’t follow this pattern, in particular those for special students not getting credits towards a degree.
  • Do the same from the ‘course’ string.
  • You might need to form sublists of class lists, e.g. a list of classlists for the 1000-level course, etc.
  • Combine files as needed to get information to answer questions. Consider using merge, rbind, up, towide, tolong, …
    Note that if many data frames have the exactly the same variable names in the same position, you can stack them with:
    bigdf <- rbind(df1, df2, df3)
    but to do that you need to know the number and names of the data frames.
    If the data frames are in a list, you should use:
    bigdf <- do.call(rbind, dflist)
    You can only merge two data frames at a time. If you need to iteratively merge a list of data frames, consider using Reduce:
    z <- Reduce(function(x, y) merge(x, y, all = T), cls)