Common problems:
There a number of pretend classlists in http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise
A number of files are classlists from a number of sections of STAT1000 at WOSU (West Overshoe State University) in the 2016-17 academic session that includes the summer 2017 term and sections of STAT2000 in the summer of 2017 and in the fall and winter terms of 2017-18.
Conveniently, WOSU has the same set of programs and faculty names as York University so they will be familiar.
The head of the Statistics Program has some questions about retention:
Information about faculty and programs is contained in a character field that is coded in a way that makes extraction challenging.
Your challenge is to do everything in R code, i.e. don’t touch the data files themselves:
If the data sets were in a local directory you could get their names with list.files
. Unfortunately, list.files
doesn’t work with a URL instead of a local path.
url <- 'http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/'
library(XML)
z <- htmlParse(url) # extracts HTML version of directory listing
z <- xpathSApply(z, '//a/@href') # extracts content of href tags
z
## href href
## "?C=N;O=D" "?C=M;O=A"
## href href
## "?C=S;O=A" "?C=D;O=A"
## href href
## "/MATH4939/data/" "STAT1000_2016_Fall_Sec_A.csv"
## href href
## "STAT1000_2016_Fall_Sec_B.csv" "STAT1000_2017_Summer_Sec_A.csv"
## href href
## "STAT1000_2017_Winter_Sec_A.csv" "STAT2000_2016_Summer_Sec_A.csv"
## href href
## "STAT2000_2017_Winter_Sec_A.csv" "STAT2000_2018_Winter_Sec_A.csv"
We only want the ones that start with ‘STAT’
library(spida2)
filenames <- grepv('^STAT', z)
filenames
## href href
## "STAT1000_2016_Fall_Sec_A.csv" "STAT1000_2016_Fall_Sec_B.csv"
## href href
## "STAT1000_2017_Summer_Sec_A.csv" "STAT1000_2017_Winter_Sec_A.csv"
## href href
## "STAT2000_2016_Summer_Sec_A.csv" "STAT2000_2017_Winter_Sec_A.csv"
## href
## "STAT2000_2018_Winter_Sec_A.csv"
cls <- lapply(filenames, function(nn) {
dd <- read.csv(paste0(url,nn))
dd$course <- nn
dd
})
Check this worked:
lapply(cls, head, 2)
## $href
## X number program year course
## 1 1 54296119 LE BSC H SP LE/CSEC NO 1 STAT1000_2016_Fall_Sec_A.csv
## 2 2 53784527 SC BSC H SP SC/MAED NO 1 STAT1000_2016_Fall_Sec_A.csv
##
## $href
## X number program year course
## 1 1 51526050 LE BSC H HO LE/COSC PRO 03 STAT1000_2016_Fall_Sec_B.csv
## 2 2 51648341 AP BA H HO AP/HESO NO 04 STAT1000_2016_Fall_Sec_B.csv
##
## $href
## X number program year course
## 1 1 54182515 AP BA H HO AP/ITEC PRO 02 STAT1000_2017_Summer_Sec_A.csv
## 2 2 54717026 AP BA H HO AP/ITEC PRO 02 STAT1000_2017_Summer_Sec_A.csv
##
## $href
## X number program year
## 1 1 54572616 SC BA O OR SC/MATH NO 01
## 2 2 53526303 LE BSC H DM LE/COSC APMA PRO 03
## course
## 1 STAT1000_2017_Winter_Sec_A.csv
## 2 STAT1000_2017_Winter_Sec_A.csv
##
## $href
## X number program year
## 1 1 53526303 LE BSC H DM LE/COSC APMA PRO 04
## 2 2 55492938 SC NO NA N/A
## course
## 1 STAT2000_2016_Summer_Sec_A.csv
## 2 STAT2000_2016_Summer_Sec_A.csv
##
## $href
## X number program year
## 1 1 53436098 SC BA H HO SC/MATC>AC NO 02
## 2 2 53732643 SC BSC H DM SC/MATH PHAS>PH NO 02
## course
## 1 STAT2000_2017_Winter_Sec_A.csv
## 2 STAT2000_2017_Winter_Sec_A.csv
##
## $href
## X number program year course
## 1 1 53901761 LE BSC H MM COSC MATH PRO 03 STAT2000_2018_Winter_Sec_A.csv
## 2 2 53700981 SC BA H HO SC/MATC>AC NO 02 STAT2000_2018_Winter_Sec_A.csv
Name the elements of the list
names(cls) <- filenames
lapply(cls, head, 2)
## $STAT1000_2016_Fall_Sec_A.csv
## X number program year course
## 1 1 54296119 LE BSC H SP LE/CSEC NO 1 STAT1000_2016_Fall_Sec_A.csv
## 2 2 53784527 SC BSC H SP SC/MAED NO 1 STAT1000_2016_Fall_Sec_A.csv
##
## $STAT1000_2016_Fall_Sec_B.csv
## X number program year course
## 1 1 51526050 LE BSC H HO LE/COSC PRO 03 STAT1000_2016_Fall_Sec_B.csv
## 2 2 51648341 AP BA H HO AP/HESO NO 04 STAT1000_2016_Fall_Sec_B.csv
##
## $STAT1000_2017_Summer_Sec_A.csv
## X number program year course
## 1 1 54182515 AP BA H HO AP/ITEC PRO 02 STAT1000_2017_Summer_Sec_A.csv
## 2 2 54717026 AP BA H HO AP/ITEC PRO 02 STAT1000_2017_Summer_Sec_A.csv
##
## $STAT1000_2017_Winter_Sec_A.csv
## X number program year
## 1 1 54572616 SC BA O OR SC/MATH NO 01
## 2 2 53526303 LE BSC H DM LE/COSC APMA PRO 03
## course
## 1 STAT1000_2017_Winter_Sec_A.csv
## 2 STAT1000_2017_Winter_Sec_A.csv
##
## $STAT2000_2016_Summer_Sec_A.csv
## X number program year
## 1 1 53526303 LE BSC H DM LE/COSC APMA PRO 04
## 2 2 55492938 SC NO NA N/A
## course
## 1 STAT2000_2016_Summer_Sec_A.csv
## 2 STAT2000_2016_Summer_Sec_A.csv
##
## $STAT2000_2017_Winter_Sec_A.csv
## X number program year
## 1 1 53436098 SC BA H HO SC/MATC>AC NO 02
## 2 2 53732643 SC BSC H DM SC/MATH PHAS>PH NO 02
## course
## 1 STAT2000_2017_Winter_Sec_A.csv
## 2 STAT2000_2017_Winter_Sec_A.csv
##
## $STAT2000_2018_Winter_Sec_A.csv
## X number program year course
## 1 1 53901761 LE BSC H MM COSC MATH PRO 03 STAT2000_2018_Winter_Sec_A.csv
## 2 2 53700981 SC BA H HO SC/MATC>AC NO 02 STAT2000_2018_Winter_Sec_A.csv
Next steps:
strsplit
, sub
, gsub
, grep
, grepv
LE BSC H MM COSC MATH PRO
:
merge
, rbind
, up
, towide
, tolong
, …bigdf <- rbind(df1, df2, df3)
bigdf <- do.call(rbind, dflist)
Reduce
:z <- Reduce(function(x, y) merge(x, y, all = T), cls)