Last updated on: January 04, 2018 at 13:22
After installing R and RStudio using these instructions, start RStudio in the project you created for this course.
Use the menus to create a new R script with File > New File > R Script.
Copy this file into the R Script and save it with Control+S (Command+S in macOS) with the name ‘survey.R’
Execute this file manually, line by line, using Control+Enter in Windows or Command+Enter on macOS. As you go through the file, make appropriate changes: e.g. fill in your name on line 3 above.
This will install a number of packages and download a data file.
After executing the file line by line you should be able to ‘render’ it to create a stand-alone HTML file by pressing Control+Shift+K (Command+Shift+K on macOS).
To complete Assignment 1, upload this file through Moodle before 9am on Monday, September 11, 2017.
This is an example of output from a “.R script with R Markdown”. If you run this script in R, all lines that begin with ‘#’ are treated as comments. If you run it with ‘knitr’ – which just involves typing Control-Shift-K simultaneously when the file is the active file in R Studio – all the text in lines that begin with “#’ ” (hashtag apostrophe space) are processed as R Markdown code that can be used to produce a polished document in a number of standard formats: pdf, HTML, RTF, beamer presentations, etc.
Note that online help for R Markdown describes syntax for “.Rmd” files. They are the same as “.R scripts with R Markdown” files minus the “#’ ” and plus code chunk delimiters. In .R scripts, the code chunk delimiters are not necessary, which makes .R scripts easier to use. There are legitimate reasons to prefer “.Rmd” files but for uniformity when collaborating we will use “.R scripts with R Markdown” in this course.
Almost all the work in this course will be done and submitted this way.
One of the great strengths of Markdown is that it allows you to do reproducible analyses and reports. Another researcher can use your R script on the same data and get the same results. They can verify the exact steps you took and can easily test how results would be affected by modifying the analysis.
You can collaborate with others much more easily knowing that if you send a script and data files to a collaborator they will get the same output (provided they have installed the same packages).
An important practice for reproducible research is that the analyst should never modify raw data. For example, if you find some errors in a spreadsheet sent by a client or downloaded from the internet, it is very tempting to manually correct the errors in the spreadsheet.
However, when you receive or download a new version of the spreadsheet you would need to apply the corrections the same way. If you did them manually you will not be able to guarantee consistency in your corrections.
To ensure reproducibility, the corrections should be done with code in the R script. This will usually involve the use of a very powerful tool: regular expressions, the subject of this xkcd cartoon:
With regular expressions, data corrections can usually be made in a way that does not introduce errors when applied to a future corrected version of the spreadsheet.
As illustrated in this script, you can combine:
in a document.
LaTeX is the most widely used language and environment for technical publishing. In R Markdown, we have access to its language for mathematical formulas. For example, we can write:
\[y_i= \beta_0 + \beta_1 x_i + \epsilon_i \quad i = 1, ... , n, \; \epsilon \sim \mathrm{N}(0,\sigma^2)\]
where \(\beta_0 = 2\), \(\beta_1 = 0.5\), \(\sigma = 2\).
All you need in order to use R Markdown to produce HTML output is automatically installed when you install R Studio.
If you are running this script, you have probably already installed R and RStudio.
To install R visit this ‘mirror’ of the The Comprehensive R Archive Network and follow the instructions for your operating system: Mac OS X, Windows or Linux.
Read the information carefully. If you use Windows, install ‘Rtools’ as suggested. If you use Mac OS X, consider whether you need to install XQuartz.
After installing R, next install R Studio.
Once R Studio is installed, open it with the icon on your desktop. We will use the console in R Studio to install some packages.
There are three main sources of packages for R:
The first package we install resides on CRAN and it is needed to install packages from GitHub.
Note that “#+ ” denotes a “chunk option” that modifies the behaviour of R until the end of the chunk. A chunk is ended by any line that starts with “#’ ” or with “#+ ”. The option “’”eval=FALSE" prevents R from running (evaluating) the chunk when you use Crtl-Shift-K to produce an output file. We assume that all packages have already been installed interactively before using Crtl-Shift-K.
install.packages('devtools')
While we are at it we can install a few more packages from CRAN:
install.packages(c("car", "effects", "ggplot2", "Hmisc"))
install.packages(c("knitr", "magrittr", "rgl", "rio", "rmarkdown"))
install.packages(c("latticeExtra"))
# Having installed 'devtools' we can install packages from github:
devtoools::install_github('gmonette/spida2')
devtoools::install_github('gmonette/p3d')
Installing packages only needs to be done once every time you install a new version of R. You can update them occasionally, with:
update.packages()
Github packages need to be reinstalled periodically to get the latest updates:
devtools::install_github('gmonette/spida2')
devtools::install_github('gmonette/p3d')
I expect these packages to be updated frequently during the course and you need to reinstall them to have access to the latest verions.
Each time you use R, you need to load the packages you need for that session with the ‘library’ command:
library(car)
library(spida2)
library(lattice)
library(latticeExtra)
Loading required package: RColorBrewer
The easiest formats to read are CSV files (comma-separated-value files which are easily created from Excel by saving a spreadsheet as a CSV file), and tab-delimited text files, usually with a ‘.txt’ extension.
You need to know the location of the file relative to the script file.
Be sure to set the working directory, which you can find with
getwd()
[1] "C:/Users/georg/Dropbox/home/Courses/MATH4939/www_MATH4939/files"
to the directory of the script file. In RStudio, use the menus:
Session > Set Working Directory > To Source File Location
Then you can refer to the file using a relative path. This is much better than an absolute path since you can send the script and the data set to a collaborator and they will be able to run the script after saving the script and the data set in a directory on their own computer.
To illustrate, we will download a data file and use for a brief exploration,
# You only need to download once, so this is in a code chunk that
# will not be run when you use Ctrl-Shift-K to render the file.
download.file("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Titanic.txt",
"Titanic.txt")
This copies the file ‘Titanic.txt’ into the working directory. This file is a list of passengers on the Titanic and does not include the crew.
Note that all the data sets from the textbook can be downloaded this way. However, it is better to create a ‘data’ subdirectory and download them there.
list.files()
[1] "22_statements.html"
[2] "22_statements.R"
[3] "activity_2"
[4] "ARAGLM3E_TOC.txt"
[5] "CMU_Structure_of_a_data_analysis_report.pdf"
[6] "description.html"
[7] "description.pdf"
[8] "description.R"
[9] "ICDA2E_TOC.txt"
[10] "m4330_sample_exam_questions.pdf"
[11] "MATH_4330_mid_term_2017.pdf"
[12] "MATH_4330_Peer_Evaluation.docx"
[13] "MATH_4330_Peer_Evaluation.pdf"
[14] "MATH4330_Fall2017_description.html"
[15] "Statistics_Canada_Information_Session.rtf"
[16] "statscan.jpg"
[17] "survey_and_markdown_sample.html"
[18] "survey_and_markdown_sample.R"
[19] "survey_and_markdown_sample.spin.R"
[20] "survey_and_markdown_sample.spin.Rmd"
[21] "syllabus.html"
[22] "syllabus.R"
[23] "tex2pdf.15536"
[24] "Titanic.txt"
[25] "unknown_princeton_glm_theory.pdf"
[26] "welcome_and_software_installation.html"
[27] "welcome_and_software_installation.R"
You can also read a text file from the internet by using the URL but then you need to be connected to the internet whenever you read the file.
titanic <- read.table("Titanic.txt", header = T)
library(spida2)
library(lattice)
head(titanic) # first 6 lines
survived age
Allen, Miss Elisabeth Walton yes 29.0000
Allison, Miss Helen Loraine no 2.0000
Allison, Mr Hudson Joshua Creighton no 30.0000
Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) no 25.0000
Allison, Master Hudson Trevor yes 0.9167
Anderson, Mr Harry yes 47.0000
passengerClass sex
Allen, Miss Elisabeth Walton 1st female
Allison, Miss Helen Loraine 1st female
Allison, Mr Hudson Joshua Creighton 1st male
Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 1st female
Allison, Master Hudson Trevor 1st male
Anderson, Mr Harry 1st male
tail(titanic) # last 6 lines
survived age passengerClass sex
Zabour, Miss Tamini no NA 3rd female
Zakarian, Mr Artun no NA 3rd male
Zakarian, Mr Maprieder no NA 3rd male
Zenn, Mr Philip no NA 3rd male
Zievens, Rene no NA 3rd female
Zimmerman, Leo no NA 3rd male
dim(titanic) # rows and columns
[1] 1313 4
xqplot(titanic)
#
# frequency table
#
tab(titanic, ~ survived + sex + passengerClass)
, , passengerClass = 1st
sex
survived female male Total
no 9 120 129
yes 134 59 193
Total 143 179 322
, , passengerClass = 2nd
sex
survived female male Total
no 13 148 161
yes 94 25 119
Total 107 173 280
, , passengerClass = 3rd
sex
survived female male Total
no 134 440 574
yes 79 58 137
Total 213 498 711
, , passengerClass = Total
sex
survived female male Total
no 156 708 864
yes 307 142 449
Total 463 850 1313
#
# percentage within each 'sex by passengerClass' grouping
#
tab(titanic, ~ survived + sex + passengerClass, pct = c(2,3))
, , passengerClass = 1st
sex
survived female male All
no 6.293706 67.039106 40.062112
yes 93.706294 32.960894 59.937888
Total 100.000000 100.000000 100.000000
, , passengerClass = 2nd
sex
survived female male All
no 12.149533 85.549133 57.500000
yes 87.850467 14.450867 42.500000
Total 100.000000 100.000000 100.000000
, , passengerClass = 3rd
sex
survived female male All
no 62.910798 88.353414 80.731364
yes 37.089202 11.646586 19.268636
Total 100.000000 100.000000 100.000000
, , passengerClass = All
sex
survived female male All
no 33.693305 83.294118 65.803503
yes 66.306695 16.705882 34.196497
Total 100.000000 100.000000 100.000000
# nicer:
tab(titanic, ~ survived + sex + passengerClass, pct = c(2,3)) %>%
round(1)
, , passengerClass = 1st
sex
survived female male All
no 6.3 67.0 40.1
yes 93.7 33.0 59.9
Total 100.0 100.0 100.0
, , passengerClass = 2nd
sex
survived female male All
no 12.1 85.5 57.5
yes 87.9 14.5 42.5
Total 100.0 100.0 100.0
, , passengerClass = 3rd
sex
survived female male All
no 62.9 88.4 80.7
yes 37.1 11.6 19.3
Total 100.0 100.0 100.0
, , passengerClass = All
sex
survived female male All
no 33.7 83.3 65.8
yes 66.3 16.7 34.2
Total 100.0 100.0 100.0
Note that the ‘%>%’ operator ‘pipes’ the output of the left-hand side (lhs) as the first argument of the function on the right.
In RStudio, you can type ‘%>%’ by pressing Control-Shift-M.
#
# barchart of frequencies
#
tab(titanic, ~ survived + sex + passengerClass) %>%
barchart(auto.key=T)
# get rid of 'Total'
tab_(titanic, ~ survived + sex + passengerClass) %>%
barchart(auto.key=T)
# not really informative, ... so
tab_(titanic, ~ sex + passengerClass + survived) %>%
barchart(xlab = 'number of passengers',
xlim = c(0,550),
auto.key=list(space='right',title='survived'))
# but we really want proportions
tab(titanic, ~ sex + passengerClass + survived, pct = c(1,2)) %>%
barchart(xlab = 'percentage of passengers',
xlim = c(0,550),
auto.key=list(space='right',title='survived'))
# we need to get rid of 'All' and 'Total'
tab__(titanic, ~ sex + passengerClass + survived, pct = c(1,2)) %>%
barchart(xlab = 'percentage of passengers',
auto.key=list(space='right',title='survived'))
# experimenting:
tab__(titanic, ~ sex + passengerClass + survived, pct = c(1,2)) %>%
barchart(ylab = 'percentage of passengers',
horizontal = FALSE,
ylim = c(0,100),
auto.key=list(space='right',title='survived'))
tab__(titanic, ~ passengerClass + sex + survived) %>%
barchart(ylab = 'percentage of passengers',
horizontal = FALSE,
auto.key=list(space='right',title='survived'))
What are the pros and cons of these various graphs?
Please answer the following questions. Use as much space as you need below each question. Leave at least one blank line, i.e. a line starting with “#’”, before each new question.
What is the name you use with classmates:
What is your family name in York’s student records:
What is your given name in York’s student records:
What are your preferred pronouns:
Your my.yorku.ca e-mail address:
The name of your github account (if any):
When did you complete MATH 3131 (e.g. Fall 2016), who was your instructor and what was the author and title of the main textbook?:
When did you complete MATH 4330 (e.g. Fall 2016), who was your instructor and what was the author and title of the main textbook?:
How do you rate your knowledge of R: (0=none, 1=basic, 2=can do regression, 3=can write functions, 4=can write and compile a package)
How do you rate your knowledge of SAS: (0=none, 1=basic, 2=can do regression, 3=can do DATA step programming, 4=expert)
Describe your knowledge of C, C++, or Fortran:
Describe your knowledge of languages like Perl, Python, Awk, Ruby, etc.
What experience do you have with platforms for collaborative programming such as Git, Github, SVN?
What operating system do you use?
Have you ever worked with statistics or done statistical consulting? Please describe briefly.
Please describe some other areas of expertise or interest.
Potential time for optional tutorial:
Do you have a conflicting class at 11:30 am? Would you be able to attend a tutorial on Fridays at 11:30 am to 12:30 am?
Do you have a conflicting class at 9:30 am? Would you be able to attend a tutorial on Fridays from 9:30 am to 10:30 am?