Last updated on: January 04, 2018 at 13:22

Instructions

After installing R and RStudio using these instructions, start RStudio in the project you created for this course.

Use the menus to create a new R script with File > New File > R Script.

Copy this file into the R Script and save it with Control+S (Command+S in macOS) with the name ‘survey.R’

Execute this file manually, line by line, using Control+Enter in Windows or Command+Enter on macOS. As you go through the file, make appropriate changes: e.g. fill in your name on line 3 above.

This will install a number of packages and download a data file.

After executing the file line by line you should be able to ‘render’ it to create a stand-alone HTML file by pressing Control+Shift+K (Command+Shift+K on macOS).

To complete Assignment 1, upload this file through Moodle before 9am on Monday, September 11, 2017.

R Markdown and Reproducible Research

This is an example of output from a “.R script with R Markdown”. If you run this script in R, all lines that begin with ‘#’ are treated as comments. If you run it with ‘knitr’ – which just involves typing Control-Shift-K simultaneously when the file is the active file in R Studio – all the text in lines that begin with “#’ ” (hashtag apostrophe space) are processed as R Markdown code that can be used to produce a polished document in a number of standard formats: pdf, HTML, RTF, beamer presentations, etc.

Note that online help for R Markdown describes syntax for “.Rmd” files. They are the same as “.R scripts with R Markdown” files minus the “#’ ” and plus code chunk delimiters. In .R scripts, the code chunk delimiters are not necessary, which makes .R scripts easier to use. There are legitimate reasons to prefer “.Rmd” files but for uniformity when collaborating we will use “.R scripts with R Markdown” in this course.

Almost all the work in this course will be done and submitted this way.

Advantages of R scripts: Reproducible Analyses

One of the great strengths of Markdown is that it allows you to do reproducible analyses and reports. Another researcher can use your R script on the same data and get the same results. They can verify the exact steps you took and can easily test how results would be affected by modifying the analysis.

You can collaborate with others much more easily knowing that if you send a script and data files to a collaborator they will get the same output (provided they have installed the same packages).

An important practice for reproducible research is that the analyst should never modify raw data. For example, if you find some errors in a spreadsheet sent by a client or downloaded from the internet, it is very tempting to manually correct the errors in the spreadsheet.

However, when you receive or download a new version of the spreadsheet you would need to apply the corrections the same way. If you did them manually you will not be able to guarantee consistency in your corrections.

To ensure reproducibility, the corrections should be done with code in the R script. This will usually involve the use of a very powerful tool: regular expressions, the subject of this xkcd cartoon:

With regular expressions, data corrections can usually be made in a way that does not introduce errors when applied to a future corrected version of the spreadsheet.

Advantages of R Markdown: Simple markup and LaTeX formulas

As illustrated in this script, you can combine:

  1. R code
  2. R ouput
  3. R graphics
  4. text
  5. hyperlinks
  6. embedded graphics
  7. headings, table of contents and other markup
  8. mathematical formulas in LaTeX
  9. and, with a bit of effort, interactive graphics and other widgets

in a document.

LaTeX is the most widely used language and environment for technical publishing. In R Markdown, we have access to its language for mathematical formulas. For example, we can write:

\[y_i= \beta_0 + \beta_1 x_i + \epsilon_i \quad i = 1, ... , n, \; \epsilon \sim \mathrm{N}(0,\sigma^2)\]

where \(\beta_0 = 2\), \(\beta_1 = 0.5\), \(\sigma = 2\).

All you need in order to use R Markdown to produce HTML output is automatically installed when you install R Studio.

Installing R and R Studio

If you are running this script, you have probably already installed R and RStudio.

To install R visit this ‘mirror’ of the The Comprehensive R Archive Network and follow the instructions for your operating system: Mac OS X, Windows or Linux.

Read the information carefully. If you use Windows, install ‘Rtools’ as suggested. If you use Mac OS X, consider whether you need to install XQuartz.

After installing R, next install R Studio.

Once R Studio is installed, open it with the icon on your desktop. We will use the console in R Studio to install some packages.

Installing and loading packages

There are three main sources of packages for R:

  1. Many packages come with R when you install it initially.
  2. Most additional packages you might consider installing reside on ‘CRAN’, the Comprehensive R Archive Network, that provides relative stable versions of packages that have passed some automated tests, and
  3. ‘github.com’ where most developers provide access to the latest beta versions of their packages or to packages that have not been sent for inclusion in CRAN.

The first package we install resides on CRAN and it is needed to install packages from GitHub.

Note that “#+ ” denotes a “chunk option” that modifies the behaviour of R until the end of the chunk. A chunk is ended by any line that starts with “#’ ” or with “#+ ”. The option “’”eval=FALSE" prevents R from running (evaluating) the chunk when you use Crtl-Shift-K to produce an output file. We assume that all packages have already been installed interactively before using Crtl-Shift-K.

install.packages('devtools')

While we are at it we can install a few more packages from CRAN:

install.packages(c("car", "effects", "ggplot2", "Hmisc"))
install.packages(c("knitr", "magrittr", "rgl", "rio", "rmarkdown"))
install.packages(c("latticeExtra"))
# Having installed 'devtools' we can install packages from github:
devtoools::install_github('gmonette/spida2')
devtoools::install_github('gmonette/p3d')

Installing packages only needs to be done once every time you install a new version of R. You can update them occasionally, with:

update.packages()

Github packages need to be reinstalled periodically to get the latest updates:

devtools::install_github('gmonette/spida2')
devtools::install_github('gmonette/p3d')

I expect these packages to be updated frequently during the course and you need to reinstall them to have access to the latest verions.

Each time you use R, you need to load the packages you need for that session with the ‘library’ command:

library(car)
library(spida2)
library(lattice)
library(latticeExtra)
Loading required package: RColorBrewer

Reading data in a text file

The easiest formats to read are CSV files (comma-separated-value files which are easily created from Excel by saving a spreadsheet as a CSV file), and tab-delimited text files, usually with a ‘.txt’ extension.

You need to know the location of the file relative to the script file.

Be sure to set the working directory, which you can find with

getwd()
[1] "C:/Users/georg/Dropbox/home/Courses/MATH4939/www_MATH4939/files"

to the directory of the script file. In RStudio, use the menus:

Session > Set Working Directory > To Source File Location

Then you can refer to the file using a relative path. This is much better than an absolute path since you can send the script and the data set to a collaborator and they will be able to run the script after saving the script and the data set in a directory on their own computer.

To illustrate, we will download a data file and use for a brief exploration,

# You only need to download once, so this is in a code chunk that
# will not be run when you use Ctrl-Shift-K to render the file.
download.file("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Titanic.txt",
                 "Titanic.txt")

This copies the file ‘Titanic.txt’ into the working directory. This file is a list of passengers on the Titanic and does not include the crew.

Note that all the data sets from the textbook can be downloaded this way. However, it is better to create a ‘data’ subdirectory and download them there.

list.files()
 [1] "22_statements.html"                         
 [2] "22_statements.R"                            
 [3] "activity_2"                                 
 [4] "ARAGLM3E_TOC.txt"                           
 [5] "CMU_Structure_of_a_data_analysis_report.pdf"
 [6] "description.html"                           
 [7] "description.pdf"                            
 [8] "description.R"                              
 [9] "ICDA2E_TOC.txt"                             
[10] "m4330_sample_exam_questions.pdf"            
[11] "MATH_4330_mid_term_2017.pdf"                
[12] "MATH_4330_Peer_Evaluation.docx"             
[13] "MATH_4330_Peer_Evaluation.pdf"              
[14] "MATH4330_Fall2017_description.html"         
[15] "Statistics_Canada_Information_Session.rtf"  
[16] "statscan.jpg"                               
[17] "survey_and_markdown_sample.html"            
[18] "survey_and_markdown_sample.R"               
[19] "survey_and_markdown_sample.spin.R"          
[20] "survey_and_markdown_sample.spin.Rmd"        
[21] "syllabus.html"                              
[22] "syllabus.R"                                 
[23] "tex2pdf.15536"                              
[24] "Titanic.txt"                                
[25] "unknown_princeton_glm_theory.pdf"           
[26] "welcome_and_software_installation.html"     
[27] "welcome_and_software_installation.R"        

You can also read a text file from the internet by using the URL but then you need to be connected to the internet whenever you read the file.

titanic <- read.table("Titanic.txt", header = T)
library(spida2)
library(lattice)
head(titanic)  # first 6 lines
                                                survived     age
Allen, Miss Elisabeth Walton                         yes 29.0000
Allison, Miss Helen Loraine                           no  2.0000
Allison, Mr Hudson Joshua Creighton                   no 30.0000
Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)       no 25.0000
Allison, Master Hudson Trevor                        yes  0.9167
Anderson, Mr Harry                                   yes 47.0000
                                                passengerClass    sex
Allen, Miss Elisabeth Walton                               1st female
Allison, Miss Helen Loraine                                1st female
Allison, Mr Hudson Joshua Creighton                        1st   male
Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)            1st female
Allison, Master Hudson Trevor                              1st   male
Anderson, Mr Harry                                         1st   male
tail(titanic)  # last 6 lines
                       survived age passengerClass    sex
Zabour, Miss Tamini          no  NA            3rd female
Zakarian, Mr Artun           no  NA            3rd   male
Zakarian, Mr Maprieder       no  NA            3rd   male
Zenn, Mr Philip              no  NA            3rd   male
Zievens, Rene                no  NA            3rd female
Zimmerman, Leo               no  NA            3rd   male
dim(titanic)   # rows and columns
[1] 1313    4
xqplot(titanic)

#
# frequency table
#
tab(titanic, ~ survived + sex + passengerClass)
, , passengerClass = 1st

        sex
survived female male Total
   no         9  120   129
   yes      134   59   193
   Total    143  179   322

, , passengerClass = 2nd

        sex
survived female male Total
   no        13  148   161
   yes       94   25   119
   Total    107  173   280

, , passengerClass = 3rd

        sex
survived female male Total
   no       134  440   574
   yes       79   58   137
   Total    213  498   711

, , passengerClass = Total

        sex
survived female male Total
   no       156  708   864
   yes      307  142   449
   Total    463  850  1313
#
# percentage within each 'sex by passengerClass' grouping
#
tab(titanic, ~ survived + sex + passengerClass, pct = c(2,3))
, , passengerClass = 1st

        sex
survived     female       male        All
   no      6.293706  67.039106  40.062112
   yes    93.706294  32.960894  59.937888
   Total 100.000000 100.000000 100.000000

, , passengerClass = 2nd

        sex
survived     female       male        All
   no     12.149533  85.549133  57.500000
   yes    87.850467  14.450867  42.500000
   Total 100.000000 100.000000 100.000000

, , passengerClass = 3rd

        sex
survived     female       male        All
   no     62.910798  88.353414  80.731364
   yes    37.089202  11.646586  19.268636
   Total 100.000000 100.000000 100.000000

, , passengerClass = All

        sex
survived     female       male        All
   no     33.693305  83.294118  65.803503
   yes    66.306695  16.705882  34.196497
   Total 100.000000 100.000000 100.000000
# nicer:
tab(titanic, ~ survived + sex + passengerClass, pct = c(2,3))  %>%
  round(1)
, , passengerClass = 1st

        sex
survived female  male   All
   no       6.3  67.0  40.1
   yes     93.7  33.0  59.9
   Total  100.0 100.0 100.0

, , passengerClass = 2nd

        sex
survived female  male   All
   no      12.1  85.5  57.5
   yes     87.9  14.5  42.5
   Total  100.0 100.0 100.0

, , passengerClass = 3rd

        sex
survived female  male   All
   no      62.9  88.4  80.7
   yes     37.1  11.6  19.3
   Total  100.0 100.0 100.0

, , passengerClass = All

        sex
survived female  male   All
   no      33.7  83.3  65.8
   yes     66.3  16.7  34.2
   Total  100.0 100.0 100.0

Note that the ‘%>%’ operator ‘pipes’ the output of the left-hand side (lhs) as the first argument of the function on the right.

In RStudio, you can type ‘%>%’ by pressing Control-Shift-M.

Visualizing frequencies

#
# barchart of frequencies
#
tab(titanic, ~ survived + sex + passengerClass)  %>%
  barchart(auto.key=T)

# get rid of 'Total'
tab_(titanic, ~ survived + sex + passengerClass)  %>%
  barchart(auto.key=T)

# not really informative, ... so
tab_(titanic, ~  sex + passengerClass + survived)  %>%
  barchart(xlab = 'number of passengers',
           xlim = c(0,550),
           auto.key=list(space='right',title='survived'))

# but we really want proportions
tab(titanic, ~  sex + passengerClass + survived, pct = c(1,2))  %>%
  barchart(xlab = 'percentage of passengers',
           xlim = c(0,550),
           auto.key=list(space='right',title='survived'))

# we need to get rid of 'All' and 'Total'
tab__(titanic, ~  sex + passengerClass + survived, pct = c(1,2)) %>%
  barchart(xlab = 'percentage of passengers',
           auto.key=list(space='right',title='survived'))

# experimenting:
tab__(titanic, ~  sex + passengerClass + survived, pct = c(1,2)) %>%
  barchart(ylab = 'percentage of passengers',
           horizontal = FALSE,
           ylim = c(0,100),
           auto.key=list(space='right',title='survived'))

tab__(titanic, ~  passengerClass + sex + survived) %>%
  barchart(ylab = 'percentage of passengers',
           horizontal = FALSE,
           auto.key=list(space='right',title='survived'))

What are the pros and cons of these various graphs?

Survey

Please answer the following questions. Use as much space as you need below each question. Leave at least one blank line, i.e. a line starting with “#’”, before each new question.

What is the name you use with classmates:

What is your family name in York’s student records:

What is your given name in York’s student records:

What are your preferred pronouns:

Your my.yorku.ca e-mail address:

The name of your github account (if any):

When did you complete MATH 3131 (e.g. Fall 2016), who was your instructor and what was the author and title of the main textbook?:

When did you complete MATH 4330 (e.g. Fall 2016), who was your instructor and what was the author and title of the main textbook?:

How do you rate your knowledge of R: (0=none, 1=basic, 2=can do regression, 3=can write functions, 4=can write and compile a package)

How do you rate your knowledge of SAS: (0=none, 1=basic, 2=can do regression, 3=can do DATA step programming, 4=expert)

Describe your knowledge of C, C++, or Fortran:

Describe your knowledge of languages like Perl, Python, Awk, Ruby, etc.

What experience do you have with platforms for collaborative programming such as Git, Github, SVN?

What operating system do you use?

Have you ever worked with statistics or done statistical consulting? Please describe briefly.

Please describe some other areas of expertise or interest.

Potential time for optional tutorial:
Do you have a conflicting class at 11:30 am? Would you be able to attend a tutorial on Fridays at 11:30 am to 12:30 am?

Do you have a conflicting class at 9:30 am? Would you be able to attend a tutorial on Fridays from 9:30 am to 10:30 am?