This version: April 03 2024 10:13


Announcements:

Sunday: February 25, 2024 at 4:30 pm: Update on the (apparently probable) strike.

At this stage it seems so unlikely that there won’t be a strike that I am planning on the assumption that it will take place.

Therefore, any forthcoming quizzes, if scheduled while the strike is still unresolved, are cancelled and will be rescheduled.

Our class will meet over Zoom at the usual time at https://yorku.zoom.us/j/93916385398

Where there is no uncertainty, there cannot be truth – Richard Feynman

To teach how to live without certainty, and yet without being paralysed by hesitation, is perhaps the chief thing that philosophy, in our age, can still do for those who study it. — Bertrand Russell

How harmful is smoking?: Cigarette consumption and life expectancy in 189 countries in 2004: Correlation is not causation. This is an example of the ecological fallacy, which is itself an example of a deeper phenomenon: Simpson’s Paradox.

Calendar

Classes meet on Mondays, Wednesday and Friday from 9:30 to 10:20 in ACW 305 on Mondays and Wednesdays and in ACW 204 on Fridays.
Tutorial: On Zoom (click here), every Tuesday from 1:30 to 2:30 pm. Instructor: Georges Monette Teaching Assistant: Chenyi Yu
Email: Messages about course content should be posted publicly to Piazza. You may post messages and questions as personal messages to the instructors. If they are of general interest and don’t contain personal information, they will usually be made public for the benefit of the entire class unless you specifically request that the message remain private.

Day 1: Monday, January 8

  • Course description
  • This evening, I will give access to Piazza to all students currently registered in the course.
    • I will use your York e-mail address. If you don’t read email at your York email address, please make sure that it’s forwarded to an email address that you do read regularly.
    • Please do not change your e-mail address on Piazza because your York email address is used to identify you so you get credited for your contributions.
  • Topic 1: How harmful is smoking?

Day 2: Wednesday, January 10

Assignment 1 (individual): Connecting with your team.

Due Thursday, January 11, 9 pm

  • Join Piazza using the invitation sent to your yorku.ca email address.
  • Get to know your team members. Post a message private to your team introducing yourself. What statistics courses have you taken? What programming languages do you know? Are you interested in particular applications of statistics, etc? Use the folder ‘assn1’.
  • Reply to your team members’ postings, perhaps asking them further questions about their background and interests.

Assignment 2 (individual): Setting things up

Due Sunday, January 14, 9 pm

  • Summary:
    1. Install (or update) R and RStudio
    2. Get a free Github account
    3. Install git on your computer
    4. Post publicly on Piazza if you run into problems. Help others if you can. Before the deadline on Sunday, post at least one public message commenting on your experiences installing software. Use the folder ‘assn2’.
  • 1. Install R and RStudio following these instructions. If you have already installed R and RStudio, update them to the latest versions.
  • 2. Get a free Github account: If you don’t have one, first consider choosing a name. Here’s an excellent source of advice from Jenny Bryan.
    • CAUTION: Avoid installing ‘Github for Windows’ from the Github site. It is not the same as ‘Git for Windows’.
  • 3. Install git on your computer using the instructions on Jenny Bryan’s webpage.
    • If you are curious about how git is used have a look at this tutorial!
    • As a final step: In the RStudio menu click on Tools | Global Options ... | Terminal. Then click on the box in Shell paragraph to the right of New terminals open with:
      • On a PC, select Git Bash
      • On a Mac, select Bash
    • You don’t need to do anything else at this time. We will see how to set up SSH keys to connect to Github through the RStudio terminal in a few lectures.
  • 4. Post questions on Piazza and if everything goes well, post that on Piazza. Use the folder ‘assn2’.

Day 3: Friday, January 12

Day 4: Monday, January 15

Complete Doodle Poll for a tutorial hour by this evening:

  • Ivy prepared this Doodle poll to find an hour for our tutorial over Zoom. We’ll start this week at the time that will be selected this evening. Please complete the poll by 9pm so your preferences will be taken into account.
  • Consider what is the best strategy in completing Doodle polls. Should you fill in your one preferred time or should you include times that are less preferable but nevertheless possible? How do you maximize the value of your choice(s)?

Quiz on Wednesday:

Day 5: Wednesday, January 17

Announcement: Tutorial on Zoom (click here), every Tuesday from 1:30 to 2:30 pm.

Quiz Today

Topic 2 (continuing): Regression Review: Regression in R

Day 6: Friday, January 19

Topic 2 (continuing): Regression Review

Day 7: Monday, January 22

Topic 2 (continuing): Regression Review

Day 8: Wednesday, January 24

Topic 2 (continuing): Regression Review

Learning R:

Why R? What about SAS, SPSS, Python, among others?

SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.

R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to R’s main repository, CRAN. When this page was last updated, CRAN had 20,640 packages.

In addition, a large number of packages under development are available through other repositories, such as github.

The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.

The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, 1 commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.

The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.

SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.

This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.

It was added to R two years ago because the new method was developed as a package in R long before being published.

So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.

So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.

The role of R in this course is to help us

  • have access to tools to expand our ability to explore and analyze data, and
  • learn how to develop and implement new statistical methods. i.e. learn how to build new tools
  • deepen our understanding of the use of statistics for scientific discovery as well as for business applications

It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R

In my opinion, ultimately, the best way is to

  • play your way through the ‘official’ manuals on CRAN starting with ‘An Introduction to R’ along with ‘R Data Import/Export’. Note however that these materials were developed before the current mounting concern with reproducible research and some of the advice should be deprecated, e.g. using ‘attach’ and ‘detach’ with data.frames.
  • read the CRAN task views in areas that interest you.
  • Have a look at the 1/2 million questions tagged ‘r’ on stackoverflow.
  • At every opportunity, use R Markdown documents (like the sample script you ran when you installed R) to work on assignments, project, etc.

Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.

Copy the following scripts as files in RStudio:

Play with them line by line.

Post questions arising from these scripts to the ‘question’ folder on Piazza. We will take up some questions in class and others in tutorials scheduled to deal with questions on R.

Assignment 3 (teams)

Exercises:

  • From 4939 questions
    • 5.1, 5.2, 5.3, 5.4,
    • 5.6.23, 5.6.24, 5.6.25, 5.6.26,
    • 6.1, 6.2, 6.3, 6.4,
    • 7.4, 7.5, 7.6, 7.7,
    • 8.1, 8.2, 8.3, 8.4,
    • 8.6, 8.7, 8.8, 8.9,
    • 8.18.a, 8.18.b, 8.18.c, 8.18.d,
    • 8.36.a, 8.36.b, 8.36.c, 8.36.d,
    • 8.51.a, 8.51.b, 8.51.c, (write functions that would work on matrices of any size), 8.61.a,
    • 12.1, 12.3, 12.5, 12.7,
  • Do the exercises above. There are at most 4 members in each team. Randomly assign the numbers 1 to 4 to members of your team (without replacement). Member number 1 does the first question in each row, Member number 2 does the second question in each row, etc.
  • Deadlines: See the course description for the meaning of these deadlines.
    1. Friday, February 2 at noon
    2. Sunday, February 4 at noon
    3. Monday, February 5 at 9 pm
  • IMPORTANT:
    • Upload the answer to each question in a single Piazza post (post it as a Piazza ‘Note’, not as a Piazza ‘Question’) with the title: “A3 5.1” for the first question, etc.
    • You can answer the question directly in the posting or by uploading a pdf file and the R script or Rmarkdown script that generated it.
    • When providing help or comments, do so as “followup discussion”.

Day 12: Friday, February 2

Day 13: Monday, February 5

Assignment 4 (teams)

Let \(N\) be your team member number for the last assignment, comment on the statements whose number \(Q\) is equal to \(N (mod __4__)\) in this file of statements related to statistics. Be warned that most of these statements are fallacious to some degree. To keep things interesting, there might be one or more correct statements.

Elliptical thinking may help you get insights into the correctness of some statements.

  • Deadlines: See the course description for the meaning of these deadlines.
    1. Monday, February 12 at noon
    2. Wednesday, February 15 at 9 pm
    3. Friday, February 17 at 9 pm
  • IMPORTANT:
    • Upload the answer to each question in a single Piazza post (post it as a Piazza ‘Note’, not as a Piazza ‘Question’) with the title: “A4 Statement 1” for the first statement, etc.
    • You can answer the question directly in the posting or by uploading a pdf file and the R script or Rmarkdown script that generated it.
    • When providing help or comments, do so as “followup discussion”.
    • Do not do a separate post for your final answer, just keep editing the original post.

Assignment 5 (individual)

  • Due: Monday, February 12

  • Do questions 6 and 8 in the R script to play with Multilevel Models: Lab 1.R

  • Upload your work on each question in a separate Piazza post (post it as a Piazza ‘Note’, not as a Piazza ‘Question’)

  • with the title: “A4 6” for the first question and “A4 8” for the second.

  • Do your work in Rmarkdown scripts (either .R or .Rmd files, it’s up to you) and post your script, not the pdf output to Piazza. The script should work when someone else runs it in the current version of R.

  • You can get and give help from anyone in the class but please do so in Piazza and you will get credit for it.

  • Even if you get help, your code should be your own. Don’t copy code from each other.

Day 18: Friday, February 16

‘Voluntary assignment 6’:

  • Work your way through Lab 2.R
    • Post questions, problems, discuss answers to questions on Piazza.
  • Use the folder assn_6 for discussions.

Day 19: Monday, February 26

Class links:

Assignment 6 (individual)

Due: Tuesday, March 6

The purpose of this assignment is to give everyone a chance to work individually with the data for the project in preparation for collaboration with your team.

We will study methods to work with this kind of data over the next few weeks.

The data are contained in an Excel file at http://blackwell.math.yorku.ca/private/MATH4939_data/ . Use the userid and password provided in class. Please do not post the userid and password in any public posting, e.g. on Piazza. You may post them in ‘Private’ posts to your team. The data should be treated as confidential and may be used only for the purposes of this course.

The data consist of longitudinal volume measurements for a number of patients being treated after traumatic brain injuries (TBI), e.g. from car accidents or falls, and similar data from a number of control subjects without brain injuries.

The goal of the study is to identify whether some brain structures are subject to a rate of shrinkage after TBIs (Traumatic Brain Injury) that is greater than the rate normally associated with aging. It is thought that some structures, particularly portions of the hippocampus may be particularly affected after TBI.

You will be able to work with this data set as we explore ways of of working with multilevel and longitudinal data. It’s a real data set with all the flaws that are typical of real data sets.

A first task will be to turn the data from a ‘wide file’ with one row per subject to a long file with one row per ‘occasion’. Note that variables with names that end in ’_1’, ’_2’, etc. are longitudinal variables, i.e. variables measured on different occasions at different points in time.

A good way to transform variable names if they aren’t in the right form is to use substitutions with regular expressions.

A good way to transform the data set from ‘wide’ form to ‘long’ form is to use the ‘tolong’ function in the ‘spida2’ package but you are welcome to use other methods that can be implemented through a script in R, i.e. not manipulating the data itself. You might find section 9, particularly section 9.4, of the following notes helpful:

Most of the variables are measures of the volume of components of the brain:

  • ‘HC’ refers to the hippocampus that has several parts,
  • ‘CC’ to the corpus callosum,
  • ‘GM’ is grey matter,
  • ‘WM’ is white matter,
  • ‘VBR’ is the ventricle to brain ratio. Ventricles are ‘holes’ in your brain that are filled with cerebro-spinal fluid so VBR measures how big the holes in your brain are compared with the ‘solid’ matter. If the brain volume shrinks, the total volume of the cranium remains the same so VBR goes up.

I hope that you will be curious to know something about these various parts of the brain and that you will exploit the internet to get some information.

The ids are numerical for patients with brain injuries and have the form ‘c01’, ‘c02’ etc for control subjects. The variable ‘date_1’ contains the date of injury and ‘date_2’, ‘date_3’, etc., the dates on which the corresponding brain scans were performed.

You might like to have a look at fixing dates in Wrangling Messy Spreadsheets into Useable Data for some ideas on using dates to extract, for example, the elapsed time between two dates as a numerical variable.

Plot some key variables: VBR, CC_TOT, HPC_L_TOT, HPC_R_TOT against elapsed time since injury using appropriate plots to convey some idea of the general patterns in the data. Remember that any changes to the data must be done in R. Do not edit the Excel file. Comment on what you see. Make a table (using a command in R, of course) showing how many observations are available from each subject. Create a posting entitled ‘Assignment 6’ to Piazza in which you Upload your Rmarkdown .R file and the html file it produces. Make it Private to the instructor until the deadline.

Day 27: Friday, March 8

Project:

  • Review the description of the project in the course description.
  • Meet with your team this weekend to choose one outcome variable you would like to focus on among VBR, CC_TOT, HPC_L_TOT, HPC_R_TOT, and discuss what approach you would like to use to analyze factors that are related to recovery. Prepare a short summary of your plans.
  • Schedule a meeting of your team with the instructor by posting a message with your preferred 30-minute slot on Wednesday, March 13, between 1 pm and 7 pm. Use the folder project and post the message to the entire class so teams will know which 30-minute slot other teams have already selected.

Day 28: Monday, March 11

Continuation of day 19

Day 29: Wednesday, March 13

Class links:

Asking meaningful questions + dealing with heteroskedasticity: pdf / R scripts

Day 30: Friday, March 15

Class links:

Day 31: Monday, March 18

Class links:

Sample questions on causality:

  1. Consider the linear DAG above and the following models:
  1. Y ~ X
  2. Y ~ X + Z6
  3. Y ~ X + Z1
  4. Y ~ X + Z1 + Z4
  5. Y ~ X + Z1 + Z3
  6. Y ~ X + Z3 + Z6
  7. Y ~ X + Z1 + Z5
  8. Y ~ X
  1. For each of these models discuss briefly whether fitting the model would produce an unbiased estimate of the causal effect of X.
  2. Among the models that provide an unbiased estimate of the causal effect of X, order them, to the extent possible from the information in the DAG, according to the expected standard deviation of \(\hat{\beta}_X\). Briefly state the basis for your ordering.
  3. Are there reasons why you might prefer to use a model that the DAG would identify as having a larger standard deviation of \(\hat{\beta}_X\)?
  1. Consider a multiple regression of the form \(Y = X_1 \beta_1 + X_2 \beta_2 + \varepsilon\) where \(\varepsilon \sim N(0, \sigma^2 I)\) and \(X_1\), \(X_2\) represent blocks of variables such that the matrix \([X_1 X_2]\) is of full column rank.
    Prove that the Added Variable Plot for the regression of \(Y\) on \(X_1\) has the same vector of least-squares coefficients as the least-squares coefficients for \(X_1\) in the multiple regression.
  2. Consider the following statement:
    “In a multiple regression, if you add a predictor whose effect is not significant, the coefficients of the other predictors should not have changed very much, nor should the p-values associated with them.”
    Is this a valid statement? If so, discuss why, illustrating your answer with appropriate figures.
  3. Are there any situations in which it would be important to drop a term in a model although its coefficient is highly statistically significant? Discuss the circumstances, if any, in which this would be true, and the consequences of including or excluding the variable in question.
  4. Are there any situations in which it would be important to include a term in a model although its coefficient is not statistically significant? Discuss the circumstances, if any, in which this would be true, and the consequences of including or excluding the variable in question. Note that this issue can arise in a number of contexts: satisfying the the principle or marginality and including a variable to ensure that the estimate of another variable is unbiased, notably in the context of causal estimation.

Day 32: Wednesday, March 20

Class links:

Day 33: Friday, March 22

Class links:

Day 34: Monday, March 25

Class links:

Day 35: Wednesday, March 27

Class links:

Day 33: Wednesday, April 3

Day 33: Friday, April 5

  • Presentations: 10 minutes

Day 33: Monday, April 8

Tutorial: tentative


  1. R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎