This version: March 31 2025 16:41
Click here to get the latest update
Click here to go to the end of this page

Doubt is not a pleasant condition, but certainty is absurd. – Voltaire

Where there is no uncertainty, there cannot be truth – Richard Feynman

To teach how to live without certainty, and yet without being paralysed by hesitation, is perhaps the chief thing that philosophy, in our age, can still do for those who study it. — Bertrand Russell

How harmful is smoking?: Cigarette consumption and life expectancy in 189 countries in 2004: Correlation is not causation. This is an example of the ecological fallacy, which is itself an example of a deeper phenomenon: Simpson’s Paradox.

Announcements

2025-01-12: Note that all quizzes are graded out of 20 marks, regardless of what eclass says.

Quick links

Go to current date
Course description
Piazza class forum
Tutorials:
- Over Zoom on Tuesdays 9am to 10am (tentative) click here to join
- Over Zoom (one time only) Friday, January 10, 2025, 4:30pm to 5:30pm: click here to join
Eclass – only for grades
Recordings
Questions

Calendar

Classes meet on Mondays, Wednesday and Friday from 9:30 to 10:20 in SC 222.
Tutorial: On Zoom (click here), weekly: Tuesdays at 9am Instructor: Georges Monette Office hours: On Zoom (click here), weekly: Tuesday at 10am Email: Messages about course content or course organization should be posted publicly on Piazza so everyone can benefit from the questions and discussions.

Day 1: Monday, January 6

Lecture slides and Quiz 1 questions
Course description
This evening, I will resend invitations to join Piazza to all students currently registered in the course.
- I will use your York e-mail address. If you don’t read email at your York email address, please make sure that it’s forwarded to an email address that you do read regularly.
- Please do not change your e-mail address on Piazza because your York email address is used to identify you so you get credited for your contributions. You make my life much easier if I don’t have to manually cut and paste your grades.
Topic 1: AMSTAT News Survey of Master’s Graduates (2023 US Statistics Graduates surveyed in mid-2024)
Topic 2: The meaning of p-values Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1 but unknown non-negative mean $\mu$.

You plan to test $H_0: \mu = 0$ versus the one-sided alternative $H_A: \mu > 0$.

Consider what happens if you use a 5% test, i.e. reject the null hypothesis if $p < 0.05$.

Suppose you reject $H_0$. What does this tell you about $H_0$?

What if the $p$-value is 0.049? Or what if the $p$-value is 0.0003?

Assignment 1 (individual): Joining Piazza

Due Thursday, January 9, 9 pm

Join Piazza using the invitation sent to your yorku.ca email address. The access code, if you need to use it, is blackwell.
Send a private message to the instructor confirming that you have joined Piazza. Use the folder assn1. If you have any private questions about the course, e.g. whether you have the required prerequisites, you can ask them in this message.

Day 2: Wednesday, January 8

Topic 2: How harmful is X?
- The Trapezoid of Means: Simpson’s Paradox, the Paik-Agresti diagram and the Liu-Meng diagram:
  - Is using Sun Screen Lotion bad for you? R script
    - Try the exercise at the end of this script.
- (On Day 3) How harmful is smoking cigarettes?: Here’s a link to the R script

Assignment 2 (individual) NHST and $p$-values

Due: Monday, January 13 at 11:59pm: Submit your answers on Piazza, preferably as a pdf file generated with Rmarkdown, but any other way is ok. Use the folder assn2 on Piazza for any posts submitted for this assignment. Do as much of the assignment as you are capable of doing. You will be graded on evidence of ‘honest effort’.

Discussion (The questions in this part are rhetorical, i.e. you don’t need to answer them)

This purpose of this assignment is to reawaken the skills you learned in MATH 2131 and MATH 3131, and to see whether some important statistical concepts really mean what most people who use them think they mean.

Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1.0 but unknown non-negative mean $\mu$.

You plan to test $H_0: \mu = 0$ versus one-sided alternative $H_A: \mu > 0$.

Consider what happens if you use a 5% test, i.e. reject the null hypothesis if $p < 0.05$.

Suppose you reject $H_0$. What does this tell you about $H_0$?

We know that the p-value tells you something about the probability of an event related to the data given $H_0$. So, although we know that the $p$-value isn’t the actual probability of $H_0$ given the data, it’s natural to feel that they are somehow related. After all, why would we use a $p$-value unless it tells us something about $H_0$? We aren’t interested in the data. What we are really interested in is $H_0$.

So it’s useful to actually explore the connection between $p$-values and the probability of $H_0$. But we can’t get an actual probability for $H_0$, without also hypothesizing a prior distribution on the hypotheses: we need a hypothetical probability for $H_0$ and a hypothetical distribution over the possible values in $H_A$.

In this exercise, you will consider the posterior probability of $H_0$ given different possibilities for prior probabilities to get a sense of the strength of the evidence against $H_0$ when you have a $p$-value of, for example, 0.049, or of 0.001.

Assignment:

In the scenario above, suppose that, before getting the data, $H_0$ and $H_A$ are considered equally probable, and that $H_A: \mu_A = 0.2$.
1. What is the posterior probability of $H_0$ given that $p < 0.05$? Note that this ignores the actual value of the $p$-value, as long as it’s less than 0.05. This is the posterior probability over all possible values of the $p$-value that are less than 0.05.
2. But in an analysis, we actually know the $p$-value. What is the posterior probability of $H_0$ given that $p = 0.049$. Or that $p = 0.001$?
3. Discuss the implications of these results.
(Somewhat challenging) Now, consider different possibilities for $\mu_A$. For what value of $\mu_A$ would we achieve the strongest evidence against $H_0$ if $p=0.049$ (i.e. minimum posterior probability), and what is that posterior probability? What is the answer if $p = 0.001$?
Write a program or function (your choice of programming language) that calculates the posterior probability of $H_0$, in the current scenario, given the following inputs:
1. observed p-value,
2. the prior probability of $H_0$
3. the hypothetical value of the alternative: $\mu_A$.
In one graph, plot the posterior probability of $H_0$ as a function of $\mu_A$ given $p$-values of 0.049 and 0.001, assuming a prior probability of 0.5 for $H_0$. Repeat assuming a prior probability of 0.05 for $H_0$.
(More challenging) Suppose you would like to consider what $p$-value you would need to ‘legally disprove’ $H_0$ in the sense of reversing a large prior probability of $H_0$ (e.g. a presumption of innocence) into a small posterior probability of $H_0$ (e.g. ‘proof’ beyond a reasonable doubt). Suppose that ‘presumption of innocence’ is taken to be a prior probability of 0.95 and that proof beyond a reasonable doubt corresponds to a posterior probability less than 0.05. What $p$-value would achieve this? ¹

As you work on this assignment, refer back occasionally to the questions in Quiz 1. How is your thinking about these questions evolving? If the same questions were to appear in Quiz 3, how would you answer them?

Assignment 3 (individual): Setting things up

Due Sunday, January 12, 9 pm

Summary:
1. Install (or update) R and RStudio
2. Get a free Github account
3. Install git on your computer
4. Post publicly on Piazza if you run into problems. Help others if you can. Before the deadline on Sunday, post at least one public message commenting on your experiences installing software. Use the folder ‘assn2’.
1. Install R and RStudio following these instructions. If you have already installed R and RStudio, update them to the latest versions.
2. Get a free Github account: If you don’t have one, first consider choosing a name. Here’s an excellent source of advice from Jenny Bryan.
- CAUTION: Avoid installing ‘Github for Windows’ from the Github site. It is not the same as ‘Git for Windows’.
3. Install git on your computer using the instructions on Jenny Bryan’s webpage.
- If you are curious about how git is used have a look at this tutorial!
- As a final step: In the RStudio menu click on Tools | Global Options ... | Terminal. Then click on the box in Shell paragraph to the right of New terminals open with:
  - On a PC, select Git Bash
  - On a Mac, select Bash
- You don’t need to do anything else at this time. We will see how to set up SSH keys to connect to Github through the RStudio terminal in a few lectures.
4. Post questions on Piazza and if everything goes well, post that on Piazza. Use the folder assn3.

Day 3: Friday, January 10

Announcements:

Tutorial today at 4:30 on Zoom. See link under Quick Links
Tutorials on Tuesday mornings at 9am on Zoom. See link under Quick Links
See modified course description
Topic 1: The Trapezoid of Means: Simpson’s Paradox with Paik-Agresti and Liu-Meng Diagrams / R script
Topic 2: How to estimate everything
Topic 3: Regression Review: Regression in R

Day 4: Monday, January 13

Quiz on Wednesday:

Sample question:

Given a table of conditional mean survival rates in percents(Y) by Gender (G: F or M) for two treatments (X: A or B) for a disease D:

	X = A	X = B
G = F	90.0	70.0
G = M	40.0	30.0

The following are the numbers of individuals in each group:

	X = A	X = B
G = F	100	300
G = M	400	100

Draw the trapezoid of mean survival rates.
Find the marginal effect of X (comparing treatment B to treatment A)
Find the conditional (specific) effects of X
Find the marginal effect of G.
Draw the lines corresponding to the Paik-Agresti diagram and the Liu-Meng diagram.
Would the conditional effects or the marginal effect be more appropriate as a measure of the effectiveness of this treatment assuming the Ms and Fs were assigned randomly (albeit not with equal probability) to the treatments.
Suppose variable G stands for ‘Gastric Complications’ that can be caused by the treatment (F is few and M is many). Again, subjects (a random mix of men and women) were randomly assigned to the the two treatments. Would the conditional effects or the marginal effect be more appropriate as a measure of the effectiveness of this treatment.
Discuss your answers to (f) and (g) briefly.

Assignment 4 (class): Setting things up

Due Sunday, January 19, 9 pm

Work your way through Regression Review: Regression in R / R script and post notes/questions/etc. using the folder assn4.

Formulate questions and post them.
Find errors (e.g. code that no longer works because things change) and flag them.
Raise topics for discussion.
Reply to other students’ postings.
Topic 1: How to estimate everything
Topic 2: Some real data on Smoking

Day 5: Wednesday, January 15

Announcement: Tutorial on Zoom (click here), every Tuesday from 1:30 to 2:30 pm.

Quiz Today:

Solution
Topic 1 (continuing): Regression Review: Regression in R / R script
Topic 2: Some real data on Smoking
Topic 3: Simple Regression to Multiple Regression

Day 6: Friday, January 17

Concepts and Theory

Topic 1: (continuing) Simple Regression to Multiple Regression
Topic 2: Multiple Regression in $\beta$-space
Topic 3: Three Basic Theorems
Topic 4: Causal Questions
Topic 5: The Causal Zoo
Topic 6: 23 statements about Statistics (one is correct)

Using R:

Learning R:

Why R? What about SAS, SPSS, Python, among others?

SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.

R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to R’s main repository, CRAN. When this page was last updated, CRAN had 22,243 packages.

In addition, a large number of packages under development are available through other repositories, such as github.

The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.

The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, ² commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.

The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.

SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.

This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.

It was added to R two years ago because the new method was developed as a package in R long before being published.

So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.

So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.

The role of R in this course is to help us

have access to tools to expand our ability to explore and analyze data, and
learn how to develop and implement new statistical methods. i.e. learn how to build new tools
deepen our understanding of the use of statistics for scientific discovery as well as for business applications

It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R

In my opinion, ultimately, the best way is to

play your way through the ‘official’ manuals on CRAN starting with ‘An Introduction to R’ along with ‘R Data Import/Export’. Note however that these materials were developed before the current mounting concern with reproducible research and some of the advice should be deprecated, e.g. using ‘attach’ and ‘detach’ with data.frames.
read the CRAN task views in areas that interest you.
Have a look at the 1/2 million questions tagged ‘r’ on stackoverflow.
At every opportunity, use R Markdown documents (like the sample script you ran when you installed R) to work on assignments, project, etc.

Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.

Copy the following scripts as files in RStudio:

Elementary examples in R: CAR_1.R, and
More advanced techniques: Working with Data.R.

Play with them line by line.

Post questions arising from these scripts to the ‘question’ folder on Piazza. We will take up some questions in class and others in tutorials scheduled to deal with questions on R.

Day 7: Monday, January 20

Assignment 5 (teams)

Exercises:

From 4939 questions
- 5.1, 5.2, 5.3, 5.4, 5.5
- 5.6.23, 5.6.24, 5.6.25, 5.6.26, 5.6.27
- 6.1, 6.2, 6.3, 6.4, 6.5
- 7.4, 7.5, 7.6, 7.7, 7.8
- 8.1, 8.2, 8.3, 8.4, 8.5
- 8.6, 8.7, 8.8, 8.9, 8.10
- 8.18.a, 8.18.b, 8.18.c, 8.18.d, 8.18.e
- 8.36.a, 8.36.b, 8.36.c, 8.36.d, 8.36.e
- 8.51.a, 8.51.b, 8.51.c, (write R functions) that would work on matrices of any size), 8.61.a, 8.62.a
- 12.1, 12.3, 12.5, 12.7, 12.9
Do the exercises above. There are 5 members in each team. Randomly assign the numbers 1 to 5 to members of your team (without replacement). Member number 1 does the first question in each row, Member number 2 does the second question in each row, etc.
Deadlines: See the course description for the meaning of these deadlines.
1. Sunday, January 26 at noon
2. Tuesday, January 28 at noon
3. Thursday, January 30 at 9 pm
IMPORTANT:
- Upload the answer to each question in a single Piazza post (post it as a Piazza ‘Note’, not as a Piazza ‘Question’) with the title: “A5 5.1” for the first question, etc. (That’s “A5” for assignment 5 and “5.1” for the question). You can add more text after “A5 5.1”, e.g. “A5 5.1
- You can answer the question directly in the posting or by uploading a pdf file and the R script or Rmarkdown script that generated it.
- When providing help or comments, do so as “followup discussion”.

Continuation

Concepts and Theory

Topic 1: (continuing) Simple Regression to Multiple Regression
Topic 2: Multiple Regression in $\beta$-space
Topic 3: Three Basic Theorems
Topic 4: Causal Questions
Topic 5: The Causal Zoo
Topic 6: 23 statements about Statistics (one is correct)

Using R:

Topic 2 (continuing): Regression Review

Day 8: Wednesday, January 22

No class this day because of an unfortunate conjunction of unlikely events.

Day 9: Friday, January 24

Day 10: Monday, January 27

Sample question for Quiz 3 on Wednesday, January 29:

Sample questions
See also the question below.

Topic 1:

Recap so far
What plots do the work of simple scatterplots for multiple regression? Let your imagination run wild depending on the nature of the process generating the data, the shape of the data, and the purpose of the analysis. Here are two basic plots:
1. Added-Variable-Plot (AVP):
  Plot $Y - \hat{Y}|_{other X's}$ against $X - \hat{X}|_{other X's}$
  - Shows the first-order leverage and influence of each observation on $\hat{\beta}_i$
  - Note that the definition of the AVP yields the simple scatterplot centered at the origin if the only ‘other variable’ is the intercept.
  - In R use:
    
    library(car)
    fit <- lm(income ~ education + prestige, data = Prestige)
    avPlots(fit, ellipse = TRUE)
    ?avPlots # for more options
  When you look at this plot, think of how you would interpret it if it were a simple scatterplot.
  You can actually construct a 95% confidence interval for the multiple regression coefficient, $\hat{\beta}_i$ for the effect of $X_i$ keeping other $X$’s constant, using this plot the same way you used a simple scatterplot. Of course, the 95% confidence interval also gives you the result of a 5% test of the hypothesis that $\beta_i = 0$ in the multiple regression.
  
  Note: According to Wikipedia, the person who first discovered this fact is a statistician, Udny Yule, who published it in 1911. Nevertheless the theorem is known as the Frisch-Waugh-Lovell Theorem. The eponymous economist, Ragnar Frisch, won the first Nobel Prize in Economics in 1969. Udny Yule is also considered the ‘real’ discoverer of Simpson’s Paradox, many years before Simpson published his paper on it. Thank goodness theorems and paradoxes are rarely named after their original discoverers. E.g. we might have no idea what someone is talking about when they refer to ‘Yule’s Theorem’. I hope he considers that a consolation.
2. Residual versus Leverage plots:
  Plot the standardized residual $\frac{e_i}{\hat{\sigma} \sqrt{1 - h_{ii}}}$ against $h_{ii}$.
  Shows the first-order leverage and influence of each observation on $\hat{Y}_i$.
  In R use:
  
  plot(rstudent(fit) ~ hatvalues(fit))
  
  or look at the fourth plot when you just use
  
  plot(fit)
  
  This plot will also show you selected isoquants of Cook’s Distance, one of perhaps over 50 diagnostics generated by two groups in the late 70s: David Cook and Sanford Weisberg in one group, and David Belsley, Edwin Kuh and Roy Welsch, in the other. Some analysts look at tables of these diagnostics and apply conventional cutoffs to flag problematic points. Many diagnostics are functions of each other and I think that we get much better insights by looking at them graphically, allowing us to detect isolated outliers or clusters of outliers seeing patterns involving more than one numeric diagnostic simultaneously.
  
  Note that \[h_{ii} = \frac{1}{n}\left( 1 + \Delta_i^2 \right)\] where $\Delta_i$ is the Mahalanobis distance of the i-th point from the predictor values in the sample. In simple regression, this is just the ‘z-score’ for $X$ for the i-th observation.
  $h_{ii}$ obeys the following inequalities if the model has an intercept: \[\frac{1}{n} \le h_{ii} \le 1\] and $h_{ii} = 1$ iff all the points omitting the i-th point are perfectly collinear. What inequalities does this imply for $\Delta_i$? What is the implication if $h_{ii} = 1/n$?
  Quiz question: Where would you expect to find outliers of ‘Type 1’, ‘Type 2’, and ‘Type 3’ in this plot.

Topic 2 (continuing): Regression Review

Simple to Multiple Regression
- Pay equity in a small law firm: Simpson’s Paradox, Paradox lost
Multiple Regression in $\beta$-space
Three Basic Theorems

Day 11: Wednesday, January 29

Day 12: Friday, January 31

Day 13: Monday, February 3

Day 14: Wednesday, February 5

The class today was on the blackboard. See the recording for its content.

Day 15: Friday, February 7

Sample questions for the midterm test
Quiz 3 solutions
Summary on causality using linear causal DAGS
- Backdoor criterion to estimate the causal effect of X on Y: (See also the last page of the link above)
  If your linear causal graph and model satisfy both of the following conditions, then you have satisfied sufficient conditions so that your estimate of the effect of X is an estimate of a causal effect of X on Y:
  1. Trace out all backdoor paths between X and Y and ensure that all backdoor paths are blocked. A backdoor path is blocked if either/or:
    1. There is a collider on the path that is NOT controlled. ‘Controlled’ can mean that the variable is included in a regression model, or that the analysis is stratified on the variable, or – more subtly and challenging to detect – that the data was selected on the basis of the variable and only one or a few levels are observed, resulting in selection bias.
    2. One or more variables that are not colliders on the path are controlled. If a path has been blocked by controlling for a collider variable, it can be unblocked by controlling for other non-colliders on the path.
  2. Do not control any descendants of X. That includes, of course, descendants of Y.
So, step-by-step:
1. Trace out the backdoor paths.
2. Make sure all the backdoor paths are blocked using the criteria above: either a collider that is not controlled or one or more non-colliders that are controlled. If a collider is controlled (through the analysis by inclusion in the model, or through selection) then the path, thus unblocked, can be re-blocked by controlling one or more non-colliders.
3. Make sure the model does not include descendants of X.
Note that average causal effects can only be generalized to members of the population that were included in the observed sample. For example, if the sample was selected through a collider variable and the resulting backdoor path re-blocked by the inclusion of another variable along the path, then the observed average causal effect can only be generalized to members of the population that were selected through the collider.
Recommended Piazza weekly feedbacks:
To further explore DAGs and related concepts:
- Download the dagitty package in R and play with it. Visit the home page for the project. A DAG implies certain relationships among its variables. Some of these relationships are purely causal and can’t be checked with a statistical test on the data. For example, with the simple DAG we used for Coffee, Stress and Heart Damage, the correct causal analysis depended on the direction of the arrow between Coffee and Stress. If the arrow points from Coffee to Stress, then Stress is a mediating variable and should be excluded from the regression. If the arrow points from Stress to Coffee then Stress is a confounding factor and should be included in the regression.
  There is no statistical test to check which way the arrow should go because either direction gives rise to the same means, variances and covariances among the three variables, and that’s all a statistical test would see.
  However, with more complex DAGs, the DAG may have both causal and correlational implications. You can use the dagitty package to test whether the correlational assumptions are satisfied and, if they’re not, you can explore further to refine your DAG.
  Just playing with examples from the dagitty package should be both fun and very instructive in deepening your understanding of causality.
Recommended books to go deeper into causal models:
- Hernan and Robins (2020) Causal Inference: What If? (hernan2020?)
- Pearl and Mackenzie (2018) The Book of Why (Pearl and Mackenzie 2018)

Day 16: Monday, February 10

Review

Day 17: Wednesday, February 12

Midterm Test

Day 18: Friday, February 14

Mixed Models:

Hierarchical Models
R script for hierarchical models: Lab 1.R
Longitudinal Models
R script for longitudinal models: Lab 2.R

Assignment 6 (individual with help from teams)

Due: Sunday, March 2 at 11:59pm
Work through Lab 1 above.
Do the numbered questions: Questions 1 to 8.
Work independently but you can seek help from your team.
Do not copy each other’s work, use help from others to generate your own response with your own code and in your own words.
You may use AI but ALWAYS show the response you generated with AI and describe your critiques and improvements.
Post your answers in two files:
- a single file (.R) or (.Rmd) AND
- the PDF file generated from it.

Assignment 7 (individual with help from teams)

Due: Sunday, March 9 at 11:59pm
Work through Lab 2 above.
Do the numbered questions in Lab 2: Questions 1 to 11, otherwise following the same instructions as for Assignment 6

Day 19: Monday, February 24

Class links:

Small hierarchical example comparing pooled, between, within and mixed models, with and without a contextual variable
- R script for the above
- Means only model: Mixed Models or Pooled Analysis (rough draft)
- When should you not use a contextual variable for a lower level predictor:
  - When the predictor is ‘balanced’, e.g. has the same values, or the same mean in each cluster: In that case the contextual variable may be almost constant and collinear with the intercept.
  - When it’s reasonable to believe that the between-effect is equal to the within-effect – and there is no evidence to the contrary – so it’s reasonable to ‘borrow strength’ from the between effect in the estimation of the within-effect.
Multilevel Mixed Models
- R script to play with Multilevel Models: Lab 1.R
Non-Linear Mixed Models
Longitudinal Linear Models
- R script to play with Longitudinal Models: Lab 2.R
  - This is a good examples of why models matter and how the estimated comparisons between drugs depend on which potential confounding factors are included in the model. This includes longitudinal components such as trends and possible autocorrelation between measurements that are close to each other in time. As you had more components the story keeps changing.
Added-Variable Plot (Frisch-Waugh-Lovell Theorem) and the Linear Propensity Score Theorem
- Provides important insights into equivalent models and nearly equivalent models
Parametric splines
- Models that are non-linear in X but linear in parameters.
Dealing with Heteroskedasticity

Day 20: Wednesday, February 26

Assignment 8 (individual)

Due: Wednesday, March 5

The purpose of this assignment is to give everyone a chance to work individually with the data for the project in preparation for collaboration with your team.

We will study methods to work with this kind of data over the next few weeks.

The data are contained in an Excel file at http://blackwell.math.yorku.ca/private/MATH4939_data/ . Use the userid and password provided in class. Please do not post the userid and password in any public posting, e.g. on Piazza. You may post them in ‘Private’ posts to your team. The data should be treated as confidential and may be used only for the purposes of this course.

The data consist of longitudinal volume measurements for a number of patients being treated after traumatic brain injuries (TBI), e.g. from car accidents or falls, and similar data from a number of control subjects without brain injuries.

The goal of the study is to identify whether some brain structures are subject to a rate of shrinkage after TBIs (Traumatic Brain Injury) that is greater than the rate normally associated with aging. It is thought that some structures, particularly portions of the hippocampus may be particularly affected after TBI.

You will be able to work with this data set as we explore ways of of working with multilevel and longitudinal data. It’s a real data set with all the flaws that are typical of real data sets.

A first task will be to turn the data from a ‘wide file’ with one row per subject to a long file with one row per ‘occasion’. Note that variables with names that end in ’_1’, ’_2’, etc. are longitudinal variables, i.e. variables measured on different occasions at different points in time.

A good way to transform variable names if they aren’t in the right form is to use substitutions with regular expressions.

A good way to transform the data set from ‘wide’ form to ‘long’ form is to use the ‘tolong’ function in the ‘spida2’ package but you are welcome to use other methods that can be implemented through a script in R, i.e. not manipulating the data itself. You might find section 9, particularly section 9.4, of the following notes helpful:

Working with Data (draft)

Most of the variables are measures of the volume of components of the brain:

‘HC’ refers to the hippocampus that has several parts,
‘CC’ to the corpus callosum,
‘GM’ is grey matter,
‘WM’ is white matter,
‘VBR’ is the ventricle to brain ratio. Ventricles are ‘holes’ in your brain that are filled with cerebro-spinal fluid so VBR measures how big the holes in your brain are compared with the ‘solid’ matter. If the brain volume shrinks, the total volume of the cranium remains the same so VBR goes up.

I hope that you will be curious to know something about these various parts of the brain and that you will exploit the internet to get some information.

The ids are numerical for patients with brain injuries and have the form ‘c01’, ‘c02’ etc for control subjects. The variable ‘date_1’ contains the date of injury and ‘date_2’, ‘date_3’, etc., the dates on which the corresponding brain scans were performed.

You might like to have a look at fixing dates in Wrangling Messy Spreadsheets into Useable Data for some ideas on using dates to extract, for example, the elapsed time between two dates as a numerical variable.

Plot some key variables: VBR, CC_TOT, HPC_L_TOT, HPC_R_TOT against elapsed time since injury using appropriate plots to convey some idea of the general patterns in the data. Remember that any changes to the data must be done in R. Do not edit the Excel file. Comment on what you see. Make a table (using a command in R, of course) showing how many observations are available from each subject. Create a posting entitled ‘Assignment 6’ to Piazza in which you Upload your Rmarkdown .R file and the html file it produces. Make it Private to the instructor until the deadline.

Project (teams)

Project:

Review the description of the project in the course description.
Meet with your team this weekend to choose one outcome variable you would like to focus on among VBR, CC_TOT, HPC_L_TOT, HPC_R_TOT, and discuss what approach you would like to use to analyze factors that are related to recovery. Prepare a short summary of your plans.
Schedule a meeting of your team with the instructor by posting a message with your preferred 30-minute slot on Wednesday, March 6, between 1 pm and 7 pm. Use the folder project and post the message to the entire class so teams will know which 30-minute slot other teams have already selected.

….

Day 24: Friday, March 7

…

Day 27: Friday, March 14

…

Day 30: Friday, March 21

Revisiting the G matrix and the Role Centering Predictors within Groups (CWG)
- Longitudinal Models
- Contextual and Compositional Effects

Day 33: Friday, March 28

Longitudinal Models
Principle of Marginality
Sample final exam questions
- This sample final has many more questions than the actual final. The final exam, lasting 2 hours, will have questions similar to these but whose marks sum to 100.

Day 34: Monday, March 31

23 Statistical Statements
- Which ONE is TRUE?
When, why and how to use REML or ML when fitting mixed model?
- REML vs ML
- REML vs ML R script

Bibliography

“A Data.table and Dplyr Tour · Home.” n.d. Accessed December 20, 2024. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/.

Agresti, Alan. 2007. An Introduction to Categorical Data Analysis. Second. John Wiley.

Anscombe, Francis J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1): 17–21.

Article. n.d.

Ashwanden, Christie. 2015. “Science Isn’t Broken: It’s Just a Hell of Lot Harder Than We Give It Credit For.” FiveThirtyEight.com, August. https://fivethirtyeight.com/features/science-isnt-broken/.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1). https://doi.org/10.18637/jss.v067.i01.

Bergland, Christopher. 2019. “Rethinking P-Values: Is "Statistical Significance" Useless?” Psychology Today. March 22, 2019. https://www.psychologytoday.com/blog/the-athletes-way/201903/rethinking-p-values-is-statistical-significance-useless.

Book. n.d.

Bouman, Judith A., Anthony Hauser, Simon L. Grimm, Martin Wohlfender, Samir Bhatt, Elizaveta Semenova, Andrew Gelman, Christian L. Althaus, and Julien Riou. 2024. “Bayesian Workflow for Time-Varying Transmission in Stratified Compartmental Infectious Disease Transmission Models.” Edited by Samuel V. Scarpino. PLOS Computational Biology 20 (4): e1011575. https://doi.org/10.1371/journal.pcbi.1011575.

Chung, Kai Lai. 1974. A Course in Probability Theory. 3rd ed.

“Datasets - UCI Machine Learning Repository.” n.d. Accessed December 17, 2023. https://archive.ics.uci.edu/datasets?orderBy=DateDonated&sort=desc.

Dmitry. 2024. “DHARMa Residual Pattern.” Forum post. Cross Validated. October 15, 2024. https://stats.stackexchange.com/q/655401.

“Essay3.” n.d. Accessed January 14, 2025. https://jwilson.coe.uga.edu/emt668/emat6680.2000/umberger/EMAT6690smu/Essay3smu/Essay3smu.html.

Evans, Michael J, and Jeffrey S Rosenthal. 2009. Probability and Statistics: The Science of Uncertainty. Second. Macmillan. http://www.utstat.toronto.edu/mikevans/jeffrosenthal/book.pdf.

Fox, John. 2016a. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Sage Publications.

———. 2016b. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Sage Publications.

Fox, John, and Jangman Hong. 2009. “Effect Displays in R for Multinomial and Proportional-Odds Logit Models: Extensions to the effects Package.” Journal of Statistical Software 32 (1): 1–24. http://www.jstatsoft.org/v32/i01/.

Fox, John, and Sanford Weisberg. 2019. An R and S-Plus Companion to Applied Regression. 3rd ed. Sage Publications.

Friendly, Michael. 2017. “An Introduction to R Graphics.” SCS Short Course, March. http://www.datavis.ca/courses/RGraphics/.

Friendly, Michael, Georges Monette, John Fox, et al. 2013. “Elliptical Insights: Understanding Statistical Methods Through Elliptical Geometry.” Statistical Science 28 (1): 1–39.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.

Giles, Philip. 2001. “An Overview of the Survey of Laobur and Income Dynamics (SLID).” Canadian Studies in Population 28 (2): 363–75. http://www.canpopsoc.ca/CanPopSoc/assets/File/publications/journal/CSPv28n2p363.pdf.

Gillespie, Colin, and Robin Lovelace. n.d. Efficient R Programming. Accessed December 26, 2019. https://csgillespie.github.io/efficientR/.

“glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.” 2024. ResearchGate, October. https://doi.org/10.32614/RJ-2017-066.

Glymour, M. Maria, and Sander Greenland. 2008. “Causal Diagrams.” In Modern Epidemiology, 3rd ed., 193–209. Lippincott Williams \& Wilkins Philadelphia, PA. http://publicifsv.sund.ku.dk/~jhp/KandidatFSV/aar19/Forelaesninger/DAG/GlymourChap12.pdf.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Gunter, Bert, and Christopher Tong. 2017. “What Are the Odds!? The ‘Airport Fallacy’ and Statistical Inference.” Significance 14 (4): 38–41. https://doi.org/10.1111/j.1740-9713.2017.01057.x.

Hernán, Miguel A, and James M Robins. 2020. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.

Ioannidis, John P A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): 6. https://journals.plos.org/plosmedicine/article/file?id=10.1371%2Fjournal.pmed.0020124&type=printable.

Ioannidis, John P. A. 2019. “What Have We (Not) Learnt from Millions of Scientific Papers with P Values?” The American Statistician 73 (March): 20–25. https://doi.org/10.1080/00031305.2018.1447512.

J, Alex. 2024. “Answer to "DHARMa Residual Pattern".” Cross Validated. October 10, 2024. https://stats.stackexchange.com/a/655624.

Jansen, Jeroen P., Christopher H. Schmid, and Georgia Salanti. 2012. “Directed Acyclic Graphs Can Help Understand Bias in Indirect and Mixed Treatment Comparisons.” Journal of Clinical Epidemiology 65 (7): 798–807. https://doi.org/10.1016/j.jclinepi.2012.01.002.

Johnson, Paul, and John Gruber. n.d. “R Markdown Basics,” 19.

Kahneman, Daniel. 2011. Thinking, Fast and Slow.

Kass, Robert E., Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, and Nancy Reid. 2016. “Ten Simple Rules for Effective Statistical Practice.” PLOS Computational Biology 12 (6): e1004961. https://doi.org/10.1371/journal.pcbi.1004961.

Liu, Keli, and Xiao-Li Meng. n.d. “A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference.”

Maathuis, Marloes H., and Diego Colombo. 2015. “A Generalized Back-Door Criterion.” The Annals of Statistics 43 (3). https://doi.org/10.1214/14-AOS1295.

McNamara, Amelia, and Nicholas J Horton. 2017. “Wrangling Categorical Data in R.” PeerJ Preprints 5 (August): e3163v2. https://doi.org/10.7287/peerj.preprints.3163v2.

Meng, Xiao-Li. 1994. “Posterior Predictive p-Values.” The Annals of Statistics 22 (3): 1142–60. https://www.jstor.org/stable/2242219.

Meyer, R.-post on Cosima. 2022. “Understanding the Basics of Package Writing in R | R-bloggers.” October 16, 2022. https://www.r-bloggers.com/2022/10/understanding-the-basics-of-package-writing-in-r/.

Monette, Georges, John Fox, Michael Friendly, and Heather Krause. 2018. “Spida2: Collection of Tools Developed for the Summer Programme in Data Analysis 2000-2012.” https://github.com/gmonette/spida2.

Murnane, Richard J, and John B Willett. 2010. Methods Matter: Improving Causal Inference in Educational and Social Science Research. Oxford University Press.

Nguyen, Mike. n.d. Chapter 9 Nonlinear and Generalized Linear Mixed Models | A Guide on Data Analysis. Accessed December 21, 2024. https://bookdown.org/mike/data_analysis/nonlinear-and-generalized-linear-mixed-models.html.

Oliver, John, dir. 2016. Last Week Tonight with John Oliver: Scientific Studies. HBO. https://www.youtube.com/watch?v=0Rnq1NpHdmw.

Paik, Minja. 1985. “A Graphic Representation of a Three-Way Contingency Table: Simpson’s Paradox and Correlation.” The American Statistician 39 (1): 53. https://doi.org/10.2307/2683907.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.

Presnell, Brett. n.d. “An Introduction to Categorical Data Analysis Using R,” 38.

Rafi, Zad, and Sander Greenland. 2020. “Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise.” BMC Medical Research Methodology 20 (1): 244. https://doi.org/10.1186/s12874-020-01105-9.

Rothman, Kenneth J. 2014. “Six Persistent Research Misconceptions.” Journal of General Internal Medicine 29 (7): 1060–64. https://doi.org/10.1007/s11606-013-2755-z.

Schervish, Mark J. 1996. “P Values: What They Are and What They Are Not.” The American Statistician 30 (3): 203–6.

Sellke, Thomas, M. J Bayarri, and James O Berger. 2001. “Calibration of ρ Values for Testing Precise Null Hypotheses.” The American Statistician 55 (1): 62–71. https://doi.org/10.1198/000313001300339950.

Snijders, Tom A. B., and Roel J. Bosker. 2012. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Second Edition. Sage.

Staff, CANSSI. 2022. “Graduate Program Listings - CANSSI.” September 27, 2022. https://canssi.ca/graduate-program-listings/.

“The Data.table R Package Cheat Sheet.” n.d. Accessed December 20, 2024. https://www.datacamp.com/cheat-sheet/the-datatable-r-package-cheat-sheet.

“Thinking, Fast and Slow.” 2024. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Thinking,_Fast_and_Slow&oldid=1263058396.

Thompson, Laura A. 2009. “R (and S-PLUS) Manual to Accompany Agresti’s _Categorical Data Analysis (2002) 2nd Edition.” http://www.stat.ufl.edu/~aa/cda/Thompson_manual.pdf.

Vélez, D., L. R. Pericchi, and M. E. Pérez. 2022. “From $p$-Values to Posterior Probabilities of Hypothesis.” February 14, 2022. https://doi.org/10.48550/arXiv.2202.06864.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p -Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘ p $<$ 0.05’.” The American Statistician 73 (March): 1–19. https://doi.org/10.1080/00031305.2019.1583913.

Wickham, Hadley. 2014. Advanced R. CRC Press. http://adv-r.had.co.nz/.

———. 2015. R Packages. CRC Press. http://r-pkgs.had.co.nz/.

Yee, Thomas W. 2022. “On the Hauck–Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization.” Journal of the American Statistical Association 117 (540): 1763–74. https://doi.org/10.1080/01621459.2021.1886936.

Note that legal definitions of the “presumption of innocence” and of “proof beyond a reasonable doubt” are never formulated, to my knowledge, in such specific probabilistic terms. Nevertheless, this would seem to me to be a minimal interpretation of these concepts. For some purposes, in the United States, a $p$-value less than 0.05 is considered sufficient for some forms of evidence.↩︎
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎

MATH 4939: Statistical Data Analysis Using SAS and R

Winter 2025