Under Construction

Where there is no uncertainty, there cannot be truth – Richard Feynman

This version: April 23 2025 13:44

If you rotate this figure around a vertical axis (click on it), what plot do you see twice in each full rotation?

Calendar (tentative)

  • Classes Tuesdays and Thursdays, May 6 to June 12, 2025 10 am to 1 pm in PSE (Petrie) 321
    • except Thursdays, May 8, May 22, June 12: 11:30 am to 2:30 pm
  • Office hours: TBA
  • Communications: Messages and questions about the course should be posted publicly on Piazza. You may use private messages on Piazza for confidential messages in which the response is of no benefit to the rest of the class.

Day 1: May 6

Assignment 1: (individual) due Thursday, May 2 at noon

  • Summary:
    1. Install R and RStudio
    2. Get a free Github account
    3. Install git on your computer
    4. Connect with Piazza, create a LOG post and introduce yourself
  • 1. Install R and RStudio following these instructions. If you already have R and RStudio, update them to the latest versions.
  • 2. Get a free Github account: If you don’t have one, follow this excellent source of advice from Jenny Bryan on using git and Github (if you have trouble accessing this link try this). with R and RStudio.
    • Be sure to read Jenny Bryan’s advice before choosing a name. The name you choose will also be used to create a Unix account for you on blackwell.math.yorku.ca
    • CAUTION: Avoid installing ‘Github for Windows’ from the Github site at this stage. If you are tempted to do this, read this first.
  • 3. Install git on your computer using these instructions.
    • If you are curious about how git is used have a look at this tutorial!
    • As a final step: In the RStudio menu click on Tools | Global Options ... | Terminal. Then click on the box in Shell paragraph to the right of New terminals open with:
      • On a PC, select Git Bash
      • On a Mac, select Bash
    • You don’t need to do anything else at this time. We will see how to set up SSH keys to connect to Github and to blackwell through the RStudio terminal in a few lectures.
  • 4. Connect with Piazza and post about your experiences installing the above:
    • Join the MATH 6642 Piazza site by going to this URL: piazza.com/yorku.ca/spring2025/math6642
      • Use the access code ‘blackwell’ when prompted.
      • Create a post with the title LOG followed by the name you use socially, e.g. ‘LOG Jon Smith’. During the course you will edit this post to add links to your exercises and other contributions. For now, complete the post as follows:
        • The first line contains your formal name in York records: e.g.’Smith Jonathan’. This must match the name in York’s records so you can be correctly credited for your work.
        • The second line contains: ‘Github: jsmith’ where ‘jsmith’ is the name of your Github account.
        • Before saving the post:
          • In the Post to line select the Individual Student(s)/Instructor(s) button and type Instructors in the text box that appears.
          • Click on the log button in the list of folders.
          • Finally, click on the Post My Note to MATH 6642! button.
        • Here’s what a LOG file will look like as the course progresses.
      • Create another post on Piazza in which you introduce yourself to your colleagues:
        • The title should read: Introduction: followed by the name you use socially, e.g. Jon Smith
        • The first line should show the name of your Github account, e.g. Github: jsmith
        • Follow this with a discussion of your goals in taking this course?
        • Then information on what computing languages you are familiar with? Which ones are you proficient with?
        • Specifically with respect to R and RStudio:
          • Where and how did you learn R?
          • Do you use Rmarkdown?
          • Do you use the ‘hadleyverse’? e.g. tidyr, ggplot2
          • Do you often write functions to solve data analysis problems?
          • Have you written any packages, private or public?
        • Share some interests: hobbies, favourite musicians, movies, restaurants, etc.
        • Click on the introductions and on the assn1 folder when you submit your post.
      • Create yet another post entitled ‘Getting started’.
        • Describe problems you encountered installing R, RStudio, and git
        • Describe problems registering with Github
        • How could the instructions be improved to make the process easier?
        • If you couldn’t complete the installation, describe the problem or error message(s) you encountered.
        • Click on the r_rstudio_git folder and on the assn1 folder before submitting your post.
      • Add links to your posts in your LOG file:
        • Find your previous two posts in the list in the left-hand pane of Piazza. Hover over the listing and then hover over the downward-pointing arrow to get the links for your posts, e.g. @43 and @47.
        • Click on the listing for your LOG post and edit it by adding the following line:
          Assignment 1: @43 @47
          Then click on submit
        • As you complete assignments in the course, you will update your LOG file with links to your contributions.

Day 2: May 2

  • Continue Day 1
  • Outline of Linear Algebra for Regression
  • Three Basic Theorems for Regression
  • Exercises:
    1. Let \(Y\) and \(X\) be numerical variables and let \(G\) be a factor. Consider the following models. All but one of these models will produce the same regression coefficient for X or Xr but they will produce different standard errors. Identify the model that produces a different coefficient. Rank the others where you can according to the standard error of the estimated coefficient for X (or Xr) stating which would be equal if any (assume a very large \(n\) and ignore the effect of slight differences in degrees of freedom for the error term). Explain your reasoning briefly.
      1. Y ~ X + G
      2. Y ~ X
      3. Yr ~ Xr where Yr is the residual of Y regressed on G and similarly for Xr
      4. Y ~ Xr
      5. Y ~ X + Xh where Xh is the least-squares predictor of X based on G
      6. Y ~ X + Xh + Zg where Zg is a G-level numerical variable, i.e. it has the same value for all observations with a common value of G.
    2. Let \(Y\) and \(X\) be numerical variables and let \(Z_1, Z_2, ..., Z_k\) be numerical or factor variables. Consider the following models. All but one of these models will produce the same regression coefficient for X or Xr but they will produce different standard errors. Identify the model that produces a different coefficient. Rank the others where you can according to the standard error of the estimated coefficient for X (or Xr) stating which would be equal if any (assume a very large \(n\) and ignore the effect of slight differences in degrees of freedom for the error term). Explain your reasoning briefly.
      1. Y ~ X + Z1 + Z2 + ... +Zk
      2. Y ~ X
      3. Yr ~ Xr where Yr is the residual of Y regressed on Z1, Z2, ..., Zk and similarly for Xr
      4. Y ~ Xr
      5. Y ~ X + Xh where Xh is the least-squares predictor of X based on Z1, Z2, ..., Zk
      6. Y ~ X + Xh + Zg where Zg is a linear combination of Z1, Z2, ..., Zk.

Day 3: May 7

Assignment 2: (teams) (see due dates below)

  • Do the exercises below following these directions:
    • Each member of the team does every fourth (note that the largest team now has 4 members) question by the first deadline. Post each solution as a separate post on Piazza and make it private to your team so members of other teams can work independently. Usee the folder ‘assn1’ when posting.
    • Don’t forget to repeat the question so your post can be read and understood without have to use an external reference.
    • Between the first and second deadline, all members of the team help each other out to correct or improve their solutions.
    • Between the second and third deadline, the original author combines contributions by other members to create a polished answer.
    • Only after the third deadline, edit each solution to make it public to the whole class.
    • Have a look at the course description for more explanation.
    • Important note: It’s excellent to use online materials provided you provide links to them and that you assess the quality of the information. Searching online is a very important part of research. But a lot of material is posted online by people who have a limited or very superficial understanding of the topic they post about. It’s important for you to develop the critical ability to distinguish well-informed material from superficial or erroneous material. It’s also important to assess the context in which material is posted.
  • The two exercises posted on the last day. These make use of the ‘three basic theorems of regression’. They show how many different models can yield the same estimate of the ‘effect of X’. However, perhaps surprisingly, one model does not. Use the title ‘Exercise: AVP 1’ and ‘Exercise: AVP 2’ for these exercises.
  • Exercises corresponding to numbers circled in green in Outline of Linear Algebra for Regression. Use the titles ‘Exercise: Linear Algebra X’ where X is the number of the exercise.
  • Deadlines: See the course description for the meaning of these deadlines.
    1. Friday, May 10 at noon
    2. Sunday, May 13 at noon
    3. Monday, May 14 at noon
  • Random sequence: 1 4 2 3

Day 5: May 14

  • continuation of previous day

Assignment 3: (individual) Due Tuesday, May 21 at noon

  • Do one of the first two exercises labelled ‘Exercises’ on line 1776 of Lab_1.R
  • Prepare your analysis in an .R or a .Rmd file that can be rendered in RStudio with “Control-K’
  • Post the R script to Piazza as a Private file until noon on May 21.

Assignment 4: (teams) (see due dates below)

  • UPDATED MAY 14 FOR 3 TEAMS OF 3
  • Follow the same directions as those for assignment 2.
  • There are 6 block of exercises labelled ‘EXERCISES’ in capital letters in Lab_1.R. Consider each block to be an exercise. Label them ‘Lab 1: Exercise X’ where X runs from 1 to 6.
  • Consider statements: 2 to 7, 10, 14 and 18 in 21+ statements about statistics. Use each of these 9 statements as a question. Discuss whether each is valid. If a statement is not always valid discuss when and why it might not be valid.
  • Deadlines: See the course description for the meaning of these deadlines.
    1. Friday, May 17 at noon
    2. Sunday, May 19 at noon
    3. Monday, May 20 at noon
  • Random sequence: 2 3 1

Day 7: May 21

Assignment 5: (individual)

  • Propose at least one name for the paradox illustrated by the Height-Weight-Health example, namely that neither Weight nor Height individually are significantly unconditionally related to Health but, together, they are are each significantly related. Post your suggestion(s) in a private post to the instructor before noon on Tuesday, May 28, so people come up with indenpendent ideas. We’ll have a survey to choose the best name. Also post what you think would be a good prize for the winner.

Assignment 6: (teams) (see due dates below)

  • Follow the same directions as those for assignment 2.
  • There are 6 block of exercises labelled ‘EXERCISES’ in capital letters in Lab_2.R. Consider each block to be an exercise. Label them ‘Lab 2: Exercise X’ where X runs from 1 to 6.
  • Deadlines: See the course description for the meaning of these deadlines.
    1. Saturday, May 25 at noon
    2. Monday, May 27 at noon
    3. Tuesday, May 28 at noon
  • Random sequence: 3 2 1

Day 9: May 28

Assignment 7: (teams) (see due dates below)

  • Follow the same directions as those for assignment 2.
  • There are 17 blocks of question labelled ‘QUESTION’ (in capitals) in Lab 5 GLMM with traditional methods.R. Treat each block as a separate exercise and label your answer: ‘Lab 5: Question X’,
  • Deadlines: See the course description for the meaning of these deadlines.
    1. Saturday, June 1at noon
    2. Monday, June 3 at noon
    3. Tuesday, June 4 at noon
  • Random sequence: 2 3 1

Day 10: May 30

  • see video

Day 11: June 4

  • no class

Day 12: June 6

Figure: Divergent chains in Stan

1. Ideas in Regression: Why Models Matter

Predictive versus Causal Inference

Regression: Correlation, Data and Beta Ellipses

2. Linear Models for Nested Data with Normal Response

Hierarchical and Mixed Models for Clustered Data

  • Slides: Hierarchical Models and Mixed Models / annotated

    • equivalent models with complementary insights
  • R script: Lab 1 - Mixed Models / html / pdf

    • Exploring data
    • Data: Levels and structure
    • Selecting a random subset of clusters
    • First look at variables
    • Looking at Level 2 variables (invariant within schools)
    • Creating additional Level 2 ( and Level 1 ) variables with ‘capply’
    • Tranformations of Level 1 variables within groups
    • Looking at data in 3D
    • Looking at Level 1 and Level 2 data using Lattice graphics
    • Visualizing fitted lines in beta space
    • Looking at between group effect
    • Fitting a mixed model
    • Convergence problems
    • Handling NAs
    • Hausman Test: Is the between effect different from the within effect?
    • Fitting a model with a contextual mean
    • Role of contextual variable for ses
    • Interpreting the model with contextual effect
    • Estimating the compositional effect (between effect)
    • Visualizing the fitted model
    • Plotting error bars
    • Using CWG instead of raw SES
    • CWG vs raw in RE model
    • Notes on testing: ML vs REML
    • Diagnostics
      • Diagnostics with Level 1 residuals
      • Scale - location plot:
      • Diagnostics with Level 2 residuals
      • Influence diagnostics – drop one row or one cluster at a time
    • Looking at the model
    • Predicted response
    • Plotting effects with confidence bounds
    • Building and testing the RE model
    • Using simulation to calibrate p-values
    • Simplifying the FE model
    • Some Dos, don’ts and whys
    • Wald or Likelihood Ratio Test
    • Some Level 2 diagnostics
    • Visualizing the model
    • More effect plots
    • Refining the model
    • Multilevel R squared
    • Visualizing the model and asking sharper questions

Hierarchical and Mixed Models for Longitudinal Data

  • Slides: Longitudinal Models
  • R script: Lab 2 - Longitudinal Models / html / pdf
    • Detailed example showing how different longitudinal assumptions affect major estimates.
    • LME model
    • Hausman test:
    • Adjusting for time
    • Diagnostics: Level 1
      • Diagnostics for heteroskedasticity
      • Diagnostics for autocorrelation
    • Diagnostics: Level 2
    • Dropping observations
    • Modeling autocorrelation
    • Modeling heteroskedasticity
    • Interpreting different kinds of residual plots
    • Visualizing the impact of model selection
    • Displaying data and fitted values together
  • References:

3. Introduction to Bayesian Ideas and Modern Bayesian Methods

4. Introduction to Stan

Diverging chains

5. Non-Linear Models for Normal Responses: Asymptotic Functions of Time

  • Slides: Non-Linear Mixed Models
  • R script: Recovery after TBI
    • Asymptotic recovery curves With Stan
      • there is no intrinsic difference between linear and non-linear models
    • Problems with convergence – possible remedies: reparametrization, more informative prior, sensitivity analysis
    • Multivariate response model

6. Models for Non-Normal Responses: GLMMs

7. Missing Data with Bayesian Methods

  • Synopsis:
    • If the model missingness is MAR and ignorable in the longitudinal ‘Model of Analysis’ (MA) then you can analyse the observations you have.
    • Sometimes missingness is not MAR in the MA but would be MAR if you could include other ‘auxiliary variables’, e.g. mediators, that you need to exclude from the MA.
    • Multiple Imputation provides a solution and Bayesian Modeling provides another.
  • Slides: Missing Data
  • R script: Lab 4 a: Missing Data with Multiple Imputation
  • Bayesian approach: Multiple imputation involves performing a small number of analyses, each with a different set of sampled values for the missing data, and then combining these analyses, averaging them and taking into account the variability between them.
  • With a Bayesian analysis we can do the equivalent of and infinite number of analyses each with a different set of sampled values for the missing data. That is, we use the whole distribution of possible value for the missing data. And then we marginalize over the missing data. Which happens to be easy with MCMC because we marginalize by ignoring the random values generated by MCMC.
  • With a distribution represented as a density, marginalizing is extremely difficult and conditioning is relatively easier. With a distribution reprented by a sample, it’s the opposite. Marginalizing is trivial, conditioning is challenging.

The following lab uses Bayesian imputation for the data with missingness determined by the mediator Weness:

8. Shorter topics

8a Functions of time: Splines

8b Shortitudinal Data

Adjusting for measurement error in computed contextual variables.

8c Things That Can Go Wrong with Bayes

Other references

Postscript

John Fox and Tanya Murphy sent some very interesting references:

Issues and Errata

Acknowledgments

A great many collaborators, students and friends have contributed to many of the ideas in this course. I’d like to acknowledge a few with deep apologies to the many I’m missing.

  • Heather Krause
  • John Fox, Michael Friendly, Hugh McCague, Mirka Ondrack, Bryn Greer-Wootten, Jolynn Peck, Robert Cribbie, David Flora, Alyssa Counsell, Jessica Flake
  • Phil Chalmers, Carrie Smith, Ernest Kwan, Andrew Hunter
  • Jane Heffernan, Angie Raad
  • Yifaht Korman, Tammy Kostecki-Dillon, Pauline Wong, E. Manolo Romero Escobar
  • Andy Koh, Jordan Collins

References

Arnqvist, Göran. 2020. “Mixed Models Offer No Freedom from Degrees of Freedom.” Trends in Ecology & Evolution 35 (4): 329–35. https://doi.org/10.1016/j.tree.2019.12.004.
Bergland, Christopher. 2019. “Rethinking P-Values: Is "Statistical Significance" Useless?” Psychology Today. https://www.psychologytoday.com/blog/the-athletes-way/201903/rethinking-p-values-is-statistical-significance-useless.
Best, Nicky, and Alexina Mason. 2012. “Bayesian Approaches to Handling Missing Data.”
Blackwell, Matthew. 2013. “Observational Studies and Confounding,” 6.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. Mice : Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3). https://doi.org/10.18637/jss.v045.i03.
Christensen, Rune Haubo B. n.d. “A Tutorial on Fitting Cumulative Link Mixed Models with Clmm2 from the Ordinal Package,” 10.
Daniels, Michael J., and Joseph W. Hogan. 2008. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman and Hall/CRC.
Gelman, Andrew, John B. Clarlin., Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2025. “Bayesian Data Analysis, 3rd Ed. Home Page.” Bayesian Data Analysis, 3rd Ed. Home Page. https://sites.stat.columbia.edu/gelman/book/.
Gillespie, Colin, and Robin Lovelace. n.d.a. Efficient R Programming. Accessed April 22, 2025. https://csgillespie.github.io/efficientR/.
———. n.d.b. Efficient R Programming. Accessed December 26, 2019. https://csgillespie.github.io/efficientR/.
Hernán, Miguel A, and James M Robins. 2020. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.
Johnson, Paul, and John Gruber. n.d. “R Markdown Basics,” 19.
Kass, Robert E., Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, and Nancy Reid. 2016. “Ten Simple Rules for Effective Statistical Practice.” PLOS Computational Biology 12 (6): e1004961. https://doi.org/10.1371/journal.pcbi.1004961.
Oliver, John. 2016. “Last Week Tonight with John Oliver: Scientific Studies.” HBO. https://www.youtube.com/watch?v=0Rnq1NpHdmw.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.
“R Interface to CmdStan.” n.d. Accessed March 25, 2025. https://mc-stan.org/cmdstanr/.
Schervish, Mark J. 1996. “P Values: What They Are and What They Are Not.” The American Statistician 30 (3): 203–6.
“Stan.” n.d. Stan. Accessed April 22, 2025. https://mc-stan.org/.
Taylor, Jonathan, and Robert J. Tibshirani. 2015. “Statistical Learning and Selective Inference.” Proceedings of the National Academy of Sciences 112 (25): 7629–34. https://doi.org/10.1073/pnas.1507583112.