Where there is no uncertainty, there cannot be truth – Richard Feynman
Classes meet on Tuesdays and Thursdays from 2:30 to
4:00 in ACW 104.
Instructor: Georges Monette
Important dates: 2024-2025 Academic
Calendar
Textbook:
References:
Email: Messages about course content should be posted publicly to Piazza so the discussion can engage and benefit everyone.
Files used today: Regression Review
Due Friday, September 7, 5 pm
Introduction:
followed by the
name you prefer to use in class, e.g. Jon Smith
Github: jsmith
. If you don’t have one already, leave
this blank for now. You can come back and edit it later.assn0
folders when you submit your
post.Due Wednesday, September 11, 9 pm
Tools | Global Options ... | Terminal
. Then click on the
box in Shell paragraph to the right of New
terminals open with:
assn1
folder before submitting your
post. - If you are hoping that someone will provide answers to questions
in your post select Question
as the Post Type
.
Otherwise, select Note
.This exercise will be done in teams of 4 or 5 members which will be posted on Piazza. All members of a team work on a similar problem but with different parameters. This way, you can help each other with the concepts but each team member must, ultimately, do their own work.
The purpose of the assignment is to explore the meaning of p-values. Before starting, stop and reflect on what it means for an experiment to ‘achieve’ a p-value of 0.049. What meaning can we give to the quantity ‘0.049’? How is it related to the probability that the null hypothesis is correct?
To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu > 0\) (note: the problem is also easy but takes a bit more coding if you assume a two-sided alternative: \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) – to help focus on the essential concepts, we’ll use a one-sided alternative) and you are designing an experiment in which you plan to take a sample of independent random variables, \(X_1, X_2, ... , X_n\) which are iid \(\textrm{N}(\mu,1)\), i.e. the variance is known to be equal to 1. You plan to use the usual test based on \(\bar{X}_n\) rejecting \(H_0\) for values of \(\bar{X}_n\) that are far from 0.
An applied example would be testing for a change in value of a response when all subjects are submitted to the same conditions and the measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’, i.e. post-test response minus the pre-test response exhibited by the \(i\)th subject.
Note: Do the assignment using R and using a Markdown file. You can use the Markdown sample file in Assignment 1 as a template. Upload both a pdf file and your R script to Piazza.
Let the selected probability of Type I error be \(\alpha = 0.05\). Depending on your team, you have a budget that allows you to collect a sample of \(n\) observations as follows:
Team | \(n\) |
---|---|
A: Akaike (Hirotsugu Akaike) | 5 |
B: Blackwell (David Blackwell) | 8 |
C: Cox (Gertrude Mary Cox) | 10 |
D: David (Florence Nightingale David) | 15 |
Depending on your position on the team, you are interested in answering the questions below using the following values of \(\mu_1\):
Position in Team | \(\mu_1\) | Cohen’s terminology for effect size: \(\mu_1/\sigma\) |
---|---|---|
1 | all | combine ROC curves for results below |
2 | 0.2 | small effect size |
3 | 0.5 | medium effect size |
4 | 0.8 | large effect size |
5 | 1 | very large effect size |
To delve more deeply into these issues you can read Wasserstein and Lazar (2016) and Wasserstein, Schirm, and Lazar (2019). Concerns about \(p\)-values have been around for a long time, see Schervish (1996). For a short overview see Bergland (2019). There are two interesting and influential papers by John Ioannidis (2005), (2019).
For an entertaining take on related issues see Last Week Tonight: Scientific Studies by John Oliver (2016) (warning: contains strong language and political irony that many could consider offensive – watch at your own risk!).
Preparing for the next class: Play your way through these scripts:
Step your way through the scripts and experiment as you go along. If you have questions or comments, post them on Piazza.
These scripts give you a tour of various basic features of R. They are very basic and meant to provide reminders of things you can do when a need arises. If you are new to R you probably know some other computing language well and these scripts can introduce you to R syntax.
What’s the best way to learn R? That depends on your learning style, your previous experience and how deep you want to go.
See the discussion in ‘Day 2’. Start or join a discussion on Piazza on ‘best ways to learn R’.
The ‘spida2’ package should be reinstalled regularly since it’s likely to be updated frequently during the course.
Tutorials: Start this week on Friday, September 13 at 9 am.
Why R? What about SAS, SPSS, Python, Julia, among others?
SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.
R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to R’s main repository, CRAN.
When this page was last updated (at 23:38 on November 27 2024), CRAN had 21,708 packages.
In addition, a large number of packages under development are available through other repositories, such as Github.
The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.
The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot
to the design of Unix. The idea is to create a toolbox of simple tools
that you link together yourself to perform an analysis.
Unix, now
mainly as Linux1, commands were simple tools linked together
by pipes so the output of one command is the input of the next.
To do anything you need to put the commands together yourself.
The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.
SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.
This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.
It was added to R two years ago because the new method was developed as a package in R long before being published.
So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.
So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.
The role of R in this course is to help us
It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R
In my opinion, ultimately, the best way is to
Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.
We will discuss the scripts CAR_1.R and CAR_2.R in the tutorial this week. Copy them to a file in RStudio and play with them line by line. You can also run an entire script with Control-Shift-K to see how scripts can be used for reproducible research.
If you want a deeper understanding of R, consider reading Hadley Wickham, Advanced R
Continuation of Day 2
The Problem of Causality
A useful oversimplification:
The fundamental \(2 \times 2\) fundamental table of Statistics
Predictive Inference | Causal Inference | |
---|---|---|
Observational Data | OK: what you learn in Regression | Modern causal inference |
Experimental Data | Hard but rarely an issue | OK: what you learn in Experimental Design |
Files to play with:
We will work through some of the following files in the near future but I encourage you to play your way through them on your own: try different models, try different displays of data, diagnostics, etc. I’ll be working on updating them so they will change. Of course, I welcome your suggestions.
Files we’ll start using today:
Comments on Leverage and Mahalanobis Distance:
Some formulas are worth remembering because they work for multiple regression as well as simple regression. We’ve seen: \[ SE(\hat{\beta}_i) = \frac{1}{\sqrt{n}}\frac{s_e}{s_{x_i|others}} \] Here’s another one: You know that the diagonal elements, \(h_{ii}\) of the hat matrix \[ H = X(X'X)^{-1}X' \] are the leverages showing how much a change in one particular \(y_i\) would change the fitted value \(\hat{y}_i\). i.e. \[ h_{ii} = \frac{\partial \hat{y}_i}{\partial y_i} \] It turns out that there’s nothing mysterious about where the \(h_{ii}\)s come from. They have a simple relationship with the squared Mahalanobis distance in the predictor space: \[ h_{ii} = \frac{1}{n}(1 + Z_i^2) \] where, in simple regression, \(Z_i\) is the Z-score (standardized with the MLE estimate of standard deviation, dividing by \(n\) instead of \(n-1\)). \(|Z_i|\) also happens to be Mahalanobis distance in one dimension.
The formula generalizes to multiple regression if you take \(Z_i\) to be Mahalanobis distance, so Mahalanobis distance is just a generalization of the Z-score to higher dimensions.
See the links above on Leverage and Mahalanobis Distance.
Comments:
Comment on AVPs: The fact that a simple regression applied to the AVP leads to an estimate of the slope that
provided one calculates these quantities using maximum likelihood estimates (it’s all asymptotically true if one uses the degree-of-freedom adjusted estimates), is the content of the Frisch-Waugh-Lovell Theorem that was ‘discovered’ in steps over a period of approximately 30 years. The first version applied to lagged time series and was discovered by Ragnar Frisch. He got the first Nobel Prize in Economics in 1931, for his general contributions, not just the start of this theorem. However, like other statistical ‘objects’ that have names associated with them, the actual original theorem was published by Udny Yule in 1911 (I am shamelessly using Wikipedia, and its gifted contributors, as a source). The theorem was refined over time by Frederick Waugh and Michael Lovell who each added portions of it. Lovell is known for an elegant proof.
The theorem shows how the AVP is the window into the way a multiple regression determines an individual coefficient.
Possible quiz questions:
Look at the link to possible questions for the quiz on Thursday under “Quick Links” above.
Simpsons’ Paradox:
Continuation of previous day.
Links:
Links:
Links:
Background material:
Y ~ Z + Col
. This shows
how models matter when you want to give a coefficient a
causal interpretation.Continuation of previous
Two key ideas:
Y ~ X + Zs
where
Zs
can represent one or more than one variable.X
in the multiple regression is the same as the
regression coefficient in the simple regression of \(Y - \hat{Y}\) against \(X - \hat{X}\), where \(\hat{Y}\) and \(\hat{X}\) are obtained by regression on
Zs
. Moreover \(SE(\hat{\beta}_X)\) is the same (if you
ignore differences in denominator degrees of freedom, e.g. if you use
the MLE estimators for the residual variance).Zs
. The \(SE(\hat{\beta}_X)\), however, may be
larger. But the model may be simpler and you may have more confidence in
its validity. This is the trade-off in using propensity scores for
causal inference. Note that you can add ‘covariates’ to a propensity
score model.Sample questions for the quiz on Tuesday:
Y ~ X
Y ~ X + Z6
Y ~ X + Z1
Y ~ X + Z1 + Z4
Y ~ X + Z1 + Z3
Y ~ X + Z3 + Z6
Y ~ X + Z1 + Z5
Y ~ X + Z2 + Z6 + Z7
Y ~ X + Z1 + Z4
Y ~ X + Z2 + Z8
X
.X
. Order them, if possible from the
information in the DAG, according to the expected standard deviation of
\(\hat{\beta}_X\). Briefly state the
basis for your ordering.Links:
Background material:
Y ~ Z + Col
. This shows
how models matter when you want to give a coefficient a
causal interpretation.Midterm Test
Project
Class
Links:
Some sample questions for quiz on November 7:
Note: You should know the definitions of the Bernoulli, Binominal, Multinomial, Normal, Multivariate Normal, Poisson, Exponential, Gamma, and Uniform distributions. If others are needed, their definitions will given in the quiz. You should also know the definition of moment-generating functions, the characterization of exponential families, and the definition and derivation of the score and Fisher information identities.
Continuation
Announcement:
Links:
Sample quiz question for November 19:
Consider estimating the average lifetime of some electronic component, a ‘gizmo’, whose lifetime has an Exponential(\(\psi\)) distribution: \[f(x|\psi) = \psi \exp\{-\psi x\}\quad x > 0, \psi > 0\] using the ‘rate’ parametrization for the Exponential distribution.
Suppose N gizmos are observed for a year and n (1 < n < N ) burn out during the year, with lifetimes recorded as: \(x_1 , x_2 , ..., x_n\), where \(x\) is measured in years (fractional, of course). The remaining \(N − n\) are still functioning at the end of the year.
Here are some sample final exam questions for the course. I apologize if there are still some redundancies and errors.
Select the 20 hardest question for your team and let each team member work on five of these questions. Your best strategy is to try to choose the questions you find the hardest.
Post, compare and discuss your answers within your team.
Make your answers and discussion public to the class by December 2.
Links:
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎