This version: February 14 2025 00:50
Click here to get the latest update
Click here to go to the end of this page
Doubt is not a pleasant condition, but certainty is absurd. – Voltaire
Where there is no uncertainty, there cannot be truth – Richard Feynman
How harmful is smoking?: Cigarette consumption and life expectancy in 189 countries in 2004: Correlation is not causation. This is an example of the ecological fallacy, which is itself an example of a deeper phenomenon: Simpson’s Paradox.
Classes meet on Mondays, Wednesday and Friday from
9:30 to 10:20 in SC 222.
Tutorial: On Zoom (click here),
weekly: Tuesdays at 9am Instructor: Georges Monette
Office hours: On Zoom (click here),
weekly: Tuesday at 10am
Email: Messages about course content or course
organization should be posted publicly on Piazza so everyone can benefit
from the questions and discussions.
This evening, I will resend invitations to join Piazza to all students currently registered in the course.
Topic 1: AMSTAT News Survey of Master’s Graduates (2023 US Statistics Graduates surveyed in mid-2024)
Topic 2: The meaning of p-values Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1 but unknown non-negative mean \(\mu\).
You plan to test \(H_0: \mu = 0\) versus the one-sided alternative \(H_A: \mu > 0\).
Consider what happens if you use a 5% test, i.e. reject the null hypothesis if \(p < 0.05\).
Suppose you reject \(H_0\). What does this tell you about \(H_0\)?
What if the \(p\)-value is 0.049? Or what if the \(p\)-value is 0.0003?
Due Thursday, January 9, 9 pm
Due: Monday, January 13 at 11:59pm: Submit your answers on Piazza, preferably as a pdf file generated with Rmarkdown, but any other way is ok. Use the folder assn2 on Piazza for any posts submitted for this assignment. Do as much of the assignment as you are capable of doing. You will be graded on evidence of ‘honest effort’.
Discussion (The questions in this part are rhetorical, i.e. you don’t need to answer them)
This purpose of this assignment is to reawaken the skills you learned in MATH 2131 and MATH 3131, and to see whether some important statistical concepts really mean what most people who use them think they mean.
Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1.0 but unknown non-negative mean \(\mu\).
You plan to test \(H_0: \mu = 0\) versus one-sided alternative \(H_A: \mu > 0\).
Consider what happens if you use a 5% test, i.e. reject the null hypothesis if \(p < 0.05\).
Suppose you reject \(H_0\). What does this tell you about \(H_0\)?
We know that the p-value tells you something about the probability of an event related to the data given \(H_0\). So, although we know that the \(p\)-value isn’t the actual probability of \(H_0\) given the data, it’s natural to feel that they are somehow related. After all, why would we use a \(p\)-value unless it tells us something about \(H_0\)? We aren’t interested in the data. What we are really interested in is \(H_0\).
So it’s useful to actually explore the connection between \(p\)-values and the probability of \(H_0\). But we can’t get an actual probability for \(H_0\), without also hypothesizing a prior distribution on the hypotheses: we need a hypothetical probability for \(H_0\) and a hypothetical distribution over the possible values in \(H_A\).
In this exercise, you will consider the posterior probability of \(H_0\) given different possibilities for prior probabilities to get a sense of the strength of the evidence against \(H_0\) when you have a \(p\)-value of, for example, 0.049, or of 0.001.
Assignment:
As you work on this assignment, refer back occasionally to the questions in Quiz 1. How is your thinking about these questions evolving? If the same questions were to appear in Quiz 3, how would you answer them?
Due Sunday, January 12, 9 pm
Tools | Global Options ... | Terminal
. Then click on the
box in Shell paragraph to the right of New
terminals open with:
Announcements:
Tutorial today at 4:30 on Zoom. See link under Quick Links
Tutorials on Tuesday mornings at 9am on Zoom. See link under Quick Links
See modified course description
Topic 1: The Trapezoid of Means: Simpson’s Paradox with Paik-Agresti and Liu-Meng Diagrams / R script
Topic 2: How to estimate everything
Topic 3: Regression Review: Regression in R
Quiz on Wednesday:
Sample question:
Given a table of conditional mean survival rates in percents(Y) by Gender (G: F or M) for two treatments (X: A or B) for a disease D:
X = A | X = B | |
---|---|---|
G = F | 90.0 | 70.0 |
G = M | 40.0 | 30.0 |
The following are the numbers of individuals in each group:
X = A | X = B | |
---|---|---|
G = F | 100 | 300 |
G = M | 400 | 100 |
Due Sunday, January 19, 9 pm
Work your way through Regression Review: Regression in R / R script and post notes/questions/etc. using the folder assn4.
Formulate questions and post them.
Find errors (e.g. code that no longer works because things change) and flag them.
Raise topics for discussion.
Reply to other students’ postings.
Topic 1: How to estimate everything
Topic 2: Some real data on Smoking
Announcement: Tutorial on Zoom (click here), every Tuesday from 1:30 to 2:30 pm.
Quiz Today:
Topic 1 (continuing): Regression Review: Regression in R / R script
Topic 2: Some real data on Smoking
Concepts and Theory
Using R:
Learning R:
Why R? What about SAS, SPSS, Python, among others?
SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.
R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to R’s main repository, CRAN. When this page was last updated, CRAN had 22,037 packages.
In addition, a large number of packages under development are available through other repositories, such as github.
The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.
The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, 2 commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.
The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.
SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.
This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.
It was added to R two years ago because the new method was developed as a package in R long before being published.
So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.
So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.
The role of R in this course is to help us
It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R
In my opinion, ultimately, the best way is to
Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.
Copy the following scripts as files in RStudio:
Play with them line by line.
Post questions arising from these scripts to the ‘question’ folder on Piazza. We will take up some questions in class and others in tutorials scheduled to deal with questions on R.
Continuation
Concepts and Theory
Using R:
Topic 2 (continuing): Regression Review
Sample question for Quiz 3 on Wednesday, January 29:
Topic 1:
Added-Variable-Plot (AVP):
Plot \(Y - \hat{Y}|_{other X's}\)
against \(X - \hat{X}|_{other
X's}\)
Shows the first-order leverage and influence of each observation on \(\hat{\beta}_i\)
Note that the definition of the AVP yields the simple scatterplot centered at the origin if the only ‘other variable’ is the intercept.
In R use:
library(car)
fit <- lm(income ~ education + prestige, data = Prestige)
avPlots(fit, ellipse = TRUE)
?avPlots # for more options
When you look at this plot, think of how you would interpret it if it
were a simple scatterplot.
You can actually construct a 95% confidence interval for the
multiple regression coefficient, \(\hat{\beta}_i\) for the effect of \(X_i\) keeping other \(X\)’s constant, using this plot the same
way you used a simple scatterplot. Of course, the 95% confidence
interval also gives you the result of a 5% test of the hypothesis that
\(\beta_i = 0\) in the multiple
regression.
Note: According to Wikipedia, the person who first discovered this fact is a statistician, Udny Yule, who published it in 1911. Nevertheless the theorem is known as the Frisch-Waugh-Lovell Theorem. The eponymous economist, Ragnar Frisch, won the first Nobel Prize in Economics in 1969. Udny Yule is also considered the ‘real’ discoverer of Simpson’s Paradox, many years before Simpson published his paper on it. Thank goodness theorems and paradoxes are rarely named after their original discoverers. E.g. we might have no idea what someone is talking about when they refer to ‘Yule’s Theorem’. I hope he considers that a consolation.
Residual versus Leverage plots:
Plot the standardized residual \(\frac{e_i}{\hat{\sigma} \sqrt{1 -
h_{ii}}}\) against \(h_{ii}\).
Shows the first-order leverage and influence of each observation on
\(\hat{Y}_i\).
In R use:
plot(rstudent(fit) ~ hatvalues(fit))
or look at the fourth plot when you just use
plot(fit)
This plot will also show you selected isoquants of Cook’s Distance, one of perhaps over 50 diagnostics generated by two groups in the late 70s: David Cook and Sanford Weisberg in one group, and David Belsley, Edwin Kuh and Roy Welsch, in the other. Some analysts look at tables of these diagnostics and apply conventional cutoffs to flag problematic points. Many diagnostics are functions of each other and I think that we get much better insights by looking at them graphically, allowing us to detect isolated outliers or clusters of outliers seeing patterns involving more than one numeric diagnostic simultaneously.
Note that \[h_{ii} = \frac{1}{n}\left( 1 +
\Delta_i^2 \right)\] where \(\Delta_i\) is the Mahalanobis distance of
the i-th point from the predictor values in the sample. In
simple regression, this is just the ‘z-score’ for \(X\) for the i-th
observation.
\(h_{ii}\) obeys the following
inequalities if the model has an intercept: \[\frac{1}{n} \le h_{ii} \le 1\] and \(h_{ii} = 1\) iff all the points omitting
the i-th point are perfectly collinear. What inequalities does
this imply for \(\Delta_i\)? What is
the implication if \(h_{ii} =
1/n\)?
Quiz question: Where
would you expect to find outliers of ‘Type 1’, ‘Type 2’, and ‘Type 3’ in
this plot.
Topic 2 (continuing): Regression Review
The class today was on the blackboard. See the recording for its content.
Summary on causality using linear causal DAGS
X
on Y
: (See also the last page of
the link above)X
is an
estimate of a causal effect of X
on Y
:
X
and
Y
and ensure that all backdoor paths are
blocked. A backdoor path is blocked if either/or:
X
. That includes, of
course, descendants of Y
.So, step-by-step:
X
.Note that average causal effects can only be generalized to members of the population that were included in the observed sample. For example, if the sample was selected through a collider variable and the resulting backdoor path re-blocked by the inclusion of another variable along the path, then the observed average causal effect can only be generalized to members of the population that were selected through the collider.
Recommended Piazza weekly feedbacks:
To further explore DAGs and related concepts:
dagitty
package in R and play with it.
Visit the home page for the
project. A DAG implies certain relationships among its variables.
Some of these relationships are purely causal and can’t be checked with
a statistical test on the data. For example, with the simple DAG we used
for Coffee, Stress and Heart Damage, the correct causal analysis
depended on the direction of the arrow between Coffee and Stress. If the
arrow points from Coffee to Stress, then Stress is a mediating variable
and should be excluded from the regression. If the arrow points from
Stress to Coffee then Stress is a confounding factor and should be
included in the regression.dagitty
package
to test whether the correlational assumptions are satisfied and, if
they’re not, you can explore further to refine your DAG.dagitty
package should
be both fun and very instructive in deepening your understanding of
causality.Recommended books to go deeper into causal models:
Review
Midterm Test
Mixed Models:
#
Note that legal definitions of the “presumption of innocence” and of “proof beyond a reasonable doubt” are never formulated, to my knowledge, in such specific probabilistic terms. Nevertheless, this would seem to me to be a minimal interpretation of these concepts. For some purposes, in the United States, a \(p\)-value less than 0.05 is considered sufficient for some forms of evidence.↩︎
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎