This version: January 19 2025 23:32
Click here to get the latest update
Click here to go to the end of this page
Doubt is not a pleasant condition, but certainty is absurd. – Voltaire
Where there is no uncertainty, there cannot be truth – Richard Feynman
How harmful is smoking?: Cigarette consumption and life expectancy in 189 countries in 2004: Correlation is not causation. This is an example of the ecological fallacy, which is itself an example of a deeper phenomenon: Simpson’s Paradox.
Classes meet on Mondays, Wednesday and Friday from
9:30 to 10:20 in SC 222.
Tutorial: On Zoom (click here),
weekly: Tuesdays at 9am Instructor: Georges Monette
Office hours: On Zoom (click here),
weekly: Tuesday at 10am
Email: Messages about course content or course
organization should be posted publicly on Piazza so everyone can benefit
from the questions and discussions.
This evening, I will resend invitations to join Piazza to all students currently registered in the course.
Topic 1: AMSTAT News Survey of Master’s Graduates (2023 US Statistics Graduates surveyed in mid-2024)
Topic 2: The meaning of p-values Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1 but unknown non-negative mean \(\mu\).
You plan to test \(H_0: \mu = 0\) versus the one-sided alternative \(H_A: \mu > 0\).
Consider what happens if you use a 5% test, i.e. reject the null hypothesis if \(p < 0.05\).
Suppose you reject \(H_0\). What does this tell you about \(H_0\)?
What if the \(p\)-value is 0.049? Or what if the \(p\)-value is 0.0003?
Due Thursday, January 9, 9 pm
Due: Monday, January 13 at 11:59pm: Submit your answers on Piazza, preferably as a pdf file generated with Rmarkdown, but any other way is ok. Use the folder assn2 on Piazza for any posts submitted for this assignment. Do as much of the assignment as you are capable of doing. You will be graded on evidence of ‘honest effort’.
Discussion (The questions in this part are rhetorical, i.e. you don’t need to answer them)
This purpose of this assignment is to reawaken the skills you learned in MATH 2131 and MATH 3131, and to see whether some important statistical concepts really mean what most people who use them think they mean.
Consider the simplest of statistical tests. Suppose you plan to sample 100 observations from a population that is known to have a normal distribution with variance equal to 1.0 but unknown non-negative mean \(\mu\).
You plan to test \(H_0: \mu = 0\) versus one-sided alternative \(H_A: \mu > 0\).
Consider what happens if you use a 5% test, i.e. reject the null hypothesis if \(p < 0.05\).
Suppose you reject \(H_0\). What does this tell you about \(H_0\)?
We know that the p-value tells you something about the probability of an event related to the data given \(H_0\). So, although we know that the \(p\)-value isn’t the actual probability of \(H_0\) given the data, it’s natural to feel that they are somehow related. After all, why would we use a \(p\)-value unless it tells us something about \(H_0\)? We aren’t interested in the data. What we are really interested in is \(H_0\).
So it’s useful to actually explore the connection between \(p\)-values and the probability of \(H_0\). But we can’t get an actual probability for \(H_0\), without also hypothesizing a prior distribution on the hypotheses: we need a hypothetical probability for \(H_0\) and a hypothetical distribution over the possible values in \(H_A\).
In this exercise, you will consider the posterior probability of \(H_0\) given different possibilities for prior probabilities to get a sense of the strength of the evidence against \(H_0\) when you have a \(p\)-value of, for example, 0.049, or of 0.001.
Assignment:
As you work on this assignment, refer back occasionally to the questions in Quiz 1. How is your thinking about these questions evolving? If the same questions were to appear in Quiz 3, how would you answer them?
Due Sunday, January 12, 9 pm
Tools | Global Options ... | Terminal
. Then click on the
box in Shell paragraph to the right of New
terminals open with:
Announcements:
Tutorial today at 4:30 on Zoom. See link under Quick Links
Tutorials on Tuesday mornings at 9am on Zoom. See link under Quick Links
See modified course description
Topic 1: The Trapezoid of Means: Simpson’s Paradox with Paik-Agresti and Liu-Meng Diagrams / R script
Topic 2: How to estimate everything
Topic 3: Regression Review: Regression in R
Quiz on Wednesday:
Sample question:
Given a table of conditional mean survival rates in percents(Y) by Gender (G: F or M) for two treatments (X: A or B) for a disease D:
X = A | X = B | |
---|---|---|
G = F | 90.0 | 70.0 |
G = M | 40.0 | 30.0 |
The following are the numbers of individuals in each group:
X = A | X = B | |
---|---|---|
G = F | 100 | 300 |
G = M | 400 | 100 |
Due Sunday, January 19, 9 pm
Work your way through Regression Review: Regression in R / R script and post notes/questions/etc. using the folder assn4.
Formulate questions and post them.
Find errors (e.g. code that no longer works because things change) and flag them.
Raise topics for discussion.
Reply to other students’ postings.
Topic 1: How to estimate everything
Topic 2: Some real data on Smoking
Announcement: Tutorial on Zoom (click here), every Tuesday from 1:30 to 2:30 pm.
Quiz Today:
Topic 1 (continuing): Regression Review: Regression in R / R script
Topic 2: Some real data on Smoking
Concepts and Theory
Using R:
Learning R:
Why R? What about SAS, SPSS, Python, among others?
SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.
R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to R’s main repository, CRAN. When this page was last updated, CRAN had 21,911 packages.
In addition, a large number of packages under development are available through other repositories, such as github.
The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.
The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, 2 commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.
The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.
SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.
This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.
It was added to R two years ago because the new method was developed as a package in R long before being published.
So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.
So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.
The role of R in this course is to help us
It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R
In my opinion, ultimately, the best way is to
Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.
Copy the following scripts as files in RStudio:
Play with them line by line.
Post questions arising from these scripts to the ‘question’ folder on Piazza. We will take up some questions in class and others in tutorials scheduled to deal with questions on R.
Continuation
Concepts and Theory
Using R:
Topic 2 (continuing): Regression Review
Note that legal definitions of the “presumption of innocence” and of “proof beyond a reasonable doubt” are never formulated, to my knowledge, in such specific probabilistic terms. Nevertheless, this would seem to me to be a minimal interpretation of these concepts. For some purposes, in the United States, a \(p\)-value less than 0.05 is considered sufficient for some forms of evidence.↩︎
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎