Where there is no uncertainty, there cannot be truth – Richard Feynman
Negative of Himmelblau’s function:
\(f(x,y) = - (x^2 + y -11)^2 - (x + y^2
-7)^2\), see p. 57, problem 2.7 of the textbook.
September 18:
September 14:
Classes meet on Mondays and Wednesdays from 1:00 to
2:30. The location on Mondays is Ross South 123 and, on Wednesdays, Life
Sciences Building (LSB) 107.
Optional Tutorial: Fridays from 3:30 pm to 4:30 pm on
Zoom at Tutorial Zoom
link
Informal chat: Tentatively Fridays, 4:30 pm to 5:30
pm at Office hour Zoom
link
Instructor: Georges Monette 1 2
Important dates: 2023-2024 Academic
Calendar
Textbook: Givens, G.H, Hoeting J.A.
(2013) Computational Statistics 2nd ed., Wiley. Some copies
available in bookstore. Errata for the
second edition
Email: Messages about course content should be posted
publicly to Piazza so the discussion can engage and benefit everyone.
You may post messages and questions as personal messages to the
instructor. If they are of general interest and don’t contain personal
information, they will usually be made public for the benefit of the
entire class unless you specifically request that the message remain
private.
Files used today: Regression Review
Due Wednesday, September 13, 9 pm
Tools | Global Options ... | Terminal
. Then click on the
box in Shell paragraph to the right of New
terminals open with:
Introduction:
followed by the
name you prefer to use in class, e.g. Jon Smith
Github: jsmith
. If you don’t have one already, leave
this blank for now. You can come back and edit it later.assn1
folders when you submit your
post.assn1
folder before submitting your
post.Question
as the Post Type
.This exercise will be done in teams of 4 or 5 members which will be posted on Piazza. All members of a team work on a similar problem but with different parameters. This way, you can help each other with the concepts but each team member must, ultimately, do their own work.
The purpose of the assignment is to explore the meaning of p-values. Before starting, stop and reflect on what it means for an experiment to ‘achieve’ a p-value of 0.049. What meaning can we give to the quantity ‘0.049’? How is it related to the probability that the null hypothesis is correct?
To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu > 0\) (note: the problem is also easy but takes a bit more coding if you assume a two-sided alternative: \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) – to help focus on the essential concepts, we’ll use a one-sided alternative) and you are designing an experiment in which you plan to take a sample of independent random variables, \(X_1, X_2, ... , X_n\) which are iid \(\textrm{N}(\mu,1)\), i.e. the variance is known to be equal to 1. You plan to use the usual test based on \(\bar{X}_n\) rejecting \(H_0\) for values of \(\bar{X}_n\) that are far from 0.
An applied example would be testing for a change in value of a response when all subjects are submitted to the same conditions and the measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’, i.e. post-test response minus the pre-test response exhibited by the \(i\)th subject.
Note: Do the assignment using R and using a Markdown file. You can use the Markdown sample file in Assignment 1 as a template. Upload both a pdf file and your R script to Piazza.
Let the selected probability of Type I error be \(\alpha = 0.05\). Depending on your team, you have a budget that allows you to collect a sample of \(n\) observations as follows:
Team | \(n\) |
---|---|
A: Akaike (Hirotsugu Akaike) | 2 |
B: Blackwell (David Blackwell) | 3 |
C: Cox (Gertrude Mary Cox) | 5 |
D: David (Florence Nightingale David) | 10 |
E: Efron (Bradley Efron) | 15 |
F: Freedman (David Freedman) | 20 |
Depending on your position on the team, you are interested in answering the questions below using the following values of \(\mu_1\):
Position in Team | \(\mu_1\) | Cohen’s terminology for effect size: \(\mu_1/\sigma\) |
---|---|---|
1 | all | combine ROC curves for results below |
2 | 0.2 | small effect size |
3 | 0.5 | medium effect size |
4 | 0.8 | large effect size |
5 | 1 | very large effect size |
To delve more deeply into these issues you can read Wasserstein and Lazar (2016) and Wasserstein, Schirm, and Lazar (2019). Concerns about \(p\)-values have been around for a long time, see Schervish (1996). For a short overview see Bergland (2019). There are two interesting and influential papers by John Ioannidis (2005), (2019).
For an entertaining take on related issues see Last Week Tonight: Scientific Studies by John Oliver (2016) (warning: contains strong language and political irony that many could consider offensive – watch at your own risk!).
Preparing for the next class: Play your way through these scripts:
Step your way through the scripts and experiment as you go along. If you have questions or comments, post them on Piazza.
These scripts give you a tour of various features of R. They are meant to provide reminders of things you can do when a need arises.
What’s the best way to learn R? That depends on your learning style, your previous experience and how deep you want to go.
See the discussion in ‘Day 2’. Start or join a discussion on Piazza on ‘best ways to learn R’.
Correction: There was a problem installing the ‘spida2’ package in R. The workaround is to install the ‘gplots’ package before installing the ‘spida2’ package:
install.packages('gplots')
devtools::install_github('gmonette/spida2')
The ‘spida2’ package should be reinstalled regularly since it’s likely to be updated frequently during the course. Once ‘gplots’ has been installed once, it does not have to be reinstalled each time you reinstall ‘spida2’.
Tutorials: Start this week on Friday, September 15 from 3:30 pm to 4:30 pm.
Student hours: Start this week on Friday, September 15, from 4:30 pm to 5:30 pm.
Teams: I will assign teams this evening for those who have already registered with Piazza so you can start working on assignments. So register soon if you want to have a team and, as importantly, your position on your team.
Why R? What about SAS, SPSS, Python, Julia, among others?
SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.
R is an environment for statistical programming and development that
has accumulated many somewhat inconsistent layers developed over time by
people of varying abilities, many of whom work largely independently of
each other. There is no centralized quality testing except to check
whether code and documentation run before a new package is added to R’s
main repository, CRAN. When this page was last updated (at
format(Sys.time(), '%H:%M')
on
format(Sys.time(), '%B %d %Y')
), CRAN had 20,119
packages.
In addition, a large number of packages under development are available through other repositories, such as Github.
The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.
The ‘design’ of R (using ‘R’ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, 3 commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.
The same is true of R. It’s extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in R’s design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.
SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.
This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It won’t be added to SPSS until 5 years after there’s a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.
It was added to R two years ago because the new method was developed as a package in R long before being published.
So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.
So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.
The role of R in this course is to help us
It’s very challenging to find a good way to ‘learn R’. It depend on where you are and where you want to go. Now, there’s a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R
In my opinion, ultimately, the best way is to
Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.
We will discuss the scripts CAR_1.R and CAR_2.R in the tutorial this week. Copy them to a file in RStudio and play with them line by line. You can also run an entire script with Control-Shift-K to see how scripts can be used for reproducible research.
If you want a deeper understanding of R, consider reading Hadley Wickham, Advanced R
Coninuation of Day 2
Files used today:
Comments on Leverage and Mahalanobis Distance:
Some formulas are worth remembering because they work for multiple regression as well as simple regression. We’ve seen: \[ SE(\hat{\beta}_i) = \frac{1}{\sqrt{n}}\frac{s_e}{s_{x_i|others}} \] Here’s another one: You know that the diagonal elements, \(h_{ii}\) of the hat matrix \[ H = X(X'X)^{-1}X' \] are the leverages showing how much a change in one particular \(y_i\) would change the fitted value \(\hat{y}_i\). i.e. \[ h_{ii} = \frac{\partial \hat{y}_i}{\partial y_i} \] It turns out that there’s nothing mysterious about where the \(h_{ii}\)s come from. They have a simple relationship with the squared Mahalanobis distance in the predictor space: \[ h_{ii} = \frac{1}{n}(1 + Z_i^2) \] where, in simple regression, \(Z_i\) is the Z-score (standardized with the MLE estimate of standard deviation, dividing by \(n\) instead of \(n-1\)). \(|Z_i|\) also happens to be Mahalanobis distance in one dimension.
The formula generalizes to multiple regression if you take \(Z_i\) to be Mahalanobis distance, so Mahalanobis distance is just a generalization of the Z-score to higher dimensions.
See the links above on Leverage and Mahalanobis Distance.
Comments:
Comment on AVPs: The fact that a simple regression applied to the AVP leads to an estimate of the slope that
provided one calculates these quantities using maximum likelihood estimates (it’s all asymptotically true if one uses the degree-of-freedom adjusted estimates), is the content of the Frisch-Waugh-Lovell Theorem that was ‘discovered’ in steps over a period of approximately 30 years. The first version applied to lagged time series and was discovered by Ragnar Frisch. He got the first Nobel Prize in Economics in 1931, for his general contributions, not just the start of this theorem. However, like other statistical ‘objects’ that have names associated with them, the actual original theorem was published by Udny Yule in 1911 (I am shamelessly using Wikipedia, and its gifted contributors, as a source). The theorem was refined over time by Frederick Waugh and Michael Lovell who each added portions of it. Lovell is known for an elegant proof.
The theorem shows how the AVP is a window into the way a multiple regression determines an individual coefficient.
Possible quiz questions:
Look at the link to possible questions for the quiz on Wednesday under “Quick Links” above.
Simpsons’ Paradox:
Links:
Links:
Announcement:
Links:
Background material:
Y ~ Z + Col
. This shows
how models matter when you want to give a coefficient a
causal interpretation.Announcement:
Sample questions for the quiz on Wednesday:
Links:
Background material:
Y ~ Z + Col
. This shows
how models matter when you want to give a coefficient a
causal interpretation.Continuation of previous
Review
Midterm test
Links:
Links:
Links:
Some sample questions for quiz on November 1:
Note: You should know the definitions of the Bernoulli, Binominal, Multinomial, Normal, Multivariate Normal, Poisson, Exponential, Gamma, and Uniform distributions. If others are needed, their definitions will given in the quiz. You should also know the definition of moment-generating functions, the characterization of exponential families, and the definition and derivation of the score and Fisher information identities.
Links:
Sample Quiz Questions for Quiz 4:
Let \(f\) be a twice continuously differentiable real function on \(\mathbb{R}^p\), with a negative definite Hessian. Derive the form of the Newton update to find a maximum of \(f\), given that a Newton update is the maximum of a local quadratic approximation of \(f\).
Consider the problem of finding the root of a real function \(f\) that is twice continuously differentiable on a closed interval \([a,b] \subset \mathbb{R}\) with \(a < b\). Suppose \(f\) has a single root at \(x^* \in (a,b)\). Let \(G(x) = f(x) + x\). Show that \(x^*\) is a fixed point of \(G\) but fixed-point iteration is not a good strategy for finding \(x^*\) if \(|G'(x^*)| > 1\).
Discuss why Fisher Scoring might be sometimes be a better strategy to maximize a log-likelihood than Newton’s method.
Let \(f(x|\theta)\) be a multivariate density for \(x \in \mathbb{R}^p\), with respect to a supporting measure, with the form: \[ f(x|\theta) = \exp \left\{ \sum_{i=1}^p \theta_i x_i - A(\theta) + c(x) \right\} \] with \(\theta \in \Omega \subset \mathbb{R}^p\), where \(\Omega\) is an open set in which the density integrates to 1.
Prove that \(\operatorname{E}(X|\theta) = A'(\theta)\).
Prove that \(\operatorname{Var}(X|\theta) = A''(\theta)\).
Let \(Y\) have a distribution belonging to a scaled exponential family with density of the form: \[ f(y|\theta) = \exp \left\{ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right\} \] with respect to a supporting measure. Let \(\mu(\theta) = E(Y|\theta)\) and let \(\theta\) be modelled as a linear function of two parameters, \(\beta_1\) and \(\beta_2\) with: \[ \theta = z_1 \beta_1 + z_2 \beta_2 \] Show that the gradient of the log-likelihood as a function of \(\beta_1, \beta_2\) has the form \[ \ell'(\beta_1,\beta_2) = \begin{pmatrix} z_1 & z_2 \end{pmatrix} (y - \mu) \]
Links:
Sample Quiz Questions for Monday, November 27
Suppose \((x_i,y_i)\) are pairs of random observations where \(x_i \sim \operatorname{Poisson}(\lambda)\) and \(y_i \sim \operatorname{Poisson}(\alpha\lambda)\), with \(\lambda\) and \(\alpha\) unknown parameters. You have obtained data on 3 pairs, \(i = 1, 2, 3\) but \(y_3\) has gone missing.
Derive the E step and the M step of the EM algorithm to estimate \(\lambda\) and \(\alpha\) with maximum likelihood.
Suppose \((x_i,y_i), i = 1,...,n\) are pairs of random observations where \(x_i \sim \operatorname{Poisson}(\lambda)\) and \(y_i \sim \operatorname{Poisson}(\alpha\lambda)\), with \(\lambda\) and \(\alpha\) unknown parameters. Because of the way the data are obtained, \(y_i\) is missing whenever \(x_i > 3\).
Derive the E step and the M step of the EM algorithm to estimate \(\lambda\) and \(\alpha\) with maximum likelihood.
Do the two problems above when the observations are from the exponential distribution instead of Poisson.
Sample Questions (largely from the 2nd half of the course):
Consider the following sample test questions in addition to quiz question, midterm test questions and sample questions for the midterm and the quizzes.
Review DAGs and the use of added-variable plots with the Frisch-Waugh-Lovell Theorem to consider the relative values of \(SE(\hat{\beta}_X)\) with different models.
The final exam will cover material from the entire course.
More theoretical:
Suppose you have access to a uniform random number generator. Describe how you could use the Metropolis-Hastings algorithm to generate a Markov chain with a Normal distribution with mean 0 and unit standard deviation as a stationary distribution. Write R code for a function that generates the next observation from this chain.
Let \(h\) on \(\mathbb{R}\) be proportional to a probability density function \(f\), i.e. \(h(x) = c f(x)\) with \(c\) unknown. Let \(g\) be a probability density on \(\mathbb{R}\) that is symmetric around 0, from which you can generate random observations. Describe how you would use random observations with pdf \(g\) to generate a Markov chain with \(f\) as a stationary distribution.
Prove that the process in the previous question has the distribution given by \(f\) as stationary distribution.
Consider estimating the average lifetime of some electronic component, a ‘gizmo’, whose lifetime has an exponential(\(\theta\)) distribution: \[f(x|\theta) = e^{-x/\theta} /\theta \quad x > 0, \theta > 0\] Suppose \(N\) gizmos are observed for a year and \(n\) (\(1 <n < N\)) burn out during the year, with lifetimes recorded as: \(x_1,x_2,...,x_n\) where \(x\) is measured in years (fractional, of course). The remaining \(N-n\) are still functioning at the end of the year.
Prove that each step in the EM algorithm cannot reduce the
log-likelihood. Assume that the densities in the model, \(f(x|\theta)\) are positive with the same
support.
Assume known as a general fact that the Kullback-Leibler divergence from
the distribution \(f(\cdot|\psi_1)\) to
\(f(\cdot|\psi_2)\) is non-negative,
i.e. \[\operatorname{E}\{\log F(x|\psi1_1) -
\log F(x|\psi_2) | \psi_1\} \ge 0\]
Describe the Newton-Raphson method for finding the MLE of a statistical model. Give a sketch of a proof, using Taylor series expansions, that it converges to the maximum if the starting value is sufficiently close to the maximum.
Describe the Newton-Raphson and the Fisher Scoring method for finding MLEs and compare some advantages and disadvantages of each method.
Describe the Nelder-Mead algorithm in detail. (You’ll forget the details after the exam but it’s worth having to learn it at some point in your career.) Be ready to apply it for one or two steps in a simple example.
Applied:
The following 4 questions are based on the ‘Vocab’ data set used in
Assignment 3.
Recall that it records a vocabulary score for over 30,000 subjects
tested over the years between 1974 and 2016. The questions below refer
to the following output in R. Note that the variable ‘Year’ is measured
in years since 2000.
library(car)
summary(Vocab)
year sex education vocabulary
Min. :1974 Female:17148 Min. : 0.00 Min. : 0.000
1st Qu.:1987 Male :13203 1st Qu.:12.00 1st Qu.: 5.000
Median :1994 Median :12.00 Median : 6.000
Mean :1995 Mean :13.03 Mean : 6.004
3rd Qu.:2006 3rd Qu.:15.00 3rd Qu.: 7.000
Max. :2016 Max. :20.00 Max. :10.000
data <- within(Vocab,
{
Year <- year - 2000
Year2 <- Year^2
Year3 <- Year^3
Yearf <- factor(Year)
}
)
fit1 <- lm(vocabulary ~ factor(Year) * sex, data = data)
fit2 <- lm(vocabulary ~ (Year + Year2 + Year3) * sex, data = data)
fit3 <- lm(vocabulary ~ Year * sex, data = data)
summary(fit2)
Call:
lm(formula = vocabulary ~ (Year + Year2 + Year3) * sex, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.1730 -1.0826 0.0162 1.0994 4.1273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.141e+00 2.790e-02 220.123 < 2e-16 ***
Year 1.215e-02 2.793e-03 4.351 1.36e-05 ***
Year2 -8.358e-04 1.880e-04 -4.445 8.83e-06 ***
Year3 -4.831e-05 1.017e-05 -4.749 2.06e-06 ***
sexMale -1.203e-01 4.226e-02 -2.847 0.00441 **
Year:sexMale -4.446e-03 4.230e-03 -1.051 0.29331
Year2:sexMale 3.865e-04 2.842e-04 1.360 0.17376
Year3:sexMale 2.488e-05 1.533e-05 1.623 0.10461
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.114 on 30343 degrees of freedom
Multiple R-squared: 0.001568, Adjusted R-squared: 0.001337
F-statistic: 6.806 on 7 and 30343 DF, p-value: 4.252e-08
summary(fit3)
Call:
lm(formula = vocabulary ~ Year * sex, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.0848 -1.0590 -0.0145 1.0799 4.1114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.0504270 0.0172503 350.742 < 2e-16 ***
Year 0.0021493 0.0012801 1.679 0.09315 .
sexMale -0.0799758 0.0260553 -3.069 0.00215 **
Year:sexMale 0.0009994 0.0019247 0.519 0.60359
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.115 on 30347 degrees of freedom
Multiple R-squared: 0.0006394, Adjusted R-squared: 0.0005406
F-statistic: 6.472 on 3 and 30347 DF, p-value: 0.000225
anova(fit3, fit2, fit1)
Analysis of Variance Table
Model 1: vocabulary ~ Year * sex
Model 2: vocabulary ~ (Year + Year2 + Year3) * sex
Model 3: vocabulary ~ factor(Year) * sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30347 135692
2 30343 135566 4 126.03 7.0624 1.116e-05 ***
3 30307 135208 36 358.24 2.2305 3.234e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred <- within(
with(data, expand.grid(Year = unique(Year), sex = unique(sex))),
{
Year2 <- Year ^ 2
Year3 <- Year ^ 3
}
)
pred$fit1 <- predict(fit1, newdata = pred)
pred$fit2 <- predict(fit2, newdata = pred)
pred$fit3 <- predict(fit3, newdata = pred)
library(latticeExtra)
Loading required package: lattice
xyplot(fit1+fit2+fit3 ~ Year, pred, groups = sex, type = 'b')
Is the test whose p-value appears in the third line of the output
for the anova
function, a legitimate hypothesis test. Why
or why not? Describe in words what the p-value is indicating, if
anything.
Using the model fit2
, estimate the rate at which the
gender gap is estimated to be changing in the year 2010. How would you
go about obtaining a confidence interval for this estimate?
Answer the previous question using the model fit3
.
Which model do you think provides a better answer to this question?
Why?
Describe the interpretation of the estimated coefficients for
sexMale
and for Year:sexMale
in the summary
output for fit3
.
Describe how you would simulate an observation from a
distribution with density \(f(x) =
exp\{-x/2\}/2, x >0\) using the inverse CDF. Write the code
for a function that would generate \(n\) observations from this distribution
using runif
as a uniform random number generator.
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C …. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.↩︎