(Updated: March 01 2020 16:42)
Here are 23 statements or questions about statistics, mainly about regression, for you to ponder and comment on.
Is each statement true, false or does its truth depend on unstated conditions? In the last case, on what conditions does it depend on and how?
Warning: Almost every statement (but not all) expresses a widely held fallacy about statistics. Many professional statisticians are fooled by many of these statements.
Suppose you are studying how some measure of health is related to weight. You are looking at a multiple regression of health on height and weight but you observe that what you are really interested in is the relationship between health and excess weight relative to height. What you should do is to compute the residuals of weight on height and replace weight in the model with this new variable. The resulting coefficient of ‘excess weight’ will give a better estimate of the effect of excess weight.
Suppose you are studying observational data on the relationship between Health and Coffee (measured in grams of caffeine consumed per day). Suppose you want to control for a possible confounding factor ‘Stress’. In this kind of study it is more important to make sure that you measure coffee consumption accurately than it is to make sure that you measure ‘stress’ accurately.
A survey of students at York reveals that the average class size of the classes they attend is 130. A survey of faculty shows an average class size of 30. The students must be exaggerating their class sizes or the faculty under-reporting.
A survey of Canadian families yielded average ‘equity’ (i.e. total owned in real estate, bonds, stocks, etc. minus total owed) of $48,000. Aggregate government data of the total equity in the Canadian population shows that this figure must be much larger, in fact more than twice as large. This shows that respondents must tend to dramatically underreport their equity.
In a multiple regression of Y on three predictors, X1, X2 and X3, suppose the coefficients of each of X2 and X3, are not significant. It is safe to drop these two variables and perform a regression on X1 alone. Dropping the variables with non-significant coefficients results in a model that fits almost as well as the original model.
If smoking really is bad for your health, you expect that a comparison of a group of people who have quit smoking with a group that has continued to smoke will reveal that the group quitting is, on average, healthier than the group that continued.
In a multiple regression, if you drop a predictor whose effect is not significant, the coefficients of the other predictors should not change very much, nor should the p-values associated with them.
We use maximum likelihood to estimate parameters because the parameter value with the highest likelihood is the value that has the highest probability of being correct. ‘Likelihood’ is just a different word for ‘probability’.
If you want to reduce the number of predictor variables in a model, forward stepwise regression will do a good job of identifying which variables you should keep. What about backward stepwise regression?
In a regression model with two predictors X1 and X2, and an interaction term between the two predictors, it is dangerous to interpret the `main’ effects of X1 and X2 without further qualification. However, it is okay to do so if the interaction term is not significant.
In a model to assess the effect of a number of treatments on some outcome, we can estimate the difference between the best treatment and the worse treatment by using the difference in the mean outcomes.
In general we don’t need to worry about interactions between variables unless there is a correlation between them.
In general, a variable cannot be a ‘confounding factor’ for the effect of another variable unless they are associated with each other.
If two variables have a strong interaction, this implies a strong correlation.
You need to impute a mid-term grade for a student who missed the mid-term with a valid excuse. You plan to somehow use the grade on the final exam to impute a mid-term grade. Discuss the relative consequences of using
If you had to choose one of these four, which would you choose and why?
If you have a better solution for the previous problem, what is it and why?
If all scientists used a p-value of 0.05 to decide which results to publish, that would ensure that at most 5% of published results would be incorrect.
If a variable X1 is not significant in a regression of Y on X1 then it will be even less significant in a regression of Y on both X1 and X2 where X2 is another variable. This follows since there is less variability left to explain in a model that already includes X2 than in a model that does not.
Consider this frequently used Venn diagram representing sums of squares in Analysis of Variance:
The diagram shows how two predictor variables X1 and X2 predict a response Y by displaying variances and shared variances. In the decompositions: \[ \begin{aligned} SSTO & = SS(X1,X2) + SSE \\ & = SS(X1) + SS(X2|X1) + SSE \\ & = SS(X2) + SS(X1|X2) + SSE \end{aligned} \] the last two lines correspond to Type I sequential sums of squares adding X1 then X2 in the second line, and adding X2 then X1 in the third and last line.
You can use the Venn diagram to prove that \(SS(X1)\) must be greater than \(SS(X1|X2)\), i.e. that a predictor added alone explains more variance than it does when it is added after having added another predictor variable.
The best way to deal with high collinearity between predictors is to drop predictors that are not significant.
AIC is useful to identify the best model among a set of models that you have selected after exploring your data if the models are not nested within each other.
A recent study showed that people who sleep more than 9 hours per night on average have a higher chance of premature death than those who sleep fewer than 9 hours. This does not necessarily mean that sleeping more than 9 hours on average is bad for your health because the sample might not have been representative.
Suppose a screening test for steroid drug use has a specificity of 95% and a sensitivity of 95%. This means that the test is incorrect 5% of the time. Therefore, if John takes the test and the result is ‘positive’ (i.e. the test indicates that John takes steroid drugs) the probability that he does not take steroid drugs is only 5%.
In a multiple regression, the predictor that is most important is the one with the smallest p-value.