These questions have been inspired by many sources. An important reference is Wickham (2019).

1 The meaning of \(p\)-values

The purpose of the assignment is to explore the meaning of p-values. Before starting stop and reflect on what it means for an experiment to ‘achieve’ a p-value of 0.049. What meaning can we give to the quantity ‘0.049’? How is it related to the probability that the null hypothesis is correct?

To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) and you are designing an experiment in which you plan to take a sample of independent random variables, \(X_1, X_2, ... , X_n\) which are iid \(\textrm{N}(\mu,1)\), i.e. the variance is known to be equal to 1. You plan to use the usual test based on \(\bar{X}_n\) rejecting \(H_0\) for values of \(\bar{X}_n\) that are far from 0.

An applied example would be testing for a change in value of a response when all subjects are submitted to the same conditions and the measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’, i.e. post-test response minus the pre-test response exhibited by the \(i\)th subject.

Let the selected probability of Type I error be \(\alpha = 0.05\). Consider collecting samples of size \(n\) where \(n\) equals one of the following:

\(i\) \(n\)
1 2
2 5
3 10
4 15
5 100
6 1,000

Consider using the following values of \(\mu_j\):

\(j\) \(\mu_j\) Cohen’s terminology for effect size: \(\mu_j/\sigma\)
1 0.2 small effect size
2 0.5 medium effect size
3 0.8 large effect size
4 1 very large effect size
5 5 huge effect size
  1. What is the probability that \(p \le 0.05\) if \(H_0: \mu = 0\) is true?
  2. What is the probability that \(p \le 0.05\) if \(\mu = \mu_j\)?
  3. What is the power of this test if \(\mu = \mu_j\)?
  4. Suppose that you collect the data and that the observed \(p\)-value is 0.049. What can you say about the probability that \(H_0\) is true?
  5. Suppose that, before running the experiment, you were willing to give \(H_0\) and \(H_1: \mu = \mu_j\) equal probability. What is the probability that \(H_0\) is true given that you have performed the experiment and obtained \(p = 0.049\).
  6. Hypothesis testing is often presented as a process that parallels that of determining guilt in a criminal process. We start with a presumption of innocence, i.e. that \(H_0\) is true, We then hear evidence and consider whether it contradicts the presumption of innocence ‘beyond a reasonable doubt.’
    Suppose we quantify the presumption of innocence to mean that \(P(H_0) \ge .95\). How small an observed \(p\)-value do you need to obtain in order to ‘flip’ the presumption of innocence to ‘guilt beyond a reasonable doubt’ if that is defined as \(P(H_0 | \mathrm{data}) \le .05\).
  7. What \(p\)-value would we need if the presumption of innocence and guilt beyond a reasonable doubt correspond to \(P(H_0) \ge 0.999\) and \(P(H_0|data) \le 0.001\)?
  8. Courts have often adopted a criterion of \(p < 0.05\) in imitation of the common practice among many researchers. Comment on the possible consequences.
  9. Have a look at this xkcd cartoon. How does the Bayesian statistician derive a probability to make a decision in this example? Show the details.

To delve more deeply into these issues you can read Wasserstein and Lazar (2016) and Wasserstein, Schirm, and Lazar (2019). Concerns about \(p\)-values have been around for a long time, see Schervish (1996). For a short overview see Bergland (2019). Two key influential and erstwhile controversial papers are by John Ioannidis: John P. A. Ioannidis (2005), John P. A. Ioannidis (2019).

For an entertaining take on related issues see Oliver (2016) (warning: contains strong language and political irony that many could consider offensive – watch at your own risk!).

2 Merge: relational data base operations 1

Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4)) 

Describe the differences between the outputs of the following commands:

  1. merge(d1,d2)
  2. merge(d1,d2, all.x = TRUE)
  3. merge(d1,d2, all.y = TRUE)
  4. merge(d1,d2, all = TRUE)
  5. Compare this base R function merge with the function inner_join, left_join, right_join and outer_join in the ‘dplyr’ package.

3 Merge: Concatenating rows of slightly different data frames

Two research assistants have collected data for a study. The two RAs worked with different subjects each gathered their data into a spreadsheet. All of the important variables have the same name and the same definitions but each RA has also collected data on a few unique variables for that RA. Also, the order of the columns is different in the two studies.

  1. Create two small data frames to illustrate this kind of situation.
  2. How could you use ‘merge’ to easily concatenate the two data frames by rows keeping each distinct row in the original data frames and keeping the unique variable names with values filled with NAs for subjects from whom the value was not present in the data.

4 Merge: relational data base operations 2

Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4)) 

Find out what the following terms mean and show how to achieve each operation using ‘merge’. Hint: the last two operations are much easier if you first create a new variable in each data frame to serve as a ‘key’ (i.e. the argument of ‘id’ in the call to ‘merge’) in each data frame. (Note: the ‘key’ is the set of variables used to match the rows of the two data frames. These are either the variable names provided as arguments to the ‘key’ parameter or, by default, the intersection of the vectors of variable names in each data frame.)

  1. inner join
  2. outer join
  3. left join
  4. right join
  5. cross join
  6. concatenation of rows

5 Answering questions with data

Use the ‘Vocab’ data set in the ‘car’ package. It records a vocabulary score for over 30,000 subjects tested over the years between 1974 and 2016.

Explore the following questions. Use appropriate tables and graphs to explain your findings.

  1. Consider the distribution of education. Are there any salient features for this distribution?
  2. What can you say about any trends in vocabulary scores over time? Do the trends, if any, appear to differ by gender?
  3. What can you say about any trends in vocabulary scores over time when you adjust for education? What difference does it make whether you adjust for education or you don’t? What is the difference in the meaning of changes in vocabulary score whether you adjust for education or not? Do the trends, if any, appear to differ by gender?
  4. What can you say about any trends in education levels over time? Do the trends, if any, appear to differ by gender?
  5. What can you say about male/female differences in vocabulary scores when you adjust for education? Is the relationship constant or changing over time? If it is changing, how can you describe the nature of the change?
  6. Studying gaps and trends: For each of the following questions, fit a suitable model and explore the question using a suitable Wald test.
    1. In the last ten years of the study, is there evidence that vocabulary scores are increasing among men?
    2. In the last ten years of the study, is there evidence that vocabulary scores are increasing among women?
    3. In the last ten years of the study, is there evidence that vocabulary scores are increasing at a different rate among men than among women?
    4. Repeat 1 for level of education.
    5. Repeat 2 for level of education.
    6. Repeat 3 for level of education.
    7. Repeat 1 for level of vocabulary adjusted for education.
    8. Repeat 2 for level of vocabulary adjusted for education.
    9. Repeat 3 for level of vocabulary adjusted for education.
    10. Repeat 1 for the first ten years.
    11. Repeat 2 for the first ten years.
    12. Repeat 3 for the first ten years.
    13. Repeat 4 for the first ten years.
    14. Repeat 5 for the first ten years.
    15. Repeat 6 for the first ten years.
    16. Repeat 7 for the first ten years.
    17. Repeat 8 for the first ten years.
    18. Repeat 9 for the first ten years.
    19. Repeat 1 comparing the last 10 years with the first 10 years.
    20. Repeat 2 comparing the last 10 years with the first 10 years.
    21. Repeat 3 comparing the last 10 years with the first 10 years.
    22. Repeat 4 comparing the last 10 years with the first 10 years.
    23. Repeat 5 comparing the last 10 years with the first 10 years.
    24. Repeat 6 comparing the last 10 years with the first 10 years.
    25. Repeat 7 comparing the last 10 years with the first 10 years.
    26. Repeat 8 comparing the last 10 years with the first 10 years.
    27. Repeat 9 comparing the last 10 years with the first 10 years.

6 Questions on factors in R

Some of these questions illustrate important potential pitfalls in using factor variables. As a result, some developers eschew them and prefer to work with character variables as much as possible. Factors, however, are invaluable for many statistical applications since they allow the creation of different orderings of the values in a character vector, which is often important for statistical modeling and for graphics.

  1. Describe the difference, if any, and if so why, between the following (note that the behaviour of the ‘factor’ function has changed over time):

    1. factor(c(1, 2, 10))
    2. factor(as.character(c(1, 2, 10)))
  2. Suppose x <- factor(c(1, 2, 10)). Write a function that would allow you to recover the original values, 1, 2 and 10, from a factor like x? Why does as.numeric(x) not work?

  3. Let f1 <- factor(c('a','b','c')) and f2 <- factor(c('A','B','C')). What’s the matter with the result of the expression

    ifelse(f1 == 'a', f1, f2)
    

    Explain why it fails to produce characters as a result. Fix it so it does.

  4. Indexing with factors: Consider

    df <- data.frame(c = 1:3, a = 11:13, b = 21:23)
    fac <- factor(c('a','b','c'))
    df[[fac[1]]]
    df[[as.character(fac[1])]]
    

    Explain why the last two lines of the code above produce different results.

  5. What happens to a factor when you modify its levels with the levels<- replacement function? Give examples to illustrate your answer.

7 Questions on the R language

  1. Describe the main differences between the four taxonomies of objects in R: typeof, mode, storage.mode and class.

  2. Let x <- letters. What is the class and mode of x? Let y <- as.factor(x). What is the class and mode of y? Why does this make sense … or not?

  3. A factor is a kind of object used to represent character variables for statistical analysis. Add a factor to the list used to display the classifications of atomic objects above. Play with some factors and use str to explain the curious values returned by typeof, mode, storage.mode and class for a factor.

  4. What makes is.vector() and is.numeric() fundamentally different to is.list() and is.character()? From Wickham: Advanced R

  5. Why is 1 == “1” true? Why is -1 < FALSE true? Why is “one” < 2 false? From Wickham: Advanced R

  6. Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).) From Wickham: Advanced R

  7. Does -1:2 produce the same result as 0-1:2? Why or why not?

  8. Which of the following assignments use valid names?

     
    a_very_long_name <- 0
    _tmp <- 2
    .tmp <- 2
    ..val <- 3
    .2regression <- TRUE
    ._2_val <- 'a'
    
  9. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Ordering and tabulating’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  10. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Linear models’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  11. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Miscellaneous tests’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  12. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Random variables’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html. Include interesting graphs.

  13. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Matrix algebra’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

#- # 7. Questions on programming in R ------

8 Questions on programming in R

Many or these questions ask you to write a function to accomplish some goal, instead of just requiring an expression. The advantage of writing a function is you can easily test your code by trying your function on extreme or impossible values.

  1. What output will the following R script produce? Explain briefly why.
    x <- c(TRUE, FALSE, 0L)
    typeof(x)
  2. What output will the following R script produce? Explain briefly why.
    TRUE | NA
  3. Let x be defined as:
    x <- c('0','10','5','20','15','10','0','5')
    Write an R function that would turn x into a factor whose ordering corresponds to the numerical ordering of x.
  4. In R, let x <- 1:5. What output would x[NA] produce? What output would x[NA_real_] produce? Describe the reason for the difference, if any.
  5. In R, describe the result of subsetting a vector with positive integers, with negative integers, with a logical vector, or with a character vector?
  6. In R, what’s the difference between [, [[, and $ when applied to a list?
  7. In R, when subsetting with [, when should you use drop = FALSE? Include arrays and factors in your discussion.
  8. In R, if x is a matrix, what does x[] <- 0 do? How is it different from x <- 0?
  9. In R, how can you use a named vector to relabel a categorical variable?
  10. In R, if mtcars is a data frame, why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?
  11. Fix each of the following common data frame subsetting errors in R:
    mtcars[mtcars$cyl = 4, ]
    mtcars[-1:4, ]
    mtcars[mtcars$cyl <= 5]
    mtcars[mtcars$cyl == 4 | 6, ]
  12. In R, if df is a data frame, what does df[is.na(df)] <- 0 do? How does it work?
  13. Create the vector (20,19,…,2,1) in R.
  14. Create the vector (1,2,3,…,19,20,19,18,…,2,1) in R.
  15. Create the vector (4,4,…,4,6,6,…,6,3,3,…,3) in R, where there are 10 occurrences of 4, 20 of 6 and 30 of 3.
  16. Write a function in R to calculate the following \(\Sigma_{i=1}^{n}(i^3+4i^2)\). Test it including ‘incorrect’ input.
  17. Generate in R a vector of 30 labels: ‘label 1’, ‘label 2’, … ‘label 30’
  18. Let y <- sample(1000, 30, replace = TRUE). Write functions in R to do the following. Test each function.
    1. Determine how many elements of y are multiples of 2.
    2. Determine how many elements of y are equal to 7 mod 13.
    3. Determine how many elements of y are within 200 of the maximum value.
    4. Determine how many elements of y are less than the previous element.
    5. Determine how many elements of y are an exact square.
    6. Determine how many elements of y are prime.
  19. Suppose data for a variable in R representing dollars has been entered in a variety of formats: ‘$1,000.00’,‘1000.00’,‘$1’. Write a function in R that transforms the variable to a numeric variable in dollars to the nearest cent.
  20. Write a function in R that takes a character vector and collapses multiple adjoining blanks in each element to a single blank.
  21. Write a function in R that accepts a data frame as input and returns a data frame in which every variable whose name starts with the letter ‘X’ and ends in a number has been removed.
  22. Create a 6 by 10 matrix of random integers in R as follows:
    set.seed(75)
    m <- matrix(sample(10, 60, replace = T), nrow = 6)
  23. Write a function to find the number of entries in each row of a matrix that are greater than 4.
  24. (continued from the previous question) Write a function to find how many rows have exactly two instances of the number 7.
  25. Describe the difference in R between paste(x, y, sep = ':') and paste(x, y, collapse = ':'). Illustrate.
  26. Using the hs data set in the spida2 package, create a plot with two panels showing histograms displaying the distribution of school sizes in the Public and in the Catholic sectors. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  27. Using the hs data set in the spida2 package, create a plot with two panels showing histograms displaying the distribution of sample sizes in each school in the Public and in the Catholic sectors. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  28. Using the hs data set in the spida2 package, create a plot with two panels showing scatterplots displaying the relationship between mean mathach and mean ses in each school in the Public and in the Catholic sectors. Explore reasonable transformations and regression lines: linear and non-parametric in the plots. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  29. Describe the difference in R between a generic function and a method.
  30. [Warwick] Create the vectors:
    1. (1,2,3,…,19,20)
    2. (20,19,…,2,1)
    3. (1,2,3,…,19,20,19,18,…,2,1)
    4. (4,6,3) and assign it to the name ‘tmp’
    5. (4,6,3,4,6,3,…,4,6,3) where there are 10 occurrences of 4 (Hint: ?rep)
    6. (4,6,3,4,6,3,…,4,6,3,4) where there are 11 occurrences of 4 and 10 of 6 and 3
    7. (4,4,…,4,6,6,…,6,3,3,…,3) where there are 10 occurrences of 4, 20 of 6 and 30 of 3.
  31. [Warwick] Create the vector of the values of \(e^x \cos(x)\) at \(x=3, 3.1, 3.2, ..., 6\).
  32. [Warwick] Create the following vectors:
    1. \((0.1^3 0.2^1, 0.1^6 0.2^4, ... , 0.1^{36} 0.2^{34} )\)
    2. \(\left({2,\frac{2^2}{2},\frac{2^3}{3},...,\frac{2^{25}}{25}}\right)\)
  33. [Warwick] Calculate the following:
    1. \(\sum_{i=10}^{100} (i^3 + 4i^2)\)
    2. \(\sum_{i=1}^{25} \left({\frac{2^i}{i} + \frac{3^i}{i^2}}\right)\)
  34. [Warwick] Use the function ‘paste’ to create the following character vectors of length 30:
    1. (“label 1”, “label 2”, … , “label 30”). Note that there is a single space between label and the number following.
    2. (“fn1”, “fn2”, …, “fn30”). In this case there is no space.
  35. [Warwick] Execute the following lines which create two vectors of random integers which are chosen with replacement from the integers 0, 1, …, 999. Both vectors have length 250.
     set.seed(50)
      xVec <- sample(0:999, 250, replace = T)
      yVec <- sample(0:999, 250, replace = T)
    Suppose \(\mathbf{x} = (x_1, x_2, ..., x_n)\) denotes the vector xVec and similarly for \(\mathbf{y}\).
    1. Write a function that returns the vector \((y_2 - x_1, ..., y_n - x_{n-1})\)
    2. Write a function that returns the vector \(\left({\frac{\sin(y_1)}{\cos(x_2)},\frac{\sin(y_2)}{\cos(x_3)},...,\frac{\sin(y_{n-1})}{\cos(x_n)} }\right)\)
    3. Write a function that returns the vector \((x_1 + 2x_2 - x_3, x_2 + 2 x_3 - x_4, ..., x_{n-1} + 2x_{n-1} - x_n)\)
    4. Write a function that calculates \(\sum_{i=1}^{n-1}\left.\frac{e^{-x_{i+1}}}{x_i + 10}\right.\)
  36. [Warwick] This question uses the vectors xVec and yVec created in the previous question and the functions sort, order, mean, sqrt, sum and abs.
    1. Write a function that returns the values in yVec which are > 100.
    2. Write a function that returns the index positions in yVec of the values which are > 600?
    3. Write a function that returns the values in xVec which correspond to the values in yVec which are > 600?
    4. Create the vector \(\left( \left|x_1-\bar{\mathbf{x}}\right|^{1/2}, \left|x_2-\bar{\mathbf{x}}\right|^{1/2},..., \left|x_n-\bar{\mathbf{x}}\right|^{1/2}\right)\)
    5. Write a function that returns how many values in ‘yVec’ are within 200 of the maximum value of the terms in ‘yVec’?
    6. Write a function that sort the numbers in the vector ‘xVec’ in the order of increasing values in ‘yVec’.
    7. Write a function that returns how many numbers in ‘xVec’ are divisible by 2?
    8. Write a function that returns the elements in ‘yVec’ at index positions 1,4,7,10,13,…
  37. [Warwick] By using the function cumprod or otherwise, write a function that calculates \[ 1 + \frac{2}{3} +\frac{2}{3}\frac{4}{5} + \frac{2}{3}\frac{4}{5}\frac{6}{7}+...+\frac{2}{3}\frac{4}{5}...\frac{38}{39}\]
  38. [Regular expressions] Suppose money data for a variable has been entered in a variety of formats, e,g.
    \(1,000.00", "1000.00", "123.2\)
    Write an R function using ‘gsub’ and ‘as.numeric’ to turn these various entries into a numeric variable. Experiment with your function to make sure it works.
  39. [Regular expressions] Write a function that takes a character vector and collapses multiple adjoining blanks into a single blank.
  40. [Regular expressions] Use the file SampleClassFile.csv. One of its variables is a string that contains information about a student’s faculty and programme: are they in an ordinary programme or in an honours program and the department of their major and minor. Write a function that uses regular expression to create four new variables: the faculty in which a student is enrolled, whether they are in an ordinary or in an honours programme, their major program and their minor program if any.
  41. [Regular expressions] Suppose you have a vector of names, such as:
        Mary Jones
        Tarik Mohammed
        Smith, Jim
        Tom O'Brian
        Victor Lindquist
        Chow, Vincent
        Wong, Mary
    
    Some names are in the format ‘First Last’ and others ‘Last, First’. Write a function to extract the full names, in the format ‘Last, First’, of all the individuals whose first name is ‘Mary’.
  42. [Merging and reshaping] Use the site Gapminder.org to download at least three longitudinal variables into separate data sets. Merge the data sets into one for which each row represents one country and year and contains the values of each of the three variables you downloaded. Display how these variables change over time.
  43. [Regular expressions] Write a function that removes every variable whose name starts with the letter ‘X’ and ends in a number from a data frame.
  44. [Data] Write a function that takes a data frame and returns it with variable names in alphabetical order.
  45. [Warwick] Suppose \[\mathbf{A}= \begin{bmatrix} 1 & 1 & 1 \\ 5 & 2 & 6 \\ -1 & -1 & -3\end{bmatrix}\]
    1. Check that \(\mathbf{A}^3 = \mathbf{0}\) where \(\mathbf{0}\) is a \(3 \times 3\) matrix with every entry equal to 0.
    2. Replace the third column of \(\mathbf{A}\) by the sum of the second and third columns.
  46. [Warwick] Create the following matrix \(\mathbf{B}\) with 15 rows: \[\mathbf{A}= \begin{bmatrix} 10 & -10 & 10 \\ 10 & -10 & 10 \\ \vdots & \vdots & \vdots \\10 & -10 & 10\end{bmatrix}\] Calculate the \(3 \times 3\) matrix \(\mathbf{B}^T\mathbf{B}\). Consider: ?crossprod
  47. [Warwick] Create a \(6 \times 6\) matrix ‘matE’ with every entry equal to 0. Check what the functions ‘row’ and ‘col’ return when applied to ‘matE’. Hence create the \(6 \times 6\) matrix: \[\begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}\]
  48. [Warwick] Look at ?outer. Hence create the following patterned matrix: \[\begin{bmatrix} 0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4 & 5 & 6 \\ 2 & 3 & 4 & 5 & 6 & 7 \\ 3 & 4 & 5 & 6 & 7 & 8 \\ 4 & 5 & 6 & 7 & 8 & 9 \\ 5 & 6 & 7 & 8 & 9 & 10 \end{bmatrix}\]
  49. [Warwick] Create the following patterned matrices. In each case, your solution should make use of the special form of the matrix – this means that the solution should easily generalize to creating a larger matrix with the same structure and should not involve typing in all the entries in the matrix.
    1. \(\begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4 & 5 & 0 \\ 2 & 3 & 4 & 5 & 0 & 1 \\ 3 & 4 & 5 & 0 & 1 & 2 \\ 4 & 5 & 0 & 1 & 2 & 3 \\ 5 & 0 & 1 & 2 & 3 & 4 \end{pmatrix}\)
    2. \(\begin{pmatrix} 0 & 5 & 4 & 3 & 2 & 1 \\ 1 & 0 & 5 & 4 & 3 & 2 \\ 2 & 1 & 0 & 5 & 4 & 3 \\ 3 & 2 & 1 & 0 & 5 & 4 \\ 4 & 3 & 2 & 1 & 0 & 5 \\ 5 & 4 & 3 & 2 & 1 & 0 \end{pmatrix}\)
  50. [Warwick] Solve the following system of linear equations in five unknowns \[\begin{eqnarray} x_1 + 2x_2 + 3x_3 + 4x_4 +5 x_5 &=& 7 \\ 2x_1 + x_2 + 2x_3 + 3x_4 +4 x_5 &=& -1 \\ 3x_1 + 2x_2 + x_3 + 2x_4 +3 x_5 &=& -3 \\ 4x_1 + 3x_2 + 2x_3 + x_4 +2 x_5 &=& 5 \\ 5x_1 + 4x_2 + 3x_3 + 2x_4 +x_5 &=& 17 \end{eqnarray}\] by considering and appropriate matrix equation \(\mathbf{A}\mathbf{x}=\mathbf{y}\).
    Make use of the special form of the matrix \(\mathbf{A}\). The method used for the solution should easily generalize to a larger set of equations where the matrix \(\mathbf{A}\) has the same structure.
  51. [Warwick] Create a \(6 \times 10\) matrix of random integers chose from \(1,2,...10\) by executing the folllowing two lines of code:
    
    set.seed(75)
    aMat <- matrix( sample(10, size = 60, replace = T), nr = 6)
    
    1. Write a function to find the number of entries in each row which are greater than 4.
    2. Write a function to find which rows contain exactly two occurrences of the number seven.
    3. Find those pairs of columns wose total (over both columns) is greater than 75. The answer should be a matrix with two columns; so, for example, the row (1,2) in the output matrix means that the sum of columns 1 and 2 in the original matrix is greater than 75. Repeating a column is permitted; so, for example, the final output matrix could contain the rows (1,2),(2,1) and (2,2).
      What if repetitions are not permitted? Then, only (1,2) from (1,2), (2,1) and (2,2) would be permitted.
  52. [Warwick] Calculate: a. \(\sum_{i=1}^{20} \sum_{j=1}^{5} \frac{i^4}{(3+j)}\) b. (Hard) \(\sum_{i=1}^{20} \sum_{j=1}^{5} \frac{i^4}{(3+ij)}\) c. (Even harder!) \(\sum_{i=1}^{10} \sum_{j=1}^{i} \frac{i^4}{(3+ij)}\)
  53. [Warwick]
    1. Write functions ‘tmpFn1’ and ‘tmpFn2’ such that if ‘xVec’ is the vector \((x_1, x_2, ..., x_n)\), then ‘tmpFn1(xVec)’ returns the vector \((x_1,x_2^2,...,x_n^n)\) and ‘tmpFn2(xVec)’ returns the vector \(\left({x_1,\frac{x_2^2}{2},...,\frac{x_n^n}{n}}\right)\)
    2. Now write a function ‘tmpFn3’ which takes two arguments \(x\) and \(n\) where \(x\) is a single number and \(n\) is a strictly positive integer. The function should return the value of \[1 + \frac{x}{1} + \frac{x^2}{2} + \frac{x^3}{3} + ... + \frac{x^n}{n}\]
  54. [Warwick] Write a function ‘tmpFn(xVec)’ such that if ‘xVec’ is the vector \(\mathbf{x}=(x_1,...,x_n)\) then ‘tmpFn(xVec)’ returns the vector of moving averages: \[\frac{x_1 + x_2 + x_3}{3}, \frac{x_2 + x_3 + x_4}{3}, ... ,\frac{x_3 + x_4 + x_5}{3}\] Try out your function; for example, try ‘tmpFn( c(1:5,6:1))
  55. [Warwick] Consider the continuous function: \[f(x) = \begin{cases} x^2 + 2x + 3 & \quad \text{if } x < 0 \\ x+3 & \quad \text{if } 0 \le x \lt 2 \\ x^2 + 4x - 7 & \quad \text{if } 2 \le x \\ \end{cases}\] Write a function tmpFn which takes a single argument ‘xVec’. The function should return the vector of values of the function \(f(x)\) evaluated at the values in ‘xVec’.
    Hence plot the function \(f(x)\) for \(-3 \lt x \lt 3\).
  56. [Warwick] Write a function which takes a single argument which is a matrix. The function should return a matrix which is the same as the function argument but every odd number is doubled.
  57. [Warwick] Write a function which takes two arguments ‘n’ and ‘k’ which are positive integers. It should return the \(n \times n\) matrix: \[\begin{bmatrix} k & 1 & 0 & 0 & \cdots & 0 & 0 \\ 1 & k & 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & k & 1 & \cdots & 0 & 0 \\ 0 & 0 & 1 & k & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & k & 1 \\ 0 & 0 & 0 & 0 & \cdots & 1 & k \\ \end{bmatrix}\]
  58. [Warwick] Suppose an angle \(\alpha\) is given as a positive real number of degrees counting counter-clockwise from the positive horizontal axis. Write a function quadrant(alpha) which returns the quadrant, 1, 2, 3 or 4, corresponding to ‘alpha’.
  59. [Warwick]
    1. Zeller’s congruence is the formula: \[f = ([2.6m-0.2] + k + y + [y/4] + [c/4] - 2c) \mod 7\] where \([x]\) denotes the integer part of \(x\); for example \([7.5]=7\).
      Zeller’s congruence returns the day of the week \(f\) given:
      \(k =\) the day of the month,
      \(y =\) the year in the century,
      \(c =\) the first 2 digits of the year (the century number)
      \(m =\) the month number (where January is month 11 of the preceding year, February is month 12 of the preceding year, March is month 1, etc) For example, the data July 21, 1963 has \(m=5, k = 21, c=19, y = 63\); while the date February 21, 1963 has \(m=12, k=21, c=19\) and \(y=62\).
      Write a function ‘weekday(day, month, year)’ which returns the day of the week when given the numerical inputs of the day, month and year.
      Note that the value 1 for \(f\) denotes Sunday, 2 denotes Monday, etc.
    2. Does your function work if the input parameters ‘day’, ‘month’ and ‘year’ are vectors with the same length and with valid entries?
  60. [Warwick]
    1. Suppose \(x_0=1\) and \(x_1=2\) and \[x_j = x_{j-1}+\frac{2}{x_{j-1}} \qquad \text{for }j= 1,2,...\] Write a function ‘testLoop’ which takes a single argument \(n\) and returns the first \(n-1\) values of the sequence \(\{x_j\}_{j \ge 0}\), that is, the values of \(x_0, x_1, x_2, ... , x_{n-2}\).
    2. Now write a function ‘testLoop2’ which takes a single argument ‘yVec’ which is a vector. The function should return \[\sum_{j=1}^{n} e^j\] where \(n\) is the length of ‘yVec
  61. [Warwick] Solution of the difference equation \(x_n = r x_{n-1}(1 - x_{n-1})\) with starting values \(x_1\).
    1. Write a function ‘quadmap(start, rho, niter)’ which returns the vector \((x_1, ....,. x_n)\) where \(x_k=r x_{k-1}(1 - x_{k-1})\) and
      \(\quad\)niter’ denotes \(n\),
      \(\quad\)start’ denotes \(x_1\), and
      \(\quad\)rho’ denotes \(r\).
      Try out the function you have written:
      • for \(r=2\) and \(0 < x_1 < 1\) you should get \(x_n \rightarrow 0.5\) as \(n \rightarrow \infty\).
      • try ‘tmp <- quadmap(start=0.95, rho=2.99, niter=500)’ Now type:
        plot(tmp, type = 'l')
        Also try ‘plot(tmp[300:500], type = ’l’)
    2. Now write a function which determines the number of iterations needed to get \(| x_n - x_{n-1}| < 0.02\). This function has only 2 arguments: ‘start’ and ‘rho’. (For ‘start = 0.95’ and ‘rho=2.99’, the answer is 84.)
  62. [Warwick] Given a vector \((x_1, ... ,x_n)\), the sample autocorrelation of lag \(k\) is defined to be \[r_k = \frac{\sum_{i=k+1}^{n}(x_i-\bar{x})(x_{i-k}-\bar{x})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
    1. Write a function ‘autocor(xVec)’ which takes a single argument ‘xVec’ which is a vector and returns a list of two values: \(r_1\) and \(r_2\).
      In particular, find \(r_1\) and \(r_2\) for the vector \((2, 5, 8, ..., 53, 56)\).
    2. (Harder) Generalize the function so that it takes two arguments: the vector ‘xVec’ and an integer ‘k’ which lies between 1 and \(n-1\) where \(n\) is the length of ‘xVec’. The function should return a vector of the values \((r_0 = 1, r_1, ..., r_k)\).
      If you used a loop to answer part (b), then you need to be aware that much, much better solutions are possible. Hint: ‘sapply’.

9 Questions on selection in R

  1. Suppose
    x <- 1:5
    What is the difference between x[NA] and x[NA_integer_]? Why? Hint: It may have something to do with the recycling rule.

  2. Write a function whose input is a numerical matrix and that checks whether the matrix is a lower diagonal matrix, i.e. all elements above the diagonal are 0. Hint: consider using the row and col functions.

  3. (Longitudinal data) A longitudinal data set with one row per occasion has a varying number of observations for each subject. Suppose that a variable named ‘id’ has a unique identifier for each subject. Some subjects have been measured on only one occasion and you would like to perform an analysis that excludes those subjects. Suppose that the original data frame is called ‘dd’. Write R code to create a data frame ‘ds’ that excludes the subjects that were measured only once.

10 Questions on data frames and data manipulation in R

  1. Let
    d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
    d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
    Describe the differences between the outputs of the following commands:
    merge(d1,d2)

    merge(d1,d2, all.x = TRUE)

    merge(d1,d2, all.y = TRUE)

    merge(d1,d2, all = TRUE)

  2. What attributes does a data frame possess? From Wickham: Advanced R

  3. What does as.matrix() do when applied to a data frame with columns of different types? From Wickham: Advanced R

  4. Can you have a data frame with 0 rows? What about 0 columns? From Wickham: Advanced R

  5. This question illustrates how simple data manipulation can be used to answer basic queries about data. Consider classlists for four sections of a first year statistics course STA1000 (at http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/) and two classlists for a second year statistics course STA2000, taken the following year. Without any direct editing of the classlists do the following:

    1. Write a function that transforms each input classlist into a data frame with useful variables on the program of each students. Note that information on program is encoded in a single column that can contain information on a number of distinct variables. You need to use string manipulation functions, e.g. sub, gsub, strsplit, to turn this column into useful variables. Note that a space is usually a delimiter between subfields but sometimes not. You might need to preprocess the strings before splitting them into subfieds.
    2. Is there evidence that a different proportion of students in the 2nd year course go on to study statistics in the 3rd year course?
    3. Is there evidence that this remains true when adjusting for the program of students in the 2nd year course?
    4. Are there conversions, i.e. students who change their majors to statistics? Do they come disproportionately from some sections instead of others?

11 Questions on graphics in R

12 Questions on functions in R

  1. Write a function that will take a vector as input and return the vector with NAs changed to a null string (““) if the vector is a character vector or to 0 if it is a numeric vector.
    Test your function on extreme examples.

  2. Extend the previous function so it returns a factor if the input is a factor and changes NAs in factors to a factor level that is a null string.
    Test your function on extreme examples.

  3. Extend the previous function so the value to which that NAs are changed can be supplied as a parameter with default value the same as in the previous question. Make the value to which NAs are changed potentially depend on the type of variable.
    Test your function on extreme examples.

  4. (Major question) Referring to question 1 on \(p\)-value,

    1. write a function that returns the posterior probability that \(H_0\) is true given the variables in the question: the alternative value of \(\mu\), \(\alpha\), \(n\).
    2. Generate a data frame with values of these variables (hint: consider using expand.grid) and evaluate the function on each row in the data frame.
    3. Graph the results in some interesting and revealing way. You might want to change the values of the variables that you used in creating the data frame in order to produce a more interesting display.
    4. Discuss what you graphs reveal.
  5. Write a function that returns a lower diagonal matrix whose (i,j)th entry, for i > j, is i + j.

  6. Write a function that finds the index of the first occurrence of x in a vector y.

  7. Write a function that turns an vector of non- negative integers into a factor a factor with values ‘0’, ‘1’, ‘2 or more’. Make sure that the factor has the right ordering of levels.

  8. Write a function that identifies whether an integer is a prime number.

  9. Write a function whose input is a data.frame and whose output is the same data frame except that all factor variables have been changed to character variables but the numeric variables are unchanged.

13 Questions on OOP in R

14 Questions on Regression

  1. Explore the pros and cons of Wald tests versus Likelihood Ratio Tests. Construct an example where they give entirely different results.
  2. Suppose the following is a model to estimate the pay gap between men and women in a large organization. \(Y\) is annualized salary, \(G\) is a dummy variable equal to 0 for men and 1 for women, \(X_1\) and \(X_2\) are other factors (e.g. experience and education). Suppose the model has the form: \[ E(Y) = \beta_0 + \beta_1 G + \beta_2 X_1 + \beta_3 X_2 + \beta_4 G X_2 +\beta_5 X_1 X_2 \] and a least-squares regression produces the following output below.
    1. What is the gap (the difference of women’s salaries minus)
Estimate S.E.
\(\beta_0\) 20 5.0
\(\beta_1\) 1 2.0
\(\beta_2\) 2 0.5
\(\beta_3\) 3 1.2
\(\beta_4\) 4 2.1
\(\beta_5\) 5 4.0

15 Questions on Causality

  1. One of your professors uploads videos of the course to a website. One day, he analyzes results from last year’s class and discovers that students’ performance in the course is related to how often they viewed the videos. Students who viewed the videos frequently tended to perform less well on the final exam than students who viewed the videos relatively rarely. Upon discovering this, your professor announces that he will stop recording lectures because, he says, the videos have been shown to cause students to perform more poorly on the course. In answering the following questions, use clear and simple language even a professor might be able to understand.

    1. Do you think the data used in this study constitute experimental data (in the sense used in this course) or observational data?
    2. Explain why the number of lectures attended could be a potential confounding factor when considering the relationship between the frequency of viewing videos and performance on the course.
    3. Can you think of potential mediating factors?
    4. Draw causal graphs for a confounding factor and for a mediating factor.
    5. Using a potential confounding factor, draw a hypothetical MC diagram (alias Paik-Agresti diagram, alias ‘marginal-conditional plot’) to show how students who view the videos frequently might do more poorly than those who view them rarely, even though viewing videos may make a positive contribution for individual students in the course.
  2. In 1964, the Public Health Service of the United States studied the effects of smoking on health in a sample of 42,00 households. For men and for women in each age group, they found that those who had never smoked were on average somewhat healthier than the current smokers, but the current smokers were on average much healthier than the former smokers.

    1. Would this data be considered observational or experimental to examine the possible effects of smoking? Why?
    2. Why did they study men and women and the different age groups separately?
    3. The lesson seems to be that you shouldn’t start smoking, but once you’ve started, don’t stop. Find some plausible explanations for this surprising relationship between quitting smoking and health. Find at least one plausible confounding factor and one plausible mediating factor that might account for part of the relationship.
    4. Conditioning on a plausible confounding factor, draw a hypothetical MC diagram showing a conditional and an unconditional relationship between quitting and health that is consistent with the findings of the study.
  3. A study investigated whether there was a higher risk of complications when women gave birth at home with the assistance of a midwife instead of giving birth in a maternity ward in a hospital. 400 women who chose to give birth at home and 2,000 women who gave birth in a hospital were studied. The table below summarizes the number of complications in each group.

    1. Find the rate of complication in each group: the home birth group and the hospital birth group.
    2. Do you think that this is an observational study or an experimental study? Why?
    3. The data suggest that it is safer (in the sense of a lower rate of complications) to give birth at home than to give birth in the hospital. Discuss whether this implies that a woman should consider giving birth at home in order to reduce her risk of complications. Identify at least one plausible confounding factor and one plausible mediating factor that could partly explain the results of the study.
    4. Choose a possible confounding factor and use a MC diagram to show how controlling for this confounding factor could reverse the direction of association between the rate of complications and the location of birth: home or hospital.
Complications No Complications Total
Home Births 20 380 400
Hospital Births 200 1800 2000
Total 220 2180 2400

  1. Mary (a woman) is choosing between restaurants A and B to take her friend, John (a man), out for dinner.
    Restaurant A has an average rating of 4.1 and restaurant B of 4.3. But looking at ratings by gender, among men, restaurant A has a rating of 4.0 and restaurant B of 3.8. Among women, restaurant A has a rating of 4.6 and restaurant B of 4.4. It seems that men and women separately prefer restaurant A but together they prefer restaurant B!

    1. Draw a MC diagram conditioning on the gender with restaurant on the horizontal axis and average ratings on the vertical axis to explain how this apparent contradiction could arise.
    2. Draw a causal graph describing the relationships among the three variables: restaurant, gender and ratings.
    3. Assuming that there are no other significant factors related to restaurant ratings, what kind of variable is gender in this context?
    4. Which restaurant should Mary choose? Why?
    5. Add an appropriate ‘do-line’ to the MC diagram.
  2. Fedor is choosing between restaurants A and B to take his friend, Jaspreet, out for dinner. Restaurant A has an average rating of 4.1 and restaurant B of 4.3. Each restaurant has six waiters. Restaurant A has one good waiter and 5 bad ones. Restaurant B has 5 good waiters and 1 bad one. Customers can’t choose the waiter they get.

    Ratings among customers getting a good waiter are 4.6 at restaurant A and 4.4 at restaurant B. Ratings among customers getting a bad waiter are 4.0 at restaurant A and 3.8 at restaurant B. So, overall, customers prefer restaurant B but among those getting a bad waiter, they prefer restaurant A and among those getting a good waiter they also prefer restaurant A! In other words, the bad waiters at restaurant A are better than the bad waiter at restaurant B and the good waiter at restaurant A is better than the good waiters at restaurant B. Suppose that you can’t choose your waiter and that your chances of getting each type of waiter at either restaurant are similar to the proportions in the ratings.

    1. Draw a MC diagram conditioning on the quality of the waiter with restaurant on the horizontal axis and average ratings on the vertical axis to explain how this apparent contradiction could arise.
    2. Draw a causal graph describing the relationships among the three variables: restaurant, quality of waiter and ratings.
    3. Assuming that there are no other significant factors related to restaurant ratings, what kind of variable is the quality of the waiter in this context?
    4. Which restaurant should Fedor choose? Why?
    5. Add an appropriate ‘do-line’ to the MC diagram.

16 Paradoxes, Fallacies and One Correct Statement

Here are 23 statements or questions about statistics, mainly about regression, for you to ponder and comment on.

Is each statement true, false or does its truth depend on unstated conditions? In the last case, on what conditions does it depend on and how?

Warning: Every statement but one expresses a widely held fallacy or half truth about statistics. Even professional statisticians are fooled by many of these statements. Many of the ideas expressed in these statements may be reasonable when applied to certain problems but can lead to serious modelling errors when applied in the wrong context. It is very important to understand the importance of the context in which statistical modelling takes place. At the very least you must always consider:

  • the nature of the questions you want to address: Are they causal, predictive or descriptive?
  • the nature of the data: Was there random assignment, or random selection? Can the data be considered to be representative of some population or process?
  • the consequences of different types of errors.

16.1 Health and Weight

Suppose you are studying how some measure of health is related to weight. You are looking at a multiple regression of health on height and weight but you observe that what you are really interested in is the relationship between health and weight relative to height. For simplicity suppose that the residual of the regression of weight on height is meaningful measure of relative weight and that health is related linearly to this variable. What you should do is to compute the residuals of weight on height and replace weight in the model with this new variable. The resulting coefficient of ‘excess weight’ will give a better estimate of the effect of excess weight. True? False? It depends?

16.2 Measurement error: confounding factor or target variable

Suppose you are studying observational data on the relationship between Health and Coffee (measured in grams of caffeine consumed per day). Suppose you want to control for a possible confounding factor ‘Stress’. In this kind of study it is more important to make sure that you measure coffee consumption accurately than it is to make sure that you measure ‘stress’ accurately. True? False? It depends?

16.3 Biases in class size surveys?

A survey of students at York reveals that the average class size of the classes they attend is 130. A survey of faculty shows an average class size of 30. The students must be exaggerating their class sizes or the faculty under-reporting. True? False? It depends?

16.4 Biases in wealth surveys?

A survey of Canadian families yielded average ‘equity’ (i.e. total owned in real estate, bonds, stocks, etc. minus total owed) of $48,000. Aggregate government data of the total equity in the Canadian population shows that this figure must be much larger, in fact more than twice as large. This shows that respondents must tend to dramatically underreport their equity._True? False? It depends?_

16.5 Dropping variables to simplify a model

In a multiple regression of Y on three predictors, X1, X2 and X3, suppose the coefficients of each of X2 and X3, are not significant. It is safe to drop these two variables and perform a regression on X1 alone. Dropping a number of variables with non-significant coefficients results in a model that fits almost as well as the original model. True? False? It depends?

16.6 Comparing two groups

If smoking really is bad for your health, you expect that a comparison of a group of people who have quit smoking with a group that has continued to smoke will reveal that the group quitting is, on average, healthier than the group that continued. True? False? It depends?

16.7 Dropping a non-significant predictor

In a multiple regression, if you drop a predictor whose effect is not significant, the coefficients of the other predictors should not change very much, nor should the p-values associated with them. True? False? It depends?

16.8 Interpretation of MLE

We use maximum likelihood to estimate parameters because the parameter value with the highest likelihood is the value that has the highest probability of being correct. ‘Likelihood’ is just a different word for ‘probability’. True? False? It depends?

16.9 Forward stepwise or backward stepwise regression?

If you want to reduce the number of predictor variables in a model, forward stepwise regression will do a good job of identifying which variables you should keep. What about backward stepwise regression? True? False? It depends?

16.10 Non-significant interaction

In a regression model with two predictors X1 and X2, and an interaction term between the two predictors, it is dangerous to interpret the `main’ effects of X1 and X2 without further qualification. However, it is okay to do so if the interaction term is not significant. True? False? It depends?

16.11 Comparing best and worst outcomes

In a model to assess the effect of a number of treatments on some outcome, we can estimate the difference between the best treatment and the worse treatment by using the difference in the mean outcomes. True? False? It depends?

16.12 Interaction and collinearity

In general we don’t need to worry about interactions between variables unless there is a correlation between them. True? False? It depends?

16.13 Confounding factor and association with predictor

In general, a variable cannot be a ‘confounding factor’ for the effect of another variable unless they are associated with each other. True? False? It depends?

16.14 Interaction implies correlation

If two variables have a strong interaction, this implies a strong correlation. True? False? It depends?

16.15 Imputing a missing grade

You need to impute a mid-term grade for a student who missed the mid-term test with a valid excuse. The best way to predict the missing mid-term grade is to perform a regression of mid-term grades on final grades using the data from students who wrote both and use the predicted mid-term grade based on the final exam grade of the students who missed the mid-term test. True? False? It depends?

Discuss the relative consequences of using

  1. the predicted mid-term grade based on the regression equation of the mid-term grades on the final grades,
  2. the student’s raw grade on the final,
  3. using the student’s z-score on the final to impute the score on the mid-term with the same z-score, and
  4. use the regression equation of the final on the mid-term to calculate the mid-term grade that would have predicted the student’s actual final grade.

If you had to choose one of these four, which would you choose and why?

If you have a better solution for the previous problem, what is it and why?

16.16 p-values and error rate in publications

If all scientists used a p-value of 0.05 to decide which results to publish, that would ensure that at most 5% of published results would be incorrect. True? False? It depends?

16.17 Significance with added variables

If a variable X1 is not significant in a regression of Y on X1 then it will be even less significant in a regression of Y on both X1 and X2 where X2 is another variable. This follows since there is less variability left to explain in a model that already includes X2 than in a model that does not. True? False? It depends?

16.18 Extra sums of squares

Consider this frequently used Venn diagram representing sums of squares in Analysis of Variance:

The diagram shows how two predictor variables X1 and X2 predict a response Y by displaying variances and shared variances. In the decompositions: \[ \begin{aligned} SSTO & = SS(X1,X2) + SSE \\ & = SS(X1) + SS(X2|X1) + SSE \\ & = SS(X2) + SS(X1|X2) + SSE \end{aligned} \] the last two lines correspond to Type I sequential sums of squares adding X1 then X2 in the second line, and adding X2 then X1 in the third and last line.

  • \(SSTO\) is represented by the area of the Y circle,
  • \(SS(X1)\) by the area of regions A and C
  • \(SS(X2|X1)\) by region B
  • \(SS(X2)\) by regions B and C
  • \(SS(X1|X2)\) by region A
  • \(SS(X1, X2)\) by regions A, B and C
  • \(SSE\) by the portion of the Y circle outside regions A, B and C

You can use the Venn diagram to prove that \(SS(X1)\) must be greater than \(SS(X1|X2)\), i.e. that a predictor added alone explains more variance than it does when it is added after having added another predictor variable. True? False? It depends?

16.19 Dropping redundant variables

The best way to deal with high collinearity between predictors is to drop predictors that are not significant. True? False? It depends?

16.20 AIC

AIC is useful to identify the best model among a set of models that you have selected after exploring your data if the models are not nested within each other. True? False? It depends?

16.21 Comparing groups

A recent study showed that people who sleep more than 9 hours per night on average have a higher chance of premature death than those who sleep fewer than 9 hours. This does not necessarily mean that sleeping more than 9 hours on average is bad for your health because the sample might not have been representative. True? False? It depends?

16.22 Error rate and posterior probability

Suppose a screening test for steroid drug use has a specificity of 95% and a sensitivity of 95%. This means that the test is incorrect 5% of the time. Therefore, if John takes the test and the result is ‘positive’ (i.e. the test indicates that John takes steroid drugs) the probability that he does not take steroid drugs is only 5%. True? False? It depends?

16.23 Importance of predictors

In a multiple regression, the predictor that is most important is the one with the smallest p-value. True? False? It depends?

Questions to be classified

References

Bergland, Christopher. 2019. “Rethinking P-Values: Is "Statistical Significance" Useless?” Psychology Today. March 22, 2019. https://www.psychologytoday.com/blog/the-athletes-way/201903/rethinking-p-values-is-statistical-significance-useless.
Ioannidis, John P A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): 6. https://journals.plos.org/plosmedicine/article/file?id=10.1371%2Fjournal.pmed.0020124&type=printable.
Ioannidis, John P. A. 2019. “What Have We (Not) Learnt from Millions of Scientific Papers with P Values?” The American Statistician 73 (March): 20–25. https://doi.org/10.1080/00031305.2018.1447512.
Oliver, John, dir. 2016. Last Week Tonight with John Oliver: Scientific Studies. HBO. https://www.youtube.com/watch?v=0Rnq1NpHdmw.
Schervish, Mark J. 1996. “P Values: What They Are and What They Are Not.” The American Statistician 30 (3): 203–6.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p -Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.
Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond p \(<\) 0.05’.” The American Statistician 73 (March): 1–19. https://doi.org/10.1080/00031305.2019.1583913.
Wickham, Hadley. 2019. Advanced R. 2nd ed. CRC Press. https://adv-r.hadley.nz/.