These questions have been inspired by many sources.

1 The meaning of \(p\)-values

The purpose of the assignment is to explore the meaning of p-values. Before starting stop and reflect on what it means for an experiment to ‘achieve’ a p-value of 0.049. What meaning can we give to the quantity ‘0.049’? How is it related to the probability that the null hypothesis is correct?

To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) and you are designing an experiment in which you plan to take a sample of independent random variables, \(X_1, X_2, ... , X_n\) which are iid \(\textrm{N}(\mu,1)\), i.e. the variance is known to be equal to 1. You plan to use the usual test based on \(\bar{X}_n\) rejecting \(H_0\) for values of \(\bar{X}_n\) that are far from 0.

An applied example would be testing for a change in value of a response when all subjects are submitted to the same conditions and the measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’, i.e. post-test response minus the pre-test response exhibited by the \(i\)th subject.

Let the selected probability of Type I error be \(\alpha = 0.05\). Consider collecting samples of size \(n\) where \(n\) equals one of the following:

\(i\) \(n\)
1 10
2 20
3 100
4 200
5 1,000
6 10,000

Consider using the following values of \(\mu_j\):

\(j\) \(\mu_j\) Cohen’s terminology for effect size: \(\mu_j/\sigma\)
1 0.2 small effect size
2 0.5 medium effect size
3 0.8 large effect size
4 1 very large effect size
5 5 huge effect size
  1. What is the probability that \(p \le 0.05\) if \(H_0: \mu = 0\) is true?
  2. What is the probability that \(p \le 0.05\) if \(\mu = \mu_j\)?
  3. What is the power of this test if \(\mu = \mu_j\)?
  4. Suppose that you collect the data and that the observed \(p\)-value is 0.049. What can you say about the probability that \(H_0\) is true?
  5. Suppose that, before running the experiment, you were willing to give \(H_0\) and \(H_1: \mu = \mu_j\) equal probability. What is the probability that \(H_0\) is true given that you have performed the experiment and obtained \(p = 0.049\).
  6. Hypothesis testing is often presented as a process that parallels that of determining guilt in a criminal process. We start with a presumption of innocence, i.e. that \(H_0\) is true, We then hear evidence and consider whether it contradicts the presumption of innocence ‘beyond a reasonable doubt.’
    Suppose we quantify the presumption of innocence to mean that \(P(H_0) \ge .95\). How small an observed \(p\)-value do you need to obtain in order to ‘flip’ the presumption of innocence to ‘guilt beyond a reasonable doubt’ if that is defined as \(P(H_0 | \mathrm{data}) \le .05\).
  7. What \(p\)-value would we need if the presumption of innocence and guilt beyond a reasonable doubt correspond to \(P(H_0) \ge 0.999\) and \(P(H_0|data) \le 0.001\)?
  8. Courts have often adopted a criterion of \(p < 0.05\) in imitation of the common practice among many researchers. Comment on the possible consequences.
  9. Have a look at this xkcd cartoon. How does the Bayesian statistician derive a probability to make a decision in this example? Show the details.

To delve more deeply into these issues you can read (wassersteinASAStatementValues2016?) and (wassersteinMovingWorld052019?). Concerns about \(p\)-values have been around for a long time, see (schervishValuesWhatThey1996?). For a short overview see (berglandRethinkingPValuesStatistical2019?). Two key influential and erstwhile controversial papers are by John Ioannidis: (ioannidisWhyMostPublished2005?), (ioannidisWhatHaveWe2019?).

For an entertaining take on related issues see (oliverLastWeekTonight2016?) (warning: contains strong language and political irony that many could consider offensive – watch at your own risk!).

2 Merge: relational data base operations 1

Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4)) 

Describe the differences between the outputs of the following commands:

  1. merge(d1,d2)
  2. merge(d1,d2, all.x = TRUE)
  3. merge(d1,d2, all.y = TRUE)
  4. merge(d1,d2, all = TRUE)

3 Merge: Concatenating rows of slightly different data frames

Two research assistants have collected data for a study. The two RAs worked with different subjects each gathered their data into a spreadsheet. All of the important variables have the same name and the same definitions but each RA has also collected data on a few unique variables for that RA. Also, the order of the columns is different in the two studies.

  1. Create two small data frames to illustrate this kind of situation.
  2. How could you use ‘merge’ to easily concatenate the two data frames by rows keeping each distinct row in the original data frames and keeping the unique variable names with values filled with NAs for subjects from whom the value was not present in the data.

4 Merge: relational data base operations 2

Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4)) 

Find out what the following terms mean and show how to achieve each operation using ‘merge’. Hint: the last two operations are much easier if you first create a new variable in each data frame to serve as a ‘key’ (i.e. the argument of ‘id’ in the call to ‘merge’) in each data frame. (Note: the ‘key’ is the set of variables used to match the rows of the two data frames. These are either the variable names provided as arguments to the ‘key’ parameter or, by default, the intersection of the vectors of variable names in each data frame.)

  1. inner join
  2. outer join
  3. left join
  4. right join
  5. cross join
  6. concatenation of rows

5 Answering questions with data

Use the ‘Vocab’ data set in the ‘car’ package. It records a vocabulary score for over 30,000 subjects tested over the years between 1974 and 2016.

Explore the following questions. Use appropriate tables and graphs to explain your findings.

  1. Consider the distribution of education. Are there any salient features for this distribution?
  2. What can you say about any trends in vocabulary scores over time? Do the trends, if any, appear to differ by gender?
  3. What can you say about any trends in vocabulary scores over time when you adjust for education? What difference does it make whether you adjust for education or you don’t? What is the difference in the meaning of changes in vocabulary score whether you adjust for education or not? Do the trends, if any, appear to differ by gender?
  4. What can you say about any trends in education levels over time? Do the trends, if any, appear to differ by gender?
  5. What can you say about male/female differences in vocabulary scores when you adjust for education? Is the relationship constant or changing over time? If it is changing, how can you describe the nature of the change?
  6. Studying gaps and trends: For each of the following questions, fit a suitable model and explore the question using a suitable Wald test.
    1. In the last ten years of the study, is there evidence that vocabulary scores are increasing among men?
    2. In the last ten years of the study, is there evidence that vocabulary scores are increasing among women?
    3. In the last ten years of the study, is there evidence that vocabulary scores are increasing at a different rate among men than among women?
    4. Repeat 1 for level of education.
    5. Repeat 2 for level of education.
    6. Repeat 3 for level of education.
    7. Repeat 1 for level of vocabulary adjusted for education.
    8. Repeat 2 for level of vocabulary adjusted for education.
    9. Repeat 3 for level of vocabulary adjusted for education.
    10. Repeat 1 for the first ten years.
    11. Repeat 2 for the first ten years.
    12. Repeat 3 for the first ten years.
    13. Repeat 4 for the first ten years.
    14. Repeat 5 for the first ten years.
    15. Repeat 6 for the first ten years.
    16. Repeat 7 for the first ten years.
    17. Repeat 8 for the first ten years.
    18. Repeat 9 for the first ten years.
    19. Repeat 1 comparing the last 10 years with the first 10 years.
    20. Repeat 2 comparing the last 10 years with the first 10 years.
    21. Repeat 3 comparing the last 10 years with the first 10 years.
    22. Repeat 4 comparing the last 10 years with the first 10 years.
    23. Repeat 5 comparing the last 10 years with the first 10 years.
    24. Repeat 6 comparing the last 10 years with the first 10 years.
    25. Repeat 7 comparing the last 10 years with the first 10 years.
    26. Repeat 8 comparing the last 10 years with the first 10 years.
    27. Repeat 9 comparing the last 10 years with the first 10 years.

6 Questions on factors in R

Some of these questions illustrate important potential pitfalls in using factor variables. As a result, some developers eschew them and prefer to work with character variables as much as possible. Factors, however, are invaluable for many statistical applications since they allow the creation of different orderings of the values in a character vector, which is often important for statistical modeling and for graphics.

  1. Describe the difference, if any, and if so why, between the following (note that the behaviour of the ‘factor’ function has changed over time):

    1. factor(c(1, 2, 10))
    2. factor(as.character(c(1, 2, 10)))
  2. Suppose x <- factor(c(1, 2, 10)). Write a function that would allow you to recover the original values, 1, 2 and 10, from a factor like x? Why does as.numeric(x) not work?

  3. Let f1 <- factor(c('a','b','c')) and f2 <- factor(c('A','B','C')). What’s the matter with the result of the expression

    ifelse(f1 == 'a', f1, f2)
    

    Explain why it fails to produce characters as a result. Fix it so it does.

  4. Indexing with factors: Consider

    df <- data.frame(c = 1:3, a = 11:13, b = 21:23)
    fac <- factor(c('a','b','c'))
    df[[fac[1]]]
    df[[as.character(fac[1])]]
    

    Explain why the last two lines of the code above produce different results.

  5. What happens to a factor when you modify its levels with the levels<- replacement function? Give examples to illustrate your answer.

7 Questions on the R language

  1. Describe the main differences between the four taxonomies of objects in R: typeof, mode, storage.mode and class.

  2. Let x <- letters. What is the class and mode of x? Let y <- as.factor(x). What is the class and mode of y? Why does this make sense … or not?

  3. A factor is a kind of object used to represent character variables for statistical analysis. Add a factor to the list used to display the classifications of atomic objects above. Play with some factors and use str to explain the curious values returned by typeof, mode, storage.mode and class for a factor.

  4. What makes is.vector() and is.numeric() fundamentally different to is.list() and is.character()? From Wickham: Advanced R

  5. Why is 1 == “1” true? Why is -1 < FALSE true? Why is “one” < 2 false? From Wickham: Advanced R

  6. Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).) From Wickham: Advanced R

  7. Does -1:2 produce the same result as 0-1:2? Why or why not?

  8. Which of the following assignments use valid names?

     
    a_very_long_name <- 0
    _tmp <- 2
    .tmp <- 2
    ..val <- 3
    .2regression <- TRUE
    ._2_val <- 'a'
    
  9. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Ordering and tabulating’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  10. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Linear models’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  11. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Miscellaneous tests’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

  12. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Random variables’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html. Include interesting graphs.

  13. Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Matrix algebra’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html

8 Questions on programming in R

Many or these questions ask you to write a function to accomplish some goal, instead of just requiring an expression. The advantage of writing a function is you can easily test your code by trying your function on extreme or impossible values.

  1. What output will the following R script produce? Explain briefly why.
    x <- c(TRUE, FALSE, 0L)
    typeof(x)
  2. What output will the following R script produce? Explain briefly why.
    TRUE | NA
  3. Let x be defined as:
    x <- c('0','10','5','20','15','10','0','5')
    Write an R function that would turn x into a factor whose ordering corresponds to the numerical ordering of x.
  4. In R, let x <- 1:5. What output would x[NA] produce? What output would x[NA_real_] produce? Describe the reason for the difference, if any.
  5. In R, describe the result of subsetting a vector with positive integers, with negative integers, with a logical vector, or with a character vector?
  6. In R, what’s the difference between [, [[, and $ when applied to a list?
  7. In R, when subsetting with [, when should you use drop = FALSE? Include arrays and factors in your discussion.
  8. In R, if x is a matrix, what does x[] <- 0 do? How is it different from x <- 0?
  9. In R, how can you use a named vector to relabel a categorical variable?
  10. In R, if mtcars is a data frame, why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?
  11. Fix each of the following common data frame subsetting errors in R:
    mtcars[mtcars$cyl = 4, ]
    mtcars[-1:4, ]
    mtcars[mtcars$cyl <= 5]
    mtcars[mtcars$cyl == 4 | 6, ]
  12. In R, if df is a data frame, what does df[is.na(df)] <- 0 do? How does it work?
  13. Create the vector (20,19,…,2,1) in R.
  14. Create the vector (1,2,3,…,19,20,19,18,…,2,1) in R.
  15. Create the vector (4,4,…,4,6,6,…,6,3,3,…,3) in R, where there are 10 occurrences of 4, 20 of 6 and 30 of 3.
  16. Write a function in R to calculate the following \(\Sigma_{i=1}^{n}(i^3+4i^2)\). Test it including ‘incorrect’ input.
  17. Generate in R a vector of 30 labels: ‘label 1’, ‘label 2’, … ‘label 30’
  18. Let y <- sample(1000, 30, replace = TRUE). Write functions in R to do the following. Test each function.
    1. Determine how many elements of y are multiples of 2.
    2. Determine how many elements of y are equal to 7 mod 13.
    3. Determine how many elements of y are within 200 of the maximum value.
    4. Determine how many elements of y are less than the previous element.
    5. Determine how many elements of y are an exact square.
    6. Determine how many elements of y are prime.
  19. Suppose data for a variable in R representing dollars has been entered in a variety of formats: ‘$1,000.00’,‘1000.00’,‘$1’. Write a function in R that transforms the variable to a numeric variable in dollars to the nearest cent.
  20. Write a function in R that takes a character vector and collapses multiple adjoining blanks in each element to a single blank.
  21. Write a function in R that accepts a data frame as input and returns a data frame in which every variable whose name starts with the letter ‘X’ and ends in a number has been removed.
  22. Create a 6 by 10 matrix of random integers in R as follows:
    set.seed(75)
    m <- matrix(sample(10, 60, replace = T), nrow = 6)
  23. Write a function to find the number of entries in each row of a matrix that are greater than 4.
  24. (continued from the previous question) Write a function to find how many rows have exactly two instances of the number 7.
  25. Describe the difference in R between paste(x, y, sep = ':') and paste(x, y, collapse = ':'). Illustrate.
  26. Using the hs data set in the spida2 package, create a plot with two panels showing histograms displaying the distribution of school sizes in the Public and in the Catholic sectors. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  27. Using the hs data set in the spida2 package, create a plot with two panels showing histograms displaying the distribution of sample sizes in each school in the Public and in the Catholic sectors. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  28. Using the hs data set in the spida2 package, create a plot with two panels showing scatterplots displaying the relationship between mean mathach and mean ses in each school in the Public and in the Catholic sectors. Explore reasonable transformations and regression lines: linear and non-parametric in the plots. Use the functions capply and up in the spida2 package. You may also use any other approach to compare with the use of capply and up.
  29. Describe the difference in R between a generic function and a method.
  30. [Warwick] Create the vectors:
    1. (1,2,3,…,19,20)
    2. (20,19,…,2,1)
    3. (1,2,3,…,19,20,19,18,…,2,1)
    4. (4,6,3) and assign it to the name ‘tmp’
    5. (4,6,3,4,6,3,…,4,6,3) where there are 10 occurrences of 4 (Hint: ?rep)
    6. (4,6,3,4,6,3,…,4,6,3,4) where there are 11 occurrences of 4 and 10 of 6 and 3
    7. (4,4,…,4,6,6,…,6,3,3,…,3) where there are 10 occurrences of 4, 20 of 6 and 30 of 3.
  31. [Warwick] Create the vector of the values of \(e^x \cos(x)\) at \(x=3, 3.1, 3.2, ..., 6\).
  32. [Warwick] Create the following vectors:
    1. \((0.1^3 0.2^1, 0.1^6 0.2^4, ... , 0.1^{36} 0.2^{34} )\)
    2. \(\left({2,\frac{2^2}{2},\frac{2^3}{3},...,\frac{2^{25}}{25}}\right)\)
  33. [Warwick] Calculate the following:
    1. \(\sum_{i=10}^{100} (i^3 + 4i^2)\)
    2. \(\sum_{i=1}^{25} \left({\frac{2^i}{i} + \frac{3^i}{i^2}}\right)\)
  34. [Warwick] Use the function ‘paste’ to create the following character vectors of length 30:
    1. (“label 1”, “label 2”, … , “label 30”). Note that there is a single space between label and the number following.
    2. (“fn1”, “fn2”, …, “fn30”). In this case there is no space.
  35. [Warwick] Execute the following lines which create two vectors of random integers which are chosen with replacement from the integers 0, 1, …, 999. Both vectors have length 250.
     set.seed(50)
      xVec <- sample(0:999, 250, replace = T)
      yVec <- sample(0:999, 250, replace = T)
    Suppose \(\mathbf{x} = (x_1, x_2, ..., x_n)\) denotes the vector xVec and similarly for \(\mathbf{y}\).
    1. Write a function that returns the vector \((y_2 - x_1, ..., y_n - x_{n-1})\)
    2. Write a function that returns the vector \(\left({\frac{\sin(y_1)}{\cos(x_2)},\frac{\sin(y_2)}{\cos(x_3)},...,\frac{\sin(y_{n-1})}{\cos(x_n)} }\right)\)
    3. Write a function that returns the vector \((x_1 + 2x_2 - x_3, x_2 + 2 x_3 - x_4, ..., x_{n-1} + 2x_{n-1} - x_n)\)
    4. Write a function that calculates \(\sum_{i=1}^{n-1}\left.\frac{e^{-x_{i+1}}}{x_i + 10}\right.\)
  36. [Warwick] This question uses the vectors xVec and yVec created in the previous question and the functions sort, order, mean, sqrt, sum and abs.
    1. Write a function that returns the values in yVec which are > 100.
    2. Write a function that returns the index positions in yVec of the values which are > 600?
    3. Write a function that returns the values in xVec which correspond to the values in yVec which are > 600?
    4. Create the vector \(\left( \left|x_1-\bar{\mathbf{x}}\right|^{1/2}, \left|x_2-\bar{\mathbf{x}}\right|^{1/2},..., \left|x_n-\bar{\mathbf{x}}\right|^{1/2}\right)\)
    5. Write a function that returns how many values in ‘yVec’ are within 200 of the maximum value of the terms in ‘yVec’?
    6. Write a function that sort the numbers in the vector ‘xVec’ in the order of increasing values in ‘yVec’.
    7. Write a function that returns how many numbers in ‘xVec’ are divisible by 2?
    8. Write a function that returns the elements in ‘yVec’ at index positions 1,4,7,10,13,…
  37. [Warwick] By using the function cumprod or otherwise, write a function that calculates \[ 1 + \frac{2}{3} +\frac{2}{3}\frac{4}{5} + \frac{2}{3}\frac{4}{5}\frac{6}{7}+...+\frac{2}{3}\frac{4}{5}...\frac{38}{39}\]
  38. [Regular expressions] Suppose money data for a variable has been entered in a variety of formats, e,g.
    \(1,000.00", "1000.00", "123.2\)
    Write an R function using ‘gsub’ and ‘as.numeric’ to turn these various entries into a numeric variable. Experiment with your function to make sure it works.
  39. [Regular expressions] Write a function that takes a character vector and collapses multiple adjoining blanks into a single blank.
  40. [Regular expressions] Use the file SampleClassFile.csv. One of its variables is a string that contains information about a student’s faculty and programme: are they in an ordinary programme or in an honours program and the department of their major and minor. Write a function that uses regular expression to create four new variables: the faculty in which a student is enrolled, whether they are in an ordinary or in an honours programme, their major program and their minor program if any.
  41. [Regular expressions] Suppose you have a vector of names, such as:
        Mary Jones
        Tarik Mohammed
        Smith, Jim
        Tom O'Brian
        Victor Lindquist
        Chow, Vincent
        Wong, Mary
    
    Some names are in the format ‘First Last’ and others ‘Last, First’. Write a function to extract the full names, in the format ‘Last, First’, of all the individuals whose first name is ‘Mary’.
  42. [Merging and reshaping] Use the site Gapminder.org to download at least three longitudinal variables into separate data sets. Merge the data sets into one for which each row represents one country and year and contains the values of each of the three variables you downloaded. Display how these variables change over time.
  43. [Regular expressions] Write a function that removes every variable whose name starts with the letter ‘X’ and ends in a number from a data frame.
  44. [Data] Write a function that takes a data frame and returns it with variable names in alphabetical order.
  45. [Warwick] Suppose \[\mathbf{A}= \begin{bmatrix} 1 & 1 & 1 \\ 5 & 2 & 6 \\ -1 & -1 & -3\end{bmatrix}\]
    1. Check that \(\mathbf{A}^3 = \mathbf{0}\) where \(\mathbf{0}\) is a \(3 \times 3\) matrix with every entry equal to 0.
    2. Replace the third column of \(\mathbf{A}\) by the sum of the second and third columns.
  46. [Warwick] Create the following matrix \(\mathbf{B}\) with 15 rows: \[\mathbf{A}= \begin{bmatrix} 10 & -10 & 10 \\ 10 & -10 & 10 \\ \vdots & \vdots & \vdots \\10 & -10 & 10\end{bmatrix}\] Calculate the \(3 \times 3\) matrix \(\mathbf{B}^T\mathbf{B}\). Consider: ?crossprod
  47. [Warwick] Create a \(6 \times 6\) matrix ‘matE’ with every entry equal to 0. Check what the functions ‘row’ and ‘col’ return when applied to ‘matE’. Hence create the \(6 \times 6\) matrix: \[\begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}\]
  48. [Warwick] Look at ?outer. Hence create the following patterned matrix: \[\begin{bmatrix} 0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4 & 5 & 6 \\ 2 & 3 & 4 & 5 & 6 & 7 \\ 3 & 4 & 5 & 6 & 7 & 8 \\ 4 & 5 & 6 & 7 & 8 & 9 \\ 5 & 6 & 7 & 8 & 9 & 10 \end{bmatrix}\]
  49. [Warwick] Create the following patterned matrices. In each case, your solution should make use of the special form of the matrix – this means that the solution should easily generalize to creating a larger matrix with the same structure and should not involve typing in all the entries in the matrix.
    1. \(\begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4 & 5 & 0 \\ 2 & 3 & 4 & 5 & 0 & 1 \\ 3 & 4 & 5 & 0 & 1 & 2 \\ 4 & 5 & 0 & 1 & 2 & 3 \\ 5 & 0 & 1 & 2 & 3 & 4 \end{pmatrix}\)
    2. \(\begin{pmatrix} 0 & 5 & 4 & 3 & 2 & 1 \\ 1 & 0 & 5 & 4 & 3 & 2 \\ 2 & 1 & 0 & 5 & 4 & 3 \\ 3 & 2 & 1 & 0 & 5 & 4 \\ 4 & 3 & 2 & 1 & 0 & 5 \\ 5 & 4 & 3 & 2 & 1 & 0 \end{pmatrix}\)
  50. [Warwick] Solve the following system of linear equations in five unknowns \[\begin{eqnarray} x_1 + 2x_2 + 3x_3 + 4x_4 +5 x_5 &=& 7 \\ 2x_1 + x_2 + 2x_3 + 3x_4 +4 x_5 &=& -1 \\ 3x_1 + 2x_2 + x_3 + 2x_4 +3 x_5 &=& -3 \\ 4x_1 + 3x_2 + 2x_3 + x_4 +2 x_5 &=& 5 \\ 5x_1 + 4x_2 + 3x_3 + 2x_4 +x_5 &=& 17 \end{eqnarray}\] by considering and appropriate matrix equation \(\mathbf{A}\mathbf{x}=\mathbf{y}\).
    Make use of the special form of the matrix \(\mathbf{A}\). The method used for the solution should easily generalize to a larger set of equations where the matrix \(\mathbf{A}\) has the same structure.
  51. [Warwick] Create a \(6 \times 10\) matrix of random integers chose from \(1,2,...10\) by executing the folllowing two lines of code:
    
    set.seed(75)
    aMat <- matrix( sample(10, size = 60, replace = T), nr = 6)
    
    1. Write a function to find the number of entries in each row which are greater than 4.
    2. Write a function to find which rows contain exactly two occurrences of the number seven.
    3. Find those pairs of columns wose total (over both columns) is greater than 75. The answer should be a matrix with two columns; so, for example, the row (1,2) in the output matrix means that the sum of columns 1 and 2 in the original matrix is greater than 75. Repeating a column is permitted; so, for example, the final output matrix could contain the rows (1,2),(2,1) and (2,2).
      What if repetitions are not permitted? Then, only (1,2) from (1,2), (2,1) and (2,2) would be permitted.
  52. [Warwick] Calculate: a. \(\sum_{i=1}^{20} \sum_{j=1}^{5} \frac{i^4}{(3+j)}\) b. (Hard) \(\sum_{i=1}^{20} \sum_{j=1}^{5} \frac{i^4}{(3+ij)}\) c. (Even harder!) \(\sum_{i=1}^{10} \sum_{j=1}^{i} \frac{i^4}{(3+ij)}\)
  53. [Warwick]
    1. Write functions ‘tmpFn1’ and ‘tmpFn2’ such that if ‘xVec’ is the vector \((x_1, x_2, ..., x_n)\), then ‘tmpFn1(xVec)’ returns the vector \((x_1,x_2^2,...,x_n^n)\) and ‘tmpFn2(xVec)’ returns the vector \(\left({x_1,\frac{x_2^2}{2},...,\frac{x_n^n}{n}}\right)\)
    2. Now write a function ‘tmpFn3’ which takes two arguments \(x\) and \(n\) where \(x\) is a single number and \(n\) is a strictly positive integer. The function should return the value of \[1 + \frac{x}{1} + \frac{x^2}{2} + \frac{x^3}{3} + ... + \frac{x^n}{n}\]
  54. [Warwick] Write a function ‘tmpFn(xVec)’ such that if ‘xVec’ is the vector \(\mathbf{x}=(x_1,...,x_n)\) then ‘tmpFn(xVec)’ returns the vector of moving averages: \[\frac{x_1 + x_2 + x_3}{3}, \frac{x_2 + x_3 + x_4}{3}, ... ,\frac{x_3 + x_4 + x_5}{3}\] Try out your function; for example, try ‘tmpFn( c(1:5,6:1))
  55. [Warwick] Consider the continuous function: \[f(x) = \begin{cases} x^2 + 2x + 3 & \quad \text{if } x < 0 \\ x+3 & \quad \text{if } 0 \le x \lt 2 \\ x^2 + 4x - 7 & \quad \text{if } 2 \le x \\ \end{cases}\] Write a function tmpFn which takes a single argument ‘xVec’. The function should return the vector of values of the function \(f(x)\) evaluated at the values in ‘xVec’.
    Hence plot the function \(f(x)\) for \(-3 \lt x \lt 3\).
  56. [Warwick] Write a function which takes a single argument which is a matrix. The function should return a matrix which is the same as the function argument but every odd number is doubled.
  57. [Warwick] Write a function which takes two arguments ‘n’ and ‘k’ which are positive integers. It should return the \(n \times n\) matrix: \[\begin{bmatrix} k & 1 & 0 & 0 & \cdots & 0 & 0 \\ 1 & k & 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & k & 1 & \cdots & 0 & 0 \\ 0 & 0 & 1 & k & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & k & 1 \\ 0 & 0 & 0 & 0 & \cdots & 1 & k \\ \end{bmatrix}\]
  58. [Warwick] Suppose an angle \(\alpha\) is given as a positive real number of degrees counting counter-clockwise from the positive horizontal axis. Write a function quadrant(alpha) which returns the quadrant, 1, 2, 3 or 4, corresponding to ‘alpha’.
  59. [Warwick]
    1. Zeller’s congruence is the formula: \[f = ([2.6m-0.2] + k + y + [y/4] + [c/4] - 2c) \mod 7\] where \([x]\) denotes the integer part of \(x\); for example \([7.5]=7\).
      Zeller’s congruence returns the day of the week \(f\) given:
      \(k =\) the day of the month,
      \(y =\) the year in the century,
      \(c =\) the first 2 digits of the year (the century number)
      \(m =\) the month number (where January is month 11 of the preceding year, February is month 12 of the preceding year, March is month 1, etc) For example, the data July 21, 1963 has \(m=5, k = 21, c=19, y = 63\); while the date February 21, 1963 has \(m=12, k=21, c=19\) and \(y=62\).
      Write a function ‘weekday(day, month, year)’ which returns the day of the week when given the numerical inputs of the day, month and year.
      Note that the value 1 for \(f\) denotes Sunday, 2 denotes Monday, etc.
    2. Does your function work if the input parameters ‘day’, ‘month’ and ‘year’ are vectors with the same length and with valid entries?
  60. [Warwick]
    1. Suppose \(x_0=1\) and \(x_1=2\) and \[x_j = x_{j-1}+\frac{2}{x_{j-1}} \qquad \text{for }j= 1,2,...\] Write a function ‘testLoop’ which takes a single argument \(n\) and returns the first \(n-1\) values of the sequence \(\{x_j\}_{j \ge 0}\), that is, the values of \(x_0, x_1, x_2, ... , x_{n-2}\).
    2. Now write a function ‘testLoop2’ which takes a single argument ‘yVec’ which is a vector. The function should return \[\sum_{j=1}^{n} e^j\] where \(n\) is the length of ‘yVec
  61. [Warwick] Solution of the difference equation \(x_n = r x_{n-1}(1 - x_{n-1})\) with starting values \(x_1\).
    1. Write a function ‘quadmap(start, rho, niter)’ which returns the vector \((x_1, ....,. x_n)\) where \(x_k=r x_{k-1}(1 - x_{k-1})\) and
      \(\quad\)niter’ denotes \(n\),
      \(\quad\)start’ denotes \(x_1\), and
      \(\quad\)rho’ denotes \(r\).
      Try out the function you have written:
      • for \(r=2\) and \(0 < x_1 < 1\) you should get \(x_n \rightarrow 0.5\) as \(n \rightarrow \infty\).
      • try ‘tmp <- quadmap(start=0.95, rho=2.99, niter=500)’ Now type:
        plot(tmp, type = 'l')
        Also try ‘plot(tmp[300:500], type = ’l’)
    2. Now write a function which determines the number of iterations needed to get \(| x_n - x_{n-1}| < 0.02\). This function has only 2 arguments: ‘start’ and ‘rho’. (For ‘start = 0.95’ and ‘rho=2.99’, the answer is 84.)
  62. [Warwick] Given a vector \((x_1, ... ,x_n)\), the sample autocorrelation of lag \(k\) is defined to be \[r_k = \frac{\sum_{i=k+1}^{n}(x_i-\bar{x})(x_{i-k}-\bar{x})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
    1. Write a function ‘autocor(xVec)’ which takes a single argument ‘xVec’ which is a vector and returns a list of two values: \(r_1\) and \(r_2\).
      In particular, find \(r_1\) and \(r_2\) for the vector \((2, 5, 8, ..., 53, 56)\).
    2. (Harder) Generalize the function so that it takes two arguments: the vector ‘xVec’ and an integer ‘k’ which lies between 1 and \(n-1\) where \(n\) is the length of ‘xVec’. The function should return a vector of the values \((r_0 = 1, r_1, ..., r_k)\).
      If you used a loop to answer part (b), then you need to be aware that much, much better solutions are possible. Hint: ‘sapply’.

9 Questions on selection in R

  1. Suppose
    x <- 1:5
    What is the difference between x[NA] and x[NA_integer_]? Why? Hint: It may have something to do with the recycling rule.

  2. Write a function whose input is a numerical matrix and that checks whether the matrix is a lower diagonal matrix, i.e. all elements above the diagonal are 0. Hint: consider using the row and col functions.

  3. (Longitudinal data) A longitudinal data set with one row per occasion has a varying number of observations for each subject. Suppose that a variable named ‘id’ has a unique identifier for each subject. Some subjects have been measured on only one occasion and you would like to perform an analysis that excludes those subjects. Suppose that the original data frame is called ‘dd’. Write R code to create a data frame ‘ds’ that excludes the subjects that were measured only once.

10 Questions on data frames and data manipulation in R

  1. Let
    d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
    d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
    Describe the differences between the outputs of the following commands:
    merge(d1,d2)

    merge(d1,d2, all.x = TRUE)

    merge(d1,d2, all.y = TRUE)

    merge(d1,d2, all = TRUE)

  2. What attributes does a data frame possess? From Wickham: Advanced R

  3. What does as.matrix() do when applied to a data frame with columns of different types? From Wickham: Advanced R

  4. Can you have a data frame with 0 rows? What about 0 columns? From Wickham: Advanced R

  5. This question illustrates how simple data manipulation can be used to answer basic queries about data. Consider classlists for four sections of a first year statistics course STA1000 (at http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/) and two classlists for a second year statistics course STA2000, taken the following year. Without any direct editing of the classlists do the following:

    1. Write a function that transforms each input classlist into a data frame with useful variables on the program of each students. Note that information on program is encoded in a single column that can contain information on a number of distinct variables. You need to use string manipulation functions, e.g. sub, gsub, strsplit, to turn this column into useful variables. Note that a space is usually a delimiter between subfields but sometimes not. You might need to preprocess the strings before splitting them into subfieds.
    2. Is there evidence that a different proportion of students in the 2nd year course go on to study statistics in the 3rd year course?
    3. Is there evidence that this remains true when adjusting for the program of students in the 2nd year course?
    4. Are there conversions, i.e. students who change their majors to statistics? Do they come disproportionately from some sections instead of others?

11 Questions on graphics in R

12 Questions on functions in R

  1. Write a function that will take a vector as input and return the vector with NAs changed to a null string (““) if the vector is a character vector or to 0 if it is a numeric vector.
    Test your function on extreme examples.

  2. Extend the previous function so it returns a factor if the input is a factor and changes NAs in factors to a factor level that is a null string.
    Test your function on extreme examples.

  3. Extend the previous function so the value to which that NAs are changed can be supplied as a parameter with default value the same as in the previous question. Make the value to which NAs are changed potentially depend on the type of variable.
    Test your function on extreme examples.

  4. (Major question) Referring to question 1 on \(p\)-value,

    1. write a function that returns the posterior probability that \(H_0\) is true given the variables in the question: the alternative value of \(\mu\), \(\alpha\), \(n\).
    2. Generate a data frame with values of these variables (hint: consider using expand.grid) and evaluate the function on each row in the data frame.
    3. Graph the results in some interesting and revealing way. You might want to change the values of the variables that you used in creating the data frame in order to produce a more interesting display.
    4. Discuss what you graphs reveal.
  5. Write a function that returns a lower diagonal matrix whose (i,j)th entry, for i > j, is i + j.

  6. Write a function that finds the index of the first occurrence of x in a vector y.

  7. Write a function that turns an vector of non- negative integers into a factor a factor with values ‘0’, ‘1’, ‘2 or more’. Make sure that the factor has the right ordering of levels.

  8. Write a function that identifies whether an integer is a prime number.

  9. Write a function whose input is a data.frame and whose output is the same data frame except that all factor variables have been changed to character variables but the numeric variables are unchanged.

13 Questions on OOP in R

14 Questions on Regression

  1. Explore the pros and cons of Wald tests versus Likelihood Ratio Tests. Construct an example where they give entirely different results.

15 Questions on Causality

  1. One of your professors uploads videos of the course to a website. One day, he analyzes results from last year’s class and discovers that students’ performance in the course is related to how often they viewed the videos. Students who viewed the videos frequently tended to perform less well on the final exam than students who viewed the videos relatively rarely. Upon discovering this, your professor announces that he will stop recording lectures because, he says, the videos have been shown to cause students to perform more poorly on the course. In answering the following questions, use clear and simple language even a professor might be able to understand.

    1. Do you think the data used in this study constitute experimental data (in the sense used in this course) or observational data?
    2. Explain why the number of lectures attended could be a potential confounding factor when considering the relationship between the frequency of viewing videos and performance on the course.
    3. Can you think of potential mediating factors?
    4. Draw causal graphs for a confounding factor and for a mediating factor.
    5. Using a potential confounding factor, draw a hypothetical MC diagram (alias Paik-Agresti diagram, alias ‘marginal-conditional plot’) to show how students who view the videos frequently might do more poorly than those who view them rarely, even though viewing videos may make a positive contribution for individual students in the course.
  2. In 1964, the Public Health Service of the United States studied the effects of smoking on health in a sample of 42,00 households. For men and for women in each age group, they found that those who had never smoked were on average somewhat healthier than the current smokers, but the current smokers were on average much healthier than the former smokers.

    1. Would this data be considered observational or experimental to examine the possible effects of smoking? Why?
    2. Why did they study men and women and the different age groups separately?
    3. The lesson seems to be that you shouldn’t start smoking, but once you’ve started, don’t stop. Find some plausible explanations for this surprising relationship between quitting smoking and health. Find at least one plausible confounding factor and one plausible mediating factor that might account for part of the relationship.
    4. Conditioning on a plausible confounding factor, draw a hypothetical MC diagram showing a conditional and an unconditional relationship between quitting and health that is consistent with the findings of the study.
  3. A study investigated whether there was a higher risk of complications when women gave birth at home with the assistance of a midwife instead of giving birth in a maternity ward in a hospital. 400 women who chose to give birth at home and 2,000 women who gave birth in a hospital were studied. The table below summarizes the number of complications in each group.

    1. Find the rate of complication in each group: the home birth group and the hospital birth group.
    2. Do you think that this is an observational study or an experimental study? Why?
    3. The data suggest that it is safer (in the sense of a lower rate of complications) to give birth at home than to give birth in the hospital. Discuss whether this implies that a woman should consider giving birth at home in order to reduce her risk of complications. Identify at least one plausible confounding factor and one plausible mediating factor that could partly explain the results of the study.
    4. Choose a possible confounding factor and use a MC diagram to show how controlling for this confounding factor could reverse the direction of association between the rate of complications and the location of birth: home or hospital.
Complications No Complications Total
Home Births 20 380 400
Hospital Births 200 1800 2000
Total 220 2180 2400

  1. Mary (a woman) is choosing between restaurants A and B to take her friend, John (a man), out for dinner.
    Restaurant A has an average rating of 4.1 and restaurant B of 4.3. But looking at ratings by gender, among men, restaurant A has a rating of 4.0 and restaurant B of 3.8. Among women, restaurant A has a rating of 4.6 and restaurant B of 4.4. It seems that men and women separately prefer restaurant A but together they prefer restaurant B!

    1. Draw a MC diagram conditioning on the gender with restaurant on the horizontal axis and average ratings on the vertical axis to explain how this apparent contradiction could arise.
    2. Draw a causal graph describing the relationships among the three variables: restaurant, gender and ratings.
    3. Assuming that there are no other significant factors related to restaurant ratings, what kind of variable is gender in this context?
    4. Which restaurant should Mary choose? Why?
    5. Add an appropriate ‘do-line’ to the MC diagram.
  2. Fedor is choosing between restaurants A and B to take his friend, Jaspreet, out for dinner. Restaurant A has an average rating of 4.1 and restaurant B of 4.3. Each restaurant has six waiters. Restaurant A has one good waiter and 5 bad ones. Restaurant B has 5 good waiters and 1 bad one. Customers can’t choose the waiter they get.

    Ratings among customers getting a good waiter are 4.6 at restaurant A and 4.4 at restaurant B. Ratings among customers getting a bad waiter are 4.0 at restaurant A and 3.8 at restaurant B. So, overall, customers prefer restaurant B but among those getting a bad waiter, they prefer restaurant A and among those getting a good waiter they also prefer restaurant A! In other words, the bad waiters at restaurant A are better than the bad waiter at restaurant B and the good waiter at restaurant A is better than the good waiters at restaurant B. Suppose that you can’t choose your waiter and that your chances of getting each type of waiter at either restaurant are similar to the proportions in the ratings.

    1. Draw a MC diagram conditioning on the quality of the waiter with restaurant on the horizontal axis and average ratings on the vertical axis to explain how this apparent contradiction could arise.
    2. Draw a causal graph describing the relationships among the three variables: restaurant, quality of waiter and ratings.
    3. Assuming that there are no other significant factors related to restaurant ratings, what kind of variable is the quality of the waiter in this context?
    4. Which restaurant should Fedor choose? Why?
    5. Add an appropriate ‘do-line’ to the MC diagram.

16 Questions to be classified

17 References