These questions have been inspired by many sources. An important
reference is Wickham (2019).
The meaning of \(p\)-values
The purpose of the assignment is to explore the meaning of p-values.
Before starting stop and reflect on what it means for an experiment to
‘achieve’ a p-value of 0.049. What meaning can we give to the quantity
‘0.049’? How is it related to the probability that the null hypothesis
is correct?
To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) and you are designing an
experiment in which you plan to take a sample of independent random
variables, \(X_1, X_2, ... , X_n\)
which are iid \(\textrm{N}(\mu,1)\),
i.e. the variance is known to be equal to 1. You plan to use the usual
test based on \(\bar{X}_n\) rejecting
\(H_0\) for values of \(\bar{X}_n\) that are far from 0.
An applied example would be testing for a change in value of a
response when all subjects are submitted to the same conditions and the
measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’,
i.e. post-test response minus the pre-test response exhibited by the
\(i\)th subject.
Let the selected probability of Type I error be \(\alpha = 0.05\). Consider collecting
samples of size \(n\) where \(n\) equals one of the following:
1 |
2 |
2 |
5 |
3 |
10 |
4 |
15 |
5 |
100 |
6 |
1,000 |
Consider using the following values of \(\mu_j\):
1 |
0.2 |
small effect size |
2 |
0.5 |
medium effect size |
3 |
0.8 |
large effect size |
4 |
1 |
very large effect size |
5 |
5 |
huge effect size |
- What is the probability that \(p \le
0.05\) if \(H_0: \mu = 0\) is
true?
- What is the probability that \(p \le
0.05\) if \(\mu = \mu_j\)?
- What is the power of this test if \(\mu =
\mu_j\)?
- Suppose that you collect the data and that the observed \(p\)-value is 0.049. What can you say about
the probability that \(H_0\) is
true?
- Suppose that, before running the experiment, you were willing to
give \(H_0\) and \(H_1: \mu = \mu_j\) equal probability. What
is the probability that \(H_0\) is true
given that you have performed the experiment and obtained \(p = 0.049\).
- Hypothesis testing is often presented as a process that parallels
that of determining guilt in a criminal process. We start with a
presumption of innocence, i.e. that \(H_0\) is true, We then hear evidence and
consider whether it contradicts the presumption of innocence ‘beyond a
reasonable doubt.’
Suppose we quantify the presumption of innocence to mean that \(P(H_0) \ge .95\). How small an observed
\(p\)-value do you need to obtain in
order to ‘flip’ the presumption of innocence to ‘guilt beyond a
reasonable doubt’ if that is defined as \(P(H_0 | \mathrm{data}) \le .05\).
- What \(p\)-value would we need if
the presumption of innocence and guilt beyond a reasonable doubt
correspond to \(P(H_0) \ge 0.999\) and
\(P(H_0|data) \le 0.001\)?
- Courts have often adopted a criterion of \(p < 0.05\) in imitation of the common
practice among many researchers. Comment on the possible
consequences.
- Have a look at this xkcd
cartoon. How does the Bayesian statistician derive a probability to
make a decision in this example? Show the details.
To delve more deeply into these issues you can read Wasserstein and Lazar (2016) and Wasserstein, Schirm, and Lazar (2019). Concerns about \(p\)-values have been around for a long
time, see Schervish (1996). For a short overview see
Bergland (2019). Two key influential and
erstwhile controversial papers are by John Ioannidis: John P. A. Ioannidis (2005), John
P. A. Ioannidis (2019).
For an entertaining take on related issues see Oliver (2016)
(warning: contains strong language and political irony that many could
consider offensive – watch at your own risk!).
Merge: relational data
base operations 1
Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Describe the differences between the outputs of the following
commands:
merge(d1,d2)
merge(d1,d2, all.x = TRUE)
merge(d1,d2, all.y = TRUE)
merge(d1,d2, all = TRUE)
- Compare this base R function
merge
with the function
inner_join
, left_join
, right_join
and outer_join
in the ‘dplyr’ package.
Merge: Concatenating
rows of slightly different data frames
Two research assistants have collected data for a study. The two RAs
worked with different subjects each gathered their data into a
spreadsheet. All of the important variables have the same name and the
same definitions but each RA has also collected data on a few unique
variables for that RA. Also, the order of the columns is different in
the two studies.
- Create two small data frames to illustrate this kind of
situation.
- How could you use ‘merge’ to easily concatenate the two data frames
by rows keeping each distinct row in the original data frames and
keeping the unique variable names with values filled with NAs for
subjects from whom the value was not present in the data.
Merge: relational data
base operations 2
Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Find out what the following terms mean and show how to achieve each
operation using ‘merge’. Hint: the last two operations are much easier
if you first create a new variable in each data frame to serve as a
‘key’ (i.e. the argument of ‘id’ in the call to ‘merge’) in each data
frame. (Note: the ‘key’ is the set of variables used to match the rows
of the two data frames. These are either the variable names provided as
arguments to the ‘key’ parameter or, by default, the intersection of the
vectors of variable names in each data frame.)
- inner join
- outer join
- left join
- right join
- cross join
- concatenation of rows
Answering questions
with data
Use the ‘Vocab’ data set in the ‘car’ package. It records a
vocabulary score for over 30,000 subjects tested over the years between
1974 and 2016.
Explore the following questions. Use appropriate tables and graphs to
explain your findings.
- Consider the distribution of education. Are there any salient
features for this distribution?
- What can you say about any trends in vocabulary scores over time? Do
the trends, if any, appear to differ by gender?
- What can you say about any trends in vocabulary scores over time
when you adjust for education? What difference does it make whether you
adjust for education or you don’t? What is the difference in the meaning
of changes in vocabulary score whether you adjust for education or not?
Do the trends, if any, appear to differ by gender?
- What can you say about any trends in education levels over time? Do
the trends, if any, appear to differ by gender?
- What can you say about male/female differences in vocabulary scores
when you adjust for education? Is the relationship constant or changing
over time? If it is changing, how can you describe the nature of the
change?
- Studying gaps and trends: For each of the following questions, fit a
suitable model and explore the question using a suitable Wald test.
- In the last ten years of the study, is there evidence that
vocabulary scores are increasing among men?
- In the last ten years of the study, is there evidence that
vocabulary scores are increasing among women?
- In the last ten years of the study, is there evidence that
vocabulary scores are increasing at a different rate among men than
among women?
- Repeat 1 for level of education.
- Repeat 2 for level of education.
- Repeat 3 for level of education.
- Repeat 1 for level of vocabulary adjusted for education.
- Repeat 2 for level of vocabulary adjusted for education.
- Repeat 3 for level of vocabulary adjusted for education.
- Repeat 1 for the first ten years.
- Repeat 2 for the first ten years.
- Repeat 3 for the first ten years.
- Repeat 4 for the first ten years.
- Repeat 5 for the first ten years.
- Repeat 6 for the first ten years.
- Repeat 7 for the first ten years.
- Repeat 8 for the first ten years.
- Repeat 9 for the first ten years.
- Repeat 1 comparing the last 10 years with the first 10 years.
- Repeat 2 comparing the last 10 years with the first 10 years.
- Repeat 3 comparing the last 10 years with the first 10 years.
- Repeat 4 comparing the last 10 years with the first 10 years.
- Repeat 5 comparing the last 10 years with the first 10 years.
- Repeat 6 comparing the last 10 years with the first 10 years.
- Repeat 7 comparing the last 10 years with the first 10 years.
- Repeat 8 comparing the last 10 years with the first 10 years.
- Repeat 9 comparing the last 10 years with the first 10 years.
Questions on factors in
R
Some of these questions illustrate important potential pitfalls in
using factor variables. As a result, some developers eschew them and
prefer to work with character variables as much as possible. Factors,
however, are invaluable for many statistical applications since they
allow the creation of different orderings of the values in a character
vector, which is often important for statistical modeling and for
graphics.
Describe the difference, if any, and if so why, between the
following (note that the behaviour of the ‘factor’ function has changed
over time):
factor(c(1, 2, 10))
factor(as.character(c(1, 2, 10)))
Suppose x <- factor(c(1, 2, 10))
. Write a
function that would allow you to recover the original values, 1, 2 and
10, from a factor like x
? Why does
as.numeric(x)
not work?
Let f1 <- factor(c('a','b','c'))
and
f2 <- factor(c('A','B','C'))
. What’s the matter with the
result of the expression
ifelse(f1 == 'a', f1, f2)
Explain why it fails to produce characters as a result. Fix it so it
does.
Indexing with factors: Consider
df <- data.frame(c = 1:3, a = 11:13, b = 21:23)
fac <- factor(c('a','b','c'))
df[[fac[1]]]
df[[as.character(fac[1])]]
Explain why the last two lines of the code above produce different
results.
What happens to a factor when you modify its levels with the
levels<-
replacement function? Give examples to
illustrate your answer.
Questions on the R
language
Describe the main differences between the four taxonomies of
objects in R: typeof, mode, storage.mode and class.
Let x <- letters
. What is the class and mode of
x
? Let y <- as.factor(x)
. What is the class
and mode of y
? Why does this make sense … or not?
A factor is a kind of object used to represent character
variables for statistical analysis. Add a factor to the list used to
display the classifications of atomic objects above. Play with some
factors and use str
to explain the curious values returned
by typeof
, mode
, storage.mode
and
class
for a factor.
What makes is.vector() and is.numeric() fundamentally different
to is.list() and is.character()? From Wickham:
Advanced R
Why is 1 == “1” true? Why is -1 < FALSE true? Why is “one”
< 2 false? From Wickham:
Advanced R
Why is the default missing value, NA, a logical vector? What’s
special about logical vectors? (Hint: think about c(FALSE,
NA_character_).) From Wickham:
Advanced R
Does -1:2
produce the same result as
0-1:2
? Why or why not?
Which of the following assignments use valid names?
a_very_long_name <- 0
_tmp <- 2
.tmp <- 2
..val <- 3
.2regression <- TRUE
._2_val <- 'a'
Write a Rmarkdown script that illustrates the use of at least 5
functions from the subgroup ‘Ordering and tabulating’ of the group
‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5
functions from the subgroup ‘Linear models’ of the group ‘Statistics’ at
http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5
functions from the subgroup ‘Miscellaneous tests’ of the group
‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5
functions from the subgroup ‘Random variables’ of the group ‘Statistics’
at http://adv-r.had.co.nz/Vocabulary.html. Include
interesting graphs.
Write a Rmarkdown script that illustrates the use of at least 5
functions from the subgroup ‘Matrix algebra’ of the group ‘Statistics’
at http://adv-r.had.co.nz/Vocabulary.html
#- # 7. Questions on programming in R ------
Questions on
programming in R
Many or these questions ask you to write a function to accomplish
some goal, instead of just requiring an expression. The advantage of
writing a function is you can easily test your code by trying your
function on extreme or impossible values.
- What output will the following R script produce? Explain briefly
why.
x <- c(TRUE, FALSE, 0L)
typeof(x)
- What output will the following R script produce? Explain briefly
why.
TRUE | NA
- Let x be defined as:
x <- c('0','10','5','20','15','10','0','5')
Write an
R function that would turn x into a factor whose ordering corresponds to
the numerical ordering of x.
- In R, let
x <- 1:5
. What output would
x[NA]
produce? What output would x[NA_real_]
produce? Describe the reason for the difference, if any.
- In R, describe the result of subsetting a vector with positive
integers, with negative integers, with a logical vector, or with a
character vector?
- In R, what’s the difference between
[
, [[
,
and $
when applied to a list?
- In R, when subsetting with
[
, when should you use
drop = FALSE
? Include arrays and factors in your
discussion.
- In R, if
x
is a matrix, what does
x[] <- 0
do? How is it different from
x <- 0
?
- In R, how can you use a named vector to relabel a categorical
variable?
- In R, if
mtcars
is a data frame, why does
mtcars[1:20]
return an error? How does it differ from the
similar mtcars[1:20, ]
?
- Fix each of the following common data frame subsetting errors in
R:
mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]
- In R, if
df
is a data frame, what does
df[is.na(df)] <- 0
do? How does it work?
- Create the vector (20,19,…,2,1) in R.
- Create the vector (1,2,3,…,19,20,19,18,…,2,1) in R.
- Create the vector (4,4,…,4,6,6,…,6,3,3,…,3) in R, where there are 10
occurrences of 4, 20 of 6 and 30 of 3.
- Write a function in R to calculate the following \(\Sigma_{i=1}^{n}(i^3+4i^2)\). Test it
including ‘incorrect’ input.
- Generate in R a vector of 30 labels: ‘label 1’, ‘label 2’, … ‘label
30’
- Let
y <- sample(1000, 30, replace = TRUE)
. Write
functions in R to do the following. Test each function.
- Determine how many elements of y are multiples of 2.
- Determine how many elements of y are equal to 7 mod 13.
- Determine how many elements of y are within 200 of the maximum
value.
- Determine how many elements of y are less than the previous
element.
- Determine how many elements of y are an exact square.
- Determine how many elements of y are prime.
- Suppose data for a variable in R representing dollars has been
entered in a variety of formats: ‘$1,000.00’,‘1000.00’,‘$1’. Write a
function in R that transforms the variable to a numeric variable in
dollars to the nearest cent.
- Write a function in R that takes a character vector and collapses
multiple adjoining blanks in each element to a single blank.
- Write a function in R that accepts a data frame as input and returns
a data frame in which every variable whose name starts with the letter
‘X’ and ends in a number has been removed.
- Create a 6 by 10 matrix of random integers in R as follows:
set.seed(75)
m <- matrix(sample(10, 60, replace = T), nrow = 6)
- Write a function to find the number of entries in each row of a
matrix that are greater than 4.
- (continued from the previous question) Write a function to find how
many rows have exactly two instances of the number 7.
- Describe the difference in R between
paste(x, y, sep = ':')
and
paste(x, y, collapse = ':')
. Illustrate.
- Using the
hs
data set in the spida2
package, create a plot with two panels showing histograms displaying the
distribution of school sizes in the Public and in the Catholic sectors.
Use the functions capply
and up
in the
spida2
package. You may also use any other approach to
compare with the use of capply
and up
.
- Using the
hs
data set in the spida2
package, create a plot with two panels showing histograms displaying the
distribution of sample sizes in each school in the Public and in the
Catholic sectors. Use the functions capply
and
up
in the spida2
package. You may also use any
other approach to compare with the use of capply
and
up
.
- Using the
hs
data set in the spida2
package, create a plot with two panels showing scatterplots displaying
the relationship between mean mathach
and mean
ses
in each school in the Public and in the Catholic
sectors. Explore reasonable transformations and regression lines: linear
and non-parametric in the plots. Use the functions capply
and up
in the spida2
package. You may also use
any other approach to compare with the use of capply
and
up
.
- Describe the difference in R between a generic function and a
method.
- [Warwick] Create the vectors:
- (1,2,3,…,19,20)
- (20,19,…,2,1)
- (1,2,3,…,19,20,19,18,…,2,1)
- (4,6,3) and assign it to the name ‘tmp’
- (4,6,3,4,6,3,…,4,6,3) where there are 10 occurrences of 4 (Hint:
?rep)
- (4,6,3,4,6,3,…,4,6,3,4) where there are 11 occurrences of 4 and 10
of 6 and 3
- (4,4,…,4,6,6,…,6,3,3,…,3) where there are 10 occurrences of 4, 20 of
6 and 30 of 3.
- [Warwick] Create the vector of the values of \(e^x \cos(x)\) at \(x=3, 3.1, 3.2, ..., 6\).
- [Warwick] Create the following vectors:
- \((0.1^3 0.2^1, 0.1^6 0.2^4, ... ,
0.1^{36} 0.2^{34} )\)
- \(\left({2,\frac{2^2}{2},\frac{2^3}{3},...,\frac{2^{25}}{25}}\right)\)
- [Warwick] Calculate the following:
- \(\sum_{i=10}^{100} (i^3 +
4i^2)\)
- \(\sum_{i=1}^{25} \left({\frac{2^i}{i} +
\frac{3^i}{i^2}}\right)\)
- [Warwick] Use the function ‘paste’ to create the following character
vectors of length 30:
- (“label 1”, “label 2”, … , “label 30”). Note that there is
a single space between label and the number following.
- (“fn1”, “fn2”, …, “fn30”). In this case there is no
space.
- [Warwick] Execute the following lines which create two vectors of
random integers which are chosen with replacement from the integers 0,
1, …, 999. Both vectors have length 250.
set.seed(50)
xVec <- sample(0:999, 250, replace = T)
yVec <- sample(0:999, 250, replace = T)
Suppose \(\mathbf{x} = (x_1, x_2, ...,
x_n)\) denotes the vector xVec and similarly for \(\mathbf{y}\).
- Write a function that returns the vector \((y_2 - x_1, ..., y_n - x_{n-1})\)
- Write a function that returns the vector \(\left({\frac{\sin(y_1)}{\cos(x_2)},\frac{\sin(y_2)}{\cos(x_3)},...,\frac{\sin(y_{n-1})}{\cos(x_n)}
}\right)\)
- Write a function that returns the vector \((x_1 + 2x_2 - x_3, x_2 + 2 x_3 - x_4, ..., x_{n-1}
+ 2x_{n-1} - x_n)\)
- Write a function that calculates \(\sum_{i=1}^{n-1}\left.\frac{e^{-x_{i+1}}}{x_i +
10}\right.\)
- [Warwick] This question uses the vectors xVec and
yVec created in the previous question and the functions
sort, order, mean, sqrt,
sum and abs.
- Write a function that returns the values in yVec which are
> 100.
- Write a function that returns the index positions in yVec
of the values which are > 600?
- Write a function that returns the values in xVec which
correspond to the values in yVec which are > 600?
- Create the vector \(\left(
\left|x_1-\bar{\mathbf{x}}\right|^{1/2},
\left|x_2-\bar{\mathbf{x}}\right|^{1/2},...,
\left|x_n-\bar{\mathbf{x}}\right|^{1/2}\right)\)
- Write a function that returns how many values in ‘yVec’ are
within 200 of the maximum value of the terms in ‘yVec’?
- Write a function that sort the numbers in the vector ‘xVec’
in the order of increasing values in ‘yVec’.
- Write a function that returns how many numbers in ‘xVec’
are divisible by 2?
- Write a function that returns the elements in ‘yVec’ at
index positions 1,4,7,10,13,…
- [Warwick] By using the function cumprod or otherwise, write
a function that calculates \[ 1 + \frac{2}{3}
+\frac{2}{3}\frac{4}{5} +
\frac{2}{3}\frac{4}{5}\frac{6}{7}+...+\frac{2}{3}\frac{4}{5}...\frac{38}{39}\]
- [Regular expressions] Suppose money data for a variable has been
entered in a variety of formats, e,g.
“\(1,000.00", "1000.00",
"123.2\)”
Write an R function using ‘gsub’ and
‘as.numeric’ to turn these various entries into a numeric variable.
Experiment with your function to make sure it works.
- [Regular expressions] Write a function that takes a character vector
and collapses multiple adjoining blanks into a single blank.
- [Regular expressions] Use the file SampleClassFile.csv. One of its variables
is a string that contains information about a student’s faculty and
programme: are they in an ordinary programme or in an honours program
and the department of their major and minor. Write a function that uses
regular expression to create four new variables: the faculty in which a
student is enrolled, whether they are in an ordinary or in an honours
programme, their major program and their minor program if any.
- [Regular expressions] Suppose you have a vector of names, such as:
Mary Jones
Tarik Mohammed
Smith, Jim
Tom O'Brian
Victor Lindquist
Chow, Vincent
Wong, Mary
Some names are in the format ‘First Last’ and others ‘Last, First’.
Write a function to extract the full names, in the format ‘Last, First’,
of all the individuals whose first name is ‘Mary’.
- [Merging and reshaping] Use the site Gapminder.org to download at
least three longitudinal variables into separate data sets. Merge the
data sets into one for which each row represents one country and year
and contains the values of each of the three variables you downloaded.
Display how these variables change over time.
- [Regular expressions] Write a function that removes every variable
whose name starts with the letter ‘X’ and ends in a number from a data
frame.
- [Data] Write a function that takes a data frame and returns it with
variable names in alphabetical order.
- [Warwick] Suppose \[\mathbf{A}=
\begin{bmatrix} 1 & 1 & 1 \\ 5 & 2 & 6 \\ -1 & -1
& -3\end{bmatrix}\]
- Check that \(\mathbf{A}^3 =
\mathbf{0}\) where \(\mathbf{0}\) is a \(3 \times 3\) matrix with every entry equal
to 0.
- Replace the third column of \(\mathbf{A}\) by the sum of the second and
third columns.
- [Warwick] Create the following matrix \(\mathbf{B}\) with 15 rows: \[\mathbf{A}= \begin{bmatrix} 10 & -10 & 10
\\ 10 & -10 & 10 \\ \vdots & \vdots & \vdots \\10 &
-10 & 10\end{bmatrix}\] Calculate the \(3 \times 3\) matrix \(\mathbf{B}^T\mathbf{B}\). Consider:
?crossprod
- [Warwick] Create a \(6 \times 6\)
matrix ‘matE’ with every entry equal to 0. Check what the
functions ‘row’ and ‘col’ return when applied to
‘matE’. Hence create the \(6 \times
6\) matrix: \[\begin{bmatrix}
0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0
& 0 & 0 \\
0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0
& 1 & 0 \\
0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0
& 1 & 0 \end{bmatrix}\]
- [Warwick] Look at ?outer. Hence create the following
patterned matrix: \[\begin{bmatrix}
0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4
& 5 & 6 \\
2 & 3 & 4 & 5 & 6 & 7 \\ 3 & 4 & 5 & 6
& 7 & 8 \\
4 & 5 & 6 & 7 & 8 & 9 \\ 5 & 6 & 7 & 8
& 9 & 10 \end{bmatrix}\]
- [Warwick] Create the following patterned matrices. In each case,
your solution should make use of the special form of the matrix – this
means that the solution should easily generalize to creating a larger
matrix with the same structure and should not involve typing in all the
entries in the matrix.
- \(\begin{pmatrix}
0 & 1 & 2 & 3 & 4 & 5 \\ 1 & 2 & 3 & 4
& 5 & 0 \\
2 & 3 & 4 & 5 & 0 & 1 \\ 3 & 4 & 5 & 0
& 1 & 2 \\
4 & 5 & 0 & 1 & 2 & 3 \\ 5 & 0 & 1 & 2
& 3 & 4 \end{pmatrix}\)
- \(\begin{pmatrix}
0 & 5 & 4 & 3 & 2 & 1 \\ 1 & 0 & 5 & 4
& 3 & 2 \\
2 & 1 & 0 & 5 & 4 & 3 \\ 3 & 2 & 1 & 0
& 5 & 4 \\
4 & 3 & 2 & 1 & 0 & 5 \\ 5 & 4 & 3 & 2
& 1 & 0 \end{pmatrix}\)
- [Warwick] Solve the following system of linear equations in five
unknowns \[\begin{eqnarray}
x_1 + 2x_2 + 3x_3 + 4x_4 +5 x_5 &=& 7 \\
2x_1 + x_2 + 2x_3 + 3x_4 +4 x_5 &=& -1 \\
3x_1 + 2x_2 + x_3 + 2x_4 +3 x_5 &=& -3 \\
4x_1 + 3x_2 + 2x_3 + x_4 +2 x_5 &=& 5 \\
5x_1 + 4x_2 + 3x_3 + 2x_4 +x_5 &=& 17
\end{eqnarray}\] by considering and appropriate matrix equation
\(\mathbf{A}\mathbf{x}=\mathbf{y}\).
Make
use of the special form of the matrix \(\mathbf{A}\). The method used for the
solution should easily generalize to a larger set of equations where the
matrix \(\mathbf{A}\) has the same
structure.
- [Warwick] Create a \(6 \times 10\)
matrix of random integers chose from \(1,2,...10\) by executing the folllowing two
lines of code:
set.seed(75)
aMat <- matrix( sample(10, size = 60, replace = T), nr = 6)
- Write a function to find the number of entries in each row which are
greater than 4.
- Write a function to find which rows contain exactly two occurrences
of the number seven.
- Find those pairs of columns wose total (over both columns) is
greater than 75. The answer should be a matrix with two columns; so, for
example, the row (1,2) in the output matrix means that the sum of
columns 1 and 2 in the original matrix is greater than 75. Repeating a
column is permitted; so, for example, the final output matrix could
contain the rows (1,2),(2,1) and (2,2).
What if repetitions are not
permitted? Then, only (1,2) from (1,2), (2,1) and (2,2) would be
permitted.
- [Warwick] Calculate: a. \(\sum_{i=1}^{20}
\sum_{j=1}^{5} \frac{i^4}{(3+j)}\) b. (Hard) \(\sum_{i=1}^{20} \sum_{j=1}^{5}
\frac{i^4}{(3+ij)}\) c. (Even harder!) \(\sum_{i=1}^{10} \sum_{j=1}^{i}
\frac{i^4}{(3+ij)}\)
- [Warwick]
- Write functions ‘tmpFn1’ and ‘tmpFn2’ such that if
‘xVec’ is the vector \((x_1, x_2,
..., x_n)\), then ‘tmpFn1(xVec)’ returns the vector
\((x_1,x_2^2,...,x_n^n)\) and
‘tmpFn2(xVec)’ returns the vector \(\left({x_1,\frac{x_2^2}{2},...,\frac{x_n^n}{n}}\right)\)
- Now write a function ‘tmpFn3’ which takes two arguments
\(x\) and \(n\) where \(x\) is a single number and \(n\) is a strictly positive integer. The
function should return the value of \[1 +
\frac{x}{1} + \frac{x^2}{2} + \frac{x^3}{3} + ... +
\frac{x^n}{n}\]
- [Warwick] Write a function ‘tmpFn(xVec)’ such that if
‘xVec’ is the vector \(\mathbf{x}=(x_1,...,x_n)\) then
‘tmpFn(xVec)’ returns the vector of moving averages: \[\frac{x_1 + x_2 + x_3}{3}, \frac{x_2 + x_3 +
x_4}{3}, ... ,\frac{x_3 + x_4 + x_5}{3}\] Try out your function;
for example, try ‘tmpFn( c(1:5,6:1))’
- [Warwick] Consider the continuous function: \[f(x) =
\begin{cases}
x^2 + 2x + 3 & \quad \text{if } x < 0 \\
x+3 & \quad \text{if } 0 \le x \lt 2 \\
x^2 + 4x - 7 & \quad \text{if } 2 \le x \\
\end{cases}\] Write a function tmpFn which takes a
single argument ‘xVec’. The function should return the vector
of values of the function \(f(x)\)
evaluated at the values in ‘xVec’.
Hence plot the function
\(f(x)\) for \(-3 \lt x \lt 3\).
- [Warwick] Write a function which takes a single argument which is a
matrix. The function should return a matrix which is the same as the
function argument but every odd number is doubled.
- [Warwick] Write a function which takes two arguments ‘n’
and ‘k’ which are positive integers. It should return the \(n \times n\) matrix: \[\begin{bmatrix}
k & 1 & 0 & 0 & \cdots & 0 & 0 \\
1 & k & 1 & 0 & \cdots & 0 & 0 \\
0 & 1 & k & 1 & \cdots & 0 & 0 \\
0 & 0 & 1 & k & \cdots & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots
& \vdots \\
0 & 0 & 0 & 0 & \cdots & k & 1 \\
0 & 0 & 0 & 0 & \cdots & 1 & k \\
\end{bmatrix}\]
- [Warwick] Suppose an angle \(\alpha\) is given as a positive real number
of degrees counting counter-clockwise from the positive horizontal axis.
Write a function quadrant(alpha) which returns the quadrant, 1,
2, 3 or 4, corresponding to ‘alpha’.
- [Warwick]
- Zeller’s congruence is the formula: \[f =
([2.6m-0.2] + k + y + [y/4] + [c/4] - 2c) \mod 7\] where \([x]\) denotes the integer part of \(x\); for example \([7.5]=7\).
Zeller’s congruence returns
the day of the week \(f\) given:
\(k =\) the day of the month,
\(y =\) the year in the century,
\(c =\) the first 2 digits of the year (the
century number)
\(m =\) the month
number (where January is month 11 of the preceding year, February is
month 12 of the preceding year, March is month 1, etc) For example, the
data July 21, 1963 has \(m=5, k = 21, c=19, y
= 63\); while the date February 21, 1963 has \(m=12, k=21, c=19\) and \(y=62\).
Write a function
‘weekday(day, month, year)’ which returns the day of the week
when given the numerical inputs of the day, month and year.
Note
that the value 1 for \(f\) denotes
Sunday, 2 denotes Monday, etc.
- Does your function work if the input parameters ‘day’,
‘month’ and ‘year’ are vectors with the same length
and with valid entries?
- [Warwick]
- Suppose \(x_0=1\) and \(x_1=2\) and \[x_j = x_{j-1}+\frac{2}{x_{j-1}} \qquad \text{for
}j= 1,2,...\] Write a function ‘testLoop’ which takes a
single argument \(n\) and returns the
first \(n-1\) values of the sequence
\(\{x_j\}_{j \ge 0}\), that is, the
values of \(x_0, x_1, x_2, ... ,
x_{n-2}\).
- Now write a function ‘testLoop2’ which takes a single
argument ‘yVec’ which is a vector. The function should return
\[\sum_{j=1}^{n} e^j\] where \(n\) is the length of ‘yVec’
- [Warwick] Solution of the difference equation \(x_n = r x_{n-1}(1 - x_{n-1})\) with
starting values \(x_1\).
- Write a function ‘quadmap(start, rho, niter)’ which returns
the vector \((x_1, ....,. x_n)\) where
\(x_k=r x_{k-1}(1 - x_{k-1})\) and
\(\quad\) ‘niter’ denotes
\(n\),
\(\quad\) ‘start’ denotes \(x_1\), and
\(\quad\) ‘rho’ denotes \(r\).
Try out the function you have
written:
- for \(r=2\) and \(0 < x_1 < 1\) you should get \(x_n \rightarrow 0.5\) as \(n \rightarrow \infty\).
- try ‘tmp <- quadmap(start=0.95, rho=2.99, niter=500)’
Now type:
plot(tmp, type = 'l')
Also try ‘plot(tmp[300:500], type = ’l’)’
- Now write a function which determines the number of iterations
needed to get \(| x_n - x_{n-1}| <
0.02\). This function has only 2 arguments: ‘start’ and
‘rho’. (For ‘start = 0.95’ and ‘rho=2.99’,
the answer is 84.)
- [Warwick] Given a vector \((x_1, ...
,x_n)\), the sample autocorrelation of lag \(k\) is defined to be \[r_k =
\frac{\sum_{i=k+1}^{n}(x_i-\bar{x})(x_{i-k}-\bar{x})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
- Write a function ‘autocor(xVec)’ which takes a single
argument ‘xVec’ which is a vector and returns a list of two
values: \(r_1\) and \(r_2\).
In particular, find \(r_1\) and \(r_2\) for the vector \((2, 5, 8, ..., 53, 56)\).
- (Harder) Generalize the function so that it takes two arguments: the
vector ‘xVec’ and an integer ‘k’ which lies between 1
and \(n-1\) where \(n\) is the length of ‘xVec’. The
function should return a vector of the values \((r_0 = 1, r_1, ..., r_k)\).
If you
used a loop to answer part (b), then you need to be aware that much,
much better solutions are possible. Hint: ‘sapply’.
Questions on selection
in R
Suppose
x <- 1:5
What is the difference
between x[NA]
and x[NA_integer_]
? Why? Hint:
It may have something to do with the recycling rule.
Write a function whose input is a numerical matrix and that
checks whether the matrix is a lower diagonal matrix, i.e. all elements
above the diagonal are 0. Hint: consider using the row
and
col
functions.
(Longitudinal data) A longitudinal data set with one row per
occasion has a varying number of observations for each subject. Suppose
that a variable named ‘id’ has a unique identifier for each subject.
Some subjects have been measured on only one occasion and you would like
to perform an analysis that excludes those subjects. Suppose that the
original data frame is called ‘dd’. Write R code to create a data frame
‘ds’ that excludes the subjects that were measured only once.
Questions on data
frames and data manipulation in R
Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Describe the differences between the outputs of the following
commands:
merge(d1,d2)
merge(d1,d2, all.x = TRUE)
merge(d1,d2, all.y = TRUE)
merge(d1,d2, all = TRUE)
What attributes does a data frame possess? From Wickham:
Advanced R
What does as.matrix() do when applied to a data frame with
columns of different types? From Wickham:
Advanced R
Can you have a data frame with 0 rows? What about 0 columns? From
Wickham:
Advanced R
This question illustrates how simple data manipulation can be
used to answer basic queries about data. Consider classlists for four
sections of a first year statistics course STA1000 (at http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/)
and two classlists for a second year statistics course STA2000, taken
the following year. Without any direct editing of the classlists do the
following:
- Write a function that transforms each input classlist into a data
frame with useful variables on the program of each students. Note that
information on program is encoded in a single column that can contain
information on a number of distinct variables. You need to use string
manipulation functions, e.g. sub, gsub, strsplit, to turn this column
into useful variables. Note that a space is usually a delimiter between
subfields but sometimes not. You might need to preprocess the strings
before splitting them into subfieds.
- Is there evidence that a different proportion of students in the 2nd
year course go on to study statistics in the 3rd year course?
- Is there evidence that this remains true when adjusting for the
program of students in the 2nd year course?
- Are there conversions, i.e. students who change their majors to
statistics? Do they come disproportionately from some sections instead
of others?
Questions on graphics
in R
Questions on functions
in R
Write a function that will take a vector as input and return the
vector with NAs changed to a null string (““) if the vector is a
character vector or to 0 if it is a numeric vector.
Test your
function on extreme examples.
Extend the previous function so it returns a factor if the input
is a factor and changes NAs in factors to a factor level that is a null
string.
Test your function on extreme examples.
Extend the previous function so the value to which that NAs are
changed can be supplied as a parameter with default value the same as in
the previous question. Make the value to which NAs are changed
potentially depend on the type of variable.
Test your function on
extreme examples.
(Major question) Referring to question 1 on \(p\)-value,
- write a function that returns the posterior probability that \(H_0\) is true given the variables in the
question: the alternative value of \(\mu\), \(\alpha\), \(n\).
- Generate a data frame with values of these variables (hint: consider
using
expand.grid
) and evaluate the function on each row in
the data frame.
- Graph the results in some interesting and revealing way. You might
want to change the values of the variables that you used in creating the
data frame in order to produce a more interesting display.
- Discuss what you graphs reveal.
Write a function that returns a lower diagonal matrix whose
(i,j)th entry, for i > j, is i + j.
Write a function that finds the index of the first occurrence of
x in a vector y.
Write a function that turns an vector of non- negative integers
into a factor a factor with values ‘0’, ‘1’, ‘2 or more’. Make sure that
the factor has the right ordering of levels.
Write a function that identifies whether an integer is a prime
number.
Write a function whose input is a data.frame and whose output is
the same data frame except that all factor variables have been changed
to character variables but the numeric variables are unchanged.
Questions on OOP in
R
Questions on
Regression
- Explore the pros and cons of Wald tests versus Likelihood Ratio
Tests. Construct an example where they give entirely different
results.
- Suppose the following is a model to estimate the pay gap between men
and women in a large organization. \(Y\) is annualized salary, \(G\) is a dummy variable equal to 0 for men
and 1 for women, \(X_1\) and \(X_2\) are other factors (e.g. experience
and education). Suppose the model has the form: \[
E(Y) = \beta_0 + \beta_1 G + \beta_2 X_1 + \beta_3 X_2 + \beta_4 G X_2
+\beta_5 X_1 X_2
\] and a least-squares regression produces the following output
below.
- What is the gap (the difference of women’s salaries minus)
|
Estimate
|
S.E.
|
\(\beta_0\)
|
20
|
5.0
|
\(\beta_1\)
|
1
|
2.0
|
\(\beta_2\)
|
2
|
0.5
|
\(\beta_3\)
|
3
|
1.2
|
\(\beta_4\)
|
4
|
2.1
|
\(\beta_5\)
|
5
|
4.0
|
Questions on
Causality
One of your professors uploads videos of the course to a website.
One day, he analyzes results from last year’s class and discovers that
students’ performance in the course is related to how often they viewed
the videos. Students who viewed the videos frequently tended to perform
less well on the final exam than students who viewed the videos
relatively rarely. Upon discovering this, your professor announces that
he will stop recording lectures because, he says, the videos have been
shown to cause students to perform more poorly on the course. In
answering the following questions, use clear and simple language even a
professor might be able to understand.
- Do you think the data used in this study constitute experimental
data (in the sense used in this course) or observational data?
- Explain why the number of lectures attended could be a potential
confounding factor when considering the relationship between the
frequency of viewing videos and performance on the course.
- Can you think of potential mediating factors?
- Draw causal graphs for a confounding factor and for a mediating
factor.
- Using a potential confounding factor, draw a hypothetical MC diagram
(alias Paik-Agresti diagram, alias ‘marginal-conditional plot’) to show
how students who view the videos frequently might do more poorly than
those who view them rarely, even though viewing videos may make a
positive contribution for individual students in the course.
In 1964, the Public Health Service of the United States studied
the effects of smoking on health in a sample of 42,00 households. For
men and for women in each age group, they found that those who had never
smoked were on average somewhat healthier than the current smokers, but
the current smokers were on average much healthier than the former
smokers.
- Would this data be considered observational or experimental to
examine the possible effects of smoking? Why?
- Why did they study men and women and the different age groups
separately?
- The lesson seems to be that you shouldn’t start smoking, but once
you’ve started, don’t stop. Find some plausible explanations for this
surprising relationship between quitting smoking and health. Find at
least one plausible confounding factor and one plausible mediating
factor that might account for part of the relationship.
- Conditioning on a plausible confounding factor, draw a hypothetical
MC diagram showing a conditional and an unconditional relationship
between quitting and health that is consistent with the findings of the
study.
A study investigated whether there was a higher risk of
complications when women gave birth at home with the assistance of a
midwife instead of giving birth in a maternity ward in a hospital. 400
women who chose to give birth at home and 2,000 women who gave birth in
a hospital were studied. The table below summarizes the number of
complications in each group.
- Find the rate of complication in each group: the home birth group
and the hospital birth group.
- Do you think that this is an observational study or an experimental
study? Why?
- The data suggest that it is safer (in the sense of a lower rate of
complications) to give birth at home than to give birth in the hospital.
Discuss whether this implies that a woman should consider giving birth
at home in order to reduce her risk of complications. Identify at least
one plausible confounding factor and one plausible mediating factor that
could partly explain the results of the study.
- Choose a possible confounding factor and use a MC diagram to show
how controlling for this confounding factor could reverse the direction
of association between the rate of complications and the location of
birth: home or hospital.
|
Complications
|
No Complications
|
Total
|
Home Births
|
20
|
380
|
400
|
Hospital Births
|
200
|
1800
|
2000
|
Total
|
220
|
2180
|
2400
|
Mary (a woman) is choosing between restaurants A and B to take
her friend, John (a man), out for dinner.
Restaurant A has an average rating of 4.1 and restaurant B of 4.3. But
looking at ratings by gender, among men, restaurant A has a rating of
4.0 and restaurant B of 3.8. Among women, restaurant A has a rating of
4.6 and restaurant B of 4.4. It seems that men and women separately
prefer restaurant A but together they prefer restaurant B!
- Draw a MC diagram conditioning on the gender with restaurant on the
horizontal axis and average ratings on the vertical axis to explain how
this apparent contradiction could arise.
- Draw a causal graph describing the relationships among the three
variables: restaurant, gender and ratings.
- Assuming that there are no other significant factors related to
restaurant ratings, what kind of variable is gender in this
context?
- Which restaurant should Mary choose? Why?
- Add an appropriate ‘do-line’ to the MC diagram.
Fedor is choosing between restaurants A and B to take his friend,
Jaspreet, out for dinner. Restaurant A has an average rating of 4.1 and
restaurant B of 4.3. Each restaurant has six waiters. Restaurant A has
one good waiter and 5 bad ones. Restaurant B has 5 good waiters and 1
bad one. Customers can’t choose the waiter they get.
Ratings among customers getting a good waiter are 4.6 at restaurant A
and 4.4 at restaurant B. Ratings among customers getting a bad waiter
are 4.0 at restaurant A and 3.8 at restaurant B. So, overall, customers
prefer restaurant B but among those getting a bad waiter, they prefer
restaurant A and among those getting a good waiter they also prefer
restaurant A! In other words, the bad waiters at restaurant A are better
than the bad waiter at restaurant B and the good waiter at restaurant A
is better than the good waiters at restaurant B. Suppose that you can’t
choose your waiter and that your chances of getting each type of waiter
at either restaurant are similar to the proportions in the ratings.
- Draw a MC diagram conditioning on the quality of the waiter with
restaurant on the horizontal axis and average ratings on the vertical
axis to explain how this apparent contradiction could arise.
- Draw a causal graph describing the relationships among the three
variables: restaurant, quality of waiter and ratings.
- Assuming that there are no other significant factors related to
restaurant ratings, what kind of variable is the quality of the waiter
in this context?
- Which restaurant should Fedor choose? Why?
- Add an appropriate ‘do-line’ to the MC diagram.
Paradoxes, Fallacies
and One Correct Statement
Here are 23 statements or questions about statistics, mainly about
regression, for you to ponder and comment on.
Is each statement true, false or does its truth depend on unstated
conditions? In the last case, on what conditions does it depend on and
how?
Warning: Every statement but one expresses a widely
held fallacy or half truth about statistics. Even professional
statisticians are fooled by many of these statements. Many of the ideas
expressed in these statements may be reasonable when applied to certain
problems but can lead to serious modelling errors when applied in the
wrong context. It is very important to understand the importance of the
context in which statistical modelling takes place. At the very least
you must always consider:
- the nature of the questions you want to address: Are they causal,
predictive or descriptive?
- the nature of the data: Was there random assignment, or random
selection? Can the data be considered to be representative of some
population or process?
- the consequences of different types of errors.
Health and
Weight
Suppose you are studying how some measure of health is related to
weight. You are looking at a multiple regression of health on height and
weight but you observe that what you are really interested in is the
relationship between health and weight relative to height. For
simplicity suppose that the residual of the regression of weight on
height is meaningful measure of relative weight and that health is
related linearly to this variable. What you should do is to compute the
residuals of weight on height and replace weight in the model with this
new variable. The resulting coefficient of ‘excess weight’ will give a
better estimate of the effect of excess weight. True? False? It
depends?
Measurement error:
confounding factor or target variable
Suppose you are studying observational data on the relationship
between Health and Coffee (measured in grams of caffeine consumed per
day). Suppose you want to control for a possible confounding factor
‘Stress’. In this kind of study it is more important to make sure that
you measure coffee consumption accurately than it is to make sure that
you measure ‘stress’ accurately. True? False? It depends?
Biases in class size
surveys?
A survey of students at York reveals that the average class size of
the classes they attend is 130. A survey of faculty shows an average
class size of 30. The students must be exaggerating their class sizes or
the faculty under-reporting. True? False? It depends?
Biases in wealth
surveys?
A survey of Canadian families yielded average ‘equity’ (i.e. total
owned in real estate, bonds, stocks, etc. minus total owed) of $48,000.
Aggregate government data of the total equity in the Canadian population
shows that this figure must be much larger, in fact more than twice as
large. This shows that respondents must tend to dramatically underreport
their equity._True? False? It depends?_
Dropping variables
to simplify a model
In a multiple regression of Y on three predictors, X1, X2 and X3,
suppose the coefficients of each of X2 and X3, are not significant. It
is safe to drop these two variables and perform a regression on X1
alone. Dropping a number of variables with non-significant coefficients
results in a model that fits almost as well as the original model.
True? False? It depends?
Comparing two
groups
If smoking really is bad for your health, you expect that a
comparison of a group of people who have quit smoking with a group that
has continued to smoke will reveal that the group quitting is, on
average, healthier than the group that continued. True? False? It
depends?
Dropping a
non-significant predictor
In a multiple regression, if you drop a predictor whose effect is not
significant, the coefficients of the other predictors should not change
very much, nor should the p-values associated with them. True?
False? It depends?
Interpretation of
MLE
We use maximum likelihood to estimate parameters because the
parameter value with the highest likelihood is the value that has the
highest probability of being correct. ‘Likelihood’ is just a different
word for ‘probability’. True? False? It depends?
Forward stepwise or
backward stepwise regression?
If you want to reduce the number of predictor variables in a model,
forward stepwise regression will do a good job of identifying which
variables you should keep. What about backward stepwise regression?
True? False? It depends?
Non-significant
interaction
In a regression model with two predictors X1 and X2, and an
interaction term between the two predictors, it is dangerous to
interpret the `main’ effects of X1 and X2 without further qualification.
However, it is okay to do so if the interaction term is not significant.
True? False? It depends?
Comparing best and
worst outcomes
In a model to assess the effect of a number of treatments on some
outcome, we can estimate the difference between the best treatment and
the worse treatment by using the difference in the mean outcomes.
True? False? It depends?
Interaction and
collinearity
In general we don’t need to worry about interactions between
variables unless there is a correlation between them. True? False?
It depends?
Confounding factor
and association with predictor
In general, a variable cannot be a ‘confounding factor’ for the
effect of another variable unless they are associated with each other.
True? False? It depends?
Interaction implies
correlation
If two variables have a strong interaction, this implies a strong
correlation. True? False? It depends?
Imputing a missing
grade
You need to impute a mid-term grade for a student who missed the
mid-term test with a valid excuse. The best way to predict the missing
mid-term grade is to perform a regression of mid-term grades on final
grades using the data from students who wrote both and use the predicted
mid-term grade based on the final exam grade of the students who missed
the mid-term test. True? False? It depends?
Discuss the relative consequences of using
- the predicted mid-term grade based on the regression equation of the
mid-term grades on the final grades,
- the student’s raw grade on the final,
- using the student’s z-score on the final to impute the score on the
mid-term with the same z-score, and
- use the regression equation of the final on the mid-term to
calculate the mid-term grade that would have predicted the student’s
actual final grade.
If you had to choose one of these four, which would you choose and
why?
If you have a better solution for the previous problem, what is it
and why?
p-values and error
rate in publications
If all scientists used a p-value of 0.05 to decide which results to
publish, that would ensure that at most 5% of published results would be
incorrect. True? False? It depends?
Significance with
added variables
If a variable X1 is not significant in a regression of Y on X1 then
it will be even less significant in a regression of Y on both X1 and X2
where X2 is another variable. This follows since there is less
variability left to explain in a model that already includes X2 than in
a model that does not. True? False? It depends?
Dropping redundant
variables
The best way to deal with high collinearity between predictors is to
drop predictors that are not significant. True? False? It
depends?
AIC
AIC is useful to identify the best model among a set of models that
you have selected after exploring your data if the models are not nested
within each other. True? False? It depends?
Comparing
groups
A recent study showed that people who sleep more than 9 hours per
night on average have a higher chance of premature death than those who
sleep fewer than 9 hours. This does not necessarily mean that sleeping
more than 9 hours on average is bad for your health because the sample
might not have been representative. True? False? It
depends?
Error rate and
posterior probability
Suppose a screening test for steroid drug use has a specificity of
95% and a sensitivity of 95%. This means that the test is incorrect 5%
of the time. Therefore, if John takes the test and the result is
‘positive’ (i.e. the test indicates that John takes steroid drugs) the
probability that he does not take steroid drugs is only 5%. True?
False? It depends?
Importance of
predictors
In a multiple regression, the predictor that is most important is the
one with the smallest p-value. True? False? It depends?
Questions to be classified
References
Bergland, Christopher. 2019.
“Rethinking P-Values:
Is "Statistical Significance"
Useless?” Psychology Today. March 22, 2019.
https://www.psychologytoday.com/blog/the-athletes-way/201903/rethinking-p-values-is-statistical-significance-useless.
Ioannidis, John P A. 2005.
“Why Most Published Research
Findings Are False.” PLoS Medicine 2 (8): 6.
https://journals.plos.org/plosmedicine/article/file?id=10.1371%2Fjournal.pmed.0020124&type=printable.
Ioannidis, John P. A. 2019.
“What Have We
(Not) Learnt from Millions of
Scientific Papers with P
Values?” The American Statistician 73
(March): 20–25.
https://doi.org/10.1080/00031305.2018.1447512.
Oliver, John, dir. 2016.
Last Week Tonight with
John Oliver: Scientific Studies. HBO.
https://www.youtube.com/watch?v=0Rnq1NpHdmw.
Schervish, Mark J. 1996. “P Values: What They
Are and What They Are Not.” The American
Statistician 30 (3): 203–6.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016.
“The ASA
Statement on p -Values:
Context, Process, and
Purpose.” The American Statistician 70 (2):
129–33.
https://doi.org/10.1080/00031305.2016.1154108.
Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019.
“Moving to a World Beyond ‘
p \(<\) 0.05’.” The
American Statistician 73 (March): 1–19.
https://doi.org/10.1080/00031305.2019.1583913.
Wickham, Hadley. 2019.
Advanced R. 2nd ed. CRC
Press.
https://adv-r.hadley.nz/.