These questions have been inspired by many sources.
The purpose of the assignment is to explore the meaning of p-values. Before starting stop and reflect on what it means for an experiment to ‘achieve’ a p-value of 0.049. What meaning can we give to the quantity ‘0.049’? How is it related to the probability that the null hypothesis is correct?
To keep things very simple suppose you want to test \(H_0: \mu =0\) versus \(H_1: \mu \neq 0\) and you are designing an experiment in which you plan to take a sample of independent random variables, \(X_1, X_2, ... , X_n\) which are iid \(\textrm{N}(\mu,1)\), i.e. the variance is known to be equal to 1. You plan to use the usual test based on \(\bar{X}_n\) rejecting \(H_0\) for values of \(\bar{X}_n\) that are far from 0.
An applied example would be testing for a change in value of a response when all subjects are submitted to the same conditions and the measurement error of the response is known. In that example \(X_i\) would be the ‘gain score’, i.e. post-test response minus the pre-test response exhibited by the \(i\)th subject.
Let the selected probability of Type I error be \(\alpha = 0.05\). Consider collecting samples of size \(n\) where \(n\) equals one of the following:
\(i\) | \(n\) |
---|---|
1 | 10 |
2 | 20 |
3 | 100 |
4 | 200 |
5 | 1,000 |
6 | 10,000 |
Consider using the following values of \(\mu_j\):
\(j\) | \(\mu_j\) | Cohen’s terminology for effect size: \(\mu_j/\sigma\) |
---|---|---|
1 | 0.2 | small effect size |
2 | 0.5 | medium effect size |
3 | 0.8 | large effect size |
4 | 1 | very large effect size |
5 | 5 | huge effect size |
To delve more deeply into these issues you can read Wasserstein and Lazar (2016) and Wasserstein, Schirm, and Lazar (2019). Concerns about \(p\)-values have been around for a long time, see Schervish (1996). For a short overview see Bergland (2019). Two key influential and erstwhile controversial papers are by John Ioannidis: John P. A. Ioannidis (2005), John P. A. Ioannidis (2019).
For an entertaining take on related issues see Oliver (2016) (warning: contains strong language and political irony that many could consider offensive – watch at your own risk!).
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3)) d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Describe the differences between the outputs of the following
commands:
merge(d1,d2)
merge(d1,d2, all.x = TRUE)
merge(d1,d2, all.y = TRUE)
merge(d1,d2, all = TRUE)
Two research assistants have collected data for a study. The two RAs worked with different subjects each gathered their data into a spreadsheet. All of the important variables have the same name and the same definitions but each RA has also collected data on a few unique variables for that RA. Also, the order of the columns is different in the two studies.
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3)) d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Find out what the following terms mean and show how to achieve each operation using ‘merge’. Hint: the last two operations are much easier if you first create a new variable in each data frame to serve as a ‘key’ (i.e. the argument of ‘id’ in the call to ‘merge’) in each data frame. (Note: the ‘key’ is the set of variables used to match the rows of the two data frames. These are either the variable names provided as arguments to the ‘key’ parameter or, by default, the intersection of the vectors of variable names in each data frame.)
Use the ‘Vocab’ data set in the ‘car’ package. It records a vocabulary score for over 30,000 subjects tested over the years between 1974 and 2016.
Explore the following questions. Use appropriate tables and graphs to explain your findings.
Some of these questions illustrate important potential pitfalls in using factor variables. As a result, some developers eschew them and prefer to work with character variables as much as possible. Factors, however, are invaluable for many statistical applications since they allow the creation of different orderings of the values in a character vector, which is often important for statistical modeling and for graphics.
Describe the difference, if any, and if so why, between the following (note that the behaviour of the ‘factor’ function has changed over time):
factor(c(1, 2, 10))
factor(as.character(c(1, 2, 10)))
Suppose x <- factor(c(1, 2, 10))
. Write a
function that would allow you to recover the original values, 1, 2 and
10, from a factor like x
? Why does
as.numeric(x)
not work?
Let f1 <- factor(c('a','b','c'))
and
f2 <- factor(c('A','B','C'))
. What’s the matter with the
result of the expression
ifelse(f1 == 'a', f1, f2)
Explain why it fails to produce characters as a result. Fix it so it does.
Indexing with factors: Consider
df <- data.frame(c = 1:3, a = 11:13, b = 21:23) fac <- factor(c('a','b','c')) df[[fac[1]]] df[[as.character(fac[1])]]
Explain why the last two lines of the code above produce different results.
What happens to a factor when you modify its levels with the
levels<-
replacement function? Give examples to
illustrate your answer.
Describe the main differences between the four taxonomies of objects in R: typeof, mode, storage.mode and class.
Let x <- letters
. What is the class and mode of
x
? Let y <- as.factor(x)
. What is the class
and mode of y
? Why does this make sense … or not?
A factor is a kind of object used to represent character
variables for statistical analysis. Add a factor to the list used to
display the classifications of atomic objects above. Play with some
factors and use str
to explain the curious values returned
by typeof
, mode
, storage.mode
and
class
for a factor.
What makes is.vector() and is.numeric() fundamentally different to is.list() and is.character()? From Wickham: Advanced R
Why is 1 == “1” true? Why is -1 < FALSE true? Why is “one” < 2 false? From Wickham: Advanced R
Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).) From Wickham: Advanced R
Does -1:2
produce the same result as
0-1:2
? Why or why not?
Which of the following assignments use valid names?
a_very_long_name <- 0 _tmp <- 2 .tmp <- 2 ..val <- 3 .2regression <- TRUE ._2_val <- 'a'
Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Ordering and tabulating’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Linear models’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Miscellaneous tests’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Random variables’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html. Include interesting graphs.
Write a Rmarkdown script that illustrates the use of at least 5 functions from the subgroup ‘Matrix algebra’ of the group ‘Statistics’ at http://adv-r.had.co.nz/Vocabulary.html
#- # 7. Questions on programming in R ------
Many or these questions ask you to write a function to accomplish some goal, instead of just requiring an expression. The advantage of writing a function is you can easily test your code by trying your function on extreme or impossible values.
x <- c(TRUE, FALSE, 0L)
typeof(x)
TRUE | NA
x <- c('0','10','5','20','15','10','0','5')
x <- 1:5
. What output would
x[NA]
produce? What output would x[NA_real_]
produce? Describe the reason for the difference, if any.[
, [[
,
and $
when applied to a list?[
, when should you use
drop = FALSE
? Include arrays and factors in your
discussion.x
is a matrix, what does
x[] <- 0
do? How is it different from
x <- 0
?mtcars
is a data frame, why does
mtcars[1:20]
return an error? How does it differ from the
similar mtcars[1:20, ]
?mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]
df
is a data frame, what does
df[is.na(df)] <- 0
do? How does it work?y <- sample(1000, 30, replace = TRUE)
. Write
functions in R to do the following. Test each function.
set.seed(75)
m <- matrix(sample(10, 60, replace = T), nrow = 6)
paste(x, y, sep = ':')
and
paste(x, y, collapse = ':')
. Illustrate.hs
data set in the spida2
package, create a plot with two panels showing histograms displaying the
distribution of school sizes in the Public and in the Catholic sectors.
Use the functions capply
and up
in the
spida2
package. You may also use any other approach to
compare with the use of capply
and up
.hs
data set in the spida2
package, create a plot with two panels showing histograms displaying the
distribution of sample sizes in each school in the Public and in the
Catholic sectors. Use the functions capply
and
up
in the spida2
package. You may also use any
other approach to compare with the use of capply
and
up
.hs
data set in the spida2
package, create a plot with two panels showing scatterplots displaying
the relationship between mean mathach
and mean
ses
in each school in the Public and in the Catholic
sectors. Explore reasonable transformations and regression lines: linear
and non-parametric in the plots. Use the functions capply
and up
in the spida2
package. You may also use
any other approach to compare with the use of capply
and
up
. set.seed(50)
xVec <- sample(0:999, 250, replace = T)
yVec <- sample(0:999, 250, replace = T)
Suppose \(\mathbf{x} = (x_1, x_2, ...,
x_n)\) denotes the vector xVec and similarly for \(\mathbf{y}\).
Mary Jones Tarik Mohammed Smith, Jim Tom O'Brian Victor Lindquist Chow, Vincent Wong, MarySome names are in the format ‘First Last’ and others ‘Last, First’. Write a function to extract the full names, in the format ‘Last, First’, of all the individuals whose first name is ‘Mary’.
set.seed(75)
aMat <- matrix( sample(10, size = 60, replace = T), nr = 6)
plot(tmp, type = 'l')
Also try ‘plot(tmp[300:500], type = ’l’)’Suppose
x <- 1:5
What is the difference
between x[NA]
and x[NA_integer_]
? Why? Hint:
It may have something to do with the recycling rule.
Write a function whose input is a numerical matrix and that
checks whether the matrix is a lower diagonal matrix, i.e. all elements
above the diagonal are 0. Hint: consider using the row
and
col
functions.
(Longitudinal data) A longitudinal data set with one row per occasion has a varying number of observations for each subject. Suppose that a variable named ‘id’ has a unique identifier for each subject. Some subjects have been measured on only one occasion and you would like to perform an analysis that excludes those subjects. Suppose that the original data frame is called ‘dd’. Write R code to create a data frame ‘ds’ that excludes the subjects that were measured only once.
Let
d1 <- data.frame(id = c('a','a','b','c'), grade = c(1,2,1,3))
d2 <- data.frame(id = c('a','c','c','d'), year = c(3,1,3,4))
Describe the differences between the outputs of the following
commands:
merge(d1,d2)
merge(d1,d2, all.x = TRUE)
merge(d1,d2, all.y = TRUE)
merge(d1,d2, all = TRUE)
What attributes does a data frame possess? From Wickham: Advanced R
What does as.matrix() do when applied to a data frame with columns of different types? From Wickham: Advanced R
Can you have a data frame with 0 rows? What about 0 columns? From Wickham: Advanced R
This question illustrates how simple data manipulation can be used to answer basic queries about data. Consider classlists for four sections of a first year statistics course STA1000 (at http://blackwell.math.yorku.ca/MATH4939/data/clist_exercise/) and two classlists for a second year statistics course STA2000, taken the following year. Without any direct editing of the classlists do the following:
Write a function that will take a vector as input and return the
vector with NAs changed to a null string (““) if the vector is a
character vector or to 0 if it is a numeric vector.
Test your
function on extreme examples.
Extend the previous function so it returns a factor if the input
is a factor and changes NAs in factors to a factor level that is a null
string.
Test your function on extreme examples.
Extend the previous function so the value to which that NAs are
changed can be supplied as a parameter with default value the same as in
the previous question. Make the value to which NAs are changed
potentially depend on the type of variable.
Test your function on
extreme examples.
(Major question) Referring to question 1 on \(p\)-value,
expand.grid
) and evaluate the function on each row in
the data frame.Write a function that returns a lower diagonal matrix whose (i,j)th entry, for i > j, is i + j.
Write a function that finds the index of the first occurrence of x in a vector y.
Write a function that turns an vector of non- negative integers into a factor a factor with values ‘0’, ‘1’, ‘2 or more’. Make sure that the factor has the right ordering of levels.
Write a function that identifies whether an integer is a prime number.
Write a function whose input is a data.frame and whose output is the same data frame except that all factor variables have been changed to character variables but the numeric variables are unchanged.
One of your professors uploads videos of the course to a website. One day, he analyzes results from last year’s class and discovers that students’ performance in the course is related to how often they viewed the videos. Students who viewed the videos frequently tended to perform less well on the final exam than students who viewed the videos relatively rarely. Upon discovering this, your professor announces that he will stop recording lectures because, he says, the videos have been shown to cause students to perform more poorly on the course. In answering the following questions, use clear and simple language even a professor might be able to understand.
In 1964, the Public Health Service of the United States studied the effects of smoking on health in a sample of 42,00 households. For men and for women in each age group, they found that those who had never smoked were on average somewhat healthier than the current smokers, but the current smokers were on average much healthier than the former smokers.
A study investigated whether there was a higher risk of complications when women gave birth at home with the assistance of a midwife instead of giving birth in a maternity ward in a hospital. 400 women who chose to give birth at home and 2,000 women who gave birth in a hospital were studied. The table below summarizes the number of complications in each group.
Complications | No Complications | Total | |
---|---|---|---|
Home Births | 20 | 380 | 400 |
Hospital Births | 200 | 1800 | 2000 |
Total | 220 | 2180 | 2400 |
Mary (a woman) is choosing between restaurants A and B to take
her friend, John (a man), out for dinner.
Restaurant A has an average rating of 4.1 and restaurant B of 4.3. But
looking at ratings by gender, among men, restaurant A has a rating of
4.0 and restaurant B of 3.8. Among women, restaurant A has a rating of
4.6 and restaurant B of 4.4. It seems that men and women separately
prefer restaurant A but together they prefer restaurant B!
Fedor is choosing between restaurants A and B to take his friend, Jaspreet, out for dinner. Restaurant A has an average rating of 4.1 and restaurant B of 4.3. Each restaurant has six waiters. Restaurant A has one good waiter and 5 bad ones. Restaurant B has 5 good waiters and 1 bad one. Customers can’t choose the waiter they get.
Ratings among customers getting a good waiter are 4.6 at restaurant A and 4.4 at restaurant B. Ratings among customers getting a bad waiter are 4.0 at restaurant A and 3.8 at restaurant B. So, overall, customers prefer restaurant B but among those getting a bad waiter, they prefer restaurant A and among those getting a good waiter they also prefer restaurant A! In other words, the bad waiters at restaurant A are better than the bad waiter at restaurant B and the good waiter at restaurant A is better than the good waiters at restaurant B. Suppose that you can’t choose your waiter and that your chances of getting each type of waiter at either restaurant are similar to the proportions in the ratings.