1 Creating a data set from a give population

Some personal habits (yours should differ)

  • I like to keep build my examples in a data frame
  • I like to use pipes with the pipe operator ‘%>%’ – an old habit from Unix way back
    • pipes are in the package ‘magrittr’ but the ‘spida2’ package uses pipes from ‘magrittr’ so I only need to load ‘spida2’
    • A keyboard shortcut in RStudio to get ‘%>%’ is Ctrl-Shift-M
    • A pipe allows you to feed the result of a command as the first argument of a new function.
  • I like to use the ‘within’ function to create new variables within a data frame. ‘within’ is in base R, I think a similar function in the tidyverse is ‘mutate’.
  • I had to do graphics for multilevel data long before ‘ggplot2’ and the ‘tidyverse’ came along, so most of my examples are based on the ‘original’ ‘lattice’ package and its extensions in the ‘latticeExtra’ package.
  • There’s a number of functions that I use a lot in the ‘spida2’ and in the ‘car’ packages. You might like to try to find alternatives in the ‘tidyverse’ and in ‘ggplot2’.
    • spida2::xqplot: quick way to summarize variables in a data set. Shows univariate outliers, odd distributions, proportion of missing values, etc.
    • spida2::waldx: Wald tests.

A silly example of pipes:

Generate a normal sample, exponentiate, and take the resulting mean three ways:

# # 1:
sam <- rnorm(1000)   # 1,000 standard normal random variates
esam <- exp(sam)
mean(esam)
## [1] 1.606366
mean(esam, trim = 0.1) # trim 10% from each end
## [1] 1.188872
?mean

Bonus Question: Why did ‘trim’ make a big difference?

# # 2:

mean(exp(rnorm(1000)))
## [1] 1.657761
mean(exp(rnorm(1000)), trim = 0.1)
## [1] 1.250043
# # 3:

library(spida2)  # or just: library(magrittr)
rnorm(1000) %>% exp %>% mean
## [1] 1.646945
rnorm(1000) %>% exp %>% mean(trim = 0.1)  # you can add arguments  
## [1] 1.194628
# Could also do:

1000 %>% rnorm %>% exp %>% mean  # but that's getting silly
## [1] 1.685241
1000 %>% rnorm %>% exp %>% mean(trim = 0.1) 
## [1] 1.25778

Question: How would you make this ‘reproducible’ to get the same result each time.

1.1 Generate 3,000 observations from a bivariate normal distribution

Suppose we want the population to have:

  • means of 68 for each variable,
  • standard deviation of 3, and
  • correlation of 0.6

It’s tempting to generate the two variables, but then what?

We can use a package ‘mvtnorm’ but we’ll do it from scratch.

  • Start with the X variable
N <- 3000
dd <- data.frame(x = rnorm(N) * 3 + 68)
head(dd)
##          x
## 1 69.54928
## 2 69.27920
## 3 68.03191
## 4 69.95720
## 5 67.15944
## 6 67.96604

We can generate y by working out the equation of the population regression line and the distribution of the error term.

We know:

\[ \begin{aligned} \operatorname{Cov}(X,Y) & = \rho_{XY} \sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)} \\ & = 0.6 \times 9 \\ & = 3.6\\ \beta_{Y|X} & = \operatorname{Cov}(X,Y) / \operatorname{Var}(X) \\ & = 3.6 / 9 \\ & = 0.6 \\ \operatorname{Var}(Y|X) & = \operatorname{Var}(Y) - \frac{\operatorname{Cov}(X,Y)^2}{\operatorname{Var}(X)} \\ & = 9 - \frac{3.6^2}{9} \\ & = 7.56 \end{aligned} \]

The ‘error terms’ are independent of \(X\) so we can generate them with

dd %>% within(
  { 
    e <- rnorm(N) * sqrt(7.56)
    y <- 68 + 0.6 * (x - 58) + e
    e <- NULL   # we never know e in real data
  }
) -> dd       
head(dd)
##          x        y
## 1 69.54928 72.54836
## 2 69.27920 72.59332
## 3 68.03191 75.43821
## 4 69.95720 75.71101
## 5 67.15944 76.08974
## 6 67.96604 72.39187
xqplot(dd)    # uniform quantile plot

xqplot(dd, ptype = 'n')  # normal quantile plot

library(lattice)
library(latticeExtra)

xyplot(y ~ x, dd)  # plots

pl <- xyplot(y ~ x, dd, asp = 1)  # doesn't plot

pl  # does plot

fit <- lm(y ~ x, dd)
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7965 -1.8189 -0.0617  1.8553 10.0598 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.67938    1.15854   29.07   <2e-16 ***
## x            0.59297    0.01702   34.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.771 on 2998 degrees of freedom
## Multiple R-squared:  0.2883, Adjusted R-squared:  0.288 
## F-statistic:  1214 on 1 and 2998 DF,  p-value: < 2.2e-16
# How did this compare with 

dd$fit <- predict(fit)

pl + xyplot(fit ~ x, dd, type = 'l', col = 'red') -> pl   # creates new plotting object
pl    # plots it

pl + layer(panel.dell(..., lwd = 5, radius = c(1,2,3),col = 'purple')) # play with this

# then assign it to pl if you plan to add more

pl + layer(panel.dell(..., lwd = 1, radius = c(1,2,3),col = 'red')) -> pl
pl

# 
# Add the SD (standard deviation line) that goes through the point of means
# and has slope equal to SD(Y)/SD(X)
# 
dd %>% within(
  sdline <- local(
    {
      xm <- mean(x)
      ym <- mean(y)
      xs <- sd(x)
      ys <- sd(y)
      sd.slope <- ys/xs
      ym + sd.slope * (x - xm)  # as in all R function the last object gets returned
    }
  )
) -> dd
head(dd)
##          x        y      fit   sdline
## 1 69.54928 72.54836 74.91999 75.70201
## 2 69.27920 72.59332 74.75985 75.40375
## 3 68.03191 75.43821 74.02024 74.02625
## 4 69.95720 75.71101 75.16188 76.15252
## 5 67.15944 76.08974 73.50289 73.06270
## 6 67.96604 72.39187 73.98118 73.95351
pl + xyplot(sdline ~ x, dd, type = 'l', lwd = 2, col = 'green') # play to get this the way you want it

pl + xyplot(sdline ~ x, dd, type = 'l', lwd = 2, lty = 2, col = 'green') -> pl 

update(pl, key = list(
  space = 'right', 
  lines = list(col = c('green','red'), lty= c(2,1), lwd = c(2,1)),
  text = list(c('SD line', 'Regression line'))))

2 Eyeballing a scatterplot

library(car)  # 'car' stands for 'Companion to Applied Regression'
## Loading required package: carData
help(p=carData) # some interesting data sets you can play with

xqplot(Prestige)                # uniform quantile plot (classroom photo)

                                # diagonal is uniform
                                # normal is (reverse?)-S shaped
xqplot(Prestige, ptype = 'n')   # normal quantile plots

                                # diagonal is normal
                                # uniform is (reverse?)-S shaped

Question: How would you describe the skewness and/or kurtosis of these variables?

dd <- Prestige  # dd is easier to type

dd$linc <- log(dd$income)

dd$lwomen <- qlogis(dd$women/100)  # turn proportions into log odds
xqplot(dd, ptype = 'n')

graphics::pairs(dd) # scatterplot matrix (more than one package has a function names 'pairs')

library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:latticeExtra':
## 
##     panel.ellipse
## The following object is masked from 'package:lattice':
## 
##     panel.fill
graphics::pairs(dd, diag.panel = panel.density, lower.panel = panel.ellipse) 

car::scatterplotMatrix(dd) %>% try    # oops. seems to conk out after 1st column, 'try' prevents error from stopping execution
## Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
## FALSE, : could not fit smooth

## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
##   NA/NaN/Inf in 'y'
# Example of very important tool:  lapply applies a function to each element of a list
# and return the list with each element transformed.
# 
# Assigning back to 'dd[]' instead of 'dd' preserves dd's structure as
# a data frame and avoids turning it into a simple list.
# 

dd[] <- lapply(dd, function(x) {     # set infinite values to NAs
  x[!is.finite(x)] <- NA
  x
})

scatterplotMatrix(dd)  # now it works! NAs were okay but not Infs

scatterplotMatrix(~education+income+women| type, dd)

Explore the R Graph Gallery

3 Added-variable plots and Component-plus-residual plots

The AVP (added-variable plot, known in SAS PROC REG as the ‘Partial Regression Leverage Plot’) and the ‘Partial Residual Plot’, also known as the ‘component-plus-residual plot’ are often confused. They look similar but they often convey very different information.

I’ll show how to produce them with the ‘car’ package using the ‘smoking’ data as an example.

library(p3d)
## Loading required package: rgl
## 
## Attaching package: 'p3d'
## The following object is masked from 'package:car':
## 
##     Identify3d
## The following objects are masked from 'package:spida2':
## 
##     cell, center, ConjComp, dell, disp, ell, ell.conj, ellbox, ellplus,
##     ellpt, ellptc, ellpts, ellptsc, elltan, elltanc, na.include, uv
dd <- Smoking2   # I'm a bad typist so short names with repeated letters work well for me
head(dd)
##           X   Country Cig.pAdult.pYear GDPpCapita Year HE.pCap HE.pcGDP LE.all
## Albania   1   Albania             1201       4030 2011     569      6.8  77.41
## Algeria   2   Algeria              577       5244 2011     437      5.4  74.02
## Angola    3    Angola              397       5148 2011     183      3.3  38.20
## Argentina 4 Argentina             1014      10941 2011    1062      7.4  76.95
## Armenia   5   Armenia             2083       3305 2011     224      3.8  72.68
## Australia 6 Australia             1130      60642 2011    3365      8.5  81.81
##           LE.male LE.female LE.all2 LE.male2 LE.female2 Continent
## Albania     74.82     80.30    76.4     73.4       79.7    Europe
## Algeria     72.35     75.77    72.3     70.9       73.7    Africa
## Angola      37.24     39.22      NA       NA         NA    Africa
## Argentina   73.71     80.36    75.3     71.6       79.1  Americas
## Armenia     69.06     76.81    72.0     68.4       75.1      Asia
## Australia   79.40     84.35    81.2     78.9       83.6   Oceania
##                              Region    LE   HE Smoking   GDP
## Albania             Southern Europe 77.41  569    1201  4030
## Algeria             Northern Africa 74.02  437     577  5244
## Angola                Middle Africa 38.20  183     397  5148
## Argentina             South America 76.95 1062    1014 10941
## Armenia                Western Asia 72.68  224    2083  3305
## Australia Australia and New Zealand 81.81 3365    1130 60642
xqplot(dd)  # uniform quantile plot (classroom picture plot)

dd$Cigarettes <- dd$Smoking/365.25   # cigarettes/day is a better unit than cigarettes/year
dd$Life <- dd$LE
dd$Health_Exp <- dd$HE

Init3d(cex=1)
Plot3d( Life ~ Cigarettes + Health_Exp | Continent, dd)
##   Continent     col
## 1    Africa    blue
## 2  Americas   green
## 3      Asia  orange
## 4    Europe magenta
## 5   Oceania    cyan
## Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspective
fg() # brings 3d window to the foreground in Windows. Don't know what it does for Macs.
Axes3d()
spinto()
spinto(phi=15, theta = 10)
rglwidget()     # creates a plot that can be manipulated in html output

AVPs and CPRs provided 2D views of high D data

fit <- lm(Life ~ Cigarettes + Health_Exp, dd)
summary(fit)
## 
## Call:
## lm(formula = Life ~ Cigarettes + Health_Exp, data = dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.895  -3.099   1.483   5.226  11.113 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.386e+01  1.169e+00   54.62  < 2e-16 ***
## Cigarettes  1.546e+00  3.753e-01    4.12 7.17e-05 ***
## Health_Exp  3.009e-03  5.179e-04    5.81 5.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.221 on 115 degrees of freedom
##   (111 observations deleted due to missingness)
## Multiple R-squared:  0.4096, Adjusted R-squared:  0.3993 
## F-statistic: 39.89 on 2 and 115 DF,  p-value: 6.937e-14

Interpretation? One extra cigarette a day adds 1.546 years to your life!

Regression diagnostics:

Four important plots

plot(fit, ask = F)

Added variable plots

  • Y is Y - Yhat|other predictors
  • X is X - Xhat|other predictors

Good for:

  • spotting cases with high leverage on coefficient
  • some forms of lack of linearity
library(car)  
avPlots(fit, ellipse = T) # default uses 'robust' ellipses so centre is not necessarily mean

avPlots(fit, ellipse = list(robust = F))

CR plots

  • Same slope and residuals above/below line as AVP
  • But X is actual X (or X - mean(X))
  • Might not show leverage points
  • Often better than AVP at showing non-linearity
crPlots(fit, ellipse = T) # default uses 'robust' ellipses
## Warning in plot.window(...): "ellipse" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "ellipse" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "ellipse" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "ellipse" is not a
## graphical parameter
## Warning in box(...): "ellipse" is not a graphical parameter
## Warning in title(...): "ellipse" is not a graphical parameter
## Warning in plot.window(...): "ellipse" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "ellipse" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "ellipse" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "ellipse" is not a
## graphical parameter
## Warning in box(...): "ellipse" is not a graphical parameter
## Warning in title(...): "ellipse" is not a graphical parameter

Contrast with AVPs

avPlots(fit, ellipse = list(robust = F))

And ‘marginal’ plots

xyplot(Life ~ Cigarettes, dd) 

xyplot(Life ~ Health_Exp, dd) 

fancier

fit1 <- lm(Life ~ Cigarettes, dd)
fit2 <- lm(Life ~ Cigarettes + I(Cigarettes^2), dd)
dd$fit1 <- predict(fit1, newdata = dd)
dd$fit2 <- predict(fit2, newdata = dd)

dd <- sortdf(dd, ~ Cigarettes)
td(pch = 16)
pl <- xyplot(Life ~ Cigarettes, dd, groups = Continent, auto.key = T) +
  xyplot(fit1 ~ Cigarettes, dd, type = 'l', lwd = 2, col = 'blue') +
  xyplot(fit2 ~ Cigarettes, dd, type = 'l', lwd = 2, col = 'red') 
pl  

update(pl, key = 
         list(
           title = 'Model',
           space = 'left',
           lines = list(
             col=c('red','blue'),
             lwd = 2), 
           text = list(
             c('quadratic','linear')
             )
           )
       )  

xyplot(Life ~ Health_Exp, dd, groups = Continent, auto.key = T) 

4 Easter Egg: ROC curves

The ROC curve shows how a classifier performs over the range of possible cutoffs.

An example of a ‘classifier’ is the process in which

  • a null hypothesis and an alternative are formulated
  • a sample size is selected
  • a statistic computed from a sample
  • the observed value of the statistic is compared with a pre-selected critical point to ‘decide’ to ‘Reject H0’ or to ‘Fail to Reject H0’
  • for each choice of a critical value, we calculate:
    • alpha: the probability of Type I error: the probability of rejecting H0 when it’s true, and
    • beta: (sometimes called 1-beta) the power of the test: the probability of rejecting H0 when the alternative is true
  • The ROC curve plots beta on the vertical axis and alpha on the horizontal axis. Some versions reverse this or use 1-beta instead of beta.

Let’s draw the ROC curve for a test of H0: \(\mu = 0, \sigma = 1\), versus HA: \(\mu = 1, \sigma =1\), using a sample size of 4.

The test is based on the sample mean that has a normal(0, \(\sigma = 1/2\)) distribution under the null hypothesis and a normal(1, \(\sigma = 1/2\)) distribution under the alternative.

Let’s take the critical values as ranging from -5 to 5

data.frame(cval = seq(-5, 5, .01)) %>%
  within(
    {
      alpha <- pnorm(cval, mean = 0, sd = 1/2, lower.tail = FALSE)
      beta <- pnorm(cval, mean = 1, sd = 1/2, lower.tail = FALSE)
    }
) -> dd

xyplot(beta~alpha, dd, type = 'l', xlim = c(0,1), ylim=c(0,1), asp = 1)

# focusing on the left edge near alpha = .05

xyplot(beta~alpha, dd, type = 'l', xlim = c(0,.1)) +
  layer(panel.abline(v= .05)) +
  layer(panel.abline(h=.64))   # the power is approx. .64 (I got this by trial and error)