To really understand regression, you need to be able to approach a problem from many different angles. I can think of at least 8 representations that complement each other. To master regression you need to know how to go from one representation to another and you need to know how to work within the right representation to think about your problem and to solve it.
Here are eight ways of thinking about regression. Some are very powerful for developing the mathematical theory of regression, other are best suited to visualize the interpretation of coefficients for a particular application.
\[ Y = X \beta + \epsilon,\: \epsilon \sim N(0,\sigma^2) \] and all the theory that follows, e.g. \[ Var(\hat{\beta}) = \sigma^2 \left( X'X \right)^{-1} \] \[ \hat{Y} = X(X'X)^{-1}X' Y = P_X Y \] where \(P_X\) is the matrix of the orthogonal projection of \(\mathbb{R}^n\) onto \(\textrm{span}(X)\).
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 z_i + \beta_4 x_i z_i \epsilon_i \\ \frac{\partial E(y)}{\partial x} = \beta_1 + 2\,\beta_2 x + \beta_4 z \]
|   
|   Call:
|   lm(formula = income ~ education * type, data = Prestige)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -6330.8 -1769.2  -356.8  1166.5 17326.2 
|   
|   Coefficients:
|                      Estimate Std. Error t value Pr(>|t|)  
|   (Intercept)         -1865.0     3682.3  -0.506   0.6137  
|   education             866.0      436.4   1.984   0.0502 .
|   typeprof            -3068.4     7191.8  -0.427   0.6706  
|   typewc               3646.5     9274.0   0.393   0.6951  
|   education:typeprof    234.0      617.3   0.379   0.7055  
|   education:typewc     -569.2      884.8  -0.643   0.5216  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3333 on 92 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:  0.4105,    Adjusted R-squared:  0.3785 
|   F-statistic: 12.81 on 5 and 92 DF,  p-value: 1.856e-09library(car)
library(lattice)
xyplot(income ~ education | type, Prestige, 
       type = c('p','r','smooth'))library(spida2)
library(latticeExtra)
fit <- lm(income ~  education + women, data = Prestige)
summary(fit)|   
|   Call:
|   lm(formula = income ~ education + women, data = Prestige)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -7257.6 -1160.1  -238.6   681.1 16044.3 
|   
|   Coefficients:
|                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept) -1491.998   1162.299  -1.284    0.202    
|   education     944.881    103.731   9.109 9.60e-15 ***
|   women         -64.056      8.921  -7.180 1.31e-10 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 2839 on 99 degrees of freedom
|   Multiple R-squared:  0.5618,    Adjusted R-squared:  0.5529 
|   F-statistic: 63.46 on 2 and 99 DF,  p-value: < 2.2e-16plot(rbind(cell(fit),0),type= 'n',
  xlab = expression(beta[education]),
  ylab = expression(beta[women]))
lines(cell(fit,dfn=2), type = 'l', col = 'blue') 
lines(cell(fit,dfn=1), type = 'l', col = 'red') 
abline(h=0)
abline(v=0)
points(c(0,0), pch = 18)Figure: Confidence ellipse for two parameters jointly. The blue ellipse has 95% coverage in 2 dimensions and its perpendicular shadows onto the vertical and horizontal axes form Scheffe 95% confidence intervals for testing in a space of dimension 2. The similar shadows of the red ellipse provide ordinary 95% confidence intervals.
Figure: Hypothetical causal graph of relationships among variables
Figure: Variable Space: Hilbert space representation of variables in which each point represents a variable
The most important representation is the interpretation of the model in the real world. Real world factors, such as the design, the nature of random assignment, the nature of random selection are fundamental in determining the interpretation of the model and on the strategy for model development, selection and interpretation.
This is where you determine the nature of the data: observational or experimental, and the nature of the questions: predictive or causal or descriptive.
\[ Y = X \beta + \varepsilon \] where
\(Y\) is a vector of length \(n\) representing \(n\) observations on a ‘response’ or ‘dependent’ variable,
\(X\) is a \(n \times p\) matrix representing \(n\) observations on each of \(p\) ‘predictor’ or ‘independent’ variables. The first column frequently consists of 1’s.
\(\beta\) is a vector of \(p\) parameters whose values are unknown and some aspect of which we wish to estimate. If the first column of \(X\) consists of 1’s it is customary to number the elements of \(\beta\) starting from 0: \(\beta = (\beta_0, \beta_1, ... , \beta_{p-1})'\).
\(\varepsilon\) is a vector of length \(n\) representing ‘errors’ or ‘residuals’ that are not directly observed.
If \(X\) is of full column rank (i.e. \(\mathrm{rank}(X) = p\)) and if we assume that \(\varepsilon \sim N_n( 0, \sigma^2 I)\) where \(I\) is the \(n \times n\) identity matrix, and if \(rank(X) = p\), then the UMVUE (Uniformly minimum variance unbiased estimator) of \(\beta\) is \[ \hat{\beta} = (X'X)^{-1}X'Y \] with \(E(\hat{\beta}) = \beta\) and \(Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\)
We can estimate or test hypotheses concerning one or more linear combinations of the \(\beta\)s by forming a \(h \times p\) hypothesis matrix \(L\) and estimating the function of parameters: \[ \eta = L \beta \]
For a model with three parameters: \(\beta = (\beta_0, \beta_1, \beta_2)'\) we can simultaneously estimate the sum and difference of \(\beta_1\) and \(\beta_2\) as follows.
Letting \[ L = \left[ \begin{array}{r} 0 & 1 & 1 \\ 0 & 1 & -1 \end{array} \right] \] we get \[ \begin{align} \eta = \left[\begin{array}{c} \eta_1 \\ \eta_ 2 \end{array} \right] &= L \beta \\ &= \left[ \begin{array}{r} 0 & 1 & 1 \\ 0 & 1 & -1 \end{array}\right] \left[ \begin{array}{r} \beta_0 \\ \beta_1 \\ \beta_2 \end{array} \right] \\ & = \left[ \begin{array}{c} \beta_1+\beta_2 \\ \beta_1 - \beta_2\end{array} \right] \end{align} \]
Letting \[ \hat{\eta} = L \hat{\beta} \] we have \[ E(\hat{\eta}) = E(L \hat{\beta})= L E(\hat{\beta})= L \beta \] and \[ Var(\hat{\eta}) = \sigma^2 L (X'X)^{-1} L' \] If \(L\) is of full row rank \(h\) and \(X\) is of full column rank we can test the hypothesis \[ H_0: \eta = L \beta = 0 \] against the alternative that \(\eta \neq 0\) (i.e. that \(\eta_i \neq 0\) for at least one \(i\)) by using the null distribution: \[ \begin{align} \hat{\eta}' \left( \widehat{Var} (\hat{\eta}) \right)^{-1} \hat{\eta} &= \frac{\hat{\beta}'L' \left( L (X'X)^{-1} L' \right)^{-1} L \hat{\beta}}{s_e^2}\\ &\sim h \times F_{h,\nu} \end{align} \] where \(s_e\) is the ‘residual standard error’: \[ s_e^2 = \frac{|| Y - X \hat{\beta} ||^2}{\nu} \] with \(\nu = n - p\) the degrees of freedom for the estimate \(s_e^2\) of \(\sigma^2\) and \(F_{h,\nu}\) is the \(F\) distribution with \(h\) and \(\nu\) degrees of freedom.
With complex models:
Why do we pay attention to regression output? Because it may make some sense for very simple additive models – but even then it is fraught with subtle traps most analysts do not understand.
We will illustrate these with the Smoking and Life Expectancy example using country-level data in 2004.
dall <- read.csv(paste0("http://",server,"/data/Smoking3.csv"))
dd <- subset( dall, sex == 'BTSX') # subset of a data frame
dd$LifeExp <- dd$lifeexp.Birth # Life expectancy at birth
dd$LE <- dd$LifeExp
dd$smoke <- dd$consumption.cigPC # cigarette consumption 
                                 # per adult per year  
dd$HE <- dd$HealthExpPC.Tot.ppp  # health expenditures per capita 
                                 # in US$ PPP
dd$hiv <- dd$hiv_prev15_49  # prevalence of HIV in 
                            # population 15 to 49
dd$special <- ifelse(
        dd$country %in% c('Angola','Sierra Leone',
                          'Equatorial Guinea'),
        1,
        0)  # indicator variable for 3 outlying countries
head(dd)|                   country iso3 region HealthExpPC.Govt.exch HealthExpPC.Tot.ppp
|   1           Afghanistan  AFG    EMR                  8.72               50.47
|   5                Angola  AGO    AFR                114.61              214.58
|   7               Albania  ALB    EUR                114.21              565.20
|   10              Andorra  AND    EUR               2246.75             3073.26
|   15 United Arab Emirates  ARE    EMR               1219.89             1732.13
|   17            Argentina  ARG    AMR                540.82             1433.70
|      HealthExpPC.Govt.ppp HealthExpPC.Tot.exch   total    govt private  sex
|   1                  7.87                55.93   50.47    7.87   42.60 BTSX
|   5                132.04               186.26  214.58  132.04   82.54 BTSX
|   7                253.49               254.64  565.20  253.49  311.71 BTSX
|   10              2257.23              3058.98 3073.26 2257.23  816.03 BTSX
|   15              1288.52              1639.87 1732.13 1288.52  443.61 BTSX
|   17               869.44               891.80 1433.70  869.44  564.26 BTSX
|      lifeexp.Birth lifeexp.At60 smoking.tobacco.current smoking.tobacco.daily
|   1             60           16                      NA                    NA
|   5             51           16                      NA                    NA
|   7             74           19                      40                    36
|   10            82           25                      35                    31
|   15            76           19                      10                     7
|   17            76           21                      27                    22
|      smoking.cig.current smoking.cig.daily Pop.Total Pop.MedAge Pop.pCntUnder15
|   1                   NA                NA     29825      16.20           47.42
|   5                   NA                NA     20821      16.18           47.58
|   7                   40                36      3162      32.56           21.33
|   10                  35                31        78         NA           15.20
|   15                   8                 5      9206      29.37           14.41
|   17                  26                20     41087      30.83           24.42
|      Pop.pCntOver60 Pop.pCntAnnGrowth consumption.cigPC hiv_prev15_49 LifeExp LE
|   1            3.82              -2.4                61           0.0      60 60
|   5            3.84              -3.1               414           2.1      51 51
|   7           14.93              -0.3              1116            NA      74 74
|   10          22.86               0.0               784            NA      82 82
|   15           0.81              -3.1               583            NA      76 76
|   17          14.97              -0.9              1042           0.4      76 76
|      smoke      HE hiv special
|   1     61   50.47 0.0       0
|   5    414  214.58 2.1       1
|   7   1116  565.20  NA       0
|   10   784 3073.26  NA       0
|   15   583 1732.13  NA       0
|   17  1042 1433.70 0.4       0fit.hiv2 <- lm( LifeExp ~ 
                  log(HE) * (smoke + I(smoke^2)) + hiv+special, 
                dd, 
                na.action = na.exclude) 
summary(fit.hiv2)|   
|   Call:
|   lm(formula = LifeExp ~ log(HE) * (smoke + I(smoke^2)) + hiv + 
|       special, data = dd, na.action = na.exclude)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -9.0373 -2.3005  0.2043  2.0760  9.7344 
|   
|   Coefficients:
|                        Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)         3.283e+01  2.674e+00  12.280  < 2e-16 ***
|   log(HE)             6.091e+00  5.024e-01  12.124  < 2e-16 ***
|   smoke               3.642e-02  7.520e-03   4.844 3.31e-06 ***
|   I(smoke^2)         -1.518e-05  3.946e-06  -3.846 0.000181 ***
|   hiv                -7.351e-01  7.593e-02  -9.681  < 2e-16 ***
|   special            -1.822e+01  2.137e+00  -8.526 2.11e-14 ***
|   log(HE):smoke      -4.878e-03  1.155e-03  -4.223 4.30e-05 ***
|   log(HE):I(smoke^2)  2.007e-06  5.726e-07   3.504 0.000614 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.63 on 141 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.8613,    Adjusted R-squared:  0.8544 
|   F-statistic:   125 on 7 and 141 DF,  p-value: < 2.2e-16Do we need curvature in ‘smoke’?
|    numDF denDF  F.value p.value
|        8   141 6968.992 <.00001
|                      Estimate   Std.Error DF  t-value   p-value Lower 0.95
|   (Intercept)         32.831342 2.673604  141 12.279805 <.00001  27.545809
|   log(HE)              6.091328 0.502427  141 12.123808 <.00001   5.098064
|   smoke                0.036424 0.007520  141  4.843564 <.00001   0.021557
|   I(smoke^2)          -0.000015 0.000004  141 -3.846440 0.00018  -0.000023
|   hiv                 -0.735098 0.075932  141 -9.681048 <.00001  -0.885210
|   special            -18.223307 2.137312  141 -8.526276 <.00001 -22.448626
|   log(HE):smoke       -0.004878 0.001155  141 -4.223209 0.00004  -0.007162
|   log(HE):I(smoke^2)   0.000002 0.000001  141  3.504458 0.00061   0.000001
|                      Upper 0.95
|   (Intercept)         38.116875
|   log(HE)              7.084592
|   smoke                0.051290
|   I(smoke^2)          -0.000007
|   hiv                 -0.584986
|   special            -13.997989
|   log(HE):smoke       -0.002595
|   log(HE):I(smoke^2)   0.000003|     numDF denDF  F.value p.value
|   2     2   141 8.835643 0.00024
|                      Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2)         -1.5e-05 4e-06     141 -3.846440 0.00018 -2.3e-05  
|   log(HE):I(smoke^2)  2.0e-06 1e-06     141  3.504458 0.00061  1.0e-06  
|                      Upper 0.95
|   I(smoke^2)         -7e-06    
|   log(HE):I(smoke^2)  3e-06How about interaction?
|     numDF denDF F.value p.value
|   :     2   141  9.2372 0.00017
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   log(HE):smoke      -0.004878 0.001155  141 -4.223209 0.00004 -0.007162 
|   log(HE):I(smoke^2)  0.000002 0.000001  141  3.504458 0.00061  0.000001 
|                      Upper 0.95
|   log(HE):smoke      -0.002595 
|   log(HE):I(smoke^2)  0.000003Create a prediction data frame over which to estimate fitted values. We will look at countries with low hiv and exclude outliers.
pred <- expand.grid(
  HE = c(50,150,500, 1000, 1500, 5000), 
  smoke = seq(10,2000,20), 
  hiv = 0, 
  special = 0)Finding \(\hat{Y}\) over a grid of values:
pred$yhat <- predict(fit.hiv2, newdata = pred)
gd(lwd = 2) # no groups
gd(lwd = c(2,2)) # groups
xyplot(yhat ~ smoke | factor(HE), pred, type = 'l',
       layout = c(6,1))It’s a good idea to make the order in the legend match the physical location in the graph, as much as feasible.
xyplot(yhat ~ smoke , groups = factor(HE), pred, type = 'l',
       auto.key=list(space='right', lines = T, points = F, 
                     reverse.rows = TRUE))Suppose we want to estimate the slope of these fitted curves: i.e. the ’effect’of an additional cigarette as a function of health expenditures and amount smoked.
We start with the mathematical formula for the model:
Letting \(\eta = E(y|HE,Smoke,HIV,Special)\) \[ \begin{aligned} \eta = &\beta_0 +\beta_1 \times ln(HE) \\ & +\beta_2 \times Smoke \\ & +\beta_3 \times Smoke^2 \\ & +\beta_4 \times HIV \\ & +\beta_5 \times Special \\ & +\beta_6 \times ln(HE) \,Smoke \\ & +\beta_7 \times ln(HE) \,Smoke^2 \end{aligned} \]
To understand the interpretation of the coefficients \(\beta_i\), we differentiate \(\eta\) with respect to each of the independent variables:
\[ \begin{aligned} \frac{\partial \eta}{\partial HE} & = \beta_1 \frac{1}{HE}+\beta_6 \frac{Smoke}{HE} + \beta_7 \frac{Smoke^2}{HE} \\ \frac{\partial \eta}{\partial Smoke} & = \beta_2 + 2 \beta_3 Smoke +\beta_6 ln(HE) + 2 \beta_7 Smoke\, ln(HE) \\ \frac{\partial \eta}{\partial HIV} & = \beta_4 \\ \frac{\partial \eta}{\partial Special} & = \beta_5 \\ \frac{\partial^2 \eta}{\partial HE^2} & = \beta_1 \frac{-1}{HE^2}+\beta_6 \frac{-Smoke}{HE^2} + \beta_7 \frac{-Smoke^2}{HE^2} \\ \frac{\partial^2 \eta}{\partial Smoke^2} & = 2 \beta_3 \\ \frac{\partial^2 \eta}{\partial HE \, \partial Smoke} & = \beta_6 \frac{1}{HE} + 2 \beta_7 \frac{Smoke}{HE} \\ \end{aligned} \]
Thus \(\beta_2\) is the partial derivative of \(\eta\) with respect to \(Smoke\) when \(ln(HE) = Smoke = 0\).
When \(ln(HE) = 5\) and \(Smoke = 4\), the partial derivative of \(\eta\) with respect to \(Smoke\) is
\[ \frac{\partial \eta}{\partial Smoke} = \beta_2 + 8 \beta_3 +5 \beta_6 + 40 \beta_7 \] whose estimator is \[ \hat{\beta_2} + 8 \hat{\beta_3} +5 \hat{\beta_6} + 40 \hat{\beta_7} \] which we can express as a linear transformation of the \(\hat{\beta}\) vector. Letting \[ L = \left[ {\begin{array}{*{20}{c}}0&0&1&8&0&0&5&{40} \end{array}} \right] \] we have: \[\hat{\phi} = L \hat{\beta} = \left[ {\begin{array}{*{20}{c}}0&0&1&8&0&0&5&{40}\end{array}} \right]\widehat {\left[ {\begin{array}{*{20}{c}}{{\beta _0}} \\{{\beta _1}}\\{{\beta _2}}\\{{\beta _3}}\\{{\beta _4}} \\{{\beta _5}}\\{{\beta _6}}\\{{\beta _7}}\end{array}} \right]}\] If we wish to simultaneously estimate the ‘effect’ of Smoke and the ‘effect’ of HE given values of HE and Smoke, we can form the \(L\) matrix: \[ L = \left[ {\begin{array}{*{20}{c}} 0 & 0 & 1 & 2\, Smoke & 0 & 0 & ln(HE) & 2\, Smoke \, ln(HE) \\ 0 & 1 & 0 & 0 & 0 & 0 & \frac{Smoke}{HE} &\frac{Smoke^2}{HE} \end{array}} \right] \] and \[ \hat{\phi} = L \hat{\beta} \] In this case \(\hat{\phi}\) is a column vector of length 2.
In both cases, inference about \(\phi\) uses the fact that \[ Var(\hat{\phi}) = L Var(\hat{\beta}) L' \] and \[ Var(\hat{\beta}) = \sigma^2 (X'X)^{-1} \] With a normal linear model in which \[ Y = X \beta + \epsilon, \:\epsilon \sim N(0,\sigma^2 \,I) \] we have that \[ \left( {\hat{\phi} - \phi} \right)'\left( {s^2 L(X'X)^{-1} L'} \right)^{-1} \left( {\hat{\phi} - \phi} \right) \sim h \times F_{h,\nu} \] where \(h\) is the number of rows of \(L\) (assuming that \(L\) is of full row rank) and \(\nu = n - p\) where \(n\) and \(p\) are the number of rows and columns of \(X\) respectively, again assuming that \(X\) is of full column rank.
We can compute these quantities in R from a fitted model. Note how the ‘evalq’ function evaluates an expression at the values given in the list provided as the ‘envir’ argument.
L <- evalq( rbind( 
c( 0,0,1, 2*smoke, 0 ,0 , log(HE), 2*smoke*log(HE)),
c( 0,1,0, 0      , 0,       0, smoke / HE, smoke^2/HE)), 
    envir = list( smoke = 4, HE = exp(5)))
L|        [,1] [,2] [,3] [,4] [,5] [,6]       [,7]       [,8]
|   [1,]    0    0    1    8    0    0 5.00000000 40.0000000
|   [2,]    0    1    0    0    0    0 0.02695179  0.1078072\(\hat{\beta}\):
|          (Intercept)            log(HE)              smoke         I(smoke^2) 
|         3.283134e+01       6.091328e+00       3.642387e-02      -1.517663e-05 
|                  hiv            special      log(HE):smoke log(HE):I(smoke^2) 
|        -7.350980e-01      -1.822331e+01      -4.878054e-03       2.006775e-06\(\hat{\phi} = L \hat{\beta}\):
|              [,1]
|   [1,] 0.01199246
|   [2,] 6.09119673\(s^2 (X'X)^{-1}\):
|                        (Intercept)       log(HE)         smoke    I(smoke^2)
|   (Intercept)         7.148161e+00 -1.298157e+00 -1.437171e-02  5.526193e-06
|   log(HE)            -1.298157e+00  2.524328e-01  2.379512e-03 -8.798735e-07
|   smoke              -1.437171e-02  2.379512e-03  5.655121e-05 -2.706223e-08
|   I(smoke^2)          5.526193e-06 -8.798735e-07 -2.706223e-08  1.556800e-11
|   hiv                -9.426054e-03 -2.319091e-03  1.603252e-05 -6.423962e-09
|   special             4.335683e-01 -1.122056e-01 -1.123946e-03  4.625775e-07
|   log(HE):smoke       2.438561e-03 -4.363636e-04 -8.441015e-06  3.949943e-09
|   log(HE):I(smoke^2) -9.000438e-07  1.540858e-07  3.944264e-09 -2.226900e-12
|                                hiv       special log(HE):smoke log(HE):I(smoke^2)
|   (Intercept)        -9.426054e-03  4.335683e-01  2.438561e-03      -9.000438e-07
|   log(HE)            -2.319091e-03 -1.122056e-01 -4.363636e-04       1.540858e-07
|   smoke               1.603252e-05 -1.123946e-03 -8.441015e-06       3.944264e-09
|   I(smoke^2)         -6.423962e-09  4.625775e-07  3.949943e-09      -2.226900e-12
|   hiv                 5.765616e-03 -1.218220e-03  2.000199e-06      -2.425728e-10
|   special            -1.218220e-03  4.568101e+00  2.166940e-04      -8.002814e-08
|   log(HE):smoke       2.000199e-06  2.166940e-04  1.334160e-06      -6.010943e-10
|   log(HE):I(smoke^2) -2.425728e-10 -8.002814e-08 -6.010943e-10       3.279108e-13\(\hat{Var}(\hat{\phi}) = L (s^2 (X'X)^{-1}) L'\):
|                [,1]         [,2]
|   [1,] 5.453266e-06 0.0001967712
|   [2,] 1.967712e-04 0.2524093523To test the hypothesis that \(\phi = 0\), we have
\(F = \hat{\phi}'\left({ \hat{Var}(\hat{\phi})}\right) ^{-1} \hat{\phi}/h\)
|            [,1]
|   [1,] 78.44759|        [,1]
|   [1,]    0|                [,1]
|   [1,] 1.254508e-23Note how rounding error is reduced by using the ‘lower.tail’ parameter.
The functions ‘lht’ in the ‘car’ package and ‘wald’ in the ‘spida2’ package can be used to test General Linear Hypotheses.
|   Linear hypothesis test
|   
|   Hypothesis:
|   smoke  + 8 I(smoke^2)  + 5 log(HE):smoke  + 40 log(HE):I(smoke^2) = 0
|   log(HE)  + 0.0269517879963419 log(HE):smoke  + 0.107807151985367 log(HE):I(smoke^2) = 0
|   
|   Model 1: restricted model
|   Model 2: LifeExp ~ log(HE) * (smoke + I(smoke^2)) + hiv + special
|   
|     Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
|   1    143 3925.4                                  
|   2    141 1858.0  2    2067.4 78.448 < 2.2e-16 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|     numDF denDF  F.value p.value
|   1     2   141 78.44759 <.00001
|        Estimate Std.Error DF  t-value   p-value Lower 0.95 Upper 0.95
|   [1,] 0.011992 0.002335  141  5.135465 <.00001 0.007376   0.016609  
|   [2,] 6.091197 0.502404  141 12.124111 <.00001 5.097979   7.084414The ‘lht’ function can take a right-hand side to test hypotheses of the form \(H_0: \phi = \phi_0\). ‘wald’ can only test \(H_0 : \phi = 0\). The ‘wald’ function can handle \(L\) matrices whose rows are not linearly independent. For example, in some uses of ‘wald’, the ‘L’ matrix is the whole design matrix.
The second argument of the ‘wald’ matrix can be a regular expression that is matched against the names of terms in the model. All terms matched by the regular expression are simultaneously tested. Thus one can test the ‘overall’ significance of an independent variable by testing whether all terms containing that variable are equal to 0. One can also use this approach to test higher-order interactions.
|         numDF denDF  F.value p.value
|   smoke     4   141 8.024538   1e-05
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   smoke               0.036424 0.007520  141  4.843564 <.00001  0.021557 
|   I(smoke^2)         -0.000015 0.000004  141 -3.846440 0.00018 -0.000023 
|   log(HE):smoke      -0.004878 0.001155  141 -4.223209 0.00004 -0.007162 
|   log(HE):I(smoke^2)  0.000002 0.000001  141  3.504458 0.00061  0.000001 
|                      Upper 0.95
|   smoke               0.051290 
|   I(smoke^2)         -0.000007 
|   log(HE):smoke      -0.002595 
|   log(HE):I(smoke^2)  0.000003|      numDF denDF  F.value p.value
|   HE     3   141 87.10337 <.00001
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   log(HE)             6.091328 0.502427  141 12.123808 <.00001  5.098064 
|   log(HE):smoke      -0.004878 0.001155  141 -4.223209 0.00004 -0.007162 
|   log(HE):I(smoke^2)  0.000002 0.000001  141  3.504458 0.00061  0.000001 
|                      Upper 0.95
|   log(HE)             7.084592 
|   log(HE):smoke      -0.002595 
|   log(HE):I(smoke^2)  0.000003|     numDF denDF F.value p.value
|   :     2   141  9.2372 0.00017
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   log(HE):smoke      -0.004878 0.001155  141 -4.223209 0.00004 -0.007162 
|   log(HE):I(smoke^2)  0.000002 0.000001  141  3.504458 0.00061  0.000001 
|                      Upper 0.95
|   log(HE):smoke      -0.002595 
|   log(HE):I(smoke^2)  0.000003|      numDF denDF  F.value p.value
|   2)     2   141 8.835643 0.00024
|                      Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2)         -1.5e-05 4e-06     141 -3.846440 0.00018 -2.3e-05  
|   log(HE):I(smoke^2)  2.0e-06 1e-06     141  3.504458 0.00061  1.0e-06  
|                      Upper 0.95
|   I(smoke^2)         -7e-06    
|   log(HE):I(smoke^2)  3e-06There are many strategies for potentially simplifying large models. The resulting model will depend on the strategy.
One is to attack higher-order interactions and simplify the model by dropping groups of interactions that are not significant but, initially, leaving main effects and lower-order interactions. In many situations there are obvious moderator variables whose interactions should not be dropped as aggressively as those of other variables for which oversimplification to an additive model may be more innocuous. Remember the consequences of dropping an interaction. Main effects become weighted averages of conditional effects, weighted by inverse variance.
Another approach is to drop all terms for selected independent variables if they are not sufficiently significant in an overall test.
The two approaches can be combined depending on the role of variables and the goals of the analysis.
The choice of approach should be guided by many factors: which null hypotheses are likely to be reasonable, the interpretive value of having a simple additive model versus the added validity of estimating conditional effects that are not averaged over levels of variables that may be important, etc. There’s a good discussion of these problems in (???).
In a model with interactions and non-linear functions of some independent variables, it is often interesting to characterize how effects (partial derivatives) and inferences about effects vary over a range of predictors.
The ‘effects’ package (Fox and Hong (2009)) allows easy visualization of predicted values as a variables changes keeping other variables constant. For other variables that interact with the variable whose effect is visualized, the interacting variables are kept constant at a number of selected values. The package is very effective for the easy visualization of moderately complex models with interactions.
The same can be achieved but much more laboriously with the ‘Lfx’ function in the ‘spida2’ package (Monette et al. (2018)). With the ‘Lfx’ function it is possible to estimate derivatives of all orders and features of general parametric splines with the ‘gsp’ function.
The ‘Lfx’ function generates an expression which can then be edited to generate large \(L\) matrices. The result of the wald test applied to this \(L\) matrix can be transformed into a data frame for plotting with error bands.
The following illustrates the use of the ‘Lfx’ function.
|   list( 1,
|   1 * M(log(HE)),
|   1 * smoke,
|   1 * M(I(smoke^2)),
|   1 * hiv,
|   1 * special,
|   1 * M(log(HE)) * smoke,
|   1 * M(log(HE)) * M(I(smoke^2)) 
|   )The expression generated by ‘Lfx’ can be edited to generate desired effects. The result is then fed back to ‘Lfx’ along with a data frame on which to evaluate the edited expression. Note that the ‘M’ functions preserved the shape of multi-term blocks in the design matrix so that multiplying them by 0 is a way of generating a block of 0s of the right dimension. In the following, we edit the expression to estimate the effect of smoking by differentiating with respect to ‘smoke’:
pred <- expand.grid(
  HE = c(50,150,500, 1000, 1500, 5000), 
  smoke = seq(10,2000,20), 
  hiv = 0, 
  special = 0)
head(pred)  # first 6 lines of 'pred'|       HE smoke hiv special
|   1   50    10   0       0
|   2  150    10   0       0
|   3  500    10   0       0
|   4 1000    10   0       0
|   5 1500    10   0       0
|   6 5000    10   0       0Predicted values as a function of HE and smoke
Use the list created above and edit it by differentiating each term with respect to ‘smoke’ to get the marginal ‘effect’ of an extra unit of ‘smoke’:
|   list( 1,
|   1 * M(log(HE)),
|   1 * smoke,
|   1 * M(I(smoke^2)),
|   1 * hiv,
|   1 * special,
|   1 * M(log(HE)) * smoke,
|   1 * M(log(HE)) * M(I(smoke^2)) 
|   )Differentiated:
L <- Lfx(fit.hiv2,
     list( 0,  
           0 * M(log(HE)),
           1 ,
           1 * M(I(2*smoke)),
           0 * hiv,
           0 * special,
           1 * M(log(HE)) * 1,
           1 * M(log(HE)) * M(I(2*smoke)) 
     ), pred)
dim(L)|   [1] 600   8|     (Intercept) log(HE) smoke I(smoke^2) hiv special log(HE):smoke
|   1           0       0     1         20   0       0      3.912023
|   2           0       0     1         20   0       0      5.010635
|   3           0       0     1         20   0       0      6.214608
|   4           0       0     1         20   0       0      6.907755
|   5           0       0     1         20   0       0      7.313220
|   6           0       0     1         20   0       0      8.517193
|     log(HE):I(smoke^2)
|   1           78.24046
|   2          100.21271
|   3          124.29216
|   4          138.15511
|   5          146.26441
|   6          170.34386|   Warning in wald(fit.hiv2, L): Poorly conditioned L matrix, calculated numDF may
|   be incorrect|              coef          se          U2           L2   HE smoke hiv special
|   1  0.0171942856 0.003272967 0.023740220  0.010648352   50    10   0       0
|   2  0.0118792893 0.002313918 0.016507124  0.007251454  150    10   0       0
|   3  0.0060545675 0.001764820 0.009584207  0.002524927  500    10   0       0
|   4  0.0027011781 0.001883655 0.006468487 -0.001066131 1000    10   0       0
|   5  0.0007395711 0.002094140 0.004927851 -0.003448709 1500    10   0       0
|   6 -0.0050851507 0.003067648 0.001050145 -0.011220447 5000    10   0       0More informative labels:
Doing this preserved the order of factor levels
|                         HE
|   HEfac                   50 150 500 1000 1500 5000 Total
|     Health Exp/Cap: 50   100   0   0    0    0    0   100
|     Health Exp/Cap: 150    0 100   0    0    0    0   100
|     Health Exp/Cap: 500    0   0 100    0    0    0   100
|     Health Exp/Cap: 1000   0   0   0  100    0    0   100
|     Health Exp/Cap: 1500   0   0   0    0  100    0   100
|     Health Exp/Cap: 5000   0   0   0    0    0  100   100
|     Total                100 100 100  100  100  100   600Levels are in numerical order when creating a factor from a numeric object! This was not always so! We are taking the derivative with respect to ‘smoke’ of these functions:
xyplot( coef ~ smoke, ww, groups = HE, 
        auto.key = list(columns = 3, lines = T, 
                        points = F),type = 'l')Figure 1: Change in Life Expectancy associated with an increase in cigarette consumption of 1 cigarette per day per capita for different levels of health expenditures per capita per year (US$).
With labels that are more informative:
Confidence bands:
xyplot( I(365*coef)  ~ I(smoke/365) | HEfac,
       data = ww, type = 'n',
       xlim = c(0,5.5),
       ylab = "Change in predicted LE per additional cigarette",
       xlab = "cigarettes per capita per day",
       fit = 365*(ww$coef),
       lower = 365*(ww$coef - ww$se),
       upper = 365*(ww$coef + ww$se),
       subscripts = T,
       as.table = T) +
  layer(panel.fit(...,alpha = .3)) + 
  layer( panel.abline( h = 0, lwd = 1))Transforming to ‘meaningful’ coefficients: cigarettes per day
## Double checking our previous calculation:
ptest <- expand.grid(smoke = 4, HE = exp(5), hiv = 0, special = 0)
(L2 <- Lfx( fit.hiv2,
      list( 0,  
            0 * M(log(HE)),
            1 ,
            1 * M(I(2*smoke)),
            0 * hiv,
            0 * special,
            1 * M(log(HE)) * 1,
            1 * M(log(HE)) * M(I(2*smoke)) 
      ), ptest)) |     (Intercept) log(HE) smoke I(smoke^2) hiv special log(HE):smoke
|   1           0       0     1          8   0       0             5
|     log(HE):I(smoke^2)
|   1                 40
|   attr(,"data")
|     smoke       HE hiv special
|   1     4 148.4132   0       0|                            numDF denDF F.value p.value
|   At smoke = 4, HE = 148.4     1   141  26.373 <.00001
|     Estimate Std.Error DF  t-value  p-value Lower 0.95 Upper 0.95
|   1 0.011992 0.002335  141 5.135465 <.00001 0.007376   0.016609Let’s consider a test for the need for a quadratic term in ‘smoke’. There are two terms in the model that contain the quadratic term and a test to remove it involves more than one parameter. We need a test of \[ H_0: \beta_3 = \beta_7 = 0 \] We cannot simply test each hypothesis \(H_0: \beta_3 = 0\) and \(H_0: \beta_7 = 0\) separately. We will see many examples where individual hypotheses are not significant, yet a joint hypothesis is highly significant. This is not a example of this phenomenon since the p-value for each hypothesis is small. Nevertheless, a test of a joint hypothesis needs to be carried out correctly. We consider two ways: a Wald test and a Likelihood Ratio Test executed with the ‘anova’ function, a clear misnomer.
Wald test using indices of coefficients
|     numDF denDF  F.value p.value
|   1     2   141 8.835643 0.00024
|                      Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2)         -1.5e-05 4e-06     141 -3.846440 0.00018 -2.3e-05  
|   log(HE):I(smoke^2)  2.0e-06 1e-06     141  3.504458 0.00061  1.0e-06  
|                      Upper 0.95
|   I(smoke^2)         -7e-06    
|   log(HE):I(smoke^2)  3e-06|                      numDF denDF  F.value p.value
|   Quadratic in smoke     2   141 8.835643 0.00024
|                      Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2)         -1.5e-05 4e-06     141 -3.846440 0.00018 -2.3e-05  
|   log(HE):I(smoke^2)  2.0e-06 1e-06     141  3.504458 0.00061  1.0e-06  
|                      Upper 0.95
|   I(smoke^2)         -7e-06    
|   log(HE):I(smoke^2)  3e-06Wald test using regular expression
|     numDF denDF  F.value p.value
|   2     2   141 8.835643 0.00024
|                      Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2)         -1.5e-05 4e-06     141 -3.846440 0.00018 -2.3e-05  
|   log(HE):I(smoke^2)  2.0e-06 1e-06     141  3.504458 0.00061  1.0e-06  
|                      Upper 0.95
|   I(smoke^2)         -7e-06    
|   log(HE):I(smoke^2)  3e-06Likelihood ratio test
We need to fit the ‘null’ model
|   
|   Call:
|   lm(formula = LifeExp ~ log(HE) + smoke + hiv + special + log(HE):smoke, 
|       data = dd, na.action = na.exclude)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -9.5894 -2.2654  0.0006  2.4404  9.6020 
|   
|   Coefficients:
|                   Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)    3.630e+01  2.082e+00  17.438  < 2e-16 ***
|   log(HE)        5.728e+00  3.613e-01  15.854  < 2e-16 ***
|   smoke          1.132e-02  3.149e-03   3.596 0.000444 ***
|   hiv           -7.617e-01  7.896e-02  -9.647  < 2e-16 ***
|   special       -1.802e+01  2.243e+00  -8.032 3.23e-13 ***
|   log(HE):smoke -1.660e-03  4.628e-04  -3.586 0.000459 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.824 on 143 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.8439,    Adjusted R-squared:  0.8384 
|   F-statistic: 154.6 on 5 and 143 DF,  p-value: < 2.2e-16Then compare the null model with the ‘full’ model:
By default, ‘anova’ uses an F distribution for the LRT taking advantage of the linear Gaussian model.
|   Analysis of Variance Table
|   
|   Model 1: LifeExp ~ log(HE) + smoke + hiv + special + log(HE):smoke
|   Model 2: LifeExp ~ log(HE) * (smoke + I(smoke^2)) + hiv + special
|     Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
|   1    143 2090.8                                  
|   2    141 1858.0  2    232.86 8.8356 0.0002426 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Using the general asymptotic distribution, chi-square, for the LRT gives a slightly different but very close result:
|   Analysis of Variance Table
|   
|   Model 1: LifeExp ~ log(HE) + smoke + hiv + special + log(HE):smoke
|   Model 2: LifeExp ~ log(HE) * (smoke + I(smoke^2)) + hiv + special
|     Res.Df    RSS Df Sum of Sq  Pr(>Chi)    
|   1    143 2090.8                           
|   2    141 1858.0  2    232.86 0.0001455 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The Wald test and the LRT using the F statistics give identical results. This is the luxury of working with a Gaussian homoskedastic independent linear model.
Type 1: sequential tests
|   Analysis of Variance Table
|   
|   Response: LifeExp
|                       Df Sum Sq Mean Sq  F value    Pr(>F)    
|   log(HE)              1 8631.5  8631.5 655.0400 < 2.2e-16 ***
|   smoke                1  152.3   152.3  11.5572 0.0008779 ***
|   I(smoke^2)           1  399.7   399.7  30.3307 1.675e-07 ***
|   hiv                  1 1228.0  1228.0  93.1880 < 2.2e-16 ***
|   special              1  878.4   878.4  66.6576 1.637e-13 ***
|   log(HE):smoke        1   81.6    81.6   6.1932 0.0139881 *  
|   log(HE):I(smoke^2)   1  161.8   161.8  12.2812 0.0006136 ***
|   Residuals          141 1858.0    13.2                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Type 2: Each term added last except for higher-order interactions
|   Anova Table (Type II tests)
|   
|   Response: LifeExp
|                      Sum Sq  Df  F value    Pr(>F)    
|   log(HE)            3199.9   1 242.8357 < 2.2e-16 ***
|   smoke               129.5   1   9.8297 0.0020900 ** 
|   I(smoke^2)           71.0   1   5.3901 0.0216845 *  
|   hiv                1235.0   1  93.7227 < 2.2e-16 ***
|   special             957.9   1  72.6974 2.109e-14 ***
|   log(HE):smoke       235.0   1  17.8355 4.297e-05 ***
|   log(HE):I(smoke^2)  161.8   1  12.2812 0.0006136 ***
|   Residuals          1858.0 141                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Type 3: Each term added last
Note that ‘Type 1’ and ‘Type 2’ sums of squares are interpreted consistently (as far as I’ve seen) in different packages. ‘Type 3’, however, has quite different interpretations. Each variable, or group of variables in the case of a factor with 3 or more levels, is added last but different packages set the other variables at different values. Some set them at their 0 values and others set them at their mean values. What does car::Anova do?
|   Anova Table (Type III tests)
|   
|   Response: LifeExp
|                       Sum Sq  Df F value    Pr(>F)    
|   (Intercept)        1987.03   1 150.794 < 2.2e-16 ***
|   log(HE)            1936.86   1 146.987 < 2.2e-16 ***
|   smoke               309.14   1  23.460 3.315e-06 ***
|   I(smoke^2)          194.96   1  14.795 0.0001809 ***
|   hiv                1235.00   1  93.723 < 2.2e-16 ***
|   special             957.94   1  72.697 2.109e-14 ***
|   log(HE):smoke       235.02   1  17.835 4.297e-05 ***
|   log(HE):I(smoke^2)  161.83   1  12.281 0.0006136 ***
|   Residuals          1857.97 141                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Type 3 is identical to regression output except that it uses equivalent F tests and a single test for terms with multiple degrees of freedom
|   
|   Call:
|   lm(formula = LifeExp ~ log(HE) * (smoke + I(smoke^2)) + hiv + 
|       special, data = dd, na.action = na.exclude)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -9.0373 -2.3005  0.2043  2.0760  9.7344 
|   
|   Coefficients:
|                        Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)         3.283e+01  2.674e+00  12.280  < 2e-16 ***
|   log(HE)             6.091e+00  5.024e-01  12.124  < 2e-16 ***
|   smoke               3.642e-02  7.520e-03   4.844 3.31e-06 ***
|   I(smoke^2)         -1.518e-05  3.946e-06  -3.846 0.000181 ***
|   hiv                -7.351e-01  7.593e-02  -9.681  < 2e-16 ***
|   special            -1.822e+01  2.137e+00  -8.526 2.11e-14 ***
|   log(HE):smoke      -4.878e-03  1.155e-03  -4.223 4.30e-05 ***
|   log(HE):I(smoke^2)  2.007e-06  5.726e-07   3.504 0.000614 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.63 on 141 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.8613,    Adjusted R-squared:  0.8544 
|   F-statistic:   125 on 7 and 141 DF,  p-value: < 2.2e-16Controlling for WHO regions provides a non-trivial example of the use of factors in regression.
When you create a data frame in R, non-numeric variables are automatically turned into factors. Factors are both a strength of R and a frequent source of annoyance and confusion. See traps and pitfalls with factors.
Let’s create a small data frame to illustrate how factors work:
|       x g
|   1   1 a
|   2   2 a
|   3   3 b
|   4   4 b
|   5   5 b
|   6   6 b
|   7   7 b
|   8   6 c
|   9   7 c
|   10  8 c
|   11  9 c
|   12 10 c|       x g         y
|   1   1 a  2.120108
|   2   2 a  2.853852
|   3   3 b  2.722852
|   4   4 b  4.825593
|   5   5 b  4.623128
|   6   6 b  5.732132
|   7   7 b  7.221543
|   8   6 c  7.060536
|   9   7 c  9.465820
|   10  8 c  9.748093
|   11  9 c 11.580972
|   12 10 c 11.661232|    [1] a a b b b b b c c c c c
|   Levels: a b c|    [1] 1 1 2 2 2 2 2 3 3 3 3 3
|   attr(,"levels")
|   [1] "a" "b" "c"g is actually a numeric variable consisting of indices into a vector of ‘levels’.
|   
|   Call:
|   lm(formula = y ~ x + g, data = sdf, na.action = na.exclude)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|               Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)   0.9074     0.4421   2.052   0.0742 .  
|   x             1.0530     0.1250   8.421 3.01e-05 ***
|   gb           -1.1476     0.6449  -1.779   0.1131    
|   gc            0.5716     0.9408   0.608   0.5603    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9797,    Adjusted R-squared:  0.972 
|   F-statistic: 128.4 on 3 and 8 DF,  p-value: 4.178e-07Note:
To work out what the coefficients for ‘gb’ and ‘gc’ mean, you need to look at the X matrix:
|      (Intercept)  x gb gc
|   1            1  1  0  0
|   2            1  2  0  0
|   3            1  3  1  0
|   4            1  4  1  0
|   5            1  5  1  0
|   6            1  6  1  0
|   7            1  7  1  0
|   8            1  6  0  1
|   9            1  7  0  1
|   10           1  8  0  1
|   11           1  9  0  1
|   12           1 10  0  1
|   attr(,"assign")
|   [1] 0 1 2 2
|   attr(,"contrasts")
|   attr(,"contrasts")$g
|   [1] "contr.treatment"|      (Intercept)  x gb gc
|   1            1  1  0  0
|   2            1  2  0  0
|   3            1  3  1  0
|   4            1  4  1  0
|   5            1  5  1  0
|   6            1  6  1  0
|   7            1  7  1  0
|   8            1  6  0  1
|   9            1  7  0  1
|   10           1  8  0  1
|   11           1  9  0  1
|   12           1 10  0  1
|   attr(,"assign")
|   [1] 0 1 2 2
|   attr(,"contrasts")
|   attr(,"contrasts")$g
|   [1] "contr.treatment"|      (Intercept) gb gc
|   1            1  0  0
|   2            1  0  0
|   3            1  1  0
|   4            1  1  0
|   5            1  1  0
|   6            1  1  0
|   7            1  1  0
|   8            1  0  1
|   9            1  0  1
|   10           1  0  1
|   11           1  0  1
|   12           1  0  1
|   attr(,"assign")
|   [1] 0 1 1
|   attr(,"contrasts")
|   attr(,"contrasts")$g
|   [1] "contr.treatment"If you work through the model: \[ E(y|x,g) = \beta_0 + \beta_x x + \beta_{gb} gb + \beta_{gc} gc \] where \(gb = 1\) if \(g=b\) and 0 otherwise, and \(gc = 1\) if \(g=c\) and 0 otherwise, you see that
Plotting fits within groups and panels using latticeExtra:
See more elegant but perhaps less flexible approaches in the ‘car’ and in the ‘effects’ package by John Fox
The values over which we want to see predicted lines every combination of x and g
pred  <- merge( sdf, pred, all = T) # merge with data
pred$y1 <- predict(sfit, newdata = pred) # the predicted value
pred <- sortdf(pred, ~ x) # order so lines won't be interrupted
require(latticeExtra)
require(spida2)
gd(cex=2, lty=1:3,  # this gives control over line styles, 
                    # colour, etc.
   pch = 16:18, 
   col = c('red','blue','magenta'), 
   lwd =2)  # from spida2
xyplot( y ~ x , pred, groups = g , 
        auto.key = list(columns = 3,lines = T),
        y1 = pred$y1, subscripts = T,
        sub = 
  "compare adjusted and unadjusted differences between groups") +
  glayer( panel.lines( x, y1[subscripts],...,type = 'l')) +
  layer( panel.abline( v = c(0,3), col = 'grey')) +
  glayer( panel.abline( h = mean(y,na.rm=T),...))|   
|   Call:
|   lm(formula = y ~ x + g, data = sdf, na.action = na.exclude)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|               Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)   0.9074     0.4421   2.052   0.0742 .  
|   x             1.0530     0.1250   8.421 3.01e-05 ***
|   gb           -1.1476     0.6449  -1.779   0.1131    
|   gc            0.5716     0.9408   0.608   0.5603    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9797,    Adjusted R-squared:  0.972 
|   F-statistic: 128.4 on 3 and 8 DF,  p-value: 4.178e-07But
|     numDF denDF  F.value p.value
|   g     2     8 9.711573 0.00724
|      Estimate  Std.Error DF t-value   p-value Lower 0.95 Upper 0.95
|   gb -1.147573 0.644944  8  -1.779337 0.11306 -2.634817  0.339671  
|   gc  0.571585 0.940783  8   0.607563 0.56032 -1.597865  2.741035Using the GLH (General Linear Hypothesis)
Lmu.6 <-list( "at x = 6" = rbind( 
 'g = a' = c(1,6,0,0),
 'g = b' = c(1,6,1,0),
 'g = c' = c(1,6,0,1)))
Lmu.6|   $`at x = 6`
|         [,1] [,2] [,3] [,4]
|   g = a    1    6    0    0
|   g = b    1    6    1    0
|   g = c    1    6    0    1|            numDF denDF  F.value p.value
|   at x = 6     3     8 613.0644 <.00001
|         Estimate Std.Error DF t-value  p-value Lower 0.95 Upper 0.95
|   g = a 7.225664 0.690607  8  10.46277 1e-05   5.633120   8.818207  
|   g = b 6.078091 0.282401  8  21.52290 <.00001 5.426872   6.729309  
|   g = c 7.797249 0.355897  8  21.90875 <.00001 6.976550   8.617948|         [,1] [,2] [,3] [,4]
|   b - a    0    0    1    0
|   c - a    0    0    0    1
|   c - b    0    0   -1    1|     numDF denDF  F.value p.value
|   1     2     8 9.711573 0.00724
|         Estimate  Std.Error DF t-value   p-value Lower 0.95 Upper 0.95
|   b - a -1.147573 0.644944  8  -1.779337 0.11306 -2.634817  0.339671  
|   c - a  0.571585 0.940783  8   0.607563 0.56032 -1.597865  2.741035  
|   c - b  1.719159 0.518616  8   3.314900 0.01062  0.523229  2.915088|     numDF denDF  F.value p.value
|   g     2     8 9.711573 0.00724
|      Estimate  Std.Error DF t-value   p-value Lower 0.95 Upper 0.95
|   gb -1.147573 0.644944  8  -1.779337 0.11306 -2.634817  0.339671  
|   gc  0.571585 0.940783  8   0.607563 0.56032 -1.597865  2.741035This illustrated the crucial point that separate tests of \[ H_0: \beta_1 = 0 \] and \[ H_0: \beta_2 = 0 \] can yield very different ‘conclusions’ that a test of the joint hypothesis: \[ H_0: \beta_1 = \beta_2 = 0 \]
Later, we will see how the relationship between confidence ellipses(oids) and tests makes this clear.
sdf$g2 <- relevel(sdf$g, 'b')  # makes 'b' the reference level
fitr <- lm( y ~ x + g2, sdf)
summary(fitr)|   
|   Call:
|   lm(formula = y ~ x + g2, data = sdf)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|               Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)  -0.2402     0.6746  -0.356   0.7310    
|   x             1.0530     0.1250   8.421 3.01e-05 ***
|   g2a           1.1476     0.6449   1.779   0.1131    
|   g2c           1.7192     0.5186   3.315   0.0106 *  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9797,    Adjusted R-squared:  0.972 
|   F-statistic: 128.4 on 3 and 8 DF,  p-value: 4.178e-07|   
|   Call:
|   lm(formula = y ~ x + g2 - 1, data = sdf)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|       Estimate Std. Error t value Pr(>|t|)    
|   x     1.0530     0.1250   8.421 3.01e-05 ***
|   g2b  -0.2402     0.6746  -0.356   0.7310    
|   g2a   0.9074     0.4421   2.052   0.0742 .  
|   g2c   1.4790     1.0319   1.433   0.1897    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9961,    Adjusted R-squared:  0.9941 
|   F-statistic: 508.3 on 4 and 8 DF,  p-value: 1.176e-09|   
|   Call:
|   lm(formula = y ~ I(x - 6) + g2 - 1, data = sdf)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|            Estimate Std. Error t value Pr(>|t|)    
|   I(x - 6)   1.0530     0.1250   8.421 3.01e-05 ***
|   g2b        6.0781     0.2824  21.523 2.29e-08 ***
|   g2a        7.2257     0.6906  10.463 6.05e-06 ***
|   g2c        7.7972     0.3559  21.909 1.99e-08 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9961,    Adjusted R-squared:  0.9941 
|   F-statistic: 508.3 on 4 and 8 DF,  p-value: 1.176e-09|            numDF denDF  F.value p.value
|   at x = 6     3     8 613.0644 <.00001
|         Estimate Std.Error DF t-value  p-value Lower 0.95 Upper 0.95
|   g = a 7.225664 0.690607  8  10.46277 1e-05   5.633120   8.818207  
|   g = b 6.078091 0.282401  8  21.52290 <.00001 5.426872   6.729309  
|   g = c 7.797249 0.355897  8  21.90875 <.00001 6.976550   8.617948What makes the last three models equivalent?
|   Warning in summary.lm(object, ...): essentially perfect fit: summary may be
|   unreliable
|   Warning in summary.lm(object, ...): essentially perfect fit: summary may be
|   unreliable
|   Warning in summary.lm(object, ...): essentially perfect fit: summary may be
|   unreliable
|   Warning in summary.lm(object, ...): essentially perfect fit: summary may be
|   unreliable|   Response (Intercept) :
|   
|   Call:
|   lm(formula = `(Intercept)` ~ model.matrix(sfit) - 1)
|   
|   Residuals:
|          Min         1Q     Median         3Q        Max 
|   -7.504e-16 -4.690e-17  0.000e+00  4.690e-17  7.504e-16 
|   
|   Coefficients:
|                                   Estimate Std. Error    t value Pr(>|t|)    
|   model.matrix(sfit)(Intercept)  1.000e+00  2.966e-16  3.371e+15   <2e-16 ***
|   model.matrix(sfit)x           -3.752e-17  8.390e-17 -4.470e-01    0.667    
|   model.matrix(sfit)gb          -6.379e-16  4.327e-16 -1.474e+00    0.179    
|   model.matrix(sfit)gc          -5.253e-16  6.312e-16 -8.320e-01    0.429    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.799e-16 on 8 degrees of freedom
|   Multiple R-squared:      1, Adjusted R-squared:      1 
|   F-statistic: 2.079e+31 on 4 and 8 DF,  p-value: < 2.2e-16
|   
|   
|   Response x :
|   
|   Call:
|   lm(formula = x ~ model.matrix(sfit) - 1)
|   
|   Residuals:
|          Min         1Q     Median         3Q        Max 
|   -8.742e-16 -8.917e-17 -2.706e-17  1.417e-16  8.742e-16 
|   
|   Coefficients:
|                                   Estimate Std. Error    t value Pr(>|t|)    
|   model.matrix(sfit)(Intercept) -1.026e-15  3.623e-16 -2.831e+00   0.0221 *  
|   model.matrix(sfit)x            1.000e+00  1.025e-16  9.759e+15   <2e-16 ***
|   model.matrix(sfit)gb          -8.052e-16  5.285e-16 -1.523e+00   0.1661    
|   model.matrix(sfit)gc          -9.918e-16  7.709e-16 -1.287e+00   0.2342    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 4.64e-16 on 8 degrees of freedom
|   Multiple R-squared:      1, Adjusted R-squared:      1 
|   F-statistic: 5.459e+32 on 4 and 8 DF,  p-value: < 2.2e-16
|   
|   
|   Response g2a :
|   
|   Call:
|   lm(formula = g2a ~ model.matrix(sfit) - 1)
|   
|   Residuals:
|          Min         1Q     Median         3Q        Max 
|   -5.639e-17 -2.701e-17 -8.327e-18  3.385e-17  6.024e-17 
|   
|   Coefficients:
|                                   Estimate Std. Error    t value Pr(>|t|)    
|   model.matrix(sfit)(Intercept)  1.000e+00  3.675e-17  2.721e+16  < 2e-16 ***
|   model.matrix(sfit)x           -4.223e-17  1.039e-17 -4.063e+00  0.00362 ** 
|   model.matrix(sfit)gb          -1.000e+00  5.361e-17 -1.865e+16  < 2e-16 ***
|   model.matrix(sfit)gc          -1.000e+00  7.820e-17 -1.279e+16  < 2e-16 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 4.706e-17 on 8 degrees of freedom
|   Multiple R-squared:      1, Adjusted R-squared:      1 
|   F-statistic: 2.258e+32 on 4 and 8 DF,  p-value: < 2.2e-16
|   
|   
|   Response g2c :
|   
|   Call:
|   lm(formula = g2c ~ model.matrix(sfit) - 1)
|   
|   Residuals:
|          Min         1Q     Median         3Q        Max 
|   -1.054e-17 -6.564e-18 -8.950e-20  4.320e-18  1.981e-17 
|   
|   Coefficients:
|                                  Estimate Std. Error  t value Pr(>|t|)    
|   model.matrix(sfit)(Intercept) 0.000e+00  7.899e-18 0.00e+00        1    
|   model.matrix(sfit)x           0.000e+00  2.234e-18 0.00e+00        1    
|   model.matrix(sfit)gb          0.000e+00  1.152e-17 0.00e+00        1    
|   model.matrix(sfit)gc          1.000e+00  1.681e-17 5.95e+16   <2e-16 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 1.012e-17 on 8 degrees of freedom
|   Multiple R-squared:      1, Adjusted R-squared:      1 
|   F-statistic: 1.222e+34 on 4 and 8 DF,  p-value: < 2.2e-16Each model matrix spans exactly the same linear space. Thus their columns are just different bases for the same space and the \(\beta\)s for one model are just a linear transformation of the \(\betas\)s for the other model.
What do the coefficients estimate in each of the following models?. Indicate how each coefficient is related to the first graph shown below.
Factor alone: g - 1
|   
|   Call:
|   lm(formula = y ~ g - 1, data = sdf)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -2.8428 -0.4108 -0.1774  0.9497  2.1965 
|   
|   Coefficients:
|      Estimate Std. Error t value Pr(>|t|)    
|   ga   2.4870     1.1855   2.098   0.0653 .  
|   gb   5.0250     0.7498   6.702 8.83e-05 ***
|   gc   9.9033     0.7498  13.209 3.39e-07 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 1.677 on 9 degrees of freedom
|   Multiple R-squared:  0.9613,    Adjusted R-squared:  0.9485 
|   F-statistic: 74.59 on 3 and 9 DF,  p-value: 1.118e-06|      ga gb gc
|   1   1  0  0
|   2   1  0  0
|   3   0  1  0
|   4   0  1  0
|   5   0  1  0
|   6   0  1  0
|   7   0  1  0
|   8   0  0  1
|   9   0  0  1
|   10  0  0  1
|   11  0  0  1
|   12  0  0  1
|   attr(,"assign")
|   [1] 1 1 1
|   attr(,"contrasts")
|   attr(,"contrasts")$g
|   [1] "contr.treatment"|   
|   Call:
|   lm(formula = y ~ x + g - 1, data = sdf)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -0.7367 -0.3465 -0.1574  0.2736  0.8536 
|   
|   Coefficients:
|      Estimate Std. Error t value Pr(>|t|)    
|   x    1.0530     0.1250   8.421 3.01e-05 ***
|   ga   0.9074     0.4421   2.052   0.0742 .  
|   gb  -0.2402     0.6746  -0.356   0.7310    
|   gc   1.4790     1.0319   1.433   0.1897    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.5662 on 8 degrees of freedom
|   Multiple R-squared:  0.9961,    Adjusted R-squared:  0.9941 
|   F-statistic: 508.3 on 4 and 8 DF,  p-value: 1.176e-09|       x ga gb gc
|   1   1  1  0  0
|   2   2  1  0  0
|   3   3  0  1  0
|   4   4  0  1  0
|   5   5  0  1  0
|   6   6  0  1  0
|   7   7  0  1  0
|   8   6  0  0  1
|   9   7  0  0  1
|   10  8  0  0  1
|   11  9  0  0  1
|   12 10  0  0  1
|   attr(,"assign")
|   [1] 1 2 2 2
|   attr(,"contrasts")
|   attr(,"contrasts")$g
|   [1] "contr.treatment"Factor with interaction: g * x
|   
|   Call:
|   lm(formula = y ~ g * x, data = sdf)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -0.57949 -0.34154 -0.07762  0.29828  0.79094 
|   
|   Coefficients:
|               Estimate Std. Error t value Pr(>|t|)
|   (Intercept)   1.3864     1.4179   0.978    0.366
|   gb           -1.3133     1.7596  -0.746    0.484
|   gc           -0.5363     2.1597  -0.248    0.812
|   x             0.7337     0.8968   0.818    0.445
|   gb:x          0.2566     0.9189   0.279    0.789
|   gc:x          0.3979     0.9189   0.433    0.680
|   
|   Residual standard error: 0.6341 on 6 degrees of freedom
|   Multiple R-squared:  0.9809,    Adjusted R-squared:  0.9649 
|   F-statistic: 61.51 on 5 and 6 DF,  p-value: 4.499e-05From which you might conclude that ‘nothing is significant’!
This illustrates that it is often wrong wrong wrong to form conclusions on the basis of scanning p-values in regression output.
Factor nesting continuous variable: g / x - 1
|   
|   Call:
|   lm(formula = y ~ g/x - 1, data = sdf)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -0.57949 -0.34154 -0.07762  0.29828  0.79094 
|   
|   Coefficients:
|        Estimate Std. Error t value Pr(>|t|)   
|   ga    1.38636    1.41789   0.978  0.36595   
|   gb    0.07309    1.04193   0.070  0.94636   
|   gc    0.85010    1.62903   0.522  0.62048   
|   ga:x  0.73374    0.89675   0.818  0.44450   
|   gb:x  0.99039    0.20052   4.939  0.00261 **
|   gc:x  1.13165    0.20052   5.644  0.00133 **
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.6341 on 6 degrees of freedom
|   Multiple R-squared:  0.9963,    Adjusted R-squared:  0.9926 
|   F-statistic: 270.2 on 6 and 6 DF,  p-value: 4.985e-07Plotting the fitted model:
# or
xyplot( y ~ x, pred, groups = g, 
  subscripts = T, y2 = pred$y2, y1 = pred$y1) +
glayer( panel.xyplot( x, y2[subscripts], ..., type = 'l'))# or
xyplot( y ~ x, pred, groups = g, 
  ylim = c(0,13), auto.key = T,
  subscripts = T, y2 = pred$y2, y1 = pred$y1) +
glayer( panel.xyplot( x, y2[subscripts], ..., type = 'l')) +
layer( panel.abline( v = c(0,6)))The model is: \[ \begin{aligned} E(y|x,g) = & \beta_0 + \beta_x x + \beta_{gb} gb + \beta_{gc} gc\\ & + \beta_{x:gb} x \times gb + \beta_{x:gc} x \times gc \\ \end{aligned} \] Taking partial derivatives, we see that \(\beta_{gb}\) is the difference between between group ‘b’ minus group ‘a’ when \(x = 0\). There might not be strong evidence of differences between groups outside the range of the data.
What would happen if we were to explore the difference between group ‘b’ and group ‘c’ when \(x = 6\):
|     numDF denDF   F.value p.value
|   1     1     6 0.4570639 0.52419
|             Estimate Std.Error DF t-value  p-value Lower 0.95 Upper 0.95
|   c - b|x=6 2.117587 3.132223  6  0.676065 0.52419 -5.546688  9.781861We could also do this by reparametrizing:
|   
|   Call:
|   lm(formula = y ~ I(x - 6) * relevel(g, "b"), data = sdf)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -0.57949 -0.34154 -0.07762  0.29828  0.79094 
|   
|   Coefficients:
|                             Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                 6.0154     0.3473  17.320 2.37e-06 ***
|   I(x - 6)                    0.9904     0.2005   4.939  0.00261 ** 
|   relevel(g, "b")a           -0.2266     4.0750  -0.056  0.95746    
|   relevel(g, "b")c            1.6246     0.6016   2.701  0.03555 *  
|   I(x - 6):relevel(g, "b")a  -0.2566     0.9189  -0.279  0.78939    
|   I(x - 6):relevel(g, "b")c   0.1413     0.2836   0.498  0.63611    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 0.6341 on 6 degrees of freedom
|   Multiple R-squared:  0.9809,    Adjusted R-squared:  0.9649 
|   F-statistic: 61.51 on 5 and 6 DF,  p-value: 4.499e-05Here are some other ways of exploring the model:
|     numDF denDF   F.value p.value
|   :     2     6 0.1890466 0.83249
|        Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95
|   gb:x 0.256648 0.918898  6  0.27930 0.78939 -1.991815  2.505111  
|   gc:x 0.397910 0.918898  6  0.43303 0.68013 -1.850553  2.646374|     numDF denDF  F.value p.value
|   g     4     6 3.965856 0.06566
|        Estimate  Std.Error DF t-value   p-value Lower 0.95 Upper 0.95
|   gb   -1.313275 1.759556  6  -0.746367 0.48365 -5.618754  2.992205  
|   gc   -0.536269 2.159667  6  -0.248311 0.81217 -5.820784  4.748246  
|   gb:x  0.256648 0.918898  6   0.279300 0.78939 -1.991815  2.505111  
|   gc:x  0.397910 0.918898  6   0.433030 0.68013 -1.850553  2.646374|     numDF denDF  F.value p.value
|   x     3     6 18.97152 0.00183
|        Estimate Std.Error DF t-value  p-value Lower 0.95 Upper 0.95
|   x    0.733744 0.896753  6  0.818223 0.44450 -1.460531  2.928020  
|   gb:x 0.256648 0.918898  6  0.279300 0.78939 -1.991815  2.505111  
|   gc:x 0.397910 0.918898  6  0.433030 0.68013 -1.850553  2.646374Using type 2 Anova gives you tests for ‘g’ and ‘x’ that assume that higher-order interactions involving ‘g’ and ‘x’ are all 0. Note that the error term used is the error term for the full model including interactions. This can lead to inconsistencies with tests based on a model in which interactions has been dropped.
|   Anova Table (Type II tests)
|   
|   Response: y
|              Sum Sq Df F value    Pr(>F)    
|   g          6.2264  2  7.7427 0.0217785 *  
|   x         22.7323  1 56.5365 0.0002865 ***
|   g:x        0.1520  2  0.1890 0.8324941    
|   Residuals  2.4125  6                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|   Anova Table (Type II tests)
|   
|   Response: y
|              Sum Sq Df F value    Pr(>F)    
|   x         22.7323  1 70.9133 3.013e-05 ***
|   g          6.2264  2  9.7116  0.007243 ** 
|   Residuals  2.5645  8                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The ‘M’ function associated with ‘Lfx’ can generate code to test for differences between factor levels
|   list( 1,
|   1 * M(g),
|   1 * x,
|   1 * M(g) * x 
|   )The idea is to use the ‘Lfx’ expression to difference and then to apply to a data frame.
|   [1] 117   3|        x g g0
|   4    3 a  a
|   5    4 a  a
|   7    6 a  a
|   9    8 a  a
|   30   3 c  a
|   48   8 a  b
|   70   4 c  b
|   91  12 a  c
|   92   0 b  c
|   100  8 b  cwe don’t need to compare g with g0 with the same levels and we only need comparisons in one direction (perhaps not)
|   [1] 39  3|       x g g0
|   14  0 b  a
|   28  1 c  a
|   31  4 c  a
|   34  7 c  a
|   36  9 c  a
|   38 11 c  a
|   69  3 c  b
|   71  5 c  b
|   72  6 c  b
|   75  9 c  b|   list( 1,
|   1 * M(g),
|   1 * x,
|   1 * M(g) * x 
|   )‘difference’ g - g0, just like differentiating wrt to g except that ‘M’ function generates differences
|      (Intercept) gb gc x gb:x gc:x
|   14           0  0  1 0    0    0
|   15           0  0  1 0    1    0
|   16           0  0  1 0    2    0
|   17           0  0  1 0    3    0
|   18           0  0  1 0    4    0
|   19           0  0  1 0    5    0
|   20           0  0  1 0    6    0
|   21           0  0  1 0    7    0
|   22           0  0  1 0    8    0
|   23           0  0  1 0    9    0
|   24           0  0  1 0   10    0
|   25           0  0  1 0   11    0
|   26           0  0  1 0   12    0
|   27           0  0  0 1    0    0
|   28           0  0  0 1    0    1
|   29           0  0  0 1    0    2
|   30           0  0  0 1    0    3
|   31           0  0  0 1    0    4
|   32           0  0  0 1    0    5
|   33           0  0  0 1    0    6
|   34           0  0  0 1    0    7
|   35           0  0  0 1    0    8
|   36           0  0  0 1    0    9
|   37           0  0  0 1    0   10
|   38           0  0  0 1    0   11
|   39           0  0  0 1    0   12
|   66           0  0 -1 1    0    0
|   67           0  0 -1 1   -1    1
|   68           0  0 -1 1   -2    2
|   69           0  0 -1 1   -3    3
|   70           0  0 -1 1   -4    4
|   71           0  0 -1 1   -5    5
|   72           0  0 -1 1   -6    6
|   73           0  0 -1 1   -7    7
|   74           0  0 -1 1   -8    8
|   75           0  0 -1 1   -9    9
|   76           0  0 -1 1  -10   10
|   77           0  0 -1 1  -11   11
|   78           0  0 -1 1  -12   12
|   attr(,"data")
|       x g g0
|   14  0 b  a
|   15  1 b  a
|   16  2 b  a
|   17  3 b  a
|   18  4 b  a
|   19  5 b  a
|   20  6 b  a
|   21  7 b  a
|   22  8 b  a
|   23  9 b  a
|   24 10 b  a
|   25 11 b  a
|   26 12 b  a
|   27  0 c  a
|   28  1 c  a
|   29  2 c  a
|   30  3 c  a
|   31  4 c  a
|   32  5 c  a
|   33  6 c  a
|   34  7 c  a
|   35  8 c  a
|   36  9 c  a
|   37 10 c  a
|   38 11 c  a
|   39 12 c  a
|   66  0 c  b
|   67  1 c  b
|   68  2 c  b
|   69  3 c  b
|   70  4 c  b
|   71  5 c  b
|   72  6 c  b
|   73  7 c  b
|   74  8 c  b
|   75  9 c  b
|   76 10 c  b
|   77 11 c  b
|   78 12 c  b|     numDF denDF F.value p.value
|   1     4     6 63.0834   5e-05
|      Estimate  Std.Error DF t-value   p-value Lower 0.95 Upper 0.95
|   14 -0.536269  2.159667 6  -0.248311 0.81217  -5.820784  4.748246 
|   15 -0.279621  1.759556 6  -0.158915 0.87895  -4.585100  4.025859 
|   16 -0.022973  1.793506 6  -0.012809 0.99020  -4.411523  4.365578 
|   17  0.233675  2.241882 6   0.104232 0.92038  -5.252013  5.719364 
|   18  0.490323  2.919616 6   0.167941 0.87215  -6.653720  7.634366 
|   19  0.746971  3.702841 6   0.201729 0.84679  -8.313553  9.807496 
|   20  1.003619  4.537251 6   0.221195 0.83228 -10.098634 12.105872 
|   21  1.260267  5.399168 6   0.233419 0.82320 -11.951020 14.471555 
|   22  1.516915  6.277271 6   0.241652 0.81710 -13.843013 16.876843 
|   23  1.773563  7.165612 6   0.247510 0.81277 -15.760057 19.307183 
|   24  2.030211  8.060806 6   0.251862 0.80955 -17.693872 21.754294 
|   25  2.286859  8.960801 6   0.255207 0.80709 -19.639432 24.213150 
|   26  2.543507  9.864282 6   0.257850 0.80514 -21.593523 26.680537 
|   27  0.733744  0.896753 6   0.818223 0.44450  -1.460531  2.928020 
|   28  1.131654  0.200520 6   5.643597 0.00133   0.641000  1.622309 
|   29  1.529565  0.982344 6   1.557057 0.17046  -0.874144  3.933273 
|   30  1.927475  1.891702 6   1.018910 0.34756  -2.701354  6.556304 
|   31  2.325385  2.807281 6   0.828341 0.43918  -4.543783  9.194554 
|   32  2.723296  3.724495 6   0.731185 0.49222  -6.390215 11.836806 
|   33  3.121206  4.642375 6   0.672330 0.52640  -8.238276 14.480688 
|   34  3.519116  5.560591 6   0.632867 0.55016 -10.087161 17.125393 
|   35  3.917027  6.479001 6   0.604573 0.56761 -11.936518 19.770572 
|   36  4.314937  7.397532 6   0.583294 0.58094 -13.786173 22.416047 
|   37  4.712847  8.316145 6   0.566711 0.59146 -15.636026 25.061720 
|   38  5.110757  9.234814 6   0.553423 0.59997 -17.486018 27.707533 
|   39  5.508668 10.153525 6   0.542537 0.60700 -19.336112 30.353448 
|   66  1.270013  1.748093 6   0.726513 0.49488  -3.007418  5.547443 
|   67  1.411275  1.944114 6   0.725922 0.49522  -3.345801  6.168351 
|   68  1.552537  2.159667 6   0.718878 0.49924  -3.731978  6.837052 
|   69  1.693800  2.389472 6   0.708859 0.50501  -4.153028  7.540628 
|   70  1.835062  2.629796 6   0.697796 0.51143  -4.599817  8.269941 
|   71  1.976324  2.878004 6   0.686700 0.51792  -5.065898  9.018547 
|   72  2.117587  3.132223 6   0.676065 0.52419  -5.546688  9.781861 
|   73  2.258849  3.391102 6   0.666111 0.53010  -6.038878 10.556576 
|   74  2.400111  3.653649 6   0.656908 0.53561  -6.540046 11.340269 
|   75  2.541374  3.919128 6   0.648454 0.54070  -7.048388 12.131136 
|   76  2.682636  4.186982 6   0.640709 0.54539  -7.562539 12.927811 
|   77  2.823898  4.456781 6   0.633618 0.54971  -8.081452 13.729249 
|   78  2.965161  4.728193 6   0.627123 0.55368  -8.604311 14.534633|            coef       se       U2        L2 x g g0
|   14 -0.5362685 2.159667 3.783066 -4.855603 0 b  a
|   15 -0.2796206 1.759556 3.239492 -3.798733 1 b  a
|   16 -0.0229726 1.793506 3.564039 -3.609984 2 b  a
|   17  0.2336754 2.241882 4.717440 -4.250089 3 b  a
|   18  0.4903233 2.919616 6.329555 -5.348909 4 b  a
|   19  0.7469713 3.702841 8.152652 -6.658710 5 b  atd( col = c('black', 'blue','blue'), lty = 1, lwd = 2)
xyplot( coef +I(coef+2*se) + I(coef-2*se) ~ x | gap, 
        ww, type = 'l') +
layer_( panel.abline( h = 0))or
xyplot( coef +I(coef+2*se) + I(coef-2*se) ~ x | gap, 
        ww, type = 'l',
        ylab = "Estimated difference plus or minus 2 SEs",
        xlim = c(0,12)) +
  layer_( panel.abline( h = 0))or
gd(col = 'blue')
xyplot( coef  ~ x | gap, ww, type = 'l',
        ylab = list("differences between response levels", cex = 1.3, font = 2),
        xlim = c(0,12),
        ylim = c(-20,20),
        upper = ww$coef + 2* ww$se,
        lower = ww$coef - 2* ww$se,
        subscripts = T) +
  layer(panel.fit(...)) +
  layer_( panel.abline( h = 0))Note that if the significant gap between ‘c’ and ‘b’ around \(x=6\) is a question inspired by the data and not a ‘prior’ hypothesis, then some adjustment should be made for multiplicity.
reset the random seed:
|   
|   Call:
|   lm(formula = LifeExp ~ (smoke + I(smoke^2)) * region + hiv + 
|       special, data = dd)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -10.3209  -3.1528   0.1424   3.0536  10.7776 
|   
|   Coefficients:
|                           Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)            5.645e+01  1.512e+00  37.324  < 2e-16 ***
|   smoke                  9.687e-03  1.343e-02   0.721 0.472152    
|   I(smoke^2)             1.503e-05  1.811e-05   0.830 0.408190    
|   regionAMR              1.079e+01  2.846e+00   3.792 0.000226 ***
|   regionEMR             -1.088e+00  3.106e+00  -0.350 0.726834    
|   regionEUR              2.211e+01  3.982e+00   5.551  1.5e-07 ***
|   regionSEAR             1.146e+01  3.977e+00   2.881 0.004634 ** 
|   regionWPR              8.297e+00  7.772e+00   1.068 0.287671    
|   hiv                   -2.405e-01  1.112e-01  -2.162 0.032401 *  
|   special               -1.006e+01  2.958e+00  -3.402 0.000885 ***
|   smoke:regionAMR        1.876e-02  1.693e-02   1.108 0.269968    
|   smoke:regionEMR        1.857e-02  1.532e-02   1.212 0.227560    
|   smoke:regionEUR       -9.910e-03  1.430e-02  -0.693 0.489629    
|   smoke:regionSEAR       2.485e-03  2.294e-02   0.108 0.913918    
|   smoke:regionWPR        4.084e-03  2.122e-02   0.192 0.847700    
|   I(smoke^2):regionAMR  -3.295e-05  1.988e-05  -1.658 0.099715 .  
|   I(smoke^2):regionEMR  -2.432e-05  1.844e-05  -1.319 0.189551    
|   I(smoke^2):regionEUR  -1.559e-05  1.817e-05  -0.858 0.392392    
|   I(smoke^2):regionSEAR -2.559e-05  2.434e-05  -1.051 0.295049    
|   I(smoke^2):regionWPR  -1.788e-05  1.938e-05  -0.923 0.357846    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 4.556 on 132 degrees of freedom
|     (42 observations deleted due to missingness)
|   Multiple R-squared:  0.8045,    Adjusted R-squared:  0.7764 
|   F-statistic:  28.6 on 19 and 132 DF,  p-value: < 2.2e-16|     numDF denDF  F.value p.value
|   :    10   132 3.198898 0.00104
|                         Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   smoke:regionAMR        0.018757 0.016932  132  1.107787 0.26997 -0.014736 
|   smoke:regionEMR        0.018572 0.015320  132  1.212307 0.22756 -0.011732 
|   smoke:regionEUR       -0.009910 0.014304  132 -0.692837 0.48963 -0.038205 
|   smoke:regionSEAR       0.002485 0.022944  132  0.108305 0.91392 -0.042901 
|   smoke:regionWPR        0.004084 0.021225  132  0.192431 0.84770 -0.037900 
|   I(smoke^2):regionAMR  -0.000033 0.000020  132 -1.657885 0.09972 -0.000072 
|   I(smoke^2):regionEMR  -0.000024 0.000018  132 -1.318706 0.18955 -0.000061 
|   I(smoke^2):regionEUR  -0.000016 0.000018  132 -0.858101 0.39239 -0.000052 
|   I(smoke^2):regionSEAR -0.000026 0.000024  132 -1.051283 0.29505 -0.000074 
|   I(smoke^2):regionWPR  -0.000018 0.000019  132 -0.922703 0.35785 -0.000056 
|                         Upper 0.95
|   smoke:regionAMR       0.052251  
|   smoke:regionEMR       0.048876  
|   smoke:regionEUR       0.018384  
|   smoke:regionSEAR      0.047871  
|   smoke:regionWPR       0.046069  
|   I(smoke^2):regionAMR  0.000006  
|   I(smoke^2):regionEMR  0.000012  
|   I(smoke^2):regionEUR  0.000020  
|   I(smoke^2):regionSEAR 0.000023  
|   I(smoke^2):regionWPR  0.000020|       numDF denDF  F.value p.value
|   2):     5   132 2.093536 0.07012
|                         Estimate Std.Error DF  t-value   p-value Lower 0.95
|   I(smoke^2):regionAMR  -3.3e-05 2.0e-05   132 -1.657885 0.09972 -7.2e-05  
|   I(smoke^2):regionEMR  -2.4e-05 1.8e-05   132 -1.318706 0.18955 -6.1e-05  
|   I(smoke^2):regionEUR  -1.6e-05 1.8e-05   132 -0.858101 0.39239 -5.2e-05  
|   I(smoke^2):regionSEAR -2.6e-05 2.4e-05   132 -1.051283 0.29505 -7.4e-05  
|   I(smoke^2):regionWPR  -1.8e-05 1.9e-05   132 -0.922703 0.35785 -5.6e-05  
|                         Upper 0.95
|   I(smoke^2):regionAMR  6.0e-06   
|   I(smoke^2):regionEMR  1.2e-05   
|   I(smoke^2):regionEUR  2.0e-05   
|   I(smoke^2):regionSEAR 2.3e-05   
|   I(smoke^2):regionWPR  2.0e-05Using data values for prediction instead of creating a separate prediction data frame: This can work with curvilinear models if the data is sufficiently dense.
# reorder for nice lines:
dd <- dd[order(dd$region,dd$smoke),]
xyplot( LifeExp ~ smoke | region, 
        dd, subscripts = T, yq = dd$yq)  + 
  layer( panel.xyplot( x, yq[subscripts],..., 
                       type = 'b', col = 'red'))presents a problem because of missing hiv values.
Try again keeping non-missing data together to avoid interrupting lines
dd <- dd[order(is.na(dd$yq),dd$region,dd$smoke),]
xyplot( LifeExp ~ smoke | region, 
        dd, subscripts = T, yq = dd$yq)  + 
  layer( panel.xyplot( x, yq[subscripts],..., 
                       type = 'b', col = 'red'))Including Health Expenditure:
fitrhe <- lm( 
  LifeExp ~ 
    (smoke + I(smoke^2)) * region * log(HE) + hiv + special, 
  dd)
summary(fitrhe)|   
|   Call:
|   lm(formula = LifeExp ~ (smoke + I(smoke^2)) * region * log(HE) + 
|       hiv + special, data = dd)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -6.9890 -1.4067  0.1123  1.2544  9.7261 
|   
|   Coefficients:
|                                   Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                    5.088e+01  7.646e+00   6.654 1.12e-09 ***
|   smoke                         -1.832e-02  5.188e-02  -0.353 0.724587    
|   I(smoke^2)                     4.173e-05  7.885e-05   0.529 0.597687    
|   regionAMR                     -1.375e+01  1.544e+01  -0.890 0.375292    
|   regionEMR                      6.606e+00  1.782e+01   0.371 0.711655    
|   regionEUR                     -8.512e+00  1.948e+01  -0.437 0.662904    
|   regionSEAR                     1.698e+02  8.098e+01   2.097 0.038272 *  
|   regionWPR                     -6.176e+00  2.585e+01  -0.239 0.811624    
|   log(HE)                        1.527e+00  1.679e+00   0.910 0.364981    
|   hiv                           -3.930e-01  1.003e-01  -3.917 0.000155 ***
|   special                       -1.296e+01  2.343e+00  -5.532 2.13e-07 ***
|   smoke:regionAMR                6.425e-02  8.185e-02   0.785 0.434105    
|   smoke:regionEMR                2.158e-02  6.486e-02   0.333 0.740033    
|   smoke:regionEUR                2.975e-02  5.868e-02   0.507 0.613198    
|   smoke:regionSEAR              -1.262e+00  6.146e-01  -2.053 0.042407 *  
|   smoke:regionWPR                1.699e-02  7.631e-02   0.223 0.824199    
|   I(smoke^2):regionAMR          -5.165e-05  9.443e-05  -0.547 0.585509    
|   I(smoke^2):regionEMR          -4.252e-05  8.567e-05  -0.496 0.620633    
|   I(smoke^2):regionEUR          -4.762e-05  7.938e-05  -0.600 0.549800    
|   I(smoke^2):regionSEAR          1.839e-03  8.954e-04   2.054 0.042279 *  
|   I(smoke^2):regionWPR          -3.620e-05  8.313e-05  -0.435 0.664067    
|   smoke:log(HE)                  5.178e-03  9.762e-03   0.530 0.596867    
|   I(smoke^2):log(HE)            -6.228e-06  1.370e-05  -0.454 0.650385    
|   regionAMR:log(HE)              3.852e+00  2.780e+00   1.386 0.168672    
|   regionEMR:log(HE)             -1.179e+00  3.844e+00  -0.307 0.759613    
|   regionEUR:log(HE)              3.195e+00  2.932e+00   1.090 0.278285    
|   regionSEAR:log(HE)            -3.376e+01  1.688e+01  -2.000 0.047953 *  
|   regionWPR:log(HE)              2.831e+00  4.291e+00   0.660 0.510830    
|   smoke:regionAMR:log(HE)       -1.044e-02  1.392e-02  -0.750 0.454840    
|   smoke:regionEMR:log(HE)       -2.242e-03  1.204e-02  -0.186 0.852601    
|   smoke:regionEUR:log(HE)       -6.670e-03  1.044e-02  -0.639 0.524198    
|   smoke:regionSEAR:log(HE)       2.685e-01  1.292e-01   2.077 0.040092 *  
|   smoke:regionWPR:log(HE)       -4.551e-03  1.300e-02  -0.350 0.726878    
|   I(smoke^2):regionAMR:log(HE)   6.664e-06  1.592e-05   0.419 0.676298    
|   I(smoke^2):regionEMR:log(HE)   5.200e-06  1.448e-05   0.359 0.720256    
|   I(smoke^2):regionEUR:log(HE)   6.979e-06  1.376e-05   0.507 0.613012    
|   I(smoke^2):regionSEAR:log(HE) -3.900e-04  1.869e-04  -2.087 0.039160 *  
|   I(smoke^2):regionWPR:log(HE)   5.384e-06  1.424e-05   0.378 0.705969    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.194 on 111 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.9154,    Adjusted R-squared:  0.8873 
|   F-statistic: 32.48 on 37 and 111 DF,  p-value: < 2.2e-16|   [1] 38Should have > 380 observations using Harrell’s rules of thumb for valid regression
|     numDF denDF F.value p.value
|   :    27   111 1.37843 0.12542
|                                 Estimate   Std.Error DF  t-value   p-value
|   smoke:regionAMR                 0.064254  0.081849 111  0.785035 0.43411
|   smoke:regionEMR                 0.021575  0.064860 111  0.332641 0.74003
|   smoke:regionEUR                 0.029747  0.058678 111  0.506947 0.61320
|   smoke:regionSEAR               -1.261784  0.614556 111 -2.053164 0.04241
|   smoke:regionWPR                 0.016993  0.076313 111  0.222673 0.82420
|   I(smoke^2):regionAMR           -0.000052  0.000094 111 -0.546953 0.58551
|   I(smoke^2):regionEMR           -0.000043  0.000086 111 -0.496344 0.62063
|   I(smoke^2):regionEUR           -0.000048  0.000079 111 -0.599894 0.54980
|   I(smoke^2):regionSEAR           0.001839  0.000895 111  2.054470 0.04228
|   I(smoke^2):regionWPR           -0.000036  0.000083 111 -0.435470 0.66407
|   smoke:log(HE)                   0.005178  0.009762 111  0.530439 0.59687
|   I(smoke^2):log(HE)             -0.000006  0.000014 111 -0.454460 0.65039
|   regionAMR:log(HE)               3.851782  2.780032 111  1.385517 0.16867
|   regionEMR:log(HE)              -1.179181  3.844186 111 -0.306744 0.75961
|   regionEUR:log(HE)               3.194526  2.932055 111  1.089518 0.27829
|   regionSEAR:log(HE)            -33.758269 16.879974 111 -1.999901 0.04795
|   regionWPR:log(HE)               2.830507  4.290773 111  0.659673 0.51083
|   smoke:regionAMR:log(HE)        -0.010442  0.013922 111 -0.750002 0.45484
|   smoke:regionEMR:log(HE)        -0.002242  0.012041 111 -0.186234 0.85260
|   smoke:regionEUR:log(HE)        -0.006670  0.010440 111 -0.638910 0.52420
|   smoke:regionSEAR:log(HE)        0.268460  0.129242 111  2.077190 0.04009
|   smoke:regionWPR:log(HE)        -0.004551  0.012996 111 -0.350166 0.72688
|   I(smoke^2):regionAMR:log(HE)    0.000007  0.000016 111  0.418627 0.67630
|   I(smoke^2):regionEMR:log(HE)    0.000005  0.000014 111  0.359028 0.72026
|   I(smoke^2):regionEUR:log(HE)    0.000007  0.000014 111  0.507212 0.61301
|   I(smoke^2):regionSEAR:log(HE)  -0.000390  0.000187 111 -2.087196 0.03916
|   I(smoke^2):regionWPR:log(HE)    0.000005  0.000014 111  0.378247 0.70597
|                                 Lower 0.95 Upper 0.95
|   smoke:regionAMR                -0.097934  0.226443 
|   smoke:regionEMR                -0.106950  0.150100 
|   smoke:regionEUR                -0.086528  0.146022 
|   smoke:regionSEAR               -2.479567 -0.044001 
|   smoke:regionWPR                -0.134226  0.168211 
|   I(smoke^2):regionAMR           -0.000239  0.000135 
|   I(smoke^2):regionEMR           -0.000212  0.000127 
|   I(smoke^2):regionEUR           -0.000205  0.000110 
|   I(smoke^2):regionSEAR           0.000065  0.003614 
|   I(smoke^2):regionWPR           -0.000201  0.000129 
|   smoke:log(HE)                  -0.014165  0.024521 
|   I(smoke^2):log(HE)             -0.000033  0.000021 
|   regionAMR:log(HE)              -1.657038  9.360602 
|   regionEMR:log(HE)              -8.796691  6.438330 
|   regionEUR:log(HE)              -2.615536  9.004588 
|   regionSEAR:log(HE)            -67.207064 -0.309474 
|   regionWPR:log(HE)              -5.671945 11.332960 
|   smoke:regionAMR:log(HE)        -0.038029  0.017146 
|   smoke:regionEMR:log(HE)        -0.026102  0.021617 
|   smoke:regionEUR:log(HE)        -0.027359  0.014018 
|   smoke:regionSEAR:log(HE)        0.012358  0.524561 
|   smoke:regionWPR:log(HE)        -0.030303  0.021202 
|   I(smoke^2):regionAMR:log(HE)   -0.000025  0.000038 
|   I(smoke^2):regionEMR:log(HE)   -0.000024  0.000034 
|   I(smoke^2):regionEUR:log(HE)   -0.000020  0.000034 
|   I(smoke^2):regionSEAR:log(HE)  -0.000760 -0.000020 
|   I(smoke^2):regionWPR:log(HE)   -0.000023  0.000034|     numDF denDF   F.value p.value
|   2    12   111 0.7935878 0.65615
|                                 Estimate  Std.Error DF  t-value   p-value
|   I(smoke^2)                     0.000042 0.000079  111  0.529253 0.59769
|   I(smoke^2):regionAMR          -0.000052 0.000094  111 -0.546953 0.58551
|   I(smoke^2):regionEMR          -0.000043 0.000086  111 -0.496344 0.62063
|   I(smoke^2):regionEUR          -0.000048 0.000079  111 -0.599894 0.54980
|   I(smoke^2):regionSEAR          0.001839 0.000895  111  2.054470 0.04228
|   I(smoke^2):regionWPR          -0.000036 0.000083  111 -0.435470 0.66407
|   I(smoke^2):log(HE)            -0.000006 0.000014  111 -0.454460 0.65039
|   I(smoke^2):regionAMR:log(HE)   0.000007 0.000016  111  0.418627 0.67630
|   I(smoke^2):regionEMR:log(HE)   0.000005 0.000014  111  0.359028 0.72026
|   I(smoke^2):regionEUR:log(HE)   0.000007 0.000014  111  0.507212 0.61301
|   I(smoke^2):regionSEAR:log(HE) -0.000390 0.000187  111 -2.087196 0.03916
|   I(smoke^2):regionWPR:log(HE)   0.000005 0.000014  111  0.378247 0.70597
|                                 Lower 0.95 Upper 0.95
|   I(smoke^2)                    -0.000115   0.000198 
|   I(smoke^2):regionAMR          -0.000239   0.000135 
|   I(smoke^2):regionEMR          -0.000212   0.000127 
|   I(smoke^2):regionEUR          -0.000205   0.000110 
|   I(smoke^2):regionSEAR          0.000065   0.003614 
|   I(smoke^2):regionWPR          -0.000201   0.000129 
|   I(smoke^2):log(HE)            -0.000033   0.000021 
|   I(smoke^2):regionAMR:log(HE)  -0.000025   0.000038 
|   I(smoke^2):regionEMR:log(HE)  -0.000024   0.000034 
|   I(smoke^2):regionEUR:log(HE)  -0.000020   0.000034 
|   I(smoke^2):regionSEAR:log(HE) -0.000760  -0.000020 
|   I(smoke^2):regionWPR:log(HE)  -0.000023   0.000034|          numDF denDF  F.value p.value
|   2|:.*:    17   111 1.286303 0.21443
|                                 Estimate  Std.Error DF  t-value   p-value
|   I(smoke^2)                     0.000042 0.000079  111  0.529253 0.59769
|   I(smoke^2):regionAMR          -0.000052 0.000094  111 -0.546953 0.58551
|   I(smoke^2):regionEMR          -0.000043 0.000086  111 -0.496344 0.62063
|   I(smoke^2):regionEUR          -0.000048 0.000079  111 -0.599894 0.54980
|   I(smoke^2):regionSEAR          0.001839 0.000895  111  2.054470 0.04228
|   I(smoke^2):regionWPR          -0.000036 0.000083  111 -0.435470 0.66407
|   I(smoke^2):log(HE)            -0.000006 0.000014  111 -0.454460 0.65039
|   smoke:regionAMR:log(HE)       -0.010442 0.013922  111 -0.750002 0.45484
|   smoke:regionEMR:log(HE)       -0.002242 0.012041  111 -0.186234 0.85260
|   smoke:regionEUR:log(HE)       -0.006670 0.010440  111 -0.638910 0.52420
|   smoke:regionSEAR:log(HE)       0.268460 0.129242  111  2.077190 0.04009
|   smoke:regionWPR:log(HE)       -0.004551 0.012996  111 -0.350166 0.72688
|   I(smoke^2):regionAMR:log(HE)   0.000007 0.000016  111  0.418627 0.67630
|   I(smoke^2):regionEMR:log(HE)   0.000005 0.000014  111  0.359028 0.72026
|   I(smoke^2):regionEUR:log(HE)   0.000007 0.000014  111  0.507212 0.61301
|   I(smoke^2):regionSEAR:log(HE) -0.000390 0.000187  111 -2.087196 0.03916
|   I(smoke^2):regionWPR:log(HE)   0.000005 0.000014  111  0.378247 0.70597
|                                 Lower 0.95 Upper 0.95
|   I(smoke^2)                    -0.000115   0.000198 
|   I(smoke^2):regionAMR          -0.000239   0.000135 
|   I(smoke^2):regionEMR          -0.000212   0.000127 
|   I(smoke^2):regionEUR          -0.000205   0.000110 
|   I(smoke^2):regionSEAR          0.000065   0.003614 
|   I(smoke^2):regionWPR          -0.000201   0.000129 
|   I(smoke^2):log(HE)            -0.000033   0.000021 
|   smoke:regionAMR:log(HE)       -0.038029   0.017146 
|   smoke:regionEMR:log(HE)       -0.026102   0.021617 
|   smoke:regionEUR:log(HE)       -0.027359   0.014018 
|   smoke:regionSEAR:log(HE)       0.012358   0.524561 
|   smoke:regionWPR:log(HE)       -0.030303   0.021202 
|   I(smoke^2):regionAMR:log(HE)  -0.000025   0.000038 
|   I(smoke^2):regionEMR:log(HE)  -0.000024   0.000034 
|   I(smoke^2):regionEUR:log(HE)  -0.000020   0.000034 
|   I(smoke^2):regionSEAR:log(HE) -0.000760  -0.000020 
|   I(smoke^2):regionWPR:log(HE)  -0.000023   0.000034                       # higher way interaction
fitr2 <- lm( LifeExp ~ 
               (smoke + log(HE) + region)^2 + hiv + special, dd)
summary(fitr2)|   
|   Call:
|   lm(formula = LifeExp ~ (smoke + log(HE) + region)^2 + hiv + special, 
|       data = dd)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -6.9873 -1.8511  0.1576  1.7496  8.8413 
|   
|   Coefficients:
|                        Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)         4.534e+01  3.590e+00  12.628  < 2e-16 ***
|   smoke               1.726e-02  4.899e-03   3.522 0.000594 ***
|   log(HE)             2.623e+00  8.903e-01   2.947 0.003816 ** 
|   regionAMR           6.004e+00  6.061e+00   0.991 0.323748    
|   regionEMR           3.025e+00  8.003e+00   0.378 0.706085    
|   regionEUR          -4.070e+00  7.195e+00  -0.566 0.572584    
|   regionSEAR          7.659e+00  6.956e+00   1.101 0.272966    
|   regionWPR          -2.520e+00  5.886e+00  -0.428 0.669332    
|   hiv                -4.235e-01  9.424e-02  -4.494 1.55e-05 ***
|   special            -1.352e+01  2.183e+00  -6.195 7.35e-09 ***
|   smoke:log(HE)      -6.207e-04  6.668e-04  -0.931 0.353725    
|   smoke:regionAMR    -1.142e-02  4.176e-03  -2.735 0.007128 ** 
|   smoke:regionEMR    -8.776e-03  4.180e-03  -2.099 0.037745 *  
|   smoke:regionEUR    -1.372e-02  3.758e-03  -3.651 0.000379 ***
|   smoke:regionSEAR   -1.321e-02  4.980e-03  -2.652 0.009019 ** 
|   smoke:regionWPR    -1.100e-02  4.111e-03  -2.675 0.008450 ** 
|   log(HE):regionAMR   8.114e-01  1.180e+00   0.687 0.493061    
|   log(HE):regionEMR   3.281e-01  1.700e+00   0.193 0.847248    
|   log(HE):regionEUR   2.389e+00  1.156e+00   2.067 0.040730 *  
|   log(HE):regionSEAR  7.592e-01  1.455e+00   0.522 0.602617    
|   log(HE):regionWPR   2.073e+00  1.159e+00   1.789 0.075940 .  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.254 on 128 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.8988,    Adjusted R-squared:  0.883 
|   F-statistic: 56.83 on 20 and 128 DF,  p-value: < 2.2e-16|     numDF denDF  F.value p.value
|   :    11   128 2.478905 0.00752
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   smoke:log(HE)      -0.000621 0.000667  128 -0.930769 0.35373 -0.001940 
|   smoke:regionAMR    -0.011419 0.004176  128 -2.734757 0.00713 -0.019681 
|   smoke:regionEMR    -0.008776 0.004180  128 -2.099411 0.03774 -0.017047 
|   smoke:regionEUR    -0.013722 0.003758  128 -3.651365 0.00038 -0.021158 
|   smoke:regionSEAR   -0.013207 0.004980  128 -2.651798 0.00902 -0.023061 
|   smoke:regionWPR    -0.010997 0.004111  128 -2.674936 0.00845 -0.019132 
|   log(HE):regionAMR   0.811401 1.180355  128  0.687421 0.49306 -1.524133 
|   log(HE):regionEMR   0.328123 1.699935  128  0.193021 0.84725 -3.035488 
|   log(HE):regionEUR   2.389164 1.155748  128  2.067201 0.04073  0.102319 
|   log(HE):regionSEAR  0.759205 1.454596  128  0.521935 0.60262 -2.118962 
|   log(HE):regionWPR   2.073361 1.158788  128  1.789249 0.07594 -0.219500 
|                      Upper 0.95
|   smoke:log(HE)       0.000699 
|   smoke:regionAMR    -0.003157 
|   smoke:regionEMR    -0.000505 
|   smoke:regionEUR    -0.006286 
|   smoke:regionSEAR   -0.003352 
|   smoke:regionWPR    -0.002862 
|   log(HE):regionAMR   3.146934 
|   log(HE):regionEMR   3.691734 
|   log(HE):regionEUR   4.676008 
|   log(HE):regionSEAR  3.637371 
|   log(HE):regionWPR   4.366221|             numDF denDF   F.value p.value
|   HE):|:log     6   128 0.9991023 0.42891
|                      Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   smoke:log(HE)      -0.000621 0.000667  128 -0.930769 0.35373 -0.001940 
|   log(HE):regionAMR   0.811401 1.180355  128  0.687421 0.49306 -1.524133 
|   log(HE):regionEMR   0.328123 1.699935  128  0.193021 0.84725 -3.035488 
|   log(HE):regionEUR   2.389164 1.155748  128  2.067201 0.04073  0.102319 
|   log(HE):regionSEAR  0.759205 1.454596  128  0.521935 0.60262 -2.118962 
|   log(HE):regionWPR   2.073361 1.158788  128  1.789249 0.07594 -0.219500 
|                      Upper 0.95
|   smoke:log(HE)      0.000699  
|   log(HE):regionAMR  3.146934  
|   log(HE):regionEMR  3.691734  
|   log(HE):regionEUR  4.676008  
|   log(HE):regionSEAR 3.637371  
|   log(HE):regionWPR  4.366221|   
|   Call:
|   lm(formula = LifeExp ~ region * smoke + log(HE) + hiv + special, 
|       data = dd)
|   
|   Residuals:
|      Min     1Q Median     3Q    Max 
|   -7.016 -1.863  0.165  1.627  8.700 
|   
|   Coefficients:
|                      Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)       41.364150   1.472822  28.085  < 2e-16 ***
|   regionAMR          8.712382   1.438295   6.057 1.31e-08 ***
|   regionEMR          3.799335   1.759763   2.159 0.032631 *  
|   regionEUR          9.842506   1.887832   5.214 6.84e-07 ***
|   regionSEAR        10.122815   1.839269   5.504 1.83e-07 ***
|   regionWPR          7.460923   1.969657   3.788 0.000229 ***
|   smoke              0.010666   0.002712   3.933 0.000134 ***
|   log(HE)            3.688028   0.314332  11.733  < 2e-16 ***
|   hiv               -0.501073   0.079030  -6.340 3.25e-09 ***
|   special          -14.766685   2.013362  -7.334 1.91e-11 ***
|   regionAMR:smoke   -0.009904   0.003175  -3.120 0.002216 ** 
|   regionEMR:smoke   -0.007235   0.003068  -2.358 0.019822 *  
|   regionEUR:smoke   -0.011760   0.002835  -4.148 5.93e-05 ***
|   regionSEAR:smoke  -0.009663   0.004494  -2.150 0.033335 *  
|   regionWPR:smoke   -0.008164   0.003110  -2.625 0.009666 ** 
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.254 on 134 degrees of freedom
|     (45 observations deleted due to missingness)
|   Multiple R-squared:  0.894, Adjusted R-squared:  0.883 
|   F-statistic: 80.76 on 14 and 134 DF,  p-value: < 2.2e-16|     numDF denDF F.value p.value
|   :     5   134 4.25484 0.00127
|                    Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   regionAMR:smoke  -0.009904 0.003175  134 -3.119787 0.00222 -0.016183 
|   regionEMR:smoke  -0.007235 0.003068  134 -2.357952 0.01982 -0.013303 
|   regionEUR:smoke  -0.011760 0.002835  134 -4.147578 0.00006 -0.017368 
|   regionSEAR:smoke -0.009663 0.004494  134 -2.150191 0.03334 -0.018551 
|   regionWPR:smoke  -0.008164 0.003110  134 -2.625207 0.00967 -0.014314 
|                    Upper 0.95
|   regionAMR:smoke  -0.003625 
|   regionEMR:smoke  -0.001166 
|   regionEUR:smoke  -0.006152 
|   regionSEAR:smoke -0.000775 
|   regionWPR:smoke  -0.002013|          numDF denDF  F.value p.value
|   region    10   134 8.121058 <.00001
|                    Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   regionAMR         8.712382 1.438295  134  6.057439 <.00001  5.867686 
|   regionEMR         3.799335 1.759763  134  2.159003 0.03263  0.318829 
|   regionEUR         9.842506 1.887832  134  5.213655 <.00001  6.108703 
|   regionSEAR       10.122815 1.839269  134  5.503717 <.00001  6.485062 
|   regionWPR         7.460923 1.969657  134  3.787931 0.00023  3.565285 
|   regionAMR:smoke  -0.009904 0.003175  134 -3.119787 0.00222 -0.016183 
|   regionEMR:smoke  -0.007235 0.003068  134 -2.357952 0.01982 -0.013303 
|   regionEUR:smoke  -0.011760 0.002835  134 -4.147578 0.00006 -0.017368 
|   regionSEAR:smoke -0.009663 0.004494  134 -2.150191 0.03334 -0.018551 
|   regionWPR:smoke  -0.008164 0.003110  134 -2.625207 0.00967 -0.014314 
|                    Upper 0.95
|   regionAMR        11.557078 
|   regionEMR         7.279840 
|   regionEUR        13.576310 
|   regionSEAR       13.760568 
|   regionWPR        11.356561 
|   regionAMR:smoke  -0.003625 
|   regionEMR:smoke  -0.001166 
|   regionEUR:smoke  -0.006152 
|   regionSEAR:smoke -0.000775 
|   regionWPR:smoke  -0.002013|      numDF denDF  F.value p.value
|   HE     1   134 137.6607 <.00001
|           Estimate Std.Error DF  t-value  p-value Lower 0.95 Upper 0.95
|   log(HE) 3.688028 0.314332  134 11.73289 <.00001 3.066333   4.309723|   Anova Table (Type II tests)
|   
|   Response: LifeExp
|                 Sum Sq  Df  F value    Pr(>F)    
|   region        634.67   5  11.9873 1.339e-09 ***
|   smoke           6.95   1   0.6561   0.41936    
|   log(HE)      1457.70   1 137.6607 < 2.2e-16 ***
|   hiv           425.67   1  40.1992 3.250e-09 ***
|   special       569.62   1  53.7926 1.910e-11 ***
|   region:smoke  225.27   5   4.2548   0.00127 ** 
|   Residuals    1418.94 134                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|   Analysis of Variance Table
|   
|   Response: LifeExp
|                 Df Sum Sq Mean Sq  F value    Pr(>F)    
|   region         5 8942.0 1788.40 168.8905 < 2.2e-16 ***
|   smoke          1  141.2  141.20  13.3348 0.0003725 ***
|   log(HE)        1 1788.2 1788.22 168.8740 < 2.2e-16 ***
|   hiv            1  337.6  337.64  31.8856 9.375e-08 ***
|   special        1  537.9  537.94  50.8014 5.711e-11 ***
|   region:smoke   5  225.3   45.05   4.2548 0.0012701 ** 
|   Residuals    134 1418.9   10.59                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|         numDF denDF  F.value p.value
|   smoke     6   134 3.655058 0.00216
|                    Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   smoke             0.010666 0.002712  134  3.933292 0.00013  0.005303 
|   regionAMR:smoke  -0.009904 0.003175  134 -3.119787 0.00222 -0.016183 
|   regionEMR:smoke  -0.007235 0.003068  134 -2.357952 0.01982 -0.013303 
|   regionEUR:smoke  -0.011760 0.002835  134 -4.147578 0.00006 -0.017368 
|   regionSEAR:smoke -0.009663 0.004494  134 -2.150191 0.03334 -0.018551 
|   regionWPR:smoke  -0.008164 0.003110  134 -2.625207 0.00967 -0.014314 
|                    Upper 0.95
|   smoke             0.016029 
|   regionAMR:smoke  -0.003625 
|   regionEMR:smoke  -0.001166 
|   regionEUR:smoke  -0.006152 
|   regionSEAR:smoke -0.000775 
|   regionWPR:smoke  -0.002013|     region     col
|   1    AFR    blue
|   2    AMR   green
|   3    EMR  orange
|   4    EUR magenta
|   5   SEAR    cyan
|   6    WPR     red
|   
|   Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspective|   Warning in log(HE): NaNs produced3d
The following is an example of exploring data using regression in R. But it’s very unrealistic.
Real data analysis does not start with a neat rectangular data set.
Hadley Wickham: “Data analysis is the process by which data becomes understanding, knowledge and insight”
The process involves much more than running regressions:
Init3d(cex=1)
ds <- dd
ds$Life <- ds$LE 
ds$Cigarettes <- ds$smoke
ds$Health <- ds$HE
ds$area <-ds$region
ds$area <- tr(ds$region, 
              c("AFR", "AMR", "EMR", "EUR", "SEAR", "WPR"),
              c("Africa","South Asia","Other")[c(1,3,3,3,2,3)])
Plot3d( Life ~ Cigarettes + Health |region,ds) |     region     col
|   1    AFR    blue
|   2    AMR   green
|   3    EMR  orange
|   4    EUR magenta
|   5   SEAR    cyan
|   6    WPR     red
|   
|   Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspective|   
|   Call:
|   lm(formula = Life ~ Cigarettes, data = ds)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -19.2997  -5.7258   0.6864   6.5300  14.9811 
|   
|   Coefficients:
|                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept) 6.508e+01  8.560e-01   76.03  < 2e-16 ***
|   Cigarettes  6.915e-03  8.547e-04    8.09 7.99e-14 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 7.969 on 183 degrees of freedom
|     (9 observations deleted due to missingness)
|   Multiple R-squared:  0.2635,    Adjusted R-squared:  0.2594 
|   F-statistic: 65.46 on 1 and 183 DF,  p-value: 7.987e-14|    numDF denDF  F.value p.value
|        2   183 7194.457 <.00001
|               Estimate  Std.Error DF  t-value   p-value Lower 0.95 Upper 0.95
|   (Intercept) 65.075840 0.855974  183 76.025515 <.00001 63.386994  66.764687 
|   Cigarettes   0.006915 0.000855  183  8.090493 <.00001  0.005228   0.008601Fit3d(fit, lwd = 3)
fitsq <- lm( Life ~ Cigarettes+I(Cigarettes^2), ds)
Fit3d(fitsq, lwd = 3, col = 'red')
spin(0,0,0)
#  Id3d(pad=1)
# Id3d("Canada")
# Id3d("United States")
par3d(windowRect=c(10,10,700,700))
rgl.snapshot('quadsmoke.png')  
Pop3d(2)Fig-quadsmoke
|   
|   Call:
|   lm(formula = Life ~ Cigarettes + Health, data = ds)
|   
|   Residuals:
|       Min      1Q  Median      3Q     Max 
|   -17.960  -4.309   1.161   5.304  11.772 
|   
|   Coefficients:
|                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept) 6.368e+01  7.464e-01  85.313  < 2e-16 ***
|   Cigarettes  4.312e-03  7.658e-04   5.631 6.84e-08 ***
|   Health      3.125e-03  3.585e-04   8.717 1.90e-15 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 6.633 on 179 degrees of freedom
|     (12 observations deleted due to missingness)
|   Multiple R-squared:  0.4786,    Adjusted R-squared:  0.4728 
|   F-statistic: 82.16 on 2 and 179 DF,  p-value: < 2.2e-16fith <- lm( Life ~ Cigarettes + Health + log( Health),ds)
Fit3d(fitlin, col = 'pink')
Fit3d(fith, col = 'red')|   Warning in log(Health): NaNs producedds$propGovt <- with(ds, govt/total) # proportion of health exp. 
                                    # from Govt
Plot3d( Life ~ Health + propGovt |area, ds)|           area    col
|   1     Africa   blue
|   2      Other  green
|   3 South Asia orange
|   
|   Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspectiveFig-health-pgovt
Try something that looks sensible:
The relationship between Life Expectance and Health Expentitures per capita and the proportion of health expenditures funelled through the government
ds$propGovt <- with(ds, govt/total)   
                    # proportion of health exp. from Govt
 
fit <- lm( Life ~ (Health + log(Health) + propGovt)  * area , ds,
          na.action = na.exclude)   
summary(fit)|   
|   Call:
|   lm(formula = Life ~ (Health + log(Health) + propGovt) * area, 
|       data = ds, na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -13.4282  -1.9480   0.1869   2.1919  14.7809 
|   
|   Coefficients:
|                                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                 39.041364   5.426332   7.195 1.68e-11 ***
|   Health                      -0.002672   0.003847  -0.695 0.488221    
|   log(Health)                  2.368302   1.239444   1.911 0.057640 .  
|   propGovt                    16.083451   3.854716   4.172 4.71e-05 ***
|   areaOther                    4.316456   6.528366   0.661 0.509349    
|   areaSouth Asia              15.629164  15.611788   1.001 0.318131    
|   Health:areaOther             0.002218   0.003872   0.573 0.567364    
|   Health:areaSouth Asia        0.006672   0.015873   0.420 0.674730    
|   log(Health):areaOther        2.454985   1.393231   1.762 0.079772 .  
|   log(Health):areaSouth Asia   1.029188   4.155759   0.248 0.804688    
|   propGovt:areaOther         -17.211452   4.343678  -3.962 0.000107 ***
|   propGovt:areaSouth Asia    -21.993937   8.781094  -2.505 0.013154 *  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 4.007 on 178 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:  0.817, Adjusted R-squared:  0.8057 
|   F-statistic: 72.24 on 11 and 178 DF,  p-value: < 2.2e-16|           area    col
|   1     Africa   blue
|   2      Other  green
|   3 South Asia orange
|   
|   Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspective|   Warning in log(Health): NaNs producedhpfit
Question: Is this model too big for the data?
Should we drop Health expenditures?
None of the coefficients relating to Health are significant.
Can we conclude that “Health” does not add to the predictive power of this model?
Type II sums of squares are slightly less prone to massive misinterpretation since each test %>% satistifies the POM.
|   Anova Table (Type II tests)
|   
|   Response: Life
|                     Sum Sq  Df F value    Pr(>F)    
|   Health             19.55   1  1.2175 0.2713384    
|   log(Health)       942.29   1 58.6882 1.142e-12 ***
|   propGovt           24.05   1  1.4979 0.2226053    
|   area             2187.86   2 68.1331 < 2.2e-16 ***
|   Health:area         6.63   2  0.2065 0.8136300    
|   log(Health):area   50.69   2  1.5785 0.2091650    
|   propGovt:area     269.57   2  8.3948 0.0003281 ***
|   Residuals        2857.93 178                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|          numDF denDF F.value p.value
|   Health     6   178 29.4052 <.00001
|                              Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   Health                     -0.002672 0.003847  178 -0.694585 0.48822 -0.010263 
|   log(Health)                 2.368302 1.239444  178  1.910778 0.05764 -0.077593 
|   Health:areaOther            0.002218 0.003872  178  0.573005 0.56736 -0.005422 
|   Health:areaSouth Asia       0.006672 0.015873  178  0.420359 0.67473 -0.024651 
|   log(Health):areaOther       2.454985 1.393231  178  1.762080 0.07977 -0.294391 
|   log(Health):areaSouth Asia  1.029188 4.155759  178  0.247653 0.80469 -7.171707 
|                              Upper 0.95
|   Health                     0.004919  
|   log(Health)                4.814197  
|   Health:areaOther           0.009858  
|   Health:areaSouth Asia      0.037995  
|   log(Health):areaOther      5.204360  
|   log(Health):areaSouth Asia 9.230082Note:
In the above, note that the overall evidence is VERY strong although individual p-values not even significant
I cannot sufficiently stress the importance of the principle this illustrates.
100% of beginning graduates students in statistics programs will fall into the trap of mis-interpreting p-values in regression output. It’s as bad a professional error as a doctor amputating the wrong leg.
The Likelihood Ratio Test and the Wald test:
Null model:
OR you can use ‘update’:
|   
|   Call:
|   lm(formula = Life ~ propGovt + area + propGovt:area, data = ds, 
|       na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -16.8782  -3.3650   0.6224   3.8491  17.9132 
|   
|   Coefficients:
|                           Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)               48.387      2.668  18.133  < 2e-16 ***
|   propGovt                  19.124      5.023   3.807 0.000191 ***
|   areaOther                 20.277      3.153   6.431 1.06e-09 ***
|   areaSouth Asia            19.891      5.055   3.935 0.000118 ***
|   propGovt:areaOther        -9.998      5.631  -1.776 0.077451 .  
|   propGovt:areaSouth Asia  -16.757      9.671  -1.733 0.084828 .  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 5.561 on 184 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:  0.6356,    Adjusted R-squared:  0.6257 
|   F-statistic: 64.19 on 5 and 184 DF,  p-value: < 2.2e-161) LRT:
|   Analysis of Variance Table
|   
|   Model 1: Life ~ (Health + log(Health) + propGovt) * area
|   Model 2: Life ~ propGovt + area + propGovt:area
|     Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
|   1    178 2857.9                                  
|   2    184 5690.7 -6   -2832.7 29.405 < 2.2e-16 ***
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 12) Wald test:
|          numDF denDF F.value p.value
|   Health     6   178 29.4052 <.00001
|                              Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   Health                     -0.002672 0.003847  178 -0.694585 0.48822 -0.010263 
|   log(Health)                 2.368302 1.239444  178  1.910778 0.05764 -0.077593 
|   Health:areaOther            0.002218 0.003872  178  0.573005 0.56736 -0.005422 
|   Health:areaSouth Asia       0.006672 0.015873  178  0.420359 0.67473 -0.024651 
|   log(Health):areaOther       2.454985 1.393231  178  1.762080 0.07977 -0.294391 
|   log(Health):areaSouth Asia  1.029188 4.155759  178  0.247653 0.80469 -7.171707 
|                              Upper 0.95
|   Health                     0.004919  
|   log(Health)                4.814197  
|   Health:areaOther           0.009858  
|   Health:areaSouth Asia      0.037995  
|   log(Health):areaOther      5.204360  
|   log(Health):areaSouth Asia 9.230082Note that the F-values are identical – which works in the case of OLS regression with normal error but rarely otherwise.
There many approaches to simplifying a model. The most widespread is to trim down non-significant interactions.
If you do this it is vital to never drop a group of terms unless you either:
We could refit and use LRTs or we can use Wald tests:
|     numDF denDF  F.value p.value
|   :     6   178 4.077738 0.00074
|                              Estimate   Std.Error DF  t-value   p-value
|   Health:areaOther             0.002218 0.003872  178  0.573005 0.56736
|   Health:areaSouth Asia        0.006672 0.015873  178  0.420359 0.67473
|   log(Health):areaOther        2.454985 1.393231  178  1.762080 0.07977
|   log(Health):areaSouth Asia   1.029188 4.155759  178  0.247653 0.80469
|   propGovt:areaOther         -17.211452 4.343678  178 -3.962414 0.00011
|   propGovt:areaSouth Asia    -21.993937 8.781094  178 -2.504692 0.01315
|                              Lower 0.95 Upper 0.95
|   Health:areaOther            -0.005422  0.009858 
|   Health:areaSouth Asia       -0.024651  0.037995 
|   log(Health):areaOther       -0.294391  5.204360 
|   log(Health):areaSouth Asia  -7.171707  9.230082 
|   propGovt:areaOther         -25.783184 -8.639720 
|   propGovt:areaSouth Asia    -39.322380 -4.665493Explore how regular expressions work:
Testing all interactions that involve ‘Health’
|             numDF denDF  F.value p.value
|   Health.*:     4   178 3.611567  0.0074
|                              Estimate Std.Error DF  t-value  p-value Lower 0.95
|   Health:areaOther           0.002218 0.003872  178 0.573005 0.56736 -0.005422 
|   Health:areaSouth Asia      0.006672 0.015873  178 0.420359 0.67473 -0.024651 
|   log(Health):areaOther      2.454985 1.393231  178 1.762080 0.07977 -0.294391 
|   log(Health):areaSouth Asia 1.029188 4.155759  178 0.247653 0.80469 -7.171707 
|                              Upper 0.95
|   Health:areaOther           0.009858  
|   Health:areaSouth Asia      0.037995  
|   log(Health):areaOther      5.204360  
|   log(Health):areaSouth Asia 9.230082Testing all interactions that involve ‘Govt’ (always to check to make sure that you captured exactly the right terms)
|         numDF denDF  F.value p.value
|   Govt:     2   178 8.394828 0.00033
|                           Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   propGovt:areaOther      -17.21145 4.343678  178 -3.962414 0.00011 -25.78318 
|   propGovt:areaSouth Asia -21.99394 8.781094  178 -2.504692 0.01315 -39.32238 
|                           Upper 0.95
|   propGovt:areaOther      -8.639720 
|   propGovt:areaSouth Asia -4.665493|   
|   Call:
|   lm(formula = Life ~ (Health + log(Health) + propGovt) * area, 
|       data = ds, na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -13.4282  -1.9480   0.1869   2.1919  14.7809 
|   
|   Coefficients:
|                                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                 39.041364   5.426332   7.195 1.68e-11 ***
|   Health                      -0.002672   0.003847  -0.695 0.488221    
|   log(Health)                  2.368302   1.239444   1.911 0.057640 .  
|   propGovt                    16.083451   3.854716   4.172 4.71e-05 ***
|   areaOther                    4.316456   6.528366   0.661 0.509349    
|   areaSouth Asia              15.629164  15.611788   1.001 0.318131    
|   Health:areaOther             0.002218   0.003872   0.573 0.567364    
|   Health:areaSouth Asia        0.006672   0.015873   0.420 0.674730    
|   log(Health):areaOther        2.454985   1.393231   1.762 0.079772 .  
|   log(Health):areaSouth Asia   1.029188   4.155759   0.248 0.804688    
|   propGovt:areaOther         -17.211452   4.343678  -3.962 0.000107 ***
|   propGovt:areaSouth Asia    -21.993937   8.781094  -2.505 0.013154 *  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 4.007 on 178 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:  0.817, Adjusted R-squared:  0.8057 
|   F-statistic: 72.24 on 11 and 178 DF,  p-value: < 2.2e-16Problems and limitations:
x if x^2 is also in the model.|   Analysis of Variance Table
|   
|   Response: Life
|                     Df Sum Sq Mean Sq  F value    Pr(>F)    
|   Health             1 6071.7  6071.7 378.1620 < 2.2e-16 ***
|   log(Health)        1 4091.9  4091.9 254.8571 < 2.2e-16 ***
|   propGovt           1   14.8    14.8   0.9238 0.3377716    
|   area               2 2187.9  1093.9  68.1331 < 2.2e-16 ***
|   Health:area        2   79.0    39.5   2.4611 0.0882394 .  
|   log(Health):area   2   44.2    22.1   1.3773 0.2549413    
|   propGovt:area      2  269.6   134.8   8.3948 0.0003281 ***
|   Residuals        178 2857.9    16.1                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# The results depend on the order of the terms.
anova(update(fit, . ~ area * (propGovt + log(Health) + Health)))|   Analysis of Variance Table
|   
|   Response: Life
|                     Df Sum Sq Mean Sq  F value    Pr(>F)    
|   area               2 9077.6  4538.8 282.6878 < 2.2e-16 ***
|   propGovt           1  718.7   718.7  44.7647 2.783e-10 ***
|   log(Health)        1 2566.2  2566.2 159.8331 < 2.2e-16 ***
|   Health             1    3.8     3.8   0.2359  0.627813    
|   area:propGovt      2  160.9    80.4   5.0101  0.007642 ** 
|   area:log(Health)   2  225.3   112.7   7.0166  0.001166 ** 
|   area:Health        2    6.6     3.3   0.2065  0.813630    
|   Residuals        178 2857.9    16.1                       
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Notes:
|   Anova Table (Type II tests)
|   
|   Response: Life
|                     Sum Sq  Df F value    Pr(>F)    
|   Health             19.55   1  1.2175 0.2713384    
|   log(Health)       942.29   1 58.6882 1.142e-12 ***
|   propGovt           24.05   1  1.4979 0.2226053    
|   area             2187.86   2 68.1331 < 2.2e-16 ***
|   Health:area         6.63   2  0.2065 0.8136300    
|   log(Health):area   50.69   2  1.5785 0.2091650    
|   propGovt:area     269.57   2  8.3948 0.0003281 ***
|   Residuals        2857.93 178                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Notes:
|   Anova Table (Type III tests)
|   
|   Response: Life
|                     Sum Sq  Df F value    Pr(>F)    
|   (Intercept)       831.13   1 51.7651 1.678e-11 ***
|   Health              7.75   1  0.4824 0.4882213    
|   log(Health)        58.62   1  3.6511 0.0576404 .  
|   propGovt          279.52   1 17.4090 4.706e-05 ***
|   area               18.52   2  0.5766 0.5628569    
|   Health:area         6.63   2  0.2065 0.8136300    
|   log(Health):area   50.69   2  1.5785 0.2091650    
|   propGovt:area     269.57   2  8.3948 0.0003281 ***
|   Residuals        2857.93 178                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Answer specific questions.
e.g. None of the above are equivalent to Wald test for ‘OVERALL SIGNIFICANCE’:
|          numDF denDF F.value p.value
|   Health     6   178 29.4052 <.00001
|                              Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   Health                     -0.002672 0.003847  178 -0.694585 0.48822 -0.010263 
|   log(Health)                 2.368302 1.239444  178  1.910778 0.05764 -0.077593 
|   Health:areaOther            0.002218 0.003872  178  0.573005 0.56736 -0.005422 
|   Health:areaSouth Asia       0.006672 0.015873  178  0.420359 0.67473 -0.024651 
|   log(Health):areaOther       2.454985 1.393231  178  1.762080 0.07977 -0.294391 
|   log(Health):areaSouth Asia  1.029188 4.155759  178  0.247653 0.80469 -7.171707 
|                              Upper 0.95
|   Health                     0.004919  
|   log(Health)                4.814197  
|   Health:areaOther           0.009858  
|   Health:areaSouth Asia      0.037995  
|   log(Health):areaOther      5.204360  
|   log(Health):areaSouth Asia 9.230082|        numDF denDF  F.value p.value
|   area     8   178 20.09158 <.00001
|                              Estimate   Std.Error DF  t-value   p-value
|   areaOther                    4.316456  6.528366 178  0.661185 0.50935
|   areaSouth Asia              15.629164 15.611788 178  1.001113 0.31813
|   Health:areaOther             0.002218  0.003872 178  0.573005 0.56736
|   Health:areaSouth Asia        0.006672  0.015873 178  0.420359 0.67473
|   log(Health):areaOther        2.454985  1.393231 178  1.762080 0.07977
|   log(Health):areaSouth Asia   1.029188  4.155759 178  0.247653 0.80469
|   propGovt:areaOther         -17.211452  4.343678 178 -3.962414 0.00011
|   propGovt:areaSouth Asia    -21.993937  8.781094 178 -2.504692 0.01315
|                              Lower 0.95 Upper 0.95
|   areaOther                   -8.566498 17.199409 
|   areaSouth Asia             -15.178841 46.437168 
|   Health:areaOther            -0.005422  0.009858 
|   Health:areaSouth Asia       -0.024651  0.037995 
|   log(Health):areaOther       -0.294391  5.204360 
|   log(Health):areaSouth Asia  -7.171707  9.230082 
|   propGovt:areaOther         -25.783184 -8.639720 
|   propGovt:areaSouth Asia    -39.322380 -4.665493|            numDF denDF  F.value p.value
|   propGovt     3   178 6.095867 0.00057
|                           Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   propGovt                 16.08345 3.854716  178  4.172409 0.00005   8.476629
|   propGovt:areaOther      -17.21145 4.343678  178 -3.962414 0.00011 -25.783184
|   propGovt:areaSouth Asia -21.99394 8.781094  178 -2.504692 0.01315 -39.322380
|                           Upper 0.95
|   propGovt                23.690273 
|   propGovt:areaOther      -8.639720 
|   propGovt:areaSouth Asia -4.665493produces 4 plots
points with high Cook’s distance might have strong influence on fitted vals
Note: added-variable plots = partial residual leverage plots
Interactively, you can use:
avPlot(fit, 'propGovt', id.method = "identify")
In 3D
|           area    col
|   1     Africa   blue
|   2      Other  green
|   3 South Asia orange
|   
|   Use left mouse to rotate, middle mouse (or scroll) to zoom, right mouse to change perspective|   Warning in log(Health): NaNs produced2D
|                   country         iso3      region   HealthExpPC.Govt.exch
|    Afghanistan        :  1   AFG    :  1   AFR :46   Min.   :   2.92      
|    Albania            :  1   AGO    :  1   AMR :35   1st Qu.:  42.30      
|    Algeria            :  1   ALB    :  1   EMR :22   Median : 183.83      
|    Andorra            :  1   AND    :  1   EUR :53   Mean   : 782.39      
|    Angola             :  1   ARE    :  1   SEAR:11   3rd Qu.: 650.11      
|    Antigua and Barbuda:  1   ARG    :  1   WPR :27   Max.   :7696.93      
|    (Other)            :188   (Other):188             NA's   :3            
|    HealthExpPC.Tot.ppp HealthExpPC.Govt.ppp HealthExpPC.Tot.exch
|    Min.   :  16.99     Min.   :   3.61      Min.   :  13.9      
|    1st Qu.: 166.09     1st Qu.:  79.92      1st Qu.:  90.7      
|    Median : 518.83     Median : 286.72      Median : 333.8      
|    Mean   :1114.08     Mean   : 770.12      Mean   :1094.2      
|    3rd Qu.:1333.26     3rd Qu.: 926.80      3rd Qu.: 971.4      
|    Max.   :8607.88     Max.   :5794.45      Max.   :9120.8      
|    NA's   :4           NA's   :4            NA's   :3           
|        total              govt            private          sex     
|    Min.   :  16.99   Min.   :   3.61   Min.   :   0.51   BTSX:194  
|    1st Qu.: 166.09   1st Qu.:  79.92   1st Qu.:  57.10   FMLE:  0  
|    Median : 518.83   Median : 286.72   Median : 187.50   MLE :  0  
|    Mean   :1114.08   Mean   : 770.12   Mean   : 343.96             
|    3rd Qu.:1333.26   3rd Qu.: 926.80   3rd Qu.: 475.42             
|    Max.   :8607.88   Max.   :5794.45   Max.   :4653.69             
|    NA's   :4         NA's   :4         NA's   :4                   
|    lifeexp.Birth    lifeexp.At60   smoking.tobacco.current smoking.tobacco.daily
|    Min.   :47.00   Min.   :11.00   Min.   : 4.00           Min.   : 3.00        
|    1st Qu.:64.00   1st Qu.:17.00   1st Qu.:14.00           1st Qu.:11.00        
|    Median :72.50   Median :19.00   Median :23.00           Median :19.00        
|    Mean   :70.01   Mean   :19.36   Mean   :22.63           Mean   :18.65        
|    3rd Qu.:76.00   3rd Qu.:22.00   3rd Qu.:29.00           3rd Qu.:25.00        
|    Max.   :83.00   Max.   :26.00   Max.   :57.00           Max.   :55.00        
|                                    NA's   :47              NA's   :50           
|    smoking.cig.current smoking.cig.daily   Pop.Total         Pop.MedAge   
|    Min.   : 4.0        Min.   : 2.00     Min.   :      1   Min.   :15.04  
|    1st Qu.:12.0        1st Qu.: 8.25     1st Qu.:   1696   1st Qu.:20.51  
|    Median :21.0        Median :16.00     Median :   7790   Median :26.64  
|    Mean   :21.2        Mean   :17.22     Mean   :  36360   Mean   :27.98  
|    3rd Qu.:29.0        3rd Qu.:23.75     3rd Qu.:  24535   3rd Qu.:36.02  
|    Max.   :57.0        Max.   :55.00     Max.   :1390000   Max.   :45.53  
|    NA's   :46          NA's   :48                          NA's   :11     
|    Pop.pCntUnder15 Pop.pCntOver60  Pop.pCntAnnGrowth consumption.cigPC
|    Min.   :13.12   Min.   : 0.81   Min.   :-9.100    Min.   :   9.0   
|    1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:-2.300    1st Qu.: 179.0   
|    Median :28.65   Median : 8.53   Median :-1.300    Median : 529.0   
|    Mean   :28.73   Mean   :11.16   Mean   :-1.453    Mean   : 730.1   
|    3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:-0.500    3rd Qu.:1039.0   
|    Max.   :49.99   Max.   :31.92   Max.   : 0.800    Max.   :2861.0   
|                                                      NA's   :9        
|    hiv_prev15_49       LifeExp            LE            smoke       
|    Min.   : 0.000   Min.   :47.00   Min.   :47.00   Min.   :   9.0  
|    1st Qu.: 0.200   1st Qu.:64.00   1st Qu.:64.00   1st Qu.: 179.0  
|    Median : 0.400   Median :72.50   Median :72.50   Median : 529.0  
|    Mean   : 1.817   Mean   :70.01   Mean   :70.01   Mean   : 730.1  
|    3rd Qu.: 1.200   3rd Qu.:76.00   3rd Qu.:76.00   3rd Qu.:1039.0  
|    Max.   :26.000   Max.   :83.00   Max.   :83.00   Max.   :2861.0  
|    NA's   :40                                       NA's   :9       
|          HE               hiv            special              yq       
|    Min.   :  16.99   Min.   : 0.000   Min.   :0.00000   Min.   :48.19  
|    1st Qu.: 166.09   1st Qu.: 0.200   1st Qu.:0.00000   1st Qu.:60.00  
|    Median : 518.83   Median : 0.400   Median :0.00000   Median :72.69  
|    Mean   :1114.08   Mean   : 1.817   Mean   :0.01546   Mean   :69.43  
|    3rd Qu.:1333.26   3rd Qu.: 1.200   3rd Qu.:0.00000   3rd Qu.:76.77  
|    Max.   :8607.88   Max.   :26.000   Max.   :1.00000   Max.   :80.79  
|    NA's   :4         NA's   :40                         NA's   :42     
|         Life         Cigarettes         Health                area    
|    Min.   :47.00   Min.   :   9.0   Min.   :  16.99   Africa    : 46  
|    1st Qu.:64.00   1st Qu.: 179.0   1st Qu.: 166.09   Other     :137  
|    Median :72.50   Median : 529.0   Median : 518.83   South Asia: 11  
|    Mean   :70.01   Mean   : 730.1   Mean   :1114.08                   
|    3rd Qu.:76.00   3rd Qu.:1039.0   3rd Qu.:1333.26                   
|    Max.   :83.00   Max.   :2861.0   Max.   :8607.88                   
|                    NA's   :9        NA's   :4                         
|       propGovt     
|    Min.   :0.1296  
|    1st Qu.:0.4561  
|    Median :0.6070  
|    Mean   :0.5940  
|    3rd Qu.:0.7478  
|    Max.   :0.9989  
|    NA's   :4Create a prediction data frame with values for which you want to predict model with observed values for Health and area but controlling for propGovt
ds <- sortdf( ds, ~ Health)
pred1 <- rbind(ds,NA,ds,NA, ds,NA, ds,NA, ds, NA)  
    # to set five values for predicted propGovt
pred1$propGovt <- rep(seq(.1,.9,by=.2), each = nrow(ds)+1)
pred1$Life.fit <- predict(fit, newdata = pred1)In ‘panels’:
xyplot(Life ~ Health | area, ds, layout = c(1,3)) +
  xyplot( Life.fit ~ Health | area, pred1, groups = propGovt,
          type = 'l')gd(lwd = 2)
(p <- xyplot(Life.fit ~ Health | area, 
             pred1, groups = propGovt,
             type = 'l', 
             auto.key = list(space='right',
                             lines = T, points = F,
                             title = 'Prop. Govt', 
                             cex.title = 1)))Some colour palettes you can choose from. display.brewer.all() Note that the first group consists of ‘progressive’ palettes, the second group of categorical palettes (one is ‘paired’) and the third group of ‘bipolar’ palettes. Note that yellow often doesn’t work for lines that blend into the background so you might have to avoid palettes that include yellow for some purposes.
Probably better with log(Health)
(p <- xyplot(Life.fit ~ log(Health) | area, 
             pred1, groups = propGovt,
             type = 'l', 
             auto.key = list(space = 'right',lines = T, 
                             points = F, title = 'Prop. Govt', 
                             cex.title = 1)))BUT NEVER NEVER USE axes that are not meaningful
Your work will be written off as incomprehensible!
Also: use meaningful labels
update(p, 
       xlab = "Health expenditures per capita in $US (log scale)",
       ylab = "Life expectancy (both sexes)",
       layout = c(1,3),
       scales = list( x = list(
         at = log(c(30,100,300,1000,3000,8000)),
         labels = c(30,100,300,1000,3000,8000)))) +
  xyplot(Life ~ log(Health) | area, ds)Dropping some observations: BEWARE
Two ways:
Suppose we want to drop “Equatorial Guinea”
CAUTION: this needs good reflection BUT we often should approach this like a sensitivity analysis: e.g. “would it make a big difference if I dropped this point?”
ds$EqG <- 1*(ds$country == "Equatorial Guinea")
fit2 <- lm( Life ~ 
              (Health + log(Health) + propGovt) * area + EqG, ds,
            na.action = na.exclude)OR
|   
|   Call:
|   lm(formula = Life ~ Health + log(Health) + propGovt + area + 
|       EqG + Health:area + log(Health):area + propGovt:area, data = ds, 
|       na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -13.3962  -1.8221   0.1869   2.1919  10.5391 
|   
|   Coefficients:
|                                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                 50.031832   6.036711   8.288 2.78e-14 ***
|   Health                       0.012027   0.005459   2.203 0.028882 *  
|   log(Health)                 -0.386600   1.412966  -0.274 0.784704    
|   propGovt                    15.122468   3.735031   4.049 7.69e-05 ***
|   areaOther                   -6.674012   6.982150  -0.956 0.340443    
|   areaSouth Asia               4.638695  15.383108   0.302 0.763353    
|   EqG                        -22.943658   6.239158  -3.677 0.000312 ***
|   Health:areaOther            -0.012481   0.005475  -2.279 0.023840 *  
|   Health:areaSouth Asia       -0.008027   0.015854  -0.506 0.613295    
|   log(Health):areaOther        5.209887   1.541017   3.381 0.000889 ***
|   log(Health):areaSouth Asia   3.784089   4.086121   0.926 0.355664    
|   propGovt:areaOther         -16.250469   4.206623  -3.863 0.000157 ***
|   propGovt:areaSouth Asia    -21.032953   8.491622  -2.477 0.014191 *  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.873 on 177 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:   0.83, Adjusted R-squared:  0.8185 
|   F-statistic: 72.01 on 12 and 177 DF,  p-value: < 2.2e-16|   Warning in log(Health): NaNs produceddropEqG
Figure X: Note the change in the fitted surface for Africa when Equatorial Guinea is dropped from the model.
Often this process focuses on the initial regression parameter and asks which ones ‘can we drop’? Often starting with highest order interactions and working in.
BEWARE THE PRINCIPLE OF MARGINALITY
In general (i.e. 99.9% of the time) DO NOT eliminate a term without also eliminating all higher-order terms to which the term is marginal. e.g. ‘Health’ is marginal to ‘Health:area’, and ‘Health:area’ would be marginal to ‘Health:propGovt:area’. Otherwise the resulting model loses invariance with respect to changes of origins of interacting variables.
Here’s a model that seems to tell a different story but it’s perfectly equivalent. The individual terms estimate different aspects of the model, but the models as a whole are equivalent and produce the same fitted values.
fit2.eq <- lm( Life ~ 
                 area/(Health + log(Health) + propGovt)  + EqG -1,
               ds, na.action = na.exclude)
anova(fit2, fit2.eq)|   Analysis of Variance Table
|   
|   Model 1: Life ~ Health + log(Health) + propGovt + area + EqG + Health:area + 
|       log(Health):area + propGovt:area
|   Model 2: Life ~ area/(Health + log(Health) + propGovt) + EqG - 1
|     Res.Df    RSS Df   Sum of Sq F Pr(>F)
|   1    177 2655.1                        
|   2    177 2655.1  0 -5.2751e-11|           df      AIC
|   fit2    14 1068.266
|   fit2.eq 14 1068.266|   
|   Call:
|   lm(formula = Life ~ Health + log(Health) + propGovt + area + 
|       EqG + Health:area + log(Health):area + propGovt:area, data = ds, 
|       na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -13.3962  -1.8221   0.1869   2.1919  10.5391 
|   
|   Coefficients:
|                                Estimate Std. Error t value Pr(>|t|)    
|   (Intercept)                 50.031832   6.036711   8.288 2.78e-14 ***
|   Health                       0.012027   0.005459   2.203 0.028882 *  
|   log(Health)                 -0.386600   1.412966  -0.274 0.784704    
|   propGovt                    15.122468   3.735031   4.049 7.69e-05 ***
|   areaOther                   -6.674012   6.982150  -0.956 0.340443    
|   areaSouth Asia               4.638695  15.383108   0.302 0.763353    
|   EqG                        -22.943658   6.239158  -3.677 0.000312 ***
|   Health:areaOther            -0.012481   0.005475  -2.279 0.023840 *  
|   Health:areaSouth Asia       -0.008027   0.015854  -0.506 0.613295    
|   log(Health):areaOther        5.209887   1.541017   3.381 0.000889 ***
|   log(Health):areaSouth Asia   3.784089   4.086121   0.926 0.355664    
|   propGovt:areaOther         -16.250469   4.206623  -3.863 0.000157 ***
|   propGovt:areaSouth Asia    -21.032953   8.491622  -2.477 0.014191 *  
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.873 on 177 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:   0.83, Adjusted R-squared:  0.8185 
|   F-statistic: 72.01 on 12 and 177 DF,  p-value: < 2.2e-16|   
|   Call:
|   lm(formula = Life ~ area/(Health + log(Health) + propGovt) + 
|       EqG - 1, data = ds, na.action = na.exclude)
|   
|   Residuals:
|        Min       1Q   Median       3Q      Max 
|   -13.3962  -1.8221   0.1869   2.1919  10.5391 
|   
|   Coefficients:
|                                Estimate Std. Error t value Pr(>|t|)    
|   areaAfrica                  5.003e+01  6.037e+00   8.288 2.78e-14 ***
|   areaOther                   4.336e+01  3.508e+00  12.358  < 2e-16 ***
|   areaSouth Asia              5.467e+01  1.415e+01   3.864 0.000156 ***
|   EqG                        -2.294e+01  6.239e+00  -3.677 0.000312 ***
|   areaAfrica:Health           1.203e-02  5.459e-03   2.203 0.028882 *  
|   areaOther:Health           -4.536e-04  4.217e-04  -1.075 0.283638    
|   areaSouth Asia:Health       4.000e-03  1.488e-02   0.269 0.788436    
|   areaAfrica:log(Health)     -3.866e-01  1.413e+00  -0.274 0.784704    
|   areaOther:log(Health)       4.823e+00  6.150e-01   7.842 4.00e-13 ***
|   areaSouth Asia:log(Health)  3.397e+00  3.834e+00   0.886 0.376746    
|   areaAfrica:propGovt         1.512e+01  3.735e+00   4.049 7.69e-05 ***
|   areaOther:propGovt         -1.128e+00  1.935e+00  -0.583 0.560723    
|   areaSouth Asia:propGovt    -5.910e+00  7.626e+00  -0.775 0.439353    
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|   
|   Residual standard error: 3.873 on 177 degrees of freedom
|     (4 observations deleted due to missingness)
|   Multiple R-squared:  0.9972,    Adjusted R-squared:  0.997 
|   F-statistic:  4880 on 13 and 177 DF,  p-value: < 2.2e-16|   Anova Table (Type II tests)
|   
|   Response: Life
|                     Sum Sq  Df F value    Pr(>F)    
|   Health             12.01   1  0.8003 0.3722085    
|   log(Health)       763.67   1 50.9096  2.39e-11 ***
|   propGovt           19.56   1  1.3039 0.2550477    
|   area             1845.55   2 61.5164 < 2.2e-16 ***
|   EqG               202.85   1 13.5230 0.0003124 ***
|   Health:area        79.23   2  2.6410 0.0741004 .  
|   log(Health):area  171.81   2  5.7268 0.0038909 ** 
|   propGovt:area     240.45   2  8.0147 0.0004655 ***
|   Residuals        2655.08 177                      
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|   Anova Table (Type II tests)
|   
|   Response: Life
|                    Sum Sq  Df    F value    Pr(>F)    
|   area             944894   3 20997.0169 < 2.2e-16 ***
|   EqG                 203   1    13.5230 0.0003124 ***
|   area:Health          91   3     2.0274 0.1117604    
|   area:log(Health)    935   3    20.7877 1.385e-11 ***
|   area:propGovt       260   3     5.7778 0.0008611 ***
|   Residuals          2655 177                         
|   ---
|   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1|            numDF denDF  F.value p.value
|   th:|th):     4   177 3.143431 0.01582
|                              Estimate  Std.Error DF  t-value   p-value Lower 0.95
|   Health:areaOther           -0.012481 0.005475  177 -2.279349 0.02384 -0.023286 
|   Health:areaSouth Asia      -0.008027 0.015854  177 -0.506274 0.61329 -0.039315 
|   log(Health):areaOther       5.209887 1.541017  177  3.380811 0.00089  2.168756 
|   log(Health):areaSouth Asia  3.784089 4.086121  177  0.926083 0.35566 -4.279696 
|                              Upper 0.95
|   Health:areaOther           -0.001675 
|   Health:areaSouth Asia       0.023261 
|   log(Health):areaOther       8.251017 
|   log(Health):areaSouth Asia 11.847874Question: Under what conditions would two seemingly different models produce exactly the same fit (i.e. predicted values of Y)?
What question does each coefficient answer and how can we get answers to the questions we want?
PRINCIPLE
Except with very simple models, raw regression output generally answers few meaningful questions AND most important questions are rarely answered by raw regression output
Interpreting \(\beta\)s:
Each term involving ‘area’ is a comparison with the REFERENCE LEVEL when ALL VARIABLES IN HIGHER ORDER INTERACTING TERMS are set to 0.
Each term involving ‘area’ is a comparison with the REFERENCE LEVEL (Africa because it’s the level that isn’t showing) when ALL VARIABLES IN HIGHER ORDER INTERACTING TERMS are set to 0.
The model is:
\[ \begin{aligned} Y &= \beta_0 + \beta_1 Health + \beta_2 ln(Health) + \beta_3 propGovt\\ & + \beta_4 area_{Other} + \beta_5 area_{South Asia}\\ & + \beta_6 EqG \\ & + \beta_7 Health \times area_{Other} + \beta_8 Health \times area_{South Asia}\\ & + \beta_9 ln(Health)\times area_{Other} + \beta_{10} ln(Health)\times area_{South Asia} \\ & + \beta_{11} propGovt \times area_{Other} + \beta_{12} propGovt \times area_{South Asia} \\ & + \varepsilon \end{aligned} \]
where \(\varepsilon \sim N(0,\sigma^2)\) independently of predictors.
Note: Here we encounter the vital difference between assumptions that can be checked and assumptions that can’t be checked. We are assuming that
The first two assumptions can be checked with diagnostics, the third, in general, cannot. In econometrics it’s recognized as a key assumption related to the ‘exogeneity’ of the predictors and the causal interpretation of the model. For a further treatment of this question see Murnane and Willett (2010) and Pearl and Mackenzie (2018).
Q: What does \(\beta_1\) mean?
A: It’s the expected change in \(Y\) when you change \(Health\) by one unit keeping all other terms constant — BUT THAT’S IMPOSSIBLE
We’ll have more luck with \(\beta_3\): It’s the expected change in \(Y\) when you change \(progGovt\) by one unit keeping all other terms constant, i.e. when all variables that interact with \(propGovt\) are equal to 0.
i.e. when \(area_{Other} = area_{South Asia} = 0\)
i.e. in Africa
So we have a clear interpretation: it’s the expected change in Life Expectancy using our model to compare a hypothetical country where health expenditures are entirely supported by the government with a hypothetical country in which they are entirely private in Africa.
This is not obvious to a casual user of regression and most surely not to most clients and readers of academic journals.
You’re in luck. Calculus comes in handy.
What’s the ‘effect’ (a very misused word and I’m still looking for a better one) of increasing Health expenditures by 1 dollar?
\[ \begin{aligned} \frac{\partial{E(Y)}}{\partial{Health}} &= 0 \times \beta_0 + \beta_1 + \beta_2 \frac{1}{Health} + 0 \beta_3 \\ & + 0 \times \beta_4 + 0 \times \beta_5 \\ & + 0 \times \beta_6 \\ & + \beta_7 area_{Other} + \beta_8 area_{South Asia}\\ & + \beta_9 \frac{area_{Other}}{Health} + \beta_{10} \frac{area_{South Asia}}{Health} \\ & + 0 \times \beta_{11} + 0 \times \beta_{12} \\ &= L \beta \end{aligned} \] where \[ L= \left[ \begin{array}{c} 0 & 1 & \frac{1}{Health} & 0 & 0 & 0 & 0 & area_{Other} & area_{South Asia} & \frac{ area_{Other}}{Health} & \frac{area_{South Asia}}{Health} & 0 & 0 \end{array} \right] \] and \[ \beta = \left[ \begin{array}{c} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \beta_6 \\ \beta_7 \\ \beta_8 \\ \beta_9 \\ \beta_{10} \\ \beta_{11} \\ \beta_{12} \\ \end{array} \right] \] which we can estimate with \[ \hat{\eta} = L \hat{\beta} \]
NOTE: This does not seem to depend on \(propGovt\)! Is this reasonable? What should we do about that?
Exercise: Explore what could be done with \(propGovt\).
\(\hat{\beta}\) is obtained with:
|                  (Intercept)                     Health 
|                 50.031832427                0.012026938 
|                  log(Health)                   propGovt 
|                 -0.386599991               15.122467573 
|                    areaOther             areaSouth Asia 
|                 -6.674012321                4.638695485 
|                          EqG           Health:areaOther 
|                -22.943657897               -0.012480505 
|        Health:areaSouth Asia      log(Health):areaOther 
|                 -0.008026658                5.209886574 
|   log(Health):areaSouth Asia         propGovt:areaOther 
|                  3.784089239              -16.250468580 
|      propGovt:areaSouth Asia 
|                -21.032953424To estimate the marginal effect of Health Expenditures in ‘Other’ when Health expenditures = 100:
|        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
|   [1,]    0    1 0.01    0    0    0    0    1    0  0.01     0     0     0|             [,1]
|   [1,] 0.0477793We could write matrix expression to get the variance, F-test, p-values etc., but it’s already been done with the ‘wald’ function in ‘spida2’. Other packages also have functions that do this, e.g. ‘lht’ in the ‘car’ package.
|     numDF denDF  F.value p.value
|   1     1   177 67.95056 <.00001
|        Estimate Std.Error DF  t-value  p-value Lower 0.95 Upper 0.95
|   [1,] 0.047779 0.005796  177 8.243213 <.00001 0.036341   0.059218How could we mass produce this?
ex <- expression( cbind( 0,1,1/Health,0,0,0,0,
                         area == "Other", area == "South Asia",
                         (area == "Other")/Health, 
                         (area == "South Asia")/Health,
                         0,0))
ex|   expression(cbind(0, 1, 1/Health, 0, 0, 0, 0, area == "Other", 
|       area == "South Asia", (area == "Other")/Health, (area == 
|           "South Asia")/Health, 0, 0))|        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
|   [1,]    0    1 0.01    0    0    0    0    0    1     0  0.01     0     0|     Health   area
|   1     30 Africa
|   2     40 Africa
|   3     50 Africa
|   4     60 Africa
|   5     70 Africa
|   6     80 Africa|        Health       area
|   1189   3950 South Asia
|   1190   3960 South Asia
|   1191   3970 South Asia
|   1192   3980 South Asia
|   1193   3990 South Asia
|   1194   4000 South Asia|   [1] 1194    2|        [,1] [,2]       [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
|   [1,]    0    1 0.03333333    0    0    0    0    0    0     0     0     0     0
|   [2,]    0    1 0.02500000    0    0    0    0    0    0     0     0     0     0
|   [3,]    0    1 0.02000000    0    0    0    0    0    0     0     0     0     0
|   [4,]    0    1 0.01666667    0    0    0    0    0    0     0     0     0     0
|   [5,]    0    1 0.01428571    0    0    0    0    0    0     0     0     0     0
|   [6,]    0    1 0.01250000    0    0    0    0    0    0     0     0     0     0|   List of 1
|    $ :List of 7
|     ..$ anova   :List of 4
|     .. ..$ numDF  : int 6
|     .. ..$ denDF  : int 177
|     .. ..$ F-value: num [1, 1] 33.2
|     .. ..$ p-value: num [1, 1] 1.3e-26
|     ..$ estimate:Classes 'data.frame.lab' and 'data.frame':   1194 obs. of  7 variables:
|     .. ..$ Estimate  : num [1:1194] -0.00086 0.00236 0.00429 0.00558 0.0065 ...
|     .. ..$ Std.Error : num [1:1194] 0.0424 0.0306 0.0236 0.0189 0.0156 ...
|     .. ..$ DF        : num [1:1194] 177 177 177 177 177 177 177 177 177 177 ...
|     .. ..$ t-value   : num [1:1194] -0.0203 0.0772 0.1822 0.2954 0.4176 ...
|     .. ..$ p-value   : num [1:1194] 0.984 0.939 0.856 0.768 0.677 ...
|     .. ..$ Lower 0.95: num [1:1194] -0.0844 -0.058 -0.0422 -0.0317 -0.0242 ...
|     .. ..$ Upper 0.95: num [1:1194] 0.0827 0.0628 0.0508 0.0429 0.0372 ...
|     .. ..- attr(*, "labs")= chr [1:2] "" ""
|     ..$ coef    : num [1:1194] -0.00086 0.00236 0.00429 0.00558 0.0065 ...
|     ..$ L       : num [1:1194, 1:13] 0 0 0 0 0 0 0 0 0 0 ...
|     ..$ se      : num [1:1194] 0.0424 0.0306 0.0236 0.0189 0.0156 ...
|     ..$ L.full  : num [1:6, 1:13] 0 0 0 0 0 ...
|     ..$ L.rank  : int 6
|    - attr(*, "class")= chr "wald"|              coef         se         U2          L2
|   1 -0.0008597286 0.04235567 0.08385161 -0.08557107
|   2  0.0023619380 0.03061080 0.06358353 -0.05885966
|   3  0.0042949379 0.02357817 0.05145127 -0.04286140
|   4  0.0055836046 0.01890309 0.04338979 -0.03222258
|   5  0.0065040807 0.01557660 0.03765728 -0.02464912
|   6  0.0071944378 0.01309439 0.03338321 -0.01899433|     Health   area          coef         se         U2          L2
|   1     30 Africa -0.0008597286 0.04235567 0.08385161 -0.08557107
|   2     40 Africa  0.0023619380 0.03061080 0.06358353 -0.05885966
|   3     50 Africa  0.0042949379 0.02357817 0.05145127 -0.04286140
|   4     60 Africa  0.0055836046 0.01890309 0.04338979 -0.03222258
|   5     70 Africa  0.0065040807 0.01557660 0.03765728 -0.02464912
|   6     80 Africa  0.0071944378 0.01309439 0.03338321 -0.01899433xyplot(coef ~ Health|area, pred, groups = area, type = 'l',
       auto.key = list(space='right'),
       subscripts = T,
       lower = pred$coef - 2* pred$se,
       upper = pred$coef + 2* pred$se,
       layout = c(1,3)) +
  glayer( gpanel.fit(...))Make labels and axes interprettable for presentation:
xyplot(I(100*coef) ~ log(Health), pred, groups = area,type= 'l',
       auto.key=list(space='right', lines = T, points = F),
       ylab = 
"Estimated change in LE associated with a $100 increase in Health Exp.",
       xlab = "Health expenditures per capita in $US (log scale)",
       scales = list( x = list(
         at = log(c(30,100,300,1000,3000,8000)),
         labels = c(30,100,300,1000,3000,8000))))Limit plot to ranges in each Area:
|               sex       area     max   min
|   Africa     BTSX     Africa 1642.71 16.99
|   Other      BTSX      Other 8607.88 50.47
|   South Asia BTSX South Asia  759.66 27.86|       area Health          coef         se         U2          L2     max   min
|   1 Africa     30 -0.0008597286 0.04235567 0.08385161 -0.08557107 1642.71 16.99
|   2 Africa     40  0.0023619380 0.03061080 0.06358353 -0.05885966 1642.71 16.99
|   3 Africa     50  0.0042949379 0.02357817 0.05145127 -0.04286140 1642.71 16.99
|   4 Africa     60  0.0055836046 0.01890309 0.04338979 -0.03222258 1642.71 16.99
|   5 Africa     70  0.0065040807 0.01557660 0.03765728 -0.02464912 1642.71 16.99
|   6 Africa     80  0.0071944378 0.01309439 0.03338321 -0.01899433 1642.71 16.99If you define arguments fit, lower and upper in ‘xyplot’, they will be available to ‘gpanel.fit’ to draw the fitted line and confidence or predictions bands.
xyplot(I(100*coef) ~ log(Health), predr, groups = area,type= 'l',
       auto.key=list(space='right', lines = T, points = F),
       ylab = 
"Estimated change in LE associated with a $100 increase in Health Exp.",
       lower = 100*(predr$coef - 2* predr$se),
       upper = 100*(predr$coef + 2* predr$se),
       sub= "Estimated change in LE with standard error bands",
       xlab = "Health expenditures per capita in $US (log scale)",
       scales = list( x = list(
         at = log(c(30,100,300,1000,3000,8000)),
         labels = c(30,100,300,1000,3000,8000)))) +
    glayer(gpanel.fit(...))xyplot(I(100*coef) ~ log(Health)|area, predr, groups = area,
       type = 'l',
       auto.key=list(space='right', lines = T, points = F),
       ylab = "Additional years of Life Expectancy",
       xlim = c(-5,15),
       layout = c(1,3),
       lower = 100*(predr$coef - predr$se),
       upper = 100*(predr$coef + predr$se),
       xlab = "Health expenditures per capita in $US (log scale)",
       scales = list( x = list(
         at = log(c(30,100,300,1000,3000,8000)),
         labels = c(30,100,300,1000,3000,8000)))) +
  glayer(gpanel.fit(...))Figure: Additional years of life expectancy associated with a $100 (U.S.) increase in health expenditures per capita per year, in three world regions as a function of the current level of health expenditures. The bands show the standard error of estimation.
We will explore the ‘Lfx’ function in ‘spida2’ and the ‘sc’ function for generalized splines generated by the ‘gsp’ function.
The principle of marginality (POM) came up in class recently and it occurred to me that there might be confusion arising from the distinction between the requirements for:
A model to satisfy the POM, and
A null hypothesis specified by setting a set of terms to zero to satisfy the POM.
In a linear model with main effects and interactions of various orders, a model satisfies the POM if, for any interaction in the model, all included lower-order interactions and main effects are also in the model. In other words, the model is closed under taking margins where we think of A:B as a margin of A:B:C, for example. Note that the intercept is considered a 0-th order effect and must be included if anything else is included.
A hypothesis that sets some parameters to 0 in a model that satisfies the POM, itself satisfies the POM provided the resulting \(H_0\) model also satisfies the POM.
That’s why the requirements for the set of terms that are set to 0 seems to be the reverse of the requirements for a model. For any given term set to zero in the hypothesis, all higher-order terms in the model that include the given term must also be set to zero. Otherwise, the null model would include those higher-order terms without including the given term which would result in a null hypothesis that violates the POM.
Thus the requirement for the set of terms set to zero is the ‘reverse’ of the requirement for models. The set of terms set to zero must be closed under taking interactions that are in the full model.
Note that the main significance of the POM is that predicted values (\(\hat{Y}\)) for a model that satisfies the POM are invariant under location-scale transformations of numerical variables and non-singular recodings of categorical variables.
Hypotheses that satisfy the POM result in a test statistic, hence p-value, that is invariant under those same transformations and recodings.
So, if you stick to models and hypotheses that satisfy the POM, you don’t need to worry that your conclusions would have been different if you had measured temperature in degrees Celsius instead of Farenheit or whether you had used a different category as a reference level for a factor.
Wald tests performed by specifying a regular expression to be matched in a model’s terms will usually satisfy the POM because, if a regular expression matches a term, it will also match higher-order terms that contain that term.
This is true for tests of interactions with e.g. wald(fit, ":") for all interactions or wald(fit, ':.*:") for two-way and higher-order interactions, and for test of any particular effect, e.g. wald(fit, 'X') provided one checks that the regular expression does not match unintended terms.
Of course, you will be interested in estimating many parameters whose values do depend on the units used and the reference level. But those will be parameters addressing specific questions. For example in a model Y ~ X*G where X is continuous and G has three levels: A, B and C, you may want to estimate the difference in the rate of change with respect to X (which does depend on the units of X) comparing levels C and B. The exact hypothesis to test whether this is a particular value will not satisfy the POM because the estimate will depend on the units of X and the coding of the factor G. This is okay because it obeys the principle of “you know what you are doing”. The general rule is:
Never violate the POM unless you know what you are doing!
Fox, John, and Jangman Hong. 2009. “Effect Displays in R for Multinomial and Proportional-Odds Logit Models: Extensions to the effects Package.” Journal of Statistical Software 32 (1): 1–24. http://www.jstatsoft.org/v32/i01/.
Monette, Georges, John Fox, Michael Friendly, and Heather Krause. 2018. Spida2: Collection of Tools Developed for the Summer Programme in Data Analysis 2000-2012.
Murnane, Richard J, and John B Willett. 2010. Methods Matter: Improving Causal Inference in Educational and Social Science Research. Oxford University Press.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.