Make sure the following packages are installed:

You can use the following code to perform the installation:

install.packages("GGally") # ggpairs()
install.packages("ggResidpanel") # resid_panel() and resid_xpanel()

Now load packages

library("tidyverse"); theme_set(theme_bw())
## ── Attaching core tidyverse packages ────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
## ✔ forcats   1.0.0          ✔ stringr   1.5.1     
## ✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
## ✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
## ✔ purrr     1.0.2          
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("Sleuth3")
library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library("ggResidpanel")

Before we begin, let’s create a function to look at diagnostic plots from the ggResidpanel package. Here is my prefered set of plots.

# My diagnostic plots
my_diag <- function(m) {
  resid_panel(m, 
              plots = c("qq",     # qqplot for normality 
                        "resid",  # resid v fitted for linearity and constant variance
                        "cookd",  # Cook's Distance for influential observations
                        "index"), # resid v row number for indepence
              smoother = TRUE,    # add smoothing lines to resid and index plots
              qqbands = TRUE)     # add bands for qq plot
}

We will only use these plots when we want to.

Higher order terms

Consider the case1001 data set.

ggplot(case1001, aes(Height, Distance)) +
  geom_point()

There are a number of ways to run the cubic model.

# First create the variables yourself
case1001_tmp <- case1001 |>
  mutate(Height2 = Height * Height,
         Height3 = Height * Height2)

# Then run the regression
m1 <- lm(Distance ~ Height + Height2 + Height3, data = case1001_tmp)

I don’t prefer this method because

new <- data.frame(Height = 500) |>
  mutate(Height2 = Height * Height,  # Need to manually create the
         Height3 = Height * Height2) # quadratic and cubic terms

predict(m1, new)
##        1 
## 470.6527

An alternative approach is to use a formula to construct the higher order terms. e.g.

m2 <- lm(Distance ~ Height + I(Height^2) + I(Height^3), data = case1001) # or
m3 <- lm(Distance ~ poly(Height, 3, raw = TRUE),        data = case1001)

Now, your original data.frame is unchanged and your new data.frame is simpler.

new <- data.frame(Height = 500)
predict(m2, newdata = new) # or
##        1 
## 470.6527
predict(m3, newdata = new)
##        1 
## 470.6527

Additional explanatory variables

Run the following code to construct the longnosedace data

# From http://www.biostathandbook.com/multipleregression.html
# Modified from http://rcompanion.org/rcompanion/e_05.html
Input = ("
stream                   count acreage  dO2   maxdepth  no3   so4     temp
BASIN_RUN                  13         2528    9.6  80        2.28  16.75   15.3
BEAR_BR                    12         3333    8.5  83        5.34   7.74   19.4
BEAR_CR                    54        19611    8.3  96        0.99  10.92   19.5
BEAVER_DAM_CR              19         3570    9.2  56        5.44  16.53   17
BEAVER_RUN                 37         1722    8.1  43        5.66   5.91   19.3
BENNETT_CR                  2          583    9.2  51        2.26   8.81   12.9
BIG_BR                     72         4790    9.4  91        4.1    5.65   16.7
BIG_ELK_CR                164        35971   10.2  81        3.2   17.53   13.8
BIG_PIPE_CR                18        25440    7.5  120       3.53   8.2    13.7
BLUE_LICK_RUN               1         2217    8.5  46        1.2   10.85   14.3
BROAD_RUN                  53         1971   11.9  56        3.25  11.12   22.2
BUFFALO_RUN                16        12620    8.3  37        0.61  18.87   16.8
BUSH_CR                    32        19046    8.3  120       2.93  11.31   18
CABIN_JOHN_CR              21         8612    8.2  103       1.57  16.09   15
CARROLL_BR                 23         3896   10.4  105       2.77  12.79   18.4
COLLIER_RUN                18         6298    8.6  42        0.26  17.63   18.2
CONOWINGO_CR              112        27350    8.5  65        6.95  14.94   24.1
DEAD_RUN                   25         4145    8.7  51        0.34  44.93   23
DEEP_RUN                    5         1175    7.7  57        1.3   21.68   21.8
DEER_CR                    26         8297    9.9  60        5.26  6.36    19.1
DORSEY_RUN                  8         7814    6.8  160       0.44  20.24   22.6
FALLS_RUN                  15         1745    9.4  48        2.19  10.27   14.3
FISHING_CR                 11         5046    7.6  109       0.73   7.1    19
FLINTSTONE_CR              11        18943    9.2  50        0.25  14.21   18.5
GREAT_SENECA_CR            87         8624    8.6  78        3.37   7.51   21.3
GREENE_BR                  33         2225    9.1  41        2.3    9.72   20.5
GUNPOWDER_FALLS            22        12659    9.7  65        3.3    5.98   18
HAINES_BR                  98         1967    8.6  50        7.71  26.44   16.8
HAWLINGS_R                  1         1172    8.3  73        2.62   4.64   20.5
HAY_MEADOW_BR               5          639    9.5  26        3.53   4.46   20.1
HERRINGTON_RUN              1         7056    6.4  60        0.25   9.82   24.5
HOLLANDS_BR                38         1934   10.5  85        2.34  11.44   12
ISRAEL_CR                  30         6260    9.5  133       2.41  13.77   21
LIBERTY_RES                12          424    8.3  62        3.49   5.82   20.2
LITTLE_ANTIETAM_CR         24         3488    9.3  44        2.11  13.37   24
LITTLE_BEAR_CR              6         3330    9.1  67        0.81   8.16   14.9
LITTLE_CONOCOCHEAGUE_CR    15         2227    6.8  54        0.33   7.6    24
LITTLE_DEER_CR             38         8115    9.6  110       3.4    9.22   20.5
LITTLE_FALLS               84         1600   10.2  56        3.54   5.69   19.5
LITTLE_GUNPOWDER_R          3        15305    9.7  85        2.6    6.96   17.5
LITTLE_HUNTING_CR          18         7121    9.5  58        0.51   7.41   16
LITTLE_PAINT_BR            63         5794    9.4  34        1.19  12.27   17.5
MAINSTEM_PATUXENT_R       239         8636    8.4  150       3.31   5.95   18.1
MEADOW_BR                 234         4803    8.5  93        5.01  10.98   24.3
MILL_CR                     6         1097    8.3  53        1.71  15.77   13.1
MORGAN_RUN                 76         9765    9.3  130       4.38   5.74   16.9
MUDDY_BR                   25         4266    8.9  68        2.05  12.77   17
MUDLICK_RUN                 8         1507    7.4  51        0.84  16.3    21
NORTH_BR                   23         3836    8.3  121       1.32   7.36   18.5
NORTH_BR_CASSELMAN_R       16        17419    7.4  48        0.29   2.5    18
NORTHWEST_BR                6         8735    8.2  63        1.56  13.22   20.8
NORTHWEST_BR_ANACOSTIA_R  100        22550    8.4  107       1.41  14.45   23
OWENS_CR                   80         9961    8.6  79        1.02   9.07   21.8
PATAPSCO_R                 28         4706    8.9  61        4.06   9.9    19.7
PINEY_BR                   48         4011    8.3  52        4.7    5.38   18.9
PINEY_CR                   18         6949    9.3  100       4.57  17.84   18.6
PINEY_RUN                  36        11405    9.2  70        2.17  10.17   23.6
PRETTYBOY_BR               19          904    9.8  39        6.81   9.2    19.2
RED_RUN                    32         3332    8.4  73        2.09   5.5    17.7
ROCK_CR                     3          575    6.8  33        2.47   7.61   18
SAVAGE_R                  106        29708    7.7  73        0.63  12.28   21.4
SECOND_MINE_BR             62         2511   10.2  60        4.17  10.75   17.7
SENECA_CR                  23        18422    9.9  45        1.58   8.37   20.1
SOUTH_BR_CASSELMAN_R        2         6311    7.6  46        0.64  21.16   18.5
SOUTH_BR_PATAPSCO          26         1450    7.9  60        2.96   8.84   18.6
SOUTH_FORK_LINGANORE_CR    20         4106   10.0  96        2.62   5.45   15.4
TUSCARORA_CR               38        10274    9.3  90        5.45  24.76   15
WATTS_BR                   19          510    6.7  82        5.25  14.19   26.5
")

longnosedace = read.table(textConnection(Input),header=TRUE)

Take a quick look at the data

head(longnosedace)
##          stream count acreage dO2 maxdepth  no3   so4 temp
## 1     BASIN_RUN    13    2528 9.6       80 2.28 16.75 15.3
## 2       BEAR_BR    12    3333 8.5       83 5.34  7.74 19.4
## 3       BEAR_CR    54   19611 8.3       96 0.99 10.92 19.5
## 4 BEAVER_DAM_CR    19    3570 9.2       56 5.44 16.53 17.0
## 5    BEAVER_RUN    37    1722 8.1       43 5.66  5.91 19.3
## 6    BENNETT_CR     2     583 9.2       51 2.26  8.81 12.9
summary(longnosedace)
##     stream              count           acreage           dO2        
##  Length:68          Min.   :  1.00   Min.   :  424   Min.   : 6.400  
##  Class :character   1st Qu.: 12.00   1st Qu.: 2156   1st Qu.: 8.300  
##  Mode  :character   Median : 23.00   Median : 4748   Median : 8.600  
##                     Mean   : 38.81   Mean   : 7565   Mean   : 8.762  
##                     3rd Qu.: 40.50   3rd Qu.: 8992   3rd Qu.: 9.425  
##                     Max.   :239.00   Max.   :35971   Max.   :11.900  
##     maxdepth           no3             so4              temp      
##  Min.   : 26.00   Min.   :0.250   Min.   : 2.500   Min.   :12.00  
##  1st Qu.: 51.00   1st Qu.:1.198   1st Qu.: 7.397   1st Qu.:16.98  
##  Median : 64.00   Median :2.375   Median :10.220   Median :18.60  
##  Mean   : 72.56   Mean   :2.672   Mean   :11.650   Mean   :18.87  
##  3rd Qu.: 90.25   3rd Qu.:3.533   3rd Qu.:14.270   3rd Qu.:20.85  
##  Max.   :160.00   Max.   :7.710   Max.   :44.930   Max.   :26.50
ggpairs(longnosedace |> select(-stream),
        progress = interactive()) 

Adding additional explanatory variables is accomplished by separating the explanatory variables by +.

# Although I am demonstrating sqrt() here, I almost always use log() instead
m <- lm(sqrt(count) ~ no3 + maxdepth, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + maxdepth, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3896 -1.9651 -0.5409  1.3018  7.8564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1.5683     1.0285   1.525  0.13217   
## no3           0.6005     0.1882   3.191  0.00218 **
## maxdepth      0.0308     0.0117   2.633  0.01057 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.804 on 65 degrees of freedom
## Multiple R-squared:  0.2145, Adjusted R-squared:  0.1903 
## F-statistic: 8.874 on 2 and 65 DF,  p-value: 0.0003914

You can run a regression with all variables using .

m <- lm(sqrt(count) ~ ., data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ ., data = longnosedace)
## 
## Residuals:
## ALL 68 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (6 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                      3.6056        NaN     NaN      NaN
## streamBEAR_BR                   -0.1414        NaN     NaN      NaN
## streamBEAR_CR                    3.7429        NaN     NaN      NaN
## streamBEAVER_DAM_CR              0.7533        NaN     NaN      NaN
## streamBEAVER_RUN                 2.4772        NaN     NaN      NaN
## streamBENNETT_CR                -2.1913        NaN     NaN      NaN
## streamBIG_BR                     4.8797        NaN     NaN      NaN
## streamBIG_ELK_CR                 9.2007        NaN     NaN      NaN
## streamBIG_PIPE_CR                0.6371        NaN     NaN      NaN
## streamBLUE_LICK_RUN             -2.6056        NaN     NaN      NaN
## streamBROAD_RUN                  3.6746        NaN     NaN      NaN
## streamBUFFALO_RUN                0.3944        NaN     NaN      NaN
## streamBUSH_CR                    2.0513        NaN     NaN      NaN
## streamCABIN_JOHN_CR              0.9770        NaN     NaN      NaN
## streamCARROLL_BR                 1.1903        NaN     NaN      NaN
## streamCOLLIER_RUN                0.6371        NaN     NaN      NaN
## streamCONOWINGO_CR               6.9775        NaN     NaN      NaN
## streamDEAD_RUN                   1.3944        NaN     NaN      NaN
## streamDEEP_RUN                  -1.3695        NaN     NaN      NaN
## streamDEER_CR                    1.4935        NaN     NaN      NaN
## streamDORSEY_RUN                -0.7771        NaN     NaN      NaN
## streamFALLS_RUN                  0.2674        NaN     NaN      NaN
## streamFISHING_CR                -0.2889        NaN     NaN      NaN
## streamFLINTSTONE_CR             -0.2889        NaN     NaN      NaN
## streamGREAT_SENECA_CR            5.7218        NaN     NaN      NaN
## streamGREENE_BR                  2.1390        NaN     NaN      NaN
## streamGUNPOWDER_FALLS            1.0849        NaN     NaN      NaN
## streamHAINES_BR                  6.2939        NaN     NaN      NaN
## streamHAWLINGS_R                -2.6056        NaN     NaN      NaN
## streamHAY_MEADOW_BR             -1.3695        NaN     NaN      NaN
## streamHERRINGTON_RUN            -2.6056        NaN     NaN      NaN
## streamHOLLANDS_BR                2.5589        NaN     NaN      NaN
## streamISRAEL_CR                  1.8717        NaN     NaN      NaN
## streamLIBERTY_RES               -0.1414        NaN     NaN      NaN
## streamLITTLE_ANTIETAM_CR         1.2934        NaN     NaN      NaN
## streamLITTLE_BEAR_CR            -1.1561        NaN     NaN      NaN
## streamLITTLE_CONOCOCHEAGUE_CR    0.2674        NaN     NaN      NaN
## streamLITTLE_DEER_CR             2.5589        NaN     NaN      NaN
## streamLITTLE_FALLS               5.5596        NaN     NaN      NaN
## streamLITTLE_GUNPOWDER_R        -1.8735        NaN     NaN      NaN
## streamLITTLE_HUNTING_CR          0.6371        NaN     NaN      NaN
## streamLITTLE_PAINT_BR            4.3317        NaN     NaN      NaN
## streamMAINSTEM_PATUXENT_R       11.8541        NaN     NaN      NaN
## streamMEADOW_BR                 11.6915        NaN     NaN      NaN
## streamMILL_CR                   -1.1561        NaN     NaN      NaN
## streamMORGAN_RUN                 5.1122        NaN     NaN      NaN
## streamMUDDY_BR                   1.3944        NaN     NaN      NaN
## streamMUDLICK_RUN               -0.7771        NaN     NaN      NaN
## streamNORTH_BR                   1.1903        NaN     NaN      NaN
## streamNORTH_BR_CASSELMAN_R       0.3944        NaN     NaN      NaN
## streamNORTHWEST_BR              -1.1561        NaN     NaN      NaN
## streamNORTHWEST_BR_ANACOSTIA_R   6.3944        NaN     NaN      NaN
## streamOWENS_CR                   5.3387        NaN     NaN      NaN
## streamPATAPSCO_R                 1.6860        NaN     NaN      NaN
## streamPINEY_BR                   3.3227        NaN     NaN      NaN
## streamPINEY_CR                   0.6371        NaN     NaN      NaN
## streamPINEY_RUN                  2.3944        NaN     NaN      NaN
## streamPRETTYBOY_BR               0.7533        NaN     NaN      NaN
## streamRED_RUN                    2.0513        NaN     NaN      NaN
## streamROCK_CR                   -1.8735        NaN     NaN      NaN
## streamSAVAGE_R                   6.6901        NaN     NaN      NaN
## streamSECOND_MINE_BR             4.2685        NaN     NaN      NaN
## streamSENECA_CR                  1.1903        NaN     NaN      NaN
## streamSOUTH_BR_CASSELMAN_R      -2.1913        NaN     NaN      NaN
## streamSOUTH_BR_PATAPSCO          1.4935        NaN     NaN      NaN
## streamSOUTH_FORK_LINGANORE_CR    0.8666        NaN     NaN      NaN
## streamTUSCARORA_CR               2.5589        NaN     NaN      NaN
## streamWATTS_BR                   0.7533        NaN     NaN      NaN
## acreage                              NA         NA      NA       NA
## dO2                                  NA         NA      NA       NA
## maxdepth                             NA         NA      NA       NA
## no3                                  NA         NA      NA       NA
## so4                                  NA         NA      NA       NA
## temp                                 NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 67 and 0 DF,  p-value: NA

But you need to make sure you want to include all the variables. In this case, we probably wanted to exclude the stream variable

# Remove stream via the data argument
m <- lm(sqrt(count) ~ ., data = longnosedace |> select(-stream))
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ ., data = select(longnosedace, -stream))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3635 -1.9071 -0.3538  1.3078  8.2489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.259e+00  4.084e+00  -2.267  0.02694 * 
## acreage      1.428e-04  4.153e-05   3.440  0.00106 **
## dO2          7.157e-01  3.311e-01   2.162  0.03456 * 
## maxdepth     2.277e-02  1.097e-02   2.075  0.04223 * 
## no3          5.580e-01  1.787e-01   3.123  0.00274 **
## so4          7.571e-03  4.756e-02   0.159  0.87406   
## temp         2.165e-01  1.042e-01   2.077  0.04198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.524 on 61 degrees of freedom
## Multiple R-squared:  0.4026, Adjusted R-squared:  0.3438 
## F-statistic: 6.852 on 6 and 61 DF,  p-value: 1.367e-05

or

# Remove stream in the formula
m <- lm(sqrt(count) ~ .-stream, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ . - stream, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3635 -1.9071 -0.3538  1.3078  8.2489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.259e+00  4.084e+00  -2.267  0.02694 * 
## acreage      1.428e-04  4.153e-05   3.440  0.00106 **
## dO2          7.157e-01  3.311e-01   2.162  0.03456 * 
## maxdepth     2.277e-02  1.097e-02   2.075  0.04223 * 
## no3          5.580e-01  1.787e-01   3.123  0.00274 **
## so4          7.571e-03  4.756e-02   0.159  0.87406   
## temp         2.165e-01  1.042e-01   2.077  0.04198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.524 on 61 degrees of freedom
## Multiple R-squared:  0.4026, Adjusted R-squared:  0.3438 
## F-statistic: 6.852 on 6 and 61 DF,  p-value: 1.367e-05

You can also use this technique to eliminate the intercept and perform “regression through the origin”.

# Sometimes this seems appealing, but I've never used it in practice
m <- lm(sqrt(count) ~ no3 + acreage - 1, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + acreage - 1, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9745 -1.3463  0.5319  2.3545  9.5786 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## no3     1.143e+00  1.273e-01   8.976 4.80e-13 ***
## acreage 2.431e-04  3.806e-05   6.386 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.849 on 66 degrees of freedom
## Multiple R-squared:  0.797,  Adjusted R-squared:  0.7909 
## F-statistic: 129.6 on 2 and 66 DF,  p-value: < 2.2e-16

This is not a common model but there may be a time when you want to force the intercept to be 0.

Interactions

Interactions can be added to a model by using a colon (:) between two explanatory variables, e.g.

# Add an interaction explicitly
m1 <- lm(sqrt(count) ~ no3 + maxdepth + no3:maxdepth, data = longnosedace)
summary(m1)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + maxdepth + no3:maxdepth, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

But because we have a convention of including main effects (lower order terms) whenever an interaction (higher order term) is included in the model, R provides a convenient way of including both the main effects as well as the interaction by using an asterisk (*).

# Add an interaction implicitly
m2 <- lm(sqrt(count) ~ no3 * maxdepth, data = longnosedace)
summary(m2)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 * maxdepth, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

Alternatively, you can use the caret symbol (^) and specifying what order interactions you want. For example ^2 will include all two-way interactions.

# Add all 2 way interactions (there is only 1 here)
m3 <- lm(sqrt(count) ~ (no3 + maxdepth)^2, data = longnosedace)
summary(m3)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth)^2, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

If you want to include more terms, you have a couple of options, e.g. 

# main effects and two-way interactions
# using ()^2
m4 <- lm(sqrt(count) ~ (no3 + maxdepth + acreage)^2, data = longnosedace) 
summary(m4)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth + acreage)^2, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9715 -1.7287 -0.6365  1.3585  7.4738 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       9.771e-01  1.891e+00   0.517   0.6072  
## no3               3.596e-01  4.977e-01   0.723   0.4727  
## maxdepth          1.277e-02  2.575e-02   0.496   0.6218  
## acreage           2.893e-04  1.585e-04   1.825   0.0729 .
## no3:maxdepth      7.511e-03  7.100e-03   1.058   0.2943  
## no3:acreage      -2.060e-05  2.156e-05  -0.956   0.3431  
## maxdepth:acreage -1.200e-06  1.859e-06  -0.646   0.5210  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.613 on 61 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.2968 
## F-statistic: 5.714 on 6 and 61 DF,  p-value: 9.178e-05
# main effects, two-way interactions, and three-way interaction
# using ()^3
m5 <- lm(sqrt(count) ~ (no3 + maxdepth + acreage)^3, data = longnosedace) 
summary(m5)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth + acreage)^3, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1123 -1.5007 -0.3734  1.1348  7.5601 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           4.540e+00  2.492e+00   1.822   0.0734 .
## no3                  -9.005e-01  7.668e-01  -1.174   0.2449  
## maxdepth             -4.593e-02  3.735e-02  -1.230   0.2236  
## acreage              -2.026e-04  2.787e-04  -0.727   0.4700  
## no3:maxdepth          2.856e-02  1.210e-02   2.361   0.0215 *
## no3:acreage           1.679e-04  9.141e-05   1.837   0.0712 .
## maxdepth:acreage      6.495e-06  4.057e-06   1.601   0.1147  
## no3:maxdepth:acreage -2.895e-06  1.366e-06  -2.119   0.0383 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 60 degrees of freedom
## Multiple R-squared:  0.4044, Adjusted R-squared:  0.3349 
## F-statistic: 5.819 on 7 and 60 DF,  p-value: 3.555e-05
# main effects, two-way interactions, and three-way interaction
# using * 
m5 <- lm(sqrt(count) ~ no3 * maxdepth * acreage, data = longnosedace) 
summary(m5)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 * maxdepth * acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1123 -1.5007 -0.3734  1.1348  7.5601 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           4.540e+00  2.492e+00   1.822   0.0734 .
## no3                  -9.005e-01  7.668e-01  -1.174   0.2449  
## maxdepth             -4.593e-02  3.735e-02  -1.230   0.2236  
## acreage              -2.026e-04  2.787e-04  -0.727   0.4700  
## no3:maxdepth          2.856e-02  1.210e-02   2.361   0.0215 *
## no3:acreage           1.679e-04  9.141e-05   1.837   0.0712 .
## maxdepth:acreage      6.495e-06  4.057e-06   1.601   0.1147  
## no3:maxdepth:acreage -2.895e-06  1.366e-06  -2.119   0.0383 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 60 degrees of freedom
## Multiple R-squared:  0.4044, Adjusted R-squared:  0.3349 
## F-statistic: 5.819 on 7 and 60 DF,  p-value: 3.555e-05

We can combine the . with this notation for higher order terms.

# remove stream and then include all 2-way interactions
m <- lm(sqrt(count) ~ .^2, data = longnosedace |> select(-stream))
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ .^2, data = select(longnosedace, -stream))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6901 -1.5280 -0.2438  1.4632  6.0108 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -1.920e+01  2.715e+01  -0.707  0.48300   
## acreage          -5.081e-05  8.671e-04  -0.059  0.95352   
## dO2               2.193e+00  2.572e+00   0.853  0.39825   
## maxdepth          7.802e-02  1.856e-01   0.420  0.67626   
## no3               7.466e-01  4.068e+00   0.184  0.85520   
## so4              -2.789e-01  9.418e-01  -0.296  0.76846   
## temp              4.744e-01  1.062e+00   0.447  0.65715   
## acreage:dO2      -2.020e-05  6.624e-05  -0.305  0.76172   
## acreage:maxdepth  1.063e-06  2.399e-06   0.443  0.65993   
## acreage:no3      -1.860e-05  3.016e-05  -0.617  0.54057   
## acreage:so4       1.568e-05  1.315e-05   1.192  0.23931   
## acreage:temp      7.263e-06  1.697e-05   0.428  0.67071   
## dO2:maxdepth     -1.478e-02  1.478e-02  -1.000  0.32248   
## dO2:no3          -5.095e-02  2.867e-01  -0.178  0.85974   
## dO2:so4           1.021e-01  8.758e-02   1.166  0.24966   
## dO2:temp         -6.416e-02  1.034e-01  -0.621  0.53785   
## maxdepth:no3      6.118e-03  9.920e-03   0.617  0.54046   
## maxdepth:so4     -7.029e-03  2.448e-03  -2.872  0.00615 **
## maxdepth:temp     6.953e-03  6.417e-03   1.084  0.28422   
## no3:so4           6.959e-03  3.333e-02   0.209  0.83553   
## no3:temp         -7.861e-03  9.412e-02  -0.084  0.93380   
## so4:temp         -1.306e-02  2.748e-02  -0.475  0.63673   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.533 on 46 degrees of freedom
## Multiple R-squared:  0.5462, Adjusted R-squared:  0.3391 
## F-statistic: 2.637 on 21 and 46 DF,  p-value: 0.003049

Adding or removing variables

Another approach that might be useful is to construct a model and then add or remove variables from that model.

# Simply linear regression model
m <- lm(sqrt(count) ~ no3, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3751 -1.9991 -0.1519  0.9806  9.6578 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7547     0.6335   5.927 1.24e-07 ***
## no3           0.6185     0.1963   3.150  0.00245 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.927 on 66 degrees of freedom
## Multiple R-squared:  0.1307, Adjusted R-squared:  0.1175 
## F-statistic: 9.924 on 1 and 66 DF,  p-value: 0.002452
# Add acreage
mA <- update(m, ~ . + acreage)
summary(mA)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9154 -1.8807 -0.3148  1.3880  9.4343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.3053896  0.6785796   3.397 0.001166 ** 
## no3         0.6890353  0.1782946   3.865 0.000259 ***
## acreage     0.0001667  0.0000419   3.978 0.000178 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.645 on 65 degrees of freedom
## Multiple R-squared:  0.3009, Adjusted R-squared:  0.2794 
## F-statistic: 13.99 on 2 and 65 DF,  p-value: 8.872e-06
# Remove no3
mR <- update(mA, ~ . - no3)
summary(mR)
## 
## Call:
## lm(formula = sqrt(count) ~ acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8406 -2.0205 -0.2371  1.5548 10.3053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.269e+00  4.951e-01   8.622 2.04e-12 ***
## acreage     1.505e-04  4.588e-05   3.281  0.00165 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.911 on 66 degrees of freedom
## Multiple R-squared:  0.1402, Adjusted R-squared:  0.1272 
## F-statistic: 10.77 on 1 and 66 DF,  p-value: 0.001654

You can find out more information by looking at

?formula

Activities

Practice these skills by taking a look at the data sets in the Sleuth3 package.

help(package="Sleuth3")

The data sets are organized by

  1. textbook usage ([case] study or [ex]ample)
  2. chapter, e.g. case0701 is chapter 7
  3. number, e.g. case0701 is number 1.

The Statistical Sleuth 3rd ed by Ramsey and Schafer begins discussing regression in Chapter 7 (although ANOVA is discussed in Chapter 5). Thus, data sets in Chapters 5 through 14 are relevant for exploration. When you get a new data set ask yourself these questions in this order

  1. What is the response variable?
  2. How many explanatory variables are there?
  3. Is the explanatory variable continuous or categorical?
  4. How many levels does each categorical explanatory variable have?