8.1-8.2

Chapter 8 - Interactions and Nonlinear Effects

8.1 Introduction

So far, we have discussed linear regression at its core: how one or more predictors can explain variations in a response variable.

For example, using the penguins dataset from palmerpenguins library, we might be interested in how penguin’s bill length affects penguin’s body mass, put differently, how bill length can be used to predict body mass. (which explanation is more accurate?)

(credit: https://allisonhorst.github.io/palmerpenguins/)

Similarly, in a wine quality dataset, we might examine the relationship between the percentage of alcohol (\(X\)) and the wine quality rating (\(Y\)).

(credit: https://www.kaggle.com/datasets/yasserh/wine-quality-dataset)

However, the relationship is not always this simple. Predictors can depend on one another other, influencing each others relationship with a response variable. This idea is called an interaction.

For example, there is an interaction between penguin’s bill length and the bill depth. These two factors, together with its interaction, explains more variations in penguin’s body mass than the single predictos alone (although, it would be silly to think this is a causal relationship).

Equation for linear regression without an interaction:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \]

Equation for linear regression with an interaction:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1*X_2) + \epsilon \]

In Chapter 8, Winter discusses interactions between categorical x continuous variables, categorical x categorical variables, and continuous x continuous variables.

We will begin with the interaction between a categorical and a continuous variable.

8.2 Categorical x Continuous interactions

Bringing back an example dataset from Qi Yi’s week: iconicity (a resemblance between form and meaning)

library(tidyverse)
library(ggplot2)
icon <- read.csv("winter_2017_iconicity.csv")

head(icon)
      Word         POS  SER CorteseImag Conc Syst    Freq  Iconicity
1        a Grammatical   NA          NA 1.46   NA 1041179  0.4615385
2    abide        Verb   NA          NA 1.68   NA     138  0.2500000
3     able   Adjective 1.73          NA 2.38   NA    8155  0.4666667
4    about Grammatical 1.20          NA 1.77   NA  185206 -0.1000000
5    above Grammatical 2.91          NA 3.33   NA    2493  1.0625000
6 abrasive   Adjective   NA          NA 3.03   NA      23  1.3125000

Variables of interest

  • Iconicity (Y - continuous) - rated from -5 to 5

  • Part-of-speech or POS (X - categorical) - 7 levels (Name, Interjection, Adverb, Grammatical, Adjective, Verb, Noun)

For today’s example, we will focus only on Noun and Verb.

  • Sensory experience or SER (X - continuous) - derived from an existing dataset from a previous study that rated how strongly a word evokes a sensory experience (based on the five common sense), on a scale 1 to 7.

Now, we will filter only the data we are interested in.

icon_sel <- icon %>% 
  select(1:3, 8) %>% 
  filter(POS %in% c("Noun", "Verb"))

head(icon_sel)
       Word  POS  SER  Iconicity
1     abide Verb   NA  0.2500000
2   academy Noun   NA  0.6923077
3  accident Noun   NA  1.3636364
4 accordion Noun   NA -0.4545455
5   account Noun 2.30 -0.4285714
6      ache Noun 4.82  1.1538462

There are many NA values in the SER column, as Qi Yi pointed out. However, we will (pretend) to ignore that issue.

Let’s briefly have a look at the data..

summary(icon_sel)
     Word               POS                 SER          Iconicity       
 Length:2261        Length:2261        Min.   :1.000   Min.   :-2.80000  
 Class :character   Class :character   1st Qu.:2.550   1st Qu.: 0.09091  
 Mode  :character   Mode  :character   Median :3.270   Median : 0.70000  
                                       Mean   :3.303   Mean   : 0.85924  
                                       3rd Qu.:4.000   3rd Qu.: 1.50000  
                                       Max.   :6.560   Max.   : 4.46667  
                                       NA's   :782                       
icon_sel %>% count(POS)
   POS    n
1 Noun 1704
2 Verb  557

If we are simply interested in the relationship between \(X\)s and \(Y\) without an interaction, we fit this model 1. These are referred to as main effects if they are not fit as an interaction.

model1 <- lm(Iconicity ~ SER + POS, data = icon_sel)

summary(model1)

Call:
lm(formula = Iconicity ~ SER + POS, data = icon_sel)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0956 -0.7210 -0.1233  0.6146  3.6902 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.11935    0.10040  -1.189    0.235    
SER          0.23319    0.02789   8.362   <2e-16 ***
POSVerb      0.60159    0.06388   9.418   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.045 on 1476 degrees of freedom
  (782 observations deleted due to missingness)
Multiple R-squared:  0.08226,   Adjusted R-squared:  0.08102 
F-statistic: 66.15 on 2 and 1476 DF,  p-value: < 2.2e-16

“As always, you should spend considerable time interpreting the coefficients” – I’ll leave this to you :D

ggplot(icon_sel, aes(x = SER, y = Iconicity)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  labs(x = "sensory experience",
       y = "iconicity rating score",
       title = "Iconicity vs Sensory experience")+
  theme_bw()

With this Model 1, the coefficient of each predictor is interpreted independently. For example, the SER coefficient indicates that for a 1-unit increase in SER, holding POS constant, the iconicity rating is predicted to increase by 0.23.

What does ‘holding POS constant’ mean?

It means that the effect of SER is estimated at the baseline. As long as POS is held as the baseline, the predicted change of 0.23 applies.

Now let’s look at this plot.

ggplot(icon_sel, aes(x = SER, y = Iconicity, color = POS)) +
  geom_point(size = 1.5, alpha = 0.5) +                                   
  geom_smooth(method = "lm") +
  labs(x = "sensory experience",
       y = "iconicity rating score",
       title = "Iconicity vs Sensory experience across two POS",
       colour = "Part of speech")+
  theme_bw()

Question: What do these regression lines for Noun and Verb tell us?

The effect of SER on iconicity differs between nouns and verbs. Likewise, the degree to which nouns and verbs differ from each other in terms of iconicity also depends on what SER value one considers. This means that SER and POS are not independent in their effects. So, we should be careful about interpreting these predictors independently.

This brings us to… Multiple regression with an interaction model

model2 <- lm(Iconicity ~ SER * POS, data = icon_sel)

summary(model2)

Call:
lm(formula = Iconicity ~ SER * POS, data = icon_sel)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8448 -0.7043 -0.1257  0.5864  3.5845 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.27394    0.11065   2.476  0.01341 *  
SER          0.11817    0.03108   3.801  0.00015 ***
POSVerb     -0.95542    0.20971  -4.556 5.65e-06 ***
SER:POSVerb  0.50838    0.06535   7.780 1.36e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.025 on 1475 degrees of freedom
  (782 observations deleted due to missingness)
Multiple R-squared:  0.1184,    Adjusted R-squared:  0.1166 
F-statistic: 66.05 on 3 and 1475 DF,  p-value: < 2.2e-16

Interpreting coefficients

  • Intercept: When the SER = 0 and POS is held at nouns, the iconicity rating score is predicted to be 0.27.

  • SER coefficient: VERY IMPORTANT NOTE This is the coefficient only for nouns It means that, for nouns, a 1-unit increase in SER is predicted to increase iconicity by 0.12.

  • POSVerb coefficient: This coefficient is the difference between the predicted iconicity for noun and verb, when the word has SER = 0.

  • SER:POSVerb coefficient: This acts as a slope adjustment (think: what is being adjusted?). For verbs, the effect of SER on iconicity is adjusted by +0.51, meaning that the SER slope for verbs is higher than that of nouns by 0.51.

emmeans::joint_tests(model2)
 model term df1  df2 F.ratio p.value
 SER          1 1475 129.872  <.0001
 POS          1 1475 125.650  <.0001
 SER:POS      1 1475  60.523  <.0001

Centering continuous variable

In Model 2, we interpret the intercept and the POSVerb coefficient at SER = 0. However, in the data, SER was rated from 1 to 7. So, actually SER = 0 is arbitrary (and does not match reality).

ggplot(icon_sel, aes(x = SER, y = Iconicity, color = POS)) +
  geom_point(size = 1.5, alpha = 0.5) +                                   
  geom_smooth(method = "lm", fullrange = TRUE) +
  labs(x = "sensory experience",
       y = "iconicity rating score",
       title = "Iconicity vs Sensory experience across two POS (Model 2)",
       colour = "Part of speech")+
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_bw()

To facilitate interpretation, we should centre the continuous variable. “The interpretation for interaction for continuous variable is more comprehensible with the continuous variables are centered”.

icon_sel2 <- icon_sel %>% mutate(SER_c = scale(SER)) %>% 
  relocate(SER_c, .after = SER)
summary(icon_sel2)
     Word               POS                 SER       
 Length:2261        Length:2261        Min.   :1.000  
 Class :character   Class :character   1st Qu.:2.550  
 Mode  :character   Mode  :character   Median :3.270  
                                       Mean   :3.303  
                                       3rd Qu.:4.000  
                                       Max.   :6.560  
                                       NA's   :782    
          SER_c.V1            Iconicity       
 Min.   :-2.3144393573700   Min.   :-2.80000  
 1st Qu.:-0.7566730383830   1st Qu.: 0.09091  
 Median :-0.0330654579519   Median : 0.70000  
 Mean   : 0.0000000000000   Mean   : 0.85924  
 3rd Qu.: 0.7005922277630   3rd Qu.: 1.50000  
 Max.   : 3.2734191804100   Max.   : 4.46667  
 NA's   :782                                  

Fit regression model again

model3 <- lm(Iconicity ~ SER_c * POS, data = icon_sel2)
summary(model3)

Call:
lm(formula = Iconicity ~ SER_c * POS, data = icon_sel2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8448 -0.7043 -0.1257  0.5864  3.5845 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.66423    0.03102  21.414  < 2e-16 ***
SER_c          0.11758    0.03093   3.801  0.00015 ***
POSVerb        0.72371    0.06456  11.209  < 2e-16 ***
SER_c:POSVerb  0.50585    0.06502   7.780 1.36e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.025 on 1475 degrees of freedom
  (782 observations deleted due to missingness)
Multiple R-squared:  0.1184,    Adjusted R-squared:  0.1166 
F-statistic: 66.05 on 3 and 1475 DF,  p-value: < 2.2e-16

Now, how do you interpret the coefficients?

In particular, look at POSVerb coefficient!

ggplot(icon_sel2, aes(x = SER_c, y = Iconicity, color = POS)) +
  geom_point(size = 1.5, alpha = 0.5) +                                   
  geom_smooth(method = "lm") +
  labs(x = "sensory experience (centred)",
             y = "iconicity rating score",
             title = "Iconicity vs Sensory experience across two POS (centred)",
        colour = "Part of speech")+
  geom_vline(xintercept = 0, linetype = "dashed")+
  theme_bw()

Download

Download Notebook & Data