p-values in multiple regression

Author

Stephen Skalicky

Published

May 8, 2026

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

data(penguins)

Significance in multiple regression

What does it mean for a single predictor to be “significant” in the context of a multiple regression? Well, it means that the model has taken in information from more than one independent variable. Sometimes this is described as “controlling” for other variables. But what is more interesting to us here is that the significance values will change in the presense of other variables. We can use penguins to demonstrate.

# load in data and z-score some variables. 
penguins_lm <- penguins %>%
  drop_na() %>%
  mutate(bl_c = scale(bill_length_mm), 
         bd_c = scale(bill_depth_mm), 
         flipper_c = scale(flipper_length_mm))

As always, we are interested in predicting the body mass of penguins.

If we fit a model with all 3 variables, we get a significant effect of flipper length, whereas bill length and bill depth are not significant.

m1 <- lm(body_mass_g ~ bl_c + bd_c + flipper_c, data = penguins_lm)

summary(m1)


Call:
lm(formula = body_mass_g ~ bl_c + bd_c + flipper_c, data = penguins_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1051.37  -284.50   -20.37   241.03  1283.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4207.06      21.54 195.345   <2e-16 ***
bl_c           18.01      29.34   0.614    0.540    
bd_c           35.12      27.23   1.290    0.198    
flipper_c     711.47      35.00  20.327   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393 on 329 degrees of freedom
Multiple R-squared:  0.7639,    Adjusted R-squared:  0.7618 
F-statistic: 354.9 on 3 and 329 DF,  p-value: < 2.2e-16

Caution

So, what does it mean for flipper length to be significant in this context?

It means that not only is flipper length significant, it remains significant even when the model has information about other predictors. When taking into account the effects of bill length and bill depth, variation in flipper length is still an important variable. And, conversely, bill length and bill depth are not significant in light of information about flipper length.

Does this mean that bill length and bill depth are not significant? Well, it depends. Look at the following 3 models:

m2 <- lm(body_mass_g ~ bl_c, data = penguins_lm)
summary(m2)


Call:
lm(formula = body_mass_g ~ bl_c, data = penguins_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1759.38  -468.82    27.79   464.20  1641.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4207.06      35.70  117.85   <2e-16 ***
bl_c          474.64      35.75   13.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 651.4 on 331 degrees of freedom
Multiple R-squared:  0.3475,    Adjusted R-squared:  0.3455 
F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

m3 <- lm(body_mass_g ~ bd_c, data = penguins_lm)
summary(m3)


Call:
lm(formula = body_mass_g ~ bd_c, data = penguins_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1616.08  -510.38   -68.67   486.51  1811.12 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4207.06      38.96 107.986   <2e-16 ***
bd_c         -380.07      39.02  -9.741   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 710.9 on 331 degrees of freedom
Multiple R-squared:  0.2228,    Adjusted R-squared:  0.2205 
F-statistic: 94.89 on 1 and 331 DF,  p-value: < 2.2e-16

m4 <- lm(body_mass_g ~ bl_c + bd_c, data = penguins_lm)
summary(m4)


Call:
lm(formula = body_mass_g ~ bl_c + bd_c, data = penguins_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1806.74  -456.82    15.33   471.07  1541.34 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4207.06      32.30 130.257  < 2e-16 ***
bl_c          409.13      33.23  12.313  < 2e-16 ***
bd_c         -286.54      33.23  -8.624 2.76e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 589.4 on 330 degrees of freedom
Multiple R-squared:  0.4675,    Adjusted R-squared:  0.4642 
F-statistic: 144.8 on 2 and 330 DF,  p-value: < 2.2e-16

Each model shows significant effects for bill length and bill depth, even when both are combined into a single model. So it is not the case that they do not have a significant relationship, it’s just that when we put them into a model with flipper length, this relationship becomes redundant or less important when compared to flipper length.

Multicollinearity?

This probably means that the variables are all inherently capturing some similar property of the penguins (i.e., body size), and are likely correlated.

library(correlation)

penguin_corr <- penguins_lm %>%
  select(bill_depth_mm, bill_length_mm, flipper_length_mm) %>%
  correlation()

summary(penguin_corr)

# Correlation Matrix (pearson-method)

Parameter      | flipper_length_mm | bill_length_mm
---------------------------------------------------
bill_depth_mm  |          -0.58*** |       -0.23***
bill_length_mm |           0.65*** |               

p-value adjustment method: Holm (1979)

We see that they are correlated, with relatively strong correlation between flipper length and both bill measures. They do not perfectly co vary, so they are capturing overlap as well as their own variation in penguins, but it seems in this case flipper length might make information about the length and depth of bills redundant.

Using VIF to check multicollinearity

Another (and probably better) way to check for redundancy and overlap among your predictors is to use variance inflation factor (VIF), which tests for overlap in the context of the fit model. Unlike a correlation matrix, which only captures pairwise relationships between predictors, VIF accounts for the combined overlap of all predictors simultaneously — making it better at detecting multicollinearity that only emerges when multiple variables are considered together.

We can use the check_collinearity function from the wonderful performance package to test a fit model. Here we see that actually, our variables are not very strongly correlated and that the full model does not really suffer from multicollinearity.

library(performance)
performance::check_collinearity(m1)

# Check for Multicollinearity

Low Correlation

      Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
      bl_c 1.85 [1.61, 2.19]     1.36      0.54     [0.46, 0.62]
      bd_c 1.59 [1.40, 1.87]     1.26      0.63     [0.53, 0.71]
 flipper_c 2.63 [2.24, 3.15]     1.62      0.38     [0.32, 0.45]

May as well get the whole model check while we are at it? Note that VIF is one of the bits of information provided.

check_model(m1)

Some of the variables were in matrix-format - probably you used
  `scale()` on your data?
  If so, and you get an error, please try `datawizard::standardize()` to
  standardize your data.
Some of the variables were in matrix-format - probably you used
  `scale()` on your data?
  If so, and you get an error, please try `datawizard::standardize()` to
  standardize your data.

So what?

We need to understand that putting variables into a multiple regression will impact other variables, even when they are not in an interaction
This can lead to several problems - too many similar variables might make something look “not significant”, and not including the right variables may inflate the importance of other variables.
The idea of “controlling” for variables simply means including them in the model. It is a conceptual difference whether a variable is a “control” variable or whether it is an independent variable related to your hypotheses.

Confidence intervals

p-values should not be the only thing we care about - we want measures of effect sizes plus precision of our estimates.

The standard error from the model output gives us precision as it relates to the point estimate:

summary(m1)


Call:
lm(formula = body_mass_g ~ bl_c + bd_c + flipper_c, data = penguins_lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1051.37  -284.50   -20.37   241.03  1283.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4207.06      21.54 195.345   <2e-16 ***
bl_c           18.01      29.34   0.614    0.540    
bd_c           35.12      27.23   1.290    0.198    
flipper_c     711.47      35.00  20.327   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393 on 329 degrees of freedom
Multiple R-squared:  0.7639,    Adjusted R-squared:  0.7618 
F-statistic: 354.9 on 3 and 329 DF,  p-value: < 2.2e-16

library(broom)

ggplot(tidy(m1)[2:4,], aes(x = estimate, y = term)) + 
  geom_point() + 
  geom_errorbar(aes(xmin = estimate - std.error, xmax = estimate + std.error), width = .25) + 
  labs(title = 'standard error around estimates')

m1_plot_ci <- cbind(tidy(m1),as.data.frame(confint(m1)))[2:4,]

ggplot(m1_plot_ci, aes(x = estimate, y = term)) + 
  geom_point() + 
  geom_errorbar(aes(xmin = `2.5 %`, xmax = `97.5 %`), width = .25) + 
  labs(title = 'confidence interval around estimates') + 
  geom_vline(xintercept = 0, color = 'lightcoral', lty = 'dashed')

Important

Standard error: precision of estimate, function of sample size and standard deviation Confidence interval: Over the long run, 95% of your confidence intervals will contain the true population parameter. (Yes this is annoyingly precise and somewhat counter intuitive).

So we can read this to say that more often than not we capture the true parameter. Just like p-values, more often than not, we are not making a type I error.

Quickly look at an interaction

Let’s complicate this with a continuous by categorical interaction.

What is significant here?

m5 <- lm(body_mass_g ~ species * flipper_c, data = penguins_lm)
summary(m5)


Call:
lm(formula = body_mass_g ~ species * flipper_c, data = penguins_lm)

Residuals:
    Min      1Q  Median      3Q     Max 
-900.90 -247.01  -25.53  197.19 1143.33 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4061.30      59.40  68.377  < 2e-16 ***
speciesChinstrap            -150.39      81.11  -1.854  0.06460 .  
speciesGentoo                149.95     108.12   1.387  0.16641    
flipper_c                    458.16      65.76   6.967 1.78e-11 ***
speciesChinstrap:flipper_c    26.41     110.22   0.240  0.81077    
speciesGentoo:flipper_c      301.01      97.65   3.083  0.00223 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 368.4 on 327 degrees of freedom
Multiple R-squared:  0.7938,    Adjusted R-squared:  0.7906 
F-statistic: 251.7 on 5 and 327 DF,  p-value: < 2.2e-16

packages like emmeans make this far easier to check

library(emmeans)

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

What do you notice about this output? What value is flipper length set to?

emmeans(m5, pairwise ~ species | flipper_c)

$emmeans
flipper_c = 3.5e-16:
 species   emmean   SE  df lower.CL upper.CL
 Adelie      4061 59.4 327     3944     4178
 Chinstrap   3911 55.2 327     3802     4020
 Gentoo      4211 90.3 327     4034     4389

Confidence level used: 0.95 

$contrasts
flipper_c = 3.5e-16:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap      150  81.1 327   1.854  0.1539
 Adelie - Gentoo        -150 108.0 327  -1.387  0.3489
 Chinstrap - Gentoo     -300 106.0 327  -2.836  0.0134

P value adjustment: tukey method for comparing a family of 3 estimates

Pick a point, or, regions of significance

Examine the species pairwise comparisons at different levels of flipper length. Since I have z-scored the variable, I can set the range to be +/-2 standard deviations, as well as the mean very easily:

# call just the contrasts, for space
emmeans(m5, pairwise ~ species | flipper_c, at = list(flipper_c = c(-2,-1,0,1,2)))$contrasts

flipper_c = -2:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    203.2 174.0 327   1.168  0.4733
 Adelie - Gentoo       452.1 246.0 327   1.836  0.1595
 Chinstrap - Gentoo    248.9 276.0 327   0.902  0.6392

flipper_c = -1:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    176.8  79.3 327   2.231  0.0675
 Adelie - Gentoo       151.1 163.0 327   0.926  0.6243
 Chinstrap - Gentoo    -25.7 175.0 327  -0.147  0.9881

flipper_c =  0:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    150.4  81.1 327   1.854  0.1539
 Adelie - Gentoo      -150.0 108.0 327  -1.387  0.3489
 Chinstrap - Gentoo   -300.3 106.0 327  -2.836  0.0134

flipper_c =  1:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    124.0 177.0 327   0.702  0.7624
 Adelie - Gentoo      -451.0 126.0 327  -3.584  0.0011
 Chinstrap - Gentoo   -574.9 134.0 327  -4.298  0.0001

flipper_c =  2:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap     97.6 283.0 327   0.345  0.9366
 Adelie - Gentoo      -752.0 198.0 327  -3.806  0.0005
 Chinstrap - Gentoo   -849.5 225.0 327  -3.775  0.0006

P value adjustment: tukey method for comparing a family of 3 estimates

What did you notice about the pairwise comparisons? What is going on here?

We can sort of see this with a plot using emmip from emmeans:

emmip(m5, ~ species|flipper_c, var = 'flipper_c', at = list(flipper_c = c(-2,-1,0,1,2)))

Warning: `aes_()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`
ℹ The deprecated feature was likely used in the emmeans package.
  Please report the issue at <https://github.com/rvlenth/emmeans/issues>.

Let’s build a more robust plot. First save the emmip to an object, ask for CIs, and ask NOT to plot it yet.

m5_plot <- emmip(m5, ~ species|flipper_c, 
                 var = 'flipper_c', 
                 at = list(flipper_c = c(-2,-1,0,1,2)),
                 plotit = F,
                 CIs = T
                 )

Here is a plot that shows the effect of flipper length at different points. Specifically, the differential effect of flipper length between species at different values of flipper length.

ggplot(m5_plot, aes(x = flipper_c, y = yvar, fill = species, lty = species)) +
  geom_ribbon(aes(ymin = LCL, ymax = UCL),alpha = .25) + 
  geom_line(alpha = .5) + 
  labs(y = "Body Mass", x = "Flipper Length\n(z-scored)") + 
  theme(legend.title = element_blank())

Do you remember what those range pairwise tests showed?

emmeans(m5, pairwise ~ species | flipper_c, at = list(flipper_c = c(-2,-1,0,1,2)))$contrasts

flipper_c = -2:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    203.2 174.0 327   1.168  0.4733
 Adelie - Gentoo       452.1 246.0 327   1.836  0.1595
 Chinstrap - Gentoo    248.9 276.0 327   0.902  0.6392

flipper_c = -1:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    176.8  79.3 327   2.231  0.0675
 Adelie - Gentoo       151.1 163.0 327   0.926  0.6243
 Chinstrap - Gentoo    -25.7 175.0 327  -0.147  0.9881

flipper_c =  0:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    150.4  81.1 327   1.854  0.1539
 Adelie - Gentoo      -150.0 108.0 327  -1.387  0.3489
 Chinstrap - Gentoo   -300.3 106.0 327  -2.836  0.0134

flipper_c =  1:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap    124.0 177.0 327   0.702  0.7624
 Adelie - Gentoo      -451.0 126.0 327  -3.584  0.0011
 Chinstrap - Gentoo   -574.9 134.0 327  -4.298  0.0001

flipper_c =  2:
 contrast           estimate    SE  df t.ratio p.value
 Adelie - Chinstrap     97.6 283.0 327   0.345  0.9366
 Adelie - Gentoo      -752.0 198.0 327  -3.806  0.0005
 Chinstrap - Gentoo   -849.5 225.0 327  -3.775  0.0006

P value adjustment: tukey method for comparing a family of 3 estimates

Let’s combine that info into our plot. It is not until we hit the mean value of flipper length that the pairwise comparisons between species start to become significantly different (marked with dashed red lines).

ggplot(m5_plot, aes(x = flipper_c, y = yvar, fill = species, lty = species)) +
  geom_ribbon(aes(ymin = LCL, ymax = UCL),alpha = .25) + 
  geom_line(alpha = .5) + 
  labs(y = "Body Mass", x = "Flipper Length\n(z-scored)", title = "Regions of Significance", caption = 'pairwise differences occur at dashed red line') + 
  theme(legend.title = element_blank()) + 
  geom_vline(xintercept = 0, colour = 'red', lty = 'dashed') + 
  geom_vline(xintercept = 1, colour = 'red', lty = 'dashed') + 
  geom_vline(xintercept = 2, colour = 'red', lty = 'dashed') + 
  geom_vline(xintercept = -2, colour = 'lightcoral') + 
  geom_vline(xintercept = -1, colour = 'lightcoral')

This illustrates an equally important issue with interactions involving continuous predictors: significance of pairwise comparisons depends on where you examine them along the continuous variable. In other words, our main effects interpretations are completely influenced by the interaction! We could be cheeky and only report where flipper length is significant as a single test, for example. Or, conversely, if we did NOT center / z-score our variable, we might come to a different and incorrect conclusion about differences between species. Our default output here for example shows no difference between Chinstrap and Adelie Penguins, but if we pushed the values to higher flipper length, we get significant differences.

Compare this to our centered model — the interaction term is identical, but now speciesChinstrap looks completely non-significant because the species differences are being evaluated at a flipper length of 0mm.

summary(lm(body_mass_g ~ species*flipper_length_mm, data = penguins_lm))


Call:
lm(formula = body_mass_g ~ species * flipper_length_mm, data = penguins_lm)

Residuals:
    Min      1Q  Median      3Q     Max 
-900.90 -247.01  -25.53  197.19 1143.33 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -2508.088    892.413  -2.810  0.00525 ** 
speciesChinstrap                    -529.108   1525.110  -0.347  0.72887    
speciesGentoo                      -4166.117   1431.581  -2.910  0.00386 ** 
flipper_length_mm                     32.689      4.692   6.967 1.78e-11 ***
speciesChinstrap:flipper_length_mm     1.884      7.864   0.240  0.81077    
speciesGentoo:flipper_length_mm       21.477      6.967   3.083  0.00223 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 368.4 on 327 degrees of freedom
Multiple R-squared:  0.7938,    Adjusted R-squared:  0.7906 
F-statistic: 251.7 on 5 and 327 DF,  p-value: < 2.2e-16

A researcher could, in principle, selectively report only the values of flipper length where species differences are significant — or only where they are not — and tell very different stories from the same model. This is another reason why plotting the full interaction and being transparent about your chosen comparison points matters.

So, what does it mean for a predictor to be significant in the context of a multiple regression?

Other variables, even when not in an interaction, influence the strength/significance of other variables
When in an interaction, pairwise tests for categorical variables are done at a specific level of a continuous interacting variable, leading to varying “regions of significance” - this can be exploited or this can blindside you.
p-values do not exist in a vacuum — they are a function of what else is in your model, how your variables are scaled, and where you choose to look.