Section 7.6+ , sum coding

Author

Stephen

Published

October 30, 2025

Note

Check out a similar explanation using the palmerpenguins data set here.

Sum Coding for Categorical Variables

  • what is it?
  • how do we interpret the regression output?
  • why would we want to do it?

Introduction

We learned about dummy coding / treatment coding before, wherein the levels of a categorical predictor are represented as a series of 0s and 1s. The baseline is 0 and each level is 1, which allows us to use the familiay “one unit increase” when interpreting the output of a regression model.

In this notebook we talk about sum coding, which is a different contrast coding scheme we can apply to the levels of a categorical variable. In comparison to dummy coding, the sum coding is approximately equivalent to centering a continuous variable, as it will set the intercept to a value “between” the levels of a categorical variable.

Load in data

I’m going to use the same data as used in the prior notebook, the smell/taste data used by BW.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dat <- read_csv('winter_2016_senses_valence.csv')
Rows: 405 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Word, DominantModality
dbl (4): Val, AbsVal, Sent, AbsSent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter out just the rows where the rating was for Taste or Smell words

dat2 <- dat %>%
  filter(DominantModality %in% c("Taste", "Smell"))

Calculate the mean and standard deviation of Val (the dependent variable)

descriptives <- dat2 %>%
  group_by(DominantModality) %>%
  summarise(meanValu = mean(Val), sdVal = sd(Val))

descriptives
# A tibble: 2 × 3
  DominantModality meanValu sdVal
  <chr>               <dbl> <dbl>
1 Smell                5.47 0.336
2 Taste                5.81 0.303

Compare contrasts

Let’s first turn the modality into a factor

dat2$DominantModality <- factor(dat2$DominantModality)
dat2$DominantModality <- droplevels(dat2$DominantModality)
summary(dat2$DominantModality)
Smell Taste 
   25    47 

A reminder of the default coding scheme (dummy/treatment)

contrasts(dat2$DominantModality)
      Taste
Smell     0
Taste     1

We recall that the levels are replaced with 0s and 1s. It’s relatively easy to apply a new coding scheme to a factor in R. We use the contr.sum() function, and provide the number of levels that are in the factor/categorical variable (in this case, two).

Here is the syntax on how to do this:

contrasts(dat2$DominantModality) <- contr.sum(2)

Check the new coding scheme: the values are now 0 and 1.

Note
  • What value will be used to set the intercept?
  • Where is that value in relation to 1 and -1?
contrasts(dat2$DominantModality)
      [,1]
Smell    1
Taste   -1

fit linear model

m1 <- lm(Val ~ DominantModality, data = dat2)

How do we interpret this output? We now have the coefficient DominantModality1, which tells us the predictor has been sum coded. The intercept also shows us a value of 5.63957 - what does that represent?

summary(m1)

Call:
lm(formula = Val ~ DominantModality, data = dat2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99315 -0.20870  0.04343  0.19115  0.62788 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        5.63957    0.03897 144.729  < 2e-16 ***
DominantModality1 -0.16856    0.03897  -4.326 4.95e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3148 on 70 degrees of freedom
Multiple R-squared:  0.2109,    Adjusted R-squared:  0.1997 
F-statistic: 18.71 on 1 and 70 DF,  p-value: 4.951e-05

A reminder of the means of each level - neither is set to the intercept:

descriptives
# A tibble: 2 × 3
  DominantModality meanValu sdVal
  <chr>               <dbl> <dbl>
1 Smell                5.47 0.336
2 Taste                5.81 0.303

But what if we take the average of those means? The mean of means!

mean(descriptives$meanValu)
[1] 5.639568
mean(dat2$Val)
[1] 5.691071

What then does the coefficient represent?

The coefficient is half the difference in their mean. Confusing, right?

This is because moving from one group to the other (-1 to +1) represents a 2-unit change in the predictor variable. Since regression coefficients always represent the change in Y per one-unit change in \(X\), the coefficient shows half the total group difference.

m1[["coefficients"]][["DominantModality1"]]
[1] -0.1685562

Using the regression formula to find the actual values

To understand how this words with the contrast codes we chose, we add the intercept to the estimate which is multiplied by the numerical representation for each level (smell = 1, taste = -1)

\[ \text{Val}_{smell} = intercept + (estimate)*(1) \]

\[ \text{Val}_{taste} = intercept + (estimate)*(-1) \] These result in values which reflect the means of each group.

\[ \text{Val}_{smell} = 5.63957 + (-0.1685562)*(1) = 5.471014 \]

 5.63957 + (-0.1685562)*(1)
[1] 5.471014

We see the mean value is the same

descriptives[1,]
# A tibble: 1 × 3
  DominantModality meanValu sdVal
  <chr>               <dbl> <dbl>
1 Smell                5.47 0.336

\[ \text{Val}_{taste} = 5.63957 + (-0.1685562)*(-1) = 5.471014 \]

 5.63957 + (-0.1685562)*(-1)
[1] 5.808126
descriptives[2,]
# A tibble: 1 × 3
  DominantModality meanValu sdVal
  <chr>               <dbl> <dbl>
1 Taste                5.81 0.303

sum coding with more than two levels

Let’s look at how this works when the categorical variables has more than two levels.

Apply contrast coding to the DominantModality variable:

dat$DominantModality <- factor(dat$DominantModality)

Remind ourselves how the treatment coding works

contrasts(dat$DominantModality)
      Smell Sound Taste Touch
Sight     0     0     0     0
Smell     1     0     0     0
Sound     0     1     0     0
Taste     0     0     1     0
Touch     0     0     0     1
contrasts(dat$DominantModality) <- contr.sum(5)

What the F is going on here?

contrasts(dat$DominantModality)
      [,1] [,2] [,3] [,4]
Sight    1    0    0    0
Smell    0    1    0    0
Sound    0    0    1    0
Taste    0    0    0    1
Touch   -1   -1   -1   -1
m2 <- lm(Val ~ DominantModality, data = dat)

summary(m2)

Call:
lm(formula = Val ~ DominantModality, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99315 -0.16482 -0.02158  0.15920  1.15734 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        5.55969    0.01647 337.531  < 2e-16 ***
DominantModality1  0.01998    0.02203   0.907   0.3651    
DominantModality2 -0.08867    0.04436  -1.999   0.0463 *  
DominantModality3 -0.15449    0.03007  -5.137 4.36e-07 ***
DominantModality4  0.24844    0.03426   7.252 2.15e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2659 on 400 degrees of freedom
Multiple R-squared:  0.1455,    Adjusted R-squared:  0.137 
F-statistic: 17.03 on 4 and 400 DF,  p-value: 6.616e-13

To calculate the intercept we can again look at the mean of means

descriptives_full <- dat %>%
  group_by(DominantModality) %>%
  summarise(meanVal = mean(Val))
descriptives_full
# A tibble: 5 × 2
  DominantModality meanVal
  <fct>              <dbl>
1 Sight               5.58
2 Smell               5.47
3 Sound               5.41
4 Taste               5.81
5 Touch               5.53

And we see it is the same value as the intercept

mean(descriptives_full$meanVal)
[1] 5.559685

Each effect can be interpreted as their deviance from the grand mean, and we can use the exact same regression formula logic to get their predicted effects:

\[ \text{Val}_{sight} = 5.55969 + (0.01998 )*(1) = 5.57967 \]

5.55969 + (0.01998 )*(1)
[1] 5.57967
descriptives_full[1, ]
# A tibble: 1 × 2
  DominantModality meanVal
  <fct>              <dbl>
1 Sight               5.58

How do we get the level represented with all -1s?

We subtract the sum of the coefficients from the intercept

m2[["coefficients"]]
      (Intercept) DominantModality1 DominantModality2 DominantModality3 
       5.55968524        0.01997783       -0.08867365       -0.15449254 
DominantModality4 
       0.24843866 

Showing you how to slice the data:

m2[["coefficients"]][2:5]
DominantModality1 DominantModality2 DominantModality3 DominantModality4 
       0.01997783       -0.08867365       -0.15449254        0.24843866 
sum(m2[["coefficients"]][2:5])
[1] 0.02525029
m2[["coefficients"]][["(Intercept)"]] - sum(m2[["coefficients"]][2:5]) 
[1] 5.534435

voila:

descriptives_full[5, ]
# A tibble: 1 × 2
  DominantModality meanVal
  <fct>              <dbl>
1 Touch               5.53
Discussion
  • What does it mean when a treatment-coded coefficient is significant?
  • What does it mean when a sum-coded coefficient is significant?
  • Are these the same or are they different?
  • treatment coded: that level is different from the baseline level
  • sum coded: that level is different from the grand mean of all levels
  • they are very different

Downloads

Download Notebook & Data