Ch. 13 poisson regression

Author

Stephen

Published

April 22, 2026

What is the Poisson distribution?

“Poisson distribution is the canonical distribution for count processes” (p. 218)
- A salient example being word frequency (e.g., how many times does a target word occur in a target corpus?)
“The Poisson distribution only has one parameter, \(λ\) ‘lambda’, which specifies the rate of a count process.”
- higher lambda = higher probability of occurrence
- lower lambda = lower probability of occurrence
bounded by zero and categorical

What does the distribution look like?

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# simulate data
set.seed(1337)

lambda_plot <- tibble(
  high = rpois(10000, 30),
  mid = rpois(10000, 10),
  low = rpois(10000, 2.5)
) %>%
  pivot_longer(cols = c(high, mid, low), values_to = 'lambda', names_to = 'type') %>%
  mutate(type = factor(type, levels = c('low', 'mid', 'high')))

# plot them
ggplot(lambda_plot, aes(x = lambda, fill = type)) + 
  geom_histogram(alpha = .75, bins = 100) +
  labs(caption = 'high λ = 30; mid λ = 10, low λ = 2.5', title = 'different λ distributions', x = '') +
  brms::theme_default() + 
  theme(legend.position = 'bottom', legend.title = element_blank())

Poisson Regression

Need to put boundaries on \(y\), accomplished through exponential function
(recall that logistic is its own function)

Instead of wrapping the right-hand side of the linear regression formula in its exponentiation, we instead wrap the left-hand side in the logarithmic transformation (which is the inverse of exponentiation)

\[ \lambda_i = \exp(\beta_0 + \beta_1 \cdot x_i) \]

This will give us the outcome on the log-lambda scale (yay?). However you should have a good idea of what it is - the log transformation of the measure of rate of frequency.

\[ \log(y_i) = \beta_0 + \beta_1 \cdot x_i \]

Running a poisson regression

First we need some count data, or data that can be coerced into counts. Let’s convert the primed production data into counts, by tallying how many total times each person produced a stranded preposition.

create tallies

Load in the data and convert to tallies

dat <- read_csv('sp-priming.csv') %>%
  group_by(subject, modality, trial_type, prod_pre, rec_pre) %>%
  summarise(produced = sum(score))

Rows: 4608 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): verb, modality, test, session, trial_type
dbl (7): subject, score, trial_order, cloze, wmc, prod_pre, rec_pre

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'subject', 'modality', 'trial_type', 'prod_pre'. You can override using the `.groups` argument.

glimpse(dat)

Rows: 128
Columns: 6
Groups: subject, modality, trial_type, prod_pre [128]
$ subject    <dbl> 1, 1, 2, 2, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,…
$ modality   <chr> "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FT…
$ trial_type <chr> "non-prime", "prime", "non-prime", "prime", "non-prime", "p…
$ prod_pre   <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 3, 3, 2, 2,…
$ rec_pre    <dbl> 6, 6, 5, 5, 5, 5, 6, 6, 14, 14, 7, 7, 4, 4, 0, 0, 6, 6, 6, …
$ produced   <dbl> 0, 2, 4, 19, 2, 2, 1, 11, 5, 18, 2, 10, 5, 5, 0, 0, 3, 6, 0…

visualise the raw data as counts

Here is our count distribution in both conditions by modality. Which one will have the highest lambda?

ggplot(dat, aes(x = produced, fill = trial_type)) + 
  geom_histogram(position = 'dodge', bins = 50) + 
  facet_wrap(. ~ modality) + 
  brms::theme_default() + 
  scale_fill_manual(values = c("#66C2A5", 
                               "#E78AC3"
                               # other colours to choose from :) 
                               #"#FC8D62",
                               #"#A6D854"
               )) +
  theme(legend.title = element_blank(), legend.position = c(.8, .8))

descriptive statistics

The lambda is the mean, but carries the assumption that the variance is approximate to the mean. Here the data clearly isn’t doing that, which means it is overdispersed (BW talks about this too).

let’s ignore this for now and carry on with a poisson regression.

lambda_descriptives <- dat %>%
  group_by(modality, trial_type) %>%
  summarise(lambda = mean(produced), var = var(produced))

`summarise()` has grouped output by 'modality'. You can override using the
`.groups` argument.

lambda_descriptives

# A tibble: 4 × 4
# Groups:   modality [2]
  modality trial_type lambda    var
  <chr>    <chr>       <dbl>  <dbl>
1 FTF      non-prime    1.77   2.83
2 FTF      prime       10.2   59.8 
3 SCMC     non-prime    2.31   4.22
4 SCMC     prime       18.1  111.

fit the poisson model:

The same syntax for a logistic regression, but here we have fit a different family.

Learning moment! glm and glmer do not mean logistic regression - they can fit many different families.

m1 <- glm(produced ~ trial_type * modality, data = dat, family = 'poisson')

The coefficients are on the log scale, they are the log lambda and their differences.

Note

Intercept: → log(λ) for FTF, non-prime
trial_typeprime: → difference in log(λ) between FTF prime and FTF non-prime
modalitySCMC: → difference in log(λ) between SCMC non-prime and FTF non-prime
trial_typeprime:modalitySCMC: → difference in the trial_type effect (prime vs non-prime) between SCMC and FTF → (i.e., how much the log(λ) difference for priming changes when moving from FTF to SCMC)

summary(m1)


Call:
glm(formula = produced ~ trial_type * modality, family = "poisson", 
    data = dat)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    0.5718     0.1270   4.502 6.72e-06 ***
trial_typeprime                1.7534     0.1376  12.747  < 2e-16 ***
modalitySCMC                   0.2656     0.1762   1.507    0.132    
trial_typeprime:modalitySCMC   0.3053     0.1891   1.615    0.106    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1249.07  on 127  degrees of freedom
Residual deviance:  545.14  on 124  degrees of freedom
AIC: 940.01

Number of Fisher Scoring iterations: 5

uhg log - and uh multiplicative?!

Remember we can translate the log back to lambda using exp. Here is the lambda for intercept, which is the FTF non-prime mean:

exp(0.5718)

[1] 1.771453

Which matches the mean from the raw data…

# repeat the descriptives
dat %>% group_by(modality, trial_type) %>% summarise(lambda = mean(produced))

`summarise()` has grouped output by 'modality'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   modality [2]
  modality trial_type lambda
  <chr>    <chr>       <dbl>
1 FTF      non-prime    1.77
2 FTF      prime       10.2 
3 SCMC     non-prime    2.31
4 SCMC     prime       18.1

But all the other coefficients will be multiplicative changes

exp(coef(m1))

                 (Intercept)              trial_typeprime 
                    1.771429                     5.774194 
                modalitySCMC trial_typeprime:modalitySCMC 
                    1.304227                     1.357042

To get the FTF, non-prime, we multiply intercept by coefficient.

1.771429 * 5.774194

[1] 10.22857

This should reinforce that these are on log (i.e., multiplicative scales)!

emmeans analysis

Who cares about all that (j/k)? We just gonna use emmeans anyhow :)

We can obtain the emmeans of the model scale:

library(emmeans)

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

emmeans(m1, ~ trial_type * modality)

 trial_type modality emmean     SE  df asymp.LCL asymp.UCL
 non-prime  FTF       0.572 0.1270 Inf     0.323     0.821
 prime      FTF       2.325 0.0529 Inf     2.222     2.429
 non-prime  SCMC      0.837 0.1220 Inf     0.598     1.077
 prime      SCMC      2.896 0.0436 Inf     2.811     2.982

Results are given on the log (not the response) scale. 
Confidence level used: 0.95

And just like with logistic regression, we can use type = 'response' to get the response scale, which gives us our rate (i.e., our lambda). Unlike the model summary, these give us condition effects (i.e., not the differences between conditions.)

emmeans(m1, ~ trial_type * modality, type = 'response')

 trial_type modality  rate    SE  df asymp.LCL asymp.UCL
 non-prime  FTF       1.77 0.225 Inf      1.38      2.27
 prime      FTF      10.23 0.541 Inf      9.22     11.34
 non-prime  SCMC      2.31 0.282 Inf      1.82      2.94
 prime      SCMC     18.10 0.790 Inf     16.62     19.72

Confidence level used: 0.95 
Intervals are back-transformed from the log scale

Are you convinced yet that emmeans is better than trying to divine model summaries??

pairwise contrasts

Let’s obtain the pairwise contrasts of the trial type effect

The coefficients are expressed as ratios. So we can say that, for example, for the FTF group, the expected count in the prime condition is 5.77 times the expected count in the non-prime condition.

m1_em <- emmeans(m1,  ~ trial_type | modality, type = 'response') 

pairs(m1_em, reverse = T)

modality = FTF:
 contrast            ratio    SE  df null z.ratio p.value
 prime / (non-prime)  5.77 0.794 Inf    1  12.747  <.0001

modality = SCMC:
 contrast            ratio    SE  df null z.ratio p.value
 prime / (non-prime)  7.84 1.020 Inf    1  15.869  <.0001

Tests are performed on the log scale

Then compare the two conditions against one another, which replicates the regression output:

contrast(emmeans(m1, ~ modality*trial_type), interaction = 'pairwise')

 modality_pairwise trial_type_pairwise estimate    SE  df z.ratio p.value
 FTF - SCMC        (non-prime) - prime    0.305 0.189 Inf   1.615  0.1064

Results are given on the log (not the response) scale.

And this is just the same effect on the ratio (response) scale (not log)

contrast(emmeans(m1, ~ modality*trial_type, type = 'response'), interaction = 'pairwise')

 modality_pairwise trial_type_pairwise ratio    SE  df null z.ratio p.value
 FTF / SCMC        (non-prime) / prime  1.36 0.257 Inf    1   1.615  0.1064

Tests are performed on the log scale

So the slope adjustment is 0.305 log lambda or 1.36 times higher lambda.

visualise the effects

Let’s plot the contrast:

m1_em_plot <- as.data.frame(m1_em)

ggplot(as.data.frame(m1_em_plot), aes(y = rate, x = trial_type, color = modality, group = modality)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = .05) + 
  geom_line() +
  labs(title = "Look familiar?") + 
  theme_bw()

It looks identical to the logistic model ;)

adding a continuous predictor

Two measures of proficiency are based on pre-tests of either productive or receptive English knowledge

summarise(ungroup(dat), 
          mean_rec_pre = mean(rec_pre), 
          sd_rec_pre = sd(rec_pre),
          mean_prod_pre = mean(prod_pre),
          sd_prod_pre = sd(prod_pre))

# A tibble: 1 × 4
  mean_rec_pre sd_rec_pre mean_prod_pre sd_prod_pre
         <dbl>      <dbl>         <dbl>       <dbl>
1         6.53       4.45          1.12        1.69

Fit a model with proficiency as as a main effect

m2 <- glm(produced ~ trial_type * modality + rec_pre, data = dat, family = 'poisson')

summary(m2)


Call:
glm(formula = produced ~ trial_type * modality + rec_pre, family = "poisson", 
    data = dat)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  0.255759   0.134329   1.904   0.0569 .  
trial_typeprime              1.753399   0.137558  12.747  < 2e-16 ***
modalitySCMC                 0.203912   0.176430   1.156   0.2478    
rec_pre                      0.048827   0.006267   7.791 6.67e-15 ***
trial_typeprime:modalitySCMC 0.305307   0.189083   1.615   0.1064    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1249.07  on 127  degrees of freedom
Residual deviance:  488.81  on 123  degrees of freedom
AIC: 885.68

Number of Fisher Scoring iterations: 5

Emmeans is less useful here (as a main effect term). We get the same output as summary():

emtrends(m2, ~ rec_pre, var = 'rec_pre', type = 'response', infer = T)

 rec_pre rec_pre.trend      SE  df asymp.LCL asymp.UCL z.ratio p.value
    6.53        0.0488 0.00627 Inf    0.0365    0.0611   7.791  <.0001

Results are averaged over the levels of: trial_type, modality 
Confidence level used: 0.95

Generate an effects plot:

We can plot the effects at different levels of rec_pre. Do you think there might be an interaction?

# create the data frame first
receptive_emm <- as.data.frame(emmeans(m2, ~ trial_type|c(modality, rec_pre), at = list(rec_pre = seq(min(dat$rec_pre), max(dat$rec_pre)))), type = 'response')

# plot the curve
ggplot(receptive_emm, aes(y = rate, x = rec_pre, fill = trial_type)) + 
  facet_wrap(. ~ modality) + 
  geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), alpha = .2) + 
  geom_line() +
  theme_bw() + 
  theme(legend.title = element_blank(), legend.position = c(.25, .8))

next time…

exposure variables (controlling for uneveness)
negative binomial regression (a variant to address overdispersion - high variance, which this data has.)