Ch. 13 poisson regression - continued

Author

Stephen

Published

April 22, 2026

Continuing Poisson Regression

In the previous notebook we looked at the basics of the poisson distribution and how to interpret a poisson regression.
We know that a poisson distribution/regression is used for data that is counts (e.g., word frequency, number of errors)
The major parameter is lambda - the rate of counts (higher = higher counts).
One important assumption of poisson regression is that the variance of the counts should be similar to the mean (lambda) of the counts.
In this notebook we look at what to do if the variance is not equal, which results in overdispersion

What is overdispersion?

First load in our stranded preposition data:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Convert the number of produced stranded prepositions into tallies

dat <- read_csv('sp-priming.csv') %>%
  group_by(subject, modality, trial_type, prod_pre, rec_pre, wmc) %>%
  summarise(produced = sum(score))

Rows: 4608 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): verb, modality, test, session, trial_type
dbl (7): subject, score, trial_order, cloze, wmc, prod_pre, rec_pre

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'subject', 'modality', 'trial_type', 'prod_pre', 'rec_pre'. You can override using the `.groups` argument.

glimpse(dat)

Rows: 128
Columns: 7
Groups: subject, modality, trial_type, prod_pre, rec_pre [128]
$ subject    <dbl> 1, 1, 2, 2, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,…
$ modality   <chr> "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FT…
$ trial_type <chr> "non-prime", "prime", "non-prime", "prime", "non-prime", "p…
$ prod_pre   <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 3, 3, 2, 2,…
$ rec_pre    <dbl> 6, 6, 5, 5, 5, 5, 6, 6, 14, 14, 7, 7, 4, 4, 0, 0, 6, 6, 6, …
$ wmc        <dbl> 74, 74, 71, 71, 87, 87, 46, 46, 68, 68, 55, 55, 92, 92, 81,…
$ produced   <dbl> 0, 2, 4, 19, 2, 2, 1, 11, 5, 18, 2, 10, 5, 5, 0, 0, 3, 6, 0…

Warning

Note that technically a binomial/logistic regression is better for these data, as there was a limited/bounded range of counts that could be produced. This is because each participant had a restricted set of trials and answers, and so thus could produce 0-max(trials) stranded prepositions. This is different from true count data where the count is not controlled by contexts - see here

However, the point here is pedagogical – this is the closest “real” data I have to count data.

Descriptives showing overdispersion

The descriptives show that the variance is much higher than the lambda

descriptives <- dat %>%
  group_by(modality, trial_type) %>%
  summarise(lambda = mean(produced) |> round(2), 
            var = var(produced) |> round(2))

`summarise()` has grouped output by 'modality'. You can override using the
`.groups` argument.

knitr::kable(descriptives)

modality	trial_type	lambda	var
FTF	non-prime	1.77	2.83
FTF	prime	10.23	59.77
SCMC	non-prime	2.31	4.22
SCMC	prime	18.10	110.52

ggplot(dat, aes(x = produced, fill = modality)) + 
  facet_wrap(modality ~ trial_type) + 
  geom_histogram(position = 'dodge')

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Negative binomial regression

We turn to negative binomial regression, which relaxes the mean == variance assumption for lambda. According to this page, this is done by “re-expressing the count data as coming from a mixture where each count is from its own Poisson distribution with its own λ parameter.” Cool!

To fit a negative binomial, the gml.nb function from the MASS package can be used:

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Fit the model, just the same. Do not need to specify the family.

nbm <- glm.nb(produced ~ trial_type * modality, data = dat)

Summary output is the same in terms of log lambda coefficients:

summary(nbm)


Call:
glm.nb(formula = produced ~ trial_type * modality, data = dat, 
    init.theta = 1.960499725, link = log)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    0.5718     0.1752   3.263   0.0011 ** 
trial_typeprime                1.7534     0.2192   7.997 1.27e-15 ***
modalitySCMC                   0.2656     0.2514   1.056   0.2908    
trial_typeprime:modalitySCMC   0.3053     0.3163   0.965   0.3345    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.9605) family taken to be 1)

    Null deviance: 305.17  on 127  degrees of freedom
Residual deviance: 146.13  on 124  degrees of freedom
AIC: 704.79

Number of Fisher Scoring iterations: 1

              Theta:  1.960 
          Std. Err.:  0.365 

 2 x log-likelihood:  -694.787

Modal posthocs

library(emmeans)

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

Estimated marginal means

emmeans(nbm, ~ trial_type|modality, type = 'response')

modality = FTF:
 trial_type response    SE  df asymp.LCL asymp.UCL
 non-prime      1.77 0.310 Inf      1.26      2.50
 prime         10.23 1.350 Inf      7.90     13.24

modality = SCMC:
 trial_type response    SE  df asymp.LCL asymp.UCL
 non-prime      2.31 0.417 Inf      1.62      3.29
 prime         18.10 2.530 Inf     13.77     23.80

Confidence level used: 0.95 
Intervals are back-transformed from the log scale

Within-group pairwise comparisons between trial type

pairs(emmeans(nbm, ~ trial_type|modality, type = 'response'), reverse = T)

modality = FTF:
 contrast            ratio   SE  df null z.ratio p.value
 prime / (non-prime)  5.77 1.27 Inf    1   7.997  <.0001

modality = SCMC:
 contrast            ratio   SE  df null z.ratio p.value
 prime / (non-prime)  7.84 1.79 Inf    1   9.027  <.0001

Tests are performed on the log scale

Comparing to poission model

Compare the output to an equivalent poisson model:

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

✔ broom        1.0.10     ✔ rsample      1.3.1 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ infer        1.0.9      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.4.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.1      ✔ yardstick    1.3.2

Warning: package 'parsnip' was built under R version 4.5.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ MASS::select()    masks dplyr::select()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

Quickly re-fit our poisson regression from last time:

pm1 <- glm(produced ~ trial_type * modality, data = dat, family = 'poisson')

What do you notice about the output? what is the same, what is different?

knitr::kable(tidy(nbm), digits = 2, caption = 'negative binomial')

negative binomial
term	estimate	std.error	statistic	p.value
(Intercept)	0.57	0.18	3.26	0.00
trial_typeprime	1.75	0.22	8.00	0.00
modalitySCMC	0.27	0.25	1.06	0.29
trial_typeprime:modalitySCMC	0.31	0.32	0.97	0.33

knitr::kable(tidy(pm1), digits = 2, caption = 'poisson')

poisson
term	estimate	std.error	statistic	p.value
(Intercept)	0.57	0.13	4.50	0.00
trial_typeprime	1.75	0.14	12.75	0.00
modalitySCMC	0.27	0.18	1.51	0.13
trial_typeprime:modalitySCMC	0.31	0.19	1.61	0.11

Compare the estimates with a plot

First create plot data and join

p_plot <- as.data.frame(emmeans(pm1, ~ trial_type|modality, type= 'response')) %>%
  mutate(model = 'poisson')

nb_plot <- as.data.frame(emmeans(nbm, ~ trial_type | modality, type = 'response')) %>%
  mutate(model = 'negative binomial') %>%
  rename(rate = response)

comb_plot <- p_plot %>%
  rbind(nb_plot)

Make the plot.

What do you see?

The estimates are the same
The standard errors/confidence intervals are not
What is the practical explanation of the effects of the negative binomial on the output/interpretation?

ggplot(comb_plot, aes(y = rate, x = trial_type, color = modality, lty = modality, group = modality)) + 
  facet_wrap(. ~ model) + 
  geom_point() + 
  geom_line() + 
  geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = .1) +
  theme_bw() +
  labs(x = '') + 
  theme(legend.title = element_blank(), legend.position = 'top')

comparing the models

Bodo Winter introduces the odTest function from the pscl package

library(pscl)

Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.

odTest(nbm)

Likelihood ratio test of H0: Poisson, as restricted NB model:
n.b., the distribution of the test-statistic under H0 is non-standard
e.g., see help(odTest) for details/references

Critical value of test statistic at the alpha= 0.05 level: 2.7055 
Chi-Square Test Statistic =  237.2202 p-value = < 2.2e-16

Looking at the odTest function, it fits a separate model, calculates the logLik of each model, multiplies their differences by 2, and then tests the result using a chisquare distribution:

odTest

function (glmobj, alpha = 0.05, digits = max(3, getOption("digits") - 
    3)) 
{
    if (!inherits(glmobj, "negbin")) 
        stop("this function only works for objects of class negbin\n")
    if (alpha > 1 | alpha < 0) 
        stop("invalid value for alpha\n")
    poissonGLM <- glm(formula = eval(glmobj$call$formula), data = eval(glmobj$call$data), 
        family = "poisson")
    llhPoisson <- logLik(poissonGLM)
    llhNB <- logLik(glmobj)
    d <- 2 * (llhNB - llhPoisson)
    critval <- qchisq(1 - (2 * alpha), df = 1)
    pval <- pchisq(d, df = 1, lower.tail = FALSE)/2
    cat("Likelihood ratio test of H0: Poisson, as restricted NB model:\n")
    cat("n.b., the distribution of the test-statistic under H0 is non-standard\n")
    cat("e.g., see help(odTest) for details/references\n\n")
    cat(paste("Critical value of test statistic at the alpha=", 
        round(alpha, digits), "level:", round(critval, digits), 
        "\n"))
    cat(paste("Chi-Square Test Statistic = ", round(d, digits), 
        "p-value =", format.pval(pval, digits = digits), "\n"))
    invisible(NULL)
}
<bytecode: 0x13155adf0>
<environment: namespace:pscl>

The logLikis this:

2 * (logLik(nbm) - logLik(pm1))

'log Lik.' 237.2202 (df=5)

The larger the number, the more the models differ. Here we see that the negative binomial model is “better” than the poisson (which makes sense in light of the overdispersion!)

Exposure variables

In a poission/negative binomial, one can enter “exposure” variables to the model. This is to account for uneven sample spaces (e.g., different sizes of countries, different lengths of utterances).

Let’s introduce an artificial exposure variable to our data. The variable wmc is the working memory capacity of each participant. This is clearly not the same as time/duration, but might represent a different sort of sample space each participant could draw from. This plot shows the distribution of log(wmc). I log transformed wmc because the model failed otherwise ;)

ggplot(dat, aes(x = log(wmc), fill = modality)) + 
  geom_histogram(position = 'dodge', alpha = .8)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The exposure variable will divide the lambda by exposure, creating “tau”.

So we end up with two terms: lambda = the mean, and tau = the mean / exposure. The tau variable is entered in the formula as a normalisation/control technique.

Fit the model like this:

pm2 <- glm(produced ~ trial_type * modality  + offset(log(wmc)), data = dat, family = 'poisson')

model summary:

summary(pm2)


Call:
glm(formula = produced ~ trial_type * modality + offset(log(wmc)), 
    family = "poisson", data = dat)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -3.7559     0.1270 -29.574   <2e-16 ***
trial_typeprime                1.7534     0.1376  12.747   <2e-16 ***
modalitySCMC                   0.3282     0.1762   1.863   0.0625 .  
trial_typeprime:modalitySCMC   0.3053     0.1891   1.615   0.1064    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1270.40  on 127  degrees of freedom
Residual deviance:  548.79  on 124  degrees of freedom
AIC: 943.65

Number of Fisher Scoring iterations: 5

We can plot this model agsint the others. Not a huge change, probably because I am not using a “real” exposure variable :)

p2_plot <- as.data.frame(emmeans(pm2, ~ trial_type|modality, type= 'response')) %>%
  mutate(model = 'exposure')


comb_plot <- comb_plot %>%
  rbind(p2_plot) %>%
  mutate(model = factor(model, levels = c('poisson', 'exposure', 'negative binomial')))

ggplot(comb_plot, aes(y = rate, x = trial_type, color = modality, lty = modality, group = modality)) + 
  facet_wrap(. ~ model) + 
  geom_point() + 
  geom_line() + 
  geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = .1) +
  theme_bw() +
  labs(x = '') + 
  theme(legend.title = element_blank(), legend.position = 'top')

Conclusion

What else do you want to know?
Do you have any “real” count data you would like to model?