logistic regression example

Author

Stephen Skalicky

Published

April 22, 2026

This notebook uses data from Kim et al. (2019)

Kim, Y., Jung, Y., & Skalicky, S. (2019). Linguistic alignment, learner characteristics, and the production of stranded prepositions in relative clauses. Studies in Second Language Acquisition, 41(5), 937–969.

This research compared the degree of primed production of English stranded prepositions among Korean learners of English. Learners completed an alignment session, where half of their input trials included a stranded preposition, and half did not. After each trial, we assessed whether the participants produced (or did not produce) a stranded preposition. If a participant produced a stranded preposition after an input trial containing a stranded preposition, this was taken as evidence of alignment. We also measured the degree of learning of stranded prepositions from the alignment sessions using a pre/immediate/delayed posttest design. These two questions are further nested within a comparison of modality: half of the participants completed the alignment session in a face-to-fact (FTF) context, whereas the other half completed the session in a synchronous computer-mediated context (SCMC). A separate control condition only completed the pre/immediate/delayed posttests and did not participate in the alignment sessions.

As such there are two main analyses:

1. What is the degree of linguistic alignment, and are there differences between FTF/SCMC modality?
2. Does the alignment session lead to learning of stranded prepositions, and are there differences between FTF/SCMC modality?

We answered both questions using logistic regression, which determines the probability of a binary outcome (yes/no, accurate/innacurate, etc) based on independent variables (modality, pre/post, etc.).

Primed production data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(emmeans) # model posthoc

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

dat <- read_csv('sp-priming.csv')

Rows: 4608 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): verb, modality, test, session, trial_type
dbl (7): subject, score, trial_order, cloze, wmc, prod_pre, rec_pre

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data has quite a few columns. Our concern here is score, which tallies production of stranded prepositions (coded as 0, not produced, or 1, produced), and trial_type, which is coded as prime or non-prime.

Our RQ is whether there is a significant difference in accuracy between the two trial types.

glimpse(dat)

Rows: 4,608
Columns: 12
$ subject     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ verb        <chr> "pet", "organize", "burn", "stack", "toast", "need", "cut"…
$ score       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ modality    <chr> "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "FTF", "F…
$ test        <chr> "priming1", "priming1", "priming1", "priming1", "priming1"…
$ trial_order <dbl> 12, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,…
$ session     <chr> "session1", "session1", "session1", "session1", "session1"…
$ trial_type  <chr> "non-prime", "prime", "non-prime", "prime", "prime", "non-…
$ cloze       <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40…
$ wmc         <dbl> 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74…
$ prod_pre    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ rec_pre     <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…

Let’s take a look at the raw data with plots

# wrapped factor() around score to make ggplot not apply continuous fill.
ggplot(dat, aes(fill = factor(score),x = trial_type)) + 
  geom_bar(position = 'dodge') + 
  # aesthetics and styling
  theme_classic() + 
  scale_fill_grey(start = .4, end = .75) +
  labs(fill = 'production:\n 0 = no, 1 = yes', title = 'frequency of producing a stranded prepsosition', subtitle = 'prime = previous sentence had a stranded preposition', x = "Trial Type")

Fit a glm model

A one-predictor glm

m1 <- glm(score ~ trial_type, data = dat, family = binomial(link = 'logit'))

# can you interpret this output? 
summary(m1)


Call:
glm(formula = score ~ trial_type, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.82497    0.09062  -31.18   <2e-16 ***
trial_typeprime  2.34918    0.10024   23.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4851.5  on 4607  degrees of freedom
Residual deviance: 4061.6  on 4606  degrees of freedom
AIC: 4065.6

Number of Fisher Scoring iterations: 5

Let’s manually convert things

# save coefficients to variables
b0 <- -2.82497
b1 <-  2.34918

# create odds ratios through exponentiation
or_intercept <- exp(b0)
or_prime <- exp(b1)


or_intercept

[1] 0.05931044

or_prime

[1] 10.47698

Odds ratio for intercept is 0.05. This is the reference level, which is the non-prime condition Odds ratio for the coefficient is 10.48. This is prime compared to the reference level.

This shows us much high odds to produce a stranded preposition in the prime versus non-prime condition.

Now manually convert to probability:

# probabilities (logit -> probability)
prob.nonPrime <- exp(b0) / (1 + exp(b0))
prob.prime   <- exp(b0 + b1) / (1 + exp(b0 + b1))

# convert to %
percent.nonPrime <- prob.nonPrime * 100
percent.prime   <- prob.prime * 100


percent.nonPrime

[1] 5.598966

percent.prime

[1] 38.32467

Probability of producing a stranded preposition in the non-prime condition is 5.59% Probability of producing a stranded preposition in the prime condition is 38.32%

posthocs via emmeans

We can get this information much easier using emmeans

# model logits
emmeans(m1, ~ trial_type)

 trial_type emmean     SE  df asymp.LCL asymp.UCL
 non-prime  -2.825 0.0906 Inf     -3.00    -2.647
 prime      -0.476 0.0429 Inf     -0.56    -0.392

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95

# put back into probability
emmeans(m1, ~trial_type, type = 'response')

 trial_type  prob      SE  df asymp.LCL asymp.UCL
 non-prime  0.056 0.00479 Inf    0.0473    0.0662
 prime      0.383 0.01010 Inf    0.3636    0.4033

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Note that because I asked emmeans for the logits of each condition, this is different than the model summary, which shows us the difference:

summary(m1)


Call:
glm(formula = score ~ trial_type, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.82497    0.09062  -31.18   <2e-16 ***
trial_typeprime  2.34918    0.10024   23.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4851.5  on 4607  degrees of freedom
Residual deviance: 4061.6  on 4606  degrees of freedom
AIC: 4065.6

Number of Fisher Scoring iterations: 5

We see what emmeans does here:

b0

[1] -2.82497

b1

[1] 2.34918

# the logit produced by emmeans is the effect of the prime condition
b0 + b1

[1] -0.47579

Asking emmeans for the pairwise does what the model summary shows. The estimate of -2.35 is the same as the logit shown in model summary.

Asking for type = 'response' gives us the odds ratio difference between conditions, showing that prime has 10.5 greater odds of a stranded preposition when compared to non-prime.

# using $contrasts to avoid showing the emmeans again
emmeans(m1, pairwise ~ trial_type)$contrasts

 contrast            estimate  SE  df z.ratio p.value
 (non-prime) - prime    -2.35 0.1 Inf -23.436  <.0001

Results are given on the log odds ratio (not the response) scale.

emmeans(m1, revpairwise ~ trial_type, type = 'response')$contrasts

 contrast            odds.ratio   SE  df null z.ratio p.value
 prime / (non-prime)       10.5 1.05 Inf    1  23.436  <.0001

Tests are performed on the log odds ratio scale

What’s next?

there are actually two groups in the data!
we need to fit this as a multilevel model.