binomial test

Author

Jiahan Xu & Stephen Skalicky

Published

May 7, 2025

The Binomial Test

An example using cookies

Imagine you made cookies of two flavors, caramel and chocolate, and ask 100 people to choose which flavor they prefer. You might expect to get results like 50 people liking caramel and 50 chocolate because you expect them to choose evenly. Let’s consider different scenarios:

If you get a result of 49 vs. 51, you’d still think that people evenly pick caramel and chocolate. In other words, your caramel and chocolate cookies are equally tasty and people have no preference. If you get a result of 99 vs. 1, you’d likely conclude that people strongly prefer caramel and the chocolate flavor isn’t very good.

But what about a contrast of 60 vs. 40? Are you going to say people evenly like your cookies or are you going to say that people prefer caramel? How about 76 vs. 24?

In other words, at which division can you draw a conclusion that people prefer one of the flavors? When does the difference become meaningful?

How can we determine this using statistics? We can use what is called a Binomial test.

What is the Binomial Test?

The binomial test checks whether the frequency distribution of a variable with two values/categories in the sample corresponds to a hypothesized distribution in the population. In this case, we might assume that if there is no preference, we would obtain a distribution evenly split between two categories. This means a 50% chance for each option.

In our cookie example: - We have a binary outcome: preference for caramel or chocolate

We expect equal preference (possibility = 0.5)
We want to know if the observed preferences significantly differ from this expected equal distribution

In a binomial test,

Null hypothesis (H₀): The proportion of people choosing caramel equals to 0.5
Alternative hypothesis (H₁): The proportion of people choosing caramel is different from 0.5
p-value: Probability of observing the given result, or a more extreme one, if the null hypothesis is true
Significance level: Typically set at 0.05, meaning we reject the null hypothesis if p < 0.05

The binomial test helps us determine whether the observed difference between the proportions(caramel or chocolate) is statistically significant or if it might have happened by chance.

How to Do a Binomial Test in R

The basic function for a binomial test in R is binom.test(x, n, p) where:

x: Number of successes (e.g., people choosing caramel)
n: Total number of trials (e.g., total people surveyed)
p: Expected probability of success under the null hypothesis (default is 0.5 for equal preference)

Remember the situation of 60 people choosing caramel and 40 choosing chocolate? We can formally test this using the binomial test:

binom.test(x = 60, n = 100, p = 0.5)


    Exact binomial test

data:  60 and 100
number of successes = 60, number of trials = 100, p-value = 0.05689
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4972092 0.6967052
sample estimates:
probability of success 
                   0.6

In this situation, the p-value is 0.0579, which is slightly above the typically threshold for statistical significance (0.05). Following a frequentist approach, this means we cannot reject the null hypothesis that the preferences are equal (i.e., that the total population distribution is split 50/50, or said differently, that each option has a 50% probability of being chosen).

Even though more people chose caramel (60 > 40), statistically, this difference could reasonably occur by chance based on our sampling error, and we cannot conclude that the population has any preference for either flavor. In other words, based on the data you collected, your cookies with different flavors are equally tasty.

What about the situation where there were 76 vs. 24 preferences?

binom.test(x = 76, n = 100, p = 0.5)


    Exact binomial test

data:  76 and 100
number of successes = 76, number of trials = 100, p-value = 1.81e-07
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.6642645 0.8397754
sample estimates:
probability of success 
                  0.76

With 76 people preferring caramel, the p-value drops to 1.81e-07 (i.e., .000000181), which is below the threshold of 0.05 and can be considered statistically significant. Now we can reject the null hypothesis and conclude that there is a statistically significant preference for caramel. Yes, your caramel cookies are tastier!

Used for our linguistic research: “in” or “on”?

In a linguistic study, I asked people to describe a picture by filling in a blank word in a sentence. Here is an example.

Most people would choose from preposition “in” or “on”. I wanted to know if the population shows any preference for one of these prepositions to describe this picture.

For question Q1, my results showed that of the 31 people who completed this question:

20 used in
11 used on

Is this difference significant to show a preference for one preposition over the other?

binom.test(x = 20, n = 31, p = 0.5)


    Exact binomial test

data:  20 and 31
number of successes = 20, number of trials = 31, p-value = 0.1496
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4536956 0.8077326
sample estimates:
probability of success 
             0.6451613

The binomial test resulted in a p-value of 0.1496, which is greater than 0.05, so it is not statistically significant. This means that the difference between 20 vs. 11 is not large enough to conclude that people have a preference for one preposition over the other.

Let’s analyze another example, Question 24.

For question Q24, of the 29 people giving their answers:

21 used “on”
8 used a different preposition

The binomial test gives a p-value of 0.0241, which is less than 0.05. Therefore, we can conclude that people have a statistically significant preference for using “on” to describe this picture.

binom.test(x = 21, n = 29, p = 0.5)


    Exact binomial test

data:  21 and 29
number of successes = 21, number of trials = 29, p-value = 0.02412
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5276155 0.8726599
sample estimates:
probability of success 
             0.7241379

Running Binomial Tests in Bulk

This is fine, but actually, there are 77 questions of my interest. Do we really want to run binomial tests one at a time? Is there a better way?

Yes! We can use a mapping function from the purrr package. This means that we “map” or apply the same function iteratively to a larger vector of data (e.g., two columns in a dataframe). In this case, we will use the map2_dbl() function.

This function

How does map2_dlb() work?

let’s create a random vectors of probabilities and a random vector of sample sizes using the sample() function

set.seed(42)
v1 <- sample(x = seq(0, 25,), 10)
v2 <- rep(50, 10)

Check what each looks like:

v1

 [1] 16  4  0  9  3 17 25 14  6 21

v2

 [1] 50 50 50 50 50 50 50 50 50 50

What happens if we run binom.test() on these data?

binom.test(x = v1, n = v2, p = .5)

Error in binom.test(x = v1, n = v2, p = 0.5): incorrect length of 'x'

We get an error because the function wants just one value for x and one value for n. This is frustrating since we know R is actually pretty good at vectorised operations:

v1 > v2

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

v1 + v2

 [1] 66 54 50 59 53 67 75 64 56 71

So the purrr package allows us to “map” functions in a way where it can loop or repeat over input. It requires some special syntax to work, but this line will still return an error:

purrr::map2_dbl(v1, v2, ~ binom.test(.x, .y, p = .5))

Error in `purrr::map2_dbl()`:
ℹ In index: 1.
Caused by error:
! Result must be length 1, not 9.

This is because the output of binom.test returns a list object, whereas map2_dbl is looking for a single value to be returned.

binomtest_output <- binom.test(x = 10, n = 100, p = .5)

The solution is to ask for what we want from the list. In this case, we can ask for just the p-value, since that is our main metric of interest in these comparisons (although please don’t think that p-values are all you ever need!).

purrr::map2_dbl(v1, v2, ~ binom.test(.x, .y, p = .5)$p.value)

 [1] 1.534668e-02 4.461782e-10 1.776357e-15 5.614100e-06 3.708323e-11
 [6] 3.283914e-02 1.000000e+00 2.602171e-03 3.243741e-08 3.222363e-01

purrr::map2_dbl(v1, v2, ~ binom.test(.x, .y, p = .5)$estimate)

 [1] 0.32 0.08 0.00 0.18 0.06 0.34 0.50 0.28 0.12 0.42

applying `map2_dbl()` to a dataframe

Let’s load a dataframe.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

in_or_on <- read_csv("https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/in-on.csv")

Rows: 77 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): questions
dbl (4): in, on, total, majority

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Look at the structure of this file

It has four columns:

questions: the ID of the question/picture
in: the number of people choosing “in”
on: the number of people choosing “on”
total: the total population (i.e., frequency of in + frequency of on)
majority: the number of people who chose the higher option (between “in” and “on”)

glimpse(in_or_on)

Rows: 77
Columns: 5
$ questions <chr> "Q1", "Q10", "Q102", "Q12", "Q16", "Q17", "Q18", "Q19", "Q2"…
$ `in`      <dbl> 8, 1, 31, 3, 1, 2, 23, 20, 1, 20, 28, 29, 23, 5, 1, 12, 19, …
$ on        <dbl> 21, 26, 0, 28, 28, 27, 6, 7, 31, 11, 2, 1, 5, 24, 29, 20, 12…
$ total     <dbl> 29, 27, 31, 31, 29, 29, 29, 27, 32, 31, 30, 30, 28, 29, 30, …
$ majority  <dbl> 21, 26, 31, 28, 28, 27, 23, 20, 31, 20, 28, 29, 23, 24, 29, …

Notice here the majority column does not care which preposition has the preference. Instead, we are first interested in whether the question attracts a preference at all. This is measured by whether the distribution is far enough away from 50% of either category.

Will the following code work?

library(purrr)
map2_dbl(in_or_on$majority, in_or_on$total, ~binom.test(.x, .y, p=0.5))

Error in `map2_dbl()`:
ℹ In index: 1.
Caused by error:
! Result must be length 1, not 9.

How about now?

map2_dbl(in_or_on$majority, in_or_on$total, ~binom.test(.x, .y, p = 0.5)$p.value)

 [1] 2.411954e-02 4.172325e-07 9.313226e-10 4.649162e-06 1.117587e-07
 [6] 1.624227e-06 2.315700e-03 1.915729e-02 1.536682e-08 1.496128e-01
[11] 8.679926e-07 5.774200e-08 9.122342e-04 5.461127e-04 5.774200e-08
[16] 2.153271e-01 2.810415e-01 4.656613e-10 4.172325e-07 2.315700e-03
[21] 8.679926e-07 2.160668e-07 9.313226e-10 8.715855e-02 8.430332e-06
[26] 1.067384e-02 8.679926e-07 3.725290e-09 4.656613e-10 7.450581e-09
[31] 4.656613e-10 9.122342e-04 4.656613e-10 5.774200e-08 4.277395e-02
[36] 1.523644e-05 1.430906e-03 4.656613e-10 1.000000e+00 1.490116e-08
[41] 4.656613e-10 3.249142e-04 1.862645e-09 4.656613e-10 5.774200e-08
[46] 5.947612e-05 1.612480e-02 1.117587e-07 5.846647e-01 5.350527e-04
[51] 1.067384e-02 1.624227e-06 5.647540e-06 3.032386e-06 5.774200e-08
[56] 4.172325e-07 1.862645e-09 3.725290e-09 4.656613e-10 1.862645e-09
[61] 1.862645e-09 4.656613e-10 3.725290e-09 7.450581e-09 1.862645e-09
[66] 4.656613e-10 1.862645e-09 3.725290e-09 3.725290e-09 3.725290e-09
[71] 4.656613e-10 4.656613e-10 3.725290e-09 1.862645e-09 4.656613e-10
[76] 7.450581e-09 4.656613e-10

We can then easily select those questions with p-value lower than 0.05, and go from there to do our actual anaylses.

bonus. turning scientific notation off:

options(scipen = 999)

round(map2_dbl(in_or_on$majority, in_or_on$total, ~binom.test(.x, .y, p = 0.5)$p.value), 3)

 [1] 0.024 0.000 0.000 0.000 0.000 0.000 0.002 0.019 0.000 0.150 0.000 0.000
[13] 0.001 0.001 0.000 0.215 0.281 0.000 0.000 0.002 0.000 0.000 0.000 0.087
[25] 0.000 0.011 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.043 0.000
[37] 0.001 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.016 0.000
[49] 0.585 0.001 0.011 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[61] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[73] 0.000 0.000 0.000 0.000 0.000

bonus. add the results as a column:

in_or_on$pvalues <- round(map2_dbl(in_or_on$majority, in_or_on$total, ~binom.test(.x, .y, p = 0.5)$p.value), 3)

Do you have a use for such a test? Are there other distribution beyond 50/50 you might be interested in?