across

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Using across() to get summary statistics of multiple columns

We know how to use summarise() to obtain summary statistics of our data, such as the mean and standard deviation. Let’s think about a dataset that has many variables we want to gather such statistics for.

Let’s use a dataset from one of my studies of satirical discourse. It has zero citations so needs some love! You can read the study in from a github page:

Code
dat <- read_csv('https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/exp2_model_data.csv')
Rows: 171369 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): subject, story_condition, stim_id, spr_version, sex, word, region,...
dbl (10): trial_index, rt_order, rt, age, news_familiar, satire_familiar, nf...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In this study people read short texts, one word at a time, and then answered questions about those texts. I also measured a number of individual differences, including:

  • age, sex, familiarity with regular news, familiarity with satirical news, and need for cognition.

What if we want to get the summary of these variables?

  1. First filter the data so that there is only one row per subject. To do so, use select() to get the above columns except for sex, as well as subject. Then use unique() to remove duplicates.

question - why do we have to do select on the columns before using unique()?

You should have a dataframe like this:

Code
demo_dat <- dat %>%
  select(subject, age, news_familiar, satire_familiar, nfc) %>%
  unique() %>%
  glimpse()
Rows: 133
Columns: 5
$ subject         <chr> "615a11c296155382cc1e0ef3", "615a109024f5dc82af2a5d41"…
$ age             <dbl> 35, 37, 41, 31, 24, 22, 51, 38, 30, 23, 40, 31, 41, 35…
$ news_familiar   <dbl> 5, 4, 4, 3, 5, 4, 5, 5, 4, 4, 4, 4, 4, 3, 5, 4, 5, 5, …
$ satire_familiar <dbl> 4, 4, 3, 2, 1, 3, 2, 4, 4, 4, 2, 4, 3, 3, 3, 4, 4, 3, …
$ nfc             <dbl> 3, -12, -11, 13, 3, 20, 20, 29, 33, 28, 22, 5, 9, 22, …

Now, if we want to create a summary for mean and standard deviation of each column, what can we do? Well, we can use summarise(), like this:

Code
demo_dat_summary <- demo_dat %>%
  summarise(mean_age = mean(age), 
            mean_news_familiar = mean(news_familiar),
            mean_satire_familiar= mean(satire_familiar),
            mean_nfc = mean(nfc))

demo_dat_summary
# A tibble: 1 × 4
  mean_age mean_news_familiar mean_satire_familiar mean_nfc
     <dbl>              <dbl>                <dbl>    <dbl>
1     33.8               4.03                 2.87     8.30

Works totally fine! But, what if we don’t want to retype the same thing over and over again? We can use across (or other options) to first select a range of columns and then apply specific values to those columns! To do so, we use summarise normally, but then follow this with across(). The across() function takes multiple arguments, including .cols, .fns, and .names.

For now, let’s focus on choosing two columns which have the function mean applied to them:

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), mean))
# A tibble: 1 × 2
  news_familiar satire_familiar
          <dbl>           <dbl>
1          4.03            2.87

Do you see how that works? First we chose the column, then the function to apply. It’s kind of weird that we provide the function without the brackets, but that’s how it works.

Using a named list

What if we want to do more than one function? We can expand the function call, but now need to provide a named list of functions after the .fns argument. Here is an example, still with one function. The named list means I first provide my own name for the function, then the function itself:

what is different about the output here compared to above?

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = mean)))
# A tibble: 1 × 2
  news_familiar_mean satire_familiar_mean
               <dbl>                <dbl>
1               4.03                 2.87

Maybe this helps showcase the difference? That’s right, the name you provide in the named list is actually added to the output column!

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list('#mean#' = mean)))
# A tibble: 1 × 2
  `news_familiar_#mean#` `satire_familiar_#mean#`
                   <dbl>                    <dbl>
1                   4.03                     2.87

using formula notation

What if we want to do something fancy, such as us built in arguments to functions, like na.rm=T or scale=T? We can also use a formula notation for our functions:

Note that we have to include the .x in the formula where we would normally put the variable itself. This is because we are using more than one variable to the function.

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean_rm = ~ mean(.x, na.rm = T))))
# A tibble: 1 × 2
  news_familiar_mean_rm satire_familiar_mean_rm
                  <dbl>                   <dbl>
1                  4.03                    2.87

more than one function

Now we can easily upscale to multiple functions, such as asking for the standard deviation as well. Sweet…

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T), 
                               sd = ~ sd(.x, na.rm = T))))
# A tibble: 1 × 4
  news_familiar_mean news_familiar_sd satire_familiar_mean satire_familiar_sd
               <dbl>            <dbl>                <dbl>              <dbl>
1               4.03            0.816                 2.87               1.11

custom names

Now let’s look at the .names argument, which lets us do custom names to our resulting variables. To do so, we use syntax similar to Python’s fstring formatting, where we can combine variable names and text in one string. We place variables inside curly brackets {} within a string, like this: '{var}_rest of string'. This allow us to make the labels however we want:

Let’s first start with mean again

Put the letter M after the name…

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T)),
                               .names = '{col}_M'))
# A tibble: 1 × 2
  news_familiar_M satire_familiar_M
            <dbl>             <dbl>
1            4.03              2.87

And before the name…

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T)),
                               .names = 'M_{col}'))
# A tibble: 1 × 2
  M_news_familiar M_satire_familiar
            <dbl>             <dbl>
1            4.03              2.87

And of course with any separator you like:

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T)),
                               .names = '(M){col}(M)'))
# A tibble: 1 × 2
  `(M)news_familiar(M)` `(M)satire_familiar(M)`
                  <dbl>                   <dbl>
1                  4.03                    2.87

Naturally, we can upscale this to also do the standard deviation. To do custom labels, we have to rely on the names in our list of functions for the names, and the .names argument for their ordering:

Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T),
                               sd = ~ sd(.x, na.rm = T)),
                               .names = '{col}_{fn}'))
# A tibble: 1 × 4
  news_familiar_mean news_familiar_sd satire_familiar_mean satire_familiar_sd
               <dbl>            <dbl>                <dbl>              <dbl>
1               4.03            0.816                 2.87               1.11
Code
demo_dat %>%
  summarise(across(.cols = c(news_familiar, satire_familiar), 
                   .fns = list(mean = ~ mean(.x, na.rm = T),
                               sd = ~ sd(.x, na.rm = T)),
                               .names = '{col}!{fn}'))
# A tibble: 1 × 4
  `news_familiar!mean` `news_familiar!sd` `satire_familiar!mean`
                 <dbl>              <dbl>                  <dbl>
1                 4.03              0.816                   2.87
# ℹ 1 more variable: `satire_familiar!sd` <dbl>

Those are the basics - with this information, can you calculate the mean and standard deviation of age and nfc? It should require almost no changes!

Code
demo_dat %>%
  summarise(across(.cols = c(age, nfc), 
                   .fns = list(mean = ~ mean(.x, na.rm = T),
                               sd = ~ sd(.x, na.rm = T)),
                               .names = '{col}_{fn}'))
# A tibble: 1 × 4
  age_mean age_sd nfc_mean nfc_sd
     <dbl>  <dbl>    <dbl>  <dbl>
1     33.8   8.65     8.30   20.2