── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Using across() to get summary statistics of multiple columns
We know how to use summarise() to obtain summary statistics of our data, such as the mean and standard deviation. Let’s think about a dataset that has many variables we want to gather such statistics for.
dat <-read_csv('https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/exp2_model_data.csv')
Rows: 171369 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): subject, story_condition, stim_id, spr_version, sex, word, region,...
dbl (10): trial_index, rt_order, rt, age, news_familiar, satire_familiar, nf...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In this study people read short texts, one word at a time, and then answered questions about those texts. I also measured a number of individual differences, including:
age, sex, familiarity with regular news, familiarity with satirical news, and need for cognition.
What if we want to get the summary of these variables?
First filter the data so that there is only one row per subject. To do so, use select() to get the above columns except for sex, as well as subject. Then use unique() to remove duplicates.
question - why do we have to do select on the columns before using unique()?
You should have a dataframe like this:
Code
demo_dat <- dat %>%select(subject, age, news_familiar, satire_familiar, nfc) %>%unique() %>%glimpse()
Works totally fine! But, what if we don’t want to retype the same thing over and over again? We can use across (or other options) to first select a range of columns and then apply specific values to those columns! To do so, we use summarise normally, but then follow this with across(). The across() function takes multiple arguments, including .cols, .fns, and .names.
For now, let’s focus on choosing two columns which have the function mean applied to them:
Do you see how that works? First we chose the column, then the function to apply. It’s kind of weird that we provide the function without the brackets, but that’s how it works.
Using a named list
What if we want to do more than one function? We can expand the function call, but now need to provide a named list of functions after the .fns argument. Here is an example, still with one function. The named list means I first provide my own name for the function, then the function itself:
what is different about the output here compared to above?
What if we want to do something fancy, such as us built in arguments to functions, like na.rm=T or scale=T? We can also use a formula notation for our functions:
Note that we have to include the .x in the formula where we would normally put the variable itself. This is because we are using more than one variable to the function.
Now let’s look at the .names argument, which lets us do custom names to our resulting variables. To do so, we use syntax similar to Python’s fstring formatting, where we can combine variable names and text in one string. We place variables inside curly brackets {} within a string, like this: '{var}_rest of string'. This allow us to make the labels however we want:
Naturally, we can upscale this to also do the standard deviation. To do custom labels, we have to rely on the names in our list of functions for the names, and the .names argument for their ordering: