What happens if we have a large number of columns in a dataframe that we want to apply similar functions to? We learned that across() is one way to map one or more functions to various columns. But we still need to provide a vector of columns to across() in order to use it - and if we have hundreds of columns, we would like to avoid typing in the name of each column.
Let’s explore some different ways to select columns based on a range of different criteria.
The first thing to get familiar with is a dataset that might be used for such purposes.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Let’s load in some real data I used for a paper during my PhD. You can get the data with this URL
dat <-read_csv('https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/linguistic_features.csv')
Rows: 80 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): headline, filename, condition
dbl (16): conditionNum, MLC, numContenWords, numWords, numFunctionWords, MRC...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What are the column names? We can use colnames(df) to see what we are working with:
So we have variables like headline and filename which are metadata about the stimuli used in the experiment, we also have condition and conditionNum which represent experimental conditions (no idea why I felt the need to have these as numbers and a factor, probably leftover SPSS in my blood).
The rest of the numbers are different linguistic features!
We have some variables that start with num, which represent counts of different things (there is also MLC - mean length of clause)
There are other variables that start with MRC, which represent scores from psycholinguistic lexicons
There are also variables that end in CS, which I cannot remember what CS means, but these are either total or average survey ratings for these variables
Finally there is an AOA, VADER, and COCA variable representing age of acquisition, VADER sentiment, and generalised word frequency
In other words, there are quite a lot of variables to choose from! How can we select just SOME of these variables?
select using ranges
We can select a subset of the variables using ranges with the : operator. If we want to select the 9th through 11th column, we can do this:
We can also exploit naming conventions in the columns to choose them. Let’s pretend we want to only have the columns that start with num. One way to do this is to use the starts_with function:
# selects all columns which names starting with `num`select(dat, starts_with('num'))
You might not know what these are yet, but we can also use regular expressions to really have control over the way we select different columns. Regular expressions are a way to search for abstract patterns ranging from strings to complex metacharacters.
To use regular expressions, we can use the matches() function:
Note here how I get the same results as the contains() function - both of them are by default not case sensitive.
We can also use regular expression characters to have more control. For example, the $ means ‘end of string’, so 'num$' would find any string which ends in 'num'
What if we wanted only the columns that are in ALL CAPS? This requires a bit of muckery…i define the start ^, a square brackets bucket of any capital letter [A-Z], an anchor saying ‘zero or more’ *, and the end of the string $.
So this is saying: select any column that, between the start and the end of the string, contains any number of capital letters (and only capital letters).
Remember how to use across()? We can know use our knowledge of how to select columns to rapidly obtain summary statistics of columns now without having to necessarily type all their names and/or remember their locations:
# obtain mean of all numeric columns except for conditionsummarise(dat, across(.cols =!condition &where(is.numeric), mean))