select

selecting many columns

What happens if we have a large number of columns in a dataframe that we want to apply similar functions to? We learned that across() is one way to map one or more functions to various columns. But we still need to provide a vector of columns to across() in order to use it - and if we have hundreds of columns, we would like to avoid typing in the name of each column.

Let’s explore some different ways to select columns based on a range of different criteria.

You can find a full list here: using select

The first thing to get familiar with is a dataset that might be used for such purposes.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Let’s load in some real data I used for a paper during my PhD. You can get the data with this URL

dat <- read_csv('https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/linguistic_features.csv')
Rows: 80 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): headline, filename, condition
dbl (16): conditionNum, MLC, numContenWords, numWords, numFunctionWords, MRC...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What are the column names? We can use colnames(df) to see what we are working with:

colnames(dat)
 [1] "headline"              "filename"              "conditionNum"         
 [4] "condition"             "MLC"                   "numContenWords"       
 [7] "numWords"              "numFunctionWords"      "MRC_Familiarity"      
[10] "MRC_Concreteness"      "MRC_Imagability"       "AOA"                  
[13] "VADER"                 "COCA_new_Frequency_CW" "familiarCS"           
[16] "understandingCS"       "positiveCS"            "sincereCS"            
[19] "funnyCS"              

So we have variables like headline and filename which are metadata about the stimuli used in the experiment, we also have condition and conditionNum which represent experimental conditions (no idea why I felt the need to have these as numbers and a factor, probably leftover SPSS in my blood).

The rest of the numbers are different linguistic features!

  • We have some variables that start with num, which represent counts of different things (there is also MLC - mean length of clause)
  • There are other variables that start with MRC, which represent scores from psycholinguistic lexicons
  • There are also variables that end in CS, which I cannot remember what CS means, but these are either total or average survey ratings for these variables
  • Finally there is an AOA, VADER, and COCA variable representing age of acquisition, VADER sentiment, and generalised word frequency

In other words, there are quite a lot of variables to choose from! How can we select just SOME of these variables?

select using ranges

We can select a subset of the variables using ranges with the : operator. If we want to select the 9th through 11th column, we can do this:

select(dat, 9:11)
# A tibble: 80 × 3
   MRC_Familiarity MRC_Concreteness MRC_Imagability
             <dbl>            <dbl>           <dbl>
 1            559.             429             482.
 2            560.             460.            500.
 3            541.             391.            444 
 4            533.             439.            488.
 5            549.             388.            460.
 6            553              479             516.
 7            542.             380.            421.
 8            559.             380             416.
 9            579              389.            436.
10            472.             618             456.
# ℹ 70 more rows

We can do effectively the same thing using the names of the variables themselves

select(dat, MRC_Familiarity:MRC_Imagability)
# A tibble: 80 × 3
   MRC_Familiarity MRC_Concreteness MRC_Imagability
             <dbl>            <dbl>           <dbl>
 1            559.             429             482.
 2            560.             460.            500.
 3            541.             391.            444 
 4            533.             439.            488.
 5            549.             388.            460.
 6            553              479             516.
 7            542.             380.            421.
 8            559.             380             416.
 9            579              389.            436.
10            472.             618             456.
# ℹ 70 more rows

using ! as a negator

The ! functions as a “not”. So if you wanted to select almost all of the columns, it might be faster to say which columns you do not want:

# removes headline and filename columns
select(dat, !headline, !filename)
# A tibble: 80 × 19
   filename                conditionNum condition    MLC numContenWords numWords
   <chr>                          <dbl> <chr>      <dbl>          <dbl>    <dbl>
 1 atten03_revision.txt               1 atten         13              9       13
 2 saturation03.txt                   5 saturation    10              8       10
 3 control39.txt                      2 control       13              9       13
 4 control03.txt                      2 control       10              7        9
 5 control10.txt                      2 control        9              7        9
 6 atten01_revision.txt               1 atten          9              7        9
 7 control05.txt                      2 control       12              8       12
 8 control06.txt                      2 control       15             11       15
 9 control02.txt                      2 control       11              9       11
10 metaphor01_revision.txt            3 metaphor       7              6        7
# ℹ 70 more rows
# ℹ 13 more variables: numFunctionWords <dbl>, MRC_Familiarity <dbl>,
#   MRC_Concreteness <dbl>, MRC_Imagability <dbl>, AOA <dbl>, VADER <dbl>,
#   COCA_new_Frequency_CW <dbl>, familiarCS <dbl>, understandingCS <dbl>,
#   positiveCS <dbl>, sincereCS <dbl>, funnyCS <dbl>, headline <chr>

Can do the same thing with -

# removes headline and filename columns
select(dat, -headline, -filename)
# A tibble: 80 × 17
   conditionNum condition    MLC numContenWords numWords numFunctionWords
          <dbl> <chr>      <dbl>          <dbl>    <dbl>            <dbl>
 1            1 atten         13              9       13                4
 2            5 saturation    10              8       10                2
 3            2 control       13              9       13                4
 4            2 control       10              7        9                2
 5            2 control        9              7        9                2
 6            1 atten          9              7        9                2
 7            2 control       12              8       12                4
 8            2 control       15             11       15                4
 9            2 control       11              9       11                2
10            3 metaphor       7              6        7                1
# ℹ 70 more rows
# ℹ 11 more variables: MRC_Familiarity <dbl>, MRC_Concreteness <dbl>,
#   MRC_Imagability <dbl>, AOA <dbl>, VADER <dbl>, COCA_new_Frequency_CW <dbl>,
#   familiarCS <dbl>, understandingCS <dbl>, positiveCS <dbl>, sincereCS <dbl>,
#   funnyCS <dbl>

We can extend the ! to ranges as well

# selects all columns but headline and three MRC variables
select(dat, !headline,  !MRC_Familiarity:MRC_Imagability)
# A tibble: 80 × 19
   filename                conditionNum condition    MLC numContenWords numWords
   <chr>                          <dbl> <chr>      <dbl>          <dbl>    <dbl>
 1 atten03_revision.txt               1 atten         13              9       13
 2 saturation03.txt                   5 saturation    10              8       10
 3 control39.txt                      2 control       13              9       13
 4 control03.txt                      2 control       10              7        9
 5 control10.txt                      2 control        9              7        9
 6 atten01_revision.txt               1 atten          9              7        9
 7 control05.txt                      2 control       12              8       12
 8 control06.txt                      2 control       15             11       15
 9 control02.txt                      2 control       11              9       11
10 metaphor01_revision.txt            3 metaphor       7              6        7
# ℹ 70 more rows
# ℹ 13 more variables: numFunctionWords <dbl>, MRC_Familiarity <dbl>,
#   MRC_Concreteness <dbl>, MRC_Imagability <dbl>, AOA <dbl>, VADER <dbl>,
#   COCA_new_Frequency_CW <dbl>, familiarCS <dbl>, understandingCS <dbl>,
#   positiveCS <dbl>, sincereCS <dbl>, funnyCS <dbl>, headline <chr>

selecting based on their names:

We can also exploit naming conventions in the columns to choose them. Let’s pretend we want to only have the columns that start with num. One way to do this is to use the starts_with function:

# selects all columns which names starting with `num`
select(dat, starts_with('num'))
# A tibble: 80 × 3
   numContenWords numWords numFunctionWords
            <dbl>    <dbl>            <dbl>
 1              9       13                4
 2              8       10                2
 3              9       13                4
 4              7        9                2
 5              7        9                2
 6              7        9                2
 7              8       12                4
 8             11       15                4
 9              9       11                2
10              6        7                1
# ℹ 70 more rows

The complement of starts_with is ends_with - we could use this to select all of the columns that end in CS:

# select all columns with names ending with "CS"
select(dat, ends_with("CS"))
# A tibble: 80 × 5
   familiarCS understandingCS positiveCS sincereCS funnyCS
        <dbl>           <dbl>      <dbl>     <dbl>   <dbl>
 1      26.5             58.8      17.6       26.5   17.6 
 2      25               91.7       0         66.7    4.17
 3      50               94.4      86.1       86.1    0   
 4      19.2             92.3       3.85      76.9    7.69
 5      29.4             79.4       2.94      76.5    2.94
 6       7.69            61.5      23.1       11.5   57.7 
 7      46.7             86.7       3.33      83.3    0   
 8      37.5             95.8       4.17      62.5    0   
 9      17.2             82.8      41.4       89.7    0   
10      17.1             65.7      17.1       11.4   74.3 
# ℹ 70 more rows

How else can we do this? We can try using the contains function:

But of course we get something a bit more than what we want…what is going on here?

# select all columns that have the string 'num' 
select(dat, contains('num'))
# A tibble: 80 × 4
   conditionNum numContenWords numWords numFunctionWords
          <dbl>          <dbl>    <dbl>            <dbl>
 1            1              9       13                4
 2            5              8       10                2
 3            2              9       13                4
 4            2              7        9                2
 5            2              7        9                2
 6            1              7        9                2
 7            2              8       12                4
 8            2             11       15                4
 9            2              9       11                2
10            3              6        7                1
# ℹ 70 more rows
select(dat, contains("CS"))
# A tibble: 80 × 5
   familiarCS understandingCS positiveCS sincereCS funnyCS
        <dbl>           <dbl>      <dbl>     <dbl>   <dbl>
 1      26.5             58.8      17.6       26.5   17.6 
 2      25               91.7       0         66.7    4.17
 3      50               94.4      86.1       86.1    0   
 4      19.2             92.3       3.85      76.9    7.69
 5      29.4             79.4       2.94      76.5    2.94
 6       7.69            61.5      23.1       11.5   57.7 
 7      46.7             86.7       3.33      83.3    0   
 8      37.5             95.8       4.17      62.5    0   
 9      17.2             82.8      41.4       89.7    0   
10      17.1             65.7      17.1       11.4   74.3 
# ℹ 70 more rows

using regular expressions

You might not know what these are yet, but we can also use regular expressions to really have control over the way we select different columns. Regular expressions are a way to search for abstract patterns ranging from strings to complex metacharacters.

To use regular expressions, we can use the matches() function:

Note here how I get the same results as the contains() function - both of them are by default not case sensitive.

select(dat, matches('num'))
# A tibble: 80 × 4
   conditionNum numContenWords numWords numFunctionWords
          <dbl>          <dbl>    <dbl>            <dbl>
 1            1              9       13                4
 2            5              8       10                2
 3            2              9       13                4
 4            2              7        9                2
 5            2              7        9                2
 6            1              7        9                2
 7            2              8       12                4
 8            2             11       15                4
 9            2              9       11                2
10            3              6        7                1
# ℹ 70 more rows

We can turn this off using the ignore.case argument, by setting it to FALSE:

select(dat, matches('num', ignore.case=F))
# A tibble: 80 × 3
   numContenWords numWords numFunctionWords
            <dbl>    <dbl>            <dbl>
 1              9       13                4
 2              8       10                2
 3              9       13                4
 4              7        9                2
 5              7        9                2
 6              7        9                2
 7              8       12                4
 8             11       15                4
 9              9       11                2
10              6        7                1
# ℹ 70 more rows

We can also use regular expression characters to have more control. For example, the $ means ‘end of string’, so 'num$' would find any string which ends in 'num'

select(dat, matches('num$'))
# A tibble: 80 × 1
   conditionNum
          <dbl>
 1            1
 2            5
 3            2
 4            2
 5            2
 6            1
 7            2
 8            2
 9            2
10            3
# ℹ 70 more rows

And the ^ means ‘start of string’, so '^num' means any string that starts with 'num'

select(dat, matches('^num'))
# A tibble: 80 × 3
   numContenWords numWords numFunctionWords
            <dbl>    <dbl>            <dbl>
 1              9       13                4
 2              8       10                2
 3              9       13                4
 4              7        9                2
 5              7        9                2
 6              7        9                2
 7              8       12                4
 8             11       15                4
 9              9       11                2
10              6        7                1
# ℹ 70 more rows

What if we wanted only the columns that are in ALL CAPS? This requires a bit of muckery…i define the start ^, a square brackets bucket of any capital letter [A-Z], an anchor saying ‘zero or more’ *, and the end of the string $.

So this is saying: select any column that, between the start and the end of the string, contains any number of capital letters (and only capital letters).

select(dat, matches('^[A-Z]*$', ignore.case = F))
# A tibble: 80 × 3
     MLC   AOA VADER
   <dbl> <dbl> <dbl>
 1    13  6.11 -0.1 
 2    10  6.22 -0.6 
 3    13  6.45  0   
 4    10  6.75  0   
 5     9  7.6  -0.03
 6     9  7.98  0.57
 7    12  7.43 -0.42
 8    15  6.87 -0.4 
 9    11  6.4  -0.42
10     7  9.44  0   
# ℹ 70 more rows

selecting with functions

We can use the where() function to apply a conditional test to the function using other function such as is.numeric()

This can be particularly handy when you know you want to summarise all the numerical data in your dataframe:

select(dat, where(is.numeric))
# A tibble: 80 × 16
   conditionNum   MLC numContenWords numWords numFunctionWords MRC_Familiarity
          <dbl> <dbl>          <dbl>    <dbl>            <dbl>           <dbl>
 1            1    13              9       13                4            559.
 2            5    10              8       10                2            560.
 3            2    13              9       13                4            541.
 4            2    10              7        9                2            533.
 5            2     9              7        9                2            549.
 6            1     9              7        9                2            553 
 7            2    12              8       12                4            542.
 8            2    15             11       15                4            559.
 9            2    11              9       11                2            579 
10            3     7              6        7                1            472.
# ℹ 70 more rows
# ℹ 10 more variables: MRC_Concreteness <dbl>, MRC_Imagability <dbl>,
#   AOA <dbl>, VADER <dbl>, COCA_new_Frequency_CW <dbl>, familiarCS <dbl>,
#   understandingCS <dbl>, positiveCS <dbl>, sincereCS <dbl>, funnyCS <dbl>

chaining conditions

Notice above we still selected the condition column, because it is a number. What if we don’t want it?

You are able to make a pretty complex set of conditions for your select using &

select(dat, !condition & where(is.numeric))
# A tibble: 80 × 16
   conditionNum   MLC numContenWords numWords numFunctionWords MRC_Familiarity
          <dbl> <dbl>          <dbl>    <dbl>            <dbl>           <dbl>
 1            1    13              9       13                4            559.
 2            5    10              8       10                2            560.
 3            2    13              9       13                4            541.
 4            2    10              7        9                2            533.
 5            2     9              7        9                2            549.
 6            1     9              7        9                2            553 
 7            2    12              8       12                4            542.
 8            2    15             11       15                4            559.
 9            2    11              9       11                2            579 
10            3     7              6        7                1            472.
# ℹ 70 more rows
# ℹ 10 more variables: MRC_Concreteness <dbl>, MRC_Imagability <dbl>,
#   AOA <dbl>, VADER <dbl>, COCA_new_Frequency_CW <dbl>, familiarCS <dbl>,
#   understandingCS <dbl>, positiveCS <dbl>, sincereCS <dbl>, funnyCS <dbl>

You can also use | for ‘or’

select(dat, starts_with('A', ignore.case = F) | starts_with('C', ignore.case = F))
# A tibble: 80 × 2
     AOA COCA_new_Frequency_CW
   <dbl>                 <dbl>
 1  6.11                 479. 
 2  6.22                 182. 
 3  6.45                1287. 
 4  6.75                 426. 
 5  7.6                   83.9
 6  7.98                 142. 
 7  7.43                  97.9
 8  6.87                 415. 
 9  6.4                  326. 
10  9.44                  94.5
# ℹ 70 more rows

now that we know this, apply to across()

Remember how to use across()? We can know use our knowledge of how to select columns to rapidly obtain summary statistics of columns now without having to necessarily type all their names and/or remember their locations:

# obtain mean of all numeric columns except for condition
summarise(dat, across(.cols = !condition & where(is.numeric), mean))
# A tibble: 1 × 16
  conditionNum   MLC numContenWords numWords numFunctionWords MRC_Familiarity
         <dbl> <dbl>          <dbl>    <dbl>            <dbl>           <dbl>
1         2.62  10.2           8.38     10.6             2.24            562.
# ℹ 10 more variables: MRC_Concreteness <dbl>, MRC_Imagability <dbl>,
#   AOA <dbl>, VADER <dbl>, COCA_new_Frequency_CW <dbl>, familiarCS <dbl>,
#   understandingCS <dbl>, positiveCS <dbl>, sincereCS <dbl>, funnyCS <dbl>

I can get the mean of each UPPER CASED column using matches()

# obtain mean of all numeric columns except for condition
summarise(dat, across(.cols = matches('^[A-Z]*$', ignore.case = F), mean))
# A tibble: 1 × 3
    MLC   AOA   VADER
  <dbl> <dbl>   <dbl>
1  10.2  6.90 -0.0888

Can you…

  • calculate the mean and standard deviation of all columns that start with MRC and end with CS?
Code
summarise(dat, across(.cols = starts_with('MRC') | ends_with("CS"), 
                      .fns = list(mean=mean, sd = sd)))
# A tibble: 1 × 16
  MRC_Familiarity_mean MRC_Familiarity_sd MRC_Concreteness_mean
                 <dbl>              <dbl>                 <dbl>
1                 562.               23.5                  426.
# ℹ 13 more variables: MRC_Concreteness_sd <dbl>, MRC_Imagability_mean <dbl>,
#   MRC_Imagability_sd <dbl>, familiarCS_mean <dbl>, familiarCS_sd <dbl>,
#   understandingCS_mean <dbl>, understandingCS_sd <dbl>,
#   positiveCS_mean <dbl>, positiveCS_sd <dbl>, sincereCS_mean <dbl>,
#   sincereCS_sd <dbl>, funnyCS_mean <dbl>, funnyCS_sd <dbl>
  • report the lowest value in any column that starts with an uppercase letter?
Code
summarise(dat, across(.cols = matches('^[A-Z]', ignore.case = F), min))
# A tibble: 1 × 7
    MLC MRC_Familiarity MRC_Concreteness MRC_Imagability   AOA  VADER
  <dbl>           <dbl>            <dbl>           <dbl> <dbl>  <dbl>
1   4.5            472.                0             246  5.04 -0.791
# ℹ 1 more variable: COCA_new_Frequency_CW <dbl>