library(tidyverse)

Let’s work more with mutate.

  1. Download the same metaphor data.csv file from https://osf.io/qrc6b/

  2. Create a new object named met.data which is the results of calling read_csv() on the metaphor data.

met.data <- read_csv('https://www.stephenskalicky.com/r_data/metaphor_data.csv')
## Rows: 1304 Columns: 28
## ── Column specification ────────────────────────────────────────────
## Delimiter: ","
## chr  (6): metaphor_id, response, met_type, sex, hand, language_group
## dbl (22): subject, conceptual, nm, trial_order, met_stim, met_RT, age, colle...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Using a single pipe, make a new object called met.data.2 from met.data. Add a select() call to your pipe so that met.data.2 only has the following columns: subject, age, englishAgeofOnset, and collegeYear. Finally, use the unique() function so that each subject only has one row.
met.data.2 <- met.data %>%
  dplyr::select(subject, age, collegeYear, englishAgeofOnset) %>%
  unique()
  1. Let’s look at the variable englishAgeofOnset first. This is the age that participants began learning English. Look at a summary() of the variable - what do you notice?
# there are zeros - what could that mean?
summary(met.data.2$englishAgeofOnset)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   4.262   9.000  17.000
  1. To remind ourselves what mutate does, lets use it to calculate a new variable. We want to get an idea of how long someone has been learning/using English, even if they are native speakers. Using the existing variables in met.data.2, how could we do that? (Create a new object named met.data.3 from met.data.2, then use mutate() to create a new variable in met.data.3 named totalEnglish which is a measure of the total number of years each participant has been learning/using English.)
met.data.3 <- met.data.2 %>%
  mutate(totalEnglish = age - englishAgeofOnset)
  1. Okay, not bad. Let’s now try to turn this number into a percentage. Create a new object named met.data.4 from met.data.3, and then use mutate() to create a new variable named englishPercent which is a percentage of one’s total life spent using/learning English. The resulting variable should be represented as percentages (i.e., numbers from 0.00 to 100.00), rather than decimals (e.g.,), 0.1, .5, 1.0, etc.)
met.data.4 <- met.data.3 %>%
  mutate(englishPercent = (totalEnglish/age)*100)
  1. Hopefully you have figured out what these values say about the participants now. Let’s go ahead and create a new variable which marks participants as either native or non-native English speakers. Create a new object named met.data.5 from met.data.4, and then use mutate() to create a new variable named ENG_Group. Use if_else() within your mutate call to assign participants to one of two groups: “NES” or “NNES”. You’ll need to choose which variable and condition you want to use in your if_else() function!
met.data.5 <- met.data.4 %>%
  mutate(ENG_Group = if_else(englishPercent == 100, 'NES', 'NNES'))
  1. You should now be able to say something about the NNES group - what is their average percentage of English learning/using? Create a new object named nnes.summary from met.data.5. Then use summarise() to provide some descriptive statistics about the NNES. Get the mean and SD of relevant values, as well as the max and min.
nnes.summary <- met.data.5 %>%
  group_by(ENG_Group) %>%
  summarise(mean.english = mean(englishPercent), 
            sd.english = sd(englishPercent),
            min.english = min(englishPercent),
            max.english = max(englishPercent))
  1. if_else() is really handy for things like this, but it only allows for two possibilities - whether something is true or false. What if we have a lot of different values we’d like to create that depend on multiple conditions? Among the many options, we can use case_when(). This function is similar to if_else() in that it evaluates whether a cell meets a certain condition and then acts, but differs in that unlike if_else(), case_when() only acts when the condition is true. In this way, you can chain a series of case_when() functions together to make many different changes.

The syntax is also different. For case_when(): the syntax uses what is called formula notation, and is in the form of case_when(condition ~ result). For example, if you wanted to turn all values of NA into 0, you could use case_when(variable == NA ~ 0). You can put multiple conditions and results inside a single case_when() function: case_when(variable == NA ~ 0, variable == 1 ~ NA, etc...).

Your condition can also include more than one variable: case_when(variable1 == value & variable2 != value ~ result)

Create a new object named met.data.6 from met.data.5. Then, create a new variable named age_group and assign the following values using mutate() and case_when():

met.data.6 <- met.data.5 %>%
  mutate(age_group = case_when(age < 21 ~ 'lower', age > 20 & age < 41 ~ 'middle', age > 40 ~ 'higher'))
  1. How many people are in each age group? Hint: as.factor() is a quick way to check this.
summary(as.factor(met.data.6$age_group))
## higher  lower middle 
##      3     20     38
  1. Okay, now that you have some practice, time to work on the final variable: collegeYear. The numbers in collegeYear correspond to answers on a demographic survey:

You can see why numbers are easier to write in the data! Well, let’s imagine we want to create some smaller categories. Create a new object named met.data.7 from met.data.6 and then use mutate() with case_when() to create a new variable named studentLevel . Group the subjects into three categories: “early UG”, “late UG”, and “PG” based on their college year.

met.data.7 <- met.data.6 %>%
  mutate(studentLevel = case_when(collegeYear < 3 ~ 'early UG', collegeYear > 2 & collegeYear < 6 ~ 'late UG', collegeYear > 5 ~ 'PG' ))
  1. Finally, create a new object named met.data.8 from met.data.7. Then use a pipe to create a new variable named super.status. This variable will assign participants to a category based on two features:

There are four values for super.status: NNES-UG, NES-UG, NNES-PG, NES-PG

Use mutate(), case_when(), and your new variables.

met.data.8 <- met.data.7 %>%
  mutate(super.status = case_when(ENG_Group == 'NES' & studentLevel == 'early UG' ~ 'NES-UG',
                                  ENG_Group == 'NES' & studentLevel == 'late UG' ~ 'NES-UG',
                                  ENG_Group == 'NES' & studentLevel == 'PG' ~ 'NES-PG',
                                  ENG_Group == 'NNES' & studentLevel == 'early UG' ~ 'NNES-UG',
                                  ENG_Group == 'NNES' & studentLevel == 'late UG' ~ 'NNES-UG',
                                  ENG_Group == 'NNES' & studentLevel == 'PG' ~ 'NNES-PG'))

The way I did this was silly - using the collegeYear variable means you could do this in four lines instead of six - but still works!

  1. What else can we do with this data? Any other mutates or transformations you would like to try?