if_else &
case_when()library(tidyverse)
Let’s work more with mutate.
Download the same metaphor data.csv file from https://osf.io/qrc6b/
Create a new object named met.data which is the
results of calling read_csv() on the metaphor data.
met.data <- read_csv('https://www.stephenskalicky.com/r_data/metaphor_data.csv')
## Rows: 1304 Columns: 28
## ── Column specification ──────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): metaphor_id, response, met_type, sex, hand, language_group
## dbl (22): subject, conceptual, nm, trial_order, met_stim, met_RT, age, colle...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
met.data.2 from met.data. Add a
select() call to your pipe so that met.data.2
only has the following columns: subject, age, englishAgeofOnset, and
collegeYear. Finally, use the unique() function so that
each subject only has one row.met.data.2 <- met.data %>%
dplyr::select(subject, age, collegeYear, englishAgeofOnset) %>%
unique()
englishAgeofOnset first.
This is the age that participants began learning English. Look at a
summary() of the variable - what do you notice?# there are zeros - what could that mean?
summary(met.data.2$englishAgeofOnset)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 4.262 9.000 17.000
mutate does, lets use it to
calculate a new variable. We want to get an idea of how long someone has
been learning/using English, even if they are native speakers. Using the
existing variables in met.data.2, how could we do that?
(Create a new object named met.data.3 from
met.data.2, then use mutate() to create a new
variable in met.data.3 named totalEnglish
which is a measure of the total number of years each participant has
been learning/using English.)met.data.3 <- met.data.2 %>%
mutate(totalEnglish = age - englishAgeofOnset)
met.data.4 from
met.data.3, and then use mutate() to create a
new variable named englishPercent which is a percentage of
one’s total life spent using/learning English. The resulting variable
should be represented as percentages (i.e., numbers from 0.00 to
100.00), rather than decimals (e.g.,), 0.1, .5, 1.0, etc.)met.data.4 <- met.data.3 %>%
mutate(englishPercent = (totalEnglish/age)*100)
met.data.5 from met.data.4,
and then use mutate() to create a new variable named
ENG_Group. Use if_else() within your mutate
call to assign participants to one of two groups: “NES” or “NNES”.
You’ll need to choose which variable and condition you want to use in
your if_else() function!met.data.5 <- met.data.4 %>%
mutate(ENG_Group = if_else(englishPercent == 100, 'NES', 'NNES'))
nnes.summary from met.data.5.
Then use summarise() to provide some descriptive statistics
about the NNES. Get the mean and SD of relevant values, as well as the
max and min.nnes.summary <- met.data.5 %>%
group_by(ENG_Group) %>%
summarise(mean.english = mean(englishPercent),
sd.english = sd(englishPercent),
min.english = min(englishPercent),
max.english = max(englishPercent))
if_else() is really handy for things like this, but it
only allows for two possibilities - whether something is true or false.
What if we have a lot of different values we’d like to create that
depend on multiple conditions? Among the many options, we can use
case_when(). This function is similar to
if_else() in that it evaluates whether a cell meets a
certain condition and then acts, but differs in that unlike
if_else(), case_when() only acts when the
condition is true. In this way, you can chain a series of
case_when() functions together to make many different
changes.The syntax is also different. For case_when(): the
syntax uses what is called formula notation, and is in the form of
case_when(condition ~ result). For example, if you wanted
to turn all values of NA into 0, you could use
case_when(variable == NA ~ 0). You can put multiple
conditions and results inside a single case_when()
function:
case_when(variable == NA ~ 0, variable == 1 ~ NA, etc...).
Your condition can also include more than one variable:
case_when(variable1 == value & variable2 != value ~ result)
Create a new object named met.data.6 from
met.data.5. Then, create a new variable named
age_group and assign the following values using
mutate() and case_when():
met.data.6 <- met.data.5 %>%
mutate(age_group = case_when(age < 21 ~ 'lower', age > 20 & age < 41 ~ 'middle', age > 40 ~ 'higher'))
as.factor() is a quick way to check this.summary(as.factor(met.data.6$age_group))
## higher lower middle
## 3 20 38
collegeYear. The numbers in
collegeYear correspond to answers on a demographic
survey:1: First-year undergraduate2: Second-year undergraduate3: Third-year undergraduate4: Fourth-year undergraduate5: Fifth-year undergraduate6: MA Student 7: PhD Student You can see why numbers are easier to write in the data! Well, let’s
imagine we want to create some smaller categories. Create a new object
named met.data.7 from met.data.6 and then use
mutate() with case_when() to create a new
variable named studentLevel . Group the subjects into three
categories: “early UG”, “late UG”, and “PG” based on their college
year.
met.data.7 <- met.data.6 %>%
mutate(studentLevel = case_when(collegeYear < 3 ~ 'early UG', collegeYear > 2 & collegeYear < 6 ~ 'late UG', collegeYear > 5 ~ 'PG' ))
met.data.8 from
met.data.7. Then use a pipe to create a new variable named
super.status. This variable will assign participants to a
category based on two features:There are four values for super.status:
NNES-UG, NES-UG, NNES-PG,
NES-PG
Use mutate(), case_when(), and your new
variables.
met.data.8 <- met.data.7 %>%
mutate(super.status = case_when(ENG_Group == 'NES' & studentLevel == 'early UG' ~ 'NES-UG',
ENG_Group == 'NES' & studentLevel == 'late UG' ~ 'NES-UG',
ENG_Group == 'NES' & studentLevel == 'PG' ~ 'NES-PG',
ENG_Group == 'NNES' & studentLevel == 'early UG' ~ 'NNES-UG',
ENG_Group == 'NNES' & studentLevel == 'late UG' ~ 'NNES-UG',
ENG_Group == 'NNES' & studentLevel == 'PG' ~ 'NNES-PG'))
The way I did this was silly - using the collegeYear
variable means you could do this in four lines instead of six - but
still works!