Code
library(tidyverse)
tidyverse
functionsThe goal of this activity is to see how to convert data from wide and long format using tidyverse
pivot functions.
First let’s consider the difference between long data and wide data. Wide data is where each row contains all the observations for a particular subject. Long data is where each row is an individual observation per subject. What is the difference?
library(tidyverse)
Wide Data - each row contains three observations per subjects
# A tibble: 10 × 4
subject pre post1 post2
<int> <dbl> <dbl> <dbl>
1 1 23.9 49.7 40.6
2 2 24.6 49.7 41.3
3 3 25.3 53.5 40.0
4 4 24.2 49.8 40.1
5 5 23.6 54.2 37.9
6 6 24.5 50.3 40.0
7 7 24.5 51.8 40.2
8 8 24.3 48.5 39.5
9 9 24.5 48.3 40.4
10 10 24.6 47.7 39.9
Long data - each row containts one observation per subject (for the first three participants)
# A tibble: 9 × 3
subject test score
<dbl> <chr> <dbl>
1 1 pre 56.2
2 1 post1 81.6
3 1 post2 29.6
4 2 pre 59.1
5 2 post1 24.0
6 2 post2 43.4
7 3 pre 79.6
8 3 post1 28.0
9 3 post2 63.3
Long data is typically the format you will want, but not always. Fortunately there is a method to swap back and forth between the two formats.
pivot_longer()
and pivot_wider()
are two methods to swap between these formats. They turn wide data to long, or long data to wide.
Let’s try to turn the object wide
into long format using pivot_longer()
There are four crucial arguments:
data
(the data object you are changing)cols
(the columns you want to manipulate)names_to
(the new name for the single column created by joining the columns named by cols
)values_to
(the new name for the column which includes the values from the columns named in cols
)It is a bit confusing to wrap your head around at first. Basically, you are telling R to combine some number of columns into a single column.
Let’s practice.
emotions
with four columns and three rows.subject
with the values 1, 2, 3first
with the values “happy”, “sad” “happy”second
with the values “sad”, “happy”, “happy”third
with the values “sad” , “sad”, “sad”<- tibble(subject = c(1,2,3),
emotions first = c('happy', 'sad', 'happy'),
second = c('sad', 'happy', 'happy'),
third = c('sad', 'sad', 'sad'))
Your emotion
tibble should look like this:
# A tibble: 3 × 4
subject first second third
<dbl> <chr> <chr> <chr>
1 1 happy sad sad
2 2 sad happy sad
3 3 happy happy sad
emotions
to long format using pivot_longer()
subject
, sequence
, and emotions
sequence
will contain the values first
, second
, or third
emotions
will contain the values happy
or sad
How to do this?
emotions_long
from emotions
.pivot_longer
to combine the columns named first
, second
, and third
into a single column.cols
wrapped within c()
.
values_to
and names_to
<- emotions %>%
emotions_long pivot_longer(cols = c(first, second, third),
values_to = "emotion",
names_to = "sequence")
If successful, your data should look like this
# A tibble: 9 × 3
subject sequence emotion
<dbl> <chr> <chr>
1 1 first happy
2 1 second sad
3 1 third sad
4 2 first sad
5 2 second happy
6 2 third sad
7 3 first happy
8 3 second happy
9 3 third sad
pivot_wider()
This function is similar to pivot_longer
except now are going in the reverse direction - we want to take a single column and break it up into new columns. The crucial arguments are:
names_from
(where to get the names of the new columns from?)
values_from
(where to get the for the new columns from?)
Create a new tibble named emotions_wide
from emotions_long
Using a single pipe, transform the data into wide using pivot_wider
You only need to use the two arguments listed above. Which column should be passed to which argument to get the correct output?
<- emotions_long %>%
emotions_wide pivot_wider(names_from = sequence,
values_from = emotion)
You should see this:
# A tibble: 3 × 4
subject first second third
<dbl> <chr> <chr> <chr>
1 1 happy sad sad
2 2 sad happy sad
3 3 happy happy sad
Let’s say that you have some data in long format. These data are responses to a test, such as a c-test or test of vocabulary knowledge. While you need long data for your statistical analysis, you also want to run internal consistency using Chronbach’s alpha or some other similar measure. To do so, you need your data to be in wide format, where each row is a participant, and each column is a test item.
Let’s simulate some data for this, assuming 10 subjects each taking a test with 25 items:
# list of subjects
<- rep(paste0("S", 1:10), each = 25)
subjects subjects
[1] "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1"
[13] "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1" "S1"
[25] "S1" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2"
[37] "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2" "S2"
[49] "S2" "S2" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3"
[61] "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3" "S3"
[73] "S3" "S3" "S3" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4"
[85] "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4" "S4"
[97] "S4" "S4" "S4" "S4" "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5"
[109] "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5" "S5"
[121] "S5" "S5" "S5" "S5" "S5" "S6" "S6" "S6" "S6" "S6" "S6" "S6"
[133] "S6" "S6" "S6" "S6" "S6" "S6" "S6" "S6" "S6" "S6" "S6" "S6"
[145] "S6" "S6" "S6" "S6" "S6" "S6" "S7" "S7" "S7" "S7" "S7" "S7"
[157] "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S7"
[169] "S7" "S7" "S7" "S7" "S7" "S7" "S7" "S8" "S8" "S8" "S8" "S8"
[181] "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8"
[193] "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S8" "S9" "S9" "S9" "S9"
[205] "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9"
[217] "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S9" "S10" "S10" "S10"
[229] "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10"
[241] "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10" "S10"
# list of items
<- rep(c(1:25), 10)
items items
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[51] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[76] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[101] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[126] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[151] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[176] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[201] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[226] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# list of test scores
<- as.vector(replicate(10, sample(c(0,1), size = 25, replace = T)))
test_scores test_scores
[1] 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1
[38] 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1
[75] 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0
[112] 1 0 1 1 1 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1
[149] 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0
[186] 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1
[223] 1 0 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0
# combine into a df
<- tibble(subjects, items, test_scores)
fake_data fake_data
# A tibble: 250 × 3
subjects items test_scores
<chr> <int> <dbl>
1 S1 1 1
2 S1 2 0
3 S1 3 0
4 S1 4 1
5 S1 5 1
6 S1 6 1
7 S1 7 1
8 S1 8 1
9 S1 9 0
10 S1 10 0
# ℹ 240 more rows
You need to pivot this so that each column is an item, and each row is a participant. Your new data frame should be thus ten rows long (10 participants) and 27 columns wide (one column per item, plus a column for subjects).
<- fake_data %>%
fake_data_wide pivot_wider(names_from = items,
values_from = test_scores)
It should look like this:
fake_data_wide
# A tibble: 10 × 26
subjects `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 S1 1 0 0 1 1 1 1 1 0 0 0
2 S2 1 1 1 0 1 1 1 0 0 1 1
3 S3 1 1 1 1 1 1 1 1 1 1 1
4 S4 0 1 1 0 0 1 0 1 1 1 1
5 S5 1 1 1 0 1 1 0 1 0 1 0
6 S6 0 0 0 0 0 0 1 1 1 1 0
7 S7 1 0 1 0 1 1 0 0 0 1 0
8 S8 1 1 0 0 0 0 1 0 1 0 1
9 S9 0 0 0 0 1 1 1 0 1 1 0
10 S10 0 1 0 1 1 1 0 0 1 1 0
# ℹ 14 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>,
# `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>,
# `22` <dbl>, `23` <dbl>, `24` <dbl>, `25` <dbl>
Now you can run an internal consistency analysis using psych::alpha()
or some other calculation! The best part is, you didn’t have to change it all manually in excel :-)
::alpha(fake_data_wide[2:ncol(fake_data_wide)], check.keys = T)$total psych
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
0.769761 0.7670732 0.7363903 0.1163953 3.293195 0.1060218 0.512 0.1966836
median_r
0.1020621