Let’s simulate similar data from last session.
You should:
1. Open R-Studio
2. Create a new project, give it a name, and save it somewhere easy to
find (File –> New Project)
3. Create a new R script (File –> New File –> New R Script
4. Go through this handout and type in your code in the prompted
areas.
Create a tibble named base with these columns:
subject which is the numbers 1 through 10age which is 10 random values from rnorm()
with a mean of 28 and sd of 5, using a set.seed() of
42Use the tibble() function. (you also need to load
tidyverse)
library(tidyverse)set.seed(42)
base <- tibble(subject = 1:10, age = rnorm(10, 28, 5)) Your data in base should look like this
## # A tibble: 10 × 2
##    subject   age
##      <int> <dbl>
##  1       1  34.9
##  2       2  25.2
##  3       3  29.8
##  4       4  31.2
##  5       5  30.0
##  6       6  27.5
##  7       7  35.6
##  8       8  27.5
##  9       9  38.1
## 10      10  27.7Create a new tibble named base02 which is
base repeated twice (20 rows instead of 10).
Use a single pipe withrbind().
base02 <- base %>%
  rbind(base)Run the str() function on base02 and you
should see this:
str(base02)## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
##  $ subject: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : num [1:20] 34.9 25.2 29.8 31.2 30 ...Create a tibble named test.data with these columns:
test which is the word ‘pre’ ten times followed by the
word ‘post’ ten timesscore which is 10 random values from
rnorm() with a mean of 50 and a sd of 10 followed by 10
more random values from rnorm() with a mean of 70 and an sd
of 5 using a set.seed() of 43set.seed()
and then the second line making test.data.c() to chain multiple calls to
rep() and rnorm() in your
tibble() function.set.seed(43)
test.data <- tibble(test = c(rep('pre',10),rep('post', 10)), score = c(rnorm(10,50,10), rnorm(10,70,5)))Your version of test.data should look like this using
str()
str(test.data)## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
##  $ test : chr [1:20] "pre" "pre" "pre" "pre" ...
##  $ score: num [1:20] 49.6 34.3 45.1 54.7 41 ...Create a new tibble named data which is the result of
combining base02 and test.data. Use the
cbind() function in a single pipe.
data <- base02 %>%
  cbind(test.data)Your version of data should look like this using
str()
str(data)## 'data.frame':    20 obs. of  4 variables:
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : num  34.9 25.2 29.8 31.2 30 ...
##  $ test   : chr  "pre" "pre" "pre" "pre" ...
##  $ score  : num  49.6 34.3 45.1 54.7 41 ...Q: Why did data turn into a data.frame when it
was made from joining two tibbles?
Create a summary of data using summary():
summary(data)##     subject          age            test               score      
##  Min.   : 1.0   Min.   :25.18   Length:20          Min.   :30.94  
##  1st Qu.: 3.0   1st Qu.:27.53   Class :character   1st Qu.:46.70  
##  Median : 5.5   Median :29.92   Mode  :character   Median :58.20  
##  Mean   : 5.5   Mean   :30.74                      Mean   :58.34  
##  3rd Qu.: 8.0   3rd Qu.:34.85                      3rd Qu.:71.47  
##  Max.   :10.0   Max.   :38.09                      Max.   :80.32Recall from the summary of our data above, we have to be careful about generating descriptive statistics
age value is
doubled and would affect the standard deviationfilter() from tidyverseThe filter() function will apply a conditional
argument to a specified column in a dataframe/tibble. It will do
this test for each row in the data.
Conditional arguments are written using symbols such as
==, <, >. See below for a
list:
| == | equals to | 
| != | does not equal | 
| < | less than | 
| <= | less than or equal to | 
| > | greater than | 
| >= | greater than or equal to | 
Q: Why do we need to use == for
equals to?
Q: Which of these symbols can be used for numeric data? How
about text/string data?
Try comparing ‘a’ and ‘b’ using ==, <,
and !=. Then do the same with 1 and
2
'a' == 'b'## [1] FALSE'a' < 'b'## [1] TRUE'a' != 'b'## [1] TRUE1 == 2## [1] FALSE1 > 2## [1] FALSE1 != 2## [1] TRUEThis helps us understand how the filter() command works
- it will keep (or return) anything that results in TRUE
from the condition that you specify. Let’s try it out.
Run filter() on data to remove anyone older
than 30. The arguments for filter include the data object
your are filtering and the condition that you are specifying. (do not
save it to any object - just run the function)
How many subjects are equal to or younger than 30?
Your output should look like this:
filter(data, age <= 30)It should be simple for us to use this method to create two versions
of the data: one for the pre-test and one for the post-test.
Create a tibble named pre.test and a tibble named
post.test. For both, create a pipe from data
and use the filter() function to extract only pre-test or
only post-test data. Your condition will be applied to the
test column in data.
pre.test <- data %>%
  filter(test == 'pre')
post.test <- data %>%
  filter(test == 'post')Your pre.test should look this this:
pre.testYour post.test should look like this:
post.testGreat - now we can correctly generate summary statistics of our data!
Create a tibble named data.summary with three
columns:
test which is the word “pre” followed by the word
“post”mean which is the mean of score for pre
and postsd which is the standard deviation of
score for pre and postYou will need to use the mean and sd
function on score from our the pre.test and
post.test objects
Try to do this without creating any other variables
data.summary <- tibble(test = c('Pre', 'Post'), 
                       mean = c(mean(pre.test$score), mean(post.test$score)), 
                       sd = c(sd(pre.test$score), sd(post.test$score)))Your data.summary object should look like this
data.summaryWe can also get a summary of our age variable using
either the pre.test or post.test objects
(because each of these objects only have one age value per
subject)
What is the mean() and sd() of
age in either pre.test or
post.test compared to data? The mean is not
affected but the standard deviation is, this is because part of the
formula to compute a standard deviation is to subtract 1 from the total
population (n)
sd() thinks the total population of data is
20 because there are 20 rows. However our total population is only 10.
This is a reminder to always be careful about calculating summary
statistics of your data in all situations.
mean(pre.test$age)## [1] 30.73648sd(pre.test$age)## [1] 4.177244mean(data$age)## [1] 30.73648sd(data$age)## [1] 4.065831group_by() and summarise()
to do the same thingThe above method works, but we can do better. Tidyverse includes the
group_by() function, which allows for any computations you
ask for to be conducted on a per-group basis. This is nice because we do
not need to create new objects using filter like we did above.
We can use group_by() on our test column to
ask R to compute the mean and sd of
score for each version of the test. To do so we will also
use the summarise() function. The summarise()
function is used to create new variables which are summaries of
larger data. It will return to results of a computation (such as
mean) and store it as a new variable. The power of this
method is that we can do this multiple times in one pipe to achieve a
data.frame/tibble which includes a summary of a larger data object, in a
similar manner to the data.summary object that we created
above.
As an example, look what happens when I run summarise()
to compute the mean and sd of age
in the post.test object.
Notice that I only ran the function without saving it to an object,
so this data was not saved anywhere. However, within the
summarise() call I made two new variables (columns) =
mAge and sdAge which were the result of
running mean() and sd() on age in
post.test
dplyr::summarise(post.test, mAge = mean(age), sdAge = sd(age))Let’s use group_by() and summarise() in a
pipe.
Create a new object named data02 from the
data object.
Then, with a pipe, use group_by() on
test
Then, with another pipe, use summarise() to create two new
variables: mScore and sdScore which are the
result of calling mean() and sd() on
score
data02 <- data %>%
  group_by(test) %>%
  summarise(mScore = mean(score), sdScore = sd(score))Your data02 object should look like this - and the
values here should be identical to those in
data.summary
print(data02)## # A tibble: 2 × 3
##   test  mScore sdScore
##   <chr>  <dbl>   <dbl>
## 1 post    71.8    6.31
## 2 pre     44.9    7.82Hrmm, this still doesn’t solve our problem for age - we
don’t want to group by age or by subject. We
can use the unique() function to help with this. The
unique() function will return only unique rows from a
data.frame/tibble - in other words it removes any duplicate rows (but
only if the value for each column is the same.)
Let’s try it out - create a new object named unique.age
from the data object
Then, make a pipe which uses the select() function to
select only subject and age
(select() is like filter(), but instead it
chooses entire columns and is not based on conditions but rather what
you ask for.)
Then, make a second pipe which calls the function unique()
with no arguments.
unique.age <- data %>%
  select(subject, age) %>%
  unique()You should see a tibble with two columns, subject and age, such as
below
Note that this only works because we filtered for unique values based on
combinations of both subjects and age
We would never want to run unique() on just age because
that would delete any duplicate ages among our subjects.
print(unique.age)##    subject      age
## 1        1 34.85479
## 2        2 25.17651
## 3        3 29.81564
## 4        4 31.16431
## 5        5 30.02134
## 6        6 27.46938
## 7        7 35.55761
## 8        8 27.52670
## 9        9 38.09212
## 10      10 27.68643Now, extend unique.age with a final pipe calling
summarise() to generate the data we want (mean and
SD)
Make an object named unique.age which is the same as the
above but includes a final pipe to summarise()
In the summarise() function, create mAge and
sdAge and apply the mean() and
sd() functions
unique.age <- data %>%
  select(subject, age) %>%
  unique() %>%
  summarise(mAge = mean(age), sdAge = sd(age))You should now see this:
print(unique.age)##       mAge    sdAge
## 1 30.73648 4.177244If we wanted to, we could do all of this within one set of
pipes
Can you create a new object named final.data from
data which includes the mean and sd for score
and age grouped by test?
All you need to do is add more arguments to the summarise()
function we used when making data02 - you don’t need to do
anything with unique()
final.data <- data %>%
  group_by(test) %>%
  summarise(mScore = mean(score), sdScore = sd(score)) %>%
  ungroup() %>%
  cbind(unique.age)You would see something like this
print(final.data)##   test   mScore  sdScore     mAge    sdAge
## 1 post 71.76039 6.308726 30.73648 4.177244
## 2  pre 44.91928 7.819863 30.73648 4.177244Q: Is there anything odd about how age has been
put into the data.frame? What caused this?
Finally, while we could copy and paste from R, we could also
ask R to give us a spreadsheet with this information. The
write_csv() function is very handy for this. The arguments
for write_csv() include the name of the object you want to
write and the name you want the output to be.
write_csv(final.data, 'final-data.csv')