Let’s simulate similar data from last session.
You should:
1. Open R-Studio
2. Create a new project, give it a name, and save it somewhere easy to
find (File –> New Project)
3. Create a new R script (File –> New File –> New R Script
4. Go through this handout and type in your code in the prompted
areas.
Create a tibble named base
with these columns:
subject
which is the numbers 1 through 10age
which is 10 random values from rnorm()
with a mean of 28 and sd of 5, using a set.seed()
of
42Use the tibble()
function. (you also need to load
tidyverse
)
library(tidyverse)
set.seed(42)
base <- tibble(subject = 1:10, age = rnorm(10, 28, 5))
Your data in base
should look like this
## # A tibble: 10 × 2
## subject age
## <int> <dbl>
## 1 1 34.9
## 2 2 25.2
## 3 3 29.8
## 4 4 31.2
## 5 5 30.0
## 6 6 27.5
## 7 7 35.6
## 8 8 27.5
## 9 9 38.1
## 10 10 27.7
Create a new tibble named base02
which is
base
repeated twice (20 rows instead of 10).
Use a single pipe withrbind()
.
base02 <- base %>%
rbind(base)
Run the str()
function on base02
and you
should see this:
str(base02)
## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
## $ subject: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num [1:20] 34.9 25.2 29.8 31.2 30 ...
Create a tibble named test.data
with these columns:
test
which is the word ‘pre’ ten times followed by the
word ‘post’ ten timesscore
which is 10 random values from
rnorm()
with a mean of 50 and a sd of 10 followed by 10
more random values from rnorm()
with a mean of 70 and an sd
of 5 using a set.seed()
of 43set.seed()
and then the second line making test.data
.c()
to chain multiple calls to
rep()
and rnorm()
in your
tibble()
function.set.seed(43)
test.data <- tibble(test = c(rep('pre',10),rep('post', 10)), score = c(rnorm(10,50,10), rnorm(10,70,5)))
Your version of test.data
should look like this using
str()
str(test.data)
## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
## $ test : chr [1:20] "pre" "pre" "pre" "pre" ...
## $ score: num [1:20] 49.6 34.3 45.1 54.7 41 ...
Create a new tibble named data
which is the result of
combining base02
and test.data
. Use the
cbind()
function in a single pipe.
data <- base02 %>%
cbind(test.data)
Your version of data
should look like this using
str()
str(data)
## 'data.frame': 20 obs. of 4 variables:
## $ subject: int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 34.9 25.2 29.8 31.2 30 ...
## $ test : chr "pre" "pre" "pre" "pre" ...
## $ score : num 49.6 34.3 45.1 54.7 41 ...
Q: Why did data
turn into a data.frame when it
was made from joining two tibbles?
Create a summary of data
using summary()
:
summary(data)
## subject age test score
## Min. : 1.0 Min. :25.18 Length:20 Min. :30.94
## 1st Qu.: 3.0 1st Qu.:27.53 Class :character 1st Qu.:46.70
## Median : 5.5 Median :29.92 Mode :character Median :58.20
## Mean : 5.5 Mean :30.74 Mean :58.34
## 3rd Qu.: 8.0 3rd Qu.:34.85 3rd Qu.:71.47
## Max. :10.0 Max. :38.09 Max. :80.32
Recall from the summary of our data above, we have to be careful about generating descriptive statistics
age
value is
doubled and would affect the standard deviationfilter()
from tidyverseThe filter()
function will apply a conditional
argument to a specified column in a dataframe/tibble. It will do
this test for each row in the data.
Conditional arguments are written using symbols such as
==
, <
, >
. See below for a
list:
== |
equals to |
!= |
does not equal |
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
Q: Why do we need to use ==
for
equals to
?
Q: Which of these symbols can be used for numeric data? How
about text/string data?
Try comparing ‘a’ and ‘b’ using ==
, <
,
and !=
. Then do the same with 1
and
2
'a' == 'b'
## [1] FALSE
'a' < 'b'
## [1] TRUE
'a' != 'b'
## [1] TRUE
1 == 2
## [1] FALSE
1 > 2
## [1] FALSE
1 != 2
## [1] TRUE
This helps us understand how the filter()
command works
- it will keep (or return) anything that results in TRUE
from the condition that you specify. Let’s try it out.
Run filter()
on data
to remove anyone older
than 30. The arguments for filter
include the data object
your are filtering and the condition that you are specifying. (do not
save it to any object - just run the function)
How many subjects are equal to or younger than 30?
Your output should look like this:
filter(data, age <= 30)
It should be simple for us to use this method to create two versions
of the data: one for the pre-test and one for the post-test.
Create a tibble named pre.test
and a tibble named
post.test
. For both, create a pipe from data
and use the filter()
function to extract only pre-test or
only post-test data. Your condition will be applied to the
test
column in data
.
pre.test <- data %>%
filter(test == 'pre')
post.test <- data %>%
filter(test == 'post')
Your pre.test
should look this this:
pre.test
Your post.test
should look like this:
post.test
Great - now we can correctly generate summary statistics of our data!
Create a tibble named data.summary
with three
columns:
test
which is the word “pre” followed by the word
“post”mean
which is the mean of score
for pre
and postsd
which is the standard deviation of
score
for pre and postYou will need to use the mean
and sd
function on score
from our the pre.test
and
post.test
objects
Try to do this without creating any other variables
data.summary <- tibble(test = c('Pre', 'Post'),
mean = c(mean(pre.test$score), mean(post.test$score)),
sd = c(sd(pre.test$score), sd(post.test$score)))
Your data.summary
object should look like this
data.summary
We can also get a summary of our age
variable using
either the pre.test
or post.test
objects
(because each of these objects only have one age value per
subject)
What is the mean()
and sd()
of
age
in either pre.test
or
post.test
compared to data
? The mean is not
affected but the standard deviation is, this is because part of the
formula to compute a standard deviation is to subtract 1 from the total
population (n)
sd()
thinks the total population of data
is
20 because there are 20 rows. However our total population is only 10.
This is a reminder to always be careful about calculating summary
statistics of your data in all situations.
mean(pre.test$age)
## [1] 30.73648
sd(pre.test$age)
## [1] 4.177244
mean(data$age)
## [1] 30.73648
sd(data$age)
## [1] 4.065831
group_by()
and summarise()
to do the same thingThe above method works, but we can do better. Tidyverse includes the
group_by()
function, which allows for any computations you
ask for to be conducted on a per-group basis. This is nice because we do
not need to create new objects using filter like we did above.
We can use group_by()
on our test
column to
ask R to compute the mean
and sd
of
score
for each version of the test. To do so we will also
use the summarise()
function. The summarise()
function is used to create new variables which are summaries of
larger data. It will return to results of a computation (such as
mean
) and store it as a new variable. The power of this
method is that we can do this multiple times in one pipe to achieve a
data.frame/tibble which includes a summary of a larger data object, in a
similar manner to the data.summary
object that we created
above.
As an example, look what happens when I run summarise()
to compute the mean
and sd
of age
in the post.test
object.
Notice that I only ran the function without saving it to an object,
so this data was not saved anywhere. However, within the
summarise()
call I made two new variables (columns) =
mAge
and sdAge
which were the result of
running mean()
and sd()
on age
in
post.test
dplyr::summarise(post.test, mAge = mean(age), sdAge = sd(age))
Let’s use group_by()
and summarise()
in a
pipe.
Create a new object named data02
from the
data
object.
Then, with a pipe, use group_by()
on
test
Then, with another pipe, use summarise()
to create two new
variables: mScore
and sdScore
which are the
result of calling mean()
and sd()
on
score
data02 <- data %>%
group_by(test) %>%
summarise(mScore = mean(score), sdScore = sd(score))
Your data02
object should look like this - and the
values here should be identical to those in
data.summary
print(data02)
## # A tibble: 2 × 3
## test mScore sdScore
## <chr> <dbl> <dbl>
## 1 post 71.8 6.31
## 2 pre 44.9 7.82
Hrmm, this still doesn’t solve our problem for age
- we
don’t want to group by age
or by subject
. We
can use the unique()
function to help with this. The
unique()
function will return only unique rows from a
data.frame/tibble - in other words it removes any duplicate rows (but
only if the value for each column is the same.)
Let’s try it out - create a new object named unique.age
from the data
object
Then, make a pipe which uses the select()
function to
select only subject
and age
(select()
is like filter()
, but instead it
chooses entire columns and is not based on conditions but rather what
you ask for.)
Then, make a second pipe which calls the function unique()
with no arguments.
unique.age <- data %>%
select(subject, age) %>%
unique()
You should see a tibble with two columns, subject and age, such as
below
Note that this only works because we filtered for unique values based on
combinations of both subjects and age
We would never want to run unique()
on just age because
that would delete any duplicate ages among our subjects.
print(unique.age)
## subject age
## 1 1 34.85479
## 2 2 25.17651
## 3 3 29.81564
## 4 4 31.16431
## 5 5 30.02134
## 6 6 27.46938
## 7 7 35.55761
## 8 8 27.52670
## 9 9 38.09212
## 10 10 27.68643
Now, extend unique.age
with a final pipe calling
summarise()
to generate the data we want (mean and
SD)
Make an object named unique.age
which is the same as the
above but includes a final pipe to summarise()
In the summarise()
function, create mAge
and
sdAge
and apply the mean()
and
sd()
functions
unique.age <- data %>%
select(subject, age) %>%
unique() %>%
summarise(mAge = mean(age), sdAge = sd(age))
You should now see this:
print(unique.age)
## mAge sdAge
## 1 30.73648 4.177244
If we wanted to, we could do all of this within one set of
pipes
Can you create a new object named final.data
from
data
which includes the mean and sd for score
and age
grouped by test?
All you need to do is add more arguments to the summarise()
function we used when making data02
- you don’t need to do
anything with unique()
final.data <- data %>%
group_by(test) %>%
summarise(mScore = mean(score), sdScore = sd(score)) %>%
ungroup() %>%
cbind(unique.age)
You would see something like this
print(final.data)
## test mScore sdScore mAge sdAge
## 1 post 71.76039 6.308726 30.73648 4.177244
## 2 pre 44.91928 7.819863 30.73648 4.177244
Q: Is there anything odd about how age
has been
put into the data.frame? What caused this?
Finally, while we could copy and paste from R, we could also
ask R to give us a spreadsheet with this information. The
write_csv()
function is very handy for this. The arguments
for write_csv()
include the name of the object you want to
write and the name you want the output to be.
write_csv(final.data, 'final-data.csv')