More with descriptives

Goal: report common descriptive statistics, such as mean and standard deviation

Let’s simulate similar data from last session.

You should:
1. Open R-Studio
2. Create a new project, give it a name, and save it somewhere easy to find (File –> New Project)
3. Create a new R script (File –> New File –> New R Script
4. Go through this handout and type in your code in the prompted areas.

Part 1.

Create a tibble named base with these columns:

subject which is the numbers 1 through 10
age which is 10 random values from rnorm() with a mean of 28 and sd of 5, using a set.seed() of 42

Use the tibble() function. (you also need to load tidyverse)

library(tidyverse)

set.seed(42)
base <- tibble(subject = 1:10, age = rnorm(10, 28, 5))

Your data in base should look like this

## # A tibble: 10 × 2
##    subject   age
##      <int> <dbl>
##  1       1  34.9
##  2       2  25.2
##  3       3  29.8
##  4       4  31.2
##  5       5  30.0
##  6       6  27.5
##  7       7  35.6
##  8       8  27.5
##  9       9  38.1
## 10      10  27.7

Create a new tibble named base02 which is base repeated twice (20 rows instead of 10).
Use a single pipe withrbind().

base02 <- base %>%
  rbind(base)

Run the str() function on base02 and you should see this:

str(base02)

## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
##  $ subject: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : num [1:20] 34.9 25.2 29.8 31.2 30 ...

Create a tibble named test.data with these columns:

test which is the word ‘pre’ ten times followed by the word ‘post’ ten times
score which is 10 random values from rnorm() with a mean of 50 and a sd of 10 followed by 10 more random values from rnorm() with a mean of 70 and an sd of 5 using a set.seed() of 43
do this all in two lines: the first line for set.seed() and then the second line making test.data.
You can use c() to chain multiple calls to rep() and rnorm() in your tibble() function.

set.seed(43)
test.data <- tibble(test = c(rep('pre',10),rep('post', 10)), score = c(rnorm(10,50,10), rnorm(10,70,5)))

Your version of test.data should look like this using str()

str(test.data)

## tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
##  $ test : chr [1:20] "pre" "pre" "pre" "pre" ...
##  $ score: num [1:20] 49.6 34.3 45.1 54.7 41 ...

Part 2.

Create a new tibble named data which is the result of combining base02 and test.data. Use the cbind() function in a single pipe.

data <- base02 %>%
  cbind(test.data)

Your version of data should look like this using str()

str(data)

## 'data.frame':    20 obs. of  4 variables:
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : num  34.9 25.2 29.8 31.2 30 ...
##  $ test   : chr  "pre" "pre" "pre" "pre" ...
##  $ score  : num  49.6 34.3 45.1 54.7 41 ...

Q: Why did data turn into a data.frame when it was made from joining two tibbles?

Create a summary of data using summary():

summary(data)

##     subject          age            test               score      
##  Min.   : 1.0   Min.   :25.18   Length:20          Min.   :30.94  
##  1st Qu.: 3.0   1st Qu.:27.53   Class :character   1st Qu.:46.70  
##  Median : 5.5   Median :29.92   Mode  :character   Median :58.20  
##  Mean   : 5.5   Mean   :30.74                      Mean   :58.34  
##  3rd Qu.: 8.0   3rd Qu.:34.85                      3rd Qu.:71.47  
##  Max.   :10.0   Max.   :38.09                      Max.   :80.32

Part 3.

Now that we have our data, we want to generate descriptive statistics (Mean, SD, etc) for different values.

Recall from the summary of our data above, we have to be careful about generating descriptive statistics

Because our data is in long format, our age value is doubled and would affect the standard deviation
we want mean and sd for test scores, but also split by test

Using `filter()` from tidyverse

The filter() function will apply a conditional argument to a specified column in a dataframe/tibble. It will do this test for each row in the data.

Conditional arguments are written using symbols such as ==, <, >. See below for a list:

`==`	equals to
`!=`	does not equal
`<`	less than
`<=`	less than or equal to
`>`	greater than
`>=`	greater than or equal to

Q: Why do we need to use == for equals to?
Q: Which of these symbols can be used for numeric data? How about text/string data?

Try comparing ‘a’ and ‘b’ using ==, <, and !=. Then do the same with 1 and 2

'a' == 'b'

## [1] FALSE

'a' < 'b'

## [1] TRUE

'a' != 'b'

## [1] TRUE

1 == 2

## [1] FALSE

1 > 2

## [1] FALSE

1 != 2

## [1] TRUE

This helps us understand how the filter() command works - it will keep (or return) anything that results in TRUE from the condition that you specify. Let’s try it out.

Run filter() on data to remove anyone older than 30. The arguments for filter include the data object your are filtering and the condition that you are specifying. (do not save it to any object - just run the function)

How many subjects are equal to or younger than 30?

Your output should look like this:

filter(data, age <= 30)

It should be simple for us to use this method to create two versions of the data: one for the pre-test and one for the post-test.

Create a tibble named pre.test and a tibble named post.test. For both, create a pipe from data and use the filter() function to extract only pre-test or only post-test data. Your condition will be applied to the test column in data.

pre.test <- data %>%
  filter(test == 'pre')

post.test <- data %>%
  filter(test == 'post')

Your pre.test should look this this:

pre.test

Your post.test should look like this:

post.test

Great - now we can correctly generate summary statistics of our data!

Create a tibble named data.summary with three columns:

test which is the word “pre” followed by the word “post”
mean which is the mean of score for pre and post
sd which is the standard deviation of score for pre and post

You will need to use the mean and sd function on score from our the pre.test and post.test objects
Try to do this without creating any other variables

data.summary <- tibble(test = c('Pre', 'Post'), 
                       mean = c(mean(pre.test$score), mean(post.test$score)), 
                       sd = c(sd(pre.test$score), sd(post.test$score)))

Your data.summary object should look like this

data.summary

We can also get a summary of our age variable using either the pre.test or post.test objects (because each of these objects only have one age value per subject)

What is the mean() and sd() of age in either pre.test or post.test compared to data? The mean is not affected but the standard deviation is, this is because part of the formula to compute a standard deviation is to subtract 1 from the total population (n)

sd() thinks the total population of data is 20 because there are 20 rows. However our total population is only 10. This is a reminder to always be careful about calculating summary statistics of your data in all situations.

mean(pre.test$age)

## [1] 30.73648

sd(pre.test$age)

## [1] 4.177244

mean(data$age)

## [1] 30.73648

sd(data$age)

## [1] 4.065831

Using `group_by()` and `summarise()` to do the same thing

The above method works, but we can do better. Tidyverse includes the group_by() function, which allows for any computations you ask for to be conducted on a per-group basis. This is nice because we do not need to create new objects using filter like we did above.

We can use group_by() on our test column to ask R to compute the mean and sd of score for each version of the test. To do so we will also use the summarise() function. The summarise() function is used to create new variables which are summaries of larger data. It will return to results of a computation (such as mean) and store it as a new variable. The power of this method is that we can do this multiple times in one pipe to achieve a data.frame/tibble which includes a summary of a larger data object, in a similar manner to the data.summary object that we created above.

As an example, look what happens when I run summarise() to compute the mean and sd of age in the post.test object.

Notice that I only ran the function without saving it to an object, so this data was not saved anywhere. However, within the summarise() call I made two new variables (columns) = mAge and sdAge which were the result of running mean() and sd() on age in post.test

dplyr::summarise(post.test, mAge = mean(age), sdAge = sd(age))

Let’s use group_by() and summarise() in a pipe.

Create a new object named data02 from the data object.
Then, with a pipe, use group_by() on test
Then, with another pipe, use summarise() to create two new variables: mScore and sdScore which are the result of calling mean() and sd() on score

data02 <- data %>%
  group_by(test) %>%
  summarise(mScore = mean(score), sdScore = sd(score))

Your data02 object should look like this - and the values here should be identical to those in data.summary

print(data02)

## # A tibble: 2 × 3
##   test  mScore sdScore
##   <chr>  <dbl>   <dbl>
## 1 post    71.8    6.31
## 2 pre     44.9    7.82

Hrmm, this still doesn’t solve our problem for age - we don’t want to group by age or by subject. We can use the unique() function to help with this. The unique() function will return only unique rows from a data.frame/tibble - in other words it removes any duplicate rows (but only if the value for each column is the same.)

Let’s try it out - create a new object named unique.age from the data object
Then, make a pipe which uses the select() function to select only subject and age (select() is like filter(), but instead it chooses entire columns and is not based on conditions but rather what you ask for.)
Then, make a second pipe which calls the function unique() with no arguments.

unique.age <- data %>%
  select(subject, age) %>%
  unique()

You should see a tibble with two columns, subject and age, such as below
Note that this only works because we filtered for unique values based on combinations of both subjects and age
We would never want to run unique() on just age because that would delete any duplicate ages among our subjects.

print(unique.age)

##    subject      age
## 1        1 34.85479
## 2        2 25.17651
## 3        3 29.81564
## 4        4 31.16431
## 5        5 30.02134
## 6        6 27.46938
## 7        7 35.55761
## 8        8 27.52670
## 9        9 38.09212
## 10      10 27.68643

Now, extend unique.age with a final pipe calling summarise() to generate the data we want (mean and SD)
Make an object named unique.age which is the same as the above but includes a final pipe to summarise()
In the summarise() function, create mAge and sdAge and apply the mean() and sd() functions

unique.age <- data %>%
  select(subject, age) %>%
  unique() %>%
  summarise(mAge = mean(age), sdAge = sd(age))

You should now see this:

print(unique.age)

##       mAge    sdAge
## 1 30.73648 4.177244

If we wanted to, we could do all of this within one set of pipes
Can you create a new object named final.data from data which includes the mean and sd for score and age grouped by test?
All you need to do is add more arguments to the summarise() function we used when making data02 - you don’t need to do anything with unique()

final.data <- data %>%
  group_by(test) %>%
  summarise(mScore = mean(score), sdScore = sd(score)) %>%
  ungroup() %>%
  cbind(unique.age)

You would see something like this

print(final.data)

##   test   mScore  sdScore     mAge    sdAge
## 1 post 71.76039 6.308726 30.73648 4.177244
## 2  pre 44.91928 7.819863 30.73648 4.177244

Q: Is there anything odd about how age has been put into the data.frame? What caused this?

Finally, while we could copy and paste from R, we could also ask R to give us a spreadsheet with this information. The write_csv() function is very handy for this. The arguments for write_csv() include the name of the object you want to write and the name you want the output to be.

Write our final data to a .csv file - note that you will OVERWRITE the file each time you run this, be careful :)
Also, be sure to add the file extension that you want, for example if you do not type the .csv the file may look weird on your computer (or it may not know how to open it properly)

write_csv(final.data, 'final-data.csv')

More with descriptives

Stephen Skalicky

31/03/2021

Goal: report common descriptive statistics, such as mean and standard deviation

Part 1.

Part 2.

Part 3.

Now that we have our data, we want to generate descriptive statistics (Mean, SD, etc) for different values.

Using `filter()` from tidyverse

Using `group_by()` and `summarise()` to do the same thing

More with descriptives

Stephen Skalicky

31/03/2021

Goal: report common descriptive statistics, such as mean and standard deviation

Part 1.

Part 2.

Part 3.

Now that we have our data, we want to generate descriptive statistics (Mean, SD, etc) for different values.

Using filter() from tidyverse

Using group_by() and summarise() to do the same thing

Using `filter()` from tidyverse

Using `group_by()` and `summarise()` to do the same thing