Almost any quantitative report will need descriptive statistics. At a minimum, this means providing the mean and SD of your measures of interest, but many times you need to report more than that. Let’s focus on a way to report M and SD for now.
Need to load our favourite library (load tidyverse
)
library(tidyverse)
Simulating data is one strategy to gain experience.
Our goal is to create a tibble with 20 rows and 4 columns for 10
subjects.
We will have data for pre/post test for the subjects, along with their
age.
We will then create summary statistics and output the statistics as a
spreadsheet.
Let’s go!
You may want to use the rep()
function.
Create a variable named subjects
with 10 values. The
values are the number range 1:10.
Create a second variable named time
with 20 values. The
values should be ten repeats of “pre” and then ten repeats of “post”. It
is important they remain in this order. You can do this many different
ways, but one way is to use two rep()
functions inside a
single c()
function. We will worry about why
time
is twice as long later.
subjects <- rep(1:10, 1)
time <- c(rep("pre",10), rep("post", 10))
Your variables should look like this:
print(subjects)
## [1] 1 2 3 4 5 6 7 8 9 10
print(time)
## [1] "pre" "pre" "pre" "pre" "pre" "pre" "pre" "pre" "pre" "pre"
## [11] "post" "post" "post" "post" "post" "post" "post" "post" "post" "post"
Let’s simulate ages for the participants.
We will use the rnorm()
function, which randomly samples
from a defined normal distribution.
rnorm()
function takes three arguments:
n
, mean
, and sd
n = 10
.mean
of 38 and a standard deviation
(sd
) of 3Before we do so, we will set a random seed so that we all get the same data.
Please run set.seed(1337)
(use this exact number)
Always run the set.seed()
function before creating the
data (i.e., it is not enough to run set.seed()
once per
session. Ask if you’re confused.)
# always run the set.seed() BEFORE you create the age variable, each time.
set.seed(1337)
Now, create a new variable named “age” which is the result of calling
rnorm()
for ten people, with a mean of 38 and a standard
deviation of 2. After you create the age variable with the set.seed(),
check to see what happens when you call more data from
rnorm()
without setting that seed.
age <- rnorm(n = 10, mean = 38, sd = 2)
Your age variable should look like this:
print(age)
## [1] 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424 39.88756 42.16385
## [9] 41.83423 37.17038
Run the rnorm()
function again without using
set.seed()
- you will get different values.
Create new versions of the subjects
and age
variables which are the results of duplicating each variable
- i.e., make subjects
and age
into new
variables which repeat themselves.
You could do with with the c()
function or with the
rep()
function.
Of course this is quite dangerous if you do this more than once, so
use new variable names: subjects2
and age2
# with rep()
#rep(subjects, 2)
#rep(age, 2)
# with c()
#c(subjects, subjects)
#c(age, age)
# make sure to create new names for the variables.
subjects2 <- rep(subjects,2)
age2 <- rep(age, 2)
Your variables should look this like
print(subjects2)
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
print(age2)
## [1] 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424 39.88756 42.16385
## [9] 41.83423 37.17038 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424
## [17] 39.88756 42.16385 41.83423 37.17038
Time to put our data into a better format: a tibble (data frame)
Create a tibble named data01
with three columns:
subjects2
, age2
, and time
. Create
a pipe and use rename()
to renamed subjects2
as subjects
, and age2
as age
(there are other ways you could do this.)
Look at data01
with glimpse()
data01 <- tibble(subjects2, age2, time) %>%
rename(subjects = subjects2, age = age2)
Your tibble should look like this (using glimpse()
)
glimpse(data01)
## Rows: 20
## Columns: 3
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
Let’s pretend our subjects, on average, did better on the post-test compared to the pre-test. How can we simulate this?
pre
post
set.seed()
)set.seed(9876)
pre <- rnorm(10, 65, 8)
set.seed(9876)
post <- rnorm(10, 85, 2)
Your pre
and post
variables should look
like this:
print(pre)
## [1] 73.17874 55.75703 63.47592 64.25773 65.15074 55.80743 69.76431 50.47631
## [9] 70.11285 64.80356
print(post)
## [1] 87.04468 82.68926 84.61898 84.81443 85.03769 82.70186 86.19108 81.36908
## [9] 86.27821 84.95089
Combine pre
and post
into a new variable
named score
(use c()
) Then, using a pipe,
create a new tibble named data02
from data01
.
Use the pipe to call cbind()
on score
score <- c(pre, post)
data02 <- data01 %>%
cbind(score)
Your data should look like this:
glimpse(data02)
## Rows: 20
## Columns: 4
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
## $ score <dbl> 73.17874, 55.75703, 63.47592, 64.25773, 65.15074, 55.80743, 6…
Compare data01
and data02
using
glimpse()
- what happened?
glimpse(data01)
## Rows: 20
## Columns: 3
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
glimpse(data02)
## Rows: 20
## Columns: 4
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
## $ score <dbl> 73.17874, 55.75703, 63.47592, 64.25773, 65.15074, 55.80743, 6…
Now that we have our data, we want to generate some summary statistics.
Let’s first check with summary()
run this function on
data02
. what do we get? what are we missing?
summary(data02)
## subjects age time score
## Min. : 1.0 Min. :35.11 Length:20 Min. :50.48
## 1st Qu.: 3.0 1st Qu.:37.17 Class :character 1st Qu.:64.67
## Median : 5.5 Median :39.14 Mode :character Median :77.27
## Mean : 5.5 Mean :39.19 Mean :73.92
## 3rd Qu.: 8.0 3rd Qu.:41.83 3rd Qu.:84.85
## Max. :10.0 Max. :42.16 Max. :87.04
There are a ton of ways to get summary statistics in R.
For instance, we can manually call functions on each variable of
interest. Run mean()
, sd()
, and
range()
on data02$score
and
data02$age
Your output should look like this:
mean(data02$score)
## [1] 73.92404
sd(data02$score)
## [1] 12.06526
range(data02$score)
## [1] 50.47631 87.04468
mean(data02$age)
## [1] 39.1852
sd(data02$age)
## [1] 2.519155
range(data02$age)
## [1] 35.10660 42.16385
What are some issues with this method? How would you get these answers into your paper/thesis/report? What might go wrong?
There are more sophisticated ways to get this information. For
example, the package psych
has the describe()
function. (If you do not have psych
, run
install.packages('psych')
)
library(psych)
psych::describe(data02)
## vars n mean sd median trimmed mad min max range skew
## subjects 1 20 5.50 2.95 5.50 5.50 3.71 1.00 10.00 9.00 0.00
## age 2 20 39.19 2.52 39.14 39.32 3.43 35.11 42.16 7.06 -0.13
## time* 3 20 1.50 0.51 1.50 1.50 0.74 1.00 2.00 1.00 0.00
## score 4 20 73.92 12.07 77.27 74.93 12.37 50.48 87.04 36.57 -0.44
## kurtosis se
## subjects -1.40 0.66
## age -1.60 0.56
## time* -2.10 0.11
## score -1.35 2.70
The package Hmisc
also has a describe function
(annoying!) - (If you do not have Hmisc
, run
install.packages('Hmisc')
) - but you probably should use
psych::describe()
anyhow.
library(Hmisc)
Hmisc::describe(data02)
## data02
##
## 4 Variables 20 Observations
## --------------------------------------------------------------------------------
## subjects
## n missing distinct Info Mean Gmd .05 .10
## 20 0 10 0.992 5.5 3.474 1.0 1.9
## .25 .50 .75 .90 .95
## 3.0 5.5 8.0 9.1 10.0
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 2 2 2 2 2 2 2 2 2 2
## Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 20 0 10 0.992 39.19 2.91 35.11 36.47
## .25 .50 .75 .90 .95
## 37.17 39.14 41.83 42.09 42.16
##
## lowest : 35.10660 36.62195 37.17038 37.35364 38.38498
## highest: 39.88756 41.24459 41.83423 42.08424 42.16385
##
## Value 35.10660 36.62195 37.17038 37.35364 38.38498 39.88756 41.24459
## Frequency 2 2 2 2 2 2 2
## Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1
##
## Value 41.83423 42.08424 42.16385
## Frequency 2 2 2
## Proportion 0.1 0.1 0.1
## --------------------------------------------------------------------------------
## time
## n missing distinct
## 20 0 2
##
## Value post pre
## Frequency 10 10
## Proportion 0.5 0.5
## --------------------------------------------------------------------------------
## score
## n missing distinct Info Mean Gmd .05 .10
## 20 0 20 1 73.92 13.71 55.49 55.80
## .25 .50 .75 .90 .95
## 64.67 77.27 84.85 86.20 86.32
##
## lowest : 50.47631 55.75703 55.80743 63.47592 64.25773
## highest: 84.95089 85.03769 86.19108 86.27821 87.04468
##
## 50.4763082827118 (1, 0.05), 55.7570280170324 (1, 0.05), 55.8074272726501 (1,
## 0.05), 63.4759184130138 (1, 0.05), 64.257728822454 (1, 0.05), 64.8035573081937
## (1, 0.05), 65.1507432628376 (1, 0.05), 69.7643060921729 (1, 0.05),
## 70.1128487352441 (1, 0.05), 73.1787395912597 (1, 0.05), 81.369077070678 (1,
## 0.05), 82.6892570042581 (1, 0.05), 82.7018568181625 (1, 0.05), 84.6189796032534
## (1, 0.05), 84.8144322056135 (1, 0.05), 84.9508893270484 (1, 0.05),
## 85.0376858157094 (1, 0.05), 86.1910765230432 (1, 0.05), 86.278212183811 (1,
## 0.05), 87.0446848978149 (1, 0.05)
## --------------------------------------------------------------------------------
describe
by
calling the specific package in the format of
package::function()
The above information is not sufficient for two reasons. First of
all, subjects are repeated twice and thus so is their age score - while
this does not affect calculations such as mean()
it does
affect calculations such as sd()
and more. Not good! We
also do not care about the M and SD for score
as a whole,
but rather for the two different time points (pre and post)
R has plenty of ways to do this. Base R has bracket notation and
conditional logic. Below I extract the mean for score
for
both the pre test and the post test.
# Do you understand what this code is doing?
mean(data02[data02$time == 'pre',]$score)
## [1] 63.27846
mean(data02[data02$time == 'post',]$score)
## [1] 84.56962
Of course, I would recommend using tidyverse
functions
for this.