Simulating and describing data

Almost any quantitative report will need descriptive statistics. At a minimum, this means providing the mean and SD of your measures of interest, but many times you need to report more than that. Let’s focus on a way to report M and SD for now.

Our task

Simulating data is one strategy to gain experience.
Our goal is to create a tibble with 20 rows and 4 columns for 10 subjects.
We will have data for pre/post test for the subjects, along with their age.
We will then create summary statistics and output the statistics as a spreadsheet.

Let’s go!

Step 1.

You may want to use the rep() function.

Create a variable named subjects with 10 values. The values are the number range 1:10.

Create a second variable named time with 20 values. The values should be ten repeats of “pre” and then ten repeats of “post”. It is important they remain in this order. You can do this many different ways, but one way is to use two rep() functions inside a single c() function. We will worry about why time is twice as long later.

subjects <- rep(1:10, 1)
time <- c(rep("pre",10), rep("post", 10))

Your variables should look like this:

print(subjects)

##  [1]  1  2  3  4  5  6  7  8  9 10

print(time)

##  [1] "pre"  "pre"  "pre"  "pre"  "pre"  "pre"  "pre"  "pre"  "pre"  "pre" 
## [11] "post" "post" "post" "post" "post" "post" "post" "post" "post" "post"

Step 2.

Let’s simulate ages for the participants.

We will use the rnorm() function, which randomly samples from a defined normal distribution.

The rnorm() function takes three arguments: n, mean, and sd
We have ten subjects, so set n = 10.
Let’s ask for a mean of 38 and a standard deviation (sd) of 3

Before we do so, we will set a random seed so that we all get the same data.

Please run set.seed(1337) (use this exact number)

Always run the set.seed() function before creating the data (i.e., it is not enough to run set.seed() once per session. Ask if you’re confused.)

# always run the set.seed() BEFORE you create the age variable, each time. 
set.seed(1337)

Now, create a new variable named “age” which is the result of calling rnorm() for ten people, with a mean of 38 and a standard deviation of 2. After you create the age variable with the set.seed(), check to see what happens when you call more data from rnorm() without setting that seed.

age <- rnorm(n = 10, mean = 38, sd =  2)

Your age variable should look like this:

print(age)

##  [1] 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424 39.88756 42.16385
##  [9] 41.83423 37.17038

Run the rnorm() function again without using set.seed() - you will get different values.

Step 3.

Create new versions of the subjects and age variables which are the results of duplicating each variable
- i.e., make subjects and age into new variables which repeat themselves.

You could do with with the c() function or with the rep() function.

Of course this is quite dangerous if you do this more than once, so use new variable names: subjects2 and age2

# with rep()
#rep(subjects, 2)
#rep(age, 2)

# with c()
#c(subjects, subjects)
#c(age, age)

# make sure to create new names for the variables. 
subjects2 <- rep(subjects,2)
age2 <- rep(age, 2)

Your variables should look this like

print(subjects2)

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

print(age2)

##  [1] 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424 39.88756 42.16385
##  [9] 41.83423 37.17038 38.38498 35.10660 37.35364 41.24459 36.62195 42.08424
## [17] 39.88756 42.16385 41.83423 37.17038

Step 4.

Time to put our data into a better format: a tibble (data frame)

Create a tibble named data01 with three columns: subjects2, age2, and time. Create a pipe and use rename() to renamed subjects2 as subjects, and age2 as age (there are other ways you could do this.)

Look at data01 with glimpse()

data01 <- tibble(subjects2, age2, time) %>%
  rename(subjects = subjects2, age = age2)

Your tibble should look like this (using glimpse())

glimpse(data01)

## Rows: 20
## Columns: 3
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age      <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time     <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…

Step 5.

Let’s pretend our subjects, on average, did better on the post-test compared to the pre-test. How can we simulate this?

assume the tests have a range of 0-100
assume a mean of 65 on the pretest, with an SD of 8, save the variable as pre
assume a mean of 97 on the posttest, with an SD of 2, save the variable as post
set a random seed of 9876 before each variable (use set.seed())

set.seed(9876)
pre <- rnorm(10, 65, 8)
  

set.seed(9876)
post <- rnorm(10, 85, 2)

Your pre and post variables should look like this:

print(pre)

##  [1] 73.17874 55.75703 63.47592 64.25773 65.15074 55.80743 69.76431 50.47631
##  [9] 70.11285 64.80356

print(post)

##  [1] 87.04468 82.68926 84.61898 84.81443 85.03769 82.70186 86.19108 81.36908
##  [9] 86.27821 84.95089

Step 6.

Combine pre and post into a new variable named score (use c()) Then, using a pipe, create a new tibble named data02 from data01. Use the pipe to call cbind() on score

score <- c(pre, post)
data02 <- data01 %>%
  cbind(score)

Your data should look like this:

glimpse(data02)

## Rows: 20
## Columns: 4
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age      <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time     <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
## $ score    <dbl> 73.17874, 55.75703, 63.47592, 64.25773, 65.15074, 55.80743, 6…

Compare data01 and data02 using glimpse()- what happened?

glimpse(data01)

## Rows: 20
## Columns: 3
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age      <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time     <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…

glimpse(data02)

## Rows: 20
## Columns: 4
## $ subjects <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ age      <dbl> 38.38498, 35.10660, 37.35364, 41.24459, 36.62195, 42.08424, 3…
## $ time     <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre"…
## $ score    <dbl> 73.17874, 55.75703, 63.47592, 64.25773, 65.15074, 55.80743, 6…

Step 7.

Now that we have our data, we want to generate some summary statistics.

Let’s first check with summary() run this function on data02. what do we get? what are we missing?

summary(data02)

##     subjects         age            time               score      
##  Min.   : 1.0   Min.   :35.11   Length:20          Min.   :50.48  
##  1st Qu.: 3.0   1st Qu.:37.17   Class :character   1st Qu.:64.67  
##  Median : 5.5   Median :39.14   Mode  :character   Median :77.27  
##  Mean   : 5.5   Mean   :39.19                      Mean   :73.92  
##  3rd Qu.: 8.0   3rd Qu.:41.83                      3rd Qu.:84.85  
##  Max.   :10.0   Max.   :42.16                      Max.   :87.04

There are a ton of ways to get summary statistics in R.

For instance, we can manually call functions on each variable of interest. Run mean(), sd(), and range() on data02$score and data02$age

Your output should look like this:

mean(data02$score)

## [1] 73.92404

sd(data02$score)

## [1] 12.06526

range(data02$score)

## [1] 50.47631 87.04468

mean(data02$age)

## [1] 39.1852

sd(data02$age)

## [1] 2.519155

range(data02$age)

## [1] 35.10660 42.16385

What are some issues with this method? How would you get these answers into your paper/thesis/report? What might go wrong?

There are more sophisticated ways to get this information. For example, the package psych has the describe() function. (If you do not have psych, run install.packages('psych'))

library(psych)
psych::describe(data02)

##          vars  n  mean    sd median trimmed   mad   min   max range  skew
## subjects    1 20  5.50  2.95   5.50    5.50  3.71  1.00 10.00  9.00  0.00
## age         2 20 39.19  2.52  39.14   39.32  3.43 35.11 42.16  7.06 -0.13
## time*       3 20  1.50  0.51   1.50    1.50  0.74  1.00  2.00  1.00  0.00
## score       4 20 73.92 12.07  77.27   74.93 12.37 50.48 87.04 36.57 -0.44
##          kurtosis   se
## subjects    -1.40 0.66
## age         -1.60 0.56
## time*       -2.10 0.11
## score       -1.35 2.70

The package Hmisc also has a describe function (annoying!) - (If you do not have Hmisc, run install.packages('Hmisc')) - but you probably should use psych::describe() anyhow.

library(Hmisc)
Hmisc::describe(data02)

## data02 
## 
##  4  Variables      20  Observations
## --------------------------------------------------------------------------------
## subjects 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       10    0.992      5.5    3.474      1.0      1.9 
##      .25      .50      .75      .90      .95 
##      3.0      5.5      8.0      9.1     10.0 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                   
## Value        1   2   3   4   5   6   7   8   9  10
## Frequency    2   2   2   2   2   2   2   2   2   2
## Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       10    0.992    39.19     2.91    35.11    36.47 
##      .25      .50      .75      .90      .95 
##    37.17    39.14    41.83    42.09    42.16 
## 
## lowest : 35.10660 36.62195 37.17038 37.35364 38.38498
## highest: 39.88756 41.24459 41.83423 42.08424 42.16385
##                                                                          
## Value      35.10660 36.62195 37.17038 37.35364 38.38498 39.88756 41.24459
## Frequency         2        2        2        2        2        2        2
## Proportion      0.1      0.1      0.1      0.1      0.1      0.1      0.1
##                                      
## Value      41.83423 42.08424 42.16385
## Frequency         2        2        2
## Proportion      0.1      0.1      0.1
## --------------------------------------------------------------------------------
## time 
##        n  missing distinct 
##       20        0        2 
##                     
## Value      post  pre
## Frequency    10   10
## Proportion  0.5  0.5
## --------------------------------------------------------------------------------
## score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       20        1    73.92    13.71    55.49    55.80 
##      .25      .50      .75      .90      .95 
##    64.67    77.27    84.85    86.20    86.32 
## 
## lowest : 50.47631 55.75703 55.80743 63.47592 64.25773
## highest: 84.95089 85.03769 86.19108 86.27821 87.04468
## 
## 50.4763082827118 (1, 0.05), 55.7570280170324 (1, 0.05), 55.8074272726501 (1,
## 0.05), 63.4759184130138 (1, 0.05), 64.257728822454 (1, 0.05), 64.8035573081937
## (1, 0.05), 65.1507432628376 (1, 0.05), 69.7643060921729 (1, 0.05),
## 70.1128487352441 (1, 0.05), 73.1787395912597 (1, 0.05), 81.369077070678 (1,
## 0.05), 82.6892570042581 (1, 0.05), 82.7018568181625 (1, 0.05), 84.6189796032534
## (1, 0.05), 84.8144322056135 (1, 0.05), 84.9508893270484 (1, 0.05),
## 85.0376858157094 (1, 0.05), 86.1910765230432 (1, 0.05), 86.278212183811 (1,
## 0.05), 87.0446848978149 (1, 0.05)
## --------------------------------------------------------------------------------

Note that we called the specific use of describe by calling the specific package in the format of package::function()

Step 8.

The above information is not sufficient for two reasons. First of all, subjects are repeated twice and thus so is their age score - while this does not affect calculations such as mean() it does affect calculations such as sd() and more. Not good! We also do not care about the M and SD for score as a whole, but rather for the two different time points (pre and post)

R has plenty of ways to do this. Base R has bracket notation and conditional logic. Below I extract the mean for score for both the pre test and the post test.

# Do you understand what this code is doing?

mean(data02[data02$time == 'pre',]$score)

## [1] 63.27846

mean(data02[data02$time == 'post',]$score)

## [1] 84.56962

Of course, I would recommend using tidyverse functions for this.