Centering, standardizing, and correlations

Author

Akihiro

Published

October 8, 2025

5. Introduction

  • What is a linear transformation?

    e.g., Addition: adding 1 to (2, 4, and 6) -> (3, 5, and 7)

  • Why is it useful?

    1. Interpretational advantages
    2. Makeing variables comparable

5.1 Centering

‘Centering’ is a linear transformation often used with continuous predictor variables, subtracting the mean of a variable from each of its values, so the data are expressed as deviations from the mean. If each data point is expressed in terms of how much it is above the mean (positive score) or below the mean (negative score), this will be useful for the intepretation. Let’s see examples;

  1. Log word frequency as the predictor of response durations
  1. Uncentered:

    The intercept = predicted response time when log frequency = 0.

  2. Centered:

    0 = the mean log frequency.

    The intercept = predicted response time at the average frequency.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

# Load the dataset
ELP <- read_csv("ELP_frequency.csv")
Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Word
dbl (2): Freq, RT

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspect structure
ELP
# A tibble: 12 × 3
   Word      Freq    RT
   <chr>    <dbl> <dbl>
 1 thing    55522  622.
 2 life     40629  520.
 3 door     14895  507.
 4 angel     3992  637.
 5 beer      3850  587.
 6 disgrace   409  705 
 7 kitten     241  611.
 8 bloke      238  794.
 9 mocha       66  725.
10 gnome       32  810.
11 nihilism     4  764.
12 puffball     4  878.
# Center log10 word frequency
ELP <- mutate(ELP,
Log10Freq = log10(Freq),
Log10Freq_c = Log10Freq - mean(Log10Freq, na.rm = TRUE))


# Check mean ~ 0
mean(ELP$Log10Freq_c)
[1] 0
# Uncentered and centered models
mdl_unc <- lm(RT ~ Log10Freq, data = ELP)
mdl_c <- lm(RT ~ Log10Freq_c, data = ELP)


# Compare coefficients
tidy(mdl_unc)
# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)    871.       40.4     21.5  0.00000000103
2 Log10Freq      -70.3      13.3     -5.30 0.000348     
tidy(mdl_c)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    680.       18.3     37.2  4.71e-12
2 Log10Freq_c    -70.3      13.3     -5.30 3.48e- 4
# Plot with uncentered log frequency
ELP %>%
  ggplot(aes(x = Log10Freq, y = RT)) +
  geom_text(aes(label = Word)) +
  geom_smooth(method = "lm", color = "blue") +
  ggtitle("(a) Uncentered") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# Plot with centered log frequency
ELP %>%
  ggplot(aes(x = Log10Freq_c, y = RT)) +
  geom_text(aes(label = Word)) +
  geom_smooth(method = "lm", color = "red") +
  ggtitle("(b) Centered") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

  • Notice that the slope of the regression line does not change when moving from (a) to (b).

Another example based on Winter (2019, p. 87):

“In some cases, uncentered intercepts outright make no sense. For example, when performance in a sports game is modeled as a function of height, the intercept is the predicted performance someone of 0 height. After centering, the intercept becomes the predicted performance for a person of average height, a much more meaningful quantity.”

Maybe something like high jump? ([Irasutoya: Free illustrations for personal and commercial use](https://www.irasutoya.com/2014/05/blog-post_9118.html))

Maybe something like high jump? ([Irasutoya: Free illustrations for personal and commercial use](https://www.irasutoya.com/2014/05/blog-post_9118.html))
  1. Uncentered (raw height):

    The intercept = predicted jump height for someone 0 cm tall. –> That’s meaningless, because no athlete has a height of zero.

  2. Centered (mean height):

    0 now represents the average athlete’s height (say, 172 cm).

    The intercept = predicted jump height for an athlete of average height, which makes sense.

    The slope (how much jump height changes per cm of body height) stays the same before and after centering.

5.2 Standardizing (z-scoring)

‘Standardizing’ or ‘z-scoring’ means centering a variable (subtracting its mean), then dividing by its standard deviation. So, each data point is expressed in terms of how many standard deviations it is above (+) or below (–) the mean. Let’s see an example;

  1. Response durations from a psycholinguistic experiment
# Example response durations
RTs <- c(460, 480, 500, 520, 540)

# Mean and standard deviation
mean_RT <- mean(RTs)
sd_RT   <- sd(RTs)

mean_RT
[1] 500
sd_RT
[1] 31.62278
# Center the values (subtract the mean)
RTs_centered <- RTs - mean_RT
RTs_centered
[1] -40 -20   0  20  40
# Standardize (divide centered values by SD)
RTs_z <- RTs_centered / sd_RT
RTs_z
[1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

Note that standardization changes the units of measurement to ‘standard units’ (often represented by letter z), but not the relative relationships among data points (As can be seen Winter, 2019, p.88).

Winter (2019, p.88)
# Standardized predictor
ELP <- mutate(ELP, Log10Freq_z = scale(Log10Freq))

# Standardized model
mdl_z <- lm(RT ~ Log10Freq_z, data = ELP)

# Compare to unstandarized model
tidy(mdl_unc)
# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)    871.       40.4     21.5  0.00000000103
2 Log10Freq      -70.3      13.3     -5.30 0.000348     
tidy(mdl_z)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     680.      18.3     37.2  4.71e-12
2 Log10Freq_z    -101.      19.1     -5.30 3.48e- 4

The slope appears different (–70 –> –101) because standardizing changes the units from raw values to standard deviations. The model itself hasn’t changed; the new slope shows the effect of a 1 standard deviation change in the predictor.

So what’s standarizing good for?

  • Removes the original metric of a variable

  • May help in making variables with different scales comparable

  • Useful for assessing the relative impact of multiple predictors

    -> see Chapter 6 and the example of penguins’ bill_length_mm and body_mass_g.

5.3 Correlation

What if you standardized the response variable as well?

If both predictor and response are standardized, the regression slope becomes Pearson’s r, showing how many standard units y changes per standard unit change in x.

Pearson’s r is a standardized measure, so you can interpret correlation strength without knowing the data’s units. Can you draw a mental picture of what the correlation looks like?

Winter (2019, p.90)

Whether a given r is “high” or “low” depends on domain knowledge. For example, r = 0.8 is very high in psychology/linguistics, but may be considered low in quantum chemistry.

  • Have you already seen another standarized statistic?

, the coefficient of determination, measures how much variance a model explains. In simple linear regression with one predictor, is just the square of Pearson’s r.

Extra. A ‘Cookbook’ Approach?

The textbook says “This book is not a ‘cookbook’ that teaches you a whole range of different techniques”, meaning it focuses on modeling rather than on picking different statistical methods. But my dataset is different — it’s ordinal, not continuous, and this book covers only Pearson’s r for correlation. Can you take a look and think about a good method for analyzing it?

library(dplyr)
library(ggplot2)

data <- read.csv("Correlation.csv")

data
    OPL_CEFR NICT_CEFR
1          1         1
2          1         1
3          1         2
4          1         2
5          1         3
6          1         1
7          1         1
8          1         2
9          1         1
10         1         2
11         1         1
12         1         1
13         1         1
14         1         2
15         1         2
16         1         2
17         1         3
18         1         2
19         1         2
20         1         2
21         1         2
22         1         1
23         1         1
24         1         2
25         1         1
26         1         1
27         1         1
28         1         2
29         1         1
30         1         3
31         1         1
32         1         1
33         1         2
34         1         1
35         1         1
36         1         2
37         1         1
38         1         1
39         1         1
40         1         1
41         1         1
42         1         2
43         1         1
44         1         1
45         1         2
46         1         2
47         1         1
48         1         1
49         1         1
50         1         1
51         1         2
52         1         1
53         1         1
54         1         1
55         1         1
56         1         1
57         1         1
58         1         1
59         1         1
60         1         1
61         1         3
62         1         1
63         1         1
64         1         1
65         1         1
66         1         1
67         1         1
68         1         1
69         1         1
70         1         1
71         1         1
72         1         1
73         1         2
74         1         1
75         1         1
76         1         1
77         1         1
78         1         1
79         1         1
80         1         1
81         1         1
82         1         1
83         1         1
84         1         1
85         1         1
86         1         1
87         1         2
88         1         1
89         1         2
90         1         1
91         1         1
92         1         1
93         1         1
94         1         1
95         1         1
96         1         2
97         1         1
98         1         1
99         1         1
100        1         3
101        1         1
102        1         1
103        1         1
104        1         1
105        1         1
106        1         1
107        1         2
108        1         1
109        1         1
110        1         1
111        1         1
112        1         1
113        1         1
114        1         1
115        1         5
116        1         1
117        1         1
118        1         1
119        1         1
120        1         1
121        1         1
122        1         1
123        1         1
124        1         1
125        1         1
126        1         1
127        1         1
128        1         1
129        1         1
130        1         1
131        1         1
132        1         2
133        1         1
134        1         2
135        1         1
136        1         1
137        1         1
138        1         1
139        1         2
140        1         2
141        1         1
142        1         1
143        1         2
144        1         1
145        1         2
146        1         2
147        1         1
148        1         1
149        1         1
150        1         1
151        1         1
152        1         1
153        1         2
154        1         1
155        1         1
156        1         1
157        1         1
158        1         1
159        1         1
160        1         1
161        1         3
162        1         1
163        1         1
164        1         1
165        1         2
166        1         1
167        1         1
168        1         1
169        1         1
170        1         1
171        1         1
172        1         1
173        1         2
174        1         1
175        1         1
176        1         1
177        1         1
178        1         1
179        1         1
180        1         1
181        1         2
182        1         1
183        1         1
184        1         1
185        1         1
186        1         1
187        1         1
188        1         1
189        1         1
190        1         1
191        2         2
192        2         1
193        2         2
194        2         2
195        2         1
196        2         1
197        2         3
198        2         2
199        2         1
200        2         3
201        2         2
202        2         2
203        2         5
204        2         1
205        2         2
206        2         1
207        2         2
208        2         4
209        2         1
210        2         2
211        2         2
212        2         2
213        2         2
214        2         2
215        2         1
216        2         1
217        2         3
218        2         2
219        2         2
220        2         2
221        2         5
222        2         1
223        2         1
224        2         1
225        2         2
226        2         2
227        2         1
228        2         2
229        2         1
230        2         3
231        2         5
232        2         3
233        2         2
234        2         1
235        2         2
236        2         1
237        2         2
238        2         1
239        2         5
240        2         2
241        2         1
242        2         2
243        2         2
244        2         1
245        2         2
246        2         2
247        2         1
248        2         1
249        2         2
250        2         1
251        2         1
252        2         2
253        2         2
254        2         2
255        2         1
256        2         2
257        2         2
258        2         3
259        2         1
260        2         2
261        2         1
262        2         3
263        2         1
264        2         2
265        2         1
266        2         2
267        2         2
268        2         2
269        2         1
270        2         1
271        2         3
272        2         3
273        2         2
274        2         2
275        2         2
276        2         1
277        2         3
278        2         2
279        2         1
280        2         3
281        2         2
282        2         1
283        2         2
284        2         2
285        2         5
286        2         3
287        2         2
288        2         2
289        2         5
290        2         1
291        2         2
292        2         2
293        2         2
294        2         2
295        2         2
296        2         1
297        2         1
298        2         2
299        2         1
300        2         1
301        2         2
302        2         1
303        2         2
304        2         2
305        2         2
306        2         1
307        2         2
308        2         1
309        2         2
310        2         2
311        2         2
312        2         4
313        2         1
314        2         4
315        2         1
316        2         1
317        2         1
318        2         2
319        2         2
320        2         1
321        2         2
322        2         1
323        2         1
324        2         1
325        2         2
326        2         1
327        2         2
328        2         2
329        2         2
330        2         3
331        2         2
332        2         2
333        2         2
334        2         1
335        2         5
336        2         3
337        2         2
338        2         1
339        2         1
340        2         2
341        2         2
342        3         4
343        3         1
344        3         2
345        3         2
346        3         3
347        3         2
348        3         1
349        3         1
350        3         2
351        3         1
352        3         2
353        3         1
354        3         1
355        3         3
356        3         1
357        3         1
358        3         1
359        3         1
360        3         5
361        3         1
362        3         1
363        3         2
364        3         2
365        3         3
366        3         3
367        3         2
368        3         2
369        3         3
370        3         1
371        3         1
372        3         3
373        3         2
374        3         2
375        3         2
376        3         3
377        3         2
378        3         2
379        3         1
380        3         2
381        3         2
382        3         2
383        3         2
384        3         3
385        3         1
386        3         1
387        3         3
388        3         2
389        3         2
390        3         2
391        3         2
392        3         2
393        3         2
394        3         2
395        3         1
396        3         2
397        3         3
398        3         2
399        3         2
400        3         2
401        3         1
402        3         3
403        3         2
404        3         2
405        3         1
406        3         1
407        3         1
408        3         3
409        3         2
410        3         2
411        3         2
412        3         2
413        3         3
414        3         2
415        3         2
416        3         2
417        3         3
418        3         1
419        3         2
420        3         2
421        3         2
422        3         2
423        3         5
424        3         3
425        3         1
426        3         1
427        3         2
428        3         2
429        3         1
430        3         2
431        3         2
432        3         5
433        3         2
434        3         3
435        3         1
436        3         5
437        3         2
438        3         2
439        3         1
440        3         1
441        3         1
442        3         1
443        3         3
444        3         1
445        3         2
446        3         2
447        3         3
448        3         3
449        3         1
450        3         2
451        3         2
452        3         2
453        3         2
454        3         2
455        3         2
456        3         2
457        3         2
458        3         2
459        3         3
460        3         3
461        3         1
462        3         1
463        3         2
464        3         2
465        3         2
466        3         1
467        3         2
468        3         2
469        3         1
470        3         3
471        3         4
472        3         2
473        3         1
474        3         1
475        3         4
476        3         2
477        3         1
478        3         2
479        3         2
480        3         2
481        4         2
482        4         2
483        4         2
484        4         3
485        4         3
486        4         4
487        4         3
488        4         1
489        4         3
490        4         5
491        4         1
492        4         2
493        4         3
494        4         3
495        4         4
496        4         2
497        4         2
498        4         2
499        4         2
500        4         2
501        4         2
502        4         2
503        4         2
504        4         5
505        4         3
506        4         5
507        4         3
508        4         3
509        4         5
510        4         2
511        4         5
512        4         4
513        4         2
514        4         1
515        4         2
516        4         2
517        4         1
518        4         1
519        4         3
520        4         3
521        4         2
522        4         2
523        4         5
524        4         2
525        4         2
526        4         1
527        4         4
528        4         2
529        4         2
530        4         2
531        4         5
532        4         5
533        4         2
534        4         3
535        4         1
536        4         2
537        4         4
538        4         5
539        4         2
540        4         2
541        4         3
542        4         3
543        4         5
544        4         3
545        4         5
546        4         2
547        4         1
548        4         2
549        4         2
550        4         3
551        4         3
552        4         5
553        4         3
554        4         2
555        4         2
556        4         2
557        4         2
558        5         5
559        5         4
560        5         5
561        5         3
562        5         3
563        5         3
564        5         2
565        5         5
566        5         5
567        5         3
568        5         3
569        5         2
570        5         3
571        5         2
572        5         5
573        5         5
574        5         2
575        5         2

The NICT_CEFR column is based on a corpus of 1,281 Japanese EFL learners and 20 native speakers, including speakers’ levels (A1–B2 and Native). I extracted 575 phrases from the corpus, based on the Oxford Phrase List (OPL), which assigns CEFR levels to each item (A1–C1). To compare learners’ usage with the Oxford levels (OPL_CEFR), I identified the earliest proficiency level at which each phrase was used in the corpus (NICT_CEFR). For example, the A1-level OPL phrase “a few” was first used by learners at the A1 level in the corpus. Both sets of CEFR levels were then converted to ordinal codes (A1 = 1, A2 = 2, B1 = 3, B2 = 4, C1/Native = 5).

Let’s see the distributions.

data$OPL_lab  <- factor(data$OPL_CEFR,  levels = 1:5)
data$NICT_lab <- factor(data$NICT_CEFR, levels = 1:5)

# Histogram for OPL CEFR
ggplot(data, aes(x = OPL_CEFR)) +
  geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
  labs(title = "Distribution of OPL CEFR Levels", x = "OPL CEFR Level", y = "Count") +
  theme_minimal()

# Histogram for NICT CEFR
ggplot(data, aes(x = NICT_CEFR)) +
  geom_histogram(binwidth = 1, color = "black", fill = "salmon") +
  labs(title = "Distribution of NICT CEFR Levels", x = "NICT CEFR Level", y = "Count") +
  theme_minimal()

According to Brezina (2018, Chapter 5, Register Variation – Correlation, Clusters and Factors), Spearman’s correlation is used for ordinal data or scale data that do not meet parametric assumptions. Rather than using raw values, it computes the association based on ranks, measuring the relationship through the differences between ranks instead of means and standard deviations.

Brezina (2018, p.146)

Let’s do a manual calculation and compare the result with that of cor_test function.

# Manual calculation
OPL_rank  <- rank(data$OPL_CEFR)
NICT_rank <- rank(data$NICT_CEFR)

# Differences of ranks
d <- OPL_rank - NICT_rank

# Square the differences
d2 <- d^2

# Sum of squared differences
sum_d2 <- sum(d2)

# Number of observations
n <- nrow(data)

# Using the formula
r_s_manual <- 1 - (6 * sum_d2) / (n * (n^2 - 1))

r_s_manual
[1] 0.5831335
# cor_test
cor_test <- cor.test(data$OPL_CEFR, data$NICT_CEFR, method = "spearman", exact = FALSE)
cor_test

    Spearman's rank correlation rho

data:  data$OPL_CEFR and data$NICT_CEFR
S = 14730623, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5350887 

Did you get the same results between the two methods? If not, why?🤔

5.1-5.3 Key Takeaways

Linear transformations (centering and standardizing) can help make variables more interpretable and comparable.

  1. Centering: Subtract the mean; shifts the reference point so the intercept represents the average value.

  2. Standardizing (z-scoring): Subtract the mean and divide by SD; expresses values in standard deviation units, making predictors on different scales comparable.

  3. Correlation: Standardizing both predictor and response transforms regression slopes into Pearson’s r.

Reference

Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge University Press. https://doi.org/10.1017/9781316410899

Downloads

Download Notebook & Data