9 Measures of central tendency

You’ve covered a lot of ground this week reading about measures of central tendency and introducing yourselves to distributions. With this walkthrough I want to apply a few of the things from your readings and our discussion.

Before jumping in, let’s install and load the necessary packages in R. This week we will make use of psych, and several of the core packages of the tidyverse.

Using the cheat code from last week’s Level-up:

# 1. check to see if pacman is on your computer and if not, let's install it:
if (!require("pacman")) install.packages("pacman",repos="http://cran.us.r-project.org")

Loading required package: pacman

# 2. install all other packages that we will be using:
pacman::p_load(psych, tidyverse)

Throughout this walk through (and many through the semester) we’ll use data from datasets that I’ve found on the web or created. Here we will be using a dataset referred to in David Flora’s textbook Statistical Methods for the Social and Behavioural Sciences: A Model Based Approach (one of you recommended texts). You’ll be asked to read an except from that text for the next module.

This data is related to the question of whether certain personality characteristics are predictors of aggression. Let’s start by reading this data into R.

library(readr)
aggression_data <- read_table("https://tehrandav.is/courses/statistics/practice_datasets/aggression.dat", col_names = FALSE)


── Column specification ────────────────────────────────────────────────────────
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double(),
  X5 = col_double(),
  X6 = col_double(),
  X7 = col_double(),
  X8 = col_double()
)

Invoking class(aggression_data) tells us that aggression_data is a data frame (or table). Invoking names(aggression_data) provides us with the header names. In this case aggression_data has no header names.

class(aggression_data)

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

names(aggression_data)

[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8"

For the sake of you’re own sanity, you should always provide headers for your data frame if it does not come with them. Building on last week’s homework, we can fix this by invoking the following to add the appropriate headers

names(aggression_data) <- c('age', 'BPAQ', 'AISS', 'alcohol', 'BIS', 'NEOc', 'gender', 'NEOo')

And now checking the modified data frame:

aggression_data

# A tibble: 275 × 8
     age  BPAQ  AISS alcohol   BIS  NEOc gender  NEOo
   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl>
 1    18  2.62  2.65      28  2.15  2.83      1  2.92
 2    18  2.24  2.85      NA  3.08  2.5       1  4.17
 3    20  2.72  3.05      80  3     2.75      1  3.92
 4    17  1.93  2.65      28  1.85  3.42      1  4.17
 5    17  2.72  2.95      10  2.08  3.58      0  3.5 
 6    17  2.45  1.95      12  2.62  3.83      1  3.25
 7    17  1.90  2.55      21  2.19  3.67      1  4.25
 8    17  2.59  2.3        3  2.19  3.42      1  2.58
 9    17  2.48  2         21  2.35  3.08      1  3.33
10    17  1.97  2.15       0  2.15  3.42      1  3.08
# ℹ 265 more rows

Building from last week’s homework assignment, we can call individual columns within the data frame using the $ operator:

# column of subject age(s):
aggression_data$age

  [1] 18 18 20 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18 18
 [26] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
 [51] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
 [76] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
[101] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
[126] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19
[151] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
[176] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20
[201] 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22
[226] 22 22 22 22 22 22 22 22 22 23 23 23 24 24 24 24 24 25 26 27 27 29 29 29 29
[251] 30 31 31 33 33 36 37 38 40 45 47 50 50 19 19 19 25 19 23 19 21 22 18 19 37

# column of BPAQ score: https://www.psytoolkit.org/survey-library/aggression-buss-perry.html
aggression_data$gender

  [1] 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0
 [38] 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
 [75] 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0
[112] 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1
[149] 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[186] 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
[223] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1
[260] 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0

# column of reported gender
aggression_data$gender

  [1] 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0
 [38] 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
 [75] 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0
[112] 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1
[149] 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[186] 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
[223] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1
[260] 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0

9.1 Getting measures of central tendency

We can use simple functions to get the mean, median, and standard deviation of an individual column of data. Remember for your reading that these measures make the most sense when the measure in question is numeric (continuous / ratio scale). For example here let’s take a look at BPAQ (https://www.psytoolkit.org/survey-library/aggression-buss-perry.html):

mean(aggression_data$BPAQ) #mean

[1] 2.612539

median(aggression_data$BPAQ) #median

[1] 2.62069

sd(aggression_data$BPAQ) #std. dev.

[1] 0.5240082

Most of the data that we deal with comes in the form of data frames like aggression_data. It might be cumbersome to get the mean, median and sd for each column separately. Instead, we can use psych::describe() to generate summary stats of all data in a data frame:

psych::describe(aggression_data)

        vars   n  mean    sd median trimmed   mad   min   max range  skew
age        1 275 20.21  4.96  18.00   19.01  1.48 17.00 50.00 33.00  3.70
BPAQ       2 275  2.61  0.52   2.62    2.61  0.56  1.34  4.03  2.69  0.01
AISS       3 275  2.56  0.37   2.55    2.56  0.37  1.45  3.70  2.25  0.01
alcohol    4 270 16.00 15.87  12.00   13.69 14.83  0.00 96.00 96.00  1.50
BIS        5 275  2.28  0.35   2.27    2.27  0.40  1.42  3.15  1.73  0.36
NEOc       6 275  3.55  0.59   3.58    3.55  0.62  1.83  4.92  3.08 -0.16
gender     7 275  0.79  0.41   1.00    0.86  0.00  0.00  1.00  1.00 -1.44
NEOo       8 275  3.36  0.52   3.42    3.37  0.49  1.67  4.67  3.00 -0.25
        kurtosis   se
age        15.43 0.30
BPAQ       -0.41 0.03
AISS        0.23 0.02
alcohol     3.09 0.97
BIS        -0.22 0.02
NEOc       -0.11 0.04
gender      0.06 0.02
NEOo       -0.09 0.03

This provides info related to the:

item name
item number
number of valid cases (identifies if data is missing)
mean
standard deviation
trimmed mean (with trim defaulting to .1)
median (standard or interpolated)
mad: median absolute deviation (see Leys et al, 2013)
minimum value
maximum value
skew
kurtosis
standard error

9.2 Turning numeric data into categorical

Looking at out output from above, one thing to keep in mind is that some of measures printed by psych::describe() may not make sense for specific columns of data. For example, the output table above has a mean value for gender (0.79). This is because in this data set men are designated the value 0 and women are designated 1. By default, when R encounters only numbers a column it treats them as numeric. However, in this case, the values in gender are better understood as nominal (ignoring discussions about gender fluidity that really wouldn’t fit into this simplistic classification to begin with). How might we deal with this?

We need to tell R that these data are categories, or factors and not continuous numbers. To do so we can call on the factor() function as below.

# convert a numeric to a categorical (i.e., make a factor)
factor(aggression_data$gender)

  [1] 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0
 [38] 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
 [75] 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0
[112] 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1
[149] 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[186] 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
[223] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1
[260] 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0
Levels: 0 1

When you run the line above you’ll note at the end of the output it mentions Levels: 0 1. This is telling you that in this vector, there are two levels, gender = 0 and gender = 1. To connect this factorized vector to your original data set, you can either create a new column in the data set or overwrite the original gender column that you are replacing. For beginners, I would recommend adding a new column. To do this, let’s take advantage of some of those fancy tidyverse / dplyr skills you picked up from DataCamp in particular mutate()

# 1. create a new vector attached to the original data frame
#    using the dplyr::mutate function:

aggression_data <-
    aggression_data %>% # take the original aggression_data data frame, and then...
    mutate(aggression_data, fctr_Gender = factor(gender)) # add a column named fctr_Gender
aggression_data

# A tibble: 275 × 9
     age  BPAQ  AISS alcohol   BIS  NEOc gender  NEOo fctr_Gender
   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl> <fct>      
 1    18  2.62  2.65      28  2.15  2.83      1  2.92 1          
 2    18  2.24  2.85      NA  3.08  2.5       1  4.17 1          
 3    20  2.72  3.05      80  3     2.75      1  3.92 1          
 4    17  1.93  2.65      28  1.85  3.42      1  4.17 1          
 5    17  2.72  2.95      10  2.08  3.58      0  3.5  0          
 6    17  2.45  1.95      12  2.62  3.83      1  3.25 1          
 7    17  1.90  2.55      21  2.19  3.67      1  4.25 1          
 8    17  2.59  2.3        3  2.19  3.42      1  2.58 1          
 9    17  2.48  2         21  2.35  3.08      1  3.33 1          
10    17  1.97  2.15       0  2.15  3.42      1  3.08 1          
# ℹ 265 more rows

Your new column is appended to the end of the aggression_data data frame. now when we run:

psych::describe(aggression_data)

             vars   n  mean    sd median trimmed   mad   min   max range  skew
age             1 275 20.21  4.96  18.00   19.01  1.48 17.00 50.00 33.00  3.70
BPAQ            2 275  2.61  0.52   2.62    2.61  0.56  1.34  4.03  2.69  0.01
AISS            3 275  2.56  0.37   2.55    2.56  0.37  1.45  3.70  2.25  0.01
alcohol         4 270 16.00 15.87  12.00   13.69 14.83  0.00 96.00 96.00  1.50
BIS             5 275  2.28  0.35   2.27    2.27  0.40  1.42  3.15  1.73  0.36
NEOc            6 275  3.55  0.59   3.58    3.55  0.62  1.83  4.92  3.08 -0.16
gender          7 275  0.79  0.41   1.00    0.86  0.00  0.00  1.00  1.00 -1.44
NEOo            8 275  3.36  0.52   3.42    3.37  0.49  1.67  4.67  3.00 -0.25
fctr_Gender*    9 275  1.79  0.41   2.00    1.86  0.00  1.00  2.00  1.00 -1.44
             kurtosis   se
age             15.43 0.30
BPAQ            -0.41 0.03
AISS             0.23 0.02
alcohol          3.09 0.97
BIS             -0.22 0.02
NEOc            -0.11 0.04
gender           0.06 0.02
NEOo            -0.09 0.03
fctr_Gender*     0.06 0.02

We see our new variable fctr_Gender. The * next to it indicates that it is a categorical variable. R is now warning us to proceed with caution when interpreting the descriptive stats related to fctr_Gender. But you’ll notice that it still gives us these measures. The lesson here is to be careful when interpreting this data.

Ultimately, it makes the most sense to change the numeric levels in gender to man and woman respectively. This can be accomplished using the dplyr::recode_factor(). The structure of this function is:

dplyr::recode_factor(data_frame$column_name, "old_level_1" = "new_level_1", ...)

So we wanted to recode our original gender column, we can:

dplyr::recode_factor(aggression_data$gender, "0" = "Man", "1" = "Woman")

  [1] Woman Woman Woman Woman Man   Woman Woman Woman Woman Woman Woman Woman
 [13] Man   Woman Man   Woman Woman Woman Man   Man   Woman Man   Woman Woman
 [25] Man   Woman Woman Man   Woman Woman Woman Woman Woman Man   Man   Man  
 [37] Man   Woman Woman Woman Woman Woman Woman Woman Woman Man   Woman Woman
 [49] Man   Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
 [61] Man   Woman Woman Woman Woman Woman Man   Woman Woman Woman Woman Woman
 [73] Woman Woman Man   Man   Woman Woman Man   Woman Man   Woman Woman Woman
 [85] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man   Man  
 [97] Woman Woman Woman Man   Woman Woman Woman Woman Woman Man   Man   Woman
[109] Woman Woman Man   Man   Woman Woman Woman Woman Woman Woman Woman Woman
[121] Woman Woman Man   Woman Woman Woman Woman Man   Woman Woman Woman Man  
[133] Woman Woman Woman Man   Woman Woman Woman Man   Man   Man   Man   Woman
[145] Woman Man   Man   Woman Woman Woman Man   Woman Woman Woman Man   Woman
[157] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man   Woman
[169] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
[181] Woman Woman Woman Woman Man   Woman Man   Woman Woman Woman Woman Man  
[193] Woman Woman Woman Woman Woman Woman Woman Woman Woman Man   Woman Woman
[205] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
[217] Woman Man   Woman Woman Woman Woman Woman Man   Woman Woman Woman Woman
[229] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man   Woman
[241] Woman Woman Woman Woman Man   Man   Woman Woman Woman Man   Man   Woman
[253] Woman Woman Woman Woman Man   Woman Woman Woman Woman Man   Woman Woman
[265] Woman Woman Man   Man   Woman Woman Man   Woman Woman Woman Man  
Levels: Man Woman

Remember that the code above is not assigning the output to any variable, so it will not be saved. This can be accomplished in one fell swoop by including recode_factor() in the mutate() function:

aggression_data <- # save to "aggression_data" 
    # take the original aggression_data data frame, and then...
    aggression_data %>%
    # create a new column named fctr_Gender that recodes the original
    # gender column to Man and Woman
    mutate(fctr_Gender = recode_factor(gender, "0" = "Man", "1" = "Woman"))
    
print(aggression_data)

# A tibble: 275 × 9
     age  BPAQ  AISS alcohol   BIS  NEOc gender  NEOo fctr_Gender
   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl> <fct>      
 1    18  2.62  2.65      28  2.15  2.83      1  2.92 Woman      
 2    18  2.24  2.85      NA  3.08  2.5       1  4.17 Woman      
 3    20  2.72  3.05      80  3     2.75      1  3.92 Woman      
 4    17  1.93  2.65      28  1.85  3.42      1  4.17 Woman      
 5    17  2.72  2.95      10  2.08  3.58      0  3.5  Man        
 6    17  2.45  1.95      12  2.62  3.83      1  3.25 Woman      
 7    17  1.90  2.55      21  2.19  3.67      1  4.25 Woman      
 8    17  2.59  2.3        3  2.19  3.42      1  2.58 Woman      
 9    17  2.48  2         21  2.35  3.08      1  3.33 Woman      
10    17  1.97  2.15       0  2.15  3.42      1  3.08 Woman      
# ℹ 265 more rows

Now if we compare our original gender column:

mean(aggression_data$gender)

[1] 0.7927273

to our re-coded fctr_Gender column:

mean(aggression_data$fctr_Gender)

Warning in mean.default(aggression_data$fctr_Gender): argument is not numeric
or logical: returning NA

[1] NA

you’ll see that the levels have been transformed AND trying to get the mean of a categorical value gives us an error (as it should!)

Before moving on, I want to stress that we can assign our summary table to an object:

aggression_data_summary <- psych::describe(aggression_data)

and now anytime we call aggression_data_summary we get this table

aggression_data_summary

             vars   n  mean    sd median trimmed   mad   min   max range  skew
age             1 275 20.21  4.96  18.00   19.01  1.48 17.00 50.00 33.00  3.70
BPAQ            2 275  2.61  0.52   2.62    2.61  0.56  1.34  4.03  2.69  0.01
AISS            3 275  2.56  0.37   2.55    2.56  0.37  1.45  3.70  2.25  0.01
alcohol         4 270 16.00 15.87  12.00   13.69 14.83  0.00 96.00 96.00  1.50
BIS             5 275  2.28  0.35   2.27    2.27  0.40  1.42  3.15  1.73  0.36
NEOc            6 275  3.55  0.59   3.58    3.55  0.62  1.83  4.92  3.08 -0.16
gender          7 275  0.79  0.41   1.00    0.86  0.00  0.00  1.00  1.00 -1.44
NEOo            8 275  3.36  0.52   3.42    3.37  0.49  1.67  4.67  3.00 -0.25
fctr_Gender*    9 275  1.79  0.41   2.00    1.86  0.00  1.00  2.00  1.00 -1.44
             kurtosis   se
age             15.43 0.30
BPAQ            -0.41 0.03
AISS             0.23 0.02
alcohol          3.09 0.97
BIS             -0.22 0.02
NEOc            -0.11 0.04
gender           0.06 0.02
NEOo            -0.09 0.03
fctr_Gender*     0.06 0.02

9.3 Getting summary stats by groups or conditions

You may also get summary statistics subsetted for individual groups. In this case the fctr_Gender column that we just created above has two groups, men and women. To take a look at the summary stat by Gender, or fctr_Gender we use the psych::describeBy() function, and designate the group that we want to split by:

psych::describeBy(aggression_data,group = aggression_data$fctr_Gender)


 Descriptive statistics by group 
group: Man
             vars  n  mean    sd median trimmed   mad   min   max range  skew
age             1 57 20.60  5.87  18.00   19.21  0.00 17.00 50.00 33.00  3.09
BPAQ            2 57  2.66  0.51   2.69    2.67  0.66  1.62  3.69  2.07  0.00
AISS            3 57  2.74  0.34   2.75    2.73  0.30  1.80  3.70  1.90  0.28
alcohol         4 55 17.95 18.22  12.00   15.42 17.79  0.00 73.00 73.00  1.06
BIS             5 57  2.23  0.33   2.19    2.21  0.29  1.69  3.12  1.42  0.64
NEOc            6 57  3.64  0.56   3.67    3.64  0.62  2.33  4.75  2.42 -0.13
gender          7 57  0.00  0.00   0.00    0.00  0.00  0.00  0.00  0.00   NaN
NEOo            8 57  3.27  0.59   3.33    3.30  0.62  1.67  4.33  2.67 -0.41
fctr_Gender*    9 57  1.00  0.00   1.00    1.00  0.00  1.00  1.00  0.00   NaN
             kurtosis   se
age             10.46 0.78
BPAQ            -0.96 0.07
AISS             0.95 0.05
alcohol          0.44 2.46
BIS              0.03 0.04
NEOc            -0.65 0.07
gender            NaN 0.00
NEOo            -0.10 0.08
fctr_Gender*      NaN 0.00
------------------------------------------------------------ 
group: Woman
             vars   n  mean    sd median trimmed   mad   min   max range  skew
age             1 218 20.11  4.70  19.00   19.03  1.48 17.00 50.00 33.00  3.86
BPAQ            2 218  2.60  0.53   2.62    2.60  0.56  1.34  4.03  2.69  0.02
AISS            3 218  2.51  0.37   2.52    2.51  0.37  1.45  3.70  2.25 -0.01
alcohol         4 215 15.50 15.22  12.00   13.36 13.34  0.00 96.00 96.00  1.63
BIS             5 218  2.30  0.36   2.27    2.28  0.40  1.42  3.15  1.73  0.29
NEOc            6 218  3.52  0.60   3.50    3.53  0.49  1.83  4.92  3.08 -0.15
gender          7 218  1.00  0.00   1.00    1.00  0.00  1.00  1.00  0.00   NaN
NEOo            8 218  3.38  0.50   3.42    3.39  0.49  1.83  4.67  2.83 -0.14
fctr_Gender*    9 218  2.00  0.00   2.00    2.00  0.00  2.00  2.00  0.00   NaN
             kurtosis   se
age             17.01 0.32
BPAQ            -0.31 0.04
AISS             0.01 0.02
alcohol          4.16 1.04
BIS             -0.27 0.02
NEOc            -0.04 0.04
gender            NaN 0.00
NEOo            -0.30 0.03
fctr_Gender*      NaN 0.00

Check out this video briefly describing one of the quirks of the psych::describe() output: https://youtu.be/ZFHTTY9886k