You’ve covered a lot of ground this week reading about measures of central tendency and introducing yourselves to distributions. With this walkthrough I want to apply a few of the things from your readings and our discussion.
Before jumping in, let’s install and load the necessary packages in R. This week we will make use of psych, and several of the core packages of the tidyverse.
# 1. check to see if pacman is on your computer and if not, let's install it:if (!require("pacman")) install.packages("pacman",repos="http://cran.us.r-project.org")
Loading required package: pacman
# 2. install all other packages that we will be using:pacman::p_load(psych, tidyverse)
Throughout this walk through (and many through the semester) we’ll use data from datasets that I’ve found on the web or created. Here we will be using a dataset referred to in David Flora’s textbook Statistical Methods for the Social and Behavioural Sciences: A Model Based Approach (one of you recommended texts). You’ll be asked to read an except from that text for the next module.
This data is related to the question of whether certain personality characteristics are predictors of aggression. Let’s start by reading this data into R.
Invoking class(aggression_data) tells us that aggression_data is a data frame (or table). Invoking names(aggression_data) provides us with the header names. In this case aggression_data has no header names.
class(aggression_data)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
names(aggression_data)
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8"
For the sake of you’re own sanity, you should always provide headers for your data frame if it does not come with them. Building on last week’s homework, we can fix this by invoking the following to add the appropriate headers
We can use simple functions to get the mean, median, and standard deviation of an individual column of data. Remember for your reading that these measures make the most sense when the measure in question is numeric (continuous / ratio scale). For example here let’s take a look at BPAQ (https://www.psytoolkit.org/survey-library/aggression-buss-perry.html):
mean(aggression_data$BPAQ) #mean
[1] 2.612539
median(aggression_data$BPAQ) #median
[1] 2.62069
sd(aggression_data$BPAQ) #std. dev.
[1] 0.5240082
Most of the data that we deal with comes in the form of data frames like aggression_data. It might be cumbersome to get the mean, median and sd for each column separately. Instead, we can use psych::describe() to generate summary stats of all data in a data frame:
number of valid cases (identifies if data is missing)
mean
standard deviation
trimmed mean (with trim defaulting to .1)
median (standard or interpolated)
mad: median absolute deviation (see Leys et al, 2013)
minimum value
maximum value
skew
kurtosis
standard error
9.2 Turning numeric data into categorical
Looking at out output from above, one thing to keep in mind is that some of measures printed by psych::describe() may not make sense for specific columns of data. For example, the output table above has a mean value for gender (0.79). This is because in this data set men are designated the value 0 and women are designated 1. By default, when R encounters only numbers a column it treats them as numeric. However, in this case, the values in gender are better understood as nominal (ignoring discussions about gender fluidity that really wouldn’t fit into this simplistic classification to begin with). How might we deal with this?
We need to tell R that these data are categories, or factors and not continuous numbers. To do so we can call on the factor() function as below.
# convert a numeric to a categorical (i.e., make a factor)factor(aggression_data$gender)
When you run the line above you’ll note at the end of the output it mentions Levels: 0 1. This is telling you that in this vector, there are two levels, gender = 0 and gender = 1. To connect this factorized vector to your original data set, you can either create a new column in the data set or overwrite the original gender column that you are replacing. For beginners, I would recommend adding a new column. To do this, let’s take advantage of some of those fancy tidyverse / dplyr skills you picked up from DataCamp in particular mutate()
# 1. create a new vector attached to the original data frame# using the dplyr::mutate function:aggression_data <- aggression_data %>%# take the original aggression_data data frame, and then...mutate(aggression_data, fctr_Gender =factor(gender)) # add a column named fctr_Genderaggression_data
We see our new variable fctr_Gender. The * next to it indicates that it is a categorical variable. R is now warning us to proceed with caution when interpreting the descriptive stats related to fctr_Gender. But you’ll notice that it still gives us these measures. The lesson here is to be careful when interpreting this data.
Ultimately, it makes the most sense to change the numeric levels in gender to man and woman respectively. This can be accomplished using the dplyr::recode_factor(). The structure of this function is:
[1] Woman Woman Woman Woman Man Woman Woman Woman Woman Woman Woman Woman
[13] Man Woman Man Woman Woman Woman Man Man Woman Man Woman Woman
[25] Man Woman Woman Man Woman Woman Woman Woman Woman Man Man Man
[37] Man Woman Woman Woman Woman Woman Woman Woman Woman Man Woman Woman
[49] Man Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
[61] Man Woman Woman Woman Woman Woman Man Woman Woman Woman Woman Woman
[73] Woman Woman Man Man Woman Woman Man Woman Man Woman Woman Woman
[85] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man Man
[97] Woman Woman Woman Man Woman Woman Woman Woman Woman Man Man Woman
[109] Woman Woman Man Man Woman Woman Woman Woman Woman Woman Woman Woman
[121] Woman Woman Man Woman Woman Woman Woman Man Woman Woman Woman Man
[133] Woman Woman Woman Man Woman Woman Woman Man Man Man Man Woman
[145] Woman Man Man Woman Woman Woman Man Woman Woman Woman Man Woman
[157] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man Woman
[169] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
[181] Woman Woman Woman Woman Man Woman Man Woman Woman Woman Woman Man
[193] Woman Woman Woman Woman Woman Woman Woman Woman Woman Man Woman Woman
[205] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman
[217] Woman Man Woman Woman Woman Woman Woman Man Woman Woman Woman Woman
[229] Woman Woman Woman Woman Woman Woman Woman Woman Woman Woman Man Woman
[241] Woman Woman Woman Woman Man Man Woman Woman Woman Man Man Woman
[253] Woman Woman Woman Woman Man Woman Woman Woman Woman Man Woman Woman
[265] Woman Woman Man Man Woman Woman Man Woman Woman Woman Man
Levels: Man Woman
Remember that the code above is not assigning the output to any variable, so it will not be saved. This can be accomplished in one fell swoop by including recode_factor() in the mutate() function:
aggression_data <-# save to "aggression_data" # take the original aggression_data data frame, and then... aggression_data %>%# create a new column named fctr_Gender that recodes the original# gender column to Man and Womanmutate(fctr_Gender =recode_factor(gender, "0"="Man", "1"="Woman"))print(aggression_data)
You may also get summary statistics subsetted for individual groups. In this case the fctr_Gender column that we just created above has two groups, men and women. To take a look at the summary stat by Gender, or fctr_Gender we use the psych::describeBy() function, and designate the group that we want to split by: