23  Regression, pt. 1: Running the model

Please note that this walkthrough assumes that you have the following packages installed and loaded in R:

pacman::p_load(car, # qqPlot function
               cowplot, # apa plotting
               tidyverse, # tidyverse goodness
               psych, # descriptive stats
               lm.beta, # getting a standard coefficient
               broom, # tidy lm output
               olsrr # testing heteroscadicity.
               ) 

23.1 Fitting the data to a model

As social scientists, we are not only concerned with observation, but also with explanation. Measures of correlation provide us with the former; we use regression methods to build models in service of the latter. Again both the assigned texts provide excellent overviews of regression, so I won’t repeat much of what they say here.

For the purposes of this vignette (and this course) we will limit ourselves to linear regression, that is describing a line that best fits our data. By “fit” we mean a line than when compared to our observed data minimizes our squared residuals.

23.2 Modeling the Relationship b/tw Stress and Health

Let’s revisit our data from the first walk-through:

Wagner, Compas, and Howell (1988) investigated the relationship between stress and mental health in first-year college students. Using a scale they developed to measure the frequency, perceived importance, and desirability of recent life events, they created a measure of negative events weighted by the reported frequency and the respondent’s subjective estimate of the impact of each event. This served as their measure of the subject’s perceived social and environmental stress. They also asked students to complete the Hopkins Symptom Checklist, assessing the presence or absence of 57 psychological symptoms.

This data can be accessed directly from this companion website for Howell’s stats textbook.

stress_data <- read_table("https://www.uvm.edu/~statdhtx/methods8/DataFiles/Tab9-2.dat")

── Column specification ────────────────────────────────────────────────────────
cols(
  ID = col_double(),
  Stress = col_double(),
  Symptoms = col_double(),
  lnSymptoms = col_double()
)
psych::describe(stress_data)
           vars   n  mean    sd median trimmed   mad   min    max  range skew
ID            1 107 54.00 31.03  54.00   54.00 40.03  1.00 107.00 106.00 0.00
Stress        2 107 21.29 12.49  20.00   20.49 11.86  1.00  58.00  57.00 0.62
Symptoms      3 107 90.33 18.81  88.00   88.87 17.79 58.00 147.00  89.00 0.74
lnSymptoms    4 107  4.48  0.20   4.48    4.48  0.19  4.06   4.99   0.93 0.21
           kurtosis   se
ID            -1.23 3.00
Stress        -0.19 1.21
Symptoms       0.45 1.82
lnSymptoms    -0.28 0.02

To perform a linear regression we may call upon the lm() function. This function uses formula notation outcome variable ~ predictor variable(s). A simple regression has a single predictor. More often in our analyses we are concerned with the relative effects of multiple predictors, but this is multiple regression and saved for the Spring course. Here, we may be interested in the degree to which perceived Stress contributes to the number of lnSymptons, or using formula terminology: “How do lnSymptoms (outcome) vary as a function of Stress (predictor)”. This is represented in R as stress_data$lnSymptoms~stress_data$Stress

23.3 Linear models in R

Let’s run the model we introduced in the last section. Note that instead of specifying the data frame for each variable, I can place it in the data= argument.

lm(lnSymptoms~Stress,data = stress_data)

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Coefficients:
(Intercept)       Stress  
   4.300537     0.008565  

This output gives us our intercept and slope coefficients. That said, theirs more lurking behind this output. The resulting output of the lm() function is an object with the class lm. This simply means that R understands that this object is storing a linear model. Hiding behind this spartan output is a multitude of info that may be accessed via its attributes or it may be thrown into other functions for additional info and analysis. My tip… GET IN THE HABIT OF SAVING (ASSIGNING) YOUR MODELS. In this case let’s assign the model we just ran to stress_symptoms_model

stress_symptoms_model <- lm(lnSymptoms~Stress,data = stress_data)

23.3.1 Testing the model

Two important assumption tests for our model are

  1. that residuals are normally distributed and
  2. that the variance across the model is homogenous

We can test both of these using performance. First visually, we can use performance::check_model(). As I mentioned in a previous walkthrough, this function returns quite a few plots that we may use to assess our model assumptions. For the sake of brevity, we’ll limit the output to just plots that tell us about normality and heterskedasticity:

performance::check_model(stress_symptoms_model, check = c("qq", "homogeneity"))

In this case the homogeneity outcome is less thatn ideal, but there is not a noticable trend, so let’s continue on.

We should follow-up our visual inspection using our tests.

For normality, we can use the Shapiro-Wilkes via performance::check_normality():

performance::check_normality(stress_symptoms_model)
OK: residuals appear as normally distributed (p = 0.436).

For heteroskedastisticity, since Stress is a continuous predictor, we use the Breusch–Pagan test via performance::check_heteroskedasticity():

performance::check_heteroskedasticity(stress_symptoms_model)

Taken together, or visuals and our quantitative tests suggest we are okay.

23.3.2 Interpreting the output

By itself, stress_symptoms_model simply returns the coefficients for the model:

stress_symptoms_model

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Coefficients:
(Intercept)       Stress  
   4.300537     0.008565  

To get a more comprehensive output of the model we need to submit it to the summary() function.

summary(stress_symptoms_model)

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42889 -0.13568  0.00478  0.09672  0.40726 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.300537   0.033088 129.974  < 2e-16 ***
Stress      0.008565   0.001342   6.382 4.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1726 on 105 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.2726 
F-statistic: 40.73 on 1 and 105 DF,  p-value: 4.827e-09

The output above provides us with info about the coefficients, their standard error, \(t\)-scores, and corresponding p-values (Pr>|t|). It also provides us with our \(r^2\) and adjusted \(r_{adj}^2\) (which corresponds to our \(r\) and \(r_{adj}\) from above).

In greater detail…

  • Call: tells us the formula we originally input for this model

  • Residuals: descriptive stats of the residuals

  • Coefficients: provides the estimated values, standard error of the estimates, and NHST of the intercept and beta coefficient. In both cases the null hypothesis is Estimate = 0. Again for the intercept this may or may not be useful. For example you might use it to test whether people who haven’t had a single drink of alcohol have no depression at all. The test of beta tests against the null hypothesis of zero correlation (relationship). All tests are conducted provided the t-value on \(n-k\) degrees of freedom (see next bullet point)

  • Residual standard error: This is calculated by first getting the sum of the squared residuals and dividing that number by the degrees of freedom of the model. In this case the df equals the number of total observations (107) minus the number of parameters (\(k\)) that were calculated by the model in order to generate the predicted values (2). This may be interpreted as a measure of goodness of fit of the model, and perhaps more importantly its predictive value. This value is closely tied to Cohen Chapter 10 “The Variance of the Estimate”, but rather than the denominator being \(N\), it is the degrees of freedom for the model. In this way it speaks more directly to the model itself.

  • Multiple R-squared and Adjusted R-squared: The coefficient of determination calculated from \(r\) or \(r_{adj}\). \(r^2\) tells us the degree to which our model accounts for observed variance in the observed outcomes. On was to think about this is, to what degree does variation in our predictor explain the observed variation in our outcomes. More on this below.

  • Finally, this output gives is the resulting \(F\)-ratio and corresponding \(df\) for an \(F\)-test. This is a significance test for our \(r^2\). Does our model explain a significant amount of the variance?

23.4 The coeff. of determination as index of explanatory value

Here we may turn to a (a little bit longer than expected) digression involving J.S. Mill and exact and inexact sciences. One philosophical conceit is that we can never really empirically observe causes, but only effects. That is, I can never truly observe that one billiard ball caused another to move, I can only say with a high degree of certainty that I observed the first billiard ball striking the second and subsequently the second moved at a particular velocity. In order to deal with this epistemological crisis (how can you really ever know anything!!!) we build explanatory models. The degree to which our model is able to account for the range of our previous observations tells us how useful our model parameters (predictors) are in explaining a phenomenon in question.

For example, to explain what I observe about the contact and subsequent motion of billiard balls I might build a model that takes into account the mass and velocity (speed and direction) of the first ball, the mass and velocity of the second ball and the point and angle of contact. Presuming I use these as predictors, I can account with varying degrees of success the heading and direction of the balls, or the outcome. In fact, in this example, I’m likely to be very successful in describing most, but not all observed outcomes. That is, there are some variations in the outcomes that just using these parameters does not result in a successful account. I can refine my billiard contact model to include things like the friction caused from the felt, the force of gravity, the angle of the table top, etc., and it’s likely the case that with each new predictor my model becomes better. However at some point I begin to reach diminishing returns (see over-fitting next semester).

This is why physics appears to be an exact science (well, mechanical physics at least). They’ve had centuries to develop and refine their models about the mechanical workings of our very local corner of the universe to the point that these models are able to account for a very high degree of variation in observed phenomena. These models typically have a very high \(r^2\) and as such tend to have high predictive value. Which is why with some understanding of gravitational force, mass, and a few simple calculations, high school students all over country are able to do this in their physics lab.

Indeed within some disciplines having an \(r^2\) < .9 means you model isn’t good enough. Typically this is the case when the phenomenon under observation is not complex (hope that doesn’t offend). As you move away from physics and chemistry (or at least the low complexity phenomena under each discipline), the explanatory value of models tends to decrease. Us folks in the social sciences, we are moving into inexact-science-land. Weather patterns, economies, ecosystems, and human behaviors are highly complex; in many cases in inordinate number of contributing factors may be present in an observed outcome. As a result degree of variation that these models may account for is typically diminished, especially in the social sciences. For example depending on the phenomena and sub-discipline, an \(r^2\) of .3 might mean you’re cooking with gas! In fact, for some phenomena, a reported \(r^2\) that is too high might be a cause for suspicion (“they weren’t so much cooking with gas so much as they were cooking the books”).

As a final note, its typically poor practice to get a significant \(r^2\) and just stop there. Remember, this is just a claim that your model accounts for significantly more variation than 0%. If you were to have a model that accounted for 6% (\(r^2 = .06\)) of the variance but was significant (\(p<.05\)), you need to ask yourself whether that model is useful at all. It certainly has very little predictive value (I wouldn’t bet my life… or even a few bills on it predicting a future outcome), and its true explanatory value is quite questionable.

For what it’s worth, this isn’t just a joke. I can point to a paper from my field that was published in a very prominent journal (**cough Psych Science) that reported just this… a \(r^2\) of about .08.

23.5 A note on degrees of freedom of our model

The general equation for calculating the requisite degrees of freedom of a model is \(n-k\) where \(n\) is the number of observations (or later conditions) and \(k\) is the number of parameters that are required to be known for the calculation.

Recall how we calculate variance / standard deviation. We take the sum of the squared deviations from the mean \(\sum(X-\bar{X})\) and divide by \((n-1)\). Remember that we are subtracting that 1 because in order to calculate the variance we must keep one thing constant, the mean. In this case the mean is our parameter that must be known in order to calculate the variance.

We can now ask the question “what must be known in order to derive the line of best fit?”, or more simply what must one know in order to create a particular line. We must know both the line’s slope and its intercept. Knowing slope alone leads to the indeterminism presented in Fig A (identical slopes, infinitely-many intercepts), intercept alone leads to indeterminism presented in Fig B (identical intercept, infinitely-many slopes). We need both to define a particular line. Just like the mean above, these are the two parameters that must be locked-in to derive our model.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

A simple way of restating this is that your model degrees of freedom are the: \[ number \space of \space observations - number \space of \space predictors \]

where at least one of your predictors is used as a constant (the intercept) and the remainder provide your \(\beta\) coefficients.

For simple regression we have 1 predictor and 1 constant so our degrees of freedom are \(n-k\); where \(k=2\). However, in your future there be be dragons… models that may have multiple predictors and multiple intercepts (multiple regression, mixed effects / multi-level models, etc). Indeed, using df may not even make much sense in such cases, and there are arguments about how to even calculate them in the first place.

23.6 Write up and presentation

Our obtained results can be reported in several different ways. However, typically it’s best practice to report both the \(beta\) (\(b\)) coefficient with corresponding \(t-test\); as well as the general statistics about the model including the \(r^2\) and corresponding \(F-test\). Prior to any reporting of the stats, you also need to articulate the structure of the model.

I think a good template to start from is:

“[Clearly restate the alternative hypothesis identifying operationalized variables]. To test this hypothesis the data were [what type of analysis?] with [what is your dv] as the outcome and [what is your predictor] as our predictor. The resulting model was significant, \(r^2\)=0.XX; \(F\)(df1,df2)=F-ratio, \(p\)=p-value and revealed a significant relationship between the [covariables], \(b\)=XX; \(t\)(df)=XX, \(p\)=XX$.”

Returning to our stress_symptoms_model:

summary(stress_symptoms_model)

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42889 -0.13568  0.00478  0.09672  0.40726 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.300537   0.033088 129.974  < 2e-16 ***
Stress      0.008565   0.001342   6.382 4.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1726 on 105 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.2726 
F-statistic: 40.73 on 1 and 105 DF,  p-value: 4.827e-09

For example, with this data you may report it as:

“We hypothesized that increases in self-reported measures of Stress would correspond to increases in the number of self-reported symptoms. Given that our Symptoms data showed a significant violation of the normality assumption, we transformed these scores using a log-transformation (lnSymptoms). To test this hypothesis the data were submitted to a simple linear regression with lnSymptoms as the outcome and Stress as our predictor. The resulting model was significant, \(r^2\)=.28; \(F\)(1,105)=40.73, \(p\)<.001, and revealed a significant positive relationship between the Stress and Symptoms, \(b\)=.008; \(t\)(105)=6.38, \(p\)<.001.”

23.6.1 Reporting standardized \(\beta\)

That said, in many circles it’s typical practice to report the standardized \(beta\) coefficients (\(\beta\)). These are the \(beta\) coefficients we would have obtained if we had initially converted both our Stress and lnSymptom values to \(z\)-scores, and then ran the model using those \(z\)-scores. For example, let’s add z-transformed values for both to our original dataframe

stress_data <- stress_data %>% mutate(z_Stress = scale(Stress)[,1],
                                      z_lnSymptoms = scale(lnSymptoms)[,1])

And now run the model:

z_model <- lm(z_lnSymptoms ~ z_Stress, data = stress_data)
summary(z_model)

Call:
lm(formula = z_lnSymptoms ~ z_Stress, data = stress_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11929 -0.67043  0.02364  0.47792  2.01238 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.071e-16  8.245e-02   0.000        1    
z_Stress     5.287e-01  8.284e-02   6.382 4.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8529 on 105 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.2726 
F-statistic: 40.73 on 1 and 105 DF,  p-value: 4.827e-09

That said, you don’t NEED to do the z-transform by hand. We can retroactively obtain these coefficients from raw values by using the lm.beta() function from the lm.beta package (duplicate names, confusing I know). We can then input our model into the lm.beta() function and get the summary of those results. The new column std_estimate is reported in addition to the estimate:

pacman::p_load(lm.beta)

# this is the raw data model you ran earlier
stress_symptoms_model <- lm(lnSymptoms ~ Stress, data = stress_data)
lm.beta(stress_symptoms_model) %>% summary()

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42889 -0.13568  0.00478  0.09672  0.40726 

Coefficients:
            Estimate Standardized Std. Error t value Pr(>|t|)    
(Intercept) 4.300537           NA   0.033088 129.974  < 2e-16 ***
Stress      0.008565     0.528656   0.001342   6.382 4.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1726 on 105 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.2726 
F-statistic: 40.73 on 1 and 105 DF,  p-value: 4.827e-09

So amending the example reporting paragraph above:

The resulting model was significant, \(r^2\)=.28; \(F\)(1,105)=40.73, \(p\)<.001, and revealed a significant positive relationship between the Stress and Symptoms, \(\beta =.528\); \(t(105)=6.38\), \(p<.001\).”

Coincidentally you may have noticed that this \(\beta\) value is the correlation that we established in the previous walkthrough! Quickly:

cor(stress_data$Stress, stress_data$lnSymptoms)
[1] 0.5286565
lm.beta(stress_symptoms_model) %>% summary()

Call:
lm(formula = lnSymptoms ~ Stress, data = stress_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42889 -0.13568  0.00478  0.09672  0.40726 

Coefficients:
            Estimate Standardized Std. Error t value Pr(>|t|)    
(Intercept) 4.300537           NA   0.033088 129.974  < 2e-16 ***
Stress      0.008565     0.528656   0.001342   6.382 4.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1726 on 105 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.2726 
F-statistic: 40.73 on 1 and 105 DF,  p-value: 4.827e-09

As explained by Barry Cohen (in his useful textbook that I didn’t have you buy…):

if you have already tested the Pearson’s \(r\) against zero, there is no need to test the corresponding \(b\), as the \(t\) values will be exactly the same (except for rounding error) for both tests”. More, a test on \(b\) is equivalent to a test on \(r\) in the one-predictor case. If it is true that \(X\) and \(Y\) are related, then it must also be true that \(Y\) varies with \(X\)—that is, that the slope is nonzero. This suggests that a test on \(b\) (beta coefficient) will produce the same answer as a test on \(r\), and we could dispense with a test for \(b\) altogether.

The reverse is also true. In the single predictor case, if one has a test of \(b\) this obviates the need to run an independent test of \(r\). The output of lm provides us with a simple test that gives us the significance of the beta coefficient, \(b\).