14 Distributions, populations, and samples

So this week is typically one of the more confusing weeks for those that do not have much background in stats and programming because I’m asking you to do two conceptually demanding things.

From the stats side: I’m asking you to imagine that the sample data that you collect is not sui generis (I always wonder if I’m using that term correctly), but part of a larger theoretical distribution of possible (but materially non-existent) sample data sets. The number of these sets is indefinitely many… which is a way of saying really large without invoking notions of infinity.

On the programming side: I’m asking you to build simulations of these theoretical sampling distributions by using a for() loop; programming R to proceed in a recursive, iterative loop for a very larger number of instances.

Amended: Actually we **won’t be using for loops. Instead we’ll be doing things the tidy way, which involves much simpler code. That said, for loops might be something to learn if you want to become more adept in your programming outside of this course.

In the class lecture we laid out the logic of the sampling distribution (of means), the difference distribution (of means) and the null distribution (of means). In the workshop we made it as far as simulating the sampling distribution and using it to obtain important measures and to generate the distribution of means. Here I’m going to extend that work to show how to get both the difference distribution and the null distribution. This involve minor tweaks of what was done in class.

Please follow along by entering the commands into your own console. First let’s load in the appropriate packages for this week:

pacman::p_load(psych, tidyverse, radiant)

14.1 Populations, samples, and theoreticals

This week you’ve encountered the notion of the distribution, which is just a way of describing all of the scores in a given set. We typically describe distributions in terms of tallies. For example if we were looking at the distribution of heights in a given group, the distribution would describe the number of individuals within that group that were a certain height (or more likely within a range of heights). I like to focus on distributions as having three different classifications.

population: the distribution of all scores of EVERYONE in an entire population, you almost NEVER have access to this.
sample: the empirically observed distribution of scores from a selected group (sample) within a population (this is what we typically have access to as scientists)
and, theoretical: a distribution that is derived from certain principles or assumptions by logical and mathematical reasoning, as opposed to one derived from real-world data obtained by empirical research (see APA). For example the normal distribution, Poisson distribution, the F-distribution, etc.

14.2 Population distribution

We’re just extrapolate to play around with some numbers from this website: https://ourworldindata.org/human-height. According to this website the average height of men in North America (from 1996) is about 177 cm, with a standard deviation of 7.49. Height, it turns out is normally distributed.

Let’s just assume those numbers hold for the US. There are about 151 million men in the US. On my machine it would takes R about 10 secs to generate a distribution of 151 million heights with the above criteria, but it also eats up about 1.2 GB of RAM. Given that not everyone has 32GB of RAM laying around on their laptops, let’s just create a population of 1 million men (depending on your machine this might take a few seconds:

population_heights <- rnorm(n = 1000000, mean = 177,sd = 7.49)

Congrats, in a matter of seconds you’ve created a population of 1 million heights! Before doing so, let’s turn this into a data frame / tibble containing columns for person and their corresponding height:

pacman::p_load(tidyverse)
person_id <- 1:1000000

pop_mens_heights <- tibble(person = person_id,
                      height = population_heights)

And now let’s inspect this dataset, by looking at the first few rows…

head(pop_mens_heights)

# A tibble: 6 × 2
  person height
   <int>  <dbl>
1      1   179.
2      2   181.
3      3   176.
4      4   160.
5      5   181.
6      6   180.

and now using str() to get more info about the dataset’s structure:

str(pop_mens_heights)

tibble [1,000,000 × 2] (S3: tbl_df/tbl/data.frame)
 $ person: int [1:1000000] 1 2 3 4 5 6 7 8 9 10 ...
 $ height: num [1:1000000] 179 181 176 160 181 ...

Finally let’s take a look at this distribution.

pacman::p_load(cowplot)
ggplot(data = pop_mens_heights,mapping = aes(x=height)) +
  geom_histogram(bins = sqrt(1000000)) + # ideal # of bins = sqrt of N
  theme_cowplot()

And getting our summary stats for this population

pacman::p_load(psych)
psych::describe(pop_mens_heights)

       vars     n      mean        sd    median   trimmed       mad    min
person    1 1e+06 500000.50 288675.28 500000.50 500000.50 370650.00   1.00
height    2 1e+06    176.99      7.48    176.99    176.99      7.49 138.11
              max     range skew kurtosis     se
person 1000000.00 999999.00    0     -1.2 288.68
height     210.54     72.42    0      0.0   0.01

14.3 Sample

Of course we don’t have time or money to go around getting the height of 1 million men. Instead we get a sample of 25 from our population, pop_men_heights.

We can do this by:

# get a sample of heights
sample_mens_heights <- sample_n(pop_mens_heights, size = 25)
sample_mens_heights

# A tibble: 25 × 2
   person height
    <int>  <dbl>
 1 244330   175.
 2 873826   170.
 3 673148   182.
 4 193898   187.
 5 689082   163.
 6 649774   175.
 7 884501   169.
 8 591346   178.
 9 387740   178.
10 823819   178.
# ℹ 15 more rows

Now, let’s take a look at the distribution and the summary stats of our sample:

# histogram
ggplot(data = sample_mens_heights,mapping = aes(x=height)) +
  geom_histogram(bins = sqrt(25)) + # ideal # of bins = sqrt of N
  theme_cowplot()

# summary stats
psych::describe(sample_mens_heights)

       vars  n     mean        sd    median   trimmed       mad      min
person    1 25 562259.0 269444.16 610197.00 571548.71 316715.98 75837.00
height    2 25    175.8      7.65    175.94    175.67      7.19   161.71
             max     range  skew kurtosis       se
person 988805.00 912968.00 -0.33    -1.19 53888.83
height    192.58     30.88  0.16    -0.46     1.53

Two things:

First, related to R, your histogram and summary stats are likely different from mine. This is because sample_n() pulls a random sample. If you resample the data using sample_n() you’re going to get different numbers every time around. FWIW, this is also true when we are getting our samples in research. To ensure that our numbers are identical we need to set the seed—set the manner R “randomly” (nothing in a computer is truly random) selects its sample. For example, if we rerun our sample using this code, our numbers will match:

# setting the seed to 123
set.seed(123)

# get a sample of heights
sample_mens_heights <- sample_n(pop_mens_heights, size = 25)

# histogram
ggplot(data = sample_mens_heights,mapping = aes(x=height)) +
  geom_histogram(bins = sqrt(25)) + # ideal # of bins = sqrt of N
  theme_cowplot()

# summary stats
psych::describe(sample_mens_heights)

       vars  n      mean        sd    median   trimmed       mad       min
person    1 25 519031.72 270338.04 554826.00 512511.86 281126.16 124022.00
height    2 25    177.13      8.62    177.01    177.03      9.53    163.46
            max     range skew kurtosis       se
person 985797.0 861775.00 0.18    -1.17 54067.61
height    193.6     30.14 0.06    -1.21     1.72

Second, tied to the conceptual material (and related to the discussion above), you’ll note that the distribution of heights in your sample is not identical to the population. This will be true for any sample that you take—the sample statistics are not identical to the population statistics. What we hope is that they are close enough that our sample approximates the population within a certain tolerance. More, no two samples are going to be exactly alike:

set.seed(2021)
sample1 <- sample_n(pop_mens_heights,size = 25)
sample2 <- sample_n(pop_mens_heights,size = 25)

psych::describe(sample1)

       vars  n      mean        sd   median   trimmed       mad      min
person    1 25 589528.52 282701.81 643236.0 602575.86 364602.47 66630.00
height    2 25    177.66      6.89    177.4    177.44      8.14   166.74
             max     range  skew kurtosis       se
person 976624.00 909994.00 -0.30    -1.23 56540.36
height    191.08     24.34  0.34    -1.11     1.38

psych::describe(sample2)

       vars  n      mean        sd    median   trimmed      mad     min
person    1 25 428057.16 308820.34 353637.00 415141.71 360989.4 14244.0
height    2 25    177.88      7.42    175.94    177.89      8.0   165.9
             max     range skew kurtosis       se
person 967537.00 953293.00 0.38    -1.30 61764.07
height    191.25     25.35 0.06    -1.23     1.48

The fact that summary statistics vary from sample to sample, leads us to our next kind of distribution… the sampling distribution.

14.4 Theoretical: the Sampling Distribution (of Means)

The sampling distribution is a distribution built by the sampling and resampling of a population. A sampling distribution of a statistic shows every possible result that statistic can take in every possible sample from a population and how often each result happens. In practice, the generic characteristics of a sampling distribution are theoretical. That is they are developed by applying some small amount of observation to establish fundamental, and ultimately abstract principles. As an example, the normal distribution is a theoretical distribution that states that for any (approximately) randomly occurring measurement, such as height, a given percentage of scores should fall within specific ranges from the mean. Indeed, we’ve already generated a normal distribution in this walkthrough when we created our population. The function rnorm() generates a normal distribution of n scores, data with a prescribed mean and sd:

rnorm(n = 1000, mean = 100, sd = 15) %>% hist(., main = "Randomly generated normal distribution")

Returning to the Sampling Distribution, one claim is that if we were to sample and resample a population to infinity, the Sampling Distribution of Means (i.e. the distribution of those sample means) would be normal. More, the mean of this Sampling Distribution of Means should be identical to the population mean (177.0).

Obviously we can’t get an infinite number of samples, so this assumption / claim is based on underlying mathematical principles and formalisms. BUT… one of the nice things about computers is that while we can’t go to infinity, we can perform a really high number of samples. Let’s generate a sampling distribution of means by taking ten-thousand different samples from our population of men, where each sample is 25 men

# imagine taking ten thousand different samples...
number_of_samples <- 10000

# where each sample is 25 men
N <- 25

sampling_dist_means <- 
  tibble(num = 1:number_of_samples) %>% 
  group_by(num) %>% 
  # using sample for a vector, rather than sample_n for a data frame (tibble)
  mutate(sample_means = mean(sample(pop_mens_heights$height, size = N, replace = TRUE)),
           )

Let’s look at our sampling distribution of means:

# histogram
ggplot(data = sampling_dist_means,mapping = aes(x=sample_means)) +
  geom_histogram(bins = sqrt(10000)) + # ideal # of bins = sqrt of N
  theme_cowplot()

And now let’s compare the mean of the sampling distribution of means to the mean of the entire population of 1 million men

psych::describe(sampling_dist_means)

             vars     n    mean     sd median trimmed     mad    min      max
num             1 10000 5000.50 2886.9 5000.5 5000.50 3706.50   1.00 10000.00
sample_means    2 10000  177.01    1.5  177.0  177.01    1.51 170.83   182.56
               range  skew kurtosis    se
num          9999.00  0.00    -1.20 28.87
sample_means   11.73 -0.02     0.03  0.02

psych::describe(pop_mens_heights)

       vars     n      mean        sd    median   trimmed       mad    min
person    1 1e+06 500000.50 288675.28 500000.50 500000.50 370650.00   1.00
height    2 1e+06    176.99      7.48    176.99    176.99      7.49 138.11
              max     range skew kurtosis     se
person 1000000.00 999999.00    0     -1.2 288.68
height     210.54     72.42    0      0.0   0.01

Look at that, both are 177.0!

One last bit with the Sampling Distribution of Means. The standard deviation of the Sampling Distribution of Means is known as the standard error. You may have likely encountered this term before as a measure of confidence of the mean of your sample as it relates to the true mean of the population. When working with a given sample, standard error is approximated as the standard deviation of the sample divided-by the square root of the number of scores…

\[\frac{standard\space deviation}{N}\] For example the standard error for this single sample is \(7.54/\sqrt{25}\) or 7.54/5 or 1.51. Note that psych::describe() produces the se at the send of its table:

set.seed(2021)
sample_n(pop_mens_heights,size = 25) %>% psych::describe()

       vars  n      mean        sd   median   trimmed       mad      min
person    1 25 589528.52 282701.81 643236.0 602575.86 364602.47 66630.00
height    2 25    177.66      6.89    177.4    177.44      8.14   166.74
             max     range  skew kurtosis       se
person 976624.00 909994.00 -0.30    -1.23 56540.36
height    191.08     24.34  0.34    -1.11     1.38

In this case the standard error of the sample (1.51) is pretty close to the standard deviation of the sampling distribution (1.49). Given that we are dividing by \(\sqrt{N}\) it should be apparent that the standard error decreases as we increase or sample size. Thus, with increase sample size we are more confident that a given sample accurately represents the whole population.

If you have time, try rerunning the code, but this time with sample sizes of 20, 50, 100, and 500. What happens to your distributions and your measures of central tendency?