::p_load(psych, tidyverse, radiant) pacman
14 Distributions, populations, and samples
So this week is typically one of the more confusing weeks for those that do not have much background in stats and programming because I’m asking you to do two conceptually demanding things.
From the stats side: I’m asking you to imagine that the sample data that you collect is not sui generis (I always wonder if I’m using that term correctly), but part of a larger theoretical distribution of possible (but materially non-existent) sample data sets. The number of these sets is indefinitely many… which is a way of saying really large without invoking notions of infinity.
On the programming side: I’m asking you to build simulations of these theoretical sampling distributions by using a for()
loop; programming R
to proceed in a recursive, iterative loop for a very larger number of instances.
Amended: Actually we **won’t be using for
loops. Instead we’ll be doing things the tidy
way, which involves much simpler code. That said, for loops might be something to learn if you want to become more adept in your programming outside of this course.
In the class lecture we laid out the logic of the sampling distribution (of means), the difference distribution (of means) and the null distribution (of means). In the workshop we made it as far as simulating the sampling distribution and using it to obtain important measures and to generate the distribution of means. Here I’m going to extend that work to show how to get both the difference distribution and the null distribution. This involve minor tweaks of what was done in class.
Please follow along by entering the commands into your own console. First let’s load in the appropriate packages for this week:
14.1 Populations, samples, and theoreticals
This week you’ve encountered the notion of the distribution, which is just a way of describing all of the scores in a given set. We typically describe distributions in terms of tallies. For example if we were looking at the distribution of heights in a given group, the distribution would describe the number of individuals within that group that were a certain height (or more likely within a range of heights). I like to focus on distributions as having three different classifications.
population: the distribution of all scores of EVERYONE in an entire population, you almost NEVER have access to this.
sample: the empirically observed distribution of scores from a selected group (sample) within a population (this is what we typically have access to as scientists)
and, theoretical: a distribution that is derived from certain principles or assumptions by logical and mathematical reasoning, as opposed to one derived from real-world data obtained by empirical research (see APA). For example the normal distribution, Poisson distribution, the F-distribution, etc.
14.2 Population distribution
We’re just extrapolate to play around with some numbers from this website: https://ourworldindata.org/human-height. According to this website the average height of men in North America (from 1996) is about 177 cm, with a standard deviation of 7.49. Height, it turns out is normally distributed.
Let’s just assume those numbers hold for the US. There are about 151 million men in the US. On my machine it would takes R
about 10 secs to generate a distribution of 151 million heights with the above criteria, but it also eats up about 1.2 GB of RAM. Given that not everyone has 32GB of RAM laying around on their laptops, let’s just create a population of 1 million men (depending on your machine this might take a few seconds:
<- rnorm(n = 1000000, mean = 177,sd = 7.49) population_heights
Congrats, in a matter of seconds you’ve created a population of 1 million heights! Before doing so, let’s turn this into a data frame / tibble containing columns for person and their corresponding height:
::p_load(tidyverse)
pacman<- 1:1000000
person_id
<- tibble(person = person_id,
pop_mens_heights height = population_heights)
And now let’s inspect this dataset, by looking at the first few rows…
head(pop_mens_heights)
# A tibble: 6 × 2
person height
<int> <dbl>
1 1 179.
2 2 181.
3 3 176.
4 4 160.
5 5 181.
6 6 180.
and now using str()
to get more info about the dataset’s structure:
str(pop_mens_heights)
tibble [1,000,000 × 2] (S3: tbl_df/tbl/data.frame)
$ person: int [1:1000000] 1 2 3 4 5 6 7 8 9 10 ...
$ height: num [1:1000000] 179 181 176 160 181 ...
Finally let’s take a look at this distribution.
::p_load(cowplot)
pacmanggplot(data = pop_mens_heights,mapping = aes(x=height)) +
geom_histogram(bins = sqrt(1000000)) + # ideal # of bins = sqrt of N
theme_cowplot()
And getting our summary stats for this population
::p_load(psych)
pacman::describe(pop_mens_heights) psych
vars n mean sd median trimmed mad min
person 1 1e+06 500000.50 288675.28 500000.50 500000.50 370650.00 1.00
height 2 1e+06 176.99 7.48 176.99 176.99 7.49 138.11
max range skew kurtosis se
person 1000000.00 999999.00 0 -1.2 288.68
height 210.54 72.42 0 0.0 0.01
14.3 Sample
Of course we don’t have time or money to go around getting the height of 1 million men. Instead we get a sample of 25 from our population, pop_men_heights
.
We can do this by:
# get a sample of heights
<- sample_n(pop_mens_heights, size = 25)
sample_mens_heights sample_mens_heights
# A tibble: 25 × 2
person height
<int> <dbl>
1 244330 175.
2 873826 170.
3 673148 182.
4 193898 187.
5 689082 163.
6 649774 175.
7 884501 169.
8 591346 178.
9 387740 178.
10 823819 178.
# ℹ 15 more rows
Now, let’s take a look at the distribution and the summary stats of our sample:
# histogram
ggplot(data = sample_mens_heights,mapping = aes(x=height)) +
geom_histogram(bins = sqrt(25)) + # ideal # of bins = sqrt of N
theme_cowplot()
# summary stats
::describe(sample_mens_heights) psych
vars n mean sd median trimmed mad min
person 1 25 562259.0 269444.16 610197.00 571548.71 316715.98 75837.00
height 2 25 175.8 7.65 175.94 175.67 7.19 161.71
max range skew kurtosis se
person 988805.00 912968.00 -0.33 -1.19 53888.83
height 192.58 30.88 0.16 -0.46 1.53
Two things:
First, related to R
, your histogram and summary stats are likely different from mine. This is because sample_n()
pulls a random sample. If you resample the data using sample_n()
you’re going to get different numbers every time around. FWIW, this is also true when we are getting our samples in research. To ensure that our numbers are identical we need to set the seed—set the manner R
“randomly” (nothing in a computer is truly random) selects its sample. For example, if we rerun our sample using this code, our numbers will match:
# setting the seed to 123
set.seed(123)
# get a sample of heights
<- sample_n(pop_mens_heights, size = 25)
sample_mens_heights
# histogram
ggplot(data = sample_mens_heights,mapping = aes(x=height)) +
geom_histogram(bins = sqrt(25)) + # ideal # of bins = sqrt of N
theme_cowplot()
# summary stats
::describe(sample_mens_heights) psych
vars n mean sd median trimmed mad min
person 1 25 519031.72 270338.04 554826.00 512511.86 281126.16 124022.00
height 2 25 177.13 8.62 177.01 177.03 9.53 163.46
max range skew kurtosis se
person 985797.0 861775.00 0.18 -1.17 54067.61
height 193.6 30.14 0.06 -1.21 1.72
Second, tied to the conceptual material (and related to the discussion above), you’ll note that the distribution of heights in your sample is not identical to the population. This will be true for any sample that you take—the sample statistics are not identical to the population statistics. What we hope is that they are close enough that our sample approximates the population within a certain tolerance. More, no two samples are going to be exactly alike:
set.seed(2021)
<- sample_n(pop_mens_heights,size = 25)
sample1 <- sample_n(pop_mens_heights,size = 25)
sample2
::describe(sample1) psych
vars n mean sd median trimmed mad min
person 1 25 589528.52 282701.81 643236.0 602575.86 364602.47 66630.00
height 2 25 177.66 6.89 177.4 177.44 8.14 166.74
max range skew kurtosis se
person 976624.00 909994.00 -0.30 -1.23 56540.36
height 191.08 24.34 0.34 -1.11 1.38
::describe(sample2) psych
vars n mean sd median trimmed mad min
person 1 25 428057.16 308820.34 353637.00 415141.71 360989.4 14244.0
height 2 25 177.88 7.42 175.94 177.89 8.0 165.9
max range skew kurtosis se
person 967537.00 953293.00 0.38 -1.30 61764.07
height 191.25 25.35 0.06 -1.23 1.48
The fact that summary statistics vary from sample to sample, leads us to our next kind of distribution… the sampling distribution.
14.4 Theoretical: the Sampling Distribution (of Means)
The sampling distribution is a distribution built by the sampling and resampling of a population. A sampling distribution of a statistic shows every possible result that statistic can take in every possible sample from a population and how often each result happens. In practice, the generic characteristics of a sampling distribution are theoretical. That is they are developed by applying some small amount of observation to establish fundamental, and ultimately abstract principles. As an example, the normal distribution is a theoretical distribution that states that for any (approximately) randomly occurring measurement, such as height, a given percentage of scores should fall within specific ranges from the mean. Indeed, we’ve already generated a normal distribution in this walkthrough when we created our population. The function rnorm()
generates a normal distribution of n
scores, data with a prescribed mean
and sd
:
rnorm(n = 1000, mean = 100, sd = 15) %>% hist(., main = "Randomly generated normal distribution")
Returning to the Sampling Distribution, one claim is that if we were to sample and resample a population to infinity, the Sampling Distribution of Means (i.e. the distribution of those sample means) would be normal. More, the mean of this Sampling Distribution of Means should be identical to the population mean (177.0).
Obviously we can’t get an infinite number of samples, so this assumption / claim is based on underlying mathematical principles and formalisms. BUT… one of the nice things about computers is that while we can’t go to infinity, we can perform a really high number of samples. Let’s generate a sampling distribution of means by taking ten-thousand different samples from our population of men, where each sample is 25 men
# imagine taking ten thousand different samples...
<- 10000
number_of_samples
# where each sample is 25 men
<- 25
N
<-
sampling_dist_means tibble(num = 1:number_of_samples) %>%
group_by(num) %>%
# using sample for a vector, rather than sample_n for a data frame (tibble)
mutate(sample_means = mean(sample(pop_mens_heights$height, size = N, replace = TRUE)),
)
Let’s look at our sampling distribution of means:
# histogram
ggplot(data = sampling_dist_means,mapping = aes(x=sample_means)) +
geom_histogram(bins = sqrt(10000)) + # ideal # of bins = sqrt of N
theme_cowplot()
And now let’s compare the mean of the sampling distribution of means to the mean of the entire population of 1 million men
::describe(sampling_dist_means) psych
vars n mean sd median trimmed mad min max
num 1 10000 5000.50 2886.9 5000.5 5000.50 3706.50 1.00 10000.00
sample_means 2 10000 177.01 1.5 177.0 177.01 1.51 170.83 182.56
range skew kurtosis se
num 9999.00 0.00 -1.20 28.87
sample_means 11.73 -0.02 0.03 0.02
::describe(pop_mens_heights) psych
vars n mean sd median trimmed mad min
person 1 1e+06 500000.50 288675.28 500000.50 500000.50 370650.00 1.00
height 2 1e+06 176.99 7.48 176.99 176.99 7.49 138.11
max range skew kurtosis se
person 1000000.00 999999.00 0 -1.2 288.68
height 210.54 72.42 0 0.0 0.01
Look at that, both are 177.0!
One last bit with the Sampling Distribution of Means. The standard deviation of the Sampling Distribution of Means is known as the standard error. You may have likely encountered this term before as a measure of confidence of the mean of your sample as it relates to the true mean of the population. When working with a given sample, standard error is approximated as the standard deviation of the sample divided-by the square root of the number of scores…
\[\frac{standard\space deviation}{N}\] For example the standard error for this single sample is \(7.54/\sqrt{25}\) or 7.54/5 or 1.51. Note that psych::describe()
produces the se
at the send of its table:
set.seed(2021)
sample_n(pop_mens_heights,size = 25) %>% psych::describe()
vars n mean sd median trimmed mad min
person 1 25 589528.52 282701.81 643236.0 602575.86 364602.47 66630.00
height 2 25 177.66 6.89 177.4 177.44 8.14 166.74
max range skew kurtosis se
person 976624.00 909994.00 -0.30 -1.23 56540.36
height 191.08 24.34 0.34 -1.11 1.38
In this case the standard error of the sample (1.51) is pretty close to the standard deviation of the sampling distribution (1.49). Given that we are dividing by \(\sqrt{N}\) it should be apparent that the standard error decreases as we increase or sample size. Thus, with increase sample size we are more confident that a given sample accurately represents the whole population.
If you have time, try rerunning the code, but this time with sample sizes of 20, 50, 100, and 500. What happens to your distributions and your measures of central tendency?