Sampling: Good and Bad
Day 35 - Lesson 3.2
Describe how convenience sampling can lead to bias.
Describe how voluntary response sampling can lead to bias.
Explain how random sampling can help to avoid bias.
Activity: What is the average word length in the Gettysburg Address?
Intro: Sharing Good News
Every Monday and Friday, we start class with Good News. It is one of the strategies utilized by the program Capturing Kids Hearts to help teachers make connections with students. Students are given an opportunity to share something good in their lives and teachers get to learn more about each of their students.
Today we were very intentional about how we selected our sample of students to share out their Good News: “Let’s suppose that I wanted to find out about the Good News that is happening for all of the students in my 3rd hour. Ideally, I would ask each one of you to share your good news, but because we don’t have time for that I would like to take a sample of 5 students. If you choose to volunteer for good news, you must share your Good News and then give me a rating of how awesome your weekend was (1 is boring, 10 is most awesome).” Call on 5 students, let them share Good News and their weekend rating, and write the ratings on the board. Then find the average of their ratings (3rd hour had a 8.2).
Follow up questions:
What is the population? The sample? (this is a connection to Lesson 3.1).
Do you think that the average weekend rating is higher or lower than the true weekend rating for the whole class? Why?
Students realize that those students who choose to volunteer themselves to be part of the sample are usually the ones that feel strongly about having a good weekend, and that our estimate of the weekend rating is probably higher than the true weekend rating for the whole class.
We started by making clear our goal in this Activity: to try to find the average word length of all the words in the Gettysburg Address. Since, we don’t have all day to be counting letters, we will take a sample of 5 words. Tell students to quickly circle 5 words. It is important that students do this quickly without thinking about it too much if we are to get the biased results that we are hoping for. Each student calculate the average word length of their sample and then brings to value to the front board to make a dotplot.
There are several important statistical ideas present in this dotplot. First, the idea that when you take a random sample, there is a variety of words that you could get in your sample, and so we are showing estimated means as low as 3.4 and as high as 8.8. This is a very concrete example of sampling variability (which is a learning target in Lesson 3.3). In 3rdhour, I pointed at the dot at 3.4 and asked which student got this estimate, and then had them share the 5 words that were in their sample. We did the same for the dot at 8.8. This way, students can hear the variety of word lengths that could end up in a sample.
Second, this dotplot is the first time that students are seeing a sampling distribution. Sampling distributions will become much more important when we get to Chapter 5. We make sure to ask students “What does each dot represent” with an intended answer of “A different random sample of 5 words, and an average calculated from that sample”. So each dot represents a different sample of 5 words.
Third, this dotplot shows very clearly that our sampling method shows bias. This only becomes apparent after we reveal the true mean word length at 4.29. We can see that our sampling method consistently overestimates the true population value. 26 out of 27 students overestimated the true population value. Students eyes are drawn to the larger words when they are quickly circling.
Finally, the sampling method that we used here is a convenience sample because we simply picked words in the easiest way (circling quickly).
The next question, of course, is how could we improve our estimates? How can we be sure that our sample is actually random? After some discussion, students realize that a random number generator would likely be better than us randomly circling words. Students can use RandInt on a graphing calculator, or even better use their iPhones by asking Siri to “Give me a random number between 1 and 268”. Each student takes a new random sample of 5 words, finds an average, and takes it to the dotplot at the front board:
We started by comparing the two distributions (remember to compare shape, center, variability, and outliers from Chapter 1). The big difference in the center of the distribution from the convenience samples being higher than the center of the distribution from the random samples.
In these samples, we have 12 students who have underestimated the true mean and 11 students who have overestimated the mean (low bias!!). Be sure to take a picture of this distribution as the students will need to use it again tomorrow.
A sampling distribution is the distribution of all possible estimates coming from all possible samples of the same size from the population. Sampling distributions are fundamental for students understanding confidence intervals and significance tests later in the course. We have found that you have to slowly grow this idea with students, and the best way is by having them take many many samples. We hope to do many more Activities later where students are taking many, many samples and looking at the distribution of all of their estimates.