Introduction to Box Plots

(Using another tool? Please select it now: CODAP.)

Students compute five-number summaries from quantitative datasets, and then use those five-number summaries to create box plots.

Lesson Goals

Students will be able to…

Compute the values in the five-number summary for a dataset (minimum, Q1, median, Q3, and maximum).
Interpret the values in the five-number summary for a dataset (minimum, Q1, median, Q3, and maximum).
Create a box plot from a five-number summary.
Create a box plot in Pyret.
Interpret a box plot to answer questions about a dataset and its spread.

Student-facing Lesson Goals

Let’s compute and plot five-number summaries.

Materials

Supplemental Materials

Preparation

Decide whether or not you want to launch this lesson using the Live Pyret Survey and your class' own data. If so, follow the instructions on How to Use a Live Pyret Survey to:
- Make a copy of the Number of Homes we’ve lived in (Google Form) and share it with your class.
  - This makes a great "Do Now" as students enter the room, but could also be assigned for homework the night before.
- Link the Google Sheet of their responses to the Number of Homes we’ve lived in (Starter File).
- Click "Run" and then either:
  - Project the new Data Visualization they’ll be introduced to today.
  - Publish the starter file and share a link with your students.
If you are using our Google Slides, you will see the word "Optional" in the title of any slide that corresponds to an optional section of the lesson plan. Adjust the slides based on which portions of the lesson you will be doing with your students.

🔗The Five-Number Summary

Overview

Students learn about how to use range, quartiles, and interquartile range to talk about variability.

Launch

If you decided to launch today’s class using our Live Pyret Survey, you’ll probably want your students to complete the google form as they enter class. We won’t actually look at the results until the next section, though, so feel free to make another choice about how and when to have your students enter their data.

Make sure you’ve already…

Followed the Instructions to Set up and Link the Files
Shared the link you made to your class' copy of the Number of Homes we’ve lived in (Google Form)

Open the Google Form Survey link I shared and submit your response.

Each of the three dot plots shown below has a median of 60.

a dot plot showing 2 dots on 30, 5 dots on 40, 2 dots on 50, 4 dots on 60, 1 dot on 70, 3 dots on 80, 2 dots on 100, and 1 dot on 120

a dot plot with many tall columns of dots between 0 and 100 (some with over 20 dots) and then smaller (1 to 4 dot stacks) spread between 100 and 250

a dot plot with 2 dots on 40, 2 dots on 50, 1 dot on 60, 1 dot on 70, 2 dots on 80 and 1 dot on 90

We know that these dot plots represent datasets that have the same median, but the data definitely isn’t the same! In what ways are they different?
Each dot plot has a different spread and a different shape. There are different peaks, gaps, and outliers.

How do statisticians talk about what makes each sample unique?

One way to characterize the distribution of data is a five-number summary. To compute a five-number summary, we arrange the data values from least to greatest. Then, we decompose a dataset into four equal parts, separated by quartiles. This process can offer us a more nuanced idea of how the data is spread out.

Here are the specific ingredients for a 5-Number Summary:

Minimum: the smallest value in a dataset - it starts the first quarter
Q1 (lower quartile): the number that separates the first quarter of the data from the second quarter of the data
Q2 (Median): the middle value in a dataset
Q3 (upper quartile): the value that separates the third quarter of the data from the last
Maximum: the largest value in a dataset - it ends the fourth quarter of the data

Let’s try this out for a dataset with an even number of values

Consider the dataset: 1, 2, 3, 5, 6, 7, 8, 9 (4 is missing!)

What is the Minimum of this dataset?
1
What is the Maximum of this dataset?
9
How can we calculate the Median (Q2) of this dataset?
The median is always the "middle number". This dataset has 8 numbers, so there is no exact "middle". When this happens, we take the mean of the two middle numbers (5 and 6), which is 5.5
Which numbers are in the "lower half" of this dataset, and what is the median of that half (Q1)?
All the numbers less than 5.5: 1, 2, 3, 5
The median is the middle number, but once again we need to take the mean of two middles (2 and 3): 2.5
Which numbers are in the "upper half" of this dataset, and what is the median of that half (Q3)?
All the numbers more than 5.5: 6, 7, 8, 9
The median is the middle number, but once again we need to take the mean of two middles (7 and 8): 7.5

Let’s try this out for a dataset with an odd number of values

Consider the dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9 (We let 4 back in!)

Are the Minimum and Maximum any different than they were without 4?
No!
How can we calculate the Median (Q2) of this dataset?
The median is always the "middle number". This dataset has 9 numbers, so we can grab the one in the middle: 5
Which numbers are in the "lower half" of this dataset, and what is the median of that half (Q1)?
All the numbers less than 5: 1, 2, 3, 4
Which numbers are in the "upper half" of this dataset, and what is the median of that half (Q3)?
All the numbers more than 5: 6, 7, 8, 9

Our quartiles allow us to calculate the Interquartile Range(IQR) - the distance spanned by the middle half of the data. The IQR is a more robust measure of variation than the range because it is less susceptible to outliers. Seeing the relative size of the middle quartiles can be more useful than looking at data "on the edge". Mathematically, IQR = Q3 - Q1.

Investigate

We are going to be looking at the data from 2 family gatherings.
- The average age at the Watson Family gathering was 70.4 year old.
- The average age at the Ledet Family gathering was 44.3 years old.

What images do these statistics conjure in your mind? What do you imagine to be true about the ages of the people in attendance at each of the gatherings?
Answers will vary.
Some students will likely imagine that all of the people at both of the gatherings are adults.
Some students will likely expect that all of the people at the Watson Family Gathering were much older.

We are going to find the 5-number summary, range and IQR for 2 datasets. Future reflection will rely upon students having worked through both datasets. If your students tend to need more support, you may want to work with the first dataset as a class and then have students work with the second dataset independently.

Let’s see what we can learn about how typical those averages were by looking at the datasets in the first section of Distribution of a Dataset.
Order the ages and compute the five-number summaries for both the Ledet Family Reunion and the Watson Family Gathering.

The partitioning of the data into four parts can be a challenge! Research by Bakker et al, 2005Bakker, A., Biehler, R., & Konold, C. (2005). Should young students learn about box plots? In G. Burrill & M. Camden (Eds.), Curricular Development in Statistics Education: International Association for Statistical Education (IASE) Roundtable, Lund, Sweden, 28 June-3 July 2004. suggests that students do not tend to conceive of distribution in four parts, but three. (Their brains naturally view: the majority in the middle; lower values on the left; and higher values on the right.)

Annotating the list of ordered values can help students visualize the four groups. Emphasize that the median does not get included in the bottom or upper half of the data.

Ledet:

a dot plot

Watson:

a dot plot

What do you Notice and Wonder about these datasets and the summary values you’ve just computed?
Students may notice that the maximum values are pretty close to each other, but the minimum values are very different from each other!
Students may notice that Q3 for both datasets is 72.
Students may notice that the median value for the Watson family data is a number that isn’t in the dataset, whereas the median value for the Ledet family data is a number that’s in the dataset.
Students may have questions about how to calculate the median and/or quartiles.

Now that we know how to compute a five-number summary, let’s practice!

Practice computing five-number summaries from small datasets (either 7 or 8 values) visualized as dot plots on Matching Dot Plots and Five-Number Summaries.
Be prepared to describe your strategy for matching dot plots with five-number summaries.

What strategies did you use to match dot plots to five-number summaries?
Responses will vary. Students will likely identify the median first to narrow in on a smaller pool of possible five-number summaries, and then compute the quartiles.
Dot plots 7 and 8 included 8 points, rather than 7. Did you need to change your strategy to complete these problems? If so, how?
The median was no longer the 4th datapoint in sequence. Instead, the median was the average of the 4th and 5th datapoints.
Which five-number summary on Matching Dot Plots and Five-Number Summaries has the greatest IQR?
Option C, which corresponds with dot plot 1.
Which five-number summary on Matching Dot Plots and Five-Number Summaries has the smallest IQR?
Option E, which corresponds with dot plot 6.

Synthesize

What is a quartile?
One of the three boundary points that splits our dataset into four equal quarters.
A quartile is sometimes / always / never one of the values in the dataset.
Sometimes.
Why is the IQR a more robust measure of variability than the range?
Because it focuses on the middle half of the data, so is less susceptible to outliers.