instagram

(Using another tool? Please select it now: CODAP.)

Students compute five-number summaries from quantitative datasets, and then use those five-number summaries to create box plots.

Lesson Goals

Students will be able to…​

  • Compute the values in the five-number summary for a dataset (minimum, Q1, median, Q3, and maximum).

  • Interpret the values in the five-number summary for a dataset (minimum, Q1, median, Q3, and maximum).

  • Create a box plot from a five-number summary.

  • Create a box plot in Pyret.

  • Interpret a box plot to answer questions about a dataset and its spread.

Student-facing Lesson Goals

  • Let’s compute and plot five-number summaries.

Materials

Supplemental Materials

Preparation

  • Decide whether or not you want to launch this lesson using the Live Pyret Survey and your class' own data. If so, follow the instructions on How to Use a Live Pyret Survey to:

    • Make a copy of the Number of Homes we’ve lived in (Google Form) and share it with your class.

      • This makes a great "Do Now" as students enter the room, but could also be assigned for homework the night before.

    • Link the Google Sheet of their responses to the Number of Homes we’ve lived in (Starter File).

    • Click "Run" and then either:

      • Project the new Data Visualization they’ll be introduced to today.

      • Publish the starter file and share a link with your students.

  • If you are using our Google Slides, you will see the word "Optional" in the title of any slide that corresponds to an optional section of the lesson plan. Adjust the slides based on which portions of the lesson you will be doing with your students.

🔗The Five-Number Summary

Overview

Students learn about how to use range, quartiles, and interquartile range to talk about variability.

Launch

If you decided to launch today’s class using our Live Pyret Survey, you’ll probably want your students to complete the google form as they enter class. We won’t actually look at the results until the next section, though, so feel free to make another choice about how and when to have your students enter their data.

Make sure you’ve already…​

  1. Followed the Instructions to Set up and Link the Files

  2. Shared the link you made to your class' copy of the Number of Homes we’ve lived in (Google Form)

  • Open the Google Form Survey link I shared and submit your response.

Each of the three dot plots shown below has a median of 60.

a dot plot showing 2 dots on 30, 5 dots on 40, 2 dots on 50, 4 dots on 60, 1 dot on 70, 3 dots on 80, 2 dots on 100, and 1 dot on 120

a dot plot with many tall columns of dots between 0 and 100 (some with over 20 dots) and then smaller (1 to 4 dot stacks) spread between 100 and 250

a dot plot with 2 dots on 40, 2 dots on 50, 1 dot on 60, 1 dot on 70, 2 dots on 80 and 1 dot on 90

  • We know that these dot plots represent datasets that have the same median, but the data definitely isn’t the same! In what ways are they different?

  • Each dot plot has a different spread and a different shape. There are different peaks, gaps, and outliers.

How do statisticians talk about what makes each sample unique?

One way to characterize the distribution of data is a five-number summary. To compute a five-number summary, we arrange the data values from least to greatest. Then, we decompose a dataset into four equal parts, separated by quartiles. This process can offer us a more nuanced idea of how the data is spread out.

Here are the specific ingredients for a 5-Number Summary:

  • Minimum: the smallest value in a dataset - it starts the first quarter

  • Q1 (lower quartile): the number that separates the first quarter of the data from the second quarter of the data

  • Q2 (Median): the middle value in a dataset

  • Q3 (upper quartile): the value that separates the third quarter of the data from the last

  • Maximum: the largest value in a dataset - it ends the fourth quarter of the data

Let’s try this out for a dataset with an even number of values

Consider the dataset: 1, 2, 3, 5, 6, 7, 8, 9 (4 is missing!)

  • What is the Minimum of this dataset?

  • 1

  • What is the Maximum of this dataset?

  • 9

  • How can we calculate the Median (Q2) of this dataset?

  • The median is always the "middle number". This dataset has 8 numbers, so there is no exact "middle". When this happens, we take the mean of the two middle numbers (5 and 6), which is 5.5

  • Which numbers are in the "lower half" of this dataset, and what is the median of that half (Q1)?

  • All the numbers less than 5.5: 1, 2, 3, 5

  • The median is the middle number, but once again we need to take the mean of two middles (2 and 3): 2.5

  • Which numbers are in the "upper half" of this dataset, and what is the median of that half (Q3)?

  • All the numbers more than 5.5: 6, 7, 8, 9

  • The median is the middle number, but once again we need to take the mean of two middles (7 and 8): 7.5

Let’s try this out for a dataset with an odd number of values

Consider the dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9 (We let 4 back in!)

  • Are the Minimum and Maximum any different than they were without 4?

  • No!

  • How can we calculate the Median (Q2) of this dataset?

  • The median is always the "middle number". This dataset has 9 numbers, so we can grab the one in the middle: 5

  • Which numbers are in the "lower half" of this dataset, and what is the median of that half (Q1)?

  • All the numbers less than 5: 1, 2, 3, 4

  • Which numbers are in the "upper half" of this dataset, and what is the median of that half (Q3)?

  • All the numbers more than 5: 6, 7, 8, 9

Our quartiles allow us to calculate the Interquartile Range(IQR) - the distance spanned by the middle half of the data. The IQR is a more robust measure of variation than the range because it is less susceptible to outliers. Seeing the relative size of the middle quartiles can be more useful than looking at data "on the edge". Mathematically, IQR = Q3 - Q1.

Investigate

  • We are going to be looking at the data from 2 family gatherings.

    • The average age at the Watson Family gathering was 70.4 year old.

    • The average age at the Ledet Family gathering was 44.3 years old.

  • What images do these statistics conjure in your mind? What do you imagine to be true about the ages of the people in attendance at each of the gatherings?

  • Answers will vary.

  • Some students will likely imagine that all of the people at both of the gatherings are adults.

  • Some students will likely expect that all of the people at the Watson Family Gathering were much older.

We are going to find the 5-number summary, range and IQR for 2 datasets. Future reflection will rely upon students having worked through both datasets. If your students tend to need more support, you may want to work with the first dataset as a class and then have students work with the second dataset independently.

  • Let’s see what we can learn about how typical those averages were by looking at the datasets in the first section of Distribution of a Dataset.

  • Order the ages and compute the five-number summaries for both the Ledet Family Reunion and the Watson Family Gathering.

The partitioning of the data into four parts can be a challenge! Research by Bakker et al, 2005Bakker, A., Biehler, R., & Konold, C. (2005). Should young students learn about box plots? In G. Burrill & M. Camden (Eds.), Curricular Development in Statistics Education: International Association for Statistical Education (IASE) Roundtable, Lund, Sweden, 28 June-3 July 2004. suggests that students do not tend to conceive of distribution in four parts, but three. (Their brains naturally view: the majority in the middle; lower values on the left; and higher values on the right.)

Annotating the list of ordered values can help students visualize the four groups. Emphasize that the median does not get included in the bottom or upper half of the data.

Ledet:

a dot plot

Watson:

a dot plot

  • What do you Notice and Wonder about these datasets and the summary values you’ve just computed?

  • Students may notice that the maximum values are pretty close to each other, but the minimum values are very different from each other!

  • Students may notice that Q3 for both datasets is 72.

  • Students may notice that the median value for the Watson family data is a number that isn’t in the dataset, whereas the median value for the Ledet family data is a number that’s in the dataset.

  • Students may have questions about how to calculate the median and/or quartiles.

Now that we know how to compute a five-number summary, let’s practice!

  • Practice computing five-number summaries from small datasets (either 7 or 8 values) visualized as dot plots on Matching Dot Plots and Five-Number Summaries.

  • Be prepared to describe your strategy for matching dot plots with five-number summaries.

  • What strategies did you use to match dot plots to five-number summaries?

  • Responses will vary. Students will likely identify the median first to narrow in on a smaller pool of possible five-number summaries, and then compute the quartiles.

  • Dot plots 7 and 8 included 8 points, rather than 7. Did you need to change your strategy to complete these problems? If so, how?

  • The median was no longer the 4th datapoint in sequence. Instead, the median was the average of the 4th and 5th datapoints.

  • Which five-number summary on Matching Dot Plots and Five-Number Summaries has the greatest IQR?

  • Option C, which corresponds with dot plot 1.

  • Which five-number summary on Matching Dot Plots and Five-Number Summaries has the smallest IQR?

  • Option E, which corresponds with dot plot 6.

Synthesize

  • What is a quartile?

  • One of the three boundary points that splits our dataset into four equal quarters.

  • A quartile is sometimes / always / never one of the values in the dataset.

  • Sometimes.

  • Why is the IQR a more robust measure of variability than the range?

  • Because it focuses on the middle half of the data, so is less susceptible to outliers.

🔗Plotting our Five-Number Summary

Overview

Students plot five-number summaries as box plots before learning to make box plots in pyret.

Launch

To visualize the 5-number summary, the Range, and the Interquartile Range we can plot the five numbers on a number line and connect them to make a box plot.

A sample box-and-whisker plot based on contrived data

If you decided to launch today’s class using our Live Pyret Survey, now is the time to display the results!

When you click "Run", the Number of Homes we’ve lived in (Starter File) builds a box plot.

Assuming you’ve already…​

  1. Followed the Instructions to Set up and Link the Files

  2. Shared the link you made to your class' copy of the Number of Homes we’ve lived in (Google Form)

The data visualizations will be generated using data from your students!
And they will continue to update in real time as more of your students complete the Google Form.

Project your screen and/or publish the starter file and share a link with your students.

Facilitate a discussion about this new-to-them Pyret Data Visualization!

  • Take a look at the results of our survey displayed in the new Data Visualization on the Board.

  • What do you Notice?

  • What do you Wonder?

To draw a box plot from a 5-number summary:

  • First, make a vertical line on the number line for each of the 5 values of the five-number summary.

  • Next, make a box connecting Q1 to Q3. This box contains the middle half of the data (IQR).

    • Make sure the line you drew for the median is tall enough to split the box into 2 parts (not necessarily equal!)

  • Finally, make a horizontal line (called a "whisker") connecting each end of the box to the minimum / maximum value. This helps us to visualize the full range of the data.

No matter what shape the box plot has, all four sections contain exactly the same number of points.

  • How do we know that the first quarter is the densest?

  • It is the narrowest, spanning just 2 units. And since all of the quarters contain the same number of data points, that tells us that these points are the most tightly packed.

  • We can see that the points on the dot plot are clustered more closely together in this section than they are in the others.

  • Which quarter of the data is the most dispersed? How do you know?

  • The last quarter; it spans 11 units, and includes the same number of data points as each of the other quarters.

  • We can see that there is lots of space between the points on the dot plot in this section.

  • What strategies did you use to match the dot plots to the box plots

  • Answers will vary. Sample responses may include:

    • I looked for the maximum and minimum values.

    • I looked at the shape of the data, starting with whether or not it was symmetrical.

    • I looked for tall clusters of points on the dot plot and matching narrow quarters on the box plot.

Investigate

  • Let’s practice making box plots with the data from the family gatherings.

  • Complete the second and third sections of Distribution of a Dataset.

The box plots should look like this:
Ledet: a box plot of the Ledet family data distributed across the full length of the number line
Watson: a box plot of the Smith family data clustered tightly at the right end of the number line

  • The average age at the Watson Family gathering was 70.4 year old.

  • The average age at the Ledet Family gathering was 44.3 years old.

  • For which family was the average age more typical?

  • For the Watson family gathering because the data is more closely clustered, the Range and IQR are significantly smaller, and the mean and median are much more similar.

  • How did making the box plots help you to understand the data?

  • What else do you Notice and Wonder?

Synthesize

  • Box plots have four sections. What must be true about all of those sections?

  • They each contain exactly one quarter of the data, no matter how different the sections look on the number line.

  • Why isn’t the median always in the middle of the box?

  • Because the median has to split the data itself in half and the quarter of the data to the left of the median isn’t necessarily clustered as tightly as the quarter of the data to the right of the median.

  • What part of the box plot represents the Range?

  • The full width from the end of the left whisker to the end of the right whisker

🔗Making Box Plots in Pyret

Overview

Students create box plots and five-number summaries from the animals dataset in Pyret.

Launch

Let’s see what we can learn about the spread of the data in the pounds column by making a box-plot!

Below is the Contract for box-plot.
# box-plot :: (Tabletable-name, Stringcolumn) -> Image

Students will type box-plot(animals-table, "pounds") into the Interactions Area. They will use the resulting box plot to fill in the five-number summary for the pounds column, and then sketch the box plot.

box plot of pounds with a 5-number summary of min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172

box plot of pounds with a 5-number summary of min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172

Investigate

  • What conclusions can you draw about the distribution of values in this column?

  • While the animals' weights range from 0.1 pounds to 172 pounds, 50% of the animals weigh 11.3 pounds or less. The animal that weighs 172 pounds may be an outlier.

box plot of pounds with a 5-number summary of min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172

  • Now that we’ve explored the spread of the dataset, do you think the mean is the best measure of center for the animals' weights?

  • No. Most of the animals weigh far less than the average weight (of nearly 40 pounds)!

  • If Q1 is the value for which 25% of the animals weighed that amount or less, what does Q3 represent?

  • The third quartile is the value for which 75% of the animals weighed that amount or less. Another way of saying that would be that it is the value for which 25% of the animals weigh that amount or more.

  • Why do you think this visualization is sometimes called a "box and whisker plot"?

  • The distance between Min/Q1 and Q3/Max is drawn like whiskers!

  • Could we make a box plot for every column in the dataset?

  • No. We can only make box plots for quantitative columns.

If students are struggling to write conclusions, go over the following five number summary from the box plot they made.

  • Minimum (the left “whisker”) - the smallest value in the dataset . In our dataset, that’s just 0.1 pounds.

  • Q1 (the left edge of the box) - computed by taking the median of the lower half of the values. In the pounds column, that’s 3.9 pounds.

  • Q2 / Median value (the line in the middle), which is the middle Quartile of the whole dataset. We already computed this to be 11.3 pounds.

  • Q3 (the right edge of the box), which is computed by taking the median of the upper half of the values. That’s 60.4 pounds in our dataset.

  • Maximum (the right “whisker”) - the largest value in the dataset . In our dataset, that’s 172 pounds.

Choose another quantitative column to summarize and complete the second half of Summarizing Columns with Measures of Spread

Other Box Plots

If you’re trying to compare two box plots, you might like them both to appear on number lines using the same scale. Pyret has a function for the that:

# box-plot-scaled :: (Tabletable-name, Stringcolumn, Numberlow-end, Numberhigh-end) -> Image

More Statistics-based or Math-oriented classes will also be familiar with modified box plots (video explanation), which remove outliers from the box-and-whisker and draw them as asterisks outside of the plot. In Pyret, we can make them using the following contracts:

# modified-box-plot :: (Tabletable-name, Stringcolumn) -> Image # modified-box-plot-scaled :: (Tabletable-name, Stringcolumn, Numberlow-end, Numberhigh-end) -> Image

Finally, if you’d prefer to use vertical box plots, Pyret as the following contracts:

# vert-box-plot :: (Tabletable-name, Stringcolumn) -> Image # modified-vert-box-plot :: (Tabletable-name, Stringcolumn) -> Image # modified-vert-box-plot-scaled :: (Tabletable-name, Stringcolumn, Numberlow-end, Numberhigh-end) -> Image

Common Misconceptions

It is extremely common for students to forget that the quartiles divide the data into quarters, each of which includes 25% of the dataset. This will need to be heavily reinforced.

Synthesize

  • Is it safe to assume that the average typical?

  • No. It is sometimes typical. But sometimes there’s a lot of variation or skew in the data.

  • What percentage of points fall in the first quarter?

  • 25%

  • What percentage of points fall in the second quarter?

  • 25%

  • What percentage of points fall in the third quarter?

  • 25%

  • What percentage of points fall in the fourth quarter?

  • 25%

  • What percentage of points fall in the Interquartile Range (IQR)?

  • 50%

  • What percentage of points fall within the Range?

  • 100%

🔗Additional Resources

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). CCbadge Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.