Standard Deviation

email twitter instagram facebook

Lessons

Standards in this Lesson

Common Core Math Standards

6.SP.A: Develop understanding of statistical variability.
6.SP.A.2: Understand that a set of data collected to answer a statistical question has a distribution which can be described by its center, spread, and overall shape.
6.SP.B.4: Display numerical data in plots on a number line, including dot plots, histograms, and box plots.
6.SP.B.5: Summarize numerical data sets in relation to their context.
6.SP.B.5.C: Summarize numerical data sets in relation to their context by giving quantitative measures of center (median and/or mean) and variability (interquartile range and/or mean absolute deviation), as well as describing any overall pattern and any striking deviations from the overall pattern with reference to the context in which the data were gathered.
HSS.ID.A.1: Represent data with plots on the real number line (dot plots, histograms, and box plots).
HSS.ID.A.2: Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.
HSS.ID.A.3: Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

CSTA Standards

1B-DA-06: Organize and present collected data visually to highlight relationships and support a claim.
3A-DA-11: Create interactive data visualizations using software tools to help others better understand real-world phenomena.

K-12CS Standards

6-8.Data and Analysis.Inference and Models: People transform, generalize, simplify, and present large data sets in different ways to influence how other people interpret and understand the underlying information. Examples include visualization, aggregation, rearrangement, and application of mathematical operations.
9-12.Data and Analysis.Visualization and Transformation: Data can be transformed to remove errors, highlight or expose relationships, and/or make it easier for computers to process.

Oklahoma Standards

OK.6.D.1.3: Create and analyze box and whisker plots observing how each segment contains one quarter of the data.
OK.6.DA.CVT.01: Collect data using computational tools and transform the data to make it more useful.
OK.7.D.1.2: Use reasoning with proportions to display and interpret data in circle graphs (pie charts) and histograms. Choose the appropriate data display and know how to create the display using a spreadsheet or other graphing technology.
OK.7.DA.CVT.01: Collect data using computational tools and transform the data to make it more useful and reliable.
OK.8.DA.S.01: Analyze multiple methods of representing data and choose the most appropriate method for representing data.
OK.L2.DA.CVT.02: Generate data sets that use a variety of data collection tools and analysis techniques to support a claim and/or communicate information.

Practices in this Lesson

K12CS

P5: Creating Computational Artifacts

Math

MP.4: Model with mathematics
MP.3: Construct viable arguments and critique the reasoning of others

Students learn how standard deviation serves as Data Scientists' most common measure of "spread": how far all the values in a dataset tend to be from their mean. When we looked at box plots, we visualized spread based on range and interquartile range. Now we’ll return to histograms and picture the spread in terms of standard deviation.

Lesson Goals

Students will be able to…

apply one approach to measuring and displaying spread of a dataset
compare and contrast information displayed in a box plot and a histogram

Student-facing Lesson Goals

Let’s compare different uses for box plots and histograms when talking about data.

Materials

Preparation

All students should log into code.pyret.org (CPO) and open their saved "Animals Starter File". If they don’t have the file, they can open a new one from Animals Starter File.

Glossary

histogram: a display of quantitative data that uses vertical bars positioned over bins (sub-intervals); each bar’s height reflects the count or percentage of data values in that bin.
mean: a representation of the center, or 'typical' value in a set of numbers, calculated as the sum of those numbers divided by the number of values.
outlier: observations whose values are very different from the other observations in the same dataset, perhaps due to experimental error. Outliers can also be indicative of data belonging to a different population from the rest of the established samples.
skew: lack of balance in a dataset’s shape, arising from more values that are unusually low or high. Such values tend to trail off, rather than be separated by a gap (as with outliers).
spread: the extent to which values in a dataset vary, either from one another or from the center
standard deviation: a number that measures spread of a dataset using the typical distance of values from their mean

🔗Measuring "Deviance" 30 minutes

Overview

Students review the notion of spread itself, and build up to the formula by annotating histograms.

Launch

The Animal Shelter Bureau reports that the mean age of shelter cats is 3 years.

Look at the Animals Dataset in your student workbook, or the Animals Dataset spreadsheet (Google).
Does a mean age of 3 years translate to all of the cats being close to 3 years old? Why or why not?
- No, we cannot assume all cats are close to 3 years old. There are some outliers in the dataset.

In the activity that follows, students will look at ten cats from the shelter to consider the distribution of their ages.

Turn to Computing Standard Deviation, and complete numbers 1-3.
What did you get for the mean? Does it match what the Animal Shelter Bureau says?
- The mean is 3; yes, it matches what the Animal Shelter Bureau says.
Can you think of four ages, such that the mean age for all of them is 3?
- Some possibilities include: {3,3,3,3}, {1,2,4,5}, {1,1,4,6}… any four ages that add up to 12 will work!
Can you think of a different spread of four ages that would have the same mean?
- See above.
How many different sets of four ages can you think of, which all have a mean of 3?

Without a measure of spread, just knowing the mean doesn’t tell us enough about the shape of the data. When summarizing a column, we’d like to use a measure that gathers data from every value. We already have one method of measuring spread: calculating the Five Number Summary and using it to generate a box-plot. Unfortunately, that measure comes from only a small number of data points! If possible, we’d like to have a measure that summarizes the spread across all the points.

Standard deviation is the most useful way to summarize spread of a quantitative column.

Instead of focusing on the handful of datapoints used in our Five Number Summary, another way to measure spread is to focus on the "typical" distance from the mean. In other words, we want to know what kind of deviation is "standard" for all the points.

Investigate

A histogram showing the distribution of ages for 10 cats, between the ages of 1 and 8. A star labeling the mean is drawn on the x-axis at 3. 🖼Show image We could imagine a shelter where every cat is between 2 and 4, so each cat only deviates from the mean by 1 year! But we could also imagine a shelter with only kittens and very old cats, where cats deviate by as much as 10 years from the mean!

How far away is each data point from 3?

A histogram showing the distribution of ages for 10 cats, between the ages of 1 and 8. A star labeling the mean is drawn on the x-axis at 3, and arrows show the distance between the mean and each point in the first interval. 🖼Show image In this image, we’ve draw an arrow for each of the 1-year-old cats. That means there are four arrows running from the mean at 3 to the interval at 1, and each arrow has the label 2.

Next, complete numbers 4 to 6 of Computing Standard Deviation.

Mean Average Deviation?

In this section of the worksheet, students will need to stretch their visual imaginations a bit! In problem number 6, they are asked to summarize all 10 distances from the mean into a single number. The goal here is for students to make an educated guess about standard deviation (SD) before learning the algorithm for computing it. Invite and encourage discussion about students' different approaches for guessing at the best summary number before sharing the key idea about standard deviation!

Students are likely to hone in on the Mean Average Deviation, or MAD. Both SD and MAD measure variability or "spread" by computing individual deviations from the mean, but MAD averages these deviations and SD transforms them via square/square-root.

To compute the standard deviation we square each distance and take the average, then take the square root of the average.

The process of finding standard deviation manually is a bit laborious. Keeping organized is crucial; a partially-completed table is provided on the bottom half of worksheet to support students in doing so.

Complete numbers 7-10 of Computing Standard Deviation, where you will utilize the algorithm for computing standard deviation.

Now that you know how to compute standard deviation on your own, here is the contract for stdev, along with an example that will calculate the standard deviation for the pounds column in the animals-table:

# stdev :: (t :: Table, col :: String) -> Number
stdev(animals-table, "pounds")

What is the standard deviation for the weights of all the animals at our dataset?
- Approximately 48.5

Optional: For additional practice, have students complete Computing Standard Deviation (2).

Synthesize

Can you explain why two datasets can have the same mean, but different standard deviations?
- Mean is a measure of central tendency, whereas standard deviation measures the variation of some sample.
What kind of dataset would have a standard deviation of zero?
- A standard deviation of zero means that every number in the sample is exactly the same.

🔗Comparing Standard Deviations 20 minutes

Overview

Students compare centers and (more importantly) spreads - of two quantitative datasets by comparing their histograms. Both mean and standard deviation can be affected by outliers and/or skewness.

Launch

Invite students to take a look at the histogram below. It is the same histogram we saw in the previous section, but now with an 11th cat that is 16 years old. That’s quite an outlier!

A histogram showing the distribution of ages for 10 cats, between the ages of 1 and 8. An 11th cat has been added at 16 years. 🖼Show image

What is the shape of this histogram?
- The histogram has high outliers, therefore it is skewed right.
How does it differ from the one we just looked at?
- The previous histogram - with the 16-year-old cat omitted - was roughly symmetric.
Turn to The Effect of an Outlier to explore the extent to which the inclusion of an outlier will affect the center and spread of a quantitative dataset.
What did this outlier do to the mean? Refer back to Computing Standard Deviation to help you.
- Previously, the mean was 3; now it is approximately 4.33.
What did this outlier do to the standard deviation?
- The outlier caused the standard deviation to increase by about 1.33.
Optional: To see how changes in data values affect the mean and standard deviation, complete Matching Mean & Standard Deviation to Data.

Investigate

The mean and standard deviation tell us where the data is centered and how far the data strays from that center. For example, when writing about the ages of cats in our shelter, we might say "the mean age is 3 and the standard devation is 2.4, so most cats are between the ages of 1 and 5 years old."

The mean time-to-adoption is 5.75 weeks. Does that mean most animals generally get adopted in 4-6 weeks? Solicit students' ideas, but do not reveal the answer.
Turn to Data Cycle: Standard Deviation in the Animals Dataset to get some practice using the Data Cycle to answer this question, then write your findings in the space at the bottom.

Mean Average v. Standard Deviation

MAD and SD are both measures of a certain kind of distance, literally asking "how are far from the mean are all the points in the dataset?". With each point being independent from the other, we can imagine a dataset with two points as a right triangle with two legs: how far apart are these points?

Before learning the distance formula, students might guess at a number of ways to compute the hypotenuse. They can quickly rule out the sum of the legs, and the difference between them. At some point they might suggest averaging the lengths of the legs. Mean Average Deviation (MAD) does exactly that, by flattening each points' deviation into a single "dimension".

Of course, these legs exist on separate axes - so we need a formula for distances in more than one dimension. Computing the SD involves the square root of a sum of squares. That should sound suspiciously like the distance formula! Indeed, computing the SD for a dataset with two points is basically finding the (normalized) length of the hypoteneuse!

The pythagorean distance works in 3-dimensions as well (right pyramids!) - or for any number of dimensions - as does the formula for standard deviation. By treating each point as a separate dimension, DS allows each deviation to be considered independantly.

Why use one measure of spread instead the other? The answer is closely related to the difference between two measures of center! Mean incorporates data from every point, while median does not. However, mean is sensitive to the effect of extreme outliers or skew. In those cases, median is considered to be the better measure of center.

Treating each point independantly allows each deviation to contribute to the measure of spread, just as mean computes the measure of center. This is why SD is used most often, but like mean it is sensitive to extreme outliers or skew. In those cases, the MAD is considered a better measure of spread.

Synthesize

How much did adding an outlier change the mean? The standard deviation?
Extreme values affect both the mean and standard deviation of a dataset.
Unusually low values decrease the mean, while unusually high values increase it. Unusually low or high values increase the standard deviation, because it summarizes distance from the mean in either direction.

🔗Your Own Analysis flexible

Overview

Students apply what they’ve learned to their own dataset.

Launch

What is the standard deviation for quantitative columns in your dataset?

Investigate

Use what you’ve learned to find the standard deviation for the quantitative columns in your dataset. Complete Data Cycle: Standard Deviation in My Dataset, and add your findings to the "Measures of Center and Spread" section.
Do these measures bring up any interesting questions? If so, add them to the end of the document.

Synthesize

Share your findings!
Are some columns more spread out - with a larger standard deviation - than others?
What does that mean about your data?

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.