Simpson’s Paradox

Students discovery why they investigated pandemic data using only one state, learning about Simpson’s Paradox in the process!

Lesson Goals

Students will be able to…

Explain how trends in sub-groups within a dataset can be hidden or even reversed when those groups are aggregated.

Student-facing Lesson Goals

Let’s explore why it is sometimes misleading to consider subgroups together.

Materials

Preparation

Much of the exploration in this lesson hinges on the same custom-built interactive Desmos activity we introduced in Exploring the Spread of Disease.
- Set the pacing so that students are synced to you and only able to interact with the slide for the lesson section you are working on.
- Decide how you will share the link or code with students and, if you are using our Google Slides, add the appropriate link to the slide deck.
- If you don’t already have a code, or you want to share a new one, you will first want to:
  - Open Modeling Covid Spread (Desmos).
  - Make a link or code to share with your students.
If you’re a first-time Desmos user, fear not! Here’s what you need to do.

🔗Why Just One State at a Time?

Overview

Students discuss an example of Simpson’s Paradox, which motivates splitting a dataset into grouped samples using filters. They then discover another motivation for filtering: scatter plots like our covid dataset show multiple correlations, instead of just one. Finally, they learn how to filter a dataset and apply that knowledge to filtering the Covid dataset into samples grouped by state.

Launch

The first table and accompanying questions below can be printed here, if that would work better for your students.

A college is looking at enrollment and housing data for students who’ve decided what their major will be, vs. those who are undecided:

	# On Campus	# Off Campus	% On Campus
Undecided	120	80	120/200 = 60%
Decided	80	100	80/180 = 44%

According to the table, how many Undecided Majors live off-campus?
80
How many Decided Majors live on-campus?
80
Who is more likely to live on campus: Decided or Undecided Majors?
(Give students time to talk about this, and explain their thinking! )

It looks like the two variables are significantly related: undecided majors are more likely to live on campus than decided ones!

But there’s a third variable hiding in the background: freshmen college students are far less likely to have picked a major than seniors, and they are much more likely to live on campus.

When we filter by this important third variable, it turns out that there is no correlation between between deciding on a major and living on- or off-campus, for both Freshmen and Non-Freshmen.

Freshmen	# On Campus	# Off Campus	% On Campus
Undecided	100	20	100/120 = 83%
Decided	50	10	50/60 = 83%

Non-Freshmen	# On Campus	# Off Campus	% On Campus
Undecided	20	60	20/80 = 25%
Decided	30	90	30/120 = 25%

What looks like a correlation between having-a-major and living-on-campus is actually a correlation between age and living-on-campus.

A scatter plot with multiple point clouds, each one showing a positive linear correlation. However, the clouds are staggered so that each one is lower on the y-axis than the last. Taken together, the entire set appears to have a negative correlation A third variable lurking in the data can play tricks by obscuring relationships between two other variables — or by creating the appearance of a relationship where none exists!

We often think that the more data we include in our sample the more clearly we’ll see any potential relationships. But, in certain circumstances, the correlations in our sub-groups cancel each other out when we put the groups together. This is called Simpson’s Paradox.

Simpson’s Paradox: visible trends in sub-groups disappear or even reverse when the groups are combined.

Investigate

Sometimes filtering the data into subsets is the only way to see what’s really going on. In our Covid Spread Starter File, the subgroups had such strong relationships that the scatter plot for all our New England states doesn’t look much like a scatter plot at all — it looks like someone took a marker and drew in five different curvy lines!

A scatter plot showing the exponential growth of covid infections in New England

How is a grouped sample different from a random sample?
A grouped sample is a non-random subset chosen from a larger set. Grouped samples are non-random by design!
What variable(s) might be lurking in the background of the Covid Spread Data, that could be responsible for the distinct curves for each state?
Give students time to discuss!
Diseases spread more rapidly in densely-populated areas, since it’s easier for the infection to jump from one person to another. Unfortunately, we can’t see the density data in our table, so that dimension is missing from our dataset! This is exactly what happened in our college example: we couldn’t see the age of the students, which skewed our interpretation of the scatter plot.

Make sure you’ve advanced your teacher dashboard of Modeling Covid Spread (Desmos) to the fifth slide ("Exponential Model for VT") so that students will be looking at the correct screen when they are directed to return to Desmos part way through Models for Vermont

Now that we’ve explored the Massachusetts data, we are ready to explore some of the other subsets.

Working in pairs or small groups, complete the first section of Models for Vermont using the Covid Spread Starter File.

What are is-MA and MA-table doing?
is-MA is a helper function that is used to check every Row of the Table, producing true if it’s from Massachusetts or false if it’s not from Massachusetts.
MA-table uses the filter function to make a new table, using all the Rows from the original table for which the helper function produced true.

While filtering is introduced in this lesson, the primary goal is for students to explore exponential functions. If your students want to know more about filtering — or wish to filter other datasets — we recommend checking out the Filtering and Building lesson.

Complete Models for Vermont.
You will need both the Covid Spread Starter File and Slide 5 of the Modeling Covid Spread Desmos file.

Common Misconceptions

It’s extremely common for students to believe that filtering a table changes the original table, but this is NOT how it works in Pyret! Instead, the filter function always produces a new table, containing only the Rows for which the supplied function evaluates to true.

Synthesize

How would you explain Simpson’s Paradox to someone who missed class today?
In what other situations would it be useful to filter a dataset?
Can you think of other examples where Simpson’s Paradox might arise?
When comparing school in Country A to schools in Country B, a researcher finds that students living in poverty in A outperform impoverished students in country B. They also find that the wealthy students in A outperform their wealthy peers in B. In fact, for every income level, Country A outperforms Country B! But if Country B has less child poverty overall, it will still outperform A.
Another, thoroughly-explained example involving soft drinks can be found on this web page.

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.