Correlations

email twitter instagram facebook

Lesson Pathway, Standards and Practices

Standards (click one)

Common Core Math Standards

8.SP.A.1: Construct and interpret scatter plots for bivariate measurement data to investigate patterns of association between two quantities. Describe patterns such as clustering, outliers, positive or negative association, linear association, and nonlinear association.
8.SP.A.2: Know that straight lines are widely used to model relationships between two quantitative variables. For scatter plots that suggest a linear association, informally fit a straight line, and informally assess the model fit by judging the closeness of the data points to the line.
HSS.ID.B.6: Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.
HSS.ID.C.8: Compute (using technology) and interpret the correlation coefficient of a linear fit.
HSS.ID.C.9: Distinguish between correlation and causation.

CSTA Standards

1B-DA-06: Organize and present collected data visually to highlight relationships and support a claim.
2-DA-09: Refine computational models based on the data they have generated.
3B-NI-05: Use data analysis tools and techniques to identify patterns in data representing complex systems
3B-NI-07: Evaluate the ability of models and simulations to test and support the refinement of hypotheses.

K-12CS Standards

6-8.Data and Analysis.Visualization and Transformation: Computer models can be used to simulate events, examine theories and inferences, or make predictions with either few or millions of data points. Computer models are abstractions that represent phenomena and use data and algorithms to emphasize key features and relationships within a system. As more data is automatically collected, models can be refined.
P5: Creating Computational Artifacts

Oklahoma Standards

OK.L1.DA.IM.01: Show the relationships between collected data elements using computational models.
OK.PA.D.1.3: Collect, display and interpret data using scatterplots. Use the shape of the scatterplot to informally estimate a line of best fit, make statements about average rate of change, and make predictions about values not in the original data set. Use appropriate titles, labels and units.

Textbook Alignment

IM.Alg1.3.8
IM.Alg1.3.7
IM.Alg1.3.5
IM.8.6.5
IM.8.6.4
Thinking with Mathematical Models: Linear and Inverse Variations

Students deepen their understanding of scatter plots, learning to describe and interpret direction and strength of linear relationships.

Lesson Goals

Students will be able to…

Confirm if a scatter plot appears linear
Understand how correlation assesses direction in a linear relationship
Understand how correlation measures strength in a linear relationship

Student-facing Lesson Goals

Let’s explore scatter plots and what they can tell us about data relationships.

Materials

Preparation

Make sure all materials have been gathered.
Decide how students will be grouped in pairs.
Computer for each student (or pair), with access to the internet
Student workbook, and something to write with
All students should log into CPO and open the "Animals Starter File" they saved from the prior lesson. If they don’t have the file, they can open a new one

Supplemental Resources

Spurious Correlations

Language Table

Types

Functions

Values

Number

+, -, *, /, mean, median, modes, num-sqrt

4, -1.2, 2/3, pi

String

string-length, string-repeat, string-contains

"hello", "91"

Boolean

<, <>, <=, >=, <, >, ==, <>, >=

true, false

Image

star, triangle, circle, square, rhombus, ellipse, regular-polygon, radial-star, bar-chart, pie-chart, box-plot, scatter-plot, bar-chart-summarized, pie-chart-summarized, modified-box-plot

🔵🔺🔶

Table

.row-n, .order-by, .filter, .build-column, random-rows

Glossary

form: of a relationship between two quantitative variables: whether the two variables together vary linearly or in some other way
r: a number between −1 and 1 that measures the direction and strength of a linear relationship between two quantitative variables (also known as correlation value)

🔗Correlations have Form 10 minutes

Overview

Students identify and make use of patterns in scatter plots, learning to characterize them as being linear, curved, or showing no clear pattern. This builds intuition for determining if the form is linear, in which case we can proceed to correlation and linear regression.

Launch

By now we have learned ways to summarize a single quantitative variable, like the age of an animal in our dataset: report the center, spread, and shape of the distribution. Together, those numbers tell us what age is typical, how much the ages vary, and what kind of age values are usual or unusual. We could do the same for for animals' weights (or any other quantitative column).

But those individual summaries tell us nothing about the relationship between animals' ages and weights. In order to understand such relationships, we have to expand our view from a single dimension (along one axis) to two dimensions. This goes hand in hand with expanding our display from a one-dimensional histogram to a two-dimensional scatter plot.

Rather than summarizing each distribution in one dimension, we can summarize a linear relationship between two quantitative variables. But this only makes sense if the scatter plot follows a straight-line pattern, as opposed to being curved. So the very first assessment we have to make is to identify the form of the relationship as being linear or not.

Form: whether a relationship is linear or not

Investigate

The relationship between two quantitative variables can take many forms - some patterns are linear, and appear as a straight line sloping up or down. Some patterns are non-linear, and may look like a curve or an arc. And sometimes there is no pattern or relationship at all!

Have students turn to Identifying Form, Direction and Strength (Page 81) in their student workbooks. For each scatter plot, identify whether the relationship is linear, non-linear or if there’s no relationship at all.

Synthesize

Data Scientists use their eyes all the time! It doesn’t make sense to search for correlations when there’s no pattern at all, and only linear relationships make sense if we want to summarize with a correlation.

Going Deeper

In an AP Statistics class or full-year Data Science class, it’s appropriate to discuss non-linear relationships here. In a dedicated computer science class, it may also be appropriate to talk about transforming the x- or y-axis (using .build-column!) via a quadratic, exponential, or logarithmic function and then looking for a linear pattern in the resulting scatter plot. All of these are extensions to the materials presented here.

🔗Correlations have Direction & Strength 20 minutes

Overview

Once students have learned to identify a possible linear relationship, they can turn their attention to other qualities of that relationship: its direction and strength. Each of these is expressed in the $\displaystyle r$ -value, which students learn to read.

Launch

Assuming a relationship is linear, data scientists calculate a single number called "correlation" - or $\displaystyle r$ -value - that reports both the direction and strength.

Direction: whether a linear relationship is positive or negative.

A linear relationship between two quantitative variables is positive if, in general, the scatter plot points are sloping up: smaller x values tend to go with smaller y values, and larger x values tend to go with larger y values. The relationship is negative if points slope down: smaller x values tend to go with larger y values, and larger x values tend to go with smaller y values.

Positive relationships are by far most common because of natural tendencies for variables to increase in tandem. For example, “the older the animal, the more it tends to weigh”. This is usually true for human animals, too!
Negative relationships can also occur. For example, “the older a child gets, the fewer new words he or she learns each day.”

Strength: how closely the two variables are correlated.

How well does knowing the x-value allow us to predict what the y-value will be?

A relationship is strong if knowing the x-value of a data point gives us a very good idea of what its y-value will be (knowing a student’s age gives us a very good idea of what grade they’re in). A strong linear relationship means that the points in the scatter plot are all clustered tightly around an invisible line.
A relationship is weak if x tells us little about y (a student’s age doesn’t tell us much about their number of siblings). A weak linear relationship means that the cloud of points is scattered very loosely around the line.

Investigate

Have students turn to Identifying Form, Direction and Strength (Page 81) in their student workbooks. For each scatter plot, identify whether the relationship is positive or negative, and whether it is strong or weak.

The correlation is a number (falling anywhere from -1 to +1) that tells us the direction and strength of a linear relationship between two variables. $\displaystyle r$ is positive or negative depending on whether the correlation is positive or negative. The strength of a correlation is the distance from zero: an $\displaystyle r$ -value of zero means there is no correlation at all, and stronger correlations will be closer to −1 or 1.

An $\displaystyle r$ -value of about ±0.65 or ±0.70 or more is typically considered a strong correlation, and anything between ±0.35 and ±0.65 is “moderately correlated”. Anything less than about ±0.25 or ±0.35 may be considered weak. However, these cutoffs are not an exact science! In some contexts an $\displaystyle r$ -value of ±0.50 might be considered impressively strong!

Calculating $\displaystyle r$ from a data set only tells us the direction and strength of the relationship in that particular sample. If the correlation between adoption time and age for a representative sample of about 30 shelter animals turns out to be +0.44, the correlation for the larger population of animals will probably be close to that, but certainly not the same.

Have students turn to Identifying Form and r-Values (Page 82) in their student workbooks. For each scatter plot, identify whether the relationship is linear, and use $\displaystyle r$ to summarize direction and strength. You could also have them complete a card sort activity on identifying strength (Desmos) and a card sort activity on identifying direction (Desmos).

In the Interactions Area, create a scatter plot for the Animals Dataset, using "pounds" as the xs and "weeks" as the ys.
Form: Does the point cloud appear linear or non-linear?
Direction: If it’s linear, does it appear to go up or down as you move from left to right?
Strength: Is the point cloud tightly packed, or loosely dispersed?
Would you predict that the $\displaystyle r$ -value is positive or negative? Will it be closer to zero, closer to ±1, or in between?
Have Pyret compute the $\displaystyle r$ -value, by typing r-value(animals-table, "pounds", "weeks"). Does this match your prediction?
Repeat this process using "age" as the xs. Is this correlation stronger or weaker than the correlation for "pounds"? What does that mean?

(Note: An excellent resource to build intuition for r-values is Guess the Correlation!)

Common Misconceptions

Students often conflate strength and direction, thinking that a strong correlation must be positive and a weak one must be negative.
Students may also falsely believe that there is ALWAYS a correlation between any two variables in their dataset.
Students often believe that strength and sample size are interchangeable, leading to mistaken assumptions like "any correlation found in a million data points must be strong!"

Synthesize

It is useful to ask students probing questions, to help address the misconceptions listed above. Some examples:

What is the difference between a weak relationship and a negative relationship?
What is the difference between a strong relationship and a positive relationship?
If we find a strong relationship in a sample, can we always infer that relationship holds for the whole population?
Suppose we have two correlations, one drawn from 10 data points and one drawn from 50. If both correlations are identical in direction and strength, should we trust them equally when making an inference about the larger population?

Correlation does NOT imply causation.

It’s easy to be seduced by large $\displaystyle r$ -values, and believe that we’re really onto something that will help us claim that one variable really impacts another! But Data Scientists know better than that…

Here are some possible correlations that have absolutely no causal relationship; they come about either by chance or because both of them are related to another variable that’s (often) lurking in the background.

For a certain psychology test, the amount of time a student studied was negatively correlated with their score! (Struggling students needed to study more; they would have done even worse if they’d studied less!)
Weekly data gathered in a city throughout the year showed a positive correlation between ice cream consumption and drowning deaths. (Warmer weather affects both; they have no effect on one another.)
A negative correlation was found between how much time students talked on the phone and how much they weighed. (Gender is a confounder: women tend to weigh less and talk more than men.)

Here are a few real correlations, drawn from the Spurious Correlations website. If time allows, have your students explore the site to see more! - “Number of people who drowned after falling out of a fishing boat” v. “Marriage rate in Kentucky” ( $\displaystyle r$ = 0.98) - “Average per-person consumption of chicken” v. “U.S. crude oil imports” ( $\displaystyle r$ = 0.95) - “Marriage rate in Wyoming” v. “Domestic production of cars” ( $\displaystyle r$ = 0.99) - “Number of people who get tangled in their own bedsheets” v. “Amount of cheese consumed that year” ( $\displaystyle r$ = 0.95)

🔗Your Analysis flexible

Overview

Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done as a homework assignment, but we recommend giving students an additional class period to work on this.

Launch

What correlations do you think there are in your dataset? Would you like to investigate a subset of your data to find those correlations?

Investigate

Brainstorm a few possible correlations that you might expect to find in your dataset, and make some scatter plots to investigate.
Turn to Correlations in My Dataset (Page 83), and list three correlations you’d like to search for.
Investigate these correlations. If you need blank Design Recipes, you can find them at the back of your workbook, just before the Contracts.

Synthesize

What correlations did you find? Did you need to filter out certain rows in order to get those correlations?

After looking at the scatter plot for our animal shelter, do you still agree with the claim on (Dis)Proving a Claim (Page 79)? (Perhaps they need more information, or to see the analysis broken down separately by animal!)

🔗Additional Exercises:

Identifying Form, Direction and Strength (Matching)

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.