# 9.6 – Two-Variable Data

## Key Terms

• Correlation – A measure of the strength of the relationship between two variables.
• Correlation coefficient – A number that expresses the strength of the correlation between two variables.
• It also shows whether the correlation is positive or negative.
• The correlation coefficient is called r.
• Dependent Variable – The variable plotted on the y-axis. It is also called the response variable.
• The dependent variable responds to changes in the explanatory variable.
• Explanatory Variable – The variable plotted on the x-axis, also called the independent variable.
• In an experiment, the explanatory variable is the variable that is being studied.
• Independent Variable – The variable plotted on the x-axis, also called the explanatory variable.
• In an experiment, the independent variable is the variable that is being studied.
• Line of Best Fit – A line drawn as near as possible to all the points in a scatterplot.
• The line of best fit helps you see the relationship shown in the scatterplot.
• It is also called a least squares regression line (LSRL).
• Residual – The difference between an observed value and the value predicted by the least squares regression line.
• Response Variable – The variable plotted on the y-axis; also called the dependent variable.
• The response variable responds to changes in the explanatory variable.
• Two-Variable Data – Data that can be measured in two different ways and graphed on a Cartesian plane.

## Review

One-Variable Data
• Visual displays for one-variable data sets
• dot plot
• stem-and-leaf plot
• box-and-whisker plot
• histogram
• frequency table

## Notes

Correlations
• A correlation is the measure of the strength of the relationship between two variables.
• It can be described with a number — the correlation coefficient (r)
• Examples
• Height and weight of 15 men
• Population and gross domestic product (GDP) of 15 European Union (EU) countries
• Runs scored and number of wins for 15 national league (NL) baseball teams
Correlation Coefficient (r)
• The correlation coefficient is called r. is sometimes called Pearson’s r because it was developed by a statistician named Karl Pearson.
• Defined
• A number that describes the relationship between two variables.
• Measures the strength of the relationship.
• Tells whether the relationship is positive or negative.
• Properties
• It is always between -1 and 1.
• When r is near 0, it indicates very little correlation.
• When r is near 1, it indicates a strong positive correlation.
• When r is near -1, it indicates a strong negative correlation.
• r is strongly affected by outliers.
• r applies only to linear correlations.
• Measuring Strength
• Perfect – The data points fall into a line.
• Strong – The data points form a tight cluster but do not quite fall into a line.
• Weak – The overall trend of the data is in one direction, but the points do not form a tight cluster.

• Estimating “r”
• r will be positive if there is a positive linear relationship (the values go up from left to right).
• r will be negative if there is a negative linear relationship (the values go down from left to right).
• r will be close to -1 or +1 when the points are all close to being on one line.
• r will be close to 0 when the points are not close to being on one line (there is no linear pattern).
• r will be a perfect +1.0 or -1.0 when one line contains all the points.

Correlation Does Not Imply Causation
• Example, suppose a scatterplot shows that there is a strong positive correlation between the number of televisions owned and the number of well-fed people in a country.
• Does owning a TV cause a person to be well fed?
• What’s really happening is that in a country where everyone has a TV, they can also afford food.
• TVs alone don’t cause people to be well fed.
• Predictions based on correlations are not necessarily true.
• They are only likely to occur, based on observed trends in past and present data.
• This is the way most weather predictions are made

Scatterplots
• The best way to display two-variable data.
• It plots the two variables as (x, y) pairs on the Cartesian plane.
• The suspected cause of that relationship is called the explanatory variable. It is the x-axis.
• The suspected effect is called the response variable. It is the y-axis.
• Look for patterns in a scatterplot by studying three features
• Shape
• Direction

• Measuring Direction
• Positive correlation: Data appear to go up from left to right across the scatterplot.
• Negative correlation: Data appear to go down from left to right across the scatterplot.
• No correlation: Data are spread out across the scatterplot with no visible pattern.

• Examples
• The population/GDP scatterplot below shows a strong pattern — almost a perfect line! This means it is likely that GDP really does depend on population.
• The runs/wins scatterplot shows a weak pattern. This means it is unlikely that wins depend on runs.

Least Squares Regression Line (LSRL)
• A line drawn as near as possible to the points in a scatterplot
• Helps you see the linear relationship between the two variables on the scatterplot
• Also called the line of best fit
• Equation for Least Squares Regression Line
• $\widehat{y}=a+bx$
• $\widehat{y}$: response variable
• a: y-intercept of line
• b: slope of line
• x: explanatory variable
Slope of a Regression Line
• To find b, you use the following formula, where r is the correlation coefficient, $s_y$ is the standard deviation of the y-values, and $s_x$ is the standard deviation of the x-values: Formula: $b=r\;\bullet \; \frac{s_y}{s_x}$
• You need to find the standard deviation to find the slope
• To find the deviation from the mean
1. Find the mean
2. Find each value’s deviation
3. Square each deviation
5. Divide the result by n – 1
6. Take the square root

• Example: r = -0.92 and the points plotted are: (10, 120), (20, 40), (15, 80), (5, 160), (10, 80), (25, 35)
• Mean of x and y
• Mean of x: $\overline{x}=\frac{10\; +\; 20\; +\; 15\; +\; 5\; +\; 10\; +\; 25}{6}=$
• Mean of y: $\overline{y}=\frac{120\; +\; 40\; +\; 80\; +\; 160\; +\; 80\; +\; 35}{6}=85.83$
• Standard deviation of x and y
• Formula: $s_y=\sqrt{\frac{(y_1-mean)^{2}+(y_2-mean)^{2}+(y_3-mean)^{2}...(y_n-mean)^{2}}{n-1}}$
• Standard Deviation of x: $s_x=\sqrt{\frac{(10-14.17)^{2}+(20-14.17)^{2}+(15-14.17)^{2}+(5-14.17)^{2}+(10-14.17)^{2}+(25-14.17)^{2}}{6-1}}$
• Simplify: $s_x=\sqrt{54.2}$
• Answer: $s_x=7.36$
• Standard Deviation of y: $s_y=\sqrt{\frac{(120-85.83)^{2}+(40-85.83)^{2}+(80-85.83)^{2}+(160-85.83)^{2}+(80-85.83)^{2}+(35-85.83)^{2}}{6-1}}$
• Simplify: $s_y=\sqrt{2284.2}$
• Answer: $s_y=47.79$
• Regression Line formula for Slope
• $b=r\;\bullet \; \frac{s_y}{s_x}=-0.92\;\bullet \; \frac{47.79}{7.36}=-6.0$
• So far, the equation will be: $\widehat{y}=a-6x$
• To find the y-intercept, use this formula: $a=\overline{y}-b \overline{x}$
• We know $\overline{x}=14.17$, $\overline{y}=85.83$, and $b=-6$
• So, $a=85.83-(-6\bullet 14.17)$
• Answer: $a=170.85$
• Answer: the regression line is: $\widehat{y}=170.85-6x$

Residuals
• Measure the vertical distance of a point from the line of best fit.
• If the residual is a negative value, the point lies below the line.
• If the residual is a positive value, the point lies above the line.
• Residual = distance between actual y-values and predicted y-values
• residual = actual y – predicted y
• residual = $y-\widehat{y}$
• Finding Residuals
1. Substitute the point’s x-value into the equation for the line to find the predicted value of y.
2. Subtract the predicted value of y from the point’s actual value of y.

• For our example, let’s use point (10, 120) to find the predicted value.
• $\widehat{y}=170.85-6(10)$
• $\widehat{y}=110.85$
• Residual point: the actual value minus the predicted value is: $120-110.85=9.15$

How to Analyze Two-Variable Data
1. Collect data
2. Display the data on a scatterplot
3. Identify the correlation
4. Consider factors of causation
5. Find the correlation coefficient
6. Write the equation of the line of best fit
7. Use the equation to make predictions

All Three Formulas in One