
Course: Statistics and Data Science II
Focus: Associations, regression, interpretation, and model intuition
I want you to think about the association between hours studied and exam results for SDS-I.
We can imagine two possibilities for each student:
\[ Y_i(1) = \text{Exam score if student } i \text{ studies more} \] \[ Y_i(0) = \text{Exam score if student } i \text{ studies less} \]
These are called potential outcomes.
For each student, we only observe one outcome:
We can collect observed data:
Earlier you imagined your own points of potential outcomes on a scatterplot.
Put together, these plots form a collective hypothesis about the relationship between:
Regression is a way to model the relationship between variables.
Y), often called the dependent variable.X), often called our indepedent variable(s)Given this data. What’s our best prediction of a new \(y_i\)?
The best prediction for a new \(y\), is the mean: \(\bar y\). In other words \(\widehat y = \bar y\)
So how “wrong” would we be if we had predicted \(\bar y\) for every \(y_i\)?
These “errors” are called residuals (\(residual_i = y_i - \widehat{y}_i\))
Given no other information, guessing \(\bar y\) minimizes the residuals.
But, what if we have some other information?
Can we make a better prediction of \(y\) using \(x\)?
If we sort by that information…
Can we improve our best guess of \(y\)?
Can we draw a straight line with less residuals?
Can we draw a line with less residuals?
Better! But what line has the least residuals?
This line has the least residuals. It’s a regression line.
We have fitted a model to our data.

\[ y = a + b \cdot x \] Where:
\(a\) is the intercept
\(b\) is the slope
\[ y = 0 + 1 \cdot x \]

\[ y = 21 + 2 \cdot x \]

\[ \widehat{y} = -39.7 + 0.62 \cdot x \]

Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-0.7862 -0.5141 -0.1739 0.3364 1.4137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.69694 2.69296 -14.74 1.71e-09 ***
height 0.61610 0.01628 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6917 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
With a regression, we can describe a pattern between variables in our data — like height and weight — using a mathematical model, like a straight line.
But the real-world is messy:
We can write this as an assumed model:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
From our sample, we estimate:
\[ \widehat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \]
Observed Data: \((x_i, y_i)\) — what we actually see
Assumed Model: \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\)
Fitted Model: \(\widehat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\)
So:


Remember, this is how we defined residuals: \(r_i = y_i - \widehat{y}_i\)
However, when we compare models, we use squared residuals: \(r_i^2 = (y_i - \widehat{y}_i)^2\)
We add up (sum) all of the now squared residuals from the baseline model: \(\sum (y_i - \bar{y}_i)^2\)
Then we sum the squared residuals from our model: \(\sum (y_i - \widehat{y}_i)^2\)


We define:
\[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} \]

We’ve learned:
But regression only works well if certain conditions are met.
These are the assumptions of OLS.
✅ Valid: \(Y = \beta_0 + \beta_1 X + \beta_2 X^2\)
❌ Invalid: \(Y = \beta_0 + \sin(\beta_1 X)\)
OLS is robust to mild violations, especially in large samples — but you should know how to diagnose and how they affect interpretation.
We’ll revisit these in the lab later.