flowchart LR H[Predictor] --> E[Outcome] F[Confounder] --> H F --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class F con; class H exp; class E out;
Over the past three weeks, we’ve built up regression piece by piece:
✅ Week 1 – Simple Regression
→ Modeling how one variable predicts another
→ Interpreting coefficients, intercept, and \(R^2\)
✅ Week 2 – Multiple Regression
→ Adding variables to separate overlapping influences
→ Understanding conditional relationships and confounding
✅ Week 3 – Interactions & Nonlinearity
→ Testing when effects depend on context
→ Using squared terms and transformations to capture curvature
🎉 Well done — you now have the core regression toolbox.
Today, we’ll explore:
By now, you know how to run regression models. But today we shift to a harder question:
What should the model actually include?
A regression model is not just a calculation, it’s a story about the world we choose to tell.
The group a student belongs to seems to affects:
How much students tend to study
How well they tend to perform
Group belonging is a confounder — a variable related to both the predictor and the outcome
flowchart LR H[Predictor] --> E[Outcome] F[Confounder] --> H F --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class F con; class H exp; class E out;
Call:
lm(formula = exam_score ~ hours_studied, data = data)
Residuals:
Min 1Q Median 3Q Max
-47.977 -12.021 0.376 14.037 49.927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4589 2.9543 21.142 < 2e-16 ***
hours_studied -1.4457 0.5311 -2.722 0.00706 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.78 on 198 degrees of freedom
Multiple R-squared: 0.03608, Adjusted R-squared: 0.03121
F-statistic: 7.411 on 1 and 198 DF, p-value: 0.007063
Call:
lm(formula = exam_score ~ hours_studied + group, data = data)
Residuals:
Min 1Q Median 3Q Max
-24.9261 -6.8383 -0.4089 6.6237 24.8530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.1074 1.6688 35.418 < 2e-16 ***
hours_studied 3.0872 0.3701 8.341 1.27e-14 ***
groupB -38.3497 1.8507 -20.722 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.56 on 197 degrees of freedom
Multiple R-squared: 0.6968, Adjusted R-squared: 0.6938
F-statistic: 226.4 on 2 and 197 DF, p-value: < 2.2e-16
In 1973, UC Berkeley’s grad admissions data seemed to show bias against women:
When admissions were broken down by department:
In most individual departments, women were admitted at similar or higher rates than men.
Here are some patterns we might observe in data:
✈️ People who fly more are more stressed
🍷 Moderate wine drinkers live longer
📱 More phone use is linked to worse sleep
🐱 Pet owners report better mental health
🎵 Kids who take music lessons score higher on IQ tests
🏡 Homeowners are more civically engaged
💊 Vitamin supplement users are healthier
📚 Kids who read early succeed more in school
🏙️ Urban dwellers tend to be more politically liberal
In pairs: Pick three of these associations. For each one:
Identify a potential source of confounding.
Describe how that confounder creates a spurious association.
Example:
🍷 People who drink wine tend to live longer.
However, higher-SES individuals are more likely to be moderate wine drinkers.
And, higher-SES individuals also tend to have better access to healthcare and lower mortality.
Therefore, SES is a common cause of both wine consumption and longevity,
creating a spurious association between the two.
When you attend research seminars, at IAS or anywhere else, you’ll notice that discussion often revolves around just this:
What else could be driving the observed pattern?
Learning to ask that question, and think structurally about it, is one of the most valuable skills you can develop.
We’ve seen one way estimates go wrong:
But confounding is only one kind of biasing structure. Unfortunately, there are many more.
Some relationships are clearly causal:
- The sun warms the ground in the morning
- Pushing the gas pedal makes the car accelerate
But many associations are not causal:
- Ice cream sales and drownings are correlated
Most real-world relationships are a mix:
So the observed association is partially spurious.
When we answer research questions, we want to avoid:
And we want to understand:
flowchart LR H[Exposure] --> E[Outcome] F[Confounder] --> H F --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class F con; class H exp; class E out;
Not universal
Not everyone uses DAGs. They’re a framework — not a universal language in academia.
We only draw arrows if we believe there is a direct causal effect
But what is it good for?
flowchart LR Z[Z] --> H[Exposure] Z --> E[Outcome] H --> E classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class H exp; class E out;
flowchart LR A[Z0] --> Z[Z1] --> H[Exposure] --> E[Outcome] Z --> E A --> E classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class A con; class Z con; class H exp; class E out;
flowchart LR A[Z1] --> Z[Z2] --> F[Z3] --> H[Exposure] --> E[Outcome] A --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class F con; class H exp; class E out;
A backdoor path is:
any path from Exposure to Outcome
that starts with an arrow into the Exposure (a “back door” into X)
and is open (i.e. not “blocked”)
These paths carry bias/confounding
Is the mathematical definition of confounding
flowchart LR Z[L] --> F[Z] --> H[Exposure] --> E[Outcome] Z --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class F con; class H exp; class E out;
flowchart LR Z[L] --> F[Z] --> H[Exposure] --> E[Outcome] Z --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class F adj; class H exp; class E out;
How could we satisfy the backdoor criterion?
flowchart LR A[Z1] --> Z[Z2] --> F[Z3] --> Q[Z4] --> B[Z5] --> C[Z6] --> H[Exposure] --> E[Outcome] A --> E classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class F con; class H exp; class E out;
flowchart LR Z[L] --> F[Z] --> H[Exposure] --> E[Outcome] Z --> E Z --> H classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z con; class F con; class H exp; class E out;
Let’s build the Study Hours → Exam Score DAG together.
flowchart LR %% Main causal path H[Hours studied] --> E[Exam score] %% Confounders SES[Parental SES] --> H SES --> E GPA[Prior GPA] --> H GPA --> E SES --> GPA MOT[Motivation] --> H MOT --> E SES --> MOT TIQ[IQ / Cognitive ability] --> E TIQ --> GPA SLP[Sleep quality] --> H SLP --> E PARTY[Hours of partying] --> H PARTY --> SLP F[Field of study] --> H TIQ --> F GPA --> F JOB[Part-time job] --> H SES --> JOB %% Mediator STRESS[Exam stress] --> E H --> STRESS %% Output styling classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class H exp; class E out; class SES,GPA,MOT,TIQ,SLP,PARTY,STRESS,F,JOB,PEER,TUT con;
As we’ve just seen, the causal structure of a social process can be overwhelming — even when we try to simplify.
There are a lot of potential confounders.
Can we ever get unbiased causal estimates?
This is how we will use them in this class. Clarifying how we think and how that feeds into our models.
Still — DAGs give us a structured starting point. The alternative is not communicating our assumptions.
flowchart LR X[Education] --> M[Income] --> Y[Health] classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class X exp; class Y out;
flowchart LR H[Hours Studied in high school] --> G[GPA at graduation] --> A[Admission to uni.] H --> A classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class H exp; class A out;
Here are some observed associations:
🎓 More education → Higher income
💪 Regular exercise → Better mental health
👨👩👧 Parental SES → Child academic success
📖 Reading fiction → Greater empathy
🧳 Immigrant background → Lower political trust
⛪ Religious participation → Longer life
🌆 Growing up in cities → More liberal views
📱 More screen time → Lower school performance
👶 Having children → Shift in gender attitudes
🍻 More drinking → Worse career outcomes
Example:
💪 Those who exercise frequently tend to have better mental health.
Exercise improves sleep quality, and good sleep improves mental health.
Thus, sleep is a potential mediator/mechanism of the exercise–mental health effect.
If your question is:
“What is the total effect of Education on Health?”
❌ Do not adjust for income.
You would cut the pathway you are trying to estimate.
If your question is:
“What is Education’s effect holding income fixed?”
✔️ Then you can adjust for descriptive purposes.
Causal mediation
Adjusting for a mediator does not identify the causal direct effect. That is a tricky issue and need very strong assumptions to hold.
We’ve seen that controlling for a confounder can help reduce bias.
But not every variable behaves like a confounder.
Sometimes, controlling for a variable actually introduces a false association between two variables that are otherwise unrelated.
This happens when the variable you adjust for is a common effect (of two or more variables).
We call this kind of variable a collider.
flowchart LR A[Genetic Risk] --> H[Hospitalization] B[Bike Accident] --> H classDef coll fill:#fdd; class A con; class B con; class H coll;
A common effect
In the general population, genetic risk and accidents are unrelated.
But among hospital patients, they may appear negatively correlated.
Here’s why: If a patient is in the hospital because of a bike accident, they are less likely to also be there because of genetic illness, and vice versa.
By restricting the sample to hospitalized people, you are conditioning on a collider, which creates a spurious association between its causes.
flowchart LR A[Genetic Risk] --> H[Hospitalization] B[Bike Accident] --> H classDef coll fill:#fdd; class A con; class B con; class H coll;
flowchart LR A[Political Interest] --> T[Twitter Use] B[Extremism] --> T classDef var fill:#fff; classDef col fill:#fdd,stroke:#333; class A,B var; class T col;
By conditioning on Twitter use, a collider, we open a spurious association between political interest and extremism. This is collider bias.
flowchart LR H[Exposure] --> E[Outcome] Z[Z] --> L[L] A[A] --> L Z --> E A --> H classDef adj stroke-width:3,fill:#fff; classDef exp fill:#cfc; classDef con fill:#ddd,stroke:#333; classDef out fill:#ccf; class Z,L,A con; class H exp; class E out;
| Type | Where it sits | Adjust for it? | Why / Why Not? |
|---|---|---|---|
| Confounder | Common cause of X & Y | ✅ Yes | Blocks biasing backdoor paths |
| Mediator | Between X and Y | ❌ (usually) | Cuts off indirect effect; distorts total effect |
| Collider | Caused by X and Z | ❌ Never | Opens up biasing non-causal path |
Here’s a DAG for:
“Does exercise affect mental health?”
flowchart LR EX[Exercise] --> MH[Mental Health] SL[Sleep] --> MH EX --> SL ST[Stress] --> MH ST --> EX EX --> C[Gym Membership] IN[Income] --> C MH --> C classDef exp fill:#cfc; classDef out fill:#ccf; classDef con fill:#ddd; classDef med fill:#ffc; classDef coll fill:#fdd; class EX exp; class MH out; class SL,ST,IN,C con;
The word “selection” is used in multiple ways across disciplines:
These uses of “selection” are related — but not identical.
Regression doesn’t tell us what is true.
It tells us what is true under our assumptions.
So the real skill is not running the model, it is knowing what needs to be in the model, what must stay out, and why.
Common ways models go wrong:
Omitted confounders
→ creates spurious associations
Adjusting for mediators
→ risk of underestimating total effect of the causal pathway you want to study
Controlling for colliders
→ creates spurious associations
Adding variables without justification
→ inflates variance, adds noise, risks overfitting
What parts of a typical quantitative empirical project have we covered in this course:
- Formulating a research question
- Reading literature
- Developing a hypothesis
✅ Drawing assumptions or a causal diagram (NEW for today!)
- Study design
✅ Collecting/Exploring data
✅ Specifying the models (Also today)
✅ Fitting statistical models
✅ Interpreting results
- Drawing conclusion
Confounding Variables (XKCD)
flowchart LR AB[Childhood abuse] --> SC[Substance use] AB --> MH[Teenage Mental health] AB --> VI[Violent crime score] EDU[Education] --> VI FI[Parents income] --> EDU FI --> AB MH --> SC MH --> VI SC --> VI VI --> IM[Imprisonment] PO[Prior Economic crime] --> IM FI --> PO classDef exp fill:#cfc; classDef out fill:#ccf; classDef con fill:#ddd,stroke:#333; classDef coll fill:#fdd; class MH exp class VI out; class IM coll;