Week 5 Notes - Introduction to Linear Regression

Published

October 6, 2025

Key Concepts Learned

1.The Statistical Learning Framework
- Formalizing the Relationship Y=f(X)+ϵ
  - Why Estimate f? 1.Prediction 2. Inference推理
  - How Do We Estimate f? Parametric Methods(We assume f is linear, then estimate β₀ and β₁) Non-Parametric Method(We let the data determine the shape of f)
- Parametric Approach: Linear Regression参数方法：线性回归
  - The assumption: Relationship between X and Y is linear. Y≈β0+β1X1+β2X2+…+βpXp(Estimate the β coefficient系数)
  - The task: Estimate the β coefficients using our sample data
  - The method: Ordinary Least Squares (OLS)
2.Two Different Goals
- Inference
  - Focus on coefficients
  - Statistical significance matters
  - Understand mechanisms 了解机制
- Prediction
  - Focus on accuracy
  - Prediction intervals matter
3.Building Your First Model
- Fit the Model
  - R2，p value(<0.001→ Very unlikely to see this if true β₁ = 0,We can reject the null hypothesis)
- Interpreting Coefficients
- The “Holy Grail” Concept(all the estimation is based on the sample,not the entire pop.so we will never know the true relationship)(No relationship一条横线: Null Hypothesis)
- Statistical Significance
  - t value(更大 |t|= 更有信心这种关系是真实的)
  - p value(p 值：如果 H₀ 为真，则看到我们的估计值的概率;小 p →拒绝 H₀，得出关系存在)
4.Model Evaluation
- How Good is This Model?
  - How well does it fit the data we used? (in-sample fit:R²,R² = 0.208 means “21% of variation in income is explained by population”R² alone doesn’t tell us if the model is trustworthy)(Overfitting:High R² doesn’t mean good predictions!)
  - How well would it predict new data? (out-of-sample performance:Train70%/Test30% Split)
    - Code:set.seed(123) n <- nrow(pa_data)
- Evaluate Predictions
  - Code:# Calculate prediction error (RMSE) rmse_test <- sqrt(mean((test_data\(median_incomeE - test_predictions)^2)) rmse_train <- summary(model_train)\)sigma平方是为了放大error
- Cross-Validation(Better approach: Multiple train/test splits)
5.Checking Assumptions
- Assumption 1: Linearity
  - What we assume: Relationship is actually linear
  - How to check: Residual plot(clear funnel pattern & random scatter随机散点)
- Assumption 2: Constant Variance常定方差
  - Heteroscedasticity(异方差性): Variance changes across X(residual变大的时候思考是否需要引入新变量来解释→Adding the right predictor can fix heteroscedasticity)
  - Impact: Standard errors are wrong → p-values misleading
- Formal Test: Breusch-Pagan(进行异方差性（Heteroscedasticity）的统计假设检验，而 p 值是这个检验的最终输出和决策工具)
  - Interpretation: p > 0.05: Constant variance assumption OK; p < 0.05: Evidence of heteroscedasticity
  - Solutions: 1.Transform Y (try log(income)) 2.Robust standard errors 3.Add missing variables 4.Accept it (point predictions still OK for prediction goals)
- Assumption: Normality of Residuals(Q-Q Plot)
- Assumption 3: No Multicollinearity
- Assumption 4: No Influential Outliers
  - Not all outliers are problems - only those with high leverage AND large residuals
  - Identify Influential Points(Cook’s D > 4/n = potentially influential)
6.Improving the Model
- Adding More Predictors
- Log Transformations
  - If relationship is curved, try transforming
  - Interpretation changes: Log models show percentage relationships
- Categorical Variables(create new variables)

Coding Techniques

[New R functions or approaches]
[Quarto features learned]

Questions & Challenges

What I didn’t fully understand
- t value & p value
Areas needing more practice
- 1

Connections to Policy

[How this week’s content applies to real policy work]

Reflection

[What was most interesting]
[How I’ll apply this knowledge]