Week 5 Notes - Intro to Linear Regression

Published

October 6, 2025

Key Concepts Learned

  1. The Statistical Learning Framework: What are we actually doing?
  2. Two goals: Understanding relationships vs Making predictions
  3. Building your first model with PA census data
  4. Model evaluation: How do we know if it’s any good?
  5. Checking assumptions: When can we trust the model?
  6. Improving predictions: Transformations, multiple variables

Statistical Significance

t-statistic: How many standard errors away from 0?

  • Bigger |t| = more confidence the relationship is real

p-value: Probability of seeing our estimate if H₀ is true

  • Small p → reject H₀, conclude relationship exists

Model Evaluation

  • How well does it fit the data we used? (in-sample fit using R²)

  • How well would it predict new data? (out-of-sample performance)

Checking Assumptions: Plots reveal what R^2 hides

  • Assumption 1: Linear Relationship
    • Check with Residual Plot
    • Residuals = observed − fitted. They’re your best proxy for the error term. Good models leave residuals that look like random noise.
  • Assumption 2: Constant Variance
    • Check heteroscedasticity (i.e. variance changes across X
    • Impact: standard errors are wrong -> p values are misleading
  • Assumption 3: Normality of Residuals
    • Check with Q-Q Plot (quantile-quantile plot) of residuals
    • Important for confidence & prediction intervals
    • Needed for valid hypothesis tests (t-test and F-test)
  • Assumption 4: No Multicollinearity
    • Coefficients become unstable and hard to interpret
  • Assumption 5: No Influential Outliers
    • Influential Outliers: those with high leverage and large residuals

    • Visual Diagnostic using Cook’s Distance (Cook’s D)

Improving the Model

  • Adding more predictors
  • Log Transformations
  • Categorical Variables

Coding Techniques

  • Linear Regression Train/Test Split
Code
set.seed(123)
n <- nrow(pa_data)

# 70% training, 30% testing
train_indices <- sample(1:n, size = 0.7 * n)
train_data <- pa_data[train_indices, ]
test_data <- pa_data[-train_indices, ]

# Fit on training data only
model_train <- lm(median_incomeE ~ total_popE, data = train_data)

# Predict on test data
test_predictions <- predict(model_train, newdata = test_data)
  • Evaluate Predictions
Code
rmse_test <- sqrt(mean((test_data$median_incomeE - test_predictions)^2))
rmse_train <- summary(model_train)$sigma

cat("Training RMSE:", round(rmse_train, 0), "\n")
cat("Test RMSE:", round(rmse_test, 0), "\n")
  • 10-Fold Cross Validation
Code
library(caret)

train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(median_incomeE ~ total_popE,
                  data = pa_data,
                  method = "lm",
                  trControl = train_control)
cv_model$results
  • Residual Plot
Code
pa_data$residuals <- residuals(model1)
pa_data$fitted <- fitted(model1)

ggplot(pa_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Residuals vs Fitted

  • Goal: a random, horizontal band around 0. Residual plots should show random scatter - any pattern means your model is missing something systematic

  • Red flags:

    • Curvature → missing nonlinear term (try polynomials/splines or transform xxx).

    • Funnel/wedge shape → heteroskedasticity (use log/Box–Cox on yyy, weighted least squares, or robust SEs).

    • Clusters/bands → omitted categorical/group effects or interaction terms.

Constant Variance and Heteroskedacity

  • Heteroskedasticity is a condition in regression analysis where the variance of the error terms is not constant but changes as the value of one or more independent variables changes

    • Model fits well for some values (e.g., small counties) but poorly for others (large counties)

    • May indicate missing variables that matter more at certain X values

    • Ask: “What’s different about observations with large residuals?”

Formal Test: Breusch-Pagan

Code
library(lmtest) 
bptest(model1)
  • Interpretation:

    • p > 0.05: Constant variance assumption OK

    • p < 0.05: Evidence of heteroscedasticity

If detected, solutions:

  1. Transform Y (try log(income))
  2. Robust standard errors
  3. Add missing variables
  4. Accept it (point predictions still OK for prediction goals)

Multicollinearity & Variance Inflation Factor (vif)

Code
library(car)
vif(model1)  # Variance Inflation Factor

# Rule of thumb: VIF > 10 suggests problems
# Not relevant with only 1 predictor!

Influential Outliers – Cook’s D > 4/n

Questions & Challenges

  • I am also interested in learning more about non-parametric approaches in future weeks

Connections to Policy

  • It is important to understand the assumptions behind a model. For example, heteroskedasticity is often an indicator of model misspecification – it indicates missing variables that matter more at certain X values. Population alone predicts income well in rural counties, but large urban counties need additional variables (education, industry) to predict accurately.

Reflection

  • Excited to put these skills to the test with the midterm home price prediction challenge!