Week 5 Notes - Intro to Linear Regression

Published

October 6, 2025

Key Concepts Learned

The Statistical Learning Framework: What are we actually doing?
Two goals: Understanding relationships vs Making predictions
Building your first model with PA census data
Model evaluation: How do we know if it’s any good?
Checking assumptions: When can we trust the model?
Improving predictions: Transformations, multiple variables

Statistical Significance

t-statistic: How many standard errors away from 0?

Bigger |t| = more confidence the relationship is real

p-value: Probability of seeing our estimate if H₀ is true

Small p → reject H₀, conclude relationship exists

Model Evaluation

How well does it fit the data we used? (in-sample fit using R²)
How well would it predict new data? (out-of-sample performance)

Checking Assumptions: Plots reveal what R^2 hides

Assumption 1: Linear Relationship
- Check with Residual Plot
- Residuals = observed − fitted. They’re your best proxy for the error term. Good models leave residuals that look like random noise.
Assumption 2: Constant Variance
- Check heteroscedasticity (i.e. variance changes across X
- Impact: standard errors are wrong -> p values are misleading
Assumption 3: Normality of Residuals
- Check with Q-Q Plot (quantile-quantile plot) of residuals
- Important for confidence & prediction intervals
- Needed for valid hypothesis tests (t-test and F-test)
Assumption 4: No Multicollinearity
- Coefficients become unstable and hard to interpret
Assumption 5: No Influential Outliers
- Influential Outliers: those with high leverage and large residuals
- Visual Diagnostic using Cook’s Distance (Cook’s D)

Improving the Model

Adding more predictors
Log Transformations
Categorical Variables

Coding Techniques

Linear Regression Train/Test Split

Code

set.seed(123)
n <- nrow(pa_data)

# 70% training, 30% testing
train_indices <- sample(1:n, size = 0.7 * n)
train_data <- pa_data[train_indices, ]
test_data <- pa_data[-train_indices, ]

# Fit on training data only
model_train <- lm(median_incomeE ~ total_popE, data = train_data)

# Predict on test data
test_predictions <- predict(model_train, newdata = test_data)

Evaluate Predictions

Code

rmse_test <- sqrt(mean((test_data$median_incomeE - test_predictions)^2))
rmse_train <- summary(model_train)$sigma

cat("Training RMSE:", round(rmse_train, 0), "\n")
cat("Test RMSE:", round(rmse_test, 0), "\n")

10-Fold Cross Validation

Code

library(caret)

train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(median_incomeE ~ total_popE,
                  data = pa_data,
                  method = "lm",
                  trControl = train_control)
cv_model$results

Residual Plot

Code

pa_data$residuals <- residuals(model1)
pa_data$fitted <- fitted(model1)

ggplot(pa_data, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Residuals vs Fitted

Goal: a random, horizontal band around 0. Residual plots should show random scatter - any pattern means your model is missing something systematic
Red flags:
- Curvature → missing nonlinear term (try polynomials/splines or transform xxx).
- Funnel/wedge shape → heteroskedasticity (use log/Box–Cox on yyy, weighted least squares, or robust SEs).
- Clusters/bands → omitted categorical/group effects or interaction terms.

Constant Variance and Heteroskedacity

Heteroskedasticity is a condition in regression analysis where the variance of the error terms is not constant but changes as the value of one or more independent variables changes
- Model fits well for some values (e.g., small counties) but poorly for others (large counties)
- May indicate missing variables that matter more at certain X values
- Ask: “What’s different about observations with large residuals?”

Formal Test: Breusch-Pagan

Code

library(lmtest) 
bptest(model1)

Interpretation:
- p > 0.05: Constant variance assumption OK
- p < 0.05: Evidence of heteroscedasticity

If detected, solutions:

Transform Y (try log(income))
Robust standard errors
Add missing variables
Accept it (point predictions still OK for prediction goals)

Multicollinearity & Variance Inflation Factor (vif)

Code

library(car)
vif(model1)  # Variance Inflation Factor

# Rule of thumb: VIF > 10 suggests problems
# Not relevant with only 1 predictor!

Influential Outliers – Cook’s D > 4/n

Questions & Challenges

I am also interested in learning more about non-parametric approaches in future weeks

Connections to Policy

It is important to understand the assumptions behind a model. For example, heteroskedasticity is often an indicator of model misspecification – it indicates missing variables that matter more at certain X values. Population alone predicts income well in rural counties, but large urban counties need additional variables (education, industry) to predict accurately.

Reflection

Excited to put these skills to the test with the midterm home price prediction challenge!