MUSA 5080 Notes #5

Week 5: Predictive Modeling & Cross-Validation

Author

Fan Yang

Published

October 6, 2025

Note

Week 5: Predictive Modeling & Cross-Validation
Date: 10/06/2025

Overview

This week we focused on building predictive models using linear regression with proper validation techniques. We learned about cross-validation, model comparison, diagnostics, and best practices for predictive modeling using Pennsylvania county home value data.

Key Learning Objectives

Understand predictive modeling workflow
Learn 10-fold cross-validation for model selection
Master model diagnostics and validation techniques
Handle multicollinearity and outliers
Compare models using appropriate metrics
Apply feature engineering and transformations

Part 1: Predictive Modeling Fundamentals

What is Predictive Modeling?

Goal: Build a model that can accurately predict outcomes for new, unseen data

Key Principles: - Generalization: Model should work on new data, not just training data - Validation: Test model performance on held-out data - Comparison: Use consistent metrics to compare different models - Diagnostics: Check model assumptions and identify problems

The Modeling Workflow

Data Collection → Gather relevant predictors and target variable
Exploratory Data Analysis → Understand relationships and distributions
Feature Engineering → Transform variables for better model performance
Model Building → Fit multiple models with different predictors
Model Validation → Use cross-validation to assess performance
Model Selection → Choose best model based on validation metrics
Diagnostics → Check assumptions and identify issues
Final Model → Deploy best-performing model

Part 2: Data Collection and EDA

Pennsylvania County Home Value Challenge

Target Variable: Median home value (B25077_001) Predictors: - Total population (B01003_001) - Median household income (B19013_001) - Median age (B01002_001) - Percent with bachelor’s degree (B15003_022) - Median rent (B25058_001) - Poverty rate (B17001_002)

# Data collection using tidycensus
library(tidycensus)
library(tidyverse)

challenge_data <- get_acs(
  geography = "county",
  state = "PA",
  variables = c(
    home_value = "B25077_001",      # TARGET: Median home value
    total_pop = "B01003_001",       # Total population
    median_income = "B19013_001",   # Median household income
    median_age = "B01002_001",      # Median age
    percent_college = "B15003_022", # Bachelor's degree or higher
    median_rent = "B25058_001",     # Median rent
    poverty_rate = "B17001_002"     # Population in poverty
  ),
  year = 2022,
  output = "wide"
)

Exploratory Data Analysis

Key EDA Steps: 1. Descriptive Statistics → Understand data distributions 2. Correlation Analysis → Identify relationships between variables 3. Visualization → Plot relationships between predictors and target 4. Distribution Checks → Identify need for transformations

# Correlation analysis
library(corrplot)

cor_data <- challenge_data %>% 
  select(home_valueE, total_popE, median_incomeE, median_ageE, 
         percent_collegeE, median_rentE, poverty_rateE)

cor_matrix <- cor(cor_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black")

Key Findings from EDA: - Strong positive correlation between income and home values - High correlation between income and rent (potential multicollinearity) - Right-skewed distributions suggest need for log transformation - Some counties may be outliers

Part 3: Feature Engineering

Why Transform Variables?

Common Issues: - Right-skewed distributions → Violate normality assumptions - Non-linear relationships → Linear models may miss patterns - Scale differences → Variables with different scales can dominate

Log Transformation Strategy

# Create log-transformed dataset
data_log <- challenge_data %>%
  mutate(
    log_home_value = log(home_valueE + 1),
    log_total_pop = log(total_popE + 1),
    log_median_income = log(median_incomeE + 1),
    log_median_rent = log(median_rentE + 1),
    log_median_age = log(median_ageE + 1),
    log_percent_college = log(percent_collegeE + 1),
    log_poverty_rate = log(poverty_rateE + 1)
  )

Benefits of Log Transformation: - Makes distributions more symmetric - Reduces impact of extreme values - Often improves linear relationships - Makes coefficients interpretable as percentage changes

Note: Adding 1 before log transformation handles zero values

Part 4: Cross-Validation

Why Cross-Validation?

Problem with Single Train/Test Split: - Results depend on random split - May not be representative of full dataset - Limited data for training

Solution: k-Fold Cross-Validation - Split data into k folds (typically 5 or 10) - Train on k-1 folds, test on remaining fold - Repeat k times, average results - More robust estimate of model performance

10-Fold Cross-Validation Setup

library(caret)

# Set up 10-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)

# Fit model with cross-validation
model <- train(
  log_home_value ~ log_median_income,
  data = data_log,
  method = "lm",
  trControl = train_control
)

Key Parameters: - method = "cv" → Use cross-validation - number = 10 → 10 folds - set.seed(123) → Reproducible results

Part 5: Model Building Strategy

Progressive Model Building

Strategy: Start simple, add complexity gradually

# Model 1: Income only
model1 <- train(
  log_home_value ~ log_median_income,
  data = data_log,
  method = "lm",
  trControl = train_control
)

# Model 2: + Education
model2 <- train(
  log_home_value ~ log_median_income + log_percent_college,
  data = data_log,
  method = "lm",
  trControl = train_control
)

# Model 3: + Poverty
model3 <- train(
  log_home_value ~ log_median_income + log_percent_college + log_poverty_rate,
  data = data_log,
  method = "lm",
  trControl = train_control
)

# Continue adding variables...

Benefits of Progressive Building: - Understand contribution of each variable - Identify when adding variables stops helping - Detect multicollinearity issues - Maintain interpretability

Model Comparison Metrics

Key Metrics: - RMSE (Root Mean Square Error) → Lower is better - R² (R-squared) → Higher is better (0-1 scale) - MAE (Mean Absolute Error) → Lower is better

# Calculate RMSE on original scale
calc_original_rmse <- function(model, data) {
  pred_log <- predict(model, data)
  pred_original <- exp(pred_log) - 1
  actual_original <- data$home_valueE
  sqrt(mean((pred_original - actual_original)^2, na.rm = TRUE))
}

Important: Convert predictions back to original scale for meaningful interpretation

Part 6: Model Diagnostics

Why Model Diagnostics?

Linear Regression Assumptions: 1. Linearity → Relationship between predictors and outcome is linear 2. Independence → Observations are independent 3. Homoscedasticity → Constant variance of residuals 4. Normality → Residuals are normally distributed 5. No multicollinearity → Predictors are not highly correlated

Diagnostic Tools

1. Residual Plots

# Standard residual plots
par(mfrow = c(2, 2))
plot(final_lm)
par(mfrow = c(1, 1))

What to Look For: - Residuals vs Fitted: Random scatter around zero - Normal Q-Q: Points follow diagonal line - Scale-Location: Horizontal line with random scatter - Residuals vs Leverage: No extreme outliers

2. Cook’s Distance

# Identify influential observations
cooks_d <- cooks.distance(final_lm)
influential <- which(cooks_d > 4/nrow(data_log))

if(length(influential) > 0) {
  cat("Influential observations:", length(influential), "\n")
  print(challenge_data[influential, c("NAME", "home_valueE")])
}

Rule of Thumb: Cook’s Distance > 4/n indicates influential observations

3. Variance Inflation Factor (VIF)

library(car)

# Check for multicollinearity
if(ncol(model.matrix(final_lm)) > 2) {
  print(round(vif(final_lm), 2))
}

Interpretation: - VIF < 5: No multicollinearity concern - VIF 5-10: Moderate multicollinearity - VIF > 10: High multicollinearity (consider removing variables)

4. Breusch-Pagan Test

library(lmtest)

# Test for heteroscedasticity
print(bptest(final_lm))

Interpretation: - p-value > 0.05: No heteroscedasticity (good) - p-value < 0.05: Heteroscedasticity present (may need transformation)

Part 7: Model Refinement

Handling Outliers

When to Remove Outliers: - High Cook’s Distance (> 4/n) - Clear data entry errors - Extreme values that don’t represent population

Process: 1. Identify influential observations 2. Remove from dataset 3. Refit model 4. Compare performance

if(length(influential) > 0) {
  data_no_outliers <- data_log[-influential, ]
  
  # Retrain best model
  model_no_outliers <- train(
    formula(best_model),
    data = data_no_outliers,
    method = "lm",
    trControl = train_control
  )
  
  # Compare performance
  rmse_after <- calc_original_rmse(model_no_outliers, data_no_outliers)
  cat("RMSE after removal:", scales::dollar(round(rmse_after, 0)), "\n")
}

Model Selection Criteria

Primary Criterion: Lowest RMSE on original scale Secondary Criteria: - Higher R² - Simpler model (fewer variables) - Better diagnostics - More interpretable

Part 8: Best Practices

Data Preparation

Handle Missing Values → Use complete cases or imputation
Check Data Quality → Look for obvious errors
Transform Variables → Log transform skewed variables
Standardize if Needed → For some algorithms

Model Building

Start Simple → Begin with most important predictors
Use Cross-Validation → Always validate model performance
Compare Models → Use consistent metrics
Check Assumptions → Run diagnostics

Validation

Use Appropriate Metrics → RMSE for regression, accuracy for classification
Convert to Original Scale → For interpretable results
Report Confidence Intervals → Show uncertainty
Test on New Data → Final validation

Interpretation

Coefficient Interpretation → Log-transformed variables show percentage changes
Model Limitations → Acknowledge assumptions and limitations
Practical Significance → Consider if improvements are meaningful
Generalizability → Discuss how well model applies to new data

Key Takeaways

Predictive Modeling Skills

Data Preparation: Clean, transform, and explore data thoroughly
Cross-Validation: Always use proper validation techniques
Model Comparison: Use consistent metrics and systematic approach
Diagnostics: Check assumptions and identify problems
Feature Engineering: Transform variables for better model performance

Common Pitfalls

Overfitting: Model performs well on training data but poorly on new data
Data Leakage: Using future information to predict past outcomes
Ignoring Assumptions: Not checking linear regression assumptions
Inappropriate Metrics: Using wrong metrics for model comparison
Ignoring Outliers: Not identifying and handling influential observations

Model Performance

RMSE: Primary metric for regression problems
R²: Proportion of variance explained
Cross-Validation: Essential for unbiased performance estimates
Original Scale: Always report results in interpretable units

Next Steps

Learn about regularization techniques (Ridge, Lasso)
Explore non-linear models (polynomial, splines)
Understand ensemble methods
Practice with different types of data and problems

Resources

caret package: https://topepo.github.io/caret/
tidycensus package: https://walker-data.com/tidycensus/
Linear Regression Diagnostics: https://www.statmethods.net/stats/regression.html
Cross-Validation: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Model Selection: https://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf