Week 5 Notes - Introduction to Linear Regression
Key Concepts Learned
The Statistical Learning Framework
- Statistical learning = a set of approaches for estimating that relationship
- Y=f(X)+ϵ Where: f = the systematic information X provides about Y ε = random error (irreducible)
Estimate f -WHY 1. prediction(accuracy of predictions) 2. inference(interpreting the model) -HOW 1. Parametric Methods - Make an assumption about the functional form (e.g., linear) - Reduces problem to estimating a few parameters - Easier to interpret - This is what we’ll focus on - machine learning and ai are considered as parametric methods 2. Non-Parametric Methods - Don’t assume a specific form - More flexible - Require more data - Harder to interpret - let the data determine the shape off
Parametric Approach: linear regression Y≈β0+β1X1+β2X2+…+βpXp - task: Estimate the β coefficients using our sample data - method: Ordinary Least Squares (OLS)
satatistical significance The logic: 1. Null hypothesis (H₀): β₁ = 0 (no relationship) 2. Our estimate: β₁ = 0.02 3. Question: Could we get 0.02 just by chance if H₀ is true? t-statistic: How many standard errors away from 0? - Bigger |t| = more confidence the relationship is real p-value: Probability of seeing our estimate if H₀ is true - Small p → reject H₀, conclude relationship exists
how well the model 1. How well does it fit the data we used? (in-sample fit) - R² * R² alone doesn’t tell us if the model is trustworthy - scenarios: underfitting (model too simple, high bias); good fit(captures patterns without noise); overfitting (memorizes training data, high variance) - might not generalize well because it follows noise
- How well would it predict new data? (out-of-sample performance)
checking assumptions
Assumption 1: Linearity - What we assume: Relationship is actually linear - How to check: Residual plot (with ggplot)
Assumption 2: Constant Variance - Heteroscedasticity: (unequal variance) Variance changes across X - impact: standard errors are wrong -> p-values misleading - formal test: Breusch-Pagan: p < 0.05: Evidence of heteroscedasticity
Assumption: Normality of Residuals - What we assume: Residuals are normally distributed - Why it matters: Less critical for point predictions (unbiased regardless) Important for confidence intervals and prediction intervals Needed for valid hypothesis tests (t-test, F-test) - (With Q-Q plot)
Assumption 3: No Multicollinearity predictores shouldn’t be too correlated - vif(model1) # Variance Inflation Factor *Rule of thumb: VIF > 10 suggests problems Why it matters: Coefficients become unstable, hard to interpret
Assumption 4: No influential outliers not all outliers are problems - only those witth high leverage AND large residuals - Visual Diagnostic - Identify influential points: High leverage + large residual = pulls regression line
Improving the Model
- adding more predictors
- log transformations
- categorical variables
Coding Techniques
Building First model - linear model(lm)
fit the model model1 <- lm(median_incomeE ~ total_popE, data = pa_data) summary(model1)
interpreting coefficients p-value < 0.001 → Very unlikely to see this if true β₁ = 0 -> e can reject the null hypothesis
Train/Test Split Solution: Hold out some data to test predictions
the 70% is randomly chose by computer -set.seed(123) hold the same set.seed(123) n <- nrow(pa_data)
70% training, 30% testing train_indices <- sample(1:n, size = 0.7 * n) train_data <- pa_data[train_indices, ] test_data <- pa_data[-train_indices, ]
Fit on training data only model_train <- lm(median_incomeE ~ total_popE, data = train_data)
Predict on test data test_predictions <- predict(model_train, newdata = test_data)
cross-validation better approach -multiple train/test splits library(caret)
- 10-fold cross-validation train_control <- trainControl(method = “cv”, number = 10)
cv_model <- train(median_incomeE ~ total_popE, data = pa_data, method = “lm”, trControl = train_control)
cv_model$results
key metrics: RMSE, R², MAE
Calculate prediction error (RMSE) rmse_test <- sqrt(mean((test_data\(median_incomeE - test_predictions)^2)) rmse_train <- summary(model_train)\)sigma
cat(“Training RMSE:”, round(rmse_train, 0), “”)
Summary: the regression workflow
- Understand the framework: What’s f? What’s the goal?
- Visualize first: Does a linear model make sense?
- Fit the model: Estimate coefficients
- Evaluate performance: Train/test split, cross-validation
- Check assumptions: Residual plots, VIF, outliers
- Improve if needed: Transformations, more variables
- Consider ethics: Who could be harmed by this model?