MUSA 5080 Notes #6

Week 6: Spatial Machine Learning & Advanced Regression

Author

Fan Yang

Published

October 13, 2025

Note

Week 6: Spatial Machine Learning & Advanced Regression
Date: 10/13/2025

Overview

This week we expanded our regression toolkit to include categorical variables, interactions, polynomial terms, and spatial features. We learned how to build comprehensive hedonic models that account for both structural characteristics and spatial relationships, using Boston housing data as our case study.

Key Learning Objectives

Understand how to incorporate categorical variables (dummy variables) in regression
Learn to model interaction effects when relationships vary by group
Master polynomial terms for non-linear relationships
Create spatial features using buffers, k-NN, and distance calculations
Apply fixed effects to control for unmeasured neighborhood characteristics
Handle categorical variables in cross-validation properly

Part 1: Expanding Your Regression Toolkit

Categorical Variables (Dummy Variables)

Understanding Dummy Variables

When you include a categorical variable in a model, R automatically creates binary indicators:

# Ensure name is a factor
boston.sf <- boston.sf %>%
  mutate(name = as.factor(name))

# Fit model with neighborhood fixed effects
model_neighborhoods <- lm(SalePrice ~ LivingArea + name, 
                          data = boston.sf)

How R handles this: - Creates binary indicators: nameBack_Bay, nameBeacon_Hill, etc. - Each is 1 if that neighborhood, 0 otherwise - One neighborhood is automatically chosen as reference category (omitted)

The (n-1) Rule

Important: R picks one category as the reference (usually first alphabetically)

All other coefficients are interpreted relative to the reference
Reference category gets the intercept
Other categories get intercept + their coefficient

Interpreting Neighborhood Dummies

Example coefficients: - Intercept (Allston): Base price in reference neighborhood - nameBack_Bay: +$X premium vs. Allston (all else equal) - **nameRoxbury**: -$Y discount vs. Allston (all else equal)

Key point: These capture ALL unmeasured neighborhood characteristics (schools, prestige, amenities, etc.)

Interaction Effects

When Relationships Depend on Other Variables

The Question: Does the effect of one variable depend on the level of another variable?

Example: Does square footage matter more in wealthy neighborhoods?

Mathematical Form:

SalePrice = β₀ + β₁(LivingArea) + β₂(WealthyNeighborhood) + 
            β₃(LivingArea × WealthyNeighborhood) + ε

Creating Interaction Terms

# Define wealthy neighborhoods
wealthy_hoods <- c("Back Bay", "Beacon Hill", "South End", "Bay Village")

# Create binary indicator
boston.sf <- boston.sf %>%
  mutate(
    wealthy_neighborhood = ifelse(name %in% wealthy_hoods, 
                                   "Wealthy", "Not Wealthy"),
    wealthy_neighborhood = as.factor(wealthy_neighborhood)
  )

# Model without interaction (parallel slopes)
model_no_interact <- lm(SalePrice ~ LivingArea + wealthy_neighborhood, 
                        data = boston.sf)

# Model with interaction (different slopes)
model_interact <- lm(SalePrice ~ LivingArea * wealthy_neighborhood, 
                     data = boston.sf)

What the interaction allows: - Living area can have different effects in different neighborhoods - Both intercept AND slope differ - Non-parallel lines on a plot

Interpreting Interaction Coefficients

Coefficient breakdown: - Intercept (Not Wealthy): Base price in non-wealthy areas - LivingArea (Not Wealthy): $/sq ft in non-wealthy areas - wealthy_neighborhoodWealthy: Premium intercept in wealthy areas - LivingArea:wealthy_neighborhoodWealthy: Extra $/sq ft in wealthy areas

Key insight: The interaction term tells us how much MORE each sq ft is worth in wealthy neighborhoods!

When NOT to Use Interactions

Small samples: Need sufficient data in each group
Overfitting: Too many interactions make models unstable
No theoretical reason: Don’t add interactions just because you can

Polynomial Terms: Non-Linear Relationships

When Straight Lines Don’t Fit

Signs of non-linearity: - Curved residual plots - Diminishing returns - U-shaped or inverted-U patterns - Theoretical reasons for non-linear relationships

Example: House age - depreciation then vintage premium

Creating Polynomial Terms

# Calculate age from year built
boston.sf <- boston.sf %>%
  mutate(Age = 2025 - YR_BUILT) %>%
  filter(Age < 2000)  # Remove outliers

# Linear model (baseline)
model_age_linear <- lm(SalePrice ~ Age + LivingArea, data = boston.sf)

# Quadratic model (polynomial)
model_age_quad <- lm(SalePrice ~ Age + I(Age^2) + LivingArea, 
                     data = boston.sf)

The I() Function: - Why I(Age^2) instead of just Age^2? - In R formulas, ^ has special meaning - I() tells R: “interpret this literally, compute Age²”

Interpreting Polynomial Coefficients

Model equation:

Price = β₀ + β₁×Age + β₂×Age² + β₃×LivingArea

⚠️ Can’t interpret coefficients directly!

With Age², the effect of age is no longer constant. You need to calculate the marginal effect:

Marginal effect of Age = β₁ + 2×β₂×Age

This means the effect changes at every age!

Comparing Linear vs. Quadratic

Use F-test to compare:

anova(model_age_linear, model_age_quad)

If quadratic is significant: - Non-linear relationship exists - Quadratic model fits better - R² improvement shows how much better

Part 2: Why Space Matters

Tobler’s First Law of Geography

“Everything is related to everything else, but near things are more related than distant things” - Waldo Tobler, 1970

What This Means for House Prices

Crime nearby matters more than crime across the city
Parks within walking distance affect value
Your immediate neighborhood defines your market

The Challenge: How do we quantify “nearbyness” in a way our regression model can use?

Answer: Create spatial features that measure proximity to amenities/disamenities

Hedonic Model Framework

Hedonic models decompose the price of a good into its characteristics:

Price = f(structural features, location features, neighborhood features)

Components: - Structural: Size, age, condition, bedrooms, bathrooms - Location: Distance to downtown, transit, amenities - Neighborhood: Crime, schools, prestige, demographics

Part 3: Creating Spatial Features

Three Approaches to Spatial Features

1️⃣ Buffer Aggregation

Count or sum events within a defined distance

Example: Number of crimes within 500 feet

# Create buffer features
boston.sf <- boston.sf %>%
  mutate(
    crimes_500ft = lengths(st_intersects(
      st_buffer(geometry, 500),  # 500 feet buffer
      crimes.sf
    )),
    crimes_660ft = lengths(st_intersects(
      st_buffer(geometry, 660),  # 660 feet buffer
      crimes.sf
    ))
  )

Interpretation: crimes_500ft = 5 means 5 crimes occurred within 500 feet of this house

2️⃣ k-Nearest Neighbors (kNN)

Average distance to k closest events

Example: Average distance to 3 nearest violent crimes

# Calculate distance matrix
dist_matrix <- st_distance(boston.sf, crimes.sf)

# Function to get mean distance to k nearest neighbors
get_knn_distance <- function(dist_matrix, k) {
  apply(dist_matrix, 1, function(distances) {
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create multiple kNN features
boston.sf <- boston.sf %>%
  mutate(
    crime_nn1 = get_knn_distance(dist_matrix, k = 1),
    crime_nn3 = get_knn_distance(dist_matrix, k = 3),
    crime_nn5 = get_knn_distance(dist_matrix, k = 5)
  )

Interpretation: crime_nn3 = 83.29 means the average distance to the 3 nearest crimes is 83.29 feet

Which k value to use? Check correlation with price to find the relevant “zone of influence”

3️⃣ Distance to Specific Points

Straight-line distance to important locations

Example: Distance to downtown, nearest T station

# Define downtown Boston (Boston Common)
downtown <- st_sfc(st_point(c(-71.0656, 42.3551)), crs = "EPSG:4326") %>%
  st_transform('ESRI:102286')

# Calculate distance from each house to downtown
boston.sf <- boston.sf %>%
  mutate(
    dist_downtown_ft = as.numeric(st_distance(geometry, downtown)),
    dist_downtown_mi = dist_downtown_ft / 5280
  )

Interpretation: dist_downtown_mi = 2.5 means 2.5 miles from downtown

Model Comparison: Adding Spatial Features

# Model 1: Structural only
model_structural <- lm(SalePrice ~ LivingArea + R_BDRMS + Age, 
                       data = boston.sf)

# Model 2: Add spatial features
model_spatial <- lm(SalePrice ~ LivingArea + R_BDRMS + Age +
                    crimes_500ft + crime_nn3 + dist_downtown_mi,
                    data = boston.sf)

# Compare R²
summary(model_structural)$r.squared
summary(model_spatial)$r.squared

Expected improvement: Spatial features typically add 5-15% to R²

Part 4: Fixed Effects

What Are Fixed Effects?

Fixed Effects = Categorical variables that capture all unmeasured characteristics of a group

In hedonic models: - Each neighborhood gets its own dummy variable - Captures everything unique about that neighborhood we didn’t explicitly measure - School quality, prestige, walkability, job access, cultural amenities

The Code

# Add neighborhood fixed effects
reg5 <- lm(
  SalePrice ~ LivingArea + Age + 
              crimes_500ft + 
              as.factor(name),  # Fixed Effects
  data = boston.sf
)

Behind the scenes, R creates dummies: - is_BackBay = 1 if Back Bay, 0 otherwise - is_Beacon = 1 if Beacon Hill, 0 otherwise - … (R drops one as reference category)

Why Use Fixed Effects?

Dramatically Improve Prediction

Model Comparison (R²): - Structural only: 0.58 - + Spatial features: 0.67 - + Fixed Effects: 0.81 ✓

Why such a big jump? - Neighborhoods bundle many unmeasured factors - School districts, job access, amenities, “cool factor”

Trade-off: FE are powerful but they’re a black box - we don’t know WHY Back Bay commands a premium

Part 5: Cross-Validation with Categorical Variables

⚠️ The Problem: Sparse Categories

When CV Fails with Categorical Variables

You might see this error:

Error in model.frame.default: 
  factor 'name' has new level 'West End'

What happened? 1. Random split created 10 folds 2. All “West End” sales ended up in ONE fold (the test fold) 3. Training folds never saw “West End” 4. Model can’t predict for a category it never learned

The Issue: When neighborhoods have very few sales (<10), random CV splits can put all instances in the same fold, breaking the model.

Check Your Data First!

ALWAYS run this before CV with categorical variables:

# Check category sizes
category_check <- boston.sf %>%
  st_drop_geometry() %>%
  count(name) %>%
  arrange(n)

print(category_check)

Rule of Thumb: Categories with n < 10 will likely cause CV problems

Solution: Group Small Neighborhoods

Most practical approach:

# Step 1: Add count column
boston.sf <- boston.sf %>%
  add_count(name)

# Step 2: Group small neighborhoods
boston.sf <- boston.sf %>%
  mutate(
    name_cv = if_else(
      n < 10,                       # If fewer than 10 sales
      "Small_Neighborhoods",        # Group them
      as.character(name)            # Keep original
    ),
    name_cv = as.factor(name_cv)
  )

# Step 3: Use grouped version in CV
library(caret)
ctrl <- trainControl(method = "cv", number = 10)

cv_model_fe <- train(
  SalePrice ~ LivingArea + Age + crimes_500ft + 
              as.factor(name_cv),   # Use name_cv, not name!
  data = boston.sf,
  method = "lm",
  trControl = ctrl
)

Trade-off: Lose granularity for small neighborhoods, but avoid CV crashes

Full Model Comparison with CV

library(caret)

# Prep data
boston.sf <- boston.sf %>%
  add_count(name) %>%
  mutate(
    name_cv = if_else(n < 10, "Small_Neighborhoods", as.character(name)),
    name_cv = as.factor(name_cv)
  )

ctrl <- trainControl(method = "cv", number = 10)

# Compare models
cv_structural <- train(
  SalePrice ~ LivingArea + Age + R_FULL_BTH,
  data = boston.sf, method = "lm", trControl = ctrl
)

cv_spatial <- train(
  SalePrice ~ LivingArea + Age + R_FULL_BTH + crimes_500ft + crime_nn3,
  data = boston.sf, method = "lm", trControl = ctrl
)

cv_fixedeffects <- train(
  SalePrice ~ LivingArea + Age + R_FULL_BTH + crimes_500ft + crime_nn3 + 
              as.factor(name_cv),
  data = boston.sf, method = "lm", trControl = ctrl
)

# Results
data.frame(
  Model = c("Structural", "+ Spatial", "+ Fixed Effects"),
  RMSE = c(cv_structural$results$RMSE, 
           cv_spatial$results$RMSE, 
           cv_fixedeffects$results$RMSE)
)

Expected Results

Typical pattern:

Model              RMSE      Interpretation
Structural        $533,331   Baseline - just house characteristics
+ Spatial         $500,422   Adding location features helps!
+ Fixed Effects   $347,261   Neighborhoods capture a LOT

Key Insight: Each layer improves out-of-sample prediction, with fixed effects providing the biggest boost

Key Takeaways

Advanced Regression Skills

Categorical Variables: Use as.factor() to create dummy variables automatically
Interactions: Use * to allow relationships to vary by group
Polynomial Terms: Use I(x^2) for non-linear relationships
Spatial Features: Create proximity measures using buffers, k-NN, and distance
Fixed Effects: Control for unmeasured group characteristics
CV with Categories: Always check category sizes and group sparse ones

Model Building Strategy

Progressive approach: 1. Start with structural variables (baseline) 2. Add spatial features (location matters!) 3. Add fixed effects (neighborhoods matter a lot!) 4. Consider interactions and polynomials (if theoretically justified) 5. Validate with cross-validation

Common Pitfalls

Sparse categories in CV: Always check and group small categories
Ignoring spatial relationships: Location is crucial for housing prices
Overfitting with interactions: Only add if you have theory and data
Wrong CRS for distance calculations: Always transform to projected CRS
Not handling outliers: Check distributions and consider transformations

Model Performance Insights

Structural variables: Explain ~50-60% of variation
Spatial features: Add 5-15% improvement
Fixed effects: Add 10-20% improvement (biggest boost!)
Combined: Can reach 80%+ R² for housing price models

Next Steps

Practice building comprehensive hedonic models
Experiment with different spatial feature definitions
Learn about spatial autocorrelation and its implications
Explore regularization techniques (Ridge, Lasso) for feature selection
Apply these techniques to your own research questions

Resources

sf package: https://r-spatial.github.io/sf/
caret package: https://topepo.github.io/caret/
Hedonic Modeling: https://en.wikipedia.org/wiki/Hedonic_regression
Tobler’s First Law: https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography
Cross-Validation with Categorical Variables: Best practices for handling sparse categories