Week 6 Notes - Spatial Machine Learning & Advanced Regression

Published

October 13, 2025

Converting to Spatial Data

step 1: Make your data spatial

st_as_sf st_transform

step 2: Spatial Join with Neighborhoods

Expanding Your Regression Toolkit

Continuous Variables

  • Square footage
  • Age of house
  • Income levels
  • Distance to downtown

Categorical Variables

  • Neighborhood
  • School district
  • Building type
  • Has garage? (Yes/No)

Add Dummy(Categorical) Variables to the Model

Interpreting Neighborhood Dummy Variables

Interaction Effect: When Relationships Depend

Does the effect of one variable depend on the level of another variable?

Example Scenarios

  • Housing: Does square footage matter more in wealthy neighborhoods?
  • Education: Do tutoring effects vary by initial skill level?
  • Public Health: Do pollution effects differ by age?

Create an Hypothesis

Create the Categories

  • Model 1: No Interaction (Parallel Slopes)
  • Model 2: With Interaction (Different Slopes)

Interpreting the Interaction Coefficients

  • Breaking Down the Coefficient (Explain)
  • Visualizing the Interaction Effect
  • Compare Model Performance
  • Policy Implications

When NOT to Use Interaction

  • Small samples: Need suffient data in each group
  • Overfitting: Too many interactions make models unstable

Polynomial Terms: Non-Linear Relationships

Signs of Non-Linearity:

  • Curved residual plots
  • Diminishing returns
  • Accelerating effects
  • U-shaped or inverted-U patterns
  • Theoretical reasons

Examples:

  • House age: depreciation then vintage premium
  • Test scores: plateau after studying
  • Advertising: diminishing returns
  • Crime prevention: early gains, then plateaus

The U-Shaped Age Effect

  • Create Age Variable
  • First: Linear Model (Baseline)
  • Visualize
  • Add Polynomial Term: Age Squared
  • Interpreting Polynomial Coefficients
  • Compare Model Performance
  • Check Residual Plot

Creating Spatial Feartures

Three Approaches to Spatial Features

  1. Buffer Aggregation
  • count or sum events within a defined distance
  • Ex: Number od crimes within 500 feet
  1. k-Nearest Neighbors (kNN)
  • Average distance to k closest events
  • Ex: Average distance to 3 nearest violent crimes
  1. Distance to Specific Points
  • Straight-line distance to important locations
  • Ex: Distance to downtown, nearest T station

Fixed Effects

Fixed Effects = Categorical variables that capture all unmeasured characteristics of a group

In hedonic models:

  • Each neighborhood gets its own dummy variable
  • Captures everything unique about that neighborhood we didn’t explicitly measure

Cross-Validation(with Categorical Variables)

Three common validation approaches 1. Train/Test Split 2. k-Fold Cross-Validation 3. LOOCV

The Problem: Sparse Categories

When CV Fails with Categorical Variables - Don’t have enough data so the model is not well trained - Model can’t predict for a category it never learned

Solution

  • Group Small Neighborhoods
  • Alternative: Drop Spare Categories