Week 6 Notes - Spatial Machine Learning & Advanced Regression

Published

October 13, 2025

Converting to Spatial Data

step 1: Make your data spatial

st_as_sf st_transform

step 2: Spatial Join with Neighborhoods

Expanding Your Regression Toolkit

Continuous Variables

Square footage
Age of house
Income levels
Distance to downtown

Categorical Variables

Neighborhood
School district
Building type
Has garage? (Yes/No)

Add Dummy(Categorical) Variables to the Model

Interpreting Neighborhood Dummy Variables

Interaction Effect: When Relationships Depend

Does the effect of one variable depend on the level of another variable?

Example Scenarios

Housing: Does square footage matter more in wealthy neighborhoods?
Education: Do tutoring effects vary by initial skill level?
Public Health: Do pollution effects differ by age?

Create an Hypothesis

Create the Categories

Model 1: No Interaction (Parallel Slopes)
Model 2: With Interaction (Different Slopes)

Interpreting the Interaction Coefficients

Breaking Down the Coefficient (Explain)
Visualizing the Interaction Effect
Compare Model Performance
Policy Implications

When NOT to Use Interaction

Small samples: Need suffient data in each group
Overfitting: Too many interactions make models unstable

Polynomial Terms: Non-Linear Relationships

Signs of Non-Linearity:

Curved residual plots
Diminishing returns
Accelerating effects
U-shaped or inverted-U patterns
Theoretical reasons

Examples:

House age: depreciation then vintage premium
Test scores: plateau after studying
Advertising: diminishing returns
Crime prevention: early gains, then plateaus

The U-Shaped Age Effect

Create Age Variable
First: Linear Model (Baseline)
Visualize
Add Polynomial Term: Age Squared
Interpreting Polynomial Coefficients
Compare Model Performance
Check Residual Plot

Creating Spatial Feartures

Three Approaches to Spatial Features

Buffer Aggregation

count or sum events within a defined distance
Ex: Number od crimes within 500 feet

k-Nearest Neighbors (kNN)

Average distance to k closest events
Ex: Average distance to 3 nearest violent crimes

Distance to Specific Points

Straight-line distance to important locations
Ex: Distance to downtown, nearest T station

Fixed Effects

Fixed Effects = Categorical variables that capture all unmeasured characteristics of a group

In hedonic models:

Each neighborhood gets its own dummy variable
Captures everything unique about that neighborhood we didn’t explicitly measure

Cross-Validation(with Categorical Variables)

Three common validation approaches 1. Train/Test Split 2. k-Fold Cross-Validation 3. LOOCV

The Problem: Sparse Categories

When CV Fails with Categorical Variables - Don’t have enough data so the model is not well trained - Model can’t predict for a category it never learned

Solution

Group Small Neighborhoods
Alternative: Drop Spare Categories