Week 6 Notes - Spatial Machine Learning & Advanced Regression
Converting to Spatial Data
step 1: Make your data spatial
st_as_sf st_transform
step 2: Spatial Join with Neighborhoods
Expanding Your Regression Toolkit
Continuous Variables
- Square footage
- Age of house
- Income levels
- Distance to downtown
Categorical Variables
- Neighborhood
- School district
- Building type
- Has garage? (Yes/No)
Add Dummy(Categorical) Variables to the Model
Interpreting Neighborhood Dummy Variables
Interaction Effect: When Relationships Depend
Does the effect of one variable depend on the level of another variable?
Example Scenarios
- Housing: Does square footage matter more in wealthy neighborhoods?
- Education: Do tutoring effects vary by initial skill level?
- Public Health: Do pollution effects differ by age?
Create an Hypothesis
Create the Categories
- Model 1: No Interaction (Parallel Slopes)
- Model 2: With Interaction (Different Slopes)
Interpreting the Interaction Coefficients
- Breaking Down the Coefficient (Explain)
- Visualizing the Interaction Effect
- Compare Model Performance
- Policy Implications
When NOT to Use Interaction
- Small samples: Need suffient data in each group
- Overfitting: Too many interactions make models unstable
Polynomial Terms: Non-Linear Relationships
Signs of Non-Linearity:
- Curved residual plots
- Diminishing returns
- Accelerating effects
- U-shaped or inverted-U patterns
- Theoretical reasons
Examples:
- House age: depreciation then vintage premium
- Test scores: plateau after studying
- Advertising: diminishing returns
- Crime prevention: early gains, then plateaus
The U-Shaped Age Effect
- Create Age Variable
- First: Linear Model (Baseline)
- Visualize
- Add Polynomial Term: Age Squared
- Interpreting Polynomial Coefficients
- Compare Model Performance
- Check Residual Plot
Creating Spatial Feartures
Three Approaches to Spatial Features
- Buffer Aggregation
- count or sum events within a defined distance
- Ex: Number od crimes within 500 feet
- k-Nearest Neighbors (kNN)
- Average distance to k closest events
- Ex: Average distance to 3 nearest violent crimes
- Distance to Specific Points
- Straight-line distance to important locations
- Ex: Distance to downtown, nearest T station
Fixed Effects
Fixed Effects = Categorical variables that capture all unmeasured characteristics of a group
In hedonic models:
- Each neighborhood gets its own dummy variable
- Captures everything unique about that neighborhood we didn’t explicitly measure
Cross-Validation(with Categorical Variables)
Three common validation approaches 1. Train/Test Split 2. k-Fold Cross-Validation 3. LOOCV
The Problem: Sparse Categories
When CV Fails with Categorical Variables - Don’t have enough data so the model is not well trained - Model can’t predict for a category it never learned
Solution
- Group Small Neighborhoods
- Alternative: Drop Spare Categories