MUSA 5080 Notes #7
Week 7: Model Diagnostics & Spatial Autocorrelation
Week 7: Model Diagnostics & Spatial Autocorrelation
Date: 10/20/2025
Overview
This week we learned how to diagnose spatial patterns in model errors and understand spatial autocorrelation. We explored Moran’s I as a diagnostic tool to identify when our models are missing important spatial relationships, and learned best practices for handling progress bars in rendered documents.
Key Learning Objectives
- Understand spatial patterns in model errors
- Learn to calculate and interpret Moran’s I
- Visualize error patterns spatially
- Understand when spatial autocorrelation indicates model problems
- Apply spatial diagnostics to improve model specification
Part 1: Homework Feedback & Best Practices
Suppressing Progress Bars in Rendered Output
Problem: When using tigris or tidycensus functions, progress bars appear in rendered HTML documents, making them look unprofessional.
Solution 1: Add to each function call
# Add progress = FALSE to EVERY tigris/tidycensus call
tracts <- get_acs(
geography = "tract",
variables = "B01003_001",
state = "PA",
geometry = TRUE,
progress = FALSE # <-- Add this!
)
roads <- roads(state = "PA",
county = "Philadelphia",
progress = FALSE) # <-- Add this!Solution 2: Set globally (Recommended)
# Add this near the top of your .qmd after loading libraries
options(tigris_use_cache = TRUE)
options(tigris_progress = FALSE) # Suppress tigris progress barsAction Required: Always suppress progress bars in your rendered documents for professional output.
Part 2: Understanding Spatial Patterns in Errors
What Are Model Errors?
Prediction error for observation i:
\[e_i = \hat{y}_i - y_i\]
Where: - \(\hat{y}_i\) = predicted value - \(y_i\) = actual value
In R:
# Calculate errors
boston_test <- boston %>%
mutate(
predicted = predict(model, boston),
error = predicted - SalePrice,
abs_error = abs(error),
pct_error = abs(error) / SalePrice
)Good Errors vs. Bad Errors
Random errors (good): - No systematic pattern - Scattered across space - Prediction equally good everywhere - Model captures key relationships
Clustered errors (bad): - Spatial pattern visible - Under/over-predict in specific areas - Model misses something about location - Need more spatial features!
Tobler’s First Law (Revisited)
“Everything is related to everything else, but near things are more related than distant things.” - Waldo Tobler (1970)
Applied to model errors: - If nearby houses have similar errors… - …our model is missing a spatial pattern! - Need to add more spatial features or fixed effects
Visualizing Error Patterns
Map your errors to see patterns:
# Convert to spatial data
boston_test <- boston_test %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_transform('ESRI:102286')
# Map errors
ggplot(boston_test) +
geom_sf(aes(fill = error),
shape = 21,
size = 1,
alpha = 0.6,
stroke = 0.2) +
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
midpoint = 0,
limits = c(-300000, 300000),
oob = scales::squish
) +
theme_void()What to look for: - Blue clusters (we under-predict) - Red clusters (we over-predict) - Random scatter (good!)
Scatter Plot: Spatial Lag of Errors
Create the spatial lag:
library(spdep)
# Define neighbors (5 nearest)
coords <- st_coordinates(boston_test)
neighbors <- knn2nb(knearneigh(coords, k=5))
weights <- nb2listw(neighbors, style="W")
# Calculate spatial lag of errors
boston_test$error_lag <- lag.listw(weights, boston_test$error)Then plot:
ggplot(boston_test, aes(x=error_lag, y=error)) +
geom_point(alpha=0.5) +
geom_smooth(method="lm", color="red") +
labs(title="Is my error correlated with neighbors' errors?",
x="Avg error of 5 nearest neighbors",
y="My error")Interpretation: - Positive slope = errors cluster (bad) - No slope = random errors (good)
Part 3: Moran’s I
What is Moran’s I?
Moran’s I measures spatial autocorrelation
Range: -1 to +1
- +1 = Perfect positive correlation (clustering)
- 0 = Random spatial pattern
- -1 = Perfect negative correlation (dispersion)
Formula:
\[I = \frac{n \sum_i \sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i \sum_j w_{ij} \sum_i (x_i - \bar{x})^2}\]
Where \(w_{ij}\) = spatial weight between locations i and j
The Intuition Behind Moran’s I
The formula is really just asking:
“When I’m above/below average, are my neighbors also above/below average?”
Breaking it down:
- \((x_i - \bar{x})\) = How far is my house’s error from the mean?
- \((x_j - \bar{x})\) = How far is my neighbor’s error from the mean?
- Multiply them:
- If both positive or both negative → positive product (similar)
- If opposite signs → negative product (dissimilar)
- Sum across all neighbor pairs and normalize
Result: - Lots of positive products → High Moran’s I (clustering) - Products near zero → Low Moran’s I (random) - Negative products → Negative Moran’s I (rare with errors)
Defining “Neighbors”
Different ways to define spatial relationships:
Contiguity: - Polygons that share a border - Queen vs. Rook
Distance: - All within X meters - Fixed threshold
k-Nearest: - Closest k points - Adaptive distance
For point data (houses), use k-nearest neighbors:
# Create 5-nearest neighbors
coords <- st_coordinates(boston_test)
nb <- knn2nb(knearneigh(coords, k=5))
weights <- nb2listw(nb, style="W")Calculating Spatial Lag
Spatial lag = average value of neighbors
Example: 5 houses
| House | Sale Price | 2 Nearest | Spatial Lag |
|---|---|---|---|
| A | $200k | B, C | $275k |
| B | $250k | A, C | $250k |
| C | $300k | B, D | $275k |
| D | $350k | C, E | $350k |
| E | $400k | D | $350k |
In R:
boston$price_lag <- lag.listw(weights, boston$SalePrice)Computing Moran’s I
Calculate Moran’s I for your errors:
# Test for spatial autocorrelation in errors
moran_test <- moran.mc(
boston_test$error, # Your errors
weights, # Spatial weights matrix
nsim = 999 # Number of permutations
)
# View results
moran_test$statistic # Moran's I value
moran_test$p.value # Is it significant?Interpretation: - I > 0 and p < 0.05 → Significant clustering - I ≈ 0 → Random pattern (good!) - I < 0 → Dispersion (rare with errors)
Visualizing Significance
Compare observed I to random permutations:
library(ggplot2)
moran_df <- data.frame(sim_I = moran_test$res[1:999])
ggplot(moran_df, aes(x = sim_I)) +
geom_histogram(binwidth = 0.01, fill = "gray70", color = "white") +
geom_vline(aes(xintercept = moran_test$statistic),
color = "#FA7800", linewidth = 1.5) +
labs(
title = "Observed vs. Expected Moran's I",
subtitle = "Orange line = observed | Gray = random permutations",
x = "Moran's I",
y = "Count"
) +
theme_minimal()What Moran’s I Tells You
Decision Framework:
If Moran’s I is high (errors clustered):
- Add more spatial features
- Try different buffer sizes
- Include more amenities/disamenities
- Create neighborhood-specific variables
- Try spatial fixed effects
- Neighborhood dummies
- Grid cell dummies
- Consider spatial regression models
- Spatial lag model
- Spatial error model
- (Advanced topic, not covered today)
If Moran’s I ≈ 0 (random errors):
✅ Your model adequately captures spatial relationships!
Part 4: Spatial Lag Models vs. Spatial Features
Why Not Spatial Lag Models for Prediction?
The Problems with Spatial Lag for Prediction:
1. Simultaneity Problem - Including \(WY\) (neighbor prices) creates circular logic - My price affects neighbors → neighbors affect me - OLS estimates are biased and inconsistent
2. Prediction Paradox - Need neighbors’ prices to predict my price - But for new developments or future periods, those prices don’t exist yet - Can’t generalize to truly new areas
3. Data Leakage in CV - Geographic CV holds out spatial regions - Spatial lag “leaks” information from test set - Artificially good performance that won’t hold
Our Solution: Spatial Features of X (Not Y)
Instead of modeling dependence in Y (prices), model proximity in X (predictors)
| ❌ Spatial Lag | ✅ Our Approach |
|---|---|
| “Near expensive houses” | “Near low crime areas” |
| Uses neighbor prices | Uses neighbor characteristics |
| Circular logic | Causal mechanism |
| Can’t predict new areas | Generalizes well |
If Moran’s I shows clustered errors:
✅ Add more spatial features (different buffers, more amenities)
✅ Try neighborhood fixed effects
✅ Use spatial cross-validation
❌ Don’t add spatial lag of Y for prediction purposes
The Bottom Line: Both approaches are valid for different goals! Match method to purpose: inference → spatial lag/error models; prediction → spatial features.
Understanding Bias vs. Inconsistency
Biased Estimator: - Definition: Expected value ≠ true parameter - \(E[\hat{\beta}] \neq \beta\) - On average, across all possible samples, your estimate is systematically wrong - More data doesn’t fix it
Inconsistent Estimator: - Definition: Doesn’t converge to true value as n → ∞ - \(\hat{\beta} \not\to \beta \text{ as } n \to \infty\) - Even with infinite data, you won’t get the right answer - The problem doesn’t go away with bigger samples
The Regression Workflow (Updated)
Building the model:
- Visualize relationships
- Engineer features
- Fit the model
- Evaluate performance (RMSE, R²)
- Check assumptions
NEW: Spatial diagnostics:
- Are errors random or clustered?
- Do we predict better in some areas?
- Is there remaining spatial structure?
Why This Matters:
If errors cluster spatially, it suggests: - Missing spatial variables - Misspecified relationships - Non-stationarity (relationships vary across space)
Key Takeaways
Spatial Diagnostics Skills
- Error Visualization: Map residuals to identify spatial patterns
- Spatial Lag: Calculate average of neighbors’ values
- Moran’s I: Measure spatial autocorrelation in errors
- Interpretation: High I = clustering = need more spatial features
- Iterative Improvement: Diagnose → Engineer features → Re-test → Repeat
Model Improvement Strategy
If Moran’s I shows clustering:
- Add more spatial features
- Different buffer sizes
- More amenities/disamenities
- Neighborhood-specific variables
- Try spatial fixed effects
- Neighborhood dummies
- Grid cell dummies
- Document what you try!
If Moran’s I ≈ 0:
✅ Your model adequately captures spatial relationships!
Common Pitfalls
- Ignoring spatial patterns in errors: Always check for clustering
- Using spatial lag of Y for prediction: Creates circular logic and can’t generalize
- Not visualizing errors: Maps reveal patterns that statistics miss
- Forgetting to suppress progress bars: Makes rendered documents look unprofessional
Best Practices
- Always suppress progress bars in rendered documents
- Visualize errors spatially before calculating statistics
- Use Moran’s I as a diagnostic, not just a test
- Iterate: Diagnose → Improve → Re-test
- Document your process: Keep track of what features you try
Next Steps
- Practice calculating Moran’s I on your own models
- Experiment with different neighbor definitions (k-NN, distance, contiguity)
- Learn about local indicators of spatial association (LISA)
- Apply spatial diagnostics to your assignments
- Consider spatial cross-validation for geographic data
Resources
- spdep package: https://r-spatial.github.io/spdep/
- Spatial autocorrelation tutorial: https://mgimond.github.io/Spatial/spatial-autocorrelation.html
- Moran’s I explanation: Understanding the formula and interpretation
- Spatial weights: Different ways to define neighbors