Week 6 Notes - Spatial ML and Advanced Regression

Published

October 13, 2025

Key Concepts Learned

Types of Variables

Continuous: square feet, age of house, income levels, distance to downtown

Categorical: neighborhood, school district, building type, has garage (binary Y/N)

Limitations of Models using only Baseline Indicators

Fixed Effects

  • Definition: Categorical variables that capture all unmeasured characteristics of a group

  • Powerful but they’re a black box 

  • Neighborhoods bundle many unmeasured factors (e.g. school districts, job access, amenities, “cool factor”)

Three Approaches to Spatial Features

  1. Buffer Aggregation
  2. kNN
  3. Distance to Specific Points

Tips for Success


Coding Techniques

  • Make Data with X/Y Coordinates Spatial
Code
# Convert boston data to sf object
boston.sf <- boston |> 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) |> 
  st_transform('ESRI:102286')  # MA State Plane (feet)

# Spatial join: Assign each house to its neighborhood
boston.sf <- boston.sf |> 
  st_join(nhoods, join = st_intersects)

# Check results
boston.sf |> 
  st_drop_geometry() |> 
  count(name) |> 
  arrange(desc(n))
  • Add Dummy/Categorical Variables (e.g. Neighborhoods)

The n-1 rule: One neighborhood is automatically chosen as the reference category (omitted)! R picks the first alphabetically unless you specify otherwise.

Code
boston.sf <- boston.sf |> 
  mutate(name = as.factor(name))

# Fit model with neighborhood fixed effects
model_neighborhoods <- lm(SalePrice ~ LivingArea + name, data = boston.sf)
  • Interaction Effects

Code
# Define wealthy neighborhoods based on median prices
wealthy_hoods <- c("Back Bay", "Beacon Hill", "South End", "Bay Village")

# Create binary indicator
boston.sf <- boston.sf |> 
  mutate(
    wealthy_neighborhood = ifelse(name %in% wealthy_hoods, "Wealthy", "Not Wealthy"),
    wealthy_neighborhood = as.factor(wealthy_neighborhood)
  )

# Check the split
boston.sf |> 
  st_drop_geometry() |> 
  count(wealthy_neighborhood)
  • Polynomials – e.g. Square Age Variable because of U-Shaped Effect
    • New Houses (0-20 years): modern amenities, high value, steep depreciation
    • Middle-Aged (20-80 years): wear and tear, lowest value
    • Historic/Vintage (80+ years): historic districts, rising value, “vintage premium”

Code
# Calculate age from year built
boston.sf <- boston.sf |> 
  mutate(Age = 2025 - YR_BUILT) |>  filter(Age <2000)

# Check the distribution of age
summary(boston.sf$Age)

# Add Polynomial Term - use I(Age^2) for Quadratic Model
model_age_quad <- lm(SalePrice ~ Age + I(Age^2) + LivingArea, data = boston.sf)
summary(model_age_quad)$coef

# Compare Linear vs Quadratic Model in terms of R-squared
summary(model_age_quad)$r.squared

# Check Residual Plot
par(mfrow = c(1, 2))

# Linear model residuals
plot(fitted(model_age_linear), residuals(model_age_linear),
     main = "Linear Model Residuals",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

# Quadratic model residuals  
plot(fitted(model_age_quad), residuals(model_age_quad),
     main = "Quadratic Model Residuals",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)
  • Buffer Aggregation
Code
boston.sf <- boston.sf |> 
  mutate(
    crimes.Buffer = lengths(st_intersects(
      st_buffer(geometry, 660),
      crimes.sf
    )),
    crimes_500ft = lengths(st_intersects(
      st_buffer(geometry, 500),
      crimes.sf
    ))
  )
  • kNN
    • The kNN feature with the strongest correlation tells us the relevant “zone of influence” for crime perception
Code
# Calculate distance matrix (houses to crimes)
dist_matrix <- st_distance(boston.sf, crimes.sf)

# Function to get mean distance to k nearest neighbors
get_knn_distance <- function(dist_matrix, k) {
  apply(dist_matrix, 1, function(distances) {
    # Sort and take first k, then average
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create multiple kNN features
boston.sf <- boston.sf %>%
  mutate(
    crime_nn1 = get_knn_distance(dist_matrix, k = 1),
    crime_nn3 = get_knn_distance(dist_matrix, k = 3),
    crime_nn5 = get_knn_distance(dist_matrix, k = 5)
  )

# Check results
summary(boston.sf %>% st_drop_geometry() %>% select(starts_with("crime_nn")))

# Which k value correlates most with price?
boston.sf %>%
  st_drop_geometry() %>%
  select(SalePrice, crime_nn1, crime_nn3, crime_nn5) %>%
  cor(use = "complete.obs") %>%
  as.data.frame() %>%
  select(SalePrice)
  • Distance
Code
# Define downtown Boston (Boston Common: 42.3551° N, 71.0656° W)
downtown <- st_sfc(st_point(c(-71.0656, 42.3551)), crs = "EPSG:4326") %>%
  st_transform('ESRI:102286')

# Calculate distance from each house to downtown
boston.sf <- boston.sf |> 
  mutate(
    dist_downtown_ft = as.numeric(st_distance(geometry, downtown)),
    dist_downtown_mi = dist_downtown_ft / 5280
  )

# Summary
summary(boston.sf$dist_downtown_mi)
  • All Spatial Features Together
Code
spatial_summary <- boston.sf |> 
  st_drop_geometry() |> 
  select(crimes.Buffer, crimes_500ft, crime_nn3, dist_downtown_mi) |>
  summary()
spatial_summary

Neighborhood Fixed Effects

  • Fixed Effects = categorical variables that capture all unmeasured characteristics of a group
  • R creates a dummy for each neighborhood automatically
Code
# Add neighborhood fixed effects
reg5 <- lm(
  SalePrice ~ LivingArea + Age + 
              crimes_500ft + 
              parks_nn3 + 
              as.factor(name),  # FE
  data = boston.sf
)

Cross Validation Workflow

Code
category_check <- boston.sf |> 
  st_drop_geometry() |> 
  count(name) |> 
  arrange(n)

print(category_check)
  • Categories with n < 10 for categorical variables will likely cause CV problems

Code
# 1. Check category sizes
boston.sf %>%
  st_drop_geometry() %>%
  count(name) %>%
  arrange(n) %>%
  print()

# 2. Group small neighborhoods if needed
boston.sf <- boston.sf %>%
  add_count(name) %>%
  mutate(
    name_cv = if_else(n < 10, "Small_Neighborhoods", as.character(name)),
    name_cv = as.factor(name_cv)
  )

# 3. Set up CV
ctrl <- trainControl(method = "cv", number = 10)

# 4. Use grouped neighborhoods in ALL models with FE
model <- train(
  SalePrice ~ LivingArea + Age + crimes_500ft + as.factor(name_cv),
  data = boston.sf,
  method = "lm",
  trControl = ctrl
)

# 5. Report
cat("10-fold CV RMSE:", round(model$results$RMSE, 0), "\n")

Questions & Challenges

  • None for this week

Connections to Policy

  • Market Segmentation: property tax assessments should account for neighborhood-specific valuation. Housing assistance needs vary dramatically by area.

Reflection

  • Fixed Effects are powerful but they’re a black box. When I work on housing policy and housing markets, we often describe neighborhood fixed effects but I have never really quantified them.