Spatial Predictive Modeling – Sanitation 311 Requests

Author

Jingqi Lu

Published

December 8, 2025

Step 1: Choose sanitation complaints as predict data

In this step, I subset the raw 311 sanitation requests to 2017 observations with valid geographic coordinates and project them to ESRI:102271, creating a clean point dataset of 19,733 violations for spatial analysis.

Getting the Data

Code
# Save locally under /data/Sanitation.csv
sanitation_raw <- read_csv("data/Sanitation.csv")

# 1.2 Clean and convert to spatial points (keep necessary columns)
# Adjust variable names based on actual CSV field names
sanitation_sf <- sanitation_raw %>%
  mutate(CreationDate = mdy(`Creation Date`),
         Year = year(CreationDate)) %>%
  filter(Year == 2017,
         !is.na(Latitude), !is.na(Longitude)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

Step 2: Exploratory Spatial Analysis

To analyze spatial variation in complaint intensity, I construct a 500 m × 500 m fishnet grid over Chicago, aggregate sanitation violations to each cell, and map the resulting count distribution.

Part 1: Data Loading & Exploration

In this part, I load the raw sanitation 311 service request data and the Chicago city boundary, then filter the requests to 2017 incidents with valid latitude–longitude coordinates. Both layers are projected into a common planar CRS (ESRI:102271) so that distances are measured in meters and are suitable for subsequent spatial analysis. The exploratory point map shows that sanitation code violations are far from uniformly distributed: they cluster along central, south, and west corridors of the city, while the North Side exhibits lighter, more scattered activity. This initial pattern already suggests strong spatial concentration of complaints and motivates a grid-based analysis of variation in complaint intensity across the city.

Code
# Load Chicago boundary for context
chicagoBoundary <- suppressMessages(
  st_read(
    "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson",
    quiet = TRUE
  )
) |>
  st_transform("ESRI:102271")


# Quick spatial plot: where are the violations located?
ggplot() +
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = sanitation_sf, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Spatial Distribution of Sanitation Code Violations in Chicago",
    subtitle = paste0("2017, n = ", nrow(sanitation_sf)),
    caption = "Source: Chicago 311 Service Requests (Open Data Portal)"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40")
  )

Code
# Verify result
cat("311 sanitation records for 2017:", nrow(sanitation_sf), "\n")
311 sanitation records for 2017: 19733 

Violations cluster densely along the city’s central and southern corridors, especially on the South and West Sides. The North Side shows lighter, more scattered activity. Overall, incidents follow residential density and older housing patterns, revealing clear spatial concentration rather than uniform distribution.

Part 2: Fishnet Grid Creation

Here I construct a 500 m × 500 m fishnet grid over Chicago, intersect it with the city boundary, and aggregate sanitation violations to each grid cell. This produces a count surface of sanitation complaints that standardizes the spatial unit of analysis across the entire city.

Code
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,   # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number()) %>%
  filter(lengths(st_intersects(., chicagoBoundary)) > 0)

cat("✓ Created fishnet grid with", nrow(fishnet), "cells\n")
✓ Created fishnet grid with 2458 cells
Code
sanitation_count <- st_join(sanitation_sf, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarise(countSanitation = n())

fishnet <- fishnet %>%
  left_join(sanitation_count, by = "uniqueID") %>%
  mutate(countSanitation = replace_na(countSanitation, 0))

summary(fishnet$countSanitation)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    5.00    8.02   12.00  189.00 
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = countSanitation), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 0.7) +
  scale_fill_viridis_c(
    name = "Violations",
    option = "plasma",
    trans = "sqrt",
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Sanitation Code Violations by 500m Grid Cell",
    subtitle = "Chicago, 2017",
    caption = "Source: Chicago 311 Service Requests (Open Data Portal)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40"),
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank()
  )

Part 3: Spatial Features

To capture local spatial context, I derive two key features for each grid cell: the mean distance to the three nearest sanitation events (nn_meanDist) and the distance to the nearest High–High hotspot centroid (dist_to_hotspot). Smaller values of these features indicate tighter clustering of complaints around the cell or closer proximity to a major hotspot.

Code
fishnet_centroids <- st_centroid(fishnet)
san_coords <- st_coordinates(sanitation_sf)
fish_coords <- st_coordinates(fishnet_centroids)

nn_result <- get.knnx(san_coords, fish_coords, k = 3)
fishnet$nn_meanDist <- rowMeans(nn_result$nn.dist)

summary(fishnet$nn_meanDist)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.031  108.565  181.606  294.820  332.417 2213.320 
Code
cat("✓ Added 3-NN mean distance feature\n")
✓ Added 3-NN mean distance feature
Code
coords <- st_coordinates(fishnet_centroids)
nb <- knn2nb(knearneigh(coords, k = 5))
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

localM <- localmoran(fishnet$countSanitation, lw)
mean_val <- mean(fishnet$countSanitation, na.rm = TRUE)

fishnet <- fishnet %>%
  mutate(
    localI = localM[, 1],
    p_value = localM[, 5],
    significant = p_value < 0.05,
    clusterType = case_when(
      !significant ~ "Not Significant",
      localI > 0 & countSanitation > mean_val ~ "High-High",
      localI > 0 & countSanitation <= mean_val ~ "Low-Low",
      localI < 0 & countSanitation > mean_val ~ "High-Low",
      localI < 0 & countSanitation <= mean_val ~ "Low-High",
      TRUE ~ "Not Significant"
    )
  )
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = clusterType), color = NA) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low"  = "#fdae61",
      "Low-High"  = "#abd9e9",
      "Low-Low"   = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran’s I: Sanitation Violation Clusters",
    subtitle = "High-High = Hot Spots  |  Low-Low = Cold Spots"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40")
  )

Hot spots (High-High clusters) of sanitation violations concentrate mainly in the central-west and south-side neighborhoods, with small pockets on the far north and southwest. Cold or low-activity areas dominate the rest of the city. The pattern suggests that sanitation issues are spatially clustered rather than random, aligning with older, denser residential zones and areas of higher population turnover.

Code
hotspots <- fishnet %>%
  filter(clusterType == "High-High") %>%
  st_centroid()

fishnet$dist_to_hotspot <- as.numeric(
  st_distance(fishnet_centroids, hotspots %>% st_union())
)

summary(fishnet$dist_to_hotspot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1000    2000    2479    3606   10000 
Code
cat("Calculated distance to nearest High-High cluster\n")
Calculated distance to nearest High-High cluster

Part 4: Count Regression Models

In this part, I fit Poisson and Negative Binomial count regression models to explain variation in sanitation complaint counts per grid cell using the two spatial features.

Code
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  as.data.frame() %>%
  dplyr::select(uniqueID, countSanitation, nn_meanDist, dist_to_hotspot) %>%
  tidyr::drop_na()
cat("✓ Modeling dataset ready with", nrow(fishnet_model), "observations\n")
✓ Modeling dataset ready with 2458 observations
Code
model_pois <- glm(
  countSanitation ~ nn_meanDist + dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

summary(model_pois)

Call:
glm(formula = countSanitation ~ nn_meanDist + dist_to_hotspot, 
    family = "poisson", data = fishnet_model)

Coefficients:
                    Estimate   Std. Error z value            Pr(>|z|)    
(Intercept)      3.726835637  0.014118246  263.97 <0.0000000000000002 ***
nn_meanDist     -0.008449551  0.000112615  -75.03 <0.0000000000000002 ***
dist_to_hotspot -0.000183691  0.000005795  -31.70 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 25930.1  on 2457  degrees of freedom
Residual deviance:  7285.2  on 2455  degrees of freedom
AIC: 14422

Number of Fisher Scoring iterations: 5
Code
dispersion <- sum(residuals(model_pois, type = "pearson")^2) /
              model_pois$df.residual
cat("\nDispersion parameter:", round(dispersion, 2), "\n")

Dispersion parameter: 4.54 
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected — Negative Binomial model recommended.\n")
} else {
  cat("✓ Dispersion acceptable for Poisson model.\n")
}
⚠ Overdispersion detected — Negative Binomial model recommended.
Code
model_nb <- glm.nb(
  countSanitation ~ nn_meanDist + dist_to_hotspot,
  data = fishnet_model
)

summary(model_nb)

Call:
glm.nb(formula = countSanitation ~ nn_meanDist + dist_to_hotspot, 
    data = fishnet_model, init.theta = 4.336138091, link = log)

Coefficients:
                    Estimate   Std. Error z value            Pr(>|z|)    
(Intercept)      3.835903696  0.028620797  134.03 <0.0000000000000002 ***
nn_meanDist     -0.009733402  0.000182836  -53.24 <0.0000000000000002 ***
dist_to_hotspot -0.000141165  0.000009441  -14.95 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(4.3361) family taken to be 1)

    Null deviance: 10094.5  on 2457  degrees of freedom
Residual deviance:  2054.6  on 2455  degrees of freedom
AIC: 11212

Number of Fisher Scoring iterations: 1

              Theta:  4.336 
          Std. Err.:  0.202 

 2 x log-likelihood:  -11204.329 
Code
cat("\nModel Fit Comparison:\n")

Model Fit Comparison:
Code
cat("Poisson AIC:", round(AIC(model_pois), 2), "\n")
Poisson AIC: 14422.11 
Code
cat("NegBin  AIC:", round(AIC(model_nb), 2), "\n")
NegBin  AIC: 11212.33 

The Poisson model showed clear overdispersion (dispersion = 4.54 > 1.5). The Negative Binomial model substantially improved fit (AIC = 11 212 vs 14 422). Both predictors were highly significant and negative: greater distance to nearby violations or to cluster centers corresponds to fewer sanitation code violations. This pattern indicates spatial clustering of violations within central and southern neighborhoods.

Part 5: Spatial Cross-Validation (2017)

To evaluate spatial generalization, I implement a Leave-One-District-Out (LOGO) cross-validation using Chicago’s 22 police districts as spatial folds.

Code
policeDistricts <- suppressMessages(
  st_read(
    "https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON",
    quiet = TRUE
  )
) |>
  st_transform("ESRI:102271") |>
  dplyr::select(District = dist_num)

fishnet <- st_join(fishnet, policeDistricts, join = st_within, left = TRUE) %>%
  filter(!is.na(District))

cat("✓ Joined police districts:", length(unique(fishnet$District)), "\n")
✓ Joined police districts: 22 
Code
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  as.data.frame() %>%
  dplyr::select(uniqueID, District, countSanitation, nn_meanDist, dist_to_hotspot) %>%
  tidyr::drop_na()

cat("Data ready for LOGO CV with", nrow(fishnet_model), "observations\n")
Data ready for LOGO CV with 1708 observations
Code
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO cross-validation across", length(districts), "districts...\n")
Running LOGO cross-validation across 22 districts...
Code
for (i in seq_along(districts)) {
  test_district <- districts[i]

  train_data <- fishnet_model %>% filter(District != test_district)
  test_data  <- fishnet_model %>% filter(District == test_district)

  model_cv <- glm.nb(countSanitation ~ nn_meanDist + dist_to_hotspot, data = train_data)

  test_data <- test_data %>%
    mutate(prediction = predict(model_cv, newdata = test_data, type = "response"))

  mae  <- mean(abs(test_data$countSanitation - test_data$prediction))
  rmse <- sqrt(mean((test_data$countSanitation - test_data$prediction)^2))

  cv_results <- bind_rows(cv_results, tibble(
    fold = i,
    test_district = test_district,
    n_test = nrow(test_data),
    MAE = mae,
    RMSE = rmse
  ))

  cat("  Fold", i, "/", length(districts), 
      "- District", test_district, 
      "- MAE:", round(mae, 2), 
      "- RMSE:", round(rmse, 2), "\n")
}
  Fold 1 / 22 - District 5 - MAE: 2.65 - RMSE: 4.38 
  Fold 2 / 22 - District 4 - MAE: 2.11 - RMSE: 11.91 
  Fold 3 / 22 - District 22 - MAE: 3.28 - RMSE: 5.82 
  Fold 4 / 22 - District 6 - MAE: 6.08 - RMSE: 8.25 
  Fold 5 / 22 - District 8 - MAE: 3.44 - RMSE: 8.04 
  Fold 6 / 22 - District 7 - MAE: 5.71 - RMSE: 7.96 
  Fold 7 / 22 - District 3 - MAE: 4.51 - RMSE: 6.55 
  Fold 8 / 22 - District 2 - MAE: 4.24 - RMSE: 6.27 
  Fold 9 / 22 - District 9 - MAE: 2.95 - RMSE: 4.79 
  Fold 10 / 22 - District 10 - MAE: 4.35 - RMSE: 6.02 
  Fold 11 / 22 - District 1 - MAE: 2.63 - RMSE: 4.74 
  Fold 12 / 22 - District 12 - MAE: 3.73 - RMSE: 5.1 
  Fold 13 / 22 - District 15 - MAE: 5 - RMSE: 7.3 
  Fold 14 / 22 - District 11 - MAE: 6.35 - RMSE: 8.56 
  Fold 15 / 22 - District 18 - MAE: 6.4 - RMSE: 10.05 
  Fold 16 / 22 - District 25 - MAE: 4.98 - RMSE: 7.07 
  Fold 17 / 22 - District 14 - MAE: 7.81 - RMSE: 10.46 
  Fold 18 / 22 - District 19 - MAE: 7.55 - RMSE: 10.51 
  Fold 19 / 22 - District 16 - MAE: 1.89 - RMSE: 2.87 
  Fold 20 / 22 - District 17 - MAE: 6.35 - RMSE: 16.21 
  Fold 21 / 22 - District 20 - MAE: 4.79 - RMSE: 7.62 
  Fold 22 / 22 - District 24 - MAE: 5.97 - RMSE: 9.85 
Code
cat("\nLOGO Cross-Validation complete\n")

LOGO Cross-Validation complete
Code
cat("Mean MAE :", round(mean(cv_results$MAE), 2), "\n")
Mean MAE : 4.67 
Code
cat("Mean RMSE:", round(mean(cv_results$RMSE), 2), "\n")
Mean RMSE: 7.74 
Code
cv_results %>%
  arrange(desc(MAE)) %>%
  kable(
    digits = 2,
    caption = "Leave-One-District-Out Cross-Validation Results"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Leave-One-District-Out Cross-Validation Results
fold test_district n_test MAE RMSE
17 14 46 7.81 10.46
18 19 63 7.55 10.51
15 18 30 6.40 10.05
20 17 82 6.35 16.21
14 11 43 6.35 8.56
4 6 63 6.08 8.25
22 24 41 5.97 9.85
6 7 52 5.71 7.96
13 15 32 5.00 7.30
16 25 85 4.98 7.07
21 20 30 4.79 7.62
7 3 43 4.51 6.55
10 10 63 4.35 6.02
8 2 56 4.24 6.27
12 12 73 3.73 5.10
5 8 197 3.44 8.04
3 22 112 3.28 5.82
9 9 107 2.95 4.79
1 5 98 2.65 4.38
11 1 28 2.63 4.74
2 4 235 2.11 11.91
19 16 129 1.89 2.87

To evaluate the spatial generalization of the model, a Leave-One-District-Out (LOGO) cross-validation was implemented using Chicago’s 22 police districts as spatial groups. In each fold, one district was held out for testing while the Negative Binomial model was trained on the remaining districts.

Across all 22 folds, the model achieved an average Mean Absolute Error (MAE) of 4.67 and a Root Mean Squared Error (RMSE) of 7.74. These results indicate that the model performs reasonably well in predicting sanitation code violation counts for spatially unseen districts. Errors tend to be larger in central and high-density districts, reflecting the underlying spatial heterogeneity of urban sanitation violations.

Overall, the LOGO cross-validation confirms that the model captures meaningful spatial patterns and maintains moderate predictive accuracy across the city.

Part 6: Model Evaluation

Code
fishnet_model$pred_2017 <- predict(model_nb, newdata = fishnet_model, type = "response")

cat("Generated in-sample predictions for 2017\n")
Generated in-sample predictions for 2017
Code
sanitation_2018 <- read_csv("data/Sanitation.csv") %>%
  mutate(CreationDate = mdy(`Creation Date`),
         Year = year(CreationDate)) %>%
  filter(Year == 2018,
         !is.na(Latitude), !is.na(Longitude)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

cat("Loaded 2018 sanitation records:", nrow(sanitation_2018), "\n")
Loaded 2018 sanitation records: 18023 
Code
fishnet_2018 <- st_join(fishnet, sanitation_2018, join = st_contains) %>%
  group_by(uniqueID) %>%
  summarise(count2018 = n()) %>%
  right_join(fishnet_model, by = "uniqueID") %>%
  mutate(count2018 = ifelse(is.na(count2018), 0, count2018))

cat("Aggregated 2018 counts to fishnet grid\n")
Aggregated 2018 counts to fishnet grid
Code
fishnet_2018$pred_2018 <- predict(model_nb, newdata = fishnet_2018, type = "response")

fishnet_2018 <- fishnet_2018 %>%
  mutate(error = count2018 - pred_2018)

mae_2018 <- mean(abs(fishnet_2018$error))
rmse_2018 <- sqrt(mean((fishnet_2018$error)^2))

cat("2018 Temporal Validation complete\n")
2018 Temporal Validation complete
Code
cat("MAE (2018):", round(mae_2018, 2), "\n")
MAE (2018): 5.05 
Code
cat("RMSE (2018):", round(rmse_2018, 2), "\n")
RMSE (2018): 7.84 
Code
points2017 <- sanitation_sf %>%
  st_coordinates() %>%
  as.data.frame()

kde_2017 <- spatstat.geom::ppp(points2017$X, points2017$Y,
                               window = spatstat.geom::as.owin(st_union(chicagoBoundary)))

kde_density <- spatstat.explore::density.ppp(kde_2017, sigma = 1000) # bandwidth 1km

fishnet_2018$kde_pred <- raster::extract(raster::raster(kde_density), 
                                         st_coordinates(st_centroid(fishnet)))

kde_mae  <- mean(abs(fishnet_2018$count2018 - fishnet_2018$kde_pred), na.rm = TRUE)
kde_rmse <- sqrt(mean((fishnet_2018$count2018 - fishnet_2018$kde_pred)^2, na.rm = TRUE))

cat("KDE baseline complete\n")
KDE baseline complete
Code
cat("KDE MAE:", round(kde_mae, 2), "\n")
KDE MAE: 8.67 
Code
cat("KDE RMSE:", round(kde_rmse, 2), "\n")
KDE RMSE: 12.55 
Code
ggplot() +
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "white") +
  geom_sf(data = fishnet_2018, aes(fill = error), color = NA) +
  scale_fill_gradient2(
    low = "#2c7bb6",
    mid = "white",
    high = "#d7191c",
    midpoint = 0,
    name = "Prediction Error"
  ) +
  labs(
    title = "Prediction Errors for 2018 (Observed - Predicted)",
    subtitle = "Negative Binomial Model",
    caption = "Blue = Overprediction, Red = Underprediction"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40")
  )

The Negative Binomial model was trained on 2017 sanitation violations and then applied to predict 2018 counts at the 500 m grid level. Model predictions were compared against observed 2018 violations and against a kernel density estimation (KDE) baseline.

Predictive accuracy (temporal validation): The model achieved MAE = 5.05 and RMSE = 7.84, substantially lower than the KDE baseline (MAE = 8.67, RMSE = 12.55). This indicates the regression model generalizes well across time and captures spatial drivers of violation occurrence that simple hotspot persistence (KDE) cannot.

Spatial pattern of errors (map): The map of 2018 prediction errors shows:

Blue cells (over-prediction): areas where the model predicted more violations than actually occurred—often along the southern and western edges of Chicago.

Red cells (under-prediction): locations where observed violations exceeded predictions, mainly clustered in older central neighborhoods. This spatial structure suggests that some unmodeled factors—such as local inspection policies or demographic changes—still influence violation frequency.

Overall assessment: The Negative Binomial model improves predictive performance over both Poisson and KDE approaches. It handles overdispersion and learns spatial correlates of sanitation activity, making it more robust for forecasting future 311 requests.

CHALLENGE TASK: Temporal Validation (2018)

Code
crime_2018 <- read_csv("https://data.cityofchicago.org/resource/3i3m-jwuy.csv")

crime_2018_burg <- crime_2018 %>%
  filter(primary_type == "BURGLARY",
         description == "FORCIBLE ENTRY",
         !is.na(longitude), !is.na(latitude)) %>%
  mutate(year = year(as_date(date)))

cat("2018 burglary (FORCIBLE ENTRY) incidents:", 
    nrow(crime_2018_burg), "\n")
2018 burglary (FORCIBLE ENTRY) incidents: 26 
Code
crime_2018_sf <- st_as_sf(
  crime_2018_burg,
  coords = c("longitude", "latitude"),
  crs = 4326
) %>%
  st_transform(st_crs(fishnet))

counts_crime_2018 <- st_join(fishnet, crime_2018_sf, join = st_contains) %>%
  st_drop_geometry() %>%
  count(uniqueID, name = "countCrime2018")

fishnet_crime_2018 <- fishnet %>%
  left_join(counts_crime_2018, by = "uniqueID") %>%
  mutate(countCrime2018 = replace_na(countCrime2018, 0))
Code
fishnet_crime_2018$predCrime <- predict(
  model_nb,
  newdata = fishnet_crime_2018 %>% st_drop_geometry(),
  type = "response"
)

fishnet_crime_2018 <- fishnet_crime_2018 %>%
  mutate(error = countCrime2018 - predCrime)

mae_crime <- mean(abs(fishnet_crime_2018$error))
rmse_crime <- sqrt(mean((fishnet_crime_2018$error)^2))

cat("2018 Burglary Temporal Validation → MAE:", round(mae_crime, 2),
    " RMSE:", round(rmse_crime, 2), "\n")
2018 Burglary Temporal Validation → MAE: 8.58  RMSE: 11.72 
Code
ggplot() +
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "white") +
  geom_sf(data = fishnet_crime_2018, aes(fill = error), color = NA) +
  scale_fill_gradient2(low = "#2c7bb6", mid = "white", high = "#d7191c",
                       midpoint = 0, name = "Error (Obs - Pred)") +
  labs(title = "2018 Burglary (FORCIBLE ENTRY) Prediction Errors",
       subtitle = "Predicted using 2017 Sanitation Model",
       caption = "Blue = Overprediction, Red = Underprediction") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40")
  )

The 2018 burglary prediction errors show systematic overprediction (predicted counts higher than observed), indicating limited transferability of the sanitation-based model to crime patterns. Moreover, the Poisson model was found to be overdispersed (variance ≫ mean), justifying the use of a Negative Binomial specification, which effectively reweights the variance structure to better model spatially heterogeneous count data.

Step 3: Analysis

Overview

This project builds a spatial predictive model of sanitation-related 311 service requests in Chicago for 2017–2018. Using a 500 m × 500 m fishnet grid as the spatial unit, the analysis combines descriptive mapping, spatial autocorrelation diagnostics, and count regression models to understand and predict the geographic distribution of sanitation code violations.

Spatial clustering and model diagnostics show that these urban service complaints are not randomly distributed. Instead, they exhibit strong spatial dependence—particularly along the South and West Sides—reflecting the persistence of neighborhood-level environmental and socioeconomic inequalities.

1. Spatial Clustering

Exploratory mapping revealed clear spatial concentrations of sanitation violations. Local Moran’s I confirmed statistically significant High-High clusters (hotspots) in central-west and south-side neighborhoods, while Low-Low clusters occurred in more peripheral or affluent areas.
This indicates that sanitation complaints follow patterns of older housing stock, population turnover, and varying municipal service responsiveness.

2. Count Regression Modeling

Both Poisson and Negative Binomial (NB) count models were fit to predict the number of violations per grid cell using two spatial covariates:

  • Mean distance to nearest 3 sanitation events (nn_meanDist)
  • Distance to the nearest high-high cluster centroid (dist_to_hotspot)

The Poisson model exhibited overdispersion (dispersion = 4.54 ≫ 1.5), meaning that the variance of counts greatly exceeded the mean.
The NB model corrected this issue and achieved a much better fit (AIC = 11 212 vs 14 422 for Poisson), validating the choice of a variance-adjusted count specification.
Both predictors were significantly negative, showing that proximity to existing violations and hotspots strongly increases expected complaint frequency.

3. Spatial Cross-Validation (2017)

A Leave-One-District-Out (LOGO) cross-validation using Chicago’s 22 police districts assessed spatial generalizability.
The model achieved Mean MAE = 4.67 and Mean RMSE = 7.74, confirming moderate predictive power in unseen areas.
Prediction errors were larger in dense downtown and inner-south districts, highlighting the spatial heterogeneity of urban sanitation behavior.

4. Temporal Validation (2018)

Applying the 2017 NB model to 2018 sanitation data yielded MAE = 5.05 and RMSE = 7.84, outperforming a non-parametric KDE baseline (MAE = 8.67, RMSE = 12.55).
The model successfully generalized across years, indicating that neighborhood-level structural factors—rather than year-specific policy noise—drive much of the spatial pattern in 311 sanitation requests.

The 2018 error map showed: - Blue areas (overprediction) around southern and western peripheries
- Red areas (underprediction) concentrated in inner neighborhoods where new complaints emerged in 2018

These differences likely stem from unmodeled local dynamics such as inspection intensity, redevelopment, or seasonal population change.

5. Challenge Task: Cross-Domain Validation (2018 Burglary)

The 2017 sanitation model was tested on 2018 burglary (forcible entry) events to evaluate cross-domain transferability.
Results showed systematic overprediction across most grid cells, meaning that the model assumed crime would occur with similar spatial intensity as sanitation violations.
This demonstrates that while spatial autocorrelation structures are similar across urban phenomena, the underlying generative mechanisms (social, behavioral, or policing factors) differ substantially.

Such overprediction underscores the domain specificity of spatial count processes and the importance of incorporating domain-relevant variables when extending predictive models to new phenomena.

6. Interpretation

  1. Spatial dependence matters: Incorporating local spatial features (e.g., nearest-neighbor distances) substantially improves prediction over simple density persistence models.
  2. Overdispersion is the rule, not the exception: Urban event counts rarely meet Poisson variance assumptions; the Negative Binomial model offers a practical correction by weighting variance heterogeneity.
  3. Temporal generalization is feasible: Structural neighborhood effects remain stable across years.
  4. Cross-domain generalization is limited: Models trained on sanitation data poorly capture crime patterns, revealing that social drivers differ even when spatial clustering appears similar.

Spatially explicit predictive modeling of 311 sanitation complaints can help city agencies prioritize inspection and maintenance resources more efficiently.
However, models must remain adaptive—incorporating updated data, temporal shifts, and behavioral heterogeneity—to avoid reinforcing historical inequalities in service delivery.
For broader urban analytics, this exercise shows both the potential and the limits of spatial transfer learning in complex social environments.