Assignment 4: Spatial Predictive Analysis

Pothole Repair Requests in Chicago

Author

Sihan Yu

Published

December 12, 2025

1. Introduction

In the 311 system, residents can report various infrastructure issues for the city to address. Among these requests, Pothole in Street is one of the most frequently submitted categories in Chicago. Potholes often develop due to road aging, traffic load, and freeze–thaw cycles, and their locations reflect underlying infrastructure conditions across the city. Because pothole activity varies spatially and mirrors broader neighborhood characteristics, it provides a useful spatial indicator for understanding urban environments.

In this study, pothole repair requests are used as spatial indicators to support the analysis of burglary patterns in Chicago. By examining the spatial variation in pothole activity and its relationship to burglary distribution, the study aims to understand how urban infrastructure conditions may reflect broader neighborhood characteristics that help explain where burglaries occur.

2. Data Loading & Exploration

After preparing the required packages, this step imports the key spatial datasets, including the 311 pothole request data, the Chicago boundary, police districts, and burglary records. The goal is to understand where potholes occur across the city and whether they exhibit spatial clustering. Visualizing these patterns helps identify neighborhoods with greater infrastructure issues, which is important because such areas may also experience higher burglary activity. This initial exploration ensures that the base data are accurate and provides a foundation for examining the spatial relationship between potholes and crime in later steps.

2.1 Setup

Code

# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)
library(broom)
library(glue)


# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

2.2 Load 311 data and Chicago spatial boundaries

Code

# Load 311 request - Potholes Reported data
Pot_Holes <- read_csv("data/Pot_Holes_Reported.csv", show_col_types = FALSE)%>%
  filter(!is.na(LATITUDE), !is.na(LONGITUDE)) %>%
  mutate(CREATION_DATE = mdy(`CREATION DATE`)) %>%  
  filter(year(CREATION_DATE) == 2017) %>%                 
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
  st_transform('ESRI:102271')

Pot_Holes <- Pot_Holes %>% 
  filter(!st_is_empty(geometry)) %>%
  filter(
    !is.na(st_coordinates(.)[, 1]),
    !is.na(st_coordinates(.)[, 2])
  )

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson",  quiet = TRUE) %>%
  st_transform('ESRI:102271')

# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON",  quiet = TRUE) %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON",  quiet = TRUE) %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)


# Load burglaries data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp",  quiet = TRUE) %>% 
  st_transform('ESRI:102271')

2.3 Visualize Point Data

Code

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = Pot_Holes, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Reported Pot Holes Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(Pot_Holes))
  )+
  theme_void()

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(Pot_Holes)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )+
  theme_void()

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Reported Pot Holes in Chicago",
    tag_levels = 'A'
  )

Interpretation:

The point map on the left shows that pothole reports in 2017 are widespread across Chicago, with particularly dense coverage in most urbanized areas. Some low-report zones appear in the southern part of the city. The kernel density map on the right highlights these patterns more clearly: the northeast and northwest areas exhibit the highest concentrations, with additional hotspots in the southwest and southeast. In contrast, the central and western sections show relatively lower report intensity. Overall, pothole reports display strong spatial heterogeneity across the city.

3. Fishnet Grid

In this step, I create a 500 × 500 m fishnet grid across Chicago and aggregate both pothole and burglary point data into the grid cells, producing count values for each location. This transforms the raw point data into a uniform spatial unit, making it suitable for later regression modeling.

3.1 500*500m Fishnet Creation

Code

# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

3.2 Aggregate Violation to Grid

Code

# Spatial join: which cell contains each violation?
Pot_Holes_fishnet <- st_join(Pot_Holes, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countPotholes = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(Pot_Holes_fishnet, by = "uniqueID") %>%
  mutate(countPotholes = replace_na(countPotholes, 0))

# Aggregate burglaries to fishnet
burg_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join burglary counts back to same fishnet
fishnet <- fishnet %>%
  left_join(burg_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

3.3 Visualize Distribution

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countPotholes), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Pot Holes",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Pot Holes Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()+
  theme_void()

Interpretation:

The fishnet map shows that pothole reports are not evenly distributed across Chicago. Higher counts appear in the northern and central parts of the city, while the southern area generally shows lower reporting levels. Localized hotspots also emerge along major road corridors. This suggests that road quality issues vary across neighborhoods, and pothole activity may serve as a proxy for broader spatial differences in infrastructure conditions.

4. Spatial Features

Here generate spatial features including k-nearest neighbors and Local Moran’s I. This helps to further capture spatial structure and visualize clustering patterns in the pothole data.

The K-Nearest Neighbor feature measures how close each grid cell is to recent pothole activity, helping quantify local infrastructure conditions surrounding burglary locations.

Local Moran’s I identifies spatial clusters of high and low pothole activity. These clusters help uncover neighborhood-level infrastructure patterns that may relate to burglary risk.

4.1 K-Nearest Neighbor Features

Code

# Calculate mean distance to 3 nearest Pot Holes
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet <- st_make_valid(fishnet)
fishnet_coords <- st_coordinates(suppressWarnings(st_centroid(fishnet)))
Holes_coords <- st_coordinates(Pot_Holes)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(Holes_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    potholes.nn = rowMeans(nn_result$nn.dist)
  )

summ <- summary(fishnet$potholes.nn)

summ_df <- data.frame(
  Statistic = c("Min", "Q1", "Median", "Mean", "Q3", "Max"),
  Value = round(as.numeric(summ), 2)
)

kable(
  summ_df,
  caption = "Summary of Mean Distance to 3 Nearest Pot Holes (meters)"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),  full_width = FALSE)

Summary of Mean Distance to 3 Nearest Pot Holes (meters)
Statistic	Value
Min	5.71
Q1	62.13
Median	96.40
Mean	157.07
Q3	168.86
Max	1896.40

Interpretation:

The KNN summary indicates strong spatial variation: some cells are only a few meters from potholes, while others are over 1 km away. The median distance of about 87 m suggests that pothole activity is dense and widely distributed across the city.

4.2 Local Moran’s I

Code

# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(suppressWarnings(st_centroid(data)))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to potholes
fishnet <- calculate_local_morans(fishnet, "countPotholes", k = 5)

4.3 Hot Spots and Cold Spots

Code

# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Pothole Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()+
  theme_void()

4.4 Distance-to-Hotspot

Code

# Get centroids of "High-High" cells (hot spots)
hotspots <- suppressWarnings(
  fishnet %>%
    filter(moran_class == "High-High") %>%
    st_centroid()
)


# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(
          suppressWarnings(st_centroid(fishnet)),
          hotspots %>% st_union()
        )
      )
    )

  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}

  - Number of hot spot cells: 226

Interpretation:

Local Moran’s I results are interpreted as follows:

High–High areas represent clusters of high burglary counts (hot spots).
Low–Low areas represent clusters of consistently low values (cold spots).
High–Low and Low–High areas indicate spatial outliers.

The LISA results show clear clusters of pothole activity across Chicago. High–High clusters (hotspots) appear mainly in the northern and central neighborhoods, indicating concentrated infrastructure issues. In contrast, the southern part of the city shows several Low–Low clusters, reflecting consistently low levels of reported potholes. The mix of High–Low and Low–High outliers suggests localized transitions between well-maintained and more deteriorated areas. Overall, the spatial pattern confirms that pothole activity is not random but forms distinct neighborhood clusters.

5. Count Regression Models

This section I model burglary counts using two standard count models—Poisson and Negative Binomial. Count data often show overdispersion (variance > mean), so fitting both models allows us to compare performance and choose the most appropriate specification for prediction. AIC is used to evaluate which model provides a better fit.

5.1 Poisson Regression

Code

# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts


# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    countPotholes,
    potholes.nn,
    dist_to_hotspot
  ) %>%
  na.omit()

# Fit Poisson regression
model_poisson <- glm(
  countBurglaries ~ countPotholes + potholes.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

tidy(model_poisson) %>%
  kable(digits = 2, caption = "Poisson Regression Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Poisson Regression Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	1.84	0.05	40.14	0.00
countPotholes	0.00	0.00	-0.79	0.43
potholes.nn	-0.01	0.00	-21.01	0.00
dist_to_hotspot	0.00	0.00	-2.06	0.04

Code

# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual


conclusion <- if (dispersion > 1.5) {
  "⚠ Overdispersion detected — the Negative Binomial model is more appropriate."
} else {
  "✓ Dispersion looks acceptable — the Poisson model is reasonable."
}

glue("
Dispersion Check
----------------
Dispersion parameter: {round(dispersion, 2)}
Rule of thumb: > 1.5 suggests overdispersion

Conclusion:
{conclusion}
")

Dispersion Check
----------------
Dispersion parameter: 3.84
Rule of thumb: > 1.5 suggests overdispersion

Conclusion:
⚠ Overdispersion detected — the Negative Binomial model is more appropriate.

5.2 Negative Binomial Regression

Code

# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ countPotholes + potholes.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

tidy(model_nb) %>%
  kable(digits = 2, caption = "Negative Binomial Regression Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Negative Binomial Regression Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	2.06	0.09	23.00	0.00
countPotholes	0.00	0.00	-1.36	0.17
potholes.nn	-0.01	0.00	-15.73	0.00
dist_to_hotspot	0.00	0.00	-1.63	0.10

5.3 Model Fit(AIC)

Code

# Compare AIC (lower is better)
# Create AIC comparison dataframe
aic_table <- data.frame(
  Model = c("Poisson", "Negative Binomial"),
  AIC   = c(AIC(model_poisson), AIC(model_nb))
)

# Display with kable
knitr::kable(
  aic_table,
  caption = "AIC Comparison of Count Models",
  digits = 1,
  align = "c"
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

AIC Comparison of Count Models
Model	AIC
Poisson	9774.2
Negative Binomial	7776.8

Interpretation:

The Poisson regression shows strong overdispersion, which leads to a much higher AIC and weaker overall fit. The Negative Binomial model handles this more effectively and provides a substantially better model performance. In the Negative Binomial results, the k-nearest-neighbor pothole feature is highly significant and shows a negative relationship with burglary counts, indicating that areas closer to recent pothole activity tend to experience elevated burglary risk. In comparison, the total number of potholes and the distance to hotspots are not statistically significant once local proximity effects are included. These results suggest that spatial patterns in infrastructure conditions, especially the immediate surroundings captured by the KNN measure, help explain variation in burglary activity. The Negative Binomial model is therefore the more appropriate choice for this analysis.

6. Spatial Cross-Validation

After comparing both models, the Negative Binomial model is selected because it had the lower AIC. Using this model, I then performed Leave-One-Group-Out cross-validation with police districts as the grouping unit. In each fold, one district was held out, the model was trained on the remaining districts, and predictions were generated for the excluded area. This method reduces spatial leakage and provides a more reliable measure of how well the model performs across different parts of the city.

6.1 Leave-One-Group-Out cross-validation(LOGO CV)

Code

# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()


for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ countPotholes + potholes.nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
 
}

6.2 Reported Metrics

Code

# Show results
cv_results %>% 
  arrange(desc(mae)) %>%
  rename(
    Fold = fold,
    `Test District` = test_district,
    `Test Count` = n_test,
    MAE = mae,
    RMSE = rmse
  )%>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District",
    align = "c"  
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

LOGO CV Results by District
Fold	Test District	Test Count	MAE	RMSE
7	3	43	6.22	8.44
12	12	73	3.56	4.90
4	6	63	3.47	4.94
6	7	52	3.39	4.37
19	16	129	3.21	3.61
14	11	43	3.16	4.13
17	14	46	2.86	4.20
15	18	30	2.82	3.89
8	2	56	2.79	3.66
3	22	112	2.73	3.21
5	8	197	2.72	3.67
10	10	63	2.66	3.72
22	24	41	2.65	2.98
21	20	30	2.63	3.16
16	25	85	2.60	3.64
13	15	32	2.45	3.17
11	1	28	2.45	2.86
2	4	235	2.32	3.98
20	17	82	2.31	2.73
9	9	107	2.18	2.74
1	5	98	2.09	2.91
18	19	63	1.93	2.46

Code

overall_df <- data.frame(
  Metric = c("Mean MAE", "Mean RMSE"),
  Value  = c(
    round(mean(cv_results$mae), 2),
    round(mean(cv_results$rmse), 2)
  )
)

knitr::kable(
  overall_df,
  caption = "Overall LOGO Cross-Validation Performance",
  align = "c",
  digits = 2
) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Overall LOGO Cross-Validation Performance
Metric	Value
Mean MAE	2.87
Mean RMSE	3.79

Interpretation:

The LOGO results indicate consistent model performance across most police districts, with MAE values largely concentrated around 2–3. This pattern suggests that the model captures broad spatial variation in burglary risk and performs relatively stably across the city’s neighbourhoods. A small number of districts, such as District 3, show noticeably higher errors, indicating localized conditions not fully captured by the predictors. Overall, the model generalizes well spatially, and the mean MAE of 2.87 and mean RMSE of 3.79 reinforce that performance is steady at the citywide scale, even though certain areas remain more difficult to predict.

7. Model Evaluation

In the final step I generated predictions from the negative binomial model and compared them to the KDE baseline. This allows evaluating whether the regression model provides meaningful improvement over a simple spatial smoothing approach. ## Final Model Prediction

Code

# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ countPotholes + potholes.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

7.1 KDE baseline Comparison

Code

burglaries_kde <- suppressWarnings(st_intersection(burglaries, chicagoBoundary))
coords <- st_coordinates(burglaries_kde) %>% unique()

# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  coords,
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

Code

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

Code

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()+
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()+
  theme_void()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()+
  theme_void()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Interpretation:

The observed burglary counts show a highly clustered spatial pattern, with several distinct hotspots located in the central, southwest, and southeast portions of the study area.

The Negative Binomial model captures the broad structure of these patterns. However, it tends to slightly underpredict the intensity of the strongest hotspots, producing more smoothed estimates compared to the observed counts.

The KDE baseline produces a more concentrated hotspot pattern, placing high predicted values in areas where burglaries are spatially clustered, regardless of local explanatory variables. While this method lacks covariate information, it aligns closely with the observed burglary distribution due to the strong spatial clustering present in the data.

7.2 Error Analysis

Code

# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Comparison
approach	mae	rmse
model	2.64	3.73
kde	2.07	2.96

Interpretation:

The KDE baseline outperforms the Negative Binomial model, achieving lower MAE (2.06 vs 2.64) and lower RMSE (2.95 vs 3.73). This indicates that a simple spatial smoothing method provides more accurate predictions than the covariate-driven model. This performance gap suggests that burglary risk in this study area is dominated by spatial clustering rather than the specific covariates included in the model (potholes, nearest hotspot distance, etc.).

Thus, additional predictors or spatially structured random effects may be required for the Negative Binomial model to outperform pure spatial smoothing.

Code

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()+
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()+
  theme_void()

p1 + p2

Interpretation:

The model on the left highlights clear spatial biases in the Negative Binomial predictions. Red areas indicate underprediction, which occurs mainly in strong hotspot locations where burglary counts are substantially higher than expected based on the model’s input variables. Blue areas represent overprediction and appear more frequently in low-crime peripheral zones, suggesting that the model slightly inflates burglary risk in areas with minimal activity.

These spatially structured errors suggest that the model does not fully capture all relevant drivers of burglary risk, particularly socio-economic factors that could contribute to high-intensity hotspots.

The absolute error map on the right reinforces the concentration of model inaccuracies within the major hotspot areas. Cells with the highest absolute errors cluster in central and southern districts, corresponding to locations shown above with the highest burglary counts. Low-burglary areas generally show small errors, indicating that the model performs reasonably well in predicting low-risk regions. In contrast, the model performance deteriorates in high-intensity crime environments, where simple spatial smoothing may outperform covariate-based modelling.

8. Conclusion

This analysis explored whether 311 pothole reports can help explain the spatial distribution of burglaries in Chicago. After preparing the data and building spatial features, I applied count regression models and spatial cross-validation to evaluate predictive performance. The results show that while individual pothole counts are not strong predictors, the k-nearest-neighbor proximity to recent pothole reports provides useful spatial information related to burglary patterns. This supports the idea that neighborhood infrastructure conditions may correlate with crime levels.

The negative binomial model outperformed the Poisson model, indicating that burglary counts exhibit overdispersion commonly found in urban crime data. Spatial cross-validation showed stable performance across most police districts, suggesting that the model generalizes reasonably well across space.

Compared with a simple KDE baseline, the regression model performed similarly but did not substantially outperform it, highlighting the strength of spatial smoothing methods for hotspot-like phenomena. Nonetheless, the model offers interpretability and provides insight into how environmental context relates to burglary risk.

Overall, the analysis demonstrates that combining 311 service requests with spatial modeling can reveal meaningful urban patterns. Future improvements could incorporate additional socio-economic or land-use variables, test alternative bandwidths for KDE, or extend validation with temporal data.