---
title: "Assignment 4: Spatial Predictive Analysis"
subtitle: "Pothole Repair Requests in Chicago"
author: "Sihan Yu"
date: today
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# 1. Introduction
In the 311 system, residents can report various infrastructure issues for the city to address. Among these requests, **Pothole in Street** is one of the most frequently submitted categories in Chicago. Potholes often develop due to road aging, traffic load, and freeze–thaw cycles, and their locations reflect underlying infrastructure conditions across the city. Because pothole activity varies spatially and mirrors broader neighborhood characteristics, it provides a useful spatial indicator for understanding urban environments.
In this study, pothole repair requests are used as spatial indicators to support the analysis of burglary patterns in Chicago. By examining the spatial variation in pothole activity and its relationship to **burglary** **distribution**, the study aims to understand how urban infrastructure conditions may reflect broader neighborhood characteristics that help explain where burglaries occur.
# 2. Data Loading & Exploration
After preparing the required packages, this step imports the key spatial datasets, including the 311 pothole request data, the Chicago boundary, police districts, and burglary records. The goal is to understand where potholes occur across the city and whether they exhibit spatial clustering. Visualizing these patterns helps identify neighborhoods with greater infrastructure issues, which is important because such areas may also experience higher burglary activity. This initial exploration ensures that the base data are accurate and provides a foundation for examining the spatial relationship between potholes and crime in later steps.
## 2.1 Setup
```{r}
#| message: false
#| warning: false
# Load required packages
library(tidyverse) # Data manipulation
library(sf) # Spatial operations
library(here) # Relative file paths
library(viridis) # Color scales
library(terra) # Raster operations (replaces 'raster')
library(spdep) # Spatial dependence
library(FNN) # Fast nearest neighbors
library(MASS) # Negative binomial regression
library(patchwork) # Plot composition (replaces grid/gridExtra)
library(knitr) # Tables
library(kableExtra) # Table formatting
library(classInt) # Classification intervals
library(here)
library(broom)
library(glue)
# Spatstat split into sub-packages
library(spatstat.geom) # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE
# Set options
options(scipen = 999) # No scientific notation
set.seed(5080) # Reproducibility
# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "gray30", size = base_size - 1),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
}
# Set as default
theme_set(theme_crime())
```
## 2.2 Load 311 data and Chicago spatial boundaries
```{r}
# Load 311 request - Potholes Reported data
Pot_Holes <- read_csv("data/Pot_Holes_Reported.csv", show_col_types = FALSE)%>%
filter(!is.na(LATITUDE), !is.na(LONGITUDE)) %>%
mutate(CREATION_DATE = mdy(`CREATION DATE`)) %>%
filter(year(CREATION_DATE) == 2017) %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
st_transform('ESRI:102271')
Pot_Holes <- Pot_Holes %>%
filter(!st_is_empty(geometry)) %>%
filter(
!is.na(st_coordinates(.)[, 1]),
!is.na(st_coordinates(.)[, 2])
)
# Load Chicago boundary
chicagoBoundary <-
st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson", quiet = TRUE) %>%
st_transform('ESRI:102271')
# Load police districts (used for spatial cross-validation)
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON", quiet = TRUE) %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
# Load police beats (smaller administrative units)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON", quiet = TRUE) %>%
st_transform('ESRI:102271') %>%
dplyr::select(Beat = beat_num)
# Load burglaries data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp", quiet = TRUE) %>%
st_transform('ESRI:102271')
```
## 2.3 Visualize Point Data
```{r}
# Simple point map
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = Pot_Holes, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Reported Pot Holes Locations",
subtitle = paste0("Chicago 2017, n = ", nrow(Pot_Holes))
)+
theme_void()
# Density surface using modern syntax
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = data.frame(st_coordinates(Pot_Holes)),
aes(X, Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(
option = "plasma",
direction = -1,
guide = "none" # Modern ggplot2 syntax (not guide = FALSE)
) +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)+
theme_void()
# Combine plots using patchwork (modern approach)
p1 + p2 +
plot_annotation(
title = "Spatial Distribution of Reported Pot Holes in Chicago",
tag_levels = 'A'
)
```
**Interpretation:**
The point map on the left shows that pothole reports in 2017 are widespread across Chicago, with particularly dense coverage in most urbanized areas. Some low-report zones appear in the southern part of the city. The kernel density map on the right highlights these patterns more clearly: the northeast and northwest areas exhibit the highest concentrations, with additional hotspots in the southwest and southeast. In contrast, the central and western sections show relatively lower report intensity. Overall, pothole reports display strong spatial heterogeneity across the city.
# 3. Fishnet Grid
In this step, I create a 500 × 500 m fishnet grid across Chicago and aggregate both pothole and burglary point data into the grid cells, producing count values for each location. This transforms the raw point data into a uniform spatial unit, making it suitable for later regression modeling.
## 3.1 500\*500m Fishnet Creation
```{r}
# Create 500m x 500m grid
fishnet <- st_make_grid(
chicagoBoundary,
cellsize = 500, # 500 meters per cell
square = TRUE
) %>%
st_sf() %>%
mutate(uniqueID = row_number())
# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]
```
## 3.2 Aggregate Violation to Grid
```{r}
# Spatial join: which cell contains each violation?
Pot_Holes_fishnet <- st_join(Pot_Holes, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countPotholes = n())
# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
left_join(Pot_Holes_fishnet, by = "uniqueID") %>%
mutate(countPotholes = replace_na(countPotholes, 0))
# Aggregate burglaries to fishnet
burg_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countBurglaries = n())
# Join burglary counts back to same fishnet
fishnet <- fishnet %>%
left_join(burg_fishnet, by = "uniqueID") %>%
mutate(countBurglaries = replace_na(countBurglaries, 0))
```
## 3.3 Visualize Distribution
```{r}
# Visualize aggregated counts
ggplot() +
geom_sf(data = fishnet, aes(fill = countPotholes), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "Pot Holes",
option = "plasma",
trans = "sqrt", # Square root for better visualization of skewed data
breaks = c(0, 1, 5, 10, 20, 40)
) +
labs(
title = "Pot Holes Counts by Grid Cell",
subtitle = "500m x 500m cells, Chicago 2017"
) +
theme_crime()+
theme_void()
```
**Interpretation:**
The fishnet map shows that pothole reports are not evenly distributed across Chicago. Higher counts appear in the northern and central parts of the city, while the southern area generally shows lower reporting levels. Localized hotspots also emerge along major road corridors. This suggests that road quality issues vary across neighborhoods, and pothole activity may serve as a proxy for broader spatial differences in infrastructure conditions.
# 4. Spatial Features
Here generate spatial features including k-nearest neighbors and Local Moran’s I. This helps to further capture spatial structure and visualize clustering patterns in the pothole data.
The **K-Nearest Neighbor** feature measures how close each grid cell is to recent pothole activity, helping quantify local infrastructure conditions surrounding burglary locations.
**Local Moran’s I** identifies spatial clusters of high and low pothole activity. These clusters help uncover neighborhood-level infrastructure patterns that may relate to burglary risk.
## 4.1 K-Nearest Neighbor Features
```{r}
# Calculate mean distance to 3 nearest Pot Holes
# (Do this OUTSIDE of mutate to avoid sf conflicts)
# Get coordinates
fishnet <- st_make_valid(fishnet)
fishnet_coords <- st_coordinates(suppressWarnings(st_centroid(fishnet)))
Holes_coords <- st_coordinates(Pot_Holes)
# Calculate k nearest neighbors and distances
nn_result <- get.knnx(Holes_coords, fishnet_coords, k = 3)
# Add to fishnet
fishnet <- fishnet %>%
mutate(
potholes.nn = rowMeans(nn_result$nn.dist)
)
summ <- summary(fishnet$potholes.nn)
summ_df <- data.frame(
Statistic = c("Min", "Q1", "Median", "Mean", "Q3", "Max"),
Value = round(as.numeric(summ), 2)
)
kable(
summ_df,
caption = "Summary of Mean Distance to 3 Nearest Pot Holes (meters)"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```
**Interpretation:**
The KNN summary indicates strong spatial variation: some cells are only a few meters from potholes, while others are over 1 km away. The median distance of about 87 m suggests that pothole activity is dense and widely distributed across the city.
## 4.2 Local Moran’s I
```{r}
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
# Create spatial weights
coords <- st_coordinates(suppressWarnings(st_centroid(data)))
neighbors <- knn2nb(knearneigh(coords, k = k))
weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
# Calculate Local Moran's I
local_moran <- localmoran(data[[variable]], weights)
# Classify clusters
mean_val <- mean(data[[variable]], na.rm = TRUE)
data %>%
mutate(
local_i = local_moran[, 1],
p_value = local_moran[, 5],
is_significant = p_value < 0.05,
moran_class = case_when(
!is_significant ~ "Not Significant",
local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
TRUE ~ "Not Significant"
)
)
}
# Apply to potholes
fishnet <- calculate_local_morans(fishnet, "countPotholes", k = 5)
```
## 4.3 Hot Spots and Cold Spots
```{r}
# Visualize hot spots
ggplot() +
geom_sf(
data = fishnet,
aes(fill = moran_class),
color = NA
) +
scale_fill_manual(
values = c(
"High-High" = "#d7191c",
"High-Low" = "#fdae61",
"Low-High" = "#abd9e9",
"Low-Low" = "#2c7bb6",
"Not Significant" = "gray90"
),
name = "Cluster Type"
) +
labs(
title = "Local Moran's I: Pothole Clusters",
subtitle = "High-High = Hot spots of disorder"
) +
theme_crime()+
theme_void()
```
## 4.4 Distance-to-Hotspot
```{r}
# Get centroids of "High-High" cells (hot spots)
hotspots <- suppressWarnings(
fishnet %>%
filter(moran_class == "High-High") %>%
st_centroid()
)
# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
fishnet <- fishnet %>%
mutate(
dist_to_hotspot = as.numeric(
st_distance(
suppressWarnings(st_centroid(fishnet)),
hotspots %>% st_union()
)
)
)
cat(" - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
fishnet <- fishnet %>%
mutate(dist_to_hotspot = 0)
cat("⚠ No significant hot spots found\n")
}
```
**Interpretation:**
Local Moran’s I results are interpreted as follows:
- **High–High** areas represent clusters of high burglary counts (hot spots).
- **Low–Low** areas represent clusters of consistently low values (cold spots).
- **High–Low** and **Low–High** areas indicate spatial outliers.
The LISA results show clear clusters of pothole activity across Chicago. High–High clusters (hotspots) appear mainly in the northern and central neighborhoods, indicating concentrated infrastructure issues. In contrast, the southern part of the city shows several Low–Low clusters, reflecting consistently low levels of reported potholes. The mix of High–Low and Low–High outliers suggests localized transitions between well-maintained and more deteriorated areas. Overall, the spatial pattern confirms that pothole activity is not random but forms distinct neighborhood clusters.
# 5. Count Regression Models
This section I model burglary counts using two standard count models—Poisson and Negative Binomial. Count data often show overdispersion (variance \> mean), so fitting both models allows us to compare performance and choose the most appropriate specification for prediction. AIC is used to evaluate which model provides a better fit.
## 5.1 Poisson Regression
```{r}
# Join district information to fishnet
fishnet <- st_join(
fishnet,
policeDistricts,
join = st_within,
left = TRUE
) %>%
filter(!is.na(District)) # Remove cells outside districts
# Create clean modeling dataset
fishnet_model <- fishnet %>%
st_drop_geometry() %>%
dplyr::select(
uniqueID,
District,
countBurglaries,
countPotholes,
potholes.nn,
dist_to_hotspot
) %>%
na.omit()
# Fit Poisson regression
model_poisson <- glm(
countBurglaries ~ countPotholes + potholes.nn +
dist_to_hotspot,
data = fishnet_model,
family = "poisson"
)
tidy(model_poisson) %>%
kable(digits = 2, caption = "Poisson Regression Coefficients") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r}
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) /
model_poisson$df.residual
conclusion <- if (dispersion > 1.5) {
"⚠ Overdispersion detected — the Negative Binomial model is more appropriate."
} else {
"✓ Dispersion looks acceptable — the Poisson model is reasonable."
}
glue("
Dispersion Check
----------------
Dispersion parameter: {round(dispersion, 2)}
Rule of thumb: > 1.5 suggests overdispersion
Conclusion:
{conclusion}
")
```
## 5.2 Negative Binomial Regression
```{r}
# Fit Negative Binomial model
model_nb <- glm.nb(
countBurglaries ~ countPotholes + potholes.nn +
dist_to_hotspot,
data = fishnet_model
)
tidy(model_nb) %>%
kable(digits = 2, caption = "Negative Binomial Regression Coefficients") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
## 5.3 Model Fit(AIC)
```{r}
# Compare AIC (lower is better)
# Create AIC comparison dataframe
aic_table <- data.frame(
Model = c("Poisson", "Negative Binomial"),
AIC = c(AIC(model_poisson), AIC(model_nb))
)
# Display with kable
knitr::kable(
aic_table,
caption = "AIC Comparison of Count Models",
digits = 1,
align = "c"
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
)
```
**Interpretation:**
The Poisson regression shows strong overdispersion, which leads to a much higher AIC and weaker overall fit. The Negative Binomial model handles this more effectively and provides a substantially better model performance. In the Negative Binomial results, the k-nearest-neighbor pothole feature is highly significant and shows a negative relationship with burglary counts, indicating that areas closer to recent pothole activity tend to experience elevated burglary risk. In comparison, the total number of potholes and the distance to hotspots are not statistically significant once local proximity effects are included. These results suggest that spatial patterns in infrastructure conditions, especially the immediate surroundings captured by the KNN measure, help explain variation in burglary activity. The Negative Binomial model is therefore the more appropriate choice for this analysis.
# 6. Spatial Cross-Validation
After comparing both models, the Negative Binomial model is selected because it had the lower AIC. Using this model, I then performed Leave-One-Group-Out cross-validation with police districts as the grouping unit. In each fold, one district was held out, the model was trained on the remaining districts, and predictions were generated for the excluded area. This method reduces spatial leakage and provides a more reliable measure of how well the model performs across different parts of the city.
## 6.1 Leave-One-Group-Out cross-validation(LOGO CV)
```{r}
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()
for (i in seq_along(districts)) {
test_district <- districts[i]
# Split data
train_data <- fishnet_model %>% filter(District != test_district)
test_data <- fishnet_model %>% filter(District == test_district)
# Fit model on training data
model_cv <- glm.nb(
countBurglaries ~ countPotholes + potholes.nn +
dist_to_hotspot,
data = train_data
)
# Predict on test data
test_data <- test_data %>%
mutate(
prediction = predict(model_cv, test_data, type = "response")
)
# Calculate metrics
mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
# Store results
cv_results <- bind_rows(
cv_results,
tibble(
fold = i,
test_district = test_district,
n_test = nrow(test_data),
mae = mae,
rmse = rmse
)
)
}
```
## 6.2 Reported Metrics
```{r}
# Show results
cv_results %>%
arrange(desc(mae)) %>%
rename(
Fold = fold,
`Test District` = test_district,
`Test Count` = n_test,
MAE = mae,
RMSE = rmse
)%>%
kable(
digits = 2,
caption = "LOGO CV Results by District",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
overall_df <- data.frame(
Metric = c("Mean MAE", "Mean RMSE"),
Value = c(
round(mean(cv_results$mae), 2),
round(mean(cv_results$rmse), 2)
)
)
knitr::kable(
overall_df,
caption = "Overall LOGO Cross-Validation Performance",
align = "c",
digits = 2
) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
```
**Interpretation:**
The LOGO results indicate consistent model performance across most police districts, with MAE values largely concentrated around 2–3. This pattern suggests that the model captures broad spatial variation in burglary risk and performs relatively stably across the city’s neighbourhoods. A small number of districts, such as District 3, show noticeably higher errors, indicating localized conditions not fully captured by the predictors. Overall, the model generalizes well spatially, and the mean MAE of 2.87 and mean RMSE of 3.79 reinforce that performance is steady at the citywide scale, even though certain areas remain more difficult to predict.
# 7. Model Evaluation
In the final step I generated predictions from the negative binomial model and compared them to the KDE baseline. This allows evaluating whether the regression model provides meaningful improvement over a simple spatial smoothing approach. \## Final Model Prediction
```{r}
# Fit final model on all data
final_model <- glm.nb(
countBurglaries ~ countPotholes + potholes.nn +
dist_to_hotspot,
data = fishnet_model
)
# Add predictions back to fishnet
fishnet <- fishnet %>%
mutate(
prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
)
```
## 7.1 KDE baseline Comparison
```{r}
burglaries_kde <- suppressWarnings(st_intersection(burglaries, chicagoBoundary))
coords <- st_coordinates(burglaries_kde) %>% unique()
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
coords,
W = as.owin(st_bbox(chicagoBoundary))
)
# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
burglaries_ppp,
sigma = 1000, # 1km bandwidth
edge = TRUE # Edge correction
)
# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)
# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
mutate(
kde_value = terra::extract(
kde_raster,
vect(fishnet),
fun = mean,
na.rm = TRUE
)[, 2] # Extract just the values column
)
```
```{r}
ggplot() +
geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "KDE Value",
option = "plasma"
) +
labs(
title = "Kernel Density Estimation Baseline",
subtitle = "Simple spatial smoothing of burglary locations"
) +
theme_crime()
```
```{r}
# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
mutate(
prediction_kde = (kde_value / kde_sum) * count_sum
)
# Create three maps
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
labs(title = "Actual Burglaries") +
theme_crime()+
theme_void()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "Model Predictions (Neg. Binomial)") +
theme_crime()+
theme_void()
p3 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "KDE Baseline Predictions") +
theme_crime()+
theme_void()
p1 + p2 + p3 +
plot_annotation(
title = "Actual vs. Predicted Burglaries",
subtitle = "Does our complex model outperform simple KDE?"
)
```
**Interpretation:**
The observed burglary counts show a highly clustered spatial pattern, with several distinct hotspots located in the central, southwest, and southeast portions of the study area.
The Negative Binomial model captures the broad structure of these patterns. However, it tends to slightly underpredict the intensity of the strongest hotspots, producing more smoothed estimates compared to the observed counts.
The KDE baseline produces a more concentrated hotspot pattern, placing high predicted values in areas where burglaries are spatially clustered, regardless of local explanatory variables. While this method lacks covariate information, it aligns closely with the observed burglary distribution due to the strong spatial clustering present in the data.
## 7.2 Error Analysis
```{r}
# Calculate performance metrics
comparison <- fishnet %>%
st_drop_geometry() %>%
filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
summarize(
model_mae = mean(abs(countBurglaries - prediction_nb)),
model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
kde_mae = mean(abs(countBurglaries - prediction_kde)),
kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
)
comparison %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
separate(metric, into = c("approach", "metric"), sep = "_") %>%
pivot_wider(names_from = metric, values_from = value) %>%
kable(
digits = 2,
caption = "Model Performance Comparison"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
**Interpretation:**
The KDE baseline outperforms the Negative Binomial model, achieving lower MAE (2.06 vs 2.64) and lower RMSE (2.95 vs 3.73). This indicates that a simple spatial smoothing method provides more accurate predictions than the covariate-driven model. This performance gap suggests that burglary risk in this study area is dominated by spatial clustering rather than the specific covariates included in the model (potholes, nearest hotspot distance, etc.).
Thus, additional predictors or spatially structured random effects may be required for the Negative Binomial model to outperform pure spatial smoothing.
```{r}
# Calculate errors
fishnet <- fishnet %>%
mutate(
error_nb = countBurglaries - prediction_nb,
error_kde = countBurglaries - prediction_kde,
abs_error_nb = abs(error_nb),
abs_error_kde = abs(error_kde)
)
# Map errors
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
scale_fill_gradient2(
name = "Error",
low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0,
limits = c(-10, 10)
) +
labs(title = "Model Errors (Actual - Predicted)") +
theme_crime()+
theme_void()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
labs(title = "Absolute Model Errors") +
theme_crime()+
theme_void()
p1 + p2
```
**Interpretation:**
The model on the left highlights clear spatial biases in the Negative Binomial predictions. Red areas indicate underprediction, which occurs mainly in strong hotspot locations where burglary counts are substantially higher than expected based on the model’s input variables. Blue areas represent overprediction and appear more frequently in low-crime peripheral zones, suggesting that the model slightly inflates burglary risk in areas with minimal activity.
These spatially structured errors suggest that the model does not fully capture all relevant drivers of burglary risk, particularly socio-economic factors that could contribute to high-intensity hotspots.
The absolute error map on the right reinforces the concentration of model inaccuracies within the major hotspot areas. Cells with the highest absolute errors cluster in central and southern districts, corresponding to locations shown above with the highest burglary counts. Low-burglary areas generally show small errors, indicating that the model performs reasonably well in predicting low-risk regions. In contrast, the model performance deteriorates in high-intensity crime environments, where simple spatial smoothing may outperform covariate-based modelling.
# 8. Conclusion
This analysis explored whether 311 pothole reports can help explain the spatial distribution of burglaries in Chicago. After preparing the data and building spatial features, I applied count regression models and spatial cross-validation to evaluate predictive performance. The results show that while individual pothole counts are not strong predictors, the k-nearest-neighbor proximity to recent pothole reports provides useful spatial information related to burglary patterns. This supports the idea that neighborhood infrastructure conditions may correlate with crime levels.
The negative binomial model outperformed the Poisson model, indicating that burglary counts exhibit overdispersion commonly found in urban crime data. Spatial cross-validation showed stable performance across most police districts, suggesting that the model generalizes reasonably well across space.
Compared with a simple KDE baseline, the regression model performed similarly but did not substantially outperform it, highlighting the strength of spatial smoothing methods for hotspot-like phenomena. Nonetheless, the model offers interpretability and provides insight into how environmental context relates to burglary risk.
Overall, the analysis demonstrates that combining 311 service requests with spatial modeling can reveal meaningful urban patterns. Future improvements could incorporate additional socio-economic or land-use variables, test alternative bandwidths for KDE, or extend validation with temporal data.