---
title: "Spatial Prediction of Burglaries with 311 Graffiti Removal Requests"
subtitle: "MUSA 5080 - Fall 2025"
author: "Isabelle Li"
date: today
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
```{r setup}
#| echo: false
#| message: false
#| warning: false
# Load required packages
library(tidyverse)
library(sf)
library(here)
library(viridis)
library(terra)
library(spdep)
library(FNN)
library(MASS)
library(patchwork)
library(knitr)
library(kableExtra)
library(classInt)
library(here)
# Spatstat split into sub-packages
library(spatstat.geom)
library(spatstat.explore)
# Set options
options(scipen = 999)
set.seed(5080)
# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "gray30", size = base_size - 1),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
}
# Set as default
theme_set(theme_crime())
```
# Part 1: Introduction and Data Exploration
In this project, I use **Graffiti Removal 311 requests** as the main predictor to explore whether neighborhood disorder can help explain the spatial distribution of burglaries in Chicago. I chose graffiti because it is a common and consistently reported 311 issue, it has clear spatial patterns, and it is often used as a proxy for visible disorder in urban areas.
While graffiti has no direct causal link to burglary, it may reflect broader neighborhood conditions such as maintenance levels, social stress, or public activity, which could indirectly relate to crime patterns. My goal is to test whether this type of environmental signal provides predictive value and to understand how well disorder-based features perform compared to a purely spatial baseline like kernel density estimation (KDE).
```{r load-boundaries}
#| echo: false
#| message: false
#| warning: false
# Load police districts
policeDistricts <- st_read(
"https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON",
quiet = TRUE
) %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
# Load police beats
policeBeats <- st_read(
"https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON",
quiet = TRUE
) %>%
st_transform('ESRI:102271') %>%
dplyr::select(Beat = beat_num)
# Load Chicago boundary
chicagoBoundary <- st_read(
"https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson",
quiet = TRUE
) %>%
st_transform('ESRI:102271')
```
```{r load-burglaries}
#| echo: false
#| message: false
#| warning: false
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read(
here("assignments", "assignment4", "data", "burglaries.shp"), quiet = TRUE
) %>%
st_transform('ESRI:102271')
```
There are **7,482 burglary incidents** in the dataset.\
These records come from the **Chicago Crimes 2017** data, representing all **reported burglary (forcible entry)** cases during that calendar year.
Datasets are projected into Illinois State Plane East (EPSG:102271) for accurate distance and area calculations.
## 1.2: Visualize Point Data
```{r visualize-points}
#| fig-width: 10
#| fig-height: 5
# Simple point map
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Burglary Locations",
subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
)
# Density surface using modern syntax
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = data.frame(st_coordinates(burglaries)),
aes(X, Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(
option = "plasma",
direction = -1,
guide = "none" # Modern ggplot2 syntax (not guide = FALSE)
) +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)
# Combine plots using patchwork (modern approach)
p1 + p2 +
plot_annotation(
title = "Spatial Distribution of Burglaries in Chicago",
tag_levels = 'A'
)
```
The burglary incidents are **not evenly distributed** across Chicago. Instead, they exhibit a clear clustered pattern, with several well-defined hot spots.
The **highest concentrations** appear in:
- the **West Side** (especially near Humboldt Park and Garfield Park),
- parts of the **South Side**, and
- portions of the **Near North / Near West** areas.
In contrast, the Northwest and Southwest outer neighborhoods show very few incidents, indicating substantially lower burglary activity.
# Part 2: Create Fishnet Grid
## 2.1: Creating the Fishnet
A **fishnet grid** converts irregular point data into a regular grid of cells where we can:
- Aggregate counts
- Calculate spatial features
- Apply regression models
```{r create-fishnet}
# Create 500m x 500m grid
fishnet <- st_make_grid(
chicagoBoundary,
cellsize = 500, # 500 meters per cell
square = TRUE
) %>%
st_sf() %>%
mutate(uniqueID = row_number())
# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]
```
We use a regular fishnet grid because it provides **equal-sized spatial units**, which makes burglary counts directly comparable across the entire study area. Neighborhoods or census tracts vary widely in size and shape, and those differences can distort the spatial patterns we observe or bias the regression results. A uniform grid avoids those issues and creates a consistent framework for aggregating incidents, calculating spatial features, and fitting predictive models.
## 2.2: Aggregate Burglaries to Grid
```{r Burglaries Aggregate}
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countBurglaries = n())
# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
left_join(burglaries_fishnet, by = "uniqueID") %>%
mutate(countBurglaries = replace_na(countBurglaries, 0))
fishnet %>%
st_drop_geometry() %>%
summarize(
Min = min(countBurglaries),
Q1 = quantile(countBurglaries, 0.25),
Median = median(countBurglaries),
Mean = mean(countBurglaries),
Q3 = quantile(countBurglaries, 0.75),
Max = max(countBurglaries),
Percent_zero = mean(countBurglaries == 0) * 100
) %>%
mutate(across(everything(), ~round(., 2))) %>%
kable(
caption = "Summary of Burglary Counts per 500m Grid Cell"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The summary table shows that most cells have relatively low counts, while a few cells reach very high values, which confirms that burglary is highly concentrated in a small number of hotspots.
```{r visualize-fishnet}
#| fig-width: 8
#| fig-height: 6
# Visualize aggregated counts
ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "Burglaries",
option = "plasma",
trans = "sqrt", # Square root for better visualization of skewed data
breaks = c(0, 1, 5, 10, 20, 40)
) +
labs(
title = "Burglary Counts by Grid Cell",
subtitle = "500m x 500m cells, Chicago 2017"
) +
theme_crime()
```
This map shows the spatial distribution of burglary counts aggregated to 500m × 500m grid cells. Burglary is **not evenly distributed** across Chicago. Instead, it forms distinct clusters, with noticeably higher concentrations in the **West Side and the South Side**, while the northern and far-southern areas show very low activity. The pattern also reveals a strong **downtown-to-south corridor** where incidents are more frequent. This further confirms that burglary follows a highly spatially concentrated and uneven pattern, which is why later steps such as KDE, hotspot detection, and spatially aware cross-validation are necessary.
## 2.3 Create a Kernel Density Baseline
```{r kde-baseline}
#| message: false
#| warning: false
#| echo: false
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
st_coordinates(burglaries),
W = as.owin(st_bbox(chicagoBoundary))
)
# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
burglaries_ppp,
sigma = 1000, # 1km bandwidth
edge = TRUE # Edge correction
)
# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)
# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
mutate(
kde_value = terra::extract(
kde_raster,
vect(fishnet),
fun = mean,
na.rm = TRUE
)[, 2] # Extract just the values column
)
```
```{r visualize-kde}
#| fig-width: 8
#| fig-height: 6
ggplot() +
geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "KDE Value",
option = "plasma"
) +
labs(
title = "Kernel Density Estimation Baseline",
subtitle = "Simple spatial smoothing of burglary locations"
) +
theme_crime()
```
The KDE map produces a much smoother representation of burglary intensity compared to the raw count map.
# Part 3: Create Spatial Predictor Variables
```{r load-abandoned-cars}
#| echo: false
#| message: false
#| warning: false
Graffiti_removal <- read_csv( here("assignments", "assignment4", "data", "Graffiti_Removal.csv"))%>%
filter(!is.na(Latitude), !is.na(Longitude)) %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_transform('ESRI:102271')
```
## 3.1: Count of Graffiti Removal per Cell
```{r count-Graffiti_removal}
Removal_fishnet <- st_join(fishnet, Graffiti_removal, join = st_contains) %>%
st_drop_geometry() %>%
count(uniqueID, name = "Graffiti_removal")
fishnet$Graffiti_removal <- 0L
idx <- match(Removal_fishnet$uniqueID, fishnet$uniqueID)
fishnet$Graffiti_removal[idx] <- Removal_fishnet$Graffiti_removal
fishnet %>%
st_drop_geometry() %>%
summarize(
Min = min(Graffiti_removal),
Q1 = quantile(Graffiti_removal, 0.25),
Median = median(Graffiti_removal),
Mean = mean(Graffiti_removal),
Q3 = quantile(Graffiti_removal, 0.75),
Max = max(Graffiti_removal)
) %>%
mutate(across(everything(), ~round(., 2))) %>%
kable(
caption = "Summary of Graffiti Removal 311 Calls per 500m Grid Cell"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r visualize-Graffiti-Removal }
#| fig-width: 10
#| fig-height: 4
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = Graffiti_removal), color = NA) +
scale_fill_viridis_c(name = "Count", option = "magma") +
labs(title = "Graffiti Removal 311 Calls") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma") +
labs(title = "Burglaries") +
theme_crime()
p1 + p2 +
plot_annotation(title = "Are Graffiti Removal and burglaries correlated?")
```
Visually, the two patterns do not align very closely. Graffiti Removal calls are concentrated in a few very specific pockets, mainly in the Northwest and parts of the West and Southwest sides. Burglary, however, is more broadly distributed, with multiple hotspots across the West Side, South Side, and the central areas of the city.
There is some general overlap in higher-activity regions, but the intensity and location of clusters differ enough that graffiti does not appear to be a consistent indicator of burglary risk. Graffiti reflects surface-level disorder in public space, while burglary involves private property intrusion, which likely responds to different environmental and situational factors.
## 3.2: Nearest Neighbor Features
```{r nn-feature}
#| message: false
#| warning: false
# Calculate mean distance to 3 nearest Graffiti Removal
Graffiti_clean <- Graffiti_removal %>%
filter(!st_is_empty(geometry))
# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
removal_coords <- st_coordinates(Graffiti_clean)
# Calculate k nearest neighbors and distances
nn_result <- get.knnx(removal_coords, fishnet_coords, k = 3)
# Add to fishnet
fishnet <- fishnet %>%
mutate(
Graffiti_clean.nn = rowMeans(nn_result$nn.dist)
)
fishnet %>%
st_drop_geometry() %>%
summarize(
Min = min(Graffiti_clean.nn),
Q1 = quantile(Graffiti_clean.nn, 0.25),
Median = median(Graffiti_clean.nn),
Mean = mean(Graffiti_clean.nn),
Q3 = quantile(Graffiti_clean.nn, 0.75),
Max = max(Graffiti_clean.nn)
) %>%
mutate(across(everything(), ~round(., 2))) %>%
kable(
caption = "Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The distance-to-graffiti feature shows substantial variation, with a median of \~60 m but a long tail up to 1.4 km. This indicates that some neighborhoods are very close to multiple graffiti complaints, while others are far removed. This spatial heterogeneity helps capture environmental disorder that may relate to burglary risk.
## 3.3: Distance to Hot Spots
```{r local-morans-abandoned}
#| echo: false
#| message: false
#| warning: false
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
# Create spatial weights
coords <- st_coordinates(st_centroid(data))
neighbors <- knn2nb(knearneigh(coords, k = k))
weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
# Calculate Local Moran's I
local_moran <- localmoran(data[[variable]], weights)
# Classify clusters
mean_val <- mean(data[[variable]], na.rm = TRUE)
data %>%
mutate(
local_i = local_moran[, 1],
p_value = local_moran[, 5],
is_significant = p_value < 0.05,
moran_class = case_when(
!is_significant ~ "Not Significant",
local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
TRUE ~ "Not Significant"
)
)
}
# Apply to Graffiti removal
fishnet <- calculate_local_morans(fishnet, "Graffiti_removal", k = 5)
```
```{r visualize-morans}
#| fig-width: 8
#| fig-height: 6
# Visualize hot spots
ggplot() +
geom_sf(
data = fishnet,
aes(fill = moran_class),
color = NA
) +
scale_fill_manual(
values = c(
"High-High" = "#d7191c",
"High-Low" = "#fdae61",
"Low-High" = "#abd9e9",
"Low-Low" = "#2c7bb6",
"Not Significant" = "gray90"
),
name = "Cluster Type"
) +
labs(
title = "Local Moran's I: Graffiti Removal Clusters",
subtitle = "High-High = Hot spots of disorder"
) +
theme_crime()
```
```{r distance-to-hotspots}
#| echo: false
#| message: false
#| warning: false
# Identify hotspot cells
hotspots <- fishnet %>%
filter(moran_class == "High-High") %>%
st_centroid()
# Calculate distance from each cell to nearest hotspot
if (nrow(hotspots) > 0) {
fishnet <- fishnet %>%
mutate(
dist_to_hotspot = as.numeric(
st_distance(st_centroid(fishnet), hotspots %>% st_union())
)
)
} else {
fishnet <- fishnet %>%
mutate(dist_to_hotspot = 0)
}
```
Distance to a cluster of Graffiti Removal calls is more informative than distance to a single graffiti incident because it captures the broader neighborhood context of disorder. A single report might be noisy or idiosyncratic, but a High–High cluster from Local Moran’s I represents an area where both the cell and its neighbors have consistently high graffiti counts. Being close to such a cluster suggests that a grid cell lies within or near a larger zone of visible deterioration or neglect.
------------------------------------------------------------------------
```{r join-districts}
#| echo: false
#| message: false
#| warning: false
# Join district information to fishnet
fishnet <- st_join(
fishnet,
policeDistricts,
join = st_within,
left = TRUE
) %>%
filter(!is.na(District)) # Remove cells outside districts
```
# Part 4: Model Fitting
## 4.1: Poisson Regression
```{r prepare-data}
#| echo: false
#| message: false
#| warning: false
# Create clean modeling dataset
fishnet_model <- fishnet %>%
st_drop_geometry() %>%
dplyr::select(
uniqueID,
District,
countBurglaries,
Graffiti_removal,
Graffiti_clean.nn,
dist_to_hotspot
) %>%
na.omit() # Remove any remaining NAs
```
```{r fit-poisson}
# Fit Poisson regression
library(broom)
model_poisson <- glm(
countBurglaries ~ Graffiti_removal + Graffiti_clean.nn +
dist_to_hotspot,
data = fishnet_model,
family = "poisson"
)
poisson_tbl <- tidy(model_poisson) %>%
mutate(
estimate = round(estimate, 4),
std.error = round(std.error, 4),
statistic = round(statistic, 2),
p.value = round(p.value, 4)
)
kable(
poisson_tbl,
caption = "Poisson Regression Results"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
**All three predictors are statistically significant (p \< 0.001).**
1. **Graffiti_removal** has a *positive* association with burglary counts.
```
More graffiti complaints → slightly more burglaries.
```
2. **Graffiti_clean.nn** has a *strong negative* association.
Cells closer to graffiti activity → much higher burglary risk.
This is the strongest predictor in the model.
3. **dist_to_hotspot** has a very weak but statistically significant *positive* association.
Being farther from graffiti hotspots slightly increases predicted burglaries.
Effect size is tiny and may not be meaningful.
```{r check-overdispersion}
#| echo: false
#| message: false
#| warning: false
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) /
model_poisson$df.residual
```
## 4.2: Negative Binomial Regression
If overdispersed, use **Negative Binomial regression** (more flexible).
```{r fit-negbin}
# Fit Negative Binomial model
model_nb <- glm.nb(
countBurglaries ~ Graffiti_removal + Graffiti_clean.nn +
dist_to_hotspot,
data = fishnet_model
)
nb_tbl <- tidy(model_nb) %>%
mutate(
estimate = round(estimate, 4),
std.error = round(std.error, 4),
statistic = round(statistic, 2),
p.value = round(p.value, 4)
)
kable(
nb_tbl,
caption = "Negative Binomial Regression Results"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Its AIC (7654) is substantially lower than the Poisson model's AIC (9456), indicating that the NB model provides a better balance between model complexity and goodness of fit.
The improvement confirms that our burglary count data are overdispersed (variance \> mean), which violates a key Poisson assumption. As a result, the Poisson model inflated the significance of several predictors, while the NB model provides more reliable coefficient estimates.
# Part 5: Spatial Cross-Validation
```{r spatial-cv}
#| echo: false
#| message: false
#| warning: false
#| results: hide
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()
cat("Running LOGO Cross-Validation...\n")
for (i in seq_along(districts)) {
test_district <- districts[i]
# Split data
train_data <- fishnet_model %>% filter(District != test_district)
test_data <- fishnet_model %>% filter(District == test_district)
# Fit model on training data
model_cv <- glm.nb(
countBurglaries ~ Graffiti_removal + Graffiti_clean.nn +
dist_to_hotspot,
data = train_data
)
# Predict on test data
test_data <- test_data %>%
mutate(
prediction = predict(model_cv, test_data, type = "response")
)
# Calculate metrics
mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
# Store results
cv_results <- bind_rows(
cv_results,
tibble(
fold = i,
test_district = test_district,
n_test = nrow(test_data),
mae = mae,
rmse = rmse
)
)
cat(" Fold", i, "/", length(districts), "- District", test_district,
"- MAE:", round(mae, 2), "\n")
}
```
```{r cv-results-table}
# Show results
cv_results %>%
arrange(desc(mae)) %>%
kable(
digits = 2,
caption = "LOGO CV Results by District"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Spatial CV is more appropriate than random CV because crime data is highly spatially autocorrelated. Random CV would mix neighboring cells into both training and testing sets, causing spatial leakage and artificially inflating model performance. LOGO CV avoids this by holding out entire police districts, which tests the model’s ability to generalize to new spatial areas.
From the results, District 3 was the hardest to predict (MAE = 6.31), followed by District 6, suggesting that these areas have unique burglary patterns not well explained by graffiti-based predictors. In contrast, District 19 was one of the easiest districts to predict (MAE ≈ 1.9), indicating burglary patterns there are more stable and similar to the training data.
# Part 6: Model Evaluation
```{r final-predictions}
#| echo: false
#| message: false
#| warning: false
# Fit final model on all data
final_model <- glm.nb(
countBurglaries ~ Graffiti_removal + Graffiti_clean.nn +
dist_to_hotspot,
data = fishnet_model
)
# Add predictions back to fishnet
fishnet <- fishnet %>%
mutate(
prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
)
# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
mutate(
prediction_kde = (kde_value / kde_sum) * count_sum
)
```
## 6.1: Compare Model vs. KDE Baseline
```{r compare-models}
#| fig-width: 12
#| fig-height: 4
# Create three maps
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
labs(title = "Actual Burglaries") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "Model Predictions (Neg. Binomial)") +
theme_crime()
p3 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "KDE Baseline Predictions") +
theme_crime()
p1 + p2 + p3 +
plot_annotation(
title = "Actual vs. Predicted Burglaries",
subtitle = "Does our complex model outperform simple KDE?"
)
```
```{r model-comparison-metrics}
# Calculate performance metrics
comparison <- fishnet %>%
st_drop_geometry() %>%
filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
summarize(
model_mae = mean(abs(countBurglaries - prediction_nb)),
model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
kde_mae = mean(abs(countBurglaries - prediction_kde)),
kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
)
comparison %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
separate(metric, into = c("approach", "metric"), sep = "_") %>%
pivot_wider(names_from = metric, values_from = value) %>%
kable(
digits = 2,
caption = "Model Performance Comparison"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### **The KDE baseline performs better than the complex model**
The KDE has **lower MAE and RMSE**, which means:
- Smaller average prediction error
- More stable predictions
- Better overall fit to the spatial pattern of burglaries
This is a common result in spatial prediction tasks: KDE is extremely strong because crime tends to cluster, and KDE captures clustering very well without requiring any assumptions or regression structure.
## 6.2: Where Does the Model Work Well?
```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5
# Calculate errors
fishnet <- fishnet %>%
mutate(
error_nb = countBurglaries - prediction_nb,
error_kde = countBurglaries - prediction_kde,
abs_error_nb = abs(error_nb),
abs_error_kde = abs(error_kde)
)
# Map errors
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
scale_fill_gradient2(
name = "Error",
low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0,
limits = c(-10, 10)
) +
labs(title = "Model Errors (Actual - Predicted)") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
labs(title = "Absolute Model Errors") +
theme_crime()
p1 + p2
```
The error maps show that the Negative Binomial model systematically underpredicts burglaries in major hotspot areas especially in the South and West Sides, while slightly overpredicting in low-crime neighborhoods. This pattern suggests that the model cannot fully capture sharp, localized spikes in burglary activity, likely because graffiti-related predictors only partially reflect the underlying opportunity structures of crime. As a result, absolute errors are largest in high-crime districts and much smaller in low-crime ones, indicating that the model performs unevenly across space and is least accurate exactly where prediction is most important.
## 6.3: Model Summary Table
```{r model-summary-table}
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
mutate(
across(where(is.numeric), ~round(., 3))
)
model_summary %>%
kable(
caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
footnote(
general = "Rate ratios > 1 indicate positive association with burglary counts."
)
```
## 6.4 Key Findings Summary
Overall, the analysis shows that graffiti-related spatial features provide limited predictive power for burglary counts. Although the Negative Binomial model identifies some statistically significant relationships such as closer proximity to graffiti clusters being associated with slightly higher burglary counts, the effect sizes are small, and the model struggles to capture sharp crime hotspots.
The **KDE baseline** clearly outperforms the regression model in both MAE and RMSE, reinforcing that burglary exhibits strong spatial clustering that is better captured through spatial density patterns than through indirect disorder indicators.
Error maps also show that the model consistently underpredicts major hotspot areas and performs best in low-crime neighborhoods, suggesting that graffiti reflects general urban conditions but does not directly track burglary risk.
In short, disorder signals like graffiti may help describe neighborhood environments, but they are not sufficient for accurately predicting burglary patterns on their own. This might because Graffiti Removal 311 calls reflect residents’ reporting behavior more than actual vandalism. High-income neighborhoods tend to overreport, while disadvantaged neighborhoods underreport, producing a spatial pattern that diverges from the true distribution of disorder and therefore has limited predictive value for burglary.