---
title: "Assignment 4: Predictive Policing & Analysis"
author: "Jed Chew"
date: 11/11/2025
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
## About This Exercise
In this lab, you will apply the spatial predictive modeling techniques demonstrated in the class exercise to a 311 service request type of your choice. You will build a complete spatial predictive model, document your process, and interpret your results.
# Learning Objectives
1. Create a fishnet grid for aggregating point-level crime data
2. Calculate spatial features including k-nearest neighbors and distance measures
3. Diagnose spatial autocorrelation using Local Moran's I
4. Fit and interpret Poisson and Negative Binomial regression for count data
5. Implement spatial LOGO cross-validation
6. Compare model predictions to a Kernel Density Estimation baseline
7. Evaluate model performance using appropriate metrics
# Analysis Requirements
**For each major section, you must explain in your own words:**
- **What** you are doing in that step
- **Why** this step is important for the analysis
- **What** you found or learned from the results
# Setup
```{r setup}
#| message: false
#| warning: false
#| results: 'hide'
library(tidyverse) # Data manipulation
library(sf) # Spatial operations
library(here) # Relative file paths
library(viridis) # Color scales
library(terra) # Raster operations (replaces 'raster')
library(spdep) # Spatial dependence
library(FNN) # Fast nearest neighbors
library(MASS) # Negative binomial regression
library(patchwork) # Plot composition (replaces grid/gridExtra)
library(knitr) # Tables
library(kableExtra) # Table formatting
library(classInt) # Classification intervals
library(janitor) # Clean Names
library(here)
library(spatstat.geom) # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE
# Set options
options(scipen = 999) # No scientific notation
set.seed(5080) # Reproducibility
# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "gray30", size = base_size - 1),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
}
# Set as default
theme_set(theme_crime())
# Set Working Directory
setwd("C:/Users/chewj/Documents/MUSA/Github/portfolio-setup-jedchewjm/labs/lab_4")
cat("✓ All packages loaded successfully!\n")
cat("✓ Working directory:", getwd(), "\n")
```
# Part 1: Load Chicago Spatial and Crime Data
### 1.1 \| Load Chicago Spatial Data
```{r load-boundaries}
#| message: false
#| warning: false
#| results: 'hide'
# Load police districts (used for spatial CV)
# Use ESRI:102271 (Illinois State Plane East, NAD83, US Feet) as CRS
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
# Load police beats (smaller administrative units)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(Beat = beat_num)
# Load Chicago boundary
chicagoBoundary <-
st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
st_transform('ESRI:102271')
cat("✓ Loaded spatial boundaries\n")
cat(" - Police districts:", nrow(policeDistricts), "\n")
cat(" - Police beats:", nrow(policeBeats), "\n")
```
### 1.2 \| Load Burglary Data
```{r load-burglaries}
#| message: false
#| warning: false
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp") %>%
st_transform('ESRI:102271')
# Check the data
cat("\n✓ Loaded burglary data\n")
cat(" - Number of burglaries:", nrow(burglaries), "\n")
cat(" - CRS:", st_crs(burglaries)$input, "\n")
cat(" - Date range:", min(burglaries$date, na.rm = TRUE), "to",
max(burglaries$date, na.rm = TRUE), "\n")
```
### 1.3 \| Visualize Point Data & Describe Patterns
```{r visualize-points}
#| fig-width: 10
#| fig-height: 5
# Simple point map
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Burglary Locations",
subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
)
# Density surface
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = data.frame(st_coordinates(burglaries)),
aes(X, Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(option = "plasma", direction = -1, guide = "none") +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)
# Combine plots using patchwork
p1 + p2 +
plot_annotation(
title = "Spatial Distribution of Burglaries in Chicago"
)
```
::: callout-note
## Observed Patterns
- Burglary locations are widely dispersed throughout Chicago but cluster more densely in certain areas such as the West Side and South Side neighborhoods
- This pattern is more obvious in the KDE map which smooths point data to highlight hotspots of burglary activity
- Notable coldspots include the far Northside and the Downtown Core (i.e. the Loop CBD)
:::
# Part 2: Create Fishnet Grid
- A fishnet grid converts irregular point data into a regular grid of cells where we can aggregate counts / calculate spatial features / apply regression models.
- A fishnet is more useful than administrative units such as census tracts because crime risk is a phenomenon that varies across the landscape rather than across administrative units.
### 2.1 \| Create 500 m x 500 m Fishnet Grid
```{r create-fishnet}
# Create 500m x 500m grid
fishnet <- st_make_grid(
chicagoBoundary,
cellsize = 500, # 500m per cell
square = TRUE
) %>%
st_sf() %>%
mutate(uniqueID = row_number())
# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]
# View basic info
cat("✓ Created fishnet grid\n")
cat(" - Number of cells:", nrow(fishnet), "\n")
cat(" - Cell size:", 500, "x", 500, "meters\n")
cat(" - Cell area:", round(st_area(fishnet[1,])), "square meters\n")
```
### 2.2 \| Aggregate Burglaries to Grid
- `mutate` a ‘counter’ field, `countBurglaries`, for each burglary event,
- spatial join (`aggregate`) burglary points to the `fishnet`, taking the sum of `countBurglaries`
- A random `uniqueID` is generated for each grid cell
```{r aggregate-burglaries}
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countBurglaries = n())
# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
left_join(burglaries_fishnet, by = "uniqueID") %>%
mutate(countBurglaries = replace_na(countBurglaries, 0))
# Summary statistics
cat("\nBurglary count distribution:\n")
summary(fishnet$countBurglaries)
cat("\nCells with zero burglaries:",
sum(fishnet$countBurglaries == 0),
"/", nrow(fishnet),
"(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")
```
### 2.3 \| Visualize Aggregated Count Distribution
```{r visualize-fishnet}
#| fig-width: 8
#| fig-height: 6
# Visualize aggregated counts
ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "Burglaries",
option = "plasma",
trans = "sqrt", # Square root for better visualization of skewed data
breaks = c(0, 1, 5, 10, 20, 40)
) +
labs(
title = "Burglary Counts by Grid Cell",
subtitle = "500m x 500m cells, Chicago 2017"
) +
theme_crime()
```
::: callout-note
## Observed Patterns
- This map reinforces the summary statistics of the aggregated burglary count which state that 31.8% of cells had zero burglaries in 2017 (these are shown in dark purple in the map)
- Burglary incidents are mostly concentrated in clusters across Chicago’s South and West Sides, where many grid cells show counts above 20 burglaries
- There are clear hotspots of elevated burglary risk separated by lower-activity zones – this will be useful in Part 4 when we calculate spatial measures such as kNN, LISA (local Moran's I), and the distance to hot spots
:::
# Part 3: Create a Kernel Density Estimate (KDE) Baseline
- KDE represents our *null hypothesis*: burglaries happen where they happened before, with no other information
- In this context, KDE involves centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular *search radius* parameter (commonly known as *kernel density* in ArcGIS)
```{r kde-baseline}
#| message: false
#| warning: false
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
st_coordinates(burglaries),
W = as.owin(st_bbox(chicagoBoundary))
)
# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
burglaries_ppp,
sigma = 1000, # 1km bandwidth
edge = TRUE # Edge correction
)
# Convert to terra raster
kde_raster <- rast(kde_burglaries)
# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
mutate(
kde_value = terra::extract(
kde_raster,
vect(fishnet),
fun = mean,
na.rm = TRUE
)[, 2] # Extract just the values column
)
cat("✓ Calculated KDE baseline\n")
```
```{r visualize-kde}
#| fig-width: 8
#| fig-height: 6
ggplot() +
geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "KDE Value",
option = "plasma"
) +
labs(
title = "Kernel Density Estimation Baseline",
subtitle = "Simple spatial smoothing of burglary locations"
) +
theme_crime()
```
# Part 4: Create Spatial Predictor Variables
Now we'll create features that might help predict burglaries. We'll use "broken windows theory" logic: signs of disorder predict crime. In this lab report, I specifically focus on 311 calls to report *vacant and abandoned buildings*. The data source is the Chicago 311 Service Requests dataset which I accessed through <https://data.cityofchicago.org/stories/s/311-Dataset-Changes-12-11-2018/d7nq-5g7t>
In this section, I first load, filter and clean the "Vacant and Abandoned Buildings Reported" dataset. While this preliminary report does not look at specific columns, there are variables such as `vacant_due_to_fire` or `ppl_using_prop` that could potentially be applied to a more complex random forest model in order to better predict crime risk.
Next, I calculate different spatial measures that each capture an aspect of spatial proximity to our indicator variable of burglary risk.
- Count → How much disorder is HERE?
- k-NN Distance → How CLOSE are we to disorder?
- Hot Spots (LISA) → Where does disorder CLUSTER?
- Distance to Hot Spots → How close to concentrated disorder?
### 4.1 \| Load 311 Vacant and Abandoned Buildings Reported
```{r load-vacant-abandoned-buildings, progress = FALSE}
vacant_abandoned_bldg <- read_csv("data/vacant_abandoned_buildings.csv") |>
filter(!is.na(LATITUDE), !is.na(LONGITUDE))
```
```{r clean-vacant-abandoned-buildings, message = FALSE}
# clean and rename column names
vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg |>
janitor::clean_names() |>
rename(
request_type = service_request_type,
request_num = service_request_number,
date_received = date_service_request_was_received,
location_on_lot =
location_of_building_on_the_lot_if_garage_change_type_code_to_bgd,
dangerous_hazardous = is_the_building_dangerous_or_hazardous,
open_or_boarded = is_building_open_or_boarded,
entry_point = if_the_building_is_open_where_is_the_entry_point,
vacant_or_occupied = is_the_building_currently_vacant_or_occupied,
vacant_due_to_fire = is_the_building_vacant_due_to_fire,
# note typo in original dataset - childen instead of children
ppl_using_prop = any_people_using_property_homeless_childen_gangs,
addr_st_num = address_street_number,
addr_st_direction = address_street_direction,
addr_st_name = address_street_name,
addr_st_suffix = address_street_suffix,
zip = zip_code,
x = x_coordinate,
y = y_coordinate,
ward = ward,
police_dist = police_district,
community_area = community_area,
lat = latitude,
lon = longitude,
location = location
) |>
# filter vacant and abandoned buildings reported in 2017
mutate(
date_received = as.Date(date_received, format = "%m/%d/%Y"),
month = as.integer(format(date_received, "%m")),
day = as.integer(format(date_received, "%d")),
year = as.integer(format(date_received, "%Y"))
) |>
filter(year == 2017)
```
```{r select-abandoned-buildings}
# filter NA values and select relevant columns
vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg_2017 |>
dplyr::mutate(
addr_full = str_squish( # combine address variables
paste(addr_st_num, addr_st_direction, addr_st_name, addr_st_suffix))
)
keep_cols <- c(
"request_num","year","month", "open_or_boarded","vacant_or_occupied",
"vacant_due_to_fire","ppl_using_prop", "addr_full", "zip","x","y","ward",
"police_dist","community_area","lat","lon","location")
vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg_2017 |>
dplyr::select(any_of(keep_cols)) |>
dplyr::filter(!is.na(vacant_or_occupied))
vacant_abandoned_bldg_2017
```
```{r set-crs-abandoned-buildings}
#| message: false
# Set CRS
abandoned_bldg <- vacant_abandoned_bldg_2017 |>
st_as_sf(coords = c("lon", "lat"), crs = 4326) |>
st_transform('ESRI:102271') # Set CRS
cat("✓ Loaded vacant/abandoned building calls\n")
cat(" - Number of calls:", nrow(vacant_abandoned_bldg_2017), "\n")
```
::: callout-note
## Note of Caution
- The 311 reporting system in Chicago may be biased because it relies on residents to voluntarily report issues, meaning that neighborhoods with higher civic engagement or stronger trust in city services are more likely to generate reports.
- As a result, 311 data in Chicago may underrepresent conditions in lower-income or marginalized communities, where residents may be less aware of or less inclined to call the hotline.
:::
### 4.2 \| Count of Reported Vacant/Abandoned Buildings per Cell
```{r}
# Gut Check Visualization of Vacant/Abandoned Buildings
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = abandoned_bldg, aes(color = open_or_boarded),
alpha = 0.7, size = 0.7) +
labs(title = "Location of Vacant/Abadoned Buildings 311 Calls") +
theme_void()
```
```{r count-vacant_abandoned_bldg}
# Aggregate abandoned building calls to fishnet
abandoned_fishnet <- st_join(abandoned_bldg, fishnet,
join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(abandoned_bldg = n())
# Join to fishnet
fishnet <- fishnet %>%
left_join(abandoned_fishnet, by = "uniqueID") %>%
mutate(abandoned_bldg = replace_na(abandoned_bldg, 0))
cat("Vacant and Abandoned Building distribution:\n")
summary(fishnet$abandoned_bldg)
```
```{r visualize-vacant_abandoned_bldg}
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abandoned_bldg), color = NA) +
scale_fill_viridis_c(name = "Count", option = "magma") +
labs(title = "Vacant/Abandoned Building 311 Calls") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma") +
labs(title = "Burglaries") +
theme_crime()
p1 + p2 +
plot_annotation(
title = "Are vacant/abandoned buildings and burglaries correlated?")
```
### 4.3 \| k-Nearest Neighbors (kNN) Features
Count in a cell is one measure. Distance to the 5 nearest vacant and abandoed buildings that were reported through 311 calls captures additional local context that are not just limited to administrative boundaries such as census tracts.
```{r nn-feature}
#| message: false
#| warning: false
# Calculate mean distance to 5 nearest vacant and abandoned buildings
# (Do this OUTSIDE of mutate to avoid sf conflicts)
# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
abandoned_coords <- st_coordinates(abandoned_bldg)
# Calculate k nearest neighbors and distances
nn_result <- get.knnx(abandoned_coords, fishnet_coords, k = 5)
# Add to fishnet
fishnet <- fishnet %>%
mutate(
abandoned_bldg.nn = rowMeans(nn_result$nn.dist)
)
cat("✓ Calculated nearest neighbor distances\n")
summary(fishnet$abandoned_bldg.nn)
```
### 4.4 \| Local Moran's I; Identify Hotspots; Distance to Hot Spots
Let's identify clusters of abandoned buildings using Local Moran's I, then calculate distance to these hot spots.
```{r local-morans-abandoned}
#| message: false
#| warning: false
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
# Create spatial weights
coords <- st_coordinates(st_centroid(data))
neighbors <- knn2nb(knearneigh(coords, k = k))
weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
# Calculate Local Moran's I
local_moran <- localmoran(data[[variable]], weights)
# Classify clusters
mean_val <- mean(data[[variable]], na.rm = TRUE)
data %>%
mutate(
local_i = local_moran[, 1],
p_value = local_moran[, 5],
is_significant = p_value < 0.05,
moran_class = case_when(
!is_significant ~ "Not Significant",
local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
TRUE ~ "Not Significant"
)
)
}
# Apply to vacant and abandoned buildings
fishnet <- calculate_local_morans(fishnet, "abandoned_bldg", k = 5)
```
```{r visualize-morans}
#| fig-width: 8
#| fig-height: 6
# Visualize hot spots
ggplot() +
geom_sf(
data = fishnet,
aes(fill = moran_class),
color = NA
) +
scale_fill_manual(
values = c(
"High-High" = "#d7191c",
"High-Low" = "#fdae61",
"Low-High" = "#abd9e9",
"Low-Low" = "#2c7bb6",
"Not Significant" = "gray90"
),
name = "Cluster Type"
) +
labs(
title = "Local Moran's I: Vacant and Abandoned Building Clusters",
subtitle = "High-High = Hot spots of disorder calls in Chicago"
) +
theme_crime()
```
```{r distance-to-hotspots}
#| warning: false
# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
filter(moran_class == "High-High") %>%
st_centroid()
# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
fishnet <- fishnet %>%
mutate(
dist_to_hotspot = as.numeric(
st_distance(st_centroid(fishnet), hotspots %>% st_union())
)
)
cat("✓ Calculated distance to vacant and abandoned building hot spots\n")
cat(" - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
fishnet <- fishnet %>%
mutate(dist_to_hotspot = 0)
cat("⚠ No significant hot spots found\n")
}
```
::: callout-note
## Observed Patterns
- **Distribution of Vacant/Abandoned Building 311 Calls:** The majority of reported properties are open rather than boarded, while boarded properties are relatively sparse and scattered
- **Count of Vacant/Abandoned Building 311 Calls:** The distribution is highly skewed, with most grid cells containing zero or one vacant property. Notably, the median value is 0 while the max value is 21. This indicates the presence of a few extreme hotspots of concentrated vacancy which is supported by the subsequent map visualizations.
- **Map of Vacant/Abandoned Building 311 Calls and Burglaries:** 311 calls for vacant buildings (left panel) are especially dense in a few concentrated areas, while burglary counts (right panel) are more spatially dispersed but still overlap with several high-vacancy zones
- **Local Moran's I:** High–High hotspot clusters (in red) indicate statistically significant concentrations of vacant or abandoned buildings, primarily located on the South and West Sides of Chicago
:::
------------------------------------------------------------------------
# Part 5: Join Police Districts for Cross-Validation
We'll use the Chicago police districts for our spatial cross-validation.
```{r join-districts}
# Join district information to fishnet
fishnet <- st_join(
fishnet,
policeDistricts,
join = st_within,
left = TRUE
) %>%
filter(!is.na(District)) # Remove cells outside districts
cat("✓ Joined police districts\n")
cat(" - Districts:", length(unique(fishnet$District)), "\n")
cat(" - Cells:", nrow(fishnet), "\n")
```
# Part 6: Count Regression Model Fitting
### 6.1 \| Poisson Regression
Poisson regression is used instead of OLS because crime counts are right-skewed with many zeroes (around 30% of fishnet cells have no burglaries in 2017) and also discrete (only integer values). This makes Poisson regression suitable for count data.
```{r prepare-data}
# Create clean modeling dataset
fishnet_model <- fishnet %>%
st_drop_geometry() %>%
dplyr::select(
uniqueID,
District,
countBurglaries,
abandoned_bldg,
abandoned_bldg.nn,
dist_to_hotspot
) %>%
na.omit() # Remove any remaining NAs
cat("✓ Prepared modeling data\n")
cat(" - Observations:", nrow(fishnet_model), "\n")
cat(" - Variables:", ncol(fishnet_model), "\n")
```
```{r fit-poisson}
# Fit Poisson regression
model_poisson <- glm(
countBurglaries ~ abandoned_bldg + abandoned_bldg.nn +
dist_to_hotspot,
data = fishnet_model,
family = "poisson"
)
# Summary
summary(model_poisson)
```
### 6.2 \| Check for Overdispersion
- Poisson regression assumes variance = mean
- But in reality, real crime count data often violates this assumption because variance is significantly greater than the mean (i.e. overdispersion)
```{r check-overdispersion}
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) /
model_poisson$df.residual
cat("Dispersion parameter:", round(dispersion, 2), "\n")
cat("Rule of thumb: >1.5 suggests overdispersion\n")
if (dispersion > 1.5) {
cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
cat("✓ Dispersion looks okay for Poisson model.\n")
}
```
### 6.3 \| Negative Binomial Regression
Since overdispersion has been detected, we use Negative Binomial regression instead which is more flexible than Poisson Regression because it relaxes the variance = mean assumption
```{r fit-negbin}
# Fit Negative Binomial model
model_nb <- glm.nb(
countBurglaries ~ abandoned_bldg + abandoned_bldg.nn +
dist_to_hotspot,
data = fishnet_model
)
summary(model_nb)
```
### 6.4 \| Compare Model Fit (AIC)
- AIC stands for Akaike Information Criterion, and it’s a common metric used to compare model fit for non-linear regression models (i.e. AIC is used instead of R-Squared)
- For our purposes, AIC serves as a quick confirmation that the negative binomial regression model has a better fit than the poisson regression model because of its lower AIC (7788.5 vs 9357.9)
- AIC is relative, and not absolute — it’s only meaningful when comparing multiple models fit to the same dataset
- AIC is defined as $$AIC=−2×log-likelihood+2k$$ where log-likelihood measures how well the model fits the data (higher is better), and k is the number of estimated parameters in the model (including the intercept)
```{r}
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
```
::: callout-note
## Optimal Model Fit
As expected, due to overdispersion, negative binomial regression has a lower AIC than poisson regression and provides a more suitable model fit for the underlying crime dataset.
:::
# Part 7: Spatial LOGO-CV for 2017 Data
- Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district.
- LOGO-CV is used here instead of k-fold CV or a validation set approach in order to prevent spatial leakage.
```{r spatial-cv}
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()
cat("Running LOGO Cross-Validation...\n")
for (i in seq_along(districts)) {
test_district <- districts[i]
# Split data
train_data <- fishnet_model %>% filter(District != test_district)
test_data <- fishnet_model %>% filter(District == test_district)
# Fit model on training data
model_cv <- glm.nb(
countBurglaries ~ abandoned_bldg + abandoned_bldg.nn +
dist_to_hotspot,
data = train_data
)
# Predict on test data
test_data <- test_data %>%
mutate(
prediction = predict(model_cv, test_data, type = "response")
)
# Calculate metrics
mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
# Store results
cv_results <- bind_rows(
cv_results,
tibble(
fold = i,
test_district = test_district,
n_test = nrow(test_data),
mae = mae,
rmse = rmse
)
)
cat(" Fold", i, "/", length(districts), "- District", test_district,
"- MAE:", round(mae, 2), "\n")
}
# Overall results
cat("\n✓ Cross-Validation Complete\n")
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
```
```{r cv-results-table}
# Show results
cv_results %>%
arrange(desc(mae)) %>%
kable(
digits = 2,
caption = "LOGO CV Results by District"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
# Part 8: Model Predictions and Comparison
### 8.1 \| Generate Final Predictions
```{r final-predictions}
# Fit final model on all data
final_model <- glm.nb(
countBurglaries ~ abandoned_bldg + abandoned_bldg.nn +
dist_to_hotspot,
data = fishnet_model
)
# Add predictions back to fishnet
fishnet <- fishnet %>%
mutate(
prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
)
# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
mutate(
prediction_kde = (kde_value / kde_sum) * count_sum
)
```
### 8.2 \| Compare Model vs. KDE Baseline
```{r compare-models}
#| fig-width: 12
#| fig-height: 4
# Create three maps for comparison of Models
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
labs(title = "Actual Burglaries") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "Model Predictions (Neg. Binomial)") +
theme_crime()
p3 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "KDE Baseline Predictions") +
theme_crime()
p1 + p2 + p3 +
plot_annotation(
title = "Actual vs. Predicted Burglaries",
subtitle = "Does our complex model outperform simple KDE?"
)
```
```{r model-comparison-metrics}
# Calculate performance metrics
comparison <- fishnet %>%
st_drop_geometry() %>%
filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
summarize(
model_mae = mean(abs(countBurglaries - prediction_nb)),
model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
kde_mae = mean(abs(countBurglaries - prediction_kde)),
kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
)
comparison %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
separate(metric, into = c("approach", "metric"), sep = "_") %>%
pivot_wider(names_from = metric, values_from = value) %>%
kable(
digits = 2,
caption = "Model Performance Comparison"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### 8.3 \| Where Does the Model Work Well?
```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5
# Calculate errors
fishnet <- fishnet %>%
mutate(
error_nb = countBurglaries - prediction_nb,
error_kde = countBurglaries - prediction_kde,
abs_error_nb = abs(error_nb),
abs_error_kde = abs(error_kde)
)
# Map errors
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
scale_fill_gradient2(
name = "Error",
low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0,
limits = c(-10, 10)
) +
labs(title = "Model Errors (Actual - Predicted)") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
labs(title = "Absolute Model Errors") +
theme_crime()
p1 + p2
```
::: callout-important
## Assessing Model Performance
- **Spatial pattern of residuals:** The left panel shows both overpredictions (in blue) and underpredictions (in red), suggesting that the model’s performance varies systematically across neighborhoods rather than being random.
- **Magnitude of errors:** The right panel shows absolute errors concentrated in similar regions, confirming that prediction accuracy is uneven across thecity. Some fishnet grid cells exhibiting substantially higher deviations between observed and predicted values.
- **Error clustering:** Clusters of higher errors appear on the South and West Sides, which is also where crime data for 2017 is clustered as shown in earlier maps. This is a key limitation of the model – possibly due to unobserved local factors or data quality issues in these areas.
:::
# Part 9: Summary Statistics
```{r model-summary-table}
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
mutate(
across(where(is.numeric), ~round(., 3))
)
model_summary %>%
kable(
caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
footnote(
general = "Rate ratios > 1 indicate positive association with burglary counts."
)
```
## Key Findings Checklist
**Technical Performance:**
- Cross-validation MAE: `r round(mean(cv_results$mae), 2)`
- Model vs. KDE: KDE performed better than the final negative binomial model as it produced a lower MAE and RMSE than the model based on abandoned and vacant buildings
**Spatial Patterns:**
- Burglaries are clustered as clearly visualized in the Kernel Density Estimation Baseline
- Hot spots are largely located in the West Side and South Side neighborhoods
**Model Limitations:**
- Overdispersion: the dispersion parameter of 3.35 exceeds the rule of thumb of 1.5. This suggests that overdispersion in the Poisson distribution model is present and hence a Negative Binomial model is used instead
- Cells with zero counts: 781 out of 2458 had zero burglaries (31.8 %)