---
title: "Assignment 4: Spatial Predictive Analysis"
author: "Lingxuan Gao"
date: Nov 13th
format:
html:
code-fold: true
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Overview
In this lab, I will apply the spatial predictive modeling techniques to the sanitation code complaints in Chicago during 2017.
# Setup
```{r setup}
# Load required packages
library(tidyverse) # Data manipulation
library(sf) # Spatial operations
library(here) # Relative file paths
library(viridis) # Color scales
library(terra) # Raster operations (replaces 'raster')
library(spdep) # Spatial dependence
library(FNN) # Fast nearest neighbors
library(MASS) # Negative binomial regression
library(patchwork) # Plot composition (replaces grid/gridExtra)
library(knitr) # Tables
library(kableExtra) # Table formatting
library(classInt) # Classification intervals
library(here)
# Spatstat split into sub-packages
library(spatstat.geom) # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE
# Set options
options(scipen = 999) # No scientific notation
set.seed(5080) # Reproducibility
# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "gray30", size = base_size - 1),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
}
# Set as default
theme_set(theme_crime())
```
# Part 1: Data Loading & Exploration
::: callout-note
Why I choose Sanitation Code Complaints as a Predictor?
1. **Visible Environmental Disorder**: Sanitation violations (illegal dumping, overflowing garbage, unsanitary conditions) are highly visible markers of neighborhood neglect that signal low social control and guardianship.
2. **Property Maintenance Connection**: Sanitation code violations often correlate with poorly maintained properties - the same conditions that make areas vulnerable to burglary.
3. **Temporal Stability**: Unlike some 311 calls that are seasonal, sanitation violations tend to persist, making them reliable indicators of chronic neighborhood conditions.
**Expected Relationship**: I hypothesize a positive correlation between sanitation complaints and burglaries, where areas with higher sanitation violations will experience elevated burglary rates.
:::
## 1.1 Load 311 data: Sanitation Code Complaints in Chicago during 2017
```{r load data}
#| label: Load 311 Data
#| message: false
#| warning: false
# Load sanitation code complaints
sanitation_raw <- read_csv(here("data", "311_Service_Requests_-_Sanitation_Code_Complaints.csv"))
# Process and filter for 2017 only
sanitation <- sanitation_raw %>%
# Parse the existing Creation Date column
mutate(creation_date = mdy(`Creation Date`)) %>%
# Filter for 2017 only
filter(
year(creation_date) == 2017,
!is.na(Latitude),
!is.na(Longitude)
) %>%
# Convert to spatial object
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_transform('ESRI:102271')
```
## 1.2 Load Burglary Data
```{r load-burglaries}
#| label: Load Burglary Data
#| message: false
#| warning: false
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read(here("data", "burglaries.shp")) %>%
st_transform('ESRI:102271')
```
## 1.3 Load Chicago Spatial Data
```{r load-boundaries}
#| label: Load Boundary Data
#| message: false
#| warning: false
# Load police districts (used for spatial cross-validation)
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
# Load police beats (smaller administrative units)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(Beat = beat_num)
# Load Chicago boundary
chicagoBoundary <-
st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
st_transform('ESRI:102271')
```
## 1.4 The spatial distribution of Sanitation Code Complaints
```{r visualize-points1}
#| fig-width: 10
#| fig-height: 5
# Simple point map
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = sanitation, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Sanitation Code Complaints",
subtitle = paste0("Chicago 2017, n = ", nrow(sanitation))
)
# Density surface using modern syntax
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = data.frame(st_coordinates(sanitation)),
aes(X, Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(
option = "plasma",
direction = -1,
guide = "none" # Modern ggplot2 syntax (not guide = FALSE)
) +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)
# Combine plots
p1 + p2 +
plot_annotation(
title = "Sanitation Code Complaints",
tag_levels = 'A'
)
```
**Spatial Pattern of Sanitation Complaints:**
The point pattern map (Figure A) reveals **highly clustered** sanitation complaints concentrated in Chicago's central and south-central corridors.This is not random distribution—kernel density estimation (Figure B). It identifies three distinct hot spots with peak densities in the mid-north, central-south, and southern districts.
##1.5 The spatial distribution of Burglary
```{r visualize-points2}
#| fig-width: 10
#| fig-height: 5
# Simple point map
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Burglary Locations",
subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
)
# Density surface using modern syntax
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = data.frame(st_coordinates(burglaries)),
aes(X, Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(
option = "plasma",
direction = -1,
guide = "none" # Modern ggplot2 syntax (not guide = FALSE)
) +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)
# Combine plots using patchwork (modern approach)
p1 + p2 +
plot_annotation(
title = "Spatial Distribution of Burglaries in Chicago",
tag_levels = 'A'
)
```
**Spatial Patterns of Burglaries**
Burglaries are **highly unevenly distributed** across Chicago. The point pattern map reveals dense concentrations forming continuous corridors through central and south-central districts, while northern and far southern edges show sparse activity. The kernel density surface identifies **three major hot spots**: a dominant mid-north concentration, a central-south band, and a southern cluster.
# Part 2: Fishnet Grid Creation
## 2.1 Fishnet Creation
```{r create-fishnet}
# Create 500m x 500m grid
fishnet <- st_make_grid(
chicagoBoundary,
cellsize = 500, # 500 meters per cell
square = TRUE
) %>%
st_sf() %>%
mutate(uniqueID = row_number())
# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]
# View basic info
info_tbl <- tibble::tibble(
Metric = c("Number of cells", "Cell size", "Cell area (m²)"),
Value = c(
scales::comma(nrow(fishnet)),
"500 × 500 m",
scales::comma(round(as.numeric(units::drop_units(sf::st_area(fishnet[1, ])))))
)
)
knitr::kable(
info_tbl,
caption = "Fishnet Summary",
col.names = c("Metric", "Value"),
align = c("l","r"),
booktabs = TRUE
) |>
kableExtra::kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped","hover","condensed")) |>
kableExtra::column_spec(1, bold = TRUE)
```
## 2.2 Aggregate violations to grid cells
```{r aggregate-burglaries}
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countBurglaries = n())
# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
left_join(burglaries_fishnet, by = "uniqueID") %>%
mutate(countBurglaries = replace_na(countBurglaries, 0))
# Compute summary statistics
burglaries_sum<- summary(fishnet$countBurglaries)
#Identify Zero-count cells
bur_zero_n <- sum(fishnet$countBurglaries == 0)
bur_zero_pct <- 100 * bur_zero_n / nrow(fishnet)
# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(burglaries_sum[fields]), 2)
# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)
# Build final table
summary_tbl <- tibble::tibble(
Statistic = c(
"Min",
"1st Quartile",
"Median",
"Mean",
"3rd Quartile",
"Max",
"Cells with zero Burglaries"
),
Value = c(
vals_fmt,
sprintf(
"%s / %s (%.1f%%)",
scales::comma(bur_zero_n),
scales::comma(nrow(fishnet)),
bur_zero_pct
)
)
)
# Render table
knitr::kable(
summary_tbl,
caption = "Summary of Burglaries Counts per Grid Cell",
col.names = c("Statistic", "Value"),
align = c("l", "r"),
booktabs = TRUE
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)
```
```{r aggregate-sanitation}
# Spatial join: which cell contains each sanitation complaints?
sanitation_fishnet <- st_join(sanitation, fishnet, join = st_intersects) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countSanitation = n())
# Join back to fishnet (cells with 0 sanitation complaints)
fishnet <- fishnet %>%
left_join(sanitation_fishnet, by = "uniqueID") %>%
mutate(countSanitation = replace_na(countSanitation, 0))
# Compute summary statistics
sanitation_sum<- summary(fishnet$countSanitation)
#Identify Zero-count cells
san_zero_n <- sum(fishnet$countSanitation == 0)
san_zero_pct <- 100 * san_zero_n / nrow(fishnet)
# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(sanitation_sum[fields]), 2)
# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)
# Build final table
summary_tbl <- tibble::tibble(
Statistic = c(
"Min",
"1st Quartile",
"Median",
"Mean",
"3rd Quartile",
"Max",
"Cells with zero sanitation complaints"
),
Value = c(
vals_fmt,
sprintf(
"%s / %s (%.1f%%)",
scales::comma(san_zero_n),
scales::comma(nrow(fishnet)),
san_zero_pct
)
)
)
# Render table
knitr::kable(
summary_tbl,
caption = "Summary of Sanitation Complaints Counts per Grid Cell",
col.names = c("Statistic", "Value"),
align = c("l", "r"),
booktabs = TRUE
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)
```
## 2.3 Count Distribution Visualiztion
```{r visualize-fishnet}
#| fig-width: 10
#| fig-height: 4
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countSanitation), color = NA) +
scale_fill_viridis_c(
name = "Count",
option = "plasma",
trans = "sqrt",
breaks = c(0, 10, 20, 40, 80, 160)) +
labs(title = "Sanitation Complaints") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(
name = "Count",
option = "plasma",
trans = "sqrt",
breaks = c(0, 1, 5, 10, 20, 40))+
labs(title = "Burglaries") +
theme_crime()
p1 + p2 +
plot_annotation(title = "Are sanitation complaints and burglaries correlated?")
```
**Summary:**: Sanitation complaints and burglaries show broadly similar spatial patterns, with higher concentrations in central and denser parts of the city.
# Part 3: Spatial Features
## 3.1 K-Nearest Neighbor Features Calculation
```{r nn-feature}
#| message: false
#| warning: false
# Calculate mean distance to 3 nearest sanitation complaints
# Extract centroids for fishnet cells
fishnet_coords <- st_coordinates(st_centroid(fishnet))
# Extract coordinates of sanitation complaints
sanitation_coords <- st_coordinates(sanitation)
# Compute 3 nearest graffiti neighbors for each grid cell
nn_result <- get.knnx(sanitation_coords, fishnet_coords, k = 3)
# Add feature to fishnet
fishnet <- fishnet %>%
mutate(
sanitation_nn = rowMeans(nn_result$nn.dist) # mean distance to 3 nearest sanitation complaints
)
summary(fishnet$sanitation_nn)
```
**Minimum: 1 meter**: Three different sanitation complaints requests occurred just one meter apart **1st Quartile: 108 meters**: 108 meters are about 2 minute walk. This means in 25% of Chicago’s grid cells, the nearest sanitation complaints are within 1 block. **Median: 181 meters**: Half of all grid cells have their three sanitation complaints within 2 blocks. **Mean: 295 meters**: The mean is much larger than the median, which reveals: Sanitation complaints is extremely skewed. **3rd Quartile: 332 meters**: Three-quarters of grid cells have their nearest neighbors within 332m (\~4 city blocks). **Maximum: 2213 meters**: These are the sparsest parts of the city.
##3.2 Perform Local Moran’s I analysis
```{r local-morans-graffiti}
# Function unchanged
calculate_local_morans <- function(data, variable, k = 5) {
# Create spatial weights
coords <- st_coordinates(st_centroid(data))
neighbors <- knn2nb(knearneigh(coords, k = k))
weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
# Calculate Local Moran's I
local_moran <- localmoran(data[[variable]], weights)
# Classify clusters
mean_val <- mean(data[[variable]], na.rm = TRUE)
data %>%
mutate(
local_i = local_moran[, 1],
p_value = local_moran[, 5],
is_significant = p_value < 0.05,
moran_class = case_when(
!is_significant ~ "Not Significant",
local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
TRUE ~ "Not Significant"
)
)
}
# Apply to Sanitation Complaints counts
fishnet <- calculate_local_morans(fishnet, "countSanitation", k = 5)
```
##3.3 Identify hot spots and cold spots
```{r visualize-morans-sanitation}
#| fig-width: 8
#| fig-height: 6
ggplot() +
geom_sf(
data = fishnet,
aes(fill = moran_class),
color = NA
) +
scale_fill_manual(
values = c(
"High-High" = "#d7191c", # Red = hot spot
"High-Low" = "#fdae61",
"Low-High" = "#abd9e9",
"Low-Low" = "#2c7bb6",
"Not Significant" = "gray90"
),
name = "Cluster Type"
) +
labs(
title = "Local Moran's I: Sanitation code complatins Clusters",
subtitle = "High-High = Hot spots of disorder"
) +
theme_crime()
```
This Local Moran’s I map shows that sanitation complaints are clustering.
##3.4 Create distance-to-hotspot measures
```{r distance-to-hotspots-graffiti}
# Get centroids of "High-High" cells
hotspots <- fishnet %>%
filter(moran_class == "High-High") %>%
st_centroid()
# Calculate distance from each grid cell to hot spot
if (nrow(hotspots) > 0) {
fishnet <- fishnet %>%
mutate(
dist_to_hotspot = as.numeric(
st_distance(st_centroid(fishnet), hotspots %>% st_union())
)
)
cat(" - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
fishnet <- fishnet %>%
mutate(dist_to_hotspot = 0)
cat("⚠ No significant graffiti hot spots found\n")
}
```
# Part 4: Count Regression Models
I’ll use police districts for the spatial cross-validation.
## 4.1 Join district information to fishnet
```{r join-districts}
fishnet <- st_join(
fishnet,
policeDistricts,
join = st_within,
left = TRUE
) %>%
filter(!is.na(District)) # Remove cells outside districts
cat(" - Districts:", length(unique(fishnet$District)), "\n")
cat(" - Cells:", nrow(fishnet), "\n")
```
## 4.2 Fit Poisson regression
```{r prepare-data}
fishnet_model <- fishnet %>%
st_drop_geometry() %>%
dplyr::select(
uniqueID,
District,
countBurglaries,
countSanitation,
sanitation_nn,
dist_to_hotspot
) %>%
na.omit() # Remove any remaining NAs
cat(" - Observations:", nrow(fishnet_model), "\n")
cat(" - Variables:", ncol(fishnet_model), "\n")
```
```{r fit poisson}
model_poisson <- glm(
countBurglaries ~ countSanitation + sanitation_nn +
dist_to_hotspot,
data = fishnet_model,
family = "poisson"
)
# Summary
summary(model_poisson)
```
**Interpretation**: The model predicts burglary counts based on three sanitation-related features: countSanitation — how many sanitation complaints fall inside a grid cell sanitation_nn — average distance to the 3 nearest sanitation complaints dist_to_hotspot — distance to a sanitation hotspot (cluster of complaints)
The coefficient of countSanitation is positive. Each additional sanitation complaint inside a cell is linked to a 0.26% increase in expected burglaries.
The coefficient of sanitation_nn is Very strong negative. For each additional meter farther away from sanitation complaints, expected burglaries drop by about 0.45%.
The coefficient of dist_to_hotspot is slightly negative, which means being farther from a sanitation hotspot is linked to slightly fewer burglaries.
All three predictors are statistically significant.
In conclusion, Burglary risk rises in places where sanitation complaints are concentrated or close by.
##4.3 Check for Overdispersion
Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).
```{r check-overdispersion}
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) /
model_poisson$df.residual
cat("Dispersion parameter:", round(dispersion, 2), "\n")
cat("Rule of thumb: >1.5 suggests overdispersion\n")
if (dispersion > 1.5) {
cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
cat("Dispersion looks okay for Poisson model.\n")
}
```
##4.4 Negative Binomial Regression
If overdispersed, use **Negative Binomial regression** (more flexible).
```{r fit-negbin}
# Fit Negative Binomial model
model_nb <- glm.nb(
countBurglaries ~ countSanitation + sanitation_nn +
dist_to_hotspot,
data = fishnet_model
)
# Summary
summary(model_nb)
```
##4.5 Compare model fit (AIC)
```{r model comparison}
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
```
**Negative Binomial Model Interpretation**: The negative binomial model fits the data much better than the Poisson model (AIC drops from 8126 → 7151), which tells us burglary counts are overdispersed. Cells that are closer to sanitation complaints (sanitation_nn) show higher burglary risk. The coefficient is strongly negative and highly significant, meaning risk increases as distance shrinks. The same pattern appears with distance to sanitation hotspots: burglary rates are higher near clusters of sanitation complaints. The within-cell count of sanitation complaints (countSanitation) becomes non-significant once distance features are included, suggesting that proximity to disorder matters more than simple counts inside a grid cell.
# Part 5: Spatial Cross-Validation
Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!). **Leave-One-Group-Out (LOGO) Cross-Validation** trains on all districts except one, then tests on the held-out district. So I’ll use Leave-One-Group-Out (LOGO) Cross-Validation to train the model.
```{r spatial-cv}
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()
for (i in seq_along(districts)) {
test_district <- districts[i]
# Split data
train_data <- fishnet_model %>% filter(District != test_district)
test_data <- fishnet_model %>% filter(District == test_district)
# Fit model on training data
model_cv <- glm.nb(
countBurglaries ~ countSanitation + sanitation_nn +
dist_to_hotspot,
data = train_data
)
# Predict on test data
test_data <- test_data %>%
mutate(
prediction = predict(model_cv, test_data, type = "response")
)
# Calculate metrics
mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
# Store results
cv_results <- bind_rows(
cv_results,
tibble(
fold = i,
test_district = test_district,
n_test = nrow(test_data),
mae = mae,
rmse = rmse
)
)
}
# Overall results
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
```
```{r cv-results-table}
# Show results
cv_results %>%
arrange(desc(mae)) %>%
kable(
digits = 2,
caption = "LOGO CV Results by District"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
**Interpretation of the LOGO Cross-Validation Results** The table shows how well the model predicts burglaries when each police district is held out as completely unseen test data.MAE is average error in burglary counts per grid cell and RMSE is penalizes larger errors.
Across districts, MAE falls roughly between 1.3 and 5 burglaries, which means the model usually misses the true count by only a couple of events per cell. Districts with a large number of test cells (for example District 4 or 8) tend to have lower errors, because the model has more stable patterns to learn from. Districts with small or unusual spatial patterns, like District 3 or 18, show higher error, suggesting their burglary patterns don't generalize well from the rest of the city.
Overall, the model generalizes reasonably well across space: performance varies by district, but prediction error stays within a narrow range
# Part6: Model Evaluation
##6.1 Create a Kernel Density Baseline This part builds a KDE hotspot baseline—my simplest prediction model, so I can later test whether complex regression models truly improve on “crime happens where it happened before.
**The KDE baseline asks:** “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)
```{r kde-baseline}
#| message: false
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
st_coordinates(burglaries),
W = as.owin(st_bbox(chicagoBoundary))
)
# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
burglaries_ppp,
sigma = 1000, # 1km bandwidth
edge = TRUE # Edge correction
)
# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)
# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
mutate(
kde_value = terra::extract(
kde_raster,
vect(fishnet),
fun = mean,
na.rm = TRUE
)[, 2] # Extract just the values column
)
```
```{r visualize-kde}
#| fig-width: 8
#| fig-height: 6
ggplot() +
geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
scale_fill_viridis_c(
name = "KDE Value",
option = "plasma"
) +
labs(
title = "Kernel Density Estimation Baseline",
subtitle = "Simple spatial smoothing of burglary locations"
) +
theme_crime()
```
##6.2 Generate Final Predictions
```{r final-predictions}
# Fit final model on all data
final_model <- glm.nb(
countBurglaries ~ countSanitation + sanitation_nn +
dist_to_hotspot,
data = fishnet_model
)
# Add predictions back to fishnet
fishnet <- fishnet %>%
mutate(
prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
)
# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
mutate(
prediction_kde = (kde_value / kde_sum) * count_sum
)
```
##6.3 Compare Model vs. KDE Baseline
```{r compare-models}
#| fig-width: 12
#| fig-height: 4
# Create three maps
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
labs(title = "Actual Burglaries") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "Model Predictions (Neg. Binomial)") +
theme_crime()
p3 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
labs(title = "KDE Baseline Predictions") +
theme_crime()
p1 + p2 + p3 +
plot_annotation(
title = "Actual vs. Predicted Burglaries",
subtitle = "Does our complex model outperform simple KDE?"
)
```
```{r model-comparison-metrics}
# Calculate performance metrics
comparison <- fishnet %>%
st_drop_geometry() %>%
filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
summarize(
model_mae = mean(abs(countBurglaries - prediction_nb)),
model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
kde_mae = mean(abs(countBurglaries - prediction_kde)),
kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
)
comparison %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
separate(metric, into = c("approach", "metric"), sep = "_") %>%
pivot_wider(names_from = metric, values_from = value) %>%
kable(
digits = 2,
caption = "Model Performance Comparison"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
**Interpretation**: The comparison shows that the Negative Binomial model does not outperform a simple KDE baseline.
Because burglary risk in Chicago is strongly path-dependent: the best predictor of where crime will occur is simply where it has occurred before. Adding sanitation-based disorder features does not increase predictive accuracy and may introduce noise.
Although KDE outperforms the regression model in predictive accuracy, the regression model provides interpretive insight into how sanitation-related disorder correlates with burglary risk. KDE captures spatial inertia, whereas the model reveals underlying mechanisms.
KDE is a better predictor, but the model is a better explanation.
##6.4 Where Does the Model Work Well?
```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5
# Calculate errors
fishnet <- fishnet %>%
mutate(
error_nb = countBurglaries - prediction_nb,
error_kde = countBurglaries - prediction_kde,
abs_error_nb = abs(error_nb),
abs_error_kde = abs(error_kde)
)
# Map errors
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
scale_fill_gradient2(
name = "Error",
low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0,
limits = c(-10, 10)
) +
labs(title = "Model Errors (Actual - Predicted)") +
theme_crime()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
labs(title = "Absolute Model Errors") +
theme_crime()
p1 + p2
```
**Interpretation**: On the left, red cells indicate under-prediction—places where actual burglaries were higher than the model expected, while blue cells show over-prediction. Both colors appear scattered rather than clustered, which suggests no strong systematic spatial bias: the model isn’t consistently wrong in one part of the city.
The right map shows the absolute size of errors, highlighting where the model misses by larger margins. The largest errors occur in a few high-activity pockets, mostly in the South Side and near the West Side burglary clusters. This implies that the hardest places to predict are exactly the areas with the most burglary volatility.
#Part7: Summary Statistics and Tables
##7.1 Model Summary Table
```{r model-summary-table}
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
mutate(
across(where(is.numeric), ~round(., 3))
)
model_summary %>%
kable(
caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
footnote(
general = "Rate ratios > 1 indicate positive association with burglary counts."
)
```