Fan Yang - MUSA 5080
  • Home
  • Weekly Notes
    • Week 1
    • Week 2
    • Week 3
    • Week 4
    • Week 5
    • Week 6
    • Week 7
    • Week 9
    • Week 10
    • Week 11
    • Week 12
  • Labs
    • Lab 1: Setup Instructions
    • Lab 2: Getting Started with dplyr
    • Lab 3: Data Visualization and EDA
    • Lab 4: Spatial Operations with Pennsylvania Data
  • Assignments
    • Assignment 1: Census Data Quality for Policy Decisions
    • Assignment 2: Spatial Analysis and Visualization
    • Assignment 4: Spatial Predictive Analysis
    • Assignment 5: Space-Time Prediction of Bike Share Demand
  • Final
    • Final Slides
    • Technical Appendix
    • README

On this page

  • Introduction
  • Part 1: Data Loading and Exploration
    • Load Spatial Boundaries
    • Load Burglary Data
    • Load Sanitation Violation Data
    • Visualize Spatial Distribution
  • Part 2: Fishnet Grid Creation
    • Create and Aggregate Grid
    • Aggregate Burglaries to Grid
    • Aggregate Sanitation Violations to Grid
    • Summary Statistics
    • Visualize Aggregated Counts
  • Part 3: Spatial Features
    • Calculate Nearest Neighbor Distance
    • Local Moran’s I Analysis
    • Distance to Hot Spots
    • Kernel Density Estimation Baseline
  • Part 4: Count Regression Models
    • Prepare Modeling Dataset
    • Fit Poisson Model
    • Fit Negative Binomial Model
    • Model Comparison
  • Part 5: Spatial Cross-Validation
    • Cross-Validation Results
  • Part 6: Model Evaluation
    • Generate Final Predictions
    • Visual Comparison
    • Performance Metrics
    • Error Analysis
  • Conclusion

Spatial Predictive Analysis: Predicting Burglaries with Sanitation Violations

  • Show All Code
  • Hide All Code

  • View Source
Author

Fan Yang

Published

November 15, 2025

Introduction

This analysis constructs a spatial prediction model for Chicago burglaries, with sanitation code violations as the primary predictor variable. It hypothesizes that sanitation violations may also correlate with burglary risk, drawing on the “broken windows” theory—where visible signs of disorder, such as uncollected trash or violations, indicate diminished social control and may trigger criminal activity. Based on this premise, the analysis explores how community disorder indicators relate to property crime patterns by testing whether sanitation complaints can predict burglary locations.

Code
library(tidyverse)
library(sf)
library(here)
library(viridis)
library(terra)
library(spdep)
library(FNN)
library(MASS)
library(patchwork)
library(knitr)
library(kableExtra)
library(classInt)
library(spatstat.geom)
library(spatstat.explore)

options(scipen = 999)
set.seed(1115)

theme_spatial <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

theme_set(theme_spatial())

Part 1: Data Loading and Exploration

Load Spatial Boundaries

Code
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

Load Burglary Data

Code
burglaries <- st_read(here("assignments", "assignment_4", "data", "burglaries.shp")) %>% 
  st_transform('ESRI:102271')

The dataset contains 7,482 burglary incidents from 2017.

Load Sanitation Violation Data

Code
violations_raw <- read_csv(
  here("assignments", "assignment_4", "data", "311_data.csv"),
  show_col_types = FALSE
)

violations <- violations_raw %>%
  filter(`Type of Service Request` == "Sanitation Code Violation") %>%
  filter(!is.na(Latitude) & !is.na(Longitude)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

The sanitation violations dataset contains 152,636 complaints from 2017, including issues like improper garbage disposal, rodent infestations, and unsanitary property conditions.

Visualize Spatial Distribution

Code
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.3) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("n = ", format(nrow(burglaries), big.mark=","))
  ) +
  theme_void()

p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = violations, color = "#457b9d", size = 0.1, alpha = 0.3) +
  labs(
    title = "Sanitation Violations",
    subtitle = paste0("n = ", format(nrow(violations), big.mark=","))
  ) +
  theme_void()

p1 + p2 + 
  plot_annotation(title = "Spatial Distribution: Burglaries and Sanitation Violations")

Both datasets reveal a certain spatial clustering effect, with hot spots concentrated in similar areas (though sanitation code violations are more prevalent). Sanitation code violations appear to be a useful indicator for predicting the risk of residential burglary.

Part 2: Fishnet Grid Creation

Create and Aggregate Grid

500m x 500m fishnet grid.

Code
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

fishnet <- fishnet[chicagoBoundary, ]

Aggregate Burglaries to Grid

Code
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

Aggregate Sanitation Violations to Grid

Code
violations_fishnet <- st_join(violations, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(sanitation_violations = n())

fishnet <- fishnet %>%
  left_join(violations_fishnet, by = "uniqueID") %>%
  mutate(sanitation_violations = replace_na(sanitation_violations, 0))

Summary Statistics

Code
data.frame(
  Variable = c("Burglaries", "Sanitation Violations"),
  Min = c(min(fishnet$countBurglaries), min(fishnet$sanitation_violations)),
  Median = c(median(fishnet$countBurglaries), median(fishnet$sanitation_violations)),
  Mean = c(round(mean(fishnet$countBurglaries), 2), round(mean(fishnet$sanitation_violations), 2)),
  Max = c(max(fishnet$countBurglaries), max(fishnet$sanitation_violations)),
  `Cells with Zero` = c(
    paste0(sum(fishnet$countBurglaries == 0), " (", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)"),
    paste0(sum(fishnet$sanitation_violations == 0), " (", round(100 * sum(fishnet$sanitation_violations == 0) / nrow(fishnet), 1), "%)")
  )
) %>%
  kable(caption = "Distribution Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Distribution Summary
Variable Min Median Mean Max Cells.with.Zero
Burglaries 0 2 3.04 40 781 (31.8%)
Sanitation Violations 0 46 62.02 727 306 (12.4%)

Visualize Aggregated Counts

Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = sanitation_violations), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "mako",
    trans = "log1p"
  ) +
  labs(title = "Sanitation Violations per Cell") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "plasma",
    trans = "log1p"
  ) +
  labs(title = "Burglaries per Cell") +
  theme_void()

p1 + p2 +
  plot_annotation(
    title = "Spatial Relationship: Sanitation Violations and Burglaries",
    subtitle = "Do high-violation areas correspond to high-burglary areas?"
  )

These two maps reveal substantial spatial overlap between sanitation violation hot spots and residential burglary hot spots.

Part 3: Spatial Features

Calculate Nearest Neighbor Distance

It measures the average distance from each cell to the three nearest sanitation violations, capturing local exposure to disorder.

Code
fishnet_coords <- st_coordinates(st_centroid(fishnet))
violations_coords <- st_coordinates(violations)

nn_result <- get.knnx(violations_coords, fishnet_coords, k = 3)

fishnet <- fishnet %>%
  mutate(sanitation.nn = rowMeans(nn_result$nn.dist))
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = sanitation.nn), color = NA) +
  scale_fill_viridis_c(
    option = "viridis",
    direction = -1,
    name = "Distance (ft)",
    trans = "log1p"
  ) +
  labs(
    title = "Average Distance to 3 Nearest Sanitation Violations",
    subtitle = "Lower values indicate proximity to disorder hot spots"
  ) +
  theme_void()

Cells with low values are located in areas with concentrated sanitation problems. If the broken windows theory holds, it means that these areas may face elevated burglary risk.

Local Moran’s I Analysis

I identify statistically significant clusters of sanitation violations using LISA to pinpoint disorder hot spots.

Code
coords <- st_coordinates(st_centroid(fishnet))
fishnet_nb <- knn2nb(knearneigh(coords, k = 5))
fishnet_weights <- nb2listw(fishnet_nb, style = "W", zero.policy = TRUE)

local_moran <- localmoran(fishnet$sanitation_violations, fishnet_weights, zero.policy = TRUE)

fishnet <- fishnet %>%
  mutate(
    localMorans = local_moran[, 1],
    localMorans_pval = local_moran[, 5],
    sig_hotspot = localMorans > 0 & localMorans_pval <= 0.05
  )
Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = localMorans), color = NA) +
  scale_fill_gradient2(
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    name = "Local\nMoran's I"
  ) +
  labs(title = "Local Moran's I Statistic") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = sig_hotspot), color = NA) +
  scale_fill_manual(
    values = c("gray90", "#d62828"),
    labels = c("Not Significant", "Hot Spot (p ≤ 0.05)"),
    name = ""
  ) +
  labs(title = "Significant Hot Spots") +
  theme_void()

p1 + p2

LISA reveals 540 cells that are statistically significant hot spots, representing areas where high sanitation violation counts cluster together. The central and south-central regions have a higher concentration.

Distance to Hot Spots

Code
hotspot_cells <- fishnet %>% filter(sig_hotspot == TRUE)

if(nrow(hotspot_cells) > 0) {
  hotspot_union <- st_union(hotspot_cells)
  
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(st_distance(st_centroid(.), hotspot_union))
    )
} else {
  fishnet <- fishnet %>% mutate(dist_to_hotspot = NA)
}
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = dist_to_hotspot), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    direction = -1,
    name = "Distance (ft)",
    trans = "log1p",
    breaks = c(0, 500, 1000, 4000)
  ) +
  geom_sf(data = filter(fishnet, sig_hotspot == TRUE), 
          fill = NA, color = "red", size = 0.3) +
  labs(
    title = "Distance to Nearest Sanitation Hot Spot",
    subtitle = "Red outlines = significant clusters"
  ) +
  theme_void()

The map reveals multiple sanitation violation hotspots in Chicago, primarily concentrated in the city’s south central area, while the northern regions remain distant from these problem zones.

Kernel Density Estimation Baseline

Code
burglary_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

kde_result <- density.ppp(
  burglary_ppp,
  sigma = 1000,
  edge = TRUE
)

kde_raster <- rast(kde_result)
crs(kde_raster) <- "ESRI:102271"

fishnet_centroids <- st_centroid(fishnet)
fishnet$kde_value <- terra::extract(kde_raster, vect(fishnet_centroids))[, 2]
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  scale_fill_viridis_c(
    option = "turbo",
    name = "Density"
  ) +
  labs(
    title = "Kernel Density Estimation of Burglaries",
    subtitle = "Baseline prediction using only spatial distribution"
  ) +
  theme_void()

KDE as a baseline—it predicts burglary risk solely based on past locations. An effective regression model should theoretically outperform this.

Part 4: Count Regression Models

Prepare Modeling Dataset

Code
fishnet_centroids_with_id <- st_centroid(fishnet) %>%
  mutate(uniqueID = fishnet$uniqueID)

fishnet_districts <- st_join(
  fishnet_centroids_with_id,
  policeDistricts
) %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, District)

fishnet <- fishnet %>%
  left_join(fishnet_districts, by = "uniqueID")

fishnet_model <- fishnet %>%
  filter(!is.na(sanitation_violations) & !is.na(sanitation.nn) & 
         !is.na(dist_to_hotspot) & !is.na(District))

Fit Poisson Model

Code
model_poisson <- glm(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  family = "poisson",
  data = fishnet_model
)

summary(model_poisson)

Call:
glm(formula = countBurglaries ~ sanitation_violations + sanitation.nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                         Estimate  Std. Error z value            Pr(>|z|)    
(Intercept)            1.72136139  0.03385365   50.85 <0.0000000000000002 ***
sanitation_violations  0.00286160  0.00015390   18.59 <0.0000000000000002 ***
sanitation.nn         -0.00817539  0.00028246  -28.94 <0.0000000000000002 ***
dist_to_hotspot       -0.00017981  0.00001681  -10.70 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 8956.5  on 2259  degrees of freedom
Residual deviance: 5009.9  on 2256  degrees of freedom
AIC: 10120

Number of Fisher Scoring iterations: 6

Fit Negative Binomial Model

Code
model_nb <- glm.nb(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  data = fishnet_model
)

summary(model_nb)

Call:
glm.nb(formula = countBurglaries ~ sanitation_violations + sanitation.nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 2.496099382, 
    link = log)

Coefficients:
                         Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)            1.50713148  0.06142604  24.536 < 0.0000000000000002 ***
sanitation_violations  0.00452346  0.00033341  13.567 < 0.0000000000000002 ***
sanitation.nn         -0.00793662  0.00038734 -20.490 < 0.0000000000000002 ***
dist_to_hotspot       -0.00012077  0.00002554  -4.728           0.00000227 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.4961) family taken to be 1)

    Null deviance: 4353.3  on 2259  degrees of freedom
Residual deviance: 2332.1  on 2256  degrees of freedom
AIC: 9020.4

Number of Fisher Scoring iterations: 1

              Theta:  2.496 
          Std. Err.:  0.148 

 2 x log-likelihood:  -9010.445 

Model Comparison

Code
data.frame(
  Model = c("Poisson", "Negative Binomial"),
  AIC = c(round(AIC(model_poisson), 2), round(AIC(model_nb), 2)),
  BIC = c(round(BIC(model_poisson), 2), round(BIC(model_nb), 2))
) %>%
  kable(caption = "Model Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Comparison
Model AIC BIC
Poisson 10120.27 10143.16
Negative Binomial 9020.45 9049.06

The Negative Binomial model shows a lower AIC, indicating better fit. This confirms the presence of overdispersion problem in the burglary count data, justifying the use of the more flexible Negative Binomial specification.

Part 5: Spatial Cross-Validation

Leave-One-Group-Out cross-validation by police district ensures spatial separation between training and test sets, preventing information leakage from nearby cells.

Code
districts <- unique(fishnet_model$District)
cv_results <- tibble()

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  model_cv <- glm.nb(
    countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
    data = train_data
  )
  
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = round(mae, 2),
      rmse = round(rmse, 2)
    )
  )
}

Cross-Validation Results

Code
data.frame(
  Metric = c("Mean MAE", "Mean RMSE", "SD of MAE"),
  Value = c(
    round(mean(cv_results$mae), 2),
    round(mean(cv_results$rmse), 2),
    round(sd(cv_results$mae), 2)
  )
) %>%
  kable(caption = "Overall Cross-Validation Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Overall Cross-Validation Performance
Metric Value
Mean MAE 2.29
Mean RMSE 3.58
SD of MAE 0.84
Code
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO Cross-Validation Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO Cross-Validation Results by District
fold test_district n_test mae rmse
8 3 65 4.18 6.11
19 19 92 3.71 8.32
18 14 65 3.55 5.67
16 18 45 3.26 4.67
21 17 98 2.75 8.31
5 6 82 2.68 3.74
15 11 65 2.66 3.23
23 24 54 2.60 3.32
13 12 98 2.57 3.72
7 7 67 2.44 3.41
17 25 113 2.41 3.34
6 8 238 2.11 3.63
9 2 82 2.07 3.01
3 22 130 1.92 2.33
14 15 39 1.86 2.22
12 1 47 1.84 2.41
22 20 48 1.77 2.25
1 5 129 1.71 2.98
20 16 183 1.61 1.98
10 9 140 1.59 2.28
11 10 85 1.51 2.20
2 4 276 1.16 2.45
4 31 19 0.64 0.73

The cross-validation results show how well the model generalizes to unseen districts. Variation in MAE across districts reflects different relationships between sanitation violations and burglaries in different neighborhoods.

Part 6: Model Evaluation

Generate Final Predictions

Code
final_model <- glm.nb(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  data = fishnet_model
)

fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

Visual Comparison

Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "Actual Burglaries") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "Model Predictions") +
  theme_void()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "KDE Baseline") +
  theme_void()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does the sanitation-based model outperform simple KDE?"
  )

The regression model based on sanitation violations (middle) effectively captures the spatial distribution patterns of actual burglaries (left), with hot spot areas largely aligning. In contrast, while the KDE baseline model (right) also identifies major hot spots (even outperforming in the metric comparisons below), it exhibits blurred spatial details and excessive smoothing. Thus the regression model incorporating sanitation features preserves spatial details better, though its overall predictive capability may be comparable to or like inferior to KDE.

Performance Metrics

Code
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = round(mean(abs(countBurglaries - prediction_nb)), 2),
    model_rmse = round(sqrt(mean((countBurglaries - prediction_nb)^2)), 2),
    kde_mae = round(mean(abs(countBurglaries - prediction_kde)), 2),
    kde_rmse = round(sqrt(mean((countBurglaries - prediction_kde)^2)), 2)
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(caption = "Model vs. KDE Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model vs. KDE Performance
approach mae rmse
model 2.10 3.57
kde 2.01 2.90

KDE performs better in statistical metrics, but as shown in the previous image, KDE’s spatial representation is overly smooth and lacks detail. My predictive model, however, performs better at finer spatial granularity.

Error Analysis

Code
fishnet_errors <- fishnet %>%
  filter(!is.na(prediction_nb) & !is.na(countBurglaries)) %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    abs_error_nb = abs(error_nb)
  )

p1 <- ggplot() +
  geom_sf(data = fishnet_errors, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Prediction Errors (Actual - Predicted)") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet_errors, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Errors") +
  theme_void()

p1 + p2

The model performs well overall: The random distribution of red and blue errors in the left image indicates no systematic bias, while the large dark areas in the right image demonstrate that absolute errors are generally small. Only a few scattered points show larger errors, meaning the model predicts accurately in most regions, but specific areas may require additional features for improved prediction.

Conclusion

Performance: The negative binomial model achieved a cross-validation MAE of 2.29, performing roughly on par with the KDE baseline. All three spatial features demonstrated statistical significance. However, KDE exhibits overly smoothed predictions spatially, whereas my predictive model effectively captures fine-grained details consistent with actual values.

Spatial Patterns: Sanitation hotspots and residential burglary hotspots show significant overlap, particularly in central and southern Chicago. This aligns with the Broken Windows theory—areas exhibiting visible disorder tend to experience higher rates of property crime.

Limitations: Error analysis indicates that most predictions are accurate, but significant discrepancies exist in a few areas. These outliers may possess characteristics not captured by health data—potentially unusual building/land use types or specific economic and demographic conditions. Furthermore, relying solely on a single type of 311 call overlooks certain behavioral patterns.

Source Code
---
title: "Spatial Predictive Analysis: Predicting Burglaries with Sanitation Violations"
author: "Fan Yang"
date: "November 15, 2025"
format:
  html:
    code-fold: show
    code-tools: true
    toc: true
    toc-depth: 3
    toc-location: left
    theme: cosmo
    embed-resources: true
editor: visual
execute:
  warning: false
  message: false
---

## Introduction

This analysis constructs a spatial prediction model for Chicago burglaries, with sanitation code violations as the primary predictor variable. It hypothesizes that sanitation violations may also correlate with burglary risk, drawing on the “broken windows” theory—where visible signs of disorder, such as uncollected trash or violations, indicate diminished social control and may trigger criminal activity. Based on this premise, the analysis explores how community disorder indicators relate to property crime patterns by testing whether sanitation complaints can predict burglary locations.

```{r setup}
#| code-fold: true

library(tidyverse)
library(sf)
library(here)
library(viridis)
library(terra)
library(spdep)
library(FNN)
library(MASS)
library(patchwork)
library(knitr)
library(kableExtra)
library(classInt)
library(spatstat.geom)
library(spatstat.explore)

options(scipen = 999)
set.seed(1115)

theme_spatial <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

theme_set(theme_spatial())
```

## Part 1: Data Loading and Exploration

### Load Spatial Boundaries

```{r load-boundaries}
#| output: false

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
```

### Load Burglary Data

```{r load-burglaries}
#| output: false

burglaries <- st_read(here("assignments", "assignment_4", "data", "burglaries.shp")) %>% 
  st_transform('ESRI:102271')
```

The dataset contains 7,482 burglary incidents from 2017.

### Load Sanitation Violation Data

```{r load-violations}
#| cache: true

violations_raw <- read_csv(
  here("assignments", "assignment_4", "data", "311_data.csv"),
  show_col_types = FALSE
)

violations <- violations_raw %>%
  filter(`Type of Service Request` == "Sanitation Code Violation") %>%
  filter(!is.na(Latitude) & !is.na(Longitude)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')
```

The sanitation violations dataset contains 152,636 complaints from 2017, including issues like improper garbage disposal, rodent infestations, and unsanitary property conditions.

### Visualize Spatial Distribution

```{r visualize-data}
#| fig-width: 10
#| fig-height: 5

p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.3) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("n = ", format(nrow(burglaries), big.mark=","))
  ) +
  theme_void()

p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = violations, color = "#457b9d", size = 0.1, alpha = 0.3) +
  labs(
    title = "Sanitation Violations",
    subtitle = paste0("n = ", format(nrow(violations), big.mark=","))
  ) +
  theme_void()

p1 + p2 + 
  plot_annotation(title = "Spatial Distribution: Burglaries and Sanitation Violations")
```

Both datasets reveal a certain spatial clustering effect, with hot spots concentrated in similar areas (though sanitation code violations are more prevalent). Sanitation code violations appear to be a useful indicator for predicting the risk of residential burglary.

## Part 2: Fishnet Grid Creation

### Create and Aggregate Grid

500m x 500m fishnet grid.

```{r create-fishnet}
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

fishnet <- fishnet[chicagoBoundary, ]
```

### Aggregate Burglaries to Grid

```{r aggregate-burglaries}
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))
```

### Aggregate Sanitation Violations to Grid

```{r aggregate-violations}
violations_fishnet <- st_join(violations, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(sanitation_violations = n())

fishnet <- fishnet %>%
  left_join(violations_fishnet, by = "uniqueID") %>%
  mutate(sanitation_violations = replace_na(sanitation_violations, 0))
```

### Summary Statistics

```{r summary-stats}
data.frame(
  Variable = c("Burglaries", "Sanitation Violations"),
  Min = c(min(fishnet$countBurglaries), min(fishnet$sanitation_violations)),
  Median = c(median(fishnet$countBurglaries), median(fishnet$sanitation_violations)),
  Mean = c(round(mean(fishnet$countBurglaries), 2), round(mean(fishnet$sanitation_violations), 2)),
  Max = c(max(fishnet$countBurglaries), max(fishnet$sanitation_violations)),
  `Cells with Zero` = c(
    paste0(sum(fishnet$countBurglaries == 0), " (", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)"),
    paste0(sum(fishnet$sanitation_violations == 0), " (", round(100 * sum(fishnet$sanitation_violations == 0) / nrow(fishnet), 1), "%)")
  )
) %>%
  kable(caption = "Distribution Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```

### Visualize Aggregated Counts

```{r visualize-fishnet}
#| fig-width: 10
#| fig-height: 4

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = sanitation_violations), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "mako",
    trans = "log1p"
  ) +
  labs(title = "Sanitation Violations per Cell") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "plasma",
    trans = "log1p"
  ) +
  labs(title = "Burglaries per Cell") +
  theme_void()

p1 + p2 +
  plot_annotation(
    title = "Spatial Relationship: Sanitation Violations and Burglaries",
    subtitle = "Do high-violation areas correspond to high-burglary areas?"
  )
```

These two maps reveal substantial spatial overlap between sanitation violation hot spots and residential burglary hot spots.

## Part 3: Spatial Features

### Calculate Nearest Neighbor Distance

It measures the average distance from each cell to the three nearest sanitation violations, capturing local exposure to disorder.

```{r nn-feature}
fishnet_coords <- st_coordinates(st_centroid(fishnet))
violations_coords <- st_coordinates(violations)

nn_result <- get.knnx(violations_coords, fishnet_coords, k = 3)

fishnet <- fishnet %>%
  mutate(sanitation.nn = rowMeans(nn_result$nn.dist))
```

```{r plot-nn}
#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(data = fishnet, aes(fill = sanitation.nn), color = NA) +
  scale_fill_viridis_c(
    option = "viridis",
    direction = -1,
    name = "Distance (ft)",
    trans = "log1p"
  ) +
  labs(
    title = "Average Distance to 3 Nearest Sanitation Violations",
    subtitle = "Lower values indicate proximity to disorder hot spots"
  ) +
  theme_void()
```

Cells with low values are located in areas with concentrated sanitation problems. If the broken windows theory holds, it means that these areas may face elevated burglary risk.

### Local Moran's I Analysis

I identify statistically significant clusters of sanitation violations using LISA to pinpoint disorder hot spots.

```{r local-morans}
coords <- st_coordinates(st_centroid(fishnet))
fishnet_nb <- knn2nb(knearneigh(coords, k = 5))
fishnet_weights <- nb2listw(fishnet_nb, style = "W", zero.policy = TRUE)

local_moran <- localmoran(fishnet$sanitation_violations, fishnet_weights, zero.policy = TRUE)

fishnet <- fishnet %>%
  mutate(
    localMorans = local_moran[, 1],
    localMorans_pval = local_moran[, 5],
    sig_hotspot = localMorans > 0 & localMorans_pval <= 0.05
  )
```

```{r plot-hotspots}
#| fig-width: 10
#| fig-height: 4

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = localMorans), color = NA) +
  scale_fill_gradient2(
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    name = "Local\nMoran's I"
  ) +
  labs(title = "Local Moran's I Statistic") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = sig_hotspot), color = NA) +
  scale_fill_manual(
    values = c("gray90", "#d62828"),
    labels = c("Not Significant", "Hot Spot (p ≤ 0.05)"),
    name = ""
  ) +
  labs(title = "Significant Hot Spots") +
  theme_void()

p1 + p2
```

LISA reveals 540 cells that are statistically significant hot spots, representing areas where high sanitation violation counts cluster together. The central and south-central regions have a higher concentration.

### Distance to Hot Spots

```{r distance-hotspot}
hotspot_cells <- fishnet %>% filter(sig_hotspot == TRUE)

if(nrow(hotspot_cells) > 0) {
  hotspot_union <- st_union(hotspot_cells)
  
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(st_distance(st_centroid(.), hotspot_union))
    )
} else {
  fishnet <- fishnet %>% mutate(dist_to_hotspot = NA)
}
```

```{r plot-distance}
#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(data = fishnet, aes(fill = dist_to_hotspot), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    direction = -1,
    name = "Distance (ft)",
    trans = "log1p",
    breaks = c(0, 500, 1000, 4000)
  ) +
  geom_sf(data = filter(fishnet, sig_hotspot == TRUE), 
          fill = NA, color = "red", size = 0.3) +
  labs(
    title = "Distance to Nearest Sanitation Hot Spot",
    subtitle = "Red outlines = significant clusters"
  ) +
  theme_void()
```

The map reveals multiple sanitation violation hotspots in Chicago, primarily concentrated in the city's south central area, while the northern regions remain distant from these problem zones.

### Kernel Density Estimation Baseline

```{r kde}
burglary_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

kde_result <- density.ppp(
  burglary_ppp,
  sigma = 1000,
  edge = TRUE
)

kde_raster <- rast(kde_result)
crs(kde_raster) <- "ESRI:102271"

fishnet_centroids <- st_centroid(fishnet)
fishnet$kde_value <- terra::extract(kde_raster, vect(fishnet_centroids))[, 2]
```

```{r plot-kde}
#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  scale_fill_viridis_c(
    option = "turbo",
    name = "Density"
  ) +
  labs(
    title = "Kernel Density Estimation of Burglaries",
    subtitle = "Baseline prediction using only spatial distribution"
  ) +
  theme_void()
```

KDE as a baseline—it predicts burglary risk solely based on past locations. An effective regression model should theoretically outperform this.

## Part 4: Count Regression Models

### Prepare Modeling Dataset

```{r join-districts}
fishnet_centroids_with_id <- st_centroid(fishnet) %>%
  mutate(uniqueID = fishnet$uniqueID)

fishnet_districts <- st_join(
  fishnet_centroids_with_id,
  policeDistricts
) %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, District)

fishnet <- fishnet %>%
  left_join(fishnet_districts, by = "uniqueID")

fishnet_model <- fishnet %>%
  filter(!is.na(sanitation_violations) & !is.na(sanitation.nn) & 
         !is.na(dist_to_hotspot) & !is.na(District))
```

### Fit Poisson Model

```{r poisson-model}
model_poisson <- glm(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  family = "poisson",
  data = fishnet_model
)

summary(model_poisson)
```

### Fit Negative Binomial Model

```{r negbin-model}
model_nb <- glm.nb(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  data = fishnet_model
)

summary(model_nb)
```

### Model Comparison

```{r model-comparison}
data.frame(
  Model = c("Poisson", "Negative Binomial"),
  AIC = c(round(AIC(model_poisson), 2), round(AIC(model_nb), 2)),
  BIC = c(round(BIC(model_poisson), 2), round(BIC(model_nb), 2))
) %>%
  kable(caption = "Model Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

The Negative Binomial model shows a lower AIC, indicating better fit. This confirms the presence of overdispersion problem in the burglary count data, justifying the use of the more flexible Negative Binomial specification.

## Part 5: Spatial Cross-Validation

Leave-One-Group-Out cross-validation by police district ensures spatial separation between training and test sets, preventing information leakage from nearby cells.

```{r spatial-cv}
districts <- unique(fishnet_model$District)
cv_results <- tibble()

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  model_cv <- glm.nb(
    countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
    data = train_data
  )
  
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = round(mae, 2),
      rmse = round(rmse, 2)
    )
  )
}
```

### Cross-Validation Results

```{r cv-summary}
data.frame(
  Metric = c("Mean MAE", "Mean RMSE", "SD of MAE"),
  Value = c(
    round(mean(cv_results$mae), 2),
    round(mean(cv_results$rmse), 2),
    round(sd(cv_results$mae), 2)
  )
) %>%
  kable(caption = "Overall Cross-Validation Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```

```{r cv-table}
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO Cross-Validation Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

The cross-validation results show how well the model generalizes to unseen districts. Variation in MAE across districts reflects different relationships between sanitation violations and burglaries in different neighborhoods.

## Part 6: Model Evaluation

### Generate Final Predictions

```{r final-predictions}
final_model <- glm.nb(
  countBurglaries ~ sanitation_violations + sanitation.nn + dist_to_hotspot,
  data = fishnet_model
)

fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )
```

### Visual Comparison

```{r compare-predictions}
#| fig-width: 12
#| fig-height: 4

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "Actual Burglaries") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "Model Predictions") +
  theme_void()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15), trans = "sqrt") +
  labs(title = "KDE Baseline") +
  theme_void()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does the sanitation-based model outperform simple KDE?"
  )
```

The regression model based on sanitation violations (middle) effectively captures the spatial distribution patterns of actual burglaries (left), with hot spot areas largely aligning. In contrast, while the KDE baseline model (right) also identifies major hot spots (even outperforming in the metric comparisons below), it exhibits blurred spatial details and excessive smoothing. Thus the regression model incorporating sanitation features preserves spatial details better, though its overall predictive capability may be comparable to or like inferior to KDE.

### Performance Metrics

```{r performance-metrics}
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = round(mean(abs(countBurglaries - prediction_nb)), 2),
    model_rmse = round(sqrt(mean((countBurglaries - prediction_nb)^2)), 2),
    kde_mae = round(mean(abs(countBurglaries - prediction_kde)), 2),
    kde_rmse = round(sqrt(mean((countBurglaries - prediction_kde)^2)), 2)
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(caption = "Model vs. KDE Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

KDE performs better in statistical metrics, but as shown in the previous image, KDE's spatial representation is overly smooth and lacks detail. My predictive model, however, performs better at finer spatial granularity.

### Error Analysis

```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5

fishnet_errors <- fishnet %>%
  filter(!is.na(prediction_nb) & !is.na(countBurglaries)) %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    abs_error_nb = abs(error_nb)
  )

p1 <- ggplot() +
  geom_sf(data = fishnet_errors, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Prediction Errors (Actual - Predicted)") +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = fishnet_errors, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Errors") +
  theme_void()

p1 + p2
```

The model performs well overall: The random distribution of red and blue errors in the left image indicates no systematic bias, while the large dark areas in the right image demonstrate that absolute errors are generally small. Only a few scattered points show larger errors, meaning the model predicts accurately in most regions, but specific areas may require additional features for improved prediction.

## Conclusion

**Performance:** The negative binomial model achieved a cross-validation MAE of 2.29, performing roughly on par with the KDE baseline. All three spatial features demonstrated statistical significance. However, KDE exhibits overly smoothed predictions spatially, whereas my predictive model effectively captures fine-grained details consistent with actual values.

**Spatial Patterns:** Sanitation hotspots and residential burglary hotspots show significant overlap, particularly in central and southern Chicago. This aligns with the Broken Windows theory—areas exhibiting visible disorder tend to experience higher rates of property crime.

**Limitations:** Error analysis indicates that most predictions are accurate, but significant discrepancies exist in a few areas. These outliers may possess characteristics not captured by health data—potentially unusual building/land use types or specific economic and demographic conditions. Furthermore, relying solely on a single type of 311 call overlooks certain behavioral patterns.