Assignment 4: Predictive Policing & Analysis

Author

Jed Chew

Published

November 11, 2025

About This Exercise

In this lab, you will apply the spatial predictive modeling techniques demonstrated in the class exercise to a 311 service request type of your choice. You will build a complete spatial predictive model, document your process, and interpret your results.

Learning Objectives

Create a fishnet grid for aggregating point-level crime data
Calculate spatial features including k-nearest neighbors and distance measures
Diagnose spatial autocorrelation using Local Moran’s I
Fit and interpret Poisson and Negative Binomial regression for count data
Implement spatial LOGO cross-validation
Compare model predictions to a Kernel Density Estimation baseline
Evaluate model performance using appropriate metrics

Analysis Requirements

For each major section, you must explain in your own words:

What you are doing in that step
Why this step is important for the analysis
What you found or learned from the results

Setup

Code

library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(janitor)        # Clean Names
library(here)
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

# Set Working Directory
setwd("C:/Users/chewj/Documents/MUSA/Github/portfolio-setup-jedchewjm/labs/lab_4")

cat("✓ All packages loaded successfully!\n")
cat("✓ Working directory:", getwd(), "\n")

Part 1: Load Chicago Spatial and Crime Data

1.1 | Load Chicago Spatial Data

Code

# Load police districts (used for spatial CV)
# Use ESRI:102271 (Illinois State Plane East, NAD83, US Feet) as CRS
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

cat("✓ Loaded spatial boundaries\n")
cat("  - Police districts:", nrow(policeDistricts), "\n")
cat("  - Police beats:", nrow(policeBeats), "\n")

1.2 | Load Burglary Data

Code

# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp") %>% 
  st_transform('ESRI:102271')

Reading layer `burglaries' from data source 
  `C:\Users\chewj\Documents\MUSA\Github\portfolio-setup-jedchewjm\labs\lab_4\data\burglaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East

Code

# Check the data
cat("\n✓ Loaded burglary data\n")


✓ Loaded burglary data

Code

cat("  - Number of burglaries:", nrow(burglaries), "\n")

  - Number of burglaries: 7482

Code

cat("  - CRS:", st_crs(burglaries)$input, "\n")

  - CRS: ESRI:102271

Code

cat("  - Date range:", min(burglaries$date, na.rm = TRUE), "to", 
    max(burglaries$date, na.rm = TRUE), "\n")

  - Date range: Inf to -Inf

1.3 | Visualize Point Data & Describe Patterns

Code

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(option = "plasma", direction = -1, guide = "none") +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago"
  )

Observed Patterns

Burglary locations are widely dispersed throughout Chicago but cluster more densely in certain areas such as the West Side and South Side neighborhoods
This pattern is more obvious in the KDE map which smooths point data to highlight hotspots of burglary activity
Notable coldspots include the far Northside and the Downtown Core (i.e. the Loop CBD)

Part 2: Create Fishnet Grid

A fishnet grid converts irregular point data into a regular grid of cells where we can aggregate counts / calculate spatial features / apply regression models.
A fishnet is more useful than administrative units such as census tracts because crime risk is a phenomenon that varies across the landscape rather than across administrative units.

2.1 | Create 500 m x 500 m Fishnet Grid

Code

# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500m per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
cat("✓ Created fishnet grid\n")

✓ Created fishnet grid

Code

cat("  - Number of cells:", nrow(fishnet), "\n")

  - Number of cells: 2458

Code

cat("  - Cell size:", 500, "x", 500, "meters\n")

  - Cell size: 500 x 500 meters

Code

cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")

  - Cell area: 250000 square meters

2.2 | Aggregate Burglaries to Grid

mutate a ‘counter’ field, countBurglaries, for each burglary event,
spatial join (aggregate) burglary points to the fishnet, taking the sum of countBurglaries
A random uniqueID is generated for each grid cell

Code

# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Summary statistics
cat("\nBurglary count distribution:\n")


Burglary count distribution:

Code

summary(fishnet$countBurglaries)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   2.000   3.042   5.000  40.000

Code

cat("\nCells with zero burglaries:", 
    sum(fishnet$countBurglaries == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")


Cells with zero burglaries: 781 / 2458 ( 31.8 %)

2.3 | Visualize Aggregated Count Distribution

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

Observed Patterns

This map reinforces the summary statistics of the aggregated burglary count which state that 31.8% of cells had zero burglaries in 2017 (these are shown in dark purple in the map)
Burglary incidents are mostly concentrated in clusters across Chicago’s South and West Sides, where many grid cells show counts above 20 burglaries
There are clear hotspots of elevated burglary risk separated by lower-activity zones – this will be useful in Part 4 when we calculate spatial measures such as kNN, LISA (local Moran’s I), and the distance to hot spots

Part 3: Create a Kernel Density Estimate (KDE) Baseline

KDE represents our null hypothesis: burglaries happen where they happened before, with no other information
In this context, KDE involves centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius parameter (commonly known as kernel density in ArcGIS)

Code

# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

cat("✓ Calculated KDE baseline\n")

✓ Calculated KDE baseline

Code

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

Part 4: Create Spatial Predictor Variables

Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime. In this lab report, I specifically focus on 311 calls to report vacant and abandoned buildings. The data source is the Chicago 311 Service Requests dataset which I accessed through https://data.cityofchicago.org/stories/s/311-Dataset-Changes-12-11-2018/d7nq-5g7t

In this section, I first load, filter and clean the “Vacant and Abandoned Buildings Reported” dataset. While this preliminary report does not look at specific columns, there are variables such as vacant_due_to_fire or ppl_using_prop that could potentially be applied to a more complex random forest model in order to better predict crime risk.

Next, I calculate different spatial measures that each capture an aspect of spatial proximity to our indicator variable of burglary risk.

Count → How much disorder is HERE?
k-NN Distance → How CLOSE are we to disorder?
Hot Spots (LISA) → Where does disorder CLUSTER?
Distance to Hot Spots → How close to concentrated disorder?

4.1 | Load 311 Vacant and Abandoned Buildings Reported

Code

vacant_abandoned_bldg <- read_csv("data/vacant_abandoned_buildings.csv") |> 
  filter(!is.na(LATITUDE), !is.na(LONGITUDE))

Code

# clean and rename column names
vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg |>
  janitor::clean_names() |> 
  rename(
    request_type           = service_request_type,
    request_num            = service_request_number,
    date_received          = date_service_request_was_received,
    location_on_lot        = 
      location_of_building_on_the_lot_if_garage_change_type_code_to_bgd,
    dangerous_hazardous    = is_the_building_dangerous_or_hazardous,
    open_or_boarded        = is_building_open_or_boarded,
    entry_point            = if_the_building_is_open_where_is_the_entry_point,
    vacant_or_occupied     = is_the_building_currently_vacant_or_occupied,
    vacant_due_to_fire     = is_the_building_vacant_due_to_fire,
    # note typo in original dataset - childen instead of children
    ppl_using_prop         = any_people_using_property_homeless_childen_gangs,
    addr_st_num            = address_street_number,
    addr_st_direction      = address_street_direction,
    addr_st_name           = address_street_name,
    addr_st_suffix         = address_street_suffix,
    zip                    = zip_code,
    x                      = x_coordinate,
    y                      = y_coordinate,
    ward                   = ward,
    police_dist            = police_district,
    community_area         = community_area,
    lat                    = latitude,
    lon                    = longitude,
    location               = location
  ) |>
   # filter vacant and abandoned buildings reported in 2017
  mutate(
    date_received = as.Date(date_received, format = "%m/%d/%Y"),
    month = as.integer(format(date_received, "%m")),
    day   = as.integer(format(date_received, "%d")),
    year  = as.integer(format(date_received, "%Y"))
  ) |>
  filter(year == 2017)

Code

# filter NA values and select relevant columns
vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg_2017 |> 
  dplyr::mutate(
    addr_full = str_squish( # combine address variables
      paste(addr_st_num, addr_st_direction, addr_st_name, addr_st_suffix))
  )

keep_cols <- c(
  "request_num","year","month", "open_or_boarded","vacant_or_occupied",
  "vacant_due_to_fire","ppl_using_prop", "addr_full", "zip","x","y","ward",
  "police_dist","community_area","lat","lon","location")

vacant_abandoned_bldg_2017 <- vacant_abandoned_bldg_2017 |> 
  dplyr::select(any_of(keep_cols)) |> 
  dplyr::filter(!is.na(vacant_or_occupied))

vacant_abandoned_bldg_2017

# A tibble: 3,147 × 17
   request_num  year month open_or_boarded vacant_or_occupied vacant_due_to_fire
   <chr>       <int> <int> <chr>           <chr>              <lgl>             
 1 17-08691267  2017    12 Open            Vacant             FALSE             
 2 17-08681436  2017    12 Open            Vacant             FALSE             
 3 17-08682637  2017    12 Boarded         Vacant             FALSE             
 4 17-08663198  2017    12 Open            Vacant             FALSE             
 5 17-08663448  2017    12 Open            Vacant             FALSE             
 6 17-08656967  2017    12 Boarded         Vacant             FALSE             
 7 17-08674767  2017    12 Open            Vacant             FALSE             
 8 17-08659803  2017    12 Open            Vacant             FALSE             
 9 17-08638267  2017    12 Open            Vacant             FALSE             
10 17-08642801  2017    12 Open            Vacant             FALSE             
# ℹ 3,137 more rows
# ℹ 11 more variables: ppl_using_prop <lgl>, addr_full <chr>, zip <dbl>,
#   x <dbl>, y <dbl>, ward <dbl>, police_dist <dbl>, community_area <dbl>,
#   lat <dbl>, lon <dbl>, location <chr>

Code

# Set CRS
abandoned_bldg <- vacant_abandoned_bldg_2017 |> 
  st_as_sf(coords = c("lon", "lat"), crs = 4326) |>  
  st_transform('ESRI:102271') # Set CRS

cat("✓ Loaded vacant/abandoned building calls\n")

✓ Loaded vacant/abandoned building calls

Code

cat("  - Number of calls:", nrow(vacant_abandoned_bldg_2017), "\n")

  - Number of calls: 3147

Note of Caution

The 311 reporting system in Chicago may be biased because it relies on residents to voluntarily report issues, meaning that neighborhoods with higher civic engagement or stronger trust in city services are more likely to generate reports.
As a result, 311 data in Chicago may underrepresent conditions in lower-income or marginalized communities, where residents may be less aware of or less inclined to call the hotline.

4.2 | Count of Reported Vacant/Abandoned Buildings per Cell

Code

# Gut Check Visualization of Vacant/Abandoned Buildings
ggplot() +
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = abandoned_bldg, aes(color = open_or_boarded), 
          alpha = 0.7, size = 0.7) +
  labs(title = "Location of Vacant/Abadoned Buildings 311 Calls") +
  theme_void()

Code

# Aggregate abandoned building calls to fishnet
abandoned_fishnet <- st_join(abandoned_bldg, fishnet, 
                             join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(abandoned_bldg = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(abandoned_fishnet, by = "uniqueID") %>%
  mutate(abandoned_bldg = replace_na(abandoned_bldg, 0))

cat("Vacant and Abandoned Building distribution:\n")

Vacant and Abandoned Building distribution:

Code

summary(fishnet$abandoned_bldg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00    1.28    1.00   21.00

Code

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abandoned_bldg), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Vacant/Abandoned Building 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(
    title = "Are vacant/abandoned buildings and burglaries correlated?")

4.3 | k-Nearest Neighbors (kNN) Features

Count in a cell is one measure. Distance to the 5 nearest vacant and abandoed buildings that were reported through 311 calls captures additional local context that are not just limited to administrative boundaries such as census tracts.

Code

# Calculate mean distance to 5 nearest vacant and abandoned buildings
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
abandoned_coords <- st_coordinates(abandoned_bldg)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(abandoned_coords, fishnet_coords, k = 5)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    abandoned_bldg.nn = rowMeans(nn_result$nn.dist)
  )

cat("✓ Calculated nearest neighbor distances\n")

✓ Calculated nearest neighbor distances

Code

summary(fishnet$abandoned_bldg.nn)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  29.95  386.48  848.67 1011.75 1431.03 5313.74

4.4 | Local Moran’s I; Identify Hotspots; Distance to Hot Spots

Let’s identify clusters of abandoned buildings using Local Moran’s I, then calculate distance to these hot spots.

Code

# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to vacant and abandoned buildings
fishnet <- calculate_local_morans(fishnet, "abandoned_bldg", k = 5)

Code

# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Vacant and Abandoned Building Clusters",
    subtitle = "High-High = Hot spots of disorder calls in Chicago"
  ) +
  theme_crime()

Code

# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  cat("✓ Calculated distance to vacant and abandoned building hot spots\n")
  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}

✓ Calculated distance to vacant and abandoned building hot spots
  - Number of hot spot cells: 272

Observed Patterns

Distribution of Vacant/Abandoned Building 311 Calls: The majority of reported properties are open rather than boarded, while boarded properties are relatively sparse and scattered
Count of Vacant/Abandoned Building 311 Calls: The distribution is highly skewed, with most grid cells containing zero or one vacant property. Notably, the median value is 0 while the max value is 21. This indicates the presence of a few extreme hotspots of concentrated vacancy which is supported by the subsequent map visualizations.
Map of Vacant/Abandoned Building 311 Calls and Burglaries: 311 calls for vacant buildings (left panel) are especially dense in a few concentrated areas, while burglary counts (right panel) are more spatially dispersed but still overlap with several high-vacancy zones
Local Moran’s I: High–High hotspot clusters (in red) indicate statistically significant concentrations of vacant or abandoned buildings, primarily located on the South and West Sides of Chicago

Part 5: Join Police Districts for Cross-Validation

We’ll use the Chicago police districts for our spatial cross-validation.

Code

# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("✓ Joined police districts\n")

✓ Joined police districts

Code

cat("  - Districts:", length(unique(fishnet$District)), "\n")

  - Districts: 22

Code

cat("  - Cells:", nrow(fishnet), "\n")

  - Cells: 1708

Part 6: Count Regression Model Fitting

6.1 | Poisson Regression

Poisson regression is used instead of OLS because crime counts are right-skewed with many zeroes (around 30% of fishnet cells have no burglaries in 2017) and also discrete (only integer values). This makes Poisson regression suitable for count data.

Code

# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    abandoned_bldg,
    abandoned_bldg.nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("✓ Prepared modeling data\n")

✓ Prepared modeling data

Code

cat("  - Observations:", nrow(fishnet_model), "\n")

  - Observations: 1708

Code

cat("  - Variables:", ncol(fishnet_model), "\n")

  - Variables: 6

Code

# Fit Poisson regression
model_poisson <- glm(
  countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)


Call:
glm(formula = countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                     Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)        1.68028245  0.03010148  55.821 < 0.0000000000000002 ***
abandoned_bldg     0.04345861  0.00397702  10.927 < 0.0000000000000002 ***
abandoned_bldg.nn -0.00081740  0.00004497 -18.177 < 0.0000000000000002 ***
dist_to_hotspot    0.00002829  0.00000713   3.968            0.0000726 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6710.3  on 1707  degrees of freedom
Residual deviance: 5334.6  on 1704  degrees of freedom
AIC: 9403

Number of Fisher Scoring iterations: 6

6.2 | Check for Overdispersion

Poisson regression assumes variance = mean
But in reality, real crime count data often violates this assumption because variance is significantly greater than the mean (i.e. overdispersion)

Code

# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")

Dispersion parameter: 3.38

Code

cat("Rule of thumb: >1.5 suggests overdispersion\n")

Rule of thumb: >1.5 suggests overdispersion

Code

if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}

⚠ Overdispersion detected! Consider Negative Binomial model.

6.3 | Negative Binomial Regression

Since overdispersion has been detected, we use Negative Binomial regression instead which is more flexible than Poisson Regression because it relaxes the variance = mean assumption

Code

# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
    dist_to_hotspot,
  data = fishnet_model
)
summary(model_nb)


Call:
glm.nb(formula = countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 1.419813092, 
    link = log)

Coefficients:
                     Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)        1.66189112  0.05722613  29.041 < 0.0000000000000002 ***
abandoned_bldg     0.04931960  0.00936161   5.268          0.000000138 ***
abandoned_bldg.nn -0.00088040  0.00007166 -12.285 < 0.0000000000000002 ***
dist_to_hotspot    0.00004581  0.00001194   3.837             0.000124 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.4198) family taken to be 1)

    Null deviance: 2374.4  on 1707  degrees of freedom
Residual deviance: 1957.2  on 1704  degrees of freedom
AIC: 7803.4

Number of Fisher Scoring iterations: 1

              Theta:  1.4198 
          Std. Err.:  0.0799 

 2 x log-likelihood:  -7793.3700

6.4 | Compare Model Fit (AIC)

AIC stands for Akaike Information Criterion, and it’s a common metric used to compare model fit for non-linear regression models (i.e. AIC is used instead of R-Squared)
For our purposes, AIC serves as a quick confirmation that the negative binomial regression model has a better fit than the poisson regression model because of its lower AIC (7788.5 vs 9357.9)
AIC is relative, and not absolute — it’s only meaningful when comparing multiple models fit to the same dataset
AIC is defined as \[AIC=−2×log-likelihood+2k\] where log-likelihood measures how well the model fits the data (higher is better), and k is the number of estimated parameters in the model (including the intercept)

Code

# Compare AIC (lower is better)
cat("\nModel Comparison:\n")


Model Comparison:

Code

cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")

Poisson AIC: 9403

Code

cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")

Negative Binomial AIC: 7803.4

Optimal Model Fit

As expected, due to overdispersion, negative binomial regression has a lower AIC than poisson regression and provides a more suitable model fit for the underlying crime dataset.

Part 7: Spatial LOGO-CV for 2017 Data

Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district.
LOGO-CV is used here instead of k-fold CV or a validation set approach in order to prevent spatial leakage.

Code

# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO Cross-Validation...\n")

Running LOGO Cross-Validation...

Code

for (i in seq_along(districts)) {
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}

  Fold 1 / 22 - District 5 - MAE: 2.6 
  Fold 2 / 22 - District 4 - MAE: 2.87 
  Fold 3 / 22 - District 22 - MAE: 2.1 
  Fold 4 / 22 - District 6 - MAE: 2.99 
  Fold 5 / 22 - District 8 - MAE: 2.51 
  Fold 6 / 22 - District 7 - MAE: 3.57 
  Fold 7 / 22 - District 3 - MAE: 5.54 
  Fold 8 / 22 - District 2 - MAE: 2.77 
  Fold 9 / 22 - District 9 - MAE: 2.54 
  Fold 10 / 22 - District 10 - MAE: 2.14 
  Fold 11 / 22 - District 1 - MAE: 2.41 
  Fold 12 / 22 - District 12 - MAE: 3.63 
  Fold 13 / 22 - District 15 - MAE: 2.26 
  Fold 14 / 22 - District 11 - MAE: 3.03 
  Fold 15 / 22 - District 18 - MAE: 2.45 
  Fold 16 / 22 - District 25 - MAE: 2.63 
  Fold 17 / 22 - District 14 - MAE: 3.45 
  Fold 18 / 22 - District 19 - MAE: 2.36 
  Fold 19 / 22 - District 16 - MAE: 1.83 
  Fold 20 / 22 - District 17 - MAE: 1.75 
  Fold 21 / 22 - District 20 - MAE: 1.58 
  Fold 22 / 22 - District 24 - MAE: 2.13

Code

# Overall results
cat("\n✓ Cross-Validation Complete\n")


✓ Cross-Validation Complete

Code

cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")

Mean MAE: 2.69

Code

cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")

Mean RMSE: 3.52

Code

# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

LOGO CV Results by District
fold	test_district	n_test	mae	rmse
7	3	43	5.54	7.24
12	12	73	3.63	5.08
6	7	52	3.57	4.41
17	14	46	3.45	4.93
14	11	43	3.03	3.43
4	6	63	2.99	4.16
2	4	235	2.87	3.92
8	2	56	2.77	3.47
16	25	85	2.63	3.67
1	5	98	2.60	2.97
9	9	107	2.54	2.97
5	8	197	2.51	3.55
15	18	30	2.45	4.15
11	1	28	2.41	3.06
18	19	63	2.36	3.07
13	15	32	2.26	2.81
10	10	63	2.14	2.80
22	24	41	2.13	3.13
3	22	112	2.10	2.66
19	16	129	1.83	2.11
20	17	82	1.75	2.11
21	20	30	1.58	1.85

Part 8: Model Predictions and Comparison

8.1 | Generate Final Predictions

Code

# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ abandoned_bldg + abandoned_bldg.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

8.2 | Compare Model vs. KDE Baseline

Code

# Create three maps for comparison of Models
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code

# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Comparison
approach	mae	rmse
model	2.53	3.49
kde	2.06	2.95

8.3 | Where Does the Model Work Well?

Code

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Assessing Model Performance

Spatial pattern of residuals: The left panel shows both overpredictions (in blue) and underpredictions (in red), suggesting that the model’s performance varies systematically across neighborhoods rather than being random.
Magnitude of errors: The right panel shows absolute errors concentrated in similar regions, confirming that prediction accuracy is uneven across thecity. Some fishnet grid cells exhibiting substantially higher deviations between observed and predicted values.
Error clustering: Clusters of higher errors appear on the South and West Sides, which is also where crime data for 2017 is clustered as shown in earlier maps. This is a key limitation of the model – possibly due to unobserved local factors or data quality issues in these areas.

Part 9: Summary Statistics

Code

# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )

Final Negative Binomial Model Coefficients (Exponentiated)
Variable	Rate Ratio	Std. Error	Z	P-Value
(Intercept)	5.269	0.057	29.041	0
abandoned_bldg	1.051	0.009	5.268	0
abandoned_bldg.nn	0.999	0.000	-12.285	0
dist_to_hotspot	1.000	0.000	3.837	0
Note:
Rate ratios > 1 indicate positive association with burglary counts.

Key Findings Checklist

Technical Performance:

Cross-validation MAE: 2.69
Model vs. KDE: KDE performed better than the final negative binomial model as it produced a lower MAE and RMSE than the model based on abandoned and vacant buildings

Spatial Patterns:

Burglaries are clustered as clearly visualized in the Kernel Density Estimation Baseline
Hot spots are largely located in the West Side and South Side neighborhoods

Model Limitations:

Overdispersion: the dispersion parameter of 3.35 exceeds the rule of thumb of 1.5. This suggests that overdispersion in the Poisson distribution model is present and hence a Negative Binomial model is used instead
Cells with zero counts: 781 out of 2458 had zero burglaries (31.8 %)