Spatial Predictive Analysis

Building a spatial predictive model using a 311 service request type

Author

Christine

Published

December 10, 2025

About This Exercise

In this exercise, I will build a spatial predictive model for burglaries using count regression and spatial features.

Setup

Code

# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)
library(tidycensus)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

Part 1: Data Loading & Exploration

This initial phase establishes the foundational dataset for our spatial analysis of burglary patterns in Chicago. The process systematically assembles and prepares multiple data sources: administrative boundaries (police districts, beats, and city limits) provide the spatial framework; burglary incidents serve as the primary outcome variable; and street light outage reports function as a key environmental predictor. Each dataset undergoes rigorous processing including coordinate system standardization, temporal filtering to 2017, and geometric validation.

1.1: Load Chicago Spatial Data

Code

# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84

Code

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84

Code

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84

1.2: Load Burglary Data

Code

# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("D:/MUSA5080PPA/portfolio-setup-ChristineCui12/labs/lab_4/data/burglaries.shp") %>% 
  st_transform('ESRI:102271')

Reading layer `burglaries' from data source 
  `D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_4\data\burglaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East

1.3: Load 311 Street Lights Out Calls

Code

lights_out <- read_csv("D:/MUSA5080PPA/portfolio-setup-ChristineCui12/labs/lab_4/data/Alley_Lights_Out_2025.csv")%>%
  mutate(date_received = mdy(`Creation Date`)) %>%
  filter(!is.na(Latitude), !is.na(Longitude), year(date_received) == 2017) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

The selection of “Alley Lights Out” service requests as our predictor variable is grounded in both theoretical foundations and practical analytical considerations. From an environmental criminology perspective, inadequate alley illumination directly impacts crime opportunities by reducing natural surveillance and creating concealment spaces, aligning with CPTED principles. This violation type differs significantly from the classroom example of abandoned vehicles, representing a distinct category of infrastructure issues with unique spatial characteristics. Furthermore, alley lighting problems typically exhibit temporal stability due to municipal maintenance cycles, providing reliable spatial patterns for predictive modeling.

For our analytical framework, this dataset offers several advantages. The persistent nature of lighting infrastructure ensures strong spatial autocorrelation, making it a robust predictor at the grid-cell level. The geographic specificity of alley locations enables precise spatial analysis, while the substantial data volume with quality coordinates supports reliable modeling. Importantly, findings from this analysis can directly inform evidence-based policies for infrastructure maintenance and targeted crime prevention, bridging academic research with practical urban governance applications.

1.4: Visualize the spatial distribution

Code

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = lights_out, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Lights out Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(lights_out))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(lights_out)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Lights out in Chicago",
    tag_levels = 'A'
  )

From the maps above, it can be found that the spatial distribution of alley light outages in Chicago demonstrates pronounced unevenness across the urban landscape. The visualization reveals distinct concentration patterns, with particularly dense clusters forming continuous corridors in western and southern sectors, while central and northern lakeshore areas remain notably sparse.

The kernel density analysis further validates this spatial disparity, highlighting persistent high-intensity zones in historically underserved neighborhoods. This geographical patterning reflects underlying infrastructure maintenance disparities and suggests potential correlations between illumination deficiencies and neighborhood socioeconomic characteristics. The stark contrast between high-frequency reporting areas and minimally affected regions underscores systemic inequities in public service distribution, while the concentrated nature of these outages may significantly influence local safety conditions and nocturnal activity patterns.

Part 2: Fishnet Grid Creation and Distribution

This section establishes the analytical foundation by transforming raw spatial data into structured, comparable units that enable systematic examination of crime patterns and environmental factors. The process begins with creating a uniform 500m x 500m grid system spanning Chicago’s urban landscape, which serves as our standardized spatial framework. This grid-based approach is crucial because it transcends traditional administrative boundaries (unequal sizes), allowing us to analyze urban phenomena in consistent, equally-sized, and no boundary bias units that capture local variations while maintaining city-wide comparability.

2.1: Fishnet Creation

Code

# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

2.2: Aggregate Burglaries to Grid

Code

# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Summary statistics
summary(fishnet$countBurglaries)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   2.000   3.042   5.000  40.000

Code

cat("\nCells with zero burglaries:", 
    sum(fishnet$countBurglaries == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")


Cells with zero burglaries: 781 / 2458 ( 31.8 %)

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

The aggregation of burglary incidents to these grid cells reveals striking spatial disparities in crime distribution. Rather than being evenly dispersed across the city, burglaries concentrate in specific hotspots while large portions of the grid remain crime-free. This clustering pattern underscores the non-random nature of criminal activity and highlights the limitations of traditional Poisson regression for modeling such overdispersed count data, pointing toward the need for more flexible statistical approaches like negative binomial regression.

2.3: Count of Streets Lights out per Cell

Code

# Aggregate lights out calls to fishnet
lightsout_fishnet <- st_join(lights_out, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(lights_out = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(lightsout_fishnet, by = "uniqueID") %>%
  mutate(lights_out = replace_na(lights_out, 0))

summary(fishnet$lights_out)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    7.00   11.35   17.00  217.00

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = lights_out), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Lights Out",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Alley Lights Out by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

When we map alley light outage reports onto the same grid system, we observe distinct spatial concentrations that visually correspond with burglary hotspots (to some extent), particularly in western and southern sectors of the city. This spatial correlation between infrastructure deficiencies and crime patterns provides preliminary support for environmental criminology theories, suggesting that inadequate lighting may create environmental conditions conducive to criminal activity by reducing natural surveillance and increasing perceived opportunities for offenders. The consistent spatial framework enables us to systematically explore these relationships in subsequent modeling phases.

Part 3: Create Spatial Features

This section develops advanced spatial predictors to characterize the environmental context of lighting infrastructure and its potential relationship to crime patterns. The analysis progresses from measuring local lighting conditions through nearest-neighbor distances to identifying statistically significant spatial clusters of illumination deficiencies. These clusters are then operationalized as distance-based features, capturing proximity to areas of concentrated infrastructure problems. Finally, the framework is enriched with administrative boundaries, vacant buildings and demographic characteristics, creating a comprehensive feature set for subsequent burglary prediction modeling.

3.1: K-Nearest Neighbor Features

This step calculates the average distance from each grid cell centroid to its three nearest alley light outage reports. We first extract the centroid coordinates of all fishnet grid cells and the geographic coordinates of all light outage incidents, then use k-nearest neighbor algorithm (k=3) to compute the mean distance for each cell.

Code

# Calculate mean distance to 3 nearest abandoned cars
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
lightsout_coords <- st_coordinates(lights_out)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(lightsout_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    lights_out.nn = rowMeans(nn_result$nn.dist)
  )

summary(fishnet$lights_out.nn)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   4.602  102.009  172.653  295.489  328.127 2313.532

This spatial feature captures the local environmental context of lighting infrastructure conditions beyond simple count-based measures. This is crucial because concentrated lighting deficiencies may create extended dark corridors that facilitate criminal activity, whereas isolated outages might have less impact on overall neighborhood safety perception.

The distance statistics reveal substantial spatial variation in lighting infrastructure problems across Chicago. With distances ranging from 4.6 meters to over 2.3 kilometers, and a median distance of 172.7 meters, we observe that some areas experience very dense concentrations of lighting issues (low distance values) while others have relatively sparse problems (high distance values). The right-skewed distribution (mean of 295.5 meters exceeding the median) suggests that while many cells have relatively accessible lighting infrastructure, there are significant areas where residents must travel considerable distances to encounter multiple lighting problems, potentially indicating systematic infrastructure neglect in certain neighborhoods.

3.2: Local Moran’s I analysis

This step implements spatial autocorrelation analysis using Local Moran’s I to identify statistically significant clustering patterns of alley light outages across the grid system. The method constructs a spatial weights matrix based on k-nearest neighbors (k=5) from grid centroids, computes local spatial autocorrelation statistics, and classifies each grid cell into distinct spatial patterns including hotspots (High-High), coldspots (Low-Low), and spatial outliers (High-Low, Low-High).

Code

# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to abandoned cars
fishnet <- calculate_local_morans(fishnet, "lights_out", k = 5)

3.3: Identify Hot Spots and Cold Spots

Code

# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Lights out Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

The visualization reveals several key spatial patterns. High-High clusters (hotspots) form contiguous zones primarily in the west and south sides of Chicago, indicating neighborhoods where lighting problems are widespread and mutually reinforcing across multiple adjacent grid cells. These hotspots represent areas of systematic infrastructure neglect that likely require comprehensive intervention strategies. The scattered Low-High outliers suggest isolated pockets of lighting problems within otherwise well-maintained areas, possibly indicating specific properties or blocks needing targeted repairs. The extensive gray areas (non-significant) demonstrate that many parts of the city, particularly in the central and northern regions, do not exhibit statistically significant clustering of lighting issues. This clear spatial differentiation provides valuable evidence for resource allocation decisions and confirms that lighting infrastructure disparities follow broader urban inequality patterns.

3.4: Distance to Hot Spots

This step quantifies the spatial proximity of each grid cell to identified lighting infrastructure hotspots by calculating the minimum distance from each cell’s centroid to the nearest High-High cluster centroid. The process involves extracting the central points of all statistically significant hotspot cells, combining them into a unified spatial object, and computing the shortest distance from every grid cell to this collective hotspot geometry.

Code

# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}

  - Number of hot spot cells: 193

The implementation successfully identified 193 hotspot cells and calculated distance values for all grid cells. The resulting distance metric creates a spatial gradient that decreases as we move toward the core of identified problem areas and increases in peripheral and well-maintained regions. This feature will enable us to test whether burglary risk exhibits distance-decay relationships relative to lighting infrastructure hotspots, potentially revealing how the spatial influence of problematic areas extends beyond their immediate boundaries.

3.5: Join Additional Contextual Data

The code below integrates police district information, which allows for analyzing jurisdictional variations in service delivery and law enforcement practices. Then, we obtain demographic and socioeconomic data from the American Community Survey (ACS), including total population, median income, and the population aged 65 and over, and use this to calculate the proportion of elderly population. Next, we load vacant building data and aggregate it into the grid. These steps aim to enrich each grid cell with comprehensive background variables, allowing for better control of additional influencing factors in subsequent modeling.

Join district information to fishnet

Code

# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("  - Districts:", length(unique(fishnet$District)), "\n")

  - Districts: 22

Code

cat("  - Cells:", nrow(fishnet), "\n")

  - Cells: 1708

Join demographic information to fishnet

Code

acs_vars <- c(
  total_pop = "B01003_001",
  med_income = "B19013_001",
  # Male 65 years and over (sum of 6 categories)
  male_65_66 = "B01001_020",
  male_67_69 = "B01001_021",
  male_70_74 = "B01001_022",
  male_75_79 = "B01001_023",
  male_80_84 = "B01001_024",
  male_85_up = "B01001_025",
  # Female 65 years and over (sum of 6 categories)
  female_65_66 = "B01001_044",
  female_67_69 = "B01001_045",
  female_70_74 = "B01001_046",
  female_75_79 = "B01001_047",
  female_80_84 = "B01001_048",
  female_85_up = "B01001_049"
)

Code

tracts_all <- get_acs(
  geography = "tract",
  state = "IL",
  county = "Cook",
  variables = acs_vars,
  output = "wide",
  year = 2017, 
  geometry = TRUE,
  progress_bar = FALSE
)

Code

tracts_chi <- tracts_all %>%
  st_transform(st_crs(chicagoBoundary)) %>%
  st_filter(chicagoBoundary, .predicate = st_intersects) %>%
  dplyr::mutate(
    pop_65_over = (male_65_66E + male_67_69E + male_70_74E + male_75_79E + male_80_84E + male_85_upE +
                  female_65_66E + female_67_69E + female_70_74E + female_75_79E + female_80_84E + female_85_upE),
    pct_65_over = 100 * pop_65_over / pmax(total_popE, 1),
    total_pop = total_popE,
    med_income = med_incomeE
  ) %>%
  # Select and rename final columns for clarity and consistency
  dplyr::select(GEOID,total_pop,med_income,pop_65_over,pct_65_over)
# Join to fishnet
fishnet <- fishnet %>%
  st_join(tracts_chi, join = st_intersects, left = TRUE)

Join vacant buildings information to fishnet

Code

# Load vacant buildings data
vacant <-
  st_read("data/Vacant_and_Abandoned_Buildings.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")
# Spatial join.
vacant_fishnet <- st_join(vacant, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_vacant = n())
# Join to fishnet
fishnet <- fishnet %>%
  left_join(vacant_fishnet, by = "uniqueID") %>%
  mutate(count_vacant = replace_na(count_vacant, 0))

Part 4: Model Fitting

This section establishes the modeling framework by systematically evaluating alternative regression specifications for burglary count prediction. The process involves preparing the analytical dataset, implementing a baseline Poisson model, diagnosing its limitations through dispersion testing, and then proceeding to a more flexible Negative Binomial formulation to address identified statistical issues.

4.1: Poisson Regression

We begin with a Poisson regression, which serves as the baseline model for count data analysis. The model incorporates key predictors including alley light outage counts, nearest neighbor distances to light problems, distance to lighting hotspots, and control variables for income, elderly population proportion, and vacant buildings.

Code

# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    lights_out,
    lights_out.nn,
    dist_to_hotspot,
    med_income,
    pct_65_over,
    count_vacant
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("  - Observations:", nrow(fishnet_model), "\n")

  - Observations: 4204

Code

cat("  - Variables:", ncol(fishnet_model), "\n")

  - Variables: 9

Code

# Fit Poisson regression
model_poisson <- glm(
  countBurglaries ~ lights_out + lights_out.nn + 
    dist_to_hotspot + med_income + pct_65_over + count_vacant,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)


Call:
glm(formula = countBurglaries ~ lights_out + lights_out.nn + 
    dist_to_hotspot + med_income + pct_65_over + count_vacant, 
    family = "poisson", data = fishnet_model)

Coefficients:
                     Estimate    Std. Error z value             Pr(>|z|)    
(Intercept)      2.2935556927  0.0326873382  70.166 < 0.0000000000000002 ***
lights_out       0.0029259386  0.0004614293   6.341       0.000000000228 ***
lights_out.nn   -0.0029046996  0.0000914734 -31.755 < 0.0000000000000002 ***
dist_to_hotspot -0.0000613090  0.0000063612  -9.638 < 0.0000000000000002 ***
med_income      -0.0000036522  0.0000003212 -11.370 < 0.0000000000000002 ***
pct_65_over     -0.0200035311  0.0014396462 -13.895 < 0.0000000000000002 ***
count_vacant     0.0014450388  0.0002285951   6.321       0.000000000259 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 15489  on 4203  degrees of freedom
Residual deviance: 12169  on 4197  degrees of freedom
AIC: 23225

Number of Fisher Scoring iterations: 5

Based on the Poisson regression results, all predictor variables demonstrate statistically significant relationships with burglary counts (p < 0.001), though their directional influences vary notably. Contrary to theoretical expectations, increased street light outages (lights_out) show a slight negative association with burglaries, while greater distances from lighting problem clusters (dist_to_hotspot) and individual outages (lights_out.nn) both correlate with reduced crime rates. More consistent with conventional criminological theory, higher median income (med_income) and larger elderly population proportions (pct_65_over) both exhibit protective effects against burglaries. Conversely, the coefficient for count_vacant is 0.001445. The positive value indicates that a higher number of vacant buildings is associated with a higher expected log count of burglaries. The IRR (Incidence Rate Ratio) is exp(0.001445) ≈ 1.00145, meaning that for each additional vacant building, the expected number of burglaries is multiplied by 1.00145, representing a 0.145% increase. These findings suggest that while physical disorder indicators like vacancy align with theoretical predictions, the relationships between lighting infrastructure issues and crime patterns appear more complex than initially hypothesized, potentially reflecting spatial interdependencies or countervailing mechanisms in the urban environment.

4.2: Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion). Diagnostic checks reveal significant overdispersion in the Poisson model (dispersion parameter ≈ 3.12), violating the fundamental assumption of equal mean and variance. This leads us to employ a Negative Binomial regression, which introduces an additional parameter to accommodate extra-Poisson variation.

Code

# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")

Dispersion parameter: 3.12

Code

cat("Rule of thumb: >1.5 suggests overdispersion\n")

Rule of thumb: >1.5 suggests overdispersion

Code

if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}

⚠ Overdispersion detected! Consider Negative Binomial model.

4.3: Negative Binomial Regression

Code

# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ lights_out + lights_out.nn + 
    dist_to_hotspot + med_income + pct_65_over + count_vacant,
  data = fishnet_model
)

# Summary
summary(model_nb)


Call:
glm.nb(formula = countBurglaries ~ lights_out + lights_out.nn + 
    dist_to_hotspot + med_income + pct_65_over + count_vacant, 
    data = fishnet_model, init.theta = 1.972735443, link = log)

Coefficients:
                    Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)      2.279286029  0.058481245  38.975 < 0.0000000000000002 ***
lights_out       0.003502754  0.000965502   3.628             0.000286 ***
lights_out.nn   -0.003074634  0.000137026 -22.438 < 0.0000000000000002 ***
dist_to_hotspot -0.000056394  0.000010841  -5.202          0.000000197 ***
med_income      -0.000002855  0.000000542  -5.268          0.000000138 ***
pct_65_over     -0.022029975  0.002449034  -8.995 < 0.0000000000000002 ***
count_vacant     0.002048773  0.000481459   4.255          0.000020873 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.9727) family taken to be 1)

    Null deviance: 6016.3  on 4203  degrees of freedom
Residual deviance: 4699.3  on 4197  degrees of freedom
AIC: 19782

Number of Fisher Scoring iterations: 1

              Theta:  1.9727 
          Std. Err.:  0.0714 

 2 x log-likelihood:  -19766.1740

Code

# Compare AIC (lower is better)
cat("\nModel Comparison:\n")


Model Comparison:

Code

cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")

Poisson AIC: 23224.7

Code

cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")

Negative Binomial AIC: 19782.2

The comparison of AIC values clearly demonstrates the superiority of the Negative Binomial specification, with a substantial reduction in AIC from approximately 23,224.7 to a lower value 19,782.2. This improvement reflects better handling of overdispersion (θ = 1.9727). Notably, lights_out shows a positive coefficient (0.0035) in the NB model, aligning with theoretical expectations that lighting failures increase burglary risk. All predictors remain highly significant (p < 0.001), with consistent protective effects for income and elderly population, and risk-increasing effects for vacancy. The NB model provides more reliable estimates for spatial crime analysis.

Part 5: Spatial Cross-Validation

This section establishes a baseline for model evaluation by creating a kernel density estimate of burglary patterns, then implements leave-one-group-out cross-validation using police districts to assess the model’s spatial generalizability. The process compares our complex regression model against a simple spatial smoothing approach and quantifies prediction errors across different geographic areas.

5.1: Create a Kernel Density Baseline

This step creates a baseline prediction model using Kernel Density Estimation (KDE) to analyze the spatial concentration patterns of burglary incidents. The process involves converting burglary point locations into a continuous probability surface using a 1km bandwidth kernel, then extracting the resulting density values to each fishnet grid cell. This creates a smoothed representation of burglary risk based solely on historical crime locations.

Code

# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

Establishing a KDE baseline serves multiple critical purposes. First, it provides a simple but effective benchmark against which to compare our more complex regression models - if our sophisticated models cannot outperform this basic spatial smoothing approach, their added complexity isn’t justified. Second, KDE captures the fundamental spatial autocorrelation inherent in crime patterns, representing the “where crime has happened before” hypothesis. Third, it helps us understand the underlying spatial structure of burglary risk before introducing explanatory variables.

Code

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

The KDE map reveals pronounced spatial heterogeneity in burglary risk, with dense hotspot emerging in southern and northern sectors while middle areas remain relatively safe. The smooth risk gradients demonstrate spatial autocorrelation and distance decay patterns, providing a crucial baseline for evaluating more complex predictive models.

5.2: Leave-One-Group-Out (LOGO) Cross-Validation

This step implements Leave-One-Group-Out cross-validation using police districts to assess the model’s spatial generalizability. By iteratively training the model on all but one district and testing on the excluded district, we evaluate how well the model transfers to unseen geographic areas. The process reveals whether the relationships learned from most districts can effectively predict burglary patterns in completely new locations.

Code

# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO Cross-Validation...\n")

Running LOGO Cross-Validation...

Code

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ lights_out + lights_out.nn + 
      dist_to_hotspot + med_income + pct_65_over + count_vacant,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}

  Fold 1 / 22 - District 5 - MAE: 2.13 
  Fold 2 / 22 - District 4 - MAE: 2.36 
  Fold 3 / 22 - District 22 - MAE: 2.25 
  Fold 4 / 22 - District 6 - MAE: 3.2 
  Fold 5 / 22 - District 8 - MAE: 3.33 
  Fold 6 / 22 - District 7 - MAE: 2.91 
  Fold 7 / 22 - District 3 - MAE: 6.21 
  Fold 8 / 22 - District 2 - MAE: 2.84 
  Fold 9 / 22 - District 9 - MAE: 3.29 
  Fold 10 / 22 - District 10 - MAE: 2.63 
  Fold 11 / 22 - District 1 - MAE: 2.34 
  Fold 12 / 22 - District 12 - MAE: 3.56 
  Fold 13 / 22 - District 15 - MAE: 1.98 
  Fold 14 / 22 - District 11 - MAE: 3.14 
  Fold 15 / 22 - District 18 - MAE: 2.21 
  Fold 16 / 22 - District 25 - MAE: 2.87 
  Fold 17 / 22 - District 14 - MAE: 3.14 
  Fold 18 / 22 - District 19 - MAE: 2.07 
  Fold 19 / 22 - District 16 - MAE: 2.25 
  Fold 20 / 22 - District 17 - MAE: 1.69 
  Fold 21 / 22 - District 20 - MAE: 1.85 
  Fold 22 / 22 - District 24 - MAE: 2.01

Code

# Overall results
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")

Mean MAE: 2.74

Code

cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")

Mean RMSE: 3.67

The results show an average MAE of 2.74 burglaries per grid cell, indicating reasonable predictive accuracy. However, the variation across districts (MAE ranging from 1.69 to 6.21) suggests significant geographic heterogeneity in prediction performance. District 3’s high error (6.21) likely reflects unique local factors not captured by our predictors, highlighting areas where the model may require additional contextual variables or specialized approaches.

5.3: Error metrics calculation and report

This step systematically analyzes the spatial cross-validation results by organizing prediction errors across all police districts. The table ranks districts by MAE (Mean Absolute Error), providing a clear comparison of model performance across different geographic areas. This organization allows for identifying patterns in prediction accuracy and pinpointing where the model struggles most.

Code

# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

LOGO CV Results by District
fold	test_district	n_test	mae	rmse
7	3	120	6.21	7.87
12	12	203	3.56	4.91
5	8	419	3.33	4.17
9	9	291	3.29	4.83
4	6	154	3.20	4.54
14	11	124	3.14	3.79
17	14	164	3.14	4.66
6	7	149	2.91	3.77
16	25	228	2.87	3.82
8	2	183	2.84	3.48
10	10	173	2.63	3.37
2	4	418	2.36	4.21
11	1	60	2.34	3.23
3	22	229	2.25	2.69
19	16	296	2.25	2.63
15	18	95	2.21	3.65
1	5	191	2.13	2.93
18	19	234	2.07	2.57
22	24	111	2.01	2.71
13	15	77	1.98	2.60
21	20	79	1.85	2.19
20	17	206	1.69	2.05

The results reveal substantial spatial variation in prediction quality. District 3 shows the highest error (MAE: 6.21), suggesting unique local dynamics not captured by our predictors, while District 17 demonstrates the best performance (MAE: 1.69).

Part 6: Model Evaluation

This section evaluates model performance by comparing our Negative Binomial regression predictions against the KDE baseline across two key aspects: quantitative error metrics calculation (MAE and RMSE) and spatial error distribution analysis. The evaluation aims to determine whether the complex model’s added predictive power justifies its computational complexity and to identify geographic patterns in prediction accuracy.

6.1: Generate Final Predictions

Code

# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ lights_out + lights_out.nn + 
    dist_to_hotspot + med_income + pct_65_over + count_vacant,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

6.2: Compare Model vs. KDE Baseline

This step visually compares the actual burglary distribution against predictions from both the complex model and the simple KDE baseline.

Code

# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

The result differs from expectations… The negative binomial regression does not appear to resemble the actual burglary patterns as closely as KDE. However, the negative binomial model produces clearer hotspot boundaries and better distinguishes between high-risk and medium-risk areas, while the KDE baseline creates smoother, more diffused risk surfaces. The identification of complex model in hotspot areas shows some deviation from the actual situation.

Code

# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Comparison
approach	mae	rmse
model	2.60	3.63
kde	2.17	3.06

The performance metrics reveal a nuanced comparison between the two modeling approaches. Our lighting-focused Negative Binomial regression (MAE: 2.60, RMSE: 3.63) is outperformed by the KDE baseline (MAE: 2.17, RMSE: 3.06) in terms of pure prediction accuracy for 2017 burglary patterns. This slightly higher MAE and RMSE for our model suggests that the complex relationships we modeled - incorporating alley lighting conditions, nearest-neighbor distances, hotspot proximity, and socioeconomic factors - do not provide additional predictive power beyond simple spatial smoothing of historical crime patterns for within-year prediction.

However, this numerical comparison tells only part of the story. While KDE excels at capturing the strong spatial autocorrelation inherent in burglary patterns, it offers limited explanatory value and cannot identify the underlying environmental factors driving crime risk. Our model, despite its slightly higher errors, provides crucial insights into how specific urban infrastructure conditions - particularly alley lighting deficiencies and their spatial distribution - contribute to burglary risk. The model’s structured approach, which quantifies the effects of lighting problems, income levels, elderly population concentrations, and vacant properties, offers valuable guidance for targeted interventions and policy decisions that go beyond mere crime prediction.

6.3: Prediction Errors

This step analyzes the spatial distribution of prediction errors by calculating both directional errors (actual minus predicted) and absolute errors for the Negative Binomial model.

Code

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "plasma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

The error maps reveal distinct spatial patterns in model performance. The directional error map shows systematic overprediction (red areas) primarily in higher-risk neighborhoods, where the model likely overestimates burglary risk due to unaccounted protective factors. Meanwhile, underprediction (blue areas) concentrates in lower-crime western sectors, suggesting the model misses certain risk amplifiers in these regions. The absolute error map confirms that prediction inaccuracies are most pronounced in areas with extreme crime counts - both very high and very low - indicating the model struggles with accurately predicting both crime hotspots and exceptionally safe areas.

Part 7: Conclusion and Discussion

7.1: Model Summary Table

Code

# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )

Final Negative Binomial Model Coefficients (Exponentiated)
Variable	Rate Ratio	Std. Error	Z	P-Value
(Intercept)	9.770	0.058	38.975	0
lights_out	1.004	0.001	3.628	0
lights_out.nn	0.997	0.000	-22.438	0
dist_to_hotspot	1.000	0.000	-5.202	0
med_income	1.000	0.000	-5.268	0
pct_65_over	0.978	0.002	-8.995	0
count_vacant	1.002	0.000	4.255	0
Note:
Rate ratios > 1 indicate positive association with burglary counts.

7.2: Key Findings Summary

Overall Performance Assessment:
The spatial predictive analysis reveals a nuanced picture of model performance. While the Negative Binomial regression demonstrated strong explanatory power with all predictors showing statistical significance (p < 0.001), it was ultimately outperformed by the simpler Kernel Density Estimation baseline in predictive accuracy. The KDE achieved lower error metrics (MAE: 2.17 vs 2.60; RMSE: 3.06 vs 3.63), indicating that for within-year prediction of burglary patterns, historical crime locations alone provide marginally better forecasts than our complex model incorporating lighting infrastructure, socioeconomic factors, and vacancy measures.

Key Predictive Relationships:
The regression analysis yielded several theoretically meaningful findings. Vacant building counts showed a positive association with burglary risk, consistent with broken windows theory. Higher median income and larger elderly population proportions exhibited protective effects against burglaries. However, the relationships between lighting infrastructure and crime patterns proved more complex than anticipated, with both proximity to individual light outages and distance to lighting hotspots showing significant but counterintuitive associations that may reflect spatial interdependencies in the urban environment.

Spatial Generalizability:
Leave-one-group-out cross-validation revealed substantial geographic variation in model performance across police districts (MAE range: 1.69-6.21). This spatial heterogeneity suggests that the model’s predictive power is context-dependent, performing well in some areas while struggling in others, particularly District 3 which showed the highest prediction errors. This indicates the potential need for localized modeling approaches or additional place-specific variables.

Practical Implications:
The superior performance of the simple KDE model carries important policy implications: basic hotspot mapping of historical crime data remains the most cost-effective approach for short-term resource allocation. Despite its slightly inferior predictive accuracy compared to KDE, the regression model offers valuable policy insights by quantifying how specific environmental factors influence burglary risk. It provides valuable guidance for long-term preventive strategies. The identification of lighting infrastructure hotspots, combined with socioeconomic and vacancy measures, provides actionable intelligence for targeted interventions that go beyond simple crime prediction to inform infrastructure investment and community safety strategies.