Predictive Policing - Technical Implementation

MUSA 5080 - Fall 2025

Author

Ryan Drake

Published

November 17, 2025

About This Assignment

This assignment will build a spatial predictive model for Chicago tree debris 311 reports using count regression and spatial features:

Create a fishnet grid for aggregating point-level crime data
Calculate spatial features including k-nearest neighbors and distance measures
Diagnose spatial autocorrelation using Local Moran’s I
Fit and interpret Poisson and Negative Binomial regression for count data
Implement spatial cross-validation (Leave-One-Group-Out)
Compare model predictions to a Kernel Density Estimation baseline
Evaluate model performance using appropriate metrics

Setup

Code

# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

cat("✓ All packages loaded successfully!\n")

✓ All packages loaded successfully!

Code

cat("✓ Working directory:", getwd(), "\n")

✓ Working directory: C:/Users/rydra/OneDrive - PennO365/CPLN 5080 Public Policy/portfolio-setup-rdrake251/Assignments/assignment_04

Part 1: Load and Explore Data

Exercise 1.1: Load Chicago Spatial Data

Code

# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84

Code

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84

Code

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84

Code

cat("✓ Loaded spatial boundaries\n")

✓ Loaded spatial boundaries

Code

cat("  - Police districts:", nrow(policeDistricts), "\n")

  - Police districts: 25

Code

cat("  - Police beats:", nrow(policeBeats), "\n")

  - Police beats: 277

Coordinate Reference System

We’re using ESRI:102271 (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:

It minimizes distortion in this region
Uses feet (common in US planning)
Allows accurate distance calculations

Exercise 1.2: Load Tree Debris Data

Code

# Load from provided data file (downloaded from Chicago open data portal)
treedebris <- read_csv(here("Assignments/assignment_04/data",
                             "311_Service_Requests_Tree_Debris.csv"))

names(treedebris)

 [1] "Creation Date"                       
 [2] "Status"                              
 [3] "Completion Date"                     
 [4] "Service Request Number"              
 [5] "Type of Service Request"             
 [6] "If Yes, where is the debris located?"
 [7] "Current Activity"                    
 [8] "Most Recent Action"                  
 [9] "Street Address"                      
[10] "ZIP Code"                            
[11] "X Coordinate"                        
[12] "Y Coordinate"                        
[13] "Ward"                                
[14] "Police District"                     
[15] "Community Area"                      
[16] "Latitude"                            
[17] "Longitude"                           
[18] "Location"

Code

treedebris %>% 
  summarize(
    n_rows = n(),
    n_missing_lon = sum(is.na(Longitude)),
    n_missing_lat = sum(is.na(Latitude))
  )

# A tibble: 1 × 3
  n_rows n_missing_lon n_missing_lat
   <int>         <int>         <int>
1 163414            50            50

Code

treedebris_sf <- treedebris %>% 
  # keep only rows with both coords present
  filter(!is.na(Longitude), !is.na(Latitude)) %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform("ESRI:102271")


# Check the data
cat("\n✓ Loaded treedebris data\n")


✓ Loaded treedebris data

Code

cat("  - Number of treedebris:", nrow(treedebris_sf), "\n")

  - Number of treedebris: 163364

Code

cat("  - CRS:", st_crs(treedebris_sf)$input, "\n")

  - CRS: ESRI:102271

Code

cat("  - Date range:", min(treedebris_sf$`Completion Date`, na.rm = TRUE), "to", 
    max(treedebris_sf$`Completion Date`, na.rm = TRUE), "\n")

  - Date range: 01/02/2013 to 12/31/2015

Question 1.1: How many tree debris reports are in the data set? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?

Your answer here: There are 163,364 reports of completed tree debris reports within the data range of 01/02/2013 to 12/31/2015. Having this file in the Illinois State Plane East coordinate reference system (102271) will help to minimize distortion, it uses feet, and will make our distance calculations more accurate.

Exercise 1.3: Visualize Point Data

Code

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = treedebris_sf, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Tree Debris Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(treedebris_sf))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(treedebris_sf)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Tree Debris in Chicago",
    tag_levels = 'A'
  )

Question 1.2: What spatial patterns do you observe? Are tree debris evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?

Your answer here: Tree Debris reports seem to be evenly reported across Chicago. South Chicago sees less reports of tree debris and the density map shows this. Based on my own knowledge of the tree canopy of Chicago, South Chicago and the middle areas are less dense in canopy cover so there are less trees that would have tree debris. Even split would be expected nonetheless since tree debris would be evenly distributed based on weather and other outside factors.

Part 2: Create Fishnet Grid

Exercise 2.1: Understanding the Fishnet

Code

# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
cat("✓ Created fishnet grid\n")

✓ Created fishnet grid

Code

cat("  - Number of cells:", nrow(fishnet), "\n")

  - Number of cells: 2458

Code

cat("  - Cell size:", 500, "x", 500, "meters\n")

  - Cell size: 500 x 500 meters

Code

cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")

  - Cell area: 250000 square meters

Question 2.1: Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?

Your answer here: The fishnet grid approach is a neutral way of forming a grid for spatial analysis. The equal sized cells make comparative analysis more consistent and can let you see fine detail of hot spots and other spatial analysis. However, since neighborhoods or census tracts are tied to administrative units, and they are thus tied to budgets, policies, etc., a regular grid is more difficult to apply to policy. There is also scale sensitivity of having too small or too large a grid which can impact interpretation.

Exercise 2.2: Aggregate Tree Debirs to Grid

Code

cat("\nTree Debris count distribution:\n")


Tree Debris count distribution:

Code

summary(fishnet$counttreedebris_sf)

Length  Class   Mode 
     0   NULL   NULL

Code

cat(
  "\nCells with zero tree debris:",
  sum(fishnet$counttreedebris_sf == 0), "/", nrow(fishnet),
  "(", round(100 * sum(fishnet$counttreedebris_sf == 0) / nrow(fishnet), 1), "%)\n"
)


Cells with zero tree debris: 0 / 2458 ( 0 %)

Code

# Spatial join: which cell contains each tree debris?
treedebris_fishnet <- st_join(treedebris_sf, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(counttreedebris_sf = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(treedebris_fishnet, by = "uniqueID") %>%
  mutate(counttreedebris_sf = replace_na(counttreedebris_sf, 0))

# Summary statistics
cat("\nTree Debris count distribution:\n")


Tree Debris count distribution:

Code

summary(fishnet$counttreedebris_sf)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   10.00   50.00   66.37   97.00  672.00

Code

cat("\nCells with zero tree debris:", 
    sum(fishnet$counttreedebris_sf == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$counttreedebris_sf == 0) / nrow(fishnet), 1), "%)\n")


Cells with zero tree debris: 317 / 2458 ( 12.9 %)

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = counttreedebris_sf), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Tree Debris",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Tree Debris Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2013-2015"
  ) +
  theme_crime()

Question 2.2: What is the distribution of tree debris counts across cells? Why do so many cells have zero tree debris? Is this distribution suitable for count regression? (Hint: look up overdispersion)

Your answer here: The map shows largely blue pattern meaning most cells follow a zero distribution. The 311 reports are dependent on areas with trees, people reporting these activities, influenced by weather patterns, and can be in places there aren’t as many trees like parking lots, industrial areas, or commercial areas. There is 12.9% that have zero tree debris count. This all makes the pattern strongly over dispersed. Thus this isn’t suitable for count regression because there can’t be negative counts.

Part 3: Create a Kernel Density Baseline

Code

# Convert tree debris to ppp (point pattern) format for spatstat
treedebris_ppp <- as.ppp(
  st_coordinates(treedebris_sf),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_treedebris <- density.ppp(
  treedebris_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_treedebris)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

cat("✓ Calculated KDE baseline\n")

✓ Calculated KDE baseline

Code

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of tree debris locations"
  ) +
  theme_crime()

Question 3.1: How does the KDE map compare to the count map? What does KDE capture well? What does it miss?

Your answer here: The KDE captures a much smoother version of tree debris activity and its clusters. This gradient and neighborhood scale visualization better connects to the tree canopy cover and residential and storm response patterns. This is also helpful as isn’t tied to the grid so it can be more interpret-able. The KDE is still not counts though, just intensity map. So it doesn’t show actual numbers of tree debris requests or even which cells actually contain the requests. It also depends on size and location such as similar count cells make look similar after this kind of smoothing or spreading influence to areas of industrial or commercial which have a lower tree canopy but smoothing spreads into these areas. KDE is simply just an overall pattern.

Part 4: Create Spatial Predictor Variables

Exercise 4.1: Load 311 Pothole Calls

Code

# Load from provided data file (downloaded from Chicago open data portal)
potholes <- read_csv(here("Assignments/assignment_04/data",
                             "311_Service_Requests_Pot_Holes.csv"))

names(potholes)

 [1] "CREATION DATE"                      "STATUS"                            
 [3] "COMPLETION DATE"                    "SERVICE REQUEST NUMBER"            
 [5] "TYPE OF SERVICE REQUEST"            "CURRENT ACTIVITY"                  
 [7] "MOST RECENT ACTION"                 "NUMBER OF POTHOLES FILLED ON BLOCK"
 [9] "STREET ADDRESS"                     "ZIP"                               
[11] "X COORDINATE"                       "Y COORDINATE"                      
[13] "Ward"                               "Police District"                   
[15] "Community Area"                     "SSA"                               
[17] "LATITUDE"                           "LONGITUDE"                         
[19] "LOCATION"

Code

potholes %>% 
  summarize(
    n_rows = n(),
    n_missing_lon = sum(is.na(LONGITUDE)),
    n_missing_lat = sum(is.na(LATITUDE))
  )

# A tibble: 1 × 3
  n_rows n_missing_lon n_missing_lat
   <int>         <int>         <int>
1 560478           876           876

Code

potholes_sf <- potholes %>% 
  # keep only rows with both coords present
  filter(!is.na(LONGITUDE), !is.na(LATITUDE)) %>% 
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>% 
  st_transform("ESRI:102271")


# Check the data
cat("\n✓ Loaded potholes data\n")


✓ Loaded potholes data

Code

cat("  - Number of potholes:", nrow(potholes_sf), "\n")

  - Number of potholes: 559602

Code

cat("  - CRS:", st_crs(potholes_sf)$input, "\n")

  - CRS: ESRI:102271

Code

cat("  - Date range:", min(potholes_sf$`CREATION DATE`, na.rm = TRUE), "to", 
    max(potholes_sf$`CREATION DATE`, na.rm = TRUE), "\n")

  - Date range: 01/01/2011 to 12/31/2017

Exercise 4.2: Count of Pot Holes per Cell

Code

# Aggregate pothole calls to fishnet
potholes_fishnet <- st_join(potholes_sf, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(potholes_sf = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(potholes_fishnet, by = "uniqueID") %>%
  mutate(potholes_sf = replace_na(potholes_sf, 0))

cat("Potholes distribution:\n")

Potholes distribution:

Code

summary(fishnet$potholes_sf)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   99.25  215.00  226.73  333.00 1049.00

Code

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = counttreedebris_sf), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Tree Debris 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = potholes_sf), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Potholes") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are tree debris and potholes correlated?")

Question 4.1: Do you see a visual relationship between tree debris and potholes? What does this suggest?

Your answer here: Firstly, potholes occur much more frequently so the map for that is more intense. There is some strong spatial overlap in South and Southwest Chicago and moderate overlap in Northern Chicago. It doesn’t seem like a strong relationship but this may show areas of aging residential streets or aging areas where things are reported. There is also the case of weathering of streets. Chicago is known as the “windy city” and weather can help create storm damage or freeze/thaw for potholes so these areas of Chicago may see similar patterns. This all suggests using potholes as a predictive variable.

Exercise 4.3: Nearest Neighbor Features

Code

# Calculate mean distance to 3 nearest potholes
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
potholes_coords <- st_coordinates(potholes_sf)

# Check error for k nearest neighbor because coords have NA
colSums(is.na(potholes_coords))

X Y 
6 6

Code

colSums(is.na(fishnet_coords))

X Y 
0 0

Code

#fix invalid rows for Potholes
valid_rows <- complete.cases(potholes_coords)
potholes_coords <- potholes_coords[valid_rows, , drop = FALSE]

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(potholes_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    potholes_sf.nn = rowMeans(nn_result$nn.dist)
  )

cat("✓ Calculated nearest neighbor distances\n")

✓ Calculated nearest neighbor distances

Code

summary(fishnet$potholes_sf.nn)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.8875   25.0254   42.9326   90.6696   80.8198 1288.8894

Question 4.2: What does a low value of potholes_sf.nn mean? A high value? Why might this be informative?

Your answer here: A low pothole.nn means that there is a lot of activity of reports so may be an area with more potholes or more disorder. Potholes would cluster with a low value. The high value would be the opposite where there are limited potholes or lower disorder. This would help see pothole activity at the neighborhood level.

Exercise 4.4: Distance to Hot Spots using Local Moran’s I

Code

# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to Potholes
fishnet <- calculate_local_morans(fishnet, "potholes_sf", k = 5)

Code

# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Potholes Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Code

# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  cat("✓ Calculated distance to pothole hot spots\n")
  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}

✓ Calculated distance to pothole hot spots
  - Number of hot spot cells: 278

Question 4.3: Why might distance to a cluster of potholes be more informative than distance to a single pothole? What does Local Moran’s I tell us?

Your answer here: Distance to a cluster of potholes gives us broader patterns of disorderly infrastructure rather than misleading us to isolated events or since one event doesn’t reflect neighborhood conditions. Clusters are real spatial patterns. The local moran’s I tells us the spatial autocorrelation at the local level. Meaning it tells us statistically signification areas where potholes concentrate and tells us when the disorder is clustered, dispersed, or random. It gives us better chronic disorder understanding at the neighborhood level.

Note

Local Moran’s I identifies:

High-High: Hot spots (high values surrounded by high values)
Low-Low: Cold spots (low values surrounded by low values)
High-Low / Low-High: Spatial outliers

This helps us understand spatial clustering patterns.

Part 5: Join Police Districts for Cross-Validation

Code

# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("✓ Joined police districts\n")

✓ Joined police districts

Code

cat("  - Districts:", length(unique(fishnet$District)), "\n")

  - Districts: 22

Code

cat("  - Cells:", nrow(fishnet), "\n")

  - Cells: 1708

Part 6: Model Fitting

Exercise 6.1: Poisson Regression

Code

# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    counttreedebris_sf,
    potholes_sf,
    potholes_sf.nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("✓ Prepared modeling data\n")

✓ Prepared modeling data

Code

cat("  - Observations:", nrow(fishnet_model), "\n")

  - Observations: 1708

Code

cat("  - Variables:", ncol(fishnet_model), "\n")

  - Variables: 6

Code

# Fit Poisson regression
model_poisson <- glm(
  counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)


Call:
glm(formula = counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                    Estimate   Std. Error  z value            Pr(>|z|)    
(Intercept)      4.578379031  0.010459312  437.732 <0.0000000000000002 ***
potholes_sf      0.000211475  0.000023392    9.041 <0.0000000000000002 ***
potholes_sf.nn  -0.010904545  0.000098723 -110.456 <0.0000000000000002 ***
dist_to_hotspot  0.000084366  0.000001253   67.312 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 110227  on 1707  degrees of freedom
Residual deviance:  80510  on 1704  degrees of freedom
AIC: 89583

Number of Fisher Scoring iterations: 6

Question 6.1: Interpret the coefficients. Which variables are significant? What do the signs (positive/negative) tell you?

Your answer here: All of the coefficients are highly significant. potholes_sf is showing that, because potholes counts are so large, that the broken windows theory we used is working that more potholes means more tree debris. The potholes_sf.nn is the strongest coefficient and is negative in contrast because it measures distance to a nearby pothole - as distance to be near a pothole increases the tree debris counts decrease thus the slope is negative. On the other hand, dist_to_hotspot is positive. Which is strange but it is because there are two distance measures that are related - potholes_sf.nn and dist_to_hotspot are collinear and thus dist_to_hotspot doesn’t mean much here.

Exercise 6.2: Check for Overdispersion

Code

# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")

Dispersion parameter: 52.94

Code

cat("Rule of thumb: >1.5 suggests overdispersion\n")

Rule of thumb: >1.5 suggests overdispersion

Code

if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}

⚠ Overdispersion detected! Consider Negative Binomial model.

Exercise 6.3: Negative Binomial Regression

Code

# Fit Negative Binomial model
model_nb <- glm.nb(
  counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Summary
summary(model_nb)


Call:
glm.nb(formula = counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 1.157435882, 
    link = log)

Coefficients:
                   Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)      4.93458514  0.08149791  60.549 < 0.0000000000000002 ***
potholes_sf     -0.00015778  0.00019496  -0.809                0.418    
potholes_sf.nn  -0.01587808  0.00049814 -31.875 < 0.0000000000000002 ***
dist_to_hotspot  0.00006968  0.00001112   6.265       0.000000000373 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.1574) family taken to be 1)

    Null deviance: 3069.2  on 1707  degrees of freedom
Residual deviance: 1937.9  on 1704  degrees of freedom
AIC: 17257

Number of Fisher Scoring iterations: 1

              Theta:  1.1574 
          Std. Err.:  0.0397 

 2 x log-likelihood:  -17246.7980

Code

# Compare AIC (lower is better)
cat("\nModel Comparison:\n")


Model Comparison:

Code

cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")

Poisson AIC: 89582.9

Code

cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")

Negative Binomial AIC: 17256.8

Question 6.2: Which model fits better (lower AIC)? What does this tell you about the data?

Your answer here: The NB model has a way better AIC of 17256.8 than the Poisson of 89582.9. This shows the NB model, which dropped the potholes_sf count as non significant and reduced severity of dist_to_hotspot, is a better fit. Counts aren’t helpful once distance is put in. Hotspot distance is just redundant. The NB model is showing that the only important predictor is proximity to potholes - the potholes_sf.nn variable.

Part 7: Spatial Cross-Validation

Leave-One-Group-Out (LOGO) Cross-Validation

Code

# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO Cross-Validation...\n")

Running LOGO Cross-Validation...

Code

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$counttreedebris_sf - test_data$prediction))
  rmse <- sqrt(mean((test_data$counttreedebris_sf - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}

  Fold 1 / 22 - District 5 - MAE: 48.03 
  Fold 2 / 22 - District 4 - MAE: 56.36 
  Fold 3 / 22 - District 22 - MAE: 102.85 
  Fold 4 / 22 - District 6 - MAE: 63.45 
  Fold 5 / 22 - District 8 - MAE: 42.19 
  Fold 6 / 22 - District 7 - MAE: 53.22 
  Fold 7 / 22 - District 3 - MAE: 38.93 
  Fold 8 / 22 - District 2 - MAE: 55.22 
  Fold 9 / 22 - District 9 - MAE: 49.24 
  Fold 10 / 22 - District 10 - MAE: 41.11 
  Fold 11 / 22 - District 1 - MAE: 74.22 
  Fold 12 / 22 - District 12 - MAE: 67.45 
  Fold 13 / 22 - District 15 - MAE: 84.06 
  Fold 14 / 22 - District 11 - MAE: 39.88 
  Fold 15 / 22 - District 18 - MAE: 63.18 
  Fold 16 / 22 - District 25 - MAE: 39.19 
  Fold 17 / 22 - District 14 - MAE: 36.33 
  Fold 18 / 22 - District 19 - MAE: 38.26 
  Fold 19 / 22 - District 16 - MAE: 33.89 
  Fold 20 / 22 - District 17 - MAE: 31.69 
  Fold 21 / 22 - District 20 - MAE: 29.31 
  Fold 22 / 22 - District 24 - MAE: 48.67

Code

# Overall results
cat("\n✓ Cross-Validation Complete\n")


✓ Cross-Validation Complete

Code

cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")

Mean MAE: 51.67

Code

cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")

Mean RMSE: 64.9

Code

# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

LOGO CV Results by District
fold	test_district	n_test	mae	rmse
3	22	112	102.85	136.53
13	15	32	84.06	110.00
11	1	28	74.22	80.05
12	12	73	67.45	74.32
4	6	63	63.45	82.83
15	18	30	63.18	68.05
2	4	235	56.36	89.92
8	2	56	55.22	65.05
6	7	52	53.22	67.41
9	9	107	49.24	59.75
22	24	41	48.67	67.11
1	5	98	48.03	68.96
5	8	197	42.19	51.87
10	10	63	41.11	48.68
14	11	43	39.88	49.88
16	25	85	39.19	46.02
7	3	43	38.93	50.44
18	19	63	38.26	43.08
17	14	46	36.33	46.28
19	16	129	33.89	44.09
20	17	82	31.69	42.70
21	20	30	29.31	34.82

Question 7.1: Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?

Your answer here: The MAE values range from 102.85 to 29.31 and RSME values range from 136.53 to 34.82. These are large variations which is what spatial CV does better - it can show where the model generalizes well and where it doesn’t. The models performance is a spatial problem by district - the relationship between tree debris and potholes isn’t the same across the city. A random CV wouldn’t capture this. Districts 22, 15, 1, 12 etc. were the hardest to predict.

Connection to Week 5

Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!

Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?

Part 8: Model Predictions and Comparison

Exercise 8.1: Generate Final Predictions

Code

# Fit final model on all data
final_model <- glm.nb(
  counttreedebris_sf ~ potholes_sf + potholes_sf.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$counttreedebris_sf, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

Exercise 8.2: Compare Model vs. KDE Baseline

Code

# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = counttreedebris_sf), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Tree Debris") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Tree Debris",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code

# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(counttreedebris_sf - prediction_nb)),
    model_rmse = sqrt(mean((counttreedebris_sf - prediction_nb)^2)),
    kde_mae = mean(abs(counttreedebris_sf - prediction_kde)),
    kde_rmse = sqrt(mean((counttreedebris_sf - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Comparison
approach	mae	rmse
model	47.81	67.20
kde	31.95	46.83

Question 8.1: Does the complex model outperform the simple KDE baseline? By how much? Is the added complexity worth it?

Your answer here: The simple KDE outperformed the complex model by 15.86 MAE and 20.37 RMSE. The added complexity did not beat spatial smoothing - just added more explanatory variation. Tree debris is a spatial problem. The KDE baseline already captures the spatial pattern, spatial clustering, and neighborhood structure.

Exercise 9.3: Where Does the Model Work Well?

Code

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = counttreedebris_sf - prediction_nb,
    error_kde = counttreedebris_sf - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Question 9.2: Where does the model make the biggest errors? Are there spatial patterns in the errors? What might this reveal?

Your answer here: Yes there are spatial patterns in the errors. The model makes the largest errors in the South and Southwest areas of Chicago. Also, a little bit in the Westside and Northside areas. The actual-predicted map shows underprediction in some of the districts and overprediciton in others, while the absolute error map shows neighborhoods where model performance is rather poor. There are other drivers that would happen with tree debris like I alluded to before such as tree canopy cover, land uses (industrial or commerical), storm intensity or exposure in certain areas over others, and just peoples reporting behaviors across neighborhoods that the model doesn’t incorporate.

Part 10: Summary Statistics and Tables

Exercise 10.1: Model Summary Table

Code

# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with tree debris counts."
  )

Final Negative Binomial Model Coefficients (Exponentiated)
Variable	Rate Ratio	Std. Error	Z	P-Value
(Intercept)	139.015	0.081	60.549	0.000
potholes_sf	1.000	0.000	-0.809	0.418
potholes_sf.nn	0.984	0.000	-31.875	0.000
dist_to_hotspot	1.000	0.000	6.265	0.000
Note:
Rate ratios > 1 indicate positive association with tree debris counts.

Exercise 10.2: Key Findings Summary

Technical Performance:

Cross-validation MAE: 54.5
Model vs. KDE: KDE
Most predictive variable: potholes_sf.nn

Spatial Patterns:

Tree debris reports are clustered.
Hot spots are located in South and Southwest neighborhoods and a little bit in the Westsides and Northsides of Chicago.
Model errors show patterns of clustering of underprediciton and overprediction with district boundaries.

Model Limitations:

Overdispersion: Yes, by a lot. We had to use the NB model.
Spatial autocorrelation in residuals: Yes, the error maps show clustered residuals.
Cells with zero counts: 12.9%

Conclusion and Next Steps

This assignment has made a spatial prediction workflow for Chicago tree debris using 311 data and pothole as a predictor. While the Negative Binomial model identified strong relationships between disorder and tree debris, the spatial cross-validation and baseline comparison exercises demonstrated that the KDE actually explains most of the variation anyway. Tree debris is strongly spatially auto correlated, spatially clustered, and shaped by neighborhood(local)-level contextual factors that weren’t in the model (like land use, storm exposure and intensities, and canopy cover). Future work should focus on incorporating tree canopy, storm data, other land use data.