Lingxuan Gao - MUSA 5080 Portfolio
  • Home
  • Weekly Notes
    • Weekly Notes 01: Introduction to R and dplyr
    • Weekly Notes 02: Algorithmic Decision Making & The Census
    • Weekly Notes 03: Data Visualization & Exploratory Analysis
    • Weekly Notes 04: Spatial Data & GIS Operations in R
    • Weekly Notes 05: Introduction to Linear Regression
    • Weekly Notes 11: Space-Time Prediction
  • Labs
    • Lab 0: dplyr Basics
    • Lab 1: Census Data Quality for Policy Decisions
    • Lab 2: Spatial Analysis and Visualization-Healthcare Access and Equity in Pennsylvania
    • Lab 4: Spatial Predictive Analysis
    • Lab 5: Space-Time Prediction
  • Midterm
    • Appendix
    • Presentation
  • Final
    • Eviction Risk Prediction in Philadelphia

On this page

  • Overview
  • Setup
  • Part 1: Data Loading & Exploration
    • 1.1 Load 311 data: Sanitation Code Complaints in Chicago during 2017
    • 1.2 Load Burglary Data
    • 1.3 Load Chicago Spatial Data
    • 1.4 The spatial distribution of Sanitation Code Complaints
  • Part 2: Fishnet Grid Creation
    • 2.1 Fishnet Creation
    • 2.2 Aggregate violations to grid cells
    • 2.3 Count Distribution Visualiztion
  • Part 3: Spatial Features
    • 3.1 K-Nearest Neighbor Features Calculation
  • Part 4: Count Regression Models
    • 4.1 Join district information to fishnet
    • 4.2 Fit Poisson regression
  • Part 5: Spatial Cross-Validation
  • Part6: Model Evaluation

Assignment 4: Spatial Predictive Analysis

  • Show All Code
  • Hide All Code

  • View Source
Author

Lingxuan Gao

Published

Invalid Date

Overview

In this lab, I will apply the spatial predictive modeling techniques to the sanitation code complaints in Chicago during 2017.

Setup

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

Part 1: Data Loading & Exploration

Note

Why I choose Sanitation Code Complaints as a Predictor?

  1. Visible Environmental Disorder: Sanitation violations (illegal dumping, overflowing garbage, unsanitary conditions) are highly visible markers of neighborhood neglect that signal low social control and guardianship.

  2. Property Maintenance Connection: Sanitation code violations often correlate with poorly maintained properties - the same conditions that make areas vulnerable to burglary.

  3. Temporal Stability: Unlike some 311 calls that are seasonal, sanitation violations tend to persist, making them reliable indicators of chronic neighborhood conditions.

Expected Relationship: I hypothesize a positive correlation between sanitation complaints and burglaries, where areas with higher sanitation violations will experience elevated burglary rates.

1.1 Load 311 data: Sanitation Code Complaints in Chicago during 2017

Code
# Load sanitation code complaints
sanitation_raw <- read_csv(here("data", "311_Service_Requests_-_Sanitation_Code_Complaints.csv"))

# Process and filter for 2017 only
sanitation <- sanitation_raw %>%
  # Parse the existing Creation Date column 
  mutate(creation_date = mdy(`Creation Date`)) %>%

# Filter for 2017 only
  filter(
    year(creation_date) == 2017,
    !is.na(Latitude), 
    !is.na(Longitude)
  ) %>%
  # Convert to spatial object
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

1.2 Load Burglary Data

Code
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read(here("data", "burglaries.shp")) %>% 
  st_transform('ESRI:102271')
Reading layer `burglaries' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\data\burglaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East

1.3 Load Chicago Spatial Data

Code
# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84

1.4 The spatial distribution of Sanitation Code Complaints

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = sanitation, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Sanitation Code Complaints",
    subtitle = paste0("Chicago 2017, n = ", nrow(sanitation))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(sanitation)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots
p1 + p2 + 
  plot_annotation(
    title = "Sanitation Code Complaints",
    tag_levels = 'A'
  )

Spatial Pattern of Sanitation Complaints:

The point pattern map (Figure A) reveals highly clustered sanitation complaints concentrated in Chicago’s central and south-central corridors.This is not random distribution—kernel density estimation (Figure B). It identifies three distinct hot spots with peak densities in the mid-north, central-south, and southern districts.

##1.5 The spatial distribution of Burglary

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

Spatial Patterns of Burglaries

Burglaries are highly unevenly distributed across Chicago. The point pattern map reveals dense concentrations forming continuous corridors through central and south-central districts, while northern and far southern edges show sparse activity. The kernel density surface identifies three major hot spots: a dominant mid-north concentration, a central-south band, and a southern cluster.

Part 2: Fishnet Grid Creation

2.1 Fishnet Creation

Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
info_tbl <- tibble::tibble(
  Metric = c("Number of cells", "Cell size", "Cell area (m²)"),
  Value  = c(
    scales::comma(nrow(fishnet)),
    "500 × 500 m",
    scales::comma(round(as.numeric(units::drop_units(sf::st_area(fishnet[1, ])))))
  )
)

knitr::kable(
  info_tbl,
  caption   = "Fishnet Summary",
  col.names = c("Metric", "Value"),
  align     = c("l","r"),
  booktabs  = TRUE
) |>
  kableExtra::kable_styling(full_width = FALSE, position = "left",
                            bootstrap_options = c("striped","hover","condensed")) |>
  kableExtra::column_spec(1, bold = TRUE)
Fishnet Summary
Metric Value
Number of cells 2,458
Cell size 500 × 500 m
Cell area (m²) 250,000

2.2 Aggregate violations to grid cells

Code
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Compute summary statistics
burglaries_sum<- summary(fishnet$countBurglaries)

#Identify Zero-count cells
bur_zero_n <- sum(fishnet$countBurglaries == 0)
bur_zero_pct   <- 100 * bur_zero_n / nrow(fishnet)

# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(burglaries_sum[fields]), 2)

# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)

# Build final table
summary_tbl <- tibble::tibble(
  Statistic = c(
    "Min",
    "1st Quartile",
    "Median",
    "Mean",
    "3rd Quartile",
    "Max",
    "Cells with zero Burglaries"
  ),
  Value = c(
    vals_fmt,
    sprintf(
      "%s / %s (%.1f%%)",
      scales::comma(bur_zero_n),
      scales::comma(nrow(fishnet)),
      bur_zero_pct
    )
  )
)

# Render table
knitr::kable(
  summary_tbl,
  caption = "Summary of Burglaries Counts per Grid Cell",
  col.names = c("Statistic", "Value"),
  align = c("l", "r"),
  booktabs = TRUE
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Summary of Burglaries Counts per Grid Cell
Statistic Value
Min 0.00
1st Quartile 0.00
Median 2.00
Mean 3.04
3rd Quartile 5.00
Max 40.00
Cells with zero Burglaries 781 / 2,458 (31.8%)
Code
# Spatial join: which cell contains each sanitation complaints?
sanitation_fishnet <- st_join(sanitation, fishnet, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countSanitation = n())

# Join back to fishnet (cells with 0 sanitation complaints)
fishnet <- fishnet %>%
  left_join(sanitation_fishnet, by = "uniqueID") %>%
  mutate(countSanitation = replace_na(countSanitation, 0))

# Compute summary statistics
sanitation_sum<- summary(fishnet$countSanitation)

#Identify Zero-count cells
san_zero_n <- sum(fishnet$countSanitation == 0)
san_zero_pct   <- 100 * san_zero_n / nrow(fishnet)

# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(sanitation_sum[fields]), 2)

# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)

# Build final table
summary_tbl <- tibble::tibble(
  Statistic = c(
    "Min",
    "1st Quartile",
    "Median",
    "Mean",
    "3rd Quartile",
    "Max",
    "Cells with zero sanitation complaints"
  ),
  Value = c(
    vals_fmt,
    sprintf(
      "%s / %s (%.1f%%)",
      scales::comma(san_zero_n),
      scales::comma(nrow(fishnet)),
      san_zero_pct
    )
  )
)

# Render table
knitr::kable(
  summary_tbl,
  caption = "Summary of Sanitation Complaints Counts per Grid Cell",
  col.names = c("Statistic", "Value"),
  align = c("l", "r"),
  booktabs = TRUE
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Summary of Sanitation Complaints Counts per Grid Cell
Statistic Value
Min 0.00
1st Quartile 1.00
Median 5.00
Mean 8.02
3rd Quartile 12.00
Max 189.00
Cells with zero sanitation complaints 576 / 2,458 (23.4%)

2.3 Count Distribution Visualiztion

Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countSanitation), color = NA) +
  scale_fill_viridis_c(
    name = "Count", 
    option = "plasma",
    trans = "sqrt", 
    breaks = c(0, 10, 20, 40, 80, 160)) +
  labs(title = "Sanitation Complaints") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "plasma",
    trans = "sqrt", 
    breaks = c(0, 1, 5, 10, 20, 40))+      
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are sanitation complaints and burglaries correlated?")

Summary:: Sanitation complaints and burglaries show broadly similar spatial patterns, with higher concentrations in central and denser parts of the city.

Part 3: Spatial Features

3.1 K-Nearest Neighbor Features Calculation

Code
# Calculate mean distance to 3 nearest sanitation complaints

# Extract centroids for fishnet cells
fishnet_coords <- st_coordinates(st_centroid(fishnet))

# Extract coordinates of sanitation complaints
sanitation_coords <- st_coordinates(sanitation)

# Compute 3 nearest graffiti neighbors for each grid cell
nn_result <- get.knnx(sanitation_coords, fishnet_coords, k = 3)

# Add feature to fishnet
fishnet <- fishnet %>%
  mutate(
    sanitation_nn = rowMeans(nn_result$nn.dist)   # mean distance to 3 nearest sanitation complaints
  )


summary(fishnet$sanitation_nn)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.031  108.565  181.606  294.820  332.417 2213.320 

Minimum: 1 meter: Three different sanitation complaints requests occurred just one meter apart 1st Quartile: 108 meters: 108 meters are about 2 minute walk. This means in 25% of Chicago’s grid cells, the nearest sanitation complaints are within 1 block. Median: 181 meters: Half of all grid cells have their three sanitation complaints within 2 blocks. Mean: 295 meters: The mean is much larger than the median, which reveals: Sanitation complaints is extremely skewed. 3rd Quartile: 332 meters: Three-quarters of grid cells have their nearest neighbors within 332m (~4 city blocks). Maximum: 2213 meters: These are the sparsest parts of the city.

##3.2 Perform Local Moran’s I analysis

Code
# Function unchanged
calculate_local_morans <- function(data, variable, k = 5) {

# Create spatial weights  
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

# Calculate Local Moran's I  
  local_moran <- localmoran(data[[variable]], weights)
  
# Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to Sanitation Complaints counts
fishnet <- calculate_local_morans(fishnet, "countSanitation", k = 5)

##3.3 Identify hot spots and cold spots

Code
ggplot() +
  geom_sf(
    data = fishnet,
    aes(fill = moran_class),
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",   # Red = hot spot
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Sanitation code complatins Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

This Local Moran’s I map shows that sanitation complaints are clustering.

##3.4 Create distance-to-hotspot measures

Code
# Get centroids of "High-High" cells 
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each grid cell to hot spot
if (nrow(hotspots) > 0) {
  
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )

  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
  
} else {
  
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  
  cat("⚠ No significant graffiti hot spots found\n")
}
  - Number of hot spot cells: 203 

Part 4: Count Regression Models

I’ll use police districts for the spatial cross-validation.

4.1 Join district information to fishnet

Code
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("  - Districts:", length(unique(fishnet$District)), "\n")
  - Districts: 22 
Code
cat("  - Cells:", nrow(fishnet), "\n")
  - Cells: 1708 

4.2 Fit Poisson regression

Code
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    countSanitation,
    sanitation_nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("  - Observations:", nrow(fishnet_model), "\n")
  - Observations: 1708 
Code
cat("  - Variables:", ncol(fishnet_model), "\n")
  - Variables: 6 
Code
model_poisson <- glm(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)

Call:
glm(formula = countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                   Estimate  Std. Error z value            Pr(>|z|)    
(Intercept)      2.22791188  0.03593051  62.006 <0.0000000000000002 ***
countSanitation  0.00264490  0.00107409   2.462              0.0138 *  
sanitation_nn   -0.00449200  0.00017766 -25.284 <0.0000000000000002 ***
dist_to_hotspot -0.00014657  0.00001029 -14.243 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6710.3  on 1707  degrees of freedom
Residual deviance: 4058.3  on 1704  degrees of freedom
AIC: 8126.6

Number of Fisher Scoring iterations: 5

Interpretation: The model predicts burglary counts based on three sanitation-related features: countSanitation — how many sanitation complaints fall inside a grid cell sanitation_nn — average distance to the 3 nearest sanitation complaints dist_to_hotspot — distance to a sanitation hotspot (cluster of complaints)

The coefficient of countSanitation is positive. Each additional sanitation complaint inside a cell is linked to a 0.26% increase in expected burglaries.

The coefficient of sanitation_nn is Very strong negative. For each additional meter farther away from sanitation complaints, expected burglaries drop by about 0.45%.

The coefficient of dist_to_hotspot is slightly negative, which means being farther from a sanitation hotspot is linked to slightly fewer burglaries.

All three predictors are statistically significant.

In conclusion, Burglary risk rises in places where sanitation complaints are concentrated or close by.

##4.3 Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).

Code
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")
Dispersion parameter: 2.55 
Code
cat("Rule of thumb: >1.5 suggests overdispersion\n")
Rule of thumb: >1.5 suggests overdispersion
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("Dispersion looks okay for Poisson model.\n")
}
⚠ Overdispersion detected! Consider Negative Binomial model.

##4.4 Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code
# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Summary
summary(model_nb)

Call:
glm.nb(formula = countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 2.396013291, 
    link = log)

Coefficients:
                   Estimate  Std. Error z value            Pr(>|z|)    
(Intercept)      2.26736511  0.06167510  36.763 <0.0000000000000002 ***
countSanitation  0.00338981  0.00207473   1.634               0.102    
sanitation_nn   -0.00483437  0.00025537 -18.931 <0.0000000000000002 ***
dist_to_hotspot -0.00014469  0.00001548  -9.345 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.396) family taken to be 1)

    Null deviance: 3091.5  on 1707  degrees of freedom
Residual deviance: 1783.9  on 1704  degrees of freedom
AIC: 7151.4

Number of Fisher Scoring iterations: 1

              Theta:  2.396 
          Std. Err.:  0.154 

 2 x log-likelihood:  -7141.431 

##4.5 Compare model fit (AIC)

Code
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")

Model Comparison:
Code
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
Poisson AIC: 8126.6 
Code
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
Negative Binomial AIC: 7151.4 

Negative Binomial Model Interpretation: The negative binomial model fits the data much better than the Poisson model (AIC drops from 8126 → 7151), which tells us burglary counts are overdispersed. Cells that are closer to sanitation complaints (sanitation_nn) show higher burglary risk. The coefficient is strongly negative and highly significant, meaning risk increases as distance shrinks. The same pattern appears with distance to sanitation hotspots: burglary rates are higher near clusters of sanitation complaints. The within-cell count of sanitation complaints (countSanitation) becomes non-significant once distance features are included, suggesting that proximity to disorder matters more than simple counts inside a grid cell.

Part 5: Spatial Cross-Validation

Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!). Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district. So I’ll use Leave-One-Group-Out (LOGO) Cross-Validation to train the model.

Code
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ countSanitation + sanitation_nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
}

# Overall results
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
Mean MAE: 2.51 
Code
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
Mean RMSE: 3.35 
Code
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
fold test_district n_test mae rmse
7 3 43 5.22 7.10
15 18 30 3.57 4.65
14 11 43 3.00 3.61
12 12 73 2.93 4.07
17 14 46 2.91 3.82
21 20 30 2.71 3.18
4 6 63 2.70 3.98
18 19 63 2.67 3.25
16 25 85 2.64 3.54
22 24 41 2.55 3.06
6 7 52 2.44 3.08
8 2 56 2.43 3.24
10 10 63 2.32 2.97
20 17 82 2.24 2.79
5 8 197 2.19 3.32
3 22 112 2.04 2.61
11 1 28 2.04 2.53
13 15 32 2.02 2.38
1 5 98 2.00 2.96
9 9 107 1.79 2.39
19 16 129 1.57 2.00
2 4 235 1.31 3.21

Interpretation of the LOGO Cross-Validation Results The table shows how well the model predicts burglaries when each police district is held out as completely unseen test data.MAE is average error in burglary counts per grid cell and RMSE is penalizes larger errors.

Across districts, MAE falls roughly between 1.3 and 5 burglaries, which means the model usually misses the true count by only a couple of events per cell. Districts with a large number of test cells (for example District 4 or 8) tend to have lower errors, because the model has more stable patterns to learn from. Districts with small or unusual spatial patterns, like District 3 or 18, show higher error, suggesting their burglary patterns don’t generalize well from the rest of the city.

Overall, the model generalizes reasonably well across space: performance varies by district, but prediction error stays within a narrow range

Part6: Model Evaluation

##6.1 Create a Kernel Density Baseline This part builds a KDE hotspot baseline—my simplest prediction model, so I can later test whether complex regression models truly improve on “crime happens where it happened before.

The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)

Code
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

##6.2 Generate Final Predictions

Code
# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

##6.3 Compare Model vs. KDE Baseline

Code
# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
approach mae rmse
model 2.21 3.27
kde 2.06 2.95

Interpretation: The comparison shows that the Negative Binomial model does not outperform a simple KDE baseline.

Because burglary risk in Chicago is strongly path-dependent: the best predictor of where crime will occur is simply where it has occurred before. Adding sanitation-based disorder features does not increase predictive accuracy and may introduce noise.

Although KDE outperforms the regression model in predictive accuracy, the regression model provides interpretive insight into how sanitation-related disorder correlates with burglary risk. KDE captures spatial inertia, whereas the model reveals underlying mechanisms.

KDE is a better predictor, but the model is a better explanation.

##6.4 Where Does the Model Work Well?

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Interpretation: On the left, red cells indicate under-prediction—places where actual burglaries were higher than the model expected, while blue cells show over-prediction. Both colors appear scattered rather than clustered, which suggests no strong systematic spatial bias: the model isn’t consistently wrong in one part of the city.

The right map shows the absolute size of errors, highlighting where the model misses by larger margins. The largest errors occur in a few high-activity pockets, mostly in the South Side and near the West Side burglary clusters. This implies that the hardest places to predict are exactly the areas with the most burglary volatility.

#Part7: Summary Statistics and Tables

##7.1 Model Summary Table

Code
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Std. Error Z P-Value
(Intercept) 9.654 0.062 36.763 0.000
countSanitation 1.003 0.002 1.634 0.102
sanitation_nn 0.995 0.000 -18.931 0.000
dist_to_hotspot 1.000 0.000 -9.345 0.000
Note:
Rate ratios > 1 indicate positive association with burglary counts.
Source Code
---
title: "Assignment 4: Spatial Predictive Analysis"
author: "Lingxuan Gao"
date: Nov 13th
format:
  html:
    code-fold: true
    code-tools: true
    toc: true
    toc-depth: 3
    toc-location: left
    theme: cosmo
    embed-resources: true
editor: visual
execute:
  warning: false
  message: false
---

# Overview

In this lab, I will apply the spatial predictive modeling techniques to the sanitation code complaints in Chicago during 2017.

# Setup

```{r setup}
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

```

# Part 1: Data Loading & Exploration

::: callout-note
Why I choose Sanitation Code Complaints as a Predictor?

1.  **Visible Environmental Disorder**: Sanitation violations (illegal dumping, overflowing garbage, unsanitary conditions) are highly visible markers of neighborhood neglect that signal low social control and guardianship.

2.  **Property Maintenance Connection**: Sanitation code violations often correlate with poorly maintained properties - the same conditions that make areas vulnerable to burglary.

3.  **Temporal Stability**: Unlike some 311 calls that are seasonal, sanitation violations tend to persist, making them reliable indicators of chronic neighborhood conditions.

**Expected Relationship**: I hypothesize a positive correlation between sanitation complaints and burglaries, where areas with higher sanitation violations will experience elevated burglary rates.
:::

## 1.1 Load 311 data: Sanitation Code Complaints in Chicago during 2017

```{r load data}
#| label: Load 311 Data
#| message: false
#| warning: false

# Load sanitation code complaints
sanitation_raw <- read_csv(here("data", "311_Service_Requests_-_Sanitation_Code_Complaints.csv"))

# Process and filter for 2017 only
sanitation <- sanitation_raw %>%
  # Parse the existing Creation Date column 
  mutate(creation_date = mdy(`Creation Date`)) %>%

# Filter for 2017 only
  filter(
    year(creation_date) == 2017,
    !is.na(Latitude), 
    !is.na(Longitude)
  ) %>%
  # Convert to spatial object
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')
```

## 1.2 Load Burglary Data

```{r load-burglaries}
#| label: Load Burglary Data
#| message: false
#| warning: false
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read(here("data", "burglaries.shp")) %>% 
  st_transform('ESRI:102271')
```

## 1.3 Load Chicago Spatial Data

```{r load-boundaries}
#| label: Load Boundary Data
#| message: false
#| warning: false

# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)

# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')

```

## 1.4 The spatial distribution of Sanitation Code Complaints

```{r visualize-points1}
#| fig-width: 10
#| fig-height: 5

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = sanitation, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Sanitation Code Complaints",
    subtitle = paste0("Chicago 2017, n = ", nrow(sanitation))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(sanitation)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots
p1 + p2 + 
  plot_annotation(
    title = "Sanitation Code Complaints",
    tag_levels = 'A'
  )
```

**Spatial Pattern of Sanitation Complaints:**

The point pattern map (Figure A) reveals **highly clustered** sanitation complaints concentrated in Chicago's central and south-central corridors.This is not random distribution—kernel density estimation (Figure B). It identifies three distinct hot spots with peak densities in the mid-north, central-south, and southern districts.

##1.5 The spatial distribution of Burglary

```{r visualize-points2}
#| fig-width: 10
#| fig-height: 5

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )
```

**Spatial Patterns of Burglaries**

Burglaries are **highly unevenly distributed** across Chicago. The point pattern map reveals dense concentrations forming continuous corridors through central and south-central districts, while northern and far southern edges show sparse activity. The kernel density surface identifies **three major hot spots**: a dominant mid-north concentration, a central-south band, and a southern cluster.

# Part 2: Fishnet Grid Creation

## 2.1 Fishnet Creation

```{r create-fishnet}
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
info_tbl <- tibble::tibble(
  Metric = c("Number of cells", "Cell size", "Cell area (m²)"),
  Value  = c(
    scales::comma(nrow(fishnet)),
    "500 × 500 m",
    scales::comma(round(as.numeric(units::drop_units(sf::st_area(fishnet[1, ])))))
  )
)

knitr::kable(
  info_tbl,
  caption   = "Fishnet Summary",
  col.names = c("Metric", "Value"),
  align     = c("l","r"),
  booktabs  = TRUE
) |>
  kableExtra::kable_styling(full_width = FALSE, position = "left",
                            bootstrap_options = c("striped","hover","condensed")) |>
  kableExtra::column_spec(1, bold = TRUE)
```

## 2.2 Aggregate violations to grid cells

```{r aggregate-burglaries}
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Compute summary statistics
burglaries_sum<- summary(fishnet$countBurglaries)

#Identify Zero-count cells
bur_zero_n <- sum(fishnet$countBurglaries == 0)
bur_zero_pct   <- 100 * bur_zero_n / nrow(fishnet)

# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(burglaries_sum[fields]), 2)

# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)

# Build final table
summary_tbl <- tibble::tibble(
  Statistic = c(
    "Min",
    "1st Quartile",
    "Median",
    "Mean",
    "3rd Quartile",
    "Max",
    "Cells with zero Burglaries"
  ),
  Value = c(
    vals_fmt,
    sprintf(
      "%s / %s (%.1f%%)",
      scales::comma(bur_zero_n),
      scales::comma(nrow(fishnet)),
      bur_zero_pct
    )
  )
)

# Render table
knitr::kable(
  summary_tbl,
  caption = "Summary of Burglaries Counts per Grid Cell",
  col.names = c("Statistic", "Value"),
  align = c("l", "r"),
  booktabs = TRUE
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
```

```{r aggregate-sanitation}

# Spatial join: which cell contains each sanitation complaints?
sanitation_fishnet <- st_join(sanitation, fishnet, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countSanitation = n())

# Join back to fishnet (cells with 0 sanitation complaints)
fishnet <- fishnet %>%
  left_join(sanitation_fishnet, by = "uniqueID") %>%
  mutate(countSanitation = replace_na(countSanitation, 0))

# Compute summary statistics
sanitation_sum<- summary(fishnet$countSanitation)

#Identify Zero-count cells
san_zero_n <- sum(fishnet$countSanitation == 0)
san_zero_pct   <- 100 * san_zero_n / nrow(fishnet)

# Pull the six standard numeric summary fields in order
fields <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
vals <- round(as.numeric(sanitation_sum[fields]), 2)

# Convert to character with two decimal places
vals_fmt <- sprintf("%.2f", vals)

# Build final table
summary_tbl <- tibble::tibble(
  Statistic = c(
    "Min",
    "1st Quartile",
    "Median",
    "Mean",
    "3rd Quartile",
    "Max",
    "Cells with zero sanitation complaints"
  ),
  Value = c(
    vals_fmt,
    sprintf(
      "%s / %s (%.1f%%)",
      scales::comma(san_zero_n),
      scales::comma(nrow(fishnet)),
      san_zero_pct
    )
  )
)

# Render table
knitr::kable(
  summary_tbl,
  caption = "Summary of Sanitation Complaints Counts per Grid Cell",
  col.names = c("Statistic", "Value"),
  align = c("l", "r"),
  booktabs = TRUE
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
```

## 2.3 Count Distribution Visualiztion

```{r visualize-fishnet}
#| fig-width: 10
#| fig-height: 4

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countSanitation), color = NA) +
  scale_fill_viridis_c(
    name = "Count", 
    option = "plasma",
    trans = "sqrt", 
    breaks = c(0, 10, 20, 40, 80, 160)) +
  labs(title = "Sanitation Complaints") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(
    name = "Count",
    option = "plasma",
    trans = "sqrt", 
    breaks = c(0, 1, 5, 10, 20, 40))+      
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are sanitation complaints and burglaries correlated?")

```

**Summary:**: Sanitation complaints and burglaries show broadly similar spatial patterns, with higher concentrations in central and denser parts of the city.

# Part 3: Spatial Features

## 3.1 K-Nearest Neighbor Features Calculation

```{r nn-feature}
#| message: false
#| warning: false

# Calculate mean distance to 3 nearest sanitation complaints

# Extract centroids for fishnet cells
fishnet_coords <- st_coordinates(st_centroid(fishnet))

# Extract coordinates of sanitation complaints
sanitation_coords <- st_coordinates(sanitation)

# Compute 3 nearest graffiti neighbors for each grid cell
nn_result <- get.knnx(sanitation_coords, fishnet_coords, k = 3)

# Add feature to fishnet
fishnet <- fishnet %>%
  mutate(
    sanitation_nn = rowMeans(nn_result$nn.dist)   # mean distance to 3 nearest sanitation complaints
  )


summary(fishnet$sanitation_nn)
```

**Minimum: 1 meter**: Three different sanitation complaints requests occurred just one meter apart **1st Quartile: 108 meters**: 108 meters are about 2 minute walk. This means in 25% of Chicago’s grid cells, the nearest sanitation complaints are within 1 block. **Median: 181 meters**: Half of all grid cells have their three sanitation complaints within 2 blocks. **Mean: 295 meters**: The mean is much larger than the median, which reveals: Sanitation complaints is extremely skewed. **3rd Quartile: 332 meters**: Three-quarters of grid cells have their nearest neighbors within 332m (\~4 city blocks). **Maximum: 2213 meters**: These are the sparsest parts of the city.

##3.2 Perform Local Moran’s I analysis

```{r local-morans-graffiti}
# Function unchanged
calculate_local_morans <- function(data, variable, k = 5) {

# Create spatial weights  
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

# Calculate Local Moran's I  
  local_moran <- localmoran(data[[variable]], weights)
  
# Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to Sanitation Complaints counts
fishnet <- calculate_local_morans(fishnet, "countSanitation", k = 5)
```

##3.3 Identify hot spots and cold spots

```{r visualize-morans-sanitation}
#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(
    data = fishnet,
    aes(fill = moran_class),
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",   # Red = hot spot
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Sanitation code complatins Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()
```

This Local Moran’s I map shows that sanitation complaints are clustering.

##3.4 Create distance-to-hotspot measures

```{r distance-to-hotspots-graffiti}
# Get centroids of "High-High" cells 
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each grid cell to hot spot
if (nrow(hotspots) > 0) {
  
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )

  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
  
} else {
  
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  
  cat("⚠ No significant graffiti hot spots found\n")
}
```

# Part 4: Count Regression Models

I’ll use police districts for the spatial cross-validation.

## 4.1 Join district information to fishnet

```{r join-districts}
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("  - Districts:", length(unique(fishnet$District)), "\n")
cat("  - Cells:", nrow(fishnet), "\n")
```

## 4.2 Fit Poisson regression

```{r prepare-data}
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    countSanitation,
    sanitation_nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("  - Observations:", nrow(fishnet_model), "\n")
cat("  - Variables:", ncol(fishnet_model), "\n")
```

```{r fit poisson}

model_poisson <- glm(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)

```

**Interpretation**: The model predicts burglary counts based on three sanitation-related features: countSanitation — how many sanitation complaints fall inside a grid cell sanitation_nn — average distance to the 3 nearest sanitation complaints dist_to_hotspot — distance to a sanitation hotspot (cluster of complaints)

The coefficient of countSanitation is positive. Each additional sanitation complaint inside a cell is linked to a 0.26% increase in expected burglaries.

The coefficient of sanitation_nn is Very strong negative. For each additional meter farther away from sanitation complaints, expected burglaries drop by about 0.45%.

The coefficient of dist_to_hotspot is slightly negative, which means being farther from a sanitation hotspot is linked to slightly fewer burglaries.

All three predictors are statistically significant.

In conclusion, Burglary risk rises in places where sanitation complaints are concentrated or close by.

##4.3 Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).

```{r check-overdispersion}
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")

cat("Rule of thumb: >1.5 suggests overdispersion\n")

if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("Dispersion looks okay for Poisson model.\n")
}
```

##4.4 Negative Binomial Regression

If overdispersed, use **Negative Binomial regression** (more flexible).

```{r fit-negbin}
# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Summary
summary(model_nb)
```

##4.5 Compare model fit (AIC)

```{r model comparison}
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
```

**Negative Binomial Model Interpretation**: The negative binomial model fits the data much better than the Poisson model (AIC drops from 8126 → 7151), which tells us burglary counts are overdispersed. Cells that are closer to sanitation complaints (sanitation_nn) show higher burglary risk. The coefficient is strongly negative and highly significant, meaning risk increases as distance shrinks. The same pattern appears with distance to sanitation hotspots: burglary rates are higher near clusters of sanitation complaints. The within-cell count of sanitation complaints (countSanitation) becomes non-significant once distance features are included, suggesting that proximity to disorder matters more than simple counts inside a grid cell.

# Part 5: Spatial Cross-Validation

Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!). **Leave-One-Group-Out (LOGO) Cross-Validation** trains on all districts except one, then tests on the held-out district. So I’ll use Leave-One-Group-Out (LOGO) Cross-Validation to train the model.

```{r spatial-cv}
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ countSanitation + sanitation_nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
}

# Overall results
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
```

```{r cv-results-table}
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

**Interpretation of the LOGO Cross-Validation Results** The table shows how well the model predicts burglaries when each police district is held out as completely unseen test data.MAE is average error in burglary counts per grid cell and RMSE is penalizes larger errors.

Across districts, MAE falls roughly between 1.3 and 5 burglaries, which means the model usually misses the true count by only a couple of events per cell. Districts with a large number of test cells (for example District 4 or 8) tend to have lower errors, because the model has more stable patterns to learn from. Districts with small or unusual spatial patterns, like District 3 or 18, show higher error, suggesting their burglary patterns don't generalize well from the rest of the city.

Overall, the model generalizes reasonably well across space: performance varies by district, but prediction error stays within a narrow range

# Part6: Model Evaluation

##6.1 Create a Kernel Density Baseline This part builds a KDE hotspot baseline—my simplest prediction model, so I can later test whether complex regression models truly improve on “crime happens where it happened before.

**The KDE baseline asks:** “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)

```{r kde-baseline}
#| message: false

# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

```

```{r visualize-kde}
#| fig-width: 8
#| fig-height: 6

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()
```

##6.2 Generate Final Predictions

```{r final-predictions}
# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ countSanitation + sanitation_nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )
```

##6.3 Compare Model vs. KDE Baseline

```{r compare-models}
#| fig-width: 12
#| fig-height: 4

# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )
```

```{r model-comparison-metrics}
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

**Interpretation**: The comparison shows that the Negative Binomial model does not outperform a simple KDE baseline.

Because burglary risk in Chicago is strongly path-dependent: the best predictor of where crime will occur is simply where it has occurred before. Adding sanitation-based disorder features does not increase predictive accuracy and may introduce noise.

Although KDE outperforms the regression model in predictive accuracy, the regression model provides interpretive insight into how sanitation-related disorder correlates with burglary risk. KDE captures spatial inertia, whereas the model reveals underlying mechanisms.

KDE is a better predictor, but the model is a better explanation.

##6.4 Where Does the Model Work Well?

```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2
```

**Interpretation**: On the left, red cells indicate under-prediction—places where actual burglaries were higher than the model expected, while blue cells show over-prediction. Both colors appear scattered rather than clustered, which suggests no strong systematic spatial bias: the model isn’t consistently wrong in one part of the city.

The right map shows the absolute size of errors, highlighting where the model misses by larger margins. The largest errors occur in a few high-activity pockets, mostly in the South Side and near the West Side burglary clusters. This implies that the hardest places to predict are exactly the areas with the most burglary volatility.

#Part7: Summary Statistics and Tables

##7.1 Model Summary Table

```{r model-summary-table}
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )
```