Spatial Prediction of Burglaries with 311 Graffiti Removal Requests

MUSA 5080 - Fall 2025

Author

Isabelle Li

Published

December 11, 2025

Part 1: Introduction and Data Exploration

In this project, I use Graffiti Removal 311 requests as the main predictor to explore whether neighborhood disorder can help explain the spatial distribution of burglaries in Chicago. I chose graffiti because it is a common and consistently reported 311 issue, it has clear spatial patterns, and it is often used as a proxy for visible disorder in urban areas.

While graffiti has no direct causal link to burglary, it may reflect broader neighborhood conditions such as maintenance levels, social stress, or public activity, which could indirectly relate to crime patterns. My goal is to test whether this type of environmental signal provides predictive value and to understand how well disorder-based features perform compared to a purely spatial baseline like kernel density estimation (KDE).

There are 7,482 burglary incidents in the dataset.
These records come from the Chicago Crimes 2017 data, representing all reported burglary (forcible entry) cases during that calendar year.

Datasets are projected into Illinois State Plane East (EPSG:102271) for accurate distance and area calculations.

1.2: Visualize Point Data

Code

# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

The burglary incidents are not evenly distributed across Chicago. Instead, they exhibit a clear clustered pattern, with several well-defined hot spots.

The highest concentrations appear in:

the West Side (especially near Humboldt Park and Garfield Park),
parts of the South Side, and
portions of the Near North / Near West areas.

In contrast, the Northwest and Southwest outer neighborhoods show very few incidents, indicating substantially lower burglary activity.

Part 2: Create Fishnet Grid

2.1: Creating the Fishnet

A fishnet grid converts irregular point data into a regular grid of cells where we can:

Aggregate counts
Calculate spatial features
Apply regression models

Code

# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

We use a regular fishnet grid because it provides equal-sized spatial units, which makes burglary counts directly comparable across the entire study area. Neighborhoods or census tracts vary widely in size and shape, and those differences can distort the spatial patterns we observe or bias the regression results. A uniform grid avoids those issues and creates a consistent framework for aggregating incidents, calculating spatial features, and fitting predictive models.

2.2: Aggregate Burglaries to Grid

Code

burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(countBurglaries),
    Q1 = quantile(countBurglaries, 0.25),
    Median = median(countBurglaries),
    Mean = mean(countBurglaries),
    Q3 = quantile(countBurglaries, 0.75),
    Max = max(countBurglaries),
    Percent_zero = mean(countBurglaries == 0) * 100
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Burglary Counts per 500m Grid Cell"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Summary of Burglary Counts per 500m Grid Cell
Min	Q1	Median	Mean	Q3	Max	Percent_zero
0	0	2	3.04	5	40	31.77

The summary table shows that most cells have relatively low counts, while a few cells reach very high values, which confirms that burglary is highly concentrated in a small number of hotspots.

Code

# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

This map shows the spatial distribution of burglary counts aggregated to 500m × 500m grid cells. Burglary is not evenly distributed across Chicago. Instead, it forms distinct clusters, with noticeably higher concentrations in the West Side and the South Side, while the northern and far-southern areas show very low activity. The pattern also reveals a strong downtown-to-south corridor where incidents are more frequent. This further confirms that burglary follows a highly spatially concentrated and uneven pattern, which is why later steps such as KDE, hotspot detection, and spatially aware cross-validation are necessary.

2.3 Create a Kernel Density Baseline

Code

ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

The KDE map produces a much smoother representation of burglary intensity compared to the raw count map.

Part 3: Create Spatial Predictor Variables

3.1: Count of Graffiti Removal per Cell

Code

Removal_fishnet <- st_join(fishnet, Graffiti_removal, join = st_contains) %>%
  st_drop_geometry() %>%
  count(uniqueID, name = "Graffiti_removal")

fishnet$Graffiti_removal <- 0L

idx <- match(Removal_fishnet$uniqueID, fishnet$uniqueID)
fishnet$Graffiti_removal[idx] <- Removal_fishnet$Graffiti_removal

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(Graffiti_removal),
    Q1 = quantile(Graffiti_removal, 0.25),
    Median = median(Graffiti_removal),
    Mean = mean(Graffiti_removal),
    Q3 = quantile(Graffiti_removal, 0.75),
    Max = max(Graffiti_removal)
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Graffiti Removal 311 Calls per 500m Grid Cell"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Summary of Graffiti Removal 311 Calls per 500m Grid Cell
Min	Q1	Median	Mean	Q3	Max
1	16	80	427.82	487.75	7590

Code

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = Graffiti_removal), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Graffiti Removal 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are Graffiti Removal and burglaries correlated?")

Visually, the two patterns do not align very closely. Graffiti Removal calls are concentrated in a few very specific pockets, mainly in the Northwest and parts of the West and Southwest sides. Burglary, however, is more broadly distributed, with multiple hotspots across the West Side, South Side, and the central areas of the city.

There is some general overlap in higher-activity regions, but the intensity and location of clusters differ enough that graffiti does not appear to be a consistent indicator of burglary risk. Graffiti reflects surface-level disorder in public space, while burglary involves private property intrusion, which likely responds to different environmental and situational factors.

3.2: Nearest Neighbor Features

Code

# Calculate mean distance to 3 nearest Graffiti Removal
Graffiti_clean <- Graffiti_removal %>%
  filter(!st_is_empty(geometry))

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
removal_coords <- st_coordinates(Graffiti_clean)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(removal_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    Graffiti_clean.nn = rowMeans(nn_result$nn.dist)
  )

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(Graffiti_clean.nn),
    Q1 = quantile(Graffiti_clean.nn, 0.25),
    Median = median(Graffiti_clean.nn),
    Mean = mean(Graffiti_clean.nn),
    Q3 = quantile(Graffiti_clean.nn, 0.75),
    Max = max(Graffiti_clean.nn)
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)
Min	Q1	Median	Mean	Q3	Max
0.55	28.61	60.28	135.14	137.95	1476.26

The distance-to-graffiti feature shows substantial variation, with a median of ~60 m but a long tail up to 1.4 km. This indicates that some neighborhoods are very close to multiple graffiti complaints, while others are far removed. This spatial heterogeneity helps capture environmental disorder that may relate to burglary risk.

3.3: Distance to Hot Spots

Code

# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Graffiti Removal Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Distance to a cluster of Graffiti Removal calls is more informative than distance to a single graffiti incident because it captures the broader neighborhood context of disorder. A single report might be noisy or idiosyncratic, but a High–High cluster from Local Moran’s I represents an area where both the cell and its neighbors have consistently high graffiti counts. Being close to such a cluster suggests that a grid cell lies within or near a larger zone of visible deterioration or neglect.

Part 4: Model Fitting

4.1: Poisson Regression

Code

# Fit Poisson regression
library(broom)

model_poisson <- glm(
  countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

poisson_tbl <- tidy(model_poisson) %>% 
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 2),
    p.value = round(p.value, 4)
  )

kable(
  poisson_tbl,
  caption = "Poisson Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Poisson Regression Results
term	estimate	std.error	statistic	p.value
(Intercept)	1.5900	0.0260	61.15	0.0000
Graffiti_removal	0.0001	0.0000	4.74	0.0000
Graffiti_clean.nn	-0.0063	0.0003	-22.46	0.0000
dist_to_hotspot	0.0000	0.0000	3.76	0.0002

All three predictors are statistically significant (p < 0.001).

Graffiti_removal has a positive association with burglary counts.

More graffiti complaints → slightly more burglaries.

Graffiti_clean.nn has a strong negative association.

Cells closer to graffiti activity → much higher burglary risk.

This is the strongest predictor in the model.
dist_to_hotspot has a very weak but statistically significant positive association.

Being farther from graffiti hotspots slightly increases predicted burglaries.

Effect size is tiny and may not be meaningful.

4.2: Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code

# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

nb_tbl <- tidy(model_nb) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 2),
    p.value = round(p.value, 4)
  )

kable(
  nb_tbl,
  caption = "Negative Binomial Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Negative Binomial Regression Results
term	estimate	std.error	statistic	p.value
(Intercept)	1.6604	0.0497	33.38	0.0000
Graffiti_removal	0.0001	0.0000	1.55	0.1211
Graffiti_clean.nn	-0.0079	0.0005	-17.15	0.0000
dist_to_hotspot	0.0000	0.0000	3.34	0.0008

Its AIC (7654) is substantially lower than the Poisson model’s AIC (9456), indicating that the NB model provides a better balance between model complexity and goodness of fit.

The improvement confirms that our burglary count data are overdispersed (variance > mean), which violates a key Poisson assumption. As a result, the Poisson model inflated the significance of several predictors, while the NB model provides more reliable coefficient estimates.

Part 5: Spatial Cross-Validation

Code

# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

LOGO CV Results by District
fold	test_district	n_test	mae	rmse
7	3	43	6.31	8.42
4	6	63	3.87	5.35
12	12	73	3.44	4.63
2	4	235	3.38	5.45
6	7	52	3.38	4.23
14	11	43	3.12	4.02
15	18	30	2.73	3.85
5	8	197	2.72	3.72
10	10	63	2.62	3.44
21	20	30	2.58	2.98
17	14	46	2.57	3.86
8	2	56	2.55	3.41
16	25	85	2.54	3.46
9	9	107	2.46	2.86
11	1	28	2.44	2.93
19	16	129	2.36	2.76
22	24	41	2.35	2.64
13	15	32	2.32	2.96
3	22	112	2.17	2.76
20	17	82	2.14	2.48
1	5	98	1.92	2.87
18	19	63	1.91	2.34

Spatial CV is more appropriate than random CV because crime data is highly spatially autocorrelated. Random CV would mix neighboring cells into both training and testing sets, causing spatial leakage and artificially inflating model performance. LOGO CV avoids this by holding out entire police districts, which tests the model’s ability to generalize to new spatial areas.

From the results, District 3 was the hardest to predict (MAE = 6.31), followed by District 6, suggesting that these areas have unique burglary patterns not well explained by graffiti-based predictors. In contrast, District 19 was one of the easiest districts to predict (MAE ≈ 1.9), indicating burglary patterns there are more stable and similar to the training data.

Part 6: Model Evaluation

6.1: Compare Model vs. KDE Baseline

Code

# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code

# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Comparison
approach	mae	rmse
model	2.56	3.66
kde	2.06	2.95

The KDE baseline performs better than the complex model

The KDE has lower MAE and RMSE, which means:

Smaller average prediction error
More stable predictions
Better overall fit to the spatial pattern of burglaries

This is a common result in spatial prediction tasks: KDE is extremely strong because crime tends to cluster, and KDE captures clustering very well without requiring any assumptions or regression structure.

6.2: Where Does the Model Work Well?

Code

# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

The error maps show that the Negative Binomial model systematically underpredicts burglaries in major hotspot areas especially in the South and West Sides, while slightly overpredicting in low-crime neighborhoods. This pattern suggests that the model cannot fully capture sharp, localized spikes in burglary activity, likely because graffiti-related predictors only partially reflect the underlying opportunity structures of crime. As a result, absolute errors are largest in high-crime districts and much smaller in low-crime ones, indicating that the model performs unevenly across space and is least accurate exactly where prediction is most important.

6.3: Model Summary Table

Code

# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )

Final Negative Binomial Model Coefficients (Exponentiated)
Variable	Rate Ratio	Std. Error	Z	P-Value
(Intercept)	5.261	0.05	33.376	0.000
Graffiti_removal	1.000	0.00	1.550	0.121
Graffiti_clean.nn	0.992	0.00	-17.147	0.000
dist_to_hotspot	1.000	0.00	3.343	0.001
Note:
Rate ratios > 1 indicate positive association with burglary counts.

6.4 Key Findings Summary

Overall, the analysis shows that graffiti-related spatial features provide limited predictive power for burglary counts. Although the Negative Binomial model identifies some statistically significant relationships such as closer proximity to graffiti clusters being associated with slightly higher burglary counts, the effect sizes are small, and the model struggles to capture sharp crime hotspots.

The KDE baseline clearly outperforms the regression model in both MAE and RMSE, reinforcing that burglary exhibits strong spatial clustering that is better captured through spatial density patterns than through indirect disorder indicators.

Error maps also show that the model consistently underpredicts major hotspot areas and performs best in low-crime neighborhoods, suggesting that graffiti reflects general urban conditions but does not directly track burglary risk.

In short, disorder signals like graffiti may help describe neighborhood environments, but they are not sufficient for accurately predicting burglary patterns on their own. This might because Graffiti Removal 311 calls reflect residents’ reporting behavior more than actual vandalism. High-income neighborhoods tend to overreport, while disadvantaged neighborhoods underreport, producing a spatial pattern that diverges from the true distribution of disorder and therefore has limited predictive value for burglary.

--- title: "Spatial Prediction of Burglaries with 311 Graffiti Removal Requests" subtitle: "MUSA 5080 - Fall 2025" author: "Isabelle Li" date: today format: html: code-fold: show code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: true editor: visual execute: warning: false message: false --- ```{r setup} #| echo: false #| message: false #| warning: false # Load required packages library(tidyverse) library(sf) library(here) library(viridis) library(terra) library(spdep) library(FNN) library(MASS) library(patchwork) library(knitr) library(kableExtra) library(classInt) library(here) # Spatstat split into sub-packages library(spatstat.geom) library(spatstat.explore) # Set options options(scipen = 999) set.seed(5080) # Create consistent theme for visualizations theme_crime <- function(base_size = 11) { theme_minimal(base_size = base_size) + theme( plot.title = element_text(face = "bold", size = base_size + 1), plot.subtitle = element_text(color = "gray30", size = base_size - 1), legend.position = "right", panel.grid.minor = element_blank(), axis.text = element_blank(), axis.title = element_blank() ) } # Set as default theme_set(theme_crime()) ``` # Part 1: Introduction and Data Exploration In this project, I use **Graffiti Removal 311 requests** as the main predictor to explore whether neighborhood disorder can help explain the spatial distribution of burglaries in Chicago. I chose graffiti because it is a common and consistently reported 311 issue, it has clear spatial patterns, and it is often used as a proxy for visible disorder in urban areas. While graffiti has no direct causal link to burglary, it may reflect broader neighborhood conditions such as maintenance levels, social stress, or public activity, which could indirectly relate to crime patterns. My goal is to test whether this type of environmental signal provides predictive value and to understand how well disorder-based features perform compared to a purely spatial baseline like kernel density estimation (KDE). ```{r load-boundaries} #| echo: false #| message: false #| warning: false # Load police districts policeDistricts <- st_read( "https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON", quiet = TRUE ) %>% st_transform('ESRI:102271') %>% dplyr::select(District = dist_num) # Load police beats policeBeats <- st_read( "https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON", quiet = TRUE ) %>% st_transform('ESRI:102271') %>% dplyr::select(Beat = beat_num) # Load Chicago boundary chicagoBoundary <- st_read( "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson", quiet = TRUE ) %>% st_transform('ESRI:102271') ``` ```{r load-burglaries} #| echo: false #| message: false #| warning: false # Load from provided data file (downloaded from Chicago open data portal) burglaries <- st_read( here("assignments", "assignment4", "data", "burglaries.shp"), quiet = TRUE ) %>% st_transform('ESRI:102271') ``` There are **7,482 burglary incidents** in the dataset.\ These records come from the **Chicago Crimes 2017** data, representing all **reported burglary (forcible entry)** cases during that calendar year. Datasets are projected into Illinois State Plane East (EPSG:102271) for accurate distance and area calculations. ## 1.2: Visualize Point Data ```{r visualize-points} #| fig-width: 10 #| fig-height: 5 # Simple point map p1 <- ggplot() + geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") + geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) + labs( title = "Burglary Locations", subtitle = paste0("Chicago 2017, n = ", nrow(burglaries)) ) # Density surface using modern syntax p2 <- ggplot() + geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") + geom_density_2d_filled( data = data.frame(st_coordinates(burglaries)), aes(X, Y), alpha = 0.7, bins = 8 ) + scale_fill_viridis_d( option = "plasma", direction = -1, guide = "none" # Modern ggplot2 syntax (not guide = FALSE) ) + labs( title = "Density Surface", subtitle = "Kernel density estimation" ) # Combine plots using patchwork (modern approach) p1 + p2 + plot_annotation( title = "Spatial Distribution of Burglaries in Chicago", tag_levels = 'A' ) ``` The burglary incidents are **not evenly distributed** across Chicago. Instead, they exhibit a clear clustered pattern, with several well-defined hot spots. The **highest concentrations** appear in: - the **West Side** (especially near Humboldt Park and Garfield Park), - parts of the **South Side**, and - portions of the **Near North / Near West** areas. In contrast, the Northwest and Southwest outer neighborhoods show very few incidents, indicating substantially lower burglary activity. # Part 2: Create Fishnet Grid ## 2.1: Creating the Fishnet A **fishnet grid** converts irregular point data into a regular grid of cells where we can: - Aggregate counts - Calculate spatial features - Apply regression models ```{r create-fishnet} # Create 500m x 500m grid fishnet <- st_make_grid( chicagoBoundary, cellsize = 500, # 500 meters per cell square = TRUE ) %>% st_sf() %>% mutate(uniqueID = row_number()) # Keep only cells that intersect Chicago fishnet <- fishnet[chicagoBoundary, ] ``` We use a regular fishnet grid because it provides **equal-sized spatial units**, which makes burglary counts directly comparable across the entire study area. Neighborhoods or census tracts vary widely in size and shape, and those differences can distort the spatial patterns we observe or bias the regression results. A uniform grid avoids those issues and creates a consistent framework for aggregating incidents, calculating spatial features, and fitting predictive models. ## 2.2: Aggregate Burglaries to Grid ```{r Burglaries Aggregate} burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>% st_drop_geometry() %>% group_by(uniqueID) %>% summarize(countBurglaries = n()) # Join back to fishnet (cells with 0 burglaries will be NA) fishnet <- fishnet %>% left_join(burglaries_fishnet, by = "uniqueID") %>% mutate(countBurglaries = replace_na(countBurglaries, 0)) fishnet %>% st_drop_geometry() %>% summarize( Min = min(countBurglaries), Q1 = quantile(countBurglaries, 0.25), Median = median(countBurglaries), Mean = mean(countBurglaries), Q3 = quantile(countBurglaries, 0.75), Max = max(countBurglaries), Percent_zero = mean(countBurglaries == 0) * 100 ) %>% mutate(across(everything(), ~round(., 2))) %>% kable( caption = "Summary of Burglary Counts per 500m Grid Cell" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` The summary table shows that most cells have relatively low counts, while a few cells reach very high values, which confirms that burglary is highly concentrated in a small number of hotspots. ```{r visualize-fishnet} #| fig-width: 8 #| fig-height: 6 # Visualize aggregated counts ggplot() + geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) + geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) + scale_fill_viridis_c( name = "Burglaries", option = "plasma", trans = "sqrt", # Square root for better visualization of skewed data breaks = c(0, 1, 5, 10, 20, 40) ) + labs( title = "Burglary Counts by Grid Cell", subtitle = "500m x 500m cells, Chicago 2017" ) + theme_crime() ``` This map shows the spatial distribution of burglary counts aggregated to 500m × 500m grid cells. Burglary is **not evenly distributed** across Chicago. Instead, it forms distinct clusters, with noticeably higher concentrations in the **West Side and the South Side**, while the northern and far-southern areas show very low activity. The pattern also reveals a strong **downtown-to-south corridor** where incidents are more frequent. This further confirms that burglary follows a highly spatially concentrated and uneven pattern, which is why later steps such as KDE, hotspot detection, and spatially aware cross-validation are necessary. ## 2.3 Create a Kernel Density Baseline ```{r kde-baseline} #| message: false #| warning: false #| echo: false # Convert burglaries to ppp (point pattern) format for spatstat burglaries_ppp <- as.ppp( st_coordinates(burglaries), W = as.owin(st_bbox(chicagoBoundary)) ) # Calculate KDE with 1km bandwidth kde_burglaries <- density.ppp( burglaries_ppp, sigma = 1000, # 1km bandwidth edge = TRUE # Edge correction ) # Convert to terra raster (modern approach, not raster::raster) kde_raster <- rast(kde_burglaries) # Extract KDE values to fishnet cells fishnet <- fishnet %>% mutate( kde_value = terra::extract( kde_raster, vect(fishnet), fun = mean, na.rm = TRUE )[, 2] # Extract just the values column ) ``` ```{r visualize-kde} #| fig-width: 8 #| fig-height: 6 ggplot() + geom_sf(data = fishnet, aes(fill = kde_value), color = NA) + geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) + scale_fill_viridis_c( name = "KDE Value", option = "plasma" ) + labs( title = "Kernel Density Estimation Baseline", subtitle = "Simple spatial smoothing of burglary locations" ) + theme_crime() ``` The KDE map produces a much smoother representation of burglary intensity compared to the raw count map. # Part 3: Create Spatial Predictor Variables ```{r load-abandoned-cars} #| echo: false #| message: false #| warning: false Graffiti_removal <- read_csv( here("assignments", "assignment4", "data", "Graffiti_Removal.csv"))%>% filter(!is.na(Latitude), !is.na(Longitude)) %>% st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>% st_transform('ESRI:102271') ``` ## 3.1: Count of Graffiti Removal per Cell ```{r count-Graffiti_removal} Removal_fishnet <- st_join(fishnet, Graffiti_removal, join = st_contains) %>% st_drop_geometry() %>% count(uniqueID, name = "Graffiti_removal") fishnet$Graffiti_removal <- 0L idx <- match(Removal_fishnet$uniqueID, fishnet$uniqueID) fishnet$Graffiti_removal[idx] <- Removal_fishnet$Graffiti_removal fishnet %>% st_drop_geometry() %>% summarize( Min = min(Graffiti_removal), Q1 = quantile(Graffiti_removal, 0.25), Median = median(Graffiti_removal), Mean = mean(Graffiti_removal), Q3 = quantile(Graffiti_removal, 0.75), Max = max(Graffiti_removal) ) %>% mutate(across(everything(), ~round(., 2))) %>% kable( caption = "Summary of Graffiti Removal 311 Calls per 500m Grid Cell" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ```{r visualize-Graffiti-Removal } #| fig-width: 10 #| fig-height: 4 p1 <- ggplot() + geom_sf(data = fishnet, aes(fill = Graffiti_removal), color = NA) + scale_fill_viridis_c(name = "Count", option = "magma") + labs(title = "Graffiti Removal 311 Calls") + theme_crime() p2 <- ggplot() + geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) + scale_fill_viridis_c(name = "Count", option = "plasma") + labs(title = "Burglaries") + theme_crime() p1 + p2 + plot_annotation(title = "Are Graffiti Removal and burglaries correlated?") ``` Visually, the two patterns do not align very closely. Graffiti Removal calls are concentrated in a few very specific pockets, mainly in the Northwest and parts of the West and Southwest sides. Burglary, however, is more broadly distributed, with multiple hotspots across the West Side, South Side, and the central areas of the city. There is some general overlap in higher-activity regions, but the intensity and location of clusters differ enough that graffiti does not appear to be a consistent indicator of burglary risk. Graffiti reflects surface-level disorder in public space, while burglary involves private property intrusion, which likely responds to different environmental and situational factors. ## 3.2: Nearest Neighbor Features ```{r nn-feature} #| message: false #| warning: false # Calculate mean distance to 3 nearest Graffiti Removal Graffiti_clean <- Graffiti_removal %>% filter(!st_is_empty(geometry)) # Get coordinates fishnet_coords <- st_coordinates(st_centroid(fishnet)) removal_coords <- st_coordinates(Graffiti_clean) # Calculate k nearest neighbors and distances nn_result <- get.knnx(removal_coords, fishnet_coords, k = 3) # Add to fishnet fishnet <- fishnet %>% mutate( Graffiti_clean.nn = rowMeans(nn_result$nn.dist) ) fishnet %>% st_drop_geometry() %>% summarize( Min = min(Graffiti_clean.nn), Q1 = quantile(Graffiti_clean.nn, 0.25), Median = median(Graffiti_clean.nn), Mean = mean(Graffiti_clean.nn), Q3 = quantile(Graffiti_clean.nn, 0.75), Max = max(Graffiti_clean.nn) ) %>% mutate(across(everything(), ~round(., 2))) %>% kable( caption = "Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` The distance-to-graffiti feature shows substantial variation, with a median of \~60 m but a long tail up to 1.4 km. This indicates that some neighborhoods are very close to multiple graffiti complaints, while others are far removed. This spatial heterogeneity helps capture environmental disorder that may relate to burglary risk. ## 3.3: Distance to Hot Spots ```{r local-morans-abandoned} #| echo: false #| message: false #| warning: false # Function to calculate Local Moran's I calculate_local_morans <- function(data, variable, k = 5) { # Create spatial weights coords <- st_coordinates(st_centroid(data)) neighbors <- knn2nb(knearneigh(coords, k = k)) weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE) # Calculate Local Moran's I local_moran <- localmoran(data[[variable]], weights) # Classify clusters mean_val <- mean(data[[variable]], na.rm = TRUE) data %>% mutate( local_i = local_moran[, 1], p_value = local_moran[, 5], is_significant = p_value < 0.05, moran_class = case_when( !is_significant ~ "Not Significant", local_i > 0 & .data[[variable]] > mean_val ~ "High-High", local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low", local_i < 0 & .data[[variable]] > mean_val ~ "High-Low", local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High", TRUE ~ "Not Significant" ) ) } # Apply to Graffiti removal fishnet <- calculate_local_morans(fishnet, "Graffiti_removal", k = 5) ``` ```{r visualize-morans} #| fig-width: 8 #| fig-height: 6 # Visualize hot spots ggplot() + geom_sf( data = fishnet, aes(fill = moran_class), color = NA ) + scale_fill_manual( values = c( "High-High" = "#d7191c", "High-Low" = "#fdae61", "Low-High" = "#abd9e9", "Low-Low" = "#2c7bb6", "Not Significant" = "gray90" ), name = "Cluster Type" ) + labs( title = "Local Moran's I: Graffiti Removal Clusters", subtitle = "High-High = Hot spots of disorder" ) + theme_crime() ``` ```{r distance-to-hotspots} #| echo: false #| message: false #| warning: false # Identify hotspot cells hotspots <- fishnet %>% filter(moran_class == "High-High") %>% st_centroid() # Calculate distance from each cell to nearest hotspot if (nrow(hotspots) > 0) { fishnet <- fishnet %>% mutate( dist_to_hotspot = as.numeric( st_distance(st_centroid(fishnet), hotspots %>% st_union()) ) ) } else { fishnet <- fishnet %>% mutate(dist_to_hotspot = 0) } ``` Distance to a cluster of Graffiti Removal calls is more informative than distance to a single graffiti incident because it captures the broader neighborhood context of disorder. A single report might be noisy or idiosyncratic, but a High–High cluster from Local Moran’s I represents an area where both the cell and its neighbors have consistently high graffiti counts. Being close to such a cluster suggests that a grid cell lies within or near a larger zone of visible deterioration or neglect. ------------------------------------------------------------------------ ```{r join-districts} #| echo: false #| message: false #| warning: false # Join district information to fishnet fishnet <- st_join( fishnet, policeDistricts, join = st_within, left = TRUE ) %>% filter(!is.na(District)) # Remove cells outside districts ``` # Part 4: Model Fitting ## 4.1: Poisson Regression ```{r prepare-data} #| echo: false #| message: false #| warning: false # Create clean modeling dataset fishnet_model <- fishnet %>% st_drop_geometry() %>% dplyr::select( uniqueID, District, countBurglaries, Graffiti_removal, Graffiti_clean.nn, dist_to_hotspot ) %>% na.omit() # Remove any remaining NAs ``` ```{r fit-poisson} # Fit Poisson regression library(broom) model_poisson <- glm( countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + dist_to_hotspot, data = fishnet_model, family = "poisson" ) poisson_tbl <- tidy(model_poisson) %>% mutate( estimate = round(estimate, 4), std.error = round(std.error, 4), statistic = round(statistic, 2), p.value = round(p.value, 4) ) kable( poisson_tbl, caption = "Poisson Regression Results" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` **All three predictors are statistically significant (p \< 0.001).** 1. **Graffiti_removal** has a *positive* association with burglary counts. ``` More graffiti complaints → slightly more burglaries. ``` 2. **Graffiti_clean.nn** has a *strong negative* association. Cells closer to graffiti activity → much higher burglary risk. This is the strongest predictor in the model. 3. **dist_to_hotspot** has a very weak but statistically significant *positive* association. Being farther from graffiti hotspots slightly increases predicted burglaries. Effect size is tiny and may not be meaningful. ```{r check-overdispersion} #| echo: false #| message: false #| warning: false # Calculate dispersion parameter dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / model_poisson$df.residual ``` ## 4.2: Negative Binomial Regression If overdispersed, use **Negative Binomial regression** (more flexible). ```{r fit-negbin} # Fit Negative Binomial model model_nb <- glm.nb( countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + dist_to_hotspot, data = fishnet_model ) nb_tbl <- tidy(model_nb) %>% mutate( estimate = round(estimate, 4), std.error = round(std.error, 4), statistic = round(statistic, 2), p.value = round(p.value, 4) ) kable( nb_tbl, caption = "Negative Binomial Regression Results" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` Its AIC (7654) is substantially lower than the Poisson model's AIC (9456), indicating that the NB model provides a better balance between model complexity and goodness of fit. The improvement confirms that our burglary count data are overdispersed (variance \> mean), which violates a key Poisson assumption. As a result, the Poisson model inflated the significance of several predictors, while the NB model provides more reliable coefficient estimates. # Part 5: Spatial Cross-Validation ```{r spatial-cv} #| echo: false #| message: false #| warning: false #| results: hide # Get unique districts districts <- unique(fishnet_model$District) cv_results <- tibble() cat("Running LOGO Cross-Validation...\n") for (i in seq_along(districts)) { test_district <- districts[i] # Split data train_data <- fishnet_model %>% filter(District != test_district) test_data <- fishnet_model %>% filter(District == test_district) # Fit model on training data model_cv <- glm.nb( countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + dist_to_hotspot, data = train_data ) # Predict on test data test_data <- test_data %>% mutate( prediction = predict(model_cv, test_data, type = "response") ) # Calculate metrics mae <- mean(abs(test_data$countBurglaries - test_data$prediction)) rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2)) # Store results cv_results <- bind_rows( cv_results, tibble( fold = i, test_district = test_district, n_test = nrow(test_data), mae = mae, rmse = rmse ) ) cat(" Fold", i, "/", length(districts), "- District", test_district, "- MAE:", round(mae, 2), "\n") } ``` ```{r cv-results-table} # Show results cv_results %>% arrange(desc(mae)) %>% kable( digits = 2, caption = "LOGO CV Results by District" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` Spatial CV is more appropriate than random CV because crime data is highly spatially autocorrelated. Random CV would mix neighboring cells into both training and testing sets, causing spatial leakage and artificially inflating model performance. LOGO CV avoids this by holding out entire police districts, which tests the model’s ability to generalize to new spatial areas. From the results, District 3 was the hardest to predict (MAE = 6.31), followed by District 6, suggesting that these areas have unique burglary patterns not well explained by graffiti-based predictors. In contrast, District 19 was one of the easiest districts to predict (MAE ≈ 1.9), indicating burglary patterns there are more stable and similar to the training data. # Part 6: Model Evaluation ```{r final-predictions} #| echo: false #| message: false #| warning: false # Fit final model on all data final_model <- glm.nb( countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + dist_to_hotspot, data = fishnet_model ) # Add predictions back to fishnet fishnet <- fishnet %>% mutate( prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)] ) # Also add KDE predictions (normalize to same scale as counts) kde_sum <- sum(fishnet$kde_value, na.rm = TRUE) count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE) fishnet <- fishnet %>% mutate( prediction_kde = (kde_value / kde_sum) * count_sum ) ``` ## 6.1: Compare Model vs. KDE Baseline ```{r compare-models} #| fig-width: 12 #| fig-height: 4 # Create three maps p1 <- ggplot() + geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) + scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) + labs(title = "Actual Burglaries") + theme_crime() p2 <- ggplot() + geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) + scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) + labs(title = "Model Predictions (Neg. Binomial)") + theme_crime() p3 <- ggplot() + geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) + scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) + labs(title = "KDE Baseline Predictions") + theme_crime() p1 + p2 + p3 + plot_annotation( title = "Actual vs. Predicted Burglaries", subtitle = "Does our complex model outperform simple KDE?" ) ``` ```{r model-comparison-metrics} # Calculate performance metrics comparison <- fishnet %>% st_drop_geometry() %>% filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>% summarize( model_mae = mean(abs(countBurglaries - prediction_nb)), model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)), kde_mae = mean(abs(countBurglaries - prediction_kde)), kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2)) ) comparison %>% pivot_longer(everything(), names_to = "metric", values_to = "value") %>% separate(metric, into = c("approach", "metric"), sep = "_") %>% pivot_wider(names_from = metric, values_from = value) %>% kable( digits = 2, caption = "Model Performance Comparison" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ### **The KDE baseline performs better than the complex model** The KDE has **lower MAE and RMSE**, which means: - Smaller average prediction error - More stable predictions - Better overall fit to the spatial pattern of burglaries This is a common result in spatial prediction tasks: KDE is extremely strong because crime tends to cluster, and KDE captures clustering very well without requiring any assumptions or regression structure. ## 6.2: Where Does the Model Work Well? ```{r prediction-errors} #| fig-width: 10 #| fig-height: 5 # Calculate errors fishnet <- fishnet %>% mutate( error_nb = countBurglaries - prediction_nb, error_kde = countBurglaries - prediction_kde, abs_error_nb = abs(error_nb), abs_error_kde = abs(error_kde) ) # Map errors p1 <- ggplot() + geom_sf(data = fishnet, aes(fill = error_nb), color = NA) + scale_fill_gradient2( name = "Error", low = "#2166ac", mid = "white", high = "#b2182b", midpoint = 0, limits = c(-10, 10) ) + labs(title = "Model Errors (Actual - Predicted)") + theme_crime() p2 <- ggplot() + geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) + scale_fill_viridis_c(name = "Abs. Error", option = "magma") + labs(title = "Absolute Model Errors") + theme_crime() p1 + p2 ``` The error maps show that the Negative Binomial model systematically underpredicts burglaries in major hotspot areas especially in the South and West Sides, while slightly overpredicting in low-crime neighborhoods. This pattern suggests that the model cannot fully capture sharp, localized spikes in burglary activity, likely because graffiti-related predictors only partially reflect the underlying opportunity structures of crime. As a result, absolute errors are largest in high-crime districts and much smaller in low-crime ones, indicating that the model performs unevenly across space and is least accurate exactly where prediction is most important. ## 6.3: Model Summary Table ```{r model-summary-table} # Create nice summary table model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>% mutate( across(where(is.numeric), ~round(., 3)) ) model_summary %>% kable( caption = "Final Negative Binomial Model Coefficients (Exponentiated)", col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value") ) %>% kable_styling(bootstrap_options = c("striped", "hover")) %>% footnote( general = "Rate ratios > 1 indicate positive association with burglary counts." ) ``` ## 6.4 Key Findings Summary Overall, the analysis shows that graffiti-related spatial features provide limited predictive power for burglary counts. Although the Negative Binomial model identifies some statistically significant relationships such as closer proximity to graffiti clusters being associated with slightly higher burglary counts, the effect sizes are small, and the model struggles to capture sharp crime hotspots. The **KDE baseline** clearly outperforms the regression model in both MAE and RMSE, reinforcing that burglary exhibits strong spatial clustering that is better captured through spatial density patterns than through indirect disorder indicators. Error maps also show that the model consistently underpredicts major hotspot areas and performs best in low-crime neighborhoods, suggesting that graffiti reflects general urban conditions but does not directly track burglary risk. In short, disorder signals like graffiti may help describe neighborhood environments, but they are not sufficient for accurately predicting burglary patterns on their own. This might because Graffiti Removal 311 calls reflect residents’ reporting behavior more than actual vandalism. High-income neighborhoods tend to overreport, while disadvantaged neighborhoods underreport, producing a spatial pattern that diverges from the true distribution of disorder and therefore has limited predictive value for burglary.