Spatial Prediction of Burglaries with 311 Graffiti Removal Requests

MUSA 5080 - Fall 2025

Author

Isabelle Li

Published

December 11, 2025

Part 1: Introduction and Data Exploration

In this project, I use Graffiti Removal 311 requests as the main predictor to explore whether neighborhood disorder can help explain the spatial distribution of burglaries in Chicago. I chose graffiti because it is a common and consistently reported 311 issue, it has clear spatial patterns, and it is often used as a proxy for visible disorder in urban areas.

While graffiti has no direct causal link to burglary, it may reflect broader neighborhood conditions such as maintenance levels, social stress, or public activity, which could indirectly relate to crime patterns. My goal is to test whether this type of environmental signal provides predictive value and to understand how well disorder-based features perform compared to a purely spatial baseline like kernel density estimation (KDE).

There are 7,482 burglary incidents in the dataset.
These records come from the Chicago Crimes 2017 data, representing all reported burglary (forcible entry) cases during that calendar year.

Datasets are projected into Illinois State Plane East (EPSG:102271) for accurate distance and area calculations.

1.2: Visualize Point Data

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

The burglary incidents are not evenly distributed across Chicago. Instead, they exhibit a clear clustered pattern, with several well-defined hot spots.

The highest concentrations appear in:

  • the West Side (especially near Humboldt Park and Garfield Park),

  • parts of the South Side, and

  • portions of the Near North / Near West areas.

In contrast, the Northwest and Southwest outer neighborhoods show very few incidents, indicating substantially lower burglary activity.

Part 2: Create Fishnet Grid

2.1: Creating the Fishnet

A fishnet grid converts irregular point data into a regular grid of cells where we can:

  • Aggregate counts
  • Calculate spatial features
  • Apply regression models
Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

We use a regular fishnet grid because it provides equal-sized spatial units, which makes burglary counts directly comparable across the entire study area. Neighborhoods or census tracts vary widely in size and shape, and those differences can distort the spatial patterns we observe or bias the regression results. A uniform grid avoids those issues and creates a consistent framework for aggregating incidents, calculating spatial features, and fitting predictive models.

2.2: Aggregate Burglaries to Grid

Code
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(countBurglaries),
    Q1 = quantile(countBurglaries, 0.25),
    Median = median(countBurglaries),
    Mean = mean(countBurglaries),
    Q3 = quantile(countBurglaries, 0.75),
    Max = max(countBurglaries),
    Percent_zero = mean(countBurglaries == 0) * 100
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Burglary Counts per 500m Grid Cell"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary of Burglary Counts per 500m Grid Cell
Min Q1 Median Mean Q3 Max Percent_zero
0 0 2 3.04 5 40 31.77

The summary table shows that most cells have relatively low counts, while a few cells reach very high values, which confirms that burglary is highly concentrated in a small number of hotspots.

Code
# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

This map shows the spatial distribution of burglary counts aggregated to 500m × 500m grid cells. Burglary is not evenly distributed across Chicago. Instead, it forms distinct clusters, with noticeably higher concentrations in the West Side and the South Side, while the northern and far-southern areas show very low activity. The pattern also reveals a strong downtown-to-south corridor where incidents are more frequent. This further confirms that burglary follows a highly spatially concentrated and uneven pattern, which is why later steps such as KDE, hotspot detection, and spatially aware cross-validation are necessary.

2.3 Create a Kernel Density Baseline

Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

The KDE map produces a much smoother representation of burglary intensity compared to the raw count map.

Part 3: Create Spatial Predictor Variables

3.1: Count of Graffiti Removal per Cell

Code
Removal_fishnet <- st_join(fishnet, Graffiti_removal, join = st_contains) %>%
  st_drop_geometry() %>%
  count(uniqueID, name = "Graffiti_removal")

fishnet$Graffiti_removal <- 0L

idx <- match(Removal_fishnet$uniqueID, fishnet$uniqueID)
fishnet$Graffiti_removal[idx] <- Removal_fishnet$Graffiti_removal

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(Graffiti_removal),
    Q1 = quantile(Graffiti_removal, 0.25),
    Median = median(Graffiti_removal),
    Mean = mean(Graffiti_removal),
    Q3 = quantile(Graffiti_removal, 0.75),
    Max = max(Graffiti_removal)
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Graffiti Removal 311 Calls per 500m Grid Cell"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary of Graffiti Removal 311 Calls per 500m Grid Cell
Min Q1 Median Mean Q3 Max
1 16 80 427.82 487.75 7590
Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = Graffiti_removal), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Graffiti Removal 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are Graffiti Removal and burglaries correlated?")

Visually, the two patterns do not align very closely. Graffiti Removal calls are concentrated in a few very specific pockets, mainly in the Northwest and parts of the West and Southwest sides. Burglary, however, is more broadly distributed, with multiple hotspots across the West Side, South Side, and the central areas of the city.

There is some general overlap in higher-activity regions, but the intensity and location of clusters differ enough that graffiti does not appear to be a consistent indicator of burglary risk. Graffiti reflects surface-level disorder in public space, while burglary involves private property intrusion, which likely responds to different environmental and situational factors.

3.2: Nearest Neighbor Features

Code
# Calculate mean distance to 3 nearest Graffiti Removal
Graffiti_clean <- Graffiti_removal %>%
  filter(!st_is_empty(geometry))

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
removal_coords <- st_coordinates(Graffiti_clean)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(removal_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    Graffiti_clean.nn = rowMeans(nn_result$nn.dist)
  )

fishnet %>%
  st_drop_geometry() %>%
  summarize(
    Min = min(Graffiti_clean.nn),
    Q1 = quantile(Graffiti_clean.nn, 0.25),
    Median = median(Graffiti_clean.nn),
    Mean = mean(Graffiti_clean.nn),
    Q3 = quantile(Graffiti_clean.nn, 0.75),
    Max = max(Graffiti_clean.nn)
  ) %>%
  mutate(across(everything(), ~round(., 2))) %>%
  kable(
    caption = "Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary of Mean Distance to 3 Nearest Graffiti Removal Calls (meters)
Min Q1 Median Mean Q3 Max
0.55 28.61 60.28 135.14 137.95 1476.26

The distance-to-graffiti feature shows substantial variation, with a median of ~60 m but a long tail up to 1.4 km. This indicates that some neighborhoods are very close to multiple graffiti complaints, while others are far removed. This spatial heterogeneity helps capture environmental disorder that may relate to burglary risk.

3.3: Distance to Hot Spots

Code
# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Graffiti Removal Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Distance to a cluster of Graffiti Removal calls is more informative than distance to a single graffiti incident because it captures the broader neighborhood context of disorder. A single report might be noisy or idiosyncratic, but a High–High cluster from Local Moran’s I represents an area where both the cell and its neighbors have consistently high graffiti counts. Being close to such a cluster suggests that a grid cell lies within or near a larger zone of visible deterioration or neglect.


Part 4: Model Fitting

4.1: Poisson Regression

Code
# Fit Poisson regression
library(broom)

model_poisson <- glm(
  countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

poisson_tbl <- tidy(model_poisson) %>% 
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 2),
    p.value = round(p.value, 4)
  )

kable(
  poisson_tbl,
  caption = "Poisson Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Poisson Regression Results
term estimate std.error statistic p.value
(Intercept) 1.5900 0.0260 61.15 0.0000
Graffiti_removal 0.0001 0.0000 4.74 0.0000
Graffiti_clean.nn -0.0063 0.0003 -22.46 0.0000
dist_to_hotspot 0.0000 0.0000 3.76 0.0002

All three predictors are statistically significant (p < 0.001).

  1. Graffiti_removal has a positive association with burglary counts.
More graffiti complaints → slightly more burglaries.
  1. Graffiti_clean.nn has a strong negative association.

    Cells closer to graffiti activity → much higher burglary risk.

    This is the strongest predictor in the model.

  2. dist_to_hotspot has a very weak but statistically significant positive association.

    Being farther from graffiti hotspots slightly increases predicted burglaries.

    Effect size is tiny and may not be meaningful.

4.2: Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code
# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ Graffiti_removal + Graffiti_clean.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

nb_tbl <- tidy(model_nb) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 2),
    p.value = round(p.value, 4)
  )

kable(
  nb_tbl,
  caption = "Negative Binomial Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Negative Binomial Regression Results
term estimate std.error statistic p.value
(Intercept) 1.6604 0.0497 33.38 0.0000
Graffiti_removal 0.0001 0.0000 1.55 0.1211
Graffiti_clean.nn -0.0079 0.0005 -17.15 0.0000
dist_to_hotspot 0.0000 0.0000 3.34 0.0008

Its AIC (7654) is substantially lower than the Poisson model’s AIC (9456), indicating that the NB model provides a better balance between model complexity and goodness of fit.

The improvement confirms that our burglary count data are overdispersed (variance > mean), which violates a key Poisson assumption. As a result, the Poisson model inflated the significance of several predictors, while the NB model provides more reliable coefficient estimates.

Part 5: Spatial Cross-Validation

Code
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
fold test_district n_test mae rmse
7 3 43 6.31 8.42
4 6 63 3.87 5.35
12 12 73 3.44 4.63
2 4 235 3.38 5.45
6 7 52 3.38 4.23
14 11 43 3.12 4.02
15 18 30 2.73 3.85
5 8 197 2.72 3.72
10 10 63 2.62 3.44
21 20 30 2.58 2.98
17 14 46 2.57 3.86
8 2 56 2.55 3.41
16 25 85 2.54 3.46
9 9 107 2.46 2.86
11 1 28 2.44 2.93
19 16 129 2.36 2.76
22 24 41 2.35 2.64
13 15 32 2.32 2.96
3 22 112 2.17 2.76
20 17 82 2.14 2.48
1 5 98 1.92 2.87
18 19 63 1.91 2.34

Spatial CV is more appropriate than random CV because crime data is highly spatially autocorrelated. Random CV would mix neighboring cells into both training and testing sets, causing spatial leakage and artificially inflating model performance. LOGO CV avoids this by holding out entire police districts, which tests the model’s ability to generalize to new spatial areas.

From the results, District 3 was the hardest to predict (MAE = 6.31), followed by District 6, suggesting that these areas have unique burglary patterns not well explained by graffiti-based predictors. In contrast, District 19 was one of the easiest districts to predict (MAE ≈ 1.9), indicating burglary patterns there are more stable and similar to the training data.

Part 6: Model Evaluation

6.1: Compare Model vs. KDE Baseline

Code
# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
approach mae rmse
model 2.56 3.66
kde 2.06 2.95

The KDE baseline performs better than the complex model

The KDE has lower MAE and RMSE, which means:

  • Smaller average prediction error

  • More stable predictions

  • Better overall fit to the spatial pattern of burglaries

This is a common result in spatial prediction tasks: KDE is extremely strong because crime tends to cluster, and KDE captures clustering very well without requiring any assumptions or regression structure.

6.2: Where Does the Model Work Well?

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

The error maps show that the Negative Binomial model systematically underpredicts burglaries in major hotspot areas especially in the South and West Sides, while slightly overpredicting in low-crime neighborhoods. This pattern suggests that the model cannot fully capture sharp, localized spikes in burglary activity, likely because graffiti-related predictors only partially reflect the underlying opportunity structures of crime. As a result, absolute errors are largest in high-crime districts and much smaller in low-crime ones, indicating that the model performs unevenly across space and is least accurate exactly where prediction is most important.

6.3: Model Summary Table

Code
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Std. Error Z P-Value
(Intercept) 5.261 0.05 33.376 0.000
Graffiti_removal 1.000 0.00 1.550 0.121
Graffiti_clean.nn 0.992 0.00 -17.147 0.000
dist_to_hotspot 1.000 0.00 3.343 0.001
Note:
Rate ratios > 1 indicate positive association with burglary counts.

6.4 Key Findings Summary

Overall, the analysis shows that graffiti-related spatial features provide limited predictive power for burglary counts. Although the Negative Binomial model identifies some statistically significant relationships such as closer proximity to graffiti clusters being associated with slightly higher burglary counts, the effect sizes are small, and the model struggles to capture sharp crime hotspots.

The KDE baseline clearly outperforms the regression model in both MAE and RMSE, reinforcing that burglary exhibits strong spatial clustering that is better captured through spatial density patterns than through indirect disorder indicators.

Error maps also show that the model consistently underpredicts major hotspot areas and performs best in low-crime neighborhoods, suggesting that graffiti reflects general urban conditions but does not directly track burglary risk.

In short, disorder signals like graffiti may help describe neighborhood environments, but they are not sufficient for accurately predicting burglary patterns on their own. This might because Graffiti Removal 311 calls reflect residents’ reporting behavior more than actual vandalism. High-income neighborhoods tend to overreport, while disadvantaged neighborhoods underreport, producing a spatial pattern that diverges from the true distribution of disorder and therefore has limited predictive value for burglary.