This initial phase establishes the foundational dataset for our spatial analysis of burglary patterns in Chicago. The process systematically assembles and prepares multiple data sources: administrative boundaries (police districts, beats, and city limits) provide the spatial framework; burglary incidents serve as the primary outcome variable; and street light outage reports function as a key environmental predictor. Each dataset undergoes rigorous processing including coordinate system standardization, temporal filtering to 2017, and geometric validation.
Reading layer `OGRGeoJSON' from data source
`https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON'
using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS: WGS 84
Code
# Load Chicago boundarychicagoBoundary <-st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source
`https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson'
using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
1.2: Load Burglary Data
Code
# Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read("D:/MUSA5080PPA/portfolio-setup-ChristineCui12/labs/lab_4/data/burglaries.shp") %>%st_transform('ESRI:102271')
Reading layer `burglaries' from data source
`D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_4\data\burglaries.shp'
using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East
The selection of “Alley Lights Out” service requests as our predictor variable is grounded in both theoretical foundations and practical analytical considerations. From an environmental criminology perspective, inadequate alley illumination directly impacts crime opportunities by reducing natural surveillance and creating concealment spaces, aligning with CPTED principles. This violation type differs significantly from the classroom example of abandoned vehicles, representing a distinct category of infrastructure issues with unique spatial characteristics. Furthermore, alley lighting problems typically exhibit temporal stability due to municipal maintenance cycles, providing reliable spatial patterns for predictive modeling.
For our analytical framework, this dataset offers several advantages. The persistent nature of lighting infrastructure ensures strong spatial autocorrelation, making it a robust predictor at the grid-cell level. The geographic specificity of alley locations enables precise spatial analysis, while the substantial data volume with quality coordinates supports reliable modeling. Importantly, findings from this analysis can directly inform evidence-based policies for infrastructure maintenance and targeted crime prevention, bridging academic research with practical urban governance applications.
1.4: Visualize the spatial distribution
Code
# Simple point mapp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = lights_out, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Lights out Locations",subtitle =paste0("Chicago 2017, n = ", nrow(lights_out)) )# Density surface using modern syntaxp2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(lights_out)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none"# Modern ggplot2 syntax (not guide = FALSE) ) +labs(title ="Density Surface",subtitle ="Kernel density estimation" )# Combine plots using patchwork (modern approach)p1 + p2 +plot_annotation(title ="Spatial Distribution of Lights out in Chicago",tag_levels ='A' )
From the maps above, it can be found that the spatial distribution of alley light outages in Chicago demonstrates pronounced unevenness across the urban landscape. The visualization reveals distinct concentration patterns, with particularly dense clusters forming continuous corridors in western and southern sectors, while central and northern lakeshore areas remain notably sparse.
The kernel density analysis further validates this spatial disparity, highlighting persistent high-intensity zones in historically underserved neighborhoods. This geographical patterning reflects underlying infrastructure maintenance disparities and suggests potential correlations between illumination deficiencies and neighborhood socioeconomic characteristics. The stark contrast between high-frequency reporting areas and minimally affected regions underscores systemic inequities in public service distribution, while the concentrated nature of these outages may significantly influence local safety conditions and nocturnal activity patterns.
Part 2: Fishnet Grid Creation and Distribution
This section establishes the analytical foundation by transforming raw spatial data into structured, comparable units that enable systematic examination of crime patterns and environmental factors. The process begins with creating a uniform 500m x 500m grid system spanning Chicago’s urban landscape, which serves as our standardized spatial framework. This grid-based approach is crucial because it transcends traditional administrative boundaries (unequal sizes), allowing us to analyze urban phenomena in consistent, equally-sized, and no boundary bias units that capture local variations while maintaining city-wide comparability.
2.1: Fishnet Creation
Code
# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]
2.2: Aggregate Burglaries to Grid
Code
# Spatial join: which cell contains each burglary?burglaries_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnet (cells with 0 burglaries will be NA)fishnet <- fishnet %>%left_join(burglaries_fishnet, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summary statisticssummary(fishnet$countBurglaries)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 2.000 3.042 5.000 40.000
Code
cat("\nCells with zero burglaries:", sum(fishnet$countBurglaries ==0), "/", nrow(fishnet),"(", round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "%)\n")
Cells with zero burglaries: 781 / 2458 ( 31.8 %)
Code
# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Burglaries",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Burglary Counts by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()
The aggregation of burglary incidents to these grid cells reveals striking spatial disparities in crime distribution. Rather than being evenly dispersed across the city, burglaries concentrate in specific hotspots while large portions of the grid remain crime-free. This clustering pattern underscores the non-random nature of criminal activity and highlights the limitations of traditional Poisson regression for modeling such overdispersed count data, pointing toward the need for more flexible statistical approaches like negative binomial regression.
2.3: Count of Streets Lights out per Cell
Code
# Aggregate lights out calls to fishnetlightsout_fishnet <-st_join(lights_out, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(lights_out =n())# Join to fishnetfishnet <- fishnet %>%left_join(lightsout_fishnet, by ="uniqueID") %>%mutate(lights_out =replace_na(lights_out, 0))summary(fishnet$lights_out)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 7.00 11.35 17.00 217.00
Code
# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = lights_out), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Lights Out",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Alley Lights Out by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()
When we map alley light outage reports onto the same grid system, we observe distinct spatial concentrations that visually correspond with burglary hotspots (to some extent), particularly in western and southern sectors of the city. This spatial correlation between infrastructure deficiencies and crime patterns provides preliminary support for environmental criminology theories, suggesting that inadequate lighting may create environmental conditions conducive to criminal activity by reducing natural surveillance and increasing perceived opportunities for offenders. The consistent spatial framework enables us to systematically explore these relationships in subsequent modeling phases.
Part 3: Create Spatial Features
This section develops advanced spatial predictors to characterize the environmental context of lighting infrastructure and its potential relationship to crime patterns. The analysis progresses from measuring local lighting conditions through nearest-neighbor distances to identifying statistically significant spatial clusters of illumination deficiencies. These clusters are then operationalized as distance-based features, capturing proximity to areas of concentrated infrastructure problems. Finally, the framework is enriched with administrative boundaries, vacant buildings and demographic characteristics, creating a comprehensive feature set for subsequent burglary prediction modeling.
3.1: K-Nearest Neighbor Features
This step calculates the average distance from each grid cell centroid to its three nearest alley light outage reports. We first extract the centroid coordinates of all fishnet grid cells and the geographic coordinates of all light outage incidents, then use k-nearest neighbor algorithm (k=3) to compute the mean distance for each cell.
Code
# Calculate mean distance to 3 nearest abandoned cars# (Do this OUTSIDE of mutate to avoid sf conflicts)# Get coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))lightsout_coords <-st_coordinates(lights_out)# Calculate k nearest neighbors and distancesnn_result <-get.knnx(lightsout_coords, fishnet_coords, k =3)# Add to fishnetfishnet <- fishnet %>%mutate(lights_out.nn =rowMeans(nn_result$nn.dist) )summary(fishnet$lights_out.nn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.602 102.009 172.653 295.489 328.127 2313.532
This spatial feature captures the local environmental context of lighting infrastructure conditions beyond simple count-based measures. This is crucial because concentrated lighting deficiencies may create extended dark corridors that facilitate criminal activity, whereas isolated outages might have less impact on overall neighborhood safety perception.
The distance statistics reveal substantial spatial variation in lighting infrastructure problems across Chicago. With distances ranging from 4.6 meters to over 2.3 kilometers, and a median distance of 172.7 meters, we observe that some areas experience very dense concentrations of lighting issues (low distance values) while others have relatively sparse problems (high distance values). The right-skewed distribution (mean of 295.5 meters exceeding the median) suggests that while many cells have relatively accessible lighting infrastructure, there are significant areas where residents must travel considerable distances to encounter multiple lighting problems, potentially indicating systematic infrastructure neglect in certain neighborhoods.
3.2: Local Moran’s I analysis
This step implements spatial autocorrelation analysis using Local Moran’s I to identify statistically significant clustering patterns of alley light outages across the grid system. The method constructs a spatial weights matrix based on k-nearest neighbors (k=5) from grid centroids, computes local spatial autocorrelation statistics, and classifies each grid cell into distinct spatial patterns including hotspots (High-High), coldspots (Low-Low), and spatial outliers (High-Low, Low-High).
Code
# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to abandoned carsfishnet <-calculate_local_morans(fishnet, "lights_out", k =5)
3.3: Identify Hot Spots and Cold Spots
Code
# Visualize hot spotsggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c","High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6","Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Lights out Clusters",subtitle ="High-High = Hot spots of disorder" ) +theme_crime()
The visualization reveals several key spatial patterns. High-High clusters (hotspots) form contiguous zones primarily in the west and south sides of Chicago, indicating neighborhoods where lighting problems are widespread and mutually reinforcing across multiple adjacent grid cells. These hotspots represent areas of systematic infrastructure neglect that likely require comprehensive intervention strategies. The scattered Low-High outliers suggest isolated pockets of lighting problems within otherwise well-maintained areas, possibly indicating specific properties or blocks needing targeted repairs. The extensive gray areas (non-significant) demonstrate that many parts of the city, particularly in the central and northern regions, do not exhibit statistically significant clustering of lighting issues. This clear spatial differentiation provides valuable evidence for resource allocation decisions and confirms that lighting infrastructure disparities follow broader urban inequality patterns.
3.4: Distance to Hot Spots
This step quantifies the spatial proximity of each grid cell to identified lighting infrastructure hotspots by calculating the minimum distance from each cell’s centroid to the nearest High-High cluster centroid. The process involves extracting the central points of all statistically significant hotspot cells, combining them into a unified spatial object, and computing the shortest distance from every grid cell to this collective hotspot geometry.
Code
# Get centroids of "High-High" cells (hot spots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest hot spotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant hot spots found\n")}
- Number of hot spot cells: 193
The implementation successfully identified 193 hotspot cells and calculated distance values for all grid cells. The resulting distance metric creates a spatial gradient that decreases as we move toward the core of identified problem areas and increases in peripheral and well-maintained regions. This feature will enable us to test whether burglary risk exhibits distance-decay relationships relative to lighting infrastructure hotspots, potentially revealing how the spatial influence of problematic areas extends beyond their immediate boundaries.
3.5: Join Additional Contextual Data
The code below integrates police district information, which allows for analyzing jurisdictional variations in service delivery and law enforcement practices. Then, we obtain demographic and socioeconomic data from the American Community Survey (ACS), including total population, median income, and the population aged 65 and over, and use this to calculate the proportion of elderly population. Next, we load vacant building data and aggregate it into the grid. These steps aim to enrich each grid cell with comprehensive background variables, allowing for better control of additional influencing factors in subsequent modeling.
Join district information to fishnet
Code
# Join district information to fishnetfishnet <-st_join( fishnet, policeDistricts,join = st_within,left =TRUE) %>%filter(!is.na(District)) # Remove cells outside districtscat(" - Districts:", length(unique(fishnet$District)), "\n")
- Districts: 22
Code
cat(" - Cells:", nrow(fishnet), "\n")
- Cells: 1708
Join demographic information to fishnet
Code
acs_vars <-c(total_pop ="B01003_001",med_income ="B19013_001",# Male 65 years and over (sum of 6 categories)male_65_66 ="B01001_020",male_67_69 ="B01001_021",male_70_74 ="B01001_022",male_75_79 ="B01001_023",male_80_84 ="B01001_024",male_85_up ="B01001_025",# Female 65 years and over (sum of 6 categories)female_65_66 ="B01001_044",female_67_69 ="B01001_045",female_70_74 ="B01001_046",female_75_79 ="B01001_047",female_80_84 ="B01001_048",female_85_up ="B01001_049")
This section establishes the modeling framework by systematically evaluating alternative regression specifications for burglary count prediction. The process involves preparing the analytical dataset, implementing a baseline Poisson model, diagnosing its limitations through dispersion testing, and then proceeding to a more flexible Negative Binomial formulation to address identified statistical issues.
4.1: Poisson Regression
We begin with a Poisson regression, which serves as the baseline model for count data analysis. The model incorporates key predictors including alley light outage counts, nearest neighbor distances to light problems, distance to lighting hotspots, and control variables for income, elderly population proportion, and vacant buildings.
Call:
glm(formula = countBurglaries ~ lights_out + lights_out.nn +
dist_to_hotspot + med_income + pct_65_over + count_vacant,
family = "poisson", data = fishnet_model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.2935556927 0.0326873382 70.166 < 0.0000000000000002 ***
lights_out 0.0029259386 0.0004614293 6.341 0.000000000228 ***
lights_out.nn -0.0029046996 0.0000914734 -31.755 < 0.0000000000000002 ***
dist_to_hotspot -0.0000613090 0.0000063612 -9.638 < 0.0000000000000002 ***
med_income -0.0000036522 0.0000003212 -11.370 < 0.0000000000000002 ***
pct_65_over -0.0200035311 0.0014396462 -13.895 < 0.0000000000000002 ***
count_vacant 0.0014450388 0.0002285951 6.321 0.000000000259 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 15489 on 4203 degrees of freedom
Residual deviance: 12169 on 4197 degrees of freedom
AIC: 23225
Number of Fisher Scoring iterations: 5
Based on the Poisson regression results, all predictor variables demonstrate statistically significant relationships with burglary counts (p < 0.001), though their directional influences vary notably. Contrary to theoretical expectations, increased street light outages (lights_out) show a slight negative association with burglaries, while greater distances from lighting problem clusters (dist_to_hotspot) and individual outages (lights_out.nn) both correlate with reduced crime rates. More consistent with conventional criminological theory, higher median income (med_income) and larger elderly population proportions (pct_65_over) both exhibit protective effects against burglaries. Conversely, the coefficient for count_vacant is 0.001445. The positive value indicates that a higher number of vacant buildings is associated with a higher expected log count of burglaries. The IRR (Incidence Rate Ratio) is exp(0.001445) ≈ 1.00145, meaning that for each additional vacant building, the expected number of burglaries is multiplied by 1.00145, representing a 0.145% increase. These findings suggest that while physical disorder indicators like vacancy align with theoretical predictions, the relationships between lighting infrastructure issues and crime patterns appear more complex than initially hypothesized, potentially reflecting spatial interdependencies or countervailing mechanisms in the urban environment.
4.2: Check for Overdispersion
Poisson regression assumes mean = variance. Real count data often violates this (overdispersion). Diagnostic checks reveal significant overdispersion in the Poisson model (dispersion parameter ≈ 3.12), violating the fundamental assumption of equal mean and variance. This leads us to employ a Negative Binomial regression, which introduces an additional parameter to accommodate extra-Poisson variation.
The comparison of AIC values clearly demonstrates the superiority of the Negative Binomial specification, with a substantial reduction in AIC from approximately 23,224.7 to a lower value 19,782.2. This improvement reflects better handling of overdispersion (θ = 1.9727). Notably, lights_out shows a positive coefficient (0.0035) in the NB model, aligning with theoretical expectations that lighting failures increase burglary risk. All predictors remain highly significant (p < 0.001), with consistent protective effects for income and elderly population, and risk-increasing effects for vacancy. The NB model provides more reliable estimates for spatial crime analysis.
Part 5: Spatial Cross-Validation
This section establishes a baseline for model evaluation by creating a kernel density estimate of burglary patterns, then implements leave-one-group-out cross-validation using police districts to assess the model’s spatial generalizability. The process compares our complex regression model against a simple spatial smoothing approach and quantifies prediction errors across different geographic areas.
5.1: Create a Kernel Density Baseline
This step creates a baseline prediction model using Kernel Density Estimation (KDE) to analyze the spatial concentration patterns of burglary incidents. The process involves converting burglary point locations into a continuous probability surface using a 1km bandwidth kernel, then extracting the resulting density values to each fishnet grid cell. This creates a smoothed representation of burglary risk based solely on historical crime locations.
Code
# Convert burglaries to ppp (point pattern) format for spatstatburglaries_ppp <-as.ppp(st_coordinates(burglaries),W =as.owin(st_bbox(chicagoBoundary)))# Calculate KDE with 1km bandwidthkde_burglaries <-density.ppp( burglaries_ppp,sigma =1000, # 1km bandwidthedge =TRUE# Edge correction)# Convert to terra raster (modern approach, not raster::raster)kde_raster <-rast(kde_burglaries)# Extract KDE values to fishnet cellsfishnet <- fishnet %>%mutate(kde_value = terra::extract( kde_raster,vect(fishnet),fun = mean,na.rm =TRUE )[, 2] # Extract just the values column )
Establishing a KDE baseline serves multiple critical purposes. First, it provides a simple but effective benchmark against which to compare our more complex regression models - if our sophisticated models cannot outperform this basic spatial smoothing approach, their added complexity isn’t justified. Second, KDE captures the fundamental spatial autocorrelation inherent in crime patterns, representing the “where crime has happened before” hypothesis. Third, it helps us understand the underlying spatial structure of burglary risk before introducing explanatory variables.
Code
ggplot() +geom_sf(data = fishnet, aes(fill = kde_value), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="KDE Value",option ="plasma" ) +labs(title ="Kernel Density Estimation Baseline",subtitle ="Simple spatial smoothing of burglary locations" ) +theme_crime()
The KDE map reveals pronounced spatial heterogeneity in burglary risk, with dense hotspot emerging in southern and northern sectors while middle areas remain relatively safe. The smooth risk gradients demonstrate spatial autocorrelation and distance decay patterns, providing a crucial baseline for evaluating more complex predictive models.
5.2: Leave-One-Group-Out (LOGO) Cross-Validation
This step implements Leave-One-Group-Out cross-validation using police districts to assess the model’s spatial generalizability. By iteratively training the model on all but one district and testing on the excluded district, we evaluate how well the model transfers to unseen geographic areas. The process reveals whether the relationships learned from most districts can effectively predict burglary patterns in completely new locations.
Code
# Get unique districtsdistricts <-unique(fishnet_model$District)cv_results <-tibble()cat("Running LOGO Cross-Validation...\n")
Running LOGO Cross-Validation...
Code
for (i inseq_along(districts)) { test_district <- districts[i]# Split data train_data <- fishnet_model %>%filter(District != test_district) test_data <- fishnet_model %>%filter(District == test_district)# Fit model on training data model_cv <-glm.nb( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = train_data )# Predict on test data test_data <- test_data %>%mutate(prediction =predict(model_cv, test_data, type ="response") )# Calculate metrics mae <-mean(abs(test_data$countBurglaries - test_data$prediction)) rmse <-sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))# Store results cv_results <-bind_rows( cv_results,tibble(fold = i,test_district = test_district,n_test =nrow(test_data),mae = mae,rmse = rmse ) )cat(" Fold", i, "/", length(districts), "- District", test_district, "- MAE:", round(mae, 2), "\n")}
Fold 1 / 22 - District 5 - MAE: 2.13
Fold 2 / 22 - District 4 - MAE: 2.36
Fold 3 / 22 - District 22 - MAE: 2.25
Fold 4 / 22 - District 6 - MAE: 3.2
Fold 5 / 22 - District 8 - MAE: 3.33
Fold 6 / 22 - District 7 - MAE: 2.91
Fold 7 / 22 - District 3 - MAE: 6.21
Fold 8 / 22 - District 2 - MAE: 2.84
Fold 9 / 22 - District 9 - MAE: 3.29
Fold 10 / 22 - District 10 - MAE: 2.63
Fold 11 / 22 - District 1 - MAE: 2.34
Fold 12 / 22 - District 12 - MAE: 3.56
Fold 13 / 22 - District 15 - MAE: 1.98
Fold 14 / 22 - District 11 - MAE: 3.14
Fold 15 / 22 - District 18 - MAE: 2.21
Fold 16 / 22 - District 25 - MAE: 2.87
Fold 17 / 22 - District 14 - MAE: 3.14
Fold 18 / 22 - District 19 - MAE: 2.07
Fold 19 / 22 - District 16 - MAE: 2.25
Fold 20 / 22 - District 17 - MAE: 1.69
Fold 21 / 22 - District 20 - MAE: 1.85
Fold 22 / 22 - District 24 - MAE: 2.01
The results show an average MAE of 2.74 burglaries per grid cell, indicating reasonable predictive accuracy. However, the variation across districts (MAE ranging from 1.69 to 6.21) suggests significant geographic heterogeneity in prediction performance. District 3’s high error (6.21) likely reflects unique local factors not captured by our predictors, highlighting areas where the model may require additional contextual variables or specialized approaches.
5.3: Error metrics calculation and report
This step systematically analyzes the spatial cross-validation results by organizing prediction errors across all police districts. The table ranks districts by MAE (Mean Absolute Error), providing a clear comparison of model performance across different geographic areas. This organization allows for identifying patterns in prediction accuracy and pinpointing where the model struggles most.
Code
# Show resultscv_results %>%arrange(desc(mae)) %>%kable(digits =2,caption ="LOGO CV Results by District" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))
LOGO CV Results by District
fold
test_district
n_test
mae
rmse
7
3
120
6.21
7.87
12
12
203
3.56
4.91
5
8
419
3.33
4.17
9
9
291
3.29
4.83
4
6
154
3.20
4.54
14
11
124
3.14
3.79
17
14
164
3.14
4.66
6
7
149
2.91
3.77
16
25
228
2.87
3.82
8
2
183
2.84
3.48
10
10
173
2.63
3.37
2
4
418
2.36
4.21
11
1
60
2.34
3.23
3
22
229
2.25
2.69
19
16
296
2.25
2.63
15
18
95
2.21
3.65
1
5
191
2.13
2.93
18
19
234
2.07
2.57
22
24
111
2.01
2.71
13
15
77
1.98
2.60
21
20
79
1.85
2.19
20
17
206
1.69
2.05
The results reveal substantial spatial variation in prediction quality. District 3 shows the highest error (MAE: 6.21), suggesting unique local dynamics not captured by our predictors, while District 17 demonstrates the best performance (MAE: 1.69).
Part 6: Model Evaluation
This section evaluates model performance by comparing our Negative Binomial regression predictions against the KDE baseline across two key aspects: quantitative error metrics calculation (MAE and RMSE) and spatial error distribution analysis. The evaluation aims to determine whether the complex model’s added predictive power justifies its computational complexity and to identify geographic patterns in prediction accuracy.
6.1: Generate Final Predictions
Code
# Fit final model on all datafinal_model <-glm.nb( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = fishnet_model)# Add predictions back to fishnetfishnet <- fishnet %>%mutate(prediction_nb =predict(final_model, fishnet_model, type ="response")[match(uniqueID, fishnet_model$uniqueID)] )# Also add KDE predictions (normalize to same scale as counts)kde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )
6.2: Compare Model vs. KDE Baseline
This step visually compares the actual burglary distribution against predictions from both the complex model and the simple KDE baseline.
The result differs from expectations… The negative binomial regression does not appear to resemble the actual burglary patterns as closely as KDE. However, the negative binomial model produces clearer hotspot boundaries and better distinguishes between high-risk and medium-risk areas, while the KDE baseline creates smoother, more diffused risk surfaces. The identification of complex model in hotspot areas shows some deviation from the actual situation.
The performance metrics reveal a nuanced comparison between the two modeling approaches. Our lighting-focused Negative Binomial regression (MAE: 2.60, RMSE: 3.63) is outperformed by the KDE baseline (MAE: 2.17, RMSE: 3.06) in terms of pure prediction accuracy for 2017 burglary patterns. This slightly higher MAE and RMSE for our model suggests that the complex relationships we modeled - incorporating alley lighting conditions, nearest-neighbor distances, hotspot proximity, and socioeconomic factors - do not provide additional predictive power beyond simple spatial smoothing of historical crime patterns for within-year prediction.
However, this numerical comparison tells only part of the story. While KDE excels at capturing the strong spatial autocorrelation inherent in burglary patterns, it offers limited explanatory value and cannot identify the underlying environmental factors driving crime risk. Our model, despite its slightly higher errors, provides crucial insights into how specific urban infrastructure conditions - particularly alley lighting deficiencies and their spatial distribution - contribute to burglary risk. The model’s structured approach, which quantifies the effects of lighting problems, income levels, elderly population concentrations, and vacant properties, offers valuable guidance for targeted interventions and policy decisions that go beyond mere crime prediction.
6.3: Prediction Errors
This step analyzes the spatial distribution of prediction errors by calculating both directional errors (actual minus predicted) and absolute errors for the Negative Binomial model.
The error maps reveal distinct spatial patterns in model performance. The directional error map shows systematic overprediction (red areas) primarily in higher-risk neighborhoods, where the model likely overestimates burglary risk due to unaccounted protective factors. Meanwhile, underprediction (blue areas) concentrates in lower-crime western sectors, suggesting the model misses certain risk amplifiers in these regions. The absolute error map confirms that prediction inaccuracies are most pronounced in areas with extreme crime counts - both very high and very low - indicating the model struggles with accurately predicting both crime hotspots and exceptionally safe areas.
Final Negative Binomial Model Coefficients (Exponentiated)
Variable
Rate Ratio
Std. Error
Z
P-Value
(Intercept)
9.770
0.058
38.975
0
lights_out
1.004
0.001
3.628
0
lights_out.nn
0.997
0.000
-22.438
0
dist_to_hotspot
1.000
0.000
-5.202
0
med_income
1.000
0.000
-5.268
0
pct_65_over
0.978
0.002
-8.995
0
count_vacant
1.002
0.000
4.255
0
Note:
Rate ratios > 1 indicate positive association with burglary counts.
7.2: Key Findings Summary
Overall Performance Assessment:
The spatial predictive analysis reveals a nuanced picture of model performance. While the Negative Binomial regression demonstrated strong explanatory power with all predictors showing statistical significance (p < 0.001), it was ultimately outperformed by the simpler Kernel Density Estimation baseline in predictive accuracy. The KDE achieved lower error metrics (MAE: 2.17 vs 2.60; RMSE: 3.06 vs 3.63), indicating that for within-year prediction of burglary patterns, historical crime locations alone provide marginally better forecasts than our complex model incorporating lighting infrastructure, socioeconomic factors, and vacancy measures.
Key Predictive Relationships:
The regression analysis yielded several theoretically meaningful findings. Vacant building counts showed a positive association with burglary risk, consistent with broken windows theory. Higher median income and larger elderly population proportions exhibited protective effects against burglaries. However, the relationships between lighting infrastructure and crime patterns proved more complex than anticipated, with both proximity to individual light outages and distance to lighting hotspots showing significant but counterintuitive associations that may reflect spatial interdependencies in the urban environment.
Spatial Generalizability:
Leave-one-group-out cross-validation revealed substantial geographic variation in model performance across police districts (MAE range: 1.69-6.21). This spatial heterogeneity suggests that the model’s predictive power is context-dependent, performing well in some areas while struggling in others, particularly District 3 which showed the highest prediction errors. This indicates the potential need for localized modeling approaches or additional place-specific variables.
Practical Implications:
The superior performance of the simple KDE model carries important policy implications: basic hotspot mapping of historical crime data remains the most cost-effective approach for short-term resource allocation. Despite its slightly inferior predictive accuracy compared to KDE, the regression model offers valuable policy insights by quantifying how specific environmental factors influence burglary risk. It provides valuable guidance for long-term preventive strategies. The identification of lighting infrastructure hotspots, combined with socioeconomic and vacancy measures, provides actionable intelligence for targeted interventions that go beyond simple crime prediction to inform infrastructure investment and community safety strategies.
Source Code
---title: "Spatial Predictive Analysis"subtitle: "Building a spatial predictive model using a 311 service request type"author: "Christine"date: todayformat: html: code-fold: true code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: trueeditor: visualexecute: warning: false message: false---## About This ExerciseIn this exercise, I will build a spatial predictive model for burglaries using count regression and spatial features.# Setup```{r setup}#| message: false#| warning: false# Load required packageslibrary(tidyverse) # Data manipulationlibrary(sf) # Spatial operationslibrary(here) # Relative file pathslibrary(viridis) # Color scaleslibrary(terra) # Raster operations (replaces 'raster')library(spdep) # Spatial dependencelibrary(FNN) # Fast nearest neighborslibrary(MASS) # Negative binomial regressionlibrary(patchwork) # Plot composition (replaces grid/gridExtra)library(knitr) # Tableslibrary(kableExtra) # Table formattinglibrary(classInt) # Classification intervalslibrary(here)library(tidycensus)# Spatstat split into sub-packageslibrary(spatstat.geom) # Spatial geometrieslibrary(spatstat.explore) # Spatial exploration/KDE# Set optionsoptions(scipen =999) # No scientific notationset.seed(5080) # Reproducibility# Create consistent theme for visualizationstheme_crime <-function(base_size =11) {theme_minimal(base_size = base_size) +theme(plot.title =element_text(face ="bold", size = base_size +1),plot.subtitle =element_text(color ="gray30", size = base_size -1),legend.position ="right",panel.grid.minor =element_blank(),axis.text =element_blank(),axis.title =element_blank() )}# Set as defaulttheme_set(theme_crime())```# Part 1: **Data Loading & Exploration**This initial phase establishes the foundational dataset for our spatial analysis of burglary patterns in Chicago. The process systematically assembles and prepares multiple data sources: administrative boundaries (police districts, beats, and city limits) provide the spatial framework; burglary incidents serve as the primary outcome variable; and street light outage reports function as a key environmental predictor. Each dataset undergoes rigorous processing including coordinate system standardization, temporal filtering to 2017, and geometric validation.## 1.1: Load Chicago Spatial Data```{r load-boundaries}#| message: false# Load police districts (used for spatial cross-validation)policeDistricts <-st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(District = dist_num)# Load police beats (smaller administrative units)policeBeats <-st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(Beat = beat_num)# Load Chicago boundarychicagoBoundary <-st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%st_transform('ESRI:102271')```## 1.2: Load Burglary Data```{r load-burglaries}#| message: false# Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read("D:/MUSA5080PPA/portfolio-setup-ChristineCui12/labs/lab_4/data/burglaries.shp") %>%st_transform('ESRI:102271')```## 1.3: Load 311 Street Lights Out Calls```{r load-lights-out}#| message: falselights_out <-read_csv("D:/MUSA5080PPA/portfolio-setup-ChristineCui12/labs/lab_4/data/Alley_Lights_Out_2025.csv")%>%mutate(date_received =mdy(`Creation Date`)) %>%filter(!is.na(Latitude), !is.na(Longitude), year(date_received) ==2017) %>%st_as_sf(coords =c("Longitude", "Latitude"), crs =4326) %>%st_transform('ESRI:102271')```The selection of "Alley Lights Out" service requests as our predictor variable is grounded in both theoretical foundations and practical analytical considerations. From an environmental criminology perspective, inadequate alley illumination directly impacts crime opportunities by reducing natural surveillance and creating concealment spaces, aligning with CPTED principles. This violation type differs significantly from the classroom example of abandoned vehicles, representing a distinct category of infrastructure issues with unique spatial characteristics. Furthermore, alley lighting problems typically exhibit temporal stability due to municipal maintenance cycles, providing reliable spatial patterns for predictive modeling.For our analytical framework, this dataset offers several advantages. The persistent nature of lighting infrastructure ensures strong spatial autocorrelation, making it a robust predictor at the grid-cell level. The geographic specificity of alley locations enables precise spatial analysis, while the substantial data volume with quality coordinates supports reliable modeling. Importantly, findings from this analysis can directly inform evidence-based policies for infrastructure maintenance and targeted crime prevention, bridging academic research with practical urban governance applications.## 1.4: Visualize the spatial distribution```{r visualize-points}#| fig-width: 15#| fig-height: 10# Simple point mapp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = lights_out, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Lights out Locations",subtitle =paste0("Chicago 2017, n = ", nrow(lights_out)) )# Density surface using modern syntaxp2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(lights_out)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none"# Modern ggplot2 syntax (not guide = FALSE) ) +labs(title ="Density Surface",subtitle ="Kernel density estimation" )# Combine plots using patchwork (modern approach)p1 + p2 +plot_annotation(title ="Spatial Distribution of Lights out in Chicago",tag_levels ='A' )```From the maps above, it can be found that the spatial distribution of alley light outages in Chicago demonstrates pronounced unevenness across the urban landscape. The visualization reveals distinct concentration patterns, with particularly dense clusters forming continuous corridors in western and southern sectors, while central and northern lakeshore areas remain notably sparse.The kernel density analysis further validates this spatial disparity, highlighting persistent high-intensity zones in historically underserved neighborhoods. This geographical patterning reflects underlying infrastructure maintenance disparities and suggests potential correlations between illumination deficiencies and neighborhood socioeconomic characteristics. The stark contrast between high-frequency reporting areas and minimally affected regions underscores systemic inequities in public service distribution, while the concentrated nature of these outages may significantly influence local safety conditions and nocturnal activity patterns.# Part 2: **Fishnet Grid Creation and Distribution**This section establishes the analytical foundation by transforming raw spatial data into structured, comparable units that enable systematic examination of crime patterns and environmental factors. The process begins with creating a uniform 500m x 500m grid system spanning Chicago's urban landscape, which serves as our standardized spatial framework. This grid-based approach is crucial because it transcends traditional administrative boundaries (unequal sizes), allowing us to analyze urban phenomena in consistent, equally-sized, and no boundary bias units that capture local variations while maintaining city-wide comparability.## 2.1: Fishnet Creation```{r create-fishnet}# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]```## 2.2: Aggregate Burglaries to Grid```{r aggregate-burglaries}# Spatial join: which cell contains each burglary?burglaries_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnet (cells with 0 burglaries will be NA)fishnet <- fishnet %>%left_join(burglaries_fishnet, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summary statisticssummary(fishnet$countBurglaries)cat("\nCells with zero burglaries:", sum(fishnet$countBurglaries ==0), "/", nrow(fishnet),"(", round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "%)\n")``````{r visualize-fishnet-burglaries}#| fig-width: 8#| fig-height: 6# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Burglaries",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Burglary Counts by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()```The aggregation of burglary incidents to these grid cells reveals striking spatial disparities in crime distribution. Rather than being evenly dispersed across the city, burglaries concentrate in specific hotspots while large portions of the grid remain crime-free. This clustering pattern underscores the non-random nature of criminal activity and highlights the limitations of traditional Poisson regression for modeling such overdispersed count data, pointing toward the need for more flexible statistical approaches like negative binomial regression.## 2.3: Count of Streets Lights out per Cell```{r count-lights-out}# Aggregate lights out calls to fishnetlightsout_fishnet <-st_join(lights_out, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(lights_out =n())# Join to fishnetfishnet <- fishnet %>%left_join(lightsout_fishnet, by ="uniqueID") %>%mutate(lights_out =replace_na(lights_out, 0))summary(fishnet$lights_out)``````{r visualize-fishnet-lightsout}#| fig-width: 8#| fig-height: 6# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = lights_out), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Lights Out",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Alley Lights Out by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()```When we map alley light outage reports onto the same grid system, we observe distinct spatial concentrations that visually correspond with burglary hotspots (to some extent), particularly in western and southern sectors of the city. This spatial correlation between infrastructure deficiencies and crime patterns provides preliminary support for environmental criminology theories, suggesting that inadequate lighting may create environmental conditions conducive to criminal activity by reducing natural surveillance and increasing perceived opportunities for offenders. The consistent spatial framework enables us to systematically explore these relationships in subsequent modeling phases.# Part 3: Create **Spatial Features**This section develops advanced spatial predictors to characterize the environmental context of lighting infrastructure and its potential relationship to crime patterns. The analysis progresses from measuring local lighting conditions through nearest-neighbor distances to identifying statistically significant spatial clusters of illumination deficiencies. These clusters are then operationalized as distance-based features, capturing proximity to areas of concentrated infrastructure problems. Finally, the framework is enriched with administrative boundaries, vacant buildings and demographic characteristics, creating a comprehensive feature set for subsequent burglary prediction modeling.## 3.1: **K-**Nearest Neighbor FeaturesThis step calculates the average distance from each grid cell centroid to its three nearest alley light outage reports. We first extract the centroid coordinates of all fishnet grid cells and the geographic coordinates of all light outage incidents, then use k-nearest neighbor algorithm (k=3) to compute the mean distance for each cell.```{r nn-feature}#| message: false# Calculate mean distance to 3 nearest abandoned cars# (Do this OUTSIDE of mutate to avoid sf conflicts)# Get coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))lightsout_coords <-st_coordinates(lights_out)# Calculate k nearest neighbors and distancesnn_result <-get.knnx(lightsout_coords, fishnet_coords, k =3)# Add to fishnetfishnet <- fishnet %>%mutate(lights_out.nn =rowMeans(nn_result$nn.dist) )summary(fishnet$lights_out.nn)```This spatial feature captures the local environmental context of lighting infrastructure conditions beyond simple count-based measures. This is crucial because concentrated lighting deficiencies may create extended dark corridors that facilitate criminal activity, whereas isolated outages might have less impact on overall neighborhood safety perception.The distance statistics reveal substantial spatial variation in lighting infrastructure problems across Chicago. With distances ranging from 4.6 meters to over 2.3 kilometers, and a median distance of 172.7 meters, we observe that some areas experience very dense concentrations of lighting issues (low distance values) while others have relatively sparse problems (high distance values). The right-skewed distribution (mean of 295.5 meters exceeding the median) suggests that while many cells have relatively accessible lighting infrastructure, there are significant areas where residents must travel considerable distances to encounter multiple lighting problems, potentially indicating **systematic infrastructure neglect** in certain neighborhoods.## 3.2: Local Moran's I analysisThis step implements spatial autocorrelation analysis using Local Moran's I to identify statistically significant clustering patterns of alley light outages across the grid system. The method constructs a spatial weights matrix based on k-nearest neighbors (k=5) from grid centroids, computes local spatial autocorrelation statistics, and classifies each grid cell into distinct spatial patterns including hotspots (High-High), coldspots (Low-Low), and spatial outliers (High-Low, Low-High).```{r local-morans-abandoned}# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to abandoned carsfishnet <-calculate_local_morans(fishnet, "lights_out", k =5)```## 3.3: **Identify Hot Spots and Cold Spots**```{r visualize-morans}#| fig-width: 8#| fig-height: 6# Visualize hot spotsggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c","High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6","Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Lights out Clusters",subtitle ="High-High = Hot spots of disorder" ) +theme_crime()```The visualization reveals several key spatial patterns. High-High clusters (hotspots) form contiguous zones primarily in the west and south sides of Chicago, indicating neighborhoods where lighting problems are widespread and mutually reinforcing across multiple adjacent grid cells. These hotspots represent areas of systematic infrastructure neglect that likely require comprehensive intervention strategies. The scattered Low-High outliers suggest isolated pockets of lighting problems within otherwise well-maintained areas, possibly indicating specific properties or blocks needing targeted repairs. The extensive gray areas (non-significant) demonstrate that many parts of the city, particularly in the central and northern regions, do not exhibit statistically significant clustering of lighting issues. This clear spatial differentiation provides valuable evidence for resource allocation decisions and confirms that lighting infrastructure disparities follow broader urban inequality patterns.## 3.4: Distance to Hot SpotsThis step quantifies the spatial proximity of each grid cell to identified lighting infrastructure hotspots by calculating the minimum distance from each cell's centroid to the nearest High-High cluster centroid. The process involves extracting the central points of all statistically significant hotspot cells, combining them into a unified spatial object, and computing the shortest distance from every grid cell to this collective hotspot geometry.```{r distance-to-hotspots}# Get centroids of "High-High" cells (hot spots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest hot spotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant hot spots found\n")}```The implementation successfully identified 193 hotspot cells and calculated distance values for all grid cells. The resulting distance metric creates a spatial gradient that decreases as we move toward the core of identified problem areas and increases in peripheral and well-maintained regions. This feature will enable us to test whether burglary risk exhibits distance-decay relationships relative to lighting infrastructure hotspots, potentially revealing how the spatial influence of problematic areas extends beyond their immediate boundaries.## 3.5: Join **Additional Contextual Data**The code below integrates police district information, which allows for analyzing jurisdictional variations in service delivery and law enforcement practices. Then, we obtain demographic and socioeconomic data from the American Community Survey (ACS), including total population, median income, and the population aged 65 and over, and use this to calculate the proportion of elderly population. Next, we load vacant building data and aggregate it into the grid. These steps aim to enrich each grid cell with comprehensive background variables, allowing for better control of additional influencing factors in subsequent modeling.### Join district information to fishnet```{r join-districts}# Join district information to fishnetfishnet <-st_join( fishnet, policeDistricts,join = st_within,left =TRUE) %>%filter(!is.na(District)) # Remove cells outside districtscat(" - Districts:", length(unique(fishnet$District)), "\n")cat(" - Cells:", nrow(fishnet), "\n")```### Join demographic information to fishnet```{r}acs_vars <-c(total_pop ="B01003_001",med_income ="B19013_001",# Male 65 years and over (sum of 6 categories)male_65_66 ="B01001_020",male_67_69 ="B01001_021",male_70_74 ="B01001_022",male_75_79 ="B01001_023",male_80_84 ="B01001_024",male_85_up ="B01001_025",# Female 65 years and over (sum of 6 categories)female_65_66 ="B01001_044",female_67_69 ="B01001_045",female_70_74 ="B01001_046",female_75_79 ="B01001_047",female_80_84 ="B01001_048",female_85_up ="B01001_049")``````{r load-acs-data, message=FALSE}tracts_all <-get_acs(geography ="tract",state ="IL",county ="Cook",variables = acs_vars,output ="wide",year =2017, geometry =TRUE,progress_bar =FALSE) ``````{r}tracts_chi <- tracts_all %>%st_transform(st_crs(chicagoBoundary)) %>%st_filter(chicagoBoundary, .predicate = st_intersects) %>% dplyr::mutate(pop_65_over = (male_65_66E + male_67_69E + male_70_74E + male_75_79E + male_80_84E + male_85_upE + female_65_66E + female_67_69E + female_70_74E + female_75_79E + female_80_84E + female_85_upE),pct_65_over =100* pop_65_over /pmax(total_popE, 1),total_pop = total_popE,med_income = med_incomeE ) %>%# Select and rename final columns for clarity and consistency dplyr::select(GEOID,total_pop,med_income,pop_65_over,pct_65_over)# Join to fishnetfishnet <- fishnet %>%st_join(tracts_chi, join = st_intersects, left =TRUE)```### Join vacant buildings information to fishnet```{r}# Load vacant buildings datavacant <-st_read("data/Vacant_and_Abandoned_Buildings.geojson", quiet =TRUE) %>%st_transform("ESRI:102271")# Spatial join.vacant_fishnet <-st_join(vacant, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(count_vacant =n())# Join to fishnetfishnet <- fishnet %>%left_join(vacant_fishnet, by ="uniqueID") %>%mutate(count_vacant =replace_na(count_vacant, 0))```# Part 4: Model FittingThis section establishes the modeling framework by systematically evaluating alternative regression specifications for burglary count prediction. The process involves preparing the analytical dataset, implementing a baseline Poisson model, diagnosing its limitations through dispersion testing, and then proceeding to a more flexible Negative Binomial formulation to address identified statistical issues.## 4.1: Poisson RegressionWe begin with a Poisson regression, which serves as the baseline model for count data analysis. The model incorporates key predictors including alley light outage counts, nearest neighbor distances to light problems, distance to lighting hotspots, and control variables for income, elderly population proportion, and vacant buildings.```{r prepare-data}# Create clean modeling datasetfishnet_model <- fishnet %>%st_drop_geometry() %>% dplyr::select( uniqueID, District, countBurglaries, lights_out, lights_out.nn, dist_to_hotspot, med_income, pct_65_over, count_vacant ) %>%na.omit() # Remove any remaining NAscat(" - Observations:", nrow(fishnet_model), "\n")cat(" - Variables:", ncol(fishnet_model), "\n")``````{r fit-poisson}# Fit Poisson regressionmodel_poisson <-glm( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = fishnet_model,family ="poisson")# Summarysummary(model_poisson)```Based on the Poisson regression results, all predictor variables demonstrate statistically significant relationships with burglary counts (p \< 0.001), though their directional influences vary notably. Contrary to theoretical expectations, increased street light outages (lights_out) show a slight negative association with burglaries, while greater distances from lighting problem clusters (dist_to_hotspot) and individual outages (lights_out.nn) both correlate with reduced crime rates. More consistent with conventional criminological theory, higher median income (med_income) and larger elderly population proportions (pct_65_over) both exhibit protective effects against burglaries. Conversely, the coefficient for count_vacant is 0.001445. The positive value indicates that a higher number of vacant buildings is associated with a higher expected log count of burglaries. The IRR (Incidence Rate Ratio) is exp(0.001445) ≈ 1.00145, meaning that for each additional vacant building, the expected number of burglaries is multiplied by 1.00145, representing a 0.145% increase. These findings suggest that while physical disorder indicators like vacancy align with theoretical predictions, the relationships between lighting infrastructure issues and crime patterns appear more complex than initially hypothesized, potentially reflecting spatial interdependencies or countervailing mechanisms in the urban environment.## 4.2: Check for OverdispersionPoisson regression assumes mean = variance. Real count data often violates this (overdispersion). Diagnostic checks reveal significant overdispersion in the Poisson model (dispersion parameter ≈ 3.12), violating the fundamental assumption of equal mean and variance. This leads us to employ a Negative Binomial regression, which introduces an additional parameter to accommodate extra-Poisson variation.```{r check-overdispersion}# Calculate dispersion parameterdispersion <-sum(residuals(model_poisson, type ="pearson")^2) / model_poisson$df.residualcat("Dispersion parameter:", round(dispersion, 2), "\n")cat("Rule of thumb: >1.5 suggests overdispersion\n")if (dispersion >1.5) {cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")} else {cat("✓ Dispersion looks okay for Poisson model.\n")}```## 4.3: Negative Binomial Regression```{r fit-negbin}# Fit Negative Binomial modelmodel_nb <-glm.nb( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = fishnet_model)# Summarysummary(model_nb)# Compare AIC (lower is better)cat("\nModel Comparison:\n")cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")```The comparison of AIC values clearly demonstrates the superiority of the Negative Binomial specification, with a substantial reduction in AIC from approximately 23,224.7 to a lower value 19,782.2. This improvement reflects better handling of overdispersion (θ = 1.9727). Notably, lights_out shows a positive coefficient (0.0035) in the NB model, aligning with theoretical expectations that lighting failures increase burglary risk. All predictors remain highly significant (p \< 0.001), with consistent protective effects for income and elderly population, and risk-increasing effects for vacancy. The NB model provides more reliable estimates for spatial crime analysis.# Part 5: Spatial Cross-ValidationThis section establishes a baseline for model evaluation by creating a kernel density estimate of burglary patterns, then implements leave-one-group-out cross-validation using police districts to assess the model's spatial generalizability. The process compares our complex regression model against a simple spatial smoothing approach and quantifies prediction errors across different geographic areas.## 5.1: Create a Kernel Density BaselineThis step creates a baseline prediction model using Kernel Density Estimation (KDE) to analyze the spatial concentration patterns of burglary incidents. The process involves converting burglary point locations into a continuous probability surface using a 1km bandwidth kernel, then extracting the resulting density values to each fishnet grid cell. This creates a smoothed representation of burglary risk based solely on historical crime locations.```{r kde-baseline}#| message: false# Convert burglaries to ppp (point pattern) format for spatstatburglaries_ppp <-as.ppp(st_coordinates(burglaries),W =as.owin(st_bbox(chicagoBoundary)))# Calculate KDE with 1km bandwidthkde_burglaries <-density.ppp( burglaries_ppp,sigma =1000, # 1km bandwidthedge =TRUE# Edge correction)# Convert to terra raster (modern approach, not raster::raster)kde_raster <-rast(kde_burglaries)# Extract KDE values to fishnet cellsfishnet <- fishnet %>%mutate(kde_value = terra::extract( kde_raster,vect(fishnet),fun = mean,na.rm =TRUE )[, 2] # Extract just the values column )```Establishing a KDE baseline serves multiple critical purposes. First, it provides a simple but effective benchmark against which to compare our more complex regression models - if our sophisticated models cannot outperform this basic spatial smoothing approach, their added complexity isn't justified. Second, KDE captures the fundamental spatial autocorrelation inherent in crime patterns, representing the "where crime has happened before" hypothesis. Third, it helps us understand the underlying spatial structure of burglary risk before introducing explanatory variables.```{r visualize-kde}#| fig-width: 8#| fig-height: 6ggplot() +geom_sf(data = fishnet, aes(fill = kde_value), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="KDE Value",option ="plasma" ) +labs(title ="Kernel Density Estimation Baseline",subtitle ="Simple spatial smoothing of burglary locations" ) +theme_crime()```The KDE map reveals pronounced spatial heterogeneity in burglary risk, with dense hotspot emerging in southern and northern sectors while middle areas remain relatively safe. The smooth risk gradients demonstrate spatial autocorrelation and distance decay patterns, providing a crucial baseline for evaluating more complex predictive models.## 5.2: **Leave-One-Group-Out (LOGO) Cross-Validation**This step implements Leave-One-Group-Out cross-validation using police districts to assess the model's spatial generalizability. By iteratively training the model on all but one district and testing on the excluded district, we evaluate how well the model transfers to unseen geographic areas. The process reveals whether the relationships learned from most districts can effectively predict burglary patterns in completely new locations.```{r spatial-cv}# Get unique districtsdistricts <-unique(fishnet_model$District)cv_results <-tibble()cat("Running LOGO Cross-Validation...\n")for (i inseq_along(districts)) { test_district <- districts[i]# Split data train_data <- fishnet_model %>%filter(District != test_district) test_data <- fishnet_model %>%filter(District == test_district)# Fit model on training data model_cv <-glm.nb( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = train_data )# Predict on test data test_data <- test_data %>%mutate(prediction =predict(model_cv, test_data, type ="response") )# Calculate metrics mae <-mean(abs(test_data$countBurglaries - test_data$prediction)) rmse <-sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))# Store results cv_results <-bind_rows( cv_results,tibble(fold = i,test_district = test_district,n_test =nrow(test_data),mae = mae,rmse = rmse ) )cat(" Fold", i, "/", length(districts), "- District", test_district, "- MAE:", round(mae, 2), "\n")}# Overall resultscat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")```The results show an average MAE of 2.74 burglaries per grid cell, indicating reasonable predictive accuracy. However, the variation across districts (MAE ranging from 1.69 to 6.21) suggests significant geographic heterogeneity in prediction performance. District 3's high error (6.21) likely reflects unique local factors not captured by our predictors, highlighting areas where the model may require additional contextual variables or specialized approaches.## 5.3: **Error metrics calculation and report**This step systematically analyzes the spatial cross-validation results by organizing prediction errors across all police districts. The table ranks districts by MAE (Mean Absolute Error), providing a clear comparison of model performance across different geographic areas. This organization allows for identifying patterns in prediction accuracy and pinpointing where the model struggles most.```{r cv-results-table}# Show resultscv_results %>%arrange(desc(mae)) %>%kable(digits =2,caption ="LOGO CV Results by District" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))```The results reveal substantial spatial variation in prediction quality. District 3 shows the highest error (MAE: 6.21), suggesting unique local dynamics not captured by our predictors, while District 17 demonstrates the best performance (MAE: 1.69).# Part 6: Model **Evaluation**This section evaluates model performance by comparing our Negative Binomial regression predictions against the KDE baseline across two key aspects: quantitative error metrics calculation (MAE and RMSE) and spatial error distribution analysis. The evaluation aims to determine whether the complex model's added predictive power justifies its computational complexity and to identify geographic patterns in prediction accuracy.## 6.1: Generate Final Predictions```{r final-predictions}# Fit final model on all datafinal_model <-glm.nb( countBurglaries ~ lights_out + lights_out.nn + dist_to_hotspot + med_income + pct_65_over + count_vacant,data = fishnet_model)# Add predictions back to fishnetfishnet <- fishnet %>%mutate(prediction_nb =predict(final_model, fishnet_model, type ="response")[match(uniqueID, fishnet_model$uniqueID)] )# Also add KDE predictions (normalize to same scale as counts)kde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )```## 6.2: Compare Model vs. KDE BaselineThis step visually compares the actual burglary distribution against predictions from both the complex model and the simple KDE baseline.```{r compare-models}#| fig-width: 15#| fig-height: 6# Create three mapsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +scale_fill_viridis_c(name ="Count", option ="plasma", limits =c(0, 15)) +labs(title ="Actual Burglaries") +theme_crime()p2 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_nb), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15)) +labs(title ="Model Predictions (Neg. Binomial)") +theme_crime()p3 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_kde), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15)) +labs(title ="KDE Baseline Predictions") +theme_crime()p1 + p2 + p3 +plot_annotation(title ="Actual vs. Predicted Burglaries",subtitle ="Does our complex model outperform simple KDE?" )```The result differs from expectations... The negative binomial regression does not appear to resemble the actual burglary patterns as closely as KDE. However, the negative binomial model produces clearer hotspot boundaries and better distinguishes between high-risk and medium-risk areas, while the KDE baseline creates smoother, more diffused risk surfaces. The identification of complex model in hotspot areas shows some deviation from the actual situation.```{r model-comparison-metrics}# Calculate performance metricscomparison <- fishnet %>%st_drop_geometry() %>%filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%summarize(model_mae =mean(abs(countBurglaries - prediction_nb)),model_rmse =sqrt(mean((countBurglaries - prediction_nb)^2)),kde_mae =mean(abs(countBurglaries - prediction_kde)),kde_rmse =sqrt(mean((countBurglaries - prediction_kde)^2)) )comparison %>%pivot_longer(everything(), names_to ="metric", values_to ="value") %>%separate(metric, into =c("approach", "metric"), sep ="_") %>%pivot_wider(names_from = metric, values_from = value) %>%kable(digits =2,caption ="Model Performance Comparison" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))```The performance metrics reveal a nuanced comparison between the two modeling approaches. Our lighting-focused Negative Binomial regression (MAE: 2.60, RMSE: 3.63) is outperformed by the KDE baseline (MAE: 2.17, RMSE: 3.06) in terms of pure prediction accuracy for 2017 burglary patterns. This slightly higher MAE and RMSE for our model suggests that the complex relationships we modeled - incorporating alley lighting conditions, nearest-neighbor distances, hotspot proximity, and socioeconomic factors - do not provide additional predictive power beyond simple spatial smoothing of historical crime patterns for within-year prediction.However, this numerical comparison tells only part of the story. While KDE excels at capturing the strong spatial autocorrelation inherent in burglary patterns, it offers limited explanatory value and cannot identify the underlying environmental factors driving crime risk. Our model, despite its slightly higher errors, provides crucial insights into how specific urban infrastructure conditions - particularly alley lighting deficiencies and their spatial distribution - contribute to burglary risk. The model's structured approach, which quantifies the effects of lighting problems, income levels, elderly population concentrations, and vacant properties, offers valuable guidance for targeted interventions and policy decisions that go beyond mere crime prediction.## 6.3: **Prediction Errors**This step analyzes the spatial distribution of prediction errors by calculating both directional errors (actual minus predicted) and absolute errors for the Negative Binomial model.```{r prediction-errors}#| fig-width: 10#| fig-height: 5# Calculate errorsfishnet <- fishnet %>%mutate(error_nb = countBurglaries - prediction_nb,error_kde = countBurglaries - prediction_kde,abs_error_nb =abs(error_nb),abs_error_kde =abs(error_kde) )# Map errorsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = error_nb), color =NA) +scale_fill_gradient2(name ="Error",low ="#2166ac", mid ="white", high ="#b2182b",midpoint =0,limits =c(-10, 10) ) +labs(title ="Model Errors (Actual - Predicted)") +theme_crime()p2 <-ggplot() +geom_sf(data = fishnet, aes(fill = abs_error_nb), color =NA) +scale_fill_viridis_c(name ="Abs. Error", option ="plasma") +labs(title ="Absolute Model Errors") +theme_crime()p1 + p2```The error maps reveal distinct spatial patterns in model performance. The directional error map shows systematic overprediction (red areas) primarily in higher-risk neighborhoods, where the model likely overestimates burglary risk due to unaccounted protective factors. Meanwhile, underprediction (blue areas) concentrates in lower-crime western sectors, suggesting the model misses certain risk amplifiers in these regions. The absolute error map confirms that prediction inaccuracies are most pronounced in areas with extreme crime counts - both very high and very low - indicating the model struggles with accurately predicting both crime hotspots and exceptionally safe areas.# Part 7: Conclusion and Discussion## 7.1: Model Summary Table```{r model-summary-table}# Create nice summary tablemodel_summary <- broom::tidy(final_model, exponentiate =TRUE) %>%mutate(across(where(is.numeric), ~round(., 3)) )model_summary %>%kable(caption ="Final Negative Binomial Model Coefficients (Exponentiated)",col.names =c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value") ) %>%kable_styling(bootstrap_options =c("striped", "hover")) %>%footnote(general ="Rate ratios > 1 indicate positive association with burglary counts." )```## 7.2: Key Findings Summary**Overall Performance Assessment:**\The spatial predictive analysis reveals a nuanced picture of model performance. While the Negative Binomial regression demonstrated strong explanatory power with all predictors showing statistical significance (p \< 0.001), it was ultimately outperformed by the simpler Kernel Density Estimation baseline in predictive accuracy. The KDE achieved lower error metrics (MAE: 2.17 vs 2.60; RMSE: 3.06 vs 3.63), indicating that for within-year prediction of burglary patterns, historical crime locations alone provide marginally better forecasts than our complex model incorporating lighting infrastructure, socioeconomic factors, and vacancy measures.**Key Predictive Relationships:**\The regression analysis yielded several theoretically meaningful findings. Vacant building counts showed a positive association with burglary risk, consistent with broken windows theory. Higher median income and larger elderly population proportions exhibited protective effects against burglaries. However, the relationships between lighting infrastructure and crime patterns proved more complex than anticipated, with both proximity to individual light outages and distance to lighting hotspots showing significant but counterintuitive associations that may reflect spatial interdependencies in the urban environment.**Spatial Generalizability:**\Leave-one-group-out cross-validation revealed substantial geographic variation in model performance across police districts (MAE range: 1.69-6.21). This spatial heterogeneity suggests that the model's predictive power is context-dependent, performing well in some areas while struggling in others, particularly District 3 which showed the highest prediction errors. This indicates the potential need for localized modeling approaches or additional place-specific variables.**Practical Implications:**\The superior performance of the simple KDE model carries important policy implications: basic hotspot mapping of historical crime data remains the most cost-effective approach for short-term resource allocation. Despite its slightly inferior predictive accuracy compared to KDE, the regression model offers valuable policy insights by quantifying how specific environmental factors influence burglary risk. It provides valuable guidance for long-term preventive strategies. The identification of lighting infrastructure hotspots, combined with socioeconomic and vacancy measures, provides actionable intelligence for targeted interventions that go beyond simple crime prediction to inform infrastructure investment and community safety strategies.