We’re using ESRI:102271 (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:
It minimizes distortion in this region
Uses feet (common in US planning)
Allows accurate distance calculations
Exercise 1.2: Load Burglary Data
Code
# Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read(here("data", "burglaries.shp")) %>%st_transform('ESRI:102271')
Reading layer `burglaries' from data source
`D:\MUSA5080PPA\portfolio-setup-ChristineCui12\weekly-notes\week-09\data\burglaries.shp'
using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East
Code
# Check the datacat("\n✓ Loaded burglary data\n")
✓ Loaded burglary data
Code
cat(" - Number of burglaries:", nrow(burglaries), "\n")
Question 1.1: How many burglaries are in the dataset? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?
Your answer here: - Number of burglaries: 7482 - Time period: 2017.1.1-2018.1.1 - Coordinate reference system: ?????
Critical Pause #1: Data Provenance
Before proceeding, consider where this data came from:
Who recorded this data? Chicago Police Department officers and detectives
Downgraded offenses (burglary recorded as trespassing)
Spatial bias (more patrol = more recorded crime)
Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?
Exercise 1.3: Visualize Point Data
Code
# Simple point mapp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = burglaries, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Burglary Locations",subtitle =paste0("Chicago 2017, n = ", nrow(burglaries)) )# Density surface using modern syntaxp2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(burglaries)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none"# Modern ggplot2 syntax (not guide = FALSE) ) +labs(title ="Density Surface",subtitle ="Kernel density estimation" )# Combine plots using patchwork (modern approach)p1 + p2 +plot_annotation(title ="Spatial Distribution of Burglaries in Chicago",tag_levels ='A' )
Question 1.2: What spatial patterns do you observe? Are burglaries evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?
Your answer here: - Spatial patterns: - Evenly distributed: No, it’s dispersed obviously and it’s concentrated in 2 parts-north and south. - Highest concentrations: The highest concentration is in the south side. - Explaination:
Part 2: Create Fishnet Grid
Exercise 2.1: Understanding the Fishnet
A fishnet grid converts irregular point data into a regular grid of cells where we can:
Aggregate counts
Calculate spatial features
Apply regression models
Think of it as overlaying graph paper on a map.
Code
# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]# View basic infocat("✓ Created fishnet grid\n")
Question 2.1: Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?
Your answer here: - Reasons for using a regular grid: Existing boundaries like neighborhoods or census tracts尺度不一样,rbitrary, unequal sizes, Modifiable Areal Unit Problem. - Pros and Cons: Fishnet grid has Consistent size, no boundary bias. We use fishnet because: Standard approach in predictive policing, Easier spatial operations, Consistent unit of analysis. But it has Arbitrary, may split “natural” areas.
Exercise 2.2: Aggregate Burglaries to Grid
Code
# Spatial join: which cell contains each burglary?burglaries_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnet (cells with 0 burglaries will be NA)fishnet <- fishnet %>%left_join(burglaries_fishnet, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summary statisticscat("\nBurglary count distribution:\n")
Burglary count distribution:
Code
summary(fishnet$countBurglaries)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 2.000 3.042 5.000 40.000
Code
cat("\nCells with zero burglaries:", sum(fishnet$countBurglaries ==0), "/", nrow(fishnet),"(", round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "%)\n")
Cells with zero burglaries: 781 / 2458 ( 31.8 %)
Code
# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Burglaries",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Burglary Counts by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()
Question 2.2: What is the distribution of burglary counts across cells? Why do so many cells have zero burglaries? Is this distribution suitable for count regression? (Hint: look up overdispersion)
Your answer here: - Distribution of burglary counts: - Reasons for cells have zero burglaries: - Suitability for count regression: No, 有太多0值,代表数据非常分散,方差应该大于平均值。泊松分布cannot handle overdispersion and will underestimated if overdispersed. Poisson assumption: Variance = Mean, but Reality with crime data: Variance > Mean ,所以不符合泊松分布的假设,所以要用负二向式回归。
Part 3: Create a Kernel Density Baseline
Before building complex models, let’s create a simple baseline using Kernel Density Estimation (KDE).
The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)
Code
# Convert burglaries to ppp (point pattern) format for spatstatburglaries_ppp <-as.ppp(st_coordinates(burglaries),W =as.owin(st_bbox(chicagoBoundary)))# Calculate KDE with 1km bandwidthkde_burglaries <-density.ppp( burglaries_ppp,sigma =1000, # 1km bandwidthedge =TRUE# Edge correction)# Convert to terra raster (modern approach, not raster::raster)kde_raster <-rast(kde_burglaries)# Extract KDE values to fishnet cellsfishnet <- fishnet %>%mutate(kde_value = terra::extract( kde_raster,vect(fishnet),fun = mean,na.rm =TRUE )[, 2] # Extract just the values column )cat("✓ Calculated KDE baseline\n")
✓ Calculated KDE baseline
Code
ggplot() +geom_sf(data = fishnet, aes(fill = kde_value), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="KDE Value",option ="plasma" ) +labs(title ="Kernel Density Estimation Baseline",subtitle ="Simple spatial smoothing of burglary locations" ) +theme_crime()
Question 3.1: How does the KDE map compare to the count map? What does KDE capture well? What does it miss?
Your answer here: - Comparison to the count map: - KDE pros and cons:
Why Start with KDE?
The KDE represents our null hypothesis: burglaries happen where they happened before, with no other information.
Your complex model must outperform this simple baseline to justify its complexity.
We’ll compare back to this at the end!
Part 4: Create Spatial Predictor Variables
Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime.
Question 4.1: Do you see a visual relationship between abandoned cars and burglaries? What does this suggest?
Your answer here:
Exercise 4.3: Nearest Neighbor Features
Count in a cell is one measure. Distance to the nearest 3 abandoned cars captures local context.
Code
# Calculate mean distance to 3 nearest abandoned cars# (Do this OUTSIDE of mutate to avoid sf conflicts)# Get coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))abandoned_coords <-st_coordinates(abandoned_cars)# Calculate k nearest neighbors and distancesnn_result <-get.knnx(abandoned_coords, fishnet_coords, k =3)# Add to fishnetfishnet <- fishnet %>%mutate(abandoned_cars.nn =rowMeans(nn_result$nn.dist) )cat("✓ Calculated nearest neighbor distances\n")
✓ Calculated nearest neighbor distances
Code
summary(fishnet$abandoned_cars.nn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.386 88.247 143.293 246.946 271.283 2195.753
Question 4.2: What does a low value of abandoned_cars.nn mean? A high value? Why might this be informative?
Your answer here:
Exercise 4.4: Distance to Hot Spots
Let’s identify clusters of abandoned cars using Local Moran’s I, then calculate distance to these hot spots.
Code
# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to abandoned carsfishnet <-calculate_local_morans(fishnet, "abandoned_cars", k =5)
Code
# Visualize hot spotsggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c","High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6","Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Abandoned Car Clusters",subtitle ="High-High = Hot spots of disorder" ) +theme_crime()
Code
# Get centroids of "High-High" cells (hot spots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest hot spotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat("✓ Calculated distance to abandoned car hot spots\n")cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant hot spots found\n")}
✓ Calculated distance to abandoned car hot spots
- Number of hot spot cells: 275
Question 4.3: Why might distance to a cluster of abandoned cars be more informative than distance to a single abandoned car? What does Local Moran’s I tell us?
Your answer here:
Note
Local Moran’s I identifies:
High-High: Hot spots (high values surrounded by high values)
Low-Low: Cold spots (low values surrounded by low values)
High-Low / Low-High: Spatial outliers
This helps us understand spatial clustering patterns.
Part 5: Join Police Districts for Cross-Validation
We’ll use police districts for our spatial cross-validation.
Code
# Join district information to fishnetfishnet <-st_join( fishnet, policeDistricts,join = st_within,left =TRUE) %>%filter(!is.na(District)) # Remove cells outside districtscat("✓ Joined police districts\n")
# Show resultscv_results %>%arrange(desc(mae)) %>%kable(digits =2,caption ="LOGO CV Results by District" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))
LOGO CV Results by District
fold
test_district
n_test
mae
rmse
7
3
43
6.05
8.08
4
6
63
3.30
4.75
14
11
43
3.19
4.09
12
12
73
3.10
4.62
6
7
52
3.08
4.07
19
16
129
2.98
3.48
17
14
46
2.96
4.24
16
25
85
2.75
3.62
15
18
30
2.75
4.15
8
2
56
2.69
3.60
21
20
30
2.68
3.11
22
24
41
2.65
2.98
5
8
197
2.53
3.48
3
22
112
2.26
2.83
10
10
63
2.19
3.09
20
17
82
2.17
2.60
9
9
107
2.16
2.59
18
19
63
2.10
2.57
13
15
32
2.08
2.67
1
5
98
2.04
3.09
2
4
235
1.84
3.69
11
1
28
1.76
2.11
Question 7.1: Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?
Your answer here:
Connection to Week 5
Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!
Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?
Part 8: Model Predictions and Comparison
Exercise 8.1: Generate Final Predictions
Code
# Fit final model on all datafinal_model <-glm.nb( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = fishnet_model)# Add predictions back to fishnetfishnet <- fishnet %>%mutate(prediction_nb =predict(final_model, fishnet_model, type ="response")[match(uniqueID, fishnet_model$uniqueID)] )# Also add KDE predictions (normalize to same scale as counts)kde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )
---title: "Predictive Policing - Technical Implementation"subtitle: "MUSA 5080 - Fall 2025"author: "Christine"date: todayformat: html: code-fold: show code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: trueeditor: visualexecute: warning: false message: false---## About This ExerciseIn this exercise, you will build a spatial predictive model for burglaries using count regression and spatial features.# Learning ObjectivesBy the end of this exercise, you will be able to:1. Create a fishnet grid for aggregating point-level crime data2. Calculate spatial features including k-nearest neighbors and distance measures3. Diagnose spatial autocorrelation using Local Moran's I4. Fit and interpret Poisson and Negative Binomial regression for count data5. Implement spatial cross-validation (Leave-One-Group-Out)6. Compare model predictions to a Kernel Density Estimation baseline7. Evaluate model performance using appropriate metrics# Setup```{r setup}#| message: false#| warning: false# Load required packageslibrary(tidyverse) # Data manipulationlibrary(sf) # Spatial operationslibrary(here) # Relative file pathslibrary(viridis) # Color scaleslibrary(terra) # Raster operations (replaces 'raster')library(spdep) # Spatial dependencelibrary(FNN) # Fast nearest neighborslibrary(MASS) # Negative binomial regressionlibrary(patchwork) # Plot composition (replaces grid/gridExtra)library(knitr) # Tableslibrary(kableExtra) # Table formattinglibrary(classInt) # Classification intervalslibrary(here)# Spatstat split into sub-packageslibrary(spatstat.geom) # Spatial geometrieslibrary(spatstat.explore) # Spatial exploration/KDE# Set optionsoptions(scipen =999) # No scientific notationset.seed(5080) # Reproducibility# Create consistent theme for visualizationstheme_crime <-function(base_size =11) {theme_minimal(base_size = base_size) +theme(plot.title =element_text(face ="bold", size = base_size +1),plot.subtitle =element_text(color ="gray30", size = base_size -1),legend.position ="right",panel.grid.minor =element_blank(),axis.text =element_blank(),axis.title =element_blank() )}# Set as defaulttheme_set(theme_crime())cat("✓ All packages loaded successfully!\n")cat("✓ Working directory:", getwd(), "\n")```# Part 1: Load and Explore Data## Exercise 1.1: Load Chicago Spatial Data```{r load-boundaries}#| message: false# Load police districts (used for spatial cross-validation)policeDistricts <-st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(District = dist_num)# Load police beats (smaller administrative units)policeBeats <-st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(Beat = beat_num)# Load Chicago boundarychicagoBoundary <-st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%st_transform('ESRI:102271')cat("✓ Loaded spatial boundaries\n")cat(" - Police districts:", nrow(policeDistricts), "\n")cat(" - Police beats:", nrow(policeBeats), "\n")```::: callout-note## Coordinate Reference SystemWe're using `ESRI:102271` (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:- It minimizes distortion in this region- Uses feet (common in US planning)- Allows accurate distance calculations:::## Exercise 1.2: Load Burglary Data```{r load-burglaries}#| message: false# Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read(here("data", "burglaries.shp")) %>%st_transform('ESRI:102271')# Check the datacat("\n✓ Loaded burglary data\n")cat(" - Number of burglaries:", nrow(burglaries), "\n")cat(" - CRS:", st_crs(burglaries)$input, "\n")cat(" - Date range:", min(burglaries$Date, na.rm =TRUE), "to", max(burglaries$Date, na.rm =TRUE), "\n")```**Question 1.1:** How many burglaries are in the dataset? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?*Your answer here:* - Number of burglaries: 7482 - Time period: 2017.1.1-2018.1.1 - Coordinate reference system: ?????::: callout-warning## Critical Pause #1: Data ProvenanceBefore proceeding, consider where this data came from:**Who recorded this data?** Chicago Police Department officers and detectives**What might be missing?**- Unreported burglaries (victims didn't call police)- Incidents police chose not to record- Downgraded offenses (burglary recorded as trespassing)- Spatial bias (more patrol = more recorded crime)**Think About** Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?:::## Exercise 1.3: Visualize Point Data```{r visualize-points}#| fig-width: 10#| fig-height: 5# Simple point mapp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = burglaries, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Burglary Locations",subtitle =paste0("Chicago 2017, n = ", nrow(burglaries)) )# Density surface using modern syntaxp2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(burglaries)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none"# Modern ggplot2 syntax (not guide = FALSE) ) +labs(title ="Density Surface",subtitle ="Kernel density estimation" )# Combine plots using patchwork (modern approach)p1 + p2 +plot_annotation(title ="Spatial Distribution of Burglaries in Chicago",tag_levels ='A' )```**Question 1.2:** What spatial patterns do you observe? Are burglaries evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?*Your answer here:* - Spatial patterns: - Evenly distributed: No, it's dispersed obviously and it's concentrated in 2 parts-north and south. - Highest concentrations: The highest concentration is in the south side. - Explaination:# Part 2: Create Fishnet Grid## Exercise 2.1: Understanding the FishnetA **fishnet grid** converts irregular point data into a regular grid of cells where we can:- Aggregate counts- Calculate spatial features- Apply regression modelsThink of it as overlaying graph paper on a map.```{r create-fishnet}# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]# View basic infocat("✓ Created fishnet grid\n")cat(" - Number of cells:", nrow(fishnet), "\n")cat(" - Cell size:", 500, "x", 500, "meters\n")cat(" - Cell area:", round(st_area(fishnet[1,])), "square meters\n")```**Question 2.1:** Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?*Your answer here:* - Reasons for using a regular grid: Existing boundaries like neighborhoods or census tracts尺度不一样,rbitrary, unequal sizes, Modifiable Areal Unit Problem. - Pros and Cons: Fishnet grid has Consistent size, no boundary bias. We use fishnet because: Standard approach in predictive policing, Easier spatial operations, Consistent unit of analysis. But it has Arbitrary, may split “natural” areas.## Exercise 2.2: Aggregate Burglaries to Grid```{r aggregate-burglaries}# Spatial join: which cell contains each burglary?burglaries_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnet (cells with 0 burglaries will be NA)fishnet <- fishnet %>%left_join(burglaries_fishnet, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summary statisticscat("\nBurglary count distribution:\n")summary(fishnet$countBurglaries)cat("\nCells with zero burglaries:", sum(fishnet$countBurglaries ==0), "/", nrow(fishnet),"(", round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "%)\n")``````{r visualize-fishnet}#| fig-width: 8#| fig-height: 6# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Burglaries",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Burglary Counts by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()```**Question 2.2:** What is the distribution of burglary counts across cells? Why do so many cells have zero burglaries? Is this distribution suitable for count regression? (Hint: look up overdispersion)*Your answer here:* - Distribution of burglary counts: - Reasons for cells have zero burglaries: - Suitability for count regression: No, 有太多0值,代表数据非常分散,方差应该大于平均值。泊松分布cannot handle overdispersion and will underestimated if overdispersed. Poisson assumption: Variance = Mean, but Reality with crime data: Variance \> Mean ,所以不符合泊松分布的假设,所以要用负二向式回归。# Part 3: Create a Kernel Density BaselineBefore building complex models, let's create a simple baseline using **Kernel Density Estimation (KDE)**.**The KDE baseline asks:** "What if crime just happens where it happened before?" (simple spatial smoothing, no predictors)```{r kde-baseline}#| message: false# Convert burglaries to ppp (point pattern) format for spatstatburglaries_ppp <-as.ppp(st_coordinates(burglaries),W =as.owin(st_bbox(chicagoBoundary)))# Calculate KDE with 1km bandwidthkde_burglaries <-density.ppp( burglaries_ppp,sigma =1000, # 1km bandwidthedge =TRUE# Edge correction)# Convert to terra raster (modern approach, not raster::raster)kde_raster <-rast(kde_burglaries)# Extract KDE values to fishnet cellsfishnet <- fishnet %>%mutate(kde_value = terra::extract( kde_raster,vect(fishnet),fun = mean,na.rm =TRUE )[, 2] # Extract just the values column )cat("✓ Calculated KDE baseline\n")``````{r visualize-kde}#| fig-width: 8#| fig-height: 6ggplot() +geom_sf(data = fishnet, aes(fill = kde_value), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="KDE Value",option ="plasma" ) +labs(title ="Kernel Density Estimation Baseline",subtitle ="Simple spatial smoothing of burglary locations" ) +theme_crime()```**Question 3.1:** How does the KDE map compare to the count map? What does KDE capture well? What does it miss?*Your answer here:* - Comparison to the count map: - KDE pros and cons:::: callout-tip## Why Start with KDE?The KDE represents our **null hypothesis**: burglaries happen where they happened before, with no other information.**Your complex model must outperform this simple baseline to justify its complexity.**We'll compare back to this at the end!:::# Part 4: Create Spatial Predictor VariablesNow we'll create features that might help predict burglaries. We'll use "broken windows theory" logic: signs of disorder predict crime.## Exercise 4.1: Load 311 Abandoned Vehicle Calls```{r load-abandoned-cars}#| message: falseabandoned_cars <-read_csv(here("data/abandoned_cars_2017.csv"))%>%filter(!is.na(Latitude), !is.na(Longitude)) %>%st_as_sf(coords =c("Longitude", "Latitude"), crs =4326) %>%st_transform('ESRI:102271')cat("✓ Loaded abandoned vehicle calls\n")cat(" - Number of calls:", nrow(abandoned_cars), "\n")```::: callout-note## Data Loading NoteThe data was downloaded from Chicago's Open Data Portal. You can now request an api from the Chicago portal and tap into the data there.**Consider:** How might the 311 reporting system itself be biased? Who calls 311? What neighborhoods have better 311 awareness?:::## Exercise 4.2: Count of Abandoned Cars per Cell```{r count-abandoned-cars}# Aggregate abandoned car calls to fishnetabandoned_fishnet <-st_join(abandoned_cars, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(abandoned_cars =n())# Join to fishnetfishnet <- fishnet %>%left_join(abandoned_fishnet, by ="uniqueID") %>%mutate(abandoned_cars =replace_na(abandoned_cars, 0))cat("Abandoned car distribution:\n")summary(fishnet$abandoned_cars)``````{r visualize-abandoned-cars}#| fig-width: 10#| fig-height: 4p1 <-ggplot() +geom_sf(data = fishnet, aes(fill = abandoned_cars), color =NA) +scale_fill_viridis_c(name ="Count", option ="magma") +labs(title ="Abandoned Vehicle 311 Calls") +theme_crime()p2 <-ggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +scale_fill_viridis_c(name ="Count", option ="plasma") +labs(title ="Burglaries") +theme_crime()p1 + p2 +plot_annotation(title ="Are abandoned cars and burglaries correlated?")```**Question 4.1:** Do you see a visual relationship between abandoned cars and burglaries? What does this suggest?## *Your answer here:*## Exercise 4.3: Nearest Neighbor FeaturesCount in a cell is one measure. Distance to the nearest 3 abandoned cars captures local context.```{r nn-feature}#| message: false# Calculate mean distance to 3 nearest abandoned cars# (Do this OUTSIDE of mutate to avoid sf conflicts)# Get coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))abandoned_coords <-st_coordinates(abandoned_cars)# Calculate k nearest neighbors and distancesnn_result <-get.knnx(abandoned_coords, fishnet_coords, k =3)# Add to fishnetfishnet <- fishnet %>%mutate(abandoned_cars.nn =rowMeans(nn_result$nn.dist) )cat("✓ Calculated nearest neighbor distances\n")summary(fishnet$abandoned_cars.nn)```**Question 4.2:** What does a low value of `abandoned_cars.nn` mean? A high value? Why might this be informative?*Your answer here:*## Exercise 4.4: Distance to Hot SpotsLet's identify clusters of abandoned cars using Local Moran's I, then calculate distance to these hot spots.```{r local-morans-abandoned}# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to abandoned carsfishnet <-calculate_local_morans(fishnet, "abandoned_cars", k =5)``````{r visualize-morans}#| fig-width: 8#| fig-height: 6# Visualize hot spotsggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c","High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6","Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Abandoned Car Clusters",subtitle ="High-High = Hot spots of disorder" ) +theme_crime()``````{r distance-to-hotspots}# Get centroids of "High-High" cells (hot spots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest hot spotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat("✓ Calculated distance to abandoned car hot spots\n")cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant hot spots found\n")}```**Question 4.3:** Why might distance to a cluster of abandoned cars be more informative than distance to a single abandoned car? What does Local Moran's I tell us?*Your answer here:*::: callout-note**Local Moran's I** identifies:- **High-High**: Hot spots (high values surrounded by high values)- **Low-Low**: Cold spots (low values surrounded by low values)- **High-Low / Low-High**: Spatial outliersThis helps us understand spatial clustering patterns.:::------------------------------------------------------------------------# Part 5: Join Police Districts for Cross-ValidationWe'll use police districts for our spatial cross-validation.```{r join-districts}# Join district information to fishnetfishnet <-st_join( fishnet, policeDistricts,join = st_within,left =TRUE) %>%filter(!is.na(District)) # Remove cells outside districtscat("✓ Joined police districts\n")cat(" - Districts:", length(unique(fishnet$District)), "\n")cat(" - Cells:", nrow(fishnet), "\n")```# Part 6: Model Fitting## Exercise 6.1: Poisson RegressionBurglary counts are count data (0, 1, 2, 3...). We'll use **Poisson regression**.```{r prepare-data}# Create clean modeling datasetfishnet_model <- fishnet %>%st_drop_geometry() %>% dplyr::select( uniqueID, District, countBurglaries, abandoned_cars, abandoned_cars.nn, dist_to_hotspot ) %>%na.omit() # Remove any remaining NAscat("✓ Prepared modeling data\n")cat(" - Observations:", nrow(fishnet_model), "\n")cat(" - Variables:", ncol(fishnet_model), "\n")``````{r fit-poisson}# Fit Poisson regressionmodel_poisson <-glm( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = fishnet_model,family ="poisson")# Summarysummary(model_poisson)```**Question 6.1:** Interpret the coefficients. Which variables are significant? What do the signs (positive/negative) tell you?*Your answer here:*## Exercise 6.2: Check for OverdispersionPoisson regression assumes mean = variance. Real count data often violates this (overdispersion).```{r check-overdispersion}# Calculate dispersion parameterdispersion <-sum(residuals(model_poisson, type ="pearson")^2) / model_poisson$df.residualcat("Dispersion parameter:", round(dispersion, 2), "\n")cat("Rule of thumb: >1.5 suggests overdispersion\n")if (dispersion >1.5) {cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")} else {cat("✓ Dispersion looks okay for Poisson model.\n")}```## Exercise 6.3: Negative Binomial RegressionIf overdispersed, use **Negative Binomial regression** (more flexible).```{r fit-negbin}# Fit Negative Binomial modelmodel_nb <-glm.nb( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = fishnet_model)# Summarysummary(model_nb)# Compare AIC (lower is better)cat("\nModel Comparison:\n")cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")```**Question 6.2:** Which model fits better (lower AIC)? What does this tell you about the data?*Your answer here:*# Part 7: Spatial Cross-ValidationStandard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!).**Leave-One-Group-Out (LOGO) Cross-Validation** trains on all districts except one, then tests on the held-out district.```{r spatial-cv}# Get unique districtsdistricts <-unique(fishnet_model$District)cv_results <-tibble()cat("Running LOGO Cross-Validation...\n")for (i inseq_along(districts)) { test_district <- districts[i]# Split data train_data <- fishnet_model %>%filter(District != test_district) test_data <- fishnet_model %>%filter(District == test_district)# Fit model on training data model_cv <-glm.nb( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = train_data )# Predict on test data test_data <- test_data %>%mutate(prediction =predict(model_cv, test_data, type ="response") )# Calculate metrics mae <-mean(abs(test_data$countBurglaries - test_data$prediction)) rmse <-sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))# Store results cv_results <-bind_rows( cv_results,tibble(fold = i,test_district = test_district,n_test =nrow(test_data),mae = mae,rmse = rmse ) )cat(" Fold", i, "/", length(districts), "- District", test_district, "- MAE:", round(mae, 2), "\n")}# Overall resultscat("\n✓ Cross-Validation Complete\n")cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")``````{r cv-results-table}# Show resultscv_results %>%arrange(desc(mae)) %>%kable(digits =2,caption ="LOGO CV Results by District" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))```**Question 7.1:** Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?*Your answer here:*::: callout-note## Connection to Week 5Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!**Why it matters:** If we can only predict well in areas we've already heavily policed, what does that tell us about the model's utility?:::# Part 8: Model Predictions and Comparison## Exercise 8.1: Generate Final Predictions```{r final-predictions}# Fit final model on all datafinal_model <-glm.nb( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = fishnet_model)# Add predictions back to fishnetfishnet <- fishnet %>%mutate(prediction_nb =predict(final_model, fishnet_model, type ="response")[match(uniqueID, fishnet_model$uniqueID)] )# Also add KDE predictions (normalize to same scale as counts)kde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )```## Exercise 8.2: Compare Model vs. KDE Baseline```{r compare-models}#| fig-width: 12#| fig-height: 4# Create three mapsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +scale_fill_viridis_c(name ="Count", option ="plasma", limits =c(0, 15)) +labs(title ="Actual Burglaries") +theme_crime()p2 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_nb), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15)) +labs(title ="Model Predictions (Neg. Binomial)") +theme_crime()p3 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_kde), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15)) +labs(title ="KDE Baseline Predictions") +theme_crime()p1 + p2 + p3 +plot_annotation(title ="Actual vs. Predicted Burglaries",subtitle ="Does our complex model outperform simple KDE?" )``````{r model-comparison-metrics}# Calculate performance metricscomparison <- fishnet %>%st_drop_geometry() %>%filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%summarize(model_mae =mean(abs(countBurglaries - prediction_nb)),model_rmse =sqrt(mean((countBurglaries - prediction_nb)^2)),kde_mae =mean(abs(countBurglaries - prediction_kde)),kde_rmse =sqrt(mean((countBurglaries - prediction_kde)^2)) )comparison %>%pivot_longer(everything(), names_to ="metric", values_to ="value") %>%separate(metric, into =c("approach", "metric"), sep ="_") %>%pivot_wider(names_from = metric, values_from = value) %>%kable(digits =2,caption ="Model Performance Comparison" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))```**Question 8.1:** Does the complex model outperform the simple KDE baseline? By how much? Is the added complexity worth it?*Your answer here:*## Exercise 9.3: Where Does the Model Work Well?```{r prediction-errors}#| fig-width: 10#| fig-height: 5# Calculate errorsfishnet <- fishnet %>%mutate(error_nb = countBurglaries - prediction_nb,error_kde = countBurglaries - prediction_kde,abs_error_nb =abs(error_nb),abs_error_kde =abs(error_kde) )# Map errorsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = error_nb), color =NA) +scale_fill_gradient2(name ="Error",low ="#2166ac", mid ="white", high ="#b2182b",midpoint =0,limits =c(-10, 10) ) +labs(title ="Model Errors (Actual - Predicted)") +theme_crime()p2 <-ggplot() +geom_sf(data = fishnet, aes(fill = abs_error_nb), color =NA) +scale_fill_viridis_c(name ="Abs. Error", option ="magma") +labs(title ="Absolute Model Errors") +theme_crime()p1 + p2```**Question 9.2:** Where does the model make the biggest errors? Are there spatial patterns in the errors? What might this reveal?*Your answer here:*# Part 10: Summary Statistics and Tables## Exercise 10.1: Model Summary Table```{r model-summary-table}# Create nice summary tablemodel_summary <- broom::tidy(final_model, exponentiate =TRUE) %>%mutate(across(where(is.numeric), ~round(., 3)) )model_summary %>%kable(caption ="Final Negative Binomial Model Coefficients (Exponentiated)",col.names =c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value") ) %>%kable_styling(bootstrap_options =c("striped", "hover")) %>%footnote(general ="Rate ratios > 1 indicate positive association with burglary counts." )```## Exercise 10.2: Key Findings SummaryBased on your analysis, complete this summary:**Technical Performance:**- Cross-validation MAE: `r round(mean(cv_results$mae), 2)`- Model vs. KDE: \[Which performed better?\]- Most predictive variable: \[Which had largest effect?\]**Spatial Patterns:**- Burglaries are \[evenly distributed / clustered\]- Hot spots are located in \[describe\]- Model errors show \[random / systematic\] patterns**Model Limitations:**- Overdispersion: \[Yes/No\]- Spatial autocorrelation in residuals: \[Test this!\]- Cells with zero counts: \[What % of data?\]# Conclusion and Next Steps::: callout-important## What You've AccomplishedYou've successfully built a spatial predictive model for burglaries using:✓ Fishnet aggregation\✓ Spatial features (counts, distances, nearest neighbors)\✓ Spatial autocorrelation diagnostics (Local Moran's I)\✓ Count regression (Poisson and Negative Binomial)\✓ Spatial cross-validation (LOGO)\✓ Comparison to baseline (KDE):::::::::::::