Question 1.1: How many burglaries are in the dataset? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?
Your answer here:
WarningCritical Pause #1: Data Provenance
Before proceeding, consider where this data came from:
Who recorded this data? Chicago Police Department officers and detectives
Downgraded offenses (burglary recorded as trespassing)
Spatial bias (more patrol = more recorded crime)
Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?
Exercise 1.3: Visualize Point Data
Code
# Simple point mapp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = burglaries, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Burglary Locations",subtitle =paste0("Chicago 2017, n = ", nrow(burglaries)) )# Density surface using modern syntaxp2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(burglaries)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none"# Modern ggplot2 syntax (not guide = FALSE) ) +labs(title ="Density Surface",subtitle ="Kernel density estimation" )# Combine plots using patchwork (modern approach)p1 + p2 +plot_annotation(title ="Spatial Distribution of Burglaries in Chicago",tag_levels ='A' )
Question 1.2: What spatial patterns do you observe? Are burglaries evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?
Your answer here:
Part 2: Create Fishnet Grid
Exercise 2.1: Understanding the Fishnet
A fishnet grid converts irregular point data into a regular grid of cells where we can:
Aggregate counts
Calculate spatial features
Apply regression models
Think of it as overlaying graph paper on a map.
Code
# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]# View basic infocat("✓ Created fishnet grid\n")
Question 2.1: Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?
Your answer here:
Exercise 2.2: Aggregate Burglaries to Grid
Code
# Spatial join: which cell contains each burglary?burglaries_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnet (cells with 0 burglaries will be NA)fishnet <- fishnet %>%left_join(burglaries_fishnet, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summary statisticscat("\nBurglary count distribution:\n")
Burglary count distribution:
Code
summary(fishnet$countBurglaries)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 2.000 3.042 5.000 40.000
Code
cat("\nCells with zero burglaries:", sum(fishnet$countBurglaries ==0), "/", nrow(fishnet),"(", round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "%)\n")
Cells with zero burglaries: 781 / 2458 ( 31.8 %)
Code
# Visualize aggregated countsggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Burglaries",option ="plasma",trans ="sqrt", # Square root for better visualization of skewed databreaks =c(0, 1, 5, 10, 20, 40) ) +labs(title ="Burglary Counts by Grid Cell",subtitle ="500m x 500m cells, Chicago 2017" ) +theme_crime()
Question 2.2: What is the distribution of burglary counts across cells? Why do so many cells have zero burglaries? Is this distribution suitable for count regression? (Hint: look up overdispersion)
Your answer here:
Part 3: Create a Kernel Density Baseline
Before building complex models, let’s create a simple baseline using Kernel Density Estimation (KDE).
The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)
Code
# Convert burglaries to ppp (point pattern) format for spatstatburglaries_ppp <-as.ppp(st_coordinates(burglaries),W =as.owin(st_bbox(chicagoBoundary)))# Calculate KDE with 1km bandwidthkde_burglaries <-density.ppp( burglaries_ppp,sigma =1000, # 1km bandwidthedge =TRUE# Edge correction)# Convert to terra raster (modern approach, not raster::raster)kde_raster <-rast(kde_burglaries)# Extract KDE values to fishnet cellsfishnet <- fishnet %>%mutate(kde_value = terra::extract( kde_raster,vect(fishnet),fun = mean,na.rm =TRUE )[, 2] # Extract just the values column )cat("✓ Calculated KDE baseline\n")
✓ Calculated KDE baseline
Code
ggplot() +geom_sf(data = fishnet, aes(fill = kde_value), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="KDE Value",option ="plasma" ) +labs(title ="Kernel Density Estimation Baseline",subtitle ="Simple spatial smoothing of burglary locations" ) +theme_crime()
Question 3.1: How does the KDE map compare to the count map? What does KDE capture well? What does it miss?
Your answer here:
TipWhy Start with KDE?
The KDE represents our null hypothesis: burglaries happen where they happened before, with no other information.
Your complex model must outperform this simple baseline to justify its complexity.
We’ll compare back to this at the end!
Part 4: Create Spatial Predictor Variables
Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime.
Question 4.1: Do you see a visual relationship between abandoned cars and burglaries? What does this suggest?
Your answer here:
Exercise 4.3: Nearest Neighbor Features
Count in a cell is one measure. Distance to the nearest 3 abandoned cars captures local context.
Code
# Calculate mean distance to 3 nearest abandoned cars# (Do this OUTSIDE of mutate to avoid sf conflicts)# Get coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))abandoned_coords <-st_coordinates(abandoned_cars)# Calculate k nearest neighbors and distancesnn_result <-get.knnx(abandoned_coords, fishnet_coords, k =3)# Add to fishnetfishnet <- fishnet %>%mutate(abandoned_cars.nn =rowMeans(nn_result$nn.dist) )cat("✓ Calculated nearest neighbor distances\n")
✓ Calculated nearest neighbor distances
Code
summary(fishnet$abandoned_cars.nn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.386 88.247 143.293 246.946 271.283 2195.753
Question 4.2: What does a low value of abandoned_cars.nn mean? A high value? Why might this be informative?
Your answer here:
Exercise 4.4: Distance to Hot Spots
Let’s identify clusters of abandoned cars using Local Moran’s I, then calculate distance to these hot spots.
Code
# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to abandoned carsfishnet <-calculate_local_morans(fishnet, "abandoned_cars", k =5)
Code
# Visualize hot spotsggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c","High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6","Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Abandoned Car Clusters",subtitle ="High-High = Hot spots of disorder" ) +theme_crime()
Code
# Get centroids of "High-High" cells (hot spots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest hot spotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat("✓ Calculated distance to abandoned car hot spots\n")cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant hot spots found\n")}
✓ Calculated distance to abandoned car hot spots
- Number of hot spot cells: 275
Question 4.3: Why might distance to a cluster of abandoned cars be more informative than distance to a single abandoned car? What does Local Moran’s I tell us?
Your answer here:
Note
Local Moran’s I identifies:
High-High: Hot spots (high values surrounded by high values)
Low-Low: Cold spots (low values surrounded by low values)
High-Low / Low-High: Spatial outliers
This helps us understand spatial clustering patterns.
Part 5: Join Police Districts for Cross-Validation
We’ll use police districts for our spatial cross-validation.
Code
# Join district information to fishnetfishnet <-st_join( fishnet, policeDistricts,join = st_within,left =TRUE) %>%filter(!is.na(District)) # Remove cells outside districtscat("✓ Joined police districts\n")
# Show resultscv_results %>%arrange(desc(mae)) %>%kable(digits =2,caption ="LOGO CV Results by District" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))
LOGO CV Results by District
fold
test_district
n_test
mae
rmse
7
3
43
6.05
8.08
4
6
63
3.30
4.75
14
11
43
3.19
4.09
12
12
73
3.10
4.62
6
7
52
3.08
4.07
19
16
129
2.98
3.48
17
14
46
2.96
4.24
16
25
85
2.75
3.62
15
18
30
2.75
4.15
8
2
56
2.69
3.60
21
20
30
2.68
3.11
22
24
41
2.65
2.98
5
8
197
2.53
3.48
3
22
112
2.26
2.83
10
10
63
2.19
3.09
20
17
82
2.17
2.60
9
9
107
2.16
2.59
18
19
63
2.10
2.57
13
15
32
2.08
2.67
1
5
98
2.04
3.09
2
4
235
1.84
3.69
11
1
28
1.76
2.11
Question 7.1: Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?
Your answer here:
NoteConnection to Week 5
Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!
Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?
Part 8: Model Predictions and Comparison
Exercise 8.1: Generate Final Predictions
Code
# Fit final model on all datafinal_model <-glm.nb( countBurglaries ~ abandoned_cars + abandoned_cars.nn + dist_to_hotspot,data = fishnet_model)# Add predictions back to fishnetfishnet <- fishnet %>%mutate(prediction_nb =predict(final_model, fishnet_model, type ="response")[match(uniqueID, fishnet_model$uniqueID)] )# Also add KDE predictions (normalize to same scale as counts)kde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )