Week 9: MUSA 5080 - Dirty Data, Bad Predictions, and the Ethics of Crime Forecasting
Dr. Elizabeth Delmelle
2025-11-03
Modified Schedule - Dr. Delmelle at CPLN Virtual Open House until ~ 10:30AM
10:15 - 10:30AM (15 min) - Watch Video for Introduction to Today’s Topic: *
10:30 - 11:15 Small Group Discussion (at tables):
The Richardson et al. paper (posted on canvas) describes ‘dirty data’ in 13 police departments that used predictive policing.
The main points are summarized in these slides, but let’s make this learning more active.
At your table, discuss:
What is ‘dirty data’ and how does it get created?
Pick ONE of the city case studies (Chicago, New Orleans, Maricopa County..) and identify 2-3 key concerns - what went wrong?
What is a consent decree in the context of predictive policing? Does it solve the problems?
Opening Thought
Before we discuss HOW to build predictive policing systems, we need to ask: SHOULD we?
“A statistically ‘good’ model can still be socially harmful.”
Richardson, Schultz & Crawford (2019)
Today’s Critical Questions
Technical Questions:
How do we model crime counts?
What spatial features predict crime?
How do we validate predictions?
Can we outperform baseline methods?
Critical Questions:
Whose data? Whose crimes?
What if the data is “dirty”?
Who benefits? Who is harmed?
What feedback loops are created?
Can technical solutions fix social problems?
Today’s Approach
We will learn the technical methods AND critically interrogate their use. Both skills are essential for ethical data science.
Part 1: The Seductive Promise of Predictive Policing
The Sales Pitch
What vendors and police departments claim:
Efficiency: “Deploy limited resources where they’re needed most”
Objectivity: “Remove human bias from decision-making”
Proactivity: “Prevent crime before it happens”
Data-driven: “Let the data tell us where crime will occur”
Sounds great, right?
But these claims rest on critical assumptions:
That crime data accurately reflects crime (it doesn’t)
That past patterns predict future crime (they might just predict policing)
That we can separate “good” from “bad” data (we often can’t)
That technical solutions can fix social problems (they can’t)
The Technical Evolution
Generation
Method
Data Used
Example
1st: Hotspots
Kernel Density
Past crime locations
KDE maps
2nd: Risk Terrain
Logistic Reg.
Crime + features
RTM software
3rd: ML
Random Forest, Neural Nets
Everything
PredPol, Palantir
4th: Person-Based
Network analysis
Social connections
Strategic Subject List
Each generation claims to be more “objective” and “accurate”
But what if they’re all built on the same flawed foundation?
Part 2: The Dirty Data Problem
What Is “Dirty Data”?
Traditional definition (data mining):
Missing data
Incorrect data
Non-standardized formats
Extended definition (Richardson et al. 2019):
“Data derived from or influenced by corrupt, biased, and unlawful practices, including data that has been intentionally manipulated or ‘juked,’ as well as data that is distorted by individual and societal biases.”
The Many Forms of Dirty Data
1. Fabricated/Manipulated Data
False arrests (planted evidence)
Downgraded crime classifications to “juke the stats”
Pressuring victims not to file reports
2. Systematically Biased Data
Over-policing of certain communities → more recorded “crime”
Under-policing of white-collar crime → appears less common
Racial profiling → disproportionate stops/arrests
3. Missing/Incomplete Data
Unreported crimes (especially in over-policed areas with low police trust)
Ignored complaints
Incomplete records
4. Proxy Problems
Arrests ≠ crimes committed
Calls for service ≠ actual need
Gang database ≠ actual gang membership
Case Study 1: “Juking the Stats” - The Wire and Baltimore
From TV to Reality:
The Wire (2004): “If the crime rate doesn’t fall, you most certainly will”
Baltimore Reality (2008-2018):
14,000+ serious assaults misrecorded as minor offenses
Extensive Gun Trace Task Force corruption
Officers robbing residents, planting evidence
False arrests, fabricated reports
Data manipulation to show “success”
Result: 55+ potential lawsuits, thousands of convictions questioned
Question: What happens when this data trains a predictive algorithm?
Case Study 2: CompStat and NYPD
The Promise: Data-driven accountability, crime reduction
The Reality Revealed:
100+ retired NYPD captains surveyed: Intense pressure led to stat manipulation
Serious crimes downgraded to meet targets
Officers planting drugs to meet arrest quotas
Commanders persuading victims not to file reports
The Dual Strategy:
Downgrade serious crimes (reported to FBI) → claim success
Increase minor arrests (stops, summonses) → show “control”
2013: Independent audit confirmed systematic data problems
The Feedback Loop Diagram
The Confirmation Bias Loop:
Algorithm learns: “Crime happens in neighborhood X”
Police sent to neighborhood X
More arrests in neighborhood X (regardless of actual crime)
Algorithm “confirmed”: “We were right about neighborhood X!”
Cycle intensifies
Part 3: Technical Fixes Can’t Solve Social Problems
Vendor Claims About Bias Mitigation
PredPol claims:
“Uses ONLY 3 data points—crime type, crime location, and crime date/time”
HunchLab claims:
“We would not use data that relates people to predict places—no arrests, no social media, no gang status”
Both exclude: Arrest data, stop data, traffic stops
Both include: Crime reports, calls for service
Why “Cleaning” The Data Isn’t Enough
Problem 1: Crime reports reflect police decisions
Officer decides what to investigate
Officer decides what to classify as “crime”
Officer decides what to document
Problem 2: Calls for service reflect community bias
Neighbors calling police on Black people barbecuing
“Suspicious activity” = person of color in “wrong” neighborhood
Gentrification → increased 311 calls on existing residents
Problem 3: What counts as “clean” data?
If drug arrests are racially biased, exclude them ✓
But isn’t burglary enforcement also biased? What about assault?
Where do you draw the line?
The Impossibility of Neutral Crime Data
Crime data is ALWAYS:
Socially constructed - Societies define what counts as “crime”
Selectively enforced - More resources to some neighborhoods
Organizationally filtered - Police priorities, department culture
Politically shaped - “Tough on crime” eras, moral panics
Can predict negative values (impossible for counts)
Assumes constant variance (counts often have variance ≠ mean)
Assumes continuous outcome (counts are discrete)
Assumes normal errors (count data is skewed)
Distribution of Crime Counts
# Typical pattern for crime dataggplot(fishnet, aes(x = countBurglaries)) +geom_histogram(binwidth =1, fill ="#440154FF", color ="white") +labs(title ="Distribution of Burglary Counts",subtitle ="Most cells have 0-2 burglaries, few have many",x ="Burglaries per Cell",y ="Number of Cells" ) +theme_minimal()
Con: Arbitrary, unequal sizes, Modifiable Areal Unit Problem
Fishnet grid (regular cells)
Pro: Consistent size, no boundary bias
Con: Arbitrary, may split “natural” areas
We use fishnet because:
Standard approach in predictive policing
Easier spatial operations
Consistent unit of analysis
Creating a Fishnet Grid
library(sf)# Step 1: Define cell size (in map units - meters for our projection)cell_size <-500# 500m x 500m cells# Step 2: Create grid over study areafishnet <-st_make_grid( chicago_boundary,cellsize = cell_size,square =TRUE,what ="polygons") %>%st_sf() %>%mutate(uniqueID =row_number())# Step 3: Clip to study area (remove cells outside boundary)fishnet <- fishnet[chicago_boundary, ]# Check resultsnrow(fishnet) # Number of cellsst_area(fishnet[1, ]) # Area of one cell (should be 250,000 m²)
Grid Cell Size: A Critical Choice
Common sizes:
250m × 250m: Fine-grained, many cells, computationally intensive
500m × 500m: Standard, balance detail and computation
1000m × 1000m: Coarse, faster, loses local detail
Smaller cells:
✓ More spatial detail
✓ Better capture local patterns
✗ More zeros (sparse data)
✗ Computational cost
Larger cells:
✓ Fewer zeros
✓ More stable estimates
✗ Lose local variation
✗ May obscure hotspots
Choice affects results! No “correct” answer.
Aggregating Points to Grid
Process:
Spatial join between crimes (points) and fishnet (polygons)
Count crimes per cell
Handle cells with zero crimes
# Count burglaries per cellburglary_counts <-st_join(burglaries, fishnet) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnetfishnet <- fishnet %>%left_join(burglary_counts, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summarysummary(fishnet$countBurglaries)# Min. 1st Qu. Median Mean 3rd Qu. Max. # 0 0 1 2.3 3 47
Handling Zeros in Count Data
Crime data typically has MANY zeros
Example distribution: - 40% of cells: 0 burglaries - 30% of cells: 1 burglary - 20% of cells: 2-3 burglaries - 10% of cells: 4+ burglaries
Implications:
Poisson handles zeros naturally (built into distribution)
Zero-inflation: If >60% zeros, consider Zero-Inflated Poisson (ZIP)
For today: Standard Negative Binomial handles our zeros fine
Critical interpretation: Are zeros “true zeros” (no crime) or “missing data” (unreported)?
Part 7E: Spatial Cross-Validation
Why Standard Cross-Validation Fails for Spatial Data
Standard k-fold CV:
Randomly split data into k folds
Train on k-1 folds, test on 1
Repeat k times
Problem with spatial data:
Nearby observations are correlated
Training set includes cells adjacent to test cells
Spatial leakage: Model learns from neighbors of test set
Overly optimistic performance estimates
Solution: Spatial cross-validation
Leave-One-Group-Out Cross-Validation (LOGO-CV)
Principle: Hold out entire spatial groups, not individual cells
Process:
Divide study area into groups (e.g., police districts)
Hold out all cells in District 1
Train model on Districts 2-N
Predict for District 1
Repeat for each district
Why better:
Tests generalization to truly new areas
No spatial leakage between train/test
More realistic deployment scenario
Conservative performance estimates
LOGO-CV Implementation
# Get unique districtsdistricts <-unique(fishnet$District)# Initialize resultscv_results <-list()# Loop through districtsfor (dist in districts) {# Split data train_data <- fishnet %>%filter(District != dist) test_data <- fishnet %>%filter(District == dist)# Fit model on training data model_cv <-glm.nb( countBurglaries ~ Abandoned_Cars + Abandoned_Cars.nn + abandoned.isSig.dist,data = train_data )# Predict on test data test_data$prediction <-predict(model_cv, test_data, type ="response")# Store results cv_results[[dist]] <- test_data}# Combine all predictionsall_predictions <-bind_rows(cv_results)
Evaluating CV Performance
Common metrics for count models:
Mean Absolute Error (MAE):\[MAE = \frac{1}{n}\sum_i |y_i - \hat{y}_i|\]
Root Mean Squared Error (RMSE):\[RMSE = \sqrt{\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2}\]
Mean Error (Bias):\[ME = \frac{1}{n}\sum_i (y_i - \hat{y}_i)\]