---
title: "Eviction Risk Prediction in Philadelphia"
author:
- "Lingxuan Gao"
- "Xiaoqing Chen"
date: "`r Sys.Date()`"
format:
html:
code-fold: true
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Introduction: Predicting the 2026 Housing Cliff
As Philadelphia approaches 2026, the city faces an unprecedented "housing cliff" driven by federal policy shocks. New mandates are shifting away from the "Housing First" model and enforcing stricter NSPIRE inspection standards, threatening the funding for approximately 1,200 permanent supportive housing units and risking mass disqualifications of affordable stock.
In this volatile landscape, relying solely on historical averages is insufficient. This project develops a spatial predictive model to forecast eviction risk for the Nov 2025 – Nov 2026 period. By integrating structural distress indicators (L&I violations), spatial context, and demographic data, we aim to identify the specific census tracts most vulnerable to this new wave of instability, enabling a "High Precision" allocation of limited legal aid resources.
------------------------------------------------------------------------
# Setup
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
cache = TRUE
)
Sys.setlocale("LC_TIME", "English")
```
## Load Libraries
```{r load_libraries}
# Core tidyverse
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
# Spatial data
library(sf)
library(tigris)
# Census data
library(tidycensus)
# Weather data
library(riem) # For Philadelphia weather from ASOS stations
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)
```
## Define Themes
```{r themes}
plotTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = 'transparent'),
panel.grid.minor = element_blank(),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
# 1. Data Source
## Load and clean all the datasets
```{r load data}
# Monthly totals vs baseline (city-wide)
philly_barchart <- read_csv("data/philadelphia_barchart.csv") %>%
mutate(
date = my(month)
)
# Claims severity over time
philly_claims_monthly <- read_csv("data/philadelphia_claims_monthly.csv") %>%
mutate(
month_date = ymd(month_date)
)
# Group-level (race, gender, etc.) monthly trends
philly_linechart <- read_csv("data/philadelphia_linechart.csv") %>%
mutate(
month_date = my(month)
)
# Census tract map snapshot
philly_map <- read_csv("data/philadelphia_map.csv") %>%
mutate(
GEOID = as.character(id),
month_date = dmy(month_date)
)
# Tract-level monthly pre/post pandemic
philly_monthly <- read_csv("data/philadelphia_monthly_2020_2021.csv") %>%
mutate(
month_date = my(month)
)
# Tract-level weekly
weekly <- read_csv("data/philadelphia_weekly_2020_2021.csv") %>%
mutate(
week_date = ymd(week_date)
)
# Code / license violations (point data, older)
li_violations <- read_csv("data/li_violations.csv")
# Hotspot properties (top filers)
hotspots <- read_csv("data/philadelphia_hotspots_media_report.csv") %>%
mutate(
time_period = ymd(time_period),
end_date = ymd(end_date)
)
```
**Data sources**:
Primary dataset: Eviction Lab – Philadelphia Tracking
- City-level series
- Tract-level monthly data over multiple years
- Tract-level snapshot of Philadelphia eviction risk (since Nov 2024)
- Group-level disparities
- landlord hotspots
# 2. Exploratory Data Analysis
## 2.1 Citywide filings over time
We reorganized the monthly eviction-filing counts by grouping each month into a season (Winter, Spring, Summer, Fall). We computed four separate seasonal baselines:
the historical average for Winter/Spring/Summer/Fall months
Then we produced a four-panel faceted plot to let us compare each season to what is “normal” for that season.
```{r season}
# Add season column
philly_barchart2 <- philly_barchart %>%
mutate(
# extract month number
m = month(date),
season = case_when(
m %in% c(12, 1, 2) ~ "Winter",
m %in% c(3, 4, 5) ~ "Spring",
m %in% c(6, 7, 8) ~ "Summer",
m %in% c(9, 10, 11) ~ "Fall"
)
)
# Calculate true seasonal averages
season_baseline <- philly_barchart2 %>%
group_by(season) %>%
summarise(season_avg = mean(month_filings, na.rm = TRUE))
season_baseline
```
```{r seaon chart}
ggplot(philly_barchart2, aes(x = date, y = month_filings)) +
geom_col(fill = "#3182bd") +
geom_hline(
data = season_baseline,
aes(yintercept = season_avg),
linetype = "dashed"
) +
facet_wrap(~ season, ncol = 2, scales = "free_y") +
labs(
title = "Monthly Eviction Filings by Season",
subtitle = "Each panel shows filings and seasonal baseline",
x = "Date",
y = "Eviction filings"
) +
theme_minimal(base_size = 14) +
theme(
strip.text = element_text(face = "bold", size = 13),
axis.text.x = element_text(angle = 45, hjust = 1)
)
```
**Patterns:**
1.Eviction filings were heavily suppressed across all seasons in 2020–2021. Every season shows extremely low activity during the early pandemic period.
2.Beginning in 2022, filings rebound sharply—but not uniformly across seasons. -Summer 2022 displays one of the largest spikes in the dataset. -Fall and Spring quickly rise back to their historical levels. -Winter rebounds more unevenly but eventually stabilizes. So the seasonal cut reveals that the return to “normal” was staggered, with some seasons experiencing larger surges than others.
3.From 2023 onward, virtually every season settles at or slightly above its long-term seasonal baseline. This indicates a new post-pandemic equilibrium where eviction activity has stabilized.
## 2.2 Tract-level risk and spatial patterns
We uses the multi-year tract-level monthly data to see how skewed filings are across tracts.
```{r eda_tract_distribution}
tract_annual <- philly_monthly %>%
group_by(GEOID) %>%
summarize(
total_filings = sum(filings_counts, na.rm = TRUE),
mean_monthly = mean(filings_counts, na.rm = TRUE),
n_months = n(),
.groups = "drop"
)
# Add a vertical line at the 80th percentile (top 20%)
p80 <- quantile(tract_annual$mean_monthly, 0.80, na.rm = TRUE)
ggplot(tract_annual, aes(x = mean_monthly)) +
geom_histogram(bins = 30, fill = "#3182bd") +
geom_vline(xintercept = p80, linetype = "dashed", color = "red") +
labs(
title = "Distribution of Mean Monthly Eviction Filings per Tract",
subtitle = paste0("Dashed line = 80th percentile (candidate high-risk cutoff: ",
round(p80, 2), " filings/month)"),
x = "Mean monthly filings (multi-year)",
y = "Number of tracts"
) +
plotTheme
```
This histogram shows that most census tracts have very low average eviction activity—often fewer than 2 filings per month.A small number of tracts sit far to the right, with 5–10+ filings per month, meaning they experience consistently high eviction pressure over multiple years. So eviction risk in Philly is highly **concentrated**, not evenly spread, which supports focusing limited legal aid and rent subsidies on that right-tail group of chronic hotspot tracts.
### Load Spatial Data
```{r}
# Read your GeoJSONs
boundary <- st_read("data/City_Limits.geojson")
pa_tracts <- st_read("data/PA-tracts.geojson")
# Join eviction data with tract geometry
philly_tracts <- pa_tracts %>%
filter(str_starts(GEOID, "42101"))
philly_map <- philly_map %>%
mutate(id = as.character(id))
philly_tracts <- philly_tracts %>%
mutate(GEOID = as.character(GEOID))
map_sf <- philly_tracts %>%
left_join(philly_map, by = c("GEOID" = "id"))
```
### Eviction rate map since Nov 2024
```{r}
ggplot(
map_sf %>% filter(!is.na(month_rate))
) +
geom_sf(aes(fill = month_rate), color = NA) +
scale_fill_viridis(option = "magma", direction = -1) +
labs(
title = "Eviction Filing Rate by Census Tract (Nov 2024–Nov 2025)",
fill = "Eviction Rate"
) +
coord_sf(datum = NA) +
theme_void() +
theme(
legend.position = "right",
plot.title = element_text(size = 14, face = "bold")
)
```
```{r}
map_compare <- map_sf %>%
mutate(
# month_diff is a ratio vs baseline
diff_vs_baseline = month_diff - 1
)
ggplot(
map_compare %>% filter(!is.na(diff_vs_baseline))
) +
geom_sf(aes(fill = diff_vs_baseline), color = NA) +
scale_fill_gradient2(
low = "#4575b4",
mid = "white",
high = "#d73027",
midpoint = 0,
labels = percent_format(accuracy = 1),
name = "Change vs baseline\n(2023–24)"
) +
labs(
title = "Eviction Filing Rate by Census Tract",
subtitle = "Nov 2024–Nov 2025 vs typical 2023–2024 rate"
) +
coord_sf(datum = NA) +
theme_void() +
theme(
legend.position = "right",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
)
```
Philly isn’t on fire everywhere—a handful of neighborhoods are overheating while others are holding steady or cooling down. Those red clusters are exactly where we’d point legal aid and rent relief first.
### Persistence of high-risk tracts
```{r}
threshold <- quantile(map_sf$month_rate, 0.8, na.rm = TRUE)
tract_risk <- map_sf %>%
st_drop_geometry() %>% # strip geometry
mutate(high = if_else(month_rate >= threshold, 1L, 0L)) %>%
group_by(GEOID) %>%
summarise(
times_high = sum(high, na.rm = TRUE),
.groups = "drop"
)
risk_sf <- philly_tracts %>%
left_join(tract_risk, by = "GEOID") # now y is NOT sf → OK
ggplot(risk_sf) +
geom_sf(aes(fill = times_high), color = NA) +
scale_fill_viridis_c(option = "inferno") +
labs(
title = "Persistence of High Eviction Risk Across Tracts",
fill = "Months above 80th percentile"
) +
coord_sf(datum = NA) +
theme_void() +
theme_minimal()
```
### Hotspot landlords (top eviction filers)
```{r}
hotspot_sf <- st_as_sf(
hotspots,
coords = c("lon", "lat"),
crs = 4326
)
ggplot() +
geom_sf(data = boundary, fill = "grey95", color = NA) +
geom_sf(data = hotspot_sf, aes(size = filings), alpha = 0.6, color = "#d7301f") +
scale_size(range = c(1, 10)) +
labs(
title = "Top Eviction-Filing Landlords in Philadelphia",
subtitle = "Bubble size shows number of filings",
size = "Filings"
) +
coord_sf(datum = NA) +
theme_void() +
theme_minimal()
```
The map reveals that eviction activity is not driven solely by tenant poverty, but also by landlord behavior. The large red bubbles represent the specific locations of the city's top filers. We observe that high-volume eviction filing is not evenly distributed; it is highly concentrated among a specific subset of property owners. The spatial pattern is stark and non-random. The largest clusters of high-volume filers are located in: North Philadelphia and West Philadelphia.
# 3.Get Philadelphia Spatial Context
Now we shift from “describe” to “explain”:\
We want to add **structural predictors**:
- Demographics & housing (ACS)\
- Landlord concentration (hotspots)\
- Housing quality (old violations)
Additional dataset:
- ACS (American Community Survey) 5-year 2022 data for Philadelphia census tracts
- Historic code-violation (2012-2016) (From Licenses and Inspections Department of Philadelphia)
- 311 Service and Information Requests
## 3.1 Base tract list
```{r spatial_base}
# One row per tract ID from the joined spatial layer
tract_base <- map_sf %>%
st_drop_geometry() %>%
distinct(GEOID) %>%
arrange(GEOID)
```
## 3.2 Add ACS demographics & housing
```{r spatial_acs, message=FALSE, warning=FALSE}
acs_vars <- c(
total_pop = "B01003_001", # total population
pov_total = "B17001_001", # poverty universe
pov_below = "B17001_002", # below poverty
occ_total = "B25003_001", # occupied housing units
occ_renter = "B25003_003", # renter-occupied units
med_rent = "B25064_001", # median gross rent
white = "B03002_003",
black = "B03002_004",
hispanic = "B03002_012"
)
acs_philly <- get_acs(
geography = "tract",
state = "PA",
county = "Philadelphia",
survey = "acs5",
year = 2022,
variables = acs_vars,
output = "wide",
geometry = FALSE
) %>%
transmute(
GEOID,
total_pop = total_popE,
pov_rate = if_else(pov_totalE > 0, pov_belowE / pov_totalE, NA_real_),
renter_share = if_else(occ_totalE > 0, occ_renterE / occ_totalE, NA_real_),
med_rent = med_rentE,
pct_black = if_else(total_popE > 0, blackE / total_popE, NA_real_),
pct_hispanic = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
)
# Attach ACS to tract base and to the spatial object
tract_context <- tract_base %>%
left_join(acs_philly, by = "GEOID")
map_sf_context <- map_sf %>% # or map_sf_context if you’d already created it
left_join(acs_philly, by = "GEOID")
```
## 3.3 Add landlord hotspot intensity
```{r spatial_hotspots, message=FALSE, warning=FALSE}
# Hotspot landlords as points (you already use this for the bubble map)
hotspot_sf <- st_as_sf(
hotspots,
coords = c("lon", "lat"),
crs = 4326
) %>%
st_transform(st_crs(map_sf_context))
# Spatial join: assign each hotspot property to a tract
hotspots_by_tract <- st_join(
hotspot_sf,
map_sf_context["GEOID"],
left = FALSE
) %>%
st_drop_geometry() %>%
group_by(GEOID) %>%
summarise(
hotspot_filings = sum(filings, na.rm = TRUE),
n_hotspot_props = n(),
n_hotspot_landlords = n_distinct(xplaintiff),
.groups = "drop"
)
map_sf_context <- map_sf_context %>%
left_join(hotspots_by_tract, by = "GEOID") %>%
mutate(
across(
c(hotspot_filings, n_hotspot_props, n_hotspot_landlords),
~ replace_na(., 0)
)
)
```
## 3.4 Add historic code-violation intensity (2012-2016)
To capture long-term structural neglect, we transformed historical code violation data (2012–2016) from individual geographic points into a tract-level density metric. We spatially joined each violation instance to its corresponding census tract, aggregated the total counts, and normalized them by population to derive the 'Violations per 1,000 Residents' (viol_per_1k) variable. This serves as a standardized proxy for 'legacy distress' within the built environment, distinguishing chronic disrepair from recent instability.
```{r spatial_violations, message=FALSE, warning=FALSE}
# Violations as points
li_violations_sf <- st_as_sf(
li_violations,
coords = c("lng", "lat"),
crs = 4326
) %>%
st_transform(st_crs(map_sf_context))
# Join each violation to a tract
viol_by_tract <- st_join(
li_violations_sf,
map_sf_context["GEOID"],
left = FALSE
) %>%
st_drop_geometry() %>%
group_by(GEOID) %>%
summarise(
violations_total = n(),
violations_types = n_distinct(violationdescription),
.groups = "drop"
)
map_sf_context <- map_sf_context %>%
left_join(viol_by_tract, by = "GEOID") %>%
mutate(
violations_total = replace_na(violations_total, 0),
violations_types = replace_na(violations_types, 0),
viol_per_1k = if_else(
total_pop > 0,
1000 * violations_total / total_pop,
NA_real_
)
)
```
## 3.5 311 housing-related complaint data
To quantify localized housing distress, we processed raw 311 service request data by filtering for ten specific housing-related categories recorded during 2025. These individual incidents were spatially joined to census tracts to aggregate point-level data into neighborhood-level counts. Finally, we normalized these figures by total population to create a standardized 'Complaints per 1,000 Residents' metric (housing311_per_1k), serving as a dynamic proxy for property mismanagement and structural neglect.
```{r add_311_housing, message=FALSE, warning=FALSE}
housing_types <- c(
"Dangerous Building Complaint",
"Construction Complaints",
"Maintenance Complaint",
"Sanitation Violation",
"Sanitation / Dumpster Violation",
"Dumpster Violation",
"Fire Safety Complaint",
"Smoke Detector",
"Graffiti Removal",
"Homeless Encampment Request"
)
housing311_raw <- read_csv("data/public_cases_fc.csv") %>%
clean_names() %>%
filter(service_name %in% housing_types) %>%
filter(!is.na(lon), !is.na(lat)) %>% # adjust if your columns are lon/lat
mutate(
request_date = as.Date(requested_datetime), # change name if needed
year = year(request_date)
) %>%
filter(year >= 2020) # recent years only
```
```{r agg_311_by_tract}
housing311_sf <- housing311_raw %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(st_crs(map_sf))
housing311_tract <- st_join(
housing311_sf,
map_sf["GEOID"],
left = FALSE
)
housing311_by_tract <- housing311_tract %>%
st_drop_geometry() %>%
count(GEOID, name = "housing311_count") %>%
left_join(
tract_context %>% select(GEOID, total_pop),
by = "GEOID"
) %>%
mutate(
housing311_per_1k = if_else(
total_pop > 0,
1000 * housing311_count / total_pop,
NA_real_
)
) %>%
select(GEOID, housing311_per_1k)
```
# 4. Build & Compare Predictive Models
## 4.1 Master Data Setup & Feature Engineering
### 4.1.1 Define Custom Time Periods (The "Nov-Nov" Fix)
To align our historical training data with the target prediction window (November 2025 – November 2026), we engineered a custom time period variable (period_year). Standard calendar years (Jan–Dec) were replaced with 'Eviction Years' that run from November to October. For instance, data from November and December of 2020 were reassigned to the '2021 Period.' This ensures that every year in our panel represents a full 12-month cycle that perfectly mirrors the seasonal structure of our final forecast.
```{r}
# Prepare the Monthly Data with Custom Time Periods
panel_setup <- philly_monthly %>%
mutate(
# Convert text to Date
month_date = my(month),
# --- CRITICAL FIX: Custom "Eviction Year" Logic ---
# We want the year to start in November.
# If month is Nov (11) or Dec (12), we assign it to the NEXT year.
# Example: Nov 2020 -> Period 2021
# Example: Nov 2024 -> Period 2025 (The most recent complete year)
period_year = if_else(
month(month_date) >= 11,
year(month_date) + 1,
year(month_date)
)
) %>%
# Filter out data before Nov 2020 to ensure we have full 12-month cycles
filter(period_year >= 2021)
```
### 4.1.2 Build Basic Tract-Year Panel
```{r}
# Aggregate from Monthly to Annual (Period) Level
model_df <- panel_setup %>%
group_by(GEOID, period_year) %>%
summarise(
# --- FIX: Variable Name Changed from filings_2020 ---
filings_total = sum(filings_counts, na.rm = TRUE),
mean_monthly = mean(filings_counts, na.rm = TRUE),
# Count how many months of data we have in this custom period
n_months = n(),
.groups = "drop"
) %>%
# Data Quality Check: Only keep periods with at least 10 months of data
filter(n_months >= 10)
```
Here we define what "High Risk" means and create the lag variable.
```{r}
# Create Target Variable (Y) and Lagged Predictor (X)
model_df_lagged <- model_df %>%
# A. Define "High Risk" Threshold per Year
group_by(period_year) %>%
mutate(
# Calculate the 80th percentile threshold for THIS specific period
cutoff80 = quantile(mean_monthly, 0.80, na.rm = TRUE),
# Create Binary Target: 1 if this tract is in the top 20%, 0 otherwise
high_risk = if_else(mean_monthly >= cutoff80, 1L, 0L)
) %>%
ungroup() %>%
# Create Lagged History (The most important predictor)
arrange(GEOID, period_year) %>%
group_by(GEOID) %>%
mutate(
# Previous year's mean monthly filings
lag_mean_monthly = lag(mean_monthly)
) %>%
ungroup() %>%
# Remove rows with no history (we can't train on the first year)
filter(!is.na(lag_mean_monthly))
```
::: callout-note
We use 0.8 as the threshold because of The Pareto Principle / 80-20 Rule.
:::
### 4.1.3 Prepare External Features (Context)
```{r}
# A. Legacy Distress (Historic Violations 2012-2016)
legacy_distress_join <- tract_context %>%
select(GEOID, total_pop) %>%
left_join(viol_by_tract, by = "GEOID") %>%
mutate(
hist_viol_per_1k = if_else(total_pop > 0, (violations_total / total_pop) * 1000, 0)
) %>%
select(GEOID, hist_viol_per_1k)
# B. Hotspot Landlord Activity
hotspot_join <- hotspots_by_tract %>%
select(GEOID, hotspot_filings)
# C. 311 Housing Complaints
housing311_join <- housing311_by_tract %>%
select(GEOID, housing311_per_1k)
# D. Spatial Context 1: Distance to City Hall
tract_centroids <- philly_tracts %>% st_centroid()
city_hall <- st_sfc(st_point(c(-75.1635, 39.9526)), crs = 4326) %>%
st_transform(st_crs(tract_centroids))
dist_center <- st_distance(tract_centroids, city_hall)
dist_center_join <- data.frame(
GEOID = tract_centroids$GEOID,
dist_center_km = as.numeric(dist_center) / 1000
)
# E. Spatial Context 2: Distance to Nearest Hotspot
dist_hotspot <- st_distance(tract_centroids, hotspot_sf)
min_dist_hotspot <- apply(dist_hotspot, 1, min)
dist_hotspot_join <- data.frame(
GEOID = tract_centroids$GEOID,
dist_hotspot_km = as.numeric(min_dist_hotspot) / 1000
)
# F. Neighborhood Typology (Clustering)
# We cluster tracts based on static demographics to capture "neighborhood type"
cluster_input <- tract_context %>%
st_drop_geometry() %>%
select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic) %>%
drop_na()
set.seed(42)
clusters <- kmeans(scale(cluster_input %>% select(-GEOID)), centers = 5)
cluster_input$neighborhood_cluster <- factor(clusters$cluster)
cluster_join <- cluster_input %>% select(GEOID, neighborhood_cluster)
```
### 4.1.4 Build Final Master Dataset
```{r}
training_data <- model_df_lagged %>%
# Join Demographics (ACS)
left_join(tract_context %>% select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic), by = "GEOID") %>%
# Join External Predictors
left_join(legacy_distress_join, by = "GEOID") %>%
left_join(hotspot_join, by = "GEOID") %>%
left_join(housing311_join, by = "GEOID") %>%
left_join(dist_center_join, by = "GEOID") %>%
left_join(dist_hotspot_join, by = "GEOID") %>%
left_join(cluster_join, by = "GEOID") %>%
# Handle NAs created by joins (fill counts with 0)
mutate(
hotspot_filings = replace_na(hotspot_filings, 0),
hist_viol_per_1k = replace_na(hist_viol_per_1k, 0),
housing311_per_1k = replace_na(housing311_per_1k, 0)
) %>%
# Final cleanup
drop_na(pov_rate, renter_share)
```
## 4.2 Model Building
### 4.2.1 Train/Test Split
We employed a Time-Based Train/Test Split (Temporal Splitting) combined with Complete Case Analysis. Instead of randomly shuffling the data (which is standard for non-time-series problems), we split the dataset chronologically.
```{r}
# 1. Define Variables
# Ensure we drop NAs for ALL potential variables to keep samples consistent across models
all_vars <- c("high_risk", "lag_mean_monthly",
"pov_rate", "renter_share", "med_rent", "pct_black", "pct_hispanic",
"hist_viol_per_1k", "hotspot_filings", "housing311_per_1k",
"dist_center_km", "dist_hotspot_km", "neighborhood_cluster")
model_data <- training_data %>%
drop_na(all_of(all_vars))
# 2. Time-Based Split
train_set <- model_data %>% filter(period_year < 2025)
test_set <- model_data %>% filter(period_year == 2025)
```
### 4.2.2 Build 5 Competing Models
We employed Hierarchical Model Building using Logistic Regression.
- Model 1 (Baseline): Tests "Inertia" (does past risk predict future risk?)
- Models 2-4 (Thematic): Test specific domains (Demographics, Built Environment, Spatial Context).
- Model 5 (Integrated): A comprehensive model combining all domains.
```{r}
# --- Model 1: Inertia Only (Lag) ---
formula_1 <- high_risk ~ lag_mean_monthly
# --- Model 2: Demographics (People) ---
formula_2 <- high_risk ~ lag_mean_monthly +
pov_rate + renter_share + pct_black + med_rent
# --- Model 3: Built Environment (Place Quality) ---
formula_3 <- high_risk ~ lag_mean_monthly +
hist_viol_per_1k + hotspot_filings + housing311_per_1k
# --- Model 4: Spatial Context (Location) ---
formula_4 <- high_risk ~ lag_mean_monthly +
dist_center_km + dist_hotspot_km + neighborhood_cluster
# --- Model 5: Full Integrated Model ---
formula_5 <- high_risk ~ lag_mean_monthly +
pov_rate + pct_black + renter_share +
hist_viol_per_1k + hotspot_filings + housing311_per_1k +
dist_center_km + neighborhood_cluster
# Train all models on the Training Set
m1 <- glm(formula_1, data = train_set, family = "binomial")
m2 <- glm(formula_2, data = train_set, family = "binomial")
m3 <- glm(formula_3, data = train_set, family = "binomial")
m4 <- glm(formula_4, data = train_set, family = "binomial")
m5 <- glm(formula_5, data = train_set, family = "binomial")
```
### 4.2.3 Compare Model Performance
We applied three standard statistical metrics to the 2025 Test Set: AUC,Accuracy and AIC.
```{r}
library(pROC)
library(caret)
# Function to calculate metrics for a model
calc_metrics <- function(model, test_data) {
# Predict probabilities
probs <- predict(model, newdata = test_data, type = "response")
# Predict class (0 or 1)
preds <- if_else(probs > 0.5, 1, 0)
# Calculate Metrics
roc_obj <- roc(test_data$high_risk, probs)
auc_val <- as.numeric(auc(roc_obj))
accuracy <- mean(preds == test_data$high_risk)
aic_val <- AIC(model)
return(c(AUC = auc_val, Accuracy = accuracy, AIC = aic_val))
}
# Run comparison
results <- data.frame(
Model = c("M1: Inertia", "M2: Demographics", "M3: Built Env", "M4: Spatial", "M5: Full"),
rbind(
calc_metrics(m1, test_set),
calc_metrics(m2, test_set),
calc_metrics(m3, test_set),
calc_metrics(m4, test_set),
calc_metrics(m5, test_set)
)
)
# Display Table
library(knitr)
kable(results, digits = 3, caption = "Model Comparison Table (Test Set Performance)")
```
The results establish Model 5 (Full Integrated) as the champion. While Model 4 (Spatial) achieved a marginally higher AUC (0.975 vs 0.969), Model 5 demonstrated superior Accuracy (94.3%), correctly classifying the vast majority of tracts in the 2025 test set. Crucially, Model 5 achieved the lowest AIC (609.176), significantly outperforming the baseline models (AIC \> 700). This drastic drop in AIC confirms that the added complexity—combining demographics, built environment, and spatial features—provides essential explanatory power rather than noise. Therefore, we select Model 5 as the most robust instrument for forecasting 2026 risk.
### 4.2.4 a Visual Comparison (ROC Curves)
```{r}
# Generate ROC objects
roc1 <- roc(test_set$high_risk, predict(m1, newdata = test_set, type = "response"))
roc2 <- roc(test_set$high_risk, predict(m2, newdata = test_set, type = "response"))
roc3 <- roc(test_set$high_risk, predict(m3, newdata = test_set, type = "response"))
roc4 <- roc(test_set$high_risk, predict(m4, newdata = test_set, type = "response"))
roc5 <- roc(test_set$high_risk, predict(m5, newdata = test_set, type = "response"))
# Plot
ggroc(list(M1=roc1, M2=roc2, M3=roc3, M4=roc4, M5=roc5), size = 1) +
labs(title = "ROC Curve Comparison",
subtitle = "Which model predicts 2025 risk best?",
color = "Model Type") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
```
### 4.2.5 Check Multicollinearity
```{r}
library(car)
vif(m5)
```
VIF diagnostic confirms that Model 5 is statistically free from severe multicollinearity issues.
### 4.2.6 Check Feature Importance
```{r}
library(caret)
# 1. Calculate Variable Importance
# For GLM, this calculates the absolute value of the t-statistic (or z-statistic)
importance_df <- varImp(m5, scale = FALSE)
# 2. Convert to a clean Data Frame for Plotting
importance_plot_df <- importance_df %>%
as.data.frame() %>%
rownames_to_column("Feature") %>%
arrange(desc(Overall)) %>%
# Remove the intercept (it's not a feature)
filter(Feature != "(Intercept)")
# 3. Visualize
ggplot(importance_plot_df, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_col(fill = "#3182bd") +
coord_flip() + # Horizontal bars are easier to read
labs(
title = "Feature Importance: What drives Eviction Risk?",
subtitle = "Based on Model 5 (Full Integrated Model)",
x = NULL,
y = "Importance Score (Absolute z-statistic)"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 11, face = "bold"),
plot.title = element_text(face = "bold")
)
```
Our feature importance analysis reveals that Historical Inertia and Landlord Behavior are the primary drivers of eviction risk.
## 4.3 Diagnostics on the Winning Model (M5)
### 4.3.1 Spatial Autocorrelation Diagnostic
We check if the Full Model successfully removed spatial clustering in errors.
```{r}
library(spdep)
# Extract residuals from the best model
train_set$m5_residuals <- residuals(m5, type = "deviance")
# Map residuals to geometry
resid_sf <- philly_tracts %>%
left_join(train_set %>% select(GEOID, m5_residuals), by = "GEOID") %>%
filter(!is.na(m5_residuals))
# Calculate Moran's I
coords <- st_centroid(resid_sf)
knn_nb <- knearneigh(coords, k = 5)
lw <- nb2listw(knn2nb(knn_nb), style = "W")
moran_result <- moran.test(resid_sf$m5_residuals, lw)
# Print Result (Ideally p-value > 0.05, or at least Moran's I is close to 0)
print(moran_result)
```
The Moran’s I test on the residuals yielded a statistic of 0.115 with a p-value \< 0.05. The value is relatively close to 0 (which indicates perfect randomness). This suggests that our model has successfully explained the vast majority of spatial patterns in eviction risk.And the result of p-value is still statistically significant.
### 4.3.2 Visualize Spatial Error Patterns (Residual Map)
We visualized the Deviance Residuals of Model 5 to identify local pockets of model error. By mapping these residuals, we can diagnose if the model systematically biases its predictions in specific neighborhoods.
```{r}
ggplot(resid_sf) +
geom_sf(aes(fill = m5_residuals), color = NA) +
scale_fill_gradient2(
low = "#4575b4", # Blue: Model predicted High, Actual was Low (Overestimation)
mid = "white", # White: Perfect Prediction
high = "#d73027", # Red: Model predicted Low, Actual was High (Underestimation)
midpoint = 0,
name = "Deviance\nResiduals"
) +
labs(
title = "Map of Model Errors (Residuals)",
subtitle = "Where is Model 5 failing?",
caption = "Red = Underestimation (Missed Risk) | Blue = Overestimation (False Alarm)"
) +
mapTheme +
theme(legend.position = "right")
```
While the errors show some local clustering (consistent with our Moran's I result of 0.115), they do not show a city-wide systemic bias. This confirms Model 5 is generally robust.
### 4.3.3 Spatial Cross-Validation (LOGOCV)
We used the 5 Neighborhood Clusters (created in Section 4.1.3) as the splitting criteria. In each round of validation, we completely held out one entire cluster of neighborhoods and trained the model on the remaining four clusters. We then asked the model to predict risk in the unseen cluster.
```{r}
library(caret)
# 1. Define the Control Method: Leave-One-Group-Out
# We use the 'neighborhood_cluster' we created earlier as the "Group"
fit_control <- trainControl(
method = "CV",
number = 5, # 5 Clusters
savePredictions = TRUE,
classProbs = TRUE, # Needed for ROC/AUC
summaryFunction = twoClassSummary, # Optimizes for AUC
index = createFolds(train_set$neighborhood_cluster, k = 5) # Critical: Split by Cluster!
)
# 2. Prepare Data for Caret (Needs "Yes/No" factor for classification)
cv_data <- train_set %>%
mutate(high_risk_fac = factor(if_else(high_risk == 1, "High", "Low"),
levels = c("High", "Low"))) %>%
drop_na()
# 3. Re-train Model 5 using Spatial CV
# Note: We use the exact same formula as M5
model_spatial_cv <- train(
high_risk_fac ~ lag_mean_monthly +
pov_rate + pct_black + renter_share +
hist_viol_per_1k + hotspot_filings + housing311_per_1k +
dist_center_km, # Removed 'neighborhood_cluster' from predictors because it's the split variable!
data = cv_data,
method = "glm",
family = "binomial",
trControl = fit_control,
metric = "ROC"
)
# 4. Print Results
print(model_spatial_cv)
```
Comparing this to our temporal test result (AUC \~0.969), the drop to 0.91 is expected and healthy. It indicates that while the model relies partly on local history ('Inertia'), it still retains excellent predictive power (0.91) even when blindfolded to specific neighborhood locations. This confirms the model has learned structural rules about eviction risk.
# 5. Prediction for 2026 (The Future)
We use the "Full Integrated Model" (M5) to forecast risk. The input for this prediction is the data from the most recent period (Nov 2024 – Nov 2025).
```{r}
# 1. Prepare the Input Data for 2026
# Logic: To predict 2026, the "lagged history" is the actual performance of 2025.
future_input <- training_data %>%
filter(period_year == 2025) %>% # Select the latest available data
select(
GEOID,
# CRITICAL STEP: The 2025 average becomes the "lag" for 2026
lag_mean_monthly = mean_monthly,
# Keep all other static/structural predictors
pov_rate, pct_black, renter_share, med_rent,
hist_viol_per_1k, hotspot_filings, housing311_per_1k,
dist_center_km, neighborhood_cluster, dist_hotspot_km
) %>%
drop_na()
# 2. Generate Predictions
# We use Model 5 (m5) to predict probabilities (0 to 1)
# type = "response" output probabilities
future_input$pred_prob <- predict(m5, newdata = future_input, type = "response")
# 3. Classify Risk
# We interpret > 50% probability as "High Risk"
future_input <- future_input %>%
mutate(
pred_risk_label = if_else(pred_prob >= 0.5, "High Risk", "Low Risk")
)
# 4. Join Predictions back to Map Geometry
forecast_map <- philly_tracts %>%
left_join(future_input, by = "GEOID") %>%
filter(!is.na(pred_prob))
```
## 5.1 Visualize the Forecast (Probability Map)
```{r}
ggplot(forecast_map) +
geom_sf(aes(fill = pred_prob), color = NA) +
scale_fill_viridis_c(
option = "rocket",
direction = -1,
name = "Risk Probability",
labels = percent
) +
labs(
title = "Forecast: Eviction Risk Probability (Nov 2025 – Nov 2026)",
subtitle = "Predicted by Model 5 (Full Integrated Model)\nDarker/Red areas indicate >80% probability of high eviction activity",
caption = "Source: Philadelphia Eviction Data"
) +
mapTheme +
theme(legend.position = "right")
```
High-Intensity Clusters: The darkest areas (probabilities \> 75%) are tightly clustered in specific corridors of North and West Philadelphia. This indicates that the model is extremely confident that these specific neighborhoods will experience severe eviction pressure in 2026.
## 5.2.1 Visualize the Forecast (Binary Risk Map)
We use three distinct cutoffs:
- Aggressive (\> 0.3): This casts a wide net to catch as many potential eviction cases as possible, accepting a higher rate of false alarms.
- Balanced (\> 0.5): This uses the standard statistical default.
- Conservative (\> 0.8): This targets only the 'sure bets' to minimize wasted resources.
```{r}
# --- 1. Create Multiple Scenarios ---
future_scenarios <- future_input %>%
mutate(
# Strategy A: High Recall (Catch almost all potential risks, but more false alarms)
`Strategy: > 0.3 (Aggressive)` = if_else(pred_prob > 0.3, "High Risk", "Low Risk"),
# Strategy B: Balanced (The statistical default)
`Strategy: > 0.5 (Balanced)` = if_else(pred_prob > 0.5, "High Risk", "Low Risk"),
# Strategy C: High Precision (Target only the absolute worst hotspots)
`Strategy: > 0.8 (Conservative)` = if_else(pred_prob > 0.8, "High Risk", "Low Risk")
) %>%
# Convert wide format to long format for plotting
pivot_longer(
cols = starts_with("Strategy"),
names_to = "Strategy_Name",
values_to = "Risk_Label"
)
# --- 2. Join to Geometry ---
# Note: This will triple the number of rows (one geometry per strategy per tract)
scenario_map <- philly_tracts %>%
left_join(future_scenarios, by = "GEOID") %>%
filter(!is.na(Risk_Label))
# --- 3. Plot Side-by-Side Maps ---
ggplot(scenario_map) +
geom_sf(aes(fill = Risk_Label), color = "white", lwd = 0.01) +
scale_fill_manual(
values = c("High Risk" = "#d73027", "Low Risk" = "#4575b4"),
name = "Risk Class"
) +
# This creates the 3 panels
facet_wrap(~ Strategy_Name, ncol = 3) +
labs(
title = "Policy Trade-offs: Comparing Intervention Thresholds",
subtitle = "How does the definition of 'High Risk' change our target areas for 2026?",
caption = "Aggressive (>0.3) = Max Coverage | Balanced (>0.5) = Standard | Conservative (>0.8) = Max Precision"
) +
mapTheme +
theme(
legend.position = "bottom",
strip.text = element_text(size = 11, face = "bold") # Make panel titles larger
)
```
**Recommendation**: Given the looming 2026 funding cuts, we recommend this 'High Precision' strategy for allocating expensive resources (like direct rental assistance), while reserving the 'Aggressive' strategy for low-cost interventions (like information campaigns).
# 6 Summary of Findings
## 6.1 Conclusion
```{r}
# Calculate summary statistics for the report text
n_high_risk <- sum(future_input$pred_risk_label == "High Risk")
pct_high_risk <- round(mean(future_input$pred_risk_label == "High Risk") * 100, 1)
# Identify the top 5 highest risk tracts
top_risk_tracts <- future_input %>%
arrange(desc(pred_prob)) %>%
head(5) %>%
select(GEOID, pred_prob, lag_mean_monthly, pov_rate)
print(paste("Total High Risk Tracts Predicted:", n_high_risk))
print(paste("Percentage of City:", pct_high_risk, "%"))
print(top_risk_tracts)
```
Our Model 5 forecast projects that 71 census tracts will be classified as 'High Risk' in the upcoming year. These 71 tracts represent exactly 20.2% of the city's neighborhoods. This finding is remarkably consistent with the Pareto Principle (80/20 rule) we observed in the historical data, confirming that the 2026 crisis will likely remain highly concentrated rather than spreading evenly across the city. For these specific neighborhoods, the model is effectively finding 'near certainty' of high eviction activity. These areas should be the immediate first stop for any 'High Precision' legal aid deployment, as they represent the absolute epicenter of displacement risk.
## 6.2 Limitations
1. **Reporting Bias in 311 data**: Our model uses 311 complaints as a proxy for housing distress.However, the tenants in the most vulnerable neighborhoods may be less likely to report violations than residents in gentrifying areas.For example, they fear landlord retaliations. So our model may paradoxically interpret a lack of complaints as "stability" in areas that are actually suffering from silent, unreported neglect.
2. **Census-Tract Scale, Not Individual Properties**: Our predictions operate at the Census Tract level. High-risk classification for a tract does not mean every property in that tract is at risk. So interventions should be verified at the individual property level to avoid wasting resources on stable buildings within distressed areas.
## 6.3 Recommendations
1. For small set of extreme hotspots, we could operate direct rent relief, full legal representation, intensive case management. Also, we need to verify needs at the property/household level with local partners before committing these high-cost resources.
2. For wider ring of at-risk neighborhoods around the hotspots, we can operate low-cost interventions – flyers, text alerts, know-your-rights workshops, landlord engagement. Also we can use this tier to reach areas where 311 distress may be under-reported, and monitor for emerging hotspots to feed back into the model.