Lingxuan Gao - MUSA 5080 Portfolio
  • Home
  • Weekly Notes
    • Weekly Notes 01: Introduction to R and dplyr
    • Weekly Notes 02: Algorithmic Decision Making & The Census
    • Weekly Notes 03: Data Visualization & Exploratory Analysis
    • Weekly Notes 04: Spatial Data & GIS Operations in R
    • Weekly Notes 05: Introduction to Linear Regression
    • Weekly Notes 11: Space-Time Prediction
  • Labs
    • Lab 0: dplyr Basics
    • Lab 1: Census Data Quality for Policy Decisions
    • Lab 2: Spatial Analysis and Visualization-Healthcare Access and Equity in Pennsylvania
    • Lab 4: Spatial Predictive Analysis
    • Lab 5: Space-Time Prediction
  • Midterm
    • Appendix
    • Presentation
  • Final
    • Eviction Risk Prediction in Philadelphia

On this page

  • Introduction: Predicting the 2026 Housing Cliff
  • Setup
    • Load Libraries
    • Define Themes
  • 1. Data Source
    • Load and clean all the datasets
  • 2. Exploratory Data Analysis
    • 2.1 Citywide filings over time
    • 2.2 Tract-level risk and spatial patterns
      • Load Spatial Data
      • Eviction rate map since Nov 2024
      • Persistence of high-risk tracts
      • Hotspot landlords (top eviction filers)
  • 3.Get Philadelphia Spatial Context
    • 3.1 Base tract list
    • 3.2 Add ACS demographics & housing
    • 3.3 Add landlord hotspot intensity
    • 3.4 Add historic code-violation intensity (2012-2016)
    • 3.5 311 housing-related complaint data
  • 4. Build & Compare Predictive Models
    • 4.1 Master Data Setup & Feature Engineering
      • 4.1.1 Define Custom Time Periods (The “Nov-Nov” Fix)
      • 4.1.2 Build Basic Tract-Year Panel
      • 4.1.3 Prepare External Features (Context)
      • 4.1.4 Build Final Master Dataset
    • 4.2 Model Building
      • 4.2.1 Train/Test Split
      • 4.2.2 Build 5 Competing Models
      • 4.2.3 Compare Model Performance
      • 4.2.4 a Visual Comparison (ROC Curves)
      • 4.2.5 Check Multicollinearity
      • 4.2.6 Check Feature Importance
    • 4.3 Diagnostics on the Winning Model (M5)
      • 4.3.1 Spatial Autocorrelation Diagnostic
      • 4.3.2 Visualize Spatial Error Patterns (Residual Map)
      • 4.3.3 Spatial Cross-Validation (LOGOCV)
  • 5. Prediction for 2026 (The Future)
    • 5.1 Visualize the Forecast (Probability Map)
    • 5.2.1 Visualize the Forecast (Binary Risk Map)
  • 6 Summary of Findings
    • 6.1 Conclusion
    • 6.2 Limitations
    • 6.3 Recommendations

Eviction Risk Prediction in Philadelphia

  • Show All Code
  • Hide All Code

  • View Source
Authors

Lingxuan Gao

Xiaoqing Chen

Published

December 8, 2025

Introduction: Predicting the 2026 Housing Cliff

As Philadelphia approaches 2026, the city faces an unprecedented “housing cliff” driven by federal policy shocks. New mandates are shifting away from the “Housing First” model and enforcing stricter NSPIRE inspection standards, threatening the funding for approximately 1,200 permanent supportive housing units and risking mass disqualifications of affordable stock.

In this volatile landscape, relying solely on historical averages is insufficient. This project develops a spatial predictive model to forecast eviction risk for the Nov 2025 – Nov 2026 period. By integrating structural distress indicators (L&I violations), spatial context, and demographic data, we aim to identify the specific census tracts most vulnerable to this new wave of instability, enabling a “High Precision” allocation of limited legal aid resources.


Setup

Load Libraries

Code
# Core tidyverse
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)

# Spatial data
library(sf)
library(tigris)

# Census data
library(tidycensus)

# Weather data
library(riem)  # For Philadelphia weather from ASOS stations

# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)

Define Themes

Code
plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

1. Data Source

Load and clean all the datasets

Code
# Monthly totals vs baseline (city-wide)
philly_barchart <- read_csv("data/philadelphia_barchart.csv") %>%
  mutate(
    date = my(month)
  )

# Claims severity over time
philly_claims_monthly <- read_csv("data/philadelphia_claims_monthly.csv") %>%
  mutate(
    month_date = ymd(month_date)
  )

# Group-level (race, gender, etc.) monthly trends
philly_linechart <- read_csv("data/philadelphia_linechart.csv") %>%
  mutate(
    month_date = my(month)
  )

# Census tract map snapshot
philly_map <- read_csv("data/philadelphia_map.csv") %>%
  mutate(
   GEOID = as.character(id),
    month_date = dmy(month_date)
  )

# Tract-level monthly pre/post pandemic
philly_monthly <- read_csv("data/philadelphia_monthly_2020_2021.csv") %>%
  mutate(
    month_date = my(month)
  )

# Tract-level weekly 
weekly <- read_csv("data/philadelphia_weekly_2020_2021.csv") %>%
  mutate(
    week_date = ymd(week_date)
  )

# Code / license violations (point data, older)
li_violations <- read_csv("data/li_violations.csv")

# Hotspot properties (top filers)
hotspots <- read_csv("data/philadelphia_hotspots_media_report.csv") %>%
  mutate(
    time_period = ymd(time_period),
    end_date    = ymd(end_date)
  )

Data sources:

Primary dataset: Eviction Lab – Philadelphia Tracking

  • City-level series

  • Tract-level monthly data over multiple years

  • Tract-level snapshot of Philadelphia eviction risk (since Nov 2024)

  • Group-level disparities

  • landlord hotspots

2. Exploratory Data Analysis

2.1 Citywide filings over time

We reorganized the monthly eviction-filing counts by grouping each month into a season (Winter, Spring, Summer, Fall). We computed four separate seasonal baselines:

the historical average for Winter/Spring/Summer/Fall months

Then we produced a four-panel faceted plot to let us compare each season to what is “normal” for that season.

Code
# Add season column
philly_barchart2 <- philly_barchart %>%
  mutate(
    # extract month number
    m = month(date),
    season = case_when(
      m %in% c(12, 1, 2) ~ "Winter",
      m %in% c(3, 4, 5)  ~ "Spring",
      m %in% c(6, 7, 8)  ~ "Summer",
      m %in% c(9, 10, 11) ~ "Fall"
    )
  )

# Calculate true seasonal averages
season_baseline <- philly_barchart2 %>%
  group_by(season) %>%
  summarise(season_avg = mean(month_filings, na.rm = TRUE))

season_baseline
# A tibble: 4 × 2
  season season_avg
  <chr>       <dbl>
1 Fall        1008.
2 Spring       795.
3 Summer       873.
4 Winter      1115.
Code
ggplot(philly_barchart2, aes(x = date, y = month_filings)) +
  geom_col(fill = "#3182bd") +
  geom_hline(
    data = season_baseline,
    aes(yintercept = season_avg),
    linetype = "dashed"
  ) +
  facet_wrap(~ season, ncol = 2, scales = "free_y") +
  labs(
    title = "Monthly Eviction Filings by Season",
    subtitle = "Each panel shows filings and seasonal baseline",
    x = "Date",
    y = "Eviction filings"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.text = element_text(face = "bold", size = 13),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Patterns:

1.Eviction filings were heavily suppressed across all seasons in 2020–2021. Every season shows extremely low activity during the early pandemic period.

2.Beginning in 2022, filings rebound sharply—but not uniformly across seasons. -Summer 2022 displays one of the largest spikes in the dataset. -Fall and Spring quickly rise back to their historical levels. -Winter rebounds more unevenly but eventually stabilizes. So the seasonal cut reveals that the return to “normal” was staggered, with some seasons experiencing larger surges than others.

3.From 2023 onward, virtually every season settles at or slightly above its long-term seasonal baseline. This indicates a new post-pandemic equilibrium where eviction activity has stabilized.

2.2 Tract-level risk and spatial patterns

We uses the multi-year tract-level monthly data to see how skewed filings are across tracts.

Code
tract_annual <- philly_monthly %>%
  group_by(GEOID) %>%
  summarize(
    total_filings = sum(filings_counts, na.rm = TRUE),
    mean_monthly  = mean(filings_counts, na.rm = TRUE),
    n_months      = n(),
    .groups = "drop"
  )

# Add a vertical line at the 80th percentile (top 20%)
p80 <- quantile(tract_annual$mean_monthly, 0.80, na.rm = TRUE)

ggplot(tract_annual, aes(x = mean_monthly)) +
  geom_histogram(bins = 30, fill = "#3182bd") +
  geom_vline(xintercept = p80, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of Mean Monthly Eviction Filings per Tract",
    subtitle = paste0("Dashed line = 80th percentile (candidate high-risk cutoff: ",
                      round(p80, 2), " filings/month)"),
    x = "Mean monthly filings (multi-year)",
    y = "Number of tracts"
  ) +
  plotTheme

This histogram shows that most census tracts have very low average eviction activity—often fewer than 2 filings per month.A small number of tracts sit far to the right, with 5–10+ filings per month, meaning they experience consistently high eviction pressure over multiple years. So eviction risk in Philly is highly concentrated, not evenly spread, which supports focusing limited legal aid and rent subsidies on that right-tail group of chronic hotspot tracts.

Load Spatial Data

Code
# Read your GeoJSONs
boundary <- st_read("data/City_Limits.geojson")
Reading layer `City_Limits' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\Final\data\City_Limits.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
Geodetic CRS:  WGS 84
Code
pa_tracts <- st_read("data/PA-tracts.geojson")
Reading layer `PA-tracts' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\Final\data\PA-tracts.geojson' 
  using driver `GeoJSON'
Simple feature collection with 3217 features and 398 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51989 ymin: 39.7198 xmax: -74.68952 ymax: 42.26986
Geodetic CRS:  WGS 84
Code
# Join eviction data with tract geometry
philly_tracts <- pa_tracts %>%
  filter(str_starts(GEOID, "42101"))

philly_map <- philly_map %>%
  mutate(id = as.character(id))

philly_tracts <- philly_tracts %>%
  mutate(GEOID = as.character(GEOID))


map_sf <- philly_tracts %>%
  left_join(philly_map, by = c("GEOID" = "id"))

Eviction rate map since Nov 2024

Code
ggplot(
  map_sf %>% filter(!is.na(month_rate))
) +
  geom_sf(aes(fill = month_rate), color = NA) +
  scale_fill_viridis(option = "magma", direction = -1) +
  labs(
    title = "Eviction Filing Rate by Census Tract (Nov 2024–Nov 2025)",
    fill  = "Eviction Rate"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(size = 14, face = "bold")
  )

Code
map_compare <- map_sf %>%
  mutate(
    # month_diff is a ratio vs baseline
    diff_vs_baseline = month_diff - 1
  )

ggplot(
  map_compare %>% filter(!is.na(diff_vs_baseline))
) +
  geom_sf(aes(fill = diff_vs_baseline), color = NA) +
  scale_fill_gradient2(
    low  = "#4575b4",
    mid  = "white",
    high = "#d73027",
    midpoint = 0,
    labels = percent_format(accuracy = 1),
    name = "Change vs baseline\n(2023–24)"
  ) +
  labs(
    title    = "Eviction Filing Rate by Census Tract",
    subtitle = "Nov 2024–Nov 2025 vs typical 2023–2024 rate"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(size = 14, face = "bold"),
    plot.subtitle   = element_text(size = 10)
  )

Philly isn’t on fire everywhere—a handful of neighborhoods are overheating while others are holding steady or cooling down. Those red clusters are exactly where we’d point legal aid and rent relief first.

Persistence of high-risk tracts

Code
threshold <- quantile(map_sf$month_rate, 0.8, na.rm = TRUE)

tract_risk <- map_sf %>%
  st_drop_geometry() %>%                        # strip geometry
  mutate(high = if_else(month_rate >= threshold, 1L, 0L)) %>%
  group_by(GEOID) %>%
  summarise(
    times_high = sum(high, na.rm = TRUE),
    .groups = "drop"
  )

risk_sf <- philly_tracts %>%
  left_join(tract_risk, by = "GEOID")          # now y is NOT sf → OK

ggplot(risk_sf) +
  geom_sf(aes(fill = times_high), color = NA) +
  scale_fill_viridis_c(option = "inferno") +
  labs(
    title = "Persistence of High Eviction Risk Across Tracts",
    fill  = "Months above 80th percentile"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme_minimal()

Hotspot landlords (top eviction filers)

Code
hotspot_sf <- st_as_sf(
  hotspots,
  coords = c("lon", "lat"),
  crs = 4326
)

ggplot() +
  geom_sf(data = boundary, fill = "grey95", color = NA) +
  geom_sf(data = hotspot_sf, aes(size = filings), alpha = 0.6, color = "#d7301f") +
  scale_size(range = c(1, 10)) +
  labs(
    title = "Top Eviction-Filing Landlords in Philadelphia",
    subtitle = "Bubble size shows number of filings",
    size = "Filings"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme_minimal()

The map reveals that eviction activity is not driven solely by tenant poverty, but also by landlord behavior. The large red bubbles represent the specific locations of the city’s top filers. We observe that high-volume eviction filing is not evenly distributed; it is highly concentrated among a specific subset of property owners. The spatial pattern is stark and non-random. The largest clusters of high-volume filers are located in: North Philadelphia and West Philadelphia.

3.Get Philadelphia Spatial Context

Now we shift from “describe” to “explain”:
We want to add structural predictors:

  • Demographics & housing (ACS)
  • Landlord concentration (hotspots)
  • Housing quality (old violations)

Additional dataset:

  • ACS (American Community Survey) 5-year 2022 data for Philadelphia census tracts

  • Historic code-violation (2012-2016) (From Licenses and Inspections Department of Philadelphia)

  • 311 Service and Information Requests

3.1 Base tract list

Code
# One row per tract ID from the joined spatial layer
tract_base <- map_sf %>%
  st_drop_geometry() %>%
  distinct(GEOID) %>%
  arrange(GEOID)

3.2 Add ACS demographics & housing

Code
acs_vars <- c(
  total_pop  = "B01003_001",  # total population
  pov_total  = "B17001_001",  # poverty universe
  pov_below  = "B17001_002",  # below poverty
  occ_total  = "B25003_001",  # occupied housing units
  occ_renter = "B25003_003",  # renter-occupied units
  med_rent   = "B25064_001",  # median gross rent
  white      = "B03002_003",
  black      = "B03002_004",
  hispanic   = "B03002_012"
)

acs_philly <- get_acs(
  geography = "tract",
  state     = "PA",
  county    = "Philadelphia",
  survey    = "acs5",
  year      = 2022,
  variables = acs_vars,
  output    = "wide",
  geometry  = FALSE
) %>%
  transmute(
    GEOID,
    total_pop    = total_popE,
    pov_rate     = if_else(pov_totalE > 0, pov_belowE / pov_totalE, NA_real_),
    renter_share = if_else(occ_totalE > 0, occ_renterE / occ_totalE, NA_real_),
    med_rent     = med_rentE,
    pct_black    = if_else(total_popE > 0, blackE / total_popE, NA_real_),
    pct_hispanic = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
  )

# Attach ACS to tract base and to the spatial object
tract_context <- tract_base %>%
  left_join(acs_philly, by = "GEOID")

map_sf_context <- map_sf %>%       # or map_sf_context if you’d already created it
  left_join(acs_philly, by = "GEOID")

3.3 Add landlord hotspot intensity

Code
# Hotspot landlords as points (you already use this for the bubble map)
hotspot_sf <- st_as_sf(
  hotspots,
  coords = c("lon", "lat"),
  crs = 4326
) %>%
  st_transform(st_crs(map_sf_context))

# Spatial join: assign each hotspot property to a tract
hotspots_by_tract <- st_join(
  hotspot_sf,
  map_sf_context["GEOID"],
  left = FALSE
) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(
    hotspot_filings     = sum(filings, na.rm = TRUE),
    n_hotspot_props     = n(),
    n_hotspot_landlords = n_distinct(xplaintiff),
    .groups = "drop"
  )

map_sf_context <- map_sf_context %>%
  left_join(hotspots_by_tract, by = "GEOID") %>%
  mutate(
    across(
      c(hotspot_filings, n_hotspot_props, n_hotspot_landlords),
      ~ replace_na(., 0)
    )
  )

3.4 Add historic code-violation intensity (2012-2016)

To capture long-term structural neglect, we transformed historical code violation data (2012–2016) from individual geographic points into a tract-level density metric. We spatially joined each violation instance to its corresponding census tract, aggregated the total counts, and normalized them by population to derive the ‘Violations per 1,000 Residents’ (viol_per_1k) variable. This serves as a standardized proxy for ‘legacy distress’ within the built environment, distinguishing chronic disrepair from recent instability.

Code
# Violations as points
li_violations_sf <- st_as_sf(
  li_violations,
  coords = c("lng", "lat"),
  crs = 4326
) %>%
  st_transform(st_crs(map_sf_context))

# Join each violation to a tract
viol_by_tract <- st_join(
  li_violations_sf,
  map_sf_context["GEOID"],
  left = FALSE
) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(
    violations_total = n(),
    violations_types = n_distinct(violationdescription),
    .groups = "drop"
  )

map_sf_context <- map_sf_context %>%
  left_join(viol_by_tract, by = "GEOID") %>%
  mutate(
    violations_total = replace_na(violations_total, 0),
    violations_types = replace_na(violations_types, 0),
    viol_per_1k = if_else(
      total_pop > 0,
      1000 * violations_total / total_pop,
      NA_real_
    )
  )

3.5 311 housing-related complaint data

To quantify localized housing distress, we processed raw 311 service request data by filtering for ten specific housing-related categories recorded during 2025. These individual incidents were spatially joined to census tracts to aggregate point-level data into neighborhood-level counts. Finally, we normalized these figures by total population to create a standardized ‘Complaints per 1,000 Residents’ metric (housing311_per_1k), serving as a dynamic proxy for property mismanagement and structural neglect.

Code
housing_types <- c(
  "Dangerous Building Complaint",
  "Construction Complaints",
  "Maintenance Complaint",
  "Sanitation Violation",
  "Sanitation / Dumpster Violation",
  "Dumpster Violation",
  "Fire Safety Complaint",
  "Smoke Detector",
  "Graffiti Removal",
  "Homeless Encampment Request"
)

housing311_raw <- read_csv("data/public_cases_fc.csv") %>%
  clean_names() %>%
  filter(service_name %in% housing_types) %>%
  filter(!is.na(lon), !is.na(lat)) %>%   # adjust if your columns are lon/lat
  mutate(
    request_date = as.Date(requested_datetime),   # change name if needed
    year         = year(request_date)
  ) %>%
  filter(year >= 2020)               # recent years only
Code
housing311_sf <- housing311_raw %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(st_crs(map_sf))

housing311_tract <- st_join(
  housing311_sf,
  map_sf["GEOID"],
  left = FALSE
)

housing311_by_tract <- housing311_tract %>%
  st_drop_geometry() %>%
  count(GEOID, name = "housing311_count") %>%
  left_join(
    tract_context %>% select(GEOID, total_pop),
    by = "GEOID"
  ) %>%
  mutate(
    housing311_per_1k = if_else(
      total_pop > 0,
      1000 * housing311_count / total_pop,
      NA_real_
    )
  ) %>%
  select(GEOID, housing311_per_1k)

4. Build & Compare Predictive Models

4.1 Master Data Setup & Feature Engineering

4.1.1 Define Custom Time Periods (The “Nov-Nov” Fix)

To align our historical training data with the target prediction window (November 2025 – November 2026), we engineered a custom time period variable (period_year). Standard calendar years (Jan–Dec) were replaced with ‘Eviction Years’ that run from November to October. For instance, data from November and December of 2020 were reassigned to the ‘2021 Period.’ This ensures that every year in our panel represents a full 12-month cycle that perfectly mirrors the seasonal structure of our final forecast.

Code
#  Prepare the Monthly Data with Custom Time Periods
panel_setup <- philly_monthly %>% 
  mutate(
    # Convert text to Date
    month_date = my(month),
    
    # --- CRITICAL FIX: Custom "Eviction Year" Logic ---
    # We want the year to start in November.
    # If month is Nov (11) or Dec (12), we assign it to the NEXT year.
    # Example: Nov 2020 -> Period 2021
    # Example: Nov 2024 -> Period 2025 (The most recent complete year)
    period_year = if_else(
      month(month_date) >= 11, 
      year(month_date) + 1, 
      year(month_date)
    )
  ) %>%
  # Filter out data before Nov 2020 to ensure we have full 12-month cycles
  filter(period_year >= 2021)

4.1.2 Build Basic Tract-Year Panel

Code
# Aggregate from Monthly to Annual (Period) Level
model_df <- panel_setup %>% 
  group_by(GEOID, period_year) %>% 
  summarise(
    # --- FIX: Variable Name Changed from filings_2020 ---
    filings_total = sum(filings_counts, na.rm = TRUE),
    mean_monthly  = mean(filings_counts, na.rm = TRUE),
    
    # Count how many months of data we have in this custom period
    n_months      = n(),
    .groups = "drop"
  ) %>% 
  
  # Data Quality Check: Only keep periods with at least 10 months of data
  filter(n_months >= 10)

Here we define what “High Risk” means and create the lag variable.

Code
# Create Target Variable (Y) and Lagged Predictor (X)
model_df_lagged <- model_df %>% 
  # A. Define "High Risk" Threshold per Year
  group_by(period_year) %>% 
  mutate(
    # Calculate the 80th percentile threshold for THIS specific period
    cutoff80  = quantile(mean_monthly, 0.80, na.rm = TRUE),
    
    # Create Binary Target: 1 if this tract is in the top 20%, 0 otherwise
    high_risk = if_else(mean_monthly >= cutoff80, 1L, 0L)
  ) %>% 
  ungroup() %>% 
  
  # Create Lagged History (The most important predictor)
  arrange(GEOID, period_year) %>% 
  group_by(GEOID) %>% 
  mutate(
    # Previous year's mean monthly filings
    lag_mean_monthly = lag(mean_monthly)
  ) %>% 
  ungroup() %>% 
  
  # Remove rows with no history (we can't train on the first year)
  filter(!is.na(lag_mean_monthly))
Note

We use 0.8 as the threshold because of The Pareto Principle / 80-20 Rule.

4.1.3 Prepare External Features (Context)

Code
# A. Legacy Distress (Historic Violations 2012-2016)
legacy_distress_join <- tract_context %>% 
  select(GEOID, total_pop) %>% 
  left_join(viol_by_tract, by = "GEOID") %>% 
  mutate(
    hist_viol_per_1k = if_else(total_pop > 0, (violations_total / total_pop) * 1000, 0)
  ) %>%
  select(GEOID, hist_viol_per_1k)

# B. Hotspot Landlord Activity
hotspot_join <- hotspots_by_tract %>% 
  select(GEOID, hotspot_filings)

# C. 311 Housing Complaints
housing311_join <- housing311_by_tract %>% 
  select(GEOID, housing311_per_1k)

# D. Spatial Context 1: Distance to City Hall
tract_centroids <- philly_tracts %>% st_centroid()
city_hall <- st_sfc(st_point(c(-75.1635, 39.9526)), crs = 4326) %>% 
  st_transform(st_crs(tract_centroids))

dist_center <- st_distance(tract_centroids, city_hall)

dist_center_join <- data.frame(
  GEOID = tract_centroids$GEOID,
  dist_center_km = as.numeric(dist_center) / 1000
)

# E. Spatial Context 2: Distance to Nearest Hotspot
dist_hotspot <- st_distance(tract_centroids, hotspot_sf) 
min_dist_hotspot <- apply(dist_hotspot, 1, min)

dist_hotspot_join <- data.frame(
  GEOID = tract_centroids$GEOID,
  dist_hotspot_km = as.numeric(min_dist_hotspot) / 1000
)

# F. Neighborhood Typology (Clustering)
# We cluster tracts based on static demographics to capture "neighborhood type"
cluster_input <- tract_context %>%
  st_drop_geometry() %>%
  select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic) %>%
  drop_na()

set.seed(42)
clusters <- kmeans(scale(cluster_input %>% select(-GEOID)), centers = 5)
cluster_input$neighborhood_cluster <- factor(clusters$cluster)

cluster_join <- cluster_input %>% select(GEOID, neighborhood_cluster)

4.1.4 Build Final Master Dataset

Code
training_data <- model_df_lagged %>% 
  # Join Demographics (ACS)
  left_join(tract_context %>% select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic), by = "GEOID") %>%
  
  # Join External Predictors
  left_join(legacy_distress_join, by = "GEOID") %>%
  left_join(hotspot_join, by = "GEOID") %>%
  left_join(housing311_join, by = "GEOID") %>%
  left_join(dist_center_join, by = "GEOID") %>%
  left_join(dist_hotspot_join, by = "GEOID") %>%  
  left_join(cluster_join, by = "GEOID") %>%       
  
  # Handle NAs created by joins (fill counts with 0)
  mutate(
    hotspot_filings   = replace_na(hotspot_filings, 0),
    hist_viol_per_1k  = replace_na(hist_viol_per_1k, 0),
    housing311_per_1k = replace_na(housing311_per_1k, 0)
  ) %>%
  
  # Final cleanup
  drop_na(pov_rate, renter_share)

4.2 Model Building

4.2.1 Train/Test Split

We employed a Time-Based Train/Test Split (Temporal Splitting) combined with Complete Case Analysis. Instead of randomly shuffling the data (which is standard for non-time-series problems), we split the dataset chronologically.

Code
# 1. Define Variables
# Ensure we drop NAs for ALL potential variables to keep samples consistent across models
all_vars <- c("high_risk", "lag_mean_monthly", 
              "pov_rate", "renter_share", "med_rent", "pct_black", "pct_hispanic",
              "hist_viol_per_1k", "hotspot_filings", "housing311_per_1k",
              "dist_center_km", "dist_hotspot_km", "neighborhood_cluster")

model_data <- training_data %>%
  drop_na(all_of(all_vars))

# 2. Time-Based Split
train_set <- model_data %>% filter(period_year < 2025)
test_set  <- model_data %>% filter(period_year == 2025)

4.2.2 Build 5 Competing Models

We employed Hierarchical Model Building using Logistic Regression.

  • Model 1 (Baseline): Tests “Inertia” (does past risk predict future risk?)
  • Models 2-4 (Thematic): Test specific domains (Demographics, Built Environment, Spatial Context).
  • Model 5 (Integrated): A comprehensive model combining all domains.
Code
# --- Model 1: Inertia Only (Lag) ---
formula_1 <- high_risk ~ lag_mean_monthly

# --- Model 2: Demographics (People) ---
formula_2 <- high_risk ~ lag_mean_monthly + 
                         pov_rate + renter_share + pct_black + med_rent

# --- Model 3: Built Environment (Place Quality) ---
formula_3 <- high_risk ~ lag_mean_monthly + 
                         hist_viol_per_1k + hotspot_filings + housing311_per_1k

# --- Model 4: Spatial Context (Location) ---
formula_4 <- high_risk ~ lag_mean_monthly + 
                         dist_center_km + dist_hotspot_km + neighborhood_cluster

# --- Model 5: Full Integrated Model ---
formula_5 <- high_risk ~ lag_mean_monthly + 
                         pov_rate + pct_black + renter_share +
                         hist_viol_per_1k + hotspot_filings + housing311_per_1k +
                         dist_center_km + neighborhood_cluster

# Train all models on the Training Set
m1 <- glm(formula_1, data = train_set, family = "binomial")
m2 <- glm(formula_2, data = train_set, family = "binomial")
m3 <- glm(formula_3, data = train_set, family = "binomial")
m4 <- glm(formula_4, data = train_set, family = "binomial")
m5 <- glm(formula_5, data = train_set, family = "binomial")

4.2.3 Compare Model Performance

We applied three standard statistical metrics to the 2025 Test Set: AUC,Accuracy and AIC.

Code
library(pROC)
library(caret)

# Function to calculate metrics for a model
calc_metrics <- function(model, test_data) {
  # Predict probabilities
  probs <- predict(model, newdata = test_data, type = "response")
  # Predict class (0 or 1)
  preds <- if_else(probs > 0.5, 1, 0)
  
  # Calculate Metrics
  roc_obj  <- roc(test_data$high_risk, probs)
  auc_val  <- as.numeric(auc(roc_obj))
  accuracy <- mean(preds == test_data$high_risk)
  aic_val  <- AIC(model)
  
  return(c(AUC = auc_val, Accuracy = accuracy, AIC = aic_val))
}

# Run comparison
results <- data.frame(
  Model = c("M1: Inertia", "M2: Demographics", "M3: Built Env", "M4: Spatial", "M5: Full"),
  rbind(
    calc_metrics(m1, test_set),
    calc_metrics(m2, test_set),
    calc_metrics(m3, test_set),
    calc_metrics(m4, test_set),
    calc_metrics(m5, test_set)
  )
)

# Display Table
library(knitr)
kable(results, digits = 3, caption = "Model Comparison Table (Test Set Performance)")
Model Comparison Table (Test Set Performance)
Model AUC Accuracy AIC
M1: Inertia 0.971 0.929 707.873
M2: Demographics 0.974 0.912 661.983
M3: Built Env 0.971 0.920 638.294
M4: Spatial 0.975 0.920 676.184
M5: Full 0.969 0.943 609.176

The results establish Model 5 (Full Integrated) as the champion. While Model 4 (Spatial) achieved a marginally higher AUC (0.975 vs 0.969), Model 5 demonstrated superior Accuracy (94.3%), correctly classifying the vast majority of tracts in the 2025 test set. Crucially, Model 5 achieved the lowest AIC (609.176), significantly outperforming the baseline models (AIC > 700). This drastic drop in AIC confirms that the added complexity—combining demographics, built environment, and spatial features—provides essential explanatory power rather than noise. Therefore, we select Model 5 as the most robust instrument for forecasting 2026 risk.

4.2.4 a Visual Comparison (ROC Curves)

Code
# Generate ROC objects
roc1 <- roc(test_set$high_risk, predict(m1, newdata = test_set, type = "response"))
roc2 <- roc(test_set$high_risk, predict(m2, newdata = test_set, type = "response"))
roc3 <- roc(test_set$high_risk, predict(m3, newdata = test_set, type = "response"))
roc4 <- roc(test_set$high_risk, predict(m4, newdata = test_set, type = "response"))
roc5 <- roc(test_set$high_risk, predict(m5, newdata = test_set, type = "response"))

# Plot
ggroc(list(M1=roc1, M2=roc2, M3=roc3, M4=roc4, M5=roc5), size = 1) +
  labs(title = "ROC Curve Comparison",
       subtitle = "Which model predicts 2025 risk best?",
       color = "Model Type") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

4.2.5 Check Multicollinearity

Code
library(car)

vif(m5)
                          GVIF Df GVIF^(1/(2*Df))
lag_mean_monthly      1.036732  1        1.018201
pov_rate              2.428184  1        1.558263
pct_black             4.647796  1        2.155875
renter_share          3.104750  1        1.762030
hist_viol_per_1k      2.282594  1        1.510826
hotspot_filings       1.306671  1        1.143097
housing311_per_1k     2.410423  1        1.552554
dist_center_km        1.955131  1        1.398260
neighborhood_cluster 21.083598  4        1.463838

VIF diagnostic confirms that Model 5 is statistically free from severe multicollinearity issues.

4.2.6 Check Feature Importance

Code
library(caret)

# 1. Calculate Variable Importance
# For GLM, this calculates the absolute value of the t-statistic (or z-statistic)
importance_df <- varImp(m5, scale = FALSE)

# 2. Convert to a clean Data Frame for Plotting
importance_plot_df <- importance_df %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  arrange(desc(Overall)) %>%
  # Remove the intercept (it's not a feature)
  filter(Feature != "(Intercept)")

# 3. Visualize
ggplot(importance_plot_df, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_col(fill = "#3182bd") +
  coord_flip() +  # Horizontal bars are easier to read
  labs(
    title = "Feature Importance: What drives Eviction Risk?",
    subtitle = "Based on Model 5 (Full Integrated Model)",
    x = NULL,
    y = "Importance Score (Absolute z-statistic)"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11, face = "bold"),
    plot.title = element_text(face = "bold")
  )

Our feature importance analysis reveals that Historical Inertia and Landlord Behavior are the primary drivers of eviction risk.

4.3 Diagnostics on the Winning Model (M5)

4.3.1 Spatial Autocorrelation Diagnostic

We check if the Full Model successfully removed spatial clustering in errors.

Code
library(spdep)

# Extract residuals from the best model
train_set$m5_residuals <- residuals(m5, type = "deviance")

# Map residuals to geometry
resid_sf <- philly_tracts %>%
  left_join(train_set %>% select(GEOID, m5_residuals), by = "GEOID") %>%
  filter(!is.na(m5_residuals))

# Calculate Moran's I
coords <- st_centroid(resid_sf)
knn_nb <- knearneigh(coords, k = 5)
lw <- nb2listw(knn2nb(knn_nb), style = "W")

moran_result <- moran.test(resid_sf$m5_residuals, lw)

# Print Result (Ideally p-value > 0.05, or at least Moran's I is close to 0)
print(moran_result)

    Moran I test under randomisation

data:  resid_sf$m5_residuals  
weights: lw    

Moran I statistic standard deviate = 6.3875, p-value = 0.00000000008433
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.1150100632     -0.0009478673      0.0003295672 

The Moran’s I test on the residuals yielded a statistic of 0.115 with a p-value < 0.05. The value is relatively close to 0 (which indicates perfect randomness). This suggests that our model has successfully explained the vast majority of spatial patterns in eviction risk.And the result of p-value is still statistically significant.

4.3.2 Visualize Spatial Error Patterns (Residual Map)

We visualized the Deviance Residuals of Model 5 to identify local pockets of model error. By mapping these residuals, we can diagnose if the model systematically biases its predictions in specific neighborhoods.

Code
ggplot(resid_sf) +
  geom_sf(aes(fill = m5_residuals), color = NA) +
  scale_fill_gradient2(
    low = "#4575b4",   # Blue: Model predicted High, Actual was Low (Overestimation)
    mid = "white",     # White: Perfect Prediction
    high = "#d73027",  # Red:  Model predicted Low, Actual was High (Underestimation)
    midpoint = 0,
    name = "Deviance\nResiduals"
  ) +
  labs(
    title = "Map of Model Errors (Residuals)",
    subtitle = "Where is Model 5 failing?",
    caption = "Red = Underestimation (Missed Risk) | Blue = Overestimation (False Alarm)"
  ) +
  mapTheme +
  theme(legend.position = "right")

While the errors show some local clustering (consistent with our Moran’s I result of 0.115), they do not show a city-wide systemic bias. This confirms Model 5 is generally robust.

4.3.3 Spatial Cross-Validation (LOGOCV)

We used the 5 Neighborhood Clusters (created in Section 4.1.3) as the splitting criteria. In each round of validation, we completely held out one entire cluster of neighborhoods and trained the model on the remaining four clusters. We then asked the model to predict risk in the unseen cluster.

Code
library(caret)

# 1. Define the Control Method: Leave-One-Group-Out
# We use the 'neighborhood_cluster' we created earlier as the "Group"
fit_control <- trainControl(
  method = "CV", 
  number = 5,              # 5 Clusters
  savePredictions = TRUE,
  classProbs = TRUE,       # Needed for ROC/AUC
  summaryFunction = twoClassSummary, # Optimizes for AUC
  index = createFolds(train_set$neighborhood_cluster, k = 5) # Critical: Split by Cluster!
)

# 2. Prepare Data for Caret (Needs "Yes/No" factor for classification)
cv_data <- train_set %>%
  mutate(high_risk_fac = factor(if_else(high_risk == 1, "High", "Low"), 
                                levels = c("High", "Low"))) %>%
  drop_na()

# 3. Re-train Model 5 using Spatial CV
# Note: We use the exact same formula as M5
model_spatial_cv <- train(
  high_risk_fac ~ lag_mean_monthly + 
                  pov_rate + pct_black + renter_share +
                  hist_viol_per_1k + hotspot_filings + housing311_per_1k +
                  dist_center_km, # Removed 'neighborhood_cluster' from predictors because it's the split variable!
  data = cv_data,
  method = "glm",
  family = "binomial",
  trControl = fit_control,
  metric = "ROC"
)

# 4. Print Results
print(model_spatial_cv)
Generalized Linear Model 

1056 samples
   8 predictor
   2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 211, 211, 212, 211, 211 
Resampling results:

  ROC        Sens       Spec     
  0.9105117  0.6142652  0.9394755

Comparing this to our temporal test result (AUC ~0.969), the drop to 0.91 is expected and healthy. It indicates that while the model relies partly on local history (‘Inertia’), it still retains excellent predictive power (0.91) even when blindfolded to specific neighborhood locations. This confirms the model has learned structural rules about eviction risk.

5. Prediction for 2026 (The Future)

We use the “Full Integrated Model” (M5) to forecast risk. The input for this prediction is the data from the most recent period (Nov 2024 – Nov 2025).

Code
#  1. Prepare the Input Data for 2026 
# Logic: To predict 2026, the "lagged history" is the actual performance of 2025.

future_input <- training_data %>%
  filter(period_year == 2025) %>%   # Select the latest available data
  select(
    GEOID, 
    # CRITICAL STEP: The 2025 average becomes the "lag" for 2026
    lag_mean_monthly = mean_monthly, 
    
    # Keep all other static/structural predictors
    pov_rate, pct_black, renter_share, med_rent,
    hist_viol_per_1k, hotspot_filings, housing311_per_1k,
    dist_center_km, neighborhood_cluster, dist_hotspot_km
  ) %>%
  drop_na()

#  2. Generate Predictions 
# We use Model 5 (m5) to predict probabilities (0 to 1)
# type = "response" output probabilities
future_input$pred_prob <- predict(m5, newdata = future_input, type = "response")

#  3. Classify Risk 
# We interpret > 50% probability as "High Risk"
future_input <- future_input %>%
  mutate(
    pred_risk_label = if_else(pred_prob >= 0.5, "High Risk", "Low Risk")
  )

#  4. Join Predictions back to Map Geometry 
forecast_map <- philly_tracts %>%
  left_join(future_input, by = "GEOID") %>%
  filter(!is.na(pred_prob))

5.1 Visualize the Forecast (Probability Map)

Code
ggplot(forecast_map) +
  geom_sf(aes(fill = pred_prob), color = NA) +
  scale_fill_viridis_c(
    option = "rocket", 
    direction = -1, 
    name = "Risk Probability",
    labels = percent
  ) +
  labs(
    title = "Forecast: Eviction Risk Probability (Nov 2025 – Nov 2026)",
    subtitle = "Predicted by Model 5 (Full Integrated Model)\nDarker/Red areas indicate >80% probability of high eviction activity",
    caption = "Source: Philadelphia Eviction Data"
  ) +
  mapTheme +
  theme(legend.position = "right")

High-Intensity Clusters: The darkest areas (probabilities > 75%) are tightly clustered in specific corridors of North and West Philadelphia. This indicates that the model is extremely confident that these specific neighborhoods will experience severe eviction pressure in 2026.

5.2.1 Visualize the Forecast (Binary Risk Map)

We use three distinct cutoffs:

  • Aggressive (> 0.3): This casts a wide net to catch as many potential eviction cases as possible, accepting a higher rate of false alarms.

  • Balanced (> 0.5): This uses the standard statistical default.

  • Conservative (> 0.8): This targets only the ‘sure bets’ to minimize wasted resources.

Code
# --- 1. Create Multiple Scenarios ---
future_scenarios <- future_input %>%
  mutate(
    # Strategy A: High Recall (Catch almost all potential risks, but more false alarms)
    `Strategy: > 0.3 (Aggressive)` = if_else(pred_prob > 0.3, "High Risk", "Low Risk"),
    
    # Strategy B: Balanced (The statistical default)
    `Strategy: > 0.5 (Balanced)`   = if_else(pred_prob > 0.5, "High Risk", "Low Risk"),
    
    # Strategy C: High Precision (Target only the absolute worst hotspots)
    `Strategy: > 0.8 (Conservative)` = if_else(pred_prob > 0.8, "High Risk", "Low Risk")
  ) %>%
  # Convert wide format to long format for plotting
  pivot_longer(
    cols = starts_with("Strategy"),
    names_to = "Strategy_Name",
    values_to = "Risk_Label"
  )

# --- 2. Join to Geometry ---
# Note: This will triple the number of rows (one geometry per strategy per tract)
scenario_map <- philly_tracts %>%
  left_join(future_scenarios, by = "GEOID") %>%
  filter(!is.na(Risk_Label))

# --- 3. Plot Side-by-Side Maps ---
ggplot(scenario_map) +
  geom_sf(aes(fill = Risk_Label), color = "white", lwd = 0.01) +
  scale_fill_manual(
    values = c("High Risk" = "#d73027", "Low Risk" = "#4575b4"),
    name = "Risk Class"
  ) +
  # This creates the 3 panels
  facet_wrap(~ Strategy_Name, ncol = 3) + 
  labs(
    title = "Policy Trade-offs: Comparing Intervention Thresholds",
    subtitle = "How does the definition of 'High Risk' change our target areas for 2026?",
    caption = "Aggressive (>0.3) = Max Coverage | Balanced (>0.5) = Standard | Conservative (>0.8) = Max Precision"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    strip.text = element_text(size = 11, face = "bold") # Make panel titles larger
  )

Recommendation: Given the looming 2026 funding cuts, we recommend this ‘High Precision’ strategy for allocating expensive resources (like direct rental assistance), while reserving the ‘Aggressive’ strategy for low-cost interventions (like information campaigns).

6 Summary of Findings

6.1 Conclusion

Code
# Calculate summary statistics for the report text
n_high_risk <- sum(future_input$pred_risk_label == "High Risk")
pct_high_risk <- round(mean(future_input$pred_risk_label == "High Risk") * 100, 1)

# Identify the top 5 highest risk tracts
top_risk_tracts <- future_input %>%
  arrange(desc(pred_prob)) %>%
  head(5) %>%
  select(GEOID, pred_prob, lag_mean_monthly, pov_rate)

print(paste("Total High Risk Tracts Predicted:", n_high_risk))
[1] "Total High Risk Tracts Predicted: 71"
Code
print(paste("Percentage of City:", pct_high_risk, "%"))
[1] "Percentage of City: 20.2 %"
Code
print(top_risk_tracts)
# A tibble: 5 × 4
  GEOID       pred_prob lag_mean_monthly pov_rate
  <chr>           <dbl>            <dbl>    <dbl>
1 42101021800     1.000             12.5   0.148 
2 42101036100     1.000             11.6   0.0478
3 42101030100     1.000             13.8   0.442 
4 42101012100     0.999             10.3   0.169 
5 42101023900     0.999              9     0.101 

Our Model 5 forecast projects that 71 census tracts will be classified as ‘High Risk’ in the upcoming year. These 71 tracts represent exactly 20.2% of the city’s neighborhoods. This finding is remarkably consistent with the Pareto Principle (80/20 rule) we observed in the historical data, confirming that the 2026 crisis will likely remain highly concentrated rather than spreading evenly across the city. For these specific neighborhoods, the model is effectively finding ‘near certainty’ of high eviction activity. These areas should be the immediate first stop for any ‘High Precision’ legal aid deployment, as they represent the absolute epicenter of displacement risk.

6.2 Limitations

  1. Reporting Bias in 311 data: Our model uses 311 complaints as a proxy for housing distress.However, the tenants in the most vulnerable neighborhoods may be less likely to report violations than residents in gentrifying areas.For example, they fear landlord retaliations. So our model may paradoxically interpret a lack of complaints as “stability” in areas that are actually suffering from silent, unreported neglect.

  2. Census-Tract Scale, Not Individual Properties: Our predictions operate at the Census Tract level. High-risk classification for a tract does not mean every property in that tract is at risk. So interventions should be verified at the individual property level to avoid wasting resources on stable buildings within distressed areas.

6.3 Recommendations

  1. For small set of extreme hotspots, we could operate direct rent relief, full legal representation, intensive case management. Also, we need to verify needs at the property/household level with local partners before committing these high-cost resources.

  2. For wider ring of at-risk neighborhoods around the hotspots, we can operate low-cost interventions – flyers, text alerts, know-your-rights workshops, landlord engagement. Also we can use this tier to reach areas where 311 distress may be under-reported, and monitor for emerging hotspots to feed back into the model.

Source Code
---
title: "Eviction Risk Prediction in Philadelphia"
author:
  - "Lingxuan Gao"
  - "Xiaoqing Chen"
date: "`r Sys.Date()`"
format:
  html:
    code-fold: true
    code-tools: true
    toc: true
    toc-depth: 3
    toc-location: left
    theme: cosmo
    embed-resources: true
editor: visual
execute:
  warning: false
  message: false
---

# Introduction: Predicting the 2026 Housing Cliff

As Philadelphia approaches 2026, the city faces an unprecedented "housing cliff" driven by federal policy shocks. New mandates are shifting away from the "Housing First" model and enforcing stricter NSPIRE inspection standards, threatening the funding for approximately 1,200 permanent supportive housing units and risking mass disqualifications of affordable stock.

In this volatile landscape, relying solely on historical averages is insufficient. This project develops a spatial predictive model to forecast eviction risk for the Nov 2025 – Nov 2026 period. By integrating structural distress indicators (L&I violations), spatial context, and demographic data, we aim to identify the specific census tracts most vulnerable to this new wave of instability, enabling a "High Precision" allocation of limited legal aid resources.

------------------------------------------------------------------------

# Setup

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  cache = TRUE
)

Sys.setlocale("LC_TIME", "English")
```

## Load Libraries

```{r load_libraries}
# Core tidyverse
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)

# Spatial data
library(sf)
library(tigris)

# Census data
library(tidycensus)

# Weather data
library(riem)  # For Philadelphia weather from ASOS stations

# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)
```

## Define Themes

```{r themes}
plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```

# 1. Data Source

## Load and clean all the datasets

```{r load data}

# Monthly totals vs baseline (city-wide)
philly_barchart <- read_csv("data/philadelphia_barchart.csv") %>%
  mutate(
    date = my(month)
  )

# Claims severity over time
philly_claims_monthly <- read_csv("data/philadelphia_claims_monthly.csv") %>%
  mutate(
    month_date = ymd(month_date)
  )

# Group-level (race, gender, etc.) monthly trends
philly_linechart <- read_csv("data/philadelphia_linechart.csv") %>%
  mutate(
    month_date = my(month)
  )

# Census tract map snapshot
philly_map <- read_csv("data/philadelphia_map.csv") %>%
  mutate(
   GEOID = as.character(id),
    month_date = dmy(month_date)
  )

# Tract-level monthly pre/post pandemic
philly_monthly <- read_csv("data/philadelphia_monthly_2020_2021.csv") %>%
  mutate(
    month_date = my(month)
  )

# Tract-level weekly 
weekly <- read_csv("data/philadelphia_weekly_2020_2021.csv") %>%
  mutate(
    week_date = ymd(week_date)
  )

# Code / license violations (point data, older)
li_violations <- read_csv("data/li_violations.csv")

# Hotspot properties (top filers)
hotspots <- read_csv("data/philadelphia_hotspots_media_report.csv") %>%
  mutate(
    time_period = ymd(time_period),
    end_date    = ymd(end_date)
  )

```

**Data sources**:

Primary dataset: Eviction Lab – Philadelphia Tracking 

- City-level series 

- Tract-level monthly data over multiple years 

- Tract-level snapshot of Philadelphia eviction risk (since Nov 2024) 

- Group-level disparities 

- landlord hotspots

# 2. Exploratory Data Analysis

## 2.1 Citywide filings over time

We reorganized the monthly eviction-filing counts by grouping each month into a season (Winter, Spring, Summer, Fall). We computed four separate seasonal baselines:

the historical average for Winter/Spring/Summer/Fall months

Then we produced a four-panel faceted plot to let us compare each season to what is “normal” for that season.

```{r season}

# Add season column
philly_barchart2 <- philly_barchart %>%
  mutate(
    # extract month number
    m = month(date),
    season = case_when(
      m %in% c(12, 1, 2) ~ "Winter",
      m %in% c(3, 4, 5)  ~ "Spring",
      m %in% c(6, 7, 8)  ~ "Summer",
      m %in% c(9, 10, 11) ~ "Fall"
    )
  )

# Calculate true seasonal averages
season_baseline <- philly_barchart2 %>%
  group_by(season) %>%
  summarise(season_avg = mean(month_filings, na.rm = TRUE))

season_baseline


```

```{r seaon chart}

ggplot(philly_barchart2, aes(x = date, y = month_filings)) +
  geom_col(fill = "#3182bd") +
  geom_hline(
    data = season_baseline,
    aes(yintercept = season_avg),
    linetype = "dashed"
  ) +
  facet_wrap(~ season, ncol = 2, scales = "free_y") +
  labs(
    title = "Monthly Eviction Filings by Season",
    subtitle = "Each panel shows filings and seasonal baseline",
    x = "Date",
    y = "Eviction filings"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.text = element_text(face = "bold", size = 13),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

```

**Patterns:**

1.Eviction filings were heavily suppressed across all seasons in 2020–2021. Every season shows extremely low activity during the early pandemic period.

2.Beginning in 2022, filings rebound sharply—but not uniformly across seasons. -Summer 2022 displays one of the largest spikes in the dataset. -Fall and Spring quickly rise back to their historical levels. -Winter rebounds more unevenly but eventually stabilizes. So the seasonal cut reveals that the return to “normal” was staggered, with some seasons experiencing larger surges than others.

3.From 2023 onward, virtually every season settles at or slightly above its long-term seasonal baseline. This indicates a new post-pandemic equilibrium where eviction activity has stabilized.

## 2.2 Tract-level risk and spatial patterns

We uses the multi-year tract-level monthly data to see how skewed filings are across tracts.

```{r eda_tract_distribution}
tract_annual <- philly_monthly %>%
  group_by(GEOID) %>%
  summarize(
    total_filings = sum(filings_counts, na.rm = TRUE),
    mean_monthly  = mean(filings_counts, na.rm = TRUE),
    n_months      = n(),
    .groups = "drop"
  )

# Add a vertical line at the 80th percentile (top 20%)
p80 <- quantile(tract_annual$mean_monthly, 0.80, na.rm = TRUE)

ggplot(tract_annual, aes(x = mean_monthly)) +
  geom_histogram(bins = 30, fill = "#3182bd") +
  geom_vline(xintercept = p80, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of Mean Monthly Eviction Filings per Tract",
    subtitle = paste0("Dashed line = 80th percentile (candidate high-risk cutoff: ",
                      round(p80, 2), " filings/month)"),
    x = "Mean monthly filings (multi-year)",
    y = "Number of tracts"
  ) +
  plotTheme

```

This histogram shows that most census tracts have very low average eviction activity—often fewer than 2 filings per month.A small number of tracts sit far to the right, with 5–10+ filings per month, meaning they experience consistently high eviction pressure over multiple years. So eviction risk in Philly is highly **concentrated**, not evenly spread, which supports focusing limited legal aid and rent subsidies on that right-tail group of chronic hotspot tracts.

### Load Spatial Data

```{r}
# Read your GeoJSONs
boundary <- st_read("data/City_Limits.geojson")
pa_tracts <- st_read("data/PA-tracts.geojson")

# Join eviction data with tract geometry
philly_tracts <- pa_tracts %>%
  filter(str_starts(GEOID, "42101"))

philly_map <- philly_map %>%
  mutate(id = as.character(id))

philly_tracts <- philly_tracts %>%
  mutate(GEOID = as.character(GEOID))


map_sf <- philly_tracts %>%
  left_join(philly_map, by = c("GEOID" = "id"))
```

### Eviction rate map since Nov 2024

```{r}
ggplot(
  map_sf %>% filter(!is.na(month_rate))
) +
  geom_sf(aes(fill = month_rate), color = NA) +
  scale_fill_viridis(option = "magma", direction = -1) +
  labs(
    title = "Eviction Filing Rate by Census Tract (Nov 2024–Nov 2025)",
    fill  = "Eviction Rate"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(size = 14, face = "bold")
  )


```

```{r}
map_compare <- map_sf %>%
  mutate(
    # month_diff is a ratio vs baseline
    diff_vs_baseline = month_diff - 1
  )

ggplot(
  map_compare %>% filter(!is.na(diff_vs_baseline))
) +
  geom_sf(aes(fill = diff_vs_baseline), color = NA) +
  scale_fill_gradient2(
    low  = "#4575b4",
    mid  = "white",
    high = "#d73027",
    midpoint = 0,
    labels = percent_format(accuracy = 1),
    name = "Change vs baseline\n(2023–24)"
  ) +
  labs(
    title    = "Eviction Filing Rate by Census Tract",
    subtitle = "Nov 2024–Nov 2025 vs typical 2023–2024 rate"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(size = 14, face = "bold"),
    plot.subtitle   = element_text(size = 10)
  )


```

Philly isn’t on fire everywhere—a handful of neighborhoods are overheating while others are holding steady or cooling down. Those red clusters are exactly where we’d point legal aid and rent relief first.

### Persistence of high-risk tracts

```{r}
threshold <- quantile(map_sf$month_rate, 0.8, na.rm = TRUE)

tract_risk <- map_sf %>%
  st_drop_geometry() %>%                        # strip geometry
  mutate(high = if_else(month_rate >= threshold, 1L, 0L)) %>%
  group_by(GEOID) %>%
  summarise(
    times_high = sum(high, na.rm = TRUE),
    .groups = "drop"
  )

risk_sf <- philly_tracts %>%
  left_join(tract_risk, by = "GEOID")          # now y is NOT sf → OK

ggplot(risk_sf) +
  geom_sf(aes(fill = times_high), color = NA) +
  scale_fill_viridis_c(option = "inferno") +
  labs(
    title = "Persistence of High Eviction Risk Across Tracts",
    fill  = "Months above 80th percentile"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme_minimal()


```

### Hotspot landlords (top eviction filers)

```{r}
hotspot_sf <- st_as_sf(
  hotspots,
  coords = c("lon", "lat"),
  crs = 4326
)

ggplot() +
  geom_sf(data = boundary, fill = "grey95", color = NA) +
  geom_sf(data = hotspot_sf, aes(size = filings), alpha = 0.6, color = "#d7301f") +
  scale_size(range = c(1, 10)) +
  labs(
    title = "Top Eviction-Filing Landlords in Philadelphia",
    subtitle = "Bubble size shows number of filings",
    size = "Filings"
  ) +
  coord_sf(datum = NA) +
  theme_void() +
  theme_minimal()
```

The map reveals that eviction activity is not driven solely by tenant poverty, but also by landlord behavior. The large red bubbles represent the specific locations of the city's top filers. We observe that high-volume eviction filing is not evenly distributed; it is highly concentrated among a specific subset of property owners. The spatial pattern is stark and non-random. The largest clusters of high-volume filers are located in: North Philadelphia and West Philadelphia.

# 3.Get Philadelphia Spatial Context

Now we shift from “describe” to “explain”:\
We want to add **structural predictors**:

-   Demographics & housing (ACS)\
-   Landlord concentration (hotspots)\
-   Housing quality (old violations)

Additional dataset:

- ACS (American Community Survey) 5-year 2022 data for Philadelphia census tracts 

- Historic code-violation (2012-2016) (From Licenses and Inspections Department of Philadelphia)

- 311 Service and Information Requests

## 3.1 Base tract list

```{r spatial_base}
# One row per tract ID from the joined spatial layer
tract_base <- map_sf %>%
  st_drop_geometry() %>%
  distinct(GEOID) %>%
  arrange(GEOID)
```

## 3.2 Add ACS demographics & housing

```{r spatial_acs, message=FALSE, warning=FALSE}
acs_vars <- c(
  total_pop  = "B01003_001",  # total population
  pov_total  = "B17001_001",  # poverty universe
  pov_below  = "B17001_002",  # below poverty
  occ_total  = "B25003_001",  # occupied housing units
  occ_renter = "B25003_003",  # renter-occupied units
  med_rent   = "B25064_001",  # median gross rent
  white      = "B03002_003",
  black      = "B03002_004",
  hispanic   = "B03002_012"
)

acs_philly <- get_acs(
  geography = "tract",
  state     = "PA",
  county    = "Philadelphia",
  survey    = "acs5",
  year      = 2022,
  variables = acs_vars,
  output    = "wide",
  geometry  = FALSE
) %>%
  transmute(
    GEOID,
    total_pop    = total_popE,
    pov_rate     = if_else(pov_totalE > 0, pov_belowE / pov_totalE, NA_real_),
    renter_share = if_else(occ_totalE > 0, occ_renterE / occ_totalE, NA_real_),
    med_rent     = med_rentE,
    pct_black    = if_else(total_popE > 0, blackE / total_popE, NA_real_),
    pct_hispanic = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
  )

# Attach ACS to tract base and to the spatial object
tract_context <- tract_base %>%
  left_join(acs_philly, by = "GEOID")

map_sf_context <- map_sf %>%       # or map_sf_context if you’d already created it
  left_join(acs_philly, by = "GEOID")
```

## 3.3 Add landlord hotspot intensity

```{r spatial_hotspots, message=FALSE, warning=FALSE}
# Hotspot landlords as points (you already use this for the bubble map)
hotspot_sf <- st_as_sf(
  hotspots,
  coords = c("lon", "lat"),
  crs = 4326
) %>%
  st_transform(st_crs(map_sf_context))

# Spatial join: assign each hotspot property to a tract
hotspots_by_tract <- st_join(
  hotspot_sf,
  map_sf_context["GEOID"],
  left = FALSE
) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(
    hotspot_filings     = sum(filings, na.rm = TRUE),
    n_hotspot_props     = n(),
    n_hotspot_landlords = n_distinct(xplaintiff),
    .groups = "drop"
  )

map_sf_context <- map_sf_context %>%
  left_join(hotspots_by_tract, by = "GEOID") %>%
  mutate(
    across(
      c(hotspot_filings, n_hotspot_props, n_hotspot_landlords),
      ~ replace_na(., 0)
    )
  )
```

## 3.4 Add historic code-violation intensity (2012-2016)

To capture long-term structural neglect, we transformed historical code violation data (2012–2016) from individual geographic points into a tract-level density metric. We spatially joined each violation instance to its corresponding census tract, aggregated the total counts, and normalized them by population to derive the 'Violations per 1,000 Residents' (viol_per_1k) variable. This serves as a standardized proxy for 'legacy distress' within the built environment, distinguishing chronic disrepair from recent instability.

```{r spatial_violations, message=FALSE, warning=FALSE}
# Violations as points
li_violations_sf <- st_as_sf(
  li_violations,
  coords = c("lng", "lat"),
  crs = 4326
) %>%
  st_transform(st_crs(map_sf_context))

# Join each violation to a tract
viol_by_tract <- st_join(
  li_violations_sf,
  map_sf_context["GEOID"],
  left = FALSE
) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(
    violations_total = n(),
    violations_types = n_distinct(violationdescription),
    .groups = "drop"
  )

map_sf_context <- map_sf_context %>%
  left_join(viol_by_tract, by = "GEOID") %>%
  mutate(
    violations_total = replace_na(violations_total, 0),
    violations_types = replace_na(violations_types, 0),
    viol_per_1k = if_else(
      total_pop > 0,
      1000 * violations_total / total_pop,
      NA_real_
    )
  )
```

## 3.5 311 housing-related complaint data

To quantify localized housing distress, we processed raw 311 service request data by filtering for ten specific housing-related categories recorded during 2025. These individual incidents were spatially joined to census tracts to aggregate point-level data into neighborhood-level counts. Finally, we normalized these figures by total population to create a standardized 'Complaints per 1,000 Residents' metric (housing311_per_1k), serving as a dynamic proxy for property mismanagement and structural neglect.

```{r add_311_housing, message=FALSE, warning=FALSE}

housing_types <- c(
  "Dangerous Building Complaint",
  "Construction Complaints",
  "Maintenance Complaint",
  "Sanitation Violation",
  "Sanitation / Dumpster Violation",
  "Dumpster Violation",
  "Fire Safety Complaint",
  "Smoke Detector",
  "Graffiti Removal",
  "Homeless Encampment Request"
)

housing311_raw <- read_csv("data/public_cases_fc.csv") %>%
  clean_names() %>%
  filter(service_name %in% housing_types) %>%
  filter(!is.na(lon), !is.na(lat)) %>%   # adjust if your columns are lon/lat
  mutate(
    request_date = as.Date(requested_datetime),   # change name if needed
    year         = year(request_date)
  ) %>%
  filter(year >= 2020)               # recent years only

```

```{r agg_311_by_tract}

housing311_sf <- housing311_raw %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(st_crs(map_sf))

housing311_tract <- st_join(
  housing311_sf,
  map_sf["GEOID"],
  left = FALSE
)

housing311_by_tract <- housing311_tract %>%
  st_drop_geometry() %>%
  count(GEOID, name = "housing311_count") %>%
  left_join(
    tract_context %>% select(GEOID, total_pop),
    by = "GEOID"
  ) %>%
  mutate(
    housing311_per_1k = if_else(
      total_pop > 0,
      1000 * housing311_count / total_pop,
      NA_real_
    )
  ) %>%
  select(GEOID, housing311_per_1k)

```

# 4. Build & Compare Predictive Models

## 4.1 Master Data Setup & Feature Engineering

### 4.1.1 Define Custom Time Periods (The "Nov-Nov" Fix)

To align our historical training data with the target prediction window (November 2025 – November 2026), we engineered a custom time period variable (period_year). Standard calendar years (Jan–Dec) were replaced with 'Eviction Years' that run from November to October. For instance, data from November and December of 2020 were reassigned to the '2021 Period.' This ensures that every year in our panel represents a full 12-month cycle that perfectly mirrors the seasonal structure of our final forecast.

```{r}
#  Prepare the Monthly Data with Custom Time Periods
panel_setup <- philly_monthly %>% 
  mutate(
    # Convert text to Date
    month_date = my(month),
    
    # --- CRITICAL FIX: Custom "Eviction Year" Logic ---
    # We want the year to start in November.
    # If month is Nov (11) or Dec (12), we assign it to the NEXT year.
    # Example: Nov 2020 -> Period 2021
    # Example: Nov 2024 -> Period 2025 (The most recent complete year)
    period_year = if_else(
      month(month_date) >= 11, 
      year(month_date) + 1, 
      year(month_date)
    )
  ) %>%
  # Filter out data before Nov 2020 to ensure we have full 12-month cycles
  filter(period_year >= 2021)
```

### 4.1.2 Build Basic Tract-Year Panel

```{r}
# Aggregate from Monthly to Annual (Period) Level
model_df <- panel_setup %>% 
  group_by(GEOID, period_year) %>% 
  summarise(
    # --- FIX: Variable Name Changed from filings_2020 ---
    filings_total = sum(filings_counts, na.rm = TRUE),
    mean_monthly  = mean(filings_counts, na.rm = TRUE),
    
    # Count how many months of data we have in this custom period
    n_months      = n(),
    .groups = "drop"
  ) %>% 
  
  # Data Quality Check: Only keep periods with at least 10 months of data
  filter(n_months >= 10)
```

Here we define what "High Risk" means and create the lag variable.

```{r}
# Create Target Variable (Y) and Lagged Predictor (X)
model_df_lagged <- model_df %>% 
  # A. Define "High Risk" Threshold per Year
  group_by(period_year) %>% 
  mutate(
    # Calculate the 80th percentile threshold for THIS specific period
    cutoff80  = quantile(mean_monthly, 0.80, na.rm = TRUE),
    
    # Create Binary Target: 1 if this tract is in the top 20%, 0 otherwise
    high_risk = if_else(mean_monthly >= cutoff80, 1L, 0L)
  ) %>% 
  ungroup() %>% 
  
  # Create Lagged History (The most important predictor)
  arrange(GEOID, period_year) %>% 
  group_by(GEOID) %>% 
  mutate(
    # Previous year's mean monthly filings
    lag_mean_monthly = lag(mean_monthly)
  ) %>% 
  ungroup() %>% 
  
  # Remove rows with no history (we can't train on the first year)
  filter(!is.na(lag_mean_monthly))


```

::: callout-note
We use 0.8 as the threshold because of The Pareto Principle / 80-20 Rule.
:::

### 4.1.3 Prepare External Features (Context)

```{r}
# A. Legacy Distress (Historic Violations 2012-2016)
legacy_distress_join <- tract_context %>% 
  select(GEOID, total_pop) %>% 
  left_join(viol_by_tract, by = "GEOID") %>% 
  mutate(
    hist_viol_per_1k = if_else(total_pop > 0, (violations_total / total_pop) * 1000, 0)
  ) %>%
  select(GEOID, hist_viol_per_1k)

# B. Hotspot Landlord Activity
hotspot_join <- hotspots_by_tract %>% 
  select(GEOID, hotspot_filings)

# C. 311 Housing Complaints
housing311_join <- housing311_by_tract %>% 
  select(GEOID, housing311_per_1k)

# D. Spatial Context 1: Distance to City Hall
tract_centroids <- philly_tracts %>% st_centroid()
city_hall <- st_sfc(st_point(c(-75.1635, 39.9526)), crs = 4326) %>% 
  st_transform(st_crs(tract_centroids))

dist_center <- st_distance(tract_centroids, city_hall)

dist_center_join <- data.frame(
  GEOID = tract_centroids$GEOID,
  dist_center_km = as.numeric(dist_center) / 1000
)

# E. Spatial Context 2: Distance to Nearest Hotspot
dist_hotspot <- st_distance(tract_centroids, hotspot_sf) 
min_dist_hotspot <- apply(dist_hotspot, 1, min)

dist_hotspot_join <- data.frame(
  GEOID = tract_centroids$GEOID,
  dist_hotspot_km = as.numeric(min_dist_hotspot) / 1000
)

# F. Neighborhood Typology (Clustering)
# We cluster tracts based on static demographics to capture "neighborhood type"
cluster_input <- tract_context %>%
  st_drop_geometry() %>%
  select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic) %>%
  drop_na()

set.seed(42)
clusters <- kmeans(scale(cluster_input %>% select(-GEOID)), centers = 5)
cluster_input$neighborhood_cluster <- factor(clusters$cluster)

cluster_join <- cluster_input %>% select(GEOID, neighborhood_cluster)
```

### 4.1.4 Build Final Master Dataset

```{r}
training_data <- model_df_lagged %>% 
  # Join Demographics (ACS)
  left_join(tract_context %>% select(GEOID, pov_rate, renter_share, med_rent, pct_black, pct_hispanic), by = "GEOID") %>%
  
  # Join External Predictors
  left_join(legacy_distress_join, by = "GEOID") %>%
  left_join(hotspot_join, by = "GEOID") %>%
  left_join(housing311_join, by = "GEOID") %>%
  left_join(dist_center_join, by = "GEOID") %>%
  left_join(dist_hotspot_join, by = "GEOID") %>%  
  left_join(cluster_join, by = "GEOID") %>%       
  
  # Handle NAs created by joins (fill counts with 0)
  mutate(
    hotspot_filings   = replace_na(hotspot_filings, 0),
    hist_viol_per_1k  = replace_na(hist_viol_per_1k, 0),
    housing311_per_1k = replace_na(housing311_per_1k, 0)
  ) %>%
  
  # Final cleanup
  drop_na(pov_rate, renter_share)
```

## 4.2 Model Building

### 4.2.1 Train/Test Split

We employed a Time-Based Train/Test Split (Temporal Splitting) combined with Complete Case Analysis. Instead of randomly shuffling the data (which is standard for non-time-series problems), we split the dataset chronologically.

```{r}
# 1. Define Variables
# Ensure we drop NAs for ALL potential variables to keep samples consistent across models
all_vars <- c("high_risk", "lag_mean_monthly", 
              "pov_rate", "renter_share", "med_rent", "pct_black", "pct_hispanic",
              "hist_viol_per_1k", "hotspot_filings", "housing311_per_1k",
              "dist_center_km", "dist_hotspot_km", "neighborhood_cluster")

model_data <- training_data %>%
  drop_na(all_of(all_vars))

# 2. Time-Based Split
train_set <- model_data %>% filter(period_year < 2025)
test_set  <- model_data %>% filter(period_year == 2025)
```

### 4.2.2 Build 5 Competing Models

We employed Hierarchical Model Building using Logistic Regression.

-   Model 1 (Baseline): Tests "Inertia" (does past risk predict future risk?)
-   Models 2-4 (Thematic): Test specific domains (Demographics, Built Environment, Spatial Context).
-   Model 5 (Integrated): A comprehensive model combining all domains.

```{r}
# --- Model 1: Inertia Only (Lag) ---
formula_1 <- high_risk ~ lag_mean_monthly

# --- Model 2: Demographics (People) ---
formula_2 <- high_risk ~ lag_mean_monthly + 
                         pov_rate + renter_share + pct_black + med_rent

# --- Model 3: Built Environment (Place Quality) ---
formula_3 <- high_risk ~ lag_mean_monthly + 
                         hist_viol_per_1k + hotspot_filings + housing311_per_1k

# --- Model 4: Spatial Context (Location) ---
formula_4 <- high_risk ~ lag_mean_monthly + 
                         dist_center_km + dist_hotspot_km + neighborhood_cluster

# --- Model 5: Full Integrated Model ---
formula_5 <- high_risk ~ lag_mean_monthly + 
                         pov_rate + pct_black + renter_share +
                         hist_viol_per_1k + hotspot_filings + housing311_per_1k +
                         dist_center_km + neighborhood_cluster

# Train all models on the Training Set
m1 <- glm(formula_1, data = train_set, family = "binomial")
m2 <- glm(formula_2, data = train_set, family = "binomial")
m3 <- glm(formula_3, data = train_set, family = "binomial")
m4 <- glm(formula_4, data = train_set, family = "binomial")
m5 <- glm(formula_5, data = train_set, family = "binomial")
```

### 4.2.3 Compare Model Performance

We applied three standard statistical metrics to the 2025 Test Set: AUC,Accuracy and AIC.

```{r}
library(pROC)
library(caret)

# Function to calculate metrics for a model
calc_metrics <- function(model, test_data) {
  # Predict probabilities
  probs <- predict(model, newdata = test_data, type = "response")
  # Predict class (0 or 1)
  preds <- if_else(probs > 0.5, 1, 0)
  
  # Calculate Metrics
  roc_obj  <- roc(test_data$high_risk, probs)
  auc_val  <- as.numeric(auc(roc_obj))
  accuracy <- mean(preds == test_data$high_risk)
  aic_val  <- AIC(model)
  
  return(c(AUC = auc_val, Accuracy = accuracy, AIC = aic_val))
}

# Run comparison
results <- data.frame(
  Model = c("M1: Inertia", "M2: Demographics", "M3: Built Env", "M4: Spatial", "M5: Full"),
  rbind(
    calc_metrics(m1, test_set),
    calc_metrics(m2, test_set),
    calc_metrics(m3, test_set),
    calc_metrics(m4, test_set),
    calc_metrics(m5, test_set)
  )
)

# Display Table
library(knitr)
kable(results, digits = 3, caption = "Model Comparison Table (Test Set Performance)")
```

The results establish Model 5 (Full Integrated) as the champion. While Model 4 (Spatial) achieved a marginally higher AUC (0.975 vs 0.969), Model 5 demonstrated superior Accuracy (94.3%), correctly classifying the vast majority of tracts in the 2025 test set. Crucially, Model 5 achieved the lowest AIC (609.176), significantly outperforming the baseline models (AIC \> 700). This drastic drop in AIC confirms that the added complexity—combining demographics, built environment, and spatial features—provides essential explanatory power rather than noise. Therefore, we select Model 5 as the most robust instrument for forecasting 2026 risk.

### 4.2.4 a Visual Comparison (ROC Curves)

```{r}
# Generate ROC objects
roc1 <- roc(test_set$high_risk, predict(m1, newdata = test_set, type = "response"))
roc2 <- roc(test_set$high_risk, predict(m2, newdata = test_set, type = "response"))
roc3 <- roc(test_set$high_risk, predict(m3, newdata = test_set, type = "response"))
roc4 <- roc(test_set$high_risk, predict(m4, newdata = test_set, type = "response"))
roc5 <- roc(test_set$high_risk, predict(m5, newdata = test_set, type = "response"))

# Plot
ggroc(list(M1=roc1, M2=roc2, M3=roc3, M4=roc4, M5=roc5), size = 1) +
  labs(title = "ROC Curve Comparison",
       subtitle = "Which model predicts 2025 risk best?",
       color = "Model Type") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")
```

### 4.2.5 Check Multicollinearity

```{r}
library(car)

vif(m5)
```

VIF diagnostic confirms that Model 5 is statistically free from severe multicollinearity issues.

### 4.2.6 Check Feature Importance

```{r}
library(caret)

# 1. Calculate Variable Importance
# For GLM, this calculates the absolute value of the t-statistic (or z-statistic)
importance_df <- varImp(m5, scale = FALSE)

# 2. Convert to a clean Data Frame for Plotting
importance_plot_df <- importance_df %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  arrange(desc(Overall)) %>%
  # Remove the intercept (it's not a feature)
  filter(Feature != "(Intercept)")

# 3. Visualize
ggplot(importance_plot_df, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_col(fill = "#3182bd") +
  coord_flip() +  # Horizontal bars are easier to read
  labs(
    title = "Feature Importance: What drives Eviction Risk?",
    subtitle = "Based on Model 5 (Full Integrated Model)",
    x = NULL,
    y = "Importance Score (Absolute z-statistic)"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11, face = "bold"),
    plot.title = element_text(face = "bold")
  )
```

Our feature importance analysis reveals that Historical Inertia and Landlord Behavior are the primary drivers of eviction risk.

## 4.3 Diagnostics on the Winning Model (M5)

### 4.3.1 Spatial Autocorrelation Diagnostic

We check if the Full Model successfully removed spatial clustering in errors.

```{r}
library(spdep)

# Extract residuals from the best model
train_set$m5_residuals <- residuals(m5, type = "deviance")

# Map residuals to geometry
resid_sf <- philly_tracts %>%
  left_join(train_set %>% select(GEOID, m5_residuals), by = "GEOID") %>%
  filter(!is.na(m5_residuals))

# Calculate Moran's I
coords <- st_centroid(resid_sf)
knn_nb <- knearneigh(coords, k = 5)
lw <- nb2listw(knn2nb(knn_nb), style = "W")

moran_result <- moran.test(resid_sf$m5_residuals, lw)

# Print Result (Ideally p-value > 0.05, or at least Moran's I is close to 0)
print(moran_result)
```

The Moran’s I test on the residuals yielded a statistic of 0.115 with a p-value \< 0.05. The value is relatively close to 0 (which indicates perfect randomness). This suggests that our model has successfully explained the vast majority of spatial patterns in eviction risk.And the result of p-value is still statistically significant.

### 4.3.2 Visualize Spatial Error Patterns (Residual Map)

We visualized the Deviance Residuals of Model 5 to identify local pockets of model error. By mapping these residuals, we can diagnose if the model systematically biases its predictions in specific neighborhoods.

```{r}
ggplot(resid_sf) +
  geom_sf(aes(fill = m5_residuals), color = NA) +
  scale_fill_gradient2(
    low = "#4575b4",   # Blue: Model predicted High, Actual was Low (Overestimation)
    mid = "white",     # White: Perfect Prediction
    high = "#d73027",  # Red:  Model predicted Low, Actual was High (Underestimation)
    midpoint = 0,
    name = "Deviance\nResiduals"
  ) +
  labs(
    title = "Map of Model Errors (Residuals)",
    subtitle = "Where is Model 5 failing?",
    caption = "Red = Underestimation (Missed Risk) | Blue = Overestimation (False Alarm)"
  ) +
  mapTheme +
  theme(legend.position = "right")
```

While the errors show some local clustering (consistent with our Moran's I result of 0.115), they do not show a city-wide systemic bias. This confirms Model 5 is generally robust.

### 4.3.3 Spatial Cross-Validation (LOGOCV)

We used the 5 Neighborhood Clusters (created in Section 4.1.3) as the splitting criteria. In each round of validation, we completely held out one entire cluster of neighborhoods and trained the model on the remaining four clusters. We then asked the model to predict risk in the unseen cluster.

```{r}
library(caret)

# 1. Define the Control Method: Leave-One-Group-Out
# We use the 'neighborhood_cluster' we created earlier as the "Group"
fit_control <- trainControl(
  method = "CV", 
  number = 5,              # 5 Clusters
  savePredictions = TRUE,
  classProbs = TRUE,       # Needed for ROC/AUC
  summaryFunction = twoClassSummary, # Optimizes for AUC
  index = createFolds(train_set$neighborhood_cluster, k = 5) # Critical: Split by Cluster!
)

# 2. Prepare Data for Caret (Needs "Yes/No" factor for classification)
cv_data <- train_set %>%
  mutate(high_risk_fac = factor(if_else(high_risk == 1, "High", "Low"), 
                                levels = c("High", "Low"))) %>%
  drop_na()

# 3. Re-train Model 5 using Spatial CV
# Note: We use the exact same formula as M5
model_spatial_cv <- train(
  high_risk_fac ~ lag_mean_monthly + 
                  pov_rate + pct_black + renter_share +
                  hist_viol_per_1k + hotspot_filings + housing311_per_1k +
                  dist_center_km, # Removed 'neighborhood_cluster' from predictors because it's the split variable!
  data = cv_data,
  method = "glm",
  family = "binomial",
  trControl = fit_control,
  metric = "ROC"
)

# 4. Print Results
print(model_spatial_cv)
```

Comparing this to our temporal test result (AUC \~0.969), the drop to 0.91 is expected and healthy. It indicates that while the model relies partly on local history ('Inertia'), it still retains excellent predictive power (0.91) even when blindfolded to specific neighborhood locations. This confirms the model has learned structural rules about eviction risk.

# 5. Prediction for 2026 (The Future)

We use the "Full Integrated Model" (M5) to forecast risk. The input for this prediction is the data from the most recent period (Nov 2024 – Nov 2025).

```{r}
#  1. Prepare the Input Data for 2026 
# Logic: To predict 2026, the "lagged history" is the actual performance of 2025.

future_input <- training_data %>%
  filter(period_year == 2025) %>%   # Select the latest available data
  select(
    GEOID, 
    # CRITICAL STEP: The 2025 average becomes the "lag" for 2026
    lag_mean_monthly = mean_monthly, 
    
    # Keep all other static/structural predictors
    pov_rate, pct_black, renter_share, med_rent,
    hist_viol_per_1k, hotspot_filings, housing311_per_1k,
    dist_center_km, neighborhood_cluster, dist_hotspot_km
  ) %>%
  drop_na()

#  2. Generate Predictions 
# We use Model 5 (m5) to predict probabilities (0 to 1)
# type = "response" output probabilities
future_input$pred_prob <- predict(m5, newdata = future_input, type = "response")

#  3. Classify Risk 
# We interpret > 50% probability as "High Risk"
future_input <- future_input %>%
  mutate(
    pred_risk_label = if_else(pred_prob >= 0.5, "High Risk", "Low Risk")
  )

#  4. Join Predictions back to Map Geometry 
forecast_map <- philly_tracts %>%
  left_join(future_input, by = "GEOID") %>%
  filter(!is.na(pred_prob))
```

## 5.1 Visualize the Forecast (Probability Map)

```{r}
ggplot(forecast_map) +
  geom_sf(aes(fill = pred_prob), color = NA) +
  scale_fill_viridis_c(
    option = "rocket", 
    direction = -1, 
    name = "Risk Probability",
    labels = percent
  ) +
  labs(
    title = "Forecast: Eviction Risk Probability (Nov 2025 – Nov 2026)",
    subtitle = "Predicted by Model 5 (Full Integrated Model)\nDarker/Red areas indicate >80% probability of high eviction activity",
    caption = "Source: Philadelphia Eviction Data"
  ) +
  mapTheme +
  theme(legend.position = "right")
```

High-Intensity Clusters: The darkest areas (probabilities \> 75%) are tightly clustered in specific corridors of North and West Philadelphia. This indicates that the model is extremely confident that these specific neighborhoods will experience severe eviction pressure in 2026.

## 5.2.1 Visualize the Forecast (Binary Risk Map)

We use three distinct cutoffs: 

- Aggressive (\> 0.3): This casts a wide net to catch as many potential eviction cases as possible, accepting a higher rate of false alarms. 

- Balanced (\> 0.5): This uses the standard statistical default. 

- Conservative (\> 0.8): This targets only the 'sure bets' to minimize wasted resources.

```{r}
# --- 1. Create Multiple Scenarios ---
future_scenarios <- future_input %>%
  mutate(
    # Strategy A: High Recall (Catch almost all potential risks, but more false alarms)
    `Strategy: > 0.3 (Aggressive)` = if_else(pred_prob > 0.3, "High Risk", "Low Risk"),
    
    # Strategy B: Balanced (The statistical default)
    `Strategy: > 0.5 (Balanced)`   = if_else(pred_prob > 0.5, "High Risk", "Low Risk"),
    
    # Strategy C: High Precision (Target only the absolute worst hotspots)
    `Strategy: > 0.8 (Conservative)` = if_else(pred_prob > 0.8, "High Risk", "Low Risk")
  ) %>%
  # Convert wide format to long format for plotting
  pivot_longer(
    cols = starts_with("Strategy"),
    names_to = "Strategy_Name",
    values_to = "Risk_Label"
  )

# --- 2. Join to Geometry ---
# Note: This will triple the number of rows (one geometry per strategy per tract)
scenario_map <- philly_tracts %>%
  left_join(future_scenarios, by = "GEOID") %>%
  filter(!is.na(Risk_Label))

# --- 3. Plot Side-by-Side Maps ---
ggplot(scenario_map) +
  geom_sf(aes(fill = Risk_Label), color = "white", lwd = 0.01) +
  scale_fill_manual(
    values = c("High Risk" = "#d73027", "Low Risk" = "#4575b4"),
    name = "Risk Class"
  ) +
  # This creates the 3 panels
  facet_wrap(~ Strategy_Name, ncol = 3) + 
  labs(
    title = "Policy Trade-offs: Comparing Intervention Thresholds",
    subtitle = "How does the definition of 'High Risk' change our target areas for 2026?",
    caption = "Aggressive (>0.3) = Max Coverage | Balanced (>0.5) = Standard | Conservative (>0.8) = Max Precision"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    strip.text = element_text(size = 11, face = "bold") # Make panel titles larger
  )
```

**Recommendation**: Given the looming 2026 funding cuts, we recommend this 'High Precision' strategy for allocating expensive resources (like direct rental assistance), while reserving the 'Aggressive' strategy for low-cost interventions (like information campaigns).

# 6 Summary of Findings

## 6.1 Conclusion

```{r}
# Calculate summary statistics for the report text
n_high_risk <- sum(future_input$pred_risk_label == "High Risk")
pct_high_risk <- round(mean(future_input$pred_risk_label == "High Risk") * 100, 1)

# Identify the top 5 highest risk tracts
top_risk_tracts <- future_input %>%
  arrange(desc(pred_prob)) %>%
  head(5) %>%
  select(GEOID, pred_prob, lag_mean_monthly, pov_rate)

print(paste("Total High Risk Tracts Predicted:", n_high_risk))
print(paste("Percentage of City:", pct_high_risk, "%"))
print(top_risk_tracts)
```

Our Model 5 forecast projects that 71 census tracts will be classified as 'High Risk' in the upcoming year. These 71 tracts represent exactly 20.2% of the city's neighborhoods. This finding is remarkably consistent with the Pareto Principle (80/20 rule) we observed in the historical data, confirming that the 2026 crisis will likely remain highly concentrated rather than spreading evenly across the city. For these specific neighborhoods, the model is effectively finding 'near certainty' of high eviction activity. These areas should be the immediate first stop for any 'High Precision' legal aid deployment, as they represent the absolute epicenter of displacement risk.

## 6.2 Limitations

1.  **Reporting Bias in 311 data**: Our model uses 311 complaints as a proxy for housing distress.However, the tenants in the most vulnerable neighborhoods may be less likely to report violations than residents in gentrifying areas.For example, they fear landlord retaliations. So our model may paradoxically interpret a lack of complaints as "stability" in areas that are actually suffering from silent, unreported neglect.

2.  **Census-Tract Scale, Not Individual Properties**: Our predictions operate at the Census Tract level. High-risk classification for a tract does not mean every property in that tract is at risk. So interventions should be verified at the individual property level to avoid wasting resources on stable buildings within distressed areas.

## 6.3 Recommendations

1. For small set of extreme hotspots, we could operate direct rent relief, full legal representation, intensive case management. Also, we need to verify needs at the property/household level with local partners before committing these high-cost resources.

2. For wider ring of at-risk neighborhoods around the hotspots,  we can operate low-cost interventions – flyers, text alerts, know-your-rights workshops, landlord engagement. Also we can use this tier to reach areas where 311 distress may be under-reported, and monitor for emerging hotspots to feed back into the model.