The Philadelphia Eviction Early Warning System

A Predictive Model for Proactive Resource Allocation

Authors

Angel Rutherford

Ixchel Ramirez

Tess Vu

Published

December 8, 2025

EXECUTIVE SUMMARY

Eviction is both a cause and consequence of poverty that destabilizes entire neighborhoods. Currently, city responses to eviction are reactive, with resources like legal aid and rental assistance deployed after filing volumes become a crisis. This project develops a Real-Time Operational Tool for the Philadelphia Office of Homeless Services and the Fair Housing Commission. By shifting from reactive to predictive analysis, we enable the city to allocate limited staff to specific census tracts predicted to experience elevated eviction filings in the coming month.

Our Negative Binomial regression model leverages temporal momentum, spatial spillover effects, policy intervention effects, property tax delinquency stress, and American Community Survey socioeconomic indicators to forecast monthly eviction filing counts at the census tract level.

The model demonstrates strong performance with meaningful improvement, and sets the foundation for building up to a practical and usable tool down the line. Using a robust temporal validation strategy (training through 2023, testing on 2024-2025), the model generalizes well to future periods without overfitting. Also stark racial disparities in eviction burden was identified, with Black-majority tracts accounting for disproportionate shares of filings. These findings emphasize the need for equity-centered implementation safeguards to prevent perpetuating existing disparities through algorithmic resource allocation.

I. EXPLORATORY DATA ANALYSIS (EDA)

1. SETUP AND LIBRARY LOADING

All required R packages were loaded at the start to streamline data cleaning, visualization, spatial analysis, and modeling.

Code
library(lubridate)
library(sf)
library(broom)
library(scales)
library(patchwork)
library(viridis)
library(kableExtra)
library(corrplot)
library(zoo)
library(MASS)
library(car)
library(caret)
library(pROC)
library(spdep)
library(tidycensus)
library(httr)
library(jsonlite)
library(glue)
library(stringr)
library(tidyr)
library(tidyverse)
library(dplyr)
library(knitr)

# Override masking.
select <- dplyr::select
filter <- dplyr::filter

# Disable scientific notation.
options(scipen = 999)

# Save figures.
#opts_chunk$set(fig.path = "final_figures/", dpi = 300)

Two visualization techniques were employed to provide clear and effective evidence of our findings. First, a consistent plotting theme was established to ensure readability and uniformity across all figures. Second, custom color palettes were applied to highlight key analytical distinctions, specifically racial categories and policy periods. Racial differences in eviction filings are central to the Eviction Lab’s dataset and visualizations and were incorporated to acknowledge inequitable outcomes. In addition, the periods before, during, and after the moratorium were visually differentiated to measure the impact of policy interventions on eviction patterns.

Code
# Custom color palette for policy periods.
period_colors <- c(
  "Pre-Moratorium" = "#3498DB",
  "Moratorium" = "#27AE60",
  "Post-Moratorium" = "#E67E22"
)

# Custom color palette for racial categories.
racial_colors <- c(
  "Black" = "#E74C3C",
  "White" = "#3498DB",
  "Hispanic" = "#F39C12",
  "Other" = "#9B59B6"
)

2. DATA LOADING AND INITIAL EXPLORATION

Monthly Tract-Level Eviction Data

The data was accessed and downloaded from the Eviction Lab, an online database that compiles nationwide eviction filings from court records. For states and major cities, the Eviction Lab provides eviction datasets at two temporal resolutions: weekly and monthly. Both scales were initially examined for their distinct analytical value: weekly data highlight short‑term spikes and localized patterns, while monthly data reveal longer‑term trends and align more closely with policy implementation, as months serve as the standard catchment period for resource allocation and housing interventions.

A summary of the dataset’s dimensions and preview of the first few rows shows that the monthly eviction data provides space-time insights into filing patterns. The dataset is structured into eight columns:

Code
# Load monthly eviction filings data at census tract level.
df_monthly_raw <- read.csv("data/eviction/philadelphia_monthly_2020_2021.csv")

# Display initial data dimensions.
cat("MONTHLY EVICTION DATA DIMENSIONS\n")
MONTHLY EVICTION DATA DIMENSIONS
Code
cat(sprintf("Rows: %s\n", comma(nrow(df_monthly_raw))))
Rows: 29,039
Code
cat(sprintf("Columns: %d\n", ncol(df_monthly_raw)))
Columns: 8
Code
cat(sprintf("Variables: %s\n", paste(names(df_monthly_raw), collapse = ", ")))
Variables: type, GEOID, racial_majority, month, filings_2020, filings_avg, filings_avg_prepandemic_baseline, last_updated
Code
# Display first rows to understand data structure.
head(df_monthly_raw, 10) %>%
  kable(caption = "Sample of Raw Monthly Eviction Filing Data",
    col.names = c("Type", "GEOID", "Racial  Majority", "Month", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Sample of Raw Monthly Eviction Filing Data
Type GEOID Racial Majority Month Filings Average Filings Pre-Pandemic Average Filings Last Updated
Census Tract 42101000101 White 01/2020 0 2.5 0.75
Census Tract 42101000101 White 02/2020 0 1.5 1.00
Census Tract 42101000101 White 03/2020 0 1.5 1.00
Census Tract 42101000101 White 04/2020 0 1.0 1.75
Census Tract 42101000101 White 05/2020 0 2.0 1.75
Census Tract 42101000101 White 06/2020 0 1.0 2.00
Census Tract 42101000101 White 07/2020 0 2.5 0.50
Census Tract 42101000101 White 08/2020 1 1.5 0.75
Census Tract 42101000101 White 09/2020 2 0.0 3.50
Census Tract 42101000101 White 10/2020 2 1.5 1.50

Weekly Tract-Level Eviction Data

The weekly dataset mirrors the structure of the monthly data set but differs as it contains more rows of data due to its finer granularity and adds additional columns that specific the week relative to the dataset and the calendar date the data was recorded.

Code
# Load weekly eviction filings data for high-frequency analysis.
df_weekly_raw <- read.csv("data/eviction/philadelphia_weekly_2020_2021.csv")

# Display weekly data dimensions.
cat("WEEKLY EVICTION DATA DIMENSIONS\n")
WEEKLY EVICTION DATA DIMENSIONS
Code
cat(sprintf("Rows: %s\n", comma(nrow(df_weekly_raw))))
Rows: 125,563
Code
cat(sprintf("Columns: %d\n", ncol(df_weekly_raw)))
Columns: 9
Code
# Display sample of weekly data.
head(df_weekly_raw, 10) %>%
  kable(caption = "Sample of Raw Weekly Eviction Filing Data",
    col.names = c("Type", "GEOID", "Racial  Majority", "Week", "Date", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Sample of Raw Weekly Eviction Filing Data
Type GEOID Racial Majority Week Date Filings Average Filings Pre-Pandemic Average Filings Last Updated
Census Tract 42101000101 White 1 2019-12-29 0 0.5 0.00 2025-11-15
Census Tract 42101000101 White 2 2020-01-05 0 0.5 0.25 2025-11-15
Census Tract 42101000101 White 3 2020-01-12 0 0.5 0.25 2025-11-15
Census Tract 42101000101 White 4 2020-01-19 0 1.0 0.00 2025-11-15
Census Tract 42101000101 White 5 2020-01-26 0 0.0 0.25 2025-11-15
Census Tract 42101000101 White 6 2020-02-02 0 0.0 0.25 2025-11-15
Census Tract 42101000101 White 7 2020-02-09 0 0.5 0.50 2025-11-15
Census Tract 42101000101 White 8 2020-02-16 0 0.0 0.00 2025-11-15
Census Tract 42101000101 White 9 2020-02-23 0 1.0 0.25 2025-11-15
Census Tract 42101000101 White 10 2020-03-01 0 0.0 0.25 2025-11-15

Claims Data

The Eviction Lab provides tract‑level data that summarizes historical and contemporary monthly claims statistics, where claims represent the financial compensation sought by landlords in eviction proceedings. In addition to reporting the median claim amount, the dataset includes thresholds that capture the relative burden these claims pose to renters, offering a spectrum of financial severity across space and time. While these measures are not the primary variables used in this analysis, they provide valuable context for understanding eviction dynamics. Exploratory review suggests that approximately 3% of claims fall below $1,000, which may reflect unpaid fees rather than market‑rate rent, though in some cases it could indicate subsidized public housing rents. Around 6% of claims align with median market rents, while 14% represent unpaid rent for at least six months. These timelines highlight the varied circumstances underlying eviction filings, including the delays introduced by lengthy legal processes. Together, the dataset’s structured measures and exploratory insights illustrate both the quantitative severity of claims and the qualitative complexity of eviction proceedings.

Code
# Load aggregated monthly claims data showing financial severity.
df_claims_raw <- read.csv("data/eviction/philadelphia_claims_monthly.csv")

# Display claims data dimensions.
cat("CLAIMS DATA DIMENSIONS\n")
CLAIMS DATA DIMENSIONS
Code
cat(sprintf("Rows: %s\n", comma(nrow(df_claims_raw))))
Rows: 70
Code
cat(sprintf("Columns: %d\n", ncol(df_claims_raw)))
Columns: 6
Code
# Display claims data structure.
head(df_claims_raw, 10) %>%
  kable(digits = 2, caption = "Sample of Monthly Claims Severity Data",
    col.names = c("Date", "Median Claim", "Below $1,000", "Below Median Rent", "Over Six Months Rent", "Baseline Median Claim")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Sample of Monthly Claims Severity Data
Date Median Claim Below $1,000 Below Median Rent Over Six Months Rent Baseline Median Claim
2020-01-01 1958.00 0.03 0.06 0.14 4200
2020-02-01 1944.00 0.03 0.06 0.14 4200
2020-03-01 2050.00 0.03 0.06 0.14 4200
2020-04-01 NA NA NA NA 4200
2020-05-01 NA NA NA NA 4200
2020-06-01 NA NA NA NA 4200
2020-07-01 3975.00 0.03 0.06 0.14 4200
2020-08-01 3285.31 0.03 0.06 0.14 4200
2020-09-01 3639.32 0.03 0.06 0.14 4200
2020-10-01 3422.50 0.03 0.06 0.14 4200

This is not used in the dataset, but for exploration behind the scenes of evictions. It looks like 3% of claims are for below $1,000, which could reflect unpaid fees rather than market rate rent. However, it could reflect rent in subsidized public housing. 6% is for the median market rate rents. 14% is for rent unpaid for at least 6 months. These timelines could have varying reasons, especially due to the time-consuming legal aspect that could delay filings.

Real Estate Tax Balances

In our analysis, tax delinquency is treated as a proxy for landlord financial stress; theoretically properties with outstanding tax balances could pressure landlords into pursuing evictions as a means of recovering income or offsetting financial strain. Thus, property tax assessments were explored to determine how financial stress on landlords may shape eviction filings.

Code
# Load real estate tax balances aggregated by census tract.
df_tax_raw <- read.csv("data/eviction/real_estate_tax_balances_census_tract.csv")

# Display tax data dimensions.
cat("TAX DELINQUENCY DATA DIMENSIONS\n")
TAX DELINQUENCY DATA DIMENSIONS
Code
cat(sprintf("Rows: %s\n", comma(nrow(df_tax_raw))))
Rows: 402
Code
cat(sprintf("Columns: %d\n", ncol(df_tax_raw)))
Columns: 11
Code
cat(sprintf("Variables: %s\n", paste(names(df_tax_raw), collapse = ", ")))
Variables: objectid, census_tract, num_props, min_period, max_period, principal, interest, penalty, other, balance, avg_balance
Code
# Display sample of tax delinquency data.
head(df_tax_raw, 10) %>%
  kable(digits = 2, caption = "Sample of Real Estate Tax Balances by Census Tract",
    col.names = c("Object ID", "Census Tract", "Properties", "Minimum Period", "Maximum Period", "Principal", "Interest", "Penalty", "Other", "Balance", "Average Balance")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Sample of Real Estate Tax Balances by Census Tract
Object ID Census Tract Properties Minimum Period Maximum Period Principal Interest Penalty Other Balance Average Balance
1 42101000101 74 1985 2025 1064210.8 155230.30 29017.78 32204.00 1280662.9 17306.26
2 42101000102 144 1978 2025 902561.4 162841.03 15843.21 65842.76 1147088.4 7965.89
3 42101000200 62 2007 2025 1178299.6 158347.04 52879.01 54369.75 1443895.4 23288.64
4 42101000300 38 1982 2025 1160409.9 204199.86 53290.48 10210.36 1428110.6 37581.86
5 42101000401 110 1986 2025 875019.7 101247.26 26525.82 14006.92 1016799.7 9243.63
6 42101000403 23 2017 2025 324383.0 47260.28 8921.37 20065.11 400629.8 17418.69
7 42101000404 12 2020 2025 187919.8 29479.15 4030.51 2819.58 224249.1 18687.42
8 42101000500 19 1978 2025 323535.4 76752.47 4661.40 14405.20 419354.5 22071.29
9 42101000600 23 2020 2025 552582.2 76400.67 25171.38 20471.78 674626.1 29331.57
10 42101000701 29 2022 2025 193651.5 21334.13 3937.60 6824.62 225747.8 7784.41

Philadelphia Census Tract Geometry

Since all datasets in this analysis are aggregated at the census tract level, a spatial file containing tract boundaries was loaded to access the geographic and geometric attributes of each tract. The inclusion of this file allows for the integration of eviction, claim, property, and other datasets with physical locations for mapping and spatial analysis.

Code
# Load Pennsylvania census tracts shapefile.
pa_tracts_sf <- st_read("data/pa_tracts/pa_tracts.shp", quiet = TRUE)

# Filter to Philadelphia County only using FIPS code 42101.
philly_tracts_sf <- pa_tracts_sf %>%
  filter(str_starts(GEOID, "42101"))

# Display spatial data summary.
cat("PHILADELPHIA CENSUS TRACT GEOMETRY\n")
PHILADELPHIA CENSUS TRACT GEOMETRY
Code
cat(sprintf("Total Philadelphia Tracts: %d\n", nrow(philly_tracts_sf)))
Total Philadelphia Tracts: 408
Code
cat(sprintf("Coordinate Reference System: %s\n", st_crs(philly_tracts_sf)$input))
Coordinate Reference System: NAD83

3. DATA CLEANING AND FEATURE ENGINEERING

Certain data cleaning and manipulation techniques were applied across all datasets to establish structural uniformity that would streamline joinability with other datasets. These techniques include removing invalid records, turning geographic identifiers into character strings to ensure they aren’t treated as numeric values, and formatting dates. Each dataset was also cleaned and manipulated based on their respective qualities and potential insights.

Monthly Eviction Data Preparation

To prepare the monthly eviction data for analysis, raw filings were cleaned and enriched with theoretically motivated features. A categorical variable was created to distinguish pre‑moratorium, moratorium, and post‑moratorium periods, alongside a binary flag for months when the moratorium was active. Months were categorized into seasonal indicators as seasonal patterns in filings can be seen in the figures provided on the Eviction Lab’s website.

Code
# Clean and prepare monthly data with temporal and policy features.
df_monthly <- df_monthly_raw %>%
  # Rename filings variable for clarity.
  rename(filings_count = filings_2020) %>%
  # Parse month string to proper date format.
  mutate(
    date = as.Date(paste0("01/", month), format = "%d/%m/%Y"),
    year = year(date),
    month_num = month(date),
    month_name = month(date, label = TRUE, abbr = FALSE)
  ) %>%
  # Filter out rows with invalid dates.
  filter(!is.na(date)) %>%
  # Convert GEOID to character for joining operations.
  mutate(GEOID = as.character(GEOID)) %>%
  # Create policy intervention indicator for eviction moratorium period.
  mutate(
    moratorium_active = ifelse(
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
      1,
      0
    ),
    # Create categorical period variable.
    period = case_when(
      date < as.Date("2020-03-01") ~ "Pre-Moratorium",
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30") ~ "Moratorium",
      date > as.Date("2021-09-30") ~ "Post-Moratorium"
    ),
    period = factor(period, levels = c("Pre-Moratorium", "Moratorium", "Post-Moratorium"))
  ) %>%
  # Create seasonal indicators for seasonality analysis.
  mutate(
    season = case_when(
      month_num %in% c(12, 1, 2) ~ "Winter",
      month_num %in% c(3, 4, 5) ~ "Spring",
      month_num %in% c(6, 7, 8) ~ "Summer",
      month_num %in% c(9, 10, 11) ~ "Fall"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
  ) %>%
  # Create post-moratorium ramp variable to capture rebound effect.
  mutate(
    post_moratorium_months = ifelse(
      date > as.Date("2021-09-30"),
      as.numeric(difftime(date, as.Date("2021-09-30"), units = "days")) / 30,
      0
    )
  )

# Display summary of cleaned monthly data.
cat("CLEANED MONTHLY DATA SUMMARY\n")
CLEANED MONTHLY DATA SUMMARY
Code
summary(df_monthly %>% select(filings_count, date, year, moratorium_active))
 filings_count          date                 year      moratorium_active
 Min.   :  0.000   Min.   :2020-01-01   Min.   :2020   Min.   :0.0000   
 1st Qu.:  0.000   1st Qu.:2021-06-01   1st Qu.:2021   1st Qu.:0.0000   
 Median :  1.000   Median :2022-12-01   Median :2022   Median :0.0000   
 Mean   :  2.286   Mean   :2022-12-01   Mean   :2022   Mean   :0.2676   
 3rd Qu.:  3.000   3rd Qu.:2024-06-01   3rd Qu.:2024   3rd Qu.:1.0000   
 Max.   :694.000   Max.   :2025-11-01   Max.   :2025   Max.   :1.0000   

To verify the scope of the cleaned dataset, a summary table was generated showing the earliest and latest months covered, the total number of distinct months, the number of census tracts represented, and the overall number of observations. This check confirms that the dataset ranges from January 2020 to November 2025.

Code
# Check temporal coverage of cleaned data.
df_monthly %>%
  summarize(
    min_date = min(date),
    max_date = max(date),
    n_months = n_distinct(date),
    n_tracts = n_distinct(GEOID),
    total_obs = n()
  ) %>%
  kable(caption = "Monthly Data Temporal Coverage",
        col.names = c("Minimum Date", "Maximum Date", "Months", "Tracts", "Observations")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Monthly Data Temporal Coverage
Minimum Date Maximum Date Months Tracts Observations
2020-01-01 2025-11-01 71 409 29039

Weekly Data Preparation

Weekly data was cleaned and enriched in parallel with the monthly dataset, but adapted to reflect the finer weekly temporal scale. A summary table of temporal coverage reveals that the dataset spans from the final days of December 2019 through the second week of November 2025, offering a more nuanced timeline down to the exact day.

Code
# Clean and prepare weekly data for finer temporal analysis.
df_weekly <- df_weekly_raw %>%
  # Rename filings variable for consistency.
  rename(filings_count = filings_2020) %>%
  # Convert week_date to proper date format.
  mutate(
    date = as.Date(week_date),
    year = year(date),
    month_num = month(date),
    week_of_year = week(date)
  ) %>%
  # Filter out rows with invalid dates.
  filter(!is.na(date)) %>%
  # Convert GEOID to character for consistency.
  mutate(GEOID = as.character(GEOID)) %>%
  # Create moratorium indicator.
  mutate(
    moratorium_active = ifelse(
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
      1,
      0
    )
  )

# Display summary of cleaned monthly data.
cat("CLEANED WEEKLY DATA SUMMARY\n")
CLEANED WEEKLY DATA SUMMARY
Code
summary(df_weekly %>% select(filings_count, date, year, moratorium_active))
 filings_count           date                 year      moratorium_active
 Min.   :  0.0000   Min.   :2019-12-29   Min.   :2019   Min.   :0.0000   
 1st Qu.:  0.0000   1st Qu.:2021-06-13   1st Qu.:2021   1st Qu.:0.0000   
 Median :  0.0000   Median :2022-12-04   Median :2022   Median :0.0000   
 Mean   :  0.5286   Mean   :2022-12-04   Mean   :2022   Mean   :0.2704   
 3rd Qu.:  1.0000   3rd Qu.:2024-05-26   3rd Qu.:2024   3rd Qu.:1.0000   
 Max.   :260.0000   Max.   :2025-11-09   Max.   :2025   Max.   :1.0000   
Code
# Check weekly temporal coverage.
df_weekly %>%
  summarize(
    min_date = min(date),
    max_date = max(date),
    n_weeks = n_distinct(date),
    n_tracts = n_distinct(GEOID),
    total_obs = n()
  ) %>%
  kable(caption = "Weekly Data Temporal Coverage",
        col.names = c("Minimum Date", "Maximum Date", "Weeks", "Tracts", "Observations")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Weekly Data Temporal Coverage
Minimum Date Maximum Date Weeks Tracts Observations
2019-12-29 2025-11-09 307 409 125563

Claims Data Preparation

Monthly eviction claims data was mutated to include a new column, claims ratio, that was calculated to represent median claim values relative to pre‑pandemic baselines.

Code
# Clean claims data showing financial severity of evictions.
df_claims <- df_claims_raw %>%
  # Convert month_date to proper date format.
  mutate(
    date = as.Date(month_date),
    year = year(date),
    month_num = month(date)
  ) %>%
  # Filter valid dates.
  filter(!is.na(date)) %>%
  # Calculate ratio to pre-pandemic baseline.
  mutate(claim_ratio = median_claim / median_claim_baseline)

# Display claims summary stats.
cat("CLAIMS DATA SUMMARY\n")
CLAIMS DATA SUMMARY
Code
summary(df_claims %>% select(median_claim, median_claim_baseline, claim_ratio))
  median_claim  median_claim_baseline  claim_ratio    
 Min.   :1944   Min.   :4200          Min.   :0.4629  
 1st Qu.:4088   1st Qu.:4200          1st Qu.:0.9733  
 Median :4236   Median :4200          Median :1.0085  
 Mean   :4346   Mean   :4200          Mean   :1.0348  
 3rd Qu.:4724   3rd Qu.:4200          3rd Qu.:1.1247  
 Max.   :6128   Max.   :4200          Max.   :1.4591  
 NA's   :3                            NA's   :3       

Tax Delinquency Data Preparation

We defined delinquency in the property tax dataset as properties that received penalties rather than properties that just have overdue balances. Due to the marginal frequency observed in the dataset’s summary statistics (only 8 properties identified as overdue with no penalties), overdue properties were removed. For each tract, counts of delinquent properties, total balances, and average balances per property were calculated, along with log transformations of balances and delinquency age in years. Summary statistics were generated to report mean and median property counts, balances, and average balances across tracts, providing a tract‑level profile of tax delinquency severity.

Code
# Clean and prepare tax data with delinquency separation.
df_tax <- df_tax_raw %>%
  mutate(
    census_tract = as.character(census_tract)
  ) %>%
  filter(!str_detect(census_tract, "Other/Unidentified")) %>%
  mutate(
    # True delinquency indicators, based on penalty being on previous year's taxes.
    has_penalty = penalty > 0,
    delinquent_prop_count = ifelse(has_penalty, num_props, 0),
    delinquent_balance = ifelse(has_penalty, balance, 0),
    avg_delinquent_balance = ifelse(has_penalty, balance / num_props, 0),
    
    # Overdue-only indicators, based on it being the current year.
    overdue_only_props = ifelse(!has_penalty, num_props, 0),
    overdue_only_balance = ifelse(!has_penalty, balance, 0),
    
    # Transformations for modeling.
    log_delinquent_balance = log1p(delinquent_balance),
    log_overdue_balance = log1p(overdue_only_balance)
  ) %>%
  rename(GEOID = census_tract)

# Display tax data summary.
cat("TAX DELINQUENCY DATA SUMMARY\n")
TAX DELINQUENCY DATA SUMMARY
Code
cat(sprintf("Tracts with Tax Data: %d\n", nrow(df_tax)))
Tracts with Tax Data: 401
Code
cat(sprintf("Total Properties with Any Tax Balance: %s\n", comma(sum(df_tax$num_props))))
Total Properties with Any Tax Balance: 84,943
Code
cat(sprintf("Delinquent Properties (Penalty > 0): %s (%.1f%%)\n", 
            comma(sum(df_tax$delinquent_prop_count)),
            sum(df_tax$delinquent_prop_count) / sum(df_tax$num_props) * 100))
Delinquent Properties (Penalty > 0): 84,935 (100.0%)
Code
cat(sprintf("Overdue Properties (No Penalty): %s (%.1f%%)\n",
            comma(sum(df_tax$overdue_only_props)),
            sum(df_tax$overdue_only_props) / sum(df_tax$num_props) * 100))
Overdue Properties (No Penalty): 8 (0.0%)
Code
cat(sprintf("Total Delinquent Balance: $%s\n", comma(sum(df_tax$delinquent_balance))))
Total Delinquent Balance: $508,908,377
Code
cat(sprintf("Total Overdue-Only Balance: $%s\n", comma(sum(df_tax$overdue_only_balance))))
Total Overdue-Only Balance: $55,963
Code
# 8 overdue properties is marginal, so remove.
df_tax <- df_tax_raw %>%
  mutate(
    census_tract = as.character(census_tract)
  ) %>%
  filter(!str_detect(census_tract, "Other/Unidentified")) %>%
  filter(penalty > 0) %>%
  mutate(
    delinquent_prop_count = num_props,
    delinquent_balance = balance,
    avg_delinquent_balance_per_prop = balance / num_props,
    log_delinquent_balance = log1p(balance),
    delinquency_age_years = 2025 - min_period
  ) %>%
  rename(GEOID = census_tract)

# Recalculate average.
df_tax <- df_tax %>%
  mutate(avg_balance = balance / num_props)
Code
# Calculate summary stats for tax delinquency.
tax_summary <- df_tax %>%
  summarize(
    mean_props = mean(num_props),
    median_props = median(num_props),
    mean_balance = mean(balance),
    median_balance = median(balance),
    mean_avg_balance = mean(avg_balance),
    median_avg_balance = median(avg_balance)
  )

tax_summary %>%
  pivot_longer(everything(), names_to = "statistic", values_to = "value") %>%
  mutate(value = ifelse(str_detect(statistic, "balance"), 
                        paste0("$", comma(round(value, 2))), 
                        comma(round(value, 2)))) %>%
  kable(caption = "Tax Delinquency Summary Statistics",
        col.names = c("Statistic", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Tax Delinquency Summary Statistics
Statistic Value
mean_props 214
median_props 143
mean_balance $1,285,122
median_balance $1,013,871
mean_avg_balance $8,730
median_avg_balance $5,892

Sealed Tract Assessment

Monthly eviction filings were assessed for sealed census tracts, where GEOIDs are suppressed for privacy protection. These tracts accounted for a small share of observations but exhibited unusually high filing counts. Because sealed tracts cannot be geographically mapped, they were excluded from spatial modeling to preserve tract‑level specificity. While this exclusion reduces the ability to design geographically actionable policies and outreach, the sealed tracts nonetheless represent legitimate evictions and highlight systematic vulnerabilities that cannot be directly addressed in this analysis.

Code
# Identify sealed census tracts where GEOID is hidden for privacy protection.
# Reduces geographic specificity for modeling.
sealed_count <- df_monthly %>%
  filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
  nrow()

sealed_pct <- sealed_count / nrow(df_monthly) * 100

cat("SEALED TRACT ASSESSMENT\n")
SEALED TRACT ASSESSMENT
Code
cat(sprintf("Sealed or Missing Tract Observations: %s (%.1f%%)\n", comma(sealed_count), sealed_pct))
Sealed or Missing Tract Observations: 71 (0.2%)
Code
# Analyze sealed tracts before removal.
sealed_tracts_analysis <- df_monthly %>%
  filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
  summarize(
    n_obs = n(),
    n_unique_dates = n_distinct(date),
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    min_filings = min(filings_count, na.rm = TRUE),
    max_filings = max(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100
  )

cat(sprintf("Total Observations: %s\n", comma(sealed_tracts_analysis$n_obs)))
Total Observations: 71
Code
cat(sprintf("Months of Data: %d\n", sealed_tracts_analysis$n_unique_dates))
Months of Data: 71
Code
cat(sprintf("Total Filings in Sealed Tracts: %s\n", comma(sealed_tracts_analysis$total_filings)))
Total Filings in Sealed Tracts: 1,043
Code
cat(sprintf("Mean Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$mean_filings))
Mean Filings per Tract-Month: 14.7
Code
cat(sprintf("Median Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$median_filings))
Median Filings per Tract-Month: 1.0
Code
cat(sprintf("Range: %d to %d\n", sealed_tracts_analysis$min_filings, sealed_tracts_analysis$max_filings))
Range: 0 to 694
Code
cat(sprintf("Standard Deviation: %.1f\n", sealed_tracts_analysis$sd_filings))
Standard Deviation: 85.9
Code
cat(sprintf("Zero-Filing Months: %.1f%%\n", sealed_tracts_analysis$zero_pct))
Zero-Filing Months: 43.7%
Code
# Show distribution of sealed tract filings.
sealed_distribution <- df_monthly %>%
  filter(GEOID == "sealed") %>%
  group_by(filings_count) %>%
  summarize(n = n(), .groups = "drop") %>%
  arrange(desc(filings_count))

head(sealed_distribution, 10) %>%
  kable(caption = "Highest Filing Counts in Sealed Tracts",
        col.names = c("Filings", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Highest Filing Counts in Sealed Tracts
Filings Frequency
694 1
221 1
22 1
11 1
7 1
6 2
5 4
4 3
3 5
2 8

The sealed tracts are outliers and are legitimate evictions that reflect the current systematic issues, and could represent vulnerable communities experiencing a recent mass eviction (highest unsealed filings are from 2025). They aren’t included because the tracts cannot be mapped to in the data if they’re sealed, this privacy protection prevents geographically actionable policies and outreach, however, it has to be stressed that this is absolutely not ideal.

4. EXPLORATORY DATA ANALYSIS

Distribution Analysis: Zero-Inflation Assessment

Understanding the distribution of eviction filings is critical for selecting the appropriate modeling approach. Because eviction filings are discrete, non‑continuous count values, they require models such as Poisson and Negative Binomial regression specifically designed to handle count data. A major difference between the two approaches is their ability to handle overdispersion, with the Poisson model assuming equal mean and variance while the Negative Binomial model allows the variance to exceed the mean, making it more flexible for data with excess variability and zeros. Zero-inflation statistics were calculated and visualized to assess the distribution of monthly eviction filings. The summary statistics report reveals substantial zero-inflation and variance that exceeded the mean, yielding a dispersion ratio well over 1, indicating that the negative binomial model as a more appropriate modeling approach for our data.

A histogram of raw counts of evictions further highlights the prevalence of zeros and the data’s skewed distribution, while a log‑transformed histogram provides a clearer view of variation.

Code
# Calculate zero-inflation stats for monthly data.
zero_stats_monthly <- df_monthly %>%
  summarize(
    total_obs = n(),
    zero_count = sum(filings_count == 0),
    zero_pct = zero_count / total_obs * 100,
    positive_count = sum(filings_count > 0),
    mean_all = mean(filings_count),
    mean_positive = mean(filings_count[filings_count > 0]),
    median_all = median(filings_count),
    variance = var(filings_count),
    dispersion_ratio = variance / mean_all
  )

# Display zero-inflation stats.
cat("MONTHLY ZERO-INFLATION STATS\n")
MONTHLY ZERO-INFLATION STATS
Code
cat(sprintf("Total Observations: %s\n", comma(zero_stats_monthly$total_obs)))
Total Observations: 29,039
Code
cat(sprintf("Zero Filings: %s (%.1f%%)\n", 
            comma(zero_stats_monthly$zero_count), 
            zero_stats_monthly$zero_pct))
Zero Filings: 10,752 (37.0%)
Code
cat(sprintf("Positive Filings: %s (%.1f%%)\n", 
            comma(zero_stats_monthly$positive_count), 
            100 - zero_stats_monthly$zero_pct))
Positive Filings: 18,287 (63.0%)
Code
cat(sprintf("Mean (all observations): %.2f\n", zero_stats_monthly$mean_all))
Mean (all observations): 2.29
Code
cat(sprintf("Mean (positive only): %.2f\n", zero_stats_monthly$mean_positive))
Mean (positive only): 3.63
Code
cat(sprintf("Variance: %.2f\n", zero_stats_monthly$variance))
Variance: 28.75
Code
cat(sprintf("Dispersion Ratio (Variance/Mean): %.2f\n", zero_stats_monthly$dispersion_ratio))
Dispersion Ratio (Variance/Mean): 12.58
Code
# Raw distribution.
monthly_plot_raw <- df_monthly %>%
  ggplot(aes(x = filings_count)) +
  geom_histogram(bins = 30, fill = "#E74C3C", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Count Distribution",
    x = "Monthly Filings Count",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Log-transformed distribution.
monthly_plot_log <- df_monthly %>%
  mutate(log_filings = log10(filings_count + 1)) %>%
  ggplot(aes(x = log_filings)) +
  geom_histogram(bins = 30, fill = "#E67E22", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed Distribution",
    x = "log(Monthly Filings + 1)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Display combined plot.
monthly_plot_raw + monthly_plot_log +
  plot_annotation(
    title = "Distribution of Monthly Eviction Filings by Census Tract",
    subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f", 
                       zero_stats_monthly$zero_pct,
                       zero_stats_monthly$dispersion_ratio),
    caption = "Data: Philadelphia Eviction Filings 2020-2025 via Eviction Lab",
    theme = theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(size = 11),
      plot.caption = element_text(size = 9)
    )
  )

Weekly Filing Distribution

Weekly eviction filings were also assessed for zero-inflation and overdispersion; the distribution was highly zero-inflated, making even standard Negative Binomial models unsuitable. However, the data structure is compatible with Zero-Inflated Negative Binomial models, which may be explored in future work. In this analysis we chose to proceed with the monthly eviction data to preserve the performance of the Negative Binomial model.

Code
# Calculate zero-inflation stats for weekly data.
zero_stats_weekly <- df_weekly %>%
  summarize(
    total_obs = n(),
    zero_count = sum(filings_count == 0),
    zero_pct = zero_count / total_obs * 100,
    mean_all = mean(filings_count),
    variance = var(filings_count),
    dispersion_ratio = variance / mean_all
  )

cat("WEEKLY ZERO-INFLATION STATS\n")
WEEKLY ZERO-INFLATION STATS
Code
cat(sprintf("Total Observations: %s\n", comma(zero_stats_weekly$total_obs)))
Total Observations: 125,563
Code
cat(sprintf("Zero Filings: %s (%.1f%%)\n", 
            comma(zero_stats_weekly$zero_count), 
            zero_stats_weekly$zero_pct))
Zero Filings: 88,027 (70.1%)
Code
cat(sprintf("Mean: %.2f\n", zero_stats_weekly$mean_all))
Mean: 0.53
Code
cat(sprintf("Dispersion Ratio: %.2f\n", zero_stats_weekly$dispersion_ratio))
Dispersion Ratio: 5.62
Code
# Raw distribution.
plot_weekly_raw <- df_weekly %>%
  ggplot(aes(x = filings_count)) +
  geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Count Distribution",
    x = "Weekly Filings Count",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Log-transformed distribution.
plot_weekly_log <- df_weekly %>%
  mutate(log_filings = log10(filings_count + 1)) %>%
  ggplot(aes(x = log_filings)) +
  geom_histogram(bins = 30, fill = "#8E44AD", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed Distribution",
    x = "log(Weekly Filings + 1)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Display combined plot.
plot_weekly_raw + plot_weekly_log +
  plot_annotation(
    title = "Distribution of Weekly Eviction Filings by Census Tract",
    subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f", 
                       zero_stats_weekly$zero_pct,
                       zero_stats_weekly$dispersion_ratio),
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab",
    theme = theme(plot.title = element_text(face = "bold", size = 14),
                  plot.subtitle = element_text(size = 11))
  )

Tax Delinquency Distribution

The histogram of average delinquent balances per property across census tracts is highly right-skewed, with most tracts showing low balances and a few exhibiting extreme outliers revealing that a small number of tracts carry disproportionately high tax burdens. A histogram of log transformed tax balances normalizes this skew, producing a more symmetric, bell-shaped distribution suitable for statistical modeling.

Code
# Raw distribution.
plot_tax_raw <- df_tax %>%
  ggplot(aes(x = avg_delinquent_balance_per_prop)) +
  geom_histogram(bins = 40, fill = "#9B59B6", color = "white") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Average Balance",
    x = "Average Delinquent Balance per Property ($)",
    y = "Number of Census Tracts"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Log-transformed distribution.
plot_tax_log <- df_tax %>%
  mutate(log_balance = log10(avg_delinquent_balance_per_prop + 1)) %>%
  ggplot(aes(x = log_balance)) +
  geom_histogram(bins = 40, fill = "#3498DB", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed",
    x = "log(Average Balance + 1)",
    y = "Number of Census Tracts"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Display combined plot.
plot_tax_raw + plot_tax_log +
  plot_annotation(
    title = "Distribution of Average Tax Delinquency Balance by Census Tract",
    caption = "Data: OpenDataPhilly Real Estate Tax Balances",
    theme = theme(plot.title = element_text(face = "bold", size = 14),
                  plot.subtitle = element_text(size = 11))
  )

Quantile Analysis

To further assess the tail-behavior of monthly eviction filings, statistice for quantiles thresholds were calculated. This included the minimum, 1st, 5th, 25th, 50th (median), 75th, 95th, and 99th percentiles, as well as the maximum. These statistics provide a clear view of the heavy right tail of the distribution, complementing the zero‑inflation and over‑dispersion diagnostics as well as indicating the need to create a variable the addresses abnormal spikes.

Code
# Calculate quantiles for positive filings to understand tail behavior.
quantile_stats_monthly <- df_monthly %>%
  filter(filings_count > 0) %>%
  summarize(
    min = min(filings_count),
    q01 = quantile(filings_count, 0.01),
    q05 = quantile(filings_count, 0.05),
    q25 = quantile(filings_count, 0.25),
    median = median(filings_count),
    q75 = quantile(filings_count, 0.75),
    q95 = quantile(filings_count, 0.95),
    q99 = quantile(filings_count, 0.99),
    max = max(filings_count)
  )

# Display quantile distribution.
quantile_stats_monthly %>%
  pivot_longer(everything(), names_to = "quantile", values_to = "filings") %>%
  kable(digits = 1, caption = "Monthly Filing Count Quantiles (Positive Counts Only)",
        col.names = c("Quantile", "Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Monthly Filing Count Quantiles (Positive Counts Only)
Quantile Filings
min 1
q01 1
q05 1
q25 1
median 3
q75 5
q95 10
q99 16
max 694

6. RACIAL DISPARITY ANALYSIS

Filing Burden by Racial Majority

Summary statistics and plots reveal stark disparities across tract racial majorities: 51.9% of filings occurred in Black-majority tracts, despite these tracts representing a smaller share of the city overall. In contrast, White-majority tracts accounted for just 22.1% of filings, though they comprised 36.0% of the comparison group. Hispanic-majority tracts showed near parity, while “Other” tracts had slightly lower filing rates than their representation. These patterns suggest disproportionate eviction exposure in Black communities, underscoring the need for equity-based analysis of our model’s performance.

Code
# Calculate filing statistics by tract racial majority.
racial_stats <- df_monthly %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n_obs = n(),
    n_tracts = n_distinct(GEOID),
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100,
    .groups = "drop"
  ) %>%
  mutate(
    pct_total_filings = total_filings / sum(total_filings) * 100,
    pct_tracts = n_tracts / sum(n_tracts) * 100
  )

# Display racial disparity statistics.
racial_stats %>%
  kable(digits = 2, caption = "Eviction Filing Statistics by Tract Racial Majority",
        col.names = c("Racial Majority", "Observations", "Tracts", "Filings", "Mean Filings", "Median Filings", "Standard Deviation Filings", "% Zero", "% Total Filings", "% Tracts")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Eviction Filing Statistics by Tract Racial Majority
Racial Majority Observations Tracts Filings Mean Filings Median Filings Standard Deviation Filings % Zero % Total Filings % Tracts
Black 10366 146 33875 3.27 2 3.63 22.52 51.85 35.78
Hispanic 1633 23 3805 2.33 2 2.40 23.33 5.82 5.64
Other 6532 92 13201 2.02 1 3.04 43.05 20.21 22.55
White 10437 147 14450 1.38 1 2.81 49.77 22.12 36.03
Code
# Create side-by-side comparison of tract share vs filing share.
racial_comparison <- racial_stats %>%
  select(racial_majority, pct_tracts, pct_total_filings) %>%
  pivot_longer(cols = c(pct_tracts, pct_total_filings),
               names_to = "metric", values_to = "percentage") %>%
  mutate(
    metric = ifelse(metric == "pct_tracts", "Share of Tracts", "Share of Filings")
  )

# Create grouped bar plot showing disparity.
racial_disparity_plot <- racial_comparison %>%
  ggplot(aes(x = racial_majority, y = percentage, fill = metric)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.7), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = c("Share of Tracts" = "#3498DB", "Share of Filings" = "#E74C3C")) +
  scale_y_continuous(labels = percent_format(scale = 1), expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Racial Disparity in Eviction Filings",
    x = "Tract Racial Majority",
    y = "Percentage",
    fill = "Metric",
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display racial disparity plot.
racial_disparity_plot

7. GEOGRAPHIC ANALYSIS

High-Risk Tract Identification

High‑risk tracts were identified by calculating cumulative eviction filings across the study period to pinpoint neighborhoods with disproportionate concentrations of cases. The analysis shows that the top 20 census tracts alone accounted for 16.6% of all filings.

Code
# Calculate total filings by tract across full study period.
tract_totals <- df_monthly %>%
  filter(GEOID != "sealed", !is.na(GEOID), GEOID != "") %>%
  group_by(GEOID, racial_majority) %>%
  summarize(
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_monthly = mean(filings_count, na.rm = TRUE),
    median_monthly = median(filings_count, na.rm = TRUE),
    sd_monthly = sd(filings_count, na.rm = TRUE),
    # Standardized measure of dispersion. Helps compare volatility. The higher the more volatile.
    cv = ifelse(mean_monthly > 0, sd_monthly / mean_monthly, NA),
    months_observed = n(),
    months_with_filings = sum(filings_count > 0),
    .groups = "drop"
  )

# Identify top 20 highest-risk tracts by total filings.
top_risk_tracts <- tract_totals %>%
  arrange(desc(total_filings)) %>%
  head(20)

# Display top risk tracts.
top_risk_tracts %>%
  select(GEOID, racial_majority, total_filings, mean_monthly, cv) %>%
  kable(digits = 2, caption = "Top 20 Highest-Risk Census Tracts (Total Filings 2020-2025)",
        col.names = c("GEOID", "Racial Majority", "Filings", "Monthly Mean", "Coefficient of Variation")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 20 Highest-Risk Census Tracts (Total Filings 2020-2025)
GEOID Racial Majority Filings Monthly Mean Coefficient of Variation
42101036100 White 831 11.70 1.17
42101021800 White 827 11.65 0.79
42101030100 Other 713 10.04 0.65
42101026800 Black 608 8.56 0.68
42101027200 Black 590 8.31 1.55
42101024000 Black 579 8.15 0.74
42101025200 Black 548 7.72 0.67
42101017800 Other 545 7.68 0.55
42101027600 Black 518 7.30 0.65
42101030000 Black 508 7.15 0.58
42101023800 Black 491 6.92 0.79
42101015200 Black 487 6.86 0.70
42101020101 Black 471 6.63 0.74
42101034600 Other 471 6.63 0.88
42101008302 Black 455 6.41 0.69
42101035302 White 455 6.41 0.85
42101024300 Black 444 6.25 0.84
42101023900 Black 441 6.21 0.72
42101012201 Black 428 6.03 0.76
42101010800 Black 426 6.00 0.69
Code
# Calculate concentration statistics showing geographic clustering of evictions.
total_system_filings <- sum(tract_totals$total_filings)
top_20_pct <- sum(top_risk_tracts$total_filings) / total_system_filings * 100

cat("GEOGRAPHIC CONCENTRATION ANALYSIS\n")
GEOGRAPHIC CONCENTRATION ANALYSIS
Code
cat(sprintf("Total System Filings: %s\n", comma(total_system_filings)))
Total System Filings: 65,331
Code
cat(sprintf("Top 20 Tracts Account For: %.1f%% of all filings\n", top_20_pct))
Top 20 Tracts Account For: 16.6% of all filings

Spatial Mapping of Eviction Burden

The choropleth plot showing the spatial distribution of evictions reveals regional concentration of census tracts with high filing levels in North, Upper North, and West Philadelphia.

Code
# Join tract totals to spatial geometry.
philly_eviction_sf <- philly_tracts_sf %>%
  left_join(tract_totals, by = "GEOID")

# Create choropleth map of total filings.
eviction_map <- ggplot(philly_eviction_sf) +
  geom_sf(aes(fill = total_filings), color = NA) +
  scale_fill_viridis_c(
    option = "inferno",
    trans = "log1p",
    labels = comma,
    na.value = "gray50",
    breaks = c(0, 25, 100, 250, 831)
  ) +
  labs(
    title = "Total Eviction Filings by Census Tract (2020-2025)",
    fill = "Total\nFilings",
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "right"
  )

# Display eviction map.
eviction_map

Spatial Mapping of Tax Delinquency

The choropleth plot showing the spatial distribution of average property tax balances reveals distinct pocket of high tax burdens in Center City as well as South, North, Upper North, and Northeast Philadelphia.

Code
# Join tax data to spatial geometry.
philly_tax_sf <- philly_tracts_sf %>%
  left_join(df_tax, by = "GEOID")

# Create choropleth map of average tax balance.
tax_map <- ggplot(philly_tax_sf) +
  geom_sf(aes(fill = avg_balance), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    trans = "log1p",
    labels = dollar_format(),
    na.value = "gray50",
    breaks = c(1000, 5000, 10000, 50000)
  ) +
  labs(
    title = "Average Property Tax Delinquency by Census Tract",
    fill = "Average\nTax\nBalance",
    caption = "Data: OpenDataPhilly Real Estate Tax Balances"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "right"
  )

# Display tax map.
tax_map

8. PRE-PANDEMIC BASELINE COMPARISON

Monthly Filings vs Baseline Ratio

The plot and summary statistics shows that the Covid-19 pandemic posed a mass disruption to eviction trends as filings drop well below the pre-pandemic baseline before gradually increasing through 2021 and 2022. By 2024 and 2025, filings stabilized near the pre-pandemic average, indicating a return to normative patterns. This trajectory emphasizes the need to capture immediate post-moratorium dynamics.

Code
# Calculate system-wide ratio to pre-pandemic baseline over time.
monthly_baseline_ratio <- df_monthly %>%
  group_by(date) %>%
  summarize(
    total_filings = sum(filings_count, na.rm = TRUE),
    total_baseline = sum(filings_avg_prepandemic_baseline, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(total_baseline > 0) %>%
  mutate(ratio_to_baseline = total_filings / total_baseline)

# Create line plot showing normalized recovery trajectory.
# Ratio greater than 1 indicates system stress relative to historical norm.
baseline_ratio_plot <- ggplot(monthly_baseline_ratio, aes(x = date, y = ratio_to_baseline)) +
  geom_line(color = "#16A085", linewidth = 0.8, alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, color = "#E74C3C", linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black", linewidth = 1) +
  annotate("text", x = min(monthly_baseline_ratio$date + 100), y = 1.08,
           label = "Pre-Pandemic Average", hjust = 0, fontface = "bold") +
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dotted", color = "#27AE60", alpha = 1, linewidth = 1) +
  geom_vline(xintercept = as.Date("2021-09-30"), 
             linetype = "dotted", color = "#E67E22", alpha = 1, linewidth = 1) +
  labs(
    title = "Monthly Filings Normalized to Pre-Pandemic Baseline",
    x = "Date (Month)",
    y = "Monthly Filings / Pre-Pandemic Average Ratio",
    caption = "Data: Philadelphia Monthly Eviction Filings via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Display baseline ratio plot.
baseline_ratio_plot

Code
# Calculate summary statistics by policy period.
monthly_baseline_ratio %>%
  mutate(period = case_when(
    date < as.Date("2020-03-01") ~ "pre_moratorium",
    date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30") ~ "moratorium",
    date > as.Date("2021-09-30") ~ "post_moratorium"
  )) %>%
  group_by(period) %>%
  summarize(
    n_weeks = n(),
    mean_ratio = mean(ratio_to_baseline, na.rm = TRUE),
    median_ratio = median(ratio_to_baseline, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Baseline Ratio Statistics by Policy Period",
        col.names = c("Period", "Weeks", "Mean Ratio", "Median Ratio")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Baseline Ratio Statistics by Policy Period
Period Weeks Mean Ratio Median Ratio
moratorium 19 0.23 0.25
post_moratorium 50 0.63 0.63
pre_moratorium 2 1.07 1.07

II. MODELING

1. FEATURE ENGINEERING FOR MODELING

Creating Temporal Lag Variables

Temporal lag variables show how eviction filings carry over from one month to the next. If a tract had high filings last month, it is more likely to have high filings this month too. The one-, two-, and three-month lags capture short-term momentum, while the three-month rolling average smooths out short-term patterns.

Code
# Create lagged features.
# When predicting month t, we can only use data from months t-1, t-2, t-3, etc.
df_model_prep <- df_monthly %>%
  filter(GEOID != "sealed", !is.na(GEOID), GEOID != "") %>%
  arrange(GEOID, date) %>%
  group_by(GEOID) %>%
  mutate(
    # One month lag captures immediate momentum.
    filings_lag1 = lag(filings_count, n = 1, order_by = date),
    # Two month lag captures medium-term trend.
    filings_lag2 = lag(filings_count, n = 2, order_by = date),
    # Three month lag captures quarterly pattern.
    filings_lag3 = lag(filings_count, n = 3, order_by = date),
    # Rolling 3-month average of lagged values (t-1, t-2, t-3).
    # Original rollmean with align = "right" included current month, please don't change this.
    filings_ma3 = (filings_lag1 + filings_lag2 + filings_lag3) / 3
  ) %>%
  ungroup()

# Check missingness of lag variables.
lag_coverage <- df_model_prep %>%
  summarize(
    total_obs = n(),
    missing_lag1 = sum(is.na(filings_lag1)),
    missing_lag2 = sum(is.na(filings_lag2)),
    missing_lag3 = sum(is.na(filings_lag3)),
    missing_ma3 = sum(is.na(filings_ma3)),
    pct_missing_lag1 = missing_lag1 / total_obs * 100,
    pct_missing_lag2 = missing_lag2 / total_obs * 100,
    pct_missing_lag3 = missing_lag3 / total_obs * 100,
    pct_missing_ma3 = missing_ma3 / total_obs * 100
  )

# Display lag coverage stats.
lag_coverage %>%
  kable(digits = 1, caption = "Lagged Variable Coverage Stats",
        col.names = c("Observations", "Missing Lag 1", "Missing Lag 2", "Missing Lag 3", "Missing Lag 3-Month Rolling Average", "% Missing Lag 1", "% Missing Lag 2", "% Missing Lag 3", "% Missing Lag 3-Month Rolling Average")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Lagged Variable Coverage Stats
Observations Missing Lag 1 Missing Lag 2 Missing Lag 3 Missing Lag 3-Month Rolling Average % Missing Lag 1 % Missing Lag 2 % Missing Lag 3 % Missing Lag 3-Month Rolling Average
28968 408 816 1224 1224 1.4 2.8 4.2 4.2

Creating Inactivity Flag

About 37% of monthly tract observations have zero filings thus these flags are intended to help the model separate active periods from months with no eviction activity. This matters because it can potentially address problematic zero-inflation in the data and highlight longer stretches of inactivity that may reflect structural differences across tracts.

Code
# There are around 37% zeros, so add zero flags for last month and consecutive.
df_model_prep <- df_model_prep %>%
  group_by(GEOID) %>%
  mutate(
    was_zero_last_month = lag(ifelse(filings_count == 0, 1, 0)),
    consecutive_zeros = sequence(rle(filings_count == 0)$lengths) * (filings_count == 0)
  ) %>%
  ungroup()

# Check missingness of zero flag variables.
zero_coverage <- df_model_prep %>%
  summarize(
    total_obs = n(),
    missing_zero_last_month = sum(is.na(was_zero_last_month)),
    missing_consecutive_zeros = sum(is.na(consecutive_zeros)),
    pct_missing_zero_last_month = missing_zero_last_month / total_obs * 100,
    pct_missing_consecutive_zeros = missing_consecutive_zeros / total_obs * 100
  )

# Display zero flag coverage stats.
zero_coverage %>%
  kable(digits = 1, caption = "Zero Flag Variable Coverage Stats",
        col.names = c("Observations", "Missing 0 Last Month", "Missing Consecutive Zeros", "% Missing 0 Last Month", "% Missing Consecutive Zeros")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Zero Flag Variable Coverage Stats
Observations Missing 0 Last Month Missing Consecutive Zeros % Missing 0 Last Month % Missing Consecutive Zeros
28968 408 0 1.4 0

Merging Claims Severity Data

The following represents various merges that integrate claims severity, tax delinquency, and American Community Survey (ACS) socioeconomic indicators into the tract-level eviction panel.

Code
# Create year-month key for joining claims data to tract-level data.
df_model_prep <- df_model_prep %>%
  mutate(year_month = floor_date(date, "month"))

df_claims_join <- df_claims %>%
  mutate(year_month = floor_date(date, "month")) %>%
  select(year_month, median_claim, claim_ratio)

# Merge claims data to tract-level panel.
df_model_prep <- df_model_prep %>%
  left_join(df_claims_join, by = "year_month")

# Check merge success.
claims_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_claims = sum(!is.na(median_claim)),
    pct_with_claims = has_claims / total_obs * 100
  )

cat(sprintf("Observations with claims data: %s (%.1f%%)\n", 
            comma(claims_merge_stats$has_claims), 
            claims_merge_stats$pct_with_claims))
Observations with claims data: 27,336 (94.4%)

Merging Tax Delinquency Data

Tax delinquency serves as a proxy for landlord financial stress. Properties with outstanding tax balances may have landlords more likely to pursue evictions.

Code
# Prepare tax data for joining.
df_tax_join <- df_tax %>%
  select(GEOID, num_props, balance, avg_balance, log_delinquent_balance, 
                delinquent_prop_count, delinquency_age_years)

# Merge tax delinquency data to tract-level panel.
df_model_prep <- df_model_prep %>%
  left_join(df_tax_join, by = "GEOID")

# Check merge success.
tax_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_tax_data = sum(!is.na(avg_balance)),
    pct_with_tax = has_tax_data / total_obs * 100
  )

cat(sprintf("Observations with tax data: %s (%.1f%%)\n", 
            comma(tax_merge_stats$has_tax_data), 
            tax_merge_stats$pct_with_tax))
Observations with tax data: 28,116 (97.1%)

Integrating ACS Socioeconomic Data

American Community Survey data provides tract-level socioeconomic indicators that capture structural vulnerability to eviction. Variables chosen pose theoretical relevance to eviction data as they related to housing, income, and demographic composition. ACS data was further engineered to allow for meaningful interpretation and comparison with other variables.

Code
# Retrieve ACS 5-year estimates for Philadelphia County tracts.
# Using 2023 ACS 5-year estimates.

# Define ACS variables to retrieve.
acs_vars <- c(
  # Housing characteristics.
  "B25003_001",  # total occupied housing units
  "B25003_003",  # renter-occupied units
  "B25064_001",  # median gross rent
  "B25071_001",  # median gross rent as % of income
  "B25077_001",  # median home value
  

  # Income and poverty.
  "B19013_001",  # median household income
  "B17001_001",  # total population for poverty status
  "B17001_002",  # population below poverty level
  
  # Employment.
  "B23025_001",  # total population 16+ in labor force
  "B23025_005",  # unemployed population
  
  # Education.
  "B15003_001",  # total population 25+ for education
  "B15003_017",  # high school diploma
  "B15003_022",  # bachelor's degree
  "B15003_023",  # master's degree
  "B15003_025",  # doctorate degree
  
  # Demographics.
  "B01003_001",  # total population
  "B01002_001",  # median age
  
  # Household composition.
  "B11001_001",  # total households
  "B11001_006",  # female householder no spouse with children
  
  # Housing cost burden.
  "B25070_001",  # total renter households for rent burden
  "B25070_007",  # renters paying 30-34.9% income on rent
  "B25070_008",  # renters paying 35-39.9% income on rent
  "B25070_009",  # renters paying 40-49.9% income on rent
  "B25070_010"   # renters paying 50%+ income on rent
)

# Retrieve ACS data for Philadelphia County.
acs_data <- get_acs(
  geography = "tract",
  variables = acs_vars,
  state = "PA",
  county = "Philadelphia",
  year = 2023,
  survey = "acs5",
  output = "wide",
  geometry = FALSE
)

cat(sprintf("Tracts with ACS data: %d\n", nrow(acs_data)))
Tracts with ACS data: 408
Code
# Engineer meaningful features from raw ACS variables.
df_acs <- acs_data %>%
  transmute(
    GEOID = GEOID,
    
    # Renter percentage.
    pct_renter = ifelse(B25003_001E > 0, B25003_003E / B25003_001E * 100, NA),
    
    # Median gross rent.
    median_rent = B25064_001E,
    
    # Rent burden (rent as % of income).
    rent_burden_pct = B25071_001E,
    
    # Median household income (log transform for modeling).
    median_income = B19013_001E,
    log_median_income = log1p(B19013_001E),
    
    # Poverty rate.
    poverty_rate = ifelse(B17001_001E > 0, B17001_002E / B17001_001E * 100, NA),
    
    # Unemployment rate.
    unemployment_rate = ifelse(B23025_001E > 0, B23025_005E / B23025_001E * 100, NA),
    
    # Educational attainment (% with at least bachelor's).
    pct_college = ifelse(B15003_001E > 0, 
                         (B15003_022E + B15003_023E + B15003_025E) / B15003_001E * 100, NA),
    
    # Total population.
    total_pop = B01003_001E,
    log_pop = log1p(B01003_001E),
    
    # Median age.
    median_age = B01002_001E,
    
    # Single mother households (% total households).
    pct_single_mother = ifelse(B11001_001E > 0, B11001_006E / B11001_001E * 100, NA),
    
    # Severe rent burden (% renters paying 35%+ income on rent).
    severe_rent_burden = ifelse(B25070_001E > 0, 
                                (B25070_008E + B25070_009E + B25070_010E) / B25070_001E * 100, NA)
  )

Single mothers and those who experience severe rent burdens may be more theoretically linked to eviction filings as their means are limited.

Code
# Calculate summary stats for ACS features.
acs_summary <- df_acs %>%
  select(-GEOID) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    pct_missing = sum(is.na(value)) / n() * 100,
    .groups = "drop"
  )

acs_summary %>%
  kable(digits = 2, caption = "ACS Feature Summary Stats",
        col.names = c("Variable", "Mean", "Median", "Standard Deviation", "Minimum", "Maximum", "% Missing")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
ACS Feature Summary Stats
Variable Mean Median Standard Deviation Minimum Maximum % Missing
log_median_income 10.99 11.02 0.49 9.53 12.17 6.62
log_pop 7.86 8.25 1.72 0.00 9.25 0.00
median_age 36.76 35.40 7.71 5.70 74.90 4.17
median_income 66877.27 60817.00 33071.00 13721.00 192727.00 6.62
median_rent 1374.54 1285.50 391.30 407.00 3132.00 6.37
pct_college 33.01 27.31 21.37 1.50 92.42 4.17
pct_renter 48.21 47.15 20.49 0.00 100.00 4.90
pct_single_mother 19.53 18.12 13.53 0.00 62.82 4.90
poverty_rate 21.60 18.54 14.04 0.00 73.91 4.66
rent_burden_pct 33.56 31.40 9.46 17.30 51.00 5.64
severe_rent_burden 41.35 41.01 17.05 4.19 100.00 5.15
total_pop 3878.51 3813.50 1834.07 0.00 10406.00 0.00
unemployment_rate 5.18 4.62 3.43 0.00 19.64 4.17
Code
# Merge ACS data into model prep dataset.
df_model_prep <- df_model_prep %>%
  left_join(df_acs, by = "GEOID")

# Check merge.
acs_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_acs = sum(!is.na(pct_renter)),
    pct_with_acs = has_acs / total_obs * 100
  )

cat(sprintf("Observations with ACS data: %s (%.1f%%)\n", 
            comma(acs_merge_stats$has_acs), 
            acs_merge_stats$pct_with_acs))
Observations with ACS data: 27,548 (95.1%)

Creating a spatial weights matrix for our model is motivated by the maps predicting the spatial distribution of eviction filings across Philadelphia; eviction activity tends to not be isolated to specific tracts and emerge as part of broader neighborhood-level patterns. The spatial weights matrix allows the model to capture these localized spillovers, ensuring that predictions reflect the reality that housing instability is often similar in adjacent communities rather than occurring in isolation.

Code
# Filter spatial data to only tracts present in eviction data.
tracts_in_data <- unique(df_model_prep$GEOID)

philly_model_sf <- philly_tracts_sf %>%
  filter(GEOID %in% tracts_in_data)

# Create spatial neighbors using Queen contiguity.
neighbors_queen <- poly2nb(philly_model_sf, queen = TRUE)

# Convert to spatial weights matrix.
weights_matrix <- nb2listw(neighbors_queen, style = "W", zero.policy = TRUE)

cat(sprintf("Number of tracts in spatial analysis: %d\n", length(neighbors_queen)))
Number of tracts in spatial analysis: 408
Code
cat(sprintf("Average number of neighbors per tract: %.1f\n", 
            mean(card(neighbors_queen))))
Average number of neighbors per tract: 6.0
Code
# Calculate lagged spatial lag of filings for each month.
# Use neighbors' filings from month t-1 (previous month).
# Using same-month neighbor data would be data leakage since we wouldn't have that info at prediction time.

# Create function to calculate spatial lag for a given month using previous month's data.
# Shift each tract's filings by 1 month.
df_model_prep <- df_model_prep %>%
  arrange(GEOID, date) %>%
  group_by(GEOID) %>%
  mutate(filings_count_lag1 = lag(filings_count, n = 1)) %>%
  ungroup()

# For each month, calculate spatial lag using the lagged filings.
df_model_prep <- df_model_prep %>%
  group_by(date) %>%
  mutate(
    # Match order to spatial data for this month.
    spatial_lag_filings = {
      sf_matched <- philly_model_sf %>%
        left_join(
          cur_data() %>% select(GEOID, filings_count_lag1), 
          by = "GEOID"
        ) %>%
        mutate(filings_count_lag1 = ifelse(is.na(filings_count_lag1), 0, filings_count_lag1))
      
      # Calculate weighted average of neighbors' lagged filings.
      lag.listw(weights_matrix, sf_matched$filings_count_lag1, zero.policy = TRUE)
    }
  ) %>%
  ungroup()

cat(sprintf("Observations with spatial lag: %s\n", 
            comma(sum(!is.na(df_model_prep$spatial_lag_filings)))))
Observations with spatial lag: 28,968

Creating Neighborhood Fixed-Effects

Neighborhoods provide a broader geographic context for understanding eviction filings. By linking tracts to neighborhood boundaries, we can test whether neighborhood characteristics are associated with filing rates.

Code
# Load boundaries.
neighborhoods <- st_read("data/philadelphia-neighborhoods/philadelphia-neighborhoods.shp")
Reading layer `philadelphia-neighborhoods' from data source 
  `C:\Users\Tess\Documents\GitHub\portfolio-setup-TessaVu\final-project\data\philadelphia-neighborhoods\philadelphia-neighborhoods.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 159 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28026 ymin: 39.86701 xmax: -74.95576 ymax: 40.13799
Geodetic CRS:  WGS 84
Code
# Make valid for duplicate vertex error.
neighborhoods <- st_make_valid(neighborhoods)

# Change CRS.
neighborhoods <- st_transform(neighborhoods, st_crs(philly_model_sf))

# Select only GEOID and geometry from tracts.
tracts_sf_subset <- philly_model_sf %>% select(GEOID, geometry)

# Select only neighborhood name and geometry from neighborhoods.
neighborhoods_sf_subset <- neighborhoods %>% select(MAPNAME, geometry) %>% rename(neighborhood = MAPNAME)

# Spatial join. Keep tracts even if they don't perfectly intersect a neighborhood.
tracts_with_neighborhood <- st_join(
    tracts_sf_subset,
    neighborhoods_sf_subset,
    join = st_intersects,
    left = TRUE
) %>%
    # Drop geometry.
    st_drop_geometry()

# Select only GEOID and neighborhood.
df_neighborhood_join <- tracts_with_neighborhood %>%
    select(GEOID, neighborhood)

# Choose first match if multiple.
df_neighborhood_join <- df_neighborhood_join %>%
    group_by(GEOID) %>%
    slice(1) %>%
    ungroup()

# Merge.
df_model_prep <- df_model_prep %>%
    left_join(df_neighborhood_join, by = "GEOID")

# Check merge.
cat(sprintf("Total observations: %s\n", comma(nrow(df_model_prep))))
Total observations: 28,968
Code
cat(sprintf("Observations with neighborhood data: %s\n",
            comma(sum(!is.na(df_model_prep$neighborhood)))))
Observations with neighborhood data: 28,968
Code
# Convert new column to factor for FE.
df_model_prep$neighborhood <- as.factor(df_model_prep$neighborhood)

# Change reference.
df_model_prep$neighborhood <- relevel(df_model_prep$neighborhood, ref = "East Falls")

A binary flag was set to identify high concentrations of filings. This helps separate routine variation from extreme events and allows the model to account for rare but impactful spikes without letting them distort patterns in the broader dataset.

Code
# Define threshold for capping target variable in training set.
cap_threshold = 20

# Create spike binary flag for all data to identify structural risk.
df_model_prep <- df_model_prep %>%
  mutate(is_extreme_spike = ifelse(filings_count > cap_threshold, 1, 0))

We generated a correlation matrix for all numeric variables to identify redundancy and guide model specification. Because poverty rate and logged median income captured similar socioeconomic conditions, we retained poverty rate due to its stronger theoretical link to eviction risk. We selected delinquent property count over logged delinquent balance because it showed a stronger empirical correlation with eviction filings. Finally, we used the three-month rolling spatial lag rather than the one- or two-month lags, as the rolling measure captured short-term fluctuations while reducing redundancy. This process ensured that the final predictors were theoretically relevant and minimally collinear.

Code
# Custom colors.
custom_colors <- colorRampPalette(c("#f03e37", "white", "#189ace")) 

color_vector <- custom_colors(200)

# Select and apply numeric columns only.
df_model_prep_correlation <- df_model_prep[, sapply(df_model_prep, is.numeric)]

# Correlation matrix.
correlation_matrix <- cor(df_model_prep_correlation, use = "pairwise.complete.obs")

# Display matrix.
corrplot(correlation_matrix,
         method = "square",
         order = "FPC", 
         type = "lower",
         diag = TRUE,
         tl.col = "black",
         cl.pos = "n",
         addCoef.col = "black",
         col = color_vector
         )

2. TRAIN/TEST SPLIT STRATEGY

Time-Based Validation Split

For a truly predictive model, we must use time-based splitting to avoid data leakage. The model learns from historical data and predicts future periods.

Code
# Define temporal cutoff for train/test split.
train_end_date <- as.Date("2023-12-31")
test_start_date <- as.Date("2024-01-01")

# Filter for all variables needed by most complex model.
df_filtered <- df_model_prep %>%
  filter(
    !is.na(filings_ma3),
    !is.na(spatial_lag_filings),
    !is.na(delinquent_prop_count),
    !is.na(pct_renter),
    !is.na(poverty_rate),
    !is.na(severe_rent_burden),
    !is.na(pct_single_mother),
    !is.na(racial_majority),
    !is.na(moratorium_active),
    !is.na(neighborhood)
  )

# Apply the cap only to training target variable to make it stable.
# Create training set where filings are capped at 99.75 percentile = 20 filings.
df_train <- df_filtered %>%
  filter(date <= train_end_date) %>%
  mutate(filings_count_capped = pmin(filings_count, cap_threshold))

# Create testing set where target is including spikes for real-world.
df_test <- df_filtered %>%
  filter(date >= test_start_date)

# Display split statistics.
cat("TRAIN/TEST SPLIT SUMMARY\n")
TRAIN/TEST SPLIT SUMMARY
Code
cat(sprintf("Training Period: %s to %s\n", min(df_train$date), max(df_train$date)))
Training Period: 2020-04-01 to 2023-12-01
Code
cat(sprintf("Training Observations: %s\n", comma(nrow(df_train))))
Training Observations: 17,280
Code
cat(sprintf("Training Months: %d\n", n_distinct(df_train$date)))
Training Months: 45
Code
cat(sprintf("Test Period: %s to %s\n", min(df_test$date), max(df_test$date)))
Test Period: 2024-01-01 to 2025-11-01
Code
cat(sprintf("Test Observations: %s\n", comma(nrow(df_test))))
Test Observations: 8,832
Code
cat(sprintf("Test Months: %d\n", n_distinct(df_test$date)))
Test Months: 23
Code
# Adjust so that the extreme outliers are not included. Cap lagged momentum signal to reduce distortion from rare mass-eviction spikes, which is 99.8 percentile.
df_train <- df_train %>% mutate(filings_ma3 = pmin(filings_ma3, 20))
df_test <- df_test %>% mutate(filings_ma3 = pmin(filings_ma3, 20))
Code
# Compare filing distributions between train and test sets.
train_test_comparison <- bind_rows(
  df_train %>% mutate(set = "training"),
  df_test %>% mutate(set = "test")
) %>%
  group_by(set) %>%
  summarize(
    n_obs = n(),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100,
    .groups = "drop"
  )

train_test_comparison %>%
  kable(digits = 2, caption = "Filing Distribution Comparison: Training vs Test Sets",
        col.names = c("Set", "Observations", "Mean Filings", "Median Filings", "Standard Deviation Filings", "% Zero")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Filing Distribution Comparison: Training vs Test Sets
Set Observations Mean Filings Median Filings Standard Deviation Filings % Zero
test 8832 2.64 2 3.20 27.39
training 17280 2.12 1 3.17 37.60

To prevent mass displacement events from destabilizing the Negative Binomial model, we capped the target variable at the ~99th percentile with 20 filings in the training set only. However, we retained the raw, uncapped values in the test set to validate the model’s performance against real-world unpredictability.

3. NEGATIVE BINOMIAL REGRESSION MODEL

Model Specification Rationale

Because of overdispersion, Negative Binomial regression is preferred over Poisson.

Model 1: Baseline with Temporal, Policy, and Monthly Features

Code
# Fit baseline Negative Binomial model with core time, policy, and monthly predictors.
model_nb_1 <- glm.nb(filings_count_capped ~ filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike,
                     data = df_train)

# Display 1st model summary.
cat("BASELINE NEGATIVE BINOMIAL MODEL SUMMARY\n")
BASELINE NEGATIVE BINOMIAL MODEL SUMMARY
Code
summary(model_nb_1)

Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike, data = df_train, init.theta = 1.652485134, 
    link = log)

Coefficients:
                     Estimate Std. Error z value             Pr(>|z|)    
(Intercept)          0.406671   0.032979  12.331 < 0.0000000000000002 ***
filings_ma3          0.211950   0.002944  72.000 < 0.0000000000000002 ***
moratorium_active   -0.587494   0.019745 -29.754 < 0.0000000000000002 ***
factor(month_num)2  -0.264619   0.045157  -5.860   0.0000000046293608 ***
factor(month_num)3   0.018918   0.043832   0.432             0.666037    
factor(month_num)4  -0.722253   0.045008 -16.047 < 0.0000000000000002 ***
factor(month_num)5  -0.432966   0.043909  -9.861 < 0.0000000000000002 ***
factor(month_num)6  -0.338440   0.043762  -7.734   0.0000000000000104 ***
factor(month_num)7  -0.143537   0.043031  -3.336             0.000851 ***
factor(month_num)8   0.010293   0.042140   0.244             0.807022    
factor(month_num)9   0.118125   0.041477   2.848             0.004400 ** 
factor(month_num)10  0.029408   0.040993   0.717             0.473130    
factor(month_num)11 -0.125061   0.041365  -3.023             0.002500 ** 
factor(month_num)12 -0.092971   0.041199  -2.257             0.024029 *  
is_extreme_spike     1.666549   0.129682  12.851 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.6525) family taken to be 1)

    Null deviance: 27521  on 17279  degrees of freedom
Residual deviance: 18556  on 17265  degrees of freedom
AIC: 60201

Number of Fisher Scoring iterations: 1

              Theta:  1.6525 
          Std. Err.:  0.0388 

 2 x log-likelihood:  -60169.1300 

Model 2: Add Tax Delinquency and Spatial Lag

Code
model_nb_2 <- glm.nb(filings_count_capped ~
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike +
                       # Added spatial lag and delinquent properties because they're more actionable.
                       spatial_lag_filings + delinquent_prop_count,
                     data = df_train)

# Display 2nd model summary.
cat("2ND NEGATIVE BINOMIAL MODEL SUMMARY\n")
2ND NEGATIVE BINOMIAL MODEL SUMMARY
Code
summary(model_nb_2)

Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + spatial_lag_filings + 
    delinquent_prop_count, data = df_train, init.theta = 1.785836474, 
    link = log)

Coefficients:
                         Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)            0.16958717  0.03601190   4.709   0.0000024869281117 ***
filings_ma3            0.18508414  0.00302674  61.150 < 0.0000000000000002 ***
moratorium_active     -0.59704251  0.02096097 -28.484 < 0.0000000000000002 ***
factor(month_num)2    -0.26768237  0.04441417  -6.027   0.0000000016707370 ***
factor(month_num)3     0.03809786  0.04309203   0.884             0.376640    
factor(month_num)4    -0.70713449  0.04430404 -15.961 < 0.0000000000000002 ***
factor(month_num)5    -0.41107419  0.04346477  -9.458 < 0.0000000000000002 ***
factor(month_num)6    -0.32981983  0.04313644  -7.646   0.0000000000000207 ***
factor(month_num)7    -0.14707775  0.04236759  -3.471             0.000518 ***
factor(month_num)8     0.00797282  0.04146404   0.192             0.847521    
factor(month_num)9     0.10634056  0.04078377   2.607             0.009123 ** 
factor(month_num)10    0.03113026  0.04025258   0.773             0.439302    
factor(month_num)11   -0.12814163  0.04065231  -3.152             0.001621 ** 
factor(month_num)12   -0.08544880  0.04048530  -2.111             0.034806 *  
is_extreme_spike       1.82586047  0.12550920  14.548 < 0.0000000000000002 ***
spatial_lag_filings    0.02619239  0.00486258   5.387   0.0000000718332945 ***
delinquent_prop_count  0.00099969  0.00004104  24.361 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.7858) family taken to be 1)

    Null deviance: 28450  on 17279  degrees of freedom
Residual deviance: 18548  on 17263  degrees of freedom
AIC: 59653

Number of Fisher Scoring iterations: 1

              Theta:  1.7858 
          Std. Err.:  0.0434 

 2 x log-likelihood:  -59616.6730 

Model 3: Add ACS

Code
model_nb_3 <- glm.nb(filings_count_capped ~
                       # 1st model.
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike + neighborhood +
                       # 2nd model.
                       spatial_lag_filings + delinquent_prop_count +
                       # Add demographics.
                       pct_renter + severe_rent_burden + poverty_rate + pct_single_mother +
                       factor(racial_majority),
                     data = df_train)

# Display 3rd model summary.
cat("3RD NEGATIVE BINOMIAL MODEL SUMMARY\n")
3RD NEGATIVE BINOMIAL MODEL SUMMARY
Code
summary(model_nb_3)

Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + neighborhood + spatial_lag_filings + 
    delinquent_prop_count + pct_renter + severe_rent_burden + 
    poverty_rate + pct_single_mother + factor(racial_majority), 
    data = df_train, init.theta = 2.456047024, link = log)

Coefficients:
                                           Estimate  Std. Error z value
(Intercept)                             -0.32885876  0.08373969  -3.927
filings_ma3                              0.10824420  0.00324140  33.394
moratorium_active                       -0.72840866  0.02044538 -35.627
factor(month_num)2                      -0.26247687  0.04149013  -6.326
factor(month_num)3                       0.04896265  0.04011348   1.221
factor(month_num)4                      -0.67002325  0.04171184 -16.063
factor(month_num)5                      -0.41028851  0.04087245 -10.038
factor(month_num)6                      -0.33329963  0.04040477  -8.249
factor(month_num)7                      -0.18329105  0.03969309  -4.618
factor(month_num)8                      -0.01468898  0.03872740  -0.379
factor(month_num)9                       0.08037622  0.03807647   2.111
factor(month_num)10                     -0.00068793  0.03751738  -0.018
factor(month_num)11                     -0.14435744  0.03793421  -3.805
factor(month_num)12                     -0.08616436  0.03772642  -2.284
is_extreme_spike                         1.75915476  0.11004495  15.986
neighborhoodAcademy Gardens              0.46868176  0.15752547   2.975
neighborhoodAllegheny West              -0.48975161  0.10430841  -4.695
neighborhoodAndorra                     -3.70144762  1.00371354  -3.688
neighborhoodBartram Village             -0.47563860  0.11981972  -3.970
neighborhoodBelmont                     -0.00989062  0.13651137  -0.072
neighborhoodBrewerytown                 -0.10763228  0.08395922  -1.282
neighborhoodBridesburg                  -0.53703472  0.12210109  -4.398
neighborhoodBurholme                     0.34905619  0.14083720   2.478
neighborhoodBustleton                    0.10438794  0.07693362   1.357
neighborhoodByberry                      0.67177557  0.12873656   5.218
neighborhoodCarroll Park                 0.14207463  0.13914364   1.021
neighborhoodCedar Park                   0.05976045  0.17349190   0.344
neighborhoodCedarbrook                  -0.00166750  0.09703355  -0.017
neighborhoodCenter City East            -0.41877600  0.11827091  -3.541
neighborhoodChestnut Hill               -0.67456234  0.10638519  -6.341
neighborhoodClearview                    0.28564927  0.08874398   3.219
neighborhoodCobbs Creek                  0.13771192  0.07563235   1.821
neighborhoodCrescentville                0.64174551  0.14655620   4.379
neighborhoodDickinson Narrows           -0.37530165  0.12653466  -2.966
neighborhoodEast Germantown              0.01876951  0.08429356   0.223
neighborhoodEast Mount Airy              0.18446467  0.07272055   2.537
neighborhoodEast Oak Lane               -0.73782079  0.22568629  -3.269
neighborhoodEast Parkside               -0.13621482  0.11061447  -1.231
neighborhoodEastwick                    -0.22687952  0.11669662  -1.944
neighborhoodElmwood                      0.00155153  0.15001950   0.010
neighborhoodFairhill                    -0.59160524  0.11584999  -5.107
neighborhoodFairmount                   -0.71642909  0.13054756  -5.488
neighborhoodFeltonville                  0.24669600  0.13596714   1.814
neighborhoodFishtown - Lower Kensington -0.66113045  0.14283317  -4.629
neighborhoodFitler Square               -0.87777924  0.19971650  -4.395
neighborhoodFox Chase                    0.31386497  0.11179379   2.808
neighborhoodFrancisville                -0.17055858  0.11865966  -1.437
neighborhoodFrankford                    0.39768227  0.08851345   4.493
neighborhoodFranklin Mills              -1.03757555  0.35782763  -2.900
neighborhoodFranklinville               -0.24506397  0.11310195  -2.167
neighborhoodGermantown - Morton         -0.69023633  0.17911193  -3.854
neighborhoodGermantown - Penn Knox       0.27843900  0.13060709   2.132
neighborhoodGirard Estates               0.06866719  0.16590901   0.414
neighborhoodGlenwood                    -0.52931597  0.12785307  -4.140
neighborhoodGraduate Hospital           -0.44607208  0.12062821  -3.698
neighborhoodGrays Ferry                 -0.01153080  0.07274276  -0.159
neighborhoodHaddington                   0.01473490  0.08847075   0.167
neighborhoodHarrowgate                   0.16413256  0.15706132   1.045
neighborhoodHartranft                   -0.41205999  0.12573499  -3.277
neighborhoodHawthorne                   -0.27119831  0.14180180  -1.913
neighborhoodHolmesburg                   0.83856618  0.09352799   8.966
neighborhoodHunting Park                -0.08154019  0.18864774  -0.432
neighborhoodIndustrial                   0.41139489  0.09656724   4.260
neighborhoodJuniata Park                 0.47813540  0.14745273   3.243
neighborhoodKingsessing                  0.01155933  0.07994566   0.145
neighborhoodLawndale                     0.65475002  0.17783919   3.682
neighborhoodLogan                        0.29100304  0.09990076   2.913
neighborhoodLogan Square                -0.20965619  0.13893856  -1.509
neighborhoodLower Moyamensing           -0.20926456  0.11343650  -1.845
neighborhoodLudlow                       0.25386145  0.15420013   1.646
neighborhoodMantua                      -0.22466546  0.15190610  -1.479
neighborhoodMayfair                      0.71759154  0.09380730   7.650
neighborhoodMill Creek                   0.21340550  0.15166722   1.407
neighborhoodModena                       1.38672039  0.13835932  10.023
neighborhoodMorrell Park                -0.02342778  0.23184103  -0.101
neighborhoodNicetown                     0.02498215  0.10258258   0.244
neighborhoodNorth Central               -0.28967180  0.10525821  -2.752
neighborhoodNortheast Phila Airport      0.37007935  0.13553448   2.731
neighborhoodNorthern Liberties          -0.34656504  0.12223663  -2.835
neighborhoodNorthwood                    0.50739140  0.08945392   5.672
neighborhoodOgontz                       0.09058871  0.14605998   0.620
neighborhoodOld City                    -0.70295702  0.11739622  -5.988
neighborhoodOld Kensington              -0.85874062  0.23393610  -3.671
neighborhoodOlney                        0.33488920  0.06826431   4.906
neighborhoodOverbrook                    0.16604677  0.07843303   2.117
neighborhoodOxford Circle                0.46689987  0.07862386   5.938
neighborhoodParkwood Manor               0.16288090  0.15392819   1.058
neighborhoodPaschall                    -0.21198188  0.09024141  -2.349
neighborhoodPassyunk Square             -0.77349522  0.18858421  -4.102
neighborhoodPennypack Park               0.41927837  0.07837068   5.350
neighborhoodPoint Breeze                -0.71228184  0.12134541  -5.870
neighborhoodRhawnhurst                   0.32020992  0.07986139   4.010
neighborhoodRichmond                     0.13562758  0.07875827   1.722
neighborhoodRittenhouse                 -1.05725258  0.10171197 -10.395
neighborhoodRiverfront                  -0.17987930  0.08014787  -2.244
neighborhoodRoxborough                  -0.48517106  0.12948921  -3.747
neighborhoodSharswood                   -0.37498504  0.10777194  -3.479
neighborhoodSociety Hill                -0.97128479  0.13490591  -7.200
neighborhoodSomerton                     0.29230952  0.09917735   2.947
neighborhoodSouthwest Germantown        -0.21810837  0.11764461  -1.854
neighborhoodSpring Garden                0.12993231  0.15357065   0.846
neighborhoodStadium District            -0.08256467  0.22021152  -0.375
neighborhoodStanton                     -0.03674313  0.07639561  -0.481
neighborhoodStrawberry Mansion           0.27142511  0.13993797   1.940
neighborhoodSummerdale                   1.01430415  0.16144081   6.283
neighborhoodTacony                       0.13075839  0.16491604   0.793
neighborhoodTioga                       -0.07611914  0.09054053  -0.841
neighborhoodTorresdale                   0.26925474  0.18542236   1.452
neighborhoodUniversity City             -0.46304900  0.13215360  -3.504
neighborhoodUpper Kensington            -0.30332474  0.17418694  -1.741
neighborhoodUpper Roxborough            -0.16332421  0.12475403  -1.309
neighborhoodWalnut Hill                  0.05295536  0.07598310   0.697
neighborhoodWashington Square West      -0.74713151  0.11504714  -6.494
neighborhoodWest Kensington             -0.04189261  0.13133936  -0.319
neighborhoodWest Mount Airy              0.00491165  0.14234699   0.035
neighborhoodWest Oak Lane                0.06840645  0.07560329   0.905
neighborhoodWest Parkside                0.23100539  0.11546496   2.001
neighborhoodWest Powelton               -0.15412798  0.11410181  -1.351
neighborhoodWest Torresdale              0.55821049  0.10741736   5.197
neighborhoodWissahickon Park             0.14953999  0.08944801   1.672
neighborhoodWissinoming                  0.18973127  0.15874658   1.195
neighborhoodWoodland Terrace            -0.66444414  0.11855089  -5.605
neighborhoodWynnefield Heights          -0.03266726  0.11637500  -0.281
neighborhoodYorktown                    -0.26931181  0.15850742  -1.699
spatial_lag_filings                      0.03170557  0.00497125   6.378
delinquent_prop_count                    0.00116241  0.00007638  15.219
pct_renter                               0.01658298  0.00071223  23.283
severe_rent_burden                       0.00135210  0.00069063   1.958
poverty_rate                            -0.00390303  0.00115768  -3.371
pct_single_mother                        0.00036465  0.00119749   0.305
factor(racial_majority)Hispanic         -0.22254760  0.06140714  -3.624
factor(racial_majority)Other            -0.11998669  0.03720466  -3.225
factor(racial_majority)White            -0.30372526  0.04583831  -6.626
                                                    Pr(>|z|)    
(Intercept)                               0.0000859564901167 ***
filings_ma3                             < 0.0000000000000002 ***
moratorium_active                       < 0.0000000000000002 ***
factor(month_num)2                        0.0000000002511928 ***
factor(month_num)3                                  0.222236    
factor(month_num)4                      < 0.0000000000000002 ***
factor(month_num)5                      < 0.0000000000000002 ***
factor(month_num)6                      < 0.0000000000000002 ***
factor(month_num)7                        0.0000038800410245 ***
factor(month_num)8                                  0.704471    
factor(month_num)9                                  0.034780 *  
factor(month_num)10                                 0.985371    
factor(month_num)11                                 0.000142 ***
factor(month_num)12                                 0.022376 *  
is_extreme_spike                        < 0.0000000000000002 ***
neighborhoodAcademy Gardens                         0.002927 ** 
neighborhoodAllegheny West                0.0000026631137103 ***
neighborhoodAndorra                                 0.000226 ***
neighborhoodBartram Village               0.0000719877138962 ***
neighborhoodBelmont                                 0.942242    
neighborhoodBrewerytown                             0.199857    
neighborhoodBridesburg                    0.0000109112440242 ***
neighborhoodBurholme                                0.013196 *  
neighborhoodBustleton                               0.174827    
neighborhoodByberry                       0.0000001806517035 ***
neighborhoodCarroll Park                            0.307224    
neighborhoodCedar Park                              0.730503    
neighborhoodCedarbrook                              0.986289    
neighborhoodCenter City East                        0.000399 ***
neighborhoodChestnut Hill                 0.0000000002286426 ***
neighborhoodClearview                               0.001287 ** 
neighborhoodCobbs Creek                             0.068636 .  
neighborhoodCrescentville                 0.0000119315120275 ***
neighborhoodDickinson Narrows                       0.003017 ** 
neighborhoodEast Germantown                         0.823794    
neighborhoodEast Mount Airy                         0.011193 *  
neighborhoodEast Oak Lane                           0.001078 ** 
neighborhoodEast Parkside                           0.218159    
neighborhoodEastwick                                0.051873 .  
neighborhoodElmwood                                 0.991748    
neighborhoodFairhill                      0.0000003279222426 ***
neighborhoodFairmount                     0.0000000406790723 ***
neighborhoodFeltonville                             0.069619 .  
neighborhoodFishtown - Lower Kensington   0.0000036798642764 ***
neighborhoodFitler Square                 0.0000110708341392 ***
neighborhoodFox Chase                               0.004992 ** 
neighborhoodFrancisville                            0.150611    
neighborhoodFrankford                     0.0000070259142674 ***
neighborhoodFranklin Mills                          0.003736 ** 
neighborhoodFranklinville                           0.030254 *  
neighborhoodGermantown - Morton                     0.000116 ***
neighborhoodGermantown - Penn Knox                  0.033016 *  
neighborhoodGirard Estates                          0.678959    
neighborhoodGlenwood                      0.0000347255376780 ***
neighborhoodGraduate Hospital                       0.000217 ***
neighborhoodGrays Ferry                             0.874051    
neighborhoodHaddington                              0.867723    
neighborhoodHarrowgate                              0.296013    
neighborhoodHartranft                               0.001048 ** 
neighborhoodHawthorne                               0.055810 .  
neighborhoodHolmesburg                  < 0.0000000000000002 ***
neighborhoodHunting Park                            0.665570    
neighborhoodIndustrial                    0.0000204252035669 ***
neighborhoodJuniata Park                            0.001184 ** 
neighborhoodKingsessing                             0.885035    
neighborhoodLawndale                                0.000232 ***
neighborhoodLogan                                   0.003581 ** 
neighborhoodLogan Square                            0.131303    
neighborhoodLower Moyamensing                       0.065071 .  
neighborhoodLudlow                                  0.099700 .  
neighborhoodMantua                                  0.139147    
neighborhoodMayfair                       0.0000000000000202 ***
neighborhoodMill Creek                              0.159408    
neighborhoodModena                      < 0.0000000000000002 ***
neighborhoodMorrell Park                            0.919510    
neighborhoodNicetown                                0.807593    
neighborhoodNorth Central                           0.005923 ** 
neighborhoodNortheast Phila Airport                 0.006323 ** 
neighborhoodNorthern Liberties                      0.004580 ** 
neighborhoodNorthwood                     0.0000000141059370 ***
neighborhoodOgontz                                  0.535116    
neighborhoodOld City                      0.0000000021256564 ***
neighborhoodOld Kensington                          0.000242 ***
neighborhoodOlney                         0.0000009306018136 ***
neighborhoodOverbrook                               0.034255 *  
neighborhoodOxford Circle                 0.0000000028781827 ***
neighborhoodParkwood Manor                          0.289982    
neighborhoodPaschall                                0.018821 *  
neighborhoodPassyunk Square               0.0000410319892452 ***
neighborhoodPennypack Park                0.0000000879837556 ***
neighborhoodPoint Breeze                  0.0000000043613604 ***
neighborhoodRhawnhurst                    0.0000608291612128 ***
neighborhoodRichmond                                0.085056 .  
neighborhoodRittenhouse                 < 0.0000000000000002 ***
neighborhoodRiverfront                              0.024810 *  
neighborhoodRoxborough                              0.000179 ***
neighborhoodSharswood                               0.000502 ***
neighborhoodSociety Hill                  0.0000000000006034 ***
neighborhoodSomerton                                0.003205 ** 
neighborhoodSouthwest Germantown                    0.063745 .  
neighborhoodSpring Garden                           0.397511    
neighborhoodStadium District                        0.707710    
neighborhoodStanton                                 0.630546    
neighborhoodStrawberry Mansion                      0.052427 .  
neighborhoodSummerdale                    0.0000000003324778 ***
neighborhoodTacony                                  0.427849    
neighborhoodTioga                                   0.400505    
neighborhoodTorresdale                              0.146469    
neighborhoodUniversity City                         0.000459 ***
neighborhoodUpper Kensington                        0.081618 .  
neighborhoodUpper Roxborough                        0.190477    
neighborhoodWalnut Hill                             0.485843    
neighborhoodWashington Square West        0.0000000000835124 ***
neighborhoodWest Kensington                         0.749753    
neighborhoodWest Mount Airy                         0.972475    
neighborhoodWest Oak Lane                           0.365567    
neighborhoodWest Parkside                           0.045430 *  
neighborhoodWest Powelton                           0.176762    
neighborhoodWest Torresdale               0.0000002029112975 ***
neighborhoodWissahickon Park                        0.094562 .  
neighborhoodWissinoming                             0.232015    
neighborhoodWoodland Terrace              0.0000000208596177 ***
neighborhoodWynnefield Heights                      0.778935    
neighborhoodYorktown                                0.089310 .  
spatial_lag_filings                       0.0000000001796708 ***
delinquent_prop_count                   < 0.0000000000000002 ***
pct_renter                              < 0.0000000000000002 ***
severe_rent_burden                                  0.050258 .  
poverty_rate                                        0.000748 ***
pct_single_mother                                   0.760736    
factor(racial_majority)Hispanic                     0.000290 ***
factor(racial_majority)Other                        0.001260 ** 
factor(racial_majority)White              0.0000000000344875 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.456) family taken to be 1)

    Null deviance: 32272  on 17279  degrees of freedom
Residual deviance: 18400  on 17147  degrees of freedom
AIC: 57706

Number of Fisher Scoring iterations: 1

              Theta:  2.4560 
          Std. Err.:  0.0686 

 2 x log-likelihood:  -57437.6890 

Model 4: Full Model with Interactions

Code
model_nb_4 <- glm.nb(filings_count_capped ~
                       # 1st model.
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike + neighborhood +
                       # 2nd model.
                       spatial_lag_filings + delinquent_prop_count +
                       # 3rd model.
                       pct_renter + severe_rent_burden + poverty_rate + pct_single_mother +
                       factor(racial_majority) +
                       # Add interactions.
                       factor(racial_majority) * moratorium_active,
                     data = df_train)

cat("FULL NEGATIVE BINOMIAL MODEL SUMMARY\n")
FULL NEGATIVE BINOMIAL MODEL SUMMARY
Code
summary(model_nb_4)

Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + neighborhood + spatial_lag_filings + 
    delinquent_prop_count + pct_renter + severe_rent_burden + 
    poverty_rate + pct_single_mother + factor(racial_majority) + 
    factor(racial_majority) * moratorium_active, data = df_train, 
    init.theta = 2.465189995, link = log)

Coefficients:
                                                    Estimate Std. Error z value
(Intercept)                                       -0.3248197  0.0838622  -3.873
filings_ma3                                        0.1080455  0.0032524  33.220
moratorium_active                                 -0.7519105  0.0281807 -26.682
factor(month_num)2                                -0.2630601  0.0414593  -6.345
factor(month_num)3                                 0.0488936  0.0400794   1.220
factor(month_num)4                                -0.6696615  0.0416854 -16.065
factor(month_num)5                                -0.4092353  0.0408380 -10.021
factor(month_num)6                                -0.3326693  0.0403722  -8.240
factor(month_num)7                                -0.1830087  0.0396631  -4.614
factor(month_num)8                                -0.0149657  0.0386996  -0.387
factor(month_num)9                                 0.0807294  0.0380454   2.122
factor(month_num)10                               -0.0006626  0.0374831  -0.018
factor(month_num)11                               -0.1440741  0.0378994  -3.801
factor(month_num)12                               -0.0850610  0.0376887  -2.257
is_extreme_spike                                   1.7676705  0.1098917  16.086
neighborhoodAcademy Gardens                        0.4680273  0.1577582   2.967
neighborhoodAllegheny West                        -0.4861105  0.1042729  -4.662
neighborhoodAndorra                               -3.7044683  1.0035887  -3.691
neighborhoodBartram Village                       -0.4749069  0.1197343  -3.966
neighborhoodBelmont                               -0.0055532  0.1364268  -0.041
neighborhoodBrewerytown                           -0.1048998  0.0839285  -1.250
neighborhoodBridesburg                            -0.5357079  0.1220534  -4.389
neighborhoodBurholme                               0.3446663  0.1405201   2.453
neighborhoodBustleton                              0.1048551  0.0769905   1.362
neighborhoodByberry                                0.6730944  0.1288547   5.224
neighborhoodCarroll Park                           0.1460024  0.1390734   1.050
neighborhoodCedar Park                             0.0569091  0.1738193   0.327
neighborhoodCedarbrook                             0.0005676  0.0969952   0.006
neighborhoodCenter City East                      -0.4180745  0.1183566  -3.532
neighborhoodChestnut Hill                         -0.6696446  0.1062139  -6.305
neighborhoodClearview                              0.2887634  0.0887244   3.255
neighborhoodCobbs Creek                            0.1407026  0.0756202   1.861
neighborhoodCrescentville                          0.6452426  0.1464806   4.405
neighborhoodDickinson Narrows                     -0.3720910  0.1264063  -2.944
neighborhoodEast Germantown                        0.0217092  0.0842761   0.258
neighborhoodEast Mount Airy                        0.1867841  0.0727120   2.569
neighborhoodEast Oak Lane                         -0.7370008  0.2256317  -3.266
neighborhoodEast Parkside                         -0.1329889  0.1105773  -1.203
neighborhoodEastwick                              -0.2262931  0.1166871  -1.939
neighborhoodElmwood                                0.0035989  0.1499660   0.024
neighborhoodFairhill                              -0.5885811  0.1156196  -5.091
neighborhoodFairmount                             -0.7180338  0.1306574  -5.496
neighborhoodFeltonville                            0.2488892  0.1356230   1.835
neighborhoodFishtown - Lower Kensington           -0.6614383  0.1429244  -4.628
neighborhoodFitler Square                         -0.8817042  0.2000285  -4.408
neighborhoodFox Chase                              0.3145977  0.1118466   2.813
neighborhoodFrancisville                          -0.1664610  0.1185074  -1.405
neighborhoodFrankford                              0.3985985  0.0884240   4.508
neighborhoodFranklin Mills                        -1.0399820  0.3580811  -2.904
neighborhoodFranklinville                         -0.2423228  0.1129082  -2.146
neighborhoodGermantown - Morton                   -0.6895901  0.1790879  -3.851
neighborhoodGermantown - Penn Knox                 0.2823489  0.1305342   2.163
neighborhoodGirard Estates                         0.0727690  0.1659803   0.438
neighborhoodGlenwood                              -0.5275492  0.1278454  -4.126
neighborhoodGraduate Hospital                     -0.4458133  0.1206666  -3.695
neighborhoodGrays Ferry                           -0.0095505  0.0727336  -0.131
neighborhoodHaddington                             0.0175236  0.0884526   0.198
neighborhoodHarrowgate                             0.1635468  0.1565196   1.045
neighborhoodHartranft                             -0.4085759  0.1256149  -3.253
neighborhoodHawthorne                             -0.2703531  0.1418869  -1.905
neighborhoodHolmesburg                             0.8400096  0.0934559   8.988
neighborhoodHunting Park                          -0.0686518  0.1878736  -0.365
neighborhoodIndustrial                             0.4121473  0.0965939   4.267
neighborhoodJuniata Park                           0.4744505  0.1469665   3.228
neighborhoodKingsessing                            0.0141586  0.0799532   0.177
neighborhoodLawndale                               0.6567920  0.1774442   3.701
neighborhoodLogan                                  0.2948786  0.0998634   2.953
neighborhoodLogan Square                          -0.2060583  0.1389616  -1.483
neighborhoodLower Moyamensing                     -0.2074244  0.1133818  -1.829
neighborhoodLudlow                                 0.2534097  0.1544635   1.641
neighborhoodMantua                                -0.2229635  0.1518729  -1.468
neighborhoodMayfair                                0.7186441  0.0937483   7.666
neighborhoodMill Creek                             0.2162096  0.1516117   1.426
neighborhoodModena                                 1.3920054  0.1384975  10.051
neighborhoodMorrell Park                          -0.0201722  0.2317480  -0.087
neighborhoodNicetown                               0.0286773  0.1025483   0.280
neighborhoodNorth Central                         -0.2864888  0.1051156  -2.725
neighborhoodNortheast Phila Airport                0.3729762  0.1350170   2.762
neighborhoodNorthern Liberties                    -0.3441610  0.1220899  -2.819
neighborhoodNorthwood                              0.5071661  0.0893051   5.679
neighborhoodOgontz                                 0.0940710  0.1459915   0.644
neighborhoodOld City                              -0.6999758  0.1172796  -5.968
neighborhoodOld Kensington                        -0.8603177  0.2341487  -3.674
neighborhoodOlney                                  0.3380187  0.0682557   4.952
neighborhoodOverbrook                              0.1691286  0.0784081   2.157
neighborhoodOxford Circle                          0.4665678  0.0785510   5.940
neighborhoodParkwood Manor                         0.1619696  0.1540848   1.051
neighborhoodPaschall                              -0.2091473  0.0902250  -2.318
neighborhoodPassyunk Square                       -0.7735358  0.1886362  -4.101
neighborhoodPennypack Park                         0.4205041  0.0783707   5.366
neighborhoodPoint Breeze                          -0.7099338  0.1213075  -5.852
neighborhoodRhawnhurst                             0.3200247  0.0798196   4.009
neighborhoodRichmond                               0.1350127  0.0787448   1.715
neighborhoodRittenhouse                           -1.0582831  0.1017593 -10.400
neighborhoodRiverfront                            -0.1798084  0.0801993  -2.242
neighborhoodRoxborough                            -0.4851839  0.1295713  -3.745
neighborhoodSharswood                             -0.3688774  0.1075890  -3.429
neighborhoodSociety Hill                          -0.9726251  0.1349939  -7.205
neighborhoodSomerton                               0.2952169  0.0992126   2.976
neighborhoodSouthwest Germantown                  -0.2157996  0.1176222  -1.835
neighborhoodSpring Garden                          0.1295709  0.1537993   0.842
neighborhoodStadium District                      -0.0805165  0.2202462  -0.366
neighborhoodStanton                               -0.0332452  0.0763496  -0.435
neighborhoodStrawberry Mansion                     0.2739033  0.1398906   1.958
neighborhoodSummerdale                             1.0123590  0.1611385   6.283
neighborhoodTacony                                 0.1320930  0.1645324   0.803
neighborhoodTioga                                 -0.0725610  0.0905149  -0.802
neighborhoodTorresdale                             0.2682252  0.1856364   1.445
neighborhoodUniversity City                       -0.4615446  0.1320569  -3.495
neighborhoodUpper Kensington                      -0.2904119  0.1733686  -1.675
neighborhoodUpper Roxborough                      -0.1678052  0.1249725  -1.343
neighborhoodWalnut Hill                            0.0558667  0.0759833   0.735
neighborhoodWashington Square West                -0.7470012  0.1151232  -6.489
neighborhoodWest Kensington                       -0.0373199  0.1311339  -0.285
neighborhoodWest Mount Airy                        0.0036632  0.1424805   0.026
neighborhoodWest Oak Lane                          0.0706722  0.0755976   0.935
neighborhoodWest Parkside                          0.2344305  0.1154254   2.031
neighborhoodWest Powelton                         -0.1508088  0.1140628  -1.322
neighborhoodWest Torresdale                        0.5626531  0.1074484   5.236
neighborhoodWissahickon Park                       0.1516284  0.0894946   1.694
neighborhoodWissinoming                            0.1916163  0.1583303   1.210
neighborhoodWoodland Terrace                      -0.6633801  0.1185524  -5.596
neighborhoodWynnefield Heights                    -0.0315577  0.1163597  -0.271
neighborhoodYorktown                              -0.2689479  0.1585172  -1.697
spatial_lag_filings                                0.0318017  0.0049681   6.401
delinquent_prop_count                              0.0011601  0.0000763  15.206
pct_renter                                         0.0165956  0.0007124  23.297
severe_rent_burden                                 0.0013658  0.0006901   1.979
poverty_rate                                      -0.0039244  0.0011564  -3.394
pct_single_mother                                  0.0003694  0.0011962   0.309
factor(racial_majority)Hispanic                   -0.2788499  0.0649017  -4.296
factor(racial_majority)Other                      -0.1557477  0.0393570  -3.957
factor(racial_majority)White                      -0.2877332  0.0471982  -6.096
moratorium_active:factor(racial_majority)Hispanic  0.1913971  0.0714723   2.678
moratorium_active:factor(racial_majority)Other     0.1267347  0.0462870   2.738
moratorium_active:factor(racial_majority)White    -0.0620417  0.0435842  -1.423
                                                              Pr(>|z|)    
(Intercept)                                                   0.000107 ***
filings_ma3                                       < 0.0000000000000002 ***
moratorium_active                                 < 0.0000000000000002 ***
factor(month_num)2                                  0.0000000002223956 ***
factor(month_num)3                                            0.222496    
factor(month_num)4                                < 0.0000000000000002 ***
factor(month_num)5                                < 0.0000000000000002 ***
factor(month_num)6                                < 0.0000000000000002 ***
factor(month_num)7                                  0.0000039483381727 ***
factor(month_num)8                                            0.698968    
factor(month_num)9                                            0.033844 *  
factor(month_num)10                                           0.985896    
factor(month_num)11                                           0.000144 ***
factor(month_num)12                                           0.024012 *  
is_extreme_spike                                  < 0.0000000000000002 ***
neighborhoodAcademy Gardens                                   0.003010 ** 
neighborhoodAllegheny West                          0.0000031329499555 ***
neighborhoodAndorra                                           0.000223 ***
neighborhoodBartram Village                         0.0000729852146261 ***
neighborhoodBelmont                                           0.967531    
neighborhoodBrewerytown                                       0.211346    
neighborhoodBridesburg                              0.0000113805825700 ***
neighborhoodBurholme                                          0.014175 *  
neighborhoodBustleton                                         0.173222    
neighborhoodByberry                                 0.0000001754113622 ***
neighborhoodCarroll Park                                      0.293800    
neighborhoodCedar Park                                        0.743363    
neighborhoodCedarbrook                                        0.995330    
neighborhoodCenter City East                                  0.000412 ***
neighborhoodChestnut Hill                           0.0000000002887952 ***
neighborhoodClearview                                         0.001135 ** 
neighborhoodCobbs Creek                                       0.062794 .  
neighborhoodCrescentville                           0.0000105799199585 ***
neighborhoodDickinson Narrows                                 0.003244 ** 
neighborhoodEast Germantown                                   0.796718    
neighborhoodEast Mount Airy                                   0.010205 *  
neighborhoodEast Oak Lane                                     0.001089 ** 
neighborhoodEast Parkside                                     0.229101    
neighborhoodEastwick                                          0.052463 .  
neighborhoodElmwood                                           0.980854    
neighborhoodFairhill                                0.0000003568024108 ***
neighborhoodFairmount                               0.0000000389499049 ***
neighborhoodFeltonville                                       0.066483 .  
neighborhoodFishtown - Lower Kensington             0.0000036941031252 ***
neighborhoodFitler Square                           0.0000104380975822 ***
neighborhoodFox Chase                                         0.004912 ** 
neighborhoodFrancisville                                      0.160126    
neighborhoodFrankford                               0.0000065500328111 ***
neighborhoodFranklin Mills                                    0.003681 ** 
neighborhoodFranklinville                                     0.031858 *  
neighborhoodGermantown - Morton                               0.000118 ***
neighborhoodGermantown - Penn Knox                            0.030539 *  
neighborhoodGirard Estates                                    0.661082    
neighborhoodGlenwood                                0.0000368388621264 ***
neighborhoodGraduate Hospital                                 0.000220 ***
neighborhoodGrays Ferry                                       0.895532    
neighborhoodHaddington                                        0.842957    
neighborhoodHarrowgate                                        0.296071    
neighborhoodHartranft                                         0.001144 ** 
neighborhoodHawthorne                                         0.056726 .  
neighborhoodHolmesburg                            < 0.0000000000000002 ***
neighborhoodHunting Park                                      0.714802    
neighborhoodIndustrial                              0.0000198292975115 ***
neighborhoodJuniata Park                                      0.001245 ** 
neighborhoodKingsessing                                       0.859440    
neighborhoodLawndale                                          0.000214 ***
neighborhoodLogan                                             0.003149 ** 
neighborhoodLogan Square                                      0.138116    
neighborhoodLower Moyamensing                                 0.067335 .  
neighborhoodLudlow                                            0.100885    
neighborhoodMantua                                            0.142079    
neighborhoodMayfair                                 0.0000000000000178 ***
neighborhoodMill Creek                                        0.153847    
neighborhoodModena                                < 0.0000000000000002 ***
neighborhoodMorrell Park                                      0.930637    
neighborhoodNicetown                                          0.779749    
neighborhoodNorth Central                                     0.006421 ** 
neighborhoodNortheast Phila Airport                           0.005737 ** 
neighborhoodNorthern Liberties                                0.004819 ** 
neighborhoodNorthwood                               0.0000000135464300 ***
neighborhoodOgontz                                            0.519343    
neighborhoodOld City                                0.0000000023953619 ***
neighborhoodOld Kensington                                    0.000239 ***
neighborhoodOlney                                   0.0000007336212283 ***
neighborhoodOverbrook                                         0.031003 *  
neighborhoodOxford Circle                           0.0000000028557960 ***
neighborhoodParkwood Manor                                    0.293180    
neighborhoodPaschall                                          0.020446 *  
neighborhoodPassyunk Square                         0.0000411946753022 ***
neighborhoodPennypack Park                          0.0000000806898321 ***
neighborhoodPoint Breeze                            0.0000000048467465 ***
neighborhoodRhawnhurst                              0.0000608859578170 ***
neighborhoodRichmond                                          0.086426 .  
neighborhoodRittenhouse                           < 0.0000000000000002 ***
neighborhoodRiverfront                                        0.024960 *  
neighborhoodRoxborough                                        0.000181 ***
neighborhoodSharswood                                         0.000607 ***
neighborhoodSociety Hill                            0.0000000000005806 ***
neighborhoodSomerton                                          0.002924 ** 
neighborhoodSouthwest Germantown                              0.066552 .  
neighborhoodSpring Garden                                     0.399526    
neighborhoodStadium District                                  0.714682    
neighborhoodStanton                                           0.663248    
neighborhoodStrawberry Mansion                                0.050232 .  
neighborhoodSummerdale                              0.0000000003330862 ***
neighborhoodTacony                                            0.422068    
neighborhoodTioga                                             0.422757    
neighborhoodTorresdale                                        0.148487    
neighborhoodUniversity City                                   0.000474 ***
neighborhoodUpper Kensington                                  0.093912 .  
neighborhoodUpper Roxborough                                  0.179357    
neighborhoodWalnut Hill                                       0.462188    
neighborhoodWashington Square West                  0.0000000000865749 ***
neighborhoodWest Kensington                                   0.775955    
neighborhoodWest Mount Airy                                   0.979488    
neighborhoodWest Oak Lane                                     0.349867    
neighborhoodWest Parkside                                     0.042254 *  
neighborhoodWest Powelton                                     0.186116    
neighborhoodWest Torresdale                         0.0000001636537881 ***
neighborhoodWissahickon Park                                  0.090213 .  
neighborhoodWissinoming                                       0.226190    
neighborhoodWoodland Terrace                        0.0000000219768638 ***
neighborhoodWynnefield Heights                                0.786231    
neighborhoodYorktown                                          0.089763 .  
spatial_lag_filings                                 0.0000000001541891 ***
delinquent_prop_count                             < 0.0000000000000002 ***
pct_renter                                        < 0.0000000000000002 ***
severe_rent_burden                                            0.047794 *  
poverty_rate                                                  0.000690 ***
pct_single_mother                                             0.757454    
factor(racial_majority)Hispanic                     0.0000173516370187 ***
factor(racial_majority)Other                        0.0000758004716086 ***
factor(racial_majority)White                        0.0000000010857173 ***
moratorium_active:factor(racial_majority)Hispanic             0.007408 ** 
moratorium_active:factor(racial_majority)Other                0.006181 ** 
moratorium_active:factor(racial_majority)White                0.154594    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.4652) family taken to be 1)

    Null deviance: 32316  on 17279  degrees of freedom
Residual deviance: 18401  on 17144  degrees of freedom
AIC: 57691

Number of Fisher Scoring iterations: 1

              Theta:  2.4652 
          Std. Err.:  0.0690 

 2 x log-likelihood:  -57417.1070 
Code
# Compare model fit statistics.
model_comparison <- data.frame(
  model = c("Baseline", "+ Tax and Spatial", "+ ACS", "+ Interactions"),
  aic = c(AIC(model_nb_1), AIC(model_nb_2), AIC(model_nb_3), AIC(model_nb_4)),
  theta = c(model_nb_1$theta, model_nb_2$theta, model_nb_3$theta, model_nb_4$theta),
  loglik = c(logLik(model_nb_1), logLik(model_nb_2), logLik(model_nb_3), logLik(model_nb_4))
) %>%
  mutate(
    aic_change = aic - min(aic),
    rank = rank(aic)
  )

model_comparison %>%
  kable(digits = 2, caption = "Model Comparison Statistics (Lower AIC = Better Fit)",
        col.names = c("Model", "AIC", "Theta", "Log Likelihood", "AIC Change", "Rank")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Comparison Statistics (Lower AIC = Better Fit)
Model AIC Theta Log Likelihood AIC Change Rank
Baseline 60201.13 1.65 -30084.56 2510.02 4
+ Tax and Spatial 59652.67 1.79 -29808.34 1961.57 3
+ ACS 57705.69 2.46 -28718.84 14.58 2
+ Interactions 57691.11 2.47 -28708.55 0.00 1

The addition of the policy interaction term (racial_majority * moratorium_active) resulted in the lowest AIC, making it the best-fitting model. However, for deployment, the 3rd model is best with lower MAE and a more simple model to interpret. filings_ma3 is a strong predicting for past filing volume and reinforcing that eviction risk is persistent. is_extreme_spike is significant showcasing tracts with a history of mass-eviction events face a structurally higher baseline risk at a whopping 5.8 IRR. spatial_lag_filings is positive and significant, showing that there is spatial spillover of eviction risk. delinquent_prop_count is also significant, and this reinforces the link between landlord financial instability and pressure to pursue eviction filings. The interaction terms also reveal a nuanced policy impact. While the moratorium reduced filings universally, the interaction coefficients indicate that the benefit was not uniform across racial groups, as what was expected.

4. MODEL VALIDATION AND DIAGNOSTICS

Train and Test Predictions

Generate predictions for all four models on both training and test sets.

Code
# TRAINING SET PREDICTIONS.
# Model 1: Baseline.
train_pred_1 <- df_train %>%
  mutate(
    predicted = predict(model_nb_1, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 2: + Tax/Spatial.
train_pred_2 <- df_train %>%
  mutate(
    predicted = predict(model_nb_2, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 3: + ACS.
train_pred_3 <- df_train %>%
  mutate(
    predicted = predict(model_nb_3, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 4: + Interactions.
train_pred_4 <- df_train %>%
  mutate(
    predicted = predict(model_nb_4, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# TEST SET PREDICTIONS.
# Model 1: Baseline.
test_pred_1 <- df_test %>%
  mutate(
    predicted = predict(model_nb_1, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 2: + Tax/Spatial.
test_pred_2 <- df_test %>%
  mutate(
    predicted = predict(model_nb_2, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 3: + ACS.
test_pred_3 <- df_test %>%
  mutate(
    predicted = predict(model_nb_3, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 4: + Interactions.
test_pred_4 <- df_test %>%
  mutate(
    predicted = predict(model_nb_4, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

cat(sprintf("Training Set: %s observations\n", comma(nrow(train_pred_1))))
Training Set: 17,280 observations
Code
cat(sprintf("Test Set: %s observations\n", comma(nrow(test_pred_1))))
Test Set: 8,832 observations

Training Set Performance

Code
# Calculate training metrics for each model explicitly.
train_metrics_1 <- train_pred_1 %>%
  summarize(
    model = "Baseline",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_2 <- train_pred_2 %>%
  summarize(
    model = "+ Tax/Spatial",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_3 <- train_pred_3 %>%
  summarize(
    model = "+ ACS",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_4 <- train_pred_4 %>%
  summarize(
    model = "+ Interactions",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

# Combine training metrics.
train_metrics <- bind_rows(train_metrics_1, train_metrics_2, train_metrics_3, train_metrics_4)

# Display training performance.
train_metrics %>%
  kable(digits = 3, caption = "Training Set Performance Across Models",
        col.names = c("Model", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Correlation", "Bias")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Training Set Performance Across Models
Model Number MAE RMSE Mean Observed Mean Predicted Correlation Bias
Baseline 17280 1.967 11.455 2.097 2.573 0.295 0.476
+ Tax/Spatial 17280 1.780 7.796 2.097 2.419 0.346 0.322
+ ACS 17280 1.480 3.192 2.097 2.205 0.545 0.108
+ Interactions 17280 1.478 3.204 2.097 2.204 0.544 0.107

Test Set Performance

Code
# Calculate test metrics for each model explicitly.
test_metrics_1 <- test_pred_1 %>%
  summarize(
    model = "Baseline",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_2 <- test_pred_2 %>%
  summarize(
    model = "+ Tax/Spatial",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_3 <- test_pred_3 %>%
  summarize(
    model = "+ ACS",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_4 <- test_pred_4 %>%
  summarize(
    model = "+ Interactions",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

# Combine test metrics.
test_metrics <- bind_rows(test_metrics_1, test_metrics_2, test_metrics_3, test_metrics_4)

# Display test performance.
test_metrics %>%
  kable(digits = 3, caption = "Test Set Performance Across Models",
        col.names = c("Model", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Correlation", "Bias")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Test Set Performance Across Models
Model Number MAE RMSE Mean Observed Mean Predicted Correlation Bias
Baseline 8832 2.127 8.330 2.638 3.158 0.406 0.520
+ Tax/Spatial 8832 1.981 5.714 2.638 2.996 0.450 0.358
+ ACS 8832 1.766 3.032 2.638 2.835 0.567 0.197
+ Interactions 8832 1.766 3.047 2.638 2.833 0.566 0.195

Overfitting Diagnostic Check

A critical concern in predictive modeling is overfitting, where the model learns noise in the training data rather than generalizable patterns. We assess this by comparing train vs test performance.

Code
# Compare train vs test performance
overfitting_check <- train_metrics %>%
  select(model, MAE_train = MAE, RMSE_train = RMSE, corr_train = correlation) %>%
  left_join(
    test_metrics %>% select(model, MAE_test = MAE, RMSE_test = RMSE, corr_test = correlation),
    by = "model"
  ) %>%
  mutate(
    MAE_change = (MAE_test - MAE_train) / MAE_train * 100,
    RMSE_change = (RMSE_test - RMSE_train) / RMSE_train * 100,
    corr_change = (corr_train - corr_test) / corr_train * 100
  )

overfitting_check %>%
  kable(digits = 2, caption = "Train vs Test Performance: Overfitting Assessment",
        col.names = c("Model", "Training MAE", "Training RMSE", "Training Correlation", "Testing MAE", "Testing RMSE", "Testing Correlation", "MAE Change", "RMSE Change", "Correlation Change")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Train vs Test Performance: Overfitting Assessment
Model Training MAE Training RMSE Training Correlation Testing MAE Testing RMSE Testing Correlation MAE Change RMSE Change Correlation Change
Baseline 1.97 11.45 0.30 2.13 8.33 0.41 8.15 -27.28 -37.67
+ Tax/Spatial 1.78 7.80 0.35 1.98 5.71 0.45 11.29 -26.71 -30.09
+ ACS 1.48 3.19 0.55 1.77 3.03 0.57 19.36 -5.00 -3.99
+ Interactions 1.48 3.20 0.54 1.77 3.05 0.57 19.44 -4.89 -3.93

The worst MAE degradation across models is 19.44%, so there is good generalization. In addition, the test MAE for the best model, the ACS addition, is off by 1.77 filings and accounts for 57% of variation. This is a good, foundational start considering this study is measuring a response variable that is very tied to human behavior.

Select Best Model for Further Analysis

Based on the model comparison and overfitting assessment, we select the best-performing model for detailed diagnostics. Model 3 with ACS provides the best balance of fit and generalization and interpretation for officials.

Code
# Use Model 3 for downstream analysis, this is for operational deployment.
df_test_pred <- test_pred_3
df_train_pred <- train_pred_3

# Extract scalar metrics for the selected model.
selected_test_metrics <- test_metrics_3

cat("SELECTED MODEL: Model 3 (+ ACS)\n")
SELECTED MODEL: Model 3 (+ ACS)
Code
cat(sprintf("Test MAE: %.3f\n", selected_test_metrics$MAE))
Test MAE: 1.766
Code
cat(sprintf("Test RMSE: %.3f\n", selected_test_metrics$RMSE))
Test RMSE: 3.032
Code
cat(sprintf("Test Correlation: %.3f\n", selected_test_metrics$correlation))
Test Correlation: 0.567
Code
# Create scatter plot of predicted vs observed values using selected model.
pred_obs_plot <- df_test_pred %>%
  ggplot(aes(x = predicted, y = filings_count)) +
  geom_hex(bins = 50) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  scale_fill_viridis_c(option = "plasma", trans = "log10", labels = comma) +
  scale_x_continuous(limits = c(0, 155)) +
  scale_y_continuous(limits = c(0, 50)) +
  labs(
    title = "Predicted vs Observed Eviction Filings (Test Set | Model 3)",
    subtitle = sprintf("MAE = %.2f | RMSE = %.2f | Correlation = %.3f",
                       selected_test_metrics$MAE, 
                       selected_test_metrics$RMSE, 
                       selected_test_metrics$correlation),
    x = "Predicted Filings",
    y = "Observed Filings",
    fill = "Count",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display predicted vs observed plot.
pred_obs_plot

Performance by Racial Majority

Code
# Calculate test set metrics by racial majority.
test_metrics_by_race <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n = n(),
    mae = mean(abs_error, na.rm = TRUE),
    rmse = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    mean_error = mean(residual, na.rm = TRUE),
    .groups = "drop"
  )

test_metrics_by_race %>%
  kable(digits = 2, caption = "Test Set Performance by Tract Racial Majority",
        col.names = c("Racial Majority", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Mean Error")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Test Set Performance by Tract Racial Majority
Racial Majority Number MAE RMSE Mean Observed Mean Predicted Mean Error
Black 3289 2.22 3.18 3.63 3.94 -0.31
Hispanic 529 1.57 2.03 2.51 2.77 -0.27
Other 1656 1.86 3.21 2.77 3.07 -0.30
White 3358 1.31 2.92 1.62 1.65 -0.03

The model, mirroring real-life eviction biases, also has a larger MAE for Black communities, with white communities having the lowest MAE.

Residual Analysis

Code
# Create residual distribution plot.
residual_plot <- df_test_pred %>%
  ggplot(aes(x = residual)) +
  geom_histogram(bins = 50, fill = "#3498DB", color = "white", alpha = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  scale_x_continuous(limits = c(-15, 15)) +
  labs(
    title = "Distribution of Prediction Residuals (Test Set)",
    subtitle = "Centered near zero indicates unbiased predictions.",
    x = "Residual (Observed - Predicted)",
    y = "Frequency",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Display residual plot.
residual_plot

5. RISK CLASSIFICATION SYSTEM

Creating Risk Quintiles

Code
# Create risk quintile classification based on predicted values.
df_test_risk <- df_test_pred %>%
  mutate(
    risk_quintile = ntile(predicted, 5),
    risk_category = case_when(
      risk_quintile == 1 ~ "Lowest Risk",
      risk_quintile == 2 ~ "Low Risk",
      risk_quintile == 3 ~ "Moderate Risk",
      risk_quintile == 4 ~ "High Risk",
      risk_quintile == 5 ~ "Highest Risk"
    ),
    risk_category = factor(risk_category, 
                          levels = c("Lowest Risk", "Low Risk", "Moderate Risk", 
                                    "High Risk", "Highest Risk"))
  )

# Calculate observed filings by risk category.
risk_validation <- df_test_risk %>%
  group_by(risk_category) %>%
  summarize(
    n_tract_months = n(),
    mean_observed = mean(filings_count, na.rm = TRUE),
    median_observed = median(filings_count, na.rm = TRUE),
    total_filings = sum(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(pct_total_filings = total_filings / sum(total_filings) * 100)

# Display risk validation table.
risk_validation %>%
  kable(digits = 2, caption = "Observed Filings by Predicted Risk Category",
        col.names = c("Risk Category", "Observations", "Mean Observed", "Median Observed", "Filings", "% Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Observed Filings by Predicted Risk Category
Risk Category Observations Mean Observed Median Observed Filings % Filings
Lowest Risk 1767 0.66 0 1169 5.02
Low Risk 1767 1.27 1 2251 9.66
Moderate Risk 1766 2.18 2 3854 16.54
High Risk 1766 3.26 3 5761 24.73
Highest Risk 1766 5.81 5 10263 44.05
Code
# Create bar plot showing observed filings by risk category.
risk_bar_plot <- risk_validation %>%
  ggplot(aes(x = risk_category, y = mean_observed, fill = risk_category)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", pct_total_filings)), 
            vjust = -0.5, size = 4) +
  scale_fill_viridis_d(option = "plasma", direction = -1) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Mean Observed Filings by Predicted Risk Category",
    subtitle = "Higher predicted risk categories correctly capture higher observed filing rates.",
    x = "Predicted Risk Category",
    y = "Mean Observed Filings per Tract-Month",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "none"
  )

# Display risk bar plot.
risk_bar_plot

High-Risk Tract Profile

Code
# Profile characteristics of highest risk tract-months.
high_risk_profile <- df_test_risk %>%
  filter(risk_category == "Highest Risk") %>%
  group_by(racial_majority) %>%
  summarize(
    n_tract_months = n(),
    pct_of_high_risk = n() / nrow(df_test_risk %>% filter(risk_category == "Highest Risk")) * 100,
    mean_filings = mean(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(n_tract_months))

high_risk_profile %>%
  kable(digits = 1, caption = "Racial Composition of Highest Risk Tract-Months",
        col.names = c("Racial Majority", "Observations", "% High Risk", "Mean Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Racial Composition of Highest Risk Tract-Months
Racial Majority Observations % High Risk Mean Filings
Black 1191 67.4 5.7
Other 341 19.3 5.3
White 148 8.4 8.9
Hispanic 86 4.9 4.2

6. MODEL INTERPRETATION

Feature Importance

Code
# Get standardized coefficients for interpretation.
feature_importance <- tidy(model_nb_3) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    abs_estimate = abs(estimate),
    direction = ifelse(estimate > 0, "Positive", "Negative"),
    category = case_when(
      str_detect(term, "filings_") ~ "Temporal",
      str_detect(term, "spatial_lag_filings") ~ "Spatial",
      str_detect(term, "log_avg_balance") ~ "Tax Delinquency",
      str_detect(term, "pct_renter") |
        str_detect(term, "poverty_rate") |
        str_detect(term, "severe_rent_burden") |
        str_detect(term, "unemployment_rate") |
        str_detect(term, "log_median_income") |
        str_detect(term, "pct_single_mother") ~ "Socioeconomic (ACS)",
      str_detect(term, "factor\\(month_num\\)") ~ "Seasonality",
      str_detect(term, "moratorium") ~ "Policy",
      str_detect(term, "factor\\(racial_majority\\)") ~ "Demographic",
      str_detect(term, "neighborhood") ~ "Neighborhood",
      TRUE ~ "Other"
    )
  ) %>%
  arrange(desc(abs_estimate)) %>%
  head(10)

# Create feature importance plot.
importance_plot <- feature_importance %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate, fill = category)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Top 10 Most Important Predictors",
    subtitle = "Coefficient magnitude indicates effect size on log(expected filings).",
    x = "Predictor Variable",
    y = "Coefficient Estimate",
    fill = "Category",
    caption = "Negative Binomial Regression Coefficients"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display importance plot.
importance_plot

Code
# Get standardized coefficients for interpretation.
all_coefficients <- tidy(model_nb_3) %>%
  # Remove intercept term.
  filter(term != "(Intercept)") %>%
  # Filter out terms with "neighborhood".
  filter(!str_detect(term, "neighborhood")) %>%
  mutate(
    abs_estimate = abs(estimate),
    direction = ifelse(estimate > 0, "Positive", "Negative"),
    category = case_when(
      str_detect(term, "filings_") ~ "Temporal",
      str_detect(term, "spatial_lag_filings") ~ "Spatial",
      str_detect(term, "log_avg_balance") ~ "Tax Delinquency",
      str_detect(term, "pct_renter") |
        str_detect(term, "poverty_rate") |
        str_detect(term, "severe_rent_burden") |
        str_detect(term, "unemployment_rate") |
        str_detect(term, "log_median_income") |
        str_detect(term, "pct_single_mother") ~ "Socioeconomic (ACS)",
      str_detect(term, "factor\\(month_num\\)") ~ "Seasonality",
      str_detect(term, "moratorium") ~ "Policy",
      str_detect(term, "factor\\(racial_majority\\)") ~ "Demographic",
      TRUE ~ "Other"
    )
  ) %>%
  arrange(desc(abs_estimate)) %>%
  head(10)

# Create feature importance plot.
other_importance_plot <- all_coefficients %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate, fill = category)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "All Non-Neighborhood Predictors",
    subtitle = "Coefficient magnitude indicates effect size on log(expected filings).",
    x = "Predictor Variable",
    y = "Coefficient Estimate",
    fill = "Category"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display importance plot.
other_importance_plot

Code
# Specifically analyze contribution of ACS features.
acs_contribution <- tidy(model_nb_3) %>%
  filter(str_detect(term, "pct_renter|poverty_rate|severe_rent_burden|unemployment_rate")) %>%
  mutate(
    irr = exp(estimate),
    pct_change = (irr - 1) * 100,
    significant = ifelse(p.value < 0.05, "Yes", "No")
  )%>%
  rename(standard_error = std.error, p_value = p.value, percent_change = pct_change) %>%
  select(term, estimate, standard_error, p_value, irr, percent_change, significant)

acs_contribution %>%
  kable(digits = 3, caption = "ACS Socioeconomic Feature Effects",
        col.names = c("Term", "Estimate", "Standard Error", "P-Value", "IRR", "% Change", "Significant")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
ACS Socioeconomic Feature Effects
Term Estimate Standard Error P-Value IRR % Change Significant
pct_renter 0.017 0.001 0.000 1.017 1.672 Yes
severe_rent_burden 0.001 0.001 0.050 1.001 0.135 No
poverty_rate -0.004 0.001 0.001 0.996 -0.390 Yes

Incidence Rate Ratios

Code
# Calculate incidence rate ratios for interpretation.
irr_table <- tidy(model_nb_3) %>%
  mutate(
    irr = exp(estimate),
    # 1.96 is z-score for 95% confidence level.
    irr_lower = exp(estimate - 1.96 * std.error),
    irr_upper = exp(estimate + 1.96 * std.error),
    pct_change = (irr - 1) * 100
  ) %>%
  filter(term != "(Intercept)") %>%
  rename(p_value = p.value) %>%
  select(term, irr, irr_lower, irr_upper, pct_change, p_value)

irr_table %>%
  head(15) %>%
  kable(digits = 3, caption = "Incidence Rate Ratios (IRR) for Key Predictors",
        col.names = c("Term", "IRR", "IRR Lower Confidence", "IRR Upper Confidence", "% Change", "P-Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Incidence Rate Ratios (IRR) for Key Predictors
Term IRR IRR Lower Confidence IRR Upper Confidence % Change P-Value
filings_ma3 1.114 1.107 1.121 11.432 0.000
moratorium_active 0.483 0.464 0.502 -51.732 0.000
factor(month_num)2 0.769 0.709 0.834 -23.086 0.000
factor(month_num)3 1.050 0.971 1.136 5.018 0.222
factor(month_num)4 0.512 0.472 0.555 -48.830 0.000
factor(month_num)5 0.663 0.612 0.719 -33.654 0.000
factor(month_num)6 0.717 0.662 0.776 -28.344 0.000
factor(month_num)7 0.833 0.770 0.900 -16.747 0.000
factor(month_num)8 0.985 0.913 1.063 -1.458 0.704
factor(month_num)9 1.084 1.006 1.168 8.369 0.035
factor(month_num)10 0.999 0.928 1.076 -0.069 0.985
factor(month_num)11 0.866 0.804 0.932 -13.442 0.000
factor(month_num)12 0.917 0.852 0.988 -8.256 0.022
is_extreme_spike 5.808 4.681 7.205 480.753 0.000
neighborhoodAcademy Gardens 1.598 1.173 2.176 59.789 0.003

filings_ma3: For every one-unit increase in the 3-month moving average of past eviction filings, the expected number of current eviction filings increases by a factor of 1.114 or 11.4%. So a history of higher recent filings strongly predicts higher current filings.

moratorium_active: When an eviction moratorium is active the expected number of eviction filings is multiplied by 0.483 or is 51.7% lower compared to when the moratorium is not active.

is_extreme_spike: If a time period is classified as having an extreme spike in filings, the expected number of eviction filings is nearly six times higher multiplied by 5.808, or increases by a staggering 481%, compared to a period without a spike.

neighborhoodAcademy Gardens: Compared to the baseline East Falls neighborhood, Academy Gardens is expected to have 59.8% more eviction filings.

Code
# Create actionable output for city agencies with intervention thresholds.
operational_output <- df_test_pred %>%
  filter(predicted >= 5) %>%
  group_by(GEOID) %>%
  summarize(
    predicted_filings_next_month = mean(predicted),
    n_high_risk_months = n(),
    .groups = "drop"
  ) %>%
  left_join(df_acs, by = "GEOID") %>%
  arrange(desc(predicted_filings_next_month))

operational_output
# A tibble: 107 × 16
   GEOID       predicted_filings_nex…¹ n_high_risk_months pct_renter median_rent
   <chr>                         <dbl>              <int>      <dbl>       <dbl>
 1 42101021800                   23.7                  21       71.3        1558
 2 42101030100                   17.0                  21       45.5         749
 3 42101025200                   14.3                  22       57.6         879
 4 42101036100                   12.8                  17       26.7        1639
 5 42101015200                   10.7                  22       68.3         961
 6 42101025700                   10.5                   4       71.5        1737
 7 42101024000                   10.5                  22       72.2        1098
 8 42101023900                    9.78                 22       92.7        1500
 9 42101037700                    9.73                 15       85.9        1159
10 42101008500                    9.45                 23       64.2        1028
# ℹ 97 more rows
# ℹ abbreviated name: ¹​predicted_filings_next_month
# ℹ 11 more variables: rent_burden_pct <dbl>, median_income <dbl>,
#   log_median_income <dbl>, poverty_rate <dbl>, unemployment_rate <dbl>,
#   pct_college <dbl>, total_pop <dbl>, log_pop <dbl>, median_age <dbl>,
#   pct_single_mother <dbl>, severe_rent_burden <dbl>

By integrating temporal, spatial, and socioeconomic indicators, the best model achieves a prediction error of < 2 filings per tract, outperforming the baseline model. Also, the model successfully identifies the “Highest Risk” tracts that account for the vast majority of displacement. This tool offers the City of Philadelphia a concrete mechanism to shift from reactive to proactive, targeting aid to the specific blocks where it is needed most and ideally before any landlord-tenant tensions escalate into expensive legal battles.

III. EQUITY ASSESSMENT

1. Impact Analysis

Code
# Calculate prediction errors by racial majority to assess disparate impact.
equity_metrics <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n_obs = n(),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    mae = mean(abs_error, na.rm = TRUE),
    mean_error = mean(residual, na.rm = TRUE),
    under_prediction_rate = sum(residual > 0) / n() * 100,
    .groups = "drop"
  )

# Display equity metrics table.
equity_metrics %>%
  kable(digits = 2, caption = "Model Performance Metrics by Tract Racial Majority",
        col.names = c("Racial Majority", "Observations", "Mean Observed", "Mean Predicted", "MAE", "Mean Error", "Under-Prediction Rate")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Metrics by Tract Racial Majority
Racial Majority Observations Mean Observed Mean Predicted MAE Mean Error Under-Prediction Rate
Black 3289 3.63 3.94 2.22 -0.31 39.68
Hispanic 529 2.51 2.77 1.57 -0.27 41.59
Other 1656 2.77 3.07 1.86 -0.30 36.78
White 3358 1.62 1.65 1.31 -0.03 37.25
Code
# Create plot comparing MAE across racial categories.
equity_plot <- equity_metrics %>%
  ggplot(aes(x = racial_majority, y = mae, fill = racial_majority)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.2f", mae)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = racial_colors) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Model Accuracy (MAE) by Tract Racial Majority",
    subtitle = "Lower values indicate better prediction accuracy.",
    x = "Tract Racial Majority",
    y = "Mean Absolute Error",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  )

# Display equity plot.
equity_plot

2. Prediction Calibration by Race

Code
# Create calibration plot showing predicted vs observed by racial majority.
calibration_plot <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  mutate(predicted_bin = cut(predicted, breaks = seq(0, 15, by = 1), include.lowest = TRUE)) %>%
  filter(!is.na(predicted_bin)) %>%
  group_by(racial_majority, predicted_bin) %>%
  summarize(
    mean_predicted = mean(predicted, na.rm = TRUE),
    mean_observed = mean(filings_count, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) %>%
  filter(n >= 10) %>%
  ggplot(aes(x = mean_predicted, y = mean_observed, color = racial_majority)) +
  geom_point(aes(size = n), alpha = 0.75) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  scale_color_manual(values = racial_colors) +
  scale_size_continuous(range = c(2, 8)) +
  labs(
    title = "Model by Tract Racial Majority",
    subtitle = "Points on diagonal indicate good predictions.",
    x = "Mean Predicted Filings",
    y = "Mean Observed Filings",
    color = "Racial\nMajority",
    size = "Observations",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display calibration plot.
calibration_plot

3. Risk Category Distribution by Race

Code
# Calculate risk category distribution by racial majority.
risk_by_race <- df_test_risk %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority, risk_category) %>%
  summarize(n = n(), .groups = "drop") %>%
  group_by(racial_majority) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup()

# Create stacked bar plot.
risk_race_plot <- risk_by_race %>%
  ggplot(aes(x = racial_majority, y = pct, fill = risk_category)) +
  geom_col(width = 0.7) +
  scale_fill_viridis_d(option = "plasma", direction = -1) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  labs(
    title = "Risk Distribution by Tract Racial Majority",
    subtitle = "Black-majority tracts classified as high risk.",
    x = "Tract Racial Majority",
    y = "Percentage of Tract-Months",
    fill = "Risk\nCategory",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display risk by race plot.
risk_race_plot

Code
# Create equity comparison plot.
equity_plot <- equity_metrics %>%
  pivot_longer(cols = c(mean_observed, mean_predicted), 
               names_to = "type", values_to = "value") %>%
  mutate(type = ifelse(type == "mean_observed", "Observed", "Predicted")) %>%
  ggplot(aes(x = racial_majority, y = value, fill = type)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Observed" = "#E74C3C", "Predicted" = "#3498DB")) +
  labs(
    title = "Model Calibration by Racial Majority: Observed vs Predicted",
    subtitle = "Close alignment indicates equitable prediction accuracy across communities.",
    x = "Tract Racial Majority",
    y = "Mean Filings per Tract-Month",
    fill = "Type",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display equity plot.
equity_plot

Future modifications need to be made to account for Black communities who account for the majority of Philadelphia’s population, despite predicting Black communities to be at more risk, there is the simultaneous higher MAE, meaning that this tool is fit as long as there is human oversight and it is not used for absolute decision-making. It struggles to capture the full issues that come from structural racism, so going into detail with the resource allocation it should be legal aid or cash assistance supportive survices, never property or police enforcement.

SESSION INFORMATION

Code
# R session info.
sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50        forcats_1.0.0     dplyr_1.1.4       purrr_1.1.0      
 [5] readr_2.1.5       tibble_3.3.0      tidyverse_2.0.0   tidyr_1.3.1      
 [9] stringr_1.5.1     glue_1.8.0        jsonlite_2.0.0    httr_1.4.7       
[13] tidycensus_1.7.3  spdep_1.4-1       spData_2.3.4      pROC_1.19.0.1    
[17] caret_7.0-1       lattice_0.22-7    ggplot2_3.5.2     car_3.1-3        
[21] carData_3.0-5     MASS_7.3-65       zoo_1.8-14        corrplot_0.95    
[25] kableExtra_1.4.0  viridis_0.6.5     viridisLite_0.4.2 patchwork_1.3.2  
[29] scales_1.4.0      broom_1.0.9       sf_1.0-21         lubridate_1.9.4  

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3   rstudioapi_0.17.1    wk_0.9.4            
  [4] magrittr_2.0.3       TH.data_1.1-4        farver_2.1.2        
  [7] rmarkdown_2.29       vctrs_0.6.5          htmltools_0.5.8.1   
 [10] curl_7.0.0           s2_1.1.9             Formula_1.2-5       
 [13] LearnBayes_2.15.1    parallelly_1.45.1    KernSmooth_2.23-26  
 [16] htmlwidgets_1.6.4    plyr_1.8.9           sandwich_3.1-1      
 [19] uuid_1.2-1           igraph_2.2.1         lifecycle_1.0.4     
 [22] iterators_1.0.14     pkgconfig_2.0.3      Matrix_1.7-3        
 [25] R6_2.6.1             fastmap_1.2.0        future_1.67.0       
 [28] digest_0.6.37        ps_1.9.1             textshaping_1.0.1   
 [31] labeling_0.4.3       spatialreg_1.4-2     timechange_0.3.0    
 [34] abind_1.4-8          mgcv_1.9-3           compiler_4.5.1      
 [37] proxy_0.4-27         withr_3.0.2          backports_1.5.0     
 [40] DBI_1.2.3            hexbin_1.28.5        lava_1.8.1          
 [43] rappdirs_0.3.3       classInt_0.4-11      ModelMetrics_1.2.2.2
 [46] tools_4.5.1          chromote_0.5.1       units_0.8-7         
 [49] future.apply_1.20.0  nnet_7.3-20          nlme_3.1-168        
 [52] promises_1.3.3       grid_4.5.1           reshape2_1.4.4      
 [55] generics_0.1.4       recipes_1.3.1        gtable_0.3.6        
 [58] tzdb_0.5.0           class_7.3-23         websocket_1.4.4     
 [61] data.table_1.17.8    hms_1.1.3            utf8_1.2.6          
 [64] sp_2.2-0             xml2_1.4.0           foreach_1.5.2       
 [67] pillar_1.11.1        later_1.4.4          splines_4.5.1       
 [70] survival_3.8-3       deldir_2.0-4         tidyselect_1.2.1    
 [73] gridExtra_2.3        svglite_2.2.1        stats4_4.5.1        
 [76] xfun_0.53            hardhat_1.4.2        timeDate_4041.110   
 [79] stringi_1.8.7        yaml_2.3.10          boot_1.3-31         
 [82] evaluate_1.0.4       codetools_0.2-20     cli_3.6.5           
 [85] rpart_4.1.24         systemfonts_1.2.3    processx_3.8.6      
 [88] Rcpp_1.1.0           globals_0.18.0       coda_0.19-4.1       
 [91] parallel_4.5.1       gower_1.0.2          listenv_0.9.1       
 [94] mvtnorm_1.3-3        tigris_2.2.1.9000    ipred_0.9-15        
 [97] prodlim_2025.04.28   e1071_1.7-16         crayon_1.5.3        
[100] rlang_1.1.6          rvest_1.0.4          multcomp_1.4-29     

DATA DOWNLOADS

Eviction Lab Main Data

Eviction Lab Claims Data

Real Estate Tax Balances

Neighborhood Boundaries

Tract Boundaries

ACS 2023 Data via tidycensus API

AI USAGE

Free version of Claude Sonnet 4.5 used for debugging code, especially the final panel data portion.

Several code chunks provided by Dr. Delmelle with adjustments from previous labs and lectures.