The Philadelphia Eviction Early Warning System

A Predictive Model for Proactive Resource Allocation

Authors

Angel Rutherford

Ixchel Ramirez

Tess Vu

Published

December 8, 2025

EXECUTIVE SUMMARY

Eviction is both a cause and consequence of poverty that destabilizes entire neighborhoods. Currently, city responses to eviction are reactive, with resources like legal aid and rental assistance deployed after filing volumes become a crisis. This project develops a Real-Time Operational Tool for the Philadelphia Office of Homeless Services and the Fair Housing Commission. By shifting from reactive to predictive analysis, we enable the city to allocate limited staff to specific census tracts predicted to experience elevated eviction filings in the coming month.

Our Negative Binomial regression model leverages temporal momentum, spatial spillover effects, policy intervention effects, property tax delinquency stress, and American Community Survey socioeconomic indicators to forecast monthly eviction filing counts at the census tract level.

The model demonstrates strong performance with meaningful improvement, and sets the foundation for building up to a practical and usable tool down the line. Using a robust temporal validation strategy (training through 2023, testing on 2024-2025), the model generalizes well to future periods without overfitting. Also stark racial disparities in eviction burden was identified, with Black-majority tracts accounting for disproportionate shares of filings. These findings emphasize the need for equity-centered implementation safeguards to prevent perpetuating existing disparities through algorithmic resource allocation.

I. EXPLORATORY DATA ANALYSIS (EDA)

1. SETUP AND LIBRARY LOADING

All required R packages were loaded at the start to streamline data cleaning, visualization, spatial analysis, and modeling.

Code

library(lubridate)
library(sf)
library(broom)
library(scales)
library(patchwork)
library(viridis)
library(kableExtra)
library(corrplot)
library(zoo)
library(MASS)
library(car)
library(caret)
library(pROC)
library(spdep)
library(tidycensus)
library(httr)
library(jsonlite)
library(glue)
library(stringr)
library(tidyr)
library(tidyverse)
library(dplyr)
library(knitr)

# Override masking.
select <- dplyr::select
filter <- dplyr::filter

# Disable scientific notation.
options(scipen = 999)

# Save figures.
#opts_chunk$set(fig.path = "final_figures/", dpi = 300)

Two visualization techniques were employed to provide clear and effective evidence of our findings. First, a consistent plotting theme was established to ensure readability and uniformity across all figures. Second, custom color palettes were applied to highlight key analytical distinctions, specifically racial categories and policy periods. Racial differences in eviction filings are central to the Eviction Lab’s dataset and visualizations and were incorporated to acknowledge inequitable outcomes. In addition, the periods before, during, and after the moratorium were visually differentiated to measure the impact of policy interventions on eviction patterns.

Code

# Custom color palette for policy periods.
period_colors <- c(
  "Pre-Moratorium" = "#3498DB",
  "Moratorium" = "#27AE60",
  "Post-Moratorium" = "#E67E22"
)

# Custom color palette for racial categories.
racial_colors <- c(
  "Black" = "#E74C3C",
  "White" = "#3498DB",
  "Hispanic" = "#F39C12",
  "Other" = "#9B59B6"
)

2. DATA LOADING AND INITIAL EXPLORATION

Monthly Tract-Level Eviction Data

The data was accessed and downloaded from the Eviction Lab, an online database that compiles nationwide eviction filings from court records. For states and major cities, the Eviction Lab provides eviction datasets at two temporal resolutions: weekly and monthly. Both scales were initially examined for their distinct analytical value: weekly data highlight short‑term spikes and localized patterns, while monthly data reveal longer‑term trends and align more closely with policy implementation, as months serve as the standard catchment period for resource allocation and housing interventions.

A summary of the dataset’s dimensions and preview of the first few rows shows that the monthly eviction data provides space-time insights into filing patterns. The dataset is structured into eight columns:

Code

# Load monthly eviction filings data at census tract level.
df_monthly_raw <- read.csv("data/eviction/philadelphia_monthly_2020_2021.csv")

# Display initial data dimensions.
cat("MONTHLY EVICTION DATA DIMENSIONS\n")

MONTHLY EVICTION DATA DIMENSIONS

Code

cat(sprintf("Rows: %s\n", comma(nrow(df_monthly_raw))))

Rows: 29,039

Code

cat(sprintf("Columns: %d\n", ncol(df_monthly_raw)))

Columns: 8

Code

cat(sprintf("Variables: %s\n", paste(names(df_monthly_raw), collapse = ", ")))

Variables: type, GEOID, racial_majority, month, filings_2020, filings_avg, filings_avg_prepandemic_baseline, last_updated

Code

# Display first rows to understand data structure.
head(df_monthly_raw, 10) %>%
  kable(caption = "Sample of Raw Monthly Eviction Filing Data",
    col.names = c("Type", "GEOID", "Racial  Majority", "Month", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Sample of Raw Monthly Eviction Filing Data
Type	GEOID	Racial Majority	Month	Filings	Average Filings	Pre-Pandemic Average Filings
Census Tract	42101000101	White	01/2020	0	2.5	0.75
Census Tract	42101000101	White	02/2020	0	1.5	1.00
Census Tract	42101000101	White	03/2020	0	1.5	1.00
Census Tract	42101000101	White	04/2020	0	1.0	1.75
Census Tract	42101000101	White	05/2020	0	2.0	1.75
Census Tract	42101000101	White	06/2020	0	1.0	2.00
Census Tract	42101000101	White	07/2020	0	2.5	0.50
Census Tract	42101000101	White	08/2020	1	1.5	0.75
Census Tract	42101000101	White	09/2020	2	0.0	3.50
Census Tract	42101000101	White	10/2020	2	1.5	1.50

Weekly Tract-Level Eviction Data

The weekly dataset mirrors the structure of the monthly data set but differs as it contains more rows of data due to its finer granularity and adds additional columns that specific the week relative to the dataset and the calendar date the data was recorded.

Code

# Load weekly eviction filings data for high-frequency analysis.
df_weekly_raw <- read.csv("data/eviction/philadelphia_weekly_2020_2021.csv")

# Display weekly data dimensions.
cat("WEEKLY EVICTION DATA DIMENSIONS\n")

WEEKLY EVICTION DATA DIMENSIONS

Code

cat(sprintf("Rows: %s\n", comma(nrow(df_weekly_raw))))

Rows: 125,563

Code

cat(sprintf("Columns: %d\n", ncol(df_weekly_raw)))

Columns: 9

Code

# Display sample of weekly data.
head(df_weekly_raw, 10) %>%
  kable(caption = "Sample of Raw Weekly Eviction Filing Data",
    col.names = c("Type", "GEOID", "Racial  Majority", "Week", "Date", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Sample of Raw Weekly Eviction Filing Data
Type	GEOID	Racial Majority	Week	Date	Average Filings	Pre-Pandemic Average Filings	Last Updated
Census Tract	42101000101	White	1	2019-12-29	0.5	0.00	2025-11-15
Census Tract	42101000101	White	2	2020-01-05	0.5	0.25	2025-11-15
Census Tract	42101000101	White	3	2020-01-12	0.5	0.25	2025-11-15
Census Tract	42101000101	White	4	2020-01-19	1.0	0.00	2025-11-15
Census Tract	42101000101	White	5	2020-01-26	0.0	0.25	2025-11-15
Census Tract	42101000101	White	6	2020-02-02	0.0	0.25	2025-11-15
Census Tract	42101000101	White	7	2020-02-09	0.5	0.50	2025-11-15
Census Tract	42101000101	White	8	2020-02-16	0.0	0.00	2025-11-15
Census Tract	42101000101	White	9	2020-02-23	1.0	0.25	2025-11-15
Census Tract	42101000101	White	10	2020-03-01	0.0	0.25	2025-11-15

Claims Data

The Eviction Lab provides tract‑level data that summarizes historical and contemporary monthly claims statistics, where claims represent the financial compensation sought by landlords in eviction proceedings. In addition to reporting the median claim amount, the dataset includes thresholds that capture the relative burden these claims pose to renters, offering a spectrum of financial severity across space and time. While these measures are not the primary variables used in this analysis, they provide valuable context for understanding eviction dynamics. Exploratory review suggests that approximately 3% of claims fall below $1,000, which may reflect unpaid fees rather than market‑rate rent, though in some cases it could indicate subsidized public housing rents. Around 6% of claims align with median market rents, while 14% represent unpaid rent for at least six months. These timelines highlight the varied circumstances underlying eviction filings, including the delays introduced by lengthy legal processes. Together, the dataset’s structured measures and exploratory insights illustrate both the quantitative severity of claims and the qualitative complexity of eviction proceedings.

Code

# Load aggregated monthly claims data showing financial severity.
df_claims_raw <- read.csv("data/eviction/philadelphia_claims_monthly.csv")

# Display claims data dimensions.
cat("CLAIMS DATA DIMENSIONS\n")

CLAIMS DATA DIMENSIONS

Code

cat(sprintf("Rows: %s\n", comma(nrow(df_claims_raw))))

Rows: 70

Code

cat(sprintf("Columns: %d\n", ncol(df_claims_raw)))

Columns: 6

Code

# Display claims data structure.
head(df_claims_raw, 10) %>%
  kable(digits = 2, caption = "Sample of Monthly Claims Severity Data",
    col.names = c("Date", "Median Claim", "Below $1,000", "Below Median Rent", "Over Six Months Rent", "Baseline Median Claim")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Sample of Monthly Claims Severity Data
Date	Median Claim	Below $1,000	Below Median Rent	Over Six Months Rent	Baseline Median Claim
2020-01-01	1958.00	0.03	0.06	0.14	4200
2020-02-01	1944.00	0.03	0.06	0.14	4200
2020-03-01	2050.00	0.03	0.06	0.14	4200
2020-04-01	NA	NA	NA	NA	4200
2020-05-01	NA	NA	NA	NA	4200
2020-06-01	NA	NA	NA	NA	4200
2020-07-01	3975.00	0.03	0.06	0.14	4200
2020-08-01	3285.31	0.03	0.06	0.14	4200
2020-09-01	3639.32	0.03	0.06	0.14	4200
2020-10-01	3422.50	0.03	0.06	0.14	4200

This is not used in the dataset, but for exploration behind the scenes of evictions. It looks like 3% of claims are for below $1,000, which could reflect unpaid fees rather than market rate rent. However, it could reflect rent in subsidized public housing. 6% is for the median market rate rents. 14% is for rent unpaid for at least 6 months. These timelines could have varying reasons, especially due to the time-consuming legal aspect that could delay filings.

Real Estate Tax Balances

In our analysis, tax delinquency is treated as a proxy for landlord financial stress; theoretically properties with outstanding tax balances could pressure landlords into pursuing evictions as a means of recovering income or offsetting financial strain. Thus, property tax assessments were explored to determine how financial stress on landlords may shape eviction filings.

Code

# Load real estate tax balances aggregated by census tract.
df_tax_raw <- read.csv("data/eviction/real_estate_tax_balances_census_tract.csv")

# Display tax data dimensions.
cat("TAX DELINQUENCY DATA DIMENSIONS\n")

TAX DELINQUENCY DATA DIMENSIONS

Code

cat(sprintf("Rows: %s\n", comma(nrow(df_tax_raw))))

Rows: 402

Code

cat(sprintf("Columns: %d\n", ncol(df_tax_raw)))

Columns: 11

Code

cat(sprintf("Variables: %s\n", paste(names(df_tax_raw), collapse = ", ")))

Variables: objectid, census_tract, num_props, min_period, max_period, principal, interest, penalty, other, balance, avg_balance

Code

# Display sample of tax delinquency data.
head(df_tax_raw, 10) %>%
  kable(digits = 2, caption = "Sample of Real Estate Tax Balances by Census Tract",
    col.names = c("Object ID", "Census Tract", "Properties", "Minimum Period", "Maximum Period", "Principal", "Interest", "Penalty", "Other", "Balance", "Average Balance")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Sample of Real Estate Tax Balances by Census Tract
Object ID	Census Tract	Properties	Minimum Period	Maximum Period	Principal	Interest	Penalty	Other	Balance	Average Balance
1	42101000101	74	1985	2025	1064210.8	155230.30	29017.78	32204.00	1280662.9	17306.26
2	42101000102	144	1978	2025	902561.4	162841.03	15843.21	65842.76	1147088.4	7965.89
3	42101000200	62	2007	2025	1178299.6	158347.04	52879.01	54369.75	1443895.4	23288.64
4	42101000300	38	1982	2025	1160409.9	204199.86	53290.48	10210.36	1428110.6	37581.86
5	42101000401	110	1986	2025	875019.7	101247.26	26525.82	14006.92	1016799.7	9243.63
6	42101000403	23	2017	2025	324383.0	47260.28	8921.37	20065.11	400629.8	17418.69
7	42101000404	12	2020	2025	187919.8	29479.15	4030.51	2819.58	224249.1	18687.42
8	42101000500	19	1978	2025	323535.4	76752.47	4661.40	14405.20	419354.5	22071.29
9	42101000600	23	2020	2025	552582.2	76400.67	25171.38	20471.78	674626.1	29331.57
10	42101000701	29	2022	2025	193651.5	21334.13	3937.60	6824.62	225747.8	7784.41

Philadelphia Census Tract Geometry

Since all datasets in this analysis are aggregated at the census tract level, a spatial file containing tract boundaries was loaded to access the geographic and geometric attributes of each tract. The inclusion of this file allows for the integration of eviction, claim, property, and other datasets with physical locations for mapping and spatial analysis.

Code

# Load Pennsylvania census tracts shapefile.
pa_tracts_sf <- st_read("data/pa_tracts/pa_tracts.shp", quiet = TRUE)

# Filter to Philadelphia County only using FIPS code 42101.
philly_tracts_sf <- pa_tracts_sf %>%
  filter(str_starts(GEOID, "42101"))

# Display spatial data summary.
cat("PHILADELPHIA CENSUS TRACT GEOMETRY\n")

PHILADELPHIA CENSUS TRACT GEOMETRY

Code

cat(sprintf("Total Philadelphia Tracts: %d\n", nrow(philly_tracts_sf)))

Total Philadelphia Tracts: 408

Code

cat(sprintf("Coordinate Reference System: %s\n", st_crs(philly_tracts_sf)$input))

Coordinate Reference System: NAD83

3. DATA CLEANING AND FEATURE ENGINEERING

Certain data cleaning and manipulation techniques were applied across all datasets to establish structural uniformity that would streamline joinability with other datasets. These techniques include removing invalid records, turning geographic identifiers into character strings to ensure they aren’t treated as numeric values, and formatting dates. Each dataset was also cleaned and manipulated based on their respective qualities and potential insights.

Monthly Eviction Data Preparation

To prepare the monthly eviction data for analysis, raw filings were cleaned and enriched with theoretically motivated features. A categorical variable was created to distinguish pre‑moratorium, moratorium, and post‑moratorium periods, alongside a binary flag for months when the moratorium was active. Months were categorized into seasonal indicators as seasonal patterns in filings can be seen in the figures provided on the Eviction Lab’s website.

Code

# Clean and prepare monthly data with temporal and policy features.
df_monthly <- df_monthly_raw %>%
  # Rename filings variable for clarity.
  rename(filings_count = filings_2020) %>%
  # Parse month string to proper date format.
  mutate(
    date = as.Date(paste0("01/", month), format = "%d/%m/%Y"),
    year = year(date),
    month_num = month(date),
    month_name = month(date, label = TRUE, abbr = FALSE)
  ) %>%
  # Filter out rows with invalid dates.
  filter(!is.na(date)) %>%
  # Convert GEOID to character for joining operations.
  mutate(GEOID = as.character(GEOID)) %>%
  # Create policy intervention indicator for eviction moratorium period.
  mutate(
    moratorium_active = ifelse(
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
      1,
      0
    ),
    # Create categorical period variable.
    period = case_when(
      date < as.Date("2020-03-01") ~ "Pre-Moratorium",
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30") ~ "Moratorium",
      date > as.Date("2021-09-30") ~ "Post-Moratorium"
    ),
    period = factor(period, levels = c("Pre-Moratorium", "Moratorium", "Post-Moratorium"))
  ) %>%
  # Create seasonal indicators for seasonality analysis.
  mutate(
    season = case_when(
      month_num %in% c(12, 1, 2) ~ "Winter",
      month_num %in% c(3, 4, 5) ~ "Spring",
      month_num %in% c(6, 7, 8) ~ "Summer",
      month_num %in% c(9, 10, 11) ~ "Fall"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
  ) %>%
  # Create post-moratorium ramp variable to capture rebound effect.
  mutate(
    post_moratorium_months = ifelse(
      date > as.Date("2021-09-30"),
      as.numeric(difftime(date, as.Date("2021-09-30"), units = "days")) / 30,
      0
    )
  )

# Display summary of cleaned monthly data.
cat("CLEANED MONTHLY DATA SUMMARY\n")

CLEANED MONTHLY DATA SUMMARY

Code

summary(df_monthly %>% select(filings_count, date, year, moratorium_active))

 filings_count          date                 year      moratorium_active
 Min.   :  0.000   Min.   :2020-01-01   Min.   :2020   Min.   :0.0000   
 1st Qu.:  0.000   1st Qu.:2021-06-01   1st Qu.:2021   1st Qu.:0.0000   
 Median :  1.000   Median :2022-12-01   Median :2022   Median :0.0000   
 Mean   :  2.286   Mean   :2022-12-01   Mean   :2022   Mean   :0.2676   
 3rd Qu.:  3.000   3rd Qu.:2024-06-01   3rd Qu.:2024   3rd Qu.:1.0000   
 Max.   :694.000   Max.   :2025-11-01   Max.   :2025   Max.   :1.0000

To verify the scope of the cleaned dataset, a summary table was generated showing the earliest and latest months covered, the total number of distinct months, the number of census tracts represented, and the overall number of observations. This check confirms that the dataset ranges from January 2020 to November 2025.

Code

# Check temporal coverage of cleaned data.
df_monthly %>%
  summarize(
    min_date = min(date),
    max_date = max(date),
    n_months = n_distinct(date),
    n_tracts = n_distinct(GEOID),
    total_obs = n()
  ) %>%
  kable(caption = "Monthly Data Temporal Coverage",
        col.names = c("Minimum Date", "Maximum Date", "Months", "Tracts", "Observations")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Monthly Data Temporal Coverage
Minimum Date	Maximum Date	Months	Tracts	Observations
2020-01-01	2025-11-01	71	409	29039

Weekly Data Preparation

Weekly data was cleaned and enriched in parallel with the monthly dataset, but adapted to reflect the finer weekly temporal scale. A summary table of temporal coverage reveals that the dataset spans from the final days of December 2019 through the second week of November 2025, offering a more nuanced timeline down to the exact day.

Code

# Clean and prepare weekly data for finer temporal analysis.
df_weekly <- df_weekly_raw %>%
  # Rename filings variable for consistency.
  rename(filings_count = filings_2020) %>%
  # Convert week_date to proper date format.
  mutate(
    date = as.Date(week_date),
    year = year(date),
    month_num = month(date),
    week_of_year = week(date)
  ) %>%
  # Filter out rows with invalid dates.
  filter(!is.na(date)) %>%
  # Convert GEOID to character for consistency.
  mutate(GEOID = as.character(GEOID)) %>%
  # Create moratorium indicator.
  mutate(
    moratorium_active = ifelse(
      date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
      1,
      0
    )
  )

# Display summary of cleaned monthly data.
cat("CLEANED WEEKLY DATA SUMMARY\n")

CLEANED WEEKLY DATA SUMMARY

Code

summary(df_weekly %>% select(filings_count, date, year, moratorium_active))

 filings_count           date                 year      moratorium_active
 Min.   :  0.0000   Min.   :2019-12-29   Min.   :2019   Min.   :0.0000   
 1st Qu.:  0.0000   1st Qu.:2021-06-13   1st Qu.:2021   1st Qu.:0.0000   
 Median :  0.0000   Median :2022-12-04   Median :2022   Median :0.0000   
 Mean   :  0.5286   Mean   :2022-12-04   Mean   :2022   Mean   :0.2704   
 3rd Qu.:  1.0000   3rd Qu.:2024-05-26   3rd Qu.:2024   3rd Qu.:1.0000   
 Max.   :260.0000   Max.   :2025-11-09   Max.   :2025   Max.   :1.0000

Code

# Check weekly temporal coverage.
df_weekly %>%
  summarize(
    min_date = min(date),
    max_date = max(date),
    n_weeks = n_distinct(date),
    n_tracts = n_distinct(GEOID),
    total_obs = n()
  ) %>%
  kable(caption = "Weekly Data Temporal Coverage",
        col.names = c("Minimum Date", "Maximum Date", "Weeks", "Tracts", "Observations")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Weekly Data Temporal Coverage
Minimum Date	Maximum Date	Weeks	Tracts	Observations
2019-12-29	2025-11-09	307	409	125563

Claims Data Preparation

Monthly eviction claims data was mutated to include a new column, claims ratio, that was calculated to represent median claim values relative to pre‑pandemic baselines.

Code

# Clean claims data showing financial severity of evictions.
df_claims <- df_claims_raw %>%
  # Convert month_date to proper date format.
  mutate(
    date = as.Date(month_date),
    year = year(date),
    month_num = month(date)
  ) %>%
  # Filter valid dates.
  filter(!is.na(date)) %>%
  # Calculate ratio to pre-pandemic baseline.
  mutate(claim_ratio = median_claim / median_claim_baseline)

# Display claims summary stats.
cat("CLAIMS DATA SUMMARY\n")

CLAIMS DATA SUMMARY

Code

summary(df_claims %>% select(median_claim, median_claim_baseline, claim_ratio))

  median_claim  median_claim_baseline  claim_ratio    
 Min.   :1944   Min.   :4200          Min.   :0.4629  
 1st Qu.:4088   1st Qu.:4200          1st Qu.:0.9733  
 Median :4236   Median :4200          Median :1.0085  
 Mean   :4346   Mean   :4200          Mean   :1.0348  
 3rd Qu.:4724   3rd Qu.:4200          3rd Qu.:1.1247  
 Max.   :6128   Max.   :4200          Max.   :1.4591  
 NA's   :3                            NA's   :3

Tax Delinquency Data Preparation

We defined delinquency in the property tax dataset as properties that received penalties rather than properties that just have overdue balances. Due to the marginal frequency observed in the dataset’s summary statistics (only 8 properties identified as overdue with no penalties), overdue properties were removed. For each tract, counts of delinquent properties, total balances, and average balances per property were calculated, along with log transformations of balances and delinquency age in years. Summary statistics were generated to report mean and median property counts, balances, and average balances across tracts, providing a tract‑level profile of tax delinquency severity.

Code

# Clean and prepare tax data with delinquency separation.
df_tax <- df_tax_raw %>%
  mutate(
    census_tract = as.character(census_tract)
  ) %>%
  filter(!str_detect(census_tract, "Other/Unidentified")) %>%
  mutate(
    # True delinquency indicators, based on penalty being on previous year's taxes.
    has_penalty = penalty > 0,
    delinquent_prop_count = ifelse(has_penalty, num_props, 0),
    delinquent_balance = ifelse(has_penalty, balance, 0),
    avg_delinquent_balance = ifelse(has_penalty, balance / num_props, 0),
    
    # Overdue-only indicators, based on it being the current year.
    overdue_only_props = ifelse(!has_penalty, num_props, 0),
    overdue_only_balance = ifelse(!has_penalty, balance, 0),
    
    # Transformations for modeling.
    log_delinquent_balance = log1p(delinquent_balance),
    log_overdue_balance = log1p(overdue_only_balance)
  ) %>%
  rename(GEOID = census_tract)

# Display tax data summary.
cat("TAX DELINQUENCY DATA SUMMARY\n")

TAX DELINQUENCY DATA SUMMARY

Code

cat(sprintf("Tracts with Tax Data: %d\n", nrow(df_tax)))

Tracts with Tax Data: 401

Code

cat(sprintf("Total Properties with Any Tax Balance: %s\n", comma(sum(df_tax$num_props))))

Total Properties with Any Tax Balance: 84,943

Code

cat(sprintf("Delinquent Properties (Penalty > 0): %s (%.1f%%)\n", 
            comma(sum(df_tax$delinquent_prop_count)),
            sum(df_tax$delinquent_prop_count) / sum(df_tax$num_props) * 100))

Delinquent Properties (Penalty > 0): 84,935 (100.0%)

Code

cat(sprintf("Overdue Properties (No Penalty): %s (%.1f%%)\n",
            comma(sum(df_tax$overdue_only_props)),
            sum(df_tax$overdue_only_props) / sum(df_tax$num_props) * 100))

Overdue Properties (No Penalty): 8 (0.0%)

Code

cat(sprintf("Total Delinquent Balance: $%s\n", comma(sum(df_tax$delinquent_balance))))

Total Delinquent Balance: $508,908,377

Code

cat(sprintf("Total Overdue-Only Balance: $%s\n", comma(sum(df_tax$overdue_only_balance))))

Total Overdue-Only Balance: $55,963

Code

# 8 overdue properties is marginal, so remove.
df_tax <- df_tax_raw %>%
  mutate(
    census_tract = as.character(census_tract)
  ) %>%
  filter(!str_detect(census_tract, "Other/Unidentified")) %>%
  filter(penalty > 0) %>%
  mutate(
    delinquent_prop_count = num_props,
    delinquent_balance = balance,
    avg_delinquent_balance_per_prop = balance / num_props,
    log_delinquent_balance = log1p(balance),
    delinquency_age_years = 2025 - min_period
  ) %>%
  rename(GEOID = census_tract)

# Recalculate average.
df_tax <- df_tax %>%
  mutate(avg_balance = balance / num_props)

Code

# Calculate summary stats for tax delinquency.
tax_summary <- df_tax %>%
  summarize(
    mean_props = mean(num_props),
    median_props = median(num_props),
    mean_balance = mean(balance),
    median_balance = median(balance),
    mean_avg_balance = mean(avg_balance),
    median_avg_balance = median(avg_balance)
  )

tax_summary %>%
  pivot_longer(everything(), names_to = "statistic", values_to = "value") %>%
  mutate(value = ifelse(str_detect(statistic, "balance"), 
                        paste0("$", comma(round(value, 2))), 
                        comma(round(value, 2)))) %>%
  kable(caption = "Tax Delinquency Summary Statistics",
        col.names = c("Statistic", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Tax Delinquency Summary Statistics
Statistic	Value
mean_props	214
median_props	143
mean_balance	$1,285,122
median_balance	$1,013,871
mean_avg_balance	$8,730
median_avg_balance	$5,892

Sealed Tract Assessment

Monthly eviction filings were assessed for sealed census tracts, where GEOIDs are suppressed for privacy protection. These tracts accounted for a small share of observations but exhibited unusually high filing counts. Because sealed tracts cannot be geographically mapped, they were excluded from spatial modeling to preserve tract‑level specificity. While this exclusion reduces the ability to design geographically actionable policies and outreach, the sealed tracts nonetheless represent legitimate evictions and highlight systematic vulnerabilities that cannot be directly addressed in this analysis.

Code

# Identify sealed census tracts where GEOID is hidden for privacy protection.
# Reduces geographic specificity for modeling.
sealed_count <- df_monthly %>%
  filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
  nrow()

sealed_pct <- sealed_count / nrow(df_monthly) * 100

cat("SEALED TRACT ASSESSMENT\n")

SEALED TRACT ASSESSMENT

Code

cat(sprintf("Sealed or Missing Tract Observations: %s (%.1f%%)\n", comma(sealed_count), sealed_pct))

Sealed or Missing Tract Observations: 71 (0.2%)

Code

# Analyze sealed tracts before removal.
sealed_tracts_analysis <- df_monthly %>%
  filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
  summarize(
    n_obs = n(),
    n_unique_dates = n_distinct(date),
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    min_filings = min(filings_count, na.rm = TRUE),
    max_filings = max(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100
  )

cat(sprintf("Total Observations: %s\n", comma(sealed_tracts_analysis$n_obs)))

Total Observations: 71

Code

cat(sprintf("Months of Data: %d\n", sealed_tracts_analysis$n_unique_dates))

Months of Data: 71

Code

cat(sprintf("Total Filings in Sealed Tracts: %s\n", comma(sealed_tracts_analysis$total_filings)))

Total Filings in Sealed Tracts: 1,043

Code

cat(sprintf("Mean Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$mean_filings))

Mean Filings per Tract-Month: 14.7

Code

cat(sprintf("Median Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$median_filings))

Median Filings per Tract-Month: 1.0

Code

cat(sprintf("Range: %d to %d\n", sealed_tracts_analysis$min_filings, sealed_tracts_analysis$max_filings))

Range: 0 to 694

Code

cat(sprintf("Standard Deviation: %.1f\n", sealed_tracts_analysis$sd_filings))

Standard Deviation: 85.9

Code

cat(sprintf("Zero-Filing Months: %.1f%%\n", sealed_tracts_analysis$zero_pct))

Zero-Filing Months: 43.7%

Code

# Show distribution of sealed tract filings.
sealed_distribution <- df_monthly %>%
  filter(GEOID == "sealed") %>%
  group_by(filings_count) %>%
  summarize(n = n(), .groups = "drop") %>%
  arrange(desc(filings_count))

head(sealed_distribution, 10) %>%
  kable(caption = "Highest Filing Counts in Sealed Tracts",
        col.names = c("Filings", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Highest Filing Counts in Sealed Tracts
Filings	Frequency
694	1
221	1
22	1
11	1
7	1
6	2
5	4
4	3
3	5
2	8

The sealed tracts are outliers and are legitimate evictions that reflect the current systematic issues, and could represent vulnerable communities experiencing a recent mass eviction (highest unsealed filings are from 2025). They aren’t included because the tracts cannot be mapped to in the data if they’re sealed, this privacy protection prevents geographically actionable policies and outreach, however, it has to be stressed that this is absolutely not ideal.

4. EXPLORATORY DATA ANALYSIS

Distribution Analysis: Zero-Inflation Assessment

Understanding the distribution of eviction filings is critical for selecting the appropriate modeling approach. Because eviction filings are discrete, non‑continuous count values, they require models such as Poisson and Negative Binomial regression specifically designed to handle count data. A major difference between the two approaches is their ability to handle overdispersion, with the Poisson model assuming equal mean and variance while the Negative Binomial model allows the variance to exceed the mean, making it more flexible for data with excess variability and zeros. Zero-inflation statistics were calculated and visualized to assess the distribution of monthly eviction filings. The summary statistics report reveals substantial zero-inflation and variance that exceeded the mean, yielding a dispersion ratio well over 1, indicating that the negative binomial model as a more appropriate modeling approach for our data.

A histogram of raw counts of evictions further highlights the prevalence of zeros and the data’s skewed distribution, while a log‑transformed histogram provides a clearer view of variation.

Code

# Calculate zero-inflation stats for monthly data.
zero_stats_monthly <- df_monthly %>%
  summarize(
    total_obs = n(),
    zero_count = sum(filings_count == 0),
    zero_pct = zero_count / total_obs * 100,
    positive_count = sum(filings_count > 0),
    mean_all = mean(filings_count),
    mean_positive = mean(filings_count[filings_count > 0]),
    median_all = median(filings_count),
    variance = var(filings_count),
    dispersion_ratio = variance / mean_all
  )

# Display zero-inflation stats.
cat("MONTHLY ZERO-INFLATION STATS\n")

MONTHLY ZERO-INFLATION STATS

Code

cat(sprintf("Total Observations: %s\n", comma(zero_stats_monthly$total_obs)))

Total Observations: 29,039

Code

cat(sprintf("Zero Filings: %s (%.1f%%)\n", 
            comma(zero_stats_monthly$zero_count), 
            zero_stats_monthly$zero_pct))

Zero Filings: 10,752 (37.0%)

Code

cat(sprintf("Positive Filings: %s (%.1f%%)\n", 
            comma(zero_stats_monthly$positive_count), 
            100 - zero_stats_monthly$zero_pct))

Positive Filings: 18,287 (63.0%)

Code

cat(sprintf("Mean (all observations): %.2f\n", zero_stats_monthly$mean_all))

Mean (all observations): 2.29

Code

cat(sprintf("Mean (positive only): %.2f\n", zero_stats_monthly$mean_positive))

Mean (positive only): 3.63

Code

cat(sprintf("Variance: %.2f\n", zero_stats_monthly$variance))

Variance: 28.75

Code

cat(sprintf("Dispersion Ratio (Variance/Mean): %.2f\n", zero_stats_monthly$dispersion_ratio))

Dispersion Ratio (Variance/Mean): 12.58

Code

# Raw distribution.
monthly_plot_raw <- df_monthly %>%
  ggplot(aes(x = filings_count)) +
  geom_histogram(bins = 30, fill = "#E74C3C", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Count Distribution",
    x = "Monthly Filings Count",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Log-transformed distribution.
monthly_plot_log <- df_monthly %>%
  mutate(log_filings = log10(filings_count + 1)) %>%
  ggplot(aes(x = log_filings)) +
  geom_histogram(bins = 30, fill = "#E67E22", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed Distribution",
    x = "log(Monthly Filings + 1)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Display combined plot.
monthly_plot_raw + monthly_plot_log +
  plot_annotation(
    title = "Distribution of Monthly Eviction Filings by Census Tract",
    subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f", 
                       zero_stats_monthly$zero_pct,
                       zero_stats_monthly$dispersion_ratio),
    caption = "Data: Philadelphia Eviction Filings 2020-2025 via Eviction Lab",
    theme = theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(size = 11),
      plot.caption = element_text(size = 9)
    )
  )

Weekly Filing Distribution

Weekly eviction filings were also assessed for zero-inflation and overdispersion; the distribution was highly zero-inflated, making even standard Negative Binomial models unsuitable. However, the data structure is compatible with Zero-Inflated Negative Binomial models, which may be explored in future work. In this analysis we chose to proceed with the monthly eviction data to preserve the performance of the Negative Binomial model.

Code

# Calculate zero-inflation stats for weekly data.
zero_stats_weekly <- df_weekly %>%
  summarize(
    total_obs = n(),
    zero_count = sum(filings_count == 0),
    zero_pct = zero_count / total_obs * 100,
    mean_all = mean(filings_count),
    variance = var(filings_count),
    dispersion_ratio = variance / mean_all
  )

cat("WEEKLY ZERO-INFLATION STATS\n")

WEEKLY ZERO-INFLATION STATS

Code

cat(sprintf("Total Observations: %s\n", comma(zero_stats_weekly$total_obs)))

Total Observations: 125,563

Code

cat(sprintf("Zero Filings: %s (%.1f%%)\n", 
            comma(zero_stats_weekly$zero_count), 
            zero_stats_weekly$zero_pct))

Zero Filings: 88,027 (70.1%)

Code

cat(sprintf("Mean: %.2f\n", zero_stats_weekly$mean_all))

Mean: 0.53

Code

cat(sprintf("Dispersion Ratio: %.2f\n", zero_stats_weekly$dispersion_ratio))

Dispersion Ratio: 5.62

Code

# Raw distribution.
plot_weekly_raw <- df_weekly %>%
  ggplot(aes(x = filings_count)) +
  geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Count Distribution",
    x = "Weekly Filings Count",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Log-transformed distribution.
plot_weekly_log <- df_weekly %>%
  mutate(log_filings = log10(filings_count + 1)) %>%
  ggplot(aes(x = log_filings)) +
  geom_histogram(bins = 30, fill = "#8E44AD", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed Distribution",
    x = "log(Weekly Filings + 1)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(size = 10)
  )

# Display combined plot.
plot_weekly_raw + plot_weekly_log +
  plot_annotation(
    title = "Distribution of Weekly Eviction Filings by Census Tract",
    subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f", 
                       zero_stats_weekly$zero_pct,
                       zero_stats_weekly$dispersion_ratio),
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab",
    theme = theme(plot.title = element_text(face = "bold", size = 14),
                  plot.subtitle = element_text(size = 11))
  )

Tax Delinquency Distribution

The histogram of average delinquent balances per property across census tracts is highly right-skewed, with most tracts showing low balances and a few exhibiting extreme outliers revealing that a small number of tracts carry disproportionately high tax burdens. A histogram of log transformed tax balances normalizes this skew, producing a more symmetric, bell-shaped distribution suitable for statistical modeling.

Code

# Raw distribution.
plot_tax_raw <- df_tax %>%
  ggplot(aes(x = avg_delinquent_balance_per_prop)) +
  geom_histogram(bins = 40, fill = "#9B59B6", color = "white") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Raw Average Balance",
    x = "Average Delinquent Balance per Property ($)",
    y = "Number of Census Tracts"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Log-transformed distribution.
plot_tax_log <- df_tax %>%
  mutate(log_balance = log10(avg_delinquent_balance_per_prop + 1)) %>%
  ggplot(aes(x = log_balance)) +
  geom_histogram(bins = 40, fill = "#3498DB", color = "white") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Log-Transformed",
    x = "log(Average Balance + 1)",
    y = "Number of Census Tracts"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Display combined plot.
plot_tax_raw + plot_tax_log +
  plot_annotation(
    title = "Distribution of Average Tax Delinquency Balance by Census Tract",
    caption = "Data: OpenDataPhilly Real Estate Tax Balances",
    theme = theme(plot.title = element_text(face = "bold", size = 14),
                  plot.subtitle = element_text(size = 11))
  )

Quantile Analysis

To further assess the tail-behavior of monthly eviction filings, statistice for quantiles thresholds were calculated. This included the minimum, 1st, 5th, 25th, 50th (median), 75th, 95th, and 99th percentiles, as well as the maximum. These statistics provide a clear view of the heavy right tail of the distribution, complementing the zero‑inflation and over‑dispersion diagnostics as well as indicating the need to create a variable the addresses abnormal spikes.

Code

# Calculate quantiles for positive filings to understand tail behavior.
quantile_stats_monthly <- df_monthly %>%
  filter(filings_count > 0) %>%
  summarize(
    min = min(filings_count),
    q01 = quantile(filings_count, 0.01),
    q05 = quantile(filings_count, 0.05),
    q25 = quantile(filings_count, 0.25),
    median = median(filings_count),
    q75 = quantile(filings_count, 0.75),
    q95 = quantile(filings_count, 0.95),
    q99 = quantile(filings_count, 0.99),
    max = max(filings_count)
  )

# Display quantile distribution.
quantile_stats_monthly %>%
  pivot_longer(everything(), names_to = "quantile", values_to = "filings") %>%
  kable(digits = 1, caption = "Monthly Filing Count Quantiles (Positive Counts Only)",
        col.names = c("Quantile", "Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Monthly Filing Count Quantiles (Positive Counts Only)
Quantile	Filings
min	1
q01	1
q05	1
q25	1
median	3
q75	5
q95	10
q99	16
max	694

5. TEMPORAL ANALYSIS: SYSTEM-WIDE TRENDS

Monthly Aggregate Trend with Policy Markers

The plot and summary statistics of monthly evictions overtime show a clear policy-linked trajectory from 2020 to 2025 as filings dropped sharply at the onset of the eviction moratorium, remained suppressed throughout its duration, and rose after the moratorium ended. This pattern underscores the impact of policy interventions as well as highlights the need for temporal controls in modeling.

Code

# Calculate total monthly filings across all tracts.
monthly_aggregate <- df_monthly %>%
  group_by(date, period) %>%
  summarize(
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    n_tracts = n_distinct(GEOID),
    .groups = "drop"
  )

# Create line plot with policy intervention markers.
monthly_trend_plot <- ggplot(monthly_aggregate, aes(x = date, y = total_filings)) +
  geom_line(color = "#C0392B", linewidth = 1.5) +
  geom_point(aes(color = period), size = 4) +
  # Mark start of eviction moratorium.
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dashed", color = "#27AE60", linewidth = 1) +
  annotate("text", x = as.Date("2020-03-01"), y = max(monthly_aggregate$total_filings) * 0.95,
           label = "Moratorium\nStarts", hjust = -0.1, color = "#27AE60", size = 4, fontface = "bold") +
  # Mark end of eviction moratorium.
  geom_vline(xintercept = as.Date("2021-09-30"), 
             linetype = "dashed", color = "#E67E22", linewidth = 1) +
  annotate("text", x = as.Date("2021-09-30"), y = max(monthly_aggregate$total_filings) * 0.85,
           label = "Moratorium\nEnds", hjust = 1.1, color = "#E67E22", size = 4, fontface = "bold") +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = period_colors) +
  labs(
    title = "Total Monthly Eviction Filings: System-Wide Trend (2020-2025)",
    subtitle = "Suppression during moratorium followed by sustained elevation after end.",
    x = "Date (Month)",
    y = "Total Monthly Filings Across All Census Tracts",
    color = "Policy Period",
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display system-wide trend plot.
monthly_trend_plot

Code

# Calculate summary statistics by policy period.
monthly_aggregate %>%
  group_by(period) %>%
  summarize(
    n_months = n(),
    mean_monthly = mean(total_filings),
    median_monthly = median(total_filings),
    min_monthly = min(total_filings),
    max_monthly = max(total_filings),
    sd_monthly = sd(total_filings),
    .groups = "drop"
  ) %>%
  kable(digits = 0, caption = "Monthly Filing Statistics by Policy Period",
        col.names = c("Period", "Months", "Monthly Mean", "Monthly Median", "Monthly Minimum", "Monthly Maximum", "Monthly Standard Deviation")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Monthly Filing Statistics by Policy Period
Period	Months	Monthly Mean	Monthly Median	Monthly Minimum	Monthly Maximum	Monthly Standard Deviation
Pre-Moratorium	2	1868	1868	1720	2015	209
Moratorium	19	412	494	0	813	272
Post-Moratorium	50	1096	1097	246	1768	234

Seasonality Analysis

The box plot of post moratorium evictions by season and month show elevated filings in winter and summer as well as monthly alternating patterns of elevation in the winter to early spring months. Summary statistics of both plots illustrates that evictions typically remained stable across seasons and months but do possess slightly higher numbers consistent with the visual patterns.

Code

# Calculate seasonal pattern excluding moratorium period for cleaner view.
seasonality_data <- df_monthly %>%
  filter(period == "Post-Moratorium") %>%
  group_by(season) %>%
  summarize(
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(season)

# Create seasonal pattern bar plot.
seasonality_boxplot <- seasonality_data %>%
  ggplot(aes(x = season, y = mean_filings)) +
  geom_col(fill = "#3498DB", alpha = 1) +
  geom_errorbar(aes(ymin = mean_filings - sd_filings, ymax = mean_filings + sd_filings),
                width = 0.3, color = "black") +
  labs(
    title = "Seasonal Pattern in Eviction Filings",
    x = "Season",
    y = "Mean Filings per Tract-Month"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Calculate monthly pattern excluding moratorium period for cleaner view.
monthly_data <- df_monthly %>%
  filter(period == "Post-Moratorium") %>%
  group_by(month_name, month_num) %>%
  summarize(
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(month_num)

# Create monthly pattern bar plot.
monthly_boxplot <- monthly_data %>%
  ggplot(aes(x = reorder(month_name, month_num), y = mean_filings)) +
  geom_col(fill = "#3498DB", alpha = 1) +
  geom_errorbar(aes(ymin = mean_filings - sd_filings, ymax = mean_filings + sd_filings),
                width = 0.3, color = "black") +
  labs(
    title = "Monthly Pattern in Eviction Filings",
    x = "Month",
    y = "Mean Filings per Tract-Month"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Display combined plot.
# Display seasonality plot.
seasonality_boxplot + monthly_boxplot +
  plot_annotation(
    title = "Seasonal and Monthly Patterns in Eviction Filings",
    subtitle = "Post-Moratorium",
    caption = "Data: Philadelphia Eviction Claims via Eviction Lab",
    theme = theme(plot.title = element_text(face = "bold", size = 14),
                  plot.subtitle = element_text(size = 12))
  )

Code

# Display seasonality statistics table.
seasonality_data %>%
  kable(digits = 2, caption = "Seasonal Stats (Post-Moratorium Period)",
        col.names = c("Season", "Mean Filings", "Median Filings", "Standard Deviation Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Seasonal Stats (Post-Moratorium Period)
Season	Mean Filings	Median Filings	Standard Deviation Filings
Winter	2.75	2	3.55
Spring	2.54	2	3.05
Summer	2.87	2	10.56
Fall	2.58	1	4.41

Code

# Display seasonality statistics table.
monthly_data %>%
  kable(digits = 2, caption = "Monthly Stats (Post-Moratorium Period)",
        col.names = c("Month", "Month Number", "Mean Filings", "Median Filings", "Standard Deviation Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Monthly Stats (Post-Moratorium Period)
Month	Month Number	Mean Filings	Median Filings	Standard Deviation Filings
January	1	2.98	2	3.57
February	2	2.45	1	3.72
March	3	2.68	2	3.09
April	4	2.24	1	2.77
May	5	2.69	2	3.26
June	6	2.77	2	3.98
July	7	2.77	2	3.28
August	8	3.07	1	17.56
September	9	2.90	2	6.51
October	10	2.74	2	3.25
November	11	2.16	1	3.13
December	12	2.82	2	3.34

Claims Severity Over Time

A temporal plot shows that median eviction claim amounts rose sharply at the onset of the pandemic, peaking in 2021–2022 before gradually declining and stabilizing near the pre-pandemic baseline, suggesting a temporary inflation in claim severity during the moratorium and immediate post-moratorium periods.

Code

# Create median claim amount trend showing financial severity evolution.
claims_plot <- ggplot(df_claims %>% filter(!is.na(median_claim)), 
                      aes(x = date, y = median_claim)) +
  geom_line(color = "#27AE60", linewidth = 1.5) +
  geom_point(color = "#27AE60", size = 3) +
  geom_hline(aes(yintercept = median_claim_baseline), 
             linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  annotate("text", x = min(df_claims$date - 100), y = df_claims$median_claim_baseline[1] + 200,
           label = sprintf("Pre-Pandemic Baseline: $%s", 
                          comma(df_claims$median_claim_baseline[1])),
           hjust = 0, color = "#E74C3C", fontface = "bold") +
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dotted", color = "#27AE60", alpha = 1, linewidth = 1) +
  geom_vline(xintercept = as.Date("2021-09-30"), 
             linetype = "dotted", color = "#E67E22", alpha = 1, linewidth = 1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Median Eviction Claim Amount Over Time",
    x = "Date (Month)",
    y = "Median Claim Amount ($)",
    caption = "Data: Philadelphia Eviction Claims via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Display claims trend plot.
claims_plot

6. RACIAL DISPARITY ANALYSIS

Filing Burden by Racial Majority

Summary statistics and plots reveal stark disparities across tract racial majorities: 51.9% of filings occurred in Black-majority tracts, despite these tracts representing a smaller share of the city overall. In contrast, White-majority tracts accounted for just 22.1% of filings, though they comprised 36.0% of the comparison group. Hispanic-majority tracts showed near parity, while “Other” tracts had slightly lower filing rates than their representation. These patterns suggest disproportionate eviction exposure in Black communities, underscoring the need for equity-based analysis of our model’s performance.

Code

# Calculate filing statistics by tract racial majority.
racial_stats <- df_monthly %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n_obs = n(),
    n_tracts = n_distinct(GEOID),
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100,
    .groups = "drop"
  ) %>%
  mutate(
    pct_total_filings = total_filings / sum(total_filings) * 100,
    pct_tracts = n_tracts / sum(n_tracts) * 100
  )

# Display racial disparity statistics.
racial_stats %>%
  kable(digits = 2, caption = "Eviction Filing Statistics by Tract Racial Majority",
        col.names = c("Racial Majority", "Observations", "Tracts", "Filings", "Mean Filings", "Median Filings", "Standard Deviation Filings", "% Zero", "% Total Filings", "% Tracts")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Eviction Filing Statistics by Tract Racial Majority
Racial Majority	Observations	Tracts	Filings	Mean Filings	Median Filings	Standard Deviation Filings	% Zero	% Total Filings	% Tracts
Black	10366	146	33875	3.27	2	3.63	22.52	51.85	35.78
Hispanic	1633	23	3805	2.33	2	2.40	23.33	5.82	5.64
Other	6532	92	13201	2.02	1	3.04	43.05	20.21	22.55
White	10437	147	14450	1.38	1	2.81	49.77	22.12	36.03

Code

# Create side-by-side comparison of tract share vs filing share.
racial_comparison <- racial_stats %>%
  select(racial_majority, pct_tracts, pct_total_filings) %>%
  pivot_longer(cols = c(pct_tracts, pct_total_filings),
               names_to = "metric", values_to = "percentage") %>%
  mutate(
    metric = ifelse(metric == "pct_tracts", "Share of Tracts", "Share of Filings")
  )

# Create grouped bar plot showing disparity.
racial_disparity_plot <- racial_comparison %>%
  ggplot(aes(x = racial_majority, y = percentage, fill = metric)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.7), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = c("Share of Tracts" = "#3498DB", "Share of Filings" = "#E74C3C")) +
  scale_y_continuous(labels = percent_format(scale = 1), expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Racial Disparity in Eviction Filings",
    x = "Tract Racial Majority",
    y = "Percentage",
    fill = "Metric",
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display racial disparity plot.
racial_disparity_plot

7. GEOGRAPHIC ANALYSIS

High-Risk Tract Identification

High‑risk tracts were identified by calculating cumulative eviction filings across the study period to pinpoint neighborhoods with disproportionate concentrations of cases. The analysis shows that the top 20 census tracts alone accounted for 16.6% of all filings.

Code

# Calculate total filings by tract across full study period.
tract_totals <- df_monthly %>%
  filter(GEOID != "sealed", !is.na(GEOID), GEOID != "") %>%
  group_by(GEOID, racial_majority) %>%
  summarize(
    total_filings = sum(filings_count, na.rm = TRUE),
    mean_monthly = mean(filings_count, na.rm = TRUE),
    median_monthly = median(filings_count, na.rm = TRUE),
    sd_monthly = sd(filings_count, na.rm = TRUE),
    # Standardized measure of dispersion. Helps compare volatility. The higher the more volatile.
    cv = ifelse(mean_monthly > 0, sd_monthly / mean_monthly, NA),
    months_observed = n(),
    months_with_filings = sum(filings_count > 0),
    .groups = "drop"
  )

# Identify top 20 highest-risk tracts by total filings.
top_risk_tracts <- tract_totals %>%
  arrange(desc(total_filings)) %>%
  head(20)

# Display top risk tracts.
top_risk_tracts %>%
  select(GEOID, racial_majority, total_filings, mean_monthly, cv) %>%
  kable(digits = 2, caption = "Top 20 Highest-Risk Census Tracts (Total Filings 2020-2025)",
        col.names = c("GEOID", "Racial Majority", "Filings", "Monthly Mean", "Coefficient of Variation")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top 20 Highest-Risk Census Tracts (Total Filings 2020-2025)
GEOID	Racial Majority	Filings	Monthly Mean	Coefficient of Variation
42101036100	White	831	11.70	1.17
42101021800	White	827	11.65	0.79
42101030100	Other	713	10.04	0.65
42101026800	Black	608	8.56	0.68
42101027200	Black	590	8.31	1.55
42101024000	Black	579	8.15	0.74
42101025200	Black	548	7.72	0.67
42101017800	Other	545	7.68	0.55
42101027600	Black	518	7.30	0.65
42101030000	Black	508	7.15	0.58
42101023800	Black	491	6.92	0.79
42101015200	Black	487	6.86	0.70
42101020101	Black	471	6.63	0.74
42101034600	Other	471	6.63	0.88
42101008302	Black	455	6.41	0.69
42101035302	White	455	6.41	0.85
42101024300	Black	444	6.25	0.84
42101023900	Black	441	6.21	0.72
42101012201	Black	428	6.03	0.76
42101010800	Black	426	6.00	0.69

Code

# Calculate concentration statistics showing geographic clustering of evictions.
total_system_filings <- sum(tract_totals$total_filings)
top_20_pct <- sum(top_risk_tracts$total_filings) / total_system_filings * 100

cat("GEOGRAPHIC CONCENTRATION ANALYSIS\n")

GEOGRAPHIC CONCENTRATION ANALYSIS

Code

cat(sprintf("Total System Filings: %s\n", comma(total_system_filings)))

Total System Filings: 65,331

Code

cat(sprintf("Top 20 Tracts Account For: %.1f%% of all filings\n", top_20_pct))

Top 20 Tracts Account For: 16.6% of all filings

Spatial Mapping of Eviction Burden

The choropleth plot showing the spatial distribution of evictions reveals regional concentration of census tracts with high filing levels in North, Upper North, and West Philadelphia.

Code

# Join tract totals to spatial geometry.
philly_eviction_sf <- philly_tracts_sf %>%
  left_join(tract_totals, by = "GEOID")

# Create choropleth map of total filings.
eviction_map <- ggplot(philly_eviction_sf) +
  geom_sf(aes(fill = total_filings), color = NA) +
  scale_fill_viridis_c(
    option = "inferno",
    trans = "log1p",
    labels = comma,
    na.value = "gray50",
    breaks = c(0, 25, 100, 250, 831)
  ) +
  labs(
    title = "Total Eviction Filings by Census Tract (2020-2025)",
    fill = "Total\nFilings",
    caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "right"
  )

# Display eviction map.
eviction_map

Spatial Mapping of Tax Delinquency

The choropleth plot showing the spatial distribution of average property tax balances reveals distinct pocket of high tax burdens in Center City as well as South, North, Upper North, and Northeast Philadelphia.

Code

# Join tax data to spatial geometry.
philly_tax_sf <- philly_tracts_sf %>%
  left_join(df_tax, by = "GEOID")

# Create choropleth map of average tax balance.
tax_map <- ggplot(philly_tax_sf) +
  geom_sf(aes(fill = avg_balance), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    trans = "log1p",
    labels = dollar_format(),
    na.value = "gray50",
    breaks = c(1000, 5000, 10000, 50000)
  ) +
  labs(
    title = "Average Property Tax Delinquency by Census Tract",
    fill = "Average\nTax\nBalance",
    caption = "Data: OpenDataPhilly Real Estate Tax Balances"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "right"
  )

# Display tax map.
tax_map

8. PRE-PANDEMIC BASELINE COMPARISON

Monthly Filings vs Baseline Ratio

The plot and summary statistics shows that the Covid-19 pandemic posed a mass disruption to eviction trends as filings drop well below the pre-pandemic baseline before gradually increasing through 2021 and 2022. By 2024 and 2025, filings stabilized near the pre-pandemic average, indicating a return to normative patterns. This trajectory emphasizes the need to capture immediate post-moratorium dynamics.

Code

# Calculate system-wide ratio to pre-pandemic baseline over time.
monthly_baseline_ratio <- df_monthly %>%
  group_by(date) %>%
  summarize(
    total_filings = sum(filings_count, na.rm = TRUE),
    total_baseline = sum(filings_avg_prepandemic_baseline, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(total_baseline > 0) %>%
  mutate(ratio_to_baseline = total_filings / total_baseline)

# Create line plot showing normalized recovery trajectory.
# Ratio greater than 1 indicates system stress relative to historical norm.
baseline_ratio_plot <- ggplot(monthly_baseline_ratio, aes(x = date, y = ratio_to_baseline)) +
  geom_line(color = "#16A085", linewidth = 0.8, alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, color = "#E74C3C", linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black", linewidth = 1) +
  annotate("text", x = min(monthly_baseline_ratio$date + 100), y = 1.08,
           label = "Pre-Pandemic Average", hjust = 0, fontface = "bold") +
  geom_vline(xintercept = as.Date("2020-03-01"), 
             linetype = "dotted", color = "#27AE60", alpha = 1, linewidth = 1) +
  geom_vline(xintercept = as.Date("2021-09-30"), 
             linetype = "dotted", color = "#E67E22", alpha = 1, linewidth = 1) +
  labs(
    title = "Monthly Filings Normalized to Pre-Pandemic Baseline",
    x = "Date (Month)",
    y = "Monthly Filings / Pre-Pandemic Average Ratio",
    caption = "Data: Philadelphia Monthly Eviction Filings via Eviction Lab"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Display baseline ratio plot.
baseline_ratio_plot

Code

# Calculate summary statistics by policy period.
monthly_baseline_ratio %>%
  mutate(period = case_when(
    date < as.Date("2020-03-01") ~ "pre_moratorium",
    date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30") ~ "moratorium",
    date > as.Date("2021-09-30") ~ "post_moratorium"
  )) %>%
  group_by(period) %>%
  summarize(
    n_weeks = n(),
    mean_ratio = mean(ratio_to_baseline, na.rm = TRUE),
    median_ratio = median(ratio_to_baseline, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Baseline Ratio Statistics by Policy Period",
        col.names = c("Period", "Weeks", "Mean Ratio", "Median Ratio")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Baseline Ratio Statistics by Policy Period
Period	Weeks	Mean Ratio	Median Ratio
moratorium	19	0.23	0.25
post_moratorium	50	0.63	0.63
pre_moratorium	2	1.07	1.07

II. MODELING

1. FEATURE ENGINEERING FOR MODELING

Creating Temporal Lag Variables

Temporal lag variables show how eviction filings carry over from one month to the next. If a tract had high filings last month, it is more likely to have high filings this month too. The one-, two-, and three-month lags capture short-term momentum, while the three-month rolling average smooths out short-term patterns.

Code

# Create lagged features.
# When predicting month t, we can only use data from months t-1, t-2, t-3, etc.
df_model_prep <- df_monthly %>%
  filter(GEOID != "sealed", !is.na(GEOID), GEOID != "") %>%
  arrange(GEOID, date) %>%
  group_by(GEOID) %>%
  mutate(
    # One month lag captures immediate momentum.
    filings_lag1 = lag(filings_count, n = 1, order_by = date),
    # Two month lag captures medium-term trend.
    filings_lag2 = lag(filings_count, n = 2, order_by = date),
    # Three month lag captures quarterly pattern.
    filings_lag3 = lag(filings_count, n = 3, order_by = date),
    # Rolling 3-month average of lagged values (t-1, t-2, t-3).
    # Original rollmean with align = "right" included current month, please don't change this.
    filings_ma3 = (filings_lag1 + filings_lag2 + filings_lag3) / 3
  ) %>%
  ungroup()

# Check missingness of lag variables.
lag_coverage <- df_model_prep %>%
  summarize(
    total_obs = n(),
    missing_lag1 = sum(is.na(filings_lag1)),
    missing_lag2 = sum(is.na(filings_lag2)),
    missing_lag3 = sum(is.na(filings_lag3)),
    missing_ma3 = sum(is.na(filings_ma3)),
    pct_missing_lag1 = missing_lag1 / total_obs * 100,
    pct_missing_lag2 = missing_lag2 / total_obs * 100,
    pct_missing_lag3 = missing_lag3 / total_obs * 100,
    pct_missing_ma3 = missing_ma3 / total_obs * 100
  )

# Display lag coverage stats.
lag_coverage %>%
  kable(digits = 1, caption = "Lagged Variable Coverage Stats",
        col.names = c("Observations", "Missing Lag 1", "Missing Lag 2", "Missing Lag 3", "Missing Lag 3-Month Rolling Average", "% Missing Lag 1", "% Missing Lag 2", "% Missing Lag 3", "% Missing Lag 3-Month Rolling Average")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Lagged Variable Coverage Stats
Observations	Missing Lag 1	Missing Lag 2	Missing Lag 3	Missing Lag 3-Month Rolling Average	% Missing Lag 1	% Missing Lag 2	% Missing Lag 3	% Missing Lag 3-Month Rolling Average
28968	408	816	1224	1224	1.4	2.8	4.2	4.2

Creating Inactivity Flag

About 37% of monthly tract observations have zero filings thus these flags are intended to help the model separate active periods from months with no eviction activity. This matters because it can potentially address problematic zero-inflation in the data and highlight longer stretches of inactivity that may reflect structural differences across tracts.

Code

# There are around 37% zeros, so add zero flags for last month and consecutive.
df_model_prep <- df_model_prep %>%
  group_by(GEOID) %>%
  mutate(
    was_zero_last_month = lag(ifelse(filings_count == 0, 1, 0)),
    consecutive_zeros = sequence(rle(filings_count == 0)$lengths) * (filings_count == 0)
  ) %>%
  ungroup()

# Check missingness of zero flag variables.
zero_coverage <- df_model_prep %>%
  summarize(
    total_obs = n(),
    missing_zero_last_month = sum(is.na(was_zero_last_month)),
    missing_consecutive_zeros = sum(is.na(consecutive_zeros)),
    pct_missing_zero_last_month = missing_zero_last_month / total_obs * 100,
    pct_missing_consecutive_zeros = missing_consecutive_zeros / total_obs * 100
  )

# Display zero flag coverage stats.
zero_coverage %>%
  kable(digits = 1, caption = "Zero Flag Variable Coverage Stats",
        col.names = c("Observations", "Missing 0 Last Month", "Missing Consecutive Zeros", "% Missing 0 Last Month", "% Missing Consecutive Zeros")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Zero Flag Variable Coverage Stats
Observations	Missing 0 Last Month	Missing Consecutive Zeros	% Missing 0 Last Month	% Missing Consecutive Zeros
28968	408	0	1.4	0

Merging Claims Severity Data

The following represents various merges that integrate claims severity, tax delinquency, and American Community Survey (ACS) socioeconomic indicators into the tract-level eviction panel.

Code

# Create year-month key for joining claims data to tract-level data.
df_model_prep <- df_model_prep %>%
  mutate(year_month = floor_date(date, "month"))

df_claims_join <- df_claims %>%
  mutate(year_month = floor_date(date, "month")) %>%
  select(year_month, median_claim, claim_ratio)

# Merge claims data to tract-level panel.
df_model_prep <- df_model_prep %>%
  left_join(df_claims_join, by = "year_month")

# Check merge success.
claims_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_claims = sum(!is.na(median_claim)),
    pct_with_claims = has_claims / total_obs * 100
  )

cat(sprintf("Observations with claims data: %s (%.1f%%)\n", 
            comma(claims_merge_stats$has_claims), 
            claims_merge_stats$pct_with_claims))

Observations with claims data: 27,336 (94.4%)

Merging Tax Delinquency Data

Tax delinquency serves as a proxy for landlord financial stress. Properties with outstanding tax balances may have landlords more likely to pursue evictions.

Code

# Prepare tax data for joining.
df_tax_join <- df_tax %>%
  select(GEOID, num_props, balance, avg_balance, log_delinquent_balance, 
                delinquent_prop_count, delinquency_age_years)

# Merge tax delinquency data to tract-level panel.
df_model_prep <- df_model_prep %>%
  left_join(df_tax_join, by = "GEOID")

# Check merge success.
tax_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_tax_data = sum(!is.na(avg_balance)),
    pct_with_tax = has_tax_data / total_obs * 100
  )

cat(sprintf("Observations with tax data: %s (%.1f%%)\n", 
            comma(tax_merge_stats$has_tax_data), 
            tax_merge_stats$pct_with_tax))

Observations with tax data: 28,116 (97.1%)

Integrating ACS Socioeconomic Data

American Community Survey data provides tract-level socioeconomic indicators that capture structural vulnerability to eviction. Variables chosen pose theoretical relevance to eviction data as they related to housing, income, and demographic composition. ACS data was further engineered to allow for meaningful interpretation and comparison with other variables.

Code

# Retrieve ACS 5-year estimates for Philadelphia County tracts.
# Using 2023 ACS 5-year estimates.

# Define ACS variables to retrieve.
acs_vars <- c(
  # Housing characteristics.
  "B25003_001",  # total occupied housing units
  "B25003_003",  # renter-occupied units
  "B25064_001",  # median gross rent
  "B25071_001",  # median gross rent as % of income
  "B25077_001",  # median home value
  

  # Income and poverty.
  "B19013_001",  # median household income
  "B17001_001",  # total population for poverty status
  "B17001_002",  # population below poverty level
  
  # Employment.
  "B23025_001",  # total population 16+ in labor force
  "B23025_005",  # unemployed population
  
  # Education.
  "B15003_001",  # total population 25+ for education
  "B15003_017",  # high school diploma
  "B15003_022",  # bachelor's degree
  "B15003_023",  # master's degree
  "B15003_025",  # doctorate degree
  
  # Demographics.
  "B01003_001",  # total population
  "B01002_001",  # median age
  
  # Household composition.
  "B11001_001",  # total households
  "B11001_006",  # female householder no spouse with children
  
  # Housing cost burden.
  "B25070_001",  # total renter households for rent burden
  "B25070_007",  # renters paying 30-34.9% income on rent
  "B25070_008",  # renters paying 35-39.9% income on rent
  "B25070_009",  # renters paying 40-49.9% income on rent
  "B25070_010"   # renters paying 50%+ income on rent
)

# Retrieve ACS data for Philadelphia County.
acs_data <- get_acs(
  geography = "tract",
  variables = acs_vars,
  state = "PA",
  county = "Philadelphia",
  year = 2023,
  survey = "acs5",
  output = "wide",
  geometry = FALSE
)

cat(sprintf("Tracts with ACS data: %d\n", nrow(acs_data)))

Tracts with ACS data: 408

Code

# Engineer meaningful features from raw ACS variables.
df_acs <- acs_data %>%
  transmute(
    GEOID = GEOID,
    
    # Renter percentage.
    pct_renter = ifelse(B25003_001E > 0, B25003_003E / B25003_001E * 100, NA),
    
    # Median gross rent.
    median_rent = B25064_001E,
    
    # Rent burden (rent as % of income).
    rent_burden_pct = B25071_001E,
    
    # Median household income (log transform for modeling).
    median_income = B19013_001E,
    log_median_income = log1p(B19013_001E),
    
    # Poverty rate.
    poverty_rate = ifelse(B17001_001E > 0, B17001_002E / B17001_001E * 100, NA),
    
    # Unemployment rate.
    unemployment_rate = ifelse(B23025_001E > 0, B23025_005E / B23025_001E * 100, NA),
    
    # Educational attainment (% with at least bachelor's).
    pct_college = ifelse(B15003_001E > 0, 
                         (B15003_022E + B15003_023E + B15003_025E) / B15003_001E * 100, NA),
    
    # Total population.
    total_pop = B01003_001E,
    log_pop = log1p(B01003_001E),
    
    # Median age.
    median_age = B01002_001E,
    
    # Single mother households (% total households).
    pct_single_mother = ifelse(B11001_001E > 0, B11001_006E / B11001_001E * 100, NA),
    
    # Severe rent burden (% renters paying 35%+ income on rent).
    severe_rent_burden = ifelse(B25070_001E > 0, 
                                (B25070_008E + B25070_009E + B25070_010E) / B25070_001E * 100, NA)
  )

Single mothers and those who experience severe rent burdens may be more theoretically linked to eviction filings as their means are limited.

Code

# Calculate summary stats for ACS features.
acs_summary <- df_acs %>%
  select(-GEOID) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    pct_missing = sum(is.na(value)) / n() * 100,
    .groups = "drop"
  )

acs_summary %>%
  kable(digits = 2, caption = "ACS Feature Summary Stats",
        col.names = c("Variable", "Mean", "Median", "Standard Deviation", "Minimum", "Maximum", "% Missing")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

ACS Feature Summary Stats
Variable	Mean	Median	Standard Deviation	Minimum	Maximum	% Missing
log_median_income	10.99	11.02	0.49	9.53	12.17	6.62
log_pop	7.86	8.25	1.72	0.00	9.25	0.00
median_age	36.76	35.40	7.71	5.70	74.90	4.17
median_income	66877.27	60817.00	33071.00	13721.00	192727.00	6.62
median_rent	1374.54	1285.50	391.30	407.00	3132.00	6.37
pct_college	33.01	27.31	21.37	1.50	92.42	4.17
pct_renter	48.21	47.15	20.49	0.00	100.00	4.90
pct_single_mother	19.53	18.12	13.53	0.00	62.82	4.90
poverty_rate	21.60	18.54	14.04	0.00	73.91	4.66
rent_burden_pct	33.56	31.40	9.46	17.30	51.00	5.64
severe_rent_burden	41.35	41.01	17.05	4.19	100.00	5.15
total_pop	3878.51	3813.50	1834.07	0.00	10406.00	0.00
unemployment_rate	5.18	4.62	3.43	0.00	19.64	4.17

Code

# Merge ACS data into model prep dataset.
df_model_prep <- df_model_prep %>%
  left_join(df_acs, by = "GEOID")

# Check merge.
acs_merge_stats <- df_model_prep %>%
  summarize(
    total_obs = n(),
    has_acs = sum(!is.na(pct_renter)),
    pct_with_acs = has_acs / total_obs * 100
  )

cat(sprintf("Observations with ACS data: %s (%.1f%%)\n", 
            comma(acs_merge_stats$has_acs), 
            acs_merge_stats$pct_with_acs))

Observations with ACS data: 27,548 (95.1%)

Creating a spatial weights matrix for our model is motivated by the maps predicting the spatial distribution of eviction filings across Philadelphia; eviction activity tends to not be isolated to specific tracts and emerge as part of broader neighborhood-level patterns. The spatial weights matrix allows the model to capture these localized spillovers, ensuring that predictions reflect the reality that housing instability is often similar in adjacent communities rather than occurring in isolation.

Code

# Filter spatial data to only tracts present in eviction data.
tracts_in_data <- unique(df_model_prep$GEOID)

philly_model_sf <- philly_tracts_sf %>%
  filter(GEOID %in% tracts_in_data)

# Create spatial neighbors using Queen contiguity.
neighbors_queen <- poly2nb(philly_model_sf, queen = TRUE)

# Convert to spatial weights matrix.
weights_matrix <- nb2listw(neighbors_queen, style = "W", zero.policy = TRUE)

cat(sprintf("Number of tracts in spatial analysis: %d\n", length(neighbors_queen)))

Number of tracts in spatial analysis: 408

Code

cat(sprintf("Average number of neighbors per tract: %.1f\n", 
            mean(card(neighbors_queen))))

Average number of neighbors per tract: 6.0

Code

# Calculate lagged spatial lag of filings for each month.
# Use neighbors' filings from month t-1 (previous month).
# Using same-month neighbor data would be data leakage since we wouldn't have that info at prediction time.

# Create function to calculate spatial lag for a given month using previous month's data.
# Shift each tract's filings by 1 month.
df_model_prep <- df_model_prep %>%
  arrange(GEOID, date) %>%
  group_by(GEOID) %>%
  mutate(filings_count_lag1 = lag(filings_count, n = 1)) %>%
  ungroup()

# For each month, calculate spatial lag using the lagged filings.
df_model_prep <- df_model_prep %>%
  group_by(date) %>%
  mutate(
    # Match order to spatial data for this month.
    spatial_lag_filings = {
      sf_matched <- philly_model_sf %>%
        left_join(
          cur_data() %>% select(GEOID, filings_count_lag1), 
          by = "GEOID"
        ) %>%
        mutate(filings_count_lag1 = ifelse(is.na(filings_count_lag1), 0, filings_count_lag1))
      
      # Calculate weighted average of neighbors' lagged filings.
      lag.listw(weights_matrix, sf_matched$filings_count_lag1, zero.policy = TRUE)
    }
  ) %>%
  ungroup()

cat(sprintf("Observations with spatial lag: %s\n", 
            comma(sum(!is.na(df_model_prep$spatial_lag_filings)))))

Observations with spatial lag: 28,968

Creating Neighborhood Fixed-Effects

Neighborhoods provide a broader geographic context for understanding eviction filings. By linking tracts to neighborhood boundaries, we can test whether neighborhood characteristics are associated with filing rates.

Code

# Load boundaries.
neighborhoods <- st_read("data/philadelphia-neighborhoods/philadelphia-neighborhoods.shp")

Reading layer `philadelphia-neighborhoods' from data source 
  `C:\Users\Tess\Documents\GitHub\portfolio-setup-TessaVu\final-project\data\philadelphia-neighborhoods\philadelphia-neighborhoods.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 159 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28026 ymin: 39.86701 xmax: -74.95576 ymax: 40.13799
Geodetic CRS:  WGS 84

Code

# Make valid for duplicate vertex error.
neighborhoods <- st_make_valid(neighborhoods)

# Change CRS.
neighborhoods <- st_transform(neighborhoods, st_crs(philly_model_sf))

# Select only GEOID and geometry from tracts.
tracts_sf_subset <- philly_model_sf %>% select(GEOID, geometry)

# Select only neighborhood name and geometry from neighborhoods.
neighborhoods_sf_subset <- neighborhoods %>% select(MAPNAME, geometry) %>% rename(neighborhood = MAPNAME)

# Spatial join. Keep tracts even if they don't perfectly intersect a neighborhood.
tracts_with_neighborhood <- st_join(
    tracts_sf_subset,
    neighborhoods_sf_subset,
    join = st_intersects,
    left = TRUE
) %>%
    # Drop geometry.
    st_drop_geometry()

# Select only GEOID and neighborhood.
df_neighborhood_join <- tracts_with_neighborhood %>%
    select(GEOID, neighborhood)

# Choose first match if multiple.
df_neighborhood_join <- df_neighborhood_join %>%
    group_by(GEOID) %>%
    slice(1) %>%
    ungroup()

# Merge.
df_model_prep <- df_model_prep %>%
    left_join(df_neighborhood_join, by = "GEOID")

# Check merge.
cat(sprintf("Total observations: %s\n", comma(nrow(df_model_prep))))

Total observations: 28,968

Code

cat(sprintf("Observations with neighborhood data: %s\n",
            comma(sum(!is.na(df_model_prep$neighborhood)))))

Observations with neighborhood data: 28,968

Code

# Convert new column to factor for FE.
df_model_prep$neighborhood <- as.factor(df_model_prep$neighborhood)

# Change reference.
df_model_prep$neighborhood <- relevel(df_model_prep$neighborhood, ref = "East Falls")

A binary flag was set to identify high concentrations of filings. This helps separate routine variation from extreme events and allows the model to account for rare but impactful spikes without letting them distort patterns in the broader dataset.

Code

# Define threshold for capping target variable in training set.
cap_threshold = 20

# Create spike binary flag for all data to identify structural risk.
df_model_prep <- df_model_prep %>%
  mutate(is_extreme_spike = ifelse(filings_count > cap_threshold, 1, 0))

We generated a correlation matrix for all numeric variables to identify redundancy and guide model specification. Because poverty rate and logged median income captured similar socioeconomic conditions, we retained poverty rate due to its stronger theoretical link to eviction risk. We selected delinquent property count over logged delinquent balance because it showed a stronger empirical correlation with eviction filings. Finally, we used the three-month rolling spatial lag rather than the one- or two-month lags, as the rolling measure captured short-term fluctuations while reducing redundancy. This process ensured that the final predictors were theoretically relevant and minimally collinear.

Code

# Custom colors.
custom_colors <- colorRampPalette(c("#f03e37", "white", "#189ace")) 

color_vector <- custom_colors(200)

# Select and apply numeric columns only.
df_model_prep_correlation <- df_model_prep[, sapply(df_model_prep, is.numeric)]

# Correlation matrix.
correlation_matrix <- cor(df_model_prep_correlation, use = "pairwise.complete.obs")

# Display matrix.
corrplot(correlation_matrix,
         method = "square",
         order = "FPC", 
         type = "lower",
         diag = TRUE,
         tl.col = "black",
         cl.pos = "n",
         addCoef.col = "black",
         col = color_vector
         )

2. TRAIN/TEST SPLIT STRATEGY

Time-Based Validation Split

For a truly predictive model, we must use time-based splitting to avoid data leakage. The model learns from historical data and predicts future periods.

Code

# Define temporal cutoff for train/test split.
train_end_date <- as.Date("2023-12-31")
test_start_date <- as.Date("2024-01-01")

# Filter for all variables needed by most complex model.
df_filtered <- df_model_prep %>%
  filter(
    !is.na(filings_ma3),
    !is.na(spatial_lag_filings),
    !is.na(delinquent_prop_count),
    !is.na(pct_renter),
    !is.na(poverty_rate),
    !is.na(severe_rent_burden),
    !is.na(pct_single_mother),
    !is.na(racial_majority),
    !is.na(moratorium_active),
    !is.na(neighborhood)
  )

# Apply the cap only to training target variable to make it stable.
# Create training set where filings are capped at 99.75 percentile = 20 filings.
df_train <- df_filtered %>%
  filter(date <= train_end_date) %>%
  mutate(filings_count_capped = pmin(filings_count, cap_threshold))

# Create testing set where target is including spikes for real-world.
df_test <- df_filtered %>%
  filter(date >= test_start_date)

# Display split statistics.
cat("TRAIN/TEST SPLIT SUMMARY\n")

TRAIN/TEST SPLIT SUMMARY

Code

cat(sprintf("Training Period: %s to %s\n", min(df_train$date), max(df_train$date)))

Training Period: 2020-04-01 to 2023-12-01

Code

cat(sprintf("Training Observations: %s\n", comma(nrow(df_train))))

Training Observations: 17,280

Code

cat(sprintf("Training Months: %d\n", n_distinct(df_train$date)))

Training Months: 45

Code

cat(sprintf("Test Period: %s to %s\n", min(df_test$date), max(df_test$date)))

Test Period: 2024-01-01 to 2025-11-01

Code

cat(sprintf("Test Observations: %s\n", comma(nrow(df_test))))

Test Observations: 8,832

Code

cat(sprintf("Test Months: %d\n", n_distinct(df_test$date)))

Test Months: 23

Code

# Adjust so that the extreme outliers are not included. Cap lagged momentum signal to reduce distortion from rare mass-eviction spikes, which is 99.8 percentile.
df_train <- df_train %>% mutate(filings_ma3 = pmin(filings_ma3, 20))
df_test <- df_test %>% mutate(filings_ma3 = pmin(filings_ma3, 20))

Code

# Compare filing distributions between train and test sets.
train_test_comparison <- bind_rows(
  df_train %>% mutate(set = "training"),
  df_test %>% mutate(set = "test")
) %>%
  group_by(set) %>%
  summarize(
    n_obs = n(),
    mean_filings = mean(filings_count, na.rm = TRUE),
    median_filings = median(filings_count, na.rm = TRUE),
    sd_filings = sd(filings_count, na.rm = TRUE),
    zero_pct = sum(filings_count == 0) / n() * 100,
    .groups = "drop"
  )

train_test_comparison %>%
  kable(digits = 2, caption = "Filing Distribution Comparison: Training vs Test Sets",
        col.names = c("Set", "Observations", "Mean Filings", "Median Filings", "Standard Deviation Filings", "% Zero")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Filing Distribution Comparison: Training vs Test Sets
Set	Observations	Mean Filings	Median Filings	Standard Deviation Filings	% Zero
test	8832	2.64	2	3.20	27.39
training	17280	2.12	1	3.17	37.60

To prevent mass displacement events from destabilizing the Negative Binomial model, we capped the target variable at the ~99th percentile with 20 filings in the training set only. However, we retained the raw, uncapped values in the test set to validate the model’s performance against real-world unpredictability.

3. NEGATIVE BINOMIAL REGRESSION MODEL

Model Specification Rationale

Because of overdispersion, Negative Binomial regression is preferred over Poisson.

Model 1: Baseline with Temporal, Policy, and Monthly Features

Code

# Fit baseline Negative Binomial model with core time, policy, and monthly predictors.
model_nb_1 <- glm.nb(filings_count_capped ~ filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike,
                     data = df_train)

# Display 1st model summary.
cat("BASELINE NEGATIVE BINOMIAL MODEL SUMMARY\n")

BASELINE NEGATIVE BINOMIAL MODEL SUMMARY

Code

summary(model_nb_1)


Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike, data = df_train, init.theta = 1.652485134, 
    link = log)

Coefficients:
                     Estimate Std. Error z value             Pr(>|z|)    
(Intercept)          0.406671   0.032979  12.331 < 0.0000000000000002 ***
filings_ma3          0.211950   0.002944  72.000 < 0.0000000000000002 ***
moratorium_active   -0.587494   0.019745 -29.754 < 0.0000000000000002 ***
factor(month_num)2  -0.264619   0.045157  -5.860   0.0000000046293608 ***
factor(month_num)3   0.018918   0.043832   0.432             0.666037    
factor(month_num)4  -0.722253   0.045008 -16.047 < 0.0000000000000002 ***
factor(month_num)5  -0.432966   0.043909  -9.861 < 0.0000000000000002 ***
factor(month_num)6  -0.338440   0.043762  -7.734   0.0000000000000104 ***
factor(month_num)7  -0.143537   0.043031  -3.336             0.000851 ***
factor(month_num)8   0.010293   0.042140   0.244             0.807022    
factor(month_num)9   0.118125   0.041477   2.848             0.004400 ** 
factor(month_num)10  0.029408   0.040993   0.717             0.473130    
factor(month_num)11 -0.125061   0.041365  -3.023             0.002500 ** 
factor(month_num)12 -0.092971   0.041199  -2.257             0.024029 *  
is_extreme_spike     1.666549   0.129682  12.851 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.6525) family taken to be 1)

    Null deviance: 27521  on 17279  degrees of freedom
Residual deviance: 18556  on 17265  degrees of freedom
AIC: 60201

Number of Fisher Scoring iterations: 1

              Theta:  1.6525 
          Std. Err.:  0.0388 

 2 x log-likelihood:  -60169.1300

Model 2: Add Tax Delinquency and Spatial Lag

Code

model_nb_2 <- glm.nb(filings_count_capped ~
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike +
                       # Added spatial lag and delinquent properties because they're more actionable.
                       spatial_lag_filings + delinquent_prop_count,
                     data = df_train)

# Display 2nd model summary.
cat("2ND NEGATIVE BINOMIAL MODEL SUMMARY\n")

2ND NEGATIVE BINOMIAL MODEL SUMMARY

Code

summary(model_nb_2)


Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + spatial_lag_filings + 
    delinquent_prop_count, data = df_train, init.theta = 1.785836474, 
    link = log)

Coefficients:
                         Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)            0.16958717  0.03601190   4.709   0.0000024869281117 ***
filings_ma3            0.18508414  0.00302674  61.150 < 0.0000000000000002 ***
moratorium_active     -0.59704251  0.02096097 -28.484 < 0.0000000000000002 ***
factor(month_num)2    -0.26768237  0.04441417  -6.027   0.0000000016707370 ***
factor(month_num)3     0.03809786  0.04309203   0.884             0.376640    
factor(month_num)4    -0.70713449  0.04430404 -15.961 < 0.0000000000000002 ***
factor(month_num)5    -0.41107419  0.04346477  -9.458 < 0.0000000000000002 ***
factor(month_num)6    -0.32981983  0.04313644  -7.646   0.0000000000000207 ***
factor(month_num)7    -0.14707775  0.04236759  -3.471             0.000518 ***
factor(month_num)8     0.00797282  0.04146404   0.192             0.847521    
factor(month_num)9     0.10634056  0.04078377   2.607             0.009123 ** 
factor(month_num)10    0.03113026  0.04025258   0.773             0.439302    
factor(month_num)11   -0.12814163  0.04065231  -3.152             0.001621 ** 
factor(month_num)12   -0.08544880  0.04048530  -2.111             0.034806 *  
is_extreme_spike       1.82586047  0.12550920  14.548 < 0.0000000000000002 ***
spatial_lag_filings    0.02619239  0.00486258   5.387   0.0000000718332945 ***
delinquent_prop_count  0.00099969  0.00004104  24.361 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.7858) family taken to be 1)

    Null deviance: 28450  on 17279  degrees of freedom
Residual deviance: 18548  on 17263  degrees of freedom
AIC: 59653

Number of Fisher Scoring iterations: 1

              Theta:  1.7858 
          Std. Err.:  0.0434 

 2 x log-likelihood:  -59616.6730

Model 3: Add ACS

Code

model_nb_3 <- glm.nb(filings_count_capped ~
                       # 1st model.
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike + neighborhood +
                       # 2nd model.
                       spatial_lag_filings + delinquent_prop_count +
                       # Add demographics.
                       pct_renter + severe_rent_burden + poverty_rate + pct_single_mother +
                       factor(racial_majority),
                     data = df_train)

# Display 3rd model summary.
cat("3RD NEGATIVE BINOMIAL MODEL SUMMARY\n")

3RD NEGATIVE BINOMIAL MODEL SUMMARY

Code

summary(model_nb_3)


Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + neighborhood + spatial_lag_filings + 
    delinquent_prop_count + pct_renter + severe_rent_burden + 
    poverty_rate + pct_single_mother + factor(racial_majority), 
    data = df_train, init.theta = 2.456047024, link = log)

Coefficients:
                                           Estimate  Std. Error z value
(Intercept)                             -0.32885876  0.08373969  -3.927
filings_ma3                              0.10824420  0.00324140  33.394
moratorium_active                       -0.72840866  0.02044538 -35.627
factor(month_num)2                      -0.26247687  0.04149013  -6.326
factor(month_num)3                       0.04896265  0.04011348   1.221
factor(month_num)4                      -0.67002325  0.04171184 -16.063
factor(month_num)5                      -0.41028851  0.04087245 -10.038
factor(month_num)6                      -0.33329963  0.04040477  -8.249
factor(month_num)7                      -0.18329105  0.03969309  -4.618
factor(month_num)8                      -0.01468898  0.03872740  -0.379
factor(month_num)9                       0.08037622  0.03807647   2.111
factor(month_num)10                     -0.00068793  0.03751738  -0.018
factor(month_num)11                     -0.14435744  0.03793421  -3.805
factor(month_num)12                     -0.08616436  0.03772642  -2.284
is_extreme_spike                         1.75915476  0.11004495  15.986
neighborhoodAcademy Gardens              0.46868176  0.15752547   2.975
neighborhoodAllegheny West              -0.48975161  0.10430841  -4.695
neighborhoodAndorra                     -3.70144762  1.00371354  -3.688
neighborhoodBartram Village             -0.47563860  0.11981972  -3.970
neighborhoodBelmont                     -0.00989062  0.13651137  -0.072
neighborhoodBrewerytown                 -0.10763228  0.08395922  -1.282
neighborhoodBridesburg                  -0.53703472  0.12210109  -4.398
neighborhoodBurholme                     0.34905619  0.14083720   2.478
neighborhoodBustleton                    0.10438794  0.07693362   1.357
neighborhoodByberry                      0.67177557  0.12873656   5.218
neighborhoodCarroll Park                 0.14207463  0.13914364   1.021
neighborhoodCedar Park                   0.05976045  0.17349190   0.344
neighborhoodCedarbrook                  -0.00166750  0.09703355  -0.017
neighborhoodCenter City East            -0.41877600  0.11827091  -3.541
neighborhoodChestnut Hill               -0.67456234  0.10638519  -6.341
neighborhoodClearview                    0.28564927  0.08874398   3.219
neighborhoodCobbs Creek                  0.13771192  0.07563235   1.821
neighborhoodCrescentville                0.64174551  0.14655620   4.379
neighborhoodDickinson Narrows           -0.37530165  0.12653466  -2.966
neighborhoodEast Germantown              0.01876951  0.08429356   0.223
neighborhoodEast Mount Airy              0.18446467  0.07272055   2.537
neighborhoodEast Oak Lane               -0.73782079  0.22568629  -3.269
neighborhoodEast Parkside               -0.13621482  0.11061447  -1.231
neighborhoodEastwick                    -0.22687952  0.11669662  -1.944
neighborhoodElmwood                      0.00155153  0.15001950   0.010
neighborhoodFairhill                    -0.59160524  0.11584999  -5.107
neighborhoodFairmount                   -0.71642909  0.13054756  -5.488
neighborhoodFeltonville                  0.24669600  0.13596714   1.814
neighborhoodFishtown - Lower Kensington -0.66113045  0.14283317  -4.629
neighborhoodFitler Square               -0.87777924  0.19971650  -4.395
neighborhoodFox Chase                    0.31386497  0.11179379   2.808
neighborhoodFrancisville                -0.17055858  0.11865966  -1.437
neighborhoodFrankford                    0.39768227  0.08851345   4.493
neighborhoodFranklin Mills              -1.03757555  0.35782763  -2.900
neighborhoodFranklinville               -0.24506397  0.11310195  -2.167
neighborhoodGermantown - Morton         -0.69023633  0.17911193  -3.854
neighborhoodGermantown - Penn Knox       0.27843900  0.13060709   2.132
neighborhoodGirard Estates               0.06866719  0.16590901   0.414
neighborhoodGlenwood                    -0.52931597  0.12785307  -4.140
neighborhoodGraduate Hospital           -0.44607208  0.12062821  -3.698
neighborhoodGrays Ferry                 -0.01153080  0.07274276  -0.159
neighborhoodHaddington                   0.01473490  0.08847075   0.167
neighborhoodHarrowgate                   0.16413256  0.15706132   1.045
neighborhoodHartranft                   -0.41205999  0.12573499  -3.277
neighborhoodHawthorne                   -0.27119831  0.14180180  -1.913
neighborhoodHolmesburg                   0.83856618  0.09352799   8.966
neighborhoodHunting Park                -0.08154019  0.18864774  -0.432
neighborhoodIndustrial                   0.41139489  0.09656724   4.260
neighborhoodJuniata Park                 0.47813540  0.14745273   3.243
neighborhoodKingsessing                  0.01155933  0.07994566   0.145
neighborhoodLawndale                     0.65475002  0.17783919   3.682
neighborhoodLogan                        0.29100304  0.09990076   2.913
neighborhoodLogan Square                -0.20965619  0.13893856  -1.509
neighborhoodLower Moyamensing           -0.20926456  0.11343650  -1.845
neighborhoodLudlow                       0.25386145  0.15420013   1.646
neighborhoodMantua                      -0.22466546  0.15190610  -1.479
neighborhoodMayfair                      0.71759154  0.09380730   7.650
neighborhoodMill Creek                   0.21340550  0.15166722   1.407
neighborhoodModena                       1.38672039  0.13835932  10.023
neighborhoodMorrell Park                -0.02342778  0.23184103  -0.101
neighborhoodNicetown                     0.02498215  0.10258258   0.244
neighborhoodNorth Central               -0.28967180  0.10525821  -2.752
neighborhoodNortheast Phila Airport      0.37007935  0.13553448   2.731
neighborhoodNorthern Liberties          -0.34656504  0.12223663  -2.835
neighborhoodNorthwood                    0.50739140  0.08945392   5.672
neighborhoodOgontz                       0.09058871  0.14605998   0.620
neighborhoodOld City                    -0.70295702  0.11739622  -5.988
neighborhoodOld Kensington              -0.85874062  0.23393610  -3.671
neighborhoodOlney                        0.33488920  0.06826431   4.906
neighborhoodOverbrook                    0.16604677  0.07843303   2.117
neighborhoodOxford Circle                0.46689987  0.07862386   5.938
neighborhoodParkwood Manor               0.16288090  0.15392819   1.058
neighborhoodPaschall                    -0.21198188  0.09024141  -2.349
neighborhoodPassyunk Square             -0.77349522  0.18858421  -4.102
neighborhoodPennypack Park               0.41927837  0.07837068   5.350
neighborhoodPoint Breeze                -0.71228184  0.12134541  -5.870
neighborhoodRhawnhurst                   0.32020992  0.07986139   4.010
neighborhoodRichmond                     0.13562758  0.07875827   1.722
neighborhoodRittenhouse                 -1.05725258  0.10171197 -10.395
neighborhoodRiverfront                  -0.17987930  0.08014787  -2.244
neighborhoodRoxborough                  -0.48517106  0.12948921  -3.747
neighborhoodSharswood                   -0.37498504  0.10777194  -3.479
neighborhoodSociety Hill                -0.97128479  0.13490591  -7.200
neighborhoodSomerton                     0.29230952  0.09917735   2.947
neighborhoodSouthwest Germantown        -0.21810837  0.11764461  -1.854
neighborhoodSpring Garden                0.12993231  0.15357065   0.846
neighborhoodStadium District            -0.08256467  0.22021152  -0.375
neighborhoodStanton                     -0.03674313  0.07639561  -0.481
neighborhoodStrawberry Mansion           0.27142511  0.13993797   1.940
neighborhoodSummerdale                   1.01430415  0.16144081   6.283
neighborhoodTacony                       0.13075839  0.16491604   0.793
neighborhoodTioga                       -0.07611914  0.09054053  -0.841
neighborhoodTorresdale                   0.26925474  0.18542236   1.452
neighborhoodUniversity City             -0.46304900  0.13215360  -3.504
neighborhoodUpper Kensington            -0.30332474  0.17418694  -1.741
neighborhoodUpper Roxborough            -0.16332421  0.12475403  -1.309
neighborhoodWalnut Hill                  0.05295536  0.07598310   0.697
neighborhoodWashington Square West      -0.74713151  0.11504714  -6.494
neighborhoodWest Kensington             -0.04189261  0.13133936  -0.319
neighborhoodWest Mount Airy              0.00491165  0.14234699   0.035
neighborhoodWest Oak Lane                0.06840645  0.07560329   0.905
neighborhoodWest Parkside                0.23100539  0.11546496   2.001
neighborhoodWest Powelton               -0.15412798  0.11410181  -1.351
neighborhoodWest Torresdale              0.55821049  0.10741736   5.197
neighborhoodWissahickon Park             0.14953999  0.08944801   1.672
neighborhoodWissinoming                  0.18973127  0.15874658   1.195
neighborhoodWoodland Terrace            -0.66444414  0.11855089  -5.605
neighborhoodWynnefield Heights          -0.03266726  0.11637500  -0.281
neighborhoodYorktown                    -0.26931181  0.15850742  -1.699
spatial_lag_filings                      0.03170557  0.00497125   6.378
delinquent_prop_count                    0.00116241  0.00007638  15.219
pct_renter                               0.01658298  0.00071223  23.283
severe_rent_burden                       0.00135210  0.00069063   1.958
poverty_rate                            -0.00390303  0.00115768  -3.371
pct_single_mother                        0.00036465  0.00119749   0.305
factor(racial_majority)Hispanic         -0.22254760  0.06140714  -3.624
factor(racial_majority)Other            -0.11998669  0.03720466  -3.225
factor(racial_majority)White            -0.30372526  0.04583831  -6.626
                                                    Pr(>|z|)    
(Intercept)                               0.0000859564901167 ***
filings_ma3                             < 0.0000000000000002 ***
moratorium_active                       < 0.0000000000000002 ***
factor(month_num)2                        0.0000000002511928 ***
factor(month_num)3                                  0.222236    
factor(month_num)4                      < 0.0000000000000002 ***
factor(month_num)5                      < 0.0000000000000002 ***
factor(month_num)6                      < 0.0000000000000002 ***
factor(month_num)7                        0.0000038800410245 ***
factor(month_num)8                                  0.704471    
factor(month_num)9                                  0.034780 *  
factor(month_num)10                                 0.985371    
factor(month_num)11                                 0.000142 ***
factor(month_num)12                                 0.022376 *  
is_extreme_spike                        < 0.0000000000000002 ***
neighborhoodAcademy Gardens                         0.002927 ** 
neighborhoodAllegheny West                0.0000026631137103 ***
neighborhoodAndorra                                 0.000226 ***
neighborhoodBartram Village               0.0000719877138962 ***
neighborhoodBelmont                                 0.942242    
neighborhoodBrewerytown                             0.199857    
neighborhoodBridesburg                    0.0000109112440242 ***
neighborhoodBurholme                                0.013196 *  
neighborhoodBustleton                               0.174827    
neighborhoodByberry                       0.0000001806517035 ***
neighborhoodCarroll Park                            0.307224    
neighborhoodCedar Park                              0.730503    
neighborhoodCedarbrook                              0.986289    
neighborhoodCenter City East                        0.000399 ***
neighborhoodChestnut Hill                 0.0000000002286426 ***
neighborhoodClearview                               0.001287 ** 
neighborhoodCobbs Creek                             0.068636 .  
neighborhoodCrescentville                 0.0000119315120275 ***
neighborhoodDickinson Narrows                       0.003017 ** 
neighborhoodEast Germantown                         0.823794    
neighborhoodEast Mount Airy                         0.011193 *  
neighborhoodEast Oak Lane                           0.001078 ** 
neighborhoodEast Parkside                           0.218159    
neighborhoodEastwick                                0.051873 .  
neighborhoodElmwood                                 0.991748    
neighborhoodFairhill                      0.0000003279222426 ***
neighborhoodFairmount                     0.0000000406790723 ***
neighborhoodFeltonville                             0.069619 .  
neighborhoodFishtown - Lower Kensington   0.0000036798642764 ***
neighborhoodFitler Square                 0.0000110708341392 ***
neighborhoodFox Chase                               0.004992 ** 
neighborhoodFrancisville                            0.150611    
neighborhoodFrankford                     0.0000070259142674 ***
neighborhoodFranklin Mills                          0.003736 ** 
neighborhoodFranklinville                           0.030254 *  
neighborhoodGermantown - Morton                     0.000116 ***
neighborhoodGermantown - Penn Knox                  0.033016 *  
neighborhoodGirard Estates                          0.678959    
neighborhoodGlenwood                      0.0000347255376780 ***
neighborhoodGraduate Hospital                       0.000217 ***
neighborhoodGrays Ferry                             0.874051    
neighborhoodHaddington                              0.867723    
neighborhoodHarrowgate                              0.296013    
neighborhoodHartranft                               0.001048 ** 
neighborhoodHawthorne                               0.055810 .  
neighborhoodHolmesburg                  < 0.0000000000000002 ***
neighborhoodHunting Park                            0.665570    
neighborhoodIndustrial                    0.0000204252035669 ***
neighborhoodJuniata Park                            0.001184 ** 
neighborhoodKingsessing                             0.885035    
neighborhoodLawndale                                0.000232 ***
neighborhoodLogan                                   0.003581 ** 
neighborhoodLogan Square                            0.131303    
neighborhoodLower Moyamensing                       0.065071 .  
neighborhoodLudlow                                  0.099700 .  
neighborhoodMantua                                  0.139147    
neighborhoodMayfair                       0.0000000000000202 ***
neighborhoodMill Creek                              0.159408    
neighborhoodModena                      < 0.0000000000000002 ***
neighborhoodMorrell Park                            0.919510    
neighborhoodNicetown                                0.807593    
neighborhoodNorth Central                           0.005923 ** 
neighborhoodNortheast Phila Airport                 0.006323 ** 
neighborhoodNorthern Liberties                      0.004580 ** 
neighborhoodNorthwood                     0.0000000141059370 ***
neighborhoodOgontz                                  0.535116    
neighborhoodOld City                      0.0000000021256564 ***
neighborhoodOld Kensington                          0.000242 ***
neighborhoodOlney                         0.0000009306018136 ***
neighborhoodOverbrook                               0.034255 *  
neighborhoodOxford Circle                 0.0000000028781827 ***
neighborhoodParkwood Manor                          0.289982    
neighborhoodPaschall                                0.018821 *  
neighborhoodPassyunk Square               0.0000410319892452 ***
neighborhoodPennypack Park                0.0000000879837556 ***
neighborhoodPoint Breeze                  0.0000000043613604 ***
neighborhoodRhawnhurst                    0.0000608291612128 ***
neighborhoodRichmond                                0.085056 .  
neighborhoodRittenhouse                 < 0.0000000000000002 ***
neighborhoodRiverfront                              0.024810 *  
neighborhoodRoxborough                              0.000179 ***
neighborhoodSharswood                               0.000502 ***
neighborhoodSociety Hill                  0.0000000000006034 ***
neighborhoodSomerton                                0.003205 ** 
neighborhoodSouthwest Germantown                    0.063745 .  
neighborhoodSpring Garden                           0.397511    
neighborhoodStadium District                        0.707710    
neighborhoodStanton                                 0.630546    
neighborhoodStrawberry Mansion                      0.052427 .  
neighborhoodSummerdale                    0.0000000003324778 ***
neighborhoodTacony                                  0.427849    
neighborhoodTioga                                   0.400505    
neighborhoodTorresdale                              0.146469    
neighborhoodUniversity City                         0.000459 ***
neighborhoodUpper Kensington                        0.081618 .  
neighborhoodUpper Roxborough                        0.190477    
neighborhoodWalnut Hill                             0.485843    
neighborhoodWashington Square West        0.0000000000835124 ***
neighborhoodWest Kensington                         0.749753    
neighborhoodWest Mount Airy                         0.972475    
neighborhoodWest Oak Lane                           0.365567    
neighborhoodWest Parkside                           0.045430 *  
neighborhoodWest Powelton                           0.176762    
neighborhoodWest Torresdale               0.0000002029112975 ***
neighborhoodWissahickon Park                        0.094562 .  
neighborhoodWissinoming                             0.232015    
neighborhoodWoodland Terrace              0.0000000208596177 ***
neighborhoodWynnefield Heights                      0.778935    
neighborhoodYorktown                                0.089310 .  
spatial_lag_filings                       0.0000000001796708 ***
delinquent_prop_count                   < 0.0000000000000002 ***
pct_renter                              < 0.0000000000000002 ***
severe_rent_burden                                  0.050258 .  
poverty_rate                                        0.000748 ***
pct_single_mother                                   0.760736    
factor(racial_majority)Hispanic                     0.000290 ***
factor(racial_majority)Other                        0.001260 ** 
factor(racial_majority)White              0.0000000000344875 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.456) family taken to be 1)

    Null deviance: 32272  on 17279  degrees of freedom
Residual deviance: 18400  on 17147  degrees of freedom
AIC: 57706

Number of Fisher Scoring iterations: 1

              Theta:  2.4560 
          Std. Err.:  0.0686 

 2 x log-likelihood:  -57437.6890

Model 4: Full Model with Interactions

Code

model_nb_4 <- glm.nb(filings_count_capped ~
                       # 1st model.
                       filings_ma3 + moratorium_active + factor(month_num) + is_extreme_spike + neighborhood +
                       # 2nd model.
                       spatial_lag_filings + delinquent_prop_count +
                       # 3rd model.
                       pct_renter + severe_rent_burden + poverty_rate + pct_single_mother +
                       factor(racial_majority) +
                       # Add interactions.
                       factor(racial_majority) * moratorium_active,
                     data = df_train)

cat("FULL NEGATIVE BINOMIAL MODEL SUMMARY\n")

FULL NEGATIVE BINOMIAL MODEL SUMMARY

Code

summary(model_nb_4)


Call:
glm.nb(formula = filings_count_capped ~ filings_ma3 + moratorium_active + 
    factor(month_num) + is_extreme_spike + neighborhood + spatial_lag_filings + 
    delinquent_prop_count + pct_renter + severe_rent_burden + 
    poverty_rate + pct_single_mother + factor(racial_majority) + 
    factor(racial_majority) * moratorium_active, data = df_train, 
    init.theta = 2.465189995, link = log)

Coefficients:
                                                    Estimate Std. Error z value
(Intercept)                                       -0.3248197  0.0838622  -3.873
filings_ma3                                        0.1080455  0.0032524  33.220
moratorium_active                                 -0.7519105  0.0281807 -26.682
factor(month_num)2                                -0.2630601  0.0414593  -6.345
factor(month_num)3                                 0.0488936  0.0400794   1.220
factor(month_num)4                                -0.6696615  0.0416854 -16.065
factor(month_num)5                                -0.4092353  0.0408380 -10.021
factor(month_num)6                                -0.3326693  0.0403722  -8.240
factor(month_num)7                                -0.1830087  0.0396631  -4.614
factor(month_num)8                                -0.0149657  0.0386996  -0.387
factor(month_num)9                                 0.0807294  0.0380454   2.122
factor(month_num)10                               -0.0006626  0.0374831  -0.018
factor(month_num)11                               -0.1440741  0.0378994  -3.801
factor(month_num)12                               -0.0850610  0.0376887  -2.257
is_extreme_spike                                   1.7676705  0.1098917  16.086
neighborhoodAcademy Gardens                        0.4680273  0.1577582   2.967
neighborhoodAllegheny West                        -0.4861105  0.1042729  -4.662
neighborhoodAndorra                               -3.7044683  1.0035887  -3.691
neighborhoodBartram Village                       -0.4749069  0.1197343  -3.966
neighborhoodBelmont                               -0.0055532  0.1364268  -0.041
neighborhoodBrewerytown                           -0.1048998  0.0839285  -1.250
neighborhoodBridesburg                            -0.5357079  0.1220534  -4.389
neighborhoodBurholme                               0.3446663  0.1405201   2.453
neighborhoodBustleton                              0.1048551  0.0769905   1.362
neighborhoodByberry                                0.6730944  0.1288547   5.224
neighborhoodCarroll Park                           0.1460024  0.1390734   1.050
neighborhoodCedar Park                             0.0569091  0.1738193   0.327
neighborhoodCedarbrook                             0.0005676  0.0969952   0.006
neighborhoodCenter City East                      -0.4180745  0.1183566  -3.532
neighborhoodChestnut Hill                         -0.6696446  0.1062139  -6.305
neighborhoodClearview                              0.2887634  0.0887244   3.255
neighborhoodCobbs Creek                            0.1407026  0.0756202   1.861
neighborhoodCrescentville                          0.6452426  0.1464806   4.405
neighborhoodDickinson Narrows                     -0.3720910  0.1264063  -2.944
neighborhoodEast Germantown                        0.0217092  0.0842761   0.258
neighborhoodEast Mount Airy                        0.1867841  0.0727120   2.569
neighborhoodEast Oak Lane                         -0.7370008  0.2256317  -3.266
neighborhoodEast Parkside                         -0.1329889  0.1105773  -1.203
neighborhoodEastwick                              -0.2262931  0.1166871  -1.939
neighborhoodElmwood                                0.0035989  0.1499660   0.024
neighborhoodFairhill                              -0.5885811  0.1156196  -5.091
neighborhoodFairmount                             -0.7180338  0.1306574  -5.496
neighborhoodFeltonville                            0.2488892  0.1356230   1.835
neighborhoodFishtown - Lower Kensington           -0.6614383  0.1429244  -4.628
neighborhoodFitler Square                         -0.8817042  0.2000285  -4.408
neighborhoodFox Chase                              0.3145977  0.1118466   2.813
neighborhoodFrancisville                          -0.1664610  0.1185074  -1.405
neighborhoodFrankford                              0.3985985  0.0884240   4.508
neighborhoodFranklin Mills                        -1.0399820  0.3580811  -2.904
neighborhoodFranklinville                         -0.2423228  0.1129082  -2.146
neighborhoodGermantown - Morton                   -0.6895901  0.1790879  -3.851
neighborhoodGermantown - Penn Knox                 0.2823489  0.1305342   2.163
neighborhoodGirard Estates                         0.0727690  0.1659803   0.438
neighborhoodGlenwood                              -0.5275492  0.1278454  -4.126
neighborhoodGraduate Hospital                     -0.4458133  0.1206666  -3.695
neighborhoodGrays Ferry                           -0.0095505  0.0727336  -0.131
neighborhoodHaddington                             0.0175236  0.0884526   0.198
neighborhoodHarrowgate                             0.1635468  0.1565196   1.045
neighborhoodHartranft                             -0.4085759  0.1256149  -3.253
neighborhoodHawthorne                             -0.2703531  0.1418869  -1.905
neighborhoodHolmesburg                             0.8400096  0.0934559   8.988
neighborhoodHunting Park                          -0.0686518  0.1878736  -0.365
neighborhoodIndustrial                             0.4121473  0.0965939   4.267
neighborhoodJuniata Park                           0.4744505  0.1469665   3.228
neighborhoodKingsessing                            0.0141586  0.0799532   0.177
neighborhoodLawndale                               0.6567920  0.1774442   3.701
neighborhoodLogan                                  0.2948786  0.0998634   2.953
neighborhoodLogan Square                          -0.2060583  0.1389616  -1.483
neighborhoodLower Moyamensing                     -0.2074244  0.1133818  -1.829
neighborhoodLudlow                                 0.2534097  0.1544635   1.641
neighborhoodMantua                                -0.2229635  0.1518729  -1.468
neighborhoodMayfair                                0.7186441  0.0937483   7.666
neighborhoodMill Creek                             0.2162096  0.1516117   1.426
neighborhoodModena                                 1.3920054  0.1384975  10.051
neighborhoodMorrell Park                          -0.0201722  0.2317480  -0.087
neighborhoodNicetown                               0.0286773  0.1025483   0.280
neighborhoodNorth Central                         -0.2864888  0.1051156  -2.725
neighborhoodNortheast Phila Airport                0.3729762  0.1350170   2.762
neighborhoodNorthern Liberties                    -0.3441610  0.1220899  -2.819
neighborhoodNorthwood                              0.5071661  0.0893051   5.679
neighborhoodOgontz                                 0.0940710  0.1459915   0.644
neighborhoodOld City                              -0.6999758  0.1172796  -5.968
neighborhoodOld Kensington                        -0.8603177  0.2341487  -3.674
neighborhoodOlney                                  0.3380187  0.0682557   4.952
neighborhoodOverbrook                              0.1691286  0.0784081   2.157
neighborhoodOxford Circle                          0.4665678  0.0785510   5.940
neighborhoodParkwood Manor                         0.1619696  0.1540848   1.051
neighborhoodPaschall                              -0.2091473  0.0902250  -2.318
neighborhoodPassyunk Square                       -0.7735358  0.1886362  -4.101
neighborhoodPennypack Park                         0.4205041  0.0783707   5.366
neighborhoodPoint Breeze                          -0.7099338  0.1213075  -5.852
neighborhoodRhawnhurst                             0.3200247  0.0798196   4.009
neighborhoodRichmond                               0.1350127  0.0787448   1.715
neighborhoodRittenhouse                           -1.0582831  0.1017593 -10.400
neighborhoodRiverfront                            -0.1798084  0.0801993  -2.242
neighborhoodRoxborough                            -0.4851839  0.1295713  -3.745
neighborhoodSharswood                             -0.3688774  0.1075890  -3.429
neighborhoodSociety Hill                          -0.9726251  0.1349939  -7.205
neighborhoodSomerton                               0.2952169  0.0992126   2.976
neighborhoodSouthwest Germantown                  -0.2157996  0.1176222  -1.835
neighborhoodSpring Garden                          0.1295709  0.1537993   0.842
neighborhoodStadium District                      -0.0805165  0.2202462  -0.366
neighborhoodStanton                               -0.0332452  0.0763496  -0.435
neighborhoodStrawberry Mansion                     0.2739033  0.1398906   1.958
neighborhoodSummerdale                             1.0123590  0.1611385   6.283
neighborhoodTacony                                 0.1320930  0.1645324   0.803
neighborhoodTioga                                 -0.0725610  0.0905149  -0.802
neighborhoodTorresdale                             0.2682252  0.1856364   1.445
neighborhoodUniversity City                       -0.4615446  0.1320569  -3.495
neighborhoodUpper Kensington                      -0.2904119  0.1733686  -1.675
neighborhoodUpper Roxborough                      -0.1678052  0.1249725  -1.343
neighborhoodWalnut Hill                            0.0558667  0.0759833   0.735
neighborhoodWashington Square West                -0.7470012  0.1151232  -6.489
neighborhoodWest Kensington                       -0.0373199  0.1311339  -0.285
neighborhoodWest Mount Airy                        0.0036632  0.1424805   0.026
neighborhoodWest Oak Lane                          0.0706722  0.0755976   0.935
neighborhoodWest Parkside                          0.2344305  0.1154254   2.031
neighborhoodWest Powelton                         -0.1508088  0.1140628  -1.322
neighborhoodWest Torresdale                        0.5626531  0.1074484   5.236
neighborhoodWissahickon Park                       0.1516284  0.0894946   1.694
neighborhoodWissinoming                            0.1916163  0.1583303   1.210
neighborhoodWoodland Terrace                      -0.6633801  0.1185524  -5.596
neighborhoodWynnefield Heights                    -0.0315577  0.1163597  -0.271
neighborhoodYorktown                              -0.2689479  0.1585172  -1.697
spatial_lag_filings                                0.0318017  0.0049681   6.401
delinquent_prop_count                              0.0011601  0.0000763  15.206
pct_renter                                         0.0165956  0.0007124  23.297
severe_rent_burden                                 0.0013658  0.0006901   1.979
poverty_rate                                      -0.0039244  0.0011564  -3.394
pct_single_mother                                  0.0003694  0.0011962   0.309
factor(racial_majority)Hispanic                   -0.2788499  0.0649017  -4.296
factor(racial_majority)Other                      -0.1557477  0.0393570  -3.957
factor(racial_majority)White                      -0.2877332  0.0471982  -6.096
moratorium_active:factor(racial_majority)Hispanic  0.1913971  0.0714723   2.678
moratorium_active:factor(racial_majority)Other     0.1267347  0.0462870   2.738
moratorium_active:factor(racial_majority)White    -0.0620417  0.0435842  -1.423
                                                              Pr(>|z|)    
(Intercept)                                                   0.000107 ***
filings_ma3                                       < 0.0000000000000002 ***
moratorium_active                                 < 0.0000000000000002 ***
factor(month_num)2                                  0.0000000002223956 ***
factor(month_num)3                                            0.222496    
factor(month_num)4                                < 0.0000000000000002 ***
factor(month_num)5                                < 0.0000000000000002 ***
factor(month_num)6                                < 0.0000000000000002 ***
factor(month_num)7                                  0.0000039483381727 ***
factor(month_num)8                                            0.698968    
factor(month_num)9                                            0.033844 *  
factor(month_num)10                                           0.985896    
factor(month_num)11                                           0.000144 ***
factor(month_num)12                                           0.024012 *  
is_extreme_spike                                  < 0.0000000000000002 ***
neighborhoodAcademy Gardens                                   0.003010 ** 
neighborhoodAllegheny West                          0.0000031329499555 ***
neighborhoodAndorra                                           0.000223 ***
neighborhoodBartram Village                         0.0000729852146261 ***
neighborhoodBelmont                                           0.967531    
neighborhoodBrewerytown                                       0.211346    
neighborhoodBridesburg                              0.0000113805825700 ***
neighborhoodBurholme                                          0.014175 *  
neighborhoodBustleton                                         0.173222    
neighborhoodByberry                                 0.0000001754113622 ***
neighborhoodCarroll Park                                      0.293800    
neighborhoodCedar Park                                        0.743363    
neighborhoodCedarbrook                                        0.995330    
neighborhoodCenter City East                                  0.000412 ***
neighborhoodChestnut Hill                           0.0000000002887952 ***
neighborhoodClearview                                         0.001135 ** 
neighborhoodCobbs Creek                                       0.062794 .  
neighborhoodCrescentville                           0.0000105799199585 ***
neighborhoodDickinson Narrows                                 0.003244 ** 
neighborhoodEast Germantown                                   0.796718    
neighborhoodEast Mount Airy                                   0.010205 *  
neighborhoodEast Oak Lane                                     0.001089 ** 
neighborhoodEast Parkside                                     0.229101    
neighborhoodEastwick                                          0.052463 .  
neighborhoodElmwood                                           0.980854    
neighborhoodFairhill                                0.0000003568024108 ***
neighborhoodFairmount                               0.0000000389499049 ***
neighborhoodFeltonville                                       0.066483 .  
neighborhoodFishtown - Lower Kensington             0.0000036941031252 ***
neighborhoodFitler Square                           0.0000104380975822 ***
neighborhoodFox Chase                                         0.004912 ** 
neighborhoodFrancisville                                      0.160126    
neighborhoodFrankford                               0.0000065500328111 ***
neighborhoodFranklin Mills                                    0.003681 ** 
neighborhoodFranklinville                                     0.031858 *  
neighborhoodGermantown - Morton                               0.000118 ***
neighborhoodGermantown - Penn Knox                            0.030539 *  
neighborhoodGirard Estates                                    0.661082    
neighborhoodGlenwood                                0.0000368388621264 ***
neighborhoodGraduate Hospital                                 0.000220 ***
neighborhoodGrays Ferry                                       0.895532    
neighborhoodHaddington                                        0.842957    
neighborhoodHarrowgate                                        0.296071    
neighborhoodHartranft                                         0.001144 ** 
neighborhoodHawthorne                                         0.056726 .  
neighborhoodHolmesburg                            < 0.0000000000000002 ***
neighborhoodHunting Park                                      0.714802    
neighborhoodIndustrial                              0.0000198292975115 ***
neighborhoodJuniata Park                                      0.001245 ** 
neighborhoodKingsessing                                       0.859440    
neighborhoodLawndale                                          0.000214 ***
neighborhoodLogan                                             0.003149 ** 
neighborhoodLogan Square                                      0.138116    
neighborhoodLower Moyamensing                                 0.067335 .  
neighborhoodLudlow                                            0.100885    
neighborhoodMantua                                            0.142079    
neighborhoodMayfair                                 0.0000000000000178 ***
neighborhoodMill Creek                                        0.153847    
neighborhoodModena                                < 0.0000000000000002 ***
neighborhoodMorrell Park                                      0.930637    
neighborhoodNicetown                                          0.779749    
neighborhoodNorth Central                                     0.006421 ** 
neighborhoodNortheast Phila Airport                           0.005737 ** 
neighborhoodNorthern Liberties                                0.004819 ** 
neighborhoodNorthwood                               0.0000000135464300 ***
neighborhoodOgontz                                            0.519343    
neighborhoodOld City                                0.0000000023953619 ***
neighborhoodOld Kensington                                    0.000239 ***
neighborhoodOlney                                   0.0000007336212283 ***
neighborhoodOverbrook                                         0.031003 *  
neighborhoodOxford Circle                           0.0000000028557960 ***
neighborhoodParkwood Manor                                    0.293180    
neighborhoodPaschall                                          0.020446 *  
neighborhoodPassyunk Square                         0.0000411946753022 ***
neighborhoodPennypack Park                          0.0000000806898321 ***
neighborhoodPoint Breeze                            0.0000000048467465 ***
neighborhoodRhawnhurst                              0.0000608859578170 ***
neighborhoodRichmond                                          0.086426 .  
neighborhoodRittenhouse                           < 0.0000000000000002 ***
neighborhoodRiverfront                                        0.024960 *  
neighborhoodRoxborough                                        0.000181 ***
neighborhoodSharswood                                         0.000607 ***
neighborhoodSociety Hill                            0.0000000000005806 ***
neighborhoodSomerton                                          0.002924 ** 
neighborhoodSouthwest Germantown                              0.066552 .  
neighborhoodSpring Garden                                     0.399526    
neighborhoodStadium District                                  0.714682    
neighborhoodStanton                                           0.663248    
neighborhoodStrawberry Mansion                                0.050232 .  
neighborhoodSummerdale                              0.0000000003330862 ***
neighborhoodTacony                                            0.422068    
neighborhoodTioga                                             0.422757    
neighborhoodTorresdale                                        0.148487    
neighborhoodUniversity City                                   0.000474 ***
neighborhoodUpper Kensington                                  0.093912 .  
neighborhoodUpper Roxborough                                  0.179357    
neighborhoodWalnut Hill                                       0.462188    
neighborhoodWashington Square West                  0.0000000000865749 ***
neighborhoodWest Kensington                                   0.775955    
neighborhoodWest Mount Airy                                   0.979488    
neighborhoodWest Oak Lane                                     0.349867    
neighborhoodWest Parkside                                     0.042254 *  
neighborhoodWest Powelton                                     0.186116    
neighborhoodWest Torresdale                         0.0000001636537881 ***
neighborhoodWissahickon Park                                  0.090213 .  
neighborhoodWissinoming                                       0.226190    
neighborhoodWoodland Terrace                        0.0000000219768638 ***
neighborhoodWynnefield Heights                                0.786231    
neighborhoodYorktown                                          0.089763 .  
spatial_lag_filings                                 0.0000000001541891 ***
delinquent_prop_count                             < 0.0000000000000002 ***
pct_renter                                        < 0.0000000000000002 ***
severe_rent_burden                                            0.047794 *  
poverty_rate                                                  0.000690 ***
pct_single_mother                                             0.757454    
factor(racial_majority)Hispanic                     0.0000173516370187 ***
factor(racial_majority)Other                        0.0000758004716086 ***
factor(racial_majority)White                        0.0000000010857173 ***
moratorium_active:factor(racial_majority)Hispanic             0.007408 ** 
moratorium_active:factor(racial_majority)Other                0.006181 ** 
moratorium_active:factor(racial_majority)White                0.154594    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.4652) family taken to be 1)

    Null deviance: 32316  on 17279  degrees of freedom
Residual deviance: 18401  on 17144  degrees of freedom
AIC: 57691

Number of Fisher Scoring iterations: 1

              Theta:  2.4652 
          Std. Err.:  0.0690 

 2 x log-likelihood:  -57417.1070

Code

# Compare model fit statistics.
model_comparison <- data.frame(
  model = c("Baseline", "+ Tax and Spatial", "+ ACS", "+ Interactions"),
  aic = c(AIC(model_nb_1), AIC(model_nb_2), AIC(model_nb_3), AIC(model_nb_4)),
  theta = c(model_nb_1$theta, model_nb_2$theta, model_nb_3$theta, model_nb_4$theta),
  loglik = c(logLik(model_nb_1), logLik(model_nb_2), logLik(model_nb_3), logLik(model_nb_4))
) %>%
  mutate(
    aic_change = aic - min(aic),
    rank = rank(aic)
  )

model_comparison %>%
  kable(digits = 2, caption = "Model Comparison Statistics (Lower AIC = Better Fit)",
        col.names = c("Model", "AIC", "Theta", "Log Likelihood", "AIC Change", "Rank")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Comparison Statistics (Lower AIC = Better Fit)
Model	AIC	Theta	Log Likelihood	AIC Change	Rank
Baseline	60201.13	1.65	-30084.56	2510.02	4
+ Tax and Spatial	59652.67	1.79	-29808.34	1961.57	3
+ ACS	57705.69	2.46	-28718.84	14.58	2
+ Interactions	57691.11	2.47	-28708.55	0.00	1

The addition of the policy interaction term (racial_majority * moratorium_active) resulted in the lowest AIC, making it the best-fitting model. However, for deployment, the 3rd model is best with lower MAE and a more simple model to interpret. filings_ma3 is a strong predicting for past filing volume and reinforcing that eviction risk is persistent. is_extreme_spike is significant showcasing tracts with a history of mass-eviction events face a structurally higher baseline risk at a whopping 5.8 IRR. spatial_lag_filings is positive and significant, showing that there is spatial spillover of eviction risk. delinquent_prop_count is also significant, and this reinforces the link between landlord financial instability and pressure to pursue eviction filings. The interaction terms also reveal a nuanced policy impact. While the moratorium reduced filings universally, the interaction coefficients indicate that the benefit was not uniform across racial groups, as what was expected.

4. MODEL VALIDATION AND DIAGNOSTICS

Train and Test Predictions

Generate predictions for all four models on both training and test sets.

Code

# TRAINING SET PREDICTIONS.
# Model 1: Baseline.
train_pred_1 <- df_train %>%
  mutate(
    predicted = predict(model_nb_1, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 2: + Tax/Spatial.
train_pred_2 <- df_train %>%
  mutate(
    predicted = predict(model_nb_2, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 3: + ACS.
train_pred_3 <- df_train %>%
  mutate(
    predicted = predict(model_nb_3, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# Model 4: + Interactions.
train_pred_4 <- df_train %>%
  mutate(
    predicted = predict(model_nb_4, newdata = ., type = "response"),
    residual = filings_count_capped - predicted,
    abs_error = abs(residual)
  )

# TEST SET PREDICTIONS.
# Model 1: Baseline.
test_pred_1 <- df_test %>%
  mutate(
    predicted = predict(model_nb_1, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 2: + Tax/Spatial.
test_pred_2 <- df_test %>%
  mutate(
    predicted = predict(model_nb_2, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 3: + ACS.
test_pred_3 <- df_test %>%
  mutate(
    predicted = predict(model_nb_3, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

# Model 4: + Interactions.
test_pred_4 <- df_test %>%
  mutate(
    predicted = predict(model_nb_4, newdata = ., type = "response"),
    residual = filings_count - predicted,
    abs_error = abs(residual)
  )

cat(sprintf("Training Set: %s observations\n", comma(nrow(train_pred_1))))

Training Set: 17,280 observations

Code

cat(sprintf("Test Set: %s observations\n", comma(nrow(test_pred_1))))

Test Set: 8,832 observations

Training Set Performance

Code

# Calculate training metrics for each model explicitly.
train_metrics_1 <- train_pred_1 %>%
  summarize(
    model = "Baseline",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_2 <- train_pred_2 %>%
  summarize(
    model = "+ Tax/Spatial",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_3 <- train_pred_3 %>%
  summarize(
    model = "+ ACS",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

train_metrics_4 <- train_pred_4 %>%
  summarize(
    model = "+ Interactions",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count_capped, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count_capped, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count_capped, na.rm = TRUE)
  )

# Combine training metrics.
train_metrics <- bind_rows(train_metrics_1, train_metrics_2, train_metrics_3, train_metrics_4)

# Display training performance.
train_metrics %>%
  kable(digits = 3, caption = "Training Set Performance Across Models",
        col.names = c("Model", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Correlation", "Bias")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Training Set Performance Across Models
Model	Number	MAE	RMSE	Mean Observed	Mean Predicted	Correlation	Bias
Baseline	17280	1.967	11.455	2.097	2.573	0.295	0.476
+ Tax/Spatial	17280	1.780	7.796	2.097	2.419	0.346	0.322
+ ACS	17280	1.480	3.192	2.097	2.205	0.545	0.108
+ Interactions	17280	1.478	3.204	2.097	2.204	0.544	0.107

Test Set Performance

Code

# Calculate test metrics for each model explicitly.
test_metrics_1 <- test_pred_1 %>%
  summarize(
    model = "Baseline",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_2 <- test_pred_2 %>%
  summarize(
    model = "+ Tax/Spatial",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_3 <- test_pred_3 %>%
  summarize(
    model = "+ ACS",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

test_metrics_4 <- test_pred_4 %>%
  summarize(
    model = "+ Interactions",
    n = n(),
    MAE = mean(abs_error, na.rm = TRUE),
    RMSE = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    correlation = cor(filings_count, predicted, use = "complete.obs"),
    bias = mean(predicted - filings_count, na.rm = TRUE)
  )

# Combine test metrics.
test_metrics <- bind_rows(test_metrics_1, test_metrics_2, test_metrics_3, test_metrics_4)

# Display test performance.
test_metrics %>%
  kable(digits = 3, caption = "Test Set Performance Across Models",
        col.names = c("Model", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Correlation", "Bias")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Test Set Performance Across Models
Model	Number	MAE	RMSE	Mean Observed	Mean Predicted	Correlation	Bias
Baseline	8832	2.127	8.330	2.638	3.158	0.406	0.520
+ Tax/Spatial	8832	1.981	5.714	2.638	2.996	0.450	0.358
+ ACS	8832	1.766	3.032	2.638	2.835	0.567	0.197
+ Interactions	8832	1.766	3.047	2.638	2.833	0.566	0.195

Overfitting Diagnostic Check

A critical concern in predictive modeling is overfitting, where the model learns noise in the training data rather than generalizable patterns. We assess this by comparing train vs test performance.

Code

# Compare train vs test performance
overfitting_check <- train_metrics %>%
  select(model, MAE_train = MAE, RMSE_train = RMSE, corr_train = correlation) %>%
  left_join(
    test_metrics %>% select(model, MAE_test = MAE, RMSE_test = RMSE, corr_test = correlation),
    by = "model"
  ) %>%
  mutate(
    MAE_change = (MAE_test - MAE_train) / MAE_train * 100,
    RMSE_change = (RMSE_test - RMSE_train) / RMSE_train * 100,
    corr_change = (corr_train - corr_test) / corr_train * 100
  )

overfitting_check %>%
  kable(digits = 2, caption = "Train vs Test Performance: Overfitting Assessment",
        col.names = c("Model", "Training MAE", "Training RMSE", "Training Correlation", "Testing MAE", "Testing RMSE", "Testing Correlation", "MAE Change", "RMSE Change", "Correlation Change")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Train vs Test Performance: Overfitting Assessment
Model	Training MAE	Training RMSE	Training Correlation	Testing MAE	Testing RMSE	Testing Correlation	MAE Change	RMSE Change	Correlation Change
Baseline	1.97	11.45	0.30	2.13	8.33	0.41	8.15	-27.28	-37.67
+ Tax/Spatial	1.78	7.80	0.35	1.98	5.71	0.45	11.29	-26.71	-30.09
+ ACS	1.48	3.19	0.55	1.77	3.03	0.57	19.36	-5.00	-3.99
+ Interactions	1.48	3.20	0.54	1.77	3.05	0.57	19.44	-4.89	-3.93

The worst MAE degradation across models is 19.44%, so there is good generalization. In addition, the test MAE for the best model, the ACS addition, is off by 1.77 filings and accounts for 57% of variation. This is a good, foundational start considering this study is measuring a response variable that is very tied to human behavior.

Select Best Model for Further Analysis

Based on the model comparison and overfitting assessment, we select the best-performing model for detailed diagnostics. Model 3 with ACS provides the best balance of fit and generalization and interpretation for officials.

Code

# Use Model 3 for downstream analysis, this is for operational deployment.
df_test_pred <- test_pred_3
df_train_pred <- train_pred_3

# Extract scalar metrics for the selected model.
selected_test_metrics <- test_metrics_3

cat("SELECTED MODEL: Model 3 (+ ACS)\n")

SELECTED MODEL: Model 3 (+ ACS)

Code

cat(sprintf("Test MAE: %.3f\n", selected_test_metrics$MAE))

Test MAE: 1.766

Code

cat(sprintf("Test RMSE: %.3f\n", selected_test_metrics$RMSE))

Test RMSE: 3.032

Code

cat(sprintf("Test Correlation: %.3f\n", selected_test_metrics$correlation))

Test Correlation: 0.567

Code

# Create scatter plot of predicted vs observed values using selected model.
pred_obs_plot <- df_test_pred %>%
  ggplot(aes(x = predicted, y = filings_count)) +
  geom_hex(bins = 50) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  scale_fill_viridis_c(option = "plasma", trans = "log10", labels = comma) +
  scale_x_continuous(limits = c(0, 155)) +
  scale_y_continuous(limits = c(0, 50)) +
  labs(
    title = "Predicted vs Observed Eviction Filings (Test Set | Model 3)",
    subtitle = sprintf("MAE = %.2f | RMSE = %.2f | Correlation = %.3f",
                       selected_test_metrics$MAE, 
                       selected_test_metrics$RMSE, 
                       selected_test_metrics$correlation),
    x = "Predicted Filings",
    y = "Observed Filings",
    fill = "Count",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display predicted vs observed plot.
pred_obs_plot

Performance by Racial Majority

Code

# Calculate test set metrics by racial majority.
test_metrics_by_race <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n = n(),
    mae = mean(abs_error, na.rm = TRUE),
    rmse = sqrt(mean(residual^2, na.rm = TRUE)),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    mean_error = mean(residual, na.rm = TRUE),
    .groups = "drop"
  )

test_metrics_by_race %>%
  kable(digits = 2, caption = "Test Set Performance by Tract Racial Majority",
        col.names = c("Racial Majority", "Number", "MAE", "RMSE", "Mean Observed", "Mean Predicted", "Mean Error")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Test Set Performance by Tract Racial Majority
Racial Majority	Number	MAE	RMSE	Mean Observed	Mean Predicted	Mean Error
Black	3289	2.22	3.18	3.63	3.94	-0.31
Hispanic	529	1.57	2.03	2.51	2.77	-0.27
Other	1656	1.86	3.21	2.77	3.07	-0.30
White	3358	1.31	2.92	1.62	1.65	-0.03

The model, mirroring real-life eviction biases, also has a larger MAE for Black communities, with white communities having the lowest MAE.

Residual Analysis

Code

# Create residual distribution plot.
residual_plot <- df_test_pred %>%
  ggplot(aes(x = residual)) +
  geom_histogram(bins = 50, fill = "#3498DB", color = "white", alpha = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
  scale_x_continuous(limits = c(-15, 15)) +
  labs(
    title = "Distribution of Prediction Residuals (Test Set)",
    subtitle = "Centered near zero indicates unbiased predictions.",
    x = "Residual (Observed - Predicted)",
    y = "Frequency",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Display residual plot.
residual_plot

5. RISK CLASSIFICATION SYSTEM

Creating Risk Quintiles

Code

# Create risk quintile classification based on predicted values.
df_test_risk <- df_test_pred %>%
  mutate(
    risk_quintile = ntile(predicted, 5),
    risk_category = case_when(
      risk_quintile == 1 ~ "Lowest Risk",
      risk_quintile == 2 ~ "Low Risk",
      risk_quintile == 3 ~ "Moderate Risk",
      risk_quintile == 4 ~ "High Risk",
      risk_quintile == 5 ~ "Highest Risk"
    ),
    risk_category = factor(risk_category, 
                          levels = c("Lowest Risk", "Low Risk", "Moderate Risk", 
                                    "High Risk", "Highest Risk"))
  )

# Calculate observed filings by risk category.
risk_validation <- df_test_risk %>%
  group_by(risk_category) %>%
  summarize(
    n_tract_months = n(),
    mean_observed = mean(filings_count, na.rm = TRUE),
    median_observed = median(filings_count, na.rm = TRUE),
    total_filings = sum(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(pct_total_filings = total_filings / sum(total_filings) * 100)

# Display risk validation table.
risk_validation %>%
  kable(digits = 2, caption = "Observed Filings by Predicted Risk Category",
        col.names = c("Risk Category", "Observations", "Mean Observed", "Median Observed", "Filings", "% Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Observed Filings by Predicted Risk Category
Risk Category	Observations	Mean Observed	Median Observed	Filings	% Filings
Lowest Risk	1767	0.66	0	1169	5.02
Low Risk	1767	1.27	1	2251	9.66
Moderate Risk	1766	2.18	2	3854	16.54
High Risk	1766	3.26	3	5761	24.73
Highest Risk	1766	5.81	5	10263	44.05

Code

# Create bar plot showing observed filings by risk category.
risk_bar_plot <- risk_validation %>%
  ggplot(aes(x = risk_category, y = mean_observed, fill = risk_category)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", pct_total_filings)), 
            vjust = -0.5, size = 4) +
  scale_fill_viridis_d(option = "plasma", direction = -1) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Mean Observed Filings by Predicted Risk Category",
    subtitle = "Higher predicted risk categories correctly capture higher observed filing rates.",
    x = "Predicted Risk Category",
    y = "Mean Observed Filings per Tract-Month",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "none"
  )

# Display risk bar plot.
risk_bar_plot

High-Risk Tract Profile

Code

# Profile characteristics of highest risk tract-months.
high_risk_profile <- df_test_risk %>%
  filter(risk_category == "Highest Risk") %>%
  group_by(racial_majority) %>%
  summarize(
    n_tract_months = n(),
    pct_of_high_risk = n() / nrow(df_test_risk %>% filter(risk_category == "Highest Risk")) * 100,
    mean_filings = mean(filings_count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(n_tract_months))

high_risk_profile %>%
  kable(digits = 1, caption = "Racial Composition of Highest Risk Tract-Months",
        col.names = c("Racial Majority", "Observations", "% High Risk", "Mean Filings")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Racial Composition of Highest Risk Tract-Months
Racial Majority	Observations	% High Risk	Mean Filings
Black	1191	67.4	5.7
Other	341	19.3	5.3
White	148	8.4	8.9
Hispanic	86	4.9	4.2

6. MODEL INTERPRETATION

Feature Importance

Code

# Get standardized coefficients for interpretation.
feature_importance <- tidy(model_nb_3) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    abs_estimate = abs(estimate),
    direction = ifelse(estimate > 0, "Positive", "Negative"),
    category = case_when(
      str_detect(term, "filings_") ~ "Temporal",
      str_detect(term, "spatial_lag_filings") ~ "Spatial",
      str_detect(term, "log_avg_balance") ~ "Tax Delinquency",
      str_detect(term, "pct_renter") |
        str_detect(term, "poverty_rate") |
        str_detect(term, "severe_rent_burden") |
        str_detect(term, "unemployment_rate") |
        str_detect(term, "log_median_income") |
        str_detect(term, "pct_single_mother") ~ "Socioeconomic (ACS)",
      str_detect(term, "factor\\(month_num\\)") ~ "Seasonality",
      str_detect(term, "moratorium") ~ "Policy",
      str_detect(term, "factor\\(racial_majority\\)") ~ "Demographic",
      str_detect(term, "neighborhood") ~ "Neighborhood",
      TRUE ~ "Other"
    )
  ) %>%
  arrange(desc(abs_estimate)) %>%
  head(10)

# Create feature importance plot.
importance_plot <- feature_importance %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate, fill = category)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Top 10 Most Important Predictors",
    subtitle = "Coefficient magnitude indicates effect size on log(expected filings).",
    x = "Predictor Variable",
    y = "Coefficient Estimate",
    fill = "Category",
    caption = "Negative Binomial Regression Coefficients"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display importance plot.
importance_plot

Code

# Get standardized coefficients for interpretation.
all_coefficients <- tidy(model_nb_3) %>%
  # Remove intercept term.
  filter(term != "(Intercept)") %>%
  # Filter out terms with "neighborhood".
  filter(!str_detect(term, "neighborhood")) %>%
  mutate(
    abs_estimate = abs(estimate),
    direction = ifelse(estimate > 0, "Positive", "Negative"),
    category = case_when(
      str_detect(term, "filings_") ~ "Temporal",
      str_detect(term, "spatial_lag_filings") ~ "Spatial",
      str_detect(term, "log_avg_balance") ~ "Tax Delinquency",
      str_detect(term, "pct_renter") |
        str_detect(term, "poverty_rate") |
        str_detect(term, "severe_rent_burden") |
        str_detect(term, "unemployment_rate") |
        str_detect(term, "log_median_income") |
        str_detect(term, "pct_single_mother") ~ "Socioeconomic (ACS)",
      str_detect(term, "factor\\(month_num\\)") ~ "Seasonality",
      str_detect(term, "moratorium") ~ "Policy",
      str_detect(term, "factor\\(racial_majority\\)") ~ "Demographic",
      TRUE ~ "Other"
    )
  ) %>%
  arrange(desc(abs_estimate)) %>%
  head(10)

# Create feature importance plot.
other_importance_plot <- all_coefficients %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = estimate, fill = category)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "All Non-Neighborhood Predictors",
    subtitle = "Coefficient magnitude indicates effect size on log(expected filings).",
    x = "Predictor Variable",
    y = "Coefficient Estimate",
    fill = "Category"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display importance plot.
other_importance_plot

Code

# Specifically analyze contribution of ACS features.
acs_contribution <- tidy(model_nb_3) %>%
  filter(str_detect(term, "pct_renter|poverty_rate|severe_rent_burden|unemployment_rate")) %>%
  mutate(
    irr = exp(estimate),
    pct_change = (irr - 1) * 100,
    significant = ifelse(p.value < 0.05, "Yes", "No")
  )%>%
  rename(standard_error = std.error, p_value = p.value, percent_change = pct_change) %>%
  select(term, estimate, standard_error, p_value, irr, percent_change, significant)

acs_contribution %>%
  kable(digits = 3, caption = "ACS Socioeconomic Feature Effects",
        col.names = c("Term", "Estimate", "Standard Error", "P-Value", "IRR", "% Change", "Significant")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

ACS Socioeconomic Feature Effects
Term	Estimate	Standard Error	P-Value	IRR	% Change	Significant
pct_renter	0.017	0.001	0.000	1.017	1.672	Yes
severe_rent_burden	0.001	0.001	0.050	1.001	0.135	No
poverty_rate	-0.004	0.001	0.001	0.996	-0.390	Yes

Incidence Rate Ratios

Code

# Calculate incidence rate ratios for interpretation.
irr_table <- tidy(model_nb_3) %>%
  mutate(
    irr = exp(estimate),
    # 1.96 is z-score for 95% confidence level.
    irr_lower = exp(estimate - 1.96 * std.error),
    irr_upper = exp(estimate + 1.96 * std.error),
    pct_change = (irr - 1) * 100
  ) %>%
  filter(term != "(Intercept)") %>%
  rename(p_value = p.value) %>%
  select(term, irr, irr_lower, irr_upper, pct_change, p_value)

irr_table %>%
  head(15) %>%
  kable(digits = 3, caption = "Incidence Rate Ratios (IRR) for Key Predictors",
        col.names = c("Term", "IRR", "IRR Lower Confidence", "IRR Upper Confidence", "% Change", "P-Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Incidence Rate Ratios (IRR) for Key Predictors
Term	IRR	IRR Lower Confidence	IRR Upper Confidence	% Change	P-Value
filings_ma3	1.114	1.107	1.121	11.432	0.000
moratorium_active	0.483	0.464	0.502	-51.732	0.000
factor(month_num)2	0.769	0.709	0.834	-23.086	0.000
factor(month_num)3	1.050	0.971	1.136	5.018	0.222
factor(month_num)4	0.512	0.472	0.555	-48.830	0.000
factor(month_num)5	0.663	0.612	0.719	-33.654	0.000
factor(month_num)6	0.717	0.662	0.776	-28.344	0.000
factor(month_num)7	0.833	0.770	0.900	-16.747	0.000
factor(month_num)8	0.985	0.913	1.063	-1.458	0.704
factor(month_num)9	1.084	1.006	1.168	8.369	0.035
factor(month_num)10	0.999	0.928	1.076	-0.069	0.985
factor(month_num)11	0.866	0.804	0.932	-13.442	0.000
factor(month_num)12	0.917	0.852	0.988	-8.256	0.022
is_extreme_spike	5.808	4.681	7.205	480.753	0.000
neighborhoodAcademy Gardens	1.598	1.173	2.176	59.789	0.003

filings_ma3: For every one-unit increase in the 3-month moving average of past eviction filings, the expected number of current eviction filings increases by a factor of 1.114 or 11.4%. So a history of higher recent filings strongly predicts higher current filings.

moratorium_active: When an eviction moratorium is active the expected number of eviction filings is multiplied by 0.483 or is 51.7% lower compared to when the moratorium is not active.

is_extreme_spike: If a time period is classified as having an extreme spike in filings, the expected number of eviction filings is nearly six times higher multiplied by 5.808, or increases by a staggering 481%, compared to a period without a spike.

neighborhoodAcademy Gardens: Compared to the baseline East Falls neighborhood, Academy Gardens is expected to have 59.8% more eviction filings.

Code

# Create actionable output for city agencies with intervention thresholds.
operational_output <- df_test_pred %>%
  filter(predicted >= 5) %>%
  group_by(GEOID) %>%
  summarize(
    predicted_filings_next_month = mean(predicted),
    n_high_risk_months = n(),
    .groups = "drop"
  ) %>%
  left_join(df_acs, by = "GEOID") %>%
  arrange(desc(predicted_filings_next_month))

operational_output

# A tibble: 107 × 16
   GEOID       predicted_filings_nex…¹ n_high_risk_months pct_renter median_rent
   <chr>                         <dbl>              <int>      <dbl>       <dbl>
 1 42101021800                   23.7                  21       71.3        1558
 2 42101030100                   17.0                  21       45.5         749
 3 42101025200                   14.3                  22       57.6         879
 4 42101036100                   12.8                  17       26.7        1639
 5 42101015200                   10.7                  22       68.3         961
 6 42101025700                   10.5                   4       71.5        1737
 7 42101024000                   10.5                  22       72.2        1098
 8 42101023900                    9.78                 22       92.7        1500
 9 42101037700                    9.73                 15       85.9        1159
10 42101008500                    9.45                 23       64.2        1028
# ℹ 97 more rows
# ℹ abbreviated name: ¹predicted_filings_next_month
# ℹ 11 more variables: rent_burden_pct <dbl>, median_income <dbl>,
#   log_median_income <dbl>, poverty_rate <dbl>, unemployment_rate <dbl>,
#   pct_college <dbl>, total_pop <dbl>, log_pop <dbl>, median_age <dbl>,
#   pct_single_mother <dbl>, severe_rent_burden <dbl>

By integrating temporal, spatial, and socioeconomic indicators, the best model achieves a prediction error of < 2 filings per tract, outperforming the baseline model. Also, the model successfully identifies the “Highest Risk” tracts that account for the vast majority of displacement. This tool offers the City of Philadelphia a concrete mechanism to shift from reactive to proactive, targeting aid to the specific blocks where it is needed most and ideally before any landlord-tenant tensions escalate into expensive legal battles.

III. EQUITY ASSESSMENT

1. Impact Analysis

Code

# Calculate prediction errors by racial majority to assess disparate impact.
equity_metrics <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority) %>%
  summarize(
    n_obs = n(),
    mean_observed = mean(filings_count, na.rm = TRUE),
    mean_predicted = mean(predicted, na.rm = TRUE),
    mae = mean(abs_error, na.rm = TRUE),
    mean_error = mean(residual, na.rm = TRUE),
    under_prediction_rate = sum(residual > 0) / n() * 100,
    .groups = "drop"
  )

# Display equity metrics table.
equity_metrics %>%
  kable(digits = 2, caption = "Model Performance Metrics by Tract Racial Majority",
        col.names = c("Racial Majority", "Observations", "Mean Observed", "Mean Predicted", "MAE", "Mean Error", "Under-Prediction Rate")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Performance Metrics by Tract Racial Majority
Racial Majority	Observations	Mean Observed	Mean Predicted	MAE	Mean Error	Under-Prediction Rate
Black	3289	3.63	3.94	2.22	-0.31	39.68
Hispanic	529	2.51	2.77	1.57	-0.27	41.59
Other	1656	2.77	3.07	1.86	-0.30	36.78
White	3358	1.62	1.65	1.31	-0.03	37.25

Code

# Create plot comparing MAE across racial categories.
equity_plot <- equity_metrics %>%
  ggplot(aes(x = racial_majority, y = mae, fill = racial_majority)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.2f", mae)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = racial_colors) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Model Accuracy (MAE) by Tract Racial Majority",
    subtitle = "Lower values indicate better prediction accuracy.",
    x = "Tract Racial Majority",
    y = "Mean Absolute Error",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  )

# Display equity plot.
equity_plot

2. Prediction Calibration by Race

Code

# Create calibration plot showing predicted vs observed by racial majority.
calibration_plot <- df_test_pred %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  mutate(predicted_bin = cut(predicted, breaks = seq(0, 15, by = 1), include.lowest = TRUE)) %>%
  filter(!is.na(predicted_bin)) %>%
  group_by(racial_majority, predicted_bin) %>%
  summarize(
    mean_predicted = mean(predicted, na.rm = TRUE),
    mean_observed = mean(filings_count, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) %>%
  filter(n >= 10) %>%
  ggplot(aes(x = mean_predicted, y = mean_observed, color = racial_majority)) +
  geom_point(aes(size = n), alpha = 0.75) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  scale_color_manual(values = racial_colors) +
  scale_size_continuous(range = c(2, 8)) +
  labs(
    title = "Model by Tract Racial Majority",
    subtitle = "Points on diagonal indicate good predictions.",
    x = "Mean Predicted Filings",
    y = "Mean Observed Filings",
    color = "Racial\nMajority",
    size = "Observations",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display calibration plot.
calibration_plot

3. Risk Category Distribution by Race

Code

# Calculate risk category distribution by racial majority.
risk_by_race <- df_test_risk %>%
  filter(!is.na(racial_majority), racial_majority != "") %>%
  group_by(racial_majority, risk_category) %>%
  summarize(n = n(), .groups = "drop") %>%
  group_by(racial_majority) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup()

# Create stacked bar plot.
risk_race_plot <- risk_by_race %>%
  ggplot(aes(x = racial_majority, y = pct, fill = risk_category)) +
  geom_col(width = 0.7) +
  scale_fill_viridis_d(option = "plasma", direction = -1) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  labs(
    title = "Risk Distribution by Tract Racial Majority",
    subtitle = "Black-majority tracts classified as high risk.",
    x = "Tract Racial Majority",
    y = "Percentage of Tract-Months",
    fill = "Risk\nCategory",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Display risk by race plot.
risk_race_plot

Code

# Create equity comparison plot.
equity_plot <- equity_metrics %>%
  pivot_longer(cols = c(mean_observed, mean_predicted), 
               names_to = "type", values_to = "value") %>%
  mutate(type = ifelse(type == "mean_observed", "Observed", "Predicted")) %>%
  ggplot(aes(x = racial_majority, y = value, fill = type)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Observed" = "#E74C3C", "Predicted" = "#3498DB")) +
  labs(
    title = "Model Calibration by Racial Majority: Observed vs Predicted",
    subtitle = "Close alignment indicates equitable prediction accuracy across communities.",
    x = "Tract Racial Majority",
    y = "Mean Filings per Tract-Month",
    fill = "Type",
    caption = "Data: Philadelphia Eviction Filings 2024-2025"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# Display equity plot.
equity_plot

Future modifications need to be made to account for Black communities who account for the majority of Philadelphia’s population, despite predicting Black communities to be at more risk, there is the simultaneous higher MAE, meaning that this tool is fit as long as there is human oversight and it is not used for absolute decision-making. It struggles to capture the full issues that come from structural racism, so going into detail with the resource allocation it should be legal aid or cash assistance supportive survices, never property or police enforcement.

SESSION INFORMATION

Code

# R session info.
sessionInfo()

R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50        forcats_1.0.0     dplyr_1.1.4       purrr_1.1.0      
 [5] readr_2.1.5       tibble_3.3.0      tidyverse_2.0.0   tidyr_1.3.1      
 [9] stringr_1.5.1     glue_1.8.0        jsonlite_2.0.0    httr_1.4.7       
[13] tidycensus_1.7.3  spdep_1.4-1       spData_2.3.4      pROC_1.19.0.1    
[17] caret_7.0-1       lattice_0.22-7    ggplot2_3.5.2     car_3.1-3        
[21] carData_3.0-5     MASS_7.3-65       zoo_1.8-14        corrplot_0.95    
[25] kableExtra_1.4.0  viridis_0.6.5     viridisLite_0.4.2 patchwork_1.3.2  
[29] scales_1.4.0      broom_1.0.9       sf_1.0-21         lubridate_1.9.4  

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3   rstudioapi_0.17.1    wk_0.9.4            
  [4] magrittr_2.0.3       TH.data_1.1-4        farver_2.1.2        
  [7] rmarkdown_2.29       vctrs_0.6.5          htmltools_0.5.8.1   
 [10] curl_7.0.0           s2_1.1.9             Formula_1.2-5       
 [13] LearnBayes_2.15.1    parallelly_1.45.1    KernSmooth_2.23-26  
 [16] htmlwidgets_1.6.4    plyr_1.8.9           sandwich_3.1-1      
 [19] uuid_1.2-1           igraph_2.2.1         lifecycle_1.0.4     
 [22] iterators_1.0.14     pkgconfig_2.0.3      Matrix_1.7-3        
 [25] R6_2.6.1             fastmap_1.2.0        future_1.67.0       
 [28] digest_0.6.37        ps_1.9.1             textshaping_1.0.1   
 [31] labeling_0.4.3       spatialreg_1.4-2     timechange_0.3.0    
 [34] abind_1.4-8          mgcv_1.9-3           compiler_4.5.1      
 [37] proxy_0.4-27         withr_3.0.2          backports_1.5.0     
 [40] DBI_1.2.3            hexbin_1.28.5        lava_1.8.1          
 [43] rappdirs_0.3.3       classInt_0.4-11      ModelMetrics_1.2.2.2
 [46] tools_4.5.1          chromote_0.5.1       units_0.8-7         
 [49] future.apply_1.20.0  nnet_7.3-20          nlme_3.1-168        
 [52] promises_1.3.3       grid_4.5.1           reshape2_1.4.4      
 [55] generics_0.1.4       recipes_1.3.1        gtable_0.3.6        
 [58] tzdb_0.5.0           class_7.3-23         websocket_1.4.4     
 [61] data.table_1.17.8    hms_1.1.3            utf8_1.2.6          
 [64] sp_2.2-0             xml2_1.4.0           foreach_1.5.2       
 [67] pillar_1.11.1        later_1.4.4          splines_4.5.1       
 [70] survival_3.8-3       deldir_2.0-4         tidyselect_1.2.1    
 [73] gridExtra_2.3        svglite_2.2.1        stats4_4.5.1        
 [76] xfun_0.53            hardhat_1.4.2        timeDate_4041.110   
 [79] stringi_1.8.7        yaml_2.3.10          boot_1.3-31         
 [82] evaluate_1.0.4       codetools_0.2-20     cli_3.6.5           
 [85] rpart_4.1.24         systemfonts_1.2.3    processx_3.8.6      
 [88] Rcpp_1.1.0           globals_0.18.0       coda_0.19-4.1       
 [91] parallel_4.5.1       gower_1.0.2          listenv_0.9.1       
 [94] mvtnorm_1.3-3        tigris_2.2.1.9000    ipred_0.9-15        
 [97] prodlim_2025.04.28   e1071_1.7-16         crayon_1.5.3        
[100] rlang_1.1.6          rvest_1.0.4          multcomp_1.4-29

DATA DOWNLOADS

Eviction Lab Main Data

Eviction Lab Claims Data

Real Estate Tax Balances

Neighborhood Boundaries

Tract Boundaries

ACS 2023 Data via tidycensus API

AI USAGE

Free version of Claude Sonnet 4.5 used for debugging code, especially the final panel data portion.

Several code chunks provided by Dr. Delmelle with adjustments from previous labs and lectures.