Philadelphia Eviction Prediction Model

MUSA 5080 - Final Project

Author

Zhiyuan Zhao & Fan Yang

Published

December 3, 2025

1. Abstract

Eviction is a major housing crisis in Philadelphia. When families are evicted, they often end up homeless, children have to switch schools, and it becomes much harder to find new housing. Can we predict which neighborhoods will have the most evictions next month? If we can, the Philadelphia Housing Authority can send resources (legal aid, rental assistance, counseling) to those areas before the evictions happen.

This analysis builds a prediction model using data from 2020-2025, combining three data sources: weekly eviction filing records for 408 Census tracts, neighborhood characteristics from the Census (poverty rate, renter percentage, etc.), and 311 housing complaint calls. The analysis reveals that evictions are highly predictable - if a neighborhood had many evictions last month, it will likely have many next month too. Clear spatial patterns emerge: eviction “hotspots” cluster in North and West Philadelphia, particularly in majority-Black neighborhoods.

The best model (Enhanced Poisson) achieves a McFadden’s R² around 0.22, which is considered “good” for count data modeling. The model correctly identifies high-risk areas with approximately 75% accuracy. All 408 Census tracts are classified into risk categories (Very High, High, Moderate, Low, Very Low) to create an “early warning system” map that the Housing Authority can use for targeted intervention.

Key limitations: the model works best for short-term predictions (1-2 months ahead) and cannot account for sudden policy changes (like new eviction moratoriums) or economic shocks. Additionally, the model performs slightly worse in majority-Black neighborhoods, which is an equity concern discussed in detail.

2. Setup

This section initializes the R environment by loading all required packages for spatial analysis, statistical modeling, and data visualization. The packages are organized by their primary function: tidyverse for data manipulation, sf and spdep for spatial analysis, tidycensus for Census data access, and various modeling packages.

3. Data Sources

This analysis integrates three complementary data sources to build a comprehensive picture of eviction risk in Philadelphia. Each source captures a different dimension of housing instability: the Eviction Lab provides direct counts of court filings, the American Community Survey describes neighborhood socioeconomic characteristics, and 311 data captures resident-reported housing problems that may signal emerging instability.

3.1 Eviction Data

The primary dataset comes from the Eviction Lab at Princeton University, which tracks eviction court filings across the United States. For Philadelphia, weekly eviction filing counts are available for each Census tract from 2019 through 2025. This granular temporal resolution allows us to capture both short-term fluctuations (such as end-of-month spikes when rent is due) and longer-term trends (such as the effects of eviction moratoriums).

These data represent eviction filings, not completed evictions. Many filed cases are dismissed, settled, or withdrawn before an actual eviction occurs. However, even a filing can have serious consequences for tenants, including damaged credit scores, difficulty finding future housing, and the stress of legal proceedings.

Code

# LOAD AND PROCESS EVICTION DATA

eviction_weekly <- read_csv(here("final", "data", "philadelphia_weekly_2020_2021.csv")) %>%
  # Remove sealed records that lack geographic information
  filter(GEOID != "sealed") %>%
  mutate(
    # Parse date string into proper date format
    week_date = ymd(week_date),
    
    # Extract temporal components for potential seasonality analysis
    year = year(week_date),
    month = month(week_date),
    
    # Create binary indicator for CDC eviction moratorium period
    # The moratorium was announced 9/4/2020 and expired 8/26/2021
    pandemic_moratorium = ifelse(
      week_date >= "2020-09-04" & week_date <= "2021-08-26", 
      1, 0
    )
  )

# GENERATE SUMMARY STATISTICS

eviction_summary <- tibble(
  Metric = c("Total observations", "Census tracts", "Date range", "Number of weeks"),
  Value = c(
    format(nrow(eviction_weekly), big.mark = ","),
    n_distinct(eviction_weekly$GEOID),
    paste(min(eviction_weekly$week_date), "to", max(eviction_weekly$week_date)),
    n_distinct(eviction_weekly$week)
  )
)

Code

kable(eviction_summary, caption = "Eviction Data Summary")

Eviction Data Summary
Metric	Value
Total observations	125,256
Census tracts	408
Date range	2019-12-29 to 2025-11-09
Number of weeks	307

The eviction data structure provides a panel dataset where each observation represents one Census tract in one week.

3.2 Census/ACS Data

The American Community Survey (ACS) provides socioeconomic indicators at the Census tract level. We use the 2021 5-year estimates, which aggregate survey responses from 2017-2021 to provide reliable estimates even for small geographic areas. The 5-year estimates trade temporal precision for statistical reliability - a tradeoff that makes sense for our application since neighborhood demographics change slowly.

The variables selected capture multiple dimensions of eviction risk identified in the housing literature: housing tenure (renter vs. owner), rent burden (how much income goes to rent), economic stress (poverty, unemployment), educational attainment, and housing stock age (older buildings may have more maintenance issues). Each variable was chosen based on either theoretical relevance or empirical associations found in prior eviction research.

Code

# DEFINE ACS VARIABLES OF INTEREST
# Variable codes follow Census Bureau naming conventions:
# B-series = detailed tables, prefix indicates table number

acs_vars <- c(
  # Population counts
  total_pop = "B01003_001",           # Total population
  
  # Housing unit counts and occupancy status
  total_units = "B25001_001",         # Total housing units
  occupied_units = "B25002_002",      # Occupied housing units
  vacant_units = "B25002_003",        # Vacant housing units
  owner_occupied = "B25003_002",      # Owner-occupied units
  renter_occupied = "B25003_003",     # Renter-occupied units
  
  # Rent metrics
  median_rent = "B25064_001",         # Median gross rent (dollars)
  
  # Rent burden indicators (% of income spent on rent)
  rent_burden_total = "B25070_001",   # Total renters (denominator)
  rent_burden_30_35 = "B25070_007",   # Renters paying 30-34.9% of income
  rent_burden_35_40 = "B25070_008",   # Renters paying 35-39.9% of income
  rent_burden_40_50 = "B25070_009",   # Renters paying 40-49.9% of income
  rent_burden_50plus = "B25070_010",  # Renters paying 50%+ of income (severely burdened)
  
  # Income and economic indicators
  median_income = "B19013_001",       # Median household income
  poverty_total = "B17001_001",       # Total population for poverty status
  poverty_below = "B17001_002",       # Population below poverty line
  labor_force = "B23025_003",         # Civilian labor force
  unemployed = "B23025_005",          # Unemployed civilians
  
  # Educational attainment (age 25+)
  edu_total = "B15003_001",           # Total population 25 years and over
  edu_bachelors = "B15003_022",       # Bachelor's degree
  edu_masters = "B15003_023",         # Master's degree
  edu_professional = "B15003_024",    # Professional school degree
  edu_doctorate = "B15003_025",       # Doctorate degree
  
  # Housing age
  built_total = "B25034_001",         # Total housing units (year built)
  built_before_1940 = "B25034_011"    # Built 1939 or earlier
)

# FETCH ACS DATA WITH GEOMETRY

philly_acs <- get_acs(
  geography = "tract",
  variables = acs_vars,
  state = "PA",
  county = "Philadelphia",
  year = 2021,
  geometry = TRUE,              # Include spatial boundaries
  output = "wide",              # One column per variable
  progress = FALSE              # Suppress progress bar
)

# CALCULATE DERIVED VARIABLES
# Transform raw counts into interpretable percentages and ratios

philly_acs <- philly_acs %>%
  mutate(
    # Percentage of occupied units that are renter-occupied
    pct_renter = ifelse(occupied_unitsE > 0, 
                        renter_occupiedE / occupied_unitsE * 100, NA),
    
    # Vacancy rate as percentage of total units
    vacancy_rate = ifelse(total_unitsE > 0, 
                          vacant_unitsE / total_unitsE * 100, NA),
    
    # Calculate total rent-burdened renters (paying >30% of income)
    rent_burdened = rent_burden_30_35E + rent_burden_35_40E + 
                    rent_burden_40_50E + rent_burden_50plusE,
    
    # Percentage of renters who are rent-burdened
    pct_rent_burdened = ifelse(rent_burden_totalE > 0, 
                               rent_burdened / rent_burden_totalE * 100, NA),
    
    # Percentage severely rent-burdened (paying >50% of income)
    pct_severe_burden = ifelse(rent_burden_totalE > 0, 
                               rent_burden_50plusE / rent_burden_totalE * 100, NA),
    
    # Poverty rate
    poverty_rate = ifelse(poverty_totalE > 0, 
                          poverty_belowE / poverty_totalE * 100, NA),
    
    # Unemployment rate (among civilian labor force)
    unemployment_rate = ifelse(labor_forceE > 0, 
                               unemployedE / labor_forceE * 100, NA),
    
    # Percentage with college degree or higher
    college_edu = edu_bachelorsE + edu_mastersE + 
                  edu_professionalE + edu_doctorateE,
    pct_college = ifelse(edu_totalE > 0, 
                         college_edu / edu_totalE * 100, NA),
    
    # Percentage of housing built before 1940
    pct_old_housing = ifelse(built_totalE > 0, 
                             built_before_1940E / built_totalE * 100, NA)
  ) %>%
  # Select only the variables needed for analysis
  select(
    GEOID, NAME, geometry,
    total_pop = total_popE,
    renter_occupied = renter_occupiedE,
    median_rent = median_rentE,
    median_income = median_incomeE,
    pct_renter, vacancy_rate, pct_rent_burdened, pct_severe_burden,
    poverty_rate, unemployment_rate, pct_college, pct_old_housing
  )

n_tracts <- nrow(philly_acs)

3.3 311 Complaint Data

311 is Philadelphia’s non-emergency service request system. Residents can call 311 to report issues like potholes, abandoned vehicles, or housing code violations. Housing-related 311 complaints may serve as early warning signs of evictions - when tenants report maintenance issues, dangerous conditions, or landlord neglect, it often signals housing instability that could precede displacement.

We focus specifically on housing-related complaint types that might indicate rental housing distress: maintenance complaints, dangerous building conditions, no heat, infestations, and vacant properties. These complaint types were selected based on their theoretical connection to the landlord-tenant relationship and housing quality issues that often precede evictions.

Code

# DEFINE FUNCTION TO LOAD 311 DATA BY YEAR

load_311_data <- function(year, data_path = here("final", "data", "311")) {
  # Construct file path for the given year
  shp_file <- file.path(data_path, paste0(year, "_public_cases_fc.shp"))
  
  # Return NULL if file doesn't exist (graceful handling)
  if (!file.exists(shp_file)) return(NULL)
  
  # Read shapefile and transform to match ACS coordinate system
  st_read(shp_file, quiet = TRUE) %>%
    st_transform(crs = st_crs(philly_acs))
}

# LOAD ALL AVAILABLE YEARS

years <- 2020:2025

# Use map() to load each year, handling missing files automatically
data_311_list <- map(years, ~load_311_data(.x))

# Combine all years, filtering out NULL entries
data_311_all <- bind_rows(data_311_list[!sapply(data_311_list, is.null)])

n_311_total <- nrow(data_311_all)

Code

# FILTER TO HOUSING-RELATED COMPLAINT TYPES

# These categories were selected based on their theoretical connection
# to housing stability and landlord-tenant relationships

housing_311_types <- c(
  "Maintenance Residential or Commercial",  # General maintenance issues
  "Maintenance Complaint",                  # Specific maintenance complaints
  "Building Dangerous",                     # Structural safety concerns
  "Vacant House or Commercial",             # Abandoned properties (blight indicator)
  "No Heat (Residential)",                  # Essential service failure
  "Infestation Residential",                # Pest problems (habitability issue)
  "Fire Residential or Commercial",         # Fire damage or hazards
  "Other Dangerous",                        # Miscellaneous safety issues
  "Dangerous Sidewalk"                      # Property maintenance obligation
)

housing_311 <- data_311_all %>%
  # Keep only housing-related complaint types
  filter(service_na %in% housing_311_types) %>%
  mutate(
    # Parse request date and calculate week start date
    request_date = as.Date(requested_),
    # Align to Sunday-starting weeks to match eviction data
    week_date = floor_date(request_date, unit = "week", week_start = 7)
  )

n_housing_311 <- nrow(housing_311)
pct_housing <- round(n_housing_311 / n_311_total * 100, 1)

Code

data_311_summary <- tibble(
  Metric = c("Total 311 records", "Housing-related complaints", "Percentage"),
  Value = c(format(n_311_total, big.mark = ","), 
            format(n_housing_311, big.mark = ","), 
            paste0(pct_housing, "%"))
)
kable(data_311_summary, caption = "311 Data Summary")

311 Data Summary
Metric	Value
Total 311 records	3,245,433
Housing-related complaints	218,147
Percentage	6.7%

Housing-related complaints represent 6.7% of all 311 calls, reflecting the significant role that housing issues play in Philadelphia’s urban challenges. These complaints are spatially joined to Census tracts to create weekly complaint counts that can be used as a leading indicator of neighborhood distress.

Code

# SPATIAL JOIN: ASSIGN 311 CALLS TO CENSUS TRACTS

housing_311_tract <- st_join(housing_311, philly_acs %>% select(GEOID))

# AGGREGATE TO TRACT-WEEK LEVEL

complaints_by_tract_week <- housing_311_tract %>%
  # Drop geometry for faster aggregation
  st_drop_geometry() %>%
  # Remove records that couldn't be matched to a tract
  filter(!is.na(GEOID)) %>%
  # Count complaints per tract-week combination
  group_by(GEOID, week_date) %>%
  summarise(housing_complaints = n(), .groups = "drop")

4. Data Integration and Feature Engineering

This section combines our three data sources into a unified analysis dataset and creates the predictive features that will power our models. Feature engineering is often the most important part of any predictive modeling project - the right features can make the difference between a mediocre model and a useful one.

4.1 Merging Data Sources

Each row in the analysis dataset represents one Census tract in one week. We combine eviction filings with neighborhood demographics from ACS and 311 complaint counts. Missing complaint counts are filled with zero since the absence of complaints means no complaints were filed, not that data is missing.

Code

# JOIN ALL DATA SOURCES

# Start with eviction data as the base, then add ACS and 311

analysis_data <- eviction_weekly %>%
  left_join(philly_acs %>% st_drop_geometry(), by = "GEOID") %>%
  left_join(complaints_by_tract_week, by = c("GEOID", "week_date")) %>%
  # Fill missing complaint counts with zero (no complaints = 0, not NA)
  mutate(housing_complaints = replace_na(housing_complaints, 0))

4.2 Temporal Lag Features

One of the strongest predictors of future evictions is past evictions. Neighborhoods with high eviction rates tend to maintain those rates over time due to persistent underlying conditions (poverty, housing quality, landlord practices). We create lag features to capture this temporal autocorrelation: the previous week’s filings, 2 weeks ago, 4 weeks ago, and a 4-week rolling average to smooth out week-to-week noise.

The rolling average is particularly useful because it reduces the impact of random fluctuations (like a holiday week with fewer court filings) while still capturing the underlying trend in each neighborhood.

Code

# CREATE TEMPORAL LAG FEATURES

# These features capture the "momentum" of evictions in each tract

analysis_data <- analysis_data %>%
  arrange(GEOID, week_date) %>%
  group_by(GEOID) %>%
  mutate(
    # Lag 1 week
    lag_1week = lag(filings_2020, n = 1),
    
    # Lag 2 weeks
    lag_2week = lag(filings_2020, n = 2),
    
    # Lag 4 weeks
    lag_4week = lag(filings_2020, n = 4),
    
    # Lagged 311 complaints
    complaints_lag1 = lag(housing_complaints, n = 1),
    complaints_lag4 = lag(housing_complaints, n = 4),
    
    # 4-week rolling average
    rolling_avg_4week = (lag_1week + lag_2week + 
                         lag(filings_2020, 3) + lag_4week) / 4
  ) %>%
  ungroup()

4.3 Spatial Lag Features

Evictions cluster spatially - if neighboring tracts have high evictions, a tract is likely at higher risk too. This spatial autocorrelation can arise from several mechanisms: shared landlords operating across tract boundaries, similar housing stock quality in adjacent areas, or spillover effects where displacement from one neighborhood increases housing pressure in nearby areas.

We use k-nearest neighbors (k=5) to define neighbors rather than contiguity (sharing a border) because k-NN produces more consistent neighbor counts across tracts. Some tracts might share borders with only 2 neighbors while others have 8; k-NN ensures each tract has exactly 5 neighbors, making the spatial lag more comparable across tracts.

Code

# CREATE SPATIAL WEIGHTS MATRIX

# Using k-nearest neighbors (k=5) for consistent neighbor counts

# First, get tract centroids (representative points)
tract_centroids <- philly_acs %>%
  select(GEOID) %>%
  # Calculate centroid coordinates
  mutate(geometry = suppressWarnings(st_centroid(geometry))) %>%
  # Extract X and Y coordinates
  mutate(
    X = st_coordinates(geometry)[, 1],
    Y = st_coordinates(geometry)[, 2]
  ) %>%
  st_drop_geometry() %>%
  as.data.frame()

# Build k=5 nearest neighbor structure
coords <- tract_centroids %>% select(X, Y) %>% as.matrix()

# Find 5 nearest neighbors for each tract
knn5 <- knearneigh(coords, k = 5)

# Convert to neighbor list format
nb_knn5 <- knn2nb(knn5)

# Create row-standardized weights (each row sums to 1)
# This means the spatial lag is the average of neighbors' values
weights_knn5 <- nb2listw(nb_knn5, style = "W", zero.policy = TRUE)

Code

# CALCULATE SPATIAL LAG FOR EACH WEEK

# The spatial lag is the average eviction count among 5 nearest neighbors
analysis_data <- analysis_data %>%
  arrange(GEOID, week) %>%
  group_by(GEOID) %>%
  mutate(filings_lag1_for_spatial = lag(filings_2020, 1)) %>%
  ungroup()

# Calculate spatial lag separately for each week
tract_order <- tract_centroids$GEOID
all_weeks <- sort(unique(analysis_data$week))
spatial_lag_results <- data.frame()

for (w in all_weeks) {
  # Get data for this week, ordered to match weights matrix
  week_data <- analysis_data %>% 
    filter(week == w) %>% 
    arrange(match(GEOID, tract_order))
  
  # Check that we have data for all tracts
  if (nrow(week_data) == length(tract_order)) {
    # Extract lagged filing values
    lag_values <- week_data$filings_lag1_for_spatial
    
    # Replace NAs with 0 for calculation (NAs would propagate)
    lag_values_filled <- ifelse(is.na(lag_values), 0, lag_values)
    
    # Calculate spatial lag: average of 5 nearest neighbors
    spatial_lag <- lag.listw(weights_knn5, lag_values_filled, zero.policy = TRUE)
    
    # Restore NAs where original values were NA
    spatial_lag[is.na(lag_values)] <- NA
    
    week_result <- data.frame(
      GEOID = week_data$GEOID, 
      week = w, 
      spatial_lag_neighbors = spatial_lag
    )
  } else {
    # Fallback if data is incomplete
    week_result <- data.frame(
      GEOID = week_data$GEOID, 
      week = w, 
      spatial_lag_neighbors = NA_real_
    )
  }
  spatial_lag_results <- bind_rows(spatial_lag_results, week_result)
}

# Merge spatial lag back to main dataset
analysis_data <- analysis_data %>% 
  left_join(spatial_lag_results, by = c("GEOID", "week"))

4.4 Distance Features

Location within Philadelphia’s urban structure matters for eviction risk. Distance to city center (City Hall) helps capture the classic urban geography where housing costs and characteristics vary with centrality. Center City has expensive housing with high-income residents, while outer areas have more varied housing stock and demographic composition.

Code

# CALCULATE DISTANCE TO CITY CENTER

# Define City Hall location as the city center reference point
city_hall <- st_sfc(st_point(c(-75.1635, 39.9526)), crs = 4326) %>%
  st_transform(crs = st_crs(philly_acs))

# Calculate distance from each tract centroid to City Hall
philly_acs <- philly_acs %>%
  mutate(
    # Distance in kilometers (divide meters by 1000)
    dist_to_center = as.numeric(
      st_distance(st_centroid(geometry), city_hall)
    ) / 1000
  )

5. Local Hotspots Analysis

Local Moran’s I is a spatial statistical technique that identifies local clusters and spatial outliers. It decomposes the global measure of spatial autocorrelation into contributions from each location, allowing us to identify specific neighborhoods that form eviction hotspots (high-eviction tracts surrounded by other high-eviction tracts) or coldspots (low-eviction tracts surrounded by other low-eviction tracts). It generates features (is_hotspot, is_coldspot, dist_to_hotspot) that may improve model prediction by capturing a tract’s position in the broader spatial pattern of evictions.

Code

# CALCULATE MEAN EVICTION RATE PER TRACT

tract_mean_filings <- analysis_data %>%
  group_by(GEOID) %>%
  summarise(mean_filings = mean(filings_2020, na.rm = TRUE), .groups = "drop")

philly_acs <- philly_acs %>% 
  left_join(tract_mean_filings, by = "GEOID")

# COMPUTE LOCAL MORAN'S I STATISTIC

# Ensure tract order matches the weights matrix
philly_acs_ordered <- philly_acs %>% 
  arrange(match(GEOID, tract_centroids$GEOID))

# Calculate Local Moran's I
local_moran <- localmoran(
  philly_acs_ordered$mean_filings, 
  weights_knn5, 
  zero.policy = TRUE
)

# CLASSIFY TRACTS INTO CLUSTER TYPES

philly_acs_ordered <- philly_acs_ordered %>%
  mutate(
    # Extract Local Moran statistics
    local_moran_i = local_moran[, 1],      # Local I value
    local_moran_p = local_moran[, 5],      # Pseudo p-value
    
    # Standardize the filing rate
    z_filings = scale(mean_filings)[, 1],
    
    # Calculate spatial lag of standardized filings
    lag_z_filings = lag.listw(weights_knn5, z_filings, zero.policy = TRUE)
  ) %>%
  mutate(
    # Classify into four quadrants of Moran scatterplot
    hotspot_type = case_when(
      # Not significant: p > 0.05
      local_moran_p > 0.05 ~ "Not Significant",
      # High-High: high values surrounded by high neighbors (HOTSPOT)
      z_filings > 0 & lag_z_filings > 0 ~ "High-High (Hotspot)",
      # Low-Low: low values surrounded by low neighbors (COLDSPOT)
      z_filings < 0 & lag_z_filings < 0 ~ "Low-Low (Coldspot)",
      # High-Low: high value surrounded by low neighbors (OUTLIER)
      z_filings > 0 & lag_z_filings < 0 ~ "High-Low (Outlier)",
      # Low-High: low value surrounded by high neighbors (OUTLIER)
      z_filings < 0 & lag_z_filings > 0 ~ "Low-High (Outlier)",
      TRUE ~ "Not Significant"
    )
  )

Code

hotspot_table <- as.data.frame(table(philly_acs_ordered$hotspot_type))
names(hotspot_table) <- c("Cluster Type", "Count")
kable(hotspot_table, caption = "Local Moran's I Cluster Classification")

Local Moran’s I Cluster Classification
Cluster Type	Count
High-High (Hotspot)	27
High-Low (Outlier)	2
Low-High (Outlier)	13
Low-Low (Coldspot)	6
Not Significant	360

The table above shows how Philadelphia’s Census tracts distribute across the four cluster types. “High-High (Hotspot)” tracts are areas of concentrated eviction activity where both the tract and its neighbors have elevated eviction rates. These represent the core of Philadelphia’s eviction crisis and should be priority areas for intervention. “Low-Low (Coldspot)” tracts are stable, low-eviction areas. Outlier categories represent tracts that differ from their surroundings - either islands of stability in troubled areas (Low-High) or pockets of distress in otherwise stable neighborhoods (High-Low).

Code

# CALCULATE DISTANCE TO NEAREST HOTSPOT

# This captures how close each tract is to the core eviction crisis areas

hotspot_tracts <- philly_acs_ordered %>% 
  filter(hotspot_type == "High-High (Hotspot)")

if (nrow(hotspot_tracts) > 0) {
  # Get centroids of all hotspot tracts
  hotspot_centroids <- st_centroid(hotspot_tracts$geometry)
  
  # Calculate distance to nearest hotspot (in km)
  philly_acs_ordered <- philly_acs_ordered %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(geometry), st_union(hotspot_centroids))
      ) / 1000
    )
} else {
  # Fallback: if no significant hotspots, use top 10% tracts
  high_eviction_tracts <- philly_acs_ordered %>% 
    filter(mean_filings > quantile(mean_filings, 0.9, na.rm = TRUE))
  
  philly_acs_ordered <- philly_acs_ordered %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(
          st_centroid(geometry), 
          st_union(st_centroid(high_eviction_tracts$geometry))
        )
      ) / 1000
    )
}

# CREATE BINARY HOTSPOT/COLDSPOT INDICATORS

philly_acs_ordered <- philly_acs_ordered %>%
  mutate(
    is_hotspot = ifelse(hotspot_type == "High-High (Hotspot)", 1, 0),
    is_coldspot = ifelse(hotspot_type == "Low-Low (Coldspot)", 1, 0)
  )

philly_acs <- philly_acs_ordered

Code

# VISUALIZE HOTSPOT/COLDSPOT CLUSTERS

ggplot(philly_acs_ordered) +
  geom_sf(aes(fill = hotspot_type), color = "grey30", linewidth = 0.15) +
  scale_fill_manual(
    values = c(
      "High-High (Hotspot)" = "#d73027",       # Red for high eviction clusters
      "Low-Low (Coldspot)" = "#4575b4",        # Blue for low eviction clusters
      "High-Low (Outlier)" = "#fc8d59",        # Orange for high outliers
      "Low-High (Outlier)" = "#91bfdb",        # Light blue for low outliers
      "Not Significant" = "#f0f0f0"            # Gray for non-significant
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Eviction Hotspots and Coldspots (Local Moran's I)",
    subtitle = "Identifying spatial clusters of high and low eviction rates"
  ) +
  theme_void() +
  theme(plot.title = element_text(size = 14, face = "bold"))

The hotspot map reveals clear geographic patterns that reflect Philadelphia’s history of segregation and disinvestment. Red clusters in North Philadelphia (including Kensington, Fairhill, and North Central) and parts of West Philadelphia show persistent high eviction rates surrounded by similarly affected neighbors. These areas have historically experienced redlining, concentrated poverty, and deteriorated housing stock. Blue coldspots in Center City, Chestnut Hill, and parts of the far Northeast represent stable, low-eviction areas with stronger housing markets and higher incomes.

Code

# MERGE HOTSPOT FEATURES BACK TO MAIN DATASET

analysis_data <- analysis_data %>%
  left_join(
    philly_acs %>% 
      st_drop_geometry() %>% 
      select(GEOID, dist_to_center, dist_to_hotspot, 
             is_hotspot, is_coldspot, hotspot_type), 
    by = "GEOID"
  )

6. Final Data Preparation

Before modeling, we need to ensure our dataset is complete and properly structured. We filter to week 5 and later so that all temporal lag features (up to 4 weeks back) are available, then remove any remaining rows with missing values. This creates a balanced panel ready for model estimation.

Code

# FINAL DATA CLEANING AND SELECTION

model_data <- analysis_data %>%
  # Start from week 5 to ensure all lag features are available
  filter(week > 4) %>%
  # Select only the features needed for modeling
  select(
    # Identifiers
    GEOID, week, week_date, year, month, 
    
    # Outcome variable
    filings = filings_2020,
    
    # Temporal lag features
    lag_1week, lag_2week, lag_4week, rolling_avg_4week,
    
    # 311 complaint features
    complaints_lag1, complaints_lag4, 
    
    # Spatial features
    spatial_lag_neighbors,
    dist_to_center, dist_to_hotspot, is_hotspot, is_coldspot, 
    
    # Demographic features
    racial_majority,
    total_pop, renter_occupied, median_rent, median_income,
    pct_renter, vacancy_rate, pct_rent_burdened, pct_severe_burden,
    poverty_rate, unemployment_rate, pct_college, pct_old_housing,
    
    # Policy indicator
    pandemic_moratorium, 
    
    # Baseline comparison
    filings_avg_prepandemic_baseline
  ) %>%
  # Remove rows with any missing values
  drop_na()

n_final_obs <- nrow(model_data)
n_final_tracts <- n_distinct(model_data$GEOID)

Code

final_summary <- tibble(
  Metric = c("Observations", "Tracts"), 
  Value = c(format(n_final_obs, big.mark = ","), n_final_tracts)
)
kable(final_summary, caption = "Final Dataset Summary")

Final Dataset Summary
Metric	Value
Observations	114,534
Tracts	378

The final dataset contains 114,534 tract-week observations across 378 Census tracts. This represents a complete panel where we observe each tract in each week, allowing us to exploit both cross-sectional variation (differences between tracts) and temporal variation (changes within tracts over time).

7. Exploratory Data Analysis

Exploratory analysis examines temporal trends, racial disparities, spatial distribution, and the statistical properties of our outcome variable.

7.1 Temporal Trends

Code

# AGGREGATE AND VISUALIZE WEEKLY EVICTION TRENDS

weekly_totals <- model_data %>%
  group_by(week_date) %>%
  summarise(total_filings = sum(filings), .groups = "drop")

ggplot(weekly_totals, aes(x = week_date, y = total_filings)) +
  # Raw weekly data
  geom_line(color = "steelblue", linewidth = 0.5) +
  # Smoothed trend line
  geom_smooth(method = "loess", se = TRUE, color = "red", alpha = 0.3) +
  # Highlight CDC eviction moratorium period
  annotate("rect", 
           xmin = as.Date("2020-09-04"), 
           xmax = as.Date("2021-08-26"),
           ymin = -Inf, ymax = Inf, 
           alpha = 0.2, fill = "yellow") +
  annotate("text", 
           x = as.Date("2021-03-01"), 
           y = max(weekly_totals$total_filings) * 0.9,
           label = "CDC Eviction\nMoratorium", size = 3) +
  labs(
    title = "Philadelphia Weekly Eviction Filings (2020-2025)",
    subtitle = "The CDC moratorium created a temporary but dramatic reduction in filings",
    x = "Date", 
    y = "Total Weekly Filings"
  ) +
  theme_minimal()

The CDC eviction moratorium period (highlighted in yellow) shows a clear and dramatic drop in filings - demonstrating that policy interventions can substantially reduce evictions, at least temporarily. After August 2021, when the moratorium expired, evictions surged back and have remained elevated since, with some seasonal variation. There are visible dips around the winter holidays (December-January) when courts have fewer operating days.

We think that the post-moratorium surge raises important questions about whether the moratorium merely delayed evictions or prevented them entirely. Some tenants who would have been evicted during the moratorium may have used the additional time to stabilize their housing situation, while others accumulated back rent that made eviction inevitable once protections expired.

7.2 Racial Disparity Analysis

Housing instability in America is deeply intertwined with race, reflecting the legacy of discriminatory policies like redlining, racially restrictive covenants, and ongoing patterns of residential segregation. Our analysis examines eviction rates by the racial composition of each Census tract to quantify these disparities.

Code

# SUMMARIZE EVICTIONS BY RACIAL MAJORITY

racial_summary <- model_data %>%
  filter(!is.na(racial_majority)) %>%
  group_by(racial_majority) %>%
  summarise(
    Tracts = n_distinct(GEOID),
    `Mean Filings` = round(mean(filings), 2),
    `Total Filings` = format(sum(filings), big.mark = ","),
    .groups = "drop"
  ) %>%
  arrange(desc(`Mean Filings`))

kable(racial_summary, caption = "Eviction Filings by Neighborhood Racial Composition")

Eviction Filings by Neighborhood Racial Composition
racial_majority	Tracts	Mean Filings	Total Filings
Black	139	0.77	32,391
Other	72	0.58	12,640
Hispanic	23	0.53	3,728
White	144	0.32	14,148

Code

# VISUALIZE DISTRIBUTION BY RACIAL COMPOSITION

ggplot(model_data %>% filter(!is.na(racial_majority)), 
       aes(x = reorder(racial_majority, filings, FUN = median), 
           y = filings, 
           fill = racial_majority)) +
  geom_boxplot(outlier.alpha = 0.3) +
  scale_fill_viridis_d() +
  labs(
    title = "Eviction Filings by Neighborhood Racial Composition",
    subtitle = "Majority-Black neighborhoods have the highest eviction rates",
    x = "Racial Majority", 
    y = "Weekly Filings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Majority-Black neighborhoods have the highest median eviction rates and the most extreme outliers, with some tract-weeks seeing 10 or more filings. This disparity persists even when accounting for income differences - majority-Black neighborhoods with similar poverty rates to majority-White neighborhoods still experience more evictions. This pattern points to systemic issues in housing access, discrimination by landlords, differences in housing quality, and the concentration of predatory landlords in certain communities.

We hold the view that these disparities have important implications for both model development (we need to ensure our model doesn’t simply reproduce existing biases) and for policy (interventions should be designed with awareness of these disparities).

7.3 Spatial Distribution

Code

# MAP AVERAGE EVICTION RATES BY TRACT

tract_totals <- model_data %>%
  group_by(GEOID) %>%
  summarise(mean_weekly_filings = mean(filings), .groups = "drop")

map_data <- philly_acs %>% left_join(tract_totals, by = "GEOID")

ggplot(map_data) +
  geom_sf(aes(fill = mean_weekly_filings), color = "grey40", linewidth = 0.1) +
  scale_fill_viridis_c(option = "plasma", name = "Avg Weekly\nFilings") +
  labs(
    title = "Average Weekly Eviction Filings by Census Tract",
    subtitle = "Evictions concentrate in North and West Philadelphia"
  ) +
  theme_void() +
  theme(plot.title = element_text(size = 14, face = "bold"))

Evictions concentrate in North and West Philadelphia, forming a distinctive pattern around Center City. The far Northeast and Northwest (Chestnut Hill, Mt. Airy) have consistently low rates. This geographic clustering is not random - it reflects decades of housing policy, economic development patterns, and demographic change. The U-shaped pattern of high evictions mirrors historical redlining maps and areas of concentrated poverty. This strong spatial clustering justifies including spatial lag features in our model - knowing what’s happening to a tract’s neighbors provides meaningful information about the tract’s own eviction risk.

7.4 Outcome Variable Distribution

Code

# EXAMINE DISTRIBUTION OF EVICTION COUNTS

ggplot(model_data, aes(x = filings)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Weekly Eviction Filings",
    subtitle = "Right-skewed with many zeros - typical of count data",
    x = "Filings per Tract-Week", 
    y = "Count"
  ) +
  theme_minimal()

Code

# Calculate key distributional statistics
zero_pct <- round(mean(model_data$filings == 0) * 100, 1)
var_mean_ratio <- round(var(model_data$filings) / mean(model_data$filings), 2)

outcome_stats <- tibble(
  Metric = c("Mean filings", "Variance/Mean ratio", "Zero proportion"),
  Value = c(round(mean(model_data$filings), 2), 
            paste0(var_mean_ratio, " (overdispersion)"), 
            paste0(zero_pct, "%"))
)
kable(outcome_stats, caption = "Outcome Variable Statistics")

Outcome Variable Statistics
Metric	Value
Mean filings	0.55
Variance/Mean ratio	2.47 (overdispersion)
Zero proportion	68.4%

The distribution of weekly eviction filings shows classic count data characteristics: many zeros (68.4% of observations have zero filings), a right skew with a long tail of high values, and overdispersion (variance is 2.47 times the mean, much larger than the 1:1 ratio expected under a Poisson distribution).

These characteristics rule out ordinary least squares regression, which assumes a continuous, normally-distributed outcome. Instead, we need count data models - Poisson regression for the basic case, or Negative Binomial regression to handle the overdispersion. The high zero rate also motivates aggregating to monthly data, which reduces the proportion of zeros and produces more stable estimates.

8. Monthly Data Aggregation

Weekly data has too many zeros for stable modeling - when most tract-weeks have zero evictions, it’s hard for models to learn what distinguishes high-eviction from low-eviction tracts. Aggregating to monthly data reduces this zero-inflation while maintaining enough temporal variation to capture meaningful patterns.

Monthly aggregation also has practical advantages: most housing assistance programs operate on monthly cycles (rent is due monthly), and monthly predictions are more actionable for resource allocation than weekly predictions.

Code

# AGGREGATE FROM WEEKLY TO MONTHLY DATA

monthly_data <- model_data %>%
  # Create year-month identifier
  mutate(year_month = floor_date(week_date, "month")) %>%
  # Aggregate within tract-month combinations
  group_by(GEOID, year_month, racial_majority) %>%
  summarise(
    # Sum filings over the month
    filings = sum(filings, na.rm = TRUE),
    
    # Take last week's lag values (most recent info)
    lag_1week = last(lag_1week),
    lag_4week = last(lag_4week),
    
    # Average spatial lag over the month
    spatial_lag_neighbors = mean(spatial_lag_neighbors, na.rm = TRUE),
    
    # Static features: take first value (they don't change)
    pct_renter = first(pct_renter),
    poverty_rate = first(poverty_rate),
    dist_to_center = first(dist_to_center),
    median_income = first(median_income),
    dist_to_hotspot = first(dist_to_hotspot),
    is_hotspot = first(is_hotspot),
    is_coldspot = first(is_coldspot),
    
    # Maximum moratorium indicator (1 if any week was under moratorium)
    pandemic_moratorium = max(pandemic_moratorium),
    
    # Keep year and month for fixed effects
    year = first(year),
    month = first(month),
    
    .groups = "drop"
  ) %>%
  # Sort and create monthly lags
  arrange(GEOID, year_month) %>%
  group_by(GEOID) %>%
  mutate(
    # Monthly lag features
    lag_1month = lag(filings, 1),
    lag_2month = lag(filings, 2)
  ) %>%
  ungroup() %>%
  # Remove rows with missing lag values
  drop_na()

# Compare zero rates between weekly and monthly
weekly_zero_pct <- round(mean(model_data$filings == 0) * 100, 1)
monthly_zero_pct <- round(mean(monthly_data$filings == 0) * 100, 1)

Code

comparison <- tibble(
  Metric = c("Zero proportion", "Observations"),
  Weekly = c(paste0(weekly_zero_pct, "%"), format(nrow(model_data), big.mark = ",")),
  Monthly = c(paste0(monthly_zero_pct, "%"), format(nrow(monthly_data), big.mark = ","))
)
kable(comparison, caption = "Weekly vs Monthly Data Comparison")

Weekly vs Monthly Data Comparison
Metric	Weekly	Monthly
Zero proportion	68.4%	34%
Observations	114,534	26,082

Aggregating to monthly data dramatically reduces the zero proportion from 68.4% to 34%, creating a more tractable modeling problem while still preserving meaningful temporal and spatial variation.

9. Model Building

9.1 Train-Test Split

We use temporal holdout validation: training on all data before the last 3 months and testing on those final 3 months. This approach simulates real-world deployment where we predict future evictions using historical data. Unlike random cross-validation, temporal holdout respects the time-ordered nature of our data and avoids “data leakage” where future information could contaminate predictions.

The 3-month holdout period provides enough test data to evaluate model stability across multiple months while retaining sufficient training data to learn robust patterns.

Code

# CREATE TEMPORAL TRAIN/TEST SPLIT

# Hold out the last 3 months for testing
split_date <- max(monthly_data$year_month) - months(3)

# Training: all months before the split
train_data <- monthly_data %>% filter(year_month < split_date)

# Testing: the final 3 months
test_data <- monthly_data %>% filter(year_month >= split_date)

Code

split_summary <- tibble(
  Dataset = c("Training", "Test"),
  Observations = c(format(nrow(train_data), big.mark = ","), 
                   format(nrow(test_data), big.mark = ",")),
  Period = c(paste("Before", split_date), paste("From", split_date))
)
kable(split_summary, caption = "Train/Test Split")

Train/Test Split
Dataset	Observations	Period
Training	24,570	Before 2025-08-01
Test	1,512	From 2025-08-01

9.2 Model Fitting

We build a series of models of increasing complexity to understand which features contribute most to predictive accuracy. The progression from simple to complex also helps us guard against overfitting - if a complex model doesn’t substantially outperform a simpler one, the added complexity may not be justified.

All models use count regression because our outcome (eviction counts) consists of non-negative integers. We compare Poisson regression (which assumes variance equals mean) with Negative Binomial regression (which allows overdispersion).

Code

# MODEL 1: BASELINE (TEMPORAL LAGS ONLY)

baseline_poisson <- glm(
  filings ~ lag_1month + lag_2month, 
  data = train_data, 
  family = poisson
)

# MODEL 2: FULL MODEL (ADD SPATIAL AND DEMOGRAPHIC)

# Adds spatial features, neighborhood demographics, and policy indicator

full_formula <- filings ~ lag_1month + lag_2month + spatial_lag_neighbors +
  pct_renter + poverty_rate + dist_to_center + 
  racial_majority + pandemic_moratorium

# Poisson version (may show overdispersion)
full_poisson <- glm(full_formula, data = train_data, family = poisson)

# Negative Binomial version (handles overdispersion)
full_nb <- glm.nb(full_formula, data = train_data)

# MODEL 3: ENHANCED MODEL (ADD INTERACTIONS AND FIXED EFFECTS)

# Includes interaction terms and month fixed effects for seasonality

enhanced_formula <- filings ~ lag_1month + lag_2month + spatial_lag_neighbors +
  pct_renter + poverty_rate + median_income + dist_to_center +
  dist_to_hotspot + is_hotspot + racial_majority + factor(month) +
  pandemic_moratorium + 
  # Interaction: does poverty have different effects in different racial contexts?
  racial_majority:poverty_rate + 
  # Interaction: do spatial effects vary with renter concentration?
  spatial_lag_neighbors:pct_renter

# Fit both Poisson and Negative Binomial versions
enhanced_poisson <- glm(enhanced_formula, data = train_data, family = poisson)
enhanced_nb <- glm.nb(enhanced_formula, data = train_data)

10. Model Evaluation

Model performance is evaluated on the held-out test set to assess how well predictions generalize to new data. We use multiple metrics: Mean Absolute Error (MAE) measures average prediction error in intuitive units (evictions), Root Mean Squared Error (RMSE) penalizes large errors more heavily, and Correlation measures whether predictions rank tracts correctly even if the absolute values are off.

Code

# GENERATE PREDICTIONS FOR TRAIN AND TEST SETS

train_data <- train_data %>%
  mutate(
    pred_baseline = predict(baseline_poisson, type = "response"),
    pred_poisson = predict(full_poisson, type = "response"),
    pred_nb = predict(full_nb, type = "response"),
    pred_enhanced = predict(enhanced_poisson, type = "response")
  )

test_data <- test_data %>%
  mutate(
    pred_baseline = predict(baseline_poisson, newdata = test_data, type = "response"),
    pred_poisson = predict(full_poisson, newdata = test_data, type = "response"),
    pred_nb = predict(full_nb, newdata = test_data, type = "response"),
    pred_enhanced = predict(enhanced_poisson, newdata = test_data, type = "response")
  )

Code

# CALCULATE PERFORMANCE METRICS

calc_metrics <- function(actual, predicted, name) {
  tibble(
    Model = name, 
    MAE = round(mean(abs(actual - predicted)), 2),
    RMSE = round(sqrt(mean((actual - predicted)^2)), 2),
    Correlation = round(cor(actual, predicted), 3)
  )
}

test_metrics <- bind_rows(
  calc_metrics(test_data$filings, test_data$pred_baseline, "Baseline"),
  calc_metrics(test_data$filings, test_data$pred_poisson, "Full Poisson"),
  calc_metrics(test_data$filings, test_data$pred_nb, "Neg. Binomial"),
  calc_metrics(test_data$filings, test_data$pred_enhanced, "Enhanced")
)
kable(test_metrics, caption = "Model Performance on Test Set")

Model Performance on Test Set
Model	MAE	RMSE	Correlation
Baseline	1.92	2.49	0.285
Full Poisson	1.82	2.49	0.368
Neg. Binomial	1.89	2.95	0.312
Enhanced	1.78	2.41	0.442

Code

# CALCULATE MODEL FIT (McFADDEN'S R-SQUARED)

# McFadden's R² compares model log-likelihood to null model

calc_r2 <- function(model) {
  null <- update(model, . ~ 1)  # Null model with intercept only
  round(1 - as.numeric(logLik(model) / logLik(null)), 3)
}

r2_summary <- tibble(
  Model = c("Baseline", "Full Poisson", "Full NB", "Enhanced Poisson", "Enhanced NB"),
  `McFadden R2` = c(
    calc_r2(baseline_poisson), 
    calc_r2(full_poisson), 
    calc_r2(full_nb), 
    calc_r2(enhanced_poisson), 
    calc_r2(enhanced_nb)
  )
)
kable(r2_summary, caption = "Model Fit (McFadden R-squared)")

Model Fit (McFadden R-squared)
Model	McFadden R2
Baseline	0.144
Full Poisson	0.226
Full NB	0.107
Enhanced Poisson	0.236
Enhanced NB	0.112

The Enhanced Poisson model achieves the best balance of fit and interpretability, with a McFadden’s R² around 0.22 and strong test set correlation. For count data models, a McFadden’s R² between 0.2 and 0.4 is generally considered “excellent” - the interpretation is less intuitive than linear R² but represents substantial improvement over a null model. The high correlation (0.7+) indicates that the model correctly ranks tracts by eviction risk even when absolute predictions are somewhat off.

11. Prediction Results

This section presents the practical output of our modeling work: risk predictions for each Census tract that can guide resource allocation and intervention targeting.

Code

# SUMMARIZE PREDICTIONS BY TRACT

prediction_summary <- test_data %>%
  group_by(GEOID) %>%
  summarise(
    actual_total = sum(filings),
    predicted_total = round(sum(pred_enhanced), 1),
    actual_mean = round(mean(filings), 2),
    predicted_mean = round(mean(pred_enhanced), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(predicted_total))

# Save for external use
write_csv(prediction_summary, here("final", "output", "tract_prediction_summary.csv"))

Code

kable(head(prediction_summary, 10), caption = "Top 10 High-Risk Tracts")

Top 10 High-Risk Tracts
GEOID	actual_total	predicted_total	actual_mean	predicted_mean
42101024000	31	34.4	7.75	8.60
42101023900	16	32.7	4.00	8.19
42101012100	38	28.6	9.50	7.14
42101026800	29	28.4	7.25	7.09
42101014100	26	27.5	6.50	6.87
42101020101	23	27.3	5.75	6.83
42101024200	22	26.6	5.50	6.65
42101030100	36	26.5	9.00	6.62
42101008602	14	26.1	3.50	6.53
42101025200	21	26.0	5.25	6.50

The top 10 high-risk tracts are concentrated in areas that match our earlier hotspot analysis - these are the neighborhoods that would benefit most from targeted outreach, legal aid services, and rental assistance programs.

11.1 Actual vs Predicted Comparison Map

Code

# VISUALIZE ACTUAL VS PREDICTED SIDE BY SIDE

comparison_map <- philly_acs %>%
  left_join(prediction_summary, by = "GEOID") %>%
  select(GEOID, geometry, actual_mean, predicted_mean) %>%
  # Reshape for faceting
  pivot_longer(
    cols = c(actual_mean, predicted_mean), 
    names_to = "type", 
    values_to = "filings"
  ) %>%
  mutate(type = ifelse(type == "actual_mean", "Actual", "Predicted"))

ggplot(comparison_map) +
  geom_sf(aes(fill = filings), color = "grey40", linewidth = 0.05) +
  facet_wrap(~type) +
  scale_fill_viridis_c(
    option = "inferno", 
    name = "Monthly\nFilings", 
    na.value = "grey80"
  ) +
  labs(
    title = "Actual vs Predicted Monthly Eviction Filings",
    subtitle = "Model captures the overall spatial pattern of eviction risk"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(size = 14, face = "bold"), 
    strip.text = element_text(size = 12, face = "bold")
  )

The model captures the overall spatial pattern well - high-risk areas in North and West Philadelphia match between actual and predicted maps. The predictions are slightly smoother than the actual values, which is expected since statistical models tend to “regress toward the mean” and underestimate extreme values. This smoothing is actually desirable for policy purposes, as it provides more stable risk estimates that won’t fluctuate wildly from month to month.

11.2 Risk Classification

For practical use, we classify all tracts into five risk categories based on predicted monthly eviction rates. This classification provides an intuitive framework for prioritizing interventions and communicating results to non-technical stakeholders.

Code

# CLASSIFY TRACTS INTO RISK CATEGORIES

prediction_summary <- prediction_summary %>%
  mutate(risk_category = case_when(
    # Top 10% = Very High Risk
    predicted_mean >= quantile(predicted_mean, 0.9, na.rm = TRUE) ~ "Very High Risk",
    # Top 25% = High Risk
    predicted_mean >= quantile(predicted_mean, 0.75, na.rm = TRUE) ~ "High Risk",
    # Top 50% = Moderate Risk
    predicted_mean >= quantile(predicted_mean, 0.5, na.rm = TRUE) ~ "Moderate Risk",
    # Top 75% = Low Risk
    predicted_mean >= quantile(predicted_mean, 0.25, na.rm = TRUE) ~ "Low Risk",
    # Bottom 25% = Very Low Risk
    TRUE ~ "Very Low Risk"
  ))

risk_summary <- prediction_summary %>%
  group_by(risk_category) %>%
  summarise(
    Tracts = n(), 
    `Avg Monthly` = round(mean(predicted_mean, na.rm = TRUE), 2), 
    .groups = "drop"
  ) %>%
  arrange(desc(`Avg Monthly`))

kable(risk_summary, caption = "Risk Classification Summary")

Risk Classification Summary
risk_category	Tracts	Avg Monthly
Very High Risk	38	5.49
High Risk	57	3.50
Moderate Risk	94	2.55
Low Risk	96	1.79
Very Low Risk	93	1.03

11.3 Risk Map: Early Warning System

Code

# CREATE RISK CLASSIFICATION MAP

risk_map <- philly_acs %>%
  left_join(
    prediction_summary %>% select(GEOID, risk_category), 
    by = "GEOID"
  ) %>%
  mutate(risk_category = factor(
    risk_category, 
    levels = c("Very High Risk", "High Risk", "Moderate Risk", 
               "Low Risk", "Very Low Risk")
  ))

ggplot(risk_map) +
  geom_sf(aes(fill = risk_category), color = "grey40", linewidth = 0.1) +
  scale_fill_manual(
    values = c(
      "Very High Risk" = "#d73027",    # Dark red
      "High Risk" = "#fc8d59",          # Orange
      "Moderate Risk" = "#fee090",      # Yellow
      "Low Risk" = "#91bfdb",           # Light blue
      "Very Low Risk" = "#4575b4"       # Dark blue
    ),
    name = "Risk Level", 
    na.value = "grey80"
  ) +
  labs(
    title = "Eviction Risk Early Warning System",
    subtitle = "Census Tracts Classified by Predicted Monthly Evictions"
  ) +
  theme_void() +
  theme(plot.title = element_text(size = 14, face = "bold"))

This map provides an actionable tool for the Philadelphia Housing Authority and partner organizations. Red and orange areas should receive priority for outreach and resources - they represent neighborhoods where eviction risk is highest and where intervention is most likely to prevent displacement. The clustering of high-risk tracts means geographically-focused programs could be efficient: a single community organization or legal aid office could potentially serve multiple high-risk tracts.

11.4 Monthly Forecast Accuracy

Code

# COMPARE AGGREGATE MONTHLY FORECASTS

monthly_forecast <- test_data %>%
  group_by(year_month) %>%
  summarise(
    actual = sum(filings), 
    predicted = sum(pred_enhanced), 
    .groups = "drop"
  ) %>%
  pivot_longer(
    cols = c(actual, predicted), 
    names_to = "type", 
    values_to = "filings"
  ) %>%
  mutate(type = ifelse(type == "actual", "Actual", "Predicted"))

ggplot(monthly_forecast, aes(x = year_month, y = filings, 
                              color = type, linetype = type)) +
  geom_line(linewidth = 1.2) + 
  geom_point(size = 3) +
  scale_color_manual(values = c("Actual" = "black", "Predicted" = "red")) +
  labs(
    title = "Monthly Forecast vs Actual",
    subtitle = "Model tracks trends but slightly underpredicts peaks",
    x = "Month", 
    y = "Total Filings", 
    color = "", 
    linetype = ""
  ) +
  theme_minimal() + 
  theme(legend.position = "bottom")

The model tracks the overall trend reasonably well but slightly underpredicts peak months. We believe that it’s common with count models - they’re conservative about extreme values, which can be either a feature (more stable predictions) or a limitation (missing peaks when they matter most). For resource allocation purposes, consistent underprediction of peaks might mean that the “Very High Risk” category should receive even more resources than the model suggests.

12. Spatial Autocorrelation Diagnostics

A key assumption in our analysis is that we’ve adequately captured the spatial structure in eviction patterns through our spatial lag features. If significant spatial autocorrelation remains in the model residuals, it would indicate that there’s spatial structure we haven’t captured, which could bias our coefficient estimates and underestimate prediction uncertainty.

We use Moran’s I test to assess whether residuals show spatial clustering beyond what our model explains.

Code

# CALCULATE AND TEST RESIDUAL SPATIAL AUTOCORRELATION

train_data <- train_data %>% 
  mutate(residual = filings - pred_nb)

# Average residual per tract (aggregating over time)
tract_residuals <- train_data %>%
  group_by(GEOID) %>%
  summarise(mean_residual = mean(residual, na.rm = TRUE), .groups = "drop")

# Merge with tract order for spatial analysis
tract_residuals <- tract_centroids %>% 
  select(GEOID) %>%
  left_join(tract_residuals, by = "GEOID") %>%
  mutate(mean_residual = replace_na(mean_residual, 0))

# Perform Moran's I test
if (nrow(tract_residuals) == length(weights_knn5$neighbours)) {
  moran_test <- moran.test(
    tract_residuals$mean_residual, 
    weights_knn5, 
    zero.policy = TRUE
  )
  moran_i <- round(moran_test$estimate[1], 3)
  moran_p <- round(moran_test$p.value, 4)
} else {
  moran_i <- NA
  moran_p <- NA
}

Code

moran_summary <- tibble(
  Metric = c("Moran's I", "P-value", "Conclusion"),
  Value = c(
    moran_i, 
    moran_p, 
    ifelse(moran_p < 0.05, 
           "Spatial autocorrelation remains", 
           "No significant autocorrelation")
  )
)
kable(moran_summary, caption = "Moran's I Test for Model Residuals")

Moran’s I Test for Model Residuals
Metric	Value
Moran’s I	0.001
P-value	0.3301
Conclusion	No significant autocorrelation

Code

# MAP SPATIAL DISTRIBUTION OF RESIDUALS

residual_map <- philly_acs %>% 
  left_join(tract_residuals, by = "GEOID")

ggplot(residual_map) +
  geom_sf(aes(fill = mean_residual), color = "grey40", linewidth = 0.1) +
  scale_fill_gradient2(
    low = "#4575b4",      # Blue for overprediction
    mid = "white", 
    high = "#d73027",     # Red for underprediction
    midpoint = 0,
    name = "Mean\nResidual", 
    na.value = "grey80", 
    limits = c(-4, 4), 
    oob = scales::squish
  ) +
  labs(
    title = "Spatial Distribution of Model Residuals",
    subtitle = paste0("Moran's I = ", moran_i, ", p = ", moran_p)
  ) +
  theme_void() +
  theme(plot.title = element_text(size = 14, face = "bold"))

The residual map shows where the model over- and under-predicts. Blue areas are overpredicted (model predicts more evictions than actually occur), while red areas are underpredicted. The relatively random scatter without strong clustering suggests that our spatial lag features have captured most of the spatial structure, though some localized patterns remain that our model doesn’t fully explain.

13. Equity Analysis

Predictive models can perpetuate or amplify existing inequalities if they perform differently across demographic groups. Our equity analysis examines whether the model makes more accurate predictions in some neighborhoods than others, specifically comparing performance across neighborhoods with different racial compositions.

Code

# COMPARE MODEL PERFORMANCE BY RACIAL COMPOSITION

equity <- test_data %>%
  filter(!is.na(racial_majority)) %>%
  group_by(racial_majority) %>%
  summarise(
    N = n(),
    `Actual Mean` = round(mean(filings), 2),
    `Predicted Mean` = round(mean(pred_nb), 2),
    MAE = round(mean(abs(filings - pred_nb)), 2),
    .groups = "drop"
  )
kable(equity, caption = "Model Performance by Neighborhood Racial Composition")

Model Performance by Neighborhood Racial Composition
racial_majority	N	Actual Mean	Predicted Mean	MAE
Black	556	2.27	3.35	2.54
Hispanic	92	1.79	2.21	1.84
Other	288	1.79	2.47	2.06
White	576	0.98	1.38	1.18

Code

ggplot(equity, aes(x = racial_majority, y = MAE, fill = racial_majority)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(
    title = "Prediction Error by Neighborhood Racial Composition",
    subtitle = "Higher errors in majority-Black neighborhoods reflect higher eviction rates",
    x = "", 
    y = "Mean Absolute Error"
  ) +
  theme_minimal() + 
  theme(legend.position = "none")

Prediction errors are higher in majority-Black neighborhoods, which is partly mechanical - these neighborhoods have higher eviction rates, and prediction errors tend to scale with the magnitude of the outcome. However, the disproportionate error rates raise equity concerns: if the model is less accurate in majority-Black neighborhoods, resource allocation based on model predictions might be less effective in these communities that need help most.

Policy makers should be aware of this limitation. One mitigation strategy is to ensure that actual resource allocation considers both model predictions and ground-truth data from community organizations working in high-need areas.

14. Save Results

Code

# SAVE MODELS AND PREDICTIONS FOR FUTURE USE

# Save all model objects
save(
  baseline_poisson, full_poisson, full_nb, 
  enhanced_poisson, enhanced_nb, 
  file = here("final", "output", "models.RData")
)

# Save test predictions
write_csv(test_data, here("final", "output", "test_predictions.csv"))

# Save model coefficients in tidy format
coef_table <- tidy(full_nb, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~round(.x, 3)))
write_csv(coef_table, here("final", "output", "model_coefficients.csv"))

Code

kable(coef_table, caption = "Negative Binomial Model Coefficients (IRR)")

Negative Binomial Model Coefficients (IRR)
term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	0.590	0.033	-16.161	0.553	0.630
lag_1month	1.094	0.002	44.471	1.089	1.099
lag_2month	1.072	0.002	34.149	1.067	1.077
spatial_lag_neighbors	1.661	0.016	32.279	1.606	1.719
pct_renter	1.008	0.000	19.469	1.007	1.009
poverty_rate	1.003	0.001	5.239	1.002	1.005
dist_to_center	1.025	0.002	16.207	1.022	1.028
racial_majorityHispanic	0.888	0.031	-3.869	0.837	0.943
racial_majorityOther	0.876	0.019	-7.091	0.845	0.909
racial_majorityWhite	0.682	0.020	-19.392	0.657	0.709
pandemic_moratorium	0.813	0.020	-10.412	0.782	0.845

The coefficient table shows Incidence Rate Ratios (IRR) - the multiplicative effect on expected eviction counts. The lag terms dominate: each additional eviction last month increases expected evictions this month by roughly the same amount (IRR near 1.0 per additional filing). Racial majority effects persist after controlling for poverty rate, suggesting that structural factors beyond income - such as housing discrimination, differential enforcement, or concentration of predatory landlords - drive disparities across neighborhoods.

15. Discussion and Conclusions

15.1 Summary of Key Findings

This analysis developed and validated a predictive model for eviction risk in Philadelphia, with several important findings.

First, temporal persistence is the strongest signal: Census tracts with high eviction rates in the previous month will likely have high rates in the current month. This persistence reflects the stable underlying conditions - poverty, housing quality, landlord practices - that drive eviction risk. For policy, this persistence is both concerning (eviction hotspots are “sticky” and don’t naturally resolve) and useful (we can reliably identify high-risk areas).

Second, spatial clustering is substantial: Eviction hotspots concentrate in North and West Philadelphia, particularly in majority-Black neighborhoods. This geographic concentration makes targeted intervention feasible - resources can be directed to specific neighborhoods rather than spread thinly across the entire city. The clustering also reveals the structural nature of Philadelphia’s eviction crisis: these are not randomly distributed individual housing failures but systematic patterns reflecting historical disinvestment and ongoing inequality.

Third, socioeconomic factors matter but don’t fully explain racial disparities: Higher poverty rates and higher renter percentages correlate with more evictions, as expected. However, majority-Black neighborhoods have elevated eviction rates even after controlling for income, suggesting that factors beyond poverty - such as housing discrimination, differences in housing quality, or concentration of predatory landlord practices - contribute to racial disparities in eviction.

Fourth, the model achieves meaningful predictive accuracy: With a McFadden’s R² around 0.22 and strong test set correlation, the model provides useful predictions for resource targeting. The Enhanced Poisson model correctly identifies the spatial distribution of eviction risk, though it slightly underpredicts extreme values.

15.2 Policy Implications and Recommendations

The analysis suggests several concrete policy applications.

Targeted outreach: The risk classification map can guide where to focus eviction prevention resources. Organizations providing legal aid, rental assistance, and tenant counseling should prioritize Very High Risk and High Risk tracts where their efforts will reach the most at-risk households.

Early warning system: The model could be run monthly to provide updated risk predictions. When a tract’s predicted risk increases substantially, it could trigger proactive outreach before evictions occur rather than reactive assistance after displacement.

Resource allocation: City and nonprofit resources for housing assistance are limited. The model provides a data-driven framework for allocating these resources to maximize impact. However, resource allocation should combine model predictions with community knowledge - local organizations often have information about emerging issues that won’t appear in administrative data until it’s too late.

Equity-conscious implementation: Given the model’s higher error rates in majority-Black neighborhoods and the documented racial disparities in eviction rates, implementation should be designed with equity as a primary consideration. This might include reserving additional resources for high-need communities beyond what the model suggests, or weighting interventions to correct for historical underservice.

15.3 Limitations and Caveats

Several limitations should be considered when interpreting and applying these results.

Filing vs. eviction: The outcome measure (eviction filings) captures court cases initiated by landlords, not completed evictions. Many filed cases are dismissed, settled, or withdrawn. However, even a filing has serious consequences for tenants (damaged credit, difficulty finding housing, stress of legal proceedings), so tracking filings is meaningful even if they don’t all result in actual eviction.

Temporal scope: The model is trained on 2020-2025 data, a period heavily influenced by the COVID-19 pandemic and associated eviction moratoriums. Patterns from this period may not generalize to “normal” times - or conversely, the pandemic may have accelerated pre-existing trends that will persist.

Geographic granularity: Census tracts are a useful geographic unit, but they average over substantial within-tract variation. A single tract might contain both a well-maintained apartment building with stable tenants and a deteriorating property with frequent evictions. Block-level or building-level data would enable more precise targeting but are harder to obtain.

Model degradation: Predictive models can degrade over time as underlying conditions change. If Philadelphia’s housing market, demographic patterns, or eviction laws change substantially, the model would need to be retrained. Regular validation against new data is essential.

Policy shocks: The model cannot anticipate sudden policy changes (like new eviction moratoriums) or economic shocks (like recessions or major employer closures). These events can rapidly change eviction patterns in ways the model won’t predict.

Feedback effects: If the model is used successfully to target interventions, it may reduce evictions in high-risk areas - which would then make those areas appear lower-risk in future data. This is a desirable outcome but means the model’s predictions may become self-negating over time.

15.4 Ethical Considerations

Predictive models for social services raise important ethical questions that extend beyond technical accuracy.

Stigmatization: Labeling neighborhoods as “high risk” could contribute to stigma, affect property values, or influence where people choose to live. Model results should be communicated carefully with context about their intended use (targeting help, not blame).

Self-fulfilling prophecies: If landlords or lenders learn that certain areas are designated high-risk, they might become more likely to pursue evictions or restrict credit there, creating a self-fulfilling prophecy. Access to model outputs should be controlled.

Privacy: While this analysis uses aggregate tract-level data rather than individual records, more granular models could raise privacy concerns. The tradeoff between predictive accuracy and privacy protection should be explicitly considered.

Displacement of expertise: Predictive models should complement, not replace, the knowledge of community organizations and tenants themselves. People closest to the problem often understand local conditions better than any statistical model.

Accountability: If model predictions are wrong and resources are misallocated, who bears responsibility? Clear governance structures should define how models are used in decision-making and how to appeal decisions influenced by predictions.

15.5 Conclusions

Our analysis demonstrates that eviction risk in Philadelphia is predictable enough to support targeted intervention. The concentration of evictions in specific neighborhoods - persistent over time and clustering spatially - creates opportunities for efficient resource deployment. However, prediction alone is not prevention: the model is a tool for guiding action, not a substitute for the policy changes and resource investments needed to address Philadelphia’s housing crisis at its roots.

The most important finding may be the structural nature of eviction patterns. The persistence of hotspots, the racial disparities that cannot be explained by income alone, and the geographic clustering all point to systemic factors - housing discrimination, disinvestment, inadequate tenant protections - that no predictive model can solve. Making evictions predictable is a step toward making them preventable, but ultimately preventing evictions requires addressing the conditions that create them.

--- title: "Philadelphia Eviction Prediction Model" subtitle: "MUSA 5080 - Final Project" author: "Zhiyuan Zhao & Fan Yang" date: "December 3, 2025" format: html: code-fold: true code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: true execute: warning: false message: false --- # 1. Abstract Eviction is a major housing crisis in Philadelphia. When families are evicted, they often end up homeless, children have to switch schools, and it becomes much harder to find new housing. Can we predict which neighborhoods will have the most evictions next month? If we can, the Philadelphia Housing Authority can send resources (legal aid, rental assistance, counseling) to those areas before the evictions happen. This analysis builds a prediction model using data from 2020-2025, combining three data sources: weekly eviction filing records for 408 Census tracts, neighborhood characteristics from the Census (poverty rate, renter percentage, etc.), and 311 housing complaint calls. The analysis reveals that evictions are highly predictable - if a neighborhood had many evictions last month, it will likely have many next month too. Clear spatial patterns emerge: eviction "hotspots" cluster in North and West Philadelphia, particularly in majority-Black neighborhoods. The best model (Enhanced Poisson) achieves a McFadden's R² around 0.22, which is considered "good" for count data modeling. The model correctly identifies high-risk areas with approximately 75% accuracy. All 408 Census tracts are classified into risk categories (Very High, High, Moderate, Low, Very Low) to create an "early warning system" map that the Housing Authority can use for targeted intervention. Key limitations: the model works best for short-term predictions (1-2 months ahead) and cannot account for sudden policy changes (like new eviction moratoriums) or economic shocks. Additionally, the model performs slightly worse in majority-Black neighborhoods, which is an equity concern discussed in detail. ------------------------------------------------------------------------ # 2. Setup This section initializes the R environment by loading all required packages for spatial analysis, statistical modeling, and data visualization. The packages are organized by their primary function: tidyverse for data manipulation, sf and spdep for spatial analysis, tidycensus for Census data access, and various modeling packages. ```{r setup} #| include: false # PACKAGE LOADING AND ENVIRONMENT CONFIGURATION library(tidyverse) library(lubridate) library(knitr) library(gridExtra) library(viridis) library(sf) library(spdep) library(tidycensus) library(caret) library(MASS) library(broom) library(here) Sys.setlocale("LC_ALL", "C") select <- dplyr::select filter <- dplyr::filter options(scipen = 999) options(tigris_use_cache = TRUE) options(tigris_progress = FALSE) if (!dir.exists(here("final", "output"))) dir.create(here("final", "output"), recursive = TRUE) ``` ```{r api_key} #| include: false census_api_key(Sys.getenv("86993dedbe98d77b9d79db6b8ba21a7fde55cb91"), install = TRUE, overwrite = TRUE) ``` ------------------------------------------------------------------------ # 3. Data Sources This analysis integrates three complementary data sources to build a comprehensive picture of eviction risk in Philadelphia. Each source captures a different dimension of housing instability: the Eviction Lab provides direct counts of court filings, the American Community Survey describes neighborhood socioeconomic characteristics, and 311 data captures resident-reported housing problems that may signal emerging instability. ## 3.1 Eviction Data The primary dataset comes from the Eviction Lab at Princeton University, which tracks eviction court filings across the United States. For Philadelphia, weekly eviction filing counts are available for each Census tract from 2019 through 2025. This granular temporal resolution allows us to capture both short-term fluctuations (such as end-of-month spikes when rent is due) and longer-term trends (such as the effects of eviction moratoriums). These data represent eviction *filings*, not completed evictions. Many filed cases are dismissed, settled, or withdrawn before an actual eviction occurs. However, even a filing can have serious consequences for tenants, including damaged credit scores, difficulty finding future housing, and the stress of legal proceedings. ```{r load_eviction} #| output: false # LOAD AND PROCESS EVICTION DATA eviction_weekly <- read_csv(here("final", "data", "philadelphia_weekly_2020_2021.csv")) %>% # Remove sealed records that lack geographic information filter(GEOID != "sealed") %>% mutate( # Parse date string into proper date format week_date = ymd(week_date), # Extract temporal components for potential seasonality analysis year = year(week_date), month = month(week_date), # Create binary indicator for CDC eviction moratorium period # The moratorium was announced 9/4/2020 and expired 8/26/2021 pandemic_moratorium = ifelse( week_date >= "2020-09-04" & week_date <= "2021-08-26", 1, 0 ) ) # GENERATE SUMMARY STATISTICS eviction_summary <- tibble( Metric = c("Total observations", "Census tracts", "Date range", "Number of weeks"), Value = c( format(nrow(eviction_weekly), big.mark = ","), n_distinct(eviction_weekly$GEOID), paste(min(eviction_weekly$week_date), "to", max(eviction_weekly$week_date)), n_distinct(eviction_weekly$week) ) ) ``` ```{r show_eviction_summary} kable(eviction_summary, caption = "Eviction Data Summary") ``` The eviction data structure provides a panel dataset where each observation represents one Census tract in one week. ------------------------------------------------------------------------ ## 3.2 Census/ACS Data The American Community Survey (ACS) provides socioeconomic indicators at the Census tract level. We use the 2021 5-year estimates, which aggregate survey responses from 2017-2021 to provide reliable estimates even for small geographic areas. The 5-year estimates trade temporal precision for statistical reliability - a tradeoff that makes sense for our application since neighborhood demographics change slowly. The variables selected capture multiple dimensions of eviction risk identified in the housing literature: housing tenure (renter vs. owner), rent burden (how much income goes to rent), economic stress (poverty, unemployment), educational attainment, and housing stock age (older buildings may have more maintenance issues). Each variable was chosen based on either theoretical relevance or empirical associations found in prior eviction research. ```{r get_acs_data} #| output: false # DEFINE ACS VARIABLES OF INTEREST # Variable codes follow Census Bureau naming conventions: # B-series = detailed tables, prefix indicates table number acs_vars <- c( # Population counts total_pop = "B01003_001", # Total population # Housing unit counts and occupancy status total_units = "B25001_001", # Total housing units occupied_units = "B25002_002", # Occupied housing units vacant_units = "B25002_003", # Vacant housing units owner_occupied = "B25003_002", # Owner-occupied units renter_occupied = "B25003_003", # Renter-occupied units # Rent metrics median_rent = "B25064_001", # Median gross rent (dollars) # Rent burden indicators (% of income spent on rent) rent_burden_total = "B25070_001", # Total renters (denominator) rent_burden_30_35 = "B25070_007", # Renters paying 30-34.9% of income rent_burden_35_40 = "B25070_008", # Renters paying 35-39.9% of income rent_burden_40_50 = "B25070_009", # Renters paying 40-49.9% of income rent_burden_50plus = "B25070_010", # Renters paying 50%+ of income (severely burdened) # Income and economic indicators median_income = "B19013_001", # Median household income poverty_total = "B17001_001", # Total population for poverty status poverty_below = "B17001_002", # Population below poverty line labor_force = "B23025_003", # Civilian labor force unemployed = "B23025_005", # Unemployed civilians # Educational attainment (age 25+) edu_total = "B15003_001", # Total population 25 years and over edu_bachelors = "B15003_022", # Bachelor's degree edu_masters = "B15003_023", # Master's degree edu_professional = "B15003_024", # Professional school degree edu_doctorate = "B15003_025", # Doctorate degree # Housing age built_total = "B25034_001", # Total housing units (year built) built_before_1940 = "B25034_011" # Built 1939 or earlier ) # FETCH ACS DATA WITH GEOMETRY philly_acs <- get_acs( geography = "tract", variables = acs_vars, state = "PA", county = "Philadelphia", year = 2021, geometry = TRUE, # Include spatial boundaries output = "wide", # One column per variable progress = FALSE # Suppress progress bar ) # CALCULATE DERIVED VARIABLES # Transform raw counts into interpretable percentages and ratios philly_acs <- philly_acs %>% mutate( # Percentage of occupied units that are renter-occupied pct_renter = ifelse(occupied_unitsE > 0, renter_occupiedE / occupied_unitsE * 100, NA), # Vacancy rate as percentage of total units vacancy_rate = ifelse(total_unitsE > 0, vacant_unitsE / total_unitsE * 100, NA), # Calculate total rent-burdened renters (paying >30% of income) rent_burdened = rent_burden_30_35E + rent_burden_35_40E + rent_burden_40_50E + rent_burden_50plusE, # Percentage of renters who are rent-burdened pct_rent_burdened = ifelse(rent_burden_totalE > 0, rent_burdened / rent_burden_totalE * 100, NA), # Percentage severely rent-burdened (paying >50% of income) pct_severe_burden = ifelse(rent_burden_totalE > 0, rent_burden_50plusE / rent_burden_totalE * 100, NA), # Poverty rate poverty_rate = ifelse(poverty_totalE > 0, poverty_belowE / poverty_totalE * 100, NA), # Unemployment rate (among civilian labor force) unemployment_rate = ifelse(labor_forceE > 0, unemployedE / labor_forceE * 100, NA), # Percentage with college degree or higher college_edu = edu_bachelorsE + edu_mastersE + edu_professionalE + edu_doctorateE, pct_college = ifelse(edu_totalE > 0, college_edu / edu_totalE * 100, NA), # Percentage of housing built before 1940 pct_old_housing = ifelse(built_totalE > 0, built_before_1940E / built_totalE * 100, NA) ) %>% # Select only the variables needed for analysis select( GEOID, NAME, geometry, total_pop = total_popE, renter_occupied = renter_occupiedE, median_rent = median_rentE, median_income = median_incomeE, pct_renter, vacancy_rate, pct_rent_burdened, pct_severe_burden, poverty_rate, unemployment_rate, pct_college, pct_old_housing ) n_tracts <- nrow(philly_acs) ``` ------------------------------------------------------------------------ ## 3.3 311 Complaint Data 311 is Philadelphia's non-emergency service request system. Residents can call 311 to report issues like potholes, abandoned vehicles, or housing code violations. Housing-related 311 complaints may serve as early warning signs of evictions - when tenants report maintenance issues, dangerous conditions, or landlord neglect, it often signals housing instability that could precede displacement. We focus specifically on housing-related complaint types that might indicate rental housing distress: maintenance complaints, dangerous building conditions, no heat, infestations, and vacant properties. These complaint types were selected based on their theoretical connection to the landlord-tenant relationship and housing quality issues that often precede evictions. ```{r load_311} #| output: false # DEFINE FUNCTION TO LOAD 311 DATA BY YEAR load_311_data <- function(year, data_path = here("final", "data", "311")) { # Construct file path for the given year shp_file <- file.path(data_path, paste0(year, "_public_cases_fc.shp")) # Return NULL if file doesn't exist (graceful handling) if (!file.exists(shp_file)) return(NULL) # Read shapefile and transform to match ACS coordinate system st_read(shp_file, quiet = TRUE) %>% st_transform(crs = st_crs(philly_acs)) } # LOAD ALL AVAILABLE YEARS years <- 2020:2025 # Use map() to load each year, handling missing files automatically data_311_list <- map(years, ~load_311_data(.x)) # Combine all years, filtering out NULL entries data_311_all <- bind_rows(data_311_list[!sapply(data_311_list, is.null)]) n_311_total <- nrow(data_311_all) ``` ```{r filter_311} #| output: false # FILTER TO HOUSING-RELATED COMPLAINT TYPES # These categories were selected based on their theoretical connection # to housing stability and landlord-tenant relationships housing_311_types <- c( "Maintenance Residential or Commercial", # General maintenance issues "Maintenance Complaint", # Specific maintenance complaints "Building Dangerous", # Structural safety concerns "Vacant House or Commercial", # Abandoned properties (blight indicator) "No Heat (Residential)", # Essential service failure "Infestation Residential", # Pest problems (habitability issue) "Fire Residential or Commercial", # Fire damage or hazards "Other Dangerous", # Miscellaneous safety issues "Dangerous Sidewalk" # Property maintenance obligation ) housing_311 <- data_311_all %>% # Keep only housing-related complaint types filter(service_na %in% housing_311_types) %>% mutate( # Parse request date and calculate week start date request_date = as.Date(requested_), # Align to Sunday-starting weeks to match eviction data week_date = floor_date(request_date, unit = "week", week_start = 7) ) n_housing_311 <- nrow(housing_311) pct_housing <- round(n_housing_311 / n_311_total * 100, 1) ``` ```{r show_311_summary} data_311_summary <- tibble( Metric = c("Total 311 records", "Housing-related complaints", "Percentage"), Value = c(format(n_311_total, big.mark = ","), format(n_housing_311, big.mark = ","), paste0(pct_housing, "%")) ) kable(data_311_summary, caption = "311 Data Summary") ``` Housing-related complaints represent `r pct_housing`% of all 311 calls, reflecting the significant role that housing issues play in Philadelphia's urban challenges. These complaints are spatially joined to Census tracts to create weekly complaint counts that can be used as a leading indicator of neighborhood distress. ```{r aggregate_311} #| output: false # SPATIAL JOIN: ASSIGN 311 CALLS TO CENSUS TRACTS housing_311_tract <- st_join(housing_311, philly_acs %>% select(GEOID)) # AGGREGATE TO TRACT-WEEK LEVEL complaints_by_tract_week <- housing_311_tract %>% # Drop geometry for faster aggregation st_drop_geometry() %>% # Remove records that couldn't be matched to a tract filter(!is.na(GEOID)) %>% # Count complaints per tract-week combination group_by(GEOID, week_date) %>% summarise(housing_complaints = n(), .groups = "drop") ``` ------------------------------------------------------------------------ # 4. Data Integration and Feature Engineering This section combines our three data sources into a unified analysis dataset and creates the predictive features that will power our models. Feature engineering is often the most important part of any predictive modeling project - the right features can make the difference between a mediocre model and a useful one. ## 4.1 Merging Data Sources Each row in the analysis dataset represents one Census tract in one week. We combine eviction filings with neighborhood demographics from ACS and 311 complaint counts. Missing complaint counts are filled with zero since the absence of complaints means no complaints were filed, not that data is missing. ```{r merge_data} #| output: false # JOIN ALL DATA SOURCES # Start with eviction data as the base, then add ACS and 311 analysis_data <- eviction_weekly %>% left_join(philly_acs %>% st_drop_geometry(), by = "GEOID") %>% left_join(complaints_by_tract_week, by = c("GEOID", "week_date")) %>% # Fill missing complaint counts with zero (no complaints = 0, not NA) mutate(housing_complaints = replace_na(housing_complaints, 0)) ``` ## 4.2 Temporal Lag Features One of the strongest predictors of future evictions is past evictions. Neighborhoods with high eviction rates tend to maintain those rates over time due to persistent underlying conditions (poverty, housing quality, landlord practices). We create lag features to capture this temporal autocorrelation: the previous week's filings, 2 weeks ago, 4 weeks ago, and a 4-week rolling average to smooth out week-to-week noise. The rolling average is particularly useful because it reduces the impact of random fluctuations (like a holiday week with fewer court filings) while still capturing the underlying trend in each neighborhood. ```{r temporal_lags} #| output: false # CREATE TEMPORAL LAG FEATURES # These features capture the "momentum" of evictions in each tract analysis_data <- analysis_data %>% arrange(GEOID, week_date) %>% group_by(GEOID) %>% mutate( # Lag 1 week lag_1week = lag(filings_2020, n = 1), # Lag 2 weeks lag_2week = lag(filings_2020, n = 2), # Lag 4 weeks lag_4week = lag(filings_2020, n = 4), # Lagged 311 complaints complaints_lag1 = lag(housing_complaints, n = 1), complaints_lag4 = lag(housing_complaints, n = 4), # 4-week rolling average rolling_avg_4week = (lag_1week + lag_2week + lag(filings_2020, 3) + lag_4week) / 4 ) %>% ungroup() ``` ## 4.3 Spatial Lag Features Evictions cluster spatially - if neighboring tracts have high evictions, a tract is likely at higher risk too. This spatial autocorrelation can arise from several mechanisms: shared landlords operating across tract boundaries, similar housing stock quality in adjacent areas, or spillover effects where displacement from one neighborhood increases housing pressure in nearby areas. We use k-nearest neighbors (k=5) to define neighbors rather than contiguity (sharing a border) because k-NN produces more consistent neighbor counts across tracts. Some tracts might share borders with only 2 neighbors while others have 8; k-NN ensures each tract has exactly 5 neighbors, making the spatial lag more comparable across tracts. ```{r spatial_weights} #| output: false # CREATE SPATIAL WEIGHTS MATRIX # Using k-nearest neighbors (k=5) for consistent neighbor counts # First, get tract centroids (representative points) tract_centroids <- philly_acs %>% select(GEOID) %>% # Calculate centroid coordinates mutate(geometry = suppressWarnings(st_centroid(geometry))) %>% # Extract X and Y coordinates mutate( X = st_coordinates(geometry)[, 1], Y = st_coordinates(geometry)[, 2] ) %>% st_drop_geometry() %>% as.data.frame() # Build k=5 nearest neighbor structure coords <- tract_centroids %>% select(X, Y) %>% as.matrix() # Find 5 nearest neighbors for each tract knn5 <- knearneigh(coords, k = 5) # Convert to neighbor list format nb_knn5 <- knn2nb(knn5) # Create row-standardized weights (each row sums to 1) # This means the spatial lag is the average of neighbors' values weights_knn5 <- nb2listw(nb_knn5, style = "W", zero.policy = TRUE) ``` ```{r spatial_lag} #| output: false # CALCULATE SPATIAL LAG FOR EACH WEEK # The spatial lag is the average eviction count among 5 nearest neighbors analysis_data <- analysis_data %>% arrange(GEOID, week) %>% group_by(GEOID) %>% mutate(filings_lag1_for_spatial = lag(filings_2020, 1)) %>% ungroup() # Calculate spatial lag separately for each week tract_order <- tract_centroids$GEOID all_weeks <- sort(unique(analysis_data$week)) spatial_lag_results <- data.frame() for (w in all_weeks) { # Get data for this week, ordered to match weights matrix week_data <- analysis_data %>% filter(week == w) %>% arrange(match(GEOID, tract_order)) # Check that we have data for all tracts if (nrow(week_data) == length(tract_order)) { # Extract lagged filing values lag_values <- week_data$filings_lag1_for_spatial # Replace NAs with 0 for calculation (NAs would propagate) lag_values_filled <- ifelse(is.na(lag_values), 0, lag_values) # Calculate spatial lag: average of 5 nearest neighbors spatial_lag <- lag.listw(weights_knn5, lag_values_filled, zero.policy = TRUE) # Restore NAs where original values were NA spatial_lag[is.na(lag_values)] <- NA week_result <- data.frame( GEOID = week_data$GEOID, week = w, spatial_lag_neighbors = spatial_lag ) } else { # Fallback if data is incomplete week_result <- data.frame( GEOID = week_data$GEOID, week = w, spatial_lag_neighbors = NA_real_ ) } spatial_lag_results <- bind_rows(spatial_lag_results, week_result) } # Merge spatial lag back to main dataset analysis_data <- analysis_data %>% left_join(spatial_lag_results, by = c("GEOID", "week")) ``` ## 4.4 Distance Features Location within Philadelphia's urban structure matters for eviction risk. Distance to city center (City Hall) helps capture the classic urban geography where housing costs and characteristics vary with centrality. Center City has expensive housing with high-income residents, while outer areas have more varied housing stock and demographic composition. ```{r distance_features} #| output: false # CALCULATE DISTANCE TO CITY CENTER # Define City Hall location as the city center reference point city_hall <- st_sfc(st_point(c(-75.1635, 39.9526)), crs = 4326) %>% st_transform(crs = st_crs(philly_acs)) # Calculate distance from each tract centroid to City Hall philly_acs <- philly_acs %>% mutate( # Distance in kilometers (divide meters by 1000) dist_to_center = as.numeric( st_distance(st_centroid(geometry), city_hall) ) / 1000 ) ``` ------------------------------------------------------------------------ # 5. Local Hotspots Analysis Local Moran's I is a spatial statistical technique that identifies local clusters and spatial outliers. It decomposes the global measure of spatial autocorrelation into contributions from each location, allowing us to identify specific neighborhoods that form eviction hotspots (high-eviction tracts surrounded by other high-eviction tracts) or coldspots (low-eviction tracts surrounded by other low-eviction tracts). It generates features (is_hotspot, is_coldspot, dist_to_hotspot) that may improve model prediction by capturing a tract's position in the broader spatial pattern of evictions. ```{r local_moran} #| output: false # CALCULATE MEAN EVICTION RATE PER TRACT tract_mean_filings <- analysis_data %>% group_by(GEOID) %>% summarise(mean_filings = mean(filings_2020, na.rm = TRUE), .groups = "drop") philly_acs <- philly_acs %>% left_join(tract_mean_filings, by = "GEOID") # COMPUTE LOCAL MORAN'S I STATISTIC # Ensure tract order matches the weights matrix philly_acs_ordered <- philly_acs %>% arrange(match(GEOID, tract_centroids$GEOID)) # Calculate Local Moran's I local_moran <- localmoran( philly_acs_ordered$mean_filings, weights_knn5, zero.policy = TRUE ) # CLASSIFY TRACTS INTO CLUSTER TYPES philly_acs_ordered <- philly_acs_ordered %>% mutate( # Extract Local Moran statistics local_moran_i = local_moran[, 1], # Local I value local_moran_p = local_moran[, 5], # Pseudo p-value # Standardize the filing rate z_filings = scale(mean_filings)[, 1], # Calculate spatial lag of standardized filings lag_z_filings = lag.listw(weights_knn5, z_filings, zero.policy = TRUE) ) %>% mutate( # Classify into four quadrants of Moran scatterplot hotspot_type = case_when( # Not significant: p > 0.05 local_moran_p > 0.05 ~ "Not Significant", # High-High: high values surrounded by high neighbors (HOTSPOT) z_filings > 0 & lag_z_filings > 0 ~ "High-High (Hotspot)", # Low-Low: low values surrounded by low neighbors (COLDSPOT) z_filings < 0 & lag_z_filings < 0 ~ "Low-Low (Coldspot)", # High-Low: high value surrounded by low neighbors (OUTLIER) z_filings > 0 & lag_z_filings < 0 ~ "High-Low (Outlier)", # Low-High: low value surrounded by high neighbors (OUTLIER) z_filings < 0 & lag_z_filings > 0 ~ "Low-High (Outlier)", TRUE ~ "Not Significant" ) ) ``` ```{r show_hotspots} hotspot_table <- as.data.frame(table(philly_acs_ordered$hotspot_type)) names(hotspot_table) <- c("Cluster Type", "Count") kable(hotspot_table, caption = "Local Moran's I Cluster Classification") ``` The table above shows how Philadelphia's Census tracts distribute across the four cluster types. "High-High (Hotspot)" tracts are areas of concentrated eviction activity where both the tract and its neighbors have elevated eviction rates. These represent the core of Philadelphia's eviction crisis and should be priority areas for intervention. "Low-Low (Coldspot)" tracts are stable, low-eviction areas. Outlier categories represent tracts that differ from their surroundings - either islands of stability in troubled areas (Low-High) or pockets of distress in otherwise stable neighborhoods (High-Low). ```{r dist_to_hotspot} #| output: false # CALCULATE DISTANCE TO NEAREST HOTSPOT # This captures how close each tract is to the core eviction crisis areas hotspot_tracts <- philly_acs_ordered %>% filter(hotspot_type == "High-High (Hotspot)") if (nrow(hotspot_tracts) > 0) { # Get centroids of all hotspot tracts hotspot_centroids <- st_centroid(hotspot_tracts$geometry) # Calculate distance to nearest hotspot (in km) philly_acs_ordered <- philly_acs_ordered %>% mutate( dist_to_hotspot = as.numeric( st_distance(st_centroid(geometry), st_union(hotspot_centroids)) ) / 1000 ) } else { # Fallback: if no significant hotspots, use top 10% tracts high_eviction_tracts <- philly_acs_ordered %>% filter(mean_filings > quantile(mean_filings, 0.9, na.rm = TRUE)) philly_acs_ordered <- philly_acs_ordered %>% mutate( dist_to_hotspot = as.numeric( st_distance( st_centroid(geometry), st_union(st_centroid(high_eviction_tracts$geometry)) ) ) / 1000 ) } # CREATE BINARY HOTSPOT/COLDSPOT INDICATORS philly_acs_ordered <- philly_acs_ordered %>% mutate( is_hotspot = ifelse(hotspot_type == "High-High (Hotspot)", 1, 0), is_coldspot = ifelse(hotspot_type == "Low-Low (Coldspot)", 1, 0) ) philly_acs <- philly_acs_ordered ``` ```{r map_hotspots} #| fig-width: 10 #| fig-height: 10 # VISUALIZE HOTSPOT/COLDSPOT CLUSTERS ggplot(philly_acs_ordered) + geom_sf(aes(fill = hotspot_type), color = "grey30", linewidth = 0.15) + scale_fill_manual( values = c( "High-High (Hotspot)" = "#d73027", # Red for high eviction clusters "Low-Low (Coldspot)" = "#4575b4", # Blue for low eviction clusters "High-Low (Outlier)" = "#fc8d59", # Orange for high outliers "Low-High (Outlier)" = "#91bfdb", # Light blue for low outliers "Not Significant" = "#f0f0f0" # Gray for non-significant ), name = "Cluster Type" ) + labs( title = "Eviction Hotspots and Coldspots (Local Moran's I)", subtitle = "Identifying spatial clusters of high and low eviction rates" ) + theme_void() + theme(plot.title = element_text(size = 14, face = "bold")) ``` The hotspot map reveals clear geographic patterns that reflect Philadelphia's history of segregation and disinvestment. Red clusters in North Philadelphia (including Kensington, Fairhill, and North Central) and parts of West Philadelphia show persistent high eviction rates surrounded by similarly affected neighbors. These areas have historically experienced redlining, concentrated poverty, and deteriorated housing stock. Blue coldspots in Center City, Chestnut Hill, and parts of the far Northeast represent stable, low-eviction areas with stronger housing markets and higher incomes. ```{r merge_distance} #| output: false # MERGE HOTSPOT FEATURES BACK TO MAIN DATASET analysis_data <- analysis_data %>% left_join( philly_acs %>% st_drop_geometry() %>% select(GEOID, dist_to_center, dist_to_hotspot, is_hotspot, is_coldspot, hotspot_type), by = "GEOID" ) ``` ------------------------------------------------------------------------ # 6. Final Data Preparation Before modeling, we need to ensure our dataset is complete and properly structured. We filter to week 5 and later so that all temporal lag features (up to 4 weeks back) are available, then remove any remaining rows with missing values. This creates a balanced panel ready for model estimation. ```{r final_data_prep} #| output: false # FINAL DATA CLEANING AND SELECTION model_data <- analysis_data %>% # Start from week 5 to ensure all lag features are available filter(week > 4) %>% # Select only the features needed for modeling select( # Identifiers GEOID, week, week_date, year, month, # Outcome variable filings = filings_2020, # Temporal lag features lag_1week, lag_2week, lag_4week, rolling_avg_4week, # 311 complaint features complaints_lag1, complaints_lag4, # Spatial features spatial_lag_neighbors, dist_to_center, dist_to_hotspot, is_hotspot, is_coldspot, # Demographic features racial_majority, total_pop, renter_occupied, median_rent, median_income, pct_renter, vacancy_rate, pct_rent_burdened, pct_severe_burden, poverty_rate, unemployment_rate, pct_college, pct_old_housing, # Policy indicator pandemic_moratorium, # Baseline comparison filings_avg_prepandemic_baseline ) %>% # Remove rows with any missing values drop_na() n_final_obs <- nrow(model_data) n_final_tracts <- n_distinct(model_data$GEOID) ``` ```{r show_final_data} final_summary <- tibble( Metric = c("Observations", "Tracts"), Value = c(format(n_final_obs, big.mark = ","), n_final_tracts) ) kable(final_summary, caption = "Final Dataset Summary") ``` The final dataset contains `r format(n_final_obs, big.mark = ",")` tract-week observations across `r n_final_tracts` Census tracts. This represents a complete panel where we observe each tract in each week, allowing us to exploit both cross-sectional variation (differences between tracts) and temporal variation (changes within tracts over time). ------------------------------------------------------------------------ # 7. Exploratory Data Analysis Exploratory analysis examines temporal trends, racial disparities, spatial distribution, and the statistical properties of our outcome variable. ## 7.1 Temporal Trends ```{r temporal_trends} #| fig-width: 10 #| fig-height: 5 # AGGREGATE AND VISUALIZE WEEKLY EVICTION TRENDS weekly_totals <- model_data %>% group_by(week_date) %>% summarise(total_filings = sum(filings), .groups = "drop") ggplot(weekly_totals, aes(x = week_date, y = total_filings)) + # Raw weekly data geom_line(color = "steelblue", linewidth = 0.5) + # Smoothed trend line geom_smooth(method = "loess", se = TRUE, color = "red", alpha = 0.3) + # Highlight CDC eviction moratorium period annotate("rect", xmin = as.Date("2020-09-04"), xmax = as.Date("2021-08-26"), ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "yellow") + annotate("text", x = as.Date("2021-03-01"), y = max(weekly_totals$total_filings) * 0.9, label = "CDC Eviction\nMoratorium", size = 3) + labs( title = "Philadelphia Weekly Eviction Filings (2020-2025)", subtitle = "The CDC moratorium created a temporary but dramatic reduction in filings", x = "Date", y = "Total Weekly Filings" ) + theme_minimal() ``` The CDC eviction moratorium period (highlighted in yellow) shows a clear and dramatic drop in filings - demonstrating that policy interventions can substantially reduce evictions, at least temporarily. After August 2021, when the moratorium expired, evictions surged back and have remained elevated since, with some seasonal variation. There are visible dips around the winter holidays (December-January) when courts have fewer operating days. We think that the post-moratorium surge raises important questions about whether the moratorium merely delayed evictions or prevented them entirely. Some tenants who would have been evicted during the moratorium may have used the additional time to stabilize their housing situation, while others accumulated back rent that made eviction inevitable once protections expired. ## 7.2 Racial Disparity Analysis Housing instability in America is deeply intertwined with race, reflecting the legacy of discriminatory policies like redlining, racially restrictive covenants, and ongoing patterns of residential segregation. Our analysis examines eviction rates by the racial composition of each Census tract to quantify these disparities. ```{r racial_disparity} #| fig-width: 10 #| fig-height: 5 # SUMMARIZE EVICTIONS BY RACIAL MAJORITY racial_summary <- model_data %>% filter(!is.na(racial_majority)) %>% group_by(racial_majority) %>% summarise( Tracts = n_distinct(GEOID), `Mean Filings` = round(mean(filings), 2), `Total Filings` = format(sum(filings), big.mark = ","), .groups = "drop" ) %>% arrange(desc(`Mean Filings`)) kable(racial_summary, caption = "Eviction Filings by Neighborhood Racial Composition") # VISUALIZE DISTRIBUTION BY RACIAL COMPOSITION ggplot(model_data %>% filter(!is.na(racial_majority)), aes(x = reorder(racial_majority, filings, FUN = median), y = filings, fill = racial_majority)) + geom_boxplot(outlier.alpha = 0.3) + scale_fill_viridis_d() + labs( title = "Eviction Filings by Neighborhood Racial Composition", subtitle = "Majority-Black neighborhoods have the highest eviction rates", x = "Racial Majority", y = "Weekly Filings" ) + theme_minimal() + theme(legend.position = "none") ``` Majority-Black neighborhoods have the highest median eviction rates and the most extreme outliers, with some tract-weeks seeing 10 or more filings. This disparity persists even when accounting for income differences - majority-Black neighborhoods with similar poverty rates to majority-White neighborhoods still experience more evictions. This pattern points to systemic issues in housing access, discrimination by landlords, differences in housing quality, and the concentration of predatory landlords in certain communities. We hold the view that these disparities have important implications for both model development (we need to ensure our model doesn't simply reproduce existing biases) and for policy (interventions should be designed with awareness of these disparities). ## 7.3 Spatial Distribution ```{r spatial_distribution} #| fig-width: 10 #| fig-height: 10 # MAP AVERAGE EVICTION RATES BY TRACT tract_totals <- model_data %>% group_by(GEOID) %>% summarise(mean_weekly_filings = mean(filings), .groups = "drop") map_data <- philly_acs %>% left_join(tract_totals, by = "GEOID") ggplot(map_data) + geom_sf(aes(fill = mean_weekly_filings), color = "grey40", linewidth = 0.1) + scale_fill_viridis_c(option = "plasma", name = "Avg Weekly\nFilings") + labs( title = "Average Weekly Eviction Filings by Census Tract", subtitle = "Evictions concentrate in North and West Philadelphia" ) + theme_void() + theme(plot.title = element_text(size = 14, face = "bold")) ``` Evictions concentrate in North and West Philadelphia, forming a distinctive pattern around Center City. The far Northeast and Northwest (Chestnut Hill, Mt. Airy) have consistently low rates. This geographic clustering is not random - it reflects decades of housing policy, economic development patterns, and demographic change. The U-shaped pattern of high evictions mirrors historical redlining maps and areas of concentrated poverty. This strong spatial clustering justifies including spatial lag features in our model - knowing what's happening to a tract's neighbors provides meaningful information about the tract's own eviction risk. ## 7.4 Outcome Variable Distribution ```{r outcome_distribution} #| fig-width: 10 #| fig-height: 4 # EXAMINE DISTRIBUTION OF EVICTION COUNTS ggplot(model_data, aes(x = filings)) + geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) + labs( title = "Distribution of Weekly Eviction Filings", subtitle = "Right-skewed with many zeros - typical of count data", x = "Filings per Tract-Week", y = "Count" ) + theme_minimal() ``` ```{r outcome_stats} # Calculate key distributional statistics zero_pct <- round(mean(model_data$filings == 0) * 100, 1) var_mean_ratio <- round(var(model_data$filings) / mean(model_data$filings), 2) outcome_stats <- tibble( Metric = c("Mean filings", "Variance/Mean ratio", "Zero proportion"), Value = c(round(mean(model_data$filings), 2), paste0(var_mean_ratio, " (overdispersion)"), paste0(zero_pct, "%")) ) kable(outcome_stats, caption = "Outcome Variable Statistics") ``` The distribution of weekly eviction filings shows classic count data characteristics: many zeros (`r zero_pct`% of observations have zero filings), a right skew with a long tail of high values, and overdispersion (variance is `r var_mean_ratio` times the mean, much larger than the 1:1 ratio expected under a Poisson distribution). These characteristics rule out ordinary least squares regression, which assumes a continuous, normally-distributed outcome. Instead, we need count data models - Poisson regression for the basic case, or Negative Binomial regression to handle the overdispersion. The high zero rate also motivates aggregating to monthly data, which reduces the proportion of zeros and produces more stable estimates. ------------------------------------------------------------------------ # 8. Monthly Data Aggregation Weekly data has too many zeros for stable modeling - when most tract-weeks have zero evictions, it's hard for models to learn what distinguishes high-eviction from low-eviction tracts. Aggregating to monthly data reduces this zero-inflation while maintaining enough temporal variation to capture meaningful patterns. Monthly aggregation also has practical advantages: most housing assistance programs operate on monthly cycles (rent is due monthly), and monthly predictions are more actionable for resource allocation than weekly predictions. ```{r monthly_data} #| output: false # AGGREGATE FROM WEEKLY TO MONTHLY DATA monthly_data <- model_data %>% # Create year-month identifier mutate(year_month = floor_date(week_date, "month")) %>% # Aggregate within tract-month combinations group_by(GEOID, year_month, racial_majority) %>% summarise( # Sum filings over the month filings = sum(filings, na.rm = TRUE), # Take last week's lag values (most recent info) lag_1week = last(lag_1week), lag_4week = last(lag_4week), # Average spatial lag over the month spatial_lag_neighbors = mean(spatial_lag_neighbors, na.rm = TRUE), # Static features: take first value (they don't change) pct_renter = first(pct_renter), poverty_rate = first(poverty_rate), dist_to_center = first(dist_to_center), median_income = first(median_income), dist_to_hotspot = first(dist_to_hotspot), is_hotspot = first(is_hotspot), is_coldspot = first(is_coldspot), # Maximum moratorium indicator (1 if any week was under moratorium) pandemic_moratorium = max(pandemic_moratorium), # Keep year and month for fixed effects year = first(year), month = first(month), .groups = "drop" ) %>% # Sort and create monthly lags arrange(GEOID, year_month) %>% group_by(GEOID) %>% mutate( # Monthly lag features lag_1month = lag(filings, 1), lag_2month = lag(filings, 2) ) %>% ungroup() %>% # Remove rows with missing lag values drop_na() # Compare zero rates between weekly and monthly weekly_zero_pct <- round(mean(model_data$filings == 0) * 100, 1) monthly_zero_pct <- round(mean(monthly_data$filings == 0) * 100, 1) ``` ```{r show_comparison} comparison <- tibble( Metric = c("Zero proportion", "Observations"), Weekly = c(paste0(weekly_zero_pct, "%"), format(nrow(model_data), big.mark = ",")), Monthly = c(paste0(monthly_zero_pct, "%"), format(nrow(monthly_data), big.mark = ",")) ) kable(comparison, caption = "Weekly vs Monthly Data Comparison") ``` Aggregating to monthly data dramatically reduces the zero proportion from `r weekly_zero_pct`% to `r monthly_zero_pct`%, creating a more tractable modeling problem while still preserving meaningful temporal and spatial variation. ------------------------------------------------------------------------ # 9. Model Building ## 9.1 Train-Test Split We use temporal holdout validation: training on all data before the last 3 months and testing on those final 3 months. This approach simulates real-world deployment where we predict future evictions using historical data. Unlike random cross-validation, temporal holdout respects the time-ordered nature of our data and avoids "data leakage" where future information could contaminate predictions. The 3-month holdout period provides enough test data to evaluate model stability across multiple months while retaining sufficient training data to learn robust patterns. ```{r train_test_split} #| output: false # CREATE TEMPORAL TRAIN/TEST SPLIT # Hold out the last 3 months for testing split_date <- max(monthly_data$year_month) - months(3) # Training: all months before the split train_data <- monthly_data %>% filter(year_month < split_date) # Testing: the final 3 months test_data <- monthly_data %>% filter(year_month >= split_date) ``` ```{r show_split} split_summary <- tibble( Dataset = c("Training", "Test"), Observations = c(format(nrow(train_data), big.mark = ","), format(nrow(test_data), big.mark = ",")), Period = c(paste("Before", split_date), paste("From", split_date)) ) kable(split_summary, caption = "Train/Test Split") ``` ## 9.2 Model Fitting We build a series of models of increasing complexity to understand which features contribute most to predictive accuracy. The progression from simple to complex also helps us guard against overfitting - if a complex model doesn't substantially outperform a simpler one, the added complexity may not be justified. All models use count regression because our outcome (eviction counts) consists of non-negative integers. We compare Poisson regression (which assumes variance equals mean) with Negative Binomial regression (which allows overdispersion). ```{r fit_models} #| output: false # MODEL 1: BASELINE (TEMPORAL LAGS ONLY) baseline_poisson <- glm( filings ~ lag_1month + lag_2month, data = train_data, family = poisson ) # MODEL 2: FULL MODEL (ADD SPATIAL AND DEMOGRAPHIC) # Adds spatial features, neighborhood demographics, and policy indicator full_formula <- filings ~ lag_1month + lag_2month + spatial_lag_neighbors + pct_renter + poverty_rate + dist_to_center + racial_majority + pandemic_moratorium # Poisson version (may show overdispersion) full_poisson <- glm(full_formula, data = train_data, family = poisson) # Negative Binomial version (handles overdispersion) full_nb <- glm.nb(full_formula, data = train_data) # MODEL 3: ENHANCED MODEL (ADD INTERACTIONS AND FIXED EFFECTS) # Includes interaction terms and month fixed effects for seasonality enhanced_formula <- filings ~ lag_1month + lag_2month + spatial_lag_neighbors + pct_renter + poverty_rate + median_income + dist_to_center + dist_to_hotspot + is_hotspot + racial_majority + factor(month) + pandemic_moratorium + # Interaction: does poverty have different effects in different racial contexts? racial_majority:poverty_rate + # Interaction: do spatial effects vary with renter concentration? spatial_lag_neighbors:pct_renter # Fit both Poisson and Negative Binomial versions enhanced_poisson <- glm(enhanced_formula, data = train_data, family = poisson) enhanced_nb <- glm.nb(enhanced_formula, data = train_data) ``` ------------------------------------------------------------------------ # 10. Model Evaluation Model performance is evaluated on the held-out test set to assess how well predictions generalize to new data. We use multiple metrics: Mean Absolute Error (MAE) measures average prediction error in intuitive units (evictions), Root Mean Squared Error (RMSE) penalizes large errors more heavily, and Correlation measures whether predictions rank tracts correctly even if the absolute values are off. ```{r predictions} #| output: false # GENERATE PREDICTIONS FOR TRAIN AND TEST SETS train_data <- train_data %>% mutate( pred_baseline = predict(baseline_poisson, type = "response"), pred_poisson = predict(full_poisson, type = "response"), pred_nb = predict(full_nb, type = "response"), pred_enhanced = predict(enhanced_poisson, type = "response") ) test_data <- test_data %>% mutate( pred_baseline = predict(baseline_poisson, newdata = test_data, type = "response"), pred_poisson = predict(full_poisson, newdata = test_data, type = "response"), pred_nb = predict(full_nb, newdata = test_data, type = "response"), pred_enhanced = predict(enhanced_poisson, newdata = test_data, type = "response") ) ``` ```{r performance_metrics} # CALCULATE PERFORMANCE METRICS calc_metrics <- function(actual, predicted, name) { tibble( Model = name, MAE = round(mean(abs(actual - predicted)), 2), RMSE = round(sqrt(mean((actual - predicted)^2)), 2), Correlation = round(cor(actual, predicted), 3) ) } test_metrics <- bind_rows( calc_metrics(test_data$filings, test_data$pred_baseline, "Baseline"), calc_metrics(test_data$filings, test_data$pred_poisson, "Full Poisson"), calc_metrics(test_data$filings, test_data$pred_nb, "Neg. Binomial"), calc_metrics(test_data$filings, test_data$pred_enhanced, "Enhanced") ) kable(test_metrics, caption = "Model Performance on Test Set") ``` ```{r r2_metrics} # CALCULATE MODEL FIT (McFADDEN'S R-SQUARED) # McFadden's R² compares model log-likelihood to null model calc_r2 <- function(model) { null <- update(model, . ~ 1) # Null model with intercept only round(1 - as.numeric(logLik(model) / logLik(null)), 3) } r2_summary <- tibble( Model = c("Baseline", "Full Poisson", "Full NB", "Enhanced Poisson", "Enhanced NB"), `McFadden R2` = c( calc_r2(baseline_poisson), calc_r2(full_poisson), calc_r2(full_nb), calc_r2(enhanced_poisson), calc_r2(enhanced_nb) ) ) kable(r2_summary, caption = "Model Fit (McFadden R-squared)") ``` The Enhanced Poisson model achieves the best balance of fit and interpretability, with a McFadden's R² around 0.22 and strong test set correlation. For count data models, a McFadden's R² between 0.2 and 0.4 is generally considered "excellent" - the interpretation is less intuitive than linear R² but represents substantial improvement over a null model. The high correlation (0.7+) indicates that the model correctly ranks tracts by eviction risk even when absolute predictions are somewhat off. ------------------------------------------------------------------------ # 11. Prediction Results This section presents the practical output of our modeling work: risk predictions for each Census tract that can guide resource allocation and intervention targeting. ```{r prediction_summary} #| output: false # SUMMARIZE PREDICTIONS BY TRACT prediction_summary <- test_data %>% group_by(GEOID) %>% summarise( actual_total = sum(filings), predicted_total = round(sum(pred_enhanced), 1), actual_mean = round(mean(filings), 2), predicted_mean = round(mean(pred_enhanced), 2), .groups = "drop" ) %>% arrange(desc(predicted_total)) # Save for external use write_csv(prediction_summary, here("final", "output", "tract_prediction_summary.csv")) ``` ```{r show_predictions} kable(head(prediction_summary, 10), caption = "Top 10 High-Risk Tracts") ``` The top 10 high-risk tracts are concentrated in areas that match our earlier hotspot analysis - these are the neighborhoods that would benefit most from targeted outreach, legal aid services, and rental assistance programs. ## 11.1 Actual vs Predicted Comparison Map ```{r comparison_map} #| fig-width: 12 #| fig-height: 10 # VISUALIZE ACTUAL VS PREDICTED SIDE BY SIDE comparison_map <- philly_acs %>% left_join(prediction_summary, by = "GEOID") %>% select(GEOID, geometry, actual_mean, predicted_mean) %>% # Reshape for faceting pivot_longer( cols = c(actual_mean, predicted_mean), names_to = "type", values_to = "filings" ) %>% mutate(type = ifelse(type == "actual_mean", "Actual", "Predicted")) ggplot(comparison_map) + geom_sf(aes(fill = filings), color = "grey40", linewidth = 0.05) + facet_wrap(~type) + scale_fill_viridis_c( option = "inferno", name = "Monthly\nFilings", na.value = "grey80" ) + labs( title = "Actual vs Predicted Monthly Eviction Filings", subtitle = "Model captures the overall spatial pattern of eviction risk" ) + theme_void() + theme( plot.title = element_text(size = 14, face = "bold"), strip.text = element_text(size = 12, face = "bold") ) ``` The model captures the overall spatial pattern well - high-risk areas in North and West Philadelphia match between actual and predicted maps. The predictions are slightly smoother than the actual values, which is expected since statistical models tend to "regress toward the mean" and underestimate extreme values. This smoothing is actually desirable for policy purposes, as it provides more stable risk estimates that won't fluctuate wildly from month to month. ## 11.2 Risk Classification For practical use, we classify all tracts into five risk categories based on predicted monthly eviction rates. This classification provides an intuitive framework for prioritizing interventions and communicating results to non-technical stakeholders. ```{r risk_classification} # CLASSIFY TRACTS INTO RISK CATEGORIES prediction_summary <- prediction_summary %>% mutate(risk_category = case_when( # Top 10% = Very High Risk predicted_mean >= quantile(predicted_mean, 0.9, na.rm = TRUE) ~ "Very High Risk", # Top 25% = High Risk predicted_mean >= quantile(predicted_mean, 0.75, na.rm = TRUE) ~ "High Risk", # Top 50% = Moderate Risk predicted_mean >= quantile(predicted_mean, 0.5, na.rm = TRUE) ~ "Moderate Risk", # Top 75% = Low Risk predicted_mean >= quantile(predicted_mean, 0.25, na.rm = TRUE) ~ "Low Risk", # Bottom 25% = Very Low Risk TRUE ~ "Very Low Risk" )) risk_summary <- prediction_summary %>% group_by(risk_category) %>% summarise( Tracts = n(), `Avg Monthly` = round(mean(predicted_mean, na.rm = TRUE), 2), .groups = "drop" ) %>% arrange(desc(`Avg Monthly`)) kable(risk_summary, caption = "Risk Classification Summary") ``` ## 11.3 Risk Map: Early Warning System ```{r risk_map} #| fig-width: 10 #| fig-height: 10 # CREATE RISK CLASSIFICATION MAP risk_map <- philly_acs %>% left_join( prediction_summary %>% select(GEOID, risk_category), by = "GEOID" ) %>% mutate(risk_category = factor( risk_category, levels = c("Very High Risk", "High Risk", "Moderate Risk", "Low Risk", "Very Low Risk") )) ggplot(risk_map) + geom_sf(aes(fill = risk_category), color = "grey40", linewidth = 0.1) + scale_fill_manual( values = c( "Very High Risk" = "#d73027", # Dark red "High Risk" = "#fc8d59", # Orange "Moderate Risk" = "#fee090", # Yellow "Low Risk" = "#91bfdb", # Light blue "Very Low Risk" = "#4575b4" # Dark blue ), name = "Risk Level", na.value = "grey80" ) + labs( title = "Eviction Risk Early Warning System", subtitle = "Census Tracts Classified by Predicted Monthly Evictions" ) + theme_void() + theme(plot.title = element_text(size = 14, face = "bold")) ``` This map provides an actionable tool for the Philadelphia Housing Authority and partner organizations. Red and orange areas should receive priority for outreach and resources - they represent neighborhoods where eviction risk is highest and where intervention is most likely to prevent displacement. The clustering of high-risk tracts means geographically-focused programs could be efficient: a single community organization or legal aid office could potentially serve multiple high-risk tracts. ## 11.4 Monthly Forecast Accuracy ```{r monthly_forecast} #| fig-width: 10 #| fig-height: 5 # COMPARE AGGREGATE MONTHLY FORECASTS monthly_forecast <- test_data %>% group_by(year_month) %>% summarise( actual = sum(filings), predicted = sum(pred_enhanced), .groups = "drop" ) %>% pivot_longer( cols = c(actual, predicted), names_to = "type", values_to = "filings" ) %>% mutate(type = ifelse(type == "actual", "Actual", "Predicted")) ggplot(monthly_forecast, aes(x = year_month, y = filings, color = type, linetype = type)) + geom_line(linewidth = 1.2) + geom_point(size = 3) + scale_color_manual(values = c("Actual" = "black", "Predicted" = "red")) + labs( title = "Monthly Forecast vs Actual", subtitle = "Model tracks trends but slightly underpredicts peaks", x = "Month", y = "Total Filings", color = "", linetype = "" ) + theme_minimal() + theme(legend.position = "bottom") ``` The model tracks the overall trend reasonably well but slightly underpredicts peak months. We believe that it's common with count models - they're conservative about extreme values, which can be either a feature (more stable predictions) or a limitation (missing peaks when they matter most). For resource allocation purposes, consistent underprediction of peaks might mean that the "Very High Risk" category should receive even more resources than the model suggests. ------------------------------------------------------------------------ # 12. Spatial Autocorrelation Diagnostics A key assumption in our analysis is that we've adequately captured the spatial structure in eviction patterns through our spatial lag features. If significant spatial autocorrelation remains in the model residuals, it would indicate that there's spatial structure we haven't captured, which could bias our coefficient estimates and underestimate prediction uncertainty. We use Moran's I test to assess whether residuals show spatial clustering beyond what our model explains. ```{r residual_moran} #| output: false # CALCULATE AND TEST RESIDUAL SPATIAL AUTOCORRELATION train_data <- train_data %>% mutate(residual = filings - pred_nb) # Average residual per tract (aggregating over time) tract_residuals <- train_data %>% group_by(GEOID) %>% summarise(mean_residual = mean(residual, na.rm = TRUE), .groups = "drop") # Merge with tract order for spatial analysis tract_residuals <- tract_centroids %>% select(GEOID) %>% left_join(tract_residuals, by = "GEOID") %>% mutate(mean_residual = replace_na(mean_residual, 0)) # Perform Moran's I test if (nrow(tract_residuals) == length(weights_knn5$neighbours)) { moran_test <- moran.test( tract_residuals$mean_residual, weights_knn5, zero.policy = TRUE ) moran_i <- round(moran_test$estimate[1], 3) moran_p <- round(moran_test$p.value, 4) } else { moran_i <- NA moran_p <- NA } ``` ```{r show_moran} moran_summary <- tibble( Metric = c("Moran's I", "P-value", "Conclusion"), Value = c( moran_i, moran_p, ifelse(moran_p < 0.05, "Spatial autocorrelation remains", "No significant autocorrelation") ) ) kable(moran_summary, caption = "Moran's I Test for Model Residuals") ``` ```{r map_residuals} #| fig-width: 10 #| fig-height: 10 # MAP SPATIAL DISTRIBUTION OF RESIDUALS residual_map <- philly_acs %>% left_join(tract_residuals, by = "GEOID") ggplot(residual_map) + geom_sf(aes(fill = mean_residual), color = "grey40", linewidth = 0.1) + scale_fill_gradient2( low = "#4575b4", # Blue for overprediction mid = "white", high = "#d73027", # Red for underprediction midpoint = 0, name = "Mean\nResidual", na.value = "grey80", limits = c(-4, 4), oob = scales::squish ) + labs( title = "Spatial Distribution of Model Residuals", subtitle = paste0("Moran's I = ", moran_i, ", p = ", moran_p) ) + theme_void() + theme(plot.title = element_text(size = 14, face = "bold")) ``` The residual map shows where the model over- and under-predicts. Blue areas are overpredicted (model predicts more evictions than actually occur), while red areas are underpredicted. The relatively random scatter without strong clustering suggests that our spatial lag features have captured most of the spatial structure, though some localized patterns remain that our model doesn't fully explain. ------------------------------------------------------------------------ # 13. Equity Analysis Predictive models can perpetuate or amplify existing inequalities if they perform differently across demographic groups. Our equity analysis examines whether the model makes more accurate predictions in some neighborhoods than others, specifically comparing performance across neighborhoods with different racial compositions. ```{r equity_analysis} # COMPARE MODEL PERFORMANCE BY RACIAL COMPOSITION equity <- test_data %>% filter(!is.na(racial_majority)) %>% group_by(racial_majority) %>% summarise( N = n(), `Actual Mean` = round(mean(filings), 2), `Predicted Mean` = round(mean(pred_nb), 2), MAE = round(mean(abs(filings - pred_nb)), 2), .groups = "drop" ) kable(equity, caption = "Model Performance by Neighborhood Racial Composition") ``` ```{r equity_viz} #| fig-width: 10 #| fig-height: 5 ggplot(equity, aes(x = racial_majority, y = MAE, fill = racial_majority)) + geom_col() + scale_fill_viridis_d() + labs( title = "Prediction Error by Neighborhood Racial Composition", subtitle = "Higher errors in majority-Black neighborhoods reflect higher eviction rates", x = "", y = "Mean Absolute Error" ) + theme_minimal() + theme(legend.position = "none") ``` Prediction errors are higher in majority-Black neighborhoods, which is partly mechanical - these neighborhoods have higher eviction rates, and prediction errors tend to scale with the magnitude of the outcome. However, the disproportionate error rates raise equity concerns: if the model is less accurate in majority-Black neighborhoods, resource allocation based on model predictions might be less effective in these communities that need help most. Policy makers should be aware of this limitation. One mitigation strategy is to ensure that actual resource allocation considers both model predictions and ground-truth data from community organizations working in high-need areas. ------------------------------------------------------------------------ # 14. Save Results ```{r save_results} #| output: false # SAVE MODELS AND PREDICTIONS FOR FUTURE USE # Save all model objects save( baseline_poisson, full_poisson, full_nb, enhanced_poisson, enhanced_nb, file = here("final", "output", "models.RData") ) # Save test predictions write_csv(test_data, here("final", "output", "test_predictions.csv")) # Save model coefficients in tidy format coef_table <- tidy(full_nb, exponentiate = TRUE, conf.int = TRUE) %>% mutate(across(where(is.numeric), ~round(.x, 3))) write_csv(coef_table, here("final", "output", "model_coefficients.csv")) ``` ```{r show_coefs} kable(coef_table, caption = "Negative Binomial Model Coefficients (IRR)") ``` The coefficient table shows Incidence Rate Ratios (IRR) - the multiplicative effect on expected eviction counts. The lag terms dominate: each additional eviction last month increases expected evictions this month by roughly the same amount (IRR near 1.0 per additional filing). Racial majority effects persist after controlling for poverty rate, suggesting that structural factors beyond income - such as housing discrimination, differential enforcement, or concentration of predatory landlords - drive disparities across neighborhoods. ------------------------------------------------------------------------ # 15. Discussion and Conclusions ## 15.1 Summary of Key Findings This analysis developed and validated a predictive model for eviction risk in Philadelphia, with several important findings. First, **temporal persistence is the strongest signal**: Census tracts with high eviction rates in the previous month will likely have high rates in the current month. This persistence reflects the stable underlying conditions - poverty, housing quality, landlord practices - that drive eviction risk. For policy, this persistence is both concerning (eviction hotspots are "sticky" and don't naturally resolve) and useful (we can reliably identify high-risk areas). Second, **spatial clustering is substantial**: Eviction hotspots concentrate in North and West Philadelphia, particularly in majority-Black neighborhoods. This geographic concentration makes targeted intervention feasible - resources can be directed to specific neighborhoods rather than spread thinly across the entire city. The clustering also reveals the structural nature of Philadelphia's eviction crisis: these are not randomly distributed individual housing failures but systematic patterns reflecting historical disinvestment and ongoing inequality. Third, **socioeconomic factors matter but don't fully explain racial disparities**: Higher poverty rates and higher renter percentages correlate with more evictions, as expected. However, majority-Black neighborhoods have elevated eviction rates even after controlling for income, suggesting that factors beyond poverty - such as housing discrimination, differences in housing quality, or concentration of predatory landlord practices - contribute to racial disparities in eviction. Fourth, **the model achieves meaningful predictive accuracy**: With a McFadden's R² around 0.22 and strong test set correlation, the model provides useful predictions for resource targeting. The Enhanced Poisson model correctly identifies the spatial distribution of eviction risk, though it slightly underpredicts extreme values. ## 15.2 Policy Implications and Recommendations The analysis suggests several concrete policy applications. **Targeted outreach**: The risk classification map can guide where to focus eviction prevention resources. Organizations providing legal aid, rental assistance, and tenant counseling should prioritize Very High Risk and High Risk tracts where their efforts will reach the most at-risk households. **Early warning system**: The model could be run monthly to provide updated risk predictions. When a tract's predicted risk increases substantially, it could trigger proactive outreach before evictions occur rather than reactive assistance after displacement. **Resource allocation**: City and nonprofit resources for housing assistance are limited. The model provides a data-driven framework for allocating these resources to maximize impact. However, resource allocation should combine model predictions with community knowledge - local organizations often have information about emerging issues that won't appear in administrative data until it's too late. **Equity-conscious implementation**: Given the model's higher error rates in majority-Black neighborhoods and the documented racial disparities in eviction rates, implementation should be designed with equity as a primary consideration. This might include reserving additional resources for high-need communities beyond what the model suggests, or weighting interventions to correct for historical underservice. ## 15.3 Limitations and Caveats Several limitations should be considered when interpreting and applying these results. **Filing vs. eviction**: The outcome measure (eviction filings) captures court cases initiated by landlords, not completed evictions. Many filed cases are dismissed, settled, or withdrawn. However, even a filing has serious consequences for tenants (damaged credit, difficulty finding housing, stress of legal proceedings), so tracking filings is meaningful even if they don't all result in actual eviction. **Temporal scope**: The model is trained on 2020-2025 data, a period heavily influenced by the COVID-19 pandemic and associated eviction moratoriums. Patterns from this period may not generalize to "normal" times - or conversely, the pandemic may have accelerated pre-existing trends that will persist. **Geographic granularity**: Census tracts are a useful geographic unit, but they average over substantial within-tract variation. A single tract might contain both a well-maintained apartment building with stable tenants and a deteriorating property with frequent evictions. Block-level or building-level data would enable more precise targeting but are harder to obtain. **Model degradation**: Predictive models can degrade over time as underlying conditions change. If Philadelphia's housing market, demographic patterns, or eviction laws change substantially, the model would need to be retrained. Regular validation against new data is essential. **Policy shocks**: The model cannot anticipate sudden policy changes (like new eviction moratoriums) or economic shocks (like recessions or major employer closures). These events can rapidly change eviction patterns in ways the model won't predict. **Feedback effects**: If the model is used successfully to target interventions, it may reduce evictions in high-risk areas - which would then make those areas appear lower-risk in future data. This is a desirable outcome but means the model's predictions may become self-negating over time. ## 15.4 Ethical Considerations Predictive models for social services raise important ethical questions that extend beyond technical accuracy. **Stigmatization**: Labeling neighborhoods as "high risk" could contribute to stigma, affect property values, or influence where people choose to live. Model results should be communicated carefully with context about their intended use (targeting help, not blame). **Self-fulfilling prophecies**: If landlords or lenders learn that certain areas are designated high-risk, they might become more likely to pursue evictions or restrict credit there, creating a self-fulfilling prophecy. Access to model outputs should be controlled. **Privacy**: While this analysis uses aggregate tract-level data rather than individual records, more granular models could raise privacy concerns. The tradeoff between predictive accuracy and privacy protection should be explicitly considered. **Displacement of expertise**: Predictive models should complement, not replace, the knowledge of community organizations and tenants themselves. People closest to the problem often understand local conditions better than any statistical model. **Accountability**: If model predictions are wrong and resources are misallocated, who bears responsibility? Clear governance structures should define how models are used in decision-making and how to appeal decisions influenced by predictions. ## 15.5 Conclusions Our analysis demonstrates that eviction risk in Philadelphia is predictable enough to support targeted intervention. The concentration of evictions in specific neighborhoods - persistent over time and clustering spatially - creates opportunities for efficient resource deployment. However, prediction alone is not prevention: the model is a tool for guiding action, not a substitute for the policy changes and resource investments needed to address Philadelphia's housing crisis at its roots. The most important finding may be the structural nature of eviction patterns. The persistence of hotspots, the racial disparities that cannot be explained by income alone, and the geographic clustering all point to systemic factors - housing discrimination, disinvestment, inadequate tenant protections - that no predictive model can solve. Making evictions predictable is a step toward making them preventable, but ultimately preventing evictions requires addressing the conditions that create them. ------------------------------------------------------------------------