SEPTA Bus Ridership Prediction in Philadelphia, PA

Identifying Transit Deserts to Help Improve Equity-Focused Investments

Author

Itsnatani Anaqami, Jinheng Cen, Henry Yang

Published

December 7, 2025

Data Sources

SEPTA Ridership Statistics – 2023 fall boardings and alightings by census tract, used for the dependent variable.
Census Data ACS 2023 5-year – tract-level demographics, income, education, employment, car ownership, commute time, etc.
Philadelphia Universities & Colleges – locations of higher-education campuses.
School Facilities – locations of public, charter, and private schools.
Philadelphia Hospitals - locations of healthcare facilities.
PPR Properties – locations of parks and recreation assets for green-space proximity.
Indego Bike Stations – station coordinates for active bike-share docks.
Crime Incidents – police incident reports classified to violent vs. non-violent for tract aggregation.
Neighborhood Boundaries – used for neighborhood fixed effects and descriptive labeling.
City Hall coordinates (39.952381, -75.163619) – used to compute distance to the City Center.

Part 1: Setup

Initial Setup

Several preliminary steps such as loading required R packages and the Census API key are omitted here for clarity.

We created helper functions to streamline data preparation: reading and projecting shapefiles, computing distances between locations, and cleaning crime descriptions. Crime incidents are then categorized into violent and non-violent groups using a simple keyword-based rule.

Code

# read the single shapefile inside a folder
read_shp_from_dir <- function(dir_path) {
  shp_path <- list.files(dir_path, pattern = "\\.shp$", full.names = TRUE)
  if (length(shp_path) == 0) {
    stop(glue("No .shp file found inside {dir_path}"))
  }
  st_read(shp_path[1], quiet = TRUE) %>%
    st_transform(CRS_TARGET) %>%
    st_make_valid()
}

st_dist_mi <- function(from, to) {
  st_distance(from, to, by_element = TRUE) %>%
    set_units("mi") %>%
    drop_units()
}

# Crime classification
norm_tgc <- function(x) {
  x %>% str_squish() %>% str_to_lower()
}

violent_set <- c(
  "homicide - criminal",
  "homicide - justifiable",
  "rape",
  "robbery firearm",
  "robbery no firearm",
  "aggravated assault firearm",
  "aggravated assault no firearm",
  "other assaults",
  "offenses against family and children"
)

make_violence_bins <- function(df) {
  df %>%
    mutate(
      key = norm_tgc(text_general_code),
      gun_involved = str_detect(key, "firearm|weapon"),
      violence_bin = case_when(
        gun_involved ~ "Violent",
        key %in% violent_set ~ "Violent",
        TRUE ~ "Misdemeanor/Non-violent"
      ),
      violence_bin = factor(
        violence_bin,
        levels = c("Violent", "Misdemeanor/Non-violent")
      )
    )
}

Part 2: Data Preparation & Exploratory Data Analysis

Ridership Processing

This step loads and cleans the ridership data, filters to Fall 2023 bus records in Philadelphia, and computes average daily ridership for each census tract by summing boardings and alightings.

Code

RIDERSHIP_PATH <- file.path(DATA_DIR, "Bus_Ridership_by_Census_Tract.csv")

ridership_raw <- read_csv(RIDERSHIP_PATH, show_col_types = FALSE) %>%
  clean_names() %>%
  mutate(
    geoid = str_trim(census_tract_id),
    season_year = str_extract(season, "\\d{4}") %>% as.integer(),
    season_label = str_extract(season, "^[A-Za-z]+")
  )

ridership <- ridership_raw %>%
  filter(
    county_name == "Philadelphia County",
    str_to_lower(mode) == "bus",
    season_year == 2023
  ) %>%
  mutate(
    total_rides = coalesce(on, 0) + coalesce(off, 0)
  )

ridership_clean <- ridership %>%
  filter(season_label == "Fall") %>%
  group_by(geoid) %>%
  summarise(
    ridership_2023_fall = mean(total_rides, na.rm = TRUE),
    .groups = "drop"
  )

Histogram: Distribution of Bus Ridership

Code

ridership_avg_tract <- ridership %>%
  group_by(geoid) %>%
  summarise(
    avg_daily_rides = mean(total_rides, na.rm = TRUE),
    .groups = "drop"
  )

mean_val <- mean(ridership_avg_tract$avg_daily_rides, na.rm = TRUE)

ggplot(ridership_avg_tract, aes(x = avg_daily_rides)) +
  geom_histogram(bins = 150, fill = "#3182bd", color = "white") +
  # Mean line
  geom_vline(xintercept = mean_val, color = "red", size = 0.5) +
  geom_text(
    aes(x = mean_val+ 1200, y = 30, label = paste0("Mean = ", round(mean_val, 1))),
    color = "black",
    vjust = -0.5,
    hjust = 0.5,
    size = 4
  ) +
  labs(
    title = "Distribution of Average Daily Bus Ridership by Tract (Fall 2023)",
    x = "Average Boardings + Alightings by Tract per Day",
    y = "Number of Tracts"
  ) +
  theme_minimal()

Interpretation: The histogram shows that average daily bus ridership is very uneven across Philadelphia census tracts. Most tracts fall below 2,000 riders per day, indicating relatively modest transit activity in many neighborhoods. Meanwhile, a small number of tracts account for extremely high ridership, creating a heavily skewed distribution. The mean value of about 994 riders. The outliers with higher riderships create a long right tail and likely correspond to major corridors or transit centers.

Choropleth: Spatial Distribution of Bus Ridership

Code

tracts_map <- tigris::tracts(
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  cb = TRUE
) %>%
  st_transform(CRS_TARGET) %>%
  dplyr::select(GEOID)

tracts_ridership <- tracts_map %>%
  left_join(ridership_clean, by = c("GEOID" = "geoid"))

ridership_breaks <- c(
  0,
  scales::pretty_breaks(n = 4)(tracts_ridership$ridership_2023_fall)
) %>%
  unique() %>%
  sort()

ggplot(tracts_ridership) +
  geom_sf(aes(fill = ridership_2023_fall), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    trans = "sqrt",
    breaks = ridership_breaks,
    labels = scales::comma
  ) +
  labs(
    title = "Spatial Distribution of Average Daily Bus Ridership (Fall 2023)",
    fill = "Avg Daily Riders"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: The spatial map reinforces the pattern seen in the histogram. Most census tracts have relatively low bus activity, and only a small group of tracts stand out with much higher volumes, which aligns with the long right tail observed earlier. Overall the distribution suggests that frequent bus use is concentrated in a limited part of the city, while many neighborhoods experience modest ridership levels.

Data Cleaning

Through our EDA, we identified both tracts as extreme high-value outliers in the ridership distribution. Mapping and further investigation revealed that these tracts contain major multimodal transfer hubs, Olney Transportation Center and Frankford Transportation Center, where high bus riderships are driven primarily by transfers from the subway, rather than by local demographic, socioeconomic, or spatial characteristics.

These two tracts do not align with the underlying assumptions of the model and would distort inference. Since the ridership dataset is aggregated at the tract level and cannot distinguish transfer activity from local activities, we exclude the two tracts as outliers from the following analysis.

Code

ridership_clean <- ridership_clean %>%
  filter(!geoid %in% c("42101030200","42101028200"))

Loading ACS Data

Code

tracts <- get_acs(
  geography = GEOG,
  variables = "B01003_001",
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  geometry = TRUE,
  cache_table = TRUE
) %>%
  st_transform(CRS_TARGET) %>%
  st_make_valid() %>%
  dplyr::select(GEOID, NAME, total_pop = estimate) %>%
  mutate(
    tract_area_sqmi = st_area(geometry) %>% set_units(mi^2) %>% drop_units()
  )

acs_vars <- c(
  total_pop    = "B01003_001",
  median_income = "B19013_001",
  poverty_total = "B17001_002",
  pop_16plus    = "B23025_007",
  employed      = "B23025_004",
  unemployed    = "B23025_005",
  car0          = "B08201_002",
  white         = "B03002_003",
  black         = "B03002_004",
  hispanic      = "B03002_012",
  asian         = "B03002_006",
  bachelors     = "B15003_022",
  grad          = "B15003_023",
  total_edu     = "B15003_001",
  housing_units = "B25001_001",
  lt5 = "B08012_003",
  m5_9 = "B08012_004",
  m10_14 = "B08012_005",
  m15_19 = "B08012_006",
  m20_24 = "B08012_007",
  m25_29 = "B08012_008",
  m30_34 = "B08012_009",
  m35_39 = "B08012_010",
  m40_44 = "B08012_011",
  m45_59 = "B08012_012",
  m60_89 = "B08012_013",
  m90plus = "B08012_014",
  wfh  = "B08126_017"
)

acs_data <- get_acs(
  geography = GEOG,
  variables = acs_vars,
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  geometry = FALSE,
  cache_table = TRUE
) %>%
  dplyr::select(GEOID, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  rename_with(
    ~ names(acs_vars)[match(.x, acs_vars)],
    any_of(acs_vars)
  ) %>%
  mutate(
    total_commuters = lt5 + m5_9 + m10_14 + m15_19 + m20_24 +
                      m25_29 + m30_34 + m35_39 + m40_44 +
                      m45_59 + m60_89 + m90plus,
    commute_score = dplyr::if_else(
      total_commuters > 0,
      (lt5    * 0  +
       m5_9   * 5  +
       m10_14 * 10 +
       m15_19 * 15 +
       m20_24 * 20 +
       m25_29 * 25 +
       m30_34 * 30 +
       m35_39 * 35 +
       m40_44 * 40 +
       m45_59 * 45 +
       m60_89 * 60 +
       m90plus* 90 ) / total_commuters,
      NA_real_
    ),
    pct_white = white / total_pop,
    pct_black = black / total_pop,
    pct_hispanic = hispanic / total_pop,
    pct_asian = asian / total_pop,
    pct_carless = car0 / housing_units,
    pct_bachelors_plus = (bachelors + grad) / total_edu,
    unemployment_rate = unemployed / pop_16plus,
    poverty_rate = poverty_total / total_pop,
    pct_wfh = wfh / pop_16plus
  )

Loading Spatial Layers

Code

universities <- read_shp_from_dir(file.path(DATA_DIR, "Universities_Colleges-shp"))

schools <- read_shp_from_dir(file.path(DATA_DIR, "Schools"))
if (any(st_geometry_type(schools) %in% c("POLYGON", "MULTIPOLYGON"))) {
  st_geometry(schools) <- st_point_on_surface(schools)
}

hospitals <- st_read(file.path(DATA_DIR, "Hospitals.geojson"), quiet = TRUE) %>%
  st_transform(CRS_TARGET) %>%
  st_make_valid()

parks <- read_shp_from_dir(file.path(DATA_DIR, "PPR_Properties"))
if (any(st_geometry_type(parks) %in% c("POINT", "MULTIPOINT"))) {
  parks <- parks %>%
    filter(st_geometry_type(geometry) %in% c("POLYGON", "MULTIPOLYGON"))
}

crime_raw <- read_shp_from_dir(file.path(DATA_DIR, "crime_incidents")) %>%
  clean_names()

crime_code_col <- names(crime_raw)[
  str_detect(names(crime_raw), "text.*general|general.*text|text_general|text_gener")
][1]

if (is.na(crime_code_col)) {
  stop("Could not locate the offense description column in crime shapefile.")
}

crime_bins <- crime_raw %>%
  rename(text_general_code = .data[[crime_code_col]]) %>%
  make_violence_bins() %>%
  filter(!st_is_empty(geometry))

crime_violent <- crime_bins %>%
  filter(violence_bin == "Violent")

cityhall <- tibble(long = -75.163619, lat = 39.952381) %>%
  st_as_sf(coords = c("long", "lat"), crs = 4326) %>%
  st_transform(CRS_TARGET)

indego_json <- fromJSON(file.path(DATA_DIR, "shared_bike_station.json"))
indego <- indiv <- indego_json$features
indego_sf <- tibble(
  longitude = map_dbl(indiv$geometry$coordinates, 1),
  latitude  = map_dbl(indiv$geometry$coordinates, 2)
) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(CRS_TARGET)

NEIGH_URL   <- "https://raw.githubusercontent.com/opendataphilly/open-geo-data/refs/heads/master/philadelphia-neighborhoods/philadelphia-neighborhoods.geojson"
neigh  <- read_sf(NEIGH_URL) %>%
  st_transform(st_crs(tracts))

Metrics Calculation

Calculating nearest distances from POIs (Points of interests) to census tracts

Code

tract_centroids <- st_centroid(tracts)

nearest_distance <- function(from_pts, to_feats, suffix) {
  idx <- st_nearest_feature(from_pts, to_feats)
  tibble(
    GEOID = from_pts$GEOID,
    !!glue("dist_{suffix}_mi") := st_dist_mi(from_pts, to_feats[idx, ])
  )
}

dist_university <- nearest_distance(tract_centroids, universities, "university")
dist_school     <- nearest_distance(tract_centroids, schools, "school")
dist_hospital   <- nearest_distance(tract_centroids, hospitals, "hospital")
dist_cityhall   <- nearest_distance(tract_centroids, cityhall, "cityhall")
dist_park       <- nearest_distance(tract_centroids, st_point_on_surface(parks), "park")
dist_indego     <- nearest_distance(tract_centroids, indego_sf, "indego")

Aggregating violent crime data using st_contains to census tracts

Code

crime_counts <- st_join(
  tracts %>% dplyr::select(GEOID),
  crime_violent,
  join = st_contains
) %>%
  st_drop_geometry() %>%
  count(GEOID, name = "violent_incidents_2014_2023")

Aggregating neighborhood data using st_intersection to census tracts (If mulitiple neighborhoods fall into one census tract, the one with biggest intersection area will be selected.)

Code

tracts_simple <- tracts %>% 
  dplyr::select(GEOID)

neigh_simple <- neigh %>% 
  dplyr::select(neigh_name = NAME)

tract_neigh_int <- st_intersection(tracts_simple, neigh_simple)

tract_neigh_largest <- tract_neigh_int %>%
  mutate(int_area = st_area(geometry)) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  slice_max(int_area, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  dplyr::select(GEOID, neigh_name)

tracts_neigh <- tracts %>%
  left_join(tract_neigh_largest, by = "GEOID")

Tract Feature Table

Code

tract_features <- tracts_neigh %>%
  st_drop_geometry() %>%
  left_join(acs_data %>% dplyr::select(-total_pop),
            by = "GEOID") %>%
  left_join(ridership_clean, by = c("GEOID" = "geoid")) %>%
  left_join(dist_university, by = "GEOID") %>%
  left_join(dist_school, by = "GEOID") %>%
  left_join(dist_hospital, by = "GEOID") %>%
  left_join(dist_cityhall, by = "GEOID") %>%
  left_join(dist_park, by = "GEOID") %>%
  left_join(dist_indego, by = "GEOID") %>%
  left_join(crime_counts, by = "GEOID") %>%
  mutate(
    pop_density = total_pop / tract_area_sqmi,
    emp_density = employed / tract_area_sqmi,
    car_ownership_rate = 1 - pct_carless,
    fe_bucket = neigh_name
  )

tract_features %>%
  dplyr::select(
    GEOID,
    fe_bucket,
    ridership_2023_fall,
    median_income,
    pct_white,
    pct_black,
    pct_bachelors_plus,
    car_ownership_rate,
    poverty_rate,
    commute_score,
    pop_density,
    violent_incidents_2014_2023,
    dist_cityhall_mi
  ) %>%
  arrange(desc(ridership_2023_fall)) %>%
  slice_head(n = 10) %>%
  mutate(
    median_income = scales::dollar(median_income, accuracy = 1),
    pct_white = scales::percent(pct_white, accuracy = 0.1),
    pct_black = scales::percent(pct_black, accuracy = 0.1),
    pct_bachelors_plus = scales::percent(pct_bachelors_plus, accuracy = 0.1),
    car_ownership_rate = scales::percent(car_ownership_rate, accuracy = 0.1),
    poverty_rate = scales::percent(poverty_rate, accuracy = 0.1)
  ) %>%
  knitr::kable(
    col.names = c(
      "Tract GEOID", "Neighborhood", "Avg Daily Riders", "Median Income",
      "% White", "% Black", "% Bachelor+", "Car Ownership Rate", "Poverty Rate",
      "Commute Time (mins)", "Population Density", "Violent Incidents",
      "Dist to City Hall (mi)"),
    caption = "Top 10 Census Tracts with Highest Average Daily Bus Ridership",
    format.args = list(big.mark = ","),
    digits = 1
  )

Top 10 Census Tracts with Highest Average Daily Bus Ridership
Tract GEOID	Neighborhood	Avg Daily Riders	Median Income	% White	% Black	% Bachelor+	Car Ownership Rate	Poverty Rate	Commute Time (mins)	Population Density	Violent Incidents	Dist to City Hall (mi)
42101000500	CENTER_CITY	11,651.0	$68,977	33.7%	32.6%	35.9%	37.7%	12.0%	40.6	19,986.4	378	0.3
42101000403	LOGAN_SQUARE	7,065.3	$187,743	74.4%	2.7%	68.0%	57.4%	8.0%	55.6	19,166.8	123	0.1
42101000701	RITTENHOUSE	5,987.0	$90,343	77.0%	3.5%	63.9%	50.3%	11.6%	45.7	37,228.3	129	0.3
42101000101	OLD_CITY	5,189.0	$108,438	83.6%	1.6%	73.6%	59.4%	1.6%	42.0	15,000.4	88	0.9
42101024100	GERMANTOWN_PENN_KNOX	4,999.0	$18,646	19.7%	67.7%	27.2%	45.5%	30.7%	51.9	9,366.2	100	5.6
42101000600	WASHINGTON_SQUARE	4,749.7	$145,036	66.4%	8.6%	65.8%	58.5%	7.4%	42.5	28,495.8	212	0.3
42101008500	COBBS_CREEK	4,153.3	$45,747	13.2%	77.0%	26.8%	64.2%	23.6%	50.3	25,633.7	233	3.4
42101008400	COBBS_CREEK	3,997.0	$53,182	2.7%	78.0%	17.3%	66.3%	23.0%	45.7	27,450.2	167	3.8
42101020102	TIOGA	3,842.3	$33,947	0.6%	89.8%	8.9%	65.7%	52.9%	47.6	26,643.7	165	4.1
42101020200	TIOGA	3,570.7	$35,000	1.6%	93.6%	11.3%	69.4%	29.3%	49.6	14,110.1	243	3.8

Interpretation:

The table highlights the census tracts with the highest average daily bus ridership. The top-ranked tracts are mostly located in Center City and adjacent neighborhoods, areas that combine dense land use, strong job concentration, and good walkability. These tracts tend to have higher educational attainment, relatively low car ownership, and short distances to the Center City.

A few other top-ranked tracts, such as Germantown, Cobbs Creek and Tioga, have high ridership despite lower median incomes and higher poverty rates. These cases likely reflect stronger dependence on transit rather than proximity to employment centers. Overall, the table shows that both central-city accessibility and transit dependency contribute to elevated bus ridership in different parts of Philadelphia.

Variable Dictionary

Variable	Definition
`GEOID`	Census tract identifier from ACS 2023 geometry.
`NAME`	Census Bureau tract label (e.g., “Census Tract 1, Philadelphia County”).
`total_pop`	Total ACS population (B01003_001).
`tract_area_sqmi`	Tract land area calculated from geometry (square miles).
`median_income`	Median household income (B19013_001).
`poverty_total`	Count of residents below the poverty line (B17001_002).
`pop_16plus`	Population age 16+ (B23025_007).
`employed`, `unemployed`	Employment status counts (B23025_001, _005).
`car0`	Household count with zero vehicles (B08201_002).
`white`, `black`, `hispanic`, `asian`	Population counts by race/ethnicity (B03002 components).
`bachelors`, `grad`, `total_edu`	Educational attainment counts (B15003_022, _023, _001).
`housing_units`	Total housing units (B25001_001).
`lt5`–`m90plus`	Commuter counts by travel time (B08012 bins).
`wfh`	Workers working from home (B08126_017).
`pct_white`, `pct_black`, `pct_hispanic`, `pct_asian`	Race/ethnicity shares (`group / total_pop`).
`pct_carless`	Household share with zero vehicles (`car0 / housing_units`).
`pct_bachelors_plus`	Bachelor’s or graduate degree share (`(bachelors + grad) / total_edu`).
`unemployment_rate`	Unemployment rate (`unemployed / pop_16plus`).
`poverty_rate`	Poverty rate (`poverty_total / total_pop`).
`pct_wfh`	Share of workers working from home (`wfh / pop_16plus`).
`commute_score`	Weighted average commute minutes (shorter = better access), using start points of B08012 bins.
`ridership_2023_fall`	Mean bus boardings + alightings in Fall 2023 per tract (Philadelphia County subset).
`dist_university_mi`, `dist_school_mi`, `dist_hospital_mi`, `dist_cityhall_mi`, `dist_park_mi`, `dist_indego_mi`	Distance (mi) from tract centroid to nearest amenity of each type.
`violent_incidents_2014_2023`	Violent crime incidents (per `make_violence_bins()` classification) within tract boundaries, 2014–2023.
`pop_density`, `emp_density`	Population and employment density (per sq mi).
`car_ownership_rate`	Share of households owning ≥1 car (`1 - pct_carless`).
`fe_bucket`	Neighborhood name (OpenDataPhilly) used as fixed-effect bucket.

Ridership Relationships

Code

p1 <- ggplot(tract_features, aes(x = violent_incidents_2014_2023, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#d73027") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#a50026") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Ridership vs. Violent Crime Incidents",
    x = "# Violent Crime Incidents (2014–2023)",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p2 <- ggplot(tract_features, aes(x = pct_asian, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#542788") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#2d004b") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Ridership vs. % Asian Population",
    x = "% Asian Residents",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p3 <- ggplot(tract_features, aes(x = commute_score, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#4575b4") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#313695") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Ridership vs. Commute Time",
    x = "Avg Daily Commute Time (minutes)",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p4 <- ggplot(tract_features, aes(x = pct_carless, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#1a9850") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#006837") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Ridership vs. No Car Ownership",
    x = "% Households Without a Vehicle",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

ridership_relation <- (p1 | p2) / (p3 | p4)
ridership_relation

Interpretation:

Ridership vs Crime Incidents: We initially expected tracts with higher violent crime to have lower ridership, assuming safety concerns would discourage transit use. Instead, the data show the opposite pattern: tracts with more violent incidents tend to have higher bus ridership. This likely reflects broader socio-economic conditions rather than crime itself. The relationship suggests that structural transit dependence outweighs safety-related deterrence at the tract level.
Ridership vs Asian Population: The share of Asian residents shows a mild positive relationship with bus ridership. This pattern may reflect specific corridors—such as parts of South Philadelphia and the Northeast—where immigrant communities rely more on bus services.
Ridership vs Commute Time: Average commute time is weakly and slightly negatively related to bus ridership. Tracts with longer commutes may rely more on regional rail or driving rather than local buses, which could explain the small downward slope.
Ridership vs Car Ownership: The strongest relationship among the four plots is with car ownership. Tracts with higher shares of households without a vehicle show notably higher bus ridership, which aligns with expectations. This indicates that household transportation access is a key driver of bus usage.

Part 3: Modeling & Cross-Validation

Model Building

Spatial unit = census tract
Target = Fall 2023 average daily bus riders

We use 5-fold CV to compare three feature sets: - ACS-only socio-economics - adding spatial accessibility and crime - adding neighborhood fixed effects

We tested three models - OLS, Poisson, and NegBin models.

Code

acs_vars <- c(
  "commute_score",
  "pct_hispanic", 
  "pct_asian",
  "pct_carless", 
  "emp_density",
  "pop_density",
  "pct_wfh"
)

spatial_vars <- c(
  "dist_university_mi",
  "dist_hospital_mi",
  "violent_incidents_2014_2023"
)

model_data <- tract_features %>%
  dplyr::select(GEOID, fe_bucket, ridership_2023_fall, all_of(acs_vars), all_of(spatial_vars)) %>%
  drop_na(ridership_2023_fall) %>%
  distinct(GEOID, .keep_all = TRUE)

folds <- vfold_cv(model_data, v = 5)

build_rhs <- function(features) {
  if (length(features) > 0) paste(features, collapse = " + ") else "1"
}

calc_train_r2 <- function(features, fe = FALSE, type = "OLS") {
  rhs <- build_rhs(features)
  formula_str <- paste("ridership_2023_fall ~", rhs)
  preds <- switch(type,
    "OLS" = {
      if (!fe) {
        fit <- lm(as.formula(formula_str), data = model_data)
      } else {
        fit <- feols(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
      }
      predict(fit, newdata = model_data)
    },
    "Poisson" = {
      if (!fe) {
        fit <- glm(as.formula(formula_str), data = model_data, family = poisson())
        predict(fit, newdata = model_data, type = "response")
      } else {
        fit <- fepois(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
        predict(fit, newdata = model_data)
      }
    }
  )
  rsq_vec(model_data$ridership_2023_fall, preds)
}

cv_mae <- function(folds, features, fe = FALSE, type = "OLS") {
  map_dbl(folds$splits, function(split) {
    train <- analysis(split)
    test  <- assessment(split)
    rhs <- build_rhs(features)
    formula_str <- paste("ridership_2023_fall ~", rhs)
    preds <- switch(type,
      "OLS" = {
        if (!fe) {
          fit <- lm(as.formula(formula_str), data = train)
          predict(fit, newdata = test)
        } else {
          fit <- feols(as.formula(paste(formula_str, "| fe_bucket")), data = train)
          predict(fit, newdata = test)
        }
      },
      "Poisson" = {
        if (!fe) {
          fit <- glm(as.formula(formula_str), data = train, family = poisson())
          predict(fit, newdata = test, type = "response")
        } else {
          fit <- fepois(as.formula(paste(formula_str, "| fe_bucket")), data = train)
          predict(fit, newdata = test)
        }
      }
    )
    mae_vec(test$ridership_2023_fall, preds)
  }) %>%
    mean()
}

specs <- tribble(
  ~spec_name, ~features, ~fe,
  "ACS only", acs_vars, FALSE,
  "ACS + Spatial", c(acs_vars, spatial_vars), FALSE,
  "ACS + Spatial + FE", c(acs_vars, spatial_vars), TRUE
)

model_specs_full <- expand_grid(
  model_type = c("OLS", "Poisson"),
  specs
) %>%
  mutate(model_label = paste(model_type, spec_name, sep = " - "))

model_performance <- model_specs_full %>%
  mutate(
    mae = pmap_dbl(list(features, fe, model_type), ~ cv_mae(folds, ..1, ..2, ..3)),
    feature_count = map_int(features, length),
    train_r2 = pmap_dbl(list(features, fe, model_type), ~ calc_train_r2(..1, ..2, ..3))
  ) %>%
  dplyr::select(model = model_label, mae, train_r2, feature_count, model_type, spec_name)

knitr::kable(
  model_performance %>%
    mutate(
      mae = scales::comma(mae, accuracy = 1),
      train_r2 = scales::number(train_r2, accuracy = 0.01)
    ),
  col.names = c("Model Spec", "CV MAE (riders)", "Train R²", "Feature Count", "Model Type", "Feature Set"),
  format.args = list(big.mark = ","),
  digits = 1
)

Model Spec	CV MAE (riders)	Train R²	Feature Count	Model Type	Feature Set
OLS - ACS only	615	0.21	7	OLS	ACS only
OLS - ACS + Spatial	527	0.43	10	OLS	ACS + Spatial
OLS - ACS + Spatial + FE	532	0.70	10	OLS	ACS + Spatial + FE
Poisson - ACS only	608	0.26	7	Poisson	ACS only
Poisson - ACS + Spatial	477	0.58	10	Poisson	ACS + Spatial
Poisson - ACS + Spatial + FE	491	0.77	10	Poisson	ACS + Spatial + FE

Interpretation:

Across all specifications, the Poisson models consistently outperform the OLS models in terms of predictive accuracy. This aligns with expectations, since bus ridership is a count-like outcome with a skewed distribution, making Poisson a more appropriate functional form.
Adding spatial features substantially improves performance for both model families. Moving from “ACS only” to “ACS + Spatial” reduces MAE from 615 to 527 in the OLS models and from 608 to 477 in the Poisson models. This suggests that proximity to key destinations such as universities and hospitals captures meaningful variation in transit demand that demographic or socioeconomic variables alone cannot explain.
Including neighborhood fixed effects increases the training R² (up to 0.77 in the Poisson FE model), indicating stronger in-sample fit, but does not necessarily reduce cross-validated MAE. In both OLS and Poisson, the FE models show slightly higher error than their non-FE spatial counterparts. This pattern suggests that fixed effects absorb tract-level heterogeneity helpful for interpretation, but may not generalize as well for prediction when using cross-validation.

Spatial KDE Benchmark

As a non-parametric spatial baseline, we compute a KDE smoother that weights nearby tracts by distance. The bandwidth of 2,500 feet (≈0.47 miles) approximates a 10-minute walk shed.

We include this KDE model to establish a spatial benchmark: if our feature-based OLS and Poisson models cannot outperform a simple distance-weighted smoother, then their added complexity is not meaningful.

Code

tract_centroids <- tracts %>%
  st_centroid(of_largest_polygon = TRUE) %>%
  mutate(
    coord_x = st_coordinates(.)[, 1],
    coord_y = st_coordinates(.)[, 2]
  ) %>%
  st_drop_geometry() %>%
  dplyr::select(GEOID, coord_x, coord_y)

kde_model_data <- tract_features %>%
  left_join(tract_centroids, by = "GEOID") %>%
  dplyr::select(GEOID, ridership_2023_fall, coord_x, coord_y) %>%
  drop_na()

coord_matrix <- as.matrix(kde_model_data[, c("coord_x", "coord_y")])

kde_bandwidth <- 2500  # ≈0.47 miles under EPSG:3365 (10-minute walk scale)

gaussian_kernel <- function(distance, bandwidth) {
  exp(-0.5 * (distance / bandwidth)^2)
}

predict_kde <- function(train_coords, train_values, query_coords, bandwidth) {
  purrr::map_dbl(seq_len(nrow(query_coords)), function(i) {
    diffs <- sweep(train_coords, 2, query_coords[i, ], "-")
    dists <- sqrt(rowSums(diffs^2))
    weights <- gaussian_kernel(dists, bandwidth)
    if (all(weights == 0)) {
      return(mean(train_values))
    }
    sum(weights * train_values) / sum(weights)
  })
}

kde_folds <- vfold_cv(kde_model_data, v = 5)

kde_mae <- map_dbl(kde_folds$splits, function(split) {
  train <- analysis(split)
  test  <- assessment(split)
  preds <- predict_kde(
    train_coords = as.matrix(train[, c("coord_x", "coord_y")]),
    train_values = train$ridership_2023_fall,
    query_coords = as.matrix(test[, c("coord_x", "coord_y")]),
    bandwidth = kde_bandwidth
  )
  mae_vec(test$ridership_2023_fall, preds)
}) %>%
  mean()

kde_train_preds <- purrr::map_dbl(seq_len(nrow(kde_model_data)), function(i) {
  train_coords <- coord_matrix[-i, , drop = FALSE]
  train_values <- kde_model_data$ridership_2023_fall[-i]
  query_coord  <- matrix(coord_matrix[i, ], nrow = 1)
  predict_kde(train_coords, train_values, query_coord, kde_bandwidth)
})

kde_train_r2 <- rsq_vec(kde_model_data$ridership_2023_fall, kde_train_preds)

kde_metrics <- tibble(
  Model = "KDE - Spatial Only",
  MAE = kde_mae,
  "Train r2" = kde_train_r2,
  "Feature Count" = 0,
  "Model Type" = "KDE",
  "spec_name" = "Spatial neighbors"
)

knitr::kable(
  kde_metrics %>%
    mutate(
      MAE = scales::comma(MAE, accuracy = 1),
      `Train r2` = scales::number(`Train r2`, accuracy = 0.01)
    ),
  col.names = c("Model Spec", "CV MAE (riders)", "Train R²", "Feature Count", "Model Type", "Feature Set"),
  format.args = list(big.mark = ","),
  digits = 1
)

Model Spec	CV MAE (riders)	Train R²	Feature Count	Model Type	Feature Set
KDE - Spatial Only	582	0.19	0	KDE	Spatial neighbors

Code

kde_full_preds <- predict_kde(
  train_coords = coord_matrix,
  train_values = kde_model_data$ridership_2023_fall,
  query_coords = coord_matrix,
  bandwidth = kde_bandwidth
)

kde_map <- tracts %>%
  left_join(
    kde_model_data %>%
      mutate(kde_riders = kde_full_preds) %>%
      dplyr::select(GEOID, kde_riders),
    by = "GEOID"
  ) %>%
  filter(!is.na(kde_riders))

ggplot(kde_map) +
  geom_sf(aes(fill = kde_riders), color = "grey85", size = 0.1) +
  scale_fill_viridis_c(labels = scales::comma) +
  labs(
    title = "Spatial KDE Estimate of Average Daily Ridership",
    fill = "KDE riders"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Cross-Validated Performance

Code

model_performance_all <- bind_rows(
  model_performance,
  tibble(
    model = "KDE - Spatial neighbors",
    mae = kde_mae,
    train_r2 = kde_train_r2,
    feature_count = 0,
    model_type = "KDE",
    spec_name = "Spatial neighbors"
  )
)

model_performance_all <- model_performance_all %>%
  mutate(
    color_group = case_when(
      model %in% c("Poisson - ACS + Spatial",
                   "KDE - Spatial neighbors") ~ "highlight",
      TRUE ~ "normal"
    )
  )

ggplot(model_performance_all, aes(x = reorder(model, mae), y = mae, fill = color_group)) +
  geom_col() +
  geom_text(aes(label = scales::comma(mae, accuracy = 1)), vjust = -0.5) +
  expand_limits(y = max(model_performance_all$mae) * 1.1) +
  scale_fill_manual(
    values = c(
      "highlight" = "#1f4e79",
      "normal" = "#4c78a8"
    ),
    guide = "none"
  ) +
  labs(
    title = "Cross-validated MAE by Specification (5-fold)",
    x = NULL,
    y = "Mean Absolute Error (Average Daily Riders)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: The cross-validated results show the best-performing specification is the Poisson model with ACS + spatial features (MAE ≈ 477). The KDE baseline sits in the middle at 582, which provides a useful benchmark. This indicates that demographic and built-environment features add real predictive value. Meanwhile, ACS-only models for both OLS and Poisson perform the worst, showing that spatial information is essential for explaining ridership patterns.

Best Specification Output

Below is the full coefficient output for the best spec (Poisson, ACS + spatial, no fixed effects) to interpret direction and significance.

Code

best_label <- "Poisson - ACS + Spatial"

models_fitted_1 <- model_specs_full %>%
  mutate(
    fit = pmap(
      list(features, fe, model_type),
      ~ {
        formula_str <- paste("ridership_2023_fall ~", build_rhs(..1))
        fe_flag <- ..2
        mtype   <- ..3
        
        if (mtype == "OLS") {
          if (!fe_flag) {
            lm(as.formula(formula_str), data = model_data)
          } else {
            feols(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
          }
        } else if (mtype == "Poisson") {
          if (!fe_flag) {
            glm(as.formula(formula_str), data = model_data, family = poisson())
          } else {
            fepois(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
          }
        }
      }
    )
  )

models_fitted_1 %>%
  filter(model_label == best_label) %>%
  pull(fit) %>%
  .[[1]] %>%
  summary()


Call:
glm(formula = as.formula(formula_str), family = poisson(), data = model_data)

Coefficients:
                                 Estimate    Std. Error z value
(Intercept)                  4.9044317612  0.0216038130  227.02
commute_score                0.0306005371  0.0004235091   72.25
pct_hispanic                -0.4309619521  0.0109908720  -39.21
pct_asian                    2.4851685409  0.0194129008  128.02
pct_carless                  1.4310436921  0.0162758559   87.92
emp_density                  0.0000384163  0.0000005788   66.37
pop_density                 -0.0000348522  0.0000003880  -89.82
pct_wfh                      0.3669917713  0.1510452521    2.43
dist_university_mi          -0.1251829378  0.0027487451  -45.54
dist_hospital_mi            -0.2949065327  0.0036478756  -80.84
violent_incidents_2014_2023  0.0063247778  0.0000248343  254.68
                                       Pr(>|z|)    
(Intercept)                 <0.0000000000000002 ***
commute_score               <0.0000000000000002 ***
pct_hispanic                <0.0000000000000002 ***
pct_asian                   <0.0000000000000002 ***
pct_carless                 <0.0000000000000002 ***
emp_density                 <0.0000000000000002 ***
pop_density                 <0.0000000000000002 ***
pct_wfh                                  0.0151 *  
dist_university_mi          <0.0000000000000002 ***
dist_hospital_mi            <0.0000000000000002 ***
violent_incidents_2014_2023 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 304490  on 382  degrees of freedom
Residual deviance: 140423  on 372  degrees of freedom
  (17 observations deleted due to missingness)
AIC: Inf

Number of Fisher Scoring iterations: 5

Interpretation: The coefficient table shows that most predictors, except for % worked from home, are statistically significant in the Poisson model. Because Poisson coefficients are on the log scale, their raw magnitudes are not directly comparable, so we rely on cross-validated MAE rather than coefficient size to evaluate model performance.

Part 4: Model Evaluation & Spatial Diagnostics

Spatial Autocorrelation Checks

Use the best model’s predictions/residuals to test remaining spatial autocorrelation (Moran’s I) and locate systematic over/under-estimation for policy interpretation.

Code

best_fit <- glm(
  as.formula(paste("ridership_2023_fall ~", build_rhs(c(acs_vars, spatial_vars)))),
  data   = model_data,
  family = poisson()
)

best_preds <- predict(best_fit, newdata = model_data, type = "response")

model_data_eval <- model_data %>%
  mutate(
    best_pred  = best_preds,
    best_resid = ridership_2023_fall - best_pred
  )

tracts_eval <- tracts %>%
  left_join(model_data_eval %>% dplyr::select(GEOID, ridership_2023_fall, best_resid), by = "GEOID") %>%
  filter(!is.na(best_resid))

nb <- poly2nb(tracts_eval, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)
r_moran    <- moran.test(tracts_eval$ridership_2023_fall, lw, zero.policy = TRUE)
resid_moran <- moran.test(tracts_eval$best_resid, lw, zero.policy = TRUE)

diagnostics <- tibble(
  Metric = c("Ridership Moran's I", "Residual Moran's I"),
  Statistic = c(r_moran$estimate[["Moran I statistic"]], resid_moran$estimate[["Moran I statistic"]]),
  "P-value" = c(r_moran$p.value, resid_moran$p.value)
)

knitr::kable(diagnostics, digits = 6)

Metric	Statistic	P-value
Ridership Moran’s I	0.371491	0.000000
Residual Moran’s I	0.135753	0.000004

Code

ggplot(tracts_eval) +
  geom_sf(aes(fill = best_resid), color = "grey85", size = 0.1) +
  scale_fill_gradient2(
    low = "#2166ac", high = "#b2182b", mid = "white",
    midpoint = 0, labels = scales::comma
  ) +
  labs(
    title = paste("Residual Map -", best_label),
    fill = "Residual (Predicted - Actual)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: The strong positive Moran’s I for ridership (0.37, p < 0.001) confirms that bus usage in Philadelphia is highly clustered spatially, i.e. tracts near one another tend to have similar ridership levels. After applying our best-performing model (Poisson with ACS and spatial features), the residual Moran’s I drops to 0.14 but remains statistically significant. This suggests that the model captures part of the spatial structure in transit demand, but not all, which is consistent with the idea that neighborhood-level transit behavior is influenced by spatial processes that are not fully represented in our current feature set. It also explains why adding neighborhood fixed effects reduced spatial autocorrelation more aggressively but performed worse in cross-validation: fixed effects can absorb spatial clustering, but at the cost of overfitting.

Extreme Residual Tracts

To target systematic over/under-estimation, the table lists the 10 tracts with largest absolute residuals alongside key attributes.

Code

resid_max <- model_data_eval %>%
  arrange(desc(best_resid)) %>%
  slice_head(n = 10) %>%
  mutate(type = "Largest Over-prediction")

resid_min <- model_data_eval %>%
  arrange(best_resid) %>%
  slice_head(n = 10) %>%
  mutate(type = "Largest Under-prediction")

worst_tracts <- bind_rows(resid_max, resid_min) %>%
  left_join(
    tract_features %>% dplyr::select(GEOID, NAME),
    by = "GEOID"
  ) %>%
  mutate(NAME = str_remove(NAME, ";\\s*Philadelphia County; Pennsylvania"))

knitr::kable(
  worst_tracts %>%
    dplyr::select(type, NAME, best_resid, pct_hispanic, pct_asian,
                  pct_carless, pop_density, emp_density,
                  dist_university_mi, dist_hospital_mi,
                  violent_incidents_2014_2023) %>%
    mutate(
      pct_hispanic = scales::percent(pct_hispanic, accuracy = 0.1),
      pct_asian = scales::percent(pct_asian, accuracy = 0.1),
      pct_carless = scales::percent(pct_carless, accuracy = 0.1),
    ),
  col.names = c(
    "Residual Type", "Tract Name", "Residual (Pred - Actual)",
    "% Hispanic", "% Asian", "% Zero-car HHs",
    "Population Density", "Employment Density",
    "Dist to Univ (mi)", "Dist to Hospital (mi)",
    "Violent Incidents 2014-2023"),
  caption = "Census Tract with Extreme Model Residuals",
  format.args = list(big.mark = ","),
  digits = 1
)

Census Tract with Extreme Model Residuals
Residual Type	Tract Name	Residual (Pred - Actual)	% Hispanic	% Asian	% Zero-car HHs	Population Density	Employment Density	Dist to Univ (mi)	Dist to Hospital (mi)	Violent Incidents 2014-2023
Largest Over-prediction	Census Tract 4.03	4,173.8	7.1%	10.0%	42.6%	19,166.8	13,010.0	0.0	0.3	123
Largest Over-prediction	Census Tract 1.01	3,837.2	8.4%	3.7%	40.6%	15,000.4	12,783.4	0.1	0.5	88
Largest Over-prediction	Census Tract 241	3,466.8	2.6%	0.0%	54.5%	9,366.2	3,721.2	0.8	0.9	100
Largest Over-prediction	Census Tract 7.01	3,377.1	4.6%	14.0%	49.7%	37,228.3	29,355.9	0.0	0.4	129
Largest Over-prediction	Census Tract 84	2,677.3	1.2%	4.2%	33.7%	27,450.2	12,086.5	1.0	0.5	167
Largest Over-prediction	Census Tract 201.02	2,534.4	6.3%	1.3%	34.3%	26,643.7	8,742.4	0.3	0.3	165
Largest Over-prediction	Census Tract 38	2,300.7	7.1%	3.5%	10.6%	9,142.7	4,927.4	0.7	0.7	75
Largest Over-prediction	Census Tract 372	1,991.6	9.9%	25.6%	19.4%	9,895.5	4,724.3	0.7	0.8	80
Largest Over-prediction	Census Tract 4.04	1,819.8	2.3%	9.3%	51.9%	44,491.0	28,686.0	0.2	0.4	65
Largest Over-prediction	Census Tract 24	1,617.8	4.2%	7.0%	29.5%	22,571.3	13,738.2	0.4	0.7	73
Largest Under-prediction	Census Tract 152	-2,187.1	10.1%	7.9%	48.7%	32,090.6	11,268.5	0.5	1.0	298
Largest Under-prediction	Census Tract 9.01	-1,747.3	6.4%	12.6%	66.6%	52,178.7	36,036.9	0.1	0.2	128
Largest Under-prediction	Census Tract 140	-1,503.6	6.0%	12.2%	29.4%	29,685.1	18,501.7	0.2	0.1	203
Largest Under-prediction	Census Tract 280	-1,495.3	5.3%	0.0%	50.7%	14,920.3	5,223.0	0.4	0.9	131
Largest Under-prediction	Census Tract 108	-1,346.9	2.1%	3.6%	39.4%	21,868.0	7,870.0	0.4	0.6	182
Largest Under-prediction	Census Tract 153	-1,182.9	7.9%	6.9%	43.0%	31,095.8	12,737.1	0.0	0.8	211
Largest Under-prediction	Census Tract 107	-1,095.1	4.1%	0.0%	40.1%	18,890.9	6,807.5	0.2	0.8	208
Largest Under-prediction	Census Tract 147	-1,021.8	9.8%	10.2%	40.8%	25,329.6	14,333.0	0.1	0.5	254
Largest Under-prediction	Census Tract 122.03	-1,001.8	14.5%	19.3%	13.8%	12,374.3	7,454.4	0.2	0.6	48
Largest Under-prediction	Census Tract 91	-991.1	5.9%	21.2%	47.9%	18,030.3	8,866.1	0.0	0.2	77

The residual map shows where our best model systematically over- or under-predicts ridership, revealing places where bus use behaves differently from what the demographic and spatial variables would suggest.

Code

worst_sf <- tracts %>%
  semi_join(worst_tracts, by = "GEOID") %>%
  left_join(
    worst_tracts %>%
      dplyr::select(GEOID, NAME_clean = NAME, best_resid, type),
    by = "GEOID"
  ) %>%
  mutate(
    directional_label = if_else(st_coordinates(st_centroid(geometry))[,1] < median(st_coordinates(st_centroid(tracts))[,1]), "West", "East"),
    clean_name = str_remove(NAME_clean, "Census Tract\\s*"),
    label = clean_name,
    color_group = if_else(type == "Largest Over-prediction", "Low", "High")
  )

worst_points <- st_point_on_surface(worst_sf)
worst_points_df <- worst_points %>%
  st_coordinates() %>%
  as_tibble() %>%
  bind_cols(st_drop_geometry(worst_sf))

ggplot() +
  geom_sf(data = tracts, fill = "grey95", color = "grey85", size = 0.1) +
  geom_sf(data = worst_sf, aes(fill = color_group), color = "grey30", size = 0.15, alpha = 0.8) +
  scale_fill_manual(
    values = c("High" = "#4575b4", "Low" = "#d73027"),
    name = "Residual Type\n(Labels = Census Tract IDs)",
    labels = c("High" = "Under predict", "Low" = "Over predict")
  ) +
  scale_color_manual(
    values = c("High" = "black", "Low" = "black"),
    guide = "none"
  ) +
  geom_label_repel(
    data = worst_points_df,
    aes(
      x = X,
      y = Y,
      label = label,
      color = color_group
    ),
    size = 3,
    fill = NA,
    color = "black",
    label.size = 0,
    segment.color = "grey50",
    min.segment.length = 0,
    segment.curvature = 0,
    segment.ncp = 3,
    box.padding = 0.2,
    point.padding = 0.2,
    nudge_y = 0.08,
    nudge_x = 0.12 * if_else(worst_points_df$X < median(worst_points_df$X), -1, 1),
    max.overlaps = Inf,
    direction = "both",
    force = 15,
    max.time = 5,
    hjust = if_else(worst_points_df$X < median(worst_points_df$X), 1, 0)
  ) +
  labs(
    title = "Top 10 High & Low Residual Census Tracts",
    subtitle = paste0(best_label)
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: - The largest over-predictions appear in Center City and South Philadelphia. These areas are very dense and have many zero-car households, so the model expects extremely high bus usage, yet the actual numbers are much lower. This pattern likely reflects the availability of strong substitutes such as subway or trolley.

In contrast, the largest under-predictions occur in parts of North and West Philadelphia, where ridership is higher than the model anticipates. These tracts tend to have higher poverty rates, more zero-car households, and elevated crime levels. Residents tend to rely on buses more heavily because they have fewer mobility alternatives.

Together, these patterns show that the model performs well overall but still misses certain behavioral dynamics: bus use is lower than expected in multimodal, choice-rich areas and higher than expected in disadvantaged, transit-dependent neighborhoods.

Part 5: Conclusions & Recommendations

Limitations of the Current Model

The set of data sources is restricted. We rely only on publicly available datasets from SEPTA, ACS, and OpenDataPhilly. Important factors such as transit service supply, parking availability, or temporal travel patterns are not included. Because we do not have stop-level ridership data, we also cannot precisely isolate the influence of large transportation centers; instead, we excluded two entire census tracts that were dominated by these hubs.
The update time of the datasets is not fully aligned. We use Fall 2023 ridership as the main observations, but some predictors are only available as annual ACS 2023 estimates or more recent 2025 releases. These mismatches introduce unavoidable temporal inconsistencies across variables.
The fixed-effect structure is coarse. We assign tracts to neighborhood-level buckets, which may be too aggregated to capture meaningful within-neighborhood variation. This may explain why adding FE did not improve predictive accuracy and, in some cases, increased MAE.
The models are linear specifications (OLS and Poisson). While they offer interpretability, they do not fully reflect spatial dependence or network effects. Our spatial diagnostics show that ridership exhibits strong spatial autocorrelation, but the residuals still retain a moderate level of spatial clustering. This indicates that the demographic and spatial predictors in the model do not fully absorb geographic spillovers or network-based behaviors. In other words, nearby tracts still resemble each other in ways that the model cannot capture.

Next Steps

Expand the data foundation. The model could incorporate OSM street networks, land-use layers, and other environmental features to construct indicators for accessibility, walkability, and land-use pattern. More detailed transit supply data, ideally matched to the same time period as ridership, would further improve model accuracy and likely help absorb the spatial clustering that appears in the residuals.
Exploring alternative modeling approaches. Since the residuals still show signs of spatial autocorrelation, future work should consider spatial lag or spatial error models, geographically weighted regression, or even machine-learning methods that account for geographic structure. These approaches may better capture localized clustering, network effects, and non-linear relationships, leading to more robust predictions and a clearer understanding of how ridership varies across the city.