SEPTA Bus Ridership Prediction in Philadelphia, PA

Identifying Transit Deserts to Help Improve Equity-Focused Investments

Author

Itsnatani Anaqami, Jinheng Cen, Henry Yang

Published

December 7, 2025

Data Sources


Part 1: Setup

Initial Setup

Several preliminary steps such as loading required R packages and the Census API key are omitted here for clarity.

We created helper functions to streamline data preparation: reading and projecting shapefiles, computing distances between locations, and cleaning crime descriptions. Crime incidents are then categorized into violent and non-violent groups using a simple keyword-based rule.

Code
# read the single shapefile inside a folder
read_shp_from_dir <- function(dir_path) {
  shp_path <- list.files(dir_path, pattern = "\\.shp$", full.names = TRUE)
  if (length(shp_path) == 0) {
    stop(glue("No .shp file found inside {dir_path}"))
  }
  st_read(shp_path[1], quiet = TRUE) %>%
    st_transform(CRS_TARGET) %>%
    st_make_valid()
}

st_dist_mi <- function(from, to) {
  st_distance(from, to, by_element = TRUE) %>%
    set_units("mi") %>%
    drop_units()
}

# Crime classification
norm_tgc <- function(x) {
  x %>% str_squish() %>% str_to_lower()
}

violent_set <- c(
  "homicide - criminal",
  "homicide - justifiable",
  "rape",
  "robbery firearm",
  "robbery no firearm",
  "aggravated assault firearm",
  "aggravated assault no firearm",
  "other assaults",
  "offenses against family and children"
)

make_violence_bins <- function(df) {
  df %>%
    mutate(
      key = norm_tgc(text_general_code),
      gun_involved = str_detect(key, "firearm|weapon"),
      violence_bin = case_when(
        gun_involved ~ "Violent",
        key %in% violent_set ~ "Violent",
        TRUE ~ "Misdemeanor/Non-violent"
      ),
      violence_bin = factor(
        violence_bin,
        levels = c("Violent", "Misdemeanor/Non-violent")
      )
    )
}

Part 2: Data Preparation & Exploratory Data Analysis

Ridership Processing

This step loads and cleans the ridership data, filters to Fall 2023 bus records in Philadelphia, and computes average daily ridership for each census tract by summing boardings and alightings.

Code
RIDERSHIP_PATH <- file.path(DATA_DIR, "Bus_Ridership_by_Census_Tract.csv")

ridership_raw <- read_csv(RIDERSHIP_PATH, show_col_types = FALSE) %>%
  clean_names() %>%
  mutate(
    geoid = str_trim(census_tract_id),
    season_year = str_extract(season, "\\d{4}") %>% as.integer(),
    season_label = str_extract(season, "^[A-Za-z]+")
  )

ridership <- ridership_raw %>%
  filter(
    county_name == "Philadelphia County",
    str_to_lower(mode) == "bus",
    season_year == 2023
  ) %>%
  mutate(
    total_rides = coalesce(on, 0) + coalesce(off, 0)
  )

ridership_clean <- ridership %>%
  filter(season_label == "Fall") %>%
  group_by(geoid) %>%
  summarise(
    ridership_2023_fall = mean(total_rides, na.rm = TRUE),
    .groups = "drop"
  )

Histogram: Distribution of Bus Ridership

Code
ridership_avg_tract <- ridership %>%
  group_by(geoid) %>%
  summarise(
    avg_daily_rides = mean(total_rides, na.rm = TRUE),
    .groups = "drop"
  )

mean_val <- mean(ridership_avg_tract$avg_daily_rides, na.rm = TRUE)

ggplot(ridership_avg_tract, aes(x = avg_daily_rides)) +
  geom_histogram(bins = 150, fill = "#3182bd", color = "white") +
  # Mean line
  geom_vline(xintercept = mean_val, color = "red", size = 0.5) +
  geom_text(
    aes(x = mean_val+ 1200, y = 30, label = paste0("Mean = ", round(mean_val, 1))),
    color = "black",
    vjust = -0.5,
    hjust = 0.5,
    size = 4
  ) +
  labs(
    title = "Distribution of Average Daily Bus Ridership by Tract (Fall 2023)",
    x = "Average Boardings + Alightings by Tract per Day",
    y = "Number of Tracts"
  ) +
  theme_minimal()

Interpretation: The histogram shows that average daily bus ridership is very uneven across Philadelphia census tracts. Most tracts fall below 2,000 riders per day, indicating relatively modest transit activity in many neighborhoods. Meanwhile, a small number of tracts account for extremely high ridership, creating a heavily skewed distribution. The mean value of about 994 riders. The outliers with higher riderships create a long right tail and likely correspond to major corridors or transit centers.

Choropleth: Spatial Distribution of Bus Ridership

Code
tracts_map <- tigris::tracts(
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  cb = TRUE
) %>%
  st_transform(CRS_TARGET) %>%
  dplyr::select(GEOID)

tracts_ridership <- tracts_map %>%
  left_join(ridership_clean, by = c("GEOID" = "geoid"))

ridership_breaks <- c(
  0,
  scales::pretty_breaks(n = 4)(tracts_ridership$ridership_2023_fall)
) %>%
  unique() %>%
  sort()

ggplot(tracts_ridership) +
  geom_sf(aes(fill = ridership_2023_fall), color = NA) +
  scale_fill_viridis_c(
    option = "plasma",
    trans = "sqrt",
    breaks = ridership_breaks,
    labels = scales::comma
  ) +
  labs(
    title = "Spatial Distribution of Average Daily Bus Ridership (Fall 2023)",
    fill = "Avg Daily Riders"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: The spatial map reinforces the pattern seen in the histogram. Most census tracts have relatively low bus activity, and only a small group of tracts stand out with much higher volumes, which aligns with the long right tail observed earlier. Overall the distribution suggests that frequent bus use is concentrated in a limited part of the city, while many neighborhoods experience modest ridership levels.


Data Cleaning

Through our EDA, we identified both tracts as extreme high-value outliers in the ridership distribution. Mapping and further investigation revealed that these tracts contain major multimodal transfer hubs, Olney Transportation Center and Frankford Transportation Center, where high bus riderships are driven primarily by transfers from the subway, rather than by local demographic, socioeconomic, or spatial characteristics.

These two tracts do not align with the underlying assumptions of the model and would distort inference. Since the ridership dataset is aggregated at the tract level and cannot distinguish transfer activity from local activities, we exclude the two tracts as outliers from the following analysis.

Code
ridership_clean <- ridership_clean %>%
  filter(!geoid %in% c("42101030200","42101028200"))

Loading ACS Data

Code
tracts <- get_acs(
  geography = GEOG,
  variables = "B01003_001",
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  geometry = TRUE,
  cache_table = TRUE
) %>%
  st_transform(CRS_TARGET) %>%
  st_make_valid() %>%
  dplyr::select(GEOID, NAME, total_pop = estimate) %>%
  mutate(
    tract_area_sqmi = st_area(geometry) %>% set_units(mi^2) %>% drop_units()
  )

acs_vars <- c(
  total_pop    = "B01003_001",
  median_income = "B19013_001",
  poverty_total = "B17001_002",
  pop_16plus    = "B23025_007",
  employed      = "B23025_004",
  unemployed    = "B23025_005",
  car0          = "B08201_002",
  white         = "B03002_003",
  black         = "B03002_004",
  hispanic      = "B03002_012",
  asian         = "B03002_006",
  bachelors     = "B15003_022",
  grad          = "B15003_023",
  total_edu     = "B15003_001",
  housing_units = "B25001_001",
  lt5 = "B08012_003",
  m5_9 = "B08012_004",
  m10_14 = "B08012_005",
  m15_19 = "B08012_006",
  m20_24 = "B08012_007",
  m25_29 = "B08012_008",
  m30_34 = "B08012_009",
  m35_39 = "B08012_010",
  m40_44 = "B08012_011",
  m45_59 = "B08012_012",
  m60_89 = "B08012_013",
  m90plus = "B08012_014",
  wfh  = "B08126_017"
)

acs_data <- get_acs(
  geography = GEOG,
  variables = acs_vars,
  state = STATE_FIPS,
  county = COUNTY_FIPS,
  year = ACS_YEAR,
  geometry = FALSE,
  cache_table = TRUE
) %>%
  dplyr::select(GEOID, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  rename_with(
    ~ names(acs_vars)[match(.x, acs_vars)],
    any_of(acs_vars)
  ) %>%
  mutate(
    total_commuters = lt5 + m5_9 + m10_14 + m15_19 + m20_24 +
                      m25_29 + m30_34 + m35_39 + m40_44 +
                      m45_59 + m60_89 + m90plus,
    commute_score = dplyr::if_else(
      total_commuters > 0,
      (lt5    * 0  +
       m5_9   * 5  +
       m10_14 * 10 +
       m15_19 * 15 +
       m20_24 * 20 +
       m25_29 * 25 +
       m30_34 * 30 +
       m35_39 * 35 +
       m40_44 * 40 +
       m45_59 * 45 +
       m60_89 * 60 +
       m90plus* 90 ) / total_commuters,
      NA_real_
    ),
    pct_white = white / total_pop,
    pct_black = black / total_pop,
    pct_hispanic = hispanic / total_pop,
    pct_asian = asian / total_pop,
    pct_carless = car0 / housing_units,
    pct_bachelors_plus = (bachelors + grad) / total_edu,
    unemployment_rate = unemployed / pop_16plus,
    poverty_rate = poverty_total / total_pop,
    pct_wfh = wfh / pop_16plus
  )

Loading Spatial Layers

Code
universities <- read_shp_from_dir(file.path(DATA_DIR, "Universities_Colleges-shp"))

schools <- read_shp_from_dir(file.path(DATA_DIR, "Schools"))
if (any(st_geometry_type(schools) %in% c("POLYGON", "MULTIPOLYGON"))) {
  st_geometry(schools) <- st_point_on_surface(schools)
}

hospitals <- st_read(file.path(DATA_DIR, "Hospitals.geojson"), quiet = TRUE) %>%
  st_transform(CRS_TARGET) %>%
  st_make_valid()

parks <- read_shp_from_dir(file.path(DATA_DIR, "PPR_Properties"))
if (any(st_geometry_type(parks) %in% c("POINT", "MULTIPOINT"))) {
  parks <- parks %>%
    filter(st_geometry_type(geometry) %in% c("POLYGON", "MULTIPOLYGON"))
}

crime_raw <- read_shp_from_dir(file.path(DATA_DIR, "crime_incidents")) %>%
  clean_names()

crime_code_col <- names(crime_raw)[
  str_detect(names(crime_raw), "text.*general|general.*text|text_general|text_gener")
][1]

if (is.na(crime_code_col)) {
  stop("Could not locate the offense description column in crime shapefile.")
}

crime_bins <- crime_raw %>%
  rename(text_general_code = .data[[crime_code_col]]) %>%
  make_violence_bins() %>%
  filter(!st_is_empty(geometry))

crime_violent <- crime_bins %>%
  filter(violence_bin == "Violent")

cityhall <- tibble(long = -75.163619, lat = 39.952381) %>%
  st_as_sf(coords = c("long", "lat"), crs = 4326) %>%
  st_transform(CRS_TARGET)

indego_json <- fromJSON(file.path(DATA_DIR, "shared_bike_station.json"))
indego <- indiv <- indego_json$features
indego_sf <- tibble(
  longitude = map_dbl(indiv$geometry$coordinates, 1),
  latitude  = map_dbl(indiv$geometry$coordinates, 2)
) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(CRS_TARGET)

NEIGH_URL   <- "https://raw.githubusercontent.com/opendataphilly/open-geo-data/refs/heads/master/philadelphia-neighborhoods/philadelphia-neighborhoods.geojson"
neigh  <- read_sf(NEIGH_URL) %>%
  st_transform(st_crs(tracts))

Metrics Calculation

  • Calculating nearest distances from POIs (Points of interests) to census tracts
Code
tract_centroids <- st_centroid(tracts)

nearest_distance <- function(from_pts, to_feats, suffix) {
  idx <- st_nearest_feature(from_pts, to_feats)
  tibble(
    GEOID = from_pts$GEOID,
    !!glue("dist_{suffix}_mi") := st_dist_mi(from_pts, to_feats[idx, ])
  )
}

dist_university <- nearest_distance(tract_centroids, universities, "university")
dist_school     <- nearest_distance(tract_centroids, schools, "school")
dist_hospital   <- nearest_distance(tract_centroids, hospitals, "hospital")
dist_cityhall   <- nearest_distance(tract_centroids, cityhall, "cityhall")
dist_park       <- nearest_distance(tract_centroids, st_point_on_surface(parks), "park")
dist_indego     <- nearest_distance(tract_centroids, indego_sf, "indego")
  • Aggregating violent crime data using st_contains to census tracts
Code
crime_counts <- st_join(
  tracts %>% dplyr::select(GEOID),
  crime_violent,
  join = st_contains
) %>%
  st_drop_geometry() %>%
  count(GEOID, name = "violent_incidents_2014_2023")
  • Aggregating neighborhood data using st_intersection to census tracts (If mulitiple neighborhoods fall into one census tract, the one with biggest intersection area will be selected.)
Code
tracts_simple <- tracts %>% 
  dplyr::select(GEOID)

neigh_simple <- neigh %>% 
  dplyr::select(neigh_name = NAME)

tract_neigh_int <- st_intersection(tracts_simple, neigh_simple)

tract_neigh_largest <- tract_neigh_int %>%
  mutate(int_area = st_area(geometry)) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  slice_max(int_area, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  dplyr::select(GEOID, neigh_name)

tracts_neigh <- tracts %>%
  left_join(tract_neigh_largest, by = "GEOID")

Tract Feature Table

Code
tract_features <- tracts_neigh %>%
  st_drop_geometry() %>%
  left_join(acs_data %>% dplyr::select(-total_pop),
            by = "GEOID") %>%
  left_join(ridership_clean, by = c("GEOID" = "geoid")) %>%
  left_join(dist_university, by = "GEOID") %>%
  left_join(dist_school, by = "GEOID") %>%
  left_join(dist_hospital, by = "GEOID") %>%
  left_join(dist_cityhall, by = "GEOID") %>%
  left_join(dist_park, by = "GEOID") %>%
  left_join(dist_indego, by = "GEOID") %>%
  left_join(crime_counts, by = "GEOID") %>%
  mutate(
    pop_density = total_pop / tract_area_sqmi,
    emp_density = employed / tract_area_sqmi,
    car_ownership_rate = 1 - pct_carless,
    fe_bucket = neigh_name
  )

tract_features %>%
  dplyr::select(
    GEOID,
    fe_bucket,
    ridership_2023_fall,
    median_income,
    pct_white,
    pct_black,
    pct_bachelors_plus,
    car_ownership_rate,
    poverty_rate,
    commute_score,
    pop_density,
    violent_incidents_2014_2023,
    dist_cityhall_mi
  ) %>%
  arrange(desc(ridership_2023_fall)) %>%
  slice_head(n = 10) %>%
  mutate(
    median_income = scales::dollar(median_income, accuracy = 1),
    pct_white = scales::percent(pct_white, accuracy = 0.1),
    pct_black = scales::percent(pct_black, accuracy = 0.1),
    pct_bachelors_plus = scales::percent(pct_bachelors_plus, accuracy = 0.1),
    car_ownership_rate = scales::percent(car_ownership_rate, accuracy = 0.1),
    poverty_rate = scales::percent(poverty_rate, accuracy = 0.1)
  ) %>%
  knitr::kable(
    col.names = c(
      "Tract GEOID", "Neighborhood", "Avg Daily Riders", "Median Income",
      "% White", "% Black", "% Bachelor+", "Car Ownership Rate", "Poverty Rate",
      "Commute Time (mins)", "Population Density", "Violent Incidents",
      "Dist to City Hall (mi)"),
    caption = "Top 10 Census Tracts with Highest Average Daily Bus Ridership",
    format.args = list(big.mark = ","),
    digits = 1
  )
Top 10 Census Tracts with Highest Average Daily Bus Ridership
Tract GEOID Neighborhood Avg Daily Riders Median Income % White % Black % Bachelor+ Car Ownership Rate Poverty Rate Commute Time (mins) Population Density Violent Incidents Dist to City Hall (mi)
42101000500 CENTER_CITY 11,651.0 $68,977 33.7% 32.6% 35.9% 37.7% 12.0% 40.6 19,986.4 378 0.3
42101000403 LOGAN_SQUARE 7,065.3 $187,743 74.4% 2.7% 68.0% 57.4% 8.0% 55.6 19,166.8 123 0.1
42101000701 RITTENHOUSE 5,987.0 $90,343 77.0% 3.5% 63.9% 50.3% 11.6% 45.7 37,228.3 129 0.3
42101000101 OLD_CITY 5,189.0 $108,438 83.6% 1.6% 73.6% 59.4% 1.6% 42.0 15,000.4 88 0.9
42101024100 GERMANTOWN_PENN_KNOX 4,999.0 $18,646 19.7% 67.7% 27.2% 45.5% 30.7% 51.9 9,366.2 100 5.6
42101000600 WASHINGTON_SQUARE 4,749.7 $145,036 66.4% 8.6% 65.8% 58.5% 7.4% 42.5 28,495.8 212 0.3
42101008500 COBBS_CREEK 4,153.3 $45,747 13.2% 77.0% 26.8% 64.2% 23.6% 50.3 25,633.7 233 3.4
42101008400 COBBS_CREEK 3,997.0 $53,182 2.7% 78.0% 17.3% 66.3% 23.0% 45.7 27,450.2 167 3.8
42101020102 TIOGA 3,842.3 $33,947 0.6% 89.8% 8.9% 65.7% 52.9% 47.6 26,643.7 165 4.1
42101020200 TIOGA 3,570.7 $35,000 1.6% 93.6% 11.3% 69.4% 29.3% 49.6 14,110.1 243 3.8

Interpretation:

The table highlights the census tracts with the highest average daily bus ridership. The top-ranked tracts are mostly located in Center City and adjacent neighborhoods, areas that combine dense land use, strong job concentration, and good walkability. These tracts tend to have higher educational attainment, relatively low car ownership, and short distances to the Center City.

A few other top-ranked tracts, such as Germantown, Cobbs Creek and Tioga, have high ridership despite lower median incomes and higher poverty rates. These cases likely reflect stronger dependence on transit rather than proximity to employment centers. Overall, the table shows that both central-city accessibility and transit dependency contribute to elevated bus ridership in different parts of Philadelphia.


Variable Dictionary

Variable Definition
GEOID Census tract identifier from ACS 2023 geometry.
NAME Census Bureau tract label (e.g., “Census Tract 1, Philadelphia County”).
total_pop Total ACS population (B01003_001).
tract_area_sqmi Tract land area calculated from geometry (square miles).
median_income Median household income (B19013_001).
poverty_total Count of residents below the poverty line (B17001_002).
pop_16plus Population age 16+ (B23025_007).
employed, unemployed Employment status counts (B23025_001, _005).
car0 Household count with zero vehicles (B08201_002).
white, black, hispanic, asian Population counts by race/ethnicity (B03002 components).
bachelors, grad, total_edu Educational attainment counts (B15003_022, _023, _001).
housing_units Total housing units (B25001_001).
lt5m90plus Commuter counts by travel time (B08012 bins).
wfh Workers working from home (B08126_017).
pct_white, pct_black, pct_hispanic, pct_asian Race/ethnicity shares (group / total_pop).
pct_carless Household share with zero vehicles (car0 / housing_units).
pct_bachelors_plus Bachelor’s or graduate degree share ((bachelors + grad) / total_edu).
unemployment_rate Unemployment rate (unemployed / pop_16plus).
poverty_rate Poverty rate (poverty_total / total_pop).
pct_wfh Share of workers working from home (wfh / pop_16plus).
commute_score Weighted average commute minutes (shorter = better access), using start points of B08012 bins.
ridership_2023_fall Mean bus boardings + alightings in Fall 2023 per tract (Philadelphia County subset).
dist_university_mi, dist_school_mi, dist_hospital_mi, dist_cityhall_mi, dist_park_mi, dist_indego_mi Distance (mi) from tract centroid to nearest amenity of each type.
violent_incidents_2014_2023 Violent crime incidents (per make_violence_bins() classification) within tract boundaries, 2014–2023.
pop_density, emp_density Population and employment density (per sq mi).
car_ownership_rate Share of households owning ≥1 car (1 - pct_carless).
fe_bucket Neighborhood name (OpenDataPhilly) used as fixed-effect bucket.

Ridership Relationships

Code
p1 <- ggplot(tract_features, aes(x = violent_incidents_2014_2023, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#d73027") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#a50026") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Ridership vs. Violent Crime Incidents",
    x = "# Violent Crime Incidents (2014–2023)",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p2 <- ggplot(tract_features, aes(x = pct_asian, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#542788") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#2d004b") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Ridership vs. % Asian Population",
    x = "% Asian Residents",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p3 <- ggplot(tract_features, aes(x = commute_score, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#4575b4") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#313695") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Ridership vs. Commute Time",
    x = "Avg Daily Commute Time (minutes)",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

p4 <- ggplot(tract_features, aes(x = pct_carless, 
                                 y = ridership_2023_fall)) +
  geom_point(size = 0.5, alpha = 0.5, color = "#1a9850") +
  geom_smooth(size = 0.5, method = "lm", se = FALSE, color = "#006837") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Ridership vs. No Car Ownership",
    x = "% Households Without a Vehicle",
    y = "Avg Daily Ridership"
  ) +
  theme_minimal()

ridership_relation <- (p1 | p2) / (p3 | p4)
ridership_relation

Interpretation:

  • Ridership vs Crime Incidents: We initially expected tracts with higher violent crime to have lower ridership, assuming safety concerns would discourage transit use. Instead, the data show the opposite pattern: tracts with more violent incidents tend to have higher bus ridership. This likely reflects broader socio-economic conditions rather than crime itself. The relationship suggests that structural transit dependence outweighs safety-related deterrence at the tract level.

  • Ridership vs Asian Population: The share of Asian residents shows a mild positive relationship with bus ridership. This pattern may reflect specific corridors—such as parts of South Philadelphia and the Northeast—where immigrant communities rely more on bus services.

  • Ridership vs Commute Time: Average commute time is weakly and slightly negatively related to bus ridership. Tracts with longer commutes may rely more on regional rail or driving rather than local buses, which could explain the small downward slope.

  • Ridership vs Car Ownership: The strongest relationship among the four plots is with car ownership. Tracts with higher shares of households without a vehicle show notably higher bus ridership, which aligns with expectations. This indicates that household transportation access is a key driver of bus usage.


Part 3: Modeling & Cross-Validation

Model Building

  • Spatial unit = census tract
  • Target = Fall 2023 average daily bus riders

We use 5-fold CV to compare three feature sets: - ACS-only socio-economics - adding spatial accessibility and crime - adding neighborhood fixed effects

We tested three models - OLS, Poisson, and NegBin models.

Code
acs_vars <- c(
  "commute_score",
  "pct_hispanic", 
  "pct_asian",
  "pct_carless", 
  "emp_density",
  "pop_density",
  "pct_wfh"
)

spatial_vars <- c(
  "dist_university_mi",
  "dist_hospital_mi",
  "violent_incidents_2014_2023"
)

model_data <- tract_features %>%
  dplyr::select(GEOID, fe_bucket, ridership_2023_fall, all_of(acs_vars), all_of(spatial_vars)) %>%
  drop_na(ridership_2023_fall) %>%
  distinct(GEOID, .keep_all = TRUE)

folds <- vfold_cv(model_data, v = 5)

build_rhs <- function(features) {
  if (length(features) > 0) paste(features, collapse = " + ") else "1"
}

calc_train_r2 <- function(features, fe = FALSE, type = "OLS") {
  rhs <- build_rhs(features)
  formula_str <- paste("ridership_2023_fall ~", rhs)
  preds <- switch(type,
    "OLS" = {
      if (!fe) {
        fit <- lm(as.formula(formula_str), data = model_data)
      } else {
        fit <- feols(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
      }
      predict(fit, newdata = model_data)
    },
    "Poisson" = {
      if (!fe) {
        fit <- glm(as.formula(formula_str), data = model_data, family = poisson())
        predict(fit, newdata = model_data, type = "response")
      } else {
        fit <- fepois(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
        predict(fit, newdata = model_data)
      }
    }
  )
  rsq_vec(model_data$ridership_2023_fall, preds)
}

cv_mae <- function(folds, features, fe = FALSE, type = "OLS") {
  map_dbl(folds$splits, function(split) {
    train <- analysis(split)
    test  <- assessment(split)
    rhs <- build_rhs(features)
    formula_str <- paste("ridership_2023_fall ~", rhs)
    preds <- switch(type,
      "OLS" = {
        if (!fe) {
          fit <- lm(as.formula(formula_str), data = train)
          predict(fit, newdata = test)
        } else {
          fit <- feols(as.formula(paste(formula_str, "| fe_bucket")), data = train)
          predict(fit, newdata = test)
        }
      },
      "Poisson" = {
        if (!fe) {
          fit <- glm(as.formula(formula_str), data = train, family = poisson())
          predict(fit, newdata = test, type = "response")
        } else {
          fit <- fepois(as.formula(paste(formula_str, "| fe_bucket")), data = train)
          predict(fit, newdata = test)
        }
      }
    )
    mae_vec(test$ridership_2023_fall, preds)
  }) %>%
    mean()
}

specs <- tribble(
  ~spec_name, ~features, ~fe,
  "ACS only", acs_vars, FALSE,
  "ACS + Spatial", c(acs_vars, spatial_vars), FALSE,
  "ACS + Spatial + FE", c(acs_vars, spatial_vars), TRUE
)

model_specs_full <- expand_grid(
  model_type = c("OLS", "Poisson"),
  specs
) %>%
  mutate(model_label = paste(model_type, spec_name, sep = " - "))

model_performance <- model_specs_full %>%
  mutate(
    mae = pmap_dbl(list(features, fe, model_type), ~ cv_mae(folds, ..1, ..2, ..3)),
    feature_count = map_int(features, length),
    train_r2 = pmap_dbl(list(features, fe, model_type), ~ calc_train_r2(..1, ..2, ..3))
  ) %>%
  dplyr::select(model = model_label, mae, train_r2, feature_count, model_type, spec_name)

knitr::kable(
  model_performance %>%
    mutate(
      mae = scales::comma(mae, accuracy = 1),
      train_r2 = scales::number(train_r2, accuracy = 0.01)
    ),
  col.names = c("Model Spec", "CV MAE (riders)", "Train R²", "Feature Count", "Model Type", "Feature Set"),
  format.args = list(big.mark = ","),
  digits = 1
)
Model Spec CV MAE (riders) Train R² Feature Count Model Type Feature Set
OLS - ACS only 615 0.21 7 OLS ACS only
OLS - ACS + Spatial 527 0.43 10 OLS ACS + Spatial
OLS - ACS + Spatial + FE 532 0.70 10 OLS ACS + Spatial + FE
Poisson - ACS only 608 0.26 7 Poisson ACS only
Poisson - ACS + Spatial 477 0.58 10 Poisson ACS + Spatial
Poisson - ACS + Spatial + FE 491 0.77 10 Poisson ACS + Spatial + FE

Interpretation:

  • Across all specifications, the Poisson models consistently outperform the OLS models in terms of predictive accuracy. This aligns with expectations, since bus ridership is a count-like outcome with a skewed distribution, making Poisson a more appropriate functional form.

  • Adding spatial features substantially improves performance for both model families. Moving from “ACS only” to “ACS + Spatial” reduces MAE from 615 to 527 in the OLS models and from 608 to 477 in the Poisson models. This suggests that proximity to key destinations such as universities and hospitals captures meaningful variation in transit demand that demographic or socioeconomic variables alone cannot explain.

  • Including neighborhood fixed effects increases the training R² (up to 0.77 in the Poisson FE model), indicating stronger in-sample fit, but does not necessarily reduce cross-validated MAE. In both OLS and Poisson, the FE models show slightly higher error than their non-FE spatial counterparts. This pattern suggests that fixed effects absorb tract-level heterogeneity helpful for interpretation, but may not generalize as well for prediction when using cross-validation.


Spatial KDE Benchmark

As a non-parametric spatial baseline, we compute a KDE smoother that weights nearby tracts by distance. The bandwidth of 2,500 feet (≈0.47 miles) approximates a 10-minute walk shed.

We include this KDE model to establish a spatial benchmark: if our feature-based OLS and Poisson models cannot outperform a simple distance-weighted smoother, then their added complexity is not meaningful.

Code
tract_centroids <- tracts %>%
  st_centroid(of_largest_polygon = TRUE) %>%
  mutate(
    coord_x = st_coordinates(.)[, 1],
    coord_y = st_coordinates(.)[, 2]
  ) %>%
  st_drop_geometry() %>%
  dplyr::select(GEOID, coord_x, coord_y)

kde_model_data <- tract_features %>%
  left_join(tract_centroids, by = "GEOID") %>%
  dplyr::select(GEOID, ridership_2023_fall, coord_x, coord_y) %>%
  drop_na()

coord_matrix <- as.matrix(kde_model_data[, c("coord_x", "coord_y")])

kde_bandwidth <- 2500  # ≈0.47 miles under EPSG:3365 (10-minute walk scale)

gaussian_kernel <- function(distance, bandwidth) {
  exp(-0.5 * (distance / bandwidth)^2)
}

predict_kde <- function(train_coords, train_values, query_coords, bandwidth) {
  purrr::map_dbl(seq_len(nrow(query_coords)), function(i) {
    diffs <- sweep(train_coords, 2, query_coords[i, ], "-")
    dists <- sqrt(rowSums(diffs^2))
    weights <- gaussian_kernel(dists, bandwidth)
    if (all(weights == 0)) {
      return(mean(train_values))
    }
    sum(weights * train_values) / sum(weights)
  })
}

kde_folds <- vfold_cv(kde_model_data, v = 5)

kde_mae <- map_dbl(kde_folds$splits, function(split) {
  train <- analysis(split)
  test  <- assessment(split)
  preds <- predict_kde(
    train_coords = as.matrix(train[, c("coord_x", "coord_y")]),
    train_values = train$ridership_2023_fall,
    query_coords = as.matrix(test[, c("coord_x", "coord_y")]),
    bandwidth = kde_bandwidth
  )
  mae_vec(test$ridership_2023_fall, preds)
}) %>%
  mean()

kde_train_preds <- purrr::map_dbl(seq_len(nrow(kde_model_data)), function(i) {
  train_coords <- coord_matrix[-i, , drop = FALSE]
  train_values <- kde_model_data$ridership_2023_fall[-i]
  query_coord  <- matrix(coord_matrix[i, ], nrow = 1)
  predict_kde(train_coords, train_values, query_coord, kde_bandwidth)
})

kde_train_r2 <- rsq_vec(kde_model_data$ridership_2023_fall, kde_train_preds)

kde_metrics <- tibble(
  Model = "KDE - Spatial Only",
  MAE = kde_mae,
  "Train r2" = kde_train_r2,
  "Feature Count" = 0,
  "Model Type" = "KDE",
  "spec_name" = "Spatial neighbors"
)

knitr::kable(
  kde_metrics %>%
    mutate(
      MAE = scales::comma(MAE, accuracy = 1),
      `Train r2` = scales::number(`Train r2`, accuracy = 0.01)
    ),
  col.names = c("Model Spec", "CV MAE (riders)", "Train R²", "Feature Count", "Model Type", "Feature Set"),
  format.args = list(big.mark = ","),
  digits = 1
)
Model Spec CV MAE (riders) Train R² Feature Count Model Type Feature Set
KDE - Spatial Only 582 0.19 0 KDE Spatial neighbors
Code
kde_full_preds <- predict_kde(
  train_coords = coord_matrix,
  train_values = kde_model_data$ridership_2023_fall,
  query_coords = coord_matrix,
  bandwidth = kde_bandwidth
)

kde_map <- tracts %>%
  left_join(
    kde_model_data %>%
      mutate(kde_riders = kde_full_preds) %>%
      dplyr::select(GEOID, kde_riders),
    by = "GEOID"
  ) %>%
  filter(!is.na(kde_riders))

ggplot(kde_map) +
  geom_sf(aes(fill = kde_riders), color = "grey85", size = 0.1) +
  scale_fill_viridis_c(labels = scales::comma) +
  labs(
    title = "Spatial KDE Estimate of Average Daily Ridership",
    fill = "KDE riders"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )


Cross-Validated Performance

Code
model_performance_all <- bind_rows(
  model_performance,
  tibble(
    model = "KDE - Spatial neighbors",
    mae = kde_mae,
    train_r2 = kde_train_r2,
    feature_count = 0,
    model_type = "KDE",
    spec_name = "Spatial neighbors"
  )
)

model_performance_all <- model_performance_all %>%
  mutate(
    color_group = case_when(
      model %in% c("Poisson - ACS + Spatial",
                   "KDE - Spatial neighbors") ~ "highlight",
      TRUE ~ "normal"
    )
  )

ggplot(model_performance_all, aes(x = reorder(model, mae), y = mae, fill = color_group)) +
  geom_col() +
  geom_text(aes(label = scales::comma(mae, accuracy = 1)), vjust = -0.5) +
  expand_limits(y = max(model_performance_all$mae) * 1.1) +
  scale_fill_manual(
    values = c(
      "highlight" = "#1f4e79",
      "normal" = "#4c78a8"
    ),
    guide = "none"
  ) +
  labs(
    title = "Cross-validated MAE by Specification (5-fold)",
    x = NULL,
    y = "Mean Absolute Error (Average Daily Riders)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: The cross-validated results show the best-performing specification is the Poisson model with ACS + spatial features (MAE ≈ 477). The KDE baseline sits in the middle at 582, which provides a useful benchmark. This indicates that demographic and built-environment features add real predictive value. Meanwhile, ACS-only models for both OLS and Poisson perform the worst, showing that spatial information is essential for explaining ridership patterns.


Best Specification Output

Below is the full coefficient output for the best spec (Poisson, ACS + spatial, no fixed effects) to interpret direction and significance.

Code
best_label <- "Poisson - ACS + Spatial"

models_fitted_1 <- model_specs_full %>%
  mutate(
    fit = pmap(
      list(features, fe, model_type),
      ~ {
        formula_str <- paste("ridership_2023_fall ~", build_rhs(..1))
        fe_flag <- ..2
        mtype   <- ..3
        
        if (mtype == "OLS") {
          if (!fe_flag) {
            lm(as.formula(formula_str), data = model_data)
          } else {
            feols(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
          }
        } else if (mtype == "Poisson") {
          if (!fe_flag) {
            glm(as.formula(formula_str), data = model_data, family = poisson())
          } else {
            fepois(as.formula(paste(formula_str, "| fe_bucket")), data = model_data)
          }
        }
      }
    )
  )

models_fitted_1 %>%
  filter(model_label == best_label) %>%
  pull(fit) %>%
  .[[1]] %>%
  summary()

Call:
glm(formula = as.formula(formula_str), family = poisson(), data = model_data)

Coefficients:
                                 Estimate    Std. Error z value
(Intercept)                  4.9044317612  0.0216038130  227.02
commute_score                0.0306005371  0.0004235091   72.25
pct_hispanic                -0.4309619521  0.0109908720  -39.21
pct_asian                    2.4851685409  0.0194129008  128.02
pct_carless                  1.4310436921  0.0162758559   87.92
emp_density                  0.0000384163  0.0000005788   66.37
pop_density                 -0.0000348522  0.0000003880  -89.82
pct_wfh                      0.3669917713  0.1510452521    2.43
dist_university_mi          -0.1251829378  0.0027487451  -45.54
dist_hospital_mi            -0.2949065327  0.0036478756  -80.84
violent_incidents_2014_2023  0.0063247778  0.0000248343  254.68
                                       Pr(>|z|)    
(Intercept)                 <0.0000000000000002 ***
commute_score               <0.0000000000000002 ***
pct_hispanic                <0.0000000000000002 ***
pct_asian                   <0.0000000000000002 ***
pct_carless                 <0.0000000000000002 ***
emp_density                 <0.0000000000000002 ***
pop_density                 <0.0000000000000002 ***
pct_wfh                                  0.0151 *  
dist_university_mi          <0.0000000000000002 ***
dist_hospital_mi            <0.0000000000000002 ***
violent_incidents_2014_2023 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 304490  on 382  degrees of freedom
Residual deviance: 140423  on 372  degrees of freedom
  (17 observations deleted due to missingness)
AIC: Inf

Number of Fisher Scoring iterations: 5

Interpretation: The coefficient table shows that most predictors, except for % worked from home, are statistically significant in the Poisson model. Because Poisson coefficients are on the log scale, their raw magnitudes are not directly comparable, so we rely on cross-validated MAE rather than coefficient size to evaluate model performance.


Part 4: Model Evaluation & Spatial Diagnostics

Spatial Autocorrelation Checks

Use the best model’s predictions/residuals to test remaining spatial autocorrelation (Moran’s I) and locate systematic over/under-estimation for policy interpretation.

Code
best_fit <- glm(
  as.formula(paste("ridership_2023_fall ~", build_rhs(c(acs_vars, spatial_vars)))),
  data   = model_data,
  family = poisson()
)

best_preds <- predict(best_fit, newdata = model_data, type = "response")

model_data_eval <- model_data %>%
  mutate(
    best_pred  = best_preds,
    best_resid = ridership_2023_fall - best_pred
  )

tracts_eval <- tracts %>%
  left_join(model_data_eval %>% dplyr::select(GEOID, ridership_2023_fall, best_resid), by = "GEOID") %>%
  filter(!is.na(best_resid))

nb <- poly2nb(tracts_eval, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)
r_moran    <- moran.test(tracts_eval$ridership_2023_fall, lw, zero.policy = TRUE)
resid_moran <- moran.test(tracts_eval$best_resid, lw, zero.policy = TRUE)

diagnostics <- tibble(
  Metric = c("Ridership Moran's I", "Residual Moran's I"),
  Statistic = c(r_moran$estimate[["Moran I statistic"]], resid_moran$estimate[["Moran I statistic"]]),
  "P-value" = c(r_moran$p.value, resid_moran$p.value)
)

knitr::kable(diagnostics, digits = 6)
Metric Statistic P-value
Ridership Moran’s I 0.371491 0.000000
Residual Moran’s I 0.135753 0.000004
Code
ggplot(tracts_eval) +
  geom_sf(aes(fill = best_resid), color = "grey85", size = 0.1) +
  scale_fill_gradient2(
    low = "#2166ac", high = "#b2182b", mid = "white",
    midpoint = 0, labels = scales::comma
  ) +
  labs(
    title = paste("Residual Map -", best_label),
    fill = "Residual (Predicted - Actual)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: The strong positive Moran’s I for ridership (0.37, p < 0.001) confirms that bus usage in Philadelphia is highly clustered spatially, i.e. tracts near one another tend to have similar ridership levels. After applying our best-performing model (Poisson with ACS and spatial features), the residual Moran’s I drops to 0.14 but remains statistically significant. This suggests that the model captures part of the spatial structure in transit demand, but not all, which is consistent with the idea that neighborhood-level transit behavior is influenced by spatial processes that are not fully represented in our current feature set. It also explains why adding neighborhood fixed effects reduced spatial autocorrelation more aggressively but performed worse in cross-validation: fixed effects can absorb spatial clustering, but at the cost of overfitting.


Extreme Residual Tracts

To target systematic over/under-estimation, the table lists the 10 tracts with largest absolute residuals alongside key attributes.

Code
resid_max <- model_data_eval %>%
  arrange(desc(best_resid)) %>%
  slice_head(n = 10) %>%
  mutate(type = "Largest Over-prediction")

resid_min <- model_data_eval %>%
  arrange(best_resid) %>%
  slice_head(n = 10) %>%
  mutate(type = "Largest Under-prediction")

worst_tracts <- bind_rows(resid_max, resid_min) %>%
  left_join(
    tract_features %>% dplyr::select(GEOID, NAME),
    by = "GEOID"
  ) %>%
  mutate(NAME = str_remove(NAME, ";\\s*Philadelphia County; Pennsylvania"))

knitr::kable(
  worst_tracts %>%
    dplyr::select(type, NAME, best_resid, pct_hispanic, pct_asian,
                  pct_carless, pop_density, emp_density,
                  dist_university_mi, dist_hospital_mi,
                  violent_incidents_2014_2023) %>%
    mutate(
      pct_hispanic = scales::percent(pct_hispanic, accuracy = 0.1),
      pct_asian = scales::percent(pct_asian, accuracy = 0.1),
      pct_carless = scales::percent(pct_carless, accuracy = 0.1),
    ),
  col.names = c(
    "Residual Type", "Tract Name", "Residual (Pred - Actual)",
    "% Hispanic", "% Asian", "% Zero-car HHs",
    "Population Density", "Employment Density",
    "Dist to Univ (mi)", "Dist to Hospital (mi)",
    "Violent Incidents 2014-2023"),
  caption = "Census Tract with Extreme Model Residuals",
  format.args = list(big.mark = ","),
  digits = 1
)
Census Tract with Extreme Model Residuals
Residual Type Tract Name Residual (Pred - Actual) % Hispanic % Asian % Zero-car HHs Population Density Employment Density Dist to Univ (mi) Dist to Hospital (mi) Violent Incidents 2014-2023
Largest Over-prediction Census Tract 4.03 4,173.8 7.1% 10.0% 42.6% 19,166.8 13,010.0 0.0 0.3 123
Largest Over-prediction Census Tract 1.01 3,837.2 8.4% 3.7% 40.6% 15,000.4 12,783.4 0.1 0.5 88
Largest Over-prediction Census Tract 241 3,466.8 2.6% 0.0% 54.5% 9,366.2 3,721.2 0.8 0.9 100
Largest Over-prediction Census Tract 7.01 3,377.1 4.6% 14.0% 49.7% 37,228.3 29,355.9 0.0 0.4 129
Largest Over-prediction Census Tract 84 2,677.3 1.2% 4.2% 33.7% 27,450.2 12,086.5 1.0 0.5 167
Largest Over-prediction Census Tract 201.02 2,534.4 6.3% 1.3% 34.3% 26,643.7 8,742.4 0.3 0.3 165
Largest Over-prediction Census Tract 38 2,300.7 7.1% 3.5% 10.6% 9,142.7 4,927.4 0.7 0.7 75
Largest Over-prediction Census Tract 372 1,991.6 9.9% 25.6% 19.4% 9,895.5 4,724.3 0.7 0.8 80
Largest Over-prediction Census Tract 4.04 1,819.8 2.3% 9.3% 51.9% 44,491.0 28,686.0 0.2 0.4 65
Largest Over-prediction Census Tract 24 1,617.8 4.2% 7.0% 29.5% 22,571.3 13,738.2 0.4 0.7 73
Largest Under-prediction Census Tract 152 -2,187.1 10.1% 7.9% 48.7% 32,090.6 11,268.5 0.5 1.0 298
Largest Under-prediction Census Tract 9.01 -1,747.3 6.4% 12.6% 66.6% 52,178.7 36,036.9 0.1 0.2 128
Largest Under-prediction Census Tract 140 -1,503.6 6.0% 12.2% 29.4% 29,685.1 18,501.7 0.2 0.1 203
Largest Under-prediction Census Tract 280 -1,495.3 5.3% 0.0% 50.7% 14,920.3 5,223.0 0.4 0.9 131
Largest Under-prediction Census Tract 108 -1,346.9 2.1% 3.6% 39.4% 21,868.0 7,870.0 0.4 0.6 182
Largest Under-prediction Census Tract 153 -1,182.9 7.9% 6.9% 43.0% 31,095.8 12,737.1 0.0 0.8 211
Largest Under-prediction Census Tract 107 -1,095.1 4.1% 0.0% 40.1% 18,890.9 6,807.5 0.2 0.8 208
Largest Under-prediction Census Tract 147 -1,021.8 9.8% 10.2% 40.8% 25,329.6 14,333.0 0.1 0.5 254
Largest Under-prediction Census Tract 122.03 -1,001.8 14.5% 19.3% 13.8% 12,374.3 7,454.4 0.2 0.6 48
Largest Under-prediction Census Tract 91 -991.1 5.9% 21.2% 47.9% 18,030.3 8,866.1 0.0 0.2 77

The residual map shows where our best model systematically over- or under-predicts ridership, revealing places where bus use behaves differently from what the demographic and spatial variables would suggest.

Code
worst_sf <- tracts %>%
  semi_join(worst_tracts, by = "GEOID") %>%
  left_join(
    worst_tracts %>%
      dplyr::select(GEOID, NAME_clean = NAME, best_resid, type),
    by = "GEOID"
  ) %>%
  mutate(
    directional_label = if_else(st_coordinates(st_centroid(geometry))[,1] < median(st_coordinates(st_centroid(tracts))[,1]), "West", "East"),
    clean_name = str_remove(NAME_clean, "Census Tract\\s*"),
    label = clean_name,
    color_group = if_else(type == "Largest Over-prediction", "Low", "High")
  )

worst_points <- st_point_on_surface(worst_sf)
worst_points_df <- worst_points %>%
  st_coordinates() %>%
  as_tibble() %>%
  bind_cols(st_drop_geometry(worst_sf))

ggplot() +
  geom_sf(data = tracts, fill = "grey95", color = "grey85", size = 0.1) +
  geom_sf(data = worst_sf, aes(fill = color_group), color = "grey30", size = 0.15, alpha = 0.8) +
  scale_fill_manual(
    values = c("High" = "#4575b4", "Low" = "#d73027"),
    name = "Residual Type\n(Labels = Census Tract IDs)",
    labels = c("High" = "Under predict", "Low" = "Over predict")
  ) +
  scale_color_manual(
    values = c("High" = "black", "Low" = "black"),
    guide = "none"
  ) +
  geom_label_repel(
    data = worst_points_df,
    aes(
      x = X,
      y = Y,
      label = label,
      color = color_group
    ),
    size = 3,
    fill = NA,
    color = "black",
    label.size = 0,
    segment.color = "grey50",
    min.segment.length = 0,
    segment.curvature = 0,
    segment.ncp = 3,
    box.padding = 0.2,
    point.padding = 0.2,
    nudge_y = 0.08,
    nudge_x = 0.12 * if_else(worst_points_df$X < median(worst_points_df$X), -1, 1),
    max.overlaps = Inf,
    direction = "both",
    force = 15,
    max.time = 5,
    hjust = if_else(worst_points_df$X < median(worst_points_df$X), 1, 0)
  ) +
  labs(
    title = "Top 10 High & Low Residual Census Tracts",
    subtitle = paste0(best_label)
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Interpretation: - The largest over-predictions appear in Center City and South Philadelphia. These areas are very dense and have many zero-car households, so the model expects extremely high bus usage, yet the actual numbers are much lower. This pattern likely reflects the availability of strong substitutes such as subway or trolley.

  • In contrast, the largest under-predictions occur in parts of North and West Philadelphia, where ridership is higher than the model anticipates. These tracts tend to have higher poverty rates, more zero-car households, and elevated crime levels. Residents tend to rely on buses more heavily because they have fewer mobility alternatives.

Together, these patterns show that the model performs well overall but still misses certain behavioral dynamics: bus use is lower than expected in multimodal, choice-rich areas and higher than expected in disadvantaged, transit-dependent neighborhoods.


Part 5: Conclusions & Recommendations

Limitations of the Current Model

  • The set of data sources is restricted. We rely only on publicly available datasets from SEPTA, ACS, and OpenDataPhilly. Important factors such as transit service supply, parking availability, or temporal travel patterns are not included. Because we do not have stop-level ridership data, we also cannot precisely isolate the influence of large transportation centers; instead, we excluded two entire census tracts that were dominated by these hubs.

  • The update time of the datasets is not fully aligned. We use Fall 2023 ridership as the main observations, but some predictors are only available as annual ACS 2023 estimates or more recent 2025 releases. These mismatches introduce unavoidable temporal inconsistencies across variables.

  • The fixed-effect structure is coarse. We assign tracts to neighborhood-level buckets, which may be too aggregated to capture meaningful within-neighborhood variation. This may explain why adding FE did not improve predictive accuracy and, in some cases, increased MAE.

  • The models are linear specifications (OLS and Poisson). While they offer interpretability, they do not fully reflect spatial dependence or network effects. Our spatial diagnostics show that ridership exhibits strong spatial autocorrelation, but the residuals still retain a moderate level of spatial clustering. This indicates that the demographic and spatial predictors in the model do not fully absorb geographic spillovers or network-based behaviors. In other words, nearby tracts still resemble each other in ways that the model cannot capture.


Next Steps

  1. Expand the data foundation. The model could incorporate OSM street networks, land-use layers, and other environmental features to construct indicators for accessibility, walkability, and land-use pattern. More detailed transit supply data, ideally matched to the same time period as ridership, would further improve model accuracy and likely help absorb the spatial clustering that appears in the residuals.

  2. Exploring alternative modeling approaches. Since the residuals still show signs of spatial autocorrelation, future work should consider spatial lag or spatial error models, geographically weighted regression, or even machine-learning methods that account for geographic structure. These approaches may better capture localized clustering, network effects, and non-linear relationships, leading to more robust predictions and a clearer understanding of how ridership varies across the city.