Space-Time Prediction of Bike Share Demand: Philadelphia Indego

MUSA 5080 - Fall 2025

Author

Isabelle Li

Published

December 11, 2025

Part 1: Data Import & Preparation

1.1 Set up

Load package

Define theme

Code
plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

Set Census API Key

1.2 Data Import

Create Time Bins

Code
indego <- indego %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

1.3 Exploratory Analysis

Trips over time

The Q2 2025 ridership pattern reflects typical late-spring to early-summer behavior. Overalll daily trip volumes are high compared to winter months, with most days falling between 3,500 and 5,000 trips.

Hourly Patterns

The hourly ridership profile shows a strong distinction between weekdays and weekends. On weekdays, the pattern is dominated by two sharp peaks: a large morning spike around 8–9 AM and an even stronger afternoon peak around 5–6 PM. These align closely with traditional commuting hours, indicating that Indego is heavily used for work-related travel during the week.

In contrast, weekend ridership follows a flatter, more gradual curve. Activity starts later in the morning, rises steadily through midday, and maintains moderate levels into the afternoon without a pronounced rush-hour peak. This reflects more recreational and discretionary trips rather than structured commuting patterns.

Top Station

Code
# Most popular origin stations
top_stations <- indego %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations, 
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 20 Indego Stations by Trip Origins
start_station start_lat start_lon trips
3,010 39.94711 -75.16618 6,278
3,032 39.94527 -75.17971 4,751
3,359 39.94888 -75.16978 4,181
3,028 39.94061 -75.14958 4,028
3,163 39.94974 -75.18097 4,000
3,020 39.94855 -75.19007 3,946
3,185 39.95169 -75.15888 3,914
3,066 39.94561 -75.17348 3,899
3,022 39.95472 -75.18323 3,856
3,054 39.96250 -75.17420 3,767
3,161 39.95486 -75.18091 3,629
3,362 39.94816 -75.16226 3,552
3,063 39.94633 -75.16980 3,516
3,213 39.93887 -75.16663 3,332
3,059 39.96244 -75.16121 3,320
3,012 39.94218 -75.17747 3,264
3,007 39.94517 -75.15993 3,242
3,061 39.95425 -75.17761 3,193
3,101 39.94295 -75.15955 3,170
3,296 39.95134 -75.16758 3,088

Get Philadelphia Spatial Context

Join Census Data to Stations

Code
# Create sf object for stations
stations_sf <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Add census variables back to the trip data
indego_census <- indego %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )
Code
# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census <- indego %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

Get Weather Data

Code
# Get weather from Philadelphia International Airport (KPHL)
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2025-04-01",
  date_end = "2025-07-31"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct()

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

Create Space-Time Panel

Code
# Count trips by station-hour
trips_panel <- indego_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
nrow(trips_panel)

# How many unique stations?
length(unique(trips_panel$start_station))

# How many unique hours?
length(unique(trips_panel$interval60))

Create Complete Panel Structure

Code
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
Expected panel rows: 548,604 
Code
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
Current rows: 179,107 
Code
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
Missing rows: 369,497 
Code
# Create complete panel
study_panel <- expand.grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
Complete panel rows: 548,604 

Add Time Features

Code
study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = lubridate::month(interval60, label = TRUE, abbr = TRUE),
    month = factor(month, levels = month.abb),  # Jan, Feb, ..., Dec
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

Join Weather Data

Code
study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

Create Temporal Lag Variables

Code
# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

Temporal Train/Test Split

Code
# Identify early and late period for Q2
# Q2 roughly covers week 14–26

early_stations <- study_panel_complete %>%
  filter(week >= 13, week < 22) %>%    # weeks 13–21
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 22, week <= 26) %>%   # weeks 22–26
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations appearing in both early and late
common_stations <- intersect(early_stations, late_stations)

study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# Now create train/test split
train <- study_panel_complete %>%
  filter(week >= 13, week < 22)

test <- study_panel_complete %>%
  filter(week >= 22, week <= 26)

1.4 Build Predictive Models

Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

model1_tidy <- tidy(model1) %>%
  filter(term %in% c("Temperature", "Precipitation", 
                     "dotw_simple2", "dotw_simple3", "dotw_simple4",
                     "dotw_simple5", "dotw_simple6", "dotw_simple7")) %>%
  mutate(
    estimate = round(estimate, 3),
    p.value = format.pval(p.value, digits = 3)
  )

kable(model1_tidy,
      caption = "Selected Coefficients from Model 1 (Q2 2025)",
      col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")) %>%
  kable_classic(full_width = FALSE)
Selected Coefficients from Model 1 (Q2 2025)
Term Estimate Std. Error t value p value
dotw_simple2 0.047 0.0062891 7.438893 0.000000000000102
dotw_simple3 0.008 0.0063365 1.305814 0.19162
dotw_simple4 0.042 0.0062043 6.761778 0.000000000013648
dotw_simple5 -0.017 0.0061253 -2.825335 0.00472
dotw_simple6 -0.027 0.0064698 -4.238667 0.000022489666761
dotw_simple7 -0.068 0.0065597 -10.320271 < 0.0000000000000002
Temperature 0.013 0.0001728 76.011086 < 0.0000000000000002
Precipitation -0.025 0.0232841 -1.060071 0.28911

Ridership is highly driven by time-of-day: morning and evening commute hours show the largest positive effects, while late night hours have the lowest activity.

Weekdays generally have higher demand than weekends, and temperature is strongly positively associated with trip counts. Precipitation shows a small negative effect but is not statistically strong in this period.

Model 2: Add Temporal Lags

Code
model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

model2_tidy <- tidy(model2) %>%
  filter(term %in% c("Temperature", "Precipitation",
                     "lag1Hour", "lag3Hours", "lag1day")) %>%
  mutate(
    estimate   = round(estimate, 3),
    std.error  = round(std.error, 3),
    statistic  = round(statistic, 1),
    p.value    = format.pval(p.value, digits = 3, eps = 1e-4)
  )

kable(
  model2_tidy,
  caption = "Key coefficients from Model 2 (Q2 2025)",
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
  kable_classic(full_width = FALSE)
Key coefficients from Model 2 (Q2 2025)
Term Estimate Std. Error t value p value
Temperature 0.004 0.000 25.0 < 0.0001
Precipitation -0.070 0.020 -3.6 0.000378
lag1Hour 0.415 0.001 296.4 < 0.0001
lag3Hours 0.120 0.001 86.7 < 0.0001
lag1day 0.137 0.001 107.1 < 0.0001

The coefficients on the lag variables are all positive and highly significant: the previous hour’s demand (lag1Hour, β ≈ 0.42), demand three hours ago (lag3Hours, β ≈ 0.12), and demand 24 hours ago (lag1day, β ≈ 0.14) all strongly predict current ridership. This means that if a station was busy in the recent past, it is very likely to be busy now as well. Compared to Model 1, the effect of temperature becomes smaller because part of the temporal pattern is now captured by lagged demand. Precipitation becomes more negative and statistically significant, indicating that rainy hours reduce trips even after controlling for recent usage. Overall, adding lag features increases R² from about 0.11 to 0.36 and reduces the residual error, showing that short-term temporal dependence is a key driver of Indego ridership.

Model 3: Add Demographics

Code
model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
  data = train
)

model3_tidy <- tidy(model3) %>%
  filter(term %in% c("Temperature", "Precipitation",
                     "lag1Hour", "lag3Hours", "lag1day",
                     "Med_Inc.x", "Percent_Taking_Transit.y", "Percent_White.y")) %>%
  mutate(
    estimate  = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 1),
    p.value   = format.pval(p.value, digits = 3, eps = 1e-4)
  )

kable(
  model3_tidy,
  caption = "Key Coefficients for Model 3 (Adding Census Features)",
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
  kable_classic(full_width = FALSE)
Key Coefficients for Model 3 (Adding Census Features)
Term Estimate Std. Error t value p value
Temperature 0.0072 0.0004 20.3 < 0.0001
Precipitation -0.3160 0.0364 -8.7 < 0.0001
lag1Hour 0.3077 0.0023 133.6 < 0.0001
lag3Hours 0.0855 0.0024 36.1 < 0.0001
lag1day 0.1191 0.0022 53.3 < 0.0001
Med_Inc.x 0.0000 0.0000 2.7 0.00689
Percent_Taking_Transit.y -0.0041 0.0004 -9.7 < 0.0001
Percent_White.y 0.0025 0.0002 11.6 < 0.0001

The lag variables (lag1Hour, lag3Hours, lag1day) remain the strongest predictors, confirming that short-term demand history is still the main driver of current ridership. Among the census features, higher median income around a station is associated with slightly higher usage, while neighborhoods with higher transit dependence tend to show lower Indego demand. The percentage of White residents is positively associated with trip counts. Temperature remains positively significant, whereas precipitation becomes strongly negative after controlling for lagged demand and neighborhood attributes.

Model 4: Add Station Fixed Effects

Code
model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station),
  data = train
)

model4_metrics <- tibble(
  Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
  Value = c(
    round(summary(model4)$r.squared, 4),
    round(summary(model4)$adj.r.squared, 4),
    round(summary(model4)$sigma, 4),
    nrow(train)
  )
)

kable(model4_metrics, 
      caption = "Model 4: Key Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_classic(full_width = FALSE)
Model 4: Key Performance Metrics
Metric Value
R-squared 0.2720
Adjusted R-squared 0.2705
Residual Std. Error 1.1827
N (train) 438712.0000

Model 5: Add Rush Hour Interaction

Code
model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + month +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

# Extract key metrics from Model 5
model5_metrics <- tibble(
  Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
  Value = c(
    round(summary(model5)$r.squared, 4),
    round(summary(model5)$adj.r.squared, 4),
    round(summary(model5)$sigma, 4),
    nrow(train)
  )
)

kable(model5_metrics,
      caption = "Model 5: Key Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_classic(full_width = FALSE)
Model 5: Key Performance Metrics
Metric Value
R-squared 0.2758
Adjusted R-squared 0.2743
Residual Std. Error 1.1796
N (train) 438712.0000
Code
# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test$month <- factor(test$month, levels = levels(model5$model$month))

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results, 
      digits = 2,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model (Test Set)
Model MAE (trips)
1. Time + Weather 0.80
2. + Temporal Lags 0.61
3. + Demographics 0.86
4. + Station FE 0.85
5. + Rush Hour Interaction 0.80

Model Performance Comparison (Q1 vs Q2)

Model MAE (Q1 2025) MAE (Q2 2025)
1. Time + Weather 0.60 0.80
2. + Temporal Lags 0.50 0.61
3. + Demographics 0.74 0.86
4. + Station FE 0.73 0.85
5. + Rush Hour Interaction 0.73 0.80

Overall, prediction errors are higher in Q2 than in Q1.

For example, the best model (Model 2) improves MAE from about 0.50 trips in Q1 to about 0.61 trips in Q2. All other models also show larger MAE in Q2 (e.g., Model 1 increases from 0.60 to 0.80 trips).

This suggests that ridership in spring–early summer is somewhat harder to predict than in winter. In Q1, demand is dominated by regular commuter patterns and strong weather effects (cold temperatures and snow suppress many trips), which makes behavior more stable. In Q2, warmer weather allows for more casual and recreational trips, as well as events and irregular travel, so there is more noise that the model cannot fully capture.

Temporal patterns also differ between Q1 and Q2.

In Q1, ridership is strongly concentrated in weekday morning and evening peaks, consistent with commuter trips, and weekends are relatively quiet. In Q2, there is more activity during the middle of the day and on weekends, reflecting recreational and leisure use in warmer months. This makes demand less purely “rush-hour driven” and more spread out, which again increases prediction error.

In Q2, the most important predictors are still the temporal lag features.

The coefficients on lag1Hour, lag3Hours, and lag1day are large and highly significant, indicating that recent station-level demand is the strongest signal of current ridership. Time-of-day (hour dummies), day-of-week, and temperature also matter, but their marginal effects become smaller once lags are included. Census variables and station fixed effects add some spatial structure, but they do not reduce MAE and in some cases slightly worsen out-of-sample performance, likely because they are associated with missing data and additional model complexity.

Part 2: Error Analysis

2.1 Spatial patterns

Create error maps

Code
station_errors <- test %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))


# Map 1 – Prediction Errors
p1_clean <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.0, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1
  ) +
  labs(
    title = "Prediction Errors",
    subtitle = "Higher in Center City"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  )

# Map 2 – Average Demand
p2_clean <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 3.0, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1
  ) +
  labs(
    title = "Average Demand",
    subtitle = "Trips per station-hour"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  )

# Only the maps (no legends)
p1_noleg <- p1_clean + theme(legend.position = "none")
p2_noleg <- p2_clean + theme(legend.position = "none")

# Extract legends
leg1 <- get_legend(p1_clean)
leg2 <- get_legend(p2_clean)

# Arrange clean layout
final_plot <- ggarrange(
  p1_noleg, p2_noleg,
  ncol = 2,
  labels = NULL
)

final_output <- ggarrange(
  final_plot,
  ggarrange(leg1, leg2, ncol = 2),
  ncol = 1,
  heights = c(4, 1)
)

final_output

From the Prediction Errors map (Model 2), the stations with the highest MAE all cluster tightly around:

1. Center City (highest error cluster)

  • Rittenhouse Square

  • Market Street corridor

  • City Hall / Suburban Station area

  • University City eastern edge (34th–30th Street)

2. University City (moderately high error)

  • Drexel / Penn campus areas

  • 33rd St, Lancaster Ave

  • Walnut/Chestnut corridor

3. Waterfront / Northern Liberties (moderate)

  • Penn’s Landing

  • Spring Garden

4. Areas with low error

  • Residential neighborhoods (West Philly, South Philly)

Highest errors occur in Center City and eastern University City. These areas include Rittenhouse Square, Market Street, City Hall / Suburban Station, Drexel/Penn corridors, and parts of the waterfront. They share two characteristics: (1) high demand, and (2) high behavioral variability.

Errors are likely driven by features missing from the model such as special events, tourism flows, campus academic schedules, and spatial spillover between nearby stations. Center City also has the strongest AM/PM peak dynamics, making it more challenging for a linear model to predict accurately.

Low-demand residential neighborhoods show much lower error (<0.5 MAE) because ridership is stable and predictable.

2.2 Temporal Patterns

Code
# MAE by time of day and day type
test <- test %>%
  mutate(
    error = Trip_Count - pred2,       
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )


temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Across the full time spectrum, prediction errors are highest during the peak commute periods, especially the PM rush, where demand rises sharply and fluctuates too quickly for the model to follow, leading to consistent under-prediction of the strongest spikes. The AM rush shows a similar pattern, though slightly milder, reflecting the model’s difficulty capturing sudden early-morning surges.

Mid-day errors are moderately high as well, particularly on weekends, when riding behavior is less routine and more sensitive to social activities or weather changes, making demand harder to predict. In contrast, overnight hours have the lowest errors because ridership is both low and stable, leaving little room for large deviations. Taken together, these patterns show that the model tends to underestimate sudden increases and slightly overestimate during gradual declines, a structural bias that appears repeatedly across hours and between weekdays and weekends.

2.3 Demographic patterns

Code
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

grid.arrange(p1, p2, p3, ncol = 2)

The weak but noticeable positive slope with median income suggests that stations located in higher-income neighborhoods tend to have slightly larger prediction errors. These areas often have more discretionary travel, irregular trip purposes, and stronger weekend or leisure-based demand spikes that a basic temporal-lag model cannot fully capture. As a result, ridership becomes harder to forecast with simple hourly and daily patterns.

In contrast, the negative association with transit-use share implies that stations in neighborhoods with a high reliance on public transportation are more predictable. These communities typically have more consistent commuting routines and a tighter dependence on fixed schedules, which aligns very well with the temporal-lag structure of Model 2. When people ride transit regularly, they also tend to use bike share in more stable, repeated daily patterns, making these locations easier for the model to learn.

Finally, the mild positive relationship between errors and percent White shows the pattern observed with income: neighborhoods that are more affluent, higher percentage of white or less transit-dependent tend to have more variable, less routine mobility behavior. These stations likely experience more irregular trip timing, spontaneous travel, and stronger seasonal fluctuations, features not fully captured by the current model.Relate errors to census characteristics

Part 3: Feature Engineering & model improvement

3.1 Add Features

Code
city_hall_lat <- 39.9526
city_hall_lon <- -75.1652

holiday_dates <- as.Date(c(
  "2025-04-20", 
  "2025-04-22", 
  "2025-05-11",
  "2025-05-26",
  "2025-06-15",
  "2025-06-19"
))
Code
train <- train %>%
  mutate(
    perfect_weather = ifelse(
      Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
      1, 0
    ),
    weekend_nice = weekend * perfect_weather,

    dist_to_center_city = distHaversine(
      cbind(start_lon.y, start_lat.y),
      c(city_hall_lon, city_hall_lat)
    ) / 1000,

    holiday = ifelse(date %in% holiday_dates, 1, 0)
  )

test <- test %>%
  mutate(
    perfect_weather = ifelse(
      Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
      1, 0
    ),
    weekend_nice = weekend * perfect_weather,

    dist_to_center_city = distHaversine(
      cbind(start_lon.y, start_lat.y),
      c(city_hall_lon, city_hall_lat)
    ) / 1000,

    holiday = ifelse(date %in% holiday_dates, 1, 0)
  )

3.2 Implementation

Add Features to The Best Model

Code
model6 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + weekend_nice +
    dist_to_center_city + holiday,
  data = train
)


# Tidy model and keep key variables
coef_table6 <- tidy(model6) %>%
  filter(term %in% c(
    "Temperature",
    "Precipitation",
    "lag1Hour",
    "lag3Hours",
    "lag1day",
    "perfect_weather",
    "weekend_nice",
    "dist_to_center_city",
    "holiday"
  )) %>%
  mutate(
    term = recode(term,
      "Temperature"          = "Temperature (°F)",
      "Precipitation"        = "Precipitation (in)",
      "lag1Hour"             = "Trips 1 hour ago",
      "lag3Hours"            = "Trips 3 hours ago",
      "lag1day"              = "Trips 24 hours ago",
      "perfect_weather"      = "Perfect weather (60–75°F, dry)",
      "weekend_nice"         = "Weekend × perfect weather",
      "dist_to_center_city"  = "Distance to Center City (km)",
      "holiday"              = "Holiday indicator"
    )
  )

kable(
  coef_table6 %>%
    select(term, estimate, std.error, statistic, p.value),
  digits = 3,
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value"),
  caption = "Key coefficients in Model 6 (with new engineered features)"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Key coefficients in Model 6 (with new engineered features)
Term Estimate Std. Error t value p value
Temperature (°F) 0.004 0.000 25.555 0.000
Precipitation (in) -0.066 0.020 -3.307 0.001
Trips 1 hour ago 0.399 0.001 283.079 0.000
Trips 3 hours ago 0.102 0.001 73.467 0.000
Trips 24 hours ago 0.119 0.001 91.684 0.000
Perfect weather (60–75°F, dry) -0.003 0.004 -0.722 0.470
Weekend × perfect weather 0.028 0.007 4.191 0.000
Distance to Center City (km) -0.063 0.001 -74.449 0.000
Holiday indicator -0.013 0.007 -1.932 0.053
Code
# Compare MAE before and after adding engineered features
# --- Recalculate pred6 safely ---
test <- test %>% 
  mutate(
    pred6 = predict(model6, newdata = test)
  )

# --- Compute baseline (model2) and new model (model6) MAE ---
mae_before <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae_after  <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

mae_compare <- data.frame(
  Model = c("Baseline Model 2 (Lags Only)", 
            "New Model 6 (+ Weather + Distance + Holiday)"),
  MAE = c(mae_before, mae_after)
)

kable(mae_compare, digits = 3,
      caption = "MAE Before vs. After New Features (Q2 Test Set)")
MAE Before vs. After New Features (Q2 Test Set)
Model MAE
Baseline Model 2 (Lags Only) 0.611
New Model 6 (+ Weather + Distance + Holiday) 0.613
Code
mae2 <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae6 <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

mae2
[1] 0.6106225
Code
mae6
[1] 0.613463
Code
mae6 - mae2
[1] 0.002840553

To test whether additional behavioral, spatial, and contextual signals could improve prediction accuracy, I introduced three new feature groups: (1) perfect-weather conditions and a weekend × nice-weather interaction, capturing the intuition that leisure-driven demand spikes under ideal conditions; (2) distance to Center City, representing the strong spatial gradient in bike-share usage; and (3) holiday indicators for major Q2 events such as Easter, Memorial Day, and Juneteenth, which may alter mobility patterns.

Comparing performance before and after adding these features shows that the enhanced model (Model 6) performs almost identically to the strongest baseline model (Model 2). Model 2 achieves an MAE of 0.6106, while Model 6 yields 0.6135, a negligible increase of approximately 0.003 trips. Although the engineered variables add meaningful behavioral interpretation, they do not improve out-of-sample predictive accuracy.

This result reflects the nature of Q2 bike-share demand: spring and early-summer ridership is already strongly explained by temporal autocorrelation (lag features) and hourly/weekly seasonality, with relatively consistent warm weather and limited extremes. Because these patterns dominate usage, additional features offer little incremental predictive power and instead introduce slight overfitting.

3.3 Poisson Model

Code
model6_pois <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + weekend_nice +
    dist_to_center_city + holiday,
  data = train,
  family = poisson(link = "log")
)

poisson_summary <- data.frame(
  Metric = c(
    "Null deviance",
    "Residual deviance",
    "AIC"
  ),
  Value = c(
    round(model6_pois$null.deviance, 2),
    round(model6_pois$deviance, 2),
    round(model6_pois$aic, 2)
  )
)

kable(poisson_summary,
      caption = "Poisson Regression Model Summary (Key Metrics)",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Poisson Regression Model Summary (Key Metrics)
Metric Value
Null deviance 701096.0
Residual deviance 441087.1
AIC 774319.7
Code
test <- test %>% 
  mutate(
    pred6_pois = predict(model6_pois, newdata = test, type = "response")
  )

mae_pois <- mean(abs(test$Trip_Count - test$pred6_pois), na.rm = TRUE)
mae_pois
[1] 0.639133
Code
mae_lin <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

data.frame(
  Model = c("Linear Model 6", "Poisson Model 6"),
  MAE = c(mae_lin, mae_pois)
)
            Model      MAE
1  Linear Model 6 0.613463
2 Poisson Model 6 0.639133

Although trip counts are discrete and non-negative, making Poisson regression a natural candidate, the Poisson specification did not outperform the linear model in this case. The Poisson model substantially reduced the null deviance (from about 701,096 to 441,087), which indicates that it captures important structure in the data. However, when evaluated out-of-sample on the Q2 test set, its MAE is 0.639, which is noticeably worse than the MAE of the linear Model 6 (0.613).

Part 4: Critical Reflection

4.1 Operational implications

With a final test-set MAE of roughly 0.61 trips per station-hour, the model is accurate enough to provide meaningful operational support for Indego, especially for routine rebalancing during normal weekdays. Most stations average between 0–2 trips per hour in Q2, so an average error of 0.6 represents a relatively small deviation from true demand. In other words, the model reliably captures broad temporal patterns: commuting peaks, weather-driven changes, and weekend shifts, which are the operational factors that matter most for daily fleet management. However, from the error analysis, the system performs less consistently in areas with highly irregular or discretionary travel (Center City, university areas, tourist clusters). These are precisely the locations where under-prediction or over-prediction can create rebalancing problems: late-morning surges, sudden evening spikes, or unusually warm weekends can leave docks empty or bikes stranded unless staff intervene manually.

Given this, the model can be deployed operationally, but with clear conditions. It should be used as a decision-support tool, not an autonomous system. The predictions can guide baseline redistribution plans, flag high-risk hours, and help prioritize which stations need closer monitoring. But stations with historically volatile demand and days with atypical weather, holidays, or major events will still require human oversight or real-time adjustments.

4.2 Equity considerations

The error patterns suggest that prediction accuracy is not evenly distributed across Philadelphia. Stations located in higher-income, majority-White neighborhoods particularly those with more discretionary and irregular travel. This show larger residuals, while stations in transit-dependent, lower-income communities tend to be more predictable. This raises an important equity concern: the model performs best in neighborhoods where travel patterns are already stable and predictable, and worst in areas where riders may rely on Indego as a flexible or secondary mobility option.

If such a system were deployed without safeguards, it could unintentionally reinforce existing disparities. Under-predicting demand in transit-reliant communities could mean empty docks, reduced availability, and longer wait times directly undermining mobility for riders with fewer alternatives.

To mitigate these risks, several safeguards should be implemented:

  1. Equity-weighted rebalancing rules ensuring underserved neighborhoods receive a minimum guaranteed baseline of bike availability regardless of predicted demand

  2. Error monitoring that track model performance separately by neighborhood demographics

  3. Human-in-the-loop overrides where planners can adjust allocations in census tracts showing sustained high error

  4. Periodic model retraining with explicit fairness constraints or stratified sampling to ensure equitable performance across different community types

4.3 Model limitations

Despite incorporating temporal lags, weather, spatial context, and holiday effects, the model still misses several important behavioral patterns. Bicycle demand is heavily influenced by micro-events such as neighborhood festivals, block parties, school schedules, university calendars, and special events at venues like the stadium district—none of which are captured in the current dataset. The model also assumes that relationships between predictors and demand are smooth and stable over time, but real bike-share patterns often shift abruptly due to construction, temporary station closures, seasonal promotions, or sudden weather changes that are not recorded by hourly temperature and precipitation alone. Moreover, riders make spontaneous decisions in response to perceived safety, traffic conditions, and trip chaining with transit—complex behaviors that cannot be explained using simple linear terms.

If given more time and richer data, the predictive system could be significantly improved by incorporating features that better capture the behavioral, spatial, and contextual complexity of bike-share demand. The current model relies primarily on temporal lags, weather, and basic spatial structure, which limits its ability to account for major demand drivers such as large events, university schedules, tourism flow, or transit disruptions. With access to real event data (concerts, stadium schedules, conventions, sports games), SEPTA operational data, academic calendars for Penn/Drexel/Temple, and park/office/restaurant POI densities, the model would be able to reflect more realistic fluctuations in ridership patterns. These variables are known to heavily shape Philadelphia’s mobility rhythms but are not represented in the present dataset.

Methodologically, the model could be strengthened by moving beyond linear regression into more flexible forms to capture non-linear relationships and temporal dependencies that the current model cannot express. Expanding the training window to multiple years rather than a single quarter would also allow the model to learn seasonal differences, long-term trends, and year-over-year behavioral changes. With richer data and more advanced modeling frameworks, Indego could move from hour-ahead prediction toward more robust real-time forecasting and dynamic rebalancing systems.