Space-Time Prediction of Bike Share Demand: Philadelphia Indego

MUSA 5080 - Fall 2025

Author

Isabelle Li

Published

December 11, 2025

Part 1: Data Import & Preparation

1.1 Set up

Load package

Define theme

Code

plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

Set Census API Key

1.2 Data Import

Create Time Bins

Code

indego <- indego %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

1.3 Exploratory Analysis

Trips over time

The Q2 2025 ridership pattern reflects typical late-spring to early-summer behavior. Overalll daily trip volumes are high compared to winter months, with most days falling between 3,500 and 5,000 trips.

Hourly Patterns

The hourly ridership profile shows a strong distinction between weekdays and weekends. On weekdays, the pattern is dominated by two sharp peaks: a large morning spike around 8–9 AM and an even stronger afternoon peak around 5–6 PM. These align closely with traditional commuting hours, indicating that Indego is heavily used for work-related travel during the week.

In contrast, weekend ridership follows a flatter, more gradual curve. Activity starts later in the morning, rises steadily through midday, and maintains moderate levels into the afternoon without a pronounced rush-hour peak. This reflects more recreational and discretionary trips rather than structured commuting patterns.

Top Station

Code

# Most popular origin stations
top_stations <- indego %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations, 
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top 20 Indego Stations by Trip Origins
start_station	start_lat	start_lon	trips
3,010	39.94711	-75.16618	6,278
3,032	39.94527	-75.17971	4,751
3,359	39.94888	-75.16978	4,181
3,028	39.94061	-75.14958	4,028
3,163	39.94974	-75.18097	4,000
3,020	39.94855	-75.19007	3,946
3,185	39.95169	-75.15888	3,914
3,066	39.94561	-75.17348	3,899
3,022	39.95472	-75.18323	3,856
3,054	39.96250	-75.17420	3,767
3,161	39.95486	-75.18091	3,629
3,362	39.94816	-75.16226	3,552
3,063	39.94633	-75.16980	3,516
3,213	39.93887	-75.16663	3,332
3,059	39.96244	-75.16121	3,320
3,012	39.94218	-75.17747	3,264
3,007	39.94517	-75.15993	3,242
3,061	39.95425	-75.17761	3,193
3,101	39.94295	-75.15955	3,170
3,296	39.95134	-75.16758	3,088

Get Philadelphia Spatial Context

Join Census Data to Stations

Code

# Create sf object for stations
stations_sf <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Add census variables back to the trip data
indego_census <- indego %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

Code

# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census <- indego %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

Get Weather Data

Code

# Get weather from Philadelphia International Airport (KPHL)
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2025-04-01",
  date_end = "2025-07-31"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct()

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

Create Space-Time Panel

Code

# Count trips by station-hour
trips_panel <- indego_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
nrow(trips_panel)

# How many unique stations?
length(unique(trips_panel$start_station))

# How many unique hours?
length(unique(trips_panel$interval60))

Create Complete Panel Structure

Code

# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")

Expected panel rows: 548,604

Code

cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")

Current rows: 179,107

Code

cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")

Missing rows: 369,497

Code

# Create complete panel
study_panel <- expand.grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")

Complete panel rows: 548,604

Add Time Features

Code

study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = lubridate::month(interval60, label = TRUE, abbr = TRUE),
    month = factor(month, levels = month.abb),  # Jan, Feb, ..., Dec
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

Join Weather Data

Code

study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

Create Temporal Lag Variables

Code

# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

Temporal Train/Test Split

Code

# Identify early and late period for Q2
# Q2 roughly covers week 14–26

early_stations <- study_panel_complete %>%
  filter(week >= 13, week < 22) %>%    # weeks 13–21
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 22, week <= 26) %>%   # weeks 22–26
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations appearing in both early and late
common_stations <- intersect(early_stations, late_stations)

study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# Now create train/test split
train <- study_panel_complete %>%
  filter(week >= 13, week < 22)

test <- study_panel_complete %>%
  filter(week >= 22, week <= 26)

1.4 Build Predictive Models

Model 1: Baseline (Time + Weather)

Code

# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

model1_tidy <- tidy(model1) %>%
  filter(term %in% c("Temperature", "Precipitation", 
                     "dotw_simple2", "dotw_simple3", "dotw_simple4",
                     "dotw_simple5", "dotw_simple6", "dotw_simple7")) %>%
  mutate(
    estimate = round(estimate, 3),
    p.value = format.pval(p.value, digits = 3)
  )

kable(model1_tidy,
      caption = "Selected Coefficients from Model 1 (Q2 2025)",
      col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")) %>%
  kable_classic(full_width = FALSE)

Selected Coefficients from Model 1 (Q2 2025)
Term	Estimate	Std. Error	t value	p value
dotw_simple2	0.047	0.0062891	7.438893	0.000000000000102
dotw_simple3	0.008	0.0063365	1.305814	0.19162
dotw_simple4	0.042	0.0062043	6.761778	0.000000000013648
dotw_simple5	-0.017	0.0061253	-2.825335	0.00472
dotw_simple6	-0.027	0.0064698	-4.238667	0.000022489666761
dotw_simple7	-0.068	0.0065597	-10.320271	< 0.0000000000000002
Temperature	0.013	0.0001728	76.011086	< 0.0000000000000002
Precipitation	-0.025	0.0232841	-1.060071	0.28911

Ridership is highly driven by time-of-day: morning and evening commute hours show the largest positive effects, while late night hours have the lowest activity.

Weekdays generally have higher demand than weekends, and temperature is strongly positively associated with trip counts. Precipitation shows a small negative effect but is not statistically strong in this period.

Model 2: Add Temporal Lags

Code

model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

model2_tidy <- tidy(model2) %>%
  filter(term %in% c("Temperature", "Precipitation",
                     "lag1Hour", "lag3Hours", "lag1day")) %>%
  mutate(
    estimate   = round(estimate, 3),
    std.error  = round(std.error, 3),
    statistic  = round(statistic, 1),
    p.value    = format.pval(p.value, digits = 3, eps = 1e-4)
  )

kable(
  model2_tidy,
  caption = "Key coefficients from Model 2 (Q2 2025)",
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
  kable_classic(full_width = FALSE)

Key coefficients from Model 2 (Q2 2025)
Term	Estimate	Std. Error	t value	p value
Temperature	0.004	0.000	25.0	< 0.0001
Precipitation	-0.070	0.020	-3.6	0.000378
lag1Hour	0.415	0.001	296.4	< 0.0001
lag3Hours	0.120	0.001	86.7	< 0.0001
lag1day	0.137	0.001	107.1	< 0.0001

The coefficients on the lag variables are all positive and highly significant: the previous hour’s demand (lag1Hour, β ≈ 0.42), demand three hours ago (lag3Hours, β ≈ 0.12), and demand 24 hours ago (lag1day, β ≈ 0.14) all strongly predict current ridership. This means that if a station was busy in the recent past, it is very likely to be busy now as well. Compared to Model 1, the effect of temperature becomes smaller because part of the temporal pattern is now captured by lagged demand. Precipitation becomes more negative and statistically significant, indicating that rainy hours reduce trips even after controlling for recent usage. Overall, adding lag features increases R² from about 0.11 to 0.36 and reduces the residual error, showing that short-term temporal dependence is a key driver of Indego ridership.

Model 3: Add Demographics

Code

model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
  data = train
)

model3_tidy <- tidy(model3) %>%
  filter(term %in% c("Temperature", "Precipitation",
                     "lag1Hour", "lag3Hours", "lag1day",
                     "Med_Inc.x", "Percent_Taking_Transit.y", "Percent_White.y")) %>%
  mutate(
    estimate  = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 1),
    p.value   = format.pval(p.value, digits = 3, eps = 1e-4)
  )

kable(
  model3_tidy,
  caption = "Key Coefficients for Model 3 (Adding Census Features)",
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
  kable_classic(full_width = FALSE)

Key Coefficients for Model 3 (Adding Census Features)
Term	Estimate	Std. Error	t value	p value
Temperature	0.0072	0.0004	20.3	< 0.0001
Precipitation	-0.3160	0.0364	-8.7	< 0.0001
lag1Hour	0.3077	0.0023	133.6	< 0.0001
lag3Hours	0.0855	0.0024	36.1	< 0.0001
lag1day	0.1191	0.0022	53.3	< 0.0001
Med_Inc.x	0.0000	0.0000	2.7	0.00689
Percent_Taking_Transit.y	-0.0041	0.0004	-9.7	< 0.0001
Percent_White.y	0.0025	0.0002	11.6	< 0.0001

The lag variables (lag1Hour, lag3Hours, lag1day) remain the strongest predictors, confirming that short-term demand history is still the main driver of current ridership. Among the census features, higher median income around a station is associated with slightly higher usage, while neighborhoods with higher transit dependence tend to show lower Indego demand. The percentage of White residents is positively associated with trip counts. Temperature remains positively significant, whereas precipitation becomes strongly negative after controlling for lagged demand and neighborhood attributes.

Model 4: Add Station Fixed Effects

Code

model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station),
  data = train
)

model4_metrics <- tibble(
  Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
  Value = c(
    round(summary(model4)$r.squared, 4),
    round(summary(model4)$adj.r.squared, 4),
    round(summary(model4)$sigma, 4),
    nrow(train)
  )
)

kable(model4_metrics, 
      caption = "Model 4: Key Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_classic(full_width = FALSE)

Model 4: Key Performance Metrics
Metric	Value
R-squared	0.2720
Adjusted R-squared	0.2705
Residual Std. Error	1.1827
N (train)	438712.0000

Model 5: Add Rush Hour Interaction

Code

model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + month +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

# Extract key metrics from Model 5
model5_metrics <- tibble(
  Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
  Value = c(
    round(summary(model5)$r.squared, 4),
    round(summary(model5)$adj.r.squared, 4),
    round(summary(model5)$sigma, 4),
    nrow(train)
  )
)

kable(model5_metrics,
      caption = "Model 5: Key Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_classic(full_width = FALSE)

Model 5: Key Performance Metrics
Metric	Value
R-squared	0.2758
Adjusted R-squared	0.2743
Residual Std. Error	1.1796
N (train)	438712.0000

Code

# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test$month <- factor(test$month, levels = levels(model5$model$month))

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results, 
      digits = 2,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Mean Absolute Error by Model (Test Set)
Model	MAE (trips)
1. Time + Weather	0.80
2. + Temporal Lags	0.61
3. + Demographics	0.86
4. + Station FE	0.85
5. + Rush Hour Interaction	0.80

Model Performance Comparison (Q1 vs Q2)

Model	MAE (Q1 2025)	MAE (Q2 2025)
1. Time + Weather	0.60	0.80
2. + Temporal Lags	0.50	0.61
3. + Demographics	0.74	0.86
4. + Station FE	0.73	0.85
5. + Rush Hour Interaction	0.73	0.80

Overall, prediction errors are higher in Q2 than in Q1.

For example, the best model (Model 2) improves MAE from about 0.50 trips in Q1 to about 0.61 trips in Q2. All other models also show larger MAE in Q2 (e.g., Model 1 increases from 0.60 to 0.80 trips).

This suggests that ridership in spring–early summer is somewhat harder to predict than in winter. In Q1, demand is dominated by regular commuter patterns and strong weather effects (cold temperatures and snow suppress many trips), which makes behavior more stable. In Q2, warmer weather allows for more casual and recreational trips, as well as events and irregular travel, so there is more noise that the model cannot fully capture.

Temporal patterns also differ between Q1 and Q2.

In Q1, ridership is strongly concentrated in weekday morning and evening peaks, consistent with commuter trips, and weekends are relatively quiet. In Q2, there is more activity during the middle of the day and on weekends, reflecting recreational and leisure use in warmer months. This makes demand less purely “rush-hour driven” and more spread out, which again increases prediction error.

In Q2, the most important predictors are still the temporal lag features.

The coefficients on lag1Hour, lag3Hours, and lag1day are large and highly significant, indicating that recent station-level demand is the strongest signal of current ridership. Time-of-day (hour dummies), day-of-week, and temperature also matter, but their marginal effects become smaller once lags are included. Census variables and station fixed effects add some spatial structure, but they do not reduce MAE and in some cases slightly worsen out-of-sample performance, likely because they are associated with missing data and additional model complexity.

Part 2: Error Analysis

2.1 Spatial patterns

Create error maps

Code

station_errors <- test %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))


# Map 1 – Prediction Errors
p1_clean <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.0, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1
  ) +
  labs(
    title = "Prediction Errors",
    subtitle = "Higher in Center City"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  )

# Map 2 – Average Demand
p2_clean <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 3.0, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1
  ) +
  labs(
    title = "Average Demand",
    subtitle = "Trips per station-hour"
  ) +
  mapTheme +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  )

# Only the maps (no legends)
p1_noleg <- p1_clean + theme(legend.position = "none")
p2_noleg <- p2_clean + theme(legend.position = "none")

# Extract legends
leg1 <- get_legend(p1_clean)
leg2 <- get_legend(p2_clean)

# Arrange clean layout
final_plot <- ggarrange(
  p1_noleg, p2_noleg,
  ncol = 2,
  labels = NULL
)

final_output <- ggarrange(
  final_plot,
  ggarrange(leg1, leg2, ncol = 2),
  ncol = 1,
  heights = c(4, 1)
)

final_output

From the Prediction Errors map (Model 2), the stations with the highest MAE all cluster tightly around:

1. Center City (highest error cluster)

Rittenhouse Square
Market Street corridor
City Hall / Suburban Station area
University City eastern edge (34th–30th Street)

2. University City (moderately high error)

Drexel / Penn campus areas
33rd St, Lancaster Ave
Walnut/Chestnut corridor

3. Waterfront / Northern Liberties (moderate)

Penn’s Landing
Spring Garden

4. Areas with low error

Residential neighborhoods (West Philly, South Philly)

Highest errors occur in Center City and eastern University City. These areas include Rittenhouse Square, Market Street, City Hall / Suburban Station, Drexel/Penn corridors, and parts of the waterfront. They share two characteristics: (1) high demand, and (2) high behavioral variability.

Errors are likely driven by features missing from the model such as special events, tourism flows, campus academic schedules, and spatial spillover between nearby stations. Center City also has the strongest AM/PM peak dynamics, making it more challenging for a linear model to predict accurately.

Low-demand residential neighborhoods show much lower error (<0.5 MAE) because ridership is stable and predictable.

2.2 Temporal Patterns

Code

# MAE by time of day and day type
test <- test %>%
  mutate(
    error = Trip_Count - pred2,       
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )


temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Across the full time spectrum, prediction errors are highest during the peak commute periods, especially the PM rush, where demand rises sharply and fluctuates too quickly for the model to follow, leading to consistent under-prediction of the strongest spikes. The AM rush shows a similar pattern, though slightly milder, reflecting the model’s difficulty capturing sudden early-morning surges.

Mid-day errors are moderately high as well, particularly on weekends, when riding behavior is less routine and more sensitive to social activities or weather changes, making demand harder to predict. In contrast, overnight hours have the lowest errors because ridership is both low and stable, leaving little room for large deviations. Taken together, these patterns show that the model tends to underestimate sudden increases and slightly overestimate during gradual declines, a structural bias that appears repeatedly across hours and between weekdays and weekends.

2.3 Demographic patterns

Code

# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

grid.arrange(p1, p2, p3, ncol = 2)

The weak but noticeable positive slope with median income suggests that stations located in higher-income neighborhoods tend to have slightly larger prediction errors. These areas often have more discretionary travel, irregular trip purposes, and stronger weekend or leisure-based demand spikes that a basic temporal-lag model cannot fully capture. As a result, ridership becomes harder to forecast with simple hourly and daily patterns.

In contrast, the negative association with transit-use share implies that stations in neighborhoods with a high reliance on public transportation are more predictable. These communities typically have more consistent commuting routines and a tighter dependence on fixed schedules, which aligns very well with the temporal-lag structure of Model 2. When people ride transit regularly, they also tend to use bike share in more stable, repeated daily patterns, making these locations easier for the model to learn.

Finally, the mild positive relationship between errors and percent White shows the pattern observed with income: neighborhoods that are more affluent, higher percentage of white or less transit-dependent tend to have more variable, less routine mobility behavior. These stations likely experience more irregular trip timing, spontaneous travel, and stronger seasonal fluctuations, features not fully captured by the current model.Relate errors to census characteristics

Part 3: Feature Engineering & model improvement

3.1 Add Features

Code

city_hall_lat <- 39.9526
city_hall_lon <- -75.1652

holiday_dates <- as.Date(c(
  "2025-04-20", 
  "2025-04-22", 
  "2025-05-11",
  "2025-05-26",
  "2025-06-15",
  "2025-06-19"
))

Code

train <- train %>%
  mutate(
    perfect_weather = ifelse(
      Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
      1, 0
    ),
    weekend_nice = weekend * perfect_weather,

    dist_to_center_city = distHaversine(
      cbind(start_lon.y, start_lat.y),
      c(city_hall_lon, city_hall_lat)
    ) / 1000,

    holiday = ifelse(date %in% holiday_dates, 1, 0)
  )

test <- test %>%
  mutate(
    perfect_weather = ifelse(
      Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
      1, 0
    ),
    weekend_nice = weekend * perfect_weather,

    dist_to_center_city = distHaversine(
      cbind(start_lon.y, start_lat.y),
      c(city_hall_lon, city_hall_lat)
    ) / 1000,

    holiday = ifelse(date %in% holiday_dates, 1, 0)
  )

3.2 Implementation

Add Features to The Best Model

Code

model6 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + weekend_nice +
    dist_to_center_city + holiday,
  data = train
)


# Tidy model and keep key variables
coef_table6 <- tidy(model6) %>%
  filter(term %in% c(
    "Temperature",
    "Precipitation",
    "lag1Hour",
    "lag3Hours",
    "lag1day",
    "perfect_weather",
    "weekend_nice",
    "dist_to_center_city",
    "holiday"
  )) %>%
  mutate(
    term = recode(term,
      "Temperature"          = "Temperature (°F)",
      "Precipitation"        = "Precipitation (in)",
      "lag1Hour"             = "Trips 1 hour ago",
      "lag3Hours"            = "Trips 3 hours ago",
      "lag1day"              = "Trips 24 hours ago",
      "perfect_weather"      = "Perfect weather (60–75°F, dry)",
      "weekend_nice"         = "Weekend × perfect weather",
      "dist_to_center_city"  = "Distance to Center City (km)",
      "holiday"              = "Holiday indicator"
    )
  )

kable(
  coef_table6 %>%
    select(term, estimate, std.error, statistic, p.value),
  digits = 3,
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value"),
  caption = "Key coefficients in Model 6 (with new engineered features)"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Key coefficients in Model 6 (with new engineered features)
Term	Estimate	Std. Error	t value	p value
Temperature (°F)	0.004	0.000	25.555	0.000
Precipitation (in)	-0.066	0.020	-3.307	0.001
Trips 1 hour ago	0.399	0.001	283.079	0.000
Trips 3 hours ago	0.102	0.001	73.467	0.000
Trips 24 hours ago	0.119	0.001	91.684	0.000
Perfect weather (60–75°F, dry)	-0.003	0.004	-0.722	0.470
Weekend × perfect weather	0.028	0.007	4.191	0.000
Distance to Center City (km)	-0.063	0.001	-74.449	0.000
Holiday indicator	-0.013	0.007	-1.932	0.053

Code

# Compare MAE before and after adding engineered features
# --- Recalculate pred6 safely ---
test <- test %>% 
  mutate(
    pred6 = predict(model6, newdata = test)
  )

# --- Compute baseline (model2) and new model (model6) MAE ---
mae_before <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae_after  <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

mae_compare <- data.frame(
  Model = c("Baseline Model 2 (Lags Only)", 
            "New Model 6 (+ Weather + Distance + Holiday)"),
  MAE = c(mae_before, mae_after)
)

kable(mae_compare, digits = 3,
      caption = "MAE Before vs. After New Features (Q2 Test Set)")

MAE Before vs. After New Features (Q2 Test Set)
Model	MAE
Baseline Model 2 (Lags Only)	0.611
New Model 6 (+ Weather + Distance + Holiday)	0.613

Code

mae2 <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae6 <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

mae2

[1] 0.6106225

Code

mae6

[1] 0.613463

Code

mae6 - mae2

[1] 0.002840553

To test whether additional behavioral, spatial, and contextual signals could improve prediction accuracy, I introduced three new feature groups: (1) perfect-weather conditions and a weekend × nice-weather interaction, capturing the intuition that leisure-driven demand spikes under ideal conditions; (2) distance to Center City, representing the strong spatial gradient in bike-share usage; and (3) holiday indicators for major Q2 events such as Easter, Memorial Day, and Juneteenth, which may alter mobility patterns.

Comparing performance before and after adding these features shows that the enhanced model (Model 6) performs almost identically to the strongest baseline model (Model 2). Model 2 achieves an MAE of 0.6106, while Model 6 yields 0.6135, a negligible increase of approximately 0.003 trips. Although the engineered variables add meaningful behavioral interpretation, they do not improve out-of-sample predictive accuracy.

This result reflects the nature of Q2 bike-share demand: spring and early-summer ridership is already strongly explained by temporal autocorrelation (lag features) and hourly/weekly seasonality, with relatively consistent warm weather and limited extremes. Because these patterns dominate usage, additional features offer little incremental predictive power and instead introduce slight overfitting.

3.3 Poisson Model

Code

model6_pois <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + weekend_nice +
    dist_to_center_city + holiday,
  data = train,
  family = poisson(link = "log")
)

poisson_summary <- data.frame(
  Metric = c(
    "Null deviance",
    "Residual deviance",
    "AIC"
  ),
  Value = c(
    round(model6_pois$null.deviance, 2),
    round(model6_pois$deviance, 2),
    round(model6_pois$aic, 2)
  )
)

kable(poisson_summary,
      caption = "Poisson Regression Model Summary (Key Metrics)",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Poisson Regression Model Summary (Key Metrics)
Metric	Value
Null deviance	701096.0
Residual deviance	441087.1
AIC	774319.7

Code

test <- test %>% 
  mutate(
    pred6_pois = predict(model6_pois, newdata = test, type = "response")
  )

mae_pois <- mean(abs(test$Trip_Count - test$pred6_pois), na.rm = TRUE)
mae_pois

[1] 0.639133

Code

mae_lin <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)

data.frame(
  Model = c("Linear Model 6", "Poisson Model 6"),
  MAE = c(mae_lin, mae_pois)
)

            Model      MAE
1  Linear Model 6 0.613463
2 Poisson Model 6 0.639133

Although trip counts are discrete and non-negative, making Poisson regression a natural candidate, the Poisson specification did not outperform the linear model in this case. The Poisson model substantially reduced the null deviance (from about 701,096 to 441,087), which indicates that it captures important structure in the data. However, when evaluated out-of-sample on the Q2 test set, its MAE is 0.639, which is noticeably worse than the MAE of the linear Model 6 (0.613).

Part 4: Critical Reflection

4.1 Operational implications

With a final test-set MAE of roughly 0.61 trips per station-hour, the model is accurate enough to provide meaningful operational support for Indego, especially for routine rebalancing during normal weekdays. Most stations average between 0–2 trips per hour in Q2, so an average error of 0.6 represents a relatively small deviation from true demand. In other words, the model reliably captures broad temporal patterns: commuting peaks, weather-driven changes, and weekend shifts, which are the operational factors that matter most for daily fleet management. However, from the error analysis, the system performs less consistently in areas with highly irregular or discretionary travel (Center City, university areas, tourist clusters). These are precisely the locations where under-prediction or over-prediction can create rebalancing problems: late-morning surges, sudden evening spikes, or unusually warm weekends can leave docks empty or bikes stranded unless staff intervene manually.

Given this, the model can be deployed operationally, but with clear conditions. It should be used as a decision-support tool, not an autonomous system. The predictions can guide baseline redistribution plans, flag high-risk hours, and help prioritize which stations need closer monitoring. But stations with historically volatile demand and days with atypical weather, holidays, or major events will still require human oversight or real-time adjustments.

4.2 Equity considerations

The error patterns suggest that prediction accuracy is not evenly distributed across Philadelphia. Stations located in higher-income, majority-White neighborhoods particularly those with more discretionary and irregular travel. This show larger residuals, while stations in transit-dependent, lower-income communities tend to be more predictable. This raises an important equity concern: the model performs best in neighborhoods where travel patterns are already stable and predictable, and worst in areas where riders may rely on Indego as a flexible or secondary mobility option.

If such a system were deployed without safeguards, it could unintentionally reinforce existing disparities. Under-predicting demand in transit-reliant communities could mean empty docks, reduced availability, and longer wait times directly undermining mobility for riders with fewer alternatives.

To mitigate these risks, several safeguards should be implemented:

Equity-weighted rebalancing rules ensuring underserved neighborhoods receive a minimum guaranteed baseline of bike availability regardless of predicted demand
Error monitoring that track model performance separately by neighborhood demographics
Human-in-the-loop overrides where planners can adjust allocations in census tracts showing sustained high error
Periodic model retraining with explicit fairness constraints or stratified sampling to ensure equitable performance across different community types

4.3 Model limitations

Despite incorporating temporal lags, weather, spatial context, and holiday effects, the model still misses several important behavioral patterns. Bicycle demand is heavily influenced by micro-events such as neighborhood festivals, block parties, school schedules, university calendars, and special events at venues like the stadium district—none of which are captured in the current dataset. The model also assumes that relationships between predictors and demand are smooth and stable over time, but real bike-share patterns often shift abruptly due to construction, temporary station closures, seasonal promotions, or sudden weather changes that are not recorded by hourly temperature and precipitation alone. Moreover, riders make spontaneous decisions in response to perceived safety, traffic conditions, and trip chaining with transit—complex behaviors that cannot be explained using simple linear terms.

If given more time and richer data, the predictive system could be significantly improved by incorporating features that better capture the behavioral, spatial, and contextual complexity of bike-share demand. The current model relies primarily on temporal lags, weather, and basic spatial structure, which limits its ability to account for major demand drivers such as large events, university schedules, tourism flow, or transit disruptions. With access to real event data (concerts, stadium schedules, conventions, sports games), SEPTA operational data, academic calendars for Penn/Drexel/Temple, and park/office/restaurant POI densities, the model would be able to reflect more realistic fluctuations in ridership patterns. These variables are known to heavily shape Philadelphia’s mobility rhythms but are not represented in the present dataset.

Methodologically, the model could be strengthened by moving beyond linear regression into more flexible forms to capture non-linear relationships and temporal dependencies that the current model cannot express. Expanding the training window to multiple years rather than a single quarter would also allow the model to learn seasonal differences, long-term trends, and year-over-year behavioral changes. With richer data and more advanced modeling frameworks, Indego could move from hour-ahead prediction toward more robust real-time forecasting and dynamic rebalancing systems.

--- title: "Space-Time Prediction of Bike Share Demand: Philadelphia Indego" subtitle: "MUSA 5080 - Fall 2025" author: "Isabelle Li" date: today format: html: code-fold: show code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: true editor: visual execute: warning: false message: false --- # Part 1: Data Import & Preparation ## 1.1 Set up ### Load package ```{r load_libraries,echo=FALSE} # Core tidyverse library(tidyverse) library(lubridate) # Spatial data library(sf) library(tigris) library(geosphere) # Census data library(tidycensus) # Weather data library(riem) # Visualization library(viridis) library(gridExtra) library(knitr) library(kableExtra) library(broom) library(gridExtra) library(grid) library(ggpubr) # here! library(here) options(scipen = 999) ``` ### Define theme ```{r themes} plotTheme <- theme( plot.title = element_text(size = 14, face = "bold"), plot.subtitle = element_text(size = 10), plot.caption = element_text(size = 8), axis.text.x = element_text(size = 10, angle = 45, hjust = 1), axis.text.y = element_text(size = 10), axis.title = element_text(size = 11, face = "bold"), panel.background = element_blank(), panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2), panel.grid.minor = element_blank(), axis.ticks = element_blank(), legend.position = "right" ) mapTheme <- theme( plot.title = element_text(size = 14, face = "bold"), plot.subtitle = element_text(size = 10), plot.caption = element_text(size = 8), axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_line(colour = 'transparent'), panel.grid.minor = element_blank(), legend.position = "right", plot.margin = margin(1, 1, 1, 1, 'cm'), legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm") ) palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c") ``` ### Set Census API Key ```{r census_key, eval=FALSE,include=FALSE} census_api_key("cbf28724dccf15e8b6a23ee3f03a8d0f8938ce79", overwrite = TRUE, install = TRUE) ``` ```{r census_key_hidden, include=FALSE} # Hidden key for rendering census_api_key("cbf28724dccf15e8b6a23ee3f03a8d0f8938ce79", overwrite = TRUE) ``` ## 1.2 Data Import ```{r load_indego,echo=FALSE,include=FALSE} # Read Q2 2025 data indego <- read_csv(here("assignments/assignment5/data/indego-trips-2025-q2.csv")) # Quick look at the data glimpse(indego) ``` ```{r explore_data,include=FALSE} # How many trips? cat("Total trips in Q2 2025:", nrow(indego), "\n") # Date range cat("Date range:", min(mdy_hm(indego$start_time)), "to", max(mdy_hm(indego$start_time)), "\n") # How many unique stations? cat("Unique start stations:", length(unique(indego$start_station)), "\n") # Trip types table(indego$trip_route_category) # Passholder types table(indego$passholder_type) # Bike types table(indego$bike_type) ``` ### Create Time Bins ```{r create_time_bins} indego <- indego %>% mutate( # Parse datetime start_datetime = mdy_hm(start_time), end_datetime = mdy_hm(end_time), # Create hourly bins interval60 = floor_date(start_datetime, unit = "hour"), # Extract time features week = week(interval60), month = month(interval60, label = TRUE), dotw = wday(interval60, label = TRUE), hour = hour(interval60), date = as.Date(interval60), # Create useful indicators weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0), rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0) ) ``` ## 1.3 Exploratory Analysis ### Trips over time ```{r trips_over_time,echo=FALSE} # Daily trip counts daily_trips <- indego %>% group_by(date) %>% summarize(trips = n()) ggplot(daily_trips, aes(x = date, y = trips)) + geom_line(color = "#3182bd", linewidth = 1) + geom_smooth(se = FALSE, color = "red", linetype = "dashed") + labs( title = "Indego Daily Ridership - Q2 2025", subtitle = "Spring demand patterns in Philadelphia", x = "Date", y = "Daily Trips", caption = "Source: Indego bike share" ) + plotTheme ``` The Q2 2025 ridership pattern reflects typical late-spring to early-summer behavior. Overalll daily trip volumes are high compared to winter months, with most days falling between 3,500 and 5,000 trips. ### Hourly Patterns ```{r hourly_patterns,echo=FALSE} # Average trips by hour and day type hourly_patterns <- indego %>% group_by(hour, weekend) %>% summarize(avg_trips = n() / n_distinct(date)) %>% mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday")) ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) + geom_line(linewidth = 1.2) + scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) + labs( title = "Average Hourly Ridership Patterns", subtitle = "Clear commute patterns on weekdays", x = "Hour of Day", y = "Average Trips per Hour", color = "Day Type" ) + plotTheme ``` The hourly ridership profile shows a strong distinction between weekdays and weekends. On weekdays, the pattern is dominated by two sharp peaks: a large morning spike around 8–9 AM and an even stronger afternoon peak around 5–6 PM. These align closely with traditional commuting hours, indicating that Indego is heavily used for work-related travel during the week. In contrast, weekend ridership follows a flatter, more gradual curve. Activity starts later in the morning, rises steadily through midday, and maintains moderate levels into the afternoon without a pronounced rush-hour peak. This reflects more recreational and discretionary trips rather than structured commuting patterns. ### Top Station ```{r top_stations} # Most popular origin stations top_stations <- indego %>% count(start_station, start_lat, start_lon, name = "trips") %>% arrange(desc(trips)) %>% head(20) kable(top_stations, caption = "Top 20 Indego Stations by Trip Origins", format.args = list(big.mark = ",")) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ### Get Philadelphia Spatial Context ```{r load_census,include=FALSE} # Get Philadelphia census tracts philly_census <- get_acs( geography = "tract", variables = c( "B01003_001", # Total population "B19013_001", # Median household income "B08301_001", # Total commuters "B08301_010", # Commute by transit "B02001_002", # White alone "B25077_001" # Median home value ), state = "PA", county = "Philadelphia", year = 2022, geometry = TRUE, output = "wide" ) %>% rename( Total_Pop = B01003_001E, Med_Inc = B19013_001E, Total_Commuters = B08301_001E, Transit_Commuters = B08301_010E, White_Pop = B02001_002E, Med_Home_Value = B25077_001E ) %>% mutate( Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100, Percent_White = (White_Pop / Total_Pop) * 100 ) %>% st_transform(crs = 4326) # WGS84 for lat/lon matching ``` ### Join Census Data to Stations ```{r} # Create sf object for stations stations_sf <- indego %>% distinct(start_station, start_lat, start_lon) %>% filter(!is.na(start_lat), !is.na(start_lon)) %>% st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326) # Spatial join to get census tract for each station stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>% st_drop_geometry() # Add census variables back to the trip data indego_census <- indego %>% left_join( stations_census %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop), by = "start_station" ) ``` ```{r} # Identify which stations to keep valid_stations <- stations_census %>% filter(!is.na(Med_Inc)) %>% pull(start_station) # Filter trip data to valid stations only indego_census <- indego %>% filter(start_station %in% valid_stations) %>% left_join( stations_census %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop), by = "start_station" ) ``` ### Get Weather Data ```{r} # Get weather from Philadelphia International Airport (KPHL) weather_data <- riem_measures( station = "PHL", # Philadelphia International Airport date_start = "2025-04-01", date_end = "2025-07-31" ) # Process weather data weather_processed <- weather_data %>% mutate( interval60 = floor_date(valid, unit = "hour"), Temperature = tmpf, # Temperature in Fahrenheit Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches Wind_Speed = sknt # Wind speed in knots ) %>% select(interval60, Temperature, Precipitation, Wind_Speed) %>% distinct() # Check for missing hours and interpolate if needed weather_complete <- weather_processed %>% complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>% fill(Temperature, Precipitation, Wind_Speed, .direction = "down") ``` ### Create Space-Time Panel ```{r aggregate_trips,results='hide'} # Count trips by station-hour trips_panel <- indego_census %>% group_by(interval60, start_station, start_lat, start_lon, Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>% summarize(Trip_Count = n()) %>% ungroup() # How many station-hour observations? nrow(trips_panel) # How many unique stations? length(unique(trips_panel$start_station)) # How many unique hours? length(unique(trips_panel$interval60)) ``` ### Create Complete Panel Structure ```{r complete_panel} # Calculate expected panel size n_stations <- length(unique(trips_panel$start_station)) n_hours <- length(unique(trips_panel$interval60)) expected_rows <- n_stations * n_hours cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n") cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n") cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n") # Create complete panel study_panel <- expand.grid( interval60 = unique(trips_panel$interval60), start_station = unique(trips_panel$start_station) ) %>% # Join trip counts left_join(trips_panel, by = c("interval60", "start_station")) %>% # Replace NA trip counts with 0 mutate(Trip_Count = replace_na(Trip_Count, 0)) # Fill in station attributes (they're the same for all hours) station_attributes <- trips_panel %>% group_by(start_station) %>% summarize( start_lat = first(start_lat), start_lon = first(start_lon), Med_Inc = first(Med_Inc), Percent_Taking_Transit = first(Percent_Taking_Transit), Percent_White = first(Percent_White), Total_Pop = first(Total_Pop) ) study_panel <- study_panel %>% left_join(station_attributes, by = "start_station") # Verify we have complete panel cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n") ``` ### Add Time Features ```{r add_time_features} study_panel <- study_panel %>% mutate( week = week(interval60), month = lubridate::month(interval60, label = TRUE, abbr = TRUE), month = factor(month, levels = month.abb), # Jan, Feb, ..., Dec dotw = wday(interval60, label = TRUE), hour = hour(interval60), date = as.Date(interval60), weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0), rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0) ) ``` ### Join Weather Data ```{r join_weather,warning=FALSE} study_panel <- study_panel %>% left_join(weather_complete, by = "interval60") ``` ### Create Temporal Lag Variables ```{r create_lags} # Sort by station and time study_panel <- study_panel %>% arrange(start_station, interval60) # Create lag variables WITHIN each station study_panel <- study_panel %>% group_by(start_station) %>% mutate( lag1Hour = lag(Trip_Count, 1), lag2Hours = lag(Trip_Count, 2), lag3Hours = lag(Trip_Count, 3), lag12Hours = lag(Trip_Count, 12), lag1day = lag(Trip_Count, 24) ) %>% ungroup() # Remove rows with NA lags (first 24 hours for each station) study_panel_complete <- study_panel %>% filter(!is.na(lag1day)) ``` ```{r,echo=FALSE,include=FALSE} # 1. put back start_lat / start_lon study_panel_complete <- study_panel_complete %>% mutate( start_lat = start_lat.y, start_lon = start_lon.y ) # 2. perfect weather + weekend interaction study_panel_complete <- study_panel_complete %>% mutate( perfect_weather = ifelse( Temperature >= 60 & Temperature <= 80 & Precipitation == 0, 1, 0 ), weekend_nice = weekend * perfect_weather ) # 3. distance to center city city_hall_lat <- 39.9526 city_hall_lon <- -75.1652 study_panel_complete <- study_panel_complete %>% mutate( dist_to_center_city = distHaversine( cbind(start_lon, start_lat), c(city_hall_lon, city_hall_lat) ) / 1000 ) # 4. holiday indicator holiday_dates <- as.Date(c( "2025-04-20", # Easter "2025-04-22", # Earth Day "2025-05-11", # Mother's Day "2025-05-26", # Memorial Day "2025-06-15", # Father's Day "2025-06-19" # Juneteenth )) study_panel_complete <- study_panel_complete %>% mutate( holiday = ifelse(date %in% holiday_dates, 1, 0) ) ``` ### Temporal Train/Test Split ```{r temporal_split} # Identify early and late period for Q2 # Q2 roughly covers week 14–26 early_stations <- study_panel_complete %>% filter(week >= 13, week < 22) %>% # weeks 13–21 filter(Trip_Count > 0) %>% distinct(start_station) %>% pull(start_station) late_stations <- study_panel_complete %>% filter(week >= 22, week <= 26) %>% # weeks 22–26 filter(Trip_Count > 0) %>% distinct(start_station) %>% pull(start_station) # Keep only stations appearing in both early and late common_stations <- intersect(early_stations, late_stations) study_panel_complete <- study_panel_complete %>% filter(start_station %in% common_stations) # Now create train/test split train <- study_panel_complete %>% filter(week >= 13, week < 22) test <- study_panel_complete %>% filter(week >= 22, week <= 26) ``` ## 1.4 Build Predictive Models ### Model 1: Baseline (Time + Weather) ```{r model1} # Create day of week factor with treatment (dummy) coding train <- train %>% mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) # Set contrasts to treatment coding (dummy variables) contrasts(train$dotw_simple) <- contr.treatment(7) # Now run the model model1 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation, data = train ) model1_tidy <- tidy(model1) %>% filter(term %in% c("Temperature", "Precipitation", "dotw_simple2", "dotw_simple3", "dotw_simple4", "dotw_simple5", "dotw_simple6", "dotw_simple7")) %>% mutate( estimate = round(estimate, 3), p.value = format.pval(p.value, digits = 3) ) kable(model1_tidy, caption = "Selected Coefficients from Model 1 (Q2 2025)", col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")) %>% kable_classic(full_width = FALSE) ``` Ridership is highly driven by time-of-day: morning and evening commute hours show the largest positive effects, while late night hours have the lowest activity.\ Weekdays generally have higher demand than weekends, and temperature is strongly positively associated with trip counts. Precipitation shows a small negative effect but is not statistically strong in this period. ### Model 2: Add Temporal Lags ```{r model2} model2 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day, data = train ) model2_tidy <- tidy(model2) %>% filter(term %in% c("Temperature", "Precipitation", "lag1Hour", "lag3Hours", "lag1day")) %>% mutate( estimate = round(estimate, 3), std.error = round(std.error, 3), statistic = round(statistic, 1), p.value = format.pval(p.value, digits = 3, eps = 1e-4) ) kable( model2_tidy, caption = "Key coefficients from Model 2 (Q2 2025)", col.names = c("Term", "Estimate", "Std. Error", "t value", "p value") ) %>% kable_classic(full_width = FALSE) ``` The coefficients on the lag variables are all positive and highly significant: the previous hour’s demand (lag1Hour, β ≈ 0.42), demand three hours ago (lag3Hours, β ≈ 0.12), and demand 24 hours ago (lag1day, β ≈ 0.14) all strongly predict current ridership. This means that if a station was busy in the recent past, it is very likely to be busy now as well. Compared to Model 1, the effect of temperature becomes smaller because part of the temporal pattern is now captured by lagged demand. Precipitation becomes more negative and statistically significant, indicating that rainy hours reduce trips even after controlling for recent usage. Overall, adding lag features increases R² from about 0.11 to 0.36 and reduces the residual error, showing that short-term temporal dependence is a key driver of Indego ridership. ### Model 3: Add Demographics ```{r model3} model3 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y, data = train ) model3_tidy <- tidy(model3) %>% filter(term %in% c("Temperature", "Precipitation", "lag1Hour", "lag3Hours", "lag1day", "Med_Inc.x", "Percent_Taking_Transit.y", "Percent_White.y")) %>% mutate( estimate = round(estimate, 4), std.error = round(std.error, 4), statistic = round(statistic, 1), p.value = format.pval(p.value, digits = 3, eps = 1e-4) ) kable( model3_tidy, caption = "Key Coefficients for Model 3 (Adding Census Features)", col.names = c("Term", "Estimate", "Std. Error", "t value", "p value") ) %>% kable_classic(full_width = FALSE) ``` The lag variables (lag1Hour, lag3Hours, lag1day) remain the strongest predictors, confirming that short-term demand history is still the main driver of current ridership. Among the census features, higher median income around a station is associated with slightly higher usage, while neighborhoods with higher transit dependence tend to show lower Indego demand. The percentage of White residents is positively associated with trip counts. Temperature remains positively significant, whereas precipitation becomes strongly negative after controlling for lagged demand and neighborhood attributes. ### Model 4: Add Station Fixed Effects ```{r model4} model4 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y + as.factor(start_station), data = train ) model4_metrics <- tibble( Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"), Value = c( round(summary(model4)$r.squared, 4), round(summary(model4)$adj.r.squared, 4), round(summary(model4)$sigma, 4), nrow(train) ) ) kable(model4_metrics, caption = "Model 4: Key Performance Metrics", col.names = c("Metric", "Value")) %>% kable_classic(full_width = FALSE) ``` ### Model 5: Add Rush Hour Interaction ```{r model5} model5 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + rush_hour + month + Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y + as.factor(start_station) + rush_hour * weekend, # Rush hour effects different on weekends data = train ) # Extract key metrics from Model 5 model5_metrics <- tibble( Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"), Value = c( round(summary(model5)$r.squared, 4), round(summary(model5)$adj.r.squared, 4), round(summary(model5)$sigma, 4), nrow(train) ) ) kable(model5_metrics, caption = "Model 5: Key Performance Metrics", col.names = c("Metric", "Value")) %>% kable_classic(full_width = FALSE) ``` ```{r calculate_mae} # Get predictions on test set # Create day of week factor with treatment (dummy) coding test <- test %>% mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) # Set contrasts to treatment coding (dummy variables) contrasts(test$dotw_simple) <- contr.treatment(7) test$month <- factor(test$month, levels = levels(model5$model$month)) test <- test %>% mutate( pred1 = predict(model1, newdata = test), pred2 = predict(model2, newdata = test), pred3 = predict(model3, newdata = test), pred4 = predict(model4, newdata = test), pred5 = predict(model5, newdata = test) ) # Calculate MAE for each model mae_results <- data.frame( Model = c( "1. Time + Weather", "2. + Temporal Lags", "3. + Demographics", "4. + Station FE", "5. + Rush Hour Interaction" ), MAE = c( mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE), mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE), mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE), mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE), mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE) ) ) kable(mae_results, digits = 2, caption = "Mean Absolute Error by Model (Test Set)", col.names = c("Model", "MAE (trips)")) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ### Model Performance Comparison (Q1 vs Q2) | Model | MAE (Q1 2025) | MAE (Q2 2025) | |-----------------------------|---------------|---------------| | 1\. Time + Weather | **0.60** | **0.80** | | 2\. + Temporal Lags | **0.50** | **0.61** | | 3\. + Demographics | 0.74 | 0.86 | | 4\. + Station FE | 0.73 | 0.85 | | 5\. + Rush Hour Interaction | 0.73 | 0.80 | **Overall, prediction errors are higher in Q2 than in Q1.**\ For example, the best model (Model 2) improves MAE from about 0.50 trips in Q1 to about 0.61 trips in Q2. All other models also show larger MAE in Q2 (e.g., Model 1 increases from 0.60 to 0.80 trips).\ This suggests that ridership in spring–early summer is somewhat harder to predict than in winter. In Q1, demand is dominated by regular commuter patterns and strong weather effects (cold temperatures and snow suppress many trips), which makes behavior more stable. In Q2, warmer weather allows for more casual and recreational trips, as well as events and irregular travel, so there is more noise that the model cannot fully capture. **Temporal patterns also differ between Q1 and Q2.**\ In Q1, ridership is strongly concentrated in weekday morning and evening peaks, consistent with commuter trips, and weekends are relatively quiet. In Q2, there is more activity during the middle of the day and on weekends, reflecting recreational and leisure use in warmer months. This makes demand less purely “rush-hour driven” and more spread out, which again increases prediction error. **In Q2, the most important predictors are still the temporal lag features.**\ The coefficients on `lag1Hour`, `lag3Hours`, and `lag1day` are large and highly significant, indicating that recent station-level demand is the strongest signal of current ridership. Time-of-day (hour dummies), day-of-week, and temperature also matter, but their marginal effects become smaller once lags are included. Census variables and station fixed effects add some spatial structure, but they do not reduce MAE and in some cases slightly worsen out-of-sample performance, likely because they are associated with missing data and additional model complexity. # Part 2: Error Analysis ## **2.1 Spatial patterns** ### Create error maps ```{r} station_errors <- test %>% filter(!is.na(pred2)) %>% group_by(start_station, start_lat, start_lon) %>% summarize( MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE), avg_demand = mean(Trip_Count, na.rm = TRUE), .groups = "drop" ) %>% filter(!is.na(start_lat), !is.na(start_lon)) # Map 1 – Prediction Errors p1_clean <- ggplot() + geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) + geom_point( data = station_errors, aes(x = start_lon, y = start_lat, color = MAE), size = 3.0, alpha = 0.7 ) + scale_color_viridis( option = "plasma", name = "MAE (trips)", direction = -1 ) + labs( title = "Prediction Errors", subtitle = "Higher in Center City" ) + mapTheme + theme( legend.position = "bottom", plot.title = element_text(size = 14, face = "bold"), plot.subtitle = element_text(size = 10) ) # Map 2 – Average Demand p2_clean <- ggplot() + geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) + geom_point( data = station_errors, aes(x = start_lon, y = start_lat, color = avg_demand), size = 3.0, alpha = 0.7 ) + scale_color_viridis( option = "viridis", name = "Avg Demand (trips/hour)", direction = -1 ) + labs( title = "Average Demand", subtitle = "Trips per station-hour" ) + mapTheme + theme( legend.position = "bottom", plot.title = element_text(size = 14, face = "bold"), plot.subtitle = element_text(size = 10) ) # Only the maps (no legends) p1_noleg <- p1_clean + theme(legend.position = "none") p2_noleg <- p2_clean + theme(legend.position = "none") # Extract legends leg1 <- get_legend(p1_clean) leg2 <- get_legend(p2_clean) # Arrange clean layout final_plot <- ggarrange( p1_noleg, p2_noleg, ncol = 2, labels = NULL ) final_output <- ggarrange( final_plot, ggarrange(leg1, leg2, ncol = 2), ncol = 1, heights = c(4, 1) ) final_output ``` From the *Prediction Errors* map (Model 2), the stations with the highest MAE all cluster tightly around: ### **1. Center City (highest error cluster)** - Rittenhouse Square - Market Street corridor - City Hall / Suburban Station area - University City eastern edge (34th–30th Street) ### **2. University City (moderately high error)** - Drexel / Penn campus areas - 33rd St, Lancaster Ave - Walnut/Chestnut corridor ### **3. Waterfront / Northern Liberties (moderate)** - Penn’s Landing - Spring Garden ### **4. Areas with low error** - Residential neighborhoods (West Philly, South Philly) Highest errors occur in Center City and eastern University City. These areas include Rittenhouse Square, Market Street, City Hall / Suburban Station, Drexel/Penn corridors, and parts of the waterfront. They share two characteristics: (1) **high demand**, and (2) **high behavioral variability**. Errors are likely driven by features missing from the model such as special events, tourism flows, campus academic schedules, and spatial spillover between nearby stations. Center City also has the strongest AM/PM peak dynamics, making it more challenging for a linear model to predict accurately. Low-demand residential neighborhoods show much lower error (\<0.5 MAE) because ridership is stable and predictable. ## **2.2 Temporal Patterns** ```{r} # MAE by time of day and day type test <- test %>% mutate( error = Trip_Count - pred2, abs_error = abs(error), time_of_day = case_when( hour < 7 ~ "Overnight", hour >= 7 & hour < 10 ~ "AM Rush", hour >= 10 & hour < 15 ~ "Mid-Day", hour >= 15 & hour <= 18 ~ "PM Rush", hour > 18 ~ "Evening" ) ) temporal_errors <- test %>% group_by(time_of_day, weekend) %>% summarize( MAE = mean(abs_error, na.rm = TRUE), .groups = "drop" ) %>% mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday")) ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) + geom_col(position = "dodge") + scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) + labs( title = "Prediction Errors by Time Period", subtitle = "When is the model struggling most?", x = "Time of Day", y = "Mean Absolute Error (trips)", fill = "Day Type" ) + plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` Across the full time spectrum, prediction errors are highest during the peak commute periods, especially the **PM rush**, where demand rises sharply and fluctuates too quickly for the model to follow, leading to consistent under-prediction of the strongest spikes. The **AM rush** shows a similar pattern, though slightly milder, reflecting the model’s difficulty capturing sudden early-morning surges. Mid-day errors are moderately high as well, particularly on weekends, when riding behavior is less routine and more sensitive to social activities or weather changes, making demand harder to predict. In contrast, **overnight hours** have the lowest errors because ridership is both low and stable, leaving little room for large deviations. Taken together, these patterns show that the model tends to underestimate sudden increases and slightly overestimate during gradual declines, a structural bias that appears repeatedly across hours and between weekdays and weekends. ## **2.3 Demographic patterns** ```{r} # Join demographic data to station errors station_errors_demo <- station_errors %>% left_join( station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White), by = "start_station" ) %>% filter(!is.na(Med_Inc)) # Create plots p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) + geom_point(alpha = 0.5, color = "#3182bd") + geom_smooth(method = "lm", se = FALSE, color = "red") + scale_x_continuous(labels = scales::dollar) + labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") + plotTheme p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) + geom_point(alpha = 0.5, color = "#3182bd") + geom_smooth(method = "lm", se = FALSE, color = "red") + labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") + plotTheme p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) + geom_point(alpha = 0.5, color = "#3182bd") + geom_smooth(method = "lm", se = FALSE, color = "red") + labs(title = "Errors vs. Race", x = "% White", y = "MAE") + plotTheme grid.arrange(p1, p2, p3, ncol = 2) ``` The weak but noticeable **positive slope with median income** suggests that stations located in higher-income neighborhoods tend to have slightly larger prediction errors. These areas often have more discretionary travel, irregular trip purposes, and stronger weekend or leisure-based demand spikes that a basic temporal-lag model cannot fully capture. As a result, ridership becomes harder to forecast with simple hourly and daily patterns. In contrast, the **negative association with transit-use share** implies that stations in neighborhoods with a high reliance on public transportation are more predictable. These communities typically have more consistent commuting routines and a tighter dependence on fixed schedules, which aligns very well with the temporal-lag structure of Model 2. When people ride transit regularly, they also tend to use bike share in more stable, repeated daily patterns, making these locations easier for the model to learn. Finally, the mild **positive relationship between errors and percent White** shows the pattern observed with income: neighborhoods that are more affluent, higher percentage of white or less transit-dependent tend to have more variable, less routine mobility behavior. These stations likely experience more irregular trip timing, spontaneous travel, and stronger seasonal fluctuations, features not fully captured by the current model.Relate errors to census characteristics # Part 3: Feature Engineering & model improvement ## 3.1 Add Features ```{r} city_hall_lat <- 39.9526 city_hall_lon <- -75.1652 holiday_dates <- as.Date(c( "2025-04-20", "2025-04-22", "2025-05-11", "2025-05-26", "2025-06-15", "2025-06-19" )) ``` ```{r} train <- train %>% mutate( perfect_weather = ifelse( Temperature >= 60 & Temperature <= 75 & Precipitation == 0, 1, 0 ), weekend_nice = weekend * perfect_weather, dist_to_center_city = distHaversine( cbind(start_lon.y, start_lat.y), c(city_hall_lon, city_hall_lat) ) / 1000, holiday = ifelse(date %in% holiday_dates, 1, 0) ) test <- test %>% mutate( perfect_weather = ifelse( Temperature >= 60 & Temperature <= 75 & Precipitation == 0, 1, 0 ), weekend_nice = weekend * perfect_weather, dist_to_center_city = distHaversine( cbind(start_lon.y, start_lat.y), c(city_hall_lon, city_hall_lat) ) / 1000, holiday = ifelse(date %in% holiday_dates, 1, 0) ) ``` ## **3.2 Implementation** ### Add Features to The Best Model ```{r} model6 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + perfect_weather + weekend_nice + dist_to_center_city + holiday, data = train ) # Tidy model and keep key variables coef_table6 <- tidy(model6) %>% filter(term %in% c( "Temperature", "Precipitation", "lag1Hour", "lag3Hours", "lag1day", "perfect_weather", "weekend_nice", "dist_to_center_city", "holiday" )) %>% mutate( term = recode(term, "Temperature" = "Temperature (°F)", "Precipitation" = "Precipitation (in)", "lag1Hour" = "Trips 1 hour ago", "lag3Hours" = "Trips 3 hours ago", "lag1day" = "Trips 24 hours ago", "perfect_weather" = "Perfect weather (60–75°F, dry)", "weekend_nice" = "Weekend × perfect weather", "dist_to_center_city" = "Distance to Center City (km)", "holiday" = "Holiday indicator" ) ) kable( coef_table6 %>% select(term, estimate, std.error, statistic, p.value), digits = 3, col.names = c("Term", "Estimate", "Std. Error", "t value", "p value"), caption = "Key coefficients in Model 6 (with new engineered features)" ) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ```{r} # Compare MAE before and after adding engineered features # --- Recalculate pred6 safely --- test <- test %>% mutate( pred6 = predict(model6, newdata = test) ) # --- Compute baseline (model2) and new model (model6) MAE --- mae_before <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE) mae_after <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE) mae_compare <- data.frame( Model = c("Baseline Model 2 (Lags Only)", "New Model 6 (+ Weather + Distance + Holiday)"), MAE = c(mae_before, mae_after) ) kable(mae_compare, digits = 3, caption = "MAE Before vs. After New Features (Q2 Test Set)") ``` ```{r,warning=FALSE} mae2 <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE) mae6 <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE) mae2 mae6 mae6 - mae2 ``` To test whether additional behavioral, spatial, and contextual signals could improve prediction accuracy, I introduced three new feature groups: (1) **perfect-weather conditions** and a **weekend × nice-weather interaction**, capturing the intuition that leisure-driven demand spikes under ideal conditions; (2) **distance to Center City**, representing the strong spatial gradient in bike-share usage; and (3) **holiday indicators** for major Q2 events such as Easter, Memorial Day, and Juneteenth, which may alter mobility patterns. Comparing performance before and after adding these features shows that the enhanced model (Model 6) performs almost identically to the strongest baseline model (Model 2). Model 2 achieves an MAE of **0.6106**, while Model 6 yields **0.6135**, a negligible increase of approximately **0.003 trips**. Although the engineered variables add meaningful behavioral interpretation, they do **not** improve out-of-sample predictive accuracy. This result reflects the nature of Q2 bike-share demand: spring and early-summer ridership is already strongly explained by **temporal autocorrelation (lag features)** and **hourly/weekly seasonality**, with relatively consistent warm weather and limited extremes. Because these patterns dominate usage, additional features offer little incremental predictive power and instead introduce slight overfitting. ## **3.3 Poisson Model** ```{r} model6_pois <- glm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + perfect_weather + weekend_nice + dist_to_center_city + holiday, data = train, family = poisson(link = "log") ) poisson_summary <- data.frame( Metric = c( "Null deviance", "Residual deviance", "AIC" ), Value = c( round(model6_pois$null.deviance, 2), round(model6_pois$deviance, 2), round(model6_pois$aic, 2) ) ) kable(poisson_summary, caption = "Poisson Regression Model Summary (Key Metrics)", col.names = c("Metric", "Value")) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ```{r,warning=FALSE} test <- test %>% mutate( pred6_pois = predict(model6_pois, newdata = test, type = "response") ) mae_pois <- mean(abs(test$Trip_Count - test$pred6_pois), na.rm = TRUE) mae_pois ``` ```{r} mae_lin <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE) data.frame( Model = c("Linear Model 6", "Poisson Model 6"), MAE = c(mae_lin, mae_pois) ) ``` Although trip counts are discrete and non-negative, making Poisson regression a natural candidate, the Poisson specification did not outperform the linear model in this case. The Poisson model substantially reduced the null deviance (from about 701,096 to 441,087), which indicates that it captures important structure in the data. However, when evaluated out-of-sample on the Q2 test set, its MAE is **0.639**, which is noticeably worse than the MAE of the linear Model 6 (**0.613**). # Part 4: Critical Reflection ## **4.1 Operational implications** With a final test-set MAE of roughly **0.61 trips per station-hour**, the model is accurate enough to provide meaningful operational support for Indego, especially for routine rebalancing during normal weekdays. Most stations average between 0–2 trips per hour in Q2, so an average error of 0.6 represents a relatively small deviation from true demand. In other words, the model reliably captures broad temporal patterns: commuting peaks, weather-driven changes, and weekend shifts, which are the operational factors that matter most for daily fleet management. However, from the error analysis, the system performs less consistently in areas with highly irregular or discretionary travel (Center City, university areas, tourist clusters). These are precisely the locations where under-prediction or over-prediction can create rebalancing problems: late-morning surges, sudden evening spikes, or unusually warm weekends can leave docks empty or bikes stranded unless staff intervene manually. Given this, the model can be **deployed operationally**, but with clear conditions. It should be used as a *decision-support tool*, not an autonomous system. The predictions can guide baseline redistribution plans, flag high-risk hours, and help prioritize which stations need closer monitoring. But stations with historically volatile demand and days with atypical weather, holidays, or major events will still require human oversight or real-time adjustments. ## **4.2 Equity considerations** The error patterns suggest that prediction accuracy is not evenly distributed across Philadelphia. Stations located in higher-income, majority-White neighborhoods particularly those with more discretionary and irregular travel. This show larger residuals, while stations in transit-dependent, lower-income communities tend to be more predictable. This raises an important equity concern: the model performs best in neighborhoods where travel patterns are already stable and predictable, and worst in areas where riders may rely on Indego as a flexible or secondary mobility option. If such a system were deployed without safeguards, it could unintentionally reinforce existing disparities. Under-predicting demand in transit-reliant communities could mean empty docks, reduced availability, and longer wait times directly undermining mobility for riders with fewer alternatives. To mitigate these risks, several safeguards should be implemented:\ 1. **Equity-weighted rebalancing rules** ensuring underserved neighborhoods receive a minimum guaranteed baseline of bike availability regardless of predicted demand\ 2. **Error monitoring** that track model performance separately by neighborhood demographics\ 3. **Human-in-the-loop overrides** where planners can adjust allocations in census tracts showing sustained high error\ 4. **Periodic model retraining** with explicit fairness constraints or stratified sampling to ensure equitable performance across different community types ## **4.3 Model limitations** Despite incorporating temporal lags, weather, spatial context, and holiday effects, the model still misses several important behavioral patterns. Bicycle demand is heavily influenced by **micro-events** such as neighborhood festivals, block parties, school schedules, university calendars, and special events at venues like the stadium district—none of which are captured in the current dataset. The model also assumes that relationships between predictors and demand are **smooth and stable over time**, but real bike-share patterns often shift abruptly due to construction, temporary station closures, seasonal promotions, or sudden weather changes that are not recorded by hourly temperature and precipitation alone. Moreover, riders make spontaneous decisions in response to perceived safety, traffic conditions, and trip chaining with transit—complex behaviors that cannot be explained using simple linear terms. If given more time and richer data, the predictive system could be significantly improved by incorporating features that better capture the behavioral, spatial, and contextual complexity of bike-share demand. The current model relies primarily on temporal lags, weather, and basic spatial structure, which limits its ability to account for major demand drivers such as large events, university schedules, tourism flow, or transit disruptions. With access to real event data (concerts, stadium schedules, conventions, sports games), SEPTA operational data, academic calendars for Penn/Drexel/Temple, and park/office/restaurant POI densities, the model would be able to reflect more realistic fluctuations in ridership patterns. These variables are known to heavily shape Philadelphia’s mobility rhythms but are not represented in the present dataset. Methodologically, the model could be strengthened by moving beyond linear regression into more flexible forms to capture non-linear relationships and temporal dependencies that the current model cannot express. Expanding the training window to multiple years rather than a single quarter would also allow the model to learn seasonal differences, long-term trends, and year-over-year behavioral changes. With richer data and more advanced modeling frameworks, Indego could move from hour-ahead prediction toward more robust real-time forecasting and dynamic rebalancing systems.