---
title: "Space-Time Prediction of Bike Share Demand: Philadelphia Indego"
subtitle: "MUSA 5080 - Fall 2025"
author: "Isabelle Li"
date: today
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Part 1: Data Import & Preparation
## 1.1 Set up
### Load package
```{r load_libraries,echo=FALSE}
# Core tidyverse
library(tidyverse)
library(lubridate)
# Spatial data
library(sf)
library(tigris)
library(geosphere)
# Census data
library(tidycensus)
# Weather data
library(riem)
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
library(broom)
library(gridExtra)
library(grid)
library(ggpubr)
# here!
library(here)
options(scipen = 999)
```
### Define theme
```{r themes}
plotTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = 'transparent'),
panel.grid.minor = element_blank(),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
### Set Census API Key
```{r census_key, eval=FALSE,include=FALSE}
census_api_key("cbf28724dccf15e8b6a23ee3f03a8d0f8938ce79", overwrite = TRUE, install = TRUE)
```
```{r census_key_hidden, include=FALSE}
# Hidden key for rendering
census_api_key("cbf28724dccf15e8b6a23ee3f03a8d0f8938ce79", overwrite = TRUE)
```
## 1.2 Data Import
```{r load_indego,echo=FALSE,include=FALSE}
# Read Q2 2025 data
indego <- read_csv(here("assignments/assignment5/data/indego-trips-2025-q2.csv"))
# Quick look at the data
glimpse(indego)
```
```{r explore_data,include=FALSE}
# How many trips?
cat("Total trips in Q2 2025:", nrow(indego), "\n")
# Date range
cat("Date range:",
min(mdy_hm(indego$start_time)), "to",
max(mdy_hm(indego$start_time)), "\n")
# How many unique stations?
cat("Unique start stations:", length(unique(indego$start_station)), "\n")
# Trip types
table(indego$trip_route_category)
# Passholder types
table(indego$passholder_type)
# Bike types
table(indego$bike_type)
```
### Create Time Bins
```{r create_time_bins}
indego <- indego %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
```
## 1.3 Exploratory Analysis
### Trips over time
```{r trips_over_time,echo=FALSE}
# Daily trip counts
daily_trips <- indego %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q2 2025",
subtitle = "Spring demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
```
The Q2 2025 ridership pattern reflects typical late-spring to early-summer behavior. Overalll daily trip volumes are high compared to winter months, with most days falling between 3,500 and 5,000 trips.
### Hourly Patterns
```{r hourly_patterns,echo=FALSE}
# Average trips by hour and day type
hourly_patterns <- indego %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
```
The hourly ridership profile shows a strong distinction between weekdays and weekends. On weekdays, the pattern is dominated by two sharp peaks: a large morning spike around 8–9 AM and an even stronger afternoon peak around 5–6 PM. These align closely with traditional commuting hours, indicating that Indego is heavily used for work-related travel during the week.
In contrast, weekend ridership follows a flatter, more gradual curve. Activity starts later in the morning, rises steadily through midday, and maintains moderate levels into the afternoon without a pronounced rush-hour peak. This reflects more recreational and discretionary trips rather than structured commuting patterns.
### Top Station
```{r top_stations}
# Most popular origin stations
top_stations <- indego %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations,
caption = "Top 20 Indego Stations by Trip Origins",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Get Philadelphia Spatial Context
```{r load_census,include=FALSE}
# Get Philadelphia census tracts
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) # WGS84 for lat/lon matching
```
### Join Census Data to Stations
```{r}
# Create sf object for stations
stations_sf <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
# Add census variables back to the trip data
indego_census <- indego %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
```{r}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
### Get Weather Data
```{r}
# Get weather from Philadelphia International Airport (KPHL)
weather_data <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-04-01",
date_end = "2025-07-31"
)
# Process weather data
weather_processed <- weather_data %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct()
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
```
### Create Space-Time Panel
```{r aggregate_trips,results='hide'}
# Count trips by station-hour
trips_panel <- indego_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# How many station-hour observations?
nrow(trips_panel)
# How many unique stations?
length(unique(trips_panel$start_station))
# How many unique hours?
length(unique(trips_panel$interval60))
```
### Create Complete Panel Structure
```{r complete_panel}
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
# Join trip counts
left_join(trips_panel, by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
```
### Add Time Features
```{r add_time_features}
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = lubridate::month(interval60, label = TRUE, abbr = TRUE),
month = factor(month, levels = month.abb), # Jan, Feb, ..., Dec
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
```
### Join Weather Data
```{r join_weather,warning=FALSE}
study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
```
### Create Temporal Lag Variables
```{r create_lags}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
```
```{r,echo=FALSE,include=FALSE}
# 1. put back start_lat / start_lon
study_panel_complete <- study_panel_complete %>%
mutate(
start_lat = start_lat.y,
start_lon = start_lon.y
)
# 2. perfect weather + weekend interaction
study_panel_complete <- study_panel_complete %>%
mutate(
perfect_weather = ifelse(
Temperature >= 60 & Temperature <= 80 & Precipitation == 0,
1, 0
),
weekend_nice = weekend * perfect_weather
)
# 3. distance to center city
city_hall_lat <- 39.9526
city_hall_lon <- -75.1652
study_panel_complete <- study_panel_complete %>%
mutate(
dist_to_center_city = distHaversine(
cbind(start_lon, start_lat),
c(city_hall_lon, city_hall_lat)
) / 1000
)
# 4. holiday indicator
holiday_dates <- as.Date(c(
"2025-04-20", # Easter
"2025-04-22", # Earth Day
"2025-05-11", # Mother's Day
"2025-05-26", # Memorial Day
"2025-06-15", # Father's Day
"2025-06-19" # Juneteenth
))
study_panel_complete <- study_panel_complete %>%
mutate(
holiday = ifelse(date %in% holiday_dates, 1, 0)
)
```
### Temporal Train/Test Split
```{r temporal_split}
# Identify early and late period for Q2
# Q2 roughly covers week 14–26
early_stations <- study_panel_complete %>%
filter(week >= 13, week < 22) %>% # weeks 13–21
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 22, week <= 26) %>% # weeks 22–26
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations appearing in both early and late
common_stations <- intersect(early_stations, late_stations)
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# Now create train/test split
train <- study_panel_complete %>%
filter(week >= 13, week < 22)
test <- study_panel_complete %>%
filter(week >= 22, week <= 26)
```
## 1.4 Build Predictive Models
### Model 1: Baseline (Time + Weather)
```{r model1}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
# Now run the model
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
model1_tidy <- tidy(model1) %>%
filter(term %in% c("Temperature", "Precipitation",
"dotw_simple2", "dotw_simple3", "dotw_simple4",
"dotw_simple5", "dotw_simple6", "dotw_simple7")) %>%
mutate(
estimate = round(estimate, 3),
p.value = format.pval(p.value, digits = 3)
)
kable(model1_tidy,
caption = "Selected Coefficients from Model 1 (Q2 2025)",
col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")) %>%
kable_classic(full_width = FALSE)
```
Ridership is highly driven by time-of-day: morning and evening commute hours show the largest positive effects, while late night hours have the lowest activity.\
Weekdays generally have higher demand than weekends, and temperature is strongly positively associated with trip counts. Precipitation shows a small negative effect but is not statistically strong in this period.
### Model 2: Add Temporal Lags
```{r model2}
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
model2_tidy <- tidy(model2) %>%
filter(term %in% c("Temperature", "Precipitation",
"lag1Hour", "lag3Hours", "lag1day")) %>%
mutate(
estimate = round(estimate, 3),
std.error = round(std.error, 3),
statistic = round(statistic, 1),
p.value = format.pval(p.value, digits = 3, eps = 1e-4)
)
kable(
model2_tidy,
caption = "Key coefficients from Model 2 (Q2 2025)",
col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
kable_classic(full_width = FALSE)
```
The coefficients on the lag variables are all positive and highly significant: the previous hour’s demand (lag1Hour, β ≈ 0.42), demand three hours ago (lag3Hours, β ≈ 0.12), and demand 24 hours ago (lag1day, β ≈ 0.14) all strongly predict current ridership. This means that if a station was busy in the recent past, it is very likely to be busy now as well. Compared to Model 1, the effect of temperature becomes smaller because part of the temporal pattern is now captured by lagged demand. Precipitation becomes more negative and statistically significant, indicating that rainy hours reduce trips even after controlling for recent usage. Overall, adding lag features increases R² from about 0.11 to 0.36 and reduces the residual error, showing that short-term temporal dependence is a key driver of Indego ridership.
### Model 3: Add Demographics
```{r model3}
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
data = train
)
model3_tidy <- tidy(model3) %>%
filter(term %in% c("Temperature", "Precipitation",
"lag1Hour", "lag3Hours", "lag1day",
"Med_Inc.x", "Percent_Taking_Transit.y", "Percent_White.y")) %>%
mutate(
estimate = round(estimate, 4),
std.error = round(std.error, 4),
statistic = round(statistic, 1),
p.value = format.pval(p.value, digits = 3, eps = 1e-4)
)
kable(
model3_tidy,
caption = "Key Coefficients for Model 3 (Adding Census Features)",
col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) %>%
kable_classic(full_width = FALSE)
```
The lag variables (lag1Hour, lag3Hours, lag1day) remain the strongest predictors, confirming that short-term demand history is still the main driver of current ridership. Among the census features, higher median income around a station is associated with slightly higher usage, while neighborhoods with higher transit dependence tend to show lower Indego demand. The percentage of White residents is positively associated with trip counts. Temperature remains positively significant, whereas precipitation becomes strongly negative after controlling for lagged demand and neighborhood attributes.
### Model 4: Add Station Fixed Effects
```{r model4}
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station),
data = train
)
model4_metrics <- tibble(
Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
Value = c(
round(summary(model4)$r.squared, 4),
round(summary(model4)$adj.r.squared, 4),
round(summary(model4)$sigma, 4),
nrow(train)
)
)
kable(model4_metrics,
caption = "Model 4: Key Performance Metrics",
col.names = c("Metric", "Value")) %>%
kable_classic(full_width = FALSE)
```
### Model 5: Add Rush Hour Interaction
```{r model5}
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + month +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
# Extract key metrics from Model 5
model5_metrics <- tibble(
Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "N (train)"),
Value = c(
round(summary(model5)$r.squared, 4),
round(summary(model5)$adj.r.squared, 4),
round(summary(model5)$sigma, 4),
nrow(train)
)
)
kable(model5_metrics,
caption = "Model 5: Key Performance Metrics",
col.names = c("Metric", "Value")) %>%
kable_classic(full_width = FALSE)
```
```{r calculate_mae}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test$month <- factor(test$month, levels = levels(model5$model$month))
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Model Performance Comparison (Q1 vs Q2)
| Model | MAE (Q1 2025) | MAE (Q2 2025) |
|-----------------------------|---------------|---------------|
| 1\. Time + Weather | **0.60** | **0.80** |
| 2\. + Temporal Lags | **0.50** | **0.61** |
| 3\. + Demographics | 0.74 | 0.86 |
| 4\. + Station FE | 0.73 | 0.85 |
| 5\. + Rush Hour Interaction | 0.73 | 0.80 |
**Overall, prediction errors are higher in Q2 than in Q1.**\
For example, the best model (Model 2) improves MAE from about 0.50 trips in Q1 to about 0.61 trips in Q2. All other models also show larger MAE in Q2 (e.g., Model 1 increases from 0.60 to 0.80 trips).\
This suggests that ridership in spring–early summer is somewhat harder to predict than in winter. In Q1, demand is dominated by regular commuter patterns and strong weather effects (cold temperatures and snow suppress many trips), which makes behavior more stable. In Q2, warmer weather allows for more casual and recreational trips, as well as events and irregular travel, so there is more noise that the model cannot fully capture.
**Temporal patterns also differ between Q1 and Q2.**\
In Q1, ridership is strongly concentrated in weekday morning and evening peaks, consistent with commuter trips, and weekends are relatively quiet. In Q2, there is more activity during the middle of the day and on weekends, reflecting recreational and leisure use in warmer months. This makes demand less purely “rush-hour driven” and more spread out, which again increases prediction error.
**In Q2, the most important predictors are still the temporal lag features.**\
The coefficients on `lag1Hour`, `lag3Hours`, and `lag1day` are large and highly significant, indicating that recent station-level demand is the strongest signal of current ridership. Time-of-day (hour dummies), day-of-week, and temperature also matter, but their marginal effects become smaller once lags are included. Census variables and station fixed effects add some spatial structure, but they do not reduce MAE and in some cases slightly worsen out-of-sample performance, likely because they are associated with missing data and additional model complexity.
# Part 2: Error Analysis
## **2.1 Spatial patterns**
### Create error maps
```{r}
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat, start_lon) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat), !is.na(start_lon))
# Map 1 – Prediction Errors
p1_clean <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon, y = start_lat, color = MAE),
size = 3.0, alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE (trips)",
direction = -1
) +
labs(
title = "Prediction Errors",
subtitle = "Higher in Center City"
) +
mapTheme +
theme(
legend.position = "bottom",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
)
# Map 2 – Average Demand
p2_clean <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon, y = start_lat, color = avg_demand),
size = 3.0, alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg Demand (trips/hour)",
direction = -1
) +
labs(
title = "Average Demand",
subtitle = "Trips per station-hour"
) +
mapTheme +
theme(
legend.position = "bottom",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
)
# Only the maps (no legends)
p1_noleg <- p1_clean + theme(legend.position = "none")
p2_noleg <- p2_clean + theme(legend.position = "none")
# Extract legends
leg1 <- get_legend(p1_clean)
leg2 <- get_legend(p2_clean)
# Arrange clean layout
final_plot <- ggarrange(
p1_noleg, p2_noleg,
ncol = 2,
labels = NULL
)
final_output <- ggarrange(
final_plot,
ggarrange(leg1, leg2, ncol = 2),
ncol = 1,
heights = c(4, 1)
)
final_output
```
From the *Prediction Errors* map (Model 2), the stations with the highest MAE all cluster tightly around:
### **1. Center City (highest error cluster)**
- Rittenhouse Square
- Market Street corridor
- City Hall / Suburban Station area
- University City eastern edge (34th–30th Street)
### **2. University City (moderately high error)**
- Drexel / Penn campus areas
- 33rd St, Lancaster Ave
- Walnut/Chestnut corridor
### **3. Waterfront / Northern Liberties (moderate)**
- Penn’s Landing
- Spring Garden
### **4. Areas with low error**
- Residential neighborhoods (West Philly, South Philly)
Highest errors occur in Center City and eastern University City. These areas include Rittenhouse Square, Market Street, City Hall / Suburban Station, Drexel/Penn corridors, and parts of the waterfront. They share two characteristics: (1) **high demand**, and (2) **high behavioral variability**.
Errors are likely driven by features missing from the model such as special events, tourism flows, campus academic schedules, and spatial spillover between nearby stations. Center City also has the strongest AM/PM peak dynamics, making it more challenging for a linear model to predict accurately.
Low-demand residential neighborhoods show much lower error (\<0.5 MAE) because ridership is stable and predictable.
## **2.2 Temporal Patterns**
```{r}
# MAE by time of day and day type
test <- test %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Across the full time spectrum, prediction errors are highest during the peak commute periods, especially the **PM rush**, where demand rises sharply and fluctuates too quickly for the model to follow, leading to consistent under-prediction of the strongest spikes. The **AM rush** shows a similar pattern, though slightly milder, reflecting the model’s difficulty capturing sudden early-morning surges.
Mid-day errors are moderately high as well, particularly on weekends, when riding behavior is less routine and more sensitive to social activities or weather changes, making demand harder to predict. In contrast, **overnight hours** have the lowest errors because ridership is both low and stable, leaving little room for large deviations. Taken together, these patterns show that the model tends to underestimate sudden increases and slightly overestimate during gradual declines, a structural bias that appears repeatedly across hours and between weekdays and weekends.
## **2.3 Demographic patterns**
```{r}
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
plotTheme
grid.arrange(p1, p2, p3, ncol = 2)
```
The weak but noticeable **positive slope with median income** suggests that stations located in higher-income neighborhoods tend to have slightly larger prediction errors. These areas often have more discretionary travel, irregular trip purposes, and stronger weekend or leisure-based demand spikes that a basic temporal-lag model cannot fully capture. As a result, ridership becomes harder to forecast with simple hourly and daily patterns.
In contrast, the **negative association with transit-use share** implies that stations in neighborhoods with a high reliance on public transportation are more predictable. These communities typically have more consistent commuting routines and a tighter dependence on fixed schedules, which aligns very well with the temporal-lag structure of Model 2. When people ride transit regularly, they also tend to use bike share in more stable, repeated daily patterns, making these locations easier for the model to learn.
Finally, the mild **positive relationship between errors and percent White** shows the pattern observed with income: neighborhoods that are more affluent, higher percentage of white or less transit-dependent tend to have more variable, less routine mobility behavior. These stations likely experience more irregular trip timing, spontaneous travel, and stronger seasonal fluctuations, features not fully captured by the current model.Relate errors to census characteristics
# Part 3: Feature Engineering & model improvement
## 3.1 Add Features
```{r}
city_hall_lat <- 39.9526
city_hall_lon <- -75.1652
holiday_dates <- as.Date(c(
"2025-04-20",
"2025-04-22",
"2025-05-11",
"2025-05-26",
"2025-06-15",
"2025-06-19"
))
```
```{r}
train <- train %>%
mutate(
perfect_weather = ifelse(
Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
1, 0
),
weekend_nice = weekend * perfect_weather,
dist_to_center_city = distHaversine(
cbind(start_lon.y, start_lat.y),
c(city_hall_lon, city_hall_lat)
) / 1000,
holiday = ifelse(date %in% holiday_dates, 1, 0)
)
test <- test %>%
mutate(
perfect_weather = ifelse(
Temperature >= 60 & Temperature <= 75 & Precipitation == 0,
1, 0
),
weekend_nice = weekend * perfect_weather,
dist_to_center_city = distHaversine(
cbind(start_lon.y, start_lat.y),
c(city_hall_lon, city_hall_lat)
) / 1000,
holiday = ifelse(date %in% holiday_dates, 1, 0)
)
```
## **3.2 Implementation**
### Add Features to The Best Model
```{r}
model6 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
perfect_weather + weekend_nice +
dist_to_center_city + holiday,
data = train
)
# Tidy model and keep key variables
coef_table6 <- tidy(model6) %>%
filter(term %in% c(
"Temperature",
"Precipitation",
"lag1Hour",
"lag3Hours",
"lag1day",
"perfect_weather",
"weekend_nice",
"dist_to_center_city",
"holiday"
)) %>%
mutate(
term = recode(term,
"Temperature" = "Temperature (°F)",
"Precipitation" = "Precipitation (in)",
"lag1Hour" = "Trips 1 hour ago",
"lag3Hours" = "Trips 3 hours ago",
"lag1day" = "Trips 24 hours ago",
"perfect_weather" = "Perfect weather (60–75°F, dry)",
"weekend_nice" = "Weekend × perfect weather",
"dist_to_center_city" = "Distance to Center City (km)",
"holiday" = "Holiday indicator"
)
)
kable(
coef_table6 %>%
select(term, estimate, std.error, statistic, p.value),
digits = 3,
col.names = c("Term", "Estimate", "Std. Error", "t value", "p value"),
caption = "Key coefficients in Model 6 (with new engineered features)"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r}
# Compare MAE before and after adding engineered features
# --- Recalculate pred6 safely ---
test <- test %>%
mutate(
pred6 = predict(model6, newdata = test)
)
# --- Compute baseline (model2) and new model (model6) MAE ---
mae_before <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae_after <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)
mae_compare <- data.frame(
Model = c("Baseline Model 2 (Lags Only)",
"New Model 6 (+ Weather + Distance + Holiday)"),
MAE = c(mae_before, mae_after)
)
kable(mae_compare, digits = 3,
caption = "MAE Before vs. After New Features (Q2 Test Set)")
```
```{r,warning=FALSE}
mae2 <- mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE)
mae6 <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)
mae2
mae6
mae6 - mae2
```
To test whether additional behavioral, spatial, and contextual signals could improve prediction accuracy, I introduced three new feature groups: (1) **perfect-weather conditions** and a **weekend × nice-weather interaction**, capturing the intuition that leisure-driven demand spikes under ideal conditions; (2) **distance to Center City**, representing the strong spatial gradient in bike-share usage; and (3) **holiday indicators** for major Q2 events such as Easter, Memorial Day, and Juneteenth, which may alter mobility patterns.
Comparing performance before and after adding these features shows that the enhanced model (Model 6) performs almost identically to the strongest baseline model (Model 2). Model 2 achieves an MAE of **0.6106**, while Model 6 yields **0.6135**, a negligible increase of approximately **0.003 trips**. Although the engineered variables add meaningful behavioral interpretation, they do **not** improve out-of-sample predictive accuracy.
This result reflects the nature of Q2 bike-share demand: spring and early-summer ridership is already strongly explained by **temporal autocorrelation (lag features)** and **hourly/weekly seasonality**, with relatively consistent warm weather and limited extremes. Because these patterns dominate usage, additional features offer little incremental predictive power and instead introduce slight overfitting.
## **3.3 Poisson Model**
```{r}
model6_pois <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
perfect_weather + weekend_nice +
dist_to_center_city + holiday,
data = train,
family = poisson(link = "log")
)
poisson_summary <- data.frame(
Metric = c(
"Null deviance",
"Residual deviance",
"AIC"
),
Value = c(
round(model6_pois$null.deviance, 2),
round(model6_pois$deviance, 2),
round(model6_pois$aic, 2)
)
)
kable(poisson_summary,
caption = "Poisson Regression Model Summary (Key Metrics)",
col.names = c("Metric", "Value")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r,warning=FALSE}
test <- test %>%
mutate(
pred6_pois = predict(model6_pois, newdata = test, type = "response")
)
mae_pois <- mean(abs(test$Trip_Count - test$pred6_pois), na.rm = TRUE)
mae_pois
```
```{r}
mae_lin <- mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE)
data.frame(
Model = c("Linear Model 6", "Poisson Model 6"),
MAE = c(mae_lin, mae_pois)
)
```
Although trip counts are discrete and non-negative, making Poisson regression a natural candidate, the Poisson specification did not outperform the linear model in this case. The Poisson model substantially reduced the null deviance (from about 701,096 to 441,087), which indicates that it captures important structure in the data. However, when evaluated out-of-sample on the Q2 test set, its MAE is **0.639**, which is noticeably worse than the MAE of the linear Model 6 (**0.613**).
# Part 4: Critical Reflection
## **4.1 Operational implications**
With a final test-set MAE of roughly **0.61 trips per station-hour**, the model is accurate enough to provide meaningful operational support for Indego, especially for routine rebalancing during normal weekdays. Most stations average between 0–2 trips per hour in Q2, so an average error of 0.6 represents a relatively small deviation from true demand. In other words, the model reliably captures broad temporal patterns: commuting peaks, weather-driven changes, and weekend shifts, which are the operational factors that matter most for daily fleet management. However, from the error analysis, the system performs less consistently in areas with highly irregular or discretionary travel (Center City, university areas, tourist clusters). These are precisely the locations where under-prediction or over-prediction can create rebalancing problems: late-morning surges, sudden evening spikes, or unusually warm weekends can leave docks empty or bikes stranded unless staff intervene manually.
Given this, the model can be **deployed operationally**, but with clear conditions. It should be used as a *decision-support tool*, not an autonomous system. The predictions can guide baseline redistribution plans, flag high-risk hours, and help prioritize which stations need closer monitoring. But stations with historically volatile demand and days with atypical weather, holidays, or major events will still require human oversight or real-time adjustments.
## **4.2 Equity considerations**
The error patterns suggest that prediction accuracy is not evenly distributed across Philadelphia. Stations located in higher-income, majority-White neighborhoods particularly those with more discretionary and irregular travel. This show larger residuals, while stations in transit-dependent, lower-income communities tend to be more predictable. This raises an important equity concern: the model performs best in neighborhoods where travel patterns are already stable and predictable, and worst in areas where riders may rely on Indego as a flexible or secondary mobility option.
If such a system were deployed without safeguards, it could unintentionally reinforce existing disparities. Under-predicting demand in transit-reliant communities could mean empty docks, reduced availability, and longer wait times directly undermining mobility for riders with fewer alternatives.
To mitigate these risks, several safeguards should be implemented:\
1. **Equity-weighted rebalancing rules** ensuring underserved neighborhoods receive a minimum guaranteed baseline of bike availability regardless of predicted demand\
2. **Error monitoring** that track model performance separately by neighborhood demographics\
3. **Human-in-the-loop overrides** where planners can adjust allocations in census tracts showing sustained high error\
4. **Periodic model retraining** with explicit fairness constraints or stratified sampling to ensure equitable performance across different community types
## **4.3 Model limitations**
Despite incorporating temporal lags, weather, spatial context, and holiday effects, the model still misses several important behavioral patterns. Bicycle demand is heavily influenced by **micro-events** such as neighborhood festivals, block parties, school schedules, university calendars, and special events at venues like the stadium district—none of which are captured in the current dataset. The model also assumes that relationships between predictors and demand are **smooth and stable over time**, but real bike-share patterns often shift abruptly due to construction, temporary station closures, seasonal promotions, or sudden weather changes that are not recorded by hourly temperature and precipitation alone. Moreover, riders make spontaneous decisions in response to perceived safety, traffic conditions, and trip chaining with transit—complex behaviors that cannot be explained using simple linear terms.
If given more time and richer data, the predictive system could be significantly improved by incorporating features that better capture the behavioral, spatial, and contextual complexity of bike-share demand. The current model relies primarily on temporal lags, weather, and basic spatial structure, which limits its ability to account for major demand drivers such as large events, university schedules, tourism flow, or transit disruptions. With access to real event data (concerts, stadium schedules, conventions, sports games), SEPTA operational data, academic calendars for Penn/Drexel/Temple, and park/office/restaurant POI densities, the model would be able to reflect more realistic fluctuations in ridership patterns. These variables are known to heavily shape Philadelphia’s mobility rhythms but are not represented in the present dataset.
Methodologically, the model could be strengthened by moving beyond linear regression into more flexible forms to capture non-linear relationships and temporal dependencies that the current model cannot express. Expanding the training window to multiple years rather than a single quarter would also allow the model to learn seasonal differences, long-term trends, and year-over-year behavioral changes. With richer data and more advanced modeling frameworks, Indego could move from hour-ahead prediction toward more robust real-time forecasting and dynamic rebalancing systems.