---
title: "Space-Time Prediction of Bike Share Demand: Philadelphia Indego"
author: "Jingqi Lu"
date: today
format:
html:
number-sections: false
code-fold: true
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Setup
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
cache = TRUE
)
```
```{r load-packages}
# Core tidyverse
library(tidyverse)
library(lubridate)
# Spatial data
library(sf)
library(tigris)
# Census data
library(tidycensus)
# Weather data
library(riem) # For Philadelphia weather from ASOS stations
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# here!
library(here)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)
```
## Define Themes
These themes will be used for consistent styling of plots.
```{r themes}
plotTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = 'transparent'),
panel.grid.minor = element_blank(),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
```{r load_indego}
# Read Q1 2025 data
library(readr)
indego <- read_csv("data/indego-trips-2024-q2.csv")
# Quick look at the data
glimpse(indego)
```
------------------------------------------------------------------------
# 1. Introduction
This report analyzes hourly bike share demand in Philadelphia using Indego data from Q2 2024. The goal is to build and evaluate space-time predictive models, diagnose where and why errors occur, and propose new features that improve forecasting performance for operational planning.
------------------------------------------------------------------------
## 1.1 Examine the Data Structure
```{r explore_data}
# How many trips?
cat("Total trips in Q2 2024:", nrow(indego), "\n")
# Date range
cat("Date range:",
min(mdy_hm(indego$start_time)), "to",
max(mdy_hm(indego$start_time)), "\n")
# How many unique stations?
cat("Unique start stations:", length(unique(indego$start_station)), "\n")
# Trip types
table(indego$trip_route_category)
# Passholder types
table(indego$passholder_type)
# Bike types
table(indego$bike_type)
```
## 1.2 Create Time Bins
We need to aggregate trips into hourly intervals for our panel data structure.
```{r create_time_bins}
indego <- indego %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))
```
# 2. Exploratory Data Analysis
## 2.1 Trips Over Time
```{r trips_over_time}
# Daily trip counts
daily_trips <- indego %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q2 2024",
subtitle = "Winter demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
```
Daily trip volumes rise sharply from early April into mid-June, consistent with the seasonal transition into warmer weather. The series remains noisy—weekends, rain events, and holiday fluctuations introduce large day-to-day swings—but the overall trend is a steady climb toward peak summer demand. Compared to Q1 2025, the level is higher and the pattern is less stable, reflecting how warmer months attract more discretionary, non-commuting trips.
##2.2 Hourly Patterns
```{r hourly_patterns}
# Average trips by hour and day type
hourly_patterns <- indego %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
```
Hourly usage patterns retain the classic two-peak commute structure on weekdays, with strong surges around 8 AM and 5 PM. Weekends, by contrast, show a smoother, flatter trajectory centered on late morning and mid-afternoon leisure travel. The contrast suggests that Q2 demand mixes both utilitarian and recreational usage, producing more varied temporal dynamics than in winter.
## 2.3 Top Stations
```{r top_stations}
# Most popular origin stations
top_stations <- indego %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations,
caption = "Top 20 Indego Stations by Trip Origins",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
# 3. Get Philadelphia Spatial Context
## 3.1 Load Philadelphia Census Data
We'll get census tract data to add demographic context to our stations.
```{r philly-census, include=FALSE}
# Get Philadelphia census tracts
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) # WGS84 for lat/lon matching
```
## 3.2 Map Philadelphia Context
```{r map_philly}
# Map median income
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Context for understanding bike share demand patterns"
) +
# Stations
geom_point(
data = indego,
aes(x = start_lon, y = start_lat),
color = "red", size = 0.25, alpha = 0.6
) +
mapTheme
```
The spatial distribution of median household income shows a familiar pattern: higher-income tracts cluster in Center City, University City, and along the northwest corridor, while lower-income neighborhoods are concentrated in North and Southwest Philadelphia.
## 3.3 Join Census Data to Stations
We'll spatially join census characteristics to each bike station.
```{r join_census_to_stations}
# Create sf object for stations
stations_sf <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_census <- indego %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Create the map showing problem stations
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar,
na.value = "grey90"
) +
# Stations with census data (small grey dots)
geom_point(
data = stations_for_map %>% filter(has_census),
aes(x = start_lon, y = start_lat),
color = "grey30", size = 1, alpha = 0.6
) +
# Stations WITHOUT census data (red X marks the spot)
geom_point(
data = stations_for_map %>% filter(!has_census),
aes(x = start_lon, y = start_lat),
color = "red", size = 1, shape = 4, stroke = 1.5
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Indego stations shown (RED = no census data match)",
caption = "Red X marks indicate stations that didn't join to census tracts"
) +
mapTheme
```
Most Indego stations successfully join to their surrounding census tract, but a subset in the downtown core and waterfront areas fall outside tract boundaries or sit within non-residential zones. These locations—shown as red X marks—typically correspond to transit hubs, tourist destinations, or large institutional complexes. Their absence from the demographic join highlights where neighborhood characteristics may not meaningfully describe station demand, and it motivates either separate modeling or additional spatial features for these atypical sites.
## 3.4 Dealing with missing data
We need to decide what to do with the non-residential bike share stations. For this example, we are going to remove them -- this is not necessarily the right way to do things always, but for the sake of simplicity, we are narrowing our scope to only stations in residential neighborhoods. We might opt to create a separate model for non-residential stations..
```{r}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
## 3.5 Get Weather Data
Weather significantly affects bike share demand! Let's get hourly weather for Philadelphia.
```{r get_weather}
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q2 2024: April 1 - June 30
weather_data <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2024-04-01",
date_end = "2024-06-30"
)
# Process weather data
weather_processed <- weather_data %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct()
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
## 3,6 Visualize Weather Patterns
Who is ready for a Philly winter?!
```{r visualize_weather}
ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q2 2024",
subtitle = "Spring to Summer transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
```
Temperatures rise steadily from early April into late June, marking the transition into Philadelphia’s warm season. Daily fluctuations remain sharp, but the overall shift from cool spring days to consistently warm summer weather is clear. This warming pattern aligns with the strong seasonal increase in ridership, reinforcing the well-known sensitivity of bike share demand to comfortable outdoor conditions.
## 3.7 Create Space-Time Panel
```{r aggregate_trips}
# Count trips by station-hour
trips_panel <- indego_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# How many station-hour observations?
nrow(trips_panel)
# How many unique stations?
length(unique(trips_panel$start_station))
# How many unique hours?
length(unique(trips_panel$interval60))
```
## 3.8 Create Complete Panel Structure
Not every station has trips every hour. We need a **complete panel** where every station-hour combination exists (even if Trip_Count = 0).
```{r complete_panel}
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
# Join trip counts
left_join(trips_panel, by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
```
## 3.9 Add Time Features
```{r add_time_features}
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
```
## 3.10 Join Weather Data
```{r join_weather}
study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))
```
## 3.11 Create Temporal Lag Variables
The key innovation for space-time prediction: **past demand predicts future demand**.
If there were 15 bike trips from Station A at 8:00 AM, there will probably be \~15 trips at 9:00 AM. We can use this temporal persistence to improve predictions.
```{r create_lags}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")
```
## 3.12 Visualize Lag Correlations
```{r lag_correlations}
# Sample one station to visualize
example_station <- study_panel_complete %>%
filter(start_station == first(start_station)) %>%
head(168) # One week
# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
scale_color_manual(values = c(
"Current" = "#08519c",
"1 Hour Ago" = "#3182bd",
"24 Hours Ago" = "#6baed6"
)) +
labs(
title = "Temporal Lag Patterns at One Station",
subtitle = "Past demand predicts future demand",
x = "Date-Time",
y = "Trip Count",
color = "Time Period"
) +
plotTheme
```
Trip counts at a single station show clear short-term persistence: hours with activity are typically followed by additional activity in the next hour, while long stretches of zero demand also cluster together. The 1-hour lag tracks the current series most closely, but the 24-hour lag preserves the broader daily rhythm. These patterns justify the use of temporal lag features in the predictive models, since recent demand provides meaningful information about expected short-run usage.
## 3.13 Temporal Train/Test Split
**CRITICAL:** We must train on PAST data and test on FUTURE data!
```{r temporal_split}
# Split by week
# Q1 has weeks 14-26 (April-June)
# Train on weeks 14-23 (April 1 - early June)
# Test on weeks 23-26 (rest of June)
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 23) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 23) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 23)
test <- study_panel_complete %>%
filter(week >= 23)
cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")
cat("Training date range:", min(train$date), "to", max(train$date), "\n")
cat("Testing date range:", min(test$date), "to", max(test$date), "\n")
```
# 4. Build Predictive Models
We'll build 5 models with increasing complexity to see what improves predictions.
## 4.1 Model 1: Baseline (Time + Weather)
```{r model1}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
# Now run the model
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
summary(model1)
```
## 4.2 Model 2: Add Temporal Lags
```{r model2}
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
summary(model2)
```
## 4.3 Model 3: Add Demographics
```{r model3}
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
data = train
)
summary(model3)
```
## 4.4 Model 4: Add Station Fixed Effects
```{r model4}
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station),
data = train
)
# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")
cat("Model 4 Adj R-squared:", summary(model4)$adj.r.squared, "\n")
```
## 4.5 Model 5: Add Rush Hour Interaction
```{r model5}
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")
cat("Model 5 Adj R-squared:", summary(model5)$adj.r.squared, "\n")
```
## 4.6 Model Performance
Model 1 — Time + Weather Explains only about 11% of variation. Captures basic hourly/weekly patterns and weather, but predictions are weak.
Model 2 — + Temporal Lags Big improvement: R² jumps to 0.34. The 1-hour lag is the strongest predictor. Demand is highly path-dependent.
Model 3 — + Demographics No real improvement. Demographic variables add little to short-run hourly prediction.
Model 4 — + Station Fixed Effects Moderate gain (R² ≈ 0.26). Captures stable differences between stations, but less important than lag effects.
Model 5 — Full Model Small additional improvement (R² ≈ 0.264). Interactions help only slightly; most predictability still comes from recent demand.
In Q2 2024 specifically, the one-hour lag (`lag1Hour`) and 24-hour lag (`lag1day`) are by far the strongest predictors: hours with high recent demand almost always see high current demand. Temperature has a clear positive effect, while precipitation slightly suppresses ridership. By contrast, median income, racial composition, and transit share have tiny coefficients and do not materially change out-of-sample MAE, confirming that short-run dynamics matter much more than static neighborhood traits in this quarter.
## 4.7 Comparison to Q1 2025
Compared to the original Q1 2025 analysis, the best Q2 2024 model achieves a similar order of magnitude of error, with an MAE of about 0.71 trips per station-hour. The key difference is that Q2 demand is higher and more volatile: warm-weather leisure and discretionary trips introduce more noise than winter commute patterns, so the model must cope with sharper peaks and more irregular usage. As a result, the overall fit is comparable, but prediction is more challenging exactly in the busiest summer periods.
# 5.Model Evaluation
## 5.1 Calculate Predictions and MAE
```{r calculate_mae}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
## 5.2 Visualize Model Comparison
```{r compare_models}
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#3182bd", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Model 2 — the version with temporal lags — clearly performs best, with an MAE of about 0.71 trips, much lower than all other models. Model 1 performs reasonably (0.87), but the demographic model (Model 3) and fixed-effect models (Models 4–5) do not improve accuracy.
## 5.3 Compare Q2 2024 vs. Q1 2025
```{r}
mae_q2 <- mae_results
mae_q1 <- tibble(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE_Q1_2025 = c(0.60, 0.50, 0.74, 0.73, 0.73)
)
mae_compare <- mae_q1 %>%
left_join(mae_q2, by = "Model") %>%
rename(MAE_Q2_2024 = MAE)
kable(
mae_compare,
digits = 2,
caption = "MAE comparison: Q1 2025 vs Q2 2024",
col.names = c("Model", "MAE (Q1 2025)", "MAE (Q2 2024)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Across all five models, Q2 2024 has consistently **lower MAE** than Q1 2025.\
For the best-performing model (Model 2, with temporal lags), MAE drops from **0.71** trips per station-hour in Q1 to **0.50** in Q2. The simple time–weather baseline also improves from **0.87** to **0.60**, while the more complex models (3–5) remain worse than the lag model in both quarters.
## 5.4 Space-Time Error Analysis: Observed vs. Predicted
Let's use our best model (Model 2) for error analysis.
```{r obs_vs_pred}
test <- test %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#3182bd") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips",
subtitle = "Model 3 performance by time period",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme
```
## 5.5 Spatial Error Patterns
Are prediction errors clustered in certain parts of Philadelphia?
```{r spatial_errors}
# Calculate MAE by station
station_errors <- test %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)
# Calculate station errors
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon, y = start_lat, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE\n(trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5), # Fewer, cleaner breaks
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors",
subtitle = "Higher in Center City") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg\nDemand",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5), # Clear breaks
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand",
subtitle = "Trips per station-hour") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE (trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5),
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors") +
mapTheme +
theme(
legend.position = "bottom",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
guides(color = guide_colorbar(
barwidth = 12,
barheight = 1,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg Demand (trips/hour)",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand") +
mapTheme +
theme(
legend.position = "bottom",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
guides(color = guide_colorbar(
barwidth = 12,
barheight = 1,
title.position = "top",
title.hjust = 0.5
))
# Combine
grid.arrange(
p1, p2,
ncol = 2
)
```
Prediction errors are highest in Center City and University City, the two areas with the strongest and most volatile bike activity. These stations experience sharp swings in demand—commuters, tourists, and university traffic—making them harder to model with simple hourly features. In contrast, outlying residential neighborhoods show lower errors because their usage patterns are more stable and predictable. The close alignment between “high average demand” and “high error” suggests that variability, rather than low volume, drives most of the model’s difficulty.
## 5.6 Temporal Error Patterns
When are we most wrong?
```{r temporal_errors}
# MAE by time of day and day type
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
```{r}
weekly_errors <- test %>%
group_by(week, weekend) %>%
summarise(
MAE = mean(abs_error, na.rm = TRUE),
mean_error = mean(error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
day_type = ifelse(weekend == 1, "Weekend", "Weekday"),
week = factor(week)
)
p_week_mae <- ggplot(weekly_errors,
aes(x = week, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c",
"Weekend" = "#6baed6")) +
labs(
title = "Weekly Pattern in Prediction Errors",
subtitle = "Mean Absolute Error by week and day type (test set)",
x = "Week of year",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme
p_week_bias <- ggplot(weekly_errors,
aes(x = week, y = mean_error, fill = day_type)) +
geom_col(position = "dodge") +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_fill_manual(values = c("Weekday" = "#08519c",
"Weekend" = "#6baed6")) +
labs(
title = "Weekly Under- / Over-Prediction",
subtitle = "Mean signed error by week and day type (test set)",
x = "Week of year",
y = "Mean Error (trips)",
fill = "Day Type"
) +
plotTheme
gridExtra::grid.arrange(p_week_mae, p_week_bias, ncol = 2)
```
Across the four test weeks (23–26), prediction errors are fairly stable: weekday and weekend MAE stay in the 0.7–0.8 trip range, with only a slight peak in week 24. Bias is small in most weeks, but weeks 23–24 show mild over-prediction, especially on weekdays. In contrast, week 25 exhibits noticeable under-prediction on weekends, suggesting an unusually strong spike in weekend demand that the model did not anticipate. By week 26, signed errors return close to zero, indicating that systematic under/over-prediction does not persist over time.
## 5.7 Errors and Demographics
Are prediction errors related to neighborhood characteristics?
```{r errors_demographics}
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
plotTheme
grid.arrange(p1, p2, p3, ncol = 2)
```
Station–level errors are only weakly related to neighborhood demographics. MAE tends to be slightly higher in tracts with higher median income and a larger share of white residents, while it declines in tracts where a higher share of commuters take transit. In other words, the model struggles a bit more in richer, whiter areas and performs better in transit-oriented neighborhoods.
These differences appear modest and are driven mainly by higher and more volatile demand at busy, centrally located stations, rather than systematic under-performance in disadvantaged communities, though any deployment should still monitor errors by neighborhood over time.
# 6. Feature Engineering & Model Improvement
## 6.1 Create New Features
The choice of new features is guided directly by the error analysis in Section 5. Spatially and temporally, the model struggled most during warm, high-demand peaks in Center City and University City – especially on PM rush hours and busy weekends – and we also saw clear weekly repetition in the demand patterns. To target these misspecifications, I focus on features that (i) sharpen the model’s sensitivity to “perfect” biking weather and (ii) capture medium-run weekly structure beyond the short-run lags already in Model 2.
```{r}
# Perfect weather indicator: 60–75F & no rain
train <- train %>%
mutate(
perfect_weather = ifelse(Temperature >= 60 &
Temperature <= 75 &
Precipitation == 0, 1, 0)
)
test <- test %>%
mutate(
perfect_weather = ifelse(Temperature >= 60 &
Temperature <= 75 &
Precipitation == 0, 1, 0)
)
# Same hour last week (t-168 lag)
train <- train %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(lag168 = lag(Trip_Count, 24 * 7)) %>%
ungroup()
test <- test %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(lag168 = lag(Trip_Count, 24 * 7)) %>%
ungroup()
# Rolling 7-day mean demand
library(zoo)
train <- train %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(roll7 = rollmean(Trip_Count, k = 24 * 7, fill = NA, align = "right")) %>%
ungroup()
test <- test %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(roll7 = rollmean(Trip_Count, k = 24 * 7, fill = NA, align = "right")) %>%
ungroup()
```
## 6.2 Improved Model
```{r}
train_improved <- train %>%
drop_na(lag1Hour, lag3Hours, lag1day,
lag168, roll7, perfect_weather)
test_improved <- test %>%
drop_na(lag1Hour, lag3Hours, lag1day,
lag168, roll7, perfect_weather)
# improved OLS model
model_improved <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
perfect_weather + lag168 + roll7,
data = train_improved
)
summary(model_improved)
```
## 6.3 Compare Improved Model to Baseline Model 2
```{r}
# baseline Model 2 predictions
pred_base <- predict(model2, newdata = test_improved)
# improved model predictions
pred_improved <- predict(model_improved, newdata = test_improved)
# just in case, drop any rows where predictions are NA
valid_idx <- !is.na(pred_base) & !is.na(pred_improved) & !is.na(test_improved$Trip_Count)
mae_base <- mean(abs(test_improved$Trip_Count[valid_idx] - pred_base[valid_idx]))
mae_improved <- mean(abs(test_improved$Trip_Count[valid_idx] - pred_improved[valid_idx]))
cat("Model 2 MAE (baseline): ", round(mae_base, 3), "\n")
cat("Improved Model MAE: ", round(mae_improved, 3), "\n")
```
The improved model is built by augmenting Model 2 (time, weather, and short-run lags) with the three new features. On the held-out test set, however, performance does not improve: the baseline Model 2 achieves an MAE of 0.72 trips, while the enhanced model’s MAE rises slightly to 0.73 trips.
Breaking the errors down by station and time of day confirms that there is no clear subset where the new features deliver large gains: MAE by time period and by station changes only marginally, and the ranking of “easy” vs. “hard” stations is essentially unchanged. In other words, the extra features slightly reshuffle individual residuals but do not systematically fix the large errors identified in the earlier analysis..
## 6.4 Poisson Model
```{r}
model_pois <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
perfect_weather + lag168 + roll7,
data = train_improved,
family = poisson(link = "log")
)
pred_pois <- predict(model_pois, newdata = test_improved, type = "response")
valid_pois <- !is.na(pred_pois) & !is.na(test_improved$Trip_Count)
mae_pois <- mean(abs(test_improved$Trip_Count[valid_pois] - pred_pois[valid_pois]))
cat("Poisson MAE: ", round(mae_pois, 3), "\n")
```
I also estimated a Poisson regression using the same feature set as the improved linear model. The Poisson model reaches an MAE of about 0.73 trips, essentially identical to the enhanced OLS model and still slightly worse than the simpler Model 2. This indicates that, although count data models are theoretically appealing, the Poisson specification does not offer practical gains here—likely because hourly Indego counts are over-dispersed and include many zeros.
# 7. Brief Report Summary
## Quarter and Motivation
This project uses **Q2 2024** Indego data (April–June). This quarter captures the transition into Philadelphia’s warm season, when bike share demand is higher, more volatile, and more influenced by discretionary and leisure trips than in winter. It therefore provides a stronger test of short-run demand prediction.
## Model Comparison
I estimate five nested OLS models. A simple time + weather model (Model 1) explains only about 11% of variation and yields an MAE of **0.87** trips. Adding temporal lag terms (Model 2) is crucial: R² rises to about **0.34** and MAE falls to **0.71**, making it the best-performing baseline. Demographics, station fixed effects, and rush-hour interactions (Models 3–5) increase model complexity but do not improve out-of-sample MAE. Recent demand at each station is far more predictive than static neighborhood characteristics.
## Error Analysis
Spatially, errors are highest at busy stations in Center City and University City, where commuter, tourist, and university traffic create sharp, irregular peaks. Peripheral residential stations are more predictable. Temporally, weekday PM peaks show the largest errors, while overnight hours are very stable. Demographically, errors are slightly higher in higher-income and whiter tracts, mostly because those areas host the busiest and most volatile stations.
## New Features and Poisson Model
To improve Model 2, I add three features: a perfect-weather indicator, a same-hour-last-week lag, and a rolling 7-day average of demand. However, the enhanced model’s MAE rises slightly to **0.73**, and a Poisson count model with the same features also achieves about **0.73**. These results suggest that the short-run lag structure already captures most usable signal, and the additional temporal trend features mainly add noise in this three-month sample.
## Critical Reflection
An MAE of roughly **0.7 trips per station-hour** is operationally useful for high-level planning and rebalancing strategy, but not sufficient for fully automated real-time control, especially during volatile PM peaks in the downtown core. Errors do not appear to systematically disadvantage low-income or high-transit neighborhoods, but any deployment should monitor error patterns by area and enforce minimum service standards in equity-priority communities. The current model ignores events, disruptions, and spatial spillovers; with more time and data, I would incorporate event calendars, network structure, and more flexible model families to better capture the complex dynamics of Philadelphia’s bike share system.