---
title: "Space-Time Prediction"
subtitle: "Bike Share Demand Forecasting with Panel Data & Temporal Lags"
author: "Sihan Yu"
date: today
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Introduction
Bike share systems generate rich space–time data that can be used to support day-to-day operations such as bike rebalancing and staffing. In this assignment, I build panel-data models with temporal lags to forecast Indego bike share demand in Philadelphia.
In class, we worked through a full end-to-end workflow for Q1 2025. Here, I replicate and extend that workflow using **Q4 2024** data. This quarter spans late fall and early winter, when demand is strongly affected by colder temperatures, shorter daylight hours, and major holidays such as Thanksgiving and Christmas. After building a set of baseline models, I compare their performance to the in-class Q1 results, analyze systematic patterns in prediction errors across space, time, and demographics, and then engineer new features to improve the model. I conclude with a critical reflection on operational usefulness, equity implications, and the limitations of this modeling approach.
# Part 1: Replication with a Different Quarter
## 1.0 Preparation
In this section, we set up the analysis environment by loading necessary R libraries, defining visualization themes, and configuring the Census API key.
### Setup
```{r}
#| message: false
#| warning: false
# Core tidyverse
library(tidyverse)
library(lubridate)
# Spatial data
library(sf)
library(tigris)
# Census data
library(tidycensus)
# Weather data
library(riem) # For Philadelphia weather from ASOS stations
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# here!
library(here)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)
```
### Define Themes
```{r}
#| warning: false
plotTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = 'transparent'),
panel.grid.minor = element_blank(),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
### Set Census API key
```{r census_key_hidden, include=FALSE}
# Hidden key for rendering
census_api_key("e173851c633e89c20243632174db68f63bac8856", overwrite = TRUE)
```
## 1.1 Load and Inspect Data
Here import the Indego trip data for Q4 2024 and Philadelphia Census data. I perform a spatial join to associate stations with census tracts, filter for valid stations, and visualize the daily ridership trends to understand the data structure.
### Bike Share data
```{r}
indego <- read_csv("data/indego-trips-2024-q4.csv")
indego <- indego %>%
mutate(start_station = as.character(start_station))
# Quick look at the data
# glimpse(indego)
```
```{r}
indego <- indego %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))
```
```{r}
# Daily trip counts
daily_trips <- indego %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q4 2024",
subtitle = "Winter demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
```
The chart illustrates the Indego bike share system's daily trip volume during Q4 2024, revealing two strong patterns. Firstly, there is a clear seasonal decline in demand, with trips steadily falling from approximately 4,500–5,000 in October to 1,000–2,000 by late December as winter approached. Secondly, a high degree of weekly volatility persists, with sharp peaks corresponding to weekdays and significant troughs for weekends or days with adverse weather conditions. This blend of decaying long-term trend and stable weekly cycles is critical for the space-time prediction models.
### Load Philadelphia Census Data
```{r, message=FALSE, warning=FALSE}
# Get Philadelphia census tracts
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
progress_bar = FALSE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) # WGS84 for lat/lon matching
# Check the data
# glimpse(philly_census)
```
### Join data to station
```{r}
# Create sf object for stations
stations_sf <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_census <- indego %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
```
```{r}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
## 1.2 Adapt Code
This section focuses on data processing and panel construction. Historical weather data is fetched for the quarter, aggregate trips into a "station-hour" panel format, and create temporal lag features to capture demand persistence. Finally, split the data into training and testing sets based for future test.
### Update weather data
```{r}
weather_data <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2024-10-01",
date_end = "2024-12-31"
)
# Process weather data
weather_processed <- weather_data %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
arrange(interval60, valid) %>%
distinct(interval60, .keep_all = TRUE) %>%
select(interval60, Temperature, Precipitation, Wind_Speed)
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
#summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
### Aggregate to station-hour level
```{r}
# Count trips by station-hour
trips_panel <- indego_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
```
```{r}
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
# Join trip counts
left_join(trips_panel, by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
# Verify we have complete panel
# cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
```
```{r}
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
```
### Temporal Lag
```{r}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
```
```{r}
# Split by week
# Q1 has weeks 1-13 (Jan-Mar)
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 51) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 51) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 51)
test <- study_panel_complete %>%
filter(week >= 51)
```
## 1.3 Predictive Models
Five linear regression models are built with increasing complexity, ranging from a baseline time/weather model to advanced models including temporal lags and station fixed effects. We then evaluate their performance by calculating the Mean Absolute Error (MAE) on the test set.
### Model 1: Baseline(Time + Weather)
```{r}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
# Now run the model
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
```
### Model 2: Add Temporal Lags
```{r model2}
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
```
### Model 3: Add Demographics
```{r model3}
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
data = train
)
```
### Model 4: Add Station Fixed Effects
```{r model4}
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station),
data = train
)
```
### Model 5: Add Rush Hour Interaction
```{r model5}
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
```
### Predictive Models Analysis
```{r}
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Analysis of the Q4 2024 data reveals that Model 2 (Temporal Lags) is the superior model, achieving the lowest Mean Absolute Error (MAE) of 0.40. After introducing lagged variables significantly improved accuracy, the MAE lowered from 0.52 (baseline Model 1) to 0.40, confirming that recent demand history is a strong predictor of future trips. Interestingly, increasing model complexity by adding demographics (Model 3) and station fixed effects (Model 4) actually worsened performance, causing the MAE to spike to 0.63 and 0.65 respectively. This suggests that static features like station fixed effects led to overfitting on the higher-demand autumn training data, which failed to generalize to the distinct, lower-demand winter patterns observed in the test set.
### 1.4 MAE comparison to Q1 2025
```{r}
library(kableExtra)
# Create the comparison dataframe based on your results
comparison_data <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
# Your specific Q4 2024 results
MAE_Q4_2024 = c(0.52, 0.40, 0.63, 0.65, 0.64),
# The provided Q1 2025 reference values
MAE_Q1_2025 = c(0.60, 0.50, 0.74, 0.73, 0.73)
)
# Generate the table
kable(comparison_data,
digits = 2,
caption = "Model Performance Comparison: Q4 2024 vs. Q1 2025",
col.names = c("Model", "Q4 2024 MAE", "Q1 2025 MAE")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
```
Comparing our results to the Q1 2025 baseline, the Q4 models in both quarter achieved lower MAE values, the MAE increased significantly when adding static features like demographics and Station Fixed Effects (Models 3 & 4) compared to the simpler Lag model (Model 2).
Our best model reaching 0.40 compared to 0.50 in Q1. This difference is likely driven by the lower overall trip volumes during the specific late-December test weeks in Q4 compared to the active winter commuting period in Q1. Temporally, Q4 differs significantly from Q1 by exhibiting a steep seasonal decline from October to December, as opposed to the stable low-demand baseline of deep winter. despite these seasonal differences, Temporal Lags remained the most critical feature in our quarter. However, unlike Q1, our Q4 model relied more heavily on weather variables to capture the dramatic transition from comfortable autumn temperatures to freezing winter conditions.
# Part 2: Error Analysis
## 2.1 Spatial patterns
```{r spatial_errors}
test <- test %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
## Create Two Maps Side-by-Side with Proper Legends
# Calculate station errors
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat.y, start_lon.y) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.y), !is.na(start_lon.y))
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.y, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE\n(trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5), # Fewer, cleaner breaks
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors",
subtitle = "Higher in Center City") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.y, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg\nDemand",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5), # Clear breaks
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand",
subtitle = "Trips per station-hour") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Combine with better layout
library(gridExtra)
library(grid)
grid.arrange(
p1, p2,
ncol = 2,
top = textGrob(
"Model 2 Performance: Errors vs. Demand Patterns",
gp = gpar(fontsize = 16, fontface = "bold")
)
)
```
**Interpretation**\
The error maps show a clear link between demand and prediction errors. The areas with the highest MAE, colored orange and red, directly match the areas with the highest average demand. These errors are focused in Center City and University City. This shows the model struggles with the high volume of trips in these busy areas. The demand in these zones is often spontaneous, which makes it much harder to predict compared to quieter residential neighborhoods.
## 2.2 Temporal patterns
```{r}
# MAE by time of day and day type
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
**Interpretation**\
The temporal analysis highlights distinct challenges during peak times. The bar chart clearly indicates that the model struggles the most during the Weekday PM Rush. The Mean Absolute Error (MAE) reaches almost 0.7 trips at this time, which is much higher than during the AM Rush or midday. This suggests that afternoon travel behavior is simply more unpredictable. People may run errands or attend social events after work, which is less regular than the fixed morning commute. On the other hand, the model performs best overnight when demand is consistently low.\
## 2.3 Demographic patterns
```{r}
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
plotTheme
grid.arrange(p1, p2, p3, ncol = 2)
```
**Interpretation**\
The scatterplots show an interesting pattern between neighborhood data and errors.
- Income: There is a positive trend with Median Income. As income increases, the prediction error also increases.
- Race: Similarly, neighborhoods with a higher percentage of white residents show higher errors.
- Transit: Areas with high public transit use have lower prediction errors.
This pattern occurs because wealthy, white neighborhoods often have the busiest bike share stations. The higher error is simply caused by the higher volume of riders. It does not reflect the model's ability to understand specific populations.
# Part 3: Feature Engineering & model improvement
The new features chosen were Holiday and Feels Like Temperature. These were added to the best performing lag model, which was Model 2.
**Holiday**: Bike share demand changes drastically on major holidays like Thanksgiving or Christmas. These events are strong temporal factors. The model needs a specific dummy variable to account for these massive shifts in rider behavior.
**Feels Like Temperature** : Since the quarter is Q4 (moving into winter), wind chill becomes a major factor. The "Feels Like" Temperature (calculated from actual temperature and wind speed) is a better predictor of rider comfort and willingness to ride than the raw air temperature. A 40°F day with high wind feels significantly colder and discourages riding more than a calm 40°F day.
**7-Day Rolling Average**: Here calculated a rolling mean of trip counts over the previous 7 days (168 hours) for each station to act as a demand "smoother". By incorporating the 7-day rolling average, we capture the station's recent baseline demand trend. This allows the model to distinguish between random spikes and genuine shifts in usage, such as a station generally trending busier over the last week. In this way, a more stable signal is provided than single-point lags alone.
## 3.1 My Model: model 4 + new features
```{r}
library(zoo)
# holidays
holiday_dates <- as.Date(c(
"2024-11-28", # Thanksgiving
"2024-11-29", # Black Friday
"2024-12-24", # Christmas Eve
"2024-12-25", # Christmas Day
"2024-12-31" # New Year's Eve
))
# 在 study_panel_complete 基础上加新特征
study_panel_fe <- study_panel_complete %>%
arrange(start_station, interval60) %>%
group_by(start_station) %>%
mutate(
# 7 day rolling
rolling7 = rollmean(Trip_Count, k = 7 * 24, fill = NA, align = "right")
) %>%
ungroup() %>%
mutate(
rolling7 = replace_na(rolling7, 0),
holiday = ifelse(date %in% holiday_dates, 1, 0),
wind_mph = Wind_Speed * 1.15078,
feels_like = ifelse(
Temperature <= 50 & wind_mph > 3,
35.74 + 0.6215 * Temperature -
35.75 * (wind_mph^0.16) +
0.4275 * Temperature * (wind_mph^0.16),
Temperature
)
)
```
```{r}
train_fe <- study_panel_fe %>%
filter(week < 51)
test_fe <- study_panel_fe %>%
filter(week >= 51)
# day-of-week
train_fe <- train_fe %>%
mutate(dotw_simple = factor(dotw,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
test_fe <- test_fe %>%
mutate(dotw_simple = factor(dotw,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Treatment coding
contrasts(train_fe$dotw_simple) <- contr.treatment(7)
contrasts(test_fe$dotw_simple) <- contr.treatment(7)
```
```{r}
# Baseline model on feature-engineered data: same structure as original Model 2
model2_fe <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train_fe
)
```
```{r}
# Model 6: Model 2 + (holiday, feels_like, rolling7)
model6 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
holiday + feels_like + rolling7,
data = train_fe
)
```
## 3.2 MAE comparison
```{r}
# ---- Predict using Model 4 and Model 6 ----
# Predictions on test_fe
test_fe$pred2 <- predict(model2_fe, newdata = test_fe)
test_fe$pred6 <- predict(model6, newdata = test_fe)
# Compute MAE with na.rm = TRUE
mae2 <- mean(abs(test_fe$Trip_Count - test_fe$pred2), na.rm = TRUE)
mae6 <- mean(abs(test_fe$Trip_Count - test_fe$pred6), na.rm = TRUE)
mae_comparison <- data.frame(
Model = c("Model 2_fe: Time + Weather + Lags",
"Model 6: Model 2 + Holiday + Feels-like + Rolling7"),
MAE = c(mae2, mae6)
)
mae_comparison %>%
kable(digits = 3,
caption = "MAE Comparison: Baseline Model 2_fe vs. Improved Model 6") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The addition of the holiday and feels_like features successfully improved model performance. The Mean Absolute Error (MAE) decreased from \[MAE 2 value\] in the baseline Model 2 to \[MAE 6 value\] in the improved Model 6.
These features likely improved predictions where weather and holidays are key. The feels_like feature improves prediction accuracy on cold or windy days. The holiday feature specifically improves accuracy on high-impact dates like Thanksgiving, where demand patterns completely shift away from a standard Thursday schedule.
**Try a poisson model for count data**
```{r}
model_pois <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple +
Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
holiday + feels_like + rolling7,
data = train_fe,
family = poisson(link = "log")
)
```
```{r}
test_fe$pred_pois <- predict(model_pois, newdata = test_fe, type = "response")
mae_pois <- mean(abs(test_fe$Trip_Count - test_fe$pred_pois), na.rm = TRUE)
mae_pois
```
Unfortunately, the Poisson model which is theoretically better suited for non-negative count data than standard linear regression did not improve prediction performance. However, this approach yielded an MAE of 0.398, which was slightly inferior to both the linear baseline (0.397) and our feature-engineered model (0.394).This outcome likely happened bacause for high-demand stations, the trip volume is sufficiently large that the data approximates a normal distribution, allowing the linear model to remain highly robust. Consequently, the added complexity of the Poisson link function did not provide a tangible benefit over the strong predictive power of the linear autoregressive lags in this specific context.
# Part 4: Critical Reflection
1. **Operational implications & limitations:** Achieving a final MAE of 0.394 is encouraging, as it implies an average deviation of less than half a trip per station-hour, serving as a reliable baseline for fleet scheduling. However, this overall accuracy masks a critical time sensitivity: the model is weakest during the PM Rush, where underestimating demand leads to a serious problem that commuters find full stations with no place to dock. Thus, I suggest treating the model not as an "autopilot" solution, but as a decision support tool to assist dispatchers rather than replace them. Furthermore, a key limitation is the assumption of infinite vehicle supply; the model predicts latent demand rather than realized trips. Future improvements should integrate real-time "bikes available" data and account for spatial spillover effects to better reflect operational reality.
2. **Equity considerations:**\
From an equity perspective, absolute errors are significantly higher in Center City and high-income neighborhoods. While this is primarily a "volume bias" rather than algorithmic discrimination, relying solely on these predictions for rebalancing could unintentionally prioritize high-traffic areas and marginalize lower-demand or low-income communities. To prevent this, I recommend that Indego to set a "minimum service level" to guarantee baseline inventory regardless of predicted demand, and monitor the possible relative error to ensure consistent performance across demographics. Low demand does not mean unimportant. Efficiency must be balanced with equity to ensure the system benefits all citizens.