Assignment 5: Space-Time Prediction of Bike Share Demand: Philadelphia Indego

Authors

Lingxuan Gao

Jinyang Xu

Published

December 1, 2025

Introduction

The Rebalancing Challenge in Philadelphia

Philadelphia’s Indego bike share system faces the same operational challenge as every bike share system: rebalancing bikes to meet anticipated demand.

In this lab. we build predictive models that forecast bike share demand across space (different stations) and time (different hours) to help solve this operational problem.

Setup

Load Libraries

Code

# Core tidyverse
library(tidyverse)
library(lubridate)

# Spatial data
library(sf)
library(tigris)

# Census data
library(tidycensus)

# Weather data
library(riem)  # For Philadelphia weather from ASOS stations

# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)

Define Themes

Code

plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

Part 1: Replicate with Q3 2024 & Compare to Q1 2025

Data Import & Preparation

In this lab, we will use the Indego data (Q3 2024), because we want to explore the pattern’s diffenrence between summer and winter.

Download data for Q3 2024

Code

# Read Q3 2024 data
indego_q3 <- read_csv("data/indego-trips-2024-q3.csv")

Examine the Data Structure

Code

# How many trips?
cat("Total trips in Q3 2024:", nrow(indego_q3), "\n")

Total trips in Q3 2024: 408408

Code

# Date range
cat("Date range:", 
    min(mdy_hm(indego_q3$start_time)), "to", 
    max(mdy_hm(indego_q3$start_time)), "\n")

Date range: 1719792120 to 1727740740

Code

as.POSIXct(1719792120, origin = "1970-01-01", tz = "America/New_York")

[1] "2024-06-30 20:02:00 EDT"

Code

as.POSIXct(1727740740, origin = "1970-01-01", tz = "America/New_York")

[1] "2024-09-30 19:59:00 EDT"

Code

# How many unique stations?
cat("Unique start stations:", length(unique(indego_q3$start_station)), "\n")

Unique start stations: 261

Code

# Trip types
table(indego_q3$trip_route_category)


   One Way Round Trip 
    380939      27469

Code

# Passholder types
table(indego_q3$passholder_type)


 Day Pass  Indego30 Indego365   Walk-up 
    22885    239857    132358     13308

Code

# Bike types
table(indego_q3$bike_type)


electric standard 
  236839   171569

Create Time Bins

We need to aggregate trips into hourly intervals for our panel data structure.

Code

indego_q3 <- indego_q3 %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

# Look at temporal features
head(indego_q3 %>% select(start_datetime, interval60, week, dotw, hour, weekend))

# A tibble: 6 × 6
  start_datetime      interval60           week dotw   hour weekend
  <dttm>              <dttm>              <dbl> <ord> <int>   <dbl>
1 2024-07-01 00:02:00 2024-07-01 00:00:00    27 Mon       0       0
2 2024-07-01 00:03:00 2024-07-01 00:00:00    27 Mon       0       0
3 2024-07-01 00:04:00 2024-07-01 00:00:00    27 Mon       0       0
4 2024-07-01 00:05:00 2024-07-01 00:00:00    27 Mon       0       0
5 2024-07-01 00:06:00 2024-07-01 00:00:00    27 Mon       0       0
6 2024-07-01 00:06:00 2024-07-01 00:00:00    27 Mon       0       0

Exploratory Analysis

We aggregated individual trip records by date to obtain total daily ridership, then used ggplot2 to plot a time-series line chart with a smoothed trend line to visualize how Indego bike-share demand changed over the third quarter of 2024.

Trips Over Time

Code

# Daily trip counts
daily_trips <- indego_q3 %>%
  group_by(date) %>%
  summarize(trips = n())

ggplot(daily_trips, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q3 2024",
    subtitle = "Winter demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

Interpretation:

The first pattern is the weekly heartbeat. Those sharp peaks and valleys repeat with metronomic regularity. Indego always dips on the same days each week, typically weekends, because commuting demand relaxes

Then there’s the gentle seasonal slope underneath that weekly pulse. Early July starts relatively high, and then the baseline slowly drifts downward into early August (Maybe because of the summer vocation). Mid-August shows a little recovery, followed by a September lift since the fall semester began. Before the season’s first chilly whisper pushes ridership down again heading into October.

Hourly Patterns

Code

# Average trips by hour and day type
hourly_patterns <- indego_q3 %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

Interpretation: On weekdays, the peaks are tied to commuting: A morning spike around 8–9 AM, when the city jolts awake and people pedal to work. A bigger afternoon/evening spike around 5–6 PM, the classic homebound surge. The slope up to these points is steep, like a heartbeat aligned with office life. On weekends, the rhythm relaxes. There’s no sharp, early-morning people are not rushing anywhere. Ridership climbs more slowly through the morning toward a broad mid-day plateau, roughly 11 AM to 3 PM, and then drifts downward without the evening spike seen on weekdays.

So the pattern suggests that weekday trips are purposeful and synchronized with the workday, while weekend trips stretch out across the middle of the day in a more leisurely wave.

Top Stations

Code

# Most popular origin stations
top_stations <- indego_q3 %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations, 
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top 20 Indego Stations by Trip Origins
start_station	start_lat	start_lon	trips
3,010	39.94711	-75.16618	6,654
3,032	39.94527	-75.17971	5,436
3,244	39.93865	-75.16674	4,421
3,054	39.96250	-75.17420	4,305
3,359	39.94888	-75.16978	4,305
3,296	39.95134	-75.16758	4,252
3,101	39.94295	-75.15955	4,226
3,020	39.94855	-75.19007	4,154
3,295	39.95028	-75.16027	4,144
3,066	39.94561	-75.17348	4,114
3,059	39.96244	-75.16121	4,022
3,022	39.95472	-75.18323	3,954
3,362	39.94816	-75.16226	3,948
3,163	39.94974	-75.18097	3,942
3,028	39.94061	-75.14958	3,921
3,208	39.95048	-75.19324	3,919
3,061	39.95425	-75.17761	3,796
3,046	39.95012	-75.14472	3,734
3,063	39.94633	-75.16980	3,726
3,185	39.95169	-75.15888	3,678

Get Philadelphia Spatial Context

Load Philadelphia Census Data

We’ll get census tract data to add demographic context to our stations.

Code

# Get Philadelphia census tracts
philly_census <- get_acs(
  geography = "tract",
  variables = c(
    "B01003_001",  # Total population
    "B19013_001",  # Median household income
    "B08301_001",  # Total commuters
    "B08301_010",  # Commute by transit
    "B02001_002",  # White alone
    "B25077_001"   # Median home value
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  geometry = TRUE,
  output = "wide",
  progress_bar = FALSE
) %>%
  rename(
    Total_Pop = B01003_001E,
    Med_Inc = B19013_001E,
    Total_Commuters = B08301_001E,
    Transit_Commuters = B08301_010E,
    White_Pop = B02001_002E,
    Med_Home_Value = B25077_001E
  ) %>%
  mutate(
    Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
    Percent_White = (White_Pop / Total_Pop) * 100
  ) %>%
  st_transform(crs = 4326)  # WGS84 for lat/lon matching

# Check the data
glimpse(philly_census)

Rows: 408
Columns: 17
$ GEOID                  <chr> "42101001500", "42101001800", "42101002802", "4…
$ NAME                   <chr> "Census Tract 15; Philadelphia County; Pennsylv…
$ Total_Pop              <dbl> 3251, 3300, 5720, 4029, 4415, 1815, 3374, 2729,…
$ B01003_001M            <dbl> 677, 369, 796, 437, 853, 210, 480, 734, 763, 11…
$ Med_Inc                <dbl> 110859, 114063, 78871, 61583, 32347, 48581, 597…
$ B19013_001M            <dbl> 24975, 30714, 20396, 22293, 4840, 13812, 6278, …
$ Total_Commuters        <dbl> 2073, 2255, 3032, 2326, 1980, 969, 2427, 708, 2…
$ B08301_001M            <dbl> 387, 308, 478, 383, 456, 189, 380, 281, 456, 68…
$ Transit_Commuters      <dbl> 429, 123, 685, 506, 534, 192, 658, 218, 438, 51…
$ B08301_010M            <dbl> 188, 66, 219, 144, 285, 71, 278, 184, 176, 235,…
$ White_Pop              <dbl> 2185, 2494, 3691, 3223, 182, 984, 2111, 231, 35…
$ B02001_002M            <dbl> 268, 381, 592, 380, 88, 190, 463, 112, 238, 778…
$ Med_Home_Value         <dbl> 568300, 605000, 350600, 296400, 76600, 289700, …
$ B25077_001M            <dbl> 58894, 34876, 12572, 22333, 10843, 118720, 1506…
$ geometry               <MULTIPOLYGON [°]> MULTIPOLYGON (((-75.16558 3..., MU…
$ Percent_Taking_Transit <dbl> 20.694645, 5.454545, 22.592348, 21.754084, 26.9…
$ Percent_White          <dbl> 67.2100892, 75.5757576, 64.5279720, 79.9950360,…

Map Philadelphia Context

Our code builds a contextual map by shading each Philadelphia census tract according to its median household income. Then it overlays red points representing Indego trip start locations. The figure shows that most bike-share activity clusters in central and lower-income neighborhoods, while higher-income outer areas have fewer stations and fewer trip origins.

Code

# Map median income
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Context for understanding bike share demand patterns"
  ) +
  # Stations 
  geom_point(
    data = indego_q3,
    aes(x = start_lon, y = start_lat),
    color = "red", size = 0.25, alpha = 0.6
  ) +
  mapTheme

Join Census Data to Stations

We’ll spatially join census characteristics to each bike station.

Code

# Create sf object for stations
stations_sf <- indego_q3 %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.

stations_for_map <- indego_q3 %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Add back to trip data
indego_q3_census <- indego_q3 %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )


# Prepare data for visualization
stations_for_map <- indego_q3 %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Create the map showing problem stations
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar,
    na.value = "grey90"
  ) +
  # Stations with census data (small grey dots)
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(x = start_lon, y = start_lat),
    color = "grey30", size = 1, alpha = 0.6
  ) +
  # Stations WITHOUT census data (red X marks the spot)
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(x = start_lon, y = start_lat),
    color = "red", size = 1, shape = 4, stroke = 1.5
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Indego stations shown (RED = no census data match)",
    caption = "Red X marks indicate stations that didn't join to census tracts"
  ) +
  mapTheme

Dealing with missing data

In this step, we are going to remove the non-residential bike share stations.

Code

# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_q3_census <- indego_q3 %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

Get Weather Data

Weather significantly affects bike share demand. We get hourly weather for Philadelphia in this part.

Code

# Get weather from Philadelphia International Airport (KPHL)
# This covers Q3 2024: July 1 - September 31
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2024-07-01",
  date_end = "2024-09-30"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct()

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))

  Temperature    Precipitation        Wind_Speed    
 Min.   :55.00   Min.   :0.000000   Min.   : 0.000  
 1st Qu.:70.00   1st Qu.:0.000000   1st Qu.: 4.000  
 Median :76.00   Median :0.000000   Median : 7.000  
 Mean   :75.59   Mean   :0.007896   Mean   : 6.893  
 3rd Qu.:81.00   3rd Qu.:0.000000   3rd Qu.: 9.000  
 Max.   :98.00   Max.   :1.250000   Max.   :44.000

Visualize Weather Patterns

Code

ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - Q3 2024",
    subtitle = "Summer to early autumn transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

Interpretation: The pattern is a classic seasonal transition: hot, volatile summer days gradually giving way to cooler, steadier early-fall conditions.

Create Space-Time Panel

Aggregate Trips to Station-Hour Level

Code

# Count trips by station-hour
trips_panel <- indego_q3_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
nrow(trips_panel)

[1] 193072

Code

# How many unique stations?
length(unique(trips_panel$start_station))

[1] 241

Code

# How many unique hours?
length(unique(trips_panel$interval60))

[1] 2202

Create Complete Panel Structure

Not every station has trips every hour. We need a complete panel where every station-hour combination exists (even if Trip_Count = 0).

Code

# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")

Expected panel rows: 530,682

Code

cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")

Current rows: 193,072

Code

cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")

Missing rows: 337,610

Code

# Create complete panel
study_panel <- expand.grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")

Complete panel rows: 530,682

Add Time Features

Code

study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

# Create day of week factor with treatment (dummy) coding
study_panel <- study_panel %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(study_panel$dotw_simple) <- contr.treatment(7)

Join Weather Data

Code

study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))

   Trip_Count       Temperature    Precipitation   
 Min.   : 0.0000   Min.   :55.00   Min.   :0.0000  
 1st Qu.: 0.0000   1st Qu.:70.00   1st Qu.:0.0000  
 Median : 0.0000   Median :76.00   Median :0.0000  
 Mean   : 0.7068   Mean   :75.59   Mean   :0.0079  
 3rd Qu.: 1.0000   3rd Qu.:81.00   3rd Qu.:0.0000  
 Max.   :24.0000   Max.   :98.00   Max.   :1.2500  
                   NA's   :5784    NA's   :5784

Create Temporal Lag Variables

The key innovation for space-time prediction: past demand predicts future demand.

Why Lags?

If there were 15 bike trips from Station A at 8:00 AM, there will probably be ~15 trips at 9:00 AM. We can use this temporal persistence to improve predictions.

Code

# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24),
    lag1week = lag(Trip_Count, 24*7)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")

Rows after removing NA lags: 646,362

Visualize Lag Correlations

Code

# Sample one station to visualize
example_station <- study_panel_complete %>%
  filter(start_station == first(start_station)) %>%
  head(168)  # One week

# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
  geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
  geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
  geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
  scale_color_manual(values = c(
    "Current" = "#08519c",
    "1 Hour Ago" = "#3182bd",
    "24 Hours Ago" = "#6baed6"
  )) +
  labs(
    title = "Temporal Lag Patterns at One Station",
    subtitle = "Past demand predicts future demand",
    x = "Date-Time",
    y = "Trip Count",
    color = "Time Period"
  ) +
  plotTheme

Temporal Train/Test Split

The code performs a clean temporal split for forecasting. It first separates the dataset into early weeks and later weeks, identifies stations that appear in both periods, and keeps only those consistent stations. Then it creates a training set using weeks 27–36 and a testing set using weeks 37–40, ensuring the model is trained on past data and evaluated on future data.

Code

# Split by week
# Q3 has weeks 27-40 (Jul-Sep)
# Train on weeks 27-36 (Jan 1 - early September)
# Test on weeks 37-40 (rest of September)

# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
  filter(week < 37) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 37) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)

# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# NOW create train/test split
train <- study_panel_complete %>%
  filter(week < 37)

test <- study_panel_complete %>%
  filter(week >= 37)

cat("Training observations:", format(nrow(train), big.mark = ","), "\n")

Training observations: 470,940

Code

cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")

Testing observations: 159,330

Code

cat("Training date range:", min(train$date), "to", max(train$date), "\n")

Training date range: 19905 to 19974

Code

cat("Testing date range:", min(test$date), "to", max(test$date), "\n")

Testing date range: 19975 to 19996

Build Predictive Models

We’ll build 5 models with increasing complexity to see what improves predictions.

This is a model evaluation function that can generate data

Code

# model evaluation function
model_metrics_lm <- function(model, test_data, response_var = "Trip_Count") {
  model_name <- deparse(substitute(model))
  
  if (inherits(model, "glm") && family(model)$family == "poisson") {
    type       <- "Poisson"
    y_train    <- model$y
    yhat_train <- predict(model, type = "response")
    
    sse <- sum((y_train - yhat_train)^2)
    sst <- sum((y_train - mean(y_train))^2)
    r2  <- 1 - sse / sst
    
    n <- length(y_train)
    p <- length(coef(model))
    adjr2 <- 1 - (1 - r2) * (n - 1) / (n - p - 1)
    
  } else if (inherits(model, "lm")) {
    type       <- "OLS"
    s          <- summary(model)
    r2         <- s$r.squared
    adjr2      <- s$adj.r.squared
    y_train    <- model$model[[response_var]]
    yhat_train <- fitted(model)
    
  } else {
    stop("Only supports lm and glm(poisson).")
  }
  
  mae_train <- mean(abs(y_train - yhat_train))
  
  if (inherits(model, "glm") && family(model)$family == "poisson") {
    yhat_test <- predict(model, newdata = test_data, type = "response")
  } else {
    yhat_test <- predict(model, newdata = test_data)
  }
  y_test <- test_data[[response_var]]
  cc <- complete.cases(y_test, yhat_test)
  mae_test <- mean(abs(y_test[cc] - yhat_test[cc]))
  
  cat("Model:", model_name, "\n")
  cat("  Type:              ", type, "\n")
  cat("  R-squared (train): ", round(r2, 4), "\n")
  cat("  Adj R-squared:     ", round(adjr2, 4), "\n")
  cat("  MAE (train):       ", round(mae_train, 4), "\n")
  cat("  MAE (test):        ", round(mae_test, 4), "\n\n")
}

Model 1: Baseline (Time + Weather)

This model captures daily and weekly cycles, weather effects.

Code

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

Code

model_metrics_lm(model1, test_data = test)

Model: model1 
  Type:               OLS 
  R-squared (train):  0.1085 
  Adj R-squared:      0.1085 
  MAE (train):        0.7937 
  MAE (test):         0.8293

Interpretation: The model shows that trip counts vary strongly by hour of day. Compared to the baseline hour (midnight), coefficients rise sharply in the morning and peak around the late afternoon, matching the commuter pattern observed earlier. Day-of-week effects are smaller but still significant: mid-week days tend to have slightly higher ridership, while weekends show lower counts. Weather matters too—higher temperatures slightly reduce trips, and precipitation has a larger negative effect. Although all predictors are statistically significant (given the huge sample size), the model explains only about 11% of the variation, meaning most hour-to-hour fluctuations are driven by other unmeasured factors such as station-specific conditions or random demand swings.

Model 2: Add Temporal Lags

Code

model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

Code

model_metrics_lm(model2, test_data = test)

Model: model2 
  Type:               OLS 
  R-squared (train):  0.3318 
  Adj R-squared:      0.3317 
  MAE (train):        0.6655 
  MAE (test):         0.6857

Interpretation: The model still captures the usual hourly and day-of-week ridership patterns, but the major change comes from adding the lagged demand terms. The one-hour lag has the strongest effect—stations that were busy an hour ago are much more likely to be busy now. The three-hour lag and the previous-day same-hour lag also contribute meaningful predictive power. Weather remains significant but plays a smaller role. Overall model fit improves substantially: R² rises from about 0.11 to about 0.33, and residual error drops, showing that including recent past demand explains much more of the variation. This confirms that short-term station activity is strongly tied to its own immediate history.

Model 3: Add Demographics

This model captures neighborhood effects on demand

Code

model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y,
  data = train
)

Code

model_metrics_lm(model3, test_data = test)

Model: model3 
  Type:               OLS 
  R-squared (train):  0.3373 
  Adj R-squared:      0.3373 
  MAE (train):        0.6673 
  MAE (test):         0.6868

Interpretation: The R square means the model explains about 34% of hourly variation in bike demand.Adding demographics raises R² slightly compared to Model 2, they help, but only modestly.Temporal lags dominate the model, while demographics show systematic spatial patterns. All three demographic variables are all significant, and it shows that demographics matter, but compared to temporal lags and hour-of-day, their effect sizes are small.

Model 4: Add Station Fixed Effects

Code

model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station),
  data = train
)

Code

model_metrics_lm(model4, test_data = test)

Model: model4 
  Type:               OLS 
  R-squared (train):  0.3627 
  Adj R-squared:      0.3624 
  MAE (train):        0.6667 
  MAE (test):         0.6831

Interpretation: Model 4 adds a dummy variable for every station, letting the model control for unobserved station-specific factors. It shows that station-specific factors are strong drivers of bike-share demand. Adding fixed effects significantly boosts the model’s ability to explain past variation, but the improvement does not reliably boost predictive accuracy. Even though R² increases, the model’s MAE does not improve much. Because station fixed effects are great for explaining historical data, but they don’t generalize as well when predicting new periods.

Model 5: Add Rush Hour Interaction

Code

model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

Code

model_metrics_lm(model5, test_data = test)

Model: model5 
  Type:               OLS 
  R-squared (train):  0.3669 
  Adj R-squared:      0.3665 
  MAE (train):        0.6653 
  MAE (test):         0.6833

Interpretation: Model 5 introduces an interaction between rush hour and weekends, allowing the model to differentiate the strong weekday commuting peaks from the much softer weekend patterns. It also adds month effects to capture seasonal shifts across July, August, and September. Even though Model 5 has the highest R², it performs worse on the test set. This indicates that Model 5 is overfit: it explains historical variation better but loses forecasting accuracy. Model 4 remains the best balance of simplicity and predictive accuracy.

Model Evaluation

Calculate Predictions and MAE

Code

# Get predictions on test set
test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results1 <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results1, 
      digits = 4,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Mean Absolute Error by Model (Test Set)
Model	MAE (trips)
1. Time + Weather	0.8293
2. + Temporal Lags	0.6857
3. + Demographics	0.6868
4. + Station FE	0.6831
5. + Rush Hour Interaction	0.6833

Interpretation: 1. Temporal lags are the most important predictive feature.

Model 2 gives the biggest drop in MAE, showing that short-term demand patterns drive accuracy.

Demographics help explain variation but don’t improve prediction.
Station fixed effects give a small but meaningful improvement.

They adjust for inherent differences between stations.

Extra complexity (Model 5) does not yield better predictions.

The model is already near its predictive ceiling.

Best predictive models are Model 2 (simplest) and Model 4 (slightly better).

Visualize Model Comparison

Code

ggplot(mae_results1, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd",alpha = 0.8) +
  geom_text(aes(label = round(MAE, 4)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.1))
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Compare results to Q1 2025:

1. How do the MAE values compare across quarters?

Q1 2025 MAE values are generally lower than those we reported for Q3 2024.

So predictions in Q1 are simply easier.

2. Why might they differ?

Winter(Q1) has smoother, more predictable demand: Riders may behave more consistently in colder months. There are fewer tourists. Weather is cold but stable—more consistently low temperatures, fewer big swings.

Summer(Q3) is noisy and harder to predict: Weather-driven volatility: heat waves, thunderstorms, humidity. More variation in leisure rides, weekend trips, and irregular patterns. Tourism and events add bursts of unpredictable activity.

So the model in Q3 faces more randomness, and MAE naturally rises.

3. Are temporal patterns different (summer vs. winter)?

Yes. In winter, ridership follows a tighter, more predictable rhythm. The daily totals rise gradually from January into March without the big, volatile spikes that appear in summer. Hourly patterns also become cleaner: weekday mornings and evenings show sharp commuter peaks, while weekends stay much flatter with only a modest mid-day bump. By contrast, summer brings stronger mid-day activity, higher weekend volume, and much more weather-driven noise. Winter demand is steadier and more commute-oriented, whereas summer mixes commuting with a large amount of leisure riding, creating far more variation.

4. Which features are most important in your quarter?

Temporal lags. The results show a large performance jump when lag variables (1 hour, 3 hours, 1 day) are added. Summer ridership changes quickly with events, weather shifts, and bursts of leisure activity.

Part 2: Error Analysis

Space-Time Error Analysis

Observed vs. Predicted

Let’s use our best model (Model 4) for error analysis.

Code

test <- test %>%
  mutate(
    error = Trip_Count - pred4,
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred4)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips",
    subtitle = "Model 4 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

Interpretation: Where the model performs well: It does best in low-to-moderate demand conditions, especially overnight and during mid-day. Those periods have fewer big swings and smaller trip counts, so the model can track the pattern closely. On weekends—when demand is generally smoother—the points cluster tightly and the green line stays close to the red line. The model doesn’t dramatically over- or under-shoot in these calmer periods.

Where the model struggles: It struggles during high-demand commuter windows, especially the AM and PM rush hours. In these panels, the cloud of points stretches upward far past the green line, showing that the model systematically under-predicts the busiest moments. Sharp surges in demand are hard to anticipate with simple temporal lags, so the model tends to flatten out the peaks. There is also more scatter on weekday rush hours than on weekends, reflecting the added variability of work-day travel.

In conclusion, the model handles routine, low-volume times well, but it misses the sudden spikes that define heavy commuting periods.

Spatial Error Patterns

In this part, we are exploring whether prediction errors are clustered in certain parts of Philadelphia.

Code

# 1. Calculate MAE by stations
station_errors <- test %>%
  filter(!is.na(pred4)) %>%
  group_by(start_station, start_lat.y, start_lon.y) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat.y), !is.na(start_lon.y))

# 2. Map 1：Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.y, color = MAE),
    size = 0.5, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12, barheight = 1,
    title.position = "top", title.hjust = 0.5
  ))

# 3. Map 2：Average Demand
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.y, color = avg_demand),
    size = 0.5, alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12, barheight = 1,
    title.position = "top", title.hjust = 0.5
  ))

# 4. Combine two maps
grid.arrange(p1, p2, ncol = 2)

Interpretation: Errors form a dense cluster in and around Center City, extending slightly into University City and parts of South Philly. The neighborhoods in center city have the highest errors.Because these stations often sit near job centers, transit hubs, parks, and retail corridors — all places with unpredictable surges. In short, The model struggles where human behavior is least predictable and demand is most dynamic.

Temporal Error Patterns

When are we most wrong?

Code

# MAE by time of day and day type
temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: When are errors highest? Errors peak during the commute periods: PM Rush and AM Rush. These periods have the most unpredictable spikes in demand.

Errors and Demographics

Are prediction errors related to neighborhood characteristics?

Code

# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

grid.arrange(p1, p2, p3, ncol = 2)

Interpretation: Prediction errors slightly related to neighborhood characteristics?

Higher-income neighborhoods show slightly higher errors. The trend line slopes upward: as median income increases, MAE tends to rise a bit. This likely reflects busier, more mixed-use areas where demand is less predictable.
Neighborhoods with high transit use show lower errors. Stations located in areas with more transit riders tend to have lower MAE. Transit-heavy areas typically have more stable, commute-driven travel patterns that are easier for the model to predict.
Stations in whiter neighborhoods show slightly higher errors. The relationship is mild, but the upward slope suggests that stations in predominantly white, higher-income areas may see more inconsistent demand—often driven by leisure or irregular trip purposes.

Part 3 : Feature Engineering & Model Improvement

Feature Selection and Engineering

From the error analysis above, we noticed that stations with large prediction errors are concentrated in Center City. From the data structure, we also know that the distribution of Trip_Count is highly skewed. There are many observations with 0 or 1 trip and a few observations with very large counts. The large number of 0 or 1 counts makes the model under-predict for high values. The extreme large counts make the model over-predict for low values. This creates a structural bias that we cannot fully remove.

To address this, we want to add variables that can detect and partially adjust for this pattern. Distance to Center City is a natural candidate. It can help the model sense the spatial structure behind these errors. However, the station dummy variables may already capture most of the distance differences. Therefore, the distance variable needs to be transformed or used in interaction terms to avoid collinearity with the station fixed effects.

When we examine the trip-count-over-time plots, we also see spikes that appear irregularly. These spikes may reflect special events that disrupt the usual demand pattern. They do not happen often, but they can create extreme values and break the roughly linear relationship that the model tries to learn. This can have a strong impact on model performance. Federal holidays are one potential source of such events. The current model does not include enough variables to detect these effects.

Based on these observations, we decide to add two new variables:

1.Holiday indicator

Code

library(lubridate)

study_panel_complete <- study_panel_complete %>%
  mutate(
    # Convert datetime column to Date
    date = as.Date(interval60),

    # Weekday with Monday = 1, Tuesday = 2, ..., Sunday = 7
    wday_mon = wday(date, week_start = 1),
    
    # US Federal Holidays (rule-based, works for any year)
    
    # New Year's Day: January 1st
    holiday_newyear = month(date) == 1  & mday(date) == 1,

    # Martin Luther King Jr. Day: 3rd Monday in January
    holiday_mlk = month(date) == 1 &
                  wday_mon == 1 &        
                  mday(date) >= 15 & mday(date) <= 21,

    # Presidents’ Day: 3rd Monday in February
    holiday_pres = month(date) == 2 &
                   wday_mon == 1 &       
                   mday(date) >= 15 & mday(date) <= 21,

    # Memorial Day: last Monday in May (Monday on or after the 25th)
    holiday_memorial = month(date) == 5 &
                       wday_mon == 1 &     
                       mday(date) >= 25,

    # Independence Day: July 4th
    holiday_july4 = month(date) == 7 & mday(date) == 4,

    # Labor Day: 1st Monday in September (Monday between 1st and 7th)
    holiday_labor = month(date) == 9 &
                    wday_mon == 1 &        
                    mday(date) <= 7,

    # Thanksgiving: 4th Thursday in November
    holiday_thanks = month(date) == 11 &
                     wday_mon == 4 &              # Thursday
                     mday(date) >= 22 & mday(date) <= 28,

    # Christmas Day: December 25th
    holiday_xmas = month(date) == 12 & mday(date) == 25,

    # Combined holiday indicator: 1 if any of the above holidays, 0 otherwise
    holiday_any = as.integer(
      holiday_newyear | holiday_mlk | holiday_pres |
      holiday_memorial | holiday_july4 | holiday_labor |
      holiday_thanks | holiday_xmas
    )
  )

2.Distance to Center City

Code

# 1. City Hall point
philly_cbd_3857 <- st_sfc(
  st_point(c(-75.1636, 39.9526)),   # lon, lat
  crs = 4326
) %>%
  st_transform(3857)

# project the dataframe to the same CRS 3857 (to calculate distance in meter)
stations_dist <- stations_sf %>%
  st_transform(3857) %>%
  mutate(
    dist_center_m = as.numeric(st_distance(geometry, philly_cbd_3857))
  ) %>%
  st_drop_geometry() %>%
  select(start_station, dist_center_m) %>%
  distinct()

study_panel_complete <- study_panel_complete %>%
  left_join(stations_dist, by = "start_station")


study_panel_complete <- study_panel_complete %>%
  mutate(
    Rain_any   = as.integer(Precipitation > 0),
    Heavy_rain = as.integer(Precipitation >= 0.3) 
  )

# let's make a little change in the intersection by using "weekday*rush_hour"
study_panel_complete <- study_panel_complete %>%
  mutate(
    weekday = ifelse(dotw %in% c("Mon", "Tue", "Wed", "Thu", "Fri"), 1, 0)
  )

#split the train adn test data
train <- study_panel_complete %>%
  filter(week < 37)

test <- study_panel_complete %>%
  filter(week >= 37)

Model Improvement

Model 6: Adjust variables in model 4

In this model, we refine some existing variables and use the result as the second baseline for later improved models.

First, based on the summary of Model 4, we noticed that the Precipitation variable did not perform well. However, rain conditions should be an important factor for bike demand. The original variable is very small in magnitude because the total amount of rain is spread across 24 hours. To better capture the effect of rain, we transform it into two dummy variables that indicate whether it is rainy and whether it is raining heavily.

Second, we believe the weekend indicator is important because the demand pattern on weekends is different from weekdays. In the previous model, it did not work well because the interaction term weekend:rush_hour is highly collinear with as.factor(hour). To keep the weekend/weekday effect in the model, we replace this term with the interaction as.factor(hour):weekend.

Code

model6 <- lm(
  Trip_Count ~ as.factor(hour)* weekend + dotw_simple + 
    Temperature + 
    Rain_any + Heavy_rain +
    lag1Hour + lag3Hours + lag1day  + 
    as.factor(month) + 
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y + 
    as.factor(start_station),  # Rush hour effects different on weekends
  data = train
)

Code

model_metrics_lm(model6, test_data = test)

Model: model6 
  Type:               OLS 
  R-squared (train):  0.3702 
  Adj R-squared:      0.3698 
  MAE (train):        0.664 
  MAE (test):         0.6783

Interpretation: Model 6 serves as our second baseline model after refining several key predictors.

On weekdays, the model shows strong morning and evening peaks that are consistent with commuting.
On weekends, the peaks flatten and shift toward the middle of the day.
The new rain dummies behave as expected: both rainy and heavy-rain hours are associated with lower trip counts compared with dry conditions.
Demographics and station fixed effects mainly shift the overall level of demand at each station, while the refined weather and weekend terms improve the shape of the predicted time profile without a large change in overall MAE.

Model 7: Add Holiday Indicator and Distance to center city Interation

This model is based on Model 6 and adds two new components: a holiday indicator and an interaction between distance to Center City and rush hour. As discussed above, the interaction term is not only used to capture the relationship between distance and peak-hour demand, but also to reduce collinearity with the station fixed effects. Adding these two variables may help the model better detect structural patterns in the data and improve overall performance.

Code

model7 <- lm(
  Trip_Count ~ as.factor(hour)* weekday + 
    dotw_simple + 
    Temperature + 
    Rain_any + Heavy_rain +
    lag1Hour + lag3Hours + lag1day  + 
    as.factor(month) + 
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y + 
    dist_center_m*rush_hour  +                              # interaction
    holiday_any +                                           # holiday indicator
    as.factor(start_station),  
  data = train
)

Code

model_metrics_lm(model7, test_data = test)

Model: model7 
  Type:               OLS 
  R-squared (train):  0.3751 
  Adj R-squared:      0.3748 
  MAE (train):        0.6584 
  MAE (test):         0.6762

Interpretation: Model 7 builds on Model 6 by adding two variables that are motivated directly by our error analysis and knowledge of Philadelphia’s spatial and temporal structure: a holiday indicator and an interaction between distance to Center City and rush hour. The idea is that holidays and downtown peak periods are exactly when we observed the largest systematic errors in previous models. These additions slightly improve performance: the test MAE decreases from 0.6783 in Model 6 to 0.6762 in Model 7, and (R^2) rises only marginally from 0.3702 to 0.3751. This shows that, even after trying to encode what we learned about the error patterns, urban form, and time-of-day effects into new variables, most of the predictive power still comes from the core ingredients introduced earlier (time of day, temporal lags, and station fixed effects). The remaining errors are likely driven by noise and unobserved factors—such as special events or local disruptions—that are difficult to capture with simple linear terms.

Model 8: Try a poisson model for count data (base on model 7)

In addition to the linear regression models, we also estimate a Poisson regression with a log link to better reflect the count nature of the bike trip data. For this specification, we keep the same set of predictors as in Model 7 and only change the estimation method from OLS to Poisson. This allows us to directly compare the two approaches under an identical model structure. The Poisson model serves as a robustness check: it tests whether the main effects of time-of-day, weekend, weather, lagged demand, distance to Center City, and holidays remain consistent when we use a count model that is more appropriate for non-negative integer outcomes.

Code

# Poisson model with log link
poisson_model7 <- glm(
  Trip_Count ~ 
    as.factor(hour) * weekday +       
    dotw_simple +                     
    Temperature + Rain_any + Heavy_rain +
    lag1Hour + lag3Hours + lag1day +
    as.factor(month) +
    Med_Inc.y + Percent_Taking_Transit.y + Percent_White.y +
    dist_center_m*rush_hour +          
    holiday_any +
    as.factor(start_station),
  family = poisson(link = "log"),
  data = train
)

Code

model_metrics_lm(poisson_model7, test_data = test)

Model: poisson_model7 
  Type:               Poisson 
  R-squared (train):  0.3828 
  Adj R-squared:      0.3824 
  MAE (train):        0.6221 
  MAE (test):         0.6461

Interpretation: Finally, we re-estimate Model 7 using a Poisson regression with a log link while keeping exactly the same set of predictors. This change in estimation method leads to a modest but clear improvement in predictive performance: the test MAE decreases from 0.6762 in the OLS version of Model 7 to 0.6461 in the Poisson model, and the (R^2) on the training set increases slightly from 0.3751 to 0.3828. The Poisson specification fits the count nature of bike trips better and reduces some of the systematic under- and over-prediction we saw in the linear model, but the gain is not dramatic, which suggests that most remaining errors are due to noise and unobserved shocks rather than the choice between OLS and Poisson.

Overall Model Evaluation

Code

# Get predictions on test set

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test),
    pred6 = predict(model6, newdata = test),
    pred7 = predict(model7, newdata = test),
    pred8 = predict(poisson_model7, newdata = test, type = "response")
  )

# Calculate MAE for each model
mae_results2 <- data.frame(
  
  Model = paste0(
    "Model ", 1:8
  ),
  Description = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction",
    "6. Adjust variables (Base on M4)",
    "7. + Holiday Indicator and Distance Interaction",
    "8. Poisson using variables in model 7"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred6), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred7), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred8), na.rm = TRUE)
  ) 
) %>%
  mutate(
    improvement_from_baseline = round((MAE[1] - MAE) / MAE[1] * 100, 1),
    improvement_from_lags = round((MAE[6] - MAE) / MAE[2] * 100, 1)
  )

kable(mae_results2, 
      digits = 4,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "Description", "MAE (trips)", "% Better than M1", "% Better than M6")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Mean Absolute Error by Model (Test Set)
Model	Description	MAE (trips)	% Better than M1	% Better than M6
Model 1	1. Time + Weather	0.8293	0.0	-22.0
Model 2	2. + Temporal Lags	0.6857	17.3	-1.1
Model 3	3. + Demographics	0.6868	17.2	-1.2
Model 4	4. + Station FE	0.6831	17.6	-0.7
Model 5	5. + Rush Hour Interaction	0.6833	17.6	-0.7
Model 6	6. Adjust variables (Base on M4)	0.6783	18.2	0.0
Model 7	7. + Holiday Indicator and Distance Interaction	0.6762	18.5	0.3
Model 8	8. Poisson using variables in model 7	0.6461	22.1	4.7

Interpretation: From Model 3 to Model 7, we keep adding more variables and refining the specification, but the improvement over Model 2 is quite limited. After including temporal lags in Model 2, the MAE already drops from 0.83 to about 0.69 trips, and this remains the dominant gain. Models 3–5 add demographics, station fixed effects, and a rush-hour interaction, and Models 6–7 further adjust weather, weekend, distance, and holiday variables. However, the MAE stays in a very narrow range around 0.68–0.676, which is only a small improvement compared to Model 2.

This pattern suggests that once we control for time-of-day and recent demand history, most additional predictors only provide marginal extra information for short-term station-level demand. Demographics and station dummies mainly shift overall levels rather than change hour-to-hour variation, and the added interaction terms capture more subtle structure but do not drastically reduce prediction error on the test set.

In contrast, Model 8 changes the estimation method rather than the predictor set. It applies a Poisson regression using the same variables as Model 7. This shift to a count model leads to a more noticeable improvement: the MAE decreases to 0.6461 trips, which is substantially better than the OLS models with the same predictors. This indicates that better matching the distributional form of the outcome (non-negative counts) can yield additional gains even when the set of explanatory variables is already rich.

Visualize Model Comparison

Code

ggplot(mae_results2, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd",alpha = 0.8) +
  geom_text(aes(label = round(MAE, 4)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.1))
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Part 4: Critical Reflection

Operational implications:

Our best model (Poisson, Model 8) has a test-set MAE of about 0.65 trips per station–hour. At this resolution a typical error is well under one bike, which is reasonably good given how noisy and skewed bike-share demand is. In practice Indego would often aggregate predictions over several hours and across groups of stations, so some of these small errors would cancel out and the effective error at the operational level would be even smaller.

Prediction errors become operationally problematic mainly when they are systematic in high-demand contexts. In our results the largest errors tend to occur at busy Center City stations during peak hours and around special days such as holidays or events. Under-predicting in those situations can leave key stations empty or full exactly when demand is highest, while over-predicting at low-demand stations mostly has minor consequences.

I would not use this model as a fully automated rebalancing system, but I would recommend it as a decision-support tool. It is best used on aggregated time windows and station clusters, to flag where demand is likely to be high relative to current inventory, while human operators retain control. As long as staff treat the forecasts as guidance rather than exact numbers, remain cautious around holidays or special events, and periodically monitor and retrain the model, the current performance is good enough to provide useful support for Indego’s rebalancing decisions.

Equity considerations:

From an equity perspective, the key question is whether our prediction errors are larger or more systematic in certain kinds of neighborhoods. In our error maps, the most obvious clusters of high error appear in Center City and other high-demand stations, which are generally higher-income and whiter areas, so the model does not obviously underperform only in disadvantaged neighborhoods. However, even small absolute errors in lower-income or majority-nonwhite areas can matter more, because there are fewer nearby alternatives if a station is empty or full. If rebalancing decisions rely heavily on these forecasts, any systematic under-prediction in these communities could translate into more frequent stock-outs and longer walking distances, which would reinforce existing disparities in bike access.

To reduce this risk, the system should include explicit safeguards. First, errors and service levels should be monitored by neighborhood characteristics such as income, race, and transit dependence, not only in aggregate. Second, rebalancing rules can include minimum service standards for equity-priority zones, so that these stations are never deprioritized solely because predicted demand is low or uncertain. Third, planners should periodically audit the model for bias, adjust the specification if one group is consistently under-served, and involve community stakeholders when defining which areas are “must-serve” regardless of forecasted demand. Used with these safeguards, the forecasting system is less likely to worsen existing disparities and can even help identify where additional investment in stations or rebalancing capacity is needed.

Model limitations:

Our model still misses several important patterns. It struggles with sharp spikes caused by special events, construction, or one-off disruptions, which do not follow the typical weekday/weekend and seasonal structure learned from the data. It also assumes that the relationship between predictors and demand is stable over time and largely linear on the log scale, even though real-world behavior can change with new infrastructure, pricing, or cultural shifts. In deployment, station capacity constraints, operational decisions (like temporary closures or rebalancing itself), and policy changes would all feed back into observed demand in ways the current model does not explicitly represent. With more time and data, I would incorporate richer features (such as real-time event data, land use and network connectivity, or app usage patterns), explore more flexible models such as gradient boosting or random forests with careful regularization, and estimate separate models for different station types or clusters. I would also set up an ongoing evaluation pipeline to monitor performance over time, detect drift, and retrain or recalibrate the model as Indego’s system and user behavior evolve.