Space-Time Prediction of Bike Share Demand: Philadelphia Indego

Author

Jingqi Lu

Published

December 8, 2025

Setup

Code

# Core tidyverse
library(tidyverse)
library(lubridate)

# Spatial data
library(sf)
library(tigris)

# Census data
library(tidycensus)

# Weather data
library(riem)  # For Philadelphia weather from ASOS stations

# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)

# here!
library(here)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)

Define Themes

These themes will be used for consistent styling of plots.

Code

plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

Code

# Read Q1 2025 data
library(readr)
indego <- read_csv("data/indego-trips-2024-q2.csv")

# Quick look at the data
glimpse(indego)

Rows: 368,091
Columns: 15
$ trip_id             <dbl> 867727465, 867732652, 867732304, 867732603, 867732…
$ duration            <dbl> 8, 19, 9, 16, 7, 255, 56, 145, 37, 4, 40, 6, 3, 8,…
$ start_time          <chr> "4/1/2024 0:00", "4/1/2024 0:03", "4/1/2024 0:04",…
$ end_time            <chr> "4/1/2024 0:08", "4/1/2024 0:22", "4/1/2024 0:13",…
$ start_station       <dbl> 3168, 3317, 3104, 3295, 3041, 3075, 3112, 3362, 30…
$ start_lat           <dbl> 39.95134, 39.93633, 39.96664, 39.95028, 39.96849, …
$ start_lon           <dbl> -75.17394, -75.19447, -75.19209, -75.16027, -75.13…
$ end_station         <dbl> 3365, 3359, 3104, 3119, 3376, 3075, 3112, 3253, 30…
$ end_lat             <dbl> 39.96173, 39.94888, 39.96664, 39.96674, 39.97183, …
$ end_lon             <dbl> -75.18788, -75.16978, -75.19209, -75.20799, -75.14…
$ bike_id             <chr> "24987", "11555", "31244", "23243", "03696", "2535…
$ plan_duration       <dbl> 30, 30, 365, 30, 365, 30, 1, 30, 30, 365, 1, 30, 3…
$ trip_route_category <chr> "One Way", "One Way", "Round Trip", "One Way", "On…
$ passholder_type     <chr> "Indego30", "Indego30", "Indego365", "Indego30", "…
$ bike_type           <chr> "electric", "standard", "electric", "electric", "s…

1. Introduction

This report analyzes hourly bike share demand in Philadelphia using Indego data from Q2 2024. The goal is to build and evaluate space-time predictive models, diagnose where and why errors occur, and propose new features that improve forecasting performance for operational planning.

1.1 Examine the Data Structure

Code

# How many trips?
cat("Total trips in Q2 2024:", nrow(indego), "\n")

Total trips in Q2 2024: 368091

Code

# Date range
cat("Date range:", 
    min(mdy_hm(indego$start_time)), "to", 
    max(mdy_hm(indego$start_time)), "\n")

Date range: 1711929600 to 1719791760

Code

# How many unique stations?
cat("Unique start stations:", length(unique(indego$start_station)), "\n")

Unique start stations: 255

Code

# Trip types
table(indego$trip_route_category)


   One Way Round Trip 
    342299      25792

Code

# Passholder types
table(indego$passholder_type)


 Day Pass  Indego30 Indego365 
    20926    231909    115256

Code

# Bike types
table(indego$bike_type)


electric standard 
  216497   151594

1.2 Create Time Bins

We need to aggregate trips into hourly intervals for our panel data structure.

Code

indego <- indego %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))

# A tibble: 6 × 6
  start_datetime      interval60           week dotw   hour weekend
  <dttm>              <dttm>              <dbl> <ord> <int>   <dbl>
1 2024-04-01 00:00:00 2024-04-01 00:00:00    14 Mon       0       0
2 2024-04-01 00:03:00 2024-04-01 00:00:00    14 Mon       0       0
3 2024-04-01 00:04:00 2024-04-01 00:00:00    14 Mon       0       0
4 2024-04-01 00:04:00 2024-04-01 00:00:00    14 Mon       0       0
5 2024-04-01 00:06:00 2024-04-01 00:00:00    14 Mon       0       0
6 2024-04-01 00:07:00 2024-04-01 00:00:00    14 Mon       0       0

2. Exploratory Data Analysis

2.1 Trips Over Time

Code

# Daily trip counts
daily_trips <- indego %>%
  group_by(date) %>%
  summarize(trips = n())

ggplot(daily_trips, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q2 2024",
    subtitle = "Winter demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

Daily trip volumes rise sharply from early April into mid-June, consistent with the seasonal transition into warmer weather. The series remains noisy—weekends, rain events, and holiday fluctuations introduce large day-to-day swings—but the overall trend is a steady climb toward peak summer demand. Compared to Q1 2025, the level is higher and the pattern is less stable, reflecting how warmer months attract more discretionary, non-commuting trips.

##2.2 Hourly Patterns

Code

# Average trips by hour and day type
hourly_patterns <- indego %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

Hourly usage patterns retain the classic two-peak commute structure on weekdays, with strong surges around 8 AM and 5 PM. Weekends, by contrast, show a smoother, flatter trajectory centered on late morning and mid-afternoon leisure travel. The contrast suggests that Q2 demand mixes both utilitarian and recreational usage, producing more varied temporal dynamics than in winter.

2.3 Top Stations

Code

# Most popular origin stations
top_stations <- indego %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations, 
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top 20 Indego Stations by Trip Origins
start_station	start_lat	start_lon	trips
3,010	39.94711	-75.16618	6,115
3,032	39.94527	-75.17971	5,231
3,295	39.95028	-75.16027	4,451
3,359	39.94888	-75.16978	4,248
3,022	39.95472	-75.18323	4,070
3,028	39.94061	-75.14958	4,052
3,066	39.94561	-75.17348	4,047
3,054	39.96250	-75.17420	3,847
3,020	39.94855	-75.19007	3,792
3,208	39.95048	-75.19324	3,661
3,052	39.94732	-75.15695	3,651
3,012	39.94218	-75.17747	3,573
3,163	39.94974	-75.18097	3,558
3,244	39.93865	-75.16674	3,549
3,185	39.95169	-75.15888	3,523
3,007	39.94517	-75.15993	3,492
3,059	39.96244	-75.16121	3,470
3,362	39.94816	-75.16226	3,443
3,101	39.94295	-75.15955	3,425
3,063	39.94633	-75.16980	3,404

3. Get Philadelphia Spatial Context

3.1 Load Philadelphia Census Data

We’ll get census tract data to add demographic context to our stations.

3.2 Map Philadelphia Context

Code

# Map median income
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Context for understanding bike share demand patterns"
  ) +
  # Stations 
  geom_point(
    data = indego,
    aes(x = start_lon, y = start_lat),
    color = "red", size = 0.25, alpha = 0.6
  ) +
  mapTheme

The spatial distribution of median household income shows a familiar pattern: higher-income tracts cluster in Center City, University City, and along the northwest corridor, while lower-income neighborhoods are concentrated in North and Southwest Philadelphia.

3.3 Join Census Data to Stations

We’ll spatially join census characteristics to each bike station.

Code

# Create sf object for stations
stations_sf <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.

stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Add back to trip data
indego_census <- indego %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )


# Prepare data for visualization
stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Create the map showing problem stations
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar,
    na.value = "grey90"
  ) +
  # Stations with census data (small grey dots)
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(x = start_lon, y = start_lat),
    color = "grey30", size = 1, alpha = 0.6
  ) +
  # Stations WITHOUT census data (red X marks the spot)
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(x = start_lon, y = start_lat),
    color = "red", size = 1, shape = 4, stroke = 1.5
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Indego stations shown (RED = no census data match)",
    caption = "Red X marks indicate stations that didn't join to census tracts"
  ) +
  mapTheme

Most Indego stations successfully join to their surrounding census tract, but a subset in the downtown core and waterfront areas fall outside tract boundaries or sit within non-residential zones. These locations—shown as red X marks—typically correspond to transit hubs, tourist destinations, or large institutional complexes. Their absence from the demographic join highlights where neighborhood characteristics may not meaningfully describe station demand, and it motivates either separate modeling or additional spatial features for these atypical sites.

3.4 Dealing with missing data

We need to decide what to do with the non-residential bike share stations. For this example, we are going to remove them – this is not necessarily the right way to do things always, but for the sake of simplicity, we are narrowing our scope to only stations in residential neighborhoods. We might opt to create a separate model for non-residential stations..

Code

# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census <- indego %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

3.5 Get Weather Data

Weather significantly affects bike share demand! Let’s get hourly weather for Philadelphia.

Code

# Get weather from Philadelphia International Airport (KPHL)
# This covers Q2 2024: April 1 - June 30
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2024-04-01",
  date_end = "2024-06-30"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct()

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))

  Temperature    Precipitation       Wind_Speed    
 Min.   :37.00   Min.   :0.00000   Min.   : 0.000  
 1st Qu.:54.00   1st Qu.:0.00000   1st Qu.: 5.000  
 Median :66.00   Median :0.00000   Median : 7.000  
 Mean   :65.05   Mean   :0.00794   Mean   : 7.677  
 3rd Qu.:74.00   3rd Qu.:0.00000   3rd Qu.:10.000  
 Max.   :98.00   Max.   :1.05000   Max.   :30.000

3,6 Visualize Weather Patterns

Who is ready for a Philly winter?!

Code

ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - Q2 2024",
    subtitle = "Spring to Summer transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

Temperatures rise steadily from early April into late June, marking the transition into Philadelphia’s warm season. Daily fluctuations remain sharp, but the overall shift from cool spring days to consistently warm summer weather is clear. This warming pattern aligns with the strong seasonal increase in ridership, reinforcing the well-known sensitivity of bike share demand to comfortable outdoor conditions.

3.7 Create Space-Time Panel

Code

# Count trips by station-hour
trips_panel <- indego_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
nrow(trips_panel)

[1] 175936

Code

# How many unique stations?
length(unique(trips_panel$start_station))

[1] 235

Code

# How many unique hours?
length(unique(trips_panel$interval60))

[1] 2184

3.8 Create Complete Panel Structure

Not every station has trips every hour. We need a complete panel where every station-hour combination exists (even if Trip_Count = 0).

Code

# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")

Expected panel rows: 513,240

Code

cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")

Current rows: 175,936

Code

cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")

Missing rows: 337,304

Code

# Create complete panel
study_panel <- expand.grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")

Complete panel rows: 513,240

3.9 Add Time Features

Code

study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

3.10 Join Weather Data

Code

study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))

   Trip_Count       Temperature    Precipitation   
 Min.   : 0.0000   Min.   :37.00   Min.   :0.0000  
 1st Qu.: 0.0000   1st Qu.:54.00   1st Qu.:0.0000  
 Median : 0.0000   Median :66.00   Median :0.0000  
 Mean   : 0.6415   Mean   :65.05   Mean   :0.0079  
 3rd Qu.: 1.0000   3rd Qu.:74.00   3rd Qu.:0.0000  
 Max.   :22.0000   Max.   :98.00   Max.   :1.0500  
                   NA's   :5640    NA's   :5640

3.11 Create Temporal Lag Variables

The key innovation for space-time prediction: past demand predicts future demand.

If there were 15 bike trips from Station A at 8:00 AM, there will probably be ~15 trips at 9:00 AM. We can use this temporal persistence to improve predictions.

Code

# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")

Rows after removing NA lags: 632,150

3.12 Visualize Lag Correlations

Code

# Sample one station to visualize
example_station <- study_panel_complete %>%
  filter(start_station == first(start_station)) %>%
  head(168)  # One week

# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
  geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
  geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
  geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
  scale_color_manual(values = c(
    "Current" = "#08519c",
    "1 Hour Ago" = "#3182bd",
    "24 Hours Ago" = "#6baed6"
  )) +
  labs(
    title = "Temporal Lag Patterns at One Station",
    subtitle = "Past demand predicts future demand",
    x = "Date-Time",
    y = "Trip Count",
    color = "Time Period"
  ) +
  plotTheme

Trip counts at a single station show clear short-term persistence: hours with activity are typically followed by additional activity in the next hour, while long stretches of zero demand also cluster together. The 1-hour lag tracks the current series most closely, but the 24-hour lag preserves the broader daily rhythm. These patterns justify the use of temporal lag features in the predictive models, since recent demand provides meaningful information about expected short-run usage.

3.13 Temporal Train/Test Split

CRITICAL: We must train on PAST data and test on FUTURE data!

Code

# Split by week
# Q1 has weeks 14-26 (April-June)
# Train on weeks 14-23 (April 1 - early June)
# Test on weeks 23-26 (rest of June)

# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
  filter(week < 23) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 23) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)


# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# NOW create train/test split
train <- study_panel_complete %>%
  filter(week < 23)

test <- study_panel_complete %>%
  filter(week >= 23)

cat("Training observations:", format(nrow(train), big.mark = ","), "\n")

Training observations: 447,528

Code

cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")

Testing observations: 176,552

Code

cat("Training date range:", min(train$date), "to", max(train$date), "\n")

Training date range: 19814 to 19876

Code

cat("Testing date range:", min(test$date), "to", max(test$date), "\n")

Testing date range: 19877 to 19904

4. Build Predictive Models

We’ll build 5 models with increasing complexity to see what improves predictions.

4.1 Model 1: Baseline (Time + Weather)

Code

# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

summary(model1)


Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7197 -0.6759 -0.2053  0.1850 20.6050 

Coefficients:
                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.623380   0.014322 -43.527 < 0.0000000000000002 ***
as.factor(hour)1  -0.072867   0.011938  -6.104 0.000000001037041260 ***
as.factor(hour)2  -0.100186   0.011862  -8.446 < 0.0000000000000002 ***
as.factor(hour)3  -0.101366   0.011827  -8.571 < 0.0000000000000002 ***
as.factor(hour)4  -0.098172   0.011994  -8.185 0.000000000000000272 ***
as.factor(hour)5   0.031520   0.011964   2.635             0.008424 ** 
as.factor(hour)6   0.251304   0.011708  21.464 < 0.0000000000000002 ***
as.factor(hour)7   0.539861   0.011720  46.065 < 0.0000000000000002 ***
as.factor(hour)8   0.854340   0.011759  72.652 < 0.0000000000000002 ***
as.factor(hour)9   0.645589   0.011734  55.020 < 0.0000000000000002 ***
as.factor(hour)10  0.548414   0.011459  47.857 < 0.0000000000000002 ***
as.factor(hour)11  0.579472   0.011564  50.111 < 0.0000000000000002 ***
as.factor(hour)12  0.626987   0.011257  55.699 < 0.0000000000000002 ***
as.factor(hour)13  0.623662   0.011744  53.104 < 0.0000000000000002 ***
as.factor(hour)14  0.656044   0.011488  57.107 < 0.0000000000000002 ***
as.factor(hour)15  0.760629   0.011788  64.526 < 0.0000000000000002 ***
as.factor(hour)16  0.862528   0.011435  75.426 < 0.0000000000000002 ***
as.factor(hour)17  1.157733   0.011763  98.420 < 0.0000000000000002 ***
as.factor(hour)18  0.930235   0.011977  77.668 < 0.0000000000000002 ***
as.factor(hour)19  0.681722   0.011792  57.813 < 0.0000000000000002 ***
as.factor(hour)20  0.432304   0.011798  36.642 < 0.0000000000000002 ***
as.factor(hour)21  0.269929   0.011696  23.078 < 0.0000000000000002 ***
as.factor(hour)22  0.191038   0.011726  16.291 < 0.0000000000000002 ***
as.factor(hour)23  0.069112   0.011871   5.822 0.000000005812931084 ***
dotw_simple2       0.061945   0.006312   9.814 < 0.0000000000000002 ***
dotw_simple3       0.021134   0.006089   3.471             0.000519 ***
dotw_simple4       0.103407   0.006324  16.352 < 0.0000000000000002 ***
dotw_simple5       0.057970   0.006276   9.237 < 0.0000000000000002 ***
dotw_simple6       0.007771   0.006527   1.191             0.233767    
dotw_simple7      -0.034128   0.006504  -5.247 0.000000154440328777 ***
Temperature        0.012729   0.000171  74.419 < 0.0000000000000002 ***
Precipitation     -2.054152   0.062168 -33.042 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.117 on 447496 degrees of freedom
Multiple R-squared:  0.1143,    Adjusted R-squared:  0.1142 
F-statistic:  1863 on 31 and 447496 DF,  p-value: < 0.00000000000000022

4.2 Model 2: Add Temporal Lags

Code

model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

summary(model2)


Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1097 -0.4314 -0.1254  0.1217 17.3211 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.2516703  0.0123654 -20.353 < 0.0000000000000002 ***
as.factor(hour)1  -0.0315610  0.0102751  -3.072             0.002129 ** 
as.factor(hour)2  -0.0255020  0.0102137  -2.497             0.012531 *  
as.factor(hour)3  -0.0193838  0.0101872  -1.903             0.057071 .  
as.factor(hour)4  -0.0118938  0.0103351  -1.151             0.249811    
as.factor(hour)5   0.0879106  0.0103161   8.522 < 0.0000000000000002 ***
as.factor(hour)6   0.2413446  0.0101023  23.890 < 0.0000000000000002 ***
as.factor(hour)7   0.3912950  0.0101257  38.644 < 0.0000000000000002 ***
as.factor(hour)8   0.5751810  0.0101693  56.561 < 0.0000000000000002 ***
as.factor(hour)9   0.2586771  0.0101578  25.466 < 0.0000000000000002 ***
as.factor(hour)10  0.2239928  0.0099018  22.621 < 0.0000000000000002 ***
as.factor(hour)11  0.2750789  0.0099905  27.534 < 0.0000000000000002 ***
as.factor(hour)12  0.3434636  0.0097188  35.340 < 0.0000000000000002 ***
as.factor(hour)13  0.3325602  0.0101359  32.810 < 0.0000000000000002 ***
as.factor(hour)14  0.3735677  0.0099128  37.686 < 0.0000000000000002 ***
as.factor(hour)15  0.4382105  0.0101782  43.054 < 0.0000000000000002 ***
as.factor(hour)16  0.5186079  0.0098810  52.485 < 0.0000000000000002 ***
as.factor(hour)17  0.7372223  0.0101810  72.412 < 0.0000000000000002 ***
as.factor(hour)18  0.3982131  0.0103986  38.295 < 0.0000000000000002 ***
as.factor(hour)19  0.2607371  0.0102099  25.538 < 0.0000000000000002 ***
as.factor(hour)20  0.1041311  0.0102076  10.201 < 0.0000000000000002 ***
as.factor(hour)21  0.0773039  0.0100888   7.662  0.00000000000001829 ***
as.factor(hour)22  0.0804148  0.0100993   7.962  0.00000000000000169 ***
as.factor(hour)23  0.0141841  0.0102172   1.388             0.165062    
dotw_simple2       0.0186587  0.0054346   3.433             0.000596 ***
dotw_simple3      -0.0004020  0.0052441  -0.077             0.938896    
dotw_simple4       0.0308340  0.0054469   5.661  0.00000001507660401 ***
dotw_simple5       0.0029701  0.0054080   0.549             0.582861    
dotw_simple6      -0.0249257  0.0056246  -4.432  0.00000935943656583 ***
dotw_simple7      -0.0338659  0.0055995  -6.048  0.00000000146779797 ***
Temperature        0.0037869  0.0001493  25.370 < 0.0000000000000002 ***
Precipitation     -0.9449920  0.0535786 -17.637 < 0.0000000000000002 ***
lag1Hour           0.4030038  0.0013966 288.563 < 0.0000000000000002 ***
lag3Hours          0.1234155  0.0013804  89.408 < 0.0000000000000002 ***
lag1day            0.1253160  0.0012810  97.830 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.961 on 447493 degrees of freedom
Multiple R-squared:  0.344, Adjusted R-squared:  0.344 
F-statistic:  6903 on 34 and 447493 DF,  p-value: < 0.00000000000000022

4.3 Model 3: Add Demographics

Code

model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
  data = train
)

summary(model3)


Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc.x + 
    Percent_Taking_Transit.y + Percent_White.y, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0962 -0.6781 -0.2691  0.4234 17.2038 

Coefficients:
                              Estimate    Std. Error t value
(Intercept)               0.3857918409  0.0371091868  10.396
as.factor(hour)1         -0.0021036158  0.0412694199  -0.051
as.factor(hour)2         -0.0088620570  0.0469025109  -0.189
as.factor(hour)3         -0.0851082042  0.0529075781  -1.609
as.factor(hour)4         -0.1044632428  0.0557223076  -1.875
as.factor(hour)5          0.0516286807  0.0371091993   1.391
as.factor(hour)6          0.2505851189  0.0312856619   8.010
as.factor(hour)7          0.4005664111  0.0294980575  13.579
as.factor(hour)8          0.6170072423  0.0287282776  21.477
as.factor(hour)9          0.1382271180  0.0288552539   4.790
as.factor(hour)10         0.1084529302  0.0287107985   3.777
as.factor(hour)11         0.1546773002  0.0286616605   5.397
as.factor(hour)12         0.2287878417  0.0281439459   8.129
as.factor(hour)13         0.2593480167  0.0286602368   9.049
as.factor(hour)14         0.2789001366  0.0282806593   9.862
as.factor(hour)15         0.3576516996  0.0284181466  12.585
as.factor(hour)16         0.4807434184  0.0279992423  17.170
as.factor(hour)17         0.8226919934  0.0281755098  29.199
as.factor(hour)18         0.3568673063  0.0285563485  12.497
as.factor(hour)19         0.1776173104  0.0286237376   6.205
as.factor(hour)20         0.0116961313  0.0293501783   0.399
as.factor(hour)21         0.0203613434  0.0299654314   0.679
as.factor(hour)22         0.0304705944  0.0307099983   0.992
as.factor(hour)23        -0.0160285584  0.0326523113  -0.491
dotw_simple2              0.0547342235  0.0119297080   4.588
dotw_simple3              0.0108137555  0.0116378166   0.929
dotw_simple4              0.0381015979  0.0117220832   3.250
dotw_simple5              0.0054589733  0.0118327085   0.461
dotw_simple6              0.0459381855  0.0125566250   3.658
dotw_simple7              0.0389827321  0.0126414408   3.084
Temperature               0.0075088185  0.0003400333  22.083
Precipitation            -2.1950253458  0.1294673174 -16.954
lag1Hour                  0.2941552792  0.0022571175 130.323
lag3Hours                 0.0851609125  0.0023233110  36.655
lag1day                   0.0958710202  0.0022118269  43.345
Med_Inc.x                 0.0000001627  0.0000001133   1.435
Percent_Taking_Transit.y -0.0032455105  0.0004140532  -7.838
Percent_White.y           0.0032869902  0.0002079131  15.809
                                     Pr(>|t|)    
(Intercept)              < 0.0000000000000002 ***
as.factor(hour)1                     0.959347    
as.factor(hour)2                     0.850135    
as.factor(hour)3                     0.107702    
as.factor(hour)4                     0.060834 .  
as.factor(hour)5                     0.164148    
as.factor(hour)6         0.000000000000001159 ***
as.factor(hour)7         < 0.0000000000000002 ***
as.factor(hour)8         < 0.0000000000000002 ***
as.factor(hour)9         0.000001666444693928 ***
as.factor(hour)10                    0.000159 ***
as.factor(hour)11        0.000000067997969458 ***
as.factor(hour)12        0.000000000000000435 ***
as.factor(hour)13        < 0.0000000000000002 ***
as.factor(hour)14        < 0.0000000000000002 ***
as.factor(hour)15        < 0.0000000000000002 ***
as.factor(hour)16        < 0.0000000000000002 ***
as.factor(hour)17        < 0.0000000000000002 ***
as.factor(hour)18        < 0.0000000000000002 ***
as.factor(hour)19        0.000000000547584341 ***
as.factor(hour)20                    0.690260    
as.factor(hour)21                    0.496826    
as.factor(hour)22                    0.321099    
as.factor(hour)23                    0.623508    
dotw_simple2             0.000004477560354601 ***
dotw_simple3                         0.352792    
dotw_simple4                         0.001153 ** 
dotw_simple5                         0.644551    
dotw_simple6                         0.000254 ***
dotw_simple7                         0.002045 ** 
Temperature              < 0.0000000000000002 ***
Precipitation            < 0.0000000000000002 ***
lag1Hour                 < 0.0000000000000002 ***
lag3Hours                < 0.0000000000000002 ***
lag1day                  < 0.0000000000000002 ***
Med_Inc.x                            0.151217    
Percent_Taking_Transit.y 0.000000000000004594 ***
Percent_White.y          < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.221 on 145195 degrees of freedom
  (302295 observations deleted due to missingness)
Multiple R-squared:  0.2319,    Adjusted R-squared:  0.2317 
F-statistic:  1185 on 37 and 145195 DF,  p-value: < 0.00000000000000022

4.4 Model 4: Add Station Fixed Effects

Code

model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station),
  data = train
)

# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")

Model 4 R-squared: 0.259619

Code

cat("Model 4 Adj R-squared:", summary(model4)$adj.r.squared, "\n")

Model 4 Adj R-squared: 0.2582656

4.5 Model 5: Add Rush Hour Interaction

Code

model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")

Model 5 R-squared: 0.2635307

Code

cat("Model 5 Adj R-squared:", summary(model5)$adj.r.squared, "\n")

Model 5 Adj R-squared: 0.2621691

4.6 Model Performance

Model 1 — Time + Weather Explains only about 11% of variation. Captures basic hourly/weekly patterns and weather, but predictions are weak.

Model 2 — + Temporal Lags Big improvement: R² jumps to 0.34. The 1-hour lag is the strongest predictor. Demand is highly path-dependent.

Model 3 — + Demographics No real improvement. Demographic variables add little to short-run hourly prediction.

Model 4 — + Station Fixed Effects Moderate gain (R² ≈ 0.26). Captures stable differences between stations, but less important than lag effects.

Model 5 — Full Model Small additional improvement (R² ≈ 0.264). Interactions help only slightly; most predictability still comes from recent demand.

In Q2 2024 specifically, the one-hour lag (lag1Hour) and 24-hour lag (lag1day) are by far the strongest predictors: hours with high recent demand almost always see high current demand. Temperature has a clear positive effect, while precipitation slightly suppresses ridership. By contrast, median income, racial composition, and transit share have tiny coefficients and do not materially change out-of-sample MAE, confirming that short-run dynamics matter much more than static neighborhood traits in this quarter.

4.7 Comparison to Q1 2025

Compared to the original Q1 2025 analysis, the best Q2 2024 model achieves a similar order of magnitude of error, with an MAE of about 0.71 trips per station-hour. The key difference is that Q2 demand is higher and more volatile: warm-weather leisure and discretionary trips introduce more noise than winter commute patterns, so the model must cope with sharper peaks and more irregular usage. As a result, the overall fit is comparable, but prediction is more challenging exactly in the busiest summer periods.

5.Model Evaluation

5.1 Calculate Predictions and MAE

Code

# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results, 
      digits = 2,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Mean Absolute Error by Model (Test Set)
Model	MAE (trips)
1. Time + Weather	0.87
2. + Temporal Lags	0.71
3. + Demographics	0.97
4. + Station FE	0.95
5. + Rush Hour Interaction	0.95

5.2 Visualize Model Comparison

Code

ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model 2 — the version with temporal lags — clearly performs best, with an MAE of about 0.71 trips, much lower than all other models. Model 1 performs reasonably (0.87), but the demographic model (Model 3) and fixed-effect models (Models 4–5) do not improve accuracy.

5.3 Compare Q2 2024 vs. Q1 2025

Code

mae_q2 <- mae_results 

mae_q1 <- tibble(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE_Q1_2025 = c(0.60, 0.50, 0.74, 0.73, 0.73)
)


mae_compare <- mae_q1 %>%
  left_join(mae_q2, by = "Model") %>%
  rename(MAE_Q2_2024 = MAE)

kable(
  mae_compare,
  digits = 2,
  caption = "MAE comparison: Q1 2025 vs Q2 2024",
  col.names = c("Model", "MAE (Q1 2025)", "MAE (Q2 2024)")
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

MAE comparison: Q1 2025 vs Q2 2024
Model	MAE (Q1 2025)	MAE (Q2 2024)
1. Time + Weather	0.60	0.87
2. + Temporal Lags	0.50	0.71
3. + Demographics	0.74	0.97
4. + Station FE	0.73	0.95
5. + Rush Hour Interaction	0.73	0.95

Across all five models, Q2 2024 has consistently lower MAE than Q1 2025.
For the best-performing model (Model 2, with temporal lags), MAE drops from 0.71 trips per station-hour in Q1 to 0.50 in Q2. The simple time–weather baseline also improves from 0.87 to 0.60, while the more complex models (3–5) remain worse than the lag model in both quarters.

5.4 Space-Time Error Analysis: Observed vs. Predicted

Let’s use our best model (Model 2) for error analysis.

Code

test <- test %>%
  mutate(
    error = Trip_Count - pred2,
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips",
    subtitle = "Model 3 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

5.5 Spatial Error Patterns

Are prediction errors clustered in certain parts of Philadelphia?

Code

# Calculate MAE by station
station_errors <- test %>%
  group_by(start_station, start_lat.x, start_lon.y) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat.x), !is.na(start_lon.y))

## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)

# Calculate station errors
station_errors <- test %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat.x, start_lon.y) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat.x), !is.na(start_lon.y))

# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE\n(trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),  # Fewer, cleaner breaks
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors",
       subtitle = "Higher in Center City") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg\nDemand",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),  # Clear breaks
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand",
       subtitle = "Trips per station-hour") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))


# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.x, color = MAE),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand  
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Combine
grid.arrange(
  p1, p2,
  ncol = 2
  )

Prediction errors are highest in Center City and University City, the two areas with the strongest and most volatile bike activity. These stations experience sharp swings in demand—commuters, tourists, and university traffic—making them harder to model with simple hourly features. In contrast, outlying residential neighborhoods show lower errors because their usage patterns are more stable and predictable. The close alignment between “high average demand” and “high error” suggests that variability, rather than low volume, drives most of the model’s difficulty.

5.6 Temporal Error Patterns

When are we most wrong?

Code

# MAE by time of day and day type
temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

weekly_errors <- test %>%
  group_by(week, weekend) %>%
  summarise(
    MAE        = mean(abs_error, na.rm = TRUE),
    mean_error = mean(error,     na.rm = TRUE),
    .groups    = "drop"
  ) %>%
  mutate(
    day_type = ifelse(weekend == 1, "Weekend", "Weekday"),
    week = factor(week)
  )

p_week_mae <- ggplot(weekly_errors,
                     aes(x = week, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c",
                               "Weekend" = "#6baed6")) +
  labs(
    title = "Weekly Pattern in Prediction Errors",
    subtitle = "Mean Absolute Error by week and day type (test set)",
    x = "Week of year",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme

p_week_bias <- ggplot(weekly_errors,
                      aes(x = week, y = mean_error, fill = day_type)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = c("Weekday" = "#08519c",
                               "Weekend" = "#6baed6")) +
  labs(
    title = "Weekly Under- / Over-Prediction",
    subtitle = "Mean signed error by week and day type (test set)",
    x = "Week of year",
    y = "Mean Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme

gridExtra::grid.arrange(p_week_mae, p_week_bias, ncol = 2)

Across the four test weeks (23–26), prediction errors are fairly stable: weekday and weekend MAE stay in the 0.7–0.8 trip range, with only a slight peak in week 24. Bias is small in most weeks, but weeks 23–24 show mild over-prediction, especially on weekdays. In contrast, week 25 exhibits noticeable under-prediction on weekends, suggesting an unusually strong spike in weekend demand that the model did not anticipate. By week 26, signed errors return close to zero, indicating that systematic under/over-prediction does not persist over time.

5.7 Errors and Demographics

Are prediction errors related to neighborhood characteristics?

Code

# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

grid.arrange(p1, p2, p3, ncol = 2)

Station–level errors are only weakly related to neighborhood demographics. MAE tends to be slightly higher in tracts with higher median income and a larger share of white residents, while it declines in tracts where a higher share of commuters take transit. In other words, the model struggles a bit more in richer, whiter areas and performs better in transit-oriented neighborhoods.

These differences appear modest and are driven mainly by higher and more volatile demand at busy, centrally located stations, rather than systematic under-performance in disadvantaged communities, though any deployment should still monitor errors by neighborhood over time.

6. Feature Engineering & Model Improvement

6.1 Create New Features

The choice of new features is guided directly by the error analysis in Section 5. Spatially and temporally, the model struggled most during warm, high-demand peaks in Center City and University City – especially on PM rush hours and busy weekends – and we also saw clear weekly repetition in the demand patterns. To target these misspecifications, I focus on features that (i) sharpen the model’s sensitivity to “perfect” biking weather and (ii) capture medium-run weekly structure beyond the short-run lags already in Model 2.

Code

# Perfect weather indicator: 60–75F & no rain
train <- train %>%
  mutate(
    perfect_weather = ifelse(Temperature >= 60 &
                               Temperature <= 75 &
                               Precipitation == 0, 1, 0)
  )

test <- test %>%
  mutate(
    perfect_weather = ifelse(Temperature >= 60 &
                               Temperature <= 75 &
                               Precipitation == 0, 1, 0)
  )

# Same hour last week (t-168 lag)
train <- train %>%
  arrange(start_station, interval60) %>%
  group_by(start_station) %>%
  mutate(lag168 = lag(Trip_Count, 24 * 7)) %>%
  ungroup()

test <- test %>%
  arrange(start_station, interval60) %>%
  group_by(start_station) %>%
  mutate(lag168 = lag(Trip_Count, 24 * 7)) %>%
  ungroup()

# Rolling 7-day mean demand
library(zoo)

train <- train %>%
  arrange(start_station, interval60) %>%
  group_by(start_station) %>%
  mutate(roll7 = rollmean(Trip_Count, k = 24 * 7, fill = NA, align = "right")) %>%
  ungroup()

test <- test %>%
  arrange(start_station, interval60) %>%
  group_by(start_station) %>%
  mutate(roll7 = rollmean(Trip_Count, k = 24 * 7, fill = NA, align = "right")) %>%
  ungroup()

6.2 Improved Model

Code

train_improved <- train %>%
  drop_na(lag1Hour, lag3Hours, lag1day,
          lag168, roll7, perfect_weather)

test_improved <- test %>%
  drop_na(lag1Hour, lag3Hours, lag1day,
          lag168, roll7, perfect_weather)

# improved OLS model
model_improved <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + lag168 + roll7,
  data = train_improved
)

summary(model_improved)


Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day + perfect_weather + 
    lag168 + roll7, data = train_improved)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7670 -0.4742 -0.1195  0.2492 17.5809 

Coefficients:
                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.347813   0.014132 -24.612 < 0.0000000000000002 ***
as.factor(hour)1  -0.045320   0.010779  -4.204       0.000026189516 ***
as.factor(hour)2  -0.052158   0.010826  -4.818       0.000001450054 ***
as.factor(hour)3  -0.060508   0.010759  -5.624       0.000000018662 ***
as.factor(hour)4  -0.049480   0.010929  -4.527       0.000005976126 ***
as.factor(hour)5   0.067097   0.010881   6.166       0.000000000699 ***
as.factor(hour)6   0.240965   0.010712  22.496 < 0.0000000000000002 ***
as.factor(hour)7   0.412024   0.010738  38.371 < 0.0000000000000002 ***
as.factor(hour)8   0.632008   0.010752  58.779 < 0.0000000000000002 ***
as.factor(hour)9   0.339184   0.010779  31.466 < 0.0000000000000002 ***
as.factor(hour)10  0.305729   0.010491  29.142 < 0.0000000000000002 ***
as.factor(hour)11  0.366893   0.010621  34.545 < 0.0000000000000002 ***
as.factor(hour)12  0.445347   0.010291  43.274 < 0.0000000000000002 ***
as.factor(hour)13  0.430450   0.010679  40.306 < 0.0000000000000002 ***
as.factor(hour)14  0.491254   0.010499  46.789 < 0.0000000000000002 ***
as.factor(hour)15  0.566113   0.010813  52.354 < 0.0000000000000002 ***
as.factor(hour)16  0.684121   0.010559  64.788 < 0.0000000000000002 ***
as.factor(hour)17  0.923141   0.010919  84.543 < 0.0000000000000002 ***
as.factor(hour)18  0.563153   0.011074  50.851 < 0.0000000000000002 ***
as.factor(hour)19  0.407262   0.010920  37.296 < 0.0000000000000002 ***
as.factor(hour)20  0.210928   0.010895  19.360 < 0.0000000000000002 ***
as.factor(hour)21  0.153190   0.010764  14.231 < 0.0000000000000002 ***
as.factor(hour)22  0.134214   0.010824  12.400 < 0.0000000000000002 ***
as.factor(hour)23  0.043454   0.010766   4.036       0.000054350305 ***
dotw_simple2       0.080028   0.005953  13.443 < 0.0000000000000002 ***
dotw_simple3       0.048226   0.005684   8.484 < 0.0000000000000002 ***
dotw_simple4       0.030337   0.005695   5.327       0.000000099800 ***
dotw_simple5      -0.007292   0.005582  -1.306                0.191    
dotw_simple6      -0.050845   0.005802  -8.763 < 0.0000000000000002 ***
dotw_simple7      -0.058297   0.005756 -10.128 < 0.0000000000000002 ***
Temperature        0.001028   0.000184   5.589       0.000000022844 ***
Precipitation     -2.853109   0.120123 -23.752 < 0.0000000000000002 ***
lag1Hour           0.333305   0.001495 223.008 < 0.0000000000000002 ***
lag3Hours          0.060813   0.001489  40.833 < 0.0000000000000002 ***
lag1day            0.061350   0.001403  43.718 < 0.0000000000000002 ***
perfect_weather    0.015343   0.003476   4.413       0.000010185951 ***
lag168            -0.007327   0.001428  -5.133       0.000000285408 ***
roll7              0.534007   0.004073 131.092 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9703 on 408514 degrees of freedom
Multiple R-squared:  0.3657,    Adjusted R-squared:  0.3657 
F-statistic:  6367 on 37 and 408514 DF,  p-value: < 0.00000000000000022

6.3 Compare Improved Model to Baseline Model 2

Code

# baseline Model 2 predictions
pred_base <- predict(model2, newdata = test_improved)
# improved model predictions
pred_improved <- predict(model_improved, newdata = test_improved)

# just in case, drop any rows where predictions are NA
valid_idx <- !is.na(pred_base) & !is.na(pred_improved) & !is.na(test_improved$Trip_Count)

mae_base <- mean(abs(test_improved$Trip_Count[valid_idx] - pred_base[valid_idx]))
mae_improved <- mean(abs(test_improved$Trip_Count[valid_idx] - pred_improved[valid_idx]))

cat("Model 2 MAE (baseline): ", round(mae_base, 3), "\n")

Model 2 MAE (baseline):  0.721

Code

cat("Improved Model MAE:     ", round(mae_improved, 3), "\n")

Improved Model MAE:      0.729

The improved model is built by augmenting Model 2 (time, weather, and short-run lags) with the three new features. On the held-out test set, however, performance does not improve: the baseline Model 2 achieves an MAE of 0.72 trips, while the enhanced model’s MAE rises slightly to 0.73 trips.

Breaking the errors down by station and time of day confirms that there is no clear subset where the new features deliver large gains: MAE by time period and by station changes only marginally, and the ranking of “easy” vs. “hard” stations is essentially unchanged. In other words, the extra features slightly reshuffle individual residuals but do not systematically fix the large errors identified in the earlier analysis..

6.4 Poisson Model

Code

model_pois <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple +
    Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    perfect_weather + lag168 + roll7,
  data = train_improved,
  family = poisson(link = "log")
)

pred_pois <- predict(model_pois, newdata = test_improved, type = "response")
valid_pois <- !is.na(pred_pois) & !is.na(test_improved$Trip_Count)

mae_pois <- mean(abs(test_improved$Trip_Count[valid_pois] - pred_pois[valid_pois]))

cat("Poisson MAE:            ", round(mae_pois, 3), "\n")

Poisson MAE:             0.729

I also estimated a Poisson regression using the same feature set as the improved linear model. The Poisson model reaches an MAE of about 0.73 trips, essentially identical to the enhanced OLS model and still slightly worse than the simpler Model 2. This indicates that, although count data models are theoretically appealing, the Poisson specification does not offer practical gains here—likely because hourly Indego counts are over-dispersed and include many zeros.

7. Brief Report Summary

Quarter and Motivation

This project uses Q2 2024 Indego data (April–June). This quarter captures the transition into Philadelphia’s warm season, when bike share demand is higher, more volatile, and more influenced by discretionary and leisure trips than in winter. It therefore provides a stronger test of short-run demand prediction.

Model Comparison

I estimate five nested OLS models. A simple time + weather model (Model 1) explains only about 11% of variation and yields an MAE of 0.87 trips. Adding temporal lag terms (Model 2) is crucial: R² rises to about 0.34 and MAE falls to 0.71, making it the best-performing baseline. Demographics, station fixed effects, and rush-hour interactions (Models 3–5) increase model complexity but do not improve out-of-sample MAE. Recent demand at each station is far more predictive than static neighborhood characteristics.

Error Analysis

Spatially, errors are highest at busy stations in Center City and University City, where commuter, tourist, and university traffic create sharp, irregular peaks. Peripheral residential stations are more predictable. Temporally, weekday PM peaks show the largest errors, while overnight hours are very stable. Demographically, errors are slightly higher in higher-income and whiter tracts, mostly because those areas host the busiest and most volatile stations.

New Features and Poisson Model

To improve Model 2, I add three features: a perfect-weather indicator, a same-hour-last-week lag, and a rolling 7-day average of demand. However, the enhanced model’s MAE rises slightly to 0.73, and a Poisson count model with the same features also achieves about 0.73. These results suggest that the short-run lag structure already captures most usable signal, and the additional temporal trend features mainly add noise in this three-month sample.

Critical Reflection

An MAE of roughly 0.7 trips per station-hour is operationally useful for high-level planning and rebalancing strategy, but not sufficient for fully automated real-time control, especially during volatile PM peaks in the downtown core. Errors do not appear to systematically disadvantage low-income or high-transit neighborhoods, but any deployment should monitor error patterns by area and enforce minimum service standards in equity-priority communities. The current model ignores events, disruptions, and spatial spillovers; with more time and data, I would incorporate event calendars, network structure, and more flexible model families to better capture the complex dynamics of Philadelphia’s bike share system.