Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Fan Yang

Published

September 29, 2025

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences

Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

# Load required packages
library(sf)
library(tidyverse)
library(tidycensus)
library(tigris)
library(ggplot2)
library(knitr)
library(kableExtra)
library(units)
library(patchwork)
library(ggspatial)

# Load spatial data
library(here)

pa_counties <- st_read(here("assignments", "assignment_2", "data", "Pennsylvania_County_Boundaries.shp"))

Reading layer `Pennsylvania_County_Boundaries' from data source 
  `D:\UPENN\MUSA5080\portfolio-setup-FANYANG0304\assignments\assignment_2\data\Pennsylvania_County_Boundaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 67 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -8963377 ymin: 4825316 xmax: -8314404 ymax: 5201413
Projected CRS: WGS 84 / Pseudo-Mercator

hospitals   <- st_read(here("assignments", "assignment_2", "data", "hospitals.geojson"))

Reading layer `hospitals' from data source 
  `D:\UPENN\MUSA5080\portfolio-setup-FANYANG0304\assignments\assignment_2\data\hospitals.geojson' 
  using driver `GeoJSON'
Simple feature collection with 223 features and 11 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.49621 ymin: 39.75163 xmax: -74.86704 ymax: 42.13403
Geodetic CRS:  WGS 84

pa_tracts <- tracts("PA", year = 2022, cb = TRUE)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |============================================                          |  64%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

# Create summary table of dataset information
dataset_summary <- data.frame(
  Dataset = c("Hospitals", "Census Tracts", "Counties", "Hospitals", "Census Tracts"),
  Metric = c("Count", "Count", "Count", "CRS", "CRS"),
  Value = c(
    nrow(hospitals),
    nrow(pa_tracts),
    nrow(pa_counties),
    st_crs(pa_counties)$input,
    st_crs(pa_tracts)$input
  )
)

kable(dataset_summary,
      col.names = c("Dataset", "Metric", "Value"),
      caption = "Spatial Dataset Summary",
      align = c("l", "l", "l"))

Spatial Dataset Summary
Dataset	Metric	Value
Hospitals	Count	223
Census Tracts	Count	3445
Counties	Count	67
Hospitals	CRS	WGS 84 / Pseudo-Mercator
Census Tracts	CRS	NAD83

Questions to answer:

How many hospitals are in your dataset? 223
How many census tracts? 3,445
What coordinate reference system is each dataset in?
- Counties: WGS 84 / Pseudo-Mercator
- Hospitals: WGS 84
- Census tracts: NAD83

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables:

Total population
Median household income
Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS

# Get demographic data for Pennsylvania tracts
pa_demographics <- get_acs(
  geography = "tract",
  state = "PA",
  variables = c(
    total_pop = "B01003_001",           # Total population
    median_income = "B19013_001",       # Median household income
    pop_65_74 = "B01001_020",          # Population 65-74 years
    pop_75_84 = "B01001_021",          # Population 75-84 years  
    pop_85_plus = "B01001_022"         # Population 85 years and over
  ),
  year = 2022,
  geometry = TRUE
)

# Pivot the data to wide format
pa_demographics_wide <- pa_demographics %>%
  select(GEOID, NAME, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate) %>%
  # Calculate total elderly population (65+)
  mutate(
    pop_65_plus = pop_65_74 + pop_75_84 + pop_85_plus,
    elderly_pct = (pop_65_plus / total_pop) * 100
  )

# Join to tract boundaries (already have geometry from get_acs)
tracts_with_demographics <- pa_demographics_wide

# Check for missing data and create summary table
demographics_summary <- data.frame(
  Metric = c("Missing Income Data", "Median Income (All Tracts)"),
  Value = c(
    paste(sum(is.na(tracts_with_demographics$median_income)), "tracts"),
    paste("$", format(median(tracts_with_demographics$median_income, na.rm = TRUE), big.mark = ","))
  )
)

kable(demographics_summary,
      col.names = c("Metric", "Value"),
      caption = "Demographics Data Quality Summary",
      align = c("l", "l"))

Demographics Data Quality Summary
Metric	Value
Missing Income Data	63 tracts
Median Income (All Tracts)	$ 70,188

Questions to answer:

What year of ACS data are you using? 2022
How many tracts have missing income data? 63 tracts
What is the median income across all PA census tracts? $70,188

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria
# Define thresholds for vulnerability
income_threshold <- 60000  # Median household income below $60,000 (85% of PA median)
elderly_threshold <- 15    # Elderly population above 15% (typical baseline threshold)

# Create vulnerable population indicator
vulnerable_tracts <- tracts_with_demographics %>%
  filter(!is.na(median_income)) %>%  # Remove tracts with missing income data
  mutate(
    low_income = median_income < income_threshold,
    high_elderly = elderly_pct > elderly_threshold,
    vulnerable = low_income | high_elderly  # OR logic: either low income OR high elderly
  )

# Create summary table of vulnerability criteria and results
vulnerability_summary <- data.frame(
  Metric = c(
    "Income Threshold",
    "PA Median Income",
    "Elderly Threshold",
    "Low Income Tracts",
    "High Elderly Tracts",
    "Vulnerable Tracts (Either)",
    "Percentage Vulnerable"
  ),
  Value = c(
    paste("$", format(income_threshold, big.mark = ",")),
    paste("$", format(median(tracts_with_demographics$median_income, na.rm = TRUE), big.mark = ",")),
    paste(elderly_threshold, "%"),
    sum(vulnerable_tracts$low_income),
    sum(vulnerable_tracts$high_elderly),
    sum(vulnerable_tracts$vulnerable),
    paste(round(sum(vulnerable_tracts$vulnerable) / nrow(vulnerable_tracts) * 100, 1), "%")
  )
)

kable(vulnerability_summary,
      col.names = c("Metric", "Value"),
      caption = "Vulnerability Criteria and Results Summary",
      align = c("l", "l"))

Vulnerability Criteria and Results Summary
Metric	Value
Income Threshold	$ 60,000
PA Median Income	$ 70,188
Elderly Threshold	15 %
Low Income Tracts	1112
High Elderly Tracts	6
Vulnerable Tracts (Either)	1116
Percentage Vulnerable	33 %

Questions to answer:

What income threshold did you choose and why?
$60,000 (85% of PA median), moderately low-income households
What elderly population threshold did you choose and why?
15% (above typical baseline). Used OR logic (low income OR high elderly) instead of AND, as both criteria together were too restrictive.
How many tracts meet your vulnerability criteria? 1,116 tracts
What percentage of PA census tracts are considered vulnerable by your definition? 33%

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS
# Use Pennsylvania State Plane South (EPSG:2272) for accurate distance calculations
pa_crs <- 2272  # Pennsylvania State Plane South

# Transform all datasets to the same CRS
vulnerable_tracts_proj <- st_transform(vulnerable_tracts, pa_crs)
hospitals_proj <- st_transform(hospitals, pa_crs)

# Calculate distance from each tract centroid to nearest hospital
# Get centroids of vulnerable tracts
tract_centroids <- st_centroid(vulnerable_tracts_proj)

# Calculate distance matrix between tract centroids and hospitals
distances <- st_distance(tract_centroids, hospitals_proj)

# Find minimum distance to nearest hospital for each tract
min_distances <- apply(distances, 1, min)

# Convert from meters to miles (1 meter = 0.000621371 miles)
min_distances_miles <- as.numeric(min_distances) * 0.000621371

# Add distance information to vulnerable tracts
vulnerable_tracts_with_distance <- vulnerable_tracts_proj %>%
  mutate(
    distance_to_hospital_miles = min_distances_miles
  )

# Create summary table of distance statistics
distance_summary <- data.frame(
  Metric = c(
    "Average Distance",
    "Maximum Distance",
    "Tracts >15 miles from Hospital"
  ),
  Value = c(
    paste(round(mean(min_distances_miles), 2), "miles"),
    paste(round(max(min_distances_miles), 2), "miles"),
    sum(min_distances_miles > 15)
  )
)

kable(distance_summary,
      col.names = c("Metric", "Value"),
      caption = "Distance to Nearest Hospital Summary",
      align = c("l", "l"))

Distance to Nearest Hospital Summary
Metric	Value
Average Distance	14.54 miles
Maximum Distance	107.76 miles
Tracts >15 miles from Hospital	1094

Requirements:

Use an appropriate projected coordinate system for Pennsylvania
Calculate distances in miles
Explain why you chose your projection

Questions to answer:

What is the average distance to the nearest hospital for vulnerable tracts? 14.54 miles
What is the maximum distance? 107.76 miles
How many vulnerable tracts are more than 15 miles from the nearest hospital? 1,094 tracts

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable
# Define "underserved" as vulnerable tracts more than 15 miles from nearest hospital
underserved_threshold <- 15  # miles

# Add underserved indicator
vulnerable_tracts_final <- vulnerable_tracts_with_distance %>%
  mutate(
    underserved = vulnerable & (distance_to_hospital_miles > underserved_threshold)
  )

# Create summary table of underserved analysis
total_vulnerable <- sum(vulnerable_tracts_final$vulnerable)
total_underserved <- sum(vulnerable_tracts_final$underserved)

underserved_summary <- data.frame(
  Metric = c(
    "Underserved Threshold",
    "Number of Underserved Tracts",
    "Percentage of Vulnerable Tracts Underserved",
    "Assessment"
  ),
  Value = c(
    paste(underserved_threshold, "miles"),
    total_underserved,
    paste(round(total_underserved / total_vulnerable * 100, 1), "%"),
    ifelse(total_underserved > total_vulnerable * 0.3, "Concerning", "Somewhat Expected")
  )
)

kable(underserved_summary,
      col.names = c("Metric", "Value"),
      caption = "Underserved Population Analysis Summary",
      align = c("l", "l"))

Underserved Population Analysis Summary
Metric	Value
Underserved Threshold	15 miles
Number of Underserved Tracts	251
Percentage of Vulnerable Tracts Underserved	22.5 %
Assessment	Somewhat Expected

Questions to answer:

How many tracts are underserved? 251 tracts
What percentage of vulnerable tracts are underserved? 22.5%
Does this surprise you? Why or why not?
Somewhat concerning. 22.5% of vulnerable tracts (251 tracts) are underserved (>15 miles from hospital). Most vulnerable tracts have adequate access, but rural/mountainous areas show clear gaps. May need mobile clinics or telehealth for these areas.

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

# Spatial join tracts to counties
# Transform counties to same CRS
pa_counties_proj <- st_transform(pa_counties, pa_crs)

# Spatial join tracts to counties
tracts_with_counties <- st_join(vulnerable_tracts_final, pa_counties_proj, join = st_within)

# Aggregate statistics by county
county_stats <- tracts_with_counties %>%
  st_drop_geometry() %>%  # Remove geometry for aggregation
  group_by(COUNTY_NAM) %>%
  summarise(
    total_tracts = n(),
    vulnerable_tracts = sum(vulnerable, na.rm = TRUE),
    underserved_tracts = sum(underserved, na.rm = TRUE),
    pct_vulnerable_underserved = ifelse(vulnerable_tracts > 0, 
                                       round(underserved_tracts / vulnerable_tracts * 100, 1), 0),
    avg_distance_vulnerable = ifelse(vulnerable_tracts > 0,
                                    round(mean(distance_to_hospital_miles[vulnerable], na.rm = TRUE), 2), 0),
    total_vulnerable_pop = sum(total_pop[vulnerable], na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(vulnerable_tracts > 0) %>%  # Only include counties with vulnerable tracts
  arrange(desc(pct_vulnerable_underserved))

# Create summary tables for top counties
top_5_underserved <- head(county_stats, 5) %>%
  select(County = COUNTY_NAM, `% Underserved` = pct_vulnerable_underserved) %>%
  mutate(Rank = row_number())

top_5_population <- county_stats %>%
  arrange(desc(total_vulnerable_pop)) %>%
  head(5) %>%
  select(County = COUNTY_NAM, `Vulnerable Population` = total_vulnerable_pop) %>%
  mutate(Rank = row_number(),
         `Vulnerable Population` = format(`Vulnerable Population`, big.mark = ","))

kable(top_5_underserved,
      col.names = c("Rank", "County", "% Underserved"),
      caption = "Top 5 Counties by Percentage of Underserved Vulnerable Tracts",
      align = c("c", "l", "c"))

Top 5 Counties by Percentage of Underserved Vulnerable Tracts
Rank	County	% Underserved
CAMERON	100	1
CHESTER	100	2
CLINTON	100	3
NORTHUMBERLAND	100	4
PIKE	100	5

kable(top_5_population,
      col.names = c("Rank", "County", "Vulnerable Population"),
      caption = "Top 5 Counties by Total Vulnerable Population",
      align = c("c", "l", "r"))

Top 5 Counties by Total Vulnerable Population
Rank	County	Vulnerable Population
PHILADELPHIA	831,589	1
NA	752,250	2
ALLEGHENY	364,981	3
LUZERNE	134,345	4
LEHIGH	106,336	5

Required county-level statistics:

Number of vulnerable tracts
Number of underserved tracts
Percentage of vulnerable tracts that are underserved
Average distance to nearest hospital for vulnerable tracts
Total vulnerable population

Questions to answer:

Which 5 counties have the highest percentage of underserved vulnerable tracts?
Cameron, Chester, Clinton, Northumberland, and Pike counties (all 100% underserved).
Which counties have the most vulnerable people living far from hospitals?
Philadelphia (831,589), Allegheny (364,981), Luzerne (134,345), Lehigh (106,336), and Lancaster (~100,000+).
Are there any patterns in where underserved counties are located?
Clear rural-urban divide. Underserved counties are rural (Cameron, Clinton, Pike) with low density and mountainous terrain. Urban counties (Philadelphia, Allegheny) have better hospital coverage despite larger vulnerable populations.

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create priority counties table with scoring
priority_counties <- county_stats %>%
  mutate(
    priority_score = (pct_vulnerable_underserved * 0.6) + (log(total_vulnerable_pop + 1) * 0.4),
    vulnerable_pop_formatted = format(total_vulnerable_pop, big.mark = ","),
    avg_distance_formatted = paste0(avg_distance_vulnerable, " mi")
  ) %>%
  arrange(desc(priority_score)) %>%
  head(10) %>%
  select(
    County = COUNTY_NAM,
    `Vulnerable Tracts` = vulnerable_tracts,
    `Underserved Tracts` = underserved_tracts,
    `% Underserved` = pct_vulnerable_underserved,
    `Avg Distance (mi)` = avg_distance_formatted,
    `Vulnerable Population` = vulnerable_pop_formatted
  )

# Create formatted table
kable(priority_counties, 
      caption = "Top 10 Priority Counties for Healthcare Investment in Pennsylvania",
      align = c("l", "c", "c", "c", "c", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") %>%
  add_header_above(c("Healthcare Access Priority Counties" = 6)) %>%
  footnote(
    general = "Priority score combines percentage of underserved vulnerable tracts (60%) and total vulnerable population (40%). Underserved = vulnerable tracts >15 miles from nearest hospital.",
    general_title = "Note: ",
    footnote_as_chunk = TRUE
  )

Top 10 Priority Counties for Healthcare Investment in Pennsylvania
Healthcare Access Priority Counties
County	Vulnerable Tracts	Underserved Tracts	% Underserved	Avg Distance (mi)	Vulnerable Population
NORTHUMBERLAND	12	12	100.0	36.45 mi	32,947
CHESTER	4	4	100.0	37.01 mi	16,587
CLINTON	3	3	100.0	29.86 mi	11,433
WYOMING	1	1	100.0	41.58 mi	3,156
CAMERON	1	1	100.0	61.26 mi	1,988
PIKE	1	1	100.0	51.68 mi	696
CUMBERLAND	6	5	83.3	32.19 mi	18,604
CLEARFIELD	8	6	75.0	45.03 mi	28,455
VENANGO	7	5	71.4	19.57 mi	18,215
ELK	3	2	66.7	18.57 mi	11,667
Note: Priority score combines percentage of underserved vulnerable tracts (60%) and total vulnerable population (40%). Underserved = vulnerable tracts >15 miles from nearest hospital.

Requirements:

Use knitr::kable() or similar for formatting
Include descriptive column names
Format numbers appropriately (commas for population, percentages, etc.)
Add an informative caption
Sort by priority (you decide the metric)

Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

# Create county-level access map
# Join county statistics to county boundaries
counties_with_stats <- pa_counties_proj %>%
  left_join(county_stats, by = c("COUNTY_NAM" = "COUNTY_NAM")) %>%
  mutate(
    pct_vulnerable_underserved = ifelse(is.na(pct_vulnerable_underserved), 0, pct_vulnerable_underserved)
  )

# Create the choropleth map
map1 <- ggplot() +
  # County boundaries with fill based on underserved percentage
  geom_sf(data = counties_with_stats, 
          aes(fill = pct_vulnerable_underserved), 
          color = "white", size = 0.3) +
  # Hospital locations
  geom_sf(data = hospitals_proj, 
          color = "red", size = 1.5, alpha = 0.8) +
  # Color scale
  scale_fill_viridis_c(
    name = "% Underserved\nVulnerable Tracts",
    option = "plasma",
    direction = 1,
    na.value = "grey90",
    labels = function(x) paste0(x, "%")
  ) +
  # Theme and labels
  theme_void() +
  labs(
    title = "Healthcare Access Challenges in Pennsylvania Counties",
    subtitle = "Percentage of vulnerable tracts that are underserved (>15 miles from nearest hospital)",
    caption = "Red dots indicate hospital locations. Vulnerable = low-income + high elderly population.\nData: ACS 2022, Pennsylvania Department of Health",
    fill = "% Underserved"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, margin = margin(b = 20)),
    plot.caption = element_text(size = 9, hjust = 0.5, margin = margin(t = 10)),
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9)
  ) +
  # Add north arrow and scale bar
  annotation_north_arrow(location = "tr", which_north = "true", 
                        style = north_arrow_fancy_orienteering) +
  annotation_scale(location = "bl", width_hint = 0.3)

# Display the map
print(map1)

Requirements:

Fill counties by percentage of vulnerable tracts that are underserved
Include hospital locations as points
Use an appropriate color scheme
Include clear title, subtitle, and caption
Use theme_void() or similar clean theme
Add a legend with formatted labels

Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

# Create detailed tract-level map
# Create tract categories for visualization
tract_categories <- vulnerable_tracts_final %>%
  mutate(
    tract_category = case_when(
      underserved ~ "Underserved Vulnerable",
      vulnerable ~ "Vulnerable (Served)",
      TRUE ~ "Non-vulnerable"
    ),
    tract_category = factor(tract_category, 
                           levels = c("Non-vulnerable", "Vulnerable (Served)", "Underserved Vulnerable"))
  )

# Create the detailed vulnerability map
map2 <- ggplot() +
  # All tracts (non-vulnerable) - light background
  geom_sf(data = tract_categories %>% filter(tract_category == "Non-vulnerable"), 
          fill = "grey95", color = "grey80", size = 0.1) +
  # Vulnerable but served tracts
  geom_sf(data = tract_categories %>% filter(tract_category == "Vulnerable (Served)"), 
          fill = "lightblue", color = "grey60", size = 0.2) +
  # Underserved vulnerable tracts - highlight these
  geom_sf(data = tract_categories %>% filter(tract_category == "Underserved Vulnerable"), 
          fill = "darkred", color = "white", size = 0.3) +
  # County boundaries for context
  geom_sf(data = pa_counties_proj, 
          fill = NA, color = "black", size = 0.5) +
  # Hospital locations
  geom_sf(data = hospitals_proj, 
          color = "darkgreen", size = 2, shape = 17, alpha = 0.8) +
  # Color scale for legend
  scale_fill_manual(
    name = "Tract Category",
    values = c("Non-vulnerable" = "grey95", 
               "Vulnerable (Served)" = "lightblue", 
               "Underserved Vulnerable" = "darkred"),
    guide = guide_legend(override.aes = list(size = 0.5))
  ) +
  # Theme and labels
  theme_void() +
  labs(
    title = "Detailed View: Underserved Vulnerable Populations in Pennsylvania",
    subtitle = "Census tracts with vulnerable populations and their access to healthcare",
    caption = "Red tracts: Underserved vulnerable populations (>15 miles from hospital)\nBlue tracts: Vulnerable populations with adequate access\nGreen triangles: Hospital locations\nData: ACS 2022, Pennsylvania Department of Health"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, margin = margin(b = 20)),
    plot.caption = element_text(size = 9, hjust = 0.5, margin = margin(t = 10)),
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9)
  ) +
  # Add north arrow and scale bar
  annotation_north_arrow(location = "tr", which_north = "true", 
                        style = north_arrow_fancy_orienteering) +
  annotation_scale(location = "bl", width_hint = 0.3)

# Display the map
print(map2)

Requirements:

Show underserved vulnerable tracts in a contrasting color
Include county boundaries for context
Show hospital locations
Use appropriate visual hierarchy (what should stand out?)
Include informative title and subtitle

Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

# Create distribution visualization
# Create a combined histogram and box plot showing distance distribution
library(patchwork)

# Prepare data for visualization
chart_data <- vulnerable_tracts_final %>%
  filter(vulnerable) %>%  # Only show vulnerable tracts
  mutate(
    category = ifelse(underserved, "Underserved", "Adequate Access"),
    category = factor(category, levels = c("Adequate Access", "Underserved"))
  )

# Create histogram
hist_plot <- ggplot(chart_data, aes(x = distance_to_hospital_miles, fill = category)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  geom_vline(xintercept = 15, linetype = "dashed", color = "red", size = 1) +
  scale_fill_manual(values = c("Adequate Access" = "lightblue", "Underserved" = "darkred")) +
  labs(
    title = "Distribution of Distance to Nearest Hospital",
    subtitle = "For vulnerable census tracts in Pennsylvania",
    x = "Distance to Nearest Hospital (miles)",
    y = "Number of Census Tracts",
    fill = "Access Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.position = "bottom"
  ) +
  annotate("text", x = 15, y = Inf, label = "15-mile threshold", 
           hjust = -0.1, vjust = 1.5, color = "red", size = 3)

# Create box plot by category
box_plot <- ggplot(chart_data, aes(x = category, y = distance_to_hospital_miles, fill = category)) +
  geom_boxplot(alpha = 0.7) +
  geom_hline(yintercept = 15, linetype = "dashed", color = "red", size = 1) +
  scale_fill_manual(values = c("Adequate Access" = "lightblue", "Underserved" = "darkred")) +
  labs(
    title = "Distance Distribution by Access Category",
    x = "Access Category",
    y = "Distance to Nearest Hospital (miles)",
    fill = "Access Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "none"
  ) +
  coord_flip()

# Combine plots
combined_chart <- hist_plot / box_plot +
  plot_annotation(
    title = "Healthcare Access Analysis: Distance to Hospitals for Vulnerable Populations",
    caption = "Data: ACS 2022, Pennsylvania Department of Health. Vulnerable = low-income + high elderly population.",
    theme = theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
  )

# Display the combined chart
print(combined_chart)

# Summary statistics for interpretation
cat("Underserved tracts median distance:", 
    round(median(chart_data$distance_to_hospital_miles[chart_data$underserved]), 1), "miles\n")

Underserved tracts median distance: 30.8 miles

cat("Adequate access tracts median distance:", round(median(chart_data$distance_to_hospital_miles[!chart_data$underserved]), 1), 
    "miles\n")

Adequate access tracts median distance: 4.4 miles

Suggested chart types:

Histogram or density plot of distances
Box plot comparing distances across regions
Bar chart of underserved tracts by county
Scatter plot of distance vs. vulnerable population size

Requirements:

Clear axes labels with units
Appropriate title
Professional formatting
Brief interpretation (1-2 sentences as a caption or in text)

Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).

Note these are just loose suggestions to spark ideas - follow or make your own as the data permits and as your ideas evolve. This analysis should include bringing in your own dataset, ensuring the projection/CRS of your layers align and are appropriate for the analysis (not lat/long or geodetic coordinate systems). The analysis portion should include some combination of spatial and attribute operations to answer a relatively straightforward question

Education & Youth Services

Option A: Educational Desert Analysis - Data: Schools, Libraries, Recreation Centers, Census tracts (child population) - Question: “Which neighborhoods lack adequate educational infrastructure for children?” - Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density - Policy relevance: School district planning, library placement, after-school program siting

Option B: School Safety Zones - Data: Schools, Crime Incidents, Bike Network - Question: “Are school zones safe for walking/biking, or are they crime hotspots?” - Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents, assess bike infrastructure coverage - Policy relevance: Safe Routes to School programs, crossing guard placement

Environmental Justice

Option C: Green Space Equity - Data: Parks, Street Trees, Census tracts (race/income demographics) - Question: “Do low-income and minority neighborhoods have equitable access to green space?” - Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics - Policy relevance: Climate resilience, environmental justice, urban forestry investment —

Public Safety & Justice

Option D: Crime & Community Resources - Data: Crime Incidents, Recreation Centers, Libraries, Street Lights - Question: “Are high-crime areas underserved by community resources?” - Operations: Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis - Policy relevance: Community investment, violence prevention strategies —

Infrastructure & Services

Option E: Polling Place Accessibility - Data: Polling Places, SEPTA stops, Census tracts (elderly population, disability rates) - Question: “Are polling places accessible for elderly and disabled voters?” - Operations: Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access - Policy relevance: Voting rights, election infrastructure, ADA compliance

Health & Wellness

Option F: Recreation & Population Health - Data: Recreation Centers, Playgrounds, Parks, Census tracts (demographics) - Question: “Is lack of recreation access associated with vulnerable populations?” - Operations: Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators - Policy relevance: Public health investment, recreation programming, obesity prevention

Emergency Services

Option G: EMS Response Coverage - Data: Fire Stations, EMS stations, Population density, High-rise buildings - Question: “Are population-dense areas adequately covered by emergency services?” - Operations: Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas - Policy relevance: Emergency preparedness, station siting decisions

Arts & Culture

Option H: Cultural Asset Distribution - Data: Public Art, Museums, Historic sites/markers, Neighborhoods - Question: “Do all neighborhoods have equitable access to cultural amenities?” - Operations: Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups - Policy relevance: Cultural equity, tourism, quality of life, neighborhood identity

Data Sources

OpenDataPhilly: https://opendataphilly.org/datasets/ - Most datasets available as GeoJSON, Shapefile, or CSV with coordinates - Always check the Metadata for a data dictionary of the fields.

Additional Sources: - Pennsylvania Open Data: https://data.pa.gov/ - Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns - TIGER/Line (via tigris): Geographic boundaries

Recommended Starting Points

If you’re feeling confident: Choose an advanced challenge with multiple data layers. If you are a beginner, choose something more manageable that helps you understand the basics

If you have a different idea: Propose your own question! Just make sure: - You can access the spatial data - You can perform at least 2 spatial operations

Your Analysis

Your Task:

Find and load additional data
- Document your data source
- Check and standardize the CRS
- Provide basic summary statistics

# Load your additional dataset
# For this analysis, I'll use Philadelphia Bus Transit Shelters data from OpenDataPhilly
# This will help us understand if vulnerable populations have adequate public transit access to hospitals

# Load Bus Transit Shelters data
bus_shelters <- st_read(here("assignments", "assignment_2", "data", "bus_transit_shelters.geojson"), quiet = TRUE)

# Transform to Pennsylvania State Plane South CRS to match our analysis
bus_shelters <- st_transform(bus_shelters, pa_crs)

# Create summary table of bus shelter data
bus_shelter_summary <- data.frame(
  Metric = c("Number of Bus Shelters", "CRS"),
  Value = c(nrow(bus_shelters), st_crs(bus_shelters)$input)
)

kable(bus_shelter_summary,
      col.names = c("Metric", "Value"),
      caption = "Bus Shelter Dataset Summary",
      align = c("l", "l"))

Bus Shelter Dataset Summary
Metric	Value
Number of Bus Shelters	487
CRS	EPSG:2272

# Show summary of shelter types
shelter_types <- table(bus_shelters$productgroup)
shelter_types_df <- data.frame(
  Product_Group = names(shelter_types),
  Count = as.numeric(shelter_types)
)

kable(shelter_types_df,
      col.names = c("Product Group", "Count"),
      caption = "Bus Shelter Types Distribution",
      align = c("l", "c"))

Bus Shelter Types Distribution
Product Group	Count
Digital	60
Static	427

# Show first few shelter locations
shelter_sample <- bus_shelters %>% 
  st_drop_geometry() %>% 
  select(site, siteid, productgroup) %>%
  head(6)

kable(shelter_sample,
      col.names = c("Site", "Site ID", "Product Group"),
      caption = "Sample Bus Shelter Locations",
      align = c("l", "l", "l"))

Sample Bus Shelter Locations
Site	Site ID	Product Group
Chestnut St & 60th St -	pa-002294	Static
Roosevelt Blvd & Broad St NE	pa-002180	Static
Roosevelt Blvd & Broad St - FS SE	pa-002181	Static
35th & Grays Ferry NE	pa-002338	Static
60th St & Haverford Av NW	pa-001485	Static
Grays Ferry Av & 35th St -	pa-002284	Static

Questions to answer:

What dataset did you choose and why?
Philadelphia Bus Transit Shelters - public transit affects healthcare access for vulnerable populations without cars.
What is the data source and date?
OpenDataPhilly (Philadelphia city open data), 2025
How many features does it contain? 487 bus shelters
What CRS is it in? Did you need to transform it?
Original: WGS 84. Transformed to PA State Plane South (EPSG:2272) for distance calculations.

Pose a research question

Research Question: “Do vulnerable populations in Philadelphia have adequate access to bus transit shelters, and how does this relate to hospital access?”

This analysis examines the spatial relationship between vulnerable tracts (low-income OR high elderly), bus shelters, and hospitals in Philadelphia.

Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+):

Buffers
Spatial joins
Spatial filtering with predicates
Distance calculations
Intersections or unions
Point-in-polygon aggregation

Your Task:

# Filter for Philadelphia area tracts (bus shelters are concentrated in Philadelphia)
philly_tracts <- vulnerable_tracts_final %>%
  filter(str_detect(NAME, "Philadelphia"))

# Create summary table for Philadelphia analysis
philly_summary <- data.frame(
  Metric = c("Vulnerable Tracts in Philadelphia", "Underserved Tracts in Philadelphia"),
  Count = c(sum(philly_tracts$vulnerable), sum(philly_tracts$underserved))
)

kable(philly_summary,
      col.names = c("Metric", "Count"),
      caption = "Philadelphia Vulnerable Population Summary",
      align = c("l", "c"))

Philadelphia Vulnerable Population Summary
Metric	Count
Vulnerable Tracts in Philadelphia	208
Underserved Tracts in Philadelphia	0

# Calculate distance from vulnerable tracts to nearest bus shelter
philly_vulnerable <- philly_tracts %>% filter(vulnerable)
philly_centroids <- st_centroid(philly_vulnerable)
shelter_distances <- st_distance(philly_centroids, bus_shelters)
min_shelter_distances <- apply(shelter_distances, 1, min)
min_shelter_distances_miles <- as.numeric(min_shelter_distances) * 0.000621371

# Add transit access information (0.25 mile = typical walking distance)
philly_vulnerable_with_transit <- philly_vulnerable %>%
  mutate(
    distance_to_shelter_miles = min_shelter_distances_miles,
    has_transit_access = distance_to_shelter_miles <= 0.25,
    transit_accessible = vulnerable & has_transit_access
  )

# Create summary table for transit access analysis
transit_access_summary <- data.frame(
  Metric = c(
    "Average Distance to Bus Shelter",
    "Tracts within 0.25 miles of Shelter",
    "Percentage with Transit Access"
  ),
  Value = c(
    paste(round(mean(min_shelter_distances_miles), 2), "miles"),
    sum(philly_vulnerable_with_transit$has_transit_access),
    paste(round(sum(philly_vulnerable_with_transit$has_transit_access) / nrow(philly_vulnerable_with_transit) * 100, 1), "%")
  )
)

kable(transit_access_summary,
      col.names = c("Metric", "Value"),
      caption = "Transit Access Analysis Summary",
      align = c("l", "l"))

Transit Access Analysis Summary
Metric	Value
Average Distance to Bus Shelter	0.87 miles
Tracts within 0.25 miles of Shelter	10
Percentage with Transit Access	4.8 %

# Buffer analysis: 0.25 mile buffers around bus shelters
shelter_buffers <- st_buffer(bus_shelters, dist = 402.34)  # 0.25 mile in meters
tracts_in_shelter_buffer <- st_join(philly_vulnerable, shelter_buffers, 
                                    join = st_intersects, left = FALSE)

# Create summary table for buffer analysis
buffer_summary <- data.frame(
  Metric = c("Vulnerable Tracts Intersecting Shelter Buffers"),
  Count = c(length(unique(tracts_in_shelter_buffer$GEOID)))
)

kable(buffer_summary,
      col.names = c("Metric", "Count"),
      caption = "Buffer Analysis Summary",
      align = c("l", "c"))

Buffer Analysis Summary
Metric	Count
Vulnerable Tracts Intersecting Shelter Buffers	173

# Cross-tabulation of transit and hospital access
access_crosstab <- philly_vulnerable_with_transit %>%
  st_drop_geometry() %>%
  mutate(
    transit_status = ifelse(has_transit_access, "Has Transit", "No Transit"),
    hospital_status = ifelse(underserved, "Underserved", "Has Hospital Access")
  ) %>%
  group_by(transit_status, hospital_status) %>%
  summarise(count = n(), .groups = 'drop')

kable(access_crosstab,
      col.names = c("Transit Status", "Hospital Status", "Count"),
      caption = "Cross-tabulation: Transit Access vs Hospital Access",
      align = c("l", "l", "c"))

Cross-tabulation: Transit Access vs Hospital Access
Transit Status	Hospital Status	Count
Has Transit	Has Hospital Access	10
No Transit	Has Hospital Access	198

# Create visualization map
transit_map <- ggplot() +
  geom_sf(data = philly_tracts, fill = "gray95", color = "gray70", size = 0.2) +
  geom_sf(data = philly_vulnerable_with_transit, 
          aes(fill = distance_to_shelter_miles), color = "white", size = 0.1) +
  geom_sf(data = shelter_buffers, fill = "lightblue", alpha = 0.15, color = NA) +
  geom_sf(data = bus_shelters, color = "blue", size = 1.5, shape = 16, alpha = 0.7) +
  geom_sf(data = hospitals_proj %>% 
            filter(st_intersects(geometry, st_union(philly_tracts), sparse = FALSE)), 
          color = "red", size = 3, shape = 17, alpha = 0.8) +
  scale_fill_viridis_c(
    name = "Distance to\nBus Shelter\n(miles)",
    option = "plasma",
    direction = -1,
    breaks = c(0, 0.25, 0.5, 0.75, 1.0),
    limits = c(0, max(min_shelter_distances_miles))
  ) +
  theme_void() +
  labs(
    title = "Public Transit Access for Vulnerable Populations in Philadelphia",
    subtitle = "Distance from vulnerable tracts to nearest bus shelter with 0.25-mile walking buffers",
    caption = "Blue circles: Bus shelters with 0.25-mile buffers\nRed triangles: Hospitals\nData: ACS 2022, OpenDataPhilly 2025"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 0.5),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9)
  )

print(transit_map)

# Create final summary statistics table
final_summary <- data.frame(
  Metric = c(
    "Tracts with Transit Access (<0.25 mi)",
    "Average Distance to Bus Shelter",
    "Maximum Distance to Bus Shelter"
  ),
  Value = c(
    paste(sum(philly_vulnerable_with_transit$has_transit_access), 
          "(", round(sum(philly_vulnerable_with_transit$has_transit_access) / nrow(philly_vulnerable_with_transit) * 100, 1), "%)"),
    paste(round(mean(min_shelter_distances_miles), 2), "miles"),
    paste(round(max(min_shelter_distances_miles), 2), "miles")
  )
)

kable(final_summary,
      col.names = c("Metric", "Value"),
      caption = "Final Transit Access Summary Statistics",
      align = c("l", "l"))

Final Transit Access Summary Statistics
Metric	Value
Tracts with Transit Access (<0.25 mi)	10 ( 4.8 %)
Average Distance to Bus Shelter	0.87 miles
Maximum Distance to Bus Shelter	3.53 miles

Analysis requirements:

Clear code comments explaining each step
Appropriate CRS transformations
Summary statistics or counts
At least one map showing your findings
Brief interpretation of results (3-5 sentences)

Your interpretation:

Philadelphia has 487 bus shelters covering 208 vulnerable tracts. Only 4.8% of vulnerable tracts are within 0.25 miles of a shelter (average distance: 0.87 miles). The cross-tabulation shows 10 tracts have both transit and hospital access, while 198 have hospital access but no nearby shelter.

This suggests two issues: (1) Philadelphia’s hospital coverage is good—all vulnerable tracts can reach a hospital within 15 miles; (2) bus shelter distribution may not align well with vulnerable populations’ locations. Many vulnerable residents likely rely on bus stops without shelters.

I think this analysis only captures shelters, not all bus stops. Many areas have bus service without shelters, which could underestimate actual transit access.

Improvements from Assignment 1

Based on feedback from Assignment 1, I have implemented the following key improvements:

Protected Census API Key - Hidden the actual API key from the output
Enhanced Table Visualizations - Replaced raw outputs with professional kable() tables
Improved Code Documentation - Added comprehensive comments and better organization
Professional Presentation - Consistent formatting and clear analysis explanations