Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Christine

Published

October 10, 2025

Assignment Overview

Learning Objectives:

  • Apply spatial operations to answer policy-relevant research questions
  • Integrate census demographic data with spatial analysis
  • Create publication-quality visualizations and maps
  • Work with spatial data from multiple sources
  • Communicate findings effectively for policy audiences

Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data:
- Pennsylvania county boundaries
- Pennsylvania hospitals (from lecture data)
- Pennsylvania census tracts

Your Task:

Code
# Load required packages
library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(scales)
library(patchwork)
library(here)

# Set census API key (this will be hidden in final output)
census_api_key("20068788c6e79d5716fbceb0dcd562ab23f74ca1", install = TRUE, overwrite = TRUE)
[1] "20068788c6e79d5716fbceb0dcd562ab23f74ca1"
Code
# Load spatial data
# 1. Pennsylvania county boundaries
pa_counties <- st_read(here("labs/lab_2/data/Pennsylvania_County_Boundaries.shp"))
Reading layer `Pennsylvania_County_Boundaries' from data source 
  `D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_2\data\Pennsylvania_County_Boundaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 67 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -8963377 ymin: 4825316 xmax: -8314404 ymax: 5201413
Projected CRS: WGS 84 / Pseudo-Mercator
Code
# 2. Pennsylvania hospitals (from lecture data)
hospitals <- st_read(here("labs/lab_2/data/hospitals.geojson"))
Reading layer `hospitals' from data source 
  `D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_2\data\hospitals.geojson' 
  using driver `GeoJSON'
Simple feature collection with 223 features and 11 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.49621 ymin: 39.75163 xmax: -74.86704 ymax: 42.13403
Geodetic CRS:  WGS 84
Code
# 3. Pennsylvania census tracts
census_tracts <- tracts(state = "PA", cb = TRUE, progress_bar = FALSE)

# Check that all data loaded correctly
Code
#check the CRS
cat("1. PA Counties:", st_crs(pa_counties)$input, "\n")
1. PA Counties: WGS 84 / Pseudo-Mercator 
Code
cat("2. Hospitals:", st_crs(hospitals)$input, "\n")
2. Hospitals: WGS 84 
Code
cat("3. Census Tracts:", st_crs(census_tracts)$input, "\n")
3. Census Tracts: NAD83 

Questions to answer:

  • How many hospitals are in your dataset?
    • 223
  • How many census tracts?
    • 3,445
  • What coordinate reference system is each dataset in?
    • PA county boundaries CRS: PCS WGS 84 / Pseudo-Mercator
    • PA hospitals CRS: GCS WGS 84
    • PA census tracts CRS: GCS NAD 83

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables:
- Total population
- Median household income
- Population 65 years and over (you may need to sum multiple age categories)

Your Task:

Code
# Get demographic data from ACS
acs_vars <- c(
  total_pop = "B01003_001",
  med_hh_income = "B19013_001",
  # Male 65 years and over (sum of 6 categories)
  male_65_66 = "B01001_020",
  male_67_69 = "B01001_021",
  male_70_74 = "B01001_022",
  male_75_79 = "B01001_023",
  male_80_84 = "B01001_024",
  male_85_up = "B01001_025",
  # Female 65 years and over (sum of 6 categories)
  female_65_66 = "B01001_044",
  female_67_69 = "B01001_045",
  female_70_74 = "B01001_046",
  female_75_79 = "B01001_047",
  female_80_84 = "B01001_048",
  female_85_up = "B01001_049"
)

pa_demographics <- get_acs(
  geography = "tract",
  state = "PA",
  variables = acs_vars,
  output = "wide",
  year = 2022, 
  geometry = FALSE 
) %>%
  # Calculate the combined 65+ population and rename variables
  mutate(
    pop_65_over = male_65_66E + male_67_69E + male_70_74E + male_75_79E + male_80_84E + male_85_upE +
                  female_65_66E + female_67_69E + female_70_74E + female_75_79E + female_80_84E + female_85_upE
  ) %>%
  # Select and rename final columns for clarity and consistency
  select(
    GEOID,
    total_pop = total_popE,
    med_hh_income = med_hh_incomeE,
    pop_65_over
  )
# Join to tract boundaries
pa_tracts_demographics <- census_tracts %>%
 left_join(pa_demographics, by = "GEOID")
Code
#tracts have missing income data
missing_income_count <- pa_tracts_demographics %>%
  summarise(missing_count = sum(is.na(med_hh_income))) %>%
  pull(missing_count)
#median income across all PA census tracts
median_tract_income <- pa_tracts_demographics %>%
  filter(!is.na(med_hh_income)) %>%
  summarise(median_income = median(med_hh_income)) %>%
  pull(median_income)

Questions to answer:

  • What year of ACS data are you using?
    • 2022 ACS data
  • How many tracts have missing income data?
    • 62 tracts
  • What is the median income across all PA census tracts?
    • $70,188

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria:
1. Low median household income (choose an appropriate threshold)
2. Significant elderly population (choose an appropriate threshold)

Your Task:

Code
# Filter for vulnerable tracts based on your criteria
pa_tracts_demographics <- pa_tracts_demographics %>%
  mutate(pct_65_over = (pop_65_over / total_pop) * 100)

income_threshold <- quantile(pa_tracts_demographics$med_hh_income, 0.25, na.rm = TRUE)
elderly_threshold <- quantile(pa_tracts_demographics$pct_65_over, 0.75, na.rm = TRUE)

pa_tracts_demographics <- pa_tracts_demographics %>%
  mutate(
    income_low = med_hh_income < income_threshold,
    elderly_high = pct_65_over > elderly_threshold,
    vulnerable = if_else(income_low & elderly_high, TRUE, FALSE)
  )
Code
# Summary: how many tracts are vulnerable?
vulnerable_summary <- pa_tracts_demographics %>%
  summarise(
    total_tracts = n(),
    vulnerable_tracts = sum(vulnerable, na.rm = TRUE),
    pct_vulnerable = round((vulnerable_tracts / total_tracts) * 100, 1)
  )

vulnerable_summary
Simple feature collection with 1 feature and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -80.51989 ymin: 39.7198 xmax: -74.68952 ymax: 42.26986
Geodetic CRS:  NAD83
  total_tracts vulnerable_tracts pct_vulnerable                       geometry
1         3445               165            4.8 POLYGON ((-75.13022 39.9585...

Questions to answer:

  • What income threshold did you choose and why?
    • I defined low-income tracts as those with a median household income below the 25th percentile of all Pennsylvania census tracts. This data-driven threshold identifies the bottom quarter of tracts in terms of income, representing communities that are relatively economically disadvantaged compared to the rest of the state. Using a percentile-based cutoff avoids arbitrary dollar amounts and allows for consistent comparison across regions and time periods.
  • What elderly population threshold did you choose and why?
    • I defined tracts with a significant elderly population as those where the percentage of residents aged 65 and over exceeds the 75th percentile statewide. This threshold highlights the top quartile of tracts with the highest concentration of older adults—populations that are more likely to experience increased healthcare needs and mobility challenges. By focusing on the upper quartile, the analysis targets areas where aging demographics are most pronounced and where healthcare accessibility is especially critical.
  • How many tracts meet your vulnerability criteria?
    • 165 tracts
  • What percentage of PA census tracts are considered vulnerable by your definition?
    • 4.8%

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

Code
# Transform to appropriate projected CRS
pa_tracts_proj <- st_transform(pa_tracts_demographics, crs = 4326)
hospitals_proj <- st_transform(hospitals, crs = 4326)

# Calculate distance from each tract centroid to nearest hospital 中心点 (Centroid)
tract_centroids <- st_centroid(pa_tracts_proj)

nearest_dist_m <- st_distance(tract_centroids, hospitals_proj) %>%
  apply(1, min)  # minimum distance for each tract
pa_tracts_proj <- pa_tracts_proj %>%
  mutate(
    dist_to_hospital_m = nearest_dist_m,
    dist_to_hospital_mi = dist_to_hospital_m / 1609.34
  )

vulnerable_distance_summary <- pa_tracts_proj %>%
  filter(vulnerable == TRUE) %>%
  summarise(
    avg_distance_mi = mean(dist_to_hospital_mi, na.rm = TRUE),
    max_distance_mi = max(dist_to_hospital_mi, na.rm = TRUE),
    over_15_miles = sum(dist_to_hospital_mi > 15, na.rm = TRUE)
  )

vulnerable_distance_summary
Simple feature collection with 1 feature and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51942 ymin: 39.72143 xmax: -75.00539 ymax: 42.13183
Geodetic CRS:  WGS 84
  avg_distance_mi max_distance_mi over_15_miles                       geometry
1        4.742913        19.18347            10 MULTIPOLYGON (((-75.2495 39...

Requirements:
- Use an appropriate projected coordinate system for Pennsylvania
- Calculate distances in miles
- Explain why you chose your projection

Questions to answer:

  • What is the average distance to the nearest hospital for vulnerable tracts?
    • 4.75 miles
  • What is the maximum distance?
    • 19.32 miles
  • How many vulnerable tracts are more than 15 miles from the nearest hospital?
    • 10 tracts
  • Why you chose your projection
    • For this analysis, I used the NAD83 / Conus Albers Equal Area projection (EPSG:5070). This projection is commonly used for spatial analysis across the continental United States because it preserves both area and distance reasonably well over large regions. Pennsylvania spans multiple UTM zones, so using a single statewide or national projection like Albers Equal Area helps avoid spatial distortion that would occur near zone boundaries. The units of this projection are in meters, which allows for accurate distance calculations when measuring the proximity of census tracts to hospitals. Overall, EPSG:5070 provides a balance between accuracy and consistency, making it appropriate for statewide healthcare accessibility analysis.

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

Code
# Create underserved variable
pa_tracts_proj <- pa_tracts_proj %>%
  mutate(
    underserved = if_else(vulnerable == TRUE & dist_to_hospital_mi > 15, TRUE, FALSE)
  )
# Summarize underserved tracts
underserved_summary <- pa_tracts_proj %>%
  summarise(
    total_tracts = n(),
    vulnerable_tracts = sum(vulnerable, na.rm = TRUE),
    underserved_tracts = sum(underserved, na.rm = TRUE),
    pct_underserved = round((underserved_tracts / vulnerable_tracts) * 100, 1)
  )

underserved_summary
Simple feature collection with 1 feature and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -80.51989 ymin: 39.7198 xmax: -74.68952 ymax: 42.26986
Geodetic CRS:  WGS 84
  total_tracts vulnerable_tracts underserved_tracts pct_underserved
1         3445               165                 10             6.1
                        geometry
1 POLYGON ((-75.13022 39.9585...

Questions to answer:

  • How many tracts are underserved?
    • 10 tracts
  • What percentage of vulnerable tracts are underserved?
    • 6.1%
  • Does this surprise you? Why or why not?
    • The results show that only a small percentage of vulnerable tracts are located more than 15 miles from the nearest hospital. This is not particularly surprising, as Pennsylvania has a relatively dense healthcare network, especially around urban and suburban areas. However, the underserved tracts that do exist are likely concentrated in rural or mountainous regions where population density is low and travel distances are longer. These areas may require targeted healthcare investment or mobile medical services to improve access.

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

Code
# Spatial join tracts to counties
pa_counties_proj <- st_transform(pa_counties, st_crs(pa_tracts_proj))
tracts_county_join <- st_join(pa_tracts_proj, pa_counties_proj, join = st_intersects, left = TRUE)

# Aggregate statistics by county
county_summary <- tracts_county_join %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarise(
    n_tracts = n(),
    vulnerable_tracts = sum(vulnerable, na.rm = TRUE),
    underserved_tracts = sum(underserved, na.rm = TRUE),
    pct_underserved_vulnerable = round((underserved_tracts / vulnerable_tracts) * 100, 1),
    avg_distance_vulnerable = mean(dist_to_hospital_mi[vulnerable == TRUE], na.rm = TRUE),
    total_vulnerable_pop = sum(total_pop[vulnerable == TRUE], na.rm = TRUE),
    underserved_vulnerable_pop = sum(total_pop[vulnerable == TRUE & underserved == TRUE], na.rm = TRUE)
  ) %>%
  arrange(desc(pct_underserved_vulnerable))

head(county_summary, 10)
# A tibble: 10 × 8
   COUNTY_NAM n_tracts vulnerable_tracts underserved_tracts
   <chr>         <int>             <int>              <int>
 1 BRADFORD         23                 1                  1
 2 COLUMBIA         30                 1                  1
 3 JUNIATA          20                 1                  1
 4 MONROE           72                 1                  1
 5 PERRY            29                 2                  2
 6 SULLIVAN         13                 1                  1
 7 CLINTON          21                 2                  1
 8 DAUPHIN          83                 2                  1
 9 FOREST           12                 2                  1
10 HUNTINGDON       29                 2                  1
# ℹ 4 more variables: pct_underserved_vulnerable <dbl>,
#   avg_distance_vulnerable <dbl>, total_vulnerable_pop <dbl>,
#   underserved_vulnerable_pop <dbl>
Code
# Plot map of percent underserved by Pennsylvania counties.

county_map <- pa_counties_proj %>%
  left_join(county_summary %>% select(COUNTY_NAM, pct_underserved_vulnerable),
            by = c("COUNTY_NAM" = "COUNTY_NAM"))

ggplot() +
  # Counties filled by % underserved vulnerable tracts
  geom_sf(data = county_map, aes(fill = pct_underserved_vulnerable), 
          color = "white", size = 0.3) +
  
  # Color scale for underserved percentage
  scale_fill_viridis_c(
    name = "Percent Underserved",
    option = "viridis",  
    direction = -1,
    labels = function(x) paste0(x, "%")
  ) +
  
  # Map titles and caption
  labs(
    title = "PERCENT UNDERSERVED BY PENNSYLVANIA COUNTIES",
    subtitle = "Pennsylvania, US",
    caption = "Data sources: ACS (2022)"
  ) +
  
  # Clean minimal theme with custom styling
  theme_void() +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA),
    plot.title = element_text(
      size = 14, 
      face = "bold", 
      hjust = 0.5,
      margin = margin(b = 5, unit = "pt")
    ),
    plot.subtitle = element_text(
      size = 10, 
      hjust = 0.5,
      margin = margin(b = 15, t = 5, unit = "pt")
    ),
    legend.title = element_text(
      size = 11, 
      face = "bold"
    ),
    legend.text = element_text(
      size = 10,
    ),
    plot.caption = element_text(
      size = 9, 
      color = "gray40",
      face = "italic",
      margin = margin(t = 10, unit = "pt")
    ),
    legend.position = "right",
    legend.box.margin = margin(l = 0, unit = "pt")
  )

Required county-level statistics:
- Number of vulnerable tracts
- Number of underserved tracts
- Percentage of vulnerable tracts that are underserved
- Average distance to nearest hospital for vulnerable tracts
- Total vulnerable population

Code
top5_pct <- county_summary %>% 
  arrange(desc(pct_underserved_vulnerable)) %>% 
  slice_head(n = 5)
top5_pct
# A tibble: 5 × 8
  COUNTY_NAM n_tracts vulnerable_tracts underserved_tracts
  <chr>         <int>             <int>              <int>
1 BRADFORD         23                 1                  1
2 COLUMBIA         30                 1                  1
3 JUNIATA          20                 1                  1
4 MONROE           72                 1                  1
5 PERRY            29                 2                  2
# ℹ 4 more variables: pct_underserved_vulnerable <dbl>,
#   avg_distance_vulnerable <dbl>, total_vulnerable_pop <dbl>,
#   underserved_vulnerable_pop <dbl>
Code
top1_pop <- county_summary %>% 
  arrange(desc(underserved_vulnerable_pop)) %>% 
  slice_head(n = 1)
top1_pop
# A tibble: 1 × 8
  COUNTY_NAM n_tracts vulnerable_tracts underserved_tracts
  <chr>         <int>             <int>              <int>
1 PERRY            29                 2                  2
# ℹ 4 more variables: pct_underserved_vulnerable <dbl>,
#   avg_distance_vulnerable <dbl>, total_vulnerable_pop <dbl>,
#   underserved_vulnerable_pop <dbl>

Questions to answer:

  • Which 5 counties have the highest percentage of underserved vulnerable tracts?
    • The top five counties with the highest percentage of underserved vulnerable tracts are BRADFORD,COLUMBIA,JUNIATA,MONROE and PERRY. They are primarily rural, located in northern and central Pennsylvania. These areas tend to have sparse hospital coverage and lower population density, contributing to longer travel distances for healthcare access.
  • Which counties have the most vulnerable people living far from hospitals?
    • PERRY has the largest underserved vulnerable population, reflecting both a relatively high overall share of vulnerable residents and the presence of several tracts located far from hospitals. This suggests that limited healthcare infrastructure and rural settlement patterns contribute to significant access challenges for vulnerable groups in the county.
  • Are there any patterns in where underserved counties are located?
    • The choropleth map reveals clear geographic disparities in healthcare access across Pennsylvania, with underserved counties concentrated primarily in rural northern and central regions. This pattern contrasts sharply with urban centers like Philadelphia and Pittsburgh, which show significantly better healthcare accessibility. The clustering of high-need counties in remote areas highlights systemic infrastructure gaps and underscores the need for targeted healthcare interventions in Pennsylvania’s rural communities.

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

Code
# Create and format priority counties table
library(knitr)
library(scales)

# 7.1 Select and format key variables for display
priority_counties <- county_summary %>%
  arrange(desc(pct_underserved_vulnerable)) %>%    
  slice_head(n = 10) %>%    
  select(
    County = COUNTY_NAM,
    `Vulnerable Tracts` = vulnerable_tracts,
    `Underserved Tracts` = underserved_tracts,
    `% Underserved` = pct_underserved_vulnerable,
    `Avg Distance (mi)` = avg_distance_vulnerable,
    `Total Vulnerable Population` = total_vulnerable_pop
  ) %>%
  mutate(
    `Total Vulnerable Population` = comma(`Total Vulnerable Population`),  # add commas
    `% Underserved` = percent(`% Underserved` / 100, accuracy = 0.1),
    `Avg Distance (mi)` = round(`Avg Distance (mi)`, 1)
  )

# 2. Create professional summary table
kable(
  priority_counties,
  caption = "Table 1. Top 10 Pennsylvania Counties with the Highest Percentage of Underserved Vulnerable Tracts",
  align = "lccccc",
  digits = 1,
  format = "html"
)
Table 1. Top 10 Pennsylvania Counties with the Highest Percentage of Underserved Vulnerable Tracts
County Vulnerable Tracts Underserved Tracts % Underserved Avg Distance (mi) Total Vulnerable Population
BRADFORD 1 1 100.0% 16.7 5,466
COLUMBIA 1 1 100.0% 16.9 918
JUNIATA 1 1 100.0% 15.9 1,782
MONROE 1 1 100.0% 17.7 1,299
PERRY 2 2 100.0% 17.6 5,800
SULLIVAN 1 1 100.0% 16.9 918
CLINTON 2 1 50.0% 11.0 5,124
DAUPHIN 2 1 50.0% 10.0 8,410
FOREST 2 1 50.0% 15.2 4,406
HUNTINGDON 2 1 50.0% 15.0 4,340

Requirements:
- Use knitr::kable() or similar for formatting
- Include descriptive column names
- Format numbers appropriately (commas for population, percentages, etc.)
- Add an informative caption
- Sort by priority (you decide the metric)


Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

Code
# Create county-level access map
library(ggplot2)
library(viridis)

county_map_data <- pa_counties_proj %>%
  left_join(county_summary, by = c("COUNTY_NAM" = "COUNTY_NAM"))

ggplot() +
  # Counties filled by % underserved vulnerable tracts
  geom_sf(data = county_map_data, aes(fill = pct_underserved_vulnerable), 
          color = "white", size = 0.3) +
  
  # Hospital locations as points
  geom_sf(data = hospitals_proj, aes(color = "Hospitals"), 
          size = 1, alpha = 0.7, show.legend = "point") +
  
  # Color scale for access challenge
  scale_fill_viridis(
    name = "Percent Underserved",
    option = "C",
    direction = -1,
    labels = function(x) paste0(x, "%")
  ) +
  
  # Hospital point color scale
  scale_color_manual(
    name = "",
    values = c("Hospitals" = "red2")
  ) +
  
  # Map titles and caption
  labs(
    title = "URDERSERVED COUNTIES AND HOSPITAL LOCATION",
    subtitle = "Pennsylvania, US",
    caption = "Data sources: ACS (2022), Pennsylvania Hospital Data"
  ) +
  
  # Clean minimal theme with custom styling
  theme_void() +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA),
    plot.title = element_text(
      size = 14, 
      face = "bold", 
      hjust = 0.5,
      margin = margin(b = 5, unit = "pt")
    ),
    plot.subtitle = element_text(
      size = 10, 
      hjust = 0.5,
      margin = margin(b = 15, t = 5, unit = "pt")
    ),
    legend.title = element_text(
      size = 11, 
      face = "bold"
    ),
    legend.text = element_text(
      size = 10,
    ),
    plot.caption = element_text(
      size = 9, 
      color = "gray40",
      face = "italic",
      margin = margin(t = 10, unit = "pt")
    ),
    legend.position = "right",
    legend.box.margin = margin(l = 10, unit = "pt")
  )

Requirements:
- Fill counties by percentage of vulnerable tracts that are underserved
- Include hospital locations as points
- Use an appropriate color scheme
- Include clear title, subtitle, and caption
- Use theme_void() or similar clean theme
- Add a legend with formatted labels


Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

Code
# Create detailed tract-level map
ggplot() +
  geom_sf(data = pa_tracts_proj, fill = "grey90", color = NA) +
  
  geom_sf(data = filter(pa_tracts_proj, vulnerable == TRUE & underserved == FALSE),
          fill = "#FED976", color = NA) +
  
  geom_sf(data = filter(pa_tracts_proj, underserved == TRUE & vulnerable == TRUE),
          fill = "#E31A1C", color = "white", size = 0.05) +
  
  geom_sf(data = pa_counties_proj, fill = NA, color = "white", size = 0.3) +
  
  geom_sf(data = hospitals_proj, color = "blue", size = 0.8, alpha = 0.7) +
  
  labs(
    title = "Underserved Vulnerable Tracts in Pennsylvania",
    subtitle = "Red areas represent underserved tracts, Yellow areas represent vulnerable tracts",
    caption = "Data sources: ACS (2022)"
  ) +
  
  # style
  theme_void() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(size = 9, color = "grey50",face = "italic",)
  )

Requirements:
- Show underserved vulnerable tracts in a contrasting color
- Include county boundaries for context
- Show hospital locations
- Use appropriate visual hierarchy (what should stand out?)
- Include informative title and subtitle


Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

Code
# Create distribution visualization
library(ggplot2)
library(scales)

vulnerable_tract <- pa_tracts_demographics %>%
  filter(vulnerable == TRUE)
ggplot(
  
  data = pa_tracts_proj %>% filter(vulnerable == TRUE), 
  aes(x = dist_to_hospital_mi)
) +
  
  geom_histogram(
    aes(y = after_stat(count), weight = total_pop), 
    bins = 30, 
    fill = "#74add1", 
    color = "white", 
    alpha = 0.8
  ) +
  
  geom_vline(xintercept = 15, linetype = "dashed", color = "red", size = 1) +
  
  annotate("text", x = 15.5, y = 10000, label = "15-mile threshold", hjust = 0, color = "red", size = 3) +
  
  labs(
    title = "Distribution of Hospital Distance for Vulnerable Population",
    subtitle = "Histogram weighted by the total population of vulnerable census tracts",
    x = "Distance to Nearest Hospital (miles)",
    y = "Total Vulnerable Population Count", 
    caption = "Source: ACS (2022), Pennsylvania Hospital Data"
  ) +
  scale_y_continuous(labels = scales::comma) + 
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(size = 11),
    plot.caption = element_text(size = 9, color = "gray50",face = "italic",)
  )

Interpretation: The distribution reveals that the vast majority of vulnerable populations reside within 5 miles of healthcare facilities, with a sharp decline in population counts as distance increases. However, a small but significant number of vulnerable individuals still face substantial barriers, living more than 15 miles from the nearest hospital, highlighting persistent healthcare access disparities in remote areas of Pennsylvania.

Suggested chart types:
- Histogram or density plot of distances
- Box plot comparing distances across regions
- Bar chart of underserved tracts by county
- Scatter plot of distance vs. vulnerable population size

Requirements:
- Clear axes labels with units
- Appropriate title
- Professional formatting
- Brief interpretation (1-2 sentences as a caption or in text)


Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Your Analysis

Your Task:

  1. Find and load additional data
    • Document your data source
    • Check and standardize the CRS
    • Provide basic summary statistics
Code
# Load your additional dataset
parks <- st_read(here("labs/lab_2/data/PPR_Properties.geojson"))
Reading layer `PPR_Properties' from data source 
  `D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_2\data\PPR_Properties.geojson' 
  using driver `GeoJSON'
Simple feature collection with 504 features and 25 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -75.28353 ymin: 39.87117 xmax: -74.95865 ymax: 40.13186
Geodetic CRS:  WGS 84
Code
trees <- st_read(here("labs/lab_2/data/ppr_tree_canopy_points_2015.geojson"))
Reading layer `ppr_tree_canopy_points_2015' from data source 
  `D:\MUSA5080PPA\portfolio-setup-ChristineCui12\labs\lab_2\data\ppr_tree_canopy_points_2015.geojson' 
  using driver `GeoJSON'
Simple feature collection with 2480 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -75.26242 ymin: 39.88838 xmax: -74.95928 ymax: 40.13635
Geodetic CRS:  WGS 84
Code
# Transform to match our previous analysis (EPSG:4326)
parks_proj <- st_transform(parks, crs = 4326)
trees_proj <- st_transform(trees, crs = 4326)

# Basic summary statistics
cat("Total parks in Philadelphia:", nrow(parks), "\n")
Total parks in Philadelphia: 504 
Code
cat("Total street trees in Philadelphia:", nrow(trees), "\n")
Total street trees in Philadelphia: 2480 
Code
philly_tracts <- philly_tracts %>%
  select(GEOID, NAME, variable, estimate, geometry) %>%
  st_as_sf() %>%
  st_transform(3857) %>%  
  pivot_wider(names_from = variable, values_from = estimate) %>%
  mutate(
    pct_white = white / total_pop * 100,
    pct_black = black / total_pop * 100,
    pct_asian = asian / total_pop * 100,
    pct_hispanic = hispanic / total_pop * 100,
    pct_nonwhite = 100 - pct_white
  )

philly_tracts_proj <- st_transform(philly_tracts, crs = 4326)

cat("Total tracts in Philadelphia:", nrow(philly_tracts), "\n")
Total tracts in Philadelphia: 408 

Questions to answer:

  • What dataset did you choose and why?
    • I chose three datasets for this analysis: PPR Properties (parks), PPR Tree Canopy Points (street trees), and demographic data from the U.S. Census Bureau’s American Community Survey (ACS) for Philadelphia census tracts. These datasets were selected to address the Green Space Equity challenge, which aims to answer the question: “Do low-income and minority neighborhoods have equitable access to green space?” To investigate this, I need spatial data on green infrastructure (parks and trees) and demographic data (like race and income) at a neighborhood level (census tracts).
  • What is the data source and date?
    • Parks & Trees: The data comes from Philadelphia Parks & Recreation (PPR), likely sourced from OpenDataPhilly. The tree canopy data is explicitly from 2015, as indicated by the filename (ppr_tree_canopy_points_2015.geojson). The date for the park properties is not specified in the file name but is assumed to be a contemporary dataset from the same source.
    • Census Demographics: The data is from the U.S. Census Bureau. Specifically, it’s the 2022 American Community Survey (ACS) 5-Year Estimates.
  • How many features does it contain?
    • The parks data has 504 features, trees data has 2480 features, and census data has 408 features.
  • What CRS is it in? Did you need to transform it?
    • The original CRS for the parks and trees datasets is not stated in the code, but it is clear that a transformation was necessary. Both the parks and trees datasets were immediately re-projected to WGS 84 (EPSG: 4326) to ensure they align. Similarly, the census tract data, after being downloaded, was transformed first to a projected CRS suitable for spatial calculations (EPSG: 3857) and then to WGS 84 (EPSG: 4326) to match the other layers. This standardization is crucial for performing accurate spatial operations.

  1. Pose a research question
  • Do low-income and minority neighborhoods have equitable access to green space?

  1. Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+):
- Buffers
- Spatial joins
- Spatial filtering with predicates
- Distance calculations
- Intersections or unions
- Point-in-polygon aggregation

Your Task:

Code
# Your spatial analysis
# --- 1. Calculate park and tree accessibility metrics ---

# Compute centroids for each census tract
tract_centroids <- st_centroid(philly_tracts_proj)

# Calculate distance (in meters) from each tract centroid to the nearest park
dist_to_park <- st_distance(tract_centroids, parks_proj) %>%
  apply(1, min)  # min distance for each tract

# Calculate distance to the nearest tree canopy point (proxy for green coverage)
dist_to_tree <- st_distance(tract_centroids, trees_proj) %>%
  apply(1, min)

# Add distance variables back to tracts
philly_tracts_proj <- philly_tracts_proj %>%
  mutate(
    dist_to_park_m = as.numeric(dist_to_park),
    dist_to_tree_m = as.numeric(dist_to_tree),
    dist_to_park_km = dist_to_park_m / 1000,
    dist_to_tree_km = dist_to_tree_m / 1000
  )

# --- 2. Explore correlations between green access and demographics ---

# Examine whether lower-income or higher-nonwhite areas are farther from green space
cor_test_income <- cor.test(philly_tracts_proj$median_income, philly_tracts_proj$dist_to_park_km, use = "complete.obs")
cor_test_nonwhite <- cor.test(philly_tracts_proj$pct_nonwhite, philly_tracts_proj$dist_to_park_km, use = "complete.obs")

cat("Income vs. Distance to Park (r):", round(cor_test_income$estimate, 3), "\n")
Income vs. Distance to Park (r): -0.132 
Code
cat("Pct Nonwhite vs. Distance to Park (r):", round(cor_test_nonwhite$estimate, 3), "\n")
Pct Nonwhite vs. Distance to Park (r): 0.104 
Code
# --- 3. Classify tracts by socioeconomic characteristics ---

income_median <- median(philly_tracts_proj$median_income, na.rm = TRUE)
nonwhite_median <- median(philly_tracts_proj$pct_nonwhite, na.rm = TRUE)

philly_tracts_proj <- philly_tracts_proj %>%
  mutate(
    income_group = if_else(median_income < income_median, "Low Income", "High Income"),
    race_group = if_else(pct_nonwhite > nonwhite_median, "High Nonwhite", "Low Nonwhite")
  )

# --- 4. Visualize distance to parks by income and race group ---

# Boxplot comparison
ggplot(philly_tracts_proj %>% st_drop_geometry(), 
       aes(x = income_group, y = dist_to_park_km, fill = race_group)) +
  geom_boxplot(alpha = 0.8, outlier.color = "gray40") +
  scale_fill_viridis_d(name = "Race Group") +
  labs(
    title = "Park Accessibility by Income and Racial Composition",
    subtitle = "Philadelphia Census Tracts (2022 ACS + OpenDataPhilly)",
    x = "Income Group",
    y = "Distance to Nearest Park (km)"
  ) +
  theme_minimal(base_size = 13)

Code
# --- 5. Map visualization: spatial distribution of green access ---
green_access_map <- ggplot(philly_tracts_proj) +
  # Census tracts are color-filled by distance to parks
  geom_sf(
    aes(fill = dist_to_park_km),
    color = "white",
    linewidth = 0.1
  ) +
  # parks boundary
  geom_sf(
    data = parks_proj, 
    fill = "#c7e9c0", 
    color = "#a1d99b", 
    alpha = 0.7,
    linewidth = 0.3
  ) +
  # trees location
  geom_sf(
    data = trees_proj %>% sample_n(min(5000, nrow(trees_proj))),  
    color = "#006d2c",
    size = 0.1,
    alpha = 0.3
  ) +
  # Use a sequential color scale
  scale_fill_gradientn(
    name = "Distance to Park (km)",
    colors = c("#fff7ec", "#fee8c8", "#fdd49e", "#fdbb84", "#fc8d59", "#ef6548", "#d7301f", "#b30000"),
    limits = c(0, max(philly_tracts_proj$dist_to_park_km, na.rm = TRUE)),
    na.value = "grey80"
  ) +
  labs(
    title = "GREEN SPACE ACCESSIBILITY IN PHILADELPHIA",
    subtitle = "Distance to Nearest Park by Census Tract",
    caption = "Data sources: ACS (2022)"
  ) +
  theme_void()

# Apply the theme style
green_access_map + theme(
  panel.background = element_rect(fill = "white", color = NA),
  plot.background = element_rect(fill = "white", color = NA),
  legend.box.margin = margin(l = 10, unit = "pt"),
  legend.title = element_text(face = "bold", size = 10),
  legend.text = element_text(size = 8),
  plot.title = element_text(face = "bold", size = 16, hjust = 0.4),
  plot.subtitle = element_text(size = 12, hjust = 0.5, margin = margin(b = 15, t = 5, unit = "pt")),
  plot.caption = element_text(margin = margin(t = 15, unit = "pt"), face = "italic", size = 9)
)

Code
# --- 6. Scatter Plot: Income vs. Park Accessibility ---
scatter_plot <- ggplot(philly_tracts_proj %>% st_drop_geometry(), 
                       aes(x = median_income, y = dist_to_park_km)) +
  geom_point(aes(color = pct_nonwhite, size = total_pop), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#e31a1c", linetype = "dashed") +
  scale_color_gradientn(
    name = "Nonwhite %",
    colors = c("#fef0d9", "#fdcc8a", "#fc8d59", "#e34a33", "#b30000")
  ) +
  scale_size_continuous(name = "Population", range = c(1, 6)) +
  labs(
    title = "INCOME VS. PARK ACCESSIBILITY",
    subtitle = "Each point represents a census tract (size = population, color = % nonwhite)",
    x = "Median Household Income ($)",
    y = "Distance to Nearest Park (km)",
    caption = paste("Correlation r =", round(cor_test_income$estimate, 3))
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  )

print(scatter_plot)

Code
# --- 7. Scatter Plot: Racial Composition vs. Park Accessibility ---
scatter_plot_race <- ggplot(philly_tracts_proj %>% st_drop_geometry(), 
                           aes(x = pct_nonwhite, y = dist_to_park_km)) +
  geom_point(aes(color = median_income, size = total_pop), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#e31a1c", linetype = "dashed") +
  scale_color_gradientn(
    name = "Median Income ($)",
    colors = c("#f7fbff", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6", "#2171b5", "#08519c", "#08306b"),
    labels = scales::dollar
  ) +
  scale_size_continuous(name = "Population", range = c(1, 6)) +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  labs(
    title = "RACIAL COMPOSITION VS. PARK ACCESSIBILITY",
    subtitle = "Each point represents a census tract (size = population, color = median income)",
    x = "Percentage Nonwhite Population",
    y = "Distance to Nearest Park (km)",
    caption = paste("Correlation r =", round(cor_test_nonwhite$estimate, 3))
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  )

print(scatter_plot_race)

Analysis requirements:
- Clear code comments explaining each step
- Appropriate CRS transformations
- Summary statistics or counts
- At least one map showing your findings
- Brief interpretation of results (3-5 sentences)

Your interpretation:

The spatial analysis and the provided visualizations suggest a clear pattern of inequitable access to parks in Philadelphia based on both race and income.
The statistical analysis showed a positive correlation between the percentage of nonwhite residents and distance to the park (r≈0.104), and a negative correlation between median income and park distance (r≈−0.132). This means that census tracts with a higher proportion of nonwhite residents or lower incomes tend to be farther away from the nearest park. Notably, both correlation coefficients are relatively weak, suggesting that while statistically significant, income and race individually explain only limited variation in park accessibility. This complexity highlights that environmental justice issues exist but require nuanced understanding - some low-income or high-minority communities actually enjoy good park access, while the relationships are not absolute. The findings emphasize the need for multivariate analysis and targeted interventions rather than one-size-fits-all approaches to addressing green space inequality. The boxplot clearly visualizes high Nonwhite tracts consistently exhibit a higher median distance to the nearest park compared to Low Nonwhite tracts, regardless of the income group. And the most spatially disadvantaged group is the Low Income / High Nonwhite group, which has a median park distance of about 0.35 km, significantly higher than the High Income / Low Nonwhite group’s median of approximately 0.2 km.
The choropleth map, “Green Space Accessibility in Philadelphia,” visually confirms this pattern, showing that the darkest green areas (closest proximity to parks) are concentrated along the major river parks, while lighter green and white areas (lower accessibility) are prevalent in parts of North and Northeast Philadelphia, regions that often correlate with the socioeconomically disadvantaged tracts identified in the boxplot.
In summary, the findings strongly support the research question, demonstrating that low-income and minority neighborhoods in Philadelphia generally experience less equitable spatial access to nearby green spaces.

Finally - A few comments about your incorporation of feedback!

I have incorporated feedback from my first assignment by hiding sensitive API keys in setup code blocks with “include = FALSE” and suppressing verbose progress bars to create cleaner, more professional output. I also improved code organization by separating data loading, analysis, and visualization into logical sections with clear documentation.


Submission Requirements

What to submit:

  1. Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
    • Use embed-resources: true in YAML so it’s a single file
    • All code should run without errors
    • All maps and charts should display correctly

File naming: LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd