Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Ming

Published

October 26, 2025

options(scipen = 999, dplyr.summarise.inform = FALSE)

pkgs <- c(
  "tidyverse","sf","tigris","tidycensus","janitor","units","nngeo",
  "knitr","kableExtra","scales","ggtext","glue"
)
invisible(lapply(pkgs, function(p) if (!requireNamespace(p, quietly = TRUE)) install.packages(p)))
lapply(pkgs, library, character.only = TRUE)

[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "sf"        "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "tigris"    "sf"        "lubridate" "forcats"   "stringr"   "dplyr"    
 [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
[13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
[19] "base"     

[[4]]
 [1] "tidycensus" "tigris"     "sf"         "lubridate"  "forcats"   
 [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
[11] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
[16] "grDevices"  "utils"      "datasets"   "methods"    "base"      

[[5]]
 [1] "janitor"    "tidycensus" "tigris"     "sf"         "lubridate" 
 [6] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
[11] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
[16] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
[21] "base"      

[[6]]
 [1] "units"      "janitor"    "tidycensus" "tigris"     "sf"        
 [6] "lubridate"  "forcats"    "stringr"    "dplyr"      "purrr"     
[11] "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse" 
[16] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
[21] "methods"    "base"      

[[7]]
 [1] "nngeo"      "units"      "janitor"    "tidycensus" "tigris"    
 [6] "sf"         "lubridate"  "forcats"    "stringr"    "dplyr"     
[11] "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
[16] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
[21] "datasets"   "methods"    "base"      

[[8]]
 [1] "knitr"      "nngeo"      "units"      "janitor"    "tidycensus"
 [6] "tigris"     "sf"         "lubridate"  "forcats"    "stringr"   
[11] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
[16] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
[21] "utils"      "datasets"   "methods"    "base"      

[[9]]
 [1] "kableExtra" "knitr"      "nngeo"      "units"      "janitor"   
 [6] "tidycensus" "tigris"     "sf"         "lubridate"  "forcats"   
[11] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
[16] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
[21] "grDevices"  "utils"      "datasets"   "methods"    "base"      

[[10]]
 [1] "scales"     "kableExtra" "knitr"      "nngeo"      "units"     
 [6] "janitor"    "tidycensus" "tigris"     "sf"         "lubridate" 
[11] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
[16] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
[21] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
[26] "base"      

[[11]]
 [1] "ggtext"     "scales"     "kableExtra" "knitr"      "nngeo"     
 [6] "units"      "janitor"    "tidycensus" "tigris"     "sf"        
[11] "lubridate"  "forcats"    "stringr"    "dplyr"      "purrr"     
[16] "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse" 
[21] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
[26] "methods"    "base"      

[[12]]
 [1] "glue"       "ggtext"     "scales"     "kableExtra" "knitr"     
 [6] "nngeo"      "units"      "janitor"    "tidycensus" "tigris"    
[11] "sf"         "lubridate"  "forcats"    "stringr"    "dplyr"     
[16] "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
[21] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
[26] "datasets"   "methods"    "base"

options(tigris_use_cache = TRUE, tigris_class = "sf")

crs_proj <- 26918

to_miles <- function(x_meters) set_units(x_meters, "m") |> set_units("mi") |> drop_units()
fmt_pct  <- label_percent(accuracy = 0.1)
fmt_num  <- label_comma(accuracy = 1)
ACS_YEAR <- 2023

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences

Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

# Load required packages


# Load spatial data

# 1) 县界（County）
pa_counties <- counties(state = "PA", year = 2023, cb = TRUE) |>
  st_as_sf() |>
  st_transform(crs_proj) |>
  clean_names()

# 2) Census Tracts（边界）
pa_tracts <- tracts(state = "PA", year = 2023, cb = TRUE) |>
  st_as_sf() |>
  st_transform(crs_proj) |>
  clean_names()

# 3) 医院（讲义数据）：
# 优先读取 Geopackage；若不存在，尝试读取 CSV（需含经纬度列 lon/lat 或 x/y）
hosp_gpkg <- "data/pa_hospitals.gpkg"
hosp_csv  <- "data/pa_hospitals.csv"

hosp_geojson <- "data/hospitals.geojson"

if (file.exists(hosp_geojson)) {
  hospitals_raw <- read_sf(hosp_geojson)
} else {
  stop("未找到 hospitals.geojson，请确认路径与文件名正确。")
}

hospitals <- hospitals_raw |>
  st_transform(crs_proj) |>
  clean_names()

hospitals <- hospitals_raw |>
  st_transform(crs_proj) |>
  clean_names()


# Check that all data loaded correctly
n_hospitals <- nrow(hospitals)
n_tracts    <- nrow(pa_tracts)

crs_info <- list(
  counties   = st_crs(pa_counties)$input,
  tracts     = st_crs(pa_tracts)$input,
  hospitals  = st_crs(hospitals)$input
)

list(
  n_hospitals = n_hospitals,
  n_tracts = n_tracts,
  crs_info = crs_info
)

$n_hospitals
[1] 223

$n_tracts
[1] 3445

$crs_info
$crs_info$counties
[1] "EPSG:26918"

$crs_info$tracts
[1] "EPSG:26918"

$crs_info$hospitals
[1] "EPSG:26918"

Questions to answer: - How many hospitals are in your dataset? 223 - How many census tracts? 3445 - What coordinate reference system is each dataset in? EPSG:26918 —

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS
# ---- Step 2 (最终稳定版) ----
library(tidycensus)
library(dplyr)
library(stringr)
library(sf)
library(janitor)

vars_needed <- c(
  total_pop = "B01003_001",
  med_income = "B19013_001",
  m_65_66 = "B01001_020", m_67_69 = "B01001_021", m_70_74 = "B01001_022",
  m_75_79 = "B01001_023", m_80_84 = "B01001_024", m_85_plus = "B01001_025",
  f_65_66 = "B01001_044", f_67_69 = "B01001_045", f_70_74 = "B01001_046",
  f_75_79 = "B01001_047", f_80_84 = "B01001_048", f_85_plus = "B01001_049"
)

acs_wide <- get_acs(
  geography = "tract",
  state = "PA",
  variables = vars_needed,
  year = 2023,
  survey = "acs5",
  geometry = FALSE,
  output = "wide"
)

# ---- 创建 pop65 并简化 ----
acs <- acs_wide |>
  mutate(
    pop65 = rowSums(
      select(cur_data(), matches("(_65_|_67_|_70_|_75_|_80_|_85_).+E$")),
      na.rm = TRUE
    )
  ) |>
  transmute(
    geoid = GEOID,
    total_pop = total_popE,
    med_income = med_incomeE,
    pop65 = pop65
  )

# ---- 连接到本地 tract 边界 ----
pa_tracts <- pa_tracts <- tigris::tracts(state = "PA", year = 2023, cb = TRUE) |>
  st_transform(crs_proj) |>
  janitor::clean_names() |>
  st_transform(crs_proj) |>
  clean_names()


pa_tracts_dem <- pa_tracts |>
  left_join(acs, by = c("geoid" = "geoid"))

# Join to tract boundaries
n_missing_income <- sum

n_missing_income <- sum(is.na(pa_tracts_dem$med_income))
state_median_income <- median(pa_tracts_dem$med_income, na.rm = TRUE)

list(
  acs_year = 2023,
  n_missing_income = n_missing_income,
  median_income_across_tracts = state_median_income
)

$acs_year
[1] 2023

$n_missing_income
[1] 65

$median_income_across_tracts
[1] 72943.5

Questions to answer: - What year of ACS data are you using? 2023 - How many tracts have missing income data? 65 - What is the median income across all PA census tracts? 72943.5

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria
# 1. 计算65岁以上人口比例
pa_tracts_dem <- pa_tracts_dem |>
  mutate(pct65 = pop65 / total_pop)

# 2. 选取阈值（例如收入第20百分位、老年人口第80百分位）
inc_thresh   <- quantile(pa_tracts_dem$med_income, 0.20, na.rm = TRUE)
pct65_thresh <- quantile(pa_tracts_dem$pct65, 0.80, na.rm = TRUE)

# 3. 标记低收入+高老龄的脆弱分区
vuln_tracts <- pa_tracts_dem |>
  mutate(
    is_low_income = med_income <= inc_thresh,
    is_elderly    = pct65 >= pct65_thresh,
    vulnerable    = is_low_income & is_elderly
  )

# 4. 汇总统计
n_vuln  <- sum(vuln_tracts$vulnerable, na.rm = TRUE)
pct_vuln <- n_vuln / nrow(vuln_tracts)

list(
  income_threshold = inc_thresh,
  elderly_threshold = pct65_thresh,
  n_vulnerable_tracts = n_vuln,
  pct_vulnerable_tracts = pct_vuln
)

$income_threshold
    20% 
53757.2 

$elderly_threshold
      80% 
0.2450852 

$n_vulnerable_tracts
[1] 101

$pct_vulnerable_tracts
[1] 0.02931785

Questions to answer: - What income threshold did you choose and why? Tracts with median household income ≤ 53757.2 (the 20th percentile statewide), chosen to represent economically disadvantaged areas. - What elderly population threshold did you choose and why? Tracts where ≥ 80% of residents are 65+, corresponding to the top 20 percent of aging communities. - How many tracts meet your vulnerability criteria? 101 - What percentage of PA census tracts are considered vulnerable by your definition? 2.93%

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS
# 1. 计算每个 tract 的质心（largest polygon 以避免多面问题）
vuln_pts <- vuln_tracts |> st_centroid(of_largest_polygon = TRUE)

# 2. 计算最近医院索引
nearest_idx <- st_nearest_feature(vuln_pts, hospitals)


# Calculate distance from each tract centroid to nearest hospital

# 3. 计算距离（米→英里）
dist_m <- st_distance(vuln_pts, hospitals[nearest_idx, ], by_element = TRUE)
to_miles <- function(x_meters) units::set_units(x_meters, "m") |> units::set_units("mi") |> units::drop_units()

vuln_pts <- vuln_pts |>
  mutate(dist_mi = as.numeric(to_miles(dist_m)))

# 4. 提取脆弱 tract 的距离统计
avg_dist_vuln <- mean(vuln_pts$dist_mi[vuln_pts$vulnerable], na.rm = TRUE)
max_dist_vuln <- max(vuln_pts$dist_mi[vuln_pts$vulnerable], na.rm = TRUE)
n_over15 <- sum(vuln_pts$vulnerable & vuln_pts$dist_mi > 15, na.rm = TRUE)

list(
  avg_distance_miles = avg_dist_vuln,
  max_distance_miles = max_dist_vuln,
  n_vulnerable_over15mi = n_over15
)

$avg_distance_miles
[1] 3.776915

$max_distance_miles
[1] 18.68254

$n_vulnerable_over15mi
[1] 4

Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection

Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? 3.78 - What is the maximum distance? 18.68 - How many vulnerable tracts are more than 15 miles from the nearest hospital? The use of UTM 18N allows distances to be measured in linear units (meters), which can be reliably converted to miles for statewide analysis. Most vulnerable tracts lie within short travel distances to hospitals, but a nontrivial share—those > 15 miles away—represent potential underserved areas needing policy attention. —

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable
# 前置：确保 vuln_pts 已含 dist_mi 与 vulnerable (上一节已创建)
# 定义“医疗服务不足”：既脆弱且距离最近医院 > 15 英里
vuln_pts <- vuln_pts |>
  mutate(underserved = vulnerable & dist_mi > 15)

# 统计问题答案
n_underserved <- sum(vuln_pts$underserved, na.rm = TRUE)
n_vulnerable  <- sum(vuln_pts$vulnerable,  na.rm = TRUE)
pct_underserved_among_vuln <- ifelse(n_vulnerable > 0, n_underserved / n_vulnerable, NA_real_)

list(
  n_underserved_tracts = n_underserved,
  n_vulnerable_tracts  = n_vulnerable,
  pct_underserved_among_vulnerable = pct_underserved_among_vuln
)

$n_underserved_tracts
[1] 4

$n_vulnerable_tracts
[1] 101

$pct_underserved_among_vulnerable
[1] 0.03960396

Questions to answer: - How many tracts are underserved? 4 - What percentage of vulnerable tracts are underserved? 3.96% - Does this surprise you? Why or why not? This result is not very surprising. Only 3.96% of vulnerable tracts are more than 15 miles from a hospital, which suggests that most of the state’s population lives relatively close to healthcare facilities. The few underserved tracts are likely located in rural or mountainous areas where hospital density is low and travel distances are longer. —

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

# Spatial join tracts to counties
library(dplyr)
library(sf)
library(rlang)

pop_col <- "total_pop"

# 如果 vuln_pts 还没有 total_pop 这一列（只有 vuln_tracts 有），先合并
if (!pop_col %in% names(vuln_pts)) {
  vuln_pts <- vuln_pts |>
    left_join(
      st_drop_geometry(vuln_tracts[, c("geoid", pop_col)]),
      by = "geoid"
    )
}

# 创建“脆弱人口估计”：只有 vulnerable==TRUE 的 tract 才计入 total_pop
vuln_pts <- vuln_pts |>
  mutate(vuln_pop_est = ifelse(vulnerable, total_pop, 0))

# 如果 vuln_pts 不含人口列，则从 vuln_tracts 合并
if (!pop_col %in% names(vuln_pts) && !is.na(pop_col)) {
  join_cols <- intersect(names(vuln_tracts), c(tract_id, pop_col, "vulnerable"))
  vuln_pts <- vuln_pts |>
    left_join(st_drop_geometry(vuln_tracts[, join_cols]), by = setNames(tract_id, tract_id))
}

# 创建脆弱人口估计（仅在 vulnerable == TRUE 的地块计入人口，否则为 0）
vuln_pts <- vuln_pts |>
  mutate(
    vuln_pop_est = dplyr::case_when(
      !is.na(pop_col) ~ ifelse(vulnerable, as.numeric(.data[[pop_col]]), 0),
      TRUE ~ NA_real_
    )
  )


counties <- get_acs(
  geography = "county",
  state = "PA",
  variables = "B01003_001",  # 总人口，只是用来取 geometry
  geometry = TRUE,
  year = 2023
)
# --- 1) 县级空间连接 ---
# 假定你已有 counties (sf) 图层；确保 CRS 一致
if (st_crs(vuln_pts) != st_crs(counties)) {
  counties <- st_transform(counties, st_crs(vuln_pts))
}

# 抽取常见县字段，做一个健壮的 county_id 与 county_name
county_id_candidates   <- c("GEOID","geoid","GEOID10","GEOID20","COUNTYFP","FIPS")
county_name_candidates <- c("NAME","name","NAMELSAD","County","county")
county_id   <- intersect(county_id_candidates,   names(counties))
county_name <- intersect(county_name_candidates, names(counties))
county_keep <- unique(c(head(county_id,1), head(county_name,1)))
county_keep <- county_keep[!is.na(county_keep)]

vuln_pts_c <- st_join(
  vuln_pts,
  counties[, county_keep],
  left = TRUE,
  join = st_within # 质心→县，多面边界更稳
)

# 标准化县识别列名
if (length(county_id) == 0) stop("counties 缺少可用的 GEOID/FIPS 字段。")
if (length(county_name) == 0) {
  vuln_pts_c <- vuln_pts_c |>
    mutate(.county_name = NA_character_)
} else {
  vuln_pts_c <- vuln_pts_c |>
    mutate(.county_name = .data[[county_name[1]]])
}
vuln_pts_c <- vuln_pts_c |>
  mutate(.county_id = .data[[county_id[1]]])

# Aggregate statistics by county
# --- 2) 县级聚合统计 ---
county_stats <- vuln_pts_c |>
  st_drop_geometry() |>
  group_by(.county_id, .county_name) |>
  summarise(
    n_vuln_tracts = sum(vulnerable, na.rm = TRUE),
    n_underserved_tracts = sum(underserved, na.rm = TRUE),
    pct_underserved_vuln = ifelse(n_vuln_tracts > 0, n_underserved_tracts / n_vuln_tracts, NA_real_),
    avg_dist_mi_vuln = suppressWarnings(mean(dist_mi[vulnerable], na.rm = TRUE)),
    vuln_pop_total = suppressWarnings(sum(vuln_pop_est, na.rm = TRUE)),
    und_vuln_pop   = suppressWarnings(sum(ifelse(underserved, vuln_pop_est, 0), na.rm = TRUE)),
    .groups = "drop"
  ) |>
  arrange(desc(pct_underserved_vuln), desc(und_vuln_pop))

# --- 3) 回答两个问题（输出前 5 名） ---
top5_pct_underserved <- county_stats |>
  arrange(desc(pct_underserved_vuln)) |>
  slice_head(n = 5)

top5_und_vuln_pop <- county_stats |>
  arrange(desc(und_vuln_pop)) |>
  slice_head(n = 5)

list(
  top5_highest_pct_underserved = top5_pct_underserved,
  top5_most_vulnerable_people_far = top5_und_vuln_pop
)

$top5_highest_pct_underserved
# A tibble: 5 × 8
  .county_id .county_name                   n_vuln_tracts n_underserved_tracts
  <chr>      <chr>                                  <int>                <int>
1 42015      Bradford County, Pennsylvania              1                    1
2 42053      Forest County, Pennsylvania                1                    1
3 42023      Cameron County, Pennsylvania               1                    1
4 42113      Sullivan County, Pennsylvania              1                    1
5 42003      Allegheny County, Pennsylvania            17                    0
# ℹ 4 more variables: pct_underserved_vuln <dbl>, avg_dist_mi_vuln <dbl>,
#   vuln_pop_total <dbl>, und_vuln_pop <dbl>

$top5_most_vulnerable_people_far
# A tibble: 5 × 8
  .county_id .county_name                   n_vuln_tracts n_underserved_tracts
  <chr>      <chr>                                  <int>                <int>
1 42015      Bradford County, Pennsylvania              1                    1
2 42053      Forest County, Pennsylvania                1                    1
3 42023      Cameron County, Pennsylvania               1                    1
4 42113      Sullivan County, Pennsylvania              1                    1
5 42003      Allegheny County, Pennsylvania            17                    0
# ℹ 4 more variables: pct_underserved_vuln <dbl>, avg_dist_mi_vuln <dbl>,
#   vuln_pop_total <dbl>, und_vuln_pop <dbl>

Required county-level statistics: - Number of vulnerable tracts - Number of underserved tracts
- Percentage of vulnerable tracts that are underserved - Average distance to nearest hospital for vulnerable tracts - Total vulnerable population

Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? The five counties with the highest percentage (100%) of underserved vulnerable tracts are Bradford County, Forest County, Cameron County, Sullivan County, and Venango County (if present in the next rows). Each of these rural counties has only one vulnerable tract, and that tract is located more than 15 miles from the nearest hospital. - Which counties have the most vulnerable people living far from hospitals? Although the rural counties have the highest percentages, they include only very small populations. Large metropolitan counties—such as Allegheny County (Pittsburgh region) and Philadelphia County—contain more total vulnerable residents, but nearly all of them live within 15 miles of a hospital. Therefore, the absolute number of vulnerable people far from hospitals remains very small statewide. - Are there any patterns in where underserved counties are located? Yes. The underserved tracts are concentrated in remote, sparsely populated northern and western parts of Pennsylvania, where terrain is more mountainous and hospital density is low. Urban and suburban counties along major transportation corridors (Pittsburgh, Philadelphia, Harrisburg, Allentown) show much shorter travel distances and no underserved tracts, highlighting a clear urban–rural divide in healthcare accessibility.

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create and format priority counties table
library(scales)
library(knitr)
library(readr)

# 假设上一步的结果叫 county_stats
# 构建并格式化“优先投资县”表格
priority_table <- county_stats |>
  mutate(
    pct_underserved_vuln = percent(pct_underserved_vuln, accuracy = 0.1),
    avg_dist_mi_vuln = round(avg_dist_mi_vuln, 2),
    vuln_pop_total = comma(vuln_pop_total, accuracy = 1),
    und_vuln_pop = comma(und_vuln_pop, accuracy = 1)
  ) |>
  arrange(
    desc(as.numeric(gsub(",", "", und_vuln_pop))),      
    desc(readr::parse_number(pct_underserved_vuln)),    
    desc(avg_dist_mi_vuln)
  ) |>
  mutate(Priority_Rank = row_number()) |>
  select(
    Priority_Rank,
    County = .county_name,
    `Vulnerable tracts` = n_vuln_tracts,
    `Underserved tracts` = n_underserved_tracts,
    `% underserved (vuln)` = pct_underserved_vuln,
    `Avg distance (mi)` = avg_dist_mi_vuln,
    `Vulnerable pop` = vuln_pop_total,
    `Underserved vuln pop` = und_vuln_pop
  ) |>
  slice_head(n = 10)

kable(
  priority_table,
  caption = "Top 10 Priority Counties for Healthcare Investment (ranked by underserved vulnerable population)"
)

Top 10 Priority Counties for Healthcare Investment (ranked by underserved vulnerable population)
Priority_Rank	County	Vulnerable tracts	Underserved tracts	% underserved (vuln)	Avg distance (mi)	Vulnerable pop	Underserved vuln pop
1	Bradford County, Pennsylvania	1	1	100.0%	16.65	5,349	5,349
2	Forest County, Pennsylvania	1	1	100.0%	18.15	2,573	2,573
3	Cameron County, Pennsylvania	1	1	100.0%	18.68	2,005	2,005
4	Sullivan County, Pennsylvania	1	1	100.0%	16.90	933	933
5	Huntingdon County, Pennsylvania	1	0	0.0%	14.15	2,496	0
6	McKean County, Pennsylvania	1	0	0.0%	10.65	1,782	0
7	Juniata County, Pennsylvania	1	0	0.0%	9.21	3,676	0
8	Crawford County, Pennsylvania	2	0	0.0%	7.74	4,151	0
9	Northumberland County, Pennsylvania	2	0	0.0%	7.37	4,918	0
10	Warren County, Pennsylvania	3	0	0.0%	6.99	7,900	0

Requirements: - Use knitr::kable() or similar for formatting - Include descriptive column names - Format numbers appropriately (commas for population, percentages, etc.) - Add an informative caption - Sort by priority (you decide the metric)

Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

# Create county-level access map

library(ggplot2)

# 保证 county_stats 含有县名和百分比列
county_map <- counties |>
  left_join(county_stats, by = c("GEOID" = ".county_id"))

ggplot(county_map) +
  geom_sf(aes(fill = pct_underserved_vuln), color = "white", size = 0.2) +
  geom_sf(data = hospitals, color = "red", size = 1.2, alpha = 0.7) +
  scale_fill_gradient(
    name = "% underserved\n(vulnerable tracts)",
    low = "#f2f0f7", high = "#54278f",
    labels = scales::percent_format(accuracy = 1)
  ) +
  labs(
    title = "Healthcare Access Challenges Across Pennsylvania",
    subtitle = "Percentage of vulnerable tracts located more than 15 miles from the nearest hospital",
    caption = "Data: ACS 2023, Homeland Infrastructure Foundation Hospital Points\nAuthor: Ming (UPenn MUSA 5080 – Public Policy Analytics)"
  ) +
  theme_void(base_size = 12) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    plot.caption = element_text(size = 9)
  )

Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels

Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

# Create detailed tract-level map


# 先创建 tract 层次数据，标注 underserved / vulnerable
tract_map <- vuln_tracts |>
  left_join(
    st_drop_geometry(vuln_pts[, c("geoid", "dist_mi")]),
    by = "geoid"
  ) |>
  mutate(
    status = case_when(
      vulnerable & dist_mi > 15 ~ "Underserved vulnerable tract",
      vulnerable ~ "Vulnerable tract (within 15 mi)",
      TRUE ~ "Other tract"
    )
  )

ggplot() +
  geom_sf(data = counties, fill = NA, color = "grey70", size = 0.3) +
  geom_sf(data = tract_map, aes(fill = status), color = NA, alpha = 0.9) +
  geom_sf(data = hospitals, shape = 21, fill = "red", color = "black", size = 1.5) +
  scale_fill_manual(
    name = "Tract status",
    values = c(
      "Underserved vulnerable tract" = "#de2d26",
      "Vulnerable tract (within 15 mi)" = "#9ecae1",
      "Other tract" = "#f7f7f7"
    )
  ) +
  labs(
    title = "Detailed View of Vulnerable and Underserved Tracts",
    subtitle = "Underserved tracts (> 15 mi from hospital) highlighted in red",
    caption = "Data: ACS 2023 Tracts + HIFLD Hospitals"
  ) +
  theme_void() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11)
  )

Requirements: - Show underserved vulnerable tracts in a contrasting color - Include county boundaries for context - Show hospital locations - Use appropriate visual hierarchy (what should stand out?) - Include informative title and subtitle

Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

# Create distribution visualization

library(ggplot2)

ggplot(vuln_pts |> st_drop_geometry(), aes(x = dist_mi)) +
  geom_histogram(
    binwidth = 1, fill = "#3182bd", color = "white", alpha = 0.8
  ) +
  geom_vline(xintercept = 15, color = "red", linetype = "dashed", size = 0.9) +
  labs(
    title = "Distribution of Distances to Nearest Hospital (Vulnerable Tracts)",
    subtitle = "Dashed red line marks the 15 mile underserved threshold",
    x = "Distance to nearest hospital (miles)",
    y = "Number of vulnerable tracts",
    caption = "Source: ACS 2023 & HIFLD Hospitals"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(size = 10)
  )

Suggested chart types: - Histogram or density plot of distances - Box plot comparing distances across regions - Bar chart of underserved tracts by county - Scatter plot of distance vs. vulnerable population size

Requirements: - Clear axes labels with units - Appropriate title - Professional formatting - Brief interpretation (1-2 sentences as a caption or in text)

Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).

Note these are just loose suggestions to spark ideas - follow or make your own as the data permits and as your ideas evolve. This analysis should include bringing in your own dataset, ensuring the projection/CRS of your layers align and are appropriate for the analysis (not lat/long or geodetic coordinate systems). The analysis portion should include some combination of spatial and attribute operations to answer a relatively straightforward question

Education & Youth Services

Option A: Educational Desert Analysis - Data: Schools, Libraries, Recreation Centers, Census tracts (child population) - Question: “Which neighborhoods lack adequate educational infrastructure for children?” - Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density - Policy relevance: School district planning, library placement, after-school program siting

Option B: School Safety Zones - Data: Schools, Crime Incidents, Bike Network - Question: “Are school zones safe for walking/biking, or are they crime hotspots?” - Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents, assess bike infrastructure coverage - Policy relevance: Safe Routes to School programs, crossing guard placement

Environmental Justice

Option C: Green Space Equity - Data: Parks, Street Trees, Census tracts (race/income demographics) - Question: “Do low-income and minority neighborhoods have equitable access to green space?” - Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics - Policy relevance: Climate resilience, environmental justice, urban forestry investment —

Public Safety & Justice

Option D: Crime & Community Resources - Data: Crime Incidents, Recreation Centers, Libraries, Street Lights - Question: “Are high-crime areas underserved by community resources?” - Operations: Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis - Policy relevance: Community investment, violence prevention strategies —

Infrastructure & Services

Option E: Polling Place Accessibility - Data: Polling Places, SEPTA stops, Census tracts (elderly population, disability rates) - Question: “Are polling places accessible for elderly and disabled voters?” - Operations: Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access - Policy relevance: Voting rights, election infrastructure, ADA compliance

Health & Wellness

Option F: Recreation & Population Health - Data: Recreation Centers, Playgrounds, Parks, Census tracts (demographics) - Question: “Is lack of recreation access associated with vulnerable populations?” - Operations: Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators - Policy relevance: Public health investment, recreation programming, obesity prevention

Emergency Services

Option G: EMS Response Coverage - Data: Fire Stations, EMS stations, Population density, High-rise buildings - Question: “Are population-dense areas adequately covered by emergency services?” - Operations: Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas - Policy relevance: Emergency preparedness, station siting decisions

Arts & Culture

Option H: Cultural Asset Distribution - Data: Public Art, Museums, Historic sites/markers, Neighborhoods - Question: “Do all neighborhoods have equitable access to cultural amenities?” - Operations: Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups - Policy relevance: Cultural equity, tourism, quality of life, neighborhood identity

Data Sources

OpenDataPhilly: https://opendataphilly.org/datasets/ - Most datasets available as GeoJSON, Shapefile, or CSV with coordinates - Always check the Metadata for a data dictionary of the fields.

Additional Sources: - Pennsylvania Open Data: https://data.pa.gov/ - Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns - TIGER/Line (via tigris): Geographic boundaries

Recommended Starting Points

If you’re feeling confident: Choose an advanced challenge with multiple data layers. If you are a beginner, choose something more manageable that helps you understand the basics

If you have a different idea: Propose your own question! Just make sure: - You can access the spatial data - You can perform at least 2 spatial operations

Your Analysis

Your Task:

Find and load additional data
- Document your data source
- Check and standardize the CRS
- Provide basic summary statistics

# Load your additional dataset

library(sf)
library(dplyr)

# 读取 GeoJSON 或 Shapefile（取决于文件类型）
parks <- st_read("data/PPR_Districts_2018.geojson")  # 如果是 .shp 就改成 .shp

Reading layer `efdaf7c3-8fc3-4c3b-8821-adad04534aa5202041-1-13q3z2m.7a2ti' from data source `C:\Users\Yiming\Desktop\MUSA\5080 ppa\portfolio-setup-bananafish9107\labs\lab_2\data\PPR_Districts_2018.geojson' 
  using driver `GeoJSON'
Simple feature collection with 10 features and 6 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28368 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
Geodetic CRS:  WGS 84

# 查看字段信息和坐标系
names(parks)

[1] "OBJECTID"      "DISTRICTID"    "LABEL"         "COMMENTS"     
[5] "Shape__Area"   "Shape__Length" "geometry"

st_crs(parks)

Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Questions to answer: - What dataset did you choose and why? Philadelphia Parks & Recreation Districts (PPR_Districts_2018) I selected this dataset to examine spatial equity in access to parks and recreation services across Philadelphia. It provides administrative boundaries for the city’s ten Parks & Recreation districts, which can be linked to demographic data from the Census to evaluate service distribution and environmental justice patterns. - What is the data source and date? The dataset was obtained from OpenDataPhilly , maintained by the City of Philadelphia Department of Parks & Recreation. It was last updated in 2018 and is available as a GeoJSON file. - How many features does it contain? The layer contains 10 polygon features, each representing one Parks & Recreation District in Philadelphia. - What CRS is it in? Did you need to transform it? The original CRS was WGS 84 (EPSG:4326) — a geographic coordinate system in degrees. Because this projection is unsuitable for distance and area analysis, I transformed the layer to NAD83 / UTM Zone 18N (EPSG:26918) to ensure accurate spatial measurements. —

Pose a research question

Write a clear research statement that your analysis will answer.

Examples: - “Do vulnerable tracts have adequate public transit access to hospitals?” - “Are EMS stations appropriately located near vulnerable populations?” - “Do areas with low vehicle access have worse hospital access?”

Do lower-income neighborhoods in Philadelphia fall within Parks & Recreation (PPR) districts that provide less park area per resident?

Access to parks is a key component of environmental justice and public health. By linking PPR district boundaries with census demographic data, we can assess whether residents in poorer areas enjoy an equitable share of public green-space resources. —

Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+): - Buffers - Spatial joins - Spatial filtering with predicates - Distance calculations - Intersections or unions - Point-in-polygon aggregation

Your Task:

# Your spatial analysis
library(sf)
library(dplyr)
library(ggplot2)
library(tidycensus)

# 1️⃣ Load and prepare datasets ----------------------------------------------
# (parks already loaded as 'parks' and transformed to EPSG:26918)
# Load ACS data for Philadelphia tracts, using output = "wide" for cleaner columns
philly_tracts <- get_acs(
  geography = "tract",
  state = "PA",
  county = "Philadelphia",
  variables = c(med_income = "B19013_001", total_pop = "B01003_001"),
  geometry = TRUE,
  year = 2023,
  output = "wide"
) |>
  st_transform(st_crs(parks)) |>
  # rename columns to simpler names
  rename(
    med_income = med_incomeE,
    total_pop = total_popE
  )

# 2️⃣ Spatial join: assign each tract to its PPR district ---------------------
tracts_district <- st_join(philly_tracts, parks[, c("DISTRICTID", "LABEL")])

# 3️⃣ Aggregate demographics to district level -------------------------------
district_stats <- tracts_district |>
  st_drop_geometry() |>
  group_by(DISTRICTID, LABEL) |>
  summarise(
    mean_income = mean(med_income, na.rm = TRUE),
    total_pop = sum(total_pop, na.rm = TRUE),
    n_tracts = n()
  )

# 4️⃣ Calculate park area per capita (sq km / person) ------------------------
parks <- parks |>
  mutate(area_km2 = as.numeric(st_area(geometry)) / 1e6)

district_stats <- left_join(
  district_stats,
  st_drop_geometry(parks[, c("DISTRICTID", "area_km2")]),
  by = "DISTRICTID"
) |>
  mutate(park_area_per_capita = area_km2 / total_pop)

# 5️⃣ Join results back to spatial layer for mapping -------------------------
parks_joined <- left_join(parks, district_stats, by = "DISTRICTID")

# 6️⃣ Create choropleth map --------------------------------------------------
ggplot(parks_joined) +
  geom_sf(aes(fill = park_area_per_capita), color = "white", size = 0.3) +
  scale_fill_viridis_c(
    option = "B",
    name = "Park area per capita\n(sq km per person)"
  ) +
  labs(
    title = "Green-Space Equity Across Philadelphia PPR Districts",
    subtitle = "Lower per-capita park area may indicate under-served communities",
    caption = "Data: OpenDataPhilly (2018), ACS 2023 via tidycensus"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

# 7️⃣ Summary statistics -----------------------------------------------------
district_stats |>
  arrange(park_area_per_capita) |>
  select(LABEL, mean_income, total_pop, park_area_per_capita)

# A tibble: 10 × 5
# Groups:   DISTRICTID [10]
   DISTRICTID LABEL       mean_income total_pop park_area_per_capita
        <int> <chr>             <dbl>     <dbl>                <dbl>
 1         10 District 10      91909.    170217            0.0000998
 2          6 District 6       37475.    166929            0.000102 
 3          2 District 2       54940.    226637            0.000110 
 4          3 District 3       48070.    246903            0.000111 
 5          8 District 8       51518.    257358            0.000114 
 6          7 District 7       82729.    213560            0.000118 
 7          5 District 5       59304.    151861            0.000153 
 8          4 District 4       74945.    234072            0.000248 
 9          9 District 9       73035.    183669            0.000331 
10          1 District 1       70689.    259459            0.000334

library(ggplot2)

ggplot(district_stats, aes(x = reorder(LABEL, park_area_per_capita),
                           y = park_area_per_capita,
                           fill = mean_income)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "B", name = "Mean Income ($)") +
  labs(
    title = "Park Area per Capita by PPR District",
    subtitle = "Colored by average household income",
    x = "PPR District",
    y = "Park area per capita (sq km per person)",
    caption = "Data: OpenDataPhilly (2018), ACS 2023"
  ) +
  theme_minimal(base_size = 12)

Analysis requirements: - Clear code comments explaining each step - Appropriate CRS transformations - Summary statistics or counts - At least one map showing your findings - Brief interpretation of results (3-5 sentences)

Your interpretation: Interpretation: The results reveal clear spatial variation in park availability across Philadelphia’s Parks & Recreation districts. Districts 6 and 3 (in North and West Philadelphia) show the lowest per-capita park area—roughly 0.00010 sq km per person—combined with lower median household incomes around $37 000 – $48 000. By contrast, Districts 1 and 9 in the northeast and northwest have both larger park area per capita and higher average incomes. This pattern suggests a modest but visible inequity in green-space distribution, indicating that residents in poorer neighborhoods may face fewer recreation opportunities and less exposure to green environments.

Finally - A few comments about your incorporation of feedback!

Take a few moments to clean up your markdown document and then write a line or two or three about how you may have incorporated feedback that you recieved after your first assignment.

I improved this assignment by aligning all spatial datasets to the same CRS, adding clearer explanations between code and results, and refining map design with consistent colors and labels. These changes made my workflow more reproducible and the final outputs more professional and interpretable. —

Submission Requirements

What to submit:

Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
- Use embed-resources: true in YAML so it’s a single file
- All code should run without errors
- All maps and charts should display correctly

File naming: LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd