Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences
Part 1: Healthcare Access for Vulnerable Populations
Research Question
Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?
Step 1: Data Collection (5 points)
Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts
Your Task:
# Load required packageslibrary(sf)library(tidyverse)library(tigris)library(tidycensus)library(scales)library(patchwork)library(here)# Set Census API keycensus_api_key("e173851c633e89c20243632174db68f63bac8856")# Load spatial datapa_counties <-st_read(here("assignments/assignments2/data/Pennsylvania_County_Boundaries.shp"))
Reading layer `Pennsylvania_County_Boundaries' from data source
`D:\MUSA5080_Github\portfolio-setup-sihan-yu429\assignments\assignments2\data\Pennsylvania_County_Boundaries.shp'
using driver `ESRI Shapefile'
Simple feature collection with 67 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -8963377 ymin: 4825316 xmax: -8314404 ymax: 5201413
Projected CRS: WGS 84 / Pseudo-Mercator
Reading layer `hospitals' from data source
`D:\MUSA5080_Github\portfolio-setup-sihan-yu429\assignments\assignments2\data\hospitals.geojson'
using driver `GeoJSON'
Simple feature collection with 223 features and 11 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -80.49621 ymin: 39.75163 xmax: -74.86704 ymax: 42.13403
Geodetic CRS: WGS 84
census_tracts <-tracts(state ="PA", cb =TRUE, progress_bar =FALSE)# Standardize CRShospitals <-st_transform(hospitals, st_crs(pa_counties))census_tracts <-st_transform(census_tracts, st_crs(pa_counties))# check that all data loaded correctly# Number of hospitalsn_hospitals <-nrow(hospitals)# Number of census tractsn_tracts <-nrow(census_tracts)cat("Number of hospitals:", nrow(hospitals), "\n")
Number of hospitals: 223
cat("Number of census tracts:", nrow(census_tracts), "\n")
Questions to answer: - How many hospitals are in your dataset? There are 223 hospitals in the dataset. - How many census tracts? There are 3445 census tracts. - What coordinate reference system is each dataset in? They share the same coordinate reference system: WGS 84.
Step 2: Get Demographic Data
Use tidycensus to download tract-level demographic data for Pennsylvania.
Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)
Your Task:
# Get demographic data from ACS#### Step 2: Get Demographic Data ----pa_demo <-get_acs(geography ="tract",state ="PA",year =2022, survey ="acs5",variables =c(total_pop ="B01003_001", # Total populationmedian_income ="B19013_001", # Median household incomeage_65_66 ="B01001_020", # Male 65–66age_67_69 ="B01001_021",age_70_74 ="B01001_022",age_75_79 ="B01001_023",age_80_84 ="B01001_024",age_85_plus ="B01001_025",f_65_66 ="B01001_044", # Female 65–66f_67_69 ="B01001_045",f_70_74 ="B01001_046",f_75_79 ="B01001_047",f_80_84 ="B01001_048",f_85_plus ="B01001_049" ),output ="wide")pa_demo <- pa_demo %>%mutate(pop_65_over = age_65_66E + age_67_69E + age_70_74E + age_75_79E + age_80_84E + age_85_plusE + f_65_66E + f_67_69E + f_70_74E + f_75_79E + f_80_84E + f_85_plusE ) %>%select(GEOID, total_pop = total_popE, median_income = median_incomeE, pop_65_over)# Join to tract boundariescensus_tracts_demo <- census_tracts %>%left_join(pa_demo, by ="GEOID")# 1. What year of ACS data are you using?acs_year <-2022cat("ACS Year Used:", acs_year, "\n")
ACS Year Used: 2022
# 2. How many tracts have missing income data?missing_income <-sum(is.na(census_tracts_demo$median_income))cat("Number of tracts with missing income data:", missing_income, "\n")
Number of tracts with missing income data: 62
# 3. What is the median income across all PA census tracts?median_income_pa <-median(census_tracts_demo$median_income, na.rm =TRUE)cat("Median income across all PA census tracts:", dollar(median_income_pa), "\n")
Median income across all PA census tracts: $70,188
Questions to answer: - What year of ACS data are you using? I used data of 2022. - How many tracts have missing income data? There are 62 tracts missing. - What is the median income across all PA census tracts? The medium income across all PA census tracts is $70,188. —
Step 3: Define Vulnerable Populations
Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)
Your Task:
# Filter for vulnerable tracts based on your criteriaincome_threshold <-quantile(census_tracts_demo$median_income, 0.25, na.rm =TRUE)elderly_threshold <-quantile( (census_tracts_demo$pop_65_over / census_tracts_demo$total_pop), 0.75, na.rm =TRUE)cat("Low-income threshold:", scales::dollar(income_threshold), "\n")
Low-income threshold: $55,923.50
cat("High-elderly threshold (share of 65+):", scales::percent(elderly_threshold), "\n")
total_tracts <-nrow(census_tracts_demo)vulnerable_count <-nrow(vulnerable_tracts)vulnerable_pct <- (vulnerable_count / total_tracts) *100cat("Percentage of PA census tracts considered vulnerable:", round(vulnerable_pct, 1), "%\n")
Percentage of PA census tracts considered vulnerable: 4.8 %
Questions to answer: - What income threshold did you choose and why? I used $55,923.50 has the income threshold. It is based on the percentiles of the general income data and this is the benchmarks of the bottom 25% of the income across Pennsylvania counties. - What elderly population threshold did you choose and why? I used the same method as for the income threshold. I used the top 25% elderly share(23%) as my threshold. - How many tracts meet your vulnerability criteria? There are 165 tracts meet my vulnerability criteria. - What percentage of PA census tracts are considered vulnerable by your definition? There are 4.8% of PA census tracts that are vulnerable. —
Step 4: Calculate Distance to Hospitals
For each vulnerable tract, calculate the distance to the nearest hospital.
Your Task:
# Transform to appropriate projected CRSlibrary(units)vulnerable_tracts_proj <-st_transform(vulnerable_tracts, 5070)hospitals_proj <-st_transform(hospitals, 5070)tract_centroids <-st_centroid(vulnerable_tracts_proj)# Calculate distance from each tract centroid to nearest hospitaldist_matrix <-st_distance(tract_centroids, hospitals_proj)nearest_hospital_dist_m <-apply(dist_matrix, 1, min)nearest_hospital_dist_miles <-as.numeric(nearest_hospital_dist_m)/1609.34vulnerable_tracts_proj <- vulnerable_tracts_proj %>%mutate(distance_to_hospital_miles = nearest_hospital_dist_miles)avg_distance <-mean(vulnerable_tracts_proj$distance_to_hospital_miles, na.rm =TRUE)max_distance <-max(vulnerable_tracts_proj$distance_to_hospital_miles, na.rm =TRUE)num_over_15 <-sum(vulnerable_tracts_proj$distance_to_hospital_miles >15, na.rm =TRUE)cat("Average distance to nearest hospital (miles):", round(avg_distance, 2), "\n","Maximum distance to nearest hospital (miles):", round(max_distance, 2), "\n","Number of vulnerable tracts >15 miles from hospital:", num_over_15, "\n")
Average distance to nearest hospital (miles): 4.75
Maximum distance to nearest hospital (miles): 19.32
Number of vulnerable tracts >15 miles from hospital: 10
Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection
Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? The average distance is 4.75 miles. - What is the maximum distance? The maximum distance is 19.16 miles. - How many vulnerable tracts are more than 15 miles from the nearest hospital? There are 9 tracts are more than 15 miles from the nearest hospital. —
Step 5: Identify Underserved Areas
Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.
Your Task:
# Create underserved variablevulnerable_tracts_proj <- vulnerable_tracts_proj %>%mutate(underserved = distance_to_hospital_miles >15)num_underserved <-sum(vulnerable_tracts_proj$underserved, na.rm =TRUE)# Percentage of vulnerable tracts that are underservedperc_underserved <- num_underserved /nrow(vulnerable_tracts_proj) *100cat("Number of underserved tracts:", num_underserved, "\n")
Number of underserved tracts: 10
cat("Percentage of vulnerable tracts underserved:", round(perc_underserved, 2), "%\n")
Percentage of vulnerable tracts underserved: 6.06 %
Questions to answer: - How many tracts are underserved? There are 9 tracts are underserved. - What percentage of vulnerable tracts are underserved? There are 5.45% of the vulnerable tracts are underserved. - Does this surprise you? Why or why not? It doesn’t surprise me. It shows that the hospitals are relatively well_distributed in Pennsylvania that only a small amount of vulnerable tracts are more than 15 miles from the nearest hospital. The few underserved tract are likely in rural areas, however, most of the population concentrated in urban and suburban counties. —
Step 6: Aggregate to County Level
Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.
Required county-level statistics: - Number of vulnerable tracts - Number of underserved tracts
- Percentage of vulnerable tracts that are underserved - Average distance to nearest hospital for vulnerable tracts - Total vulnerable population
Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? The 5 counties are Bradford, Forest, Juniata, Monroe, Sullivan.They all have 1 tract that is vulnerable and undeserved. - Which counties have the most vulnerable people living far from hospitals? Allegheny County has the most vulnerable people ~51790 living far from hospitals. - Are there any patterns in where underserved counties are located? The underserved counties are located in northern and central rural Pennsyvania with relatively low population density. —
Step 7: Create Summary Table
Create a professional table showing the top 10 priority counties for healthcare investment.
Your Task:
# Create and format priority counties tablelibrary(knitr)library(dplyr)library(scales)priority_counties <- county_stats %>%arrange(desc(pct_underserved)) %>%mutate(pct_underserved =round(pct_underserved, 1),avg_distance_miles =round(avg_distance_miles, 1),total_vulnerable_pop =comma(total_vulnerable_pop) ) %>%select(County = COUNTY_NAM,`Vulnerable Tracts`= n_vulnerable_tracts,`Underserved Tracts`= n_underserved,`Underserved (%)`= pct_underserved,`Avg Distance to Hospital (miles)`= avg_distance_miles,`Total Vulnerable Population`= total_vulnerable_pop ) priority_counties_top10 <- priority_counties %>%slice(1:10)kable(st_drop_geometry(priority_counties),caption ="Table: Top 10 Priority Counties for Healthcare Investment in Pennsylvania",align ="lccccr",format ="pipe")
Table: Top 10 Priority Counties for Healthcare Investment in Pennsylvania
County
Vulnerable Tracts
Underserved Tracts
Underserved (%)
Avg Distance to Hospital (miles)
Total Vulnerable Population
BRADFORD
1
1
100.0
16.7
5,466
FOREST
1
1
100.0
18.0
2,701
JUNIATA
1
1
100.0
16.0
1,782
MONROE
1
1
100.0
17.6
1,299
SULLIVAN
1
1
100.0
16.9
918
CLEARFIELD
4
2
50.0
11.3
13,056
DAUPHIN
2
1
50.0
10.0
8,410
POTTER
3
1
33.3
9.5
9,062
CRAWFORD
4
1
25.0
8.3
9,072
ALLEGHENY
22
0
0.0
2.8
51,790
ARMSTRONG
1
0
0.0
10.1
2,106
BEAVER
5
0
0.0
2.8
9,763
BEDFORD
2
0
0.0
8.0
7,578
BLAIR
3
0
0.0
3.1
7,602
CAMBRIA
6
0
0.0
4.8
13,551
CLARION
2
0
0.0
9.3
3,586
CLINTON
1
0
0.0
4.9
3,176
CUMBERLAND
1
0
0.0
2.0
2,438
DELAWARE
1
0
0.0
0.5
2,972
ELK
1
0
0.0
10.4
1,557
ERIE
4
0
0.0
3.5
12,216
FAYETTE
7
0
0.0
4.3
25,230
FRANKLIN
2
0
0.0
2.6
7,949
HUNTINGDON
1
0
0.0
14.2
2,558
INDIANA
2
0
0.0
10.2
5,703
JEFFERSON
2
0
0.0
5.6
4,566
LACKAWANNA
5
0
0.0
1.8
13,322
LANCASTER
1
0
0.0
6.5
4,387
LAWRENCE
2
0
0.0
4.4
4,200
LEBANON
1
0
0.0
6.3
3,107
LEHIGH
1
0
0.0
3.4
3,974
LUZERNE
7
0
0.0
3.3
21,062
LYCOMING
1
0
0.0
6.8
2,052
MCKEAN
3
0
0.0
6.0
7,832
MERCER
5
0
0.0
1.8
16,686
MONTGOMERY
1
0
0.0
1.3
2,069
NORTHAMPTON
3
0
0.0
2.6
10,073
NORTHUMBERLAND
4
0
0.0
11.2
9,087
PHILADELPHIA
16
0
0.0
1.0
61,873
SCHUYLKILL
3
0
0.0
2.2
11,194
SOMERSET
4
0
0.0
8.4
11,454
SUSQUEHANNA
1
0
0.0
5.8
1,612
TIOGA
1
0
0.0
0.2
3,459
UNION
2
0
0.0
2.6
9,762
VENANGO
1
0
0.0
8.1
4,785
WARREN
3
0
0.0
6.9
8,041
WASHINGTON
3
0
0.0
2.3
5,008
WAYNE
1
0
0.0
9.4
2,445
WESTMORELAND
11
0
0.0
2.9
25,093
YORK
4
0
0.0
1.8
11,199
Requirements: - Use knitr::kable() or similar for formatting - Include descriptive column names - Format numbers appropriately (commas for population, percentages, etc.) - Add an informative caption - Sort by priority (you decide the metric)
Part 2: Comprehensive Visualization
Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.
Map 1: County-Level Choropleth
Create a choropleth map showing healthcare access challenges at the county level.
Your Task:
# Create county-level access maplibrary(ggplot2)library(scales)library(dplyr)pa_counties_map <- pa_counties_proj %>%left_join( priority_counties %>%st_drop_geometry() %>%select(County, `Underserved (%)`),by =c("COUNTY_NAM"="County") )ggplot() +geom_sf(data = pa_counties_map,aes(fill =`Underserved (%)`),color ="white", size =0.25 ) +geom_sf(data = hospitals,shape =21, fill ="white", color ="black", size =2, alpha =0.8 ) +scale_fill_viridis_c(option ="Viridis",name ="% Underserved (of vulnerable tracts)",labels =function(x) paste0(x, "%"),na.value ="grey90",limits =c(0, 100) ) +labs(title ="Healthcare Access Challenges in Pennsylvania",subtitle ="Percent of vulnerable census tracts that are underserved (>15 miles from nearest hospital)",caption ="Sources: ACS 2022 5-year (tidycensus), lecture hospital data, county boundaries" ) +theme_void()
Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels
Map 2: Detailed Vulnerability Map
Create a map highlighting underserved vulnerable tracts.
Your Task:
# Create detailed tract-level mapvulnerable_tracts_proj <- vulnerable_tracts_proj %>%mutate(vulnerable= median_income <= income_threshold & elderly_share >= elderly_threshold ) %>%mutate(tract_status =case_when( vulnerable & underserved ~"Underserved & Vulnerable", vulnerable &!underserved ~"Served & Vulnerable",TRUE~"Other Tracts" ))ggplot()+geom_sf(data = pa_counties_proj, fill ="grey50", color ="white", size =0.3) +geom_sf(data = vulnerable_tracts_proj,aes(fill = tract_status),color =NA,alpha =0.8 ) +geom_sf(data = hospitals, shape =21, fill ="white", color ="black", size =2, alpha =0.8 ) +scale_fill_manual(name ="Tract Status",values =c("Underserved & Vulnerable"="#d7191c","Served & Vulnerable"="#fdae61","Other Tracts"="#cccccc" ) ) +labs(title ="Underserved Vulnerable Tracts in Pennsylvania",subtitle ="Tracts identified as both socially vulnerable and >15 miles from nearest hospital",caption ="Sources: ACS 2022 5-year (tidycensus), hospital data, county boundaries" ) +theme_void()
Requirements: - Show underserved vulnerable tracts in a contrasting color - Include county boundaries for context - Show hospital locations - Use appropriate visual hierarchy (what should stand out?) - Include informative title and subtitle
Chart: Distribution Analysis
Create a visualization showing the distribution of distances to hospitals for vulnerable populations.
Your Task:
# Create distribution visualizationggplot(vulnerable_tracts_proj, aes(x = distance_to_hospital_miles)) +geom_histogram(bins =20, fill ="#4575b4", color ="white", alpha =0.8) +geom_vline(xintercept =15, color ="#d7191c", linetype ="dashed", size =1) +labs(title ="Distribution of Distance to Nearest Hospital for Vulnerable Tracts",x ="Distance to Nearest Hospital (miles)",y ="Number of Vulnerable Tracts",caption ="Dashed line indicates 15-mile threshold for underserved classification" ) +theme_minimal(base_size =12)
Suggested chart types: - Histogram or density plot of distances - Box plot comparing distances across regions - Bar chart of underserved tracts by county - Scatter plot of distance vs. vulnerable population size
Requirements: - Clear axes labels with units - Appropriate title - Professional formatting - Brief interpretation (1-2 sentences as a caption or in text)
Part 3: Bring Your Own Data Analysis
Environmental Justice
Option C: Green Space Equity - Data: Parks, Street Trees, Census tracts (race/income demographics) - Question: “Do low-income and minority neighborhoods have equitable access to green space?” - Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics - Policy relevance: Climate resilience, environmental justice, urban forestry investment — Your Task:
Find and load additional data
Document your data source
Check and standardize the CRS
Provide basic summary statistics
# Load your additional datasetlibrary(sf)parks <-st_read("PPR_Program_Sites.geojson")
Reading layer `PPR_Program_Sites' from data source
`D:\MUSA5080_Github\portfolio-setup-sihan-yu429\assignments\assignments2\PPR_Program_Sites.geojson'
using driver `GeoJSON'
Simple feature collection with 171 features and 10 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -75.2563 ymin: 39.90444 xmax: -74.96944 ymax: 40.12284
Geodetic CRS: WGS 84
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
Questions to answer: - What dataset did you choose and why? I chose the Philadelphia parks dataset because public parks represent accessible green space for residents. Analyzing their distribution helps assess equity in green space access across different neighborhoods. - What is the data source and date? The dataset comes from OpenDataPhilly, specifically the “PPR Program Sites” dataset, most recently updated in 2023. - How many features does it contain? The dataset contains 171 park features. - What CRS is it in? Did you need to transform it? The parks dataset was originally in WGS84 (EPSG:4326). For consistency with other spatial layers and to facilitate distance and area calculations, I transformed it to EPSG:3857 (Web Mercator projection). —
Pose a research question
Do low-income and minority neighborhoods in Philadelphia have equitable access to green spaces within a 10-minute walk distance?
Conduct spatial analysis
Use at least TWO spatial operations to answer your research question.
# Your spatial analysis# buffer a 0.5 mile distance from the park polygon ~10minparks_buffer <-st_buffer(parks, dist =800)parks_buffer <-st_transform(parks_buffer, st_crs(phl_tracts))# Spatial join to identify tracts within 0.5 miles of any park census_access <- phl_tracts %>%mutate(has_access =lengths(st_intersects(geometry, parks_buffer)) >0)# Summary statistics access_summary <- census_access %>%group_by(income_group =case_when( median_income <median(median_income, na.rm =TRUE) ~"Low Income",TRUE~"High Income" )) %>%summarise(total_tracts =n(),tracts_with_access =sum(has_access),pct_with_access =round(mean(has_access) *100, 1) )print(access_summary)
Simple feature collection with 2 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.13799
Geodetic CRS: NAD83
# A tibble: 2 × 5
income_group total_tracts tracts_with_access pct_with_access
<chr> <int> <int> <dbl>
1 High Income 218 190 87.2
2 Low Income 190 181 95.3
# ℹ 1 more variable: geometry <MULTIPOLYGON [°]>
ggplot() +geom_sf(data = census_access, aes(fill = median_income), color ="white", size =0.1) +geom_sf(data = parks_buffer,fill ="#32CD32", alpha =0.25, color =NA ) +geom_sf(data = parks,color ="#228B22", size =1.2 ) +scale_fill_gradientn(name ="Median Household Income ($)",colors =c("#4575b4", "#91bfdb", "#fdae61","orange", "#d73027") # 蓝→浅蓝→浅橙→橙 ) +labs(title ="Philadelphia Census Tracts by Income and Park Accessibility",subtitle ="Red shaded areas represent 500 m park buffer zones",caption ="Data: OpenDataPhilly Parks, ACS 2021" ) +theme_minimal()
Your interpretation:
The table shows that park accessibility is 87.2% in high-income areas and slightly higher at 95.3% in low-income areas, indicating that not all high-income tracts are near parks, though there is a cluster of parks in moderate-income areas in South Philly. Maps show that parks and housing are more concentrated in the city center and low-income areas, while high-income areas in the northwest, northeast, and south are larger but have more dispersed parks. This likely reflects land use and planning: low-density neighborhoods with private yards leave fewer public parks. Overall, park distribution appears influenced more by urban layout and planning than by income alone.
Finally - A few comments about your incorporation of feedback!
I really appreciated the detailed feedback on my first assignment. It suggested that I clean up unnecessary content and remove figures that didn’t need to be displayed. Following this advice, I tried to clean my Markdown document this time to try to make it look much cleaner and more professional. I hope to continue improving in this way with future assignments.