Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Jed Chew

Published

December 8, 2025

Assignment Overview

Learning Objectives:

  • Apply spatial operations to answer policy-relevant research questions

  • Integrate census demographic data with spatial analysis

  • Create publication-quality visualizations and maps

  • Work with spatial data from multiple sources

  • Communicate findings effectively for policy audiences


Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data:

  • Pennsylvania county boundaries (from lecture data)

  • Pennsylvania hospitals (from lecture data)

  • Pennsylvania census tracts

Your Task:

# Load spatial data
pa_counties <- st_read("data/Pennsylvania_County_Boundaries.shp")
districts <- st_read("data/districts.geojson")
hospitals <- st_read("data/hospitals.geojson")
census_tracts <- tracts(state = "PA", cb = TRUE, progress = FALSE)
metro_areas <- core_based_statistical_areas(cb = TRUE, progress = FALSE)

# Standardize CRS
hospitals <- st_transform(hospitals, st_crs(pa_counties))
census_tracts <- st_transform(census_tracts, st_crs(pa_counties))
metro_areas <- st_transform(metro_areas, st_crs(pa_counties))
districts <- st_transform(districts, st_crs(census_tracts))
# Check that all data loaded correctly
ggplot(pa_counties) +
  geom_sf() +
  labs(title = "PA Counties") +
  theme_void()

ggplot(census_tracts) +
  geom_sf() +
  labs(title = "PA Census Tracts") +
  theme_void()

ggplot(districts) +
  geom_sf() +
  labs(title = "PA districts") +
  theme_void()

ggplot(hospitals) +
  geom_sf() +
  labs(title = "Hospitals") +
  theme_minimal()

Questions to answer:

  • How many hospitals are in your dataset? – 223 hospitals
glimpse(hospitals) 
  • How many census tracts? – 3445 census tracts
glimpse(census_tracts)
  • What coordinate reference system is each dataset in? – all datasets have been projected to WGS 84 / Psuedo-Mercator
st_crs(hospitals)$Name
st_crs(census_tracts)$Name

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Load all available variables for ACS 5-year 2022
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)
# Helper Variable for Summing Population Values for Elderly Population
elderly_pop = c("B01001_020", "B01001_021", "B01001_022", "B01001_023", 
                "B01001_024", "B01001_025", 
                "B01001_044", "B01001_045", "B01001_046", 
                "B01001_047", "B01001_048", "B01001_049")

# Get tract-level demographic data from ACS
pa_data <- get_acs(
  geography = "tract",
  variables = c(
    total_pop = "B01003_001",
    median_income = "B19013_001",
    elderly_pop = elderly_pop
  ),
  state = "PA",
  year = 2022,
  survey = "acs5",
  output = "wide" 
)

# Clean the county names to remove state name and "County" 
pa_clean <- pa_data |> 
  separate(
    NAME, 
    into = c("tract_name", "county_name", "state_name"), 
    sep = "; "
  ) |> 
  mutate(
    tract_name = str_remove(tract_name, "Census Tract "),
    county_name = str_remove(county_name, " County")
  )
head(pa_clean)
# Sum elderly population
pa_summary <- pa_clean |>
  mutate(
    elderly_popE = rowSums(across(matches("^elderly_pop\\d+E$")), na.rm = TRUE),
    # for MOE, ACS guidance is to combine by quadrature
    elderly_popM = sqrt(rowSums(across(matches("^elderly_pop\\d+M$"))^2, na.rm = 
                                  TRUE)),
    pct_elderly = elderly_popE / total_popE
  ) |>
  select(GEOID, tract_name, county_name, total_popE, median_incomeE, elderly_popE, 
         elderly_popM, pct_elderly)
head(pa_summary)
# Join to tract boundaries
tracts_with_data <- census_tracts |> 
  left_join(pa_summary, by = "GEOID")

# Map to confirm
ggplot(tracts_with_data) + 
  geom_sf(aes(fill = median_incomeE), linewidth = 0) +
  scale_fill_continuous(labels = label_dollar()) + 
  labs(fill = "Median Income") + 
  theme_void()

Questions to answer:

  • What year of ACS data are you using? – 2022 5-Year Estimates

  • How many tracts have missing income data? – 63

  • What is the median income across all PA census tracts? – $70,188

pa_tract_income_qns <- pa_summary |>
  summarize(
    tracts_missing_income = sum(is.na(median_incomeE)),
    tracts_median_income = median(median_incomeE, na.rm = TRUE),
    tract_median_elderly = mean(pct_elderly, na.rm = TRUE)
  )
pa_tract_income_qns

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria:

  1. Low median household income (choose an appropriate threshold) – $70,000
  2. Significant elderly population (choose an appropriate threshold) – 25%

Your Task:

# Filter for vulnerable tracts based on your criteria
pa_vulnerable <- pa_summary |> 
  filter(median_incomeE <= 70000 & pct_elderly >= 0.25)
pa_vulnerable

Questions to answer:

  • What income threshold did you choose and why? - $70,000. This threshold is slightly below median income across all PA census tracts of $70,188

  • What elderly population threshold did you choose and why? - 25%. This threshold is slightly higher than the mean percentage elderly population across all PA census tracts of 19.0%

  • How many tracts meet your vulnerability criteria? - 293 tracts

  • What percentage of PA census tracts are considered vulnerable by your definition? = 8.5% are considered vulnerable

# Percentage of PA census tracts considered vulnerable
293 / 3445

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Update Median Income Map from Step 2
vulnerable_tracts <- tracts_with_data |>
  filter(median_incomeE <= 70000, pct_elderly >= 0.25)   

# Project to Accurate Projected CRS of NAD 83 / Pennsylvania South State Plane
pa_crs <- 32129
vul_proj <- st_transform(vulnerable_tracts, pa_crs)
hosp_proj   <- st_transform(hospitals, pa_crs)

# Map to confirm
ggplot(vul_proj) + 
  geom_sf(aes(fill = median_incomeE), linewidth = 0) +
  scale_fill_continuous(labels = label_dollar()) + 
  labs(fill = "Median Income") + 
  theme_void()

ggplot(hosp_proj) + 
  geom_sf() + 
  theme_void()
# Calculate distance from each centroid to nearest hospital, then convert to miles
centroids <- st_centroid(vul_proj, of_largest_polygon = TRUE)              
nn_idx    <- st_nearest_feature(centroids, hosp_proj)         
dist_m    <- st_distance(centroids, hosp_proj[nn_idx, ], by_element = TRUE)

vul_proj$dist_nearest_hosp_mi <- set_units(dist_m, "mi") |> drop_units()

vul_proj |>
  st_drop_geometry() |>
  select(GEOID, tract_name, county_name, dist_nearest_hosp_mi) |>
  arrange(desc(dist_nearest_hosp_mi)) |>
  head(10)

Requirements:

  • Use an appropriate projected coordinate system for Pennsylvania

  • Calculate distances in miles

  • Explain why you chose your projection

Questions to answer:

  • What is the average distance to the nearest hospital for vulnerable tracts? – 5.70 miles

  • What is the maximum distance? – 29.1 miles

pa_vul_tract_qns <- vul_proj |>
  st_drop_geometry() |>
  summarize(
    average_distance = mean(dist_nearest_hosp_mi, na.rm = TRUE),
    max_distance = max(dist_nearest_hosp_mi),
  )
pa_vul_tract_qns
  • How many vulnerable tracts are more than 15 miles from the nearest hospital? – 17 vulnerable tracts
pa_num_vul_qns <- vul_proj |>
  st_drop_geometry() |>
  filter(dist_nearest_hosp_mi > 15)
pa_num_vul_qns

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Questions to answer:

  • How many tracts are underserved? – 17 tracts (same as last question of Step 4)

  • What percentage of vulnerable tracts are underserved? – 5.8% of vulnerable tracts are underserved

17 / 293
  • Does this surprise you? Why or why not? – This low percentage does not surprise me, given that Pennsylvania is home to the hospital system of Penn Medicine and the Children’s Hospital of Philadelphia (CHOP).

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

# Normalize the county layer (create county_geoid + county_name)
pa_counties_norm <- pa_counties |> 
  mutate(
    county_geoid = paste0("42", str_pad(as.character(FIPS_COUNT), 3, pad = "0")),
    county_name  = COUNTY_NAM
  ) |> 
  st_transform(32129) |>                     # PA South State Plane (meters)
  select(county_geoid, county_name, geometry)

# Define underserved threshold
underserved_threshold_mi <- 15

# Spatial join tracts → counties and aggregate
county_summary <- pa_counties_norm |> 
  st_join(
    vul_proj |> 
      mutate(underserved = dist_nearest_hosp_mi > underserved_threshold_mi) |> 
      select(tr_geoid = GEOID,
             dist_nearest_hosp_mi,
             total_popE,
             elderly_popE,
             underserved),
    join = st_intersects,
    left = TRUE                                  # keep all 67 counties
  ) %>%
  st_drop_geometry() |> 
  group_by(county_geoid, county_name) |> 
  summarise(
    n_vulnerable_tracts        = sum(!is.na(tr_geoid)),
    n_underserved_tracts       = sum(underserved %in% TRUE, na.rm = TRUE),
    pct_vuln_underserved       = 100 * n_underserved_tracts / n_vulnerable_tracts,
    avg_dist_nearest_hosp_mi   = mean(dist_nearest_hosp_mi, na.rm = TRUE),
    total_vulnerable_residents = sum(replace_na(total_popE, 0), na.rm = TRUE),
  ) |> 
ungroup() |> 
arrange(desc(pct_vuln_underserved))

county_summary

Required county-level statistics: - Number of vulnerable tracts - Number of underserved tracts
- Percentage of vulnerable tracts that are underserved - Average distance to nearest hospital for vulnerable tracts - Total vulnerable population

Questions to answer:

  • Which 5 counties have the highest percentage of underserved vulnerable tracts? – Dauphin, Perry, Bradford, Clinton, Sullivan
slice(county_summary, 1:5)
  • Which counties have the most vulnerable people living far from hospitals? – Pike, Bradford, Forest, Sullivan, Perry, Fulton, and Dauphin
county_summary |> 
  filter(avg_dist_nearest_hosp_mi > 15) |> 
  arrange(desc(total_vulnerable_residents))
  • Are there any patterns in where underserved counties are located?

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

priority_counties <- county_summary |> 
  arrange(desc(avg_dist_nearest_hosp_mi), desc(total_vulnerable_residents)) |> 
  select(county_name, total_vulnerable_residents, n_vulnerable_tracts, 
         n_underserved_tracts, avg_dist_nearest_hosp_mi) |> 
  slice(1:10)

kable(priority_counties,
      col.names = c("County", "Vulnerable Residents", "Vulnerable Tracts", 
                    "Underserved Tracts", "Avg Distance to Nearest Hospital"),
      caption = "Top 10 Priority Counties in PA for Healthcare Investment",
      format.args = list(big.mark = ","))

Requirements:

  • Use knitr::kable() or similar for formatting

  • Include descriptive column names

  • Format numbers appropriately (commas for population, percentages, etc.)

  • Add an informative caption

  • Sort by priority (you decide the metric)


Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

library(scales)
library(RColorBrewer)

# Spatial Joins and Projections
county_summary_sf <- pa_counties_norm %>%
  left_join(county_summary, by = "county_geoid")
hosp_pts <- hospitals %>% st_transform(st_crs(county_summary_sf))

# Create county-level access map
ggplot(county_summary_sf) +
  geom_sf(aes(fill = pct_vuln_underserved), color = "white", linewidth = 0.25) +
  geom_sf(data = hosp_pts, size = 0.9, alpha = 0.8) +
  scale_fill_distiller(
    name = "% of vulnerable tracts\n> 15 miles from a hospital",
    palette = "Reds",       
    direction = 1,          # light (low) -> dark (high)
    na.value = "grey90",
    limits = c(0, 100),
    breaks = c(0, 25, 50, 75, 100),
    labels = label_percent(accuracy = 1, scale = 1),
    guide = guide_colorbar(barheight = unit(80, "pt"))
  ) +
  labs(
    title = "Healthcare Access Challenges by County — Pennsylvania",
    subtitle = "Percentage of vulnerable tracts that are underserved (threshold = 15 miles)",
    caption  = "Notes: Grey counties have no vulnerable tracts. Distances computed in PA South State Plane"
  ) +
  theme_void() +
  theme(legend.position = "right")

Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels


Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

# Create detailed tract-level map
underserved_threshold_mi <- 15
vul_status <- vul_proj |> 
  mutate(
    status = if_else(dist_nearest_hosp_mi > underserved_threshold_mi,
                     "Underserved", "Vulnerable")
  )

# 2) Plot
ggplot() +
  # Base: county boundaries for context (low visual weight)
  geom_sf(data = pa_counties_norm, fill = NA, color = "grey60", linewidth = 0.3) +

  # Other vulnerable tracts (de-emphasized)
  geom_sf(data = subset(vul_status, status == "Vulnerable"),
          aes(fill = status), color = NA, alpha = 0.35) +

  # Underserved vulnerable tracts (primary focus)
  geom_sf(data = subset(vul_status, status == "Underserved"),
          aes(fill = status), color = "#7f0000", linewidth = 0.2, alpha = 0.9) +

  # Hospitals (symbols on top for maximum legibility)
  geom_sf(data = hosp_pts, aes(shape = "Hospital"),
          size = 1.2, color = "black", fill = "white", stroke = 0.4) +

  # Manual scales to control contrast & legend
  scale_fill_manual(
    name   = NULL,
    values = c("Underserved" = "#cb181d",  # strong red
               "Vulnerable"  = "grey80")
  ) +
  scale_shape_manual(
    name   = NULL,
    values = c("Hospital" = 21)  # filled circle w/ border
  ) +
  guides(
    fill  = guide_legend(order = 1, override.aes = list(alpha = c(0.9, 0.35))),
    shape = guide_legend(order = 2, override.aes = list(size = 3))
  ) +

  labs(
    title    = "Underserved Census Tracts and Hospital Locations — Pennsylvania",
    subtitle = "Red tracts: vulnerable and > 15 miles from the nearest hospital",
    caption  = "Distances computed in EPSG:32129 (PA South State Plane)"
  ) +
  theme_void(base_size = 12) +
  theme(
    legend.position = "right",
  ) +
  coord_sf(datum = NA)

Requirements: - Show underserved vulnerable tracts in a contrasting color - Include county boundaries for context - Show hospital locations - Use appropriate visual hierarchy (what should stand out?) - Include informative title and subtitle


Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

# Create distribution visualization
vul_regions <- vul_proj |> 
  mutate(underserved = dist_nearest_hosp_mi > underserved_threshold_mi)

# Compute tract centroids in lon/lat just for region assignment
vul_ll <- st_transform(vul_regions, 4326)
ctr    <- st_centroid(vul_ll)
lon    <- st_coordinates(ctr)[,1]
q      <- quantile(lon, probs = c(1/3, 2/3), na.rm = TRUE)

vul_regions <- vul_regions |> 
  mutate(region3 = case_when(
    lon <= q[1] ~ "West",
    lon >  q[2] ~ "East",
    TRUE        ~ "Central"
  )) %>%
  mutate(region3 = factor(region3, levels = c("West","Central","East")))

# 1) Histogram of distances (with underserved threshold)
p_hist <- ggplot(vul_regions, aes(x = dist_nearest_hosp_mi)) +
  geom_histogram(binwidth = 2, boundary = 0, closed = "left",
                 fill = "#9ecae1", color = "white") +
  geom_vline(xintercept = underserved_threshold_mi, linetype = "dashed") +
  labs(
    title   = "Distribution of Distances to the Nearest Hospital (Vulnerable Tracts)",
    x       = "Distance to nearest hospital (miles)",
    y       = "Number of tracts",
    caption = "Dashed line marks the underserved threshold (15 miles). Look for right-skew and the share beyond the threshold."
  ) +
  theme_minimal(base_size = 12)

# 2) Box plot by regions (West/Central/East)
p_box <- ggplot(vul_regions, aes(x = region3, y = dist_nearest_hosp_mi, fill = region3)) +
  geom_boxplot(outlier.alpha = 0.3) +
  geom_hline(yintercept = underserved_threshold_mi, linetype = "dashed") +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(
    title    = "Distance to Hospital by Region (Vulnerable Tracts)",
    subtitle = "Regions defined by longitude terciles (West, Central, East)",
    x        = NULL,
    y        = "Distance to nearest hospital (miles)",
    caption  = "Regions with higher medians indicate broader access challenges."
  ) +
  theme_minimal(base_size = 12)

# 3) Bar chart: underserved tracts by county (Top 15 for readability)
bar_df <- county_summary %>%
  select(county_name, n_underserved_tracts) %>%
  arrange(desc(n_underserved_tracts)) %>%
  slice_head(n = 15)

p_bar <- ggplot(bar_df,
                aes(x = reorder(county_name, n_underserved_tracts),
                    y = n_underserved_tracts)) +
  geom_col(fill = "#cb181d") +
  coord_flip() +
  labs(
    title   = "Underserved Vulnerable Tracts by County (Top 15)",
    x       = NULL,
    y       = "Underserved vulnerable tracts (count)",
    caption = "Highlights where the count of tracts beyond 15 miles is most concentrated."
  ) +
  theme_minimal(base_size = 12)

# 4) Scatter: distance vs. vulnerable population size (per tract)
p_scatter <- ggplot(vul_regions,
                    aes(x = dist_nearest_hosp_mi,
                        y = total_popE,
                        color = underserved)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE,
              linetype = "dashed", color = "black", linewidth = 0.8) +
  scale_color_manual(
    values = c(`TRUE` = "#cb181d", `FALSE` = "grey60"),
    name   = "Underserved?",
    labels = c("No (≤ 15 mi)", "Yes (> 15 mi)")
  ) +
  scale_y_continuous(labels = comma) +
  labs(
    title   = "Distance vs. Vulnerable Population Size (Tracts)",
    x       = "Distance to nearest hospital (miles)",
    y       = "Vulnerable population (tract total)",
    caption = "Dashed curve shows the local trend"
  ) +
  theme_minimal(base_size = 12)
p_hist
p_box
p_bar
p_scatter

Suggested chart types: - Histogram or density plot of distances - Box plot comparing distances across regions - Bar chart of underserved tracts by county - Scatter plot of distance vs. vulnerable population size

Requirements: - Clear axes labels with units - Appropriate title - Professional formatting - Brief interpretation (1-2 sentences as a caption or in text)


Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).

Note these are just loose suggestions to spark ideas - follow or make your own as the data permits and as your ideas evolve. This analysis should include bringing in your own dataset, ensuring the projection/CRS of your layers align and are appropriate for the analysis (not lat/long or geodetic coordinate systems). The analysis portion should include some combination of spatial and attribute operations to answer a relatively straightforward question


Education & Youth Services

Option A: Educational Desert Analysis

  • Data: Schools, Libraries, Recreation Centers, Census tracts (child population)

  • Question: “Which neighborhoods lack adequate educational infrastructure for children?”

  • Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density

  • Policy relevance: School district planning, library placement, after-school program siting

Option B: School Safety Zones

  • Data: Schools, Crime Incidents, Bike Network

  • Question: “Are school zones safe for walking/biking, or are they crime hotspots?”

  • Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents, assess bike infrastructure coverage

  • Policy relevance: Safe Routes to School programs, crossing guard placement


Environmental Justice

Option C: Green Space Equity

  • Data: Parks, Street Trees, Census tracts (race/income demographics)
  • Question: “Do low-income and minority neighborhoods have equitable access to green space?”
  • Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics
  • Policy relevance: Climate resilience, environmental justice, urban forestry investment

Public Safety & Justice

Option D: Crime & Community Resources

  • Data: Crime Incidents, Recreation Centers, Libraries, Street Lights

  • Question: “Are high-crime areas underserved by community resources?”

  • Operations: Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis

  • Policy relevance: Community investment, violence prevention strategies


Infrastructure & Services

Option E: Polling Place Accessibility

  • Data: Polling Places, SEPTA stops, Census tracts (elderly population, disability rates)

  • Question: “Are polling places accessible for elderly and disabled voters?”

  • Operations: Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access

  • Policy relevance: Voting rights, election infrastructure, ADA compliance


Health & Wellness

Option F: Recreation & Population Health

  • Data: Recreation Centers, Playgrounds, Parks, Census tracts (demographics)

  • Question: “Is lack of recreation access associated with vulnerable populations?”

  • Operations: Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators

  • Policy relevance: Public health investment, recreation programming, obesity prevention


Emergency Services

Option G: EMS Response Coverage

  • Data: Fire Stations, EMS stations, Population density, High-rise buildings

  • Question: “Are population-dense areas adequately covered by emergency services?”

  • Operations: Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas

  • Policy relevance: Emergency preparedness, station siting decisions


Arts & Culture

Option H: Cultural Asset Distribution

  • Data: Public Art, Museums, Historic sites/markers, Neighborhoods

  • Question: “Do all neighborhoods have equitable access to cultural amenities?”

  • Operations: Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups

  • Policy relevance: Cultural equity, tourism, quality of life, neighborhood identity


Data Sources

OpenDataPhilly: https://opendataphilly.org/datasets/ - Most datasets available as GeoJSON, Shapefile, or CSV with coordinates - Always check the Metadata for a data dictionary of the fields.

Additional Sources:

  • Pennsylvania Open Data: https://data.pa.gov/

  • Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns

  • TIGER/Line (via tigris): Geographic boundaries

Your Analysis

My Chosen Question: School Access and School Choice in City of Philadelphia

  • Data: Schools, Census Tracts (race/income demographics)
  • Question: “Is lack of school access associated with vulnerable populations?”
  • Operations:
    • Clean datasets
    • Distance to nearest school
    • Buffer schools (10-minute walk = 0.5 mile)
    • Compare types and enrollment of schools by demographics

Your Task:

  1. Find and load additional data
    • Document your data source
    • Check and standardize the CRS
    • Provide basic summary statistics
# Load your additional dataset
phl_schools <- st_read("data/parcels.geojson")
phl_school_points <- st_read("data/points.geojson")
#Standardize CRS
phl_schools <- st_transform(phl_schools, st_crs(pa_counties))
phl_school_points <- st_transform(phl_school_points, st_crs(pa_counties))

ggplot(phl_schools) +
  geom_sf() +
  labs(title = "PHL School Parcels") +
  theme_void()

ggplot(phl_school_points) +
  geom_sf() +
  labs(title = "PHL School Points") +
  theme_void()

Questions to answer:

  • What dataset did you choose and why?

    • School locations to answer my research question on school access
  • What is the data source and date?

    • Data Source: Open Data Philly

    • Date of School Parcels – last updated 02/02/2024 (from metadata)

    • Data of School Points – last updated 02/02/2024 (from metadata)

  • How many features does it contain?

    • School Parcels – 17 features

    • School Points – 14 features

  • What CRS is it in? Did you need to transform it? – WGS 84

# Helper Variables for Summing Population Values for Children aged 5-19
child_pop = c("B01001_004", "B01001_005", "B01001_006", "B01001_007", 
                "B01001_028", "B01001_029", "B01001_030", "B01001_031")

# Get tract-level demographic data from 2022 ACS 5-Yr Estimates for Philadelphia
phl_data <- get_acs(
  geography = "tract",
  variables = c(
    total_pop = "B01003_001",
    child_pop = child_pop,
    median_income = "B19013_001",
    poverty = "B17001_001",
    White     = "B03002_003",
    Black     = "B03002_004",
    Hispanic  = "B03002_012"
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  survey = "acs5",
  output = "wide",
  geometry = TRUE
)

# Clean the county names to remove state name and "County"
phl_clean <- phl_data |> 
  separate(
    NAME, 
    into = c("tract_name", "county_name", "state_name"), 
    sep = "; "
  ) |> 
  mutate(
    tract_name = str_remove(tract_name, "Census Tract "),
    county_name = str_remove(county_name, " County")
  ) 
phl_clean
# Summarize Raw Data
phl_summary <- phl_clean |>
  mutate(
    child_popE = rowSums(across(matches("^child_pop\\d+E$")), na.rm = TRUE),
    child_popM = sqrt(rowSums(across(matches("^child_pop\\d+M$"))^2, na.rm = 
                                  TRUE)),
    pct_child = round((child_popE/total_popE) * 100, 2),
    pct_poverty = round((povertyE/total_popE) * 100, 2),
    pct_white = round((WhiteE/total_popE) * 100, 2),
    pct_black = round((BlackE/total_popE) * 100, 2),
    pct_hispanic = round((HispanicE/total_popE) * 100, 2)
  ) |>
  select(GEOID, tract_name, median_incomeE, median_incomeM, total_popE, total_popM,
         child_popE, child_popM, pct_child, pct_poverty, 
         pct_white, pct_black, pct_hispanic)
head(phl_summary)

# Map to confirm
phl_summary <- st_transform(phl_summary, st_crs(pa_counties))

ggplot(phl_summary) + 
  geom_sf(aes(fill = median_incomeE), linewidth = 0) +
  scale_fill_continuous(labels = label_dollar()) + 
  labs(
    title = "PHL Census Tracts",
    fill = "Median Income") + 
  theme_void()

Note: Philadelphia has 408 census tracts, but as the map above visualizes, some of these census tracts have missing/incomplete data. Hence, the next part of this analysis will only use the 383 census tracts that do not have missing/NaN values

phl_summary_clean <- phl_summary |>
  # Drop rows with NA/NaN in any estimate column (keeps rows even if MOE is NA)
  filter(!if_any(ends_with("E"), ~ is.na(.) | is.nan(.)))

ggplot(phl_summary_clean) + 
  geom_sf(aes(fill = median_incomeE), linewidth = 0) +
  scale_fill_continuous(labels = label_dollar()) + 
  labs(
    title = "PHL Census Tracts (Cleaned)",
    fill = "Median Income") + 
  theme_void()

  1. Pose a research question

Write a clear research statement that your analysis will answer.

My Research Question: Is lack of school access associated with vulnerable populations?

Examples:

  • “Do vulnerable tracts have adequate public transit access to hospitals?”

  • Are EMS stations appropriately located near vulnerable populations?”

  • “Do areas with low vehicle access have worse hospital access?”


  1. Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+):

  • Buffers / Intersections / Unions / Distance Calculations

  • Spatial joins

  • Spatial filtering with predicates

  • Point-in-polygon aggregation

Your Task:

# Calculate the Distance from each Census tract to the nearest School
centroids <- st_centroid(phl_summary_clean, of_largest_polygon = TRUE)            
nn_idx    <- st_nearest_feature(centroids, phl_school_points)         
dist_m    <- st_distance(centroids, phl_school_points[nn_idx, ], by_element = TRUE)

phl_summary_clean$dist_sch_mi <- set_units(dist_m, "mi") |> drop_units()

phl_summary_clean |>
  st_drop_geometry() |>
  select(GEOID, tract_name, dist_sch_mi, pct_child, pct_poverty) |>
  arrange(desc(dist_sch_mi)) |>
  head(20)

Note: From this initial exploration, there is no clear linear relationship between the distance to the nearest school and the poverty percentage of a census tract. Instead, it might be useful to first identify underserved tracts that are more than 0.5 miles (i.e. a 10-minute walking distance) from the nearest school.

Identify Underserved Tracts

underserved_threshold <- 0.5
phl_underserved <- phl_summary_clean |>
  filter(dist_sch_mi > underserved_threshold)

phl_underserved_table <- phl_underserved |>
  st_drop_geometry() |>
  select(GEOID, tract_name, dist_sch_mi, pct_child, pct_poverty) |>
  arrange(desc(dist_sch_mi))

# Map Visualization by School Type
ggplot() + 
  geom_sf(data = phl_summary, fill = "gray95", color = "black", linewidth = 0.1, inherit.aes = FALSE) +
  geom_sf(data = phl_underserved, aes(fill = pct_poverty)) +
  geom_sf(data = phl_school_points, color = "black", size = 0.5, alpha = 0.75) +
  facet_wrap(~type_specific) +
  scale_fill_continuous() + 
  labs(
    title = "Philadelphia Census Tracts Underserved by Schools (by school type)",
    fill = "Poverty %") + 
  theme_void()

# Scatter Plot: distance vs. percentage of children (per tract)
phl_sch_scatter <- ggplot(phl_summary_clean,
                    aes(x = dist_sch_mi,
                        y = pct_child,
                        )) +
  geom_point(alpha = 0.5) +
  coord_cartesian(xlim = c(0, 1.2)) +
  geom_vline(xintercept = 0.5) +
  labs(
    title   = "Distance vs. Percentage of Children in Population (Tracts)",
    x       = "Distance to nearest school (miles)",
    y       = "Percentage of Children in Population",
  ) +
  theme_minimal(base_size = 12)
phl_sch_scatter

# Scatter Plot: distance vs. poverty percentage (per tract)
phl_poverty_sch_scatter <- ggplot(phl_summary_clean,
                    aes(x = dist_sch_mi,
                        y = pct_poverty,
                        )) +
  geom_point(alpha = 0.5) +
  coord_cartesian(xlim = c(0, 1.2)) +
  geom_vline(xintercept = 0.5) +
  labs(
    title   = "Distance vs. Percentage of Population in Poverty (Tracts)",
    x       = "Distance to nearest school (miles)",
    y       = "Percentage of Population in Poverty",
  ) +
  theme_minimal(base_size = 12)
phl_poverty_sch_scatter

As shown in the scatter plots above, there is no single clear relationship between the distance to the nearest school and the percentage of children or the level of poverty in a census tract. However, we can still do some explanatory analysis to further classify schools by their average attendance levels.

# Load School Attendance Data
attendance <- read_csv("data/SDP_Student_ADA_School_20241127.csv")
head(attendance)

# Filter School Attendance Data
attendance_clean <- attendance |>
  rename(
    sch_yr = `School Year`,
    sch_name = `School Name`,
    attendance = `Average Daily Attendance (YTD)`
  ) |> 
  filter(sch_yr == "2022-2023") |> 
  select (sch_yr, sch_name, attendance) |> 
  mutate(
    attendance_type = case_when(
      attendance < 80        ~ "Low",
      attendance <= 90       ~ "Medium",
      attendance > 90        ~ "High"),
    attendance_type = factor(attendance_type,
                            levels = c("Low", "Medium", "High"),
                            ordered = TRUE)
  )
head(attendance_clean)
# Spatial Join Attendance to School Locations
phl_school_points_clean <- phl_school_points |> 
  mutate(sch_name = str_squish(str_to_lower(school_name_label)))

attendance_clean <- attendance_clean |> 
  mutate(sch_name = str_squish(str_to_lower(sch_name)))

phl_school_points_attendance <- phl_school_points_clean |> 
  left_join(attendance_clean |>  select(sch_name, attendance, attendance_type), by = "sch_name")

# Map Visualization
ggplot() + 
  geom_sf(data = phl_summary, fill = "gray95", color = "black", linewidth = 0.1, inherit.aes = FALSE) +
  geom_sf(data = phl_underserved, aes(fill = pct_poverty)) +
  geom_sf(data = phl_school_points_attendance, aes(color = attendance_type), size = 0.5, alpha = 0.9) +
 scale_fill_distiller(
  name = "Poverty %",
  palette = "Reds", 
  direction = 1,
  na.value = "grey90",
  labels = scales::label_number(accuracy = 1, suffix = "%")
  ) +
  scale_color_brewer(name = "Attendance", palette = "Set2", na.value = "grey50") +
  labs(
    title = "Philadelphia Census Tracts Underserved by Schools",
    subtitle = "Tracts shaded by poverty rate; schools colored by attendance 
    category",
    fill = "Poverty %") + 
  theme_void()

Analysis requirements:

  • Clear code comments explaining each step

  • Appropriate CRS transformations

  • Summary statistics or counts

  • At least one map showing your findings

  • Brief interpretation of results (3-5 sentences)

Finally - A few comments about your incorporation of feedback!

Take a few moments to clean up your markdown document and then write a line or two or three about how you may have incorporated feedback that you received after your first assignment.

TA Feedback from Lab 1:

  1. Hide the setup code block. The Census API key is not supposed to be shared publicly. I used “include = FALSE” and “results = ‘hide’” in the relevant R code chunks
  2. Printing in Console instead of RMD: if you want to print something, provide a brief explanation

Submission Requirements

What to submit:

  1. Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
    • Use embed-resources: true in YAML so it’s a single file
    • All code should run without errors
    • All maps and charts should display correctly

File naming: LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd