Lingxuan Gao - MUSA 5080 Portfolio
  • Home
  • Weekly Notes
    • Weekly Notes 01: Introduction to R and dplyr
    • Weekly Notes 02: Algorithmic Decision Making & The Census
    • Weekly Notes 03: Data Visualization & Exploratory Analysis
    • Weekly Notes 04: Spatial Data & GIS Operations in R
    • Weekly Notes 05: Introduction to Linear Regression
    • Weekly Notes 11: Space-Time Prediction
  • Labs
    • Lab 0: dplyr Basics
    • Lab 1: Census Data Quality for Policy Decisions
    • Lab 2: Spatial Analysis and Visualization-Healthcare Access and Equity in Pennsylvania
    • Lab 4: Spatial Predictive Analysis
    • Lab 5: Space-Time Prediction
  • Midterm
    • Appendix
    • Presentation
  • Final
    • Eviction Risk Prediction in Philadelphia

On this page

  • Assignment Overview
  • Part 1: Healthcare Access for Vulnerable Populations
    • Research Question
    • Required Analysis Steps
  • Part 2: Comprehensive Visualization
    • Map 1: County-Level Choropleth
    • Map 2: Detailed Vulnerability Map
    • Chart: Distribution Analysis
  • Part 3: Bring Your Own Data Analysis
    • Challenge Options
    • Data Sources
    • Recommended Starting Points
    • Your Analysis
  • Finally - A few comments about your incorporation of feedback!
  • Submission Requirements

Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Lingxuan Gao

Published

Invalid Date

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences


Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

# Load required packages
library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(scales)
library(patchwork)
library(here)

# Add this near the top of your .qmd after loading libraries
options(tigris_use_cache = TRUE)
options(tigris_progress = FALSE)  # Suppress tigris progress bars

# Load spatial data
pa_counties <- st_read("./data/Pennsylvania_County_Boundaries.shp")
Reading layer `Pennsylvania_County_Boundaries' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\Pennsylvania_County_Boundaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 67 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -8963377 ymin: 4825316 xmax: -8314404 ymax: 5201413
Projected CRS: WGS 84 / Pseudo-Mercator
districts <- st_read("data/districts.geojson")
Reading layer `U.S._Congressional_Districts_for_Pennsylvania' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\districts.geojson' 
  using driver `GeoJSON'
Simple feature collection with 17 features and 8 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -80.51939 ymin: 39.71986 xmax: -74.68956 ymax: 42.26935
Geodetic CRS:  WGS 84
hospitals <- st_read("data/hospitals.geojson")
Reading layer `hospitals' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\hospitals.geojson' 
  using driver `GeoJSON'
Simple feature collection with 223 features and 11 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.49621 ymin: 39.75163 xmax: -74.86704 ymax: 42.13403
Geodetic CRS:  WGS 84
census_tracts <- tracts(state = "PA", cb = TRUE)

Questions to answer: - How many hospitals are in your dataset? 223 - How many census tracts? 3445 - What coordinate reference system is each dataset in? census_tracts: EPSG:4269 districts: EPSG:4326 hospitals: EPSG:4326 pa_counties: EPSG:3857


Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS
acs_vars <- c(
  total_pop = "B01003_001",        
  median_income = "B19013_001", 
  age65_over_male = "B01001_020",  
  age65_over_female = "B01001_044" 
)

pa_acs <- get_acs(
  geography = "tract",
  state = "PA",
  variables = acs_vars,
  year = 2022,
  geometry = TRUE,
  survey = "acs5",
  output = "wide"
)

pa_acs <- pa_acs %>%
  mutate(
    pop_65plus = age65_over_maleE + age65_over_femaleE,
    pct_65plus = pop_65plus / total_popE * 100
  )

# Join to tract boundaries
pa_acs_sf <- st_transform(pa_acs, 4326)
sum(is.na(pa_acs_sf$median_incomeE))
[1] 63
median(pa_acs_sf$median_incomeE, na.rm = TRUE)
[1] 70188

Questions to answer: - What year of ACS data are you using? 2022 - How many tracts have missing income data? 63 - What is the median income across all PA census tracts? 70188


Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria
# Define vulnerability thresholds
income_threshold <- quantile(pa_acs_sf$median_incomeE, 0.2, na.rm = TRUE)
elderly_threshold <- quantile(pa_acs_sf$pct_65plus, 0.8, na.rm = TRUE)

cat("Income threshold:", round(income_threshold, 0), "\n")
Income threshold: 51725 
cat("Elderly population threshold:", round(elderly_threshold, 1), "%\n")
Elderly population threshold: 3.6 %
# Filter for vulnerable tracts based on criteria
vulnerable_tracts <- pa_acs_sf %>%
  mutate(
    is_low_income = median_incomeE <= income_threshold,
    is_elderly = pct_65plus >= elderly_threshold,
    vulnerable = ifelse(is_low_income & is_elderly, 1, 0)
  )

# Summary
num_vulnerable <- sum(vulnerable_tracts$vulnerable, na.rm = TRUE)
total_tracts <- nrow(vulnerable_tracts)
pct_vulnerable <- num_vulnerable / total_tracts * 100

cat("Number of vulnerable tracts:", num_vulnerable, "\n")
Number of vulnerable tracts: 119 
cat("Percentage of all PA tracts:", round(pct_vulnerable, 2), "%\n")
Percentage of all PA tracts: 3.45 %

Questions to answer: - What income threshold did you choose and why? I chose the 20th percentile of the statewide median household income as the low-income threshold.This aligns with the Pareto principle (80/20 rule).

  • What elderly population threshold did you choose and why? I chose the 80th percentile of the percentage of population aged 65 and over as the threshold for defining elderly concentration. This aligns with the Pareto principle (80/20 rule).

  • How many tracts meet your vulnerability criteria? 119

  • What percentage of PA census tracts are considered vulnerable by your definition? 3.45 %


Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS
vulnerable_tracts_proj <- st_transform(vulnerable_tracts, 5070)
hospitals_proj <- st_transform(hospitals, 5070)

# Calculate distance from each tract centroid to nearest hospital
tract_centroids <- st_centroid(vulnerable_tracts_proj)

dist_matrix <- st_distance(tract_centroids, hospitals_proj)  # in meters
nearest_hospital_dist <- apply(dist_matrix, 1, min)  # get min distance per tract

# Add distances (converted to miles)
vulnerable_tracts_proj$dist_to_hospital_mi <- as.numeric(nearest_hospital_dist) / 1609.34

# Summarize results
avg_dist <- mean(vulnerable_tracts_proj$dist_to_hospital_mi, na.rm = TRUE)
max_dist <- max(vulnerable_tracts_proj$dist_to_hospital_mi, na.rm = TRUE)
over_15mi <- sum(vulnerable_tracts_proj$dist_to_hospital_mi > 15, na.rm = TRUE)

cat("Average distance to nearest hospital (miles):", round(avg_dist, 2), "\n")
Average distance to nearest hospital (miles): 4.41 
cat("Maximum distance to nearest hospital (miles):", round(max_dist, 2), "\n")
Maximum distance to nearest hospital (miles): 32.69 
cat("Number of vulnerable tracts >15 miles from hospital:", over_15mi, "\n")
Number of vulnerable tracts >15 miles from hospital: 130 
# I use EPSG:5070 (CONUS Albers) for a single, statewide projection. It keeps linear distortion small at Pennsylvania’s scale while avoiding UTM zone splits, making nearest-hospital distance comparisons and thresholds robust and reproducible.

Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection

Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? 4.41 miles - What is the maximum distance? 32.69 miles - How many vulnerable tracts are more than 15 miles from the nearest hospital? 130


Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable
vulnerable_tracts_proj <- vulnerable_tracts_proj %>%
  mutate(
    underserved = ifelse(vulnerable == 1 & dist_to_hospital_mi > 15, 1, 0)
  )

num_underserved <- sum(vulnerable_tracts_proj$underserved, na.rm = TRUE)
num_vulnerable <- sum(vulnerable_tracts_proj$vulnerable, na.rm = TRUE)
pct_underserved <- num_underserved / num_vulnerable * 100

cat("Number of underserved tracts:", num_underserved, "\n")
Number of underserved tracts: 4 
cat("Percentage of vulnerable tracts that are underserved:", round(pct_underserved, 2), "%\n")
Percentage of vulnerable tracts that are underserved: 3.36 %

Questions to answer: - How many tracts are underserved? 4 - What percentage of vulnerable tracts are underserved? 3.36% - Does this surprise you? Why or why not? No. Although some rural parts of Pennsylvania are geographically remote, the majority of the population lives in urban and suburban regions where hospitals are relatively dense. Therefore, only a small share of vulnerable tracts (about 3.36%) are more than 15 miles away from a hospital.


Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

# Spatial join tracts to counties
library(dplyr)
pa_counties <- st_transform(pa_counties, 5070)
vul_with_county <- st_join(vulnerable_tracts_proj, pa_counties, join = st_intersects)

# Aggregate statistics by county
county_summary <- vul_with_county %>%
  st_drop_geometry() %>%  
  group_by(COUNTY_NAM) %>% 
  summarise(
    n_vulnerable = sum(vulnerable, na.rm = TRUE),
    n_underserved = sum(underserved, na.rm = TRUE),
    pct_underserved = ifelse(n_vulnerable > 0, (n_underserved / n_vulnerable) * 100, NA),
    avg_dist_mi = mean(dist_to_hospital_mi, na.rm = TRUE),
    total_vulnerable_pop = sum(total_popE[vulnerable == 1], na.rm = TRUE)
  ) %>%
  arrange(desc(pct_underserved))

top5_pct <- head(county_summary, 5)

cat("Top 5 counties with highest % of underserved vulnerable tracts:\n")
Top 5 counties with highest % of underserved vulnerable tracts:
print(top5_pct[, c("COUNTY_NAM", "pct_underserved")])
# A tibble: 5 × 2
  COUNTY_NAM pct_underserved
  <chr>                <dbl>
1 FRANKLIN               100
2 JUNIATA                100
3 MIFFLIN                100
4 MONROE                 100
5 PERRY                  100
top_vulnerable_pop <- vul_with_county %>%
  filter(underserved == 1) %>%
  group_by(COUNTY_NAM) %>%
  summarise(vulnerable_far_pop = sum(total_popE, na.rm = TRUE)) %>%
  arrange(desc(vulnerable_far_pop)) 

Required county-level statistics: - Number of vulnerable tracts - Number of underserved tracts
- Percentage of vulnerable tracts that are underserved - Average distance to nearest hospital for vulnerable tracts - Total vulnerable population

Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? FRANKLIN, JUNIATA, MIFFLIN, MONROE and PERRY. - Which counties have the most vulnerable people living far from hospitals? CRAWFORD, FRANKLIN, HUNTINGDON, JUNIATA, MIFFLIN - Are there any patterns in where underserved counties are located? Underserved counties are primarily located in central to western Pennsylvania’s rural belt, particularly along the Appalachian Mountain region (e.g., Franklin, Juniata, Mifflin, Huntingdon) and parts of the northeastern fringe (e.g., Monroe and Perry). These areas tend to have low population density, challenging topography, and limited transportation infrastructure, which increase travel distances to the nearest hospitals.


Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create and format priority counties table
library(knitr)
library(scales)

priority_table <- county_summary %>%
  arrange(desc(pct_underserved), desc(total_vulnerable_pop)) %>%
  select(
    County = COUNTY_NAM,
    `Vulnerable Tracts` = n_vulnerable,
    `Underserved Tracts` = n_underserved,
    `% Underserved` = pct_underserved,
    `Avg Distance to Hospital (mi)` = avg_dist_mi,
    `Total Vulnerable Population` = total_vulnerable_pop
  ) %>%
  mutate(
    `% Underserved` = round(`% Underserved`, 2),
    `Avg Distance to Hospital (mi)` = round(`Avg Distance to Hospital (mi)`, 1),
    `Total Vulnerable Population` = comma(`Total Vulnerable Population`)
  ) %>%
  head(10)

# Display formatted table
kable(
  priority_table,
  caption = "Top 10 Priority Counties for Healthcare Investment in Pennsylvania",
  align = "lccccc",
  format.args = list(big.mark = ",")
)
Top 10 Priority Counties for Healthcare Investment in Pennsylvania
County Vulnerable Tracts Underserved Tracts % Underserved Avg Distance to Hospital (mi) Total Vulnerable Population
FRANKLIN 1 1 100.00 5.9 1,782
JUNIATA 1 1 100.00 14.7 1,782
MIFFLIN 1 1 100.00 8.7 1,782
PERRY 1 1 100.00 11.3 1,782
MONROE 1 1 100.00 10.3 1,299
SULLIVAN 1 1 100.00 16.8 918
HUNTINGDON 2 1 50.00 9.1 5,111
LYCOMING 2 1 50.00 6.3 4,094
COLUMBIA 2 1 50.00 6.5 3,446
LUZERNE 3 1 33.33 5.2 10,055

Requirements: - Use knitr::kable() or similar for formatting - Include descriptive column names - Format numbers appropriately (commas for population, percentages, etc.) - Add an informative caption - Sort by priority (you decide the metric)


Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

# Create county-level access map
# Merge geometry back to county summary
pa_counties_5070 <- st_transform(pa_counties, 5070)

county_map <- pa_counties_5070 %>%
  left_join(county_summary, by = c("COUNTY_NAM"))

# Create choropleth
ggplot() +
  geom_sf(data = county_map, aes(fill = pct_underserved), color = "white", size = 0.2) +
  geom_sf(data = hospitals_proj, color = "blue", size = 0.5, alpha = 0.6) +
  scale_fill_gradient(
    name = "% of Vulnerable Tracts Underserved",
    low = "#fee8c8", high = "#e34a33",
    na.value = "gray90",
    labels = function(x) paste0(round(x, 1), "%")
  ) +
  labs(
    title = "Healthcare Access Challenges in Pennsylvania",
    subtitle = "Percentage of Vulnerable Census Tracts that are Underserved, by County",
    caption = "Data sources: U.S. Census ACS 2022, Pennsylvania Hospital Locations\nProjection: NAD83 / Conus Albers Equal Area (EPSG:5070)"
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.caption = element_text(size = 8, hjust = 0)
  )

Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels


Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

# Create detailed tract-level map
ggplot() +
  geom_sf(data = pa_counties_5070, fill = "gray98", color = "gray80", size = 0.2) +

  geom_sf(data = vulnerable_tracts_proj, aes(fill = factor(vulnerable)), 
          color = NA, alpha = 0.4, show.legend = FALSE) +

  geom_sf(data = subset(vulnerable_tracts_proj, underserved == 1), 
          fill = "#d7301f", color = "white", size = 0.15) +

  geom_sf(data = hospitals_proj, color = "#045a8d", size = 0.6, alpha = 0.8) +

  labs(
    title = "Underserved Vulnerable Census Tracts in Pennsylvania",
    subtitle = "Tracts with low income and high elderly population located more than 15 miles from the nearest hospital",
    caption = "Data sources: U.S. Census ACS 2022, Pennsylvania Hospital Locations\nProjection: NAD83 / CONUS Albers Equal Area (EPSG:5070)"
  ) +
  
  theme_void() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(size = 8, hjust = 0),
    legend.position = "none"
  )

Requirements: - Show underserved vulnerable tracts in a contrasting color - Include county boundaries for context - Show hospital locations - Use appropriate visual hierarchy (what should stand out?) - Include informative title and subtitle


Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

# Create distribution visualization
ggplot(vulnerable_tracts_proj, aes(x = dist_to_hospital_mi)) +
  geom_histogram(
    fill = "#3182bd", color = "white", bins = 30, alpha = 0.8
  ) +
  geom_vline(aes(xintercept = mean(dist_to_hospital_mi, na.rm = TRUE)),
             color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Distance to Nearest Hospital for Vulnerable Tracts",
    subtitle = "Red dashed line shows mean distance across all vulnerable census tracts",
    x = "Distance to Nearest Hospital (miles)",
    y = "Number of Vulnerable Tracts",
    caption = "Data: ACS 2022, Pennsylvania Hospital Locations\nProjection: EPSG 5070"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    plot.caption = element_text(size = 8, hjust = 0),
    panel.grid.minor = element_blank()
  )

Suggested chart types: - Histogram or density plot of distances - Box plot comparing distances across regions - Bar chart of underserved tracts by county - Scatter plot of distance vs. vulnerable population size

Requirements: - Clear axes labels with units - Appropriate title - Professional formatting - Brief interpretation (1-2 sentences as a caption or in text)


Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).

Note these are just loose suggestions to spark ideas - follow or make your own as the data permits and as your ideas evolve. This analysis should include bringing in your own dataset, ensuring the projection/CRS of your layers align and are appropriate for the analysis (not lat/long or geodetic coordinate systems). The analysis portion should include some combination of spatial and attribute operations to answer a relatively straightforward question


Education & Youth Services

Option A: Educational Desert Analysis - Data: Schools, Libraries, Recreation Centers, Census tracts (child population) - Question: “Which neighborhoods lack adequate educational infrastructure for children?” - Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density - Policy relevance: School district planning, library placement, after-school program siting

Option B: School Safety Zones - Data: Schools, Crime Incidents, Bike Network - Question: “Are school zones safe for walking/biking, or are they crime hotspots?” - Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents, assess bike infrastructure coverage - Policy relevance: Safe Routes to School programs, crossing guard placement


Environmental Justice

Option C: Green Space Equity - Data: Parks, Street Trees, Census tracts (race/income demographics) - Question: “Do low-income and minority neighborhoods have equitable access to green space?” - Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics - Policy relevance: Climate resilience, environmental justice, urban forestry investment —

Public Safety & Justice

Option D: Crime & Community Resources - Data: Crime Incidents, Recreation Centers, Libraries, Street Lights - Question: “Are high-crime areas underserved by community resources?” - Operations: Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis - Policy relevance: Community investment, violence prevention strategies —

Infrastructure & Services

Option E: Polling Place Accessibility - Data: Polling Places, SEPTA stops, Census tracts (elderly population, disability rates) - Question: “Are polling places accessible for elderly and disabled voters?” - Operations: Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access - Policy relevance: Voting rights, election infrastructure, ADA compliance


Health & Wellness

Option F: Recreation & Population Health - Data: Recreation Centers, Playgrounds, Parks, Census tracts (demographics) - Question: “Is lack of recreation access associated with vulnerable populations?” - Operations: Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators - Policy relevance: Public health investment, recreation programming, obesity prevention


Emergency Services

Option G: EMS Response Coverage - Data: Fire Stations, EMS stations, Population density, High-rise buildings - Question: “Are population-dense areas adequately covered by emergency services?” - Operations: Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas - Policy relevance: Emergency preparedness, station siting decisions


Arts & Culture

Option H: Cultural Asset Distribution - Data: Public Art, Museums, Historic sites/markers, Neighborhoods - Question: “Do all neighborhoods have equitable access to cultural amenities?” - Operations: Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups - Policy relevance: Cultural equity, tourism, quality of life, neighborhood identity


Data Sources

OpenDataPhilly: https://opendataphilly.org/datasets/ - Most datasets available as GeoJSON, Shapefile, or CSV with coordinates - Always check the Metadata for a data dictionary of the fields.

Additional Sources: - Pennsylvania Open Data: https://data.pa.gov/ - Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns - TIGER/Line (via tigris): Geographic boundaries

Recommended Starting Points

If you’re feeling confident: Choose an advanced challenge with multiple data layers. If you are a beginner, choose something more manageable that helps you understand the basics

If you have a different idea: Propose your own question! Just make sure: - You can access the spatial data - You can perform at least 2 spatial operations

Your Analysis

Your Task:

  1. Find and load additional data
    • Document your data source
    • Check and standardize the CRS
    • Provide basic summary statistics
# Load your additional dataset
schools <- st_read("./data/Schools.geojson")
Reading layer `Schools' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\Schools.geojson' 
  using driver `GeoJSON'
Simple feature collection with 495 features and 14 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -75.2665 ymin: 39.90781 xmax: -74.97057 ymax: 40.12974
Geodetic CRS:  WGS 84
crimes  <- st_read("./data/incidents_part1_part2/incidents_part1_part2.shp")
Reading layer `incidents_part1_part2' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\incidents_part1_part2\incidents_part1_part2.shp' 
  using driver `ESRI Shapefile'
replacing null geometries with empty geometries
Simple feature collection with 151197 features and 13 fields (with 890 geometries empty)
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -75.2753 ymin: 39.87731 xmax: -74.95981 ymax: 40.13646
Geodetic CRS:  WGS 84
bike_network <- st_read("./data/Bike_Network.geojson")
Reading layer `Bike_Network' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\Bike_Network.geojson' 
  using driver `GeoJSON'
Simple feature collection with 5225 features and 8 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: -75.26927 ymin: 39.87528 xmax: -74.96572 ymax: 40.124
Geodetic CRS:  WGS 84
st_crs(schools)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(crimes)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(bike_network)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
schools_2272 <- st_transform(schools, 2272)
crimes_2272 <- st_transform(crimes, 2272)
bike_network_2272 <- st_transform(bike_network, 2272)

Questions to answer: - What dataset did you choose and why? I selected the Schools, Crime Incidents, and Bike Network datasets for Philadelphia to analyze safety conditions within 1000-ft school zones and the availability of bicycle infrastructure. This analysis helps identify areas where students may face higher crime risk and limited safe travel options, supporting Safe Routes to School policy planning. - What is the data source and date? All datasets were obtained from OpenDataPhilly. - How many features does it contain? Schools: ≈ 300 points Crime Incidents: ≈ 150000 records Bike Network: ≈ 5000 road segments - What CRS is it in? Did you need to transform it? All three datasets were originally in WGS 84 (EPSG 4326) with units in degrees. They were transformed to EPSG 2272 (Pennsylvania State Plane South, US Feet) to ensure accurate distance and buffer calculations.


  1. Pose a research question

Which school zones in Philadelphia face both high crime rates and limited bicycle infrastructure, indicating a need for targeted ‘Safe Routes to School’ interventions?


  1. Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+): - Buffers - Spatial joins - Spatial filtering with predicates - Distance calculations - Intersections or unions - Point-in-polygon aggregation

Your Task:

# Your spatial analysis

schools_2272 <- schools_2272 %>%
  mutate(school_id = row_number())

school_buffer <- st_buffer(schools_2272, dist = 304.8)

crimes_joined <- st_join(crimes_2272, school_buffer, join = st_within, left = FALSE)

crime_summary <- crimes_joined %>%
  st_drop_geometry() %>%
  group_by(school_id) %>%
  summarise(crime_count = n(), .groups = "drop")

safety_summary <- school_buffer %>%
  left_join(crime_summary, by = "school_id") %>%
  mutate(
    crime_count = replace_na(crime_count, 0),
    risk_level = case_when(
      crime_count >= quantile(crime_count, 0.75) ~ "High Risk",
      crime_count >= quantile(crime_count, 0.25) ~ "Moderate Risk",
      TRUE ~ "Low Risk"
    ),
    has_bike_lane = lengths(st_intersects(., bike_network_2272)) > 0
  )

library(knitr)
kable(
  safety_summary %>%
    st_drop_geometry() %>%
    select(school_name, crime_count, has_bike_lane, risk_level) %>%
    arrange(desc(crime_count)) %>%
    head(10),
  caption = "Top 10 Schools by Crime Count within 1000-ft Safety Zones"
)
Top 10 Schools by Crime Count within 1000-ft Safety Zones
school_name crime_count has_bike_lane risk_level
FREIRE CHARTER SCHOOL 236 TRUE High Risk
K012 KIDS SCHOOL 191 TRUE High Risk
VISITATION SCHOOL 177 TRUE High Risk
ROMAN CATHOLIC HIGH SCHOOL 150 FALSE High Risk
JUNIATA PARK ACADEMY 118 TRUE High Risk
TECH FREIRE CHARTER SCHOOL 109 FALSE High Risk
GIRLS, PHILA HIGH SCHOOL FOR 107 TRUE High Risk
SCIENCE LEADERSHIP ACADEMY 103 TRUE High Risk
WRIGHT, RICHARD R. SCHOOL 98 FALSE High Risk
YES PHILLY 93 TRUE High Risk
# Filter: only high-risk schools without nearby bike lanes
highrisk_nobike <- safety_summary %>%
  filter(risk_level == "High Risk" & has_bike_lane == FALSE)

library(sf)
library(dplyr)
library(ggplot2)
library(here)

pa_tracts <- st_read("./data/PA-tracts.geojson")
Reading layer `PA-tracts' from data source 
  `D:\MUSA\MUSA_5080\portfolio-setup-LingxuanGao\labs\lab_2\data\PA-tracts.geojson' 
  using driver `GeoJSON'
Simple feature collection with 3217 features and 398 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51989 ymin: 39.7198 xmax: -74.68952 ymax: 42.26986
Geodetic CRS:  WGS 84
philly_tracts <- pa_tracts %>%
  filter(pl == "Philadelphia County, Pennsylvania")

philly_tracts_2272 <- st_transform(philly_tracts, 2272)


plot(st_geometry(philly_tracts_2272))

ggplot() +
  geom_sf(data = philly_tracts_2272, fill = NA, color = "gray40", size = 0.1) +
  
  geom_sf(data = bike_network_2272, color = "#74c476", size = 0.3, alpha = 0.5) +
  
  geom_sf(data = highrisk_nobike, fill = "#cb181d", color = "#cb181d", size = 0.6, alpha = 1) +
  
  geom_sf(data = schools_2272, color = "black", fill = "black", size = 0.3, shape = 21) +
  
  labs(
    title = "High-Risk Schools without Bike Lanes in Philadelphia County",
    subtitle = "Tracts within the Philadelphia County boundary",
    caption = "Data: OpenDataPhilly & Eviction Lab (2024) | CRS: EPSG 2272 | Analysis by Lingxuan Gao"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    plot.caption = element_text(size = 9, color = "gray40", hjust = 0.5)
  )

Analysis requirements: - Clear code comments explaining each step - Appropriate CRS transformations - Summary statistics or counts - At least one map showing your findings - Brief interpretation of results (3-5 sentences)

Your interpretation:

The map shows that high-risk schools without nearby bike lanes are scattered across Philadelphia County, with notable clusters in Central Philadelphia and the southern parts of North Philadelphia. A smaller number of red points also appear in West Philadelphia and South Philadelphia, indicating that safety and bike infrastructure gaps are not limited to any single district. Overall, most schools with higher crime counts are located along major urban corridors, where dense street networks and limited bike lanes may increase exposure to unsafe environments.

Finally - A few comments about your incorporation of feedback!

Take a few moments to clean up your markdown document and then write a line or two or three about how you may have incorporated feedback that you recieved after your first assignment.


Submission Requirements

What to submit:

  1. Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
    • Use embed-resources: true in YAML so it’s a single file
    • All code should run without errors
    • All maps and charts should display correctly

File naming: LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd