Lingxuan Gao - MUSA 5080 Portfolio
  • Home
  • Weekly Notes
    • Weekly Notes 01: Introduction to R and dplyr
    • Weekly Notes 02: Algorithmic Decision Making & The Census
    • Weekly Notes 03: Data Visualization & Exploratory Analysis
    • Weekly Notes 04: Spatial Data & GIS Operations in R
    • Weekly Notes 05: Introduction to Linear Regression
    • Weekly Notes 11: Space-Time Prediction
  • Labs
    • Lab 0: dplyr Basics
    • Lab 1: Census Data Quality for Policy Decisions
    • Lab 2: Spatial Analysis and Visualization-Healthcare Access and Equity in Pennsylvania
    • Lab 4: Spatial Predictive Analysis
    • Lab 5: Space-Time Prediction
  • Midterm
    • Appendix
    • Presentation
  • Final
    • Eviction Risk Prediction in Philadelphia

On this page

  • Assignment Overview
    • Scenario
    • Learning Objectives
    • Submission Instructions
  • Part 1: Portfolio Integration
  • Setup
  • Part 2: County-Level Resource Assessment
    • 2.1 Data Retrieval
    • 2.2 Data Quality Assessment
    • 2.3 High Uncertainty Counties
  • Part 3: Neighborhood-Level Analysis
    • 3.1 Focus Area Selection
    • 3.2 Tract-Level Demographics
    • 3.3 Demographic Analysis
  • Part 4: Comprehensive Data Quality Evaluation
    • 4.1 MOE Analysis for Demographic Variables
    • 4.2 Pattern Analysis
  • Part 5: Policy Recommendations
    • 5.1 Analysis Integration and Professional Summary
    • 6.3 Specific Recommendations
    • Questions for Further Investigation
  • Technical Notes
    • Submission Checklist

Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Lingxuan Gao

Published

September 28, 2025

Assignment Overview

Scenario

You are a data analyst for the [Pennsylvania] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key
census_api_key("55588cff24a2b44b2f030ee1581c8d854c1839da", install = TRUE, overwrite = TRUE)
[1] "55588cff24a2b44b2f030ee1581c8d854c1839da"
readRenviron("~/.Renviron") 

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Pennsylvania"

State Selection: I have chosen [Pennsylvania] for this analysis because: [It’s where I am.]

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_data <- get_acs(
  geography = "county", 
  variables = c(
    median_income = "B19013_001",   
    total_pop = "B01003_001"    
  ),
  year = 2022,
  survey = "acs5",
  state = my_state,
  output = "wide"
)

# Clean the county names to remove state name and "County"
county_data <- county_data %>%
  mutate(NAME = str_remove(NAME, " County, Pennsylvania"))

# Hint: use mutate() with str_remove()

# Display the first few rows
head(county_data)
# A tibble: 6 × 6
  GEOID NAME      median_incomeE median_incomeM total_popE total_popM
  <chr> <chr>              <dbl>          <dbl>      <dbl>      <dbl>
1 42001 Adams              78975           3334     104604         NA
2 42003 Allegheny          72537            869    1245310         NA
3 42005 Armstrong          61011           2202      65538         NA
4 42007 Beaver             67194           1531     167629         NA
5 42009 Bedford            58337           2606      47613         NA
6 42011 Berks              74617           1191     428483         NA
glimpse(county_data)
Rows: 67
Columns: 6
$ GEOID          <chr> "42001", "42003", "42005", "42007", "42009", "42011", "…
$ NAME           <chr> "Adams", "Allegheny", "Armstrong", "Beaver", "Bedford",…
$ median_incomeE <dbl> 78975, 72537, 61011, 67194, 58337, 74617, 59386, 60650,…
$ median_incomeM <dbl> 3334, 869, 2202, 1531, 2606, 1191, 2058, 2167, 1516, 21…
$ total_popE     <dbl> 104604, 1245310, 65538, 167629, 47613, 428483, 122640, …
$ total_popM     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
  mutate(
    moe_pct_income = median_incomeM / median_incomeE * 100,
    reliability = case_when(
      moe_pct_income < 5 ~ "High Confidence",
      moe_pct_income >= 5 & moe_pct_income <= 10 ~ "Moderate Confidence",
      moe_pct_income > 10 ~ "Low Confidence"
    ),
    unreliable_flag = moe_pct_income > 10   # TRUE if MOE > 10%
  )

# Create a summary showing count of counties in each reliability category
reliability_summary <- county_data %>%
  count(reliability) %>%
  mutate(percentage = n / sum(n) * 100)
# Hint: use count() and mutate() to add percentages
reliability_summary
# A tibble: 2 × 3
  reliability             n percentage
  <chr>               <int>      <dbl>
1 High Confidence        57       85.1
2 Moderate Confidence    10       14.9

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top5_MOE <- county_data %>%
  arrange(desc(moe_pct_income)) %>%
  slice(1:5) %>%
  select(NAME, median_incomeE, median_incomeM, moe_pct_income, reliability)

# Format as table with kable() - include appropriate column names and caption
knitr::kable(
  top5_MOE,
  digits = 2,
  caption = "Top 5 Counties with the highest MOE percentages"
)
Top 5 Counties with the highest MOE percentages
NAME median_incomeE median_incomeM moe_pct_income reliability
Forest 46188 4612 9.99 Moderate Confidence
Sullivan 62910 5821 9.25 Moderate Confidence
Union 64914 4753 7.32 Moderate Confidence
Montour 72626 5146 7.09 Moderate Confidence
Elk 61672 4091 6.63 Moderate Confidence

Data Quality Commentary:

[The five counties have income margin of error percentages between 6% and 10%. They are all in the “Moderate Confidence” group. This means the estimates are usable but have some uncertainty. Forest and Sullivan counties are close to 10%, so the true median income may be different from the estimate.]

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- bind_rows(
  county_data %>% filter(moe_pct_income < 5) %>% slice(1),    
  county_data %>% filter(moe_pct_income >= 5 & moe_pct_income <= 10) %>% slice(1), 
  county_data %>% filter(moe_pct_income > 10) %>% slice(1)     
)

# Display the selected counties with their key characteristics
knitr::kable(
  selected_counties %>%
    select(NAME, median_incomeE, moe_pct_income, reliability),
  digits = 2,
  col.names = c("county name", "median income", "MOE percentage", "reliability category"),
  caption = "Selected Counties by MOE Percentage (Different Confidence Levels)"
)
Selected Counties by MOE Percentage (Different Confidence Levels)
county name median income MOE percentage reliability category
Adams 78975 4.22 High Confidence
Cameron 46186 5.64 Moderate Confidence
# Show: county name, median income, MOE percentage, reliability category

Comment on the output: [In this result, Adams County’s data is relatively stable and reliable. For Cameron County,the data are still usable, but the estimate comes with more uncertainty.]

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_vars <- c(
  white = "B03002_003",
  black = "B03002_004",
  hispanic = "B03002_012",
  total_pop = "B03002_001"
)

# Use get_acs() to retrieve tract-level data
tract_demo <- get_acs(
  geography = "tract",
  variables = race_vars,
  year = 2022,
  survey = "acs5",
  state = my_state,
  county = c("001", "023"),  
  output = "wide"
)

# Hint: You may need to specify county codes in the county parameter

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_demo <- tract_demo %>%
  mutate(
    pct_white = (whiteE / total_popE) * 100,
    pct_black = (blackE / total_popE) * 100,
    pct_hispanic = (hispanicE / total_popE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
library(stringr)

tract_demo <- tract_demo %>%
  mutate(
    county_code = str_extract(GEOID, "(?<=^..)..."),   
    tract_code  = str_extract(GEOID, "(?<=^.....).*")  
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic_tract <- tract_demo %>%
  arrange(desc(pct_hispanic)) %>%
  slice(1) %>%
  select(county_code, tract_code, pct_hispanic, pct_white, pct_black)

# Calculate average demographics by county using group_by() and summarize()
county_demographics <- tract_demo %>%
  group_by(county_code) %>%
  summarize(
    tracts = n(),
    avg_white = mean(pct_white, na.rm = TRUE),
    avg_black = mean(pct_black, na.rm = TRUE),
    avg_hispanic = mean(pct_hispanic, na.rm = TRUE)
  )

# Show: number of tracts, average percentage for each racial/ethnic group

# Create a nicely formatted table of your results using kable()
knitr::kable(
  county_demographics,
  digits = 1,
  col.names = c("County Code", "Tracts", "Avg % White", "Avg % Black", "Avg % Hispanic"),
  caption = "Average Demographics by County (Selected Counties)"
)
Average Demographics by County (Selected Counties)
County Code Tracts Avg % White Avg % Black Avg % Hispanic
001 27 88.3 1.3 7.1
023 2 93.2 0.0 2.1
top_hispanic_tract
# A tibble: 1 × 5
  county_code tract_code pct_hispanic pct_white pct_black
  <chr>       <chr>             <dbl>     <dbl>     <dbl>
1 001         031502             20.9      73.1      2.74

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_demo <- tract_demo %>%
  mutate(
    moe_pct_white    = (whiteM / whiteE) * 100,
    moe_pct_black    = (blackM / blackE) * 100,
    moe_pct_hispanic = (hispanicM / hispanicE) * 100
  )

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_demo <- tract_demo %>%
  mutate(
    high_moe_flag = moe_pct_white > 15 | moe_pct_black > 15 | moe_pct_hispanic > 15,
  )

# Create summary statistics showing how many tracts have data quality issues
moe_summary <- tract_demo %>%
  summarize(
    total_tracts = n(),
    tracts_high_moe = sum(high_moe_flag, na.rm = TRUE),
    pct_high_moe = (tracts_high_moe / total_tracts) * 100
  )

moe_summary
# A tibble: 1 × 3
  total_tracts tracts_high_moe pct_high_moe
         <int>           <int>        <dbl>
1           29              29          100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
pattern_summary <- tract_demo %>%
  group_by(high_moe_flag) %>%

  summarize(
    tracts = n(),
    avg_pop = mean(total_popE, na.rm = TRUE),
    avg_white = mean(pct_white, na.rm = TRUE),
    avg_black = mean(pct_black, na.rm = TRUE),
    avg_hispanic = mean(pct_hispanic, na.rm = TRUE)
  )

# Create a professional table showing the patterns
knitr::kable(
  pattern_summary,
  digits = 1,
  col.names = c("High MOE?", "Tracts", "Avg Population", "Avg White (%)", "Avg Black (%)", "Avg Hispanic (%)"),
  caption = "Community Characteristics by Data Reliability"
)
Community Characteristics by Data Reliability
High MOE? Tracts Avg Population Avg White (%) Avg Black (%) Avg Hispanic (%)
TRUE 29 3763.4 88.7 1.2 6.8

Pattern Analysis: [These high-MOE tracts have smaller populations, about 3,700 people on average. They are mostly White, with low Black and Hispanic shares. Small sample sizes make their data less stable.]

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

[1.The analysis shows that data quality differences are not random but systematic. Smaller tracts tend to have higher margins of error, with income and demographic data being less reliable. 2.Based on my analysis, the communities which are often small and White-majority, face the greatest risk of algorithmic bias. 3.The main drivers are survey design and population structure. Small samples and demographic homogeneity lead to unstable estimates. 4.The Department should implement MOE checks before data use, flagging tracts with MOE above 15% for review. For small-population tracts, aggregation or multi-year averages should be used. Equity monitoring should also be implemented to ensure high-MOE communities are not overlooked in resource allocation.]

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
recommendations_data <- county_data %>%
  select(NAME, median_incomeE, moe_pct_income, reliability) %>%
  mutate(
    recommendation = case_when(
      reliability == "High Confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      reliability == "Low Confidence" ~ "Requires manual review or additional data",
      TRUE ~ "Unclassified"
    )
  )
# Format as a professional table with kable()
knitr::kable(
  recommendations_data,
  digits = 1,
  col.names = c("County", "Median Income", "MOE Percentage", "Reliability Category", "Algorithm Recommendation"),
  caption = "Decision Framework for Algorithm Implementation by County"
)
Decision Framework for Algorithm Implementation by County
County Median Income MOE Percentage Reliability Category Algorithm Recommendation
Adams 78975 4.2 High Confidence Safe for algorithmic decisions
Allegheny 72537 1.2 High Confidence Safe for algorithmic decisions
Armstrong 61011 3.6 High Confidence Safe for algorithmic decisions
Beaver 67194 2.3 High Confidence Safe for algorithmic decisions
Bedford 58337 4.5 High Confidence Safe for algorithmic decisions
Berks 74617 1.6 High Confidence Safe for algorithmic decisions
Blair 59386 3.5 High Confidence Safe for algorithmic decisions
Bradford 60650 3.6 High Confidence Safe for algorithmic decisions
Bucks 107826 1.4 High Confidence Safe for algorithmic decisions
Butler 82932 2.6 High Confidence Safe for algorithmic decisions
Cambria 54221 3.3 High Confidence Safe for algorithmic decisions
Cameron 46186 5.6 Moderate Confidence Use with caution - monitor outcomes
Carbon 64538 5.3 Moderate Confidence Use with caution - monitor outcomes
Centre 70087 2.8 High Confidence Safe for algorithmic decisions
Chester 118574 1.7 High Confidence Safe for algorithmic decisions
Clarion 58690 4.4 High Confidence Safe for algorithmic decisions
Clearfield 56982 2.8 High Confidence Safe for algorithmic decisions
Clinton 59011 3.9 High Confidence Safe for algorithmic decisions
Columbia 59457 3.8 High Confidence Safe for algorithmic decisions
Crawford 58734 3.9 High Confidence Safe for algorithmic decisions
Cumberland 82849 2.2 High Confidence Safe for algorithmic decisions
Dauphin 71046 2.3 High Confidence Safe for algorithmic decisions
Delaware 86390 1.5 High Confidence Safe for algorithmic decisions
Elk 61672 6.6 Moderate Confidence Use with caution - monitor outcomes
Erie 59396 2.6 High Confidence Safe for algorithmic decisions
Fayette 55579 4.2 High Confidence Safe for algorithmic decisions
Forest 46188 10.0 Moderate Confidence Use with caution - monitor outcomes
Franklin 71808 3.0 High Confidence Safe for algorithmic decisions
Fulton 63153 3.6 High Confidence Safe for algorithmic decisions
Greene 66283 6.4 Moderate Confidence Use with caution - monitor outcomes
Huntingdon 61300 4.7 High Confidence Safe for algorithmic decisions
Indiana 57170 4.6 High Confidence Safe for algorithmic decisions
Jefferson 56607 3.4 High Confidence Safe for algorithmic decisions
Juniata 61915 4.8 High Confidence Safe for algorithmic decisions
Lackawanna 63739 2.6 High Confidence Safe for algorithmic decisions
Lancaster 81458 1.8 High Confidence Safe for algorithmic decisions
Lawrence 57585 3.1 High Confidence Safe for algorithmic decisions
Lebanon 72532 2.7 High Confidence Safe for algorithmic decisions
Lehigh 74973 2.0 High Confidence Safe for algorithmic decisions
Luzerne 60836 2.4 High Confidence Safe for algorithmic decisions
Lycoming 63437 4.4 High Confidence Safe for algorithmic decisions
McKean 57861 4.7 High Confidence Safe for algorithmic decisions
Mercer 57353 3.6 High Confidence Safe for algorithmic decisions
Mifflin 58012 3.4 High Confidence Safe for algorithmic decisions
Monroe 80656 3.2 High Confidence Safe for algorithmic decisions
Montgomery 107441 1.3 High Confidence Safe for algorithmic decisions
Montour 72626 7.1 Moderate Confidence Use with caution - monitor outcomes
Northampton 82201 1.9 High Confidence Safe for algorithmic decisions
Northumberland 55952 2.7 High Confidence Safe for algorithmic decisions
Perry 76103 3.2 High Confidence Safe for algorithmic decisions
Philadelphia 57537 1.4 High Confidence Safe for algorithmic decisions
Pike 76416 4.9 High Confidence Safe for algorithmic decisions
Potter 56491 4.4 High Confidence Safe for algorithmic decisions
Schuylkill 63574 2.4 High Confidence Safe for algorithmic decisions
Snyder 65914 5.6 Moderate Confidence Use with caution - monitor outcomes
Somerset 57357 2.8 High Confidence Safe for algorithmic decisions
Sullivan 62910 9.3 Moderate Confidence Use with caution - monitor outcomes
Susquehanna 63968 3.1 High Confidence Safe for algorithmic decisions
Tioga 59707 3.2 High Confidence Safe for algorithmic decisions
Union 64914 7.3 Moderate Confidence Use with caution - monitor outcomes
Venango 59278 3.4 High Confidence Safe for algorithmic decisions
Warren 57925 5.2 Moderate Confidence Use with caution - monitor outcomes
Washington 74403 2.4 High Confidence Safe for algorithmic decisions
Wayne 59240 4.8 High Confidence Safe for algorithmic decisions
Westmoreland 69454 2.0 High Confidence Safe for algorithmic decisions
Wyoming 67968 3.9 High Confidence Safe for algorithmic decisions
York 79183 1.8 High Confidence Safe for algorithmic decisions

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation: [Counties: Adams, Allegheny, Armstrong, Beaver, Bedford, Berks, Blair, Bradford, Bucks, Butler, Cambria, Centre, Chester, Clarion, Clearfield, Clinton, Columbia, Crawford, Cumberland, Dauphin, Delaware, Erie, Fayette, Franklin, Fulton, Huntingdon, Indiana, Jefferson, Juniata, Lackawanna, Lancaster, Lawrence, Lebanon, Lehigh, Luzerne, Lycoming, McKean, Mercer, Mifflin, Monroe, Montgomery, Northampton, Northumberland, Perry, Philadelphia, Pike, Potter, Schuylkill, Somerset, Susquehanna, Tioga, Venango, Washington, Wayne, Westmoreland, Wyoming, York Their income MOE percentages are below 5%, which makes the data highly reliable and appropriate for immediate algorithmic use.]

  2. Counties requiring additional oversight: [Counties: Cameron, Carbon, Elk, Forest, Greene, Montour, Snyder, Sullivan, Union, Warren Algorithms can be applied but require additional monitoring, such as: Regularly validating outputs against ground realities Tracking resource allocation to prevent bias amplification]

  3. Counties needing alternative approaches: [No counties are currently marked as Low Confidence, but if any exceed a 15% MOE threshold, algorithms should not be used directly. We can use: Manual review by analysts Supplemental surveys for more data Data aggregation of small tracts to reduce error]

Questions for Further Investigation

[1.Do high-MOE communities show spatial clustering, such as being concentrated in rural areas? 2.Beyond population size and racial composition, are economic factors (e.g., poverty rates, housing types) also systematically related to data quality?]

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [R4.5.1] - Census API key required for replication - Complete code and documentation available at: [https://musa-5080-fall-2025.github.io/portfolio-setup-LingxuanGao/labs/lab_1/assignment1_template.html]

Methodology Notes: [Data Source: The analysis used ‘ACS 2022 5-year estimates’ data, accessed via the tidycensus package. Data Processing: Variables were cleaned and renamed; MOE percentages were calculated as the reliability measure; GEOIDs were parsed into state, county, and tract codes. Geographic Scope: Pennsylvania was selected as the study area to align with class examples and allow direct comparison. Analytical Choices: A threshold of MOE > 15% was set to flag high error; both county- and tract-level data were used to compare reliability across spatial scales.]

Limitations: [ACS data are sample-based, and small-population communities (especially rural tracts) have large margins of error, making some results unstable.]


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html