Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Hope Levin

Published

September 28, 2025

Assignment Overview

Scenario

You are a data analyst for the New York State Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key
census_api_key("5a82e243438bea307ae1c04f150d539c4db5fa47", install = TRUE, overwrite = TRUE)

[1] "5a82e243438bea307ae1c04f150d539c4db5fa47"

readRenviron("~/.Renviron")

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "New York"

State Selection:

I have chosen New York for this analysis because: I’m from New York! Also, I’ve never been able to work with NYS data for a class assignment at Penn.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
ny_county <- get_acs(
    geography = "county",
    state = my_state,
    variables = c(median_h_income = "B19013_001", total_population = "B01003_001"),
    survey = "acs5",
    year = 2022,
    output = "wide"
)

# Look at the first few rows
head(ny_county)

# A tibble: 6 × 6
  GEOID NAME                 median_h_incomeE median_h_incomeM total_populationE
  <chr> <chr>                           <dbl>            <dbl>             <dbl>
1 36001 Albany County, New …            78829             2049            315041
2 36003 Allegany County, Ne…            58725             1965             47222
3 36005 Bronx County, New Y…            47036              890           1443229
4 36007 Broome County, New …            58317             1761            198365
5 36009 Cattaraugus County,…            56889             1778             77000
6 36011 Cayuga County, New …            63227             2736             76171
# ℹ 1 more variable: total_populationM <dbl>

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
ny_clean <- ny_county %>%
    mutate(NAME = str_remove (NAME, "County, New York"))

# Display the first few rows
head(ny_clean)

# A tibble: 6 × 6
  GEOID NAME           median_h_incomeE median_h_incomeM total_populationE
  <chr> <chr>                     <dbl>            <dbl>             <dbl>
1 36001 "Albany "                 78829             2049            315041
2 36003 "Allegany "               58725             1965             47222
3 36005 "Bronx "                  47036              890           1443229
4 36007 "Broome "                 58317             1761            198365
5 36009 "Cattaraugus "            56889             1778             77000
6 36011 "Cayuga "                 63227             2736             76171
# ℹ 1 more variable: total_populationM <dbl>

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
ny_reliability <- ny_clean %>%
  mutate(
    moe_percentage = round((median_h_incomeM/median_h_incomeE) * 100, 2),
    reliability = case_when(
      moe_percentage < 5 ~ "High Confidence",
      moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate Confidence",
      moe_percentage > 10 ~ "Low Confidence")
    )
      
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

reliability_summary <- ny_reliability %>%
  count(reliability) %>%
  mutate(percentage = round(100*n/sum(n), 1))

head(reliability_summary)

# A tibble: 3 × 3
  reliability             n percentage
  <chr>               <int>      <dbl>
1 High Confidence        56       90.3
2 Low Confidence          1        1.6
3 Moderate Confidence     5        8.1

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
high_moe <- ny_reliability %>%
  slice(1:5) %>%
  arrange(desc(moe_percentage)) %>%
  select(NAME, median_h_incomeE, (median_h_incomeM/median_h_incomeE), moe_percentage, reliability)

#ASK IN OFFICE HOURS ABOUT SLICE FUNCTION

# Format as table with kable() - include appropriate column names and caption
kable(high_moe,
      col.names = c("County", "Median Household Income", "Margin of Error", "MOE Percent", "Reliability"),
      caption = "Top 5 NY Counties with Highest Median Household Income Uncertainity",
      format.args = list(big.mark = ",")
      )

Top 5 NY Counties with Highest Median Household Income Uncertainity
County	Median Household Income	Margin of Error	MOE Percent	Reliability
Allegany	58,725	1,965	3.35	High Confidence
Cattaraugus	56,889	1,778	3.13	High Confidence
Broome	58,317	1,761	3.02	High Confidence
Albany	78,829	2,049	2.60	High Confidence
Bronx	47,036	890	1.89	High Confidence

Data Quality Commentary:

Counties that have a high MOE/those marked as the “Low Confidence” may be poorly served by algorithms, as it is assumed that the data inputted is precise. These counties tend to be in rural areas, where the population size is much smaller than their urban or suburban counterparts, therefore creating a smaller sample size.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- ny_reliability %>%
  filter(NAME %in% c("Albany ", "Essex ", "Hamilton "))

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(
    County = NAME,
    Median_Income = median_h_incomeE,
    MOE_Percentage = moe_percentage,
    Reliability = reliability
  )

# A tibble: 3 × 4
  County      Median_Income MOE_Percentage Reliability        
  <chr>               <dbl>          <dbl> <chr>              
1 "Albany "           78829           2.6  High Confidence    
2 "Essex "            68090           5.27 Moderate Confidence
3 "Hamilton "         66891          11.4  Low Confidence

Comment on the Output:

Interesting how Hamilton and Essex County are border each other, and are predominantly known for the Adirondacks, yet have such different population and MOE figures. But, the most interesting part, is seeing my work come to life! Feeling like Dr. Frankenstein hehe.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
raceVariables <- c(White = "B03002_003",
                   Black = "B03002_004",
                   Hispanic = "B03002_012",
                   Total = "B03002_001")

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
censusTracts <- c("001","031","041")

acsData <- get_acs(
    geography = "tract",
    state = "NY",
    county = censusTracts,
    variables = raceVariables,
    survey = "acs5",
    year = 2022,
    output = "wide"
)

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
group_percents <- acsData %>%
  mutate(
    white_percentage = round((WhiteE / TotalE) * 100),
    black_percentage = round((BlackE / TotalE) * 100),
    hispanic_percentage = round((HispanicE / TotalE) * 100)
  )

# Add readable tract and county name columns using str_extract() or similar
group_percents <- group_percents %>%
  mutate(NAME= str_extract(NAME, "Albany|Essex|Hamilton")
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
hispanic_pop <- group_percents %>%
  arrange(desc(hispanic_percentage)) %>%
  slice(1) %>%
  select(NAME, HispanicE, TotalE, hispanic_percentage)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
averages <- group_percents %>%
  group_by(NAME) %>%
  summarize(
    tracts = n(),
    avg_white_percent = round(mean(white_percentage, na.rm = TRUE), 0),
    avg_black_percent = round(mean(black_percentage, na.rm = TRUE), 0),
    avg_hispanic_percent = round(mean(hispanic_percentage, na.rm = TRUE), 0)
  )

# Create a nicely formatted table of your results using kable()
kable(averages,
  col.names = c("County", "Number of Tracts", "Avg White %", "Avg Black %", "Avg Hispanic %"),
  caption = "Average Percent of White, Black, and Hispanic Populations in Selected Counties",
  format.args = list(big.mark = ",")
)

Average Percent of White, Black, and Hispanic Populations in Selected Counties
County	Number of Tracts	Avg White %	Avg Black %	Avg Hispanic %
Albany	85	69	13	7
Essex	18	93	2	2
Hamilton	4	92	1	2

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
race_moe <- group_percents %>%
  mutate(
    white_moe = round((WhiteM/WhiteE)*100, 2),
    black_moe = round((BlackM/BlackE)*100, 2),
    hispanic_moe = round(HispanicM/HispanicE*100, 2),
  )

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
race_moe <- race_moe %>%
  mutate(
    high_moe = ifelse(
      white_moe > 15 | black_moe > 15 | hispanic_moe > 15,
      TRUE,
      FALSE
    )
  )
      
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
race_moe_summary <- race_moe %>%
  count(high_moe) %>%
  mutate(percentage = round(100*n/sum(n),1))

# Create summary statistics showing how many tracts have data quality issues
race_moe_summary

# A tibble: 1 × 3
  high_moe     n percentage
  <lgl>    <int>      <dbl>
1 TRUE       107        100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
pattern_analysis <- race_moe %>%
  group_by(high_moe) %>%
  summarize(
    n_tracts = n(),
    avg_white = round(mean(white_moe, na.rm = TRUE), 0),
    avg_black = round(mean(black_moe, na.rm = TRUE), 0),
    avg_hispanic = round(mean(hispanic_moe, na.rm = TRUE), 0),
    avg_pop = round(mean(TotalE, na.rm = TRUE), 0)
    )

# Use group_by() and summarize() to create this comparison

# Create a professional table showing the patterns
kable(pattern_analysis,
        col.names = c("High MOE", "Number of Tracts", "Avg White MOE", "Avg Black MOE", "Avg Hispanic MOE", " Avg Pop"),
        caption = "Comparison of Selected Tracts by MOE Reliability",
        format.args = list(big.mark = ",")
        )

Comparison of Selected Tracts by MOE Reliability
High MOE	Number of Tracts	Avg White MOE	Avg Black MOE	Avg Hispanic MOE	Avg Pop
TRUE	107	19	Inf	Inf	3,341

Pattern Analysis:

All tracts were marked as having a high MOE. Because the sample sizes were small, any discrepancy between E and M counts immediately hijacks the MOE.

Many of the MOEs for Black and Hispanic populations were zero, which forced R to print them as nfs. The data can therefore be understood as being less reliable.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

Systematic patterns that emerged from analyses fall within the topics of data reliability and demographics. For instance, at the county level, median household income estimates varied greatly in reliability, with some noted as “low confidence.” When analyses zoomed in to the tract level, reliability issues emerged when examining Black and Hispanic populations. Many were discounted from results, with analyses marking their stats as “lnf.”
Communities that face the greatest algorithmic bias are rural/smaller populations, and diverse groups.
Underlying factors driving quality and bias risk issues include small population sizes, uncertain counts of non-white populations, and census polling limitations in rural areas.
To correctly count New York State’s populations, the Census Bureau should intentionally plan and curate engagement in historically undercounted areas/populations. Another idea is to alter the algorithm to include better reliability metrics, like weighed counts.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

specific_recommendations <- ny_reliability %>%
  select(NAME, median_h_incomeE, moe_percentage, reliability) %>%
  mutate(
    algorithm_rec = case_when(
      reliability == "High Confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      reliability == "Low Confidence" ~ "Requires manual review or additional data"
    )
  )

# Format as a professional table with kable()
kable(specific_recommendations,
      col.names = c("County", "Median Household Income", "MOE", "Reliability Interpretation", "Algorithm Recommendation"),
      caption = "Decision Framework for Algorithm Implementation Across New York State Counties",
      format.args = list(big.mark = ",")
)

Decision Framework for Algorithm Implementation Across New York State Counties
County	Median Household Income	MOE	Reliability Interpretation	Algorithm Recommendation
Albany	78,829	2.60	High Confidence	Safe for algorithmic decisions
Allegany	58,725	3.35	High Confidence	Safe for algorithmic decisions
Bronx	47,036	1.89	High Confidence	Safe for algorithmic decisions
Broome	58,317	3.02	High Confidence	Safe for algorithmic decisions
Cattaraugus	56,889	3.13	High Confidence	Safe for algorithmic decisions
Cayuga	63,227	4.33	High Confidence	Safe for algorithmic decisions
Chautauqua	54,625	3.21	High Confidence	Safe for algorithmic decisions
Chemung	61,358	4.03	High Confidence	Safe for algorithmic decisions
Chenango	61,741	4.09	High Confidence	Safe for algorithmic decisions
Clinton	67,097	4.18	High Confidence	Safe for algorithmic decisions
Columbia	81,741	3.39	High Confidence	Safe for algorithmic decisions
Cortland	65,029	4.42	High Confidence	Safe for algorithmic decisions
Delaware	58,338	3.67	High Confidence	Safe for algorithmic decisions
Dutchess	94,578	2.66	High Confidence	Safe for algorithmic decisions
Erie	68,014	1.18	High Confidence	Safe for algorithmic decisions
Essex	68,090	5.27	Moderate Confidence	Use with caution - monitor outcomes
Franklin	60,270	4.81	High Confidence	Safe for algorithmic decisions
Fulton	60,557	4.37	High Confidence	Safe for algorithmic decisions
Genesee	68,178	4.57	High Confidence	Safe for algorithmic decisions
Greene	70,294	6.18	Moderate Confidence	Use with caution - monitor outcomes
Hamilton	66,891	11.39	Low Confidence	Requires manual review or additional data
Herkimer	68,104	4.79	High Confidence	Safe for algorithmic decisions
Jefferson	62,782	3.64	High Confidence	Safe for algorithmic decisions
Kings	74,692	1.27	High Confidence	Safe for algorithmic decisions
Lewis	64,401	4.16	High Confidence	Safe for algorithmic decisions
Livingston	70,443	3.99	High Confidence	Safe for algorithmic decisions
Madison	68,869	4.04	High Confidence	Safe for algorithmic decisions
Monroe	71,450	1.35	High Confidence	Safe for algorithmic decisions
Montgomery	58,033	3.63	High Confidence	Safe for algorithmic decisions
Nassau	137,709	1.39	High Confidence	Safe for algorithmic decisions
New York	99,880	1.78	High Confidence	Safe for algorithmic decisions
Niagara	65,882	2.67	High Confidence	Safe for algorithmic decisions
Oneida	66,402	3.27	High Confidence	Safe for algorithmic decisions
Onondaga	71,479	1.57	High Confidence	Safe for algorithmic decisions
Ontario	76,603	2.94	High Confidence	Safe for algorithmic decisions
Orange	91,806	1.94	High Confidence	Safe for algorithmic decisions
Orleans	61,069	4.89	High Confidence	Safe for algorithmic decisions
Oswego	65,054	3.26	High Confidence	Safe for algorithmic decisions
Otsego	65,778	4.51	High Confidence	Safe for algorithmic decisions
Putnam	120,970	4.03	High Confidence	Safe for algorithmic decisions
Queens	82,431	1.06	High Confidence	Safe for algorithmic decisions
Rensselaer	83,734	2.27	High Confidence	Safe for algorithmic decisions
Richmond	96,185	2.60	High Confidence	Safe for algorithmic decisions
Rockland	106,173	2.88	High Confidence	Safe for algorithmic decisions
St. Lawrence	58,339	3.47	High Confidence	Safe for algorithmic decisions
Saratoga	97,038	2.26	High Confidence	Safe for algorithmic decisions
Schenectady	75,056	3.03	High Confidence	Safe for algorithmic decisions
Schoharie	71,479	3.96	High Confidence	Safe for algorithmic decisions
Schuyler	61,316	9.49	Moderate Confidence	Use with caution - monitor outcomes
Seneca	64,050	5.24	Moderate Confidence	Use with caution - monitor outcomes
Steuben	62,506	2.87	High Confidence	Safe for algorithmic decisions
Suffolk	122,498	1.18	High Confidence	Safe for algorithmic decisions
Sullivan	67,841	4.35	High Confidence	Safe for algorithmic decisions
Tioga	70,427	3.99	High Confidence	Safe for algorithmic decisions
Tompkins	69,995	4.01	High Confidence	Safe for algorithmic decisions
Ulster	77,197	4.52	High Confidence	Safe for algorithmic decisions
Warren	74,531	4.74	High Confidence	Safe for algorithmic decisions
Washington	68,703	3.41	High Confidence	Safe for algorithmic decisions
Wayne	71,007	3.10	High Confidence	Safe for algorithmic decisions
Westchester	114,651	1.56	High Confidence	Safe for algorithmic decisions
Wyoming	65,066	3.38	High Confidence	Safe for algorithmic decisions
Yates	63,974	5.84	Moderate Confidence	Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Albany, Allegany, Bronx, Broome, Cattaraugus, Cayuga, Chautauqua, Chemung, Chenango, Clinton, Columbia, Cortland, Delaware, Dutchess, Erie, Franklin, Fulton, Genesee, Herkimer, Jefferson, Kings, Lewis, Livingston, Madison, Monroe, Montgomery, Nassau, New York, Niagara, Oneida, Onondaga, Ontario, Orange, Orleans, Oswego, Otsego, Putnam, Queens, Rensselaer, Richmond, Rockland, St. Lawrence, Saratoga, Schenectady, Schoharie, Steuben, Suffolk, Sullivan, Tioga, Tompkins, Ulster, Warren, Washington, Wayne, Westchester, Wyoming

These counties are suitable as their MOE is less than 5%, an indication that the data is reliable.

Counties requiring additional oversight: Essex, Greene, Schuyler, Seneca, Yates

These counties need additional oversight as their MOE is between 5 and 10%. To boost confidence, the algorithm should incorporate factors that weigh income data. Otherwise, more diligent processes would be required when conducting census sampling.

Counties needing alternative approaches: Hamilton

This county has an MOE greater than 10%, which indicates that the data is unreliable. To counteract this, manual review should occur to validate estimates before being processed by the algorithm. Otherwise, the only two options are to incorporate local data into findings, or exclude all data until reliable findings are available.

Questions for Further Investigation

How have MOEs changed over time? Have demographic estimates improved as the federal government now better recognizes minorities in the US?
Is it a trend across over states to have clusters of high or low MOE tracts? And if so, what is the common thread between these tracts?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 9/28/25

Reproducibility: - All analysis conducted in R version 2025.05.1+513 - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: N/A - no extreme decisions were made that would affect future reproducibility.

Limitations: Limitations include sample size issues, geographic scope, and user success with R.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html