Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Jed Chew

Published

December 8, 2025

Assignment Overview

Scenario

You are a data analyst for the New York Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)

# Set your Census API key
census_api_key("fe841b7ef0aa73d9579f0517bd1c8f26d33c789b")

# Choose your state for analysis - assign it to a variable called my_state

State Selection: I have chosen New York State for this analysis because I will be working in New York City post-graduation.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements:
- Geography: county level
- Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022
- Survey: acs5
- Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
ny_data <- get_acs(
  geography = "county",
  variables = c(
    total_pop = "B01003_001",
    median_income = "B19013_001"
  ),
  state = "NY",
  year = 2022,
  survey = "acs5",
  output = "wide" 
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
ny_clean <- ny_data |> 
  mutate(
    # Remove state name from county names
    county_name = str_remove(NAME, ", New York"),
    # Remove "County" word
    county_name = str_remove(county_name, " County")
  )

# Display the first few rows
head(ny_clean)

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements:

- Calculate MOE percentage: (margin of error / estimate) * 100
- Create reliability categories:
- High Confidence: MOE < 5%
- Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10%
- Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
ny_reliability <- ny_clean |> 
  mutate(
    moe_percent = round((median_incomeM / median_incomeE) * 100, 2),
    
    # Create reliability categories
    reliability = case_when(
      moe_percent < 5 ~ "High",
      moe_percent >= 5 & moe_percent <= 10 ~ "Moderate",
      moe_percent > 10 ~ "Low"
    )
  )
ny_reliability

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
count(ny_reliability, reliability)

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
high_uncertainty <- ny_reliability |> 
  arrange(desc(moe_percent)) |> 
  slice(1:5) |> 
  select(county_name, median_incomeE, median_incomeM, moe_percent, reliability)
high_uncertainty

# Format as table with kable() - include appropriate column names and caption
kable(high_uncertainty,
      col.names = c("County", "Median Income", "MOE", "MOE %", "Reliability"),
      caption = "NY Counties with Greatest Income Data Uncertainty",
      format.args = list(big.mark = ",")) #large nums have 000s separators

Data Quality Commentary: Algorithmic decision-making at the New York state-level cannot occur with one broad brush. While 56 out of 62 NY counties are classified as having highly reliable income data, 5 counties are classified as having moderate reliability and 1 county is classified as having low reliability. As shown in the table below, a possible reason for these 6 counties having low/moderate reliability could be their relatively smaller populations compared to other counties. Hence, special attention must be paid to these 6 counties, with algorithmic tools to identify communities that should receive priority for social service funding and outreach programs being complemented by qualitative on-the-ground outreach and interviews with local residents.

high_uncertainty_pop <- ny_reliability |> 
  arrange(desc(moe_percent)) |> 
  slice(1:5) |> 
  select(county_name, total_popE, median_incomeE, moe_percent, reliability)
high_uncertainty_pop

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- ny_reliability |> 
  slice_min(moe_percent, n = 1, by = reliability)
selected_counties

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
select(selected_counties, county_name, median_incomeE, moe_percent, reliability)

Comment on the output: There is a significant difference between the margin of error for different counties. For example, Queens County has a margin of error percentage of only 1.06%, whereas Hamilton County has a margin of error percentage of 11.39%.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements:

Geography: tract level
Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001)
Use the same state and year as before
Output format: wide

Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_vars <- c(
    total_pop = "B03002_001",
    White     = "B03002_003",
    Black     = "B03002_004",
    Hispanic  = "B03002_012"
)

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
  geography = "tract",
  variables = race_vars,
  state = "NY",
  # GEOIDs for Queens, Seneca, and Hamilton Counties
  county = c("081", "099", "041"),
  year = 2022,
  survey = "acs5",
  output = "wide"
)
tract_data

# Add readable tract and county name columns using str_extract() or similar
tract_clean <- tract_data |> 
  filter(total_popE > 0) |> # remove tracts with zero estimated population
  separate(
    NAME, 
    into = c("tract_name", "county_name", "state_name"), 
    sep = "; "
  ) |> 
  mutate(
    tract_name = str_remove(tract_name, "Census Tract "),
    county_name = str_remove(county_name, " County")
  )
tract_clean

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_clean <- tract_clean |> 
  mutate(
    percent_white = round((WhiteE/total_popE) * 100, 2),
    percent_black = round((BlackE/total_popE) * 100, 2),
    percent_hispanic = round((HispanicE/total_popE) * 100, 2),
  )
tract_clean

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
tract_highest_hispanic <- tract_clean |> 
  arrange(desc(percent_hispanic)) |> 
  slice(1) |> 
  select(tract_name, county_name, total_popE, percent_hispanic)
tract_highest_hispanic

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

demo_county <- tract_clean |>
  group_by(county_name) |> 
  summarize(
    num_tracts = n(),
    mean_white = round(mean(percent_white), 2),
    mean_black = round(mean(percent_black), 2),
    mean_hispanic = round(mean(percent_hispanic), 2)
  )

# Create a nicely formatted table of your results using kable()
kable(demo_county,
      col.names = c("County Name", "# Tracts", "% White", "% Black", "% Hispanic"),
      caption = "Average Demographics by County for Selected NY Counties",
      format.args = list(big.mark = ",")) #large nums have 000s separators

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements:

Calculate MOE percentages for each demographic variable
Flag tracts where any demographic variable has MOE > 15%
Create Summary Statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

tract_reliability <- tract_clean |> 
  mutate(
    moe_percent_white = round((WhiteM/WhiteE) * 100, 2), 
    moe_percent_black = round((BlackM/BlackE) * 100, 2),
    moe_percent_hispanic = round((HispanicM/HispanicE) * 100, 2)
  )

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_reliability <- tract_reliability |> 
  mutate(reliability = if_else(
      moe_percent_white > 15 | moe_percent_black > 15 | moe_percent_hispanic > 15,
      "Low",
      "High"
    )
  )
tract_reliability

# Create summary statistics showing how many tracts have data quality issues
count(tract_reliability, reliability)

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

Pattern Analysis: As shown in the tibble in Section 4.1 above, all 704 census tracts in my selected NY counties were flagged for low reliability because one of their demographic variable has a MOE > 15%. I was initially very confused by this, because I had deliberately chosen one county from each reliability category earlier (as shown below). However, I realized that the MOE calculation in earlier sections was based on median income rather than population. After discussing with Dr. Delmelle during office hours, some reasons that we came up with for these high MOEs include low population (especially for Hamilton County) or non-residential census tracts (such as LaGuardia Airport or Citi Field in Queens County).

select(selected_counties, county_name, median_incomeE, moe_percent, reliability)

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements:

1. Overall Pattern Identification: What are the systematic patterns across all your analyses?

2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings?

3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk?

4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

[Overall Pattern Identification] My analyses found that county-level indicators are generally high-reliability, but tract-level results show wide variance in MOE%, especially for demographic estimates such as race/ethnicity.

[Equity Assessment] Communities in tracts with high MOE% (threshold of ~15%) for key demographics and income indicators face the greatest risk of algorithmic bias because small sampling error can flip decision regarding eligibility or prioritization for social service and outreach programs. This risk concentrates in low-population tracts and often in places where Black and Hispanic counts are small minorities at the tract level.

[Root Cause Analysis] The root cause of this high MOE%s is likely the ACS sampling variability. Every year, the ACS surveys about 1 in 40 housing units, and the 5-year estimates essentially incorporate the past 5 years of these non-overlapping sampling to recreate the longform of the decennial census. However, this could result in major tract-level changes (such as redevelopment) or a redrawing of tract boundaries being smoothed over when looking at the 5-year estimates, leading to substantial noise in the data that shows in up high MOE% for limited sample sizes.

[Strategic Recommendations] NY Department of Human Services should adopt a tiered reliability policy (e.g., High = proceed; Moderate = monitor with additional oversight; Low = human review or supplemental data) before applying thresholds. Pair this with routine fairness audits and stability checks, prioritize five-year ACS estimates where appropriate, and fund targeted data collection or administrative-data linkages in the selected tracts that repeatedly produce the highest MOE% to reduce structural uncertainty of algorithmic decisions over time.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

# Format as a professional table with kable()
county_summary <- ny_reliability |> 
  select(county_name, median_incomeE, moe_percent, reliability) |> 
  arrange(moe_percent) |> 
  mutate(
    recommendation = case_when(
      reliability == "High" ~ "Safe for algorithmic decisions",
      reliability == "Moderate" ~ "Use with caution - monitor outcomes",  
      reliability == "Low" ~ "Requires manual review or additional data"
    )
  )

kable(county_summary,
      col.names = c("County", "Median Income", "MOE", "Reliability", "Recommendation"),
      caption = toupper("Recommended Algorithm for NY Counties"),
      format.args = list(big.mark = ","))

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: [List counties with high confidence data and explain why they’re appropriate]

The 56 NY counties listed below have MOE percentages of less than 5%, hence we classify the income data as having “high reliability.” This data can thus be used to train, fit, and test a suitable algorithm for the department of human services.
```
high_reliability <- county_summary |> 
  filter(reliability == "High")

kable(high_reliability,
      col.names = c("County", "Median Income", "MOE", "Reliability", "Recommendation"),
      caption = toupper("NY Counties with High Reliability"),
      format.args = list(big.mark = ","))
```
Counties requiring additional oversight: [List counties with moderate confidence data and describe what kind of monitoring would be needed]

The 5 counties below have MOE percentages of between 5-10%, hence we classify the income data as having “moderate reliability.” Oversight measures may include periodic monitoring of outcome distributions, bias audits against protected groups, and sensitivity testing to ensure model decisions remain robust when confidence intervals shift.
```
moderate_reliability <- county_summary |> 
  filter(reliability == "Moderate")

kable(moderate_reliability,
      col.names = c("County", "Median Income", "MOE", "Reliability", "Recommendation"),
      caption = toupper("NY Counties with Moderate Reliability"),
      format.args = list(big.mark = ","))
```
Counties needing alternative approaches: [List counties with low confidence data and suggest specific alternatives - manual review, additional surveys, etc.]

Only 1 NY County - Hamilton County - has a MOE percentage >10%, hence we classify the income data as having “low reliability.” Given the unreliability of this data, it is not suitable for direct algorithmic implementation. Instead, the department should rely on manual review, locally administered surveys, or integration of administrative datasets as proxies (e.g., tax records, school enrollment, social services data) to supplement ACS estimates. This ensures decisions affecting communities in Hamilton County are based on reliable, context-specific evidence rather than unreliable model inputs.
```
low_reliability <- county_summary |> 
  filter(reliability == "Low")

kable(low_reliability,
      col.names = c("County", "Median Income", "MOE", "Reliability", "Recommendation"),
      caption = toupper("NY Counties with Low Reliability"),
      format.args = list(big.mark = ","))
```

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

How MOEs for population data at the census tract level are estimated and computed, given that MOEs for population data at the county level are not reported because there are various other proxies that can help to create a reliable population estimate
How have MOEs changed over time for ACS data – i.e. has there been any improvements in methods and reporting over the years? In addition, how should we think about MOEs for other important datasets, such as the annual homelessness point-in-time (PIT) counts?
How can we map and visualize these tibbles effectively in R, without having to export our data tables to ArcGIS?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on September 23, 2025

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-jedchewjm/

Methodology Notes: - There are no significant decisions that I made that might affect reproducibility - I selected my three counties of Hamilton, Queens, and Seneca as they each represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence)

Limitations: - given the task of implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs, this analysis is too limited in temporal and geographic scope to provide a comprehensive recommendation - for example, this analysis only utilizes 2018-22 ACS estimates, and is unable to show how the demographic and income indicators in various communities have changed over time

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html