Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Sam Sen

Published

September 23, 2025

Assignment Overview

Scenario

You are a data analyst for the [Your State] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)

# Set your Census API key

# Choose your state for analysis - assign it to a variable called my_state

library(tidycensus)
library(tidyverse)
library(knitr)
library(stringr)

options(tigris_use_cache = TRUE)
my_state <- "New York"

State Selection: I have chosen New York for this analysis because I am originally from NYC and would like to take a deep dive into some of the census data for this area.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()

# Display the first few rows

# ---- 2.1 County-level data (ACS 2022 5-year) ----
vars <- c(
  med_income = "B19013_001",  # Median household income
  total_pop  = "B01003_001"   # Total population
)

county_raw <- get_acs(
  geography = "county",
  variables = vars,
  year = 2022,
  survey = "acs5",
  state = my_state,
  output = "wide"
)

county <- county_raw %>%
  mutate(
    county = NAME %>%
      stringr::str_remove(",\\s*.*$") %>%  # drop ", State"
      stringr::str_remove("\\s*County$")   # drop trailing "County"
  ) %>%
  select(
    county, GEOID,
    med_incomeE, med_incomeM,
    total_popE,  total_popM
  )

knitr::kable(
  head(county, 10),
  caption = paste("County-level ACS (2022 5-year) for", my_state),
  digits = 0
)

County-level ACS (2022 5-year) for New York
county	GEOID	med_incomeE	med_incomeM	total_popE	total_popM
Albany	36001	78829	2049	315041	NA
Allegany	36003	58725	1965	47222	NA
Bronx	36005	47036	890	1443229	NA
Broome	36007	58317	1761	198365	NA
Cattaraugus	36009	56889	1778	77000	NA
Cayuga	36011	63227	2736	76171	NA
Chautauqua	36013	54625	1754	127440	NA
Chemung	36015	61358	2475	83584	NA
Chenango	36017	61741	2526	47096	NA
Clinton	36019	67097	2802	79839	NA

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

# ---- 2.2A: compute MOE% for median income ----
county_moe <- county %>%
  dplyr::mutate(
    income_moe_pct = dplyr::if_else(
      !is.na(med_incomeE) & med_incomeE > 0 & !is.na(med_incomeM),
      100 * med_incomeM / med_incomeE,
      NA_real_
    )
  )

# peek at just the columns we care about
knitr::kable(
  county_moe %>%
    dplyr::select(county, med_incomeE, med_incomeM, income_moe_pct) %>%
    head(10),
  caption = "MOE% for median income (first 10 counties)",
  digits = c(0, 0, 0, 1)
)

MOE% for median income (first 10 counties)
county	med_incomeE	med_incomeM	income_moe_pct
Albany	78829	2049	2.6
Allegany	58725	1965	3.3
Bronx	47036	890	1.9
Broome	58317	1761	3.0
Cattaraugus	56889	1778	3.1
Cayuga	63227	2736	4.3
Chautauqua	54625	1754	3.2
Chemung	61358	2475	4.0
Chenango	61741	2526	4.1
Clinton	67097	2802	4.2

# ---- 2.2B: add reliability category ----
county_reliability <- county_moe %>%
  dplyr::mutate(
    reliability = dplyr::case_when(
      is.na(income_moe_pct) ~ "Unavailable",
      income_moe_pct < 5    ~ "High Confidence",
      income_moe_pct <= 10  ~ "Moderate Confidence",
      income_moe_pct > 10   ~ "Low Confidence"
    )
  )

# quick preview
knitr::kable(
  county_reliability %>%
    dplyr::select(county, med_incomeE, med_incomeM, income_moe_pct, reliability) %>%
    head(10),
  caption = "Income MOE% + reliability category (first 10)",
  digits = c(0, 0, 0, 1)
)

Income MOE% + reliability category (first 10)
county	med_incomeE	med_incomeM	income_moe_pct	reliability
Albany	78829	2049	2.6	High Confidence
Allegany	58725	1965	3.3	High Confidence
Bronx	47036	890	1.9	High Confidence
Broome	58317	1761	3.0	High Confidence
Cattaraugus	56889	1778	3.1	High Confidence
Cayuga	63227	2736	4.3	High Confidence
Chautauqua	54625	1754	3.2	High Confidence
Chemung	61358	2475	4.0	High Confidence
Chenango	61741	2526	4.1	High Confidence
Clinton	67097	2802	4.2	High Confidence

# ---- 2.2C: unreliable flag + summary table ----
county_reliability <- county_reliability %>%
  dplyr::mutate(unreliable = income_moe_pct > 10)

rel_summary <- county_reliability %>%
  dplyr::count(reliability) %>%
  dplyr::mutate(pct = round(100 * n / sum(n), 1)) %>%
  dplyr::arrange(match(reliability, c("High Confidence","Moderate Confidence","Low Confidence","Unavailable")))

knitr::kable(rel_summary, caption = "County reliability categories", digits = 1)

County reliability categories
reliability	n	pct
High Confidence	56	90.3
Moderate Confidence	5	8.1
Low Confidence	1	1.6

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage

# Format as table with kable() - include appropriate column names and caption

# ---- 2.3: Top 5 counties by income MOE% ----
top5_uncertainty <- county_reliability %>%
  dplyr::filter(!is.na(income_moe_pct)) %>%                 # ignore rows where MOE% couldn't be computed
  dplyr::arrange(dplyr::desc(income_moe_pct)) %>%           # highest MOE% first
  dplyr::slice(1:5) %>%                                     # top 5
  dplyr::transmute(                                         # select + rename for presentation
    county,
    `Median income ($)` = med_incomeE,
    `Income MOE ($)`    = med_incomeM,
    `Income MOE %`      = sprintf("%.1f%%", income_moe_pct),
    Reliability         = reliability
  )

knitr::kable(
  top5_uncertainty,
  caption = "Top 5 counties by median income MOE% (ACS 2022 5-year)"
)

Top 5 counties by median income MOE% (ACS 2022 5-year)
county	Median income ($)\| Income MOE ($)	Income MOE %	Reliability
Hamilton	66891	7622	11.4%	Low Confidence
Schuyler	61316	5818	9.5%	Moderate Confidence
Greene	70294	4341	6.2%	Moderate Confidence
Yates	63974	3733	5.8%	Moderate Confidence
Essex	68090	3590	5.3%	Moderate Confidence

Data Quality Commentary:

[The counties that have higher % reliability are likley to benefit less from algorthmic decision making. These margins of error are too large to give us high confidence in the median income figures. Therefore, algorithms that use this data may cause harm. For example, if the state decides that counties that have <$60K median income will receive social services, Hamilton looks like it’s well above this threshold but with such a high margin of error this may not be the case.]

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

Comment on the output: [write something :)]

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations

# Add readable tract and county name columns using str_extract() or similar

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

# Create a nicely formatted table of your results using kable()

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

# Create summary statistics showing how many tracts have data quality issues

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

Pattern Analysis: [Describe any patterns you observe. Do certain types of communities have less reliable data? What might explain this?]

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

[Your integrated 4-paragraph summary here]

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

# Format as a professional table with kable()

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: [List counties with high confidence data and explain why they’re appropriate]
Counties requiring additional oversight: [List counties with moderate confidence data and describe what kind of monitoring would be needed]
Counties needing alternative approaches: [List counties with low confidence data and suggest specific alternatives - manual review, additional surveys, etc.]

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: [Describe any decisions you made about data processing, county selection, or analytical choices that might affect reproducibility]

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html