Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Author
Luciano
Published
December 8, 2025
Setup
# Load required packageslibrary(tidycensus)library(tidyverse)library(knitr)# Set your Census API keycensus_api_key("b236a5b2547ce79c3e203c3e1366ed7fa7b3d463", install =FALSE)Sys.getenv("CENSUS_API_KEY")
[1] "b236a5b2547ce79c3e203c3e1366ed7fa7b3d463"
# Choose your state for analysismy_state <-"Missouri"
I have chosen Missouri for this analysis because: major metropolitan areas (like St. Louis and Kansas City) with a large number of rural counties. This scenario can help to illustrate how data quality may vary between urban and rural areas, which is important for equitable policy decisions.
Part 2: County-Level Resource Assessment
# Retrieve county-level ACS data for Missouricounty_data <-get_acs(geography ="county",state = my_state,variables =c(median_income ="B19013_001",total_pop ="B01003_001" ),year =2022,survey ="acs5",output ="wide")# Clean county names: remove ", Missouri" and " County"county_data <- county_data %>%mutate(NAME =str_remove(NAME, ", Missouri"),NAME =str_remove(NAME, " County"))# Display the first few rowshead(county_data)
# A tibble: 6 × 6
GEOID NAME median_incomeE median_incomeM total_popE total_popM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 29001 Adair 51020 4430 25299 NA
2 29003 Andrew 68774 4776 18069 NA
3 29005 Atchison 58521 3686 5270 NA
4 29007 Audrain 51745 2309 24873 NA
5 29009 Barry 55592 5385 34701 NA
6 29011 Barton 48105 5576 11683 NA
# Create table of top 5 counties by MOE percentagelibrary(knitr)top5_uncertain <- income_reliability %>%arrange(desc(income_moe_pct)) %>%slice(1:5) %>%select(County = NAME,total_popE = total_popE,`Median Income (Estimate)`= median_incomeE,`Margin of Error`= median_incomeM,`MOE %`= income_moe_pct,`Reliability`= reliability )# Format as table with kable() - include appropriate column names and captionkable( top5_uncertain,caption ="Top 5 Counties with Highest Median Income MOE Percentages in Missouri")
Top 5 Counties with Highest Median Income MOE Percentages in Missouri
County
total_popE
Median Income (Estimate)
Margin of Error
MOE %
Reliability
Shannon
7132
46767
9920
21.21154
Low Confidence
Carter
5299
45737
8517
18.62168
Low Confidence
Mississippi
12305
40833
7546
18.48015
Low Confidence
Ozark
8688
39125
7092
18.12652
Low Confidence
Mercer
3517
55592
10045
18.06915
Low Confidence
Data Quality Commentary:
The top five counties all have populations of around 40,000 to 60,000 or even fewer. For example, Mercer has just over 50,000 residents, yet its MOE reaches 18%. This suggests that smaller populations may lead to higher uncertainty in estimates, likely due to smaller sample sizes in the ACS survey. Carter, Shannon, and Ozark are located in Missouri’s Ozark region, an area characterized by both limited resources and highly dispersed populations. These factors can contribute to challenges in data collection, resulting in less reliable estimates.
Part 3: Neighborhood-Level Analysis
# Use filter() to select 2-3 counties from your county_reliability data# Store the selected counties in a variable called selected_countiesselected_counties <- income_reliability %>%filter( NAME %in%c("St. Louis", # High Confidence"Buchanan", # Moderate Confidence"Texas") # Low Confidence ) %>%select(County = NAME,`Median Income (Estimate)`= median_incomeE,`MOE %`= income_moe_pct,`Reliability`= reliability )# Display the selected counties with their key characteristics# Show: county name, median income, MOE percentage, reliability categoryselected_counties
# A tibble: 3 × 4
County `Median Income (Estimate)` `MOE %` Reliability
<chr> <dbl> <dbl> <chr>
1 Buchanan 58303 5.07 Moderate Confidence
2 St. Louis 78067 1.64 High Confidence
3 Texas 42870 11.4 Low Confidence
I selected St. Louis, Buchanan, and Texas counties to represent high, moderate, and low data reliability contexts—urban, mid-sized, and rural areas, respectively.
3.2 Tract-Level Demographics
# Define your race/ethnicity variables with descriptive namesrace_vars <-c(total_pop ="B03002_001",white ="B03002_003",black ="B03002_004",hispanic ="B03002_012")# Use get_acs() to retrieve tract-level data# Hint: You may need to specify county codes in the county parametercounty_codes <- income_reliability %>%filter(NAME %in% selected_counties$County) %>%transmute(county_code =str_sub(GEOID, 3, 5)) %>%distinct() %>%pull(county_code)tract_demo_raw <-get_acs(geography ="tract",state = my_state,county = county_codes,variables = race_vars,year =2022,survey ="acs5",output ="wide")# Create percentages for white, Black, and Hispanic populationstract_demo <- tract_demo_raw %>%mutate(white_pct =if_else(total_popE >0, 100* whiteE / total_popE, NA_real_),black_pct =if_else(total_popE >0, 100* blackE / total_popE, NA_real_),hispanic_pct =if_else(total_popE >0, 100* hispanicE/ total_popE, NA_real_),total_population = total_popE,tract_name =str_extract(NAME, "Census Tract[^,]+"),county_name =str_extract(NAME, "Census Tract[^,]+")) %>%select( GEOID, tract_name, county_name, total_population, whiteE, blackE, hispanicE, white_pct, black_pct, hispanic_pct )tract_demo <- tract_demo %>%mutate(county_name = county_name %>%str_replace_all(",", ";") %>% { str_split_fixed(., ";", 3)[, 2] } %>%str_trim() )# Add readable tract and county name columns using str_extract() or similarkable(head(tract_demo, 10),caption ="Selected Counties: Tract-Level Race/Ethnicity (ACS 2018–2022)")
# Find the tract with the highest percentage of Hispanic/Latino residents# Hint: use arrange() and slice() to get the top tracttop_hispanic_tract <- tract_demo %>%arrange(desc(hispanic_pct)) %>%slice(1) %>%select( GEOID, tract_name, county_name, total_population, white_pct, black_pct, hispanic_pct )kable( top_hispanic_tract,caption ="Tract with Highest Hispanic/Latino Percentage (Selected Counties)")
Tract with Highest Hispanic/Latino Percentage (Selected Counties)
GEOID
tract_name
county_name
total_population
white_pct
black_pct
hispanic_pct
29189214700
Census Tract 2147; St. Louis County; Missouri
St. Louis County
8305
43.66045
18.81999
32.51054
# Calculate average demographics by county using group_by() and summarize()county_summary_unweighted <- tract_demo %>%group_by(county_name) %>%summarise(n_tracts =n(),avg_white_pct =mean(white_pct, na.rm =TRUE),avg_black_pct =mean(black_pct, na.rm =TRUE),avg_hispanic_pct =mean(hispanic_pct, na.rm =TRUE) ) %>%arrange(desc(avg_hispanic_pct))# Show: number of tracts, average percentage for each racial/ethnic group# Create a nicely formatted table of your results using kable()kable( county_summary_unweighted,caption ="Average Demographics by County")
Average Demographics by County
county_name
n_tracts
avg_white_pct
avg_black_pct
avg_hispanic_pct
Buchanan County
26
80.95432
5.691374
7.305424
St. Louis County
236
61.42962
26.113783
3.002732
Texas County
8
90.21293
1.883072
2.467462
Part 4: Comprehensive Data Quality Evaluation
# Calculate MOE percentages for white, Black, and Hispanic variables# Hint: use the same formula as before (margin/estimate * 100)moe_pct <- tract_demo_raw %>%transmute( GEOID,white_moe_pct =if_else(whiteE >0, 100* whiteM / whiteE, NA_real_),black_moe_pct =if_else(blackE >0, 100* blackM / blackE, NA_real_),hispanic_moe_pct =if_else(hispanicE >0, 100* hispanicM / hispanicE, NA_real_) )# Create a flag for tracts with high MOE on any demographic variable# Use logical operators (| for OR) in an ifelse() statementtract_quality <- tract_demo %>%select(GEOID, county_name, tract_name, total_population, white_pct, black_pct, hispanic_pct) %>%left_join(moe_pct, by ="GEOID") %>%mutate(high_moe_flag =ifelse(coalesce(white_moe_pct >50, FALSE) |coalesce(black_moe_pct >50, FALSE) |coalesce(hispanic_moe_pct >50, FALSE),TRUE, FALSE ) )# Create summary statistics showing how many tracts have data quality issuesoverall_summary <- tract_quality %>%summarise(n_tracts =n(),n_high_moe =sum(high_moe_flag, na.rm =TRUE),share_high_moe =round(100* n_high_moe / n_tracts, 1) )kable(overall_summary, caption ="Overall count and share of high-MOE tracts")
# Group tracts by whether they have high MOE issues# Calculate average characteristics for each group:# - population size, demographic percentagespattern_summary <- tract_quality %>%group_by(high_moe_flag) %>%summarise(n_tracts =n(),avg_population =mean(total_population, na.rm =TRUE),avg_white_pct =mean(white_pct, na.rm =TRUE),avg_black_pct =mean(black_pct, na.rm =TRUE),avg_hispanic_pct =mean(hispanic_pct, na.rm =TRUE) )# Use group_by() and summarize() to create this comparison# Create a professional table showing the patternskable( pattern_summary,caption ="Comparison of Community Characteristics by Data Quality Flag")
Comparison of Community Characteristics by Data Quality Flag
high_moe_flag
n_tracts
avg_population
avg_white_pct
avg_black_pct
avg_hispanic_pct
FALSE
15
4488.133
35.73905
53.43592
4.259571
TRUE
255
4085.306
65.83459
21.66414
3.350713
Pattern Analysis: Even after raising the threshold for high margins of error from 15% to 50%, nearly 94% of tracts remain flagged as high-MOE. Such an extreme distribution calls the reliability of the data into question: either the ACS sample sizes in these counties are insufficient, or the estimation procedures struggle to produce stable values in small communities. Some results are also counterintuitive. For instance, certain tracts with larger populations and higher shares of minority residents appear to have lower error rates—a pattern that challenges expectations. At this stage, it is difficult to draw substantive conclusions; what emerges most clearly is the presence of potential bias and instability within the data itself.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Executive Summary:
The analysis shows that data reliability problems are not randomly distributed but concentrated in specific variables and geographies. At the county level, median household income estimates (B19013_001) exhibit much higher margins of error in small, rural counties: Shannon (21.2%), Carter (18.6%), Mississippi (18.5%), Ozark (18.1%), and Mercer (18.1%) all far exceed the 10% threshold, while larger metropolitan counties such as St. Louis report far lower error rates (1.64%). At the tract level, racial and ethnic variables (B03002) are especially unstable when minority group counts are small. Using a 50% MOE threshold, 94.4% of tracts were flagged as high-MOE, with Black and Hispanic estimates most often responsible for the flag. This pattern is stark in predominantly white, rural counties such as Buchanan (100% of tracts flagged) and Texas (75%), where Black and Hispanic groups make up less than 10% of the population. Even in St. Louis County, where the sample base is larger, 94.5% of tracts still exceeded the threshold, largely because Hispanic residents constitute only 3% of the county population.
Because algorithms allocate resources based on point estimates without accounting for their uncertainty, unstable figures can translate directly into misclassification. In rural tracts with very small Black or Hispanic populations, ACS samples often produce highly volatile estimates. An algorithm that interprets these values at face value may conclude that such communities have little or no need for targeted services, even when real needs exist. Conversely, areas where small samples happen to inflate minority counts could be over-prioritized. The core risk is that statistical noise is treated as social reality, where parts of the already vulnerable groups are more exposed to under-investment.
The ACS is a very sample survey. 1. For a large city or county, hundreds of households might be surveyed, with estimate of median income and demographic counts. For a tiny county, only a handful of households are surveyed; 2. When a population has a dominant majority and a few minority members, the minority data will have high variance. Missouri’s rural counties often have very few minority residents, and thus data about those residents is scant and uncertain. If the algorithm tries to pinpoint, say, where to fund a minority outreach program, it might miss small communities entirely due to these data gaps. 3. Some communities (e.g., very low-income households, remote rural residents, certain ethnic minorities, immigrants) have historically lower response rates to the census and surveys. Language barriers, distrust in government, or simply being hard to reach (like no internet, P.O. box addresses, etc.) can lead to undercounting. That means the ACS might not just have statistical uncertainty, but actual systematic bias in who is represented.
The algorithm should not treat all data points equally. Wherever possible, include the margin of error or a confidence weight in the algorithm’s calculations. For example, if ranking counties by median income (lowest income = highest need), adjust the ranking to account for MOE. A county with an income of $40k ± $8k should be considered as uncertain and perhaps be grouped with counties that have say $35k median ± $2k, rather than confidently placing it above or below them. Set rules that flag communities (counties or tracts) with low data confidence for manual review. If the data is too poor, don’t let the algorithm alone make the call. For instance, the algorithm could produce a preliminary list of priority areas but mark any area with “Low Confidence” data (like those >10% MOE counties, or tracts with >50% MOE on key stats) for human analysts to double-check. This ensures places like Shannon County or Texas County are not ignored just because of shaky data.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data# Include: county name, median income, MOE percentage, reliability category# Add a new column with algorithm recommendations using case_when():# - High Confidence: "Safe for algorithmic decisions"# - Moderate Confidence: "Use with caution - monitor outcomes" # - Low Confidence: "Requires manual review or additional data"recommendations <- income_reliability %>%mutate(recommendation =case_when( reliability =="High Confidence"~"Safe for algorithmic decisions", reliability =="Moderate Confidence"~"Use with caution - monitor outcomes", reliability =="Low Confidence"~"Requires manual review or additional data",TRUE~NA_character_ ) ) %>%select(County = NAME,`Median Income (Estimate)`= median_incomeE,`MOE %`= income_moe_pct,`Reliability`= reliability,Recommendation = recommendation )kable(head(recommendations, 10),caption ="Decision Framework for Algorithm Implementation")
Decision Framework for Algorithm Implementation
County
Median Income (Estimate)
MOE %
Reliability
Recommendation
Adair
51020
8.682870
Moderate Confidence
Use with caution - monitor outcomes
Andrew
68774
6.944485
Moderate Confidence
Use with caution - monitor outcomes
Atchison
58521
6.298594
Moderate Confidence
Use with caution - monitor outcomes
Audrain
51745
4.462267
High Confidence
Safe for algorithmic decisions
Barry
55592
9.686646
Moderate Confidence
Use with caution - monitor outcomes
Barton
48105
11.591311
Low Confidence
Requires manual review or additional data
Bates
54122
9.574665
Moderate Confidence
Use with caution - monitor outcomes
Benton
50229
7.722630
Moderate Confidence
Use with caution - monitor outcomes
Bollinger
52306
16.525829
Low Confidence
Requires manual review or additional data
Boone
66564
2.665104
High Confidence
Safe for algorithmic decisions
# Format as a professional table with kable()
Key Recommendations:
Counties suitable for immediate algorithmic implementation: Major metropolitan and other large counties with reliable data (High Confidence). For Missouri, these include places like St. Louis County, St. Louis City, Jackson County, St. Charles County, Clay County, Greene County, and a few others. These areas have MOEs well under 5% for key metrics. The department can confidently use the algorithm to drive decisions in these locales because any ranking or prioritization based on ACS data is grounded in fairly accurate information.
Counties requiring additional oversight: Mid-sized or somewhat smaller counties with moderate data confidence. This list might include Buchanan, Boone, Cole, Jasper, Newton, Platte, Cape Girardeau, etc., roughly counties with populations in the few tens of thousands up to around 100k. In these cases, the algorithm’s output should be reviewed by staff. For instance, if the algorithm ranks Buchanan County as the 10th highest need, because of moderate MOE, staff might double-check recent economic conditions in Buchanan (maybe there was a plant closure not reflected fully in the 2018–2022 data, or maybe the MOE means it could actually rank 8th or 12th). For example, if funds were given to a county for outreach but uptake is low, was it because the need was overestimated? Or if a county not prioritized starts showing signs of distress, was it an oversight due to data noise? Essentially
Counties needing alternative approaches: Counties with low confidence data (mostly rural counties and those flagged earlier like Shannon, Carter, Ozark, Mississippi, Mercer, and many others). In these cases, I recommend manual review and supplementary analysis as a prerequisite for decision-making. The algorithm might initially rank these places oddly (perhaps not high need because of an overestimated median income or not low need because of a weird population estimate).
Look at local poverty indicators (e.g., school district free lunch percentages, local food pantry demand).
Consult qualitative reports (maybe county commissioners or local non-profits can speak to the community’s situation).
Possibly use regional grouping: If one county’s data is flaky, consider looking at a cluster of surrounding similar counties to infer needs.
Questions for Further Investigation
We observed that many of the highest-MOE counties cluster in certain regions (e.g., the Ozarks). A deeper spatial analysis could reveal regional trends. So, are there geographic clusters of poor data quality?
It would be insightful to examine if data reliability is improving or worsening over time. For instance, how do the 2018–2022 ACS margins of error compare to 2010–2014 ACS for these same counties?
We focused on median income and a few racial groups. What about other variables that an algorithm might use? For example, poverty rates, unemployment rates, education levels, or age distributions by tract/county.
what factors best predict high MOE or data issues? Our analysis suggests population size and homogeneity are factors. We could formally test correlations: e.g., does a lower response rate or a higher proportion of rental housing correlate with higher MOEs?
Technical Notes
Data Sources: All data for this analysis comes from the U.S. Census Bureau’s American Community Survey (ACS) 2018–2022 5-Year Estimates. The data was accessed via the tidycensus R package. Key tables used were:
B19013: Median Household Income in the Past 12 Months (in 2022 inflation-adjusted dollars) – for county-level median income and MOE. B01003: Total Population – for county populations. B03002: Hispanic or Latino Origin by Race – for tract-level total population and breakdown by White (non-Hispanic), Black (non-Hispanic), and Hispanic (any race) populations and their MOEs.
Reproducibility: The analysis was conducted using R (version 4.x) and the following main packages: tidycensus for data retrieval, tidyverse (dplyr, stringr) for data manipulation, and knitr/kable for presenting tables. A Census API key (personal to the analyst) was used to authenticate data requests; this key is required to replicate the data pulls. All code and documentation for this analysis are available in https://musa-5080-fall-2025.github.io/portfolio-setup-lluluciano0505/
Methodology Notes: a. Reliability Thresholds: We defined “High”, “Moderate”, and “Low” confidence using specific MOE percentage cutoffs (5% and 10%). These thresholds are somewhat arbitrary but are common sense rules of thumb in survey analysis. A 5% MOE indicates a very tight estimate, while beyond 10% starts to indicate caution. b. High MOE Flag at Tract Level: I chose a 50% MOE as the flag criterion for tract-level demographic data. The assignment prompt mentioned 15% as a possible threshold to consider “unreliable,” but we observed that using 15% would flag virtually every tract. This decision was to focus discussion on the worst data problems; however, it means we understate the prevalence of “moderate” data issues. In reality, even a 20% MOE might be problematic for decision-making, so our approach was a conservative one. c. County Selection: We intentionally picked one county for each reliability category to compare (rather than random selection). This allowed us to illustrate contrasts, but it means some findings (like patterns in St. Louis vs. Texas County) are examples, not exhaustive. If a different high confidence county were chosen (say, Jackson County instead of St. Louis County), many patterns would be similar, but specific numbers would differ.
Limitations: Margin of Error Interpretation: All MOEs discussed are at the 90% confidence level. This means there is a 10% chance the true value lies outside the given interval. If one prefers 95% confidence, margins would be wider. For simplicity, we stuck to ACS’s MOE as given. ACS Data Limitations: The ACS 5-year data, while comprehensive, has known limitations. Small or hard-to-reach populations (e.g., homeless individuals, transient laborers, undocumented immigrants) may be undercounted. Our analysis doesn’t correct for any undercount bias; it only measures sampling error (MOE). —