Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Xiaoqing(Kelsey) Chen

Published

September 27, 2025

Part 1: Portfolio Integration

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

Setup

library(tidycensus)
library(tidyverse)
library(knitr)

census_api_key("e17c1c6ce413dc3faca31a17a58ab3374404bb9e")

State Selection: I have chosen [New York] for this analysis because: [I travelled to New York this month, so i’m curious about this county.]

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

state_pop <- get_acs(
  geography = "county",
  variables = "B01003_001",  
  state = "NY", 
  year = 2022,
  survey = "acs5"
)
glimpse(state_pop)
Rows: 62
Columns: 5
$ GEOID    <chr> "36001", "36003", "36005", "36007", "36009", "36011", "36013"…
$ NAME     <chr> "Albany County, New York", "Allegany County, New York", "Bron…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", "B010…
$ estimate <dbl> 315041, 47222, 1443229, 198365, 77000, 76171, 127440, 83584, …
$ moe      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
ny_data <- get_acs(
  geography = "county",
  variables = c(
    total_pop = "B01003_001",
    median_income = "B19013_001"
  ),
  state = "NY",
  year = 2022,
  output = "wide"
)

head(ny_data)
# A tibble: 6 × 6
  GEOID NAME                 total_popE total_popM median_incomeE median_incomeM
  <chr> <chr>                     <dbl>      <dbl>          <dbl>          <dbl>
1 36001 Albany County, New …     315041         NA          78829           2049
2 36003 Allegany County, Ne…      47222         NA          58725           1965
3 36005 Bronx County, New Y…    1443229         NA          47036            890
4 36007 Broome County, New …     198365         NA          58317           1761
5 36009 Cattaraugus County,…      77000         NA          56889           1778
6 36011 Cayuga County, New …      76171         NA          63227           2736
ny_clean <- ny_data %>%
  mutate(
    county_name = str_remove(NAME, ", New York"),
    county_name = str_remove(county_name, " County")
  )

2.2 Data Quality Assessment

select(ny_clean, NAME, county_name)
# A tibble: 62 × 2
   NAME                         county_name
   <chr>                        <chr>      
 1 Albany County, New York      Albany     
 2 Allegany County, New York    Allegany   
 3 Bronx County, New York       Bronx      
 4 Broome County, New York      Broome     
 5 Cattaraugus County, New York Cattaraugus
 6 Cayuga County, New York      Cayuga     
 7 Chautauqua County, New York  Chautauqua 
 8 Chemung County, New York     Chemung    
 9 Chenango County, New York    Chenango   
10 Clinton County, New York     Clinton    
# ℹ 52 more rows
ny_reliability <- ny_clean %>%
  mutate(
    moe_percentage = round((median_incomeM / median_incomeE) * 100, 2),

    reliability = case_when(
      moe_percentage < 5 ~ "High Confidence",
      moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
      moe_percentage > 10 ~ "Low Confidence"
    )
  )

count(ny_reliability, reliability)
# A tibble: 3 × 2
  reliability         n
  <chr>           <int>
1 High Confidence    56
2 Low Confidence      1
3 Moderate            5

2.3 High Uncertainty Counties

high_uncertainty <- ny_reliability %>%
  filter(moe_percentage > 8) %>%
  arrange(desc(moe_percentage)) %>%
  select(county_name, median_incomeE, moe_percentage)

reliability_summary <- ny_reliability %>%
  group_by(reliability) %>%
  summarize(
    counties = n(),
    avg_income = round(mean(median_incomeE, na.rm = TRUE), 0)
  )

kable(high_uncertainty,
      col.names = c("County", "Median Income", "MOE %"),
      caption = "Counties with Highest Income Data Uncertainty",
      format.args = list(big.mark = ","))
Counties with Highest Income Data Uncertainty
County Median Income MOE %
Hamilton 66,891 11.39
Schuyler 61,316 9.49

Data Quality Commentary:

[The tables show that counties like Hamilton and Schuyler have high MOE in the income data, which means these data may lead to less reliable outcome or decision for residents in terms of algorithms. Small size of populations and limited survey samples may contribute to this uncertainty, so that unfair treatment or lows may be established which will affect local people in these areas.]

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

selected_counties <- ny_reliability %>%
  filter(
    county_name %in% c("Albany", "Schuyler", "Hamilton")
  )
selected_counties %>%
  select(county_name, median_incomeE, moe_percentage, reliability)
# A tibble: 3 × 4
  county_name median_incomeE moe_percentage reliability    
  <chr>                <dbl>          <dbl> <chr>          
1 Albany               78829           2.6  High Confidence
2 Hamilton             66891          11.4  Low Confidence 
3 Schuyler             61316           9.49 Moderate       

Comment on the output: [These three countries reflects different levels of reliability. Albany has highly reliable income data with a very low MOE, while Hamilton’s data displays higher uncertainty. As a result, lower populations and limited sample sizes may influence the data confidence, which could lead to less stable estimates and algorithmic decisions.]

3.2 Tract-Level Demographics

selected_counties <- c("Albany", "Hamilton", "Schuyler")

ny_lookup <- tidycensus::fips_codes %>%
  filter(state == "NY", county %in% paste0(selected_counties, " County")) %>%
  distinct(county_code, county) %>%
  mutate(county_name = str_remove(county, " County")) %>%
  select(county_code, county_name)

race_tract <- get_acs(
  geography = "tract",
  variables = c(
    white     = "B03002_003",
    black     = "B03002_004",
    hispanic  = "B03002_012",
    total_pop = "B03002_001"
  ),
  state  = "NY",
  county = ny_lookup$county_code, 
  year   = 2022,
  survey = "acs5",
  output = "wide"
)

race_tract <- race_tract %>%
  mutate(
    white_pct    = if_else(total_popE > 0, whiteE    / total_popE, NA_real_),
    black_pct    = if_else(total_popE > 0, blackE    / total_popE, NA_real_),
    hispanic_pct = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
  )

race_tract <- race_tract %>%
  mutate(
    county_fips = str_sub(GEOID, 3, 5),
    tract_id    = str_sub(GEOID, 6)
  ) %>%
  left_join(ny_lookup, by = c("county_fips" = "county_code")) %>%
  mutate(
    tract_label = paste0(county_name, " County – Tract ", tract_id)
  )

race_tract %>%
  select(GEOID, tract_label, county_name,
         total_popE, whiteE, blackE, hispanicE,
         white_pct, black_pct, hispanic_pct) %>%
  arrange(county_name, tract_label) %>%
  head()
# A tibble: 6 × 10
  GEOID     tract_label county_name total_popE whiteE blackE hispanicE white_pct
  <chr>     <chr>       <chr>            <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
1 36001000… Albany Cou… Albany            2259    725    982       346     0.321
2 36001000… Albany Cou… Albany            2465    372   1742       174     0.151
3 36001000… Albany Cou… Albany            2374    317   1952        45     0.134
4 36001000… Albany Cou… Albany            2837    678   1271       673     0.239
5 36001000… Albany Cou… Albany            3200   1963    538       183     0.613
6 36001000… Albany Cou… Albany            2301   2012    134        98     0.874
# ℹ 2 more variables: black_pct <dbl>, hispanic_pct <dbl>

3.3 Demographic Analysis

top_hispanic <- race_tract %>%
  arrange(desc(hispanic_pct)) %>%
  slice(1)

top_hispanic
# A tibble: 1 × 17
  GEOID       NAME    whiteE whiteM blackE blackM hispanicE hispanicM total_popE
  <chr>       <chr>    <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
1 36001002000 Census…   2846    835   1460    445      1562       811       6576
# ℹ 8 more variables: total_popM <dbl>, white_pct <dbl>, black_pct <dbl>,
#   hispanic_pct <dbl>, county_fips <chr>, tract_id <chr>, county_name <chr>,
#   tract_label <chr>
county_summary <- race_tract %>%
  group_by(county_name) %>%
  summarize(
    n_tracts = n(),
    avg_white_pct = mean(white_pct, na.rm = TRUE),
    avg_black_pct = mean(black_pct, na.rm = TRUE),
    avg_hispanic_pct = mean(hispanic_pct, na.rm = TRUE),
  )


county_summary <- race_tract %>%
  kable(digits = 2, captison = "Average Demographics by county (Albany, Hamilton, Schuyler)")

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

demographic_moe <- race_tract %>%
  mutate(
    white_moe_pct    = if_else(is.na(whiteE)    | whiteE    == 0, NA_real_, 100 * whiteM    / whiteE),
    black_moe_pct    = if_else(is.na(blackE)    | blackE    == 0, NA_real_, 100 * blackM    / blackE),
    hispanic_moe_pct = if_else(is.na(hispanicE) | hispanicE == 0, NA_real_, 100 * hispanicM / hispanicE)
  ) 

demographic_moe <- demographic_moe %>%
  mutate(
    high_moe_flag = if_else(
      white_moe_pct > 10 | black_moe_pct > 10 | hispanic_moe_pct > 10,
      TRUE, FALSE
    )
  )

summary_tracts <- demographic_moe %>%
  summarise(
    total_tracts    = n(),
    high_moe_tracts = sum(high_moe_flag, na.rm = TRUE),
    pct_high_moe    = 100 * mean(high_moe_flag, na.rm = TRUE)
  )
summary_tracts
# A tibble: 1 × 3
  total_tracts high_moe_tracts pct_high_moe
         <int>           <int>        <dbl>
1           94              94          100

4.2 Pattern Analysis

avg_characteristics <- demographic_moe %>%
  group_by(high_moe_flag) %>%
  summarise(
    avg_pop      = mean(total_popE, na.rm = TRUE),
    avg_white    = mean(white_pct, na.rm = TRUE),
    avg_black    = mean(black_pct, na.rm = TRUE),
    avg_hispanic = mean(hispanic_pct, na.rm = TRUE),
    n_tracts     = n()
  )

avg_characteristics
# A tibble: 1 × 6
  high_moe_flag avg_pop avg_white avg_black avg_hispanic n_tracts
  <lgl>           <dbl>     <dbl>     <dbl>        <dbl>    <int>
1 TRUE            3596.     0.711     0.116       0.0607       94

Pattern Analysis: [Small tract populations would lead to a high MOE, so the related estimated are less reliable. On average, tracts flagged as high MOE only have nearly 3600 people, with minority groups like Hispanics accounting for only 6% of the total population.]

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

[All 94 census tracts show at least one demographic group with MOE above 10%, indicating data reliability issues are widespread.For example, Albany County Tract 000100 has White MOE = 46.97%, Black MOE = 36.86%, and Hispanic MOE = 62.71%. Hispanic populations consistently show the highest MOEs, suggesting this group faces the greatest data reliability challenges.In many tracts, Hispanic MOEs exceed 60% or even 100%, while White and Black MOEs are also high but relatively lower. High MOE levels are largely driven by small population sizes in many tracts, which make reliable estimates difficult.On average, tracts flagged as high MOE have only about 3,596 people, with Hispanic populations averaging just 6%. Improving reliability requires either aggregating data or using multi-year ACS estimates to reduce sampling error.For example, combining smaller tracts or applying a population threshold before analysis would help minimize bias in Hispanic estimates.]

6.3 Specific Recommendations

library(dplyr)
library(kableExtra)
library(knitr)

kbl_safe <- function(x, ...) {
  tbl <- knitr::kable(x, ...)
  if (use_kableExtra) {
    tbl <- tbl %>% kableExtra::kable_styling(
      full_width = FALSE,
      bootstrap_options = c("striped","condensed")
    )
  }
  tbl
}

moe_pct <- function(margin, estimate) {
  ifelse(is.na(estimate) | estimate == 0, NA_real_, 100 * margin / estimate)
}

recommendations_data <- ny_reliability %>%
  mutate(
    income_moe_pct = if_else(is.na(median_incomeE) | median_incomeE == 0,
                          NA_real_, 100 * median_incomeM / median_incomeE),

  reliability_category = case_when(
      !is.na(income_moe_pct) & income_moe_pct <= 5   ~ "High Confidence",
      !is.na(income_moe_pct) & income_moe_pct <= 10  ~ "Moderate Confidence",
      TRUE                                     ~ "Low Confidence"
    ),
  recommendation = case_when(
      reliability_category == "High Confidence"     ~ "Safe for algorithmic decisions",
      reliability_category == "Moderate Confidence" ~ "Use with caution – monitor outcomes",
      reliability_category == "Low Confidence"      ~ "Requires manual review or additional data"
    )
  ) %>%
  transmute(
    county_name,
    median_income = median_incomeE,
    income_moe_pct = round(income_moe_pct, 1),
    reliability_category,
    recommendation
  )

kable(recommendations_data,
      col.names = c("County","Median Income ($)","Income MOE %","Reliability","Recommendation"),
      caption = "Decision Framework for Algorithmic Implementation")
Decision Framework for Algorithmic Implementation
County Median Income ($) Income MOE % Reliability Recommendation
Albany 78829 2.6 High Confidence Safe for algorithmic decisions
Allegany 58725 3.3 High Confidence Safe for algorithmic decisions
Bronx 47036 1.9 High Confidence Safe for algorithmic decisions
Broome 58317 3.0 High Confidence Safe for algorithmic decisions
Cattaraugus 56889 3.1 High Confidence Safe for algorithmic decisions
Cayuga 63227 4.3 High Confidence Safe for algorithmic decisions
Chautauqua 54625 3.2 High Confidence Safe for algorithmic decisions
Chemung 61358 4.0 High Confidence Safe for algorithmic decisions
Chenango 61741 4.1 High Confidence Safe for algorithmic decisions
Clinton 67097 4.2 High Confidence Safe for algorithmic decisions
Columbia 81741 3.4 High Confidence Safe for algorithmic decisions
Cortland 65029 4.4 High Confidence Safe for algorithmic decisions
Delaware 58338 3.7 High Confidence Safe for algorithmic decisions
Dutchess 94578 2.7 High Confidence Safe for algorithmic decisions
Erie 68014 1.2 High Confidence Safe for algorithmic decisions
Essex 68090 5.3 Moderate Confidence Use with caution – monitor outcomes
Franklin 60270 4.8 High Confidence Safe for algorithmic decisions
Fulton 60557 4.4 High Confidence Safe for algorithmic decisions
Genesee 68178 4.6 High Confidence Safe for algorithmic decisions
Greene 70294 6.2 Moderate Confidence Use with caution – monitor outcomes
Hamilton 66891 11.4 Low Confidence Requires manual review or additional data
Herkimer 68104 4.8 High Confidence Safe for algorithmic decisions
Jefferson 62782 3.6 High Confidence Safe for algorithmic decisions
Kings 74692 1.3 High Confidence Safe for algorithmic decisions
Lewis 64401 4.2 High Confidence Safe for algorithmic decisions
Livingston 70443 4.0 High Confidence Safe for algorithmic decisions
Madison 68869 4.0 High Confidence Safe for algorithmic decisions
Monroe 71450 1.4 High Confidence Safe for algorithmic decisions
Montgomery 58033 3.6 High Confidence Safe for algorithmic decisions
Nassau 137709 1.4 High Confidence Safe for algorithmic decisions
New York 99880 1.8 High Confidence Safe for algorithmic decisions
Niagara 65882 2.7 High Confidence Safe for algorithmic decisions
Oneida 66402 3.3 High Confidence Safe for algorithmic decisions
Onondaga 71479 1.6 High Confidence Safe for algorithmic decisions
Ontario 76603 2.9 High Confidence Safe for algorithmic decisions
Orange 91806 1.9 High Confidence Safe for algorithmic decisions
Orleans 61069 4.9 High Confidence Safe for algorithmic decisions
Oswego 65054 3.3 High Confidence Safe for algorithmic decisions
Otsego 65778 4.5 High Confidence Safe for algorithmic decisions
Putnam 120970 4.0 High Confidence Safe for algorithmic decisions
Queens 82431 1.1 High Confidence Safe for algorithmic decisions
Rensselaer 83734 2.3 High Confidence Safe for algorithmic decisions
Richmond 96185 2.6 High Confidence Safe for algorithmic decisions
Rockland 106173 2.9 High Confidence Safe for algorithmic decisions
St. Lawrence 58339 3.5 High Confidence Safe for algorithmic decisions
Saratoga 97038 2.3 High Confidence Safe for algorithmic decisions
Schenectady 75056 3.0 High Confidence Safe for algorithmic decisions
Schoharie 71479 4.0 High Confidence Safe for algorithmic decisions
Schuyler 61316 9.5 Moderate Confidence Use with caution – monitor outcomes
Seneca 64050 5.2 Moderate Confidence Use with caution – monitor outcomes
Steuben 62506 2.9 High Confidence Safe for algorithmic decisions
Suffolk 122498 1.2 High Confidence Safe for algorithmic decisions
Sullivan 67841 4.4 High Confidence Safe for algorithmic decisions
Tioga 70427 4.0 High Confidence Safe for algorithmic decisions
Tompkins 69995 4.0 High Confidence Safe for algorithmic decisions
Ulster 77197 4.5 High Confidence Safe for algorithmic decisions
Warren 74531 4.7 High Confidence Safe for algorithmic decisions
Washington 68703 3.4 High Confidence Safe for algorithmic decisions
Wayne 71007 3.1 High Confidence Safe for algorithmic decisions
Westchester 114651 1.6 High Confidence Safe for algorithmic decisions
Wyoming 65066 3.4 High Confidence Safe for algorithmic decisions
Yates 63974 5.8 Moderate Confidence Use with caution – monitor outcomes

recommendations_data %>% filter(reliability_category == “High Confidence”) recommendations_data %>% filter(reliability_category == “Moderate Confidence”) recommendations_data %>% filter(reliability_category == “Low Confidence”)

  1. Counties suitable for immediate algorithmic implementation: [All counties with MOE below 5% can be safely used for automated decisions. For example, Albany, Allegany, and Bronx all show income MOE between 3–4%, indicating strong reliability. Other counties of high confidence inculdes Broome, Cattaraugus, Cayuge, Chautauqua, Chemung, Chenango and Clinton.]

  2. Counties requiring additional oversight: [Counties with MOE between 5–10% should be used with caution and monitored for stability. For instance, Essex (5.3%), Greene (7.0%), and Schuyler (5.8%) fall into this category, suggesting periodic review of outcomes. Besides, Seneca and Yates are also included in this category.]

  3. Counties needing alternative approaches: [Counties with MOE above 10% require manual review or supplemental data. Only Hamilton shows an income MOE of 11.4%, which signals insufficient reliability for direct algorithmic use.]

Questions for Further Investigation

[1. How do MOE patterns differ between urban and rural counties, and what does this mean for equitable decision-making? 2. Do time trends (e.g., comparing single-year vs. five-year ACS data) change the reliability patterns we observed? 3. Are smaller minority populations systematically associated with higher MOEs, and how might improved sampling address this?]

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [2025.09.20]

Reproducibility: - All analysis conducted in R version [R Studio 4.3.1] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: [Defined high MOE as >10% relative margin of error. Used county-level median income MOE% as the primary reliability measure.]

Limitations: [Tracts with small minority populations produce very high MOE%, which may lead to unreliable result.]