Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Xiaoqing(Kelsey) Chen

Published

September 27, 2025

Part 1: Portfolio Integration

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

Setup

library(tidycensus)
library(tidyverse)
library(knitr)

census_api_key("e17c1c6ce413dc3faca31a17a58ab3374404bb9e")

State Selection: I have chosen [New York] for this analysis because: [I travelled to New York this month, so i’m curious about this county.]

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

state_pop <- get_acs(
  geography = "county",
  variables = "B01003_001",  
  state = "NY", 
  year = 2022,
  survey = "acs5"
)
glimpse(state_pop)

Rows: 62
Columns: 5
$ GEOID    <chr> "36001", "36003", "36005", "36007", "36009", "36011", "36013"…
$ NAME     <chr> "Albany County, New York", "Allegany County, New York", "Bron…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", "B010…
$ estimate <dbl> 315041, 47222, 1443229, 198365, 77000, 76171, 127440, 83584, …
$ moe      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

ny_data <- get_acs(
  geography = "county",
  variables = c(
    total_pop = "B01003_001",
    median_income = "B19013_001"
  ),
  state = "NY",
  year = 2022,
  output = "wide"
)

head(ny_data)

# A tibble: 6 × 6
  GEOID NAME                 total_popE total_popM median_incomeE median_incomeM
  <chr> <chr>                     <dbl>      <dbl>          <dbl>          <dbl>
1 36001 Albany County, New …     315041         NA          78829           2049
2 36003 Allegany County, Ne…      47222         NA          58725           1965
3 36005 Bronx County, New Y…    1443229         NA          47036            890
4 36007 Broome County, New …     198365         NA          58317           1761
5 36009 Cattaraugus County,…      77000         NA          56889           1778
6 36011 Cayuga County, New …      76171         NA          63227           2736

ny_clean <- ny_data %>%
  mutate(
    county_name = str_remove(NAME, ", New York"),
    county_name = str_remove(county_name, " County")
  )

2.2 Data Quality Assessment

select(ny_clean, NAME, county_name)

# A tibble: 62 × 2
   NAME                         county_name
   <chr>                        <chr>      
 1 Albany County, New York      Albany     
 2 Allegany County, New York    Allegany   
 3 Bronx County, New York       Bronx      
 4 Broome County, New York      Broome     
 5 Cattaraugus County, New York Cattaraugus
 6 Cayuga County, New York      Cayuga     
 7 Chautauqua County, New York  Chautauqua 
 8 Chemung County, New York     Chemung    
 9 Chenango County, New York    Chenango   
10 Clinton County, New York     Clinton    
# ℹ 52 more rows

ny_reliability <- ny_clean %>%
  mutate(
    moe_percentage = round((median_incomeM / median_incomeE) * 100, 2),

    reliability = case_when(
      moe_percentage < 5 ~ "High Confidence",
      moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
      moe_percentage > 10 ~ "Low Confidence"
    )
  )

count(ny_reliability, reliability)

# A tibble: 3 × 2
  reliability         n
  <chr>           <int>
1 High Confidence    56
2 Low Confidence      1
3 Moderate            5

2.3 High Uncertainty Counties

high_uncertainty <- ny_reliability %>%
  filter(moe_percentage > 8) %>%
  arrange(desc(moe_percentage)) %>%
  select(county_name, median_incomeE, moe_percentage)

reliability_summary <- ny_reliability %>%
  group_by(reliability) %>%
  summarize(
    counties = n(),
    avg_income = round(mean(median_incomeE, na.rm = TRUE), 0)
  )

kable(high_uncertainty,
      col.names = c("County", "Median Income", "MOE %"),
      caption = "Counties with Highest Income Data Uncertainty",
      format.args = list(big.mark = ","))

Counties with Highest Income Data Uncertainty
County	Median Income	MOE %
Hamilton	66,891	11.39
Schuyler	61,316	9.49

Data Quality Commentary:

[The tables show that counties like Hamilton and Schuyler have high MOE in the income data, which means these data may lead to less reliable outcome or decision for residents in terms of algorithms. Small size of populations and limited survey samples may contribute to this uncertainty, so that unfair treatment or lows may be established which will affect local people in these areas.]

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

selected_counties <- ny_reliability %>%
  filter(
    county_name %in% c("Albany", "Schuyler", "Hamilton")
  )
selected_counties %>%
  select(county_name, median_incomeE, moe_percentage, reliability)

# A tibble: 3 × 4
  county_name median_incomeE moe_percentage reliability    
  <chr>                <dbl>          <dbl> <chr>          
1 Albany               78829           2.6  High Confidence
2 Hamilton             66891          11.4  Low Confidence 
3 Schuyler             61316           9.49 Moderate

Comment on the output: [These three countries reflects different levels of reliability. Albany has highly reliable income data with a very low MOE, while Hamilton’s data displays higher uncertainty. As a result, lower populations and limited sample sizes may influence the data confidence, which could lead to less stable estimates and algorithmic decisions.]

3.2 Tract-Level Demographics

selected_counties <- c("Albany", "Hamilton", "Schuyler")

ny_lookup <- tidycensus::fips_codes %>%
  filter(state == "NY", county %in% paste0(selected_counties, " County")) %>%
  distinct(county_code, county) %>%
  mutate(county_name = str_remove(county, " County")) %>%
  select(county_code, county_name)

race_tract <- get_acs(
  geography = "tract",
  variables = c(
    white     = "B03002_003",
    black     = "B03002_004",
    hispanic  = "B03002_012",
    total_pop = "B03002_001"
  ),
  state  = "NY",
  county = ny_lookup$county_code, 
  year   = 2022,
  survey = "acs5",
  output = "wide"
)

race_tract <- race_tract %>%
  mutate(
    white_pct    = if_else(total_popE > 0, whiteE    / total_popE, NA_real_),
    black_pct    = if_else(total_popE > 0, blackE    / total_popE, NA_real_),
    hispanic_pct = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
  )

race_tract <- race_tract %>%
  mutate(
    county_fips = str_sub(GEOID, 3, 5),
    tract_id    = str_sub(GEOID, 6)
  ) %>%
  left_join(ny_lookup, by = c("county_fips" = "county_code")) %>%
  mutate(
    tract_label = paste0(county_name, " County – Tract ", tract_id)
  )

race_tract %>%
  select(GEOID, tract_label, county_name,
         total_popE, whiteE, blackE, hispanicE,
         white_pct, black_pct, hispanic_pct) %>%
  arrange(county_name, tract_label) %>%
  head()

# A tibble: 6 × 10
  GEOID     tract_label county_name total_popE whiteE blackE hispanicE white_pct
  <chr>     <chr>       <chr>            <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
1 36001000… Albany Cou… Albany            2259    725    982       346     0.321
2 36001000… Albany Cou… Albany            2465    372   1742       174     0.151
3 36001000… Albany Cou… Albany            2374    317   1952        45     0.134
4 36001000… Albany Cou… Albany            2837    678   1271       673     0.239
5 36001000… Albany Cou… Albany            3200   1963    538       183     0.613
6 36001000… Albany Cou… Albany            2301   2012    134        98     0.874
# ℹ 2 more variables: black_pct <dbl>, hispanic_pct <dbl>

3.3 Demographic Analysis

top_hispanic <- race_tract %>%
  arrange(desc(hispanic_pct)) %>%
  slice(1)

top_hispanic

# A tibble: 1 × 17
  GEOID       NAME    whiteE whiteM blackE blackM hispanicE hispanicM total_popE
  <chr>       <chr>    <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
1 36001002000 Census…   2846    835   1460    445      1562       811       6576
# ℹ 8 more variables: total_popM <dbl>, white_pct <dbl>, black_pct <dbl>,
#   hispanic_pct <dbl>, county_fips <chr>, tract_id <chr>, county_name <chr>,
#   tract_label <chr>

county_summary <- race_tract %>%
  group_by(county_name) %>%
  summarize(
    n_tracts = n(),
    avg_white_pct = mean(white_pct, na.rm = TRUE),
    avg_black_pct = mean(black_pct, na.rm = TRUE),
    avg_hispanic_pct = mean(hispanic_pct, na.rm = TRUE),
  )


county_summary <- race_tract %>%
  kable(digits = 2, captison = "Average Demographics by county (Albany, Hamilton, Schuyler)")

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

demographic_moe <- race_tract %>%
  mutate(
    white_moe_pct    = if_else(is.na(whiteE)    | whiteE    == 0, NA_real_, 100 * whiteM    / whiteE),
    black_moe_pct    = if_else(is.na(blackE)    | blackE    == 0, NA_real_, 100 * blackM    / blackE),
    hispanic_moe_pct = if_else(is.na(hispanicE) | hispanicE == 0, NA_real_, 100 * hispanicM / hispanicE)
  ) 

demographic_moe <- demographic_moe %>%
  mutate(
    high_moe_flag = if_else(
      white_moe_pct > 10 | black_moe_pct > 10 | hispanic_moe_pct > 10,
      TRUE, FALSE
    )
  )

summary_tracts <- demographic_moe %>%
  summarise(
    total_tracts    = n(),
    high_moe_tracts = sum(high_moe_flag, na.rm = TRUE),
    pct_high_moe    = 100 * mean(high_moe_flag, na.rm = TRUE)
  )
summary_tracts

# A tibble: 1 × 3
  total_tracts high_moe_tracts pct_high_moe
         <int>           <int>        <dbl>
1           94              94          100

4.2 Pattern Analysis

avg_characteristics <- demographic_moe %>%
  group_by(high_moe_flag) %>%
  summarise(
    avg_pop      = mean(total_popE, na.rm = TRUE),
    avg_white    = mean(white_pct, na.rm = TRUE),
    avg_black    = mean(black_pct, na.rm = TRUE),
    avg_hispanic = mean(hispanic_pct, na.rm = TRUE),
    n_tracts     = n()
  )

avg_characteristics

# A tibble: 1 × 6
  high_moe_flag avg_pop avg_white avg_black avg_hispanic n_tracts
  <lgl>           <dbl>     <dbl>     <dbl>        <dbl>    <int>
1 TRUE            3596.     0.711     0.116       0.0607       94

Pattern Analysis: [Small tract populations would lead to a high MOE, so the related estimated are less reliable. On average, tracts flagged as high MOE only have nearly 3600 people, with minority groups like Hispanics accounting for only 6% of the total population.]

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

[All 94 census tracts show at least one demographic group with MOE above 10%, indicating data reliability issues are widespread.For example, Albany County Tract 000100 has White MOE = 46.97%, Black MOE = 36.86%, and Hispanic MOE = 62.71%. Hispanic populations consistently show the highest MOEs, suggesting this group faces the greatest data reliability challenges.In many tracts, Hispanic MOEs exceed 60% or even 100%, while White and Black MOEs are also high but relatively lower. High MOE levels are largely driven by small population sizes in many tracts, which make reliable estimates difficult.On average, tracts flagged as high MOE have only about 3,596 people, with Hispanic populations averaging just 6%. Improving reliability requires either aggregating data or using multi-year ACS estimates to reduce sampling error.For example, combining smaller tracts or applying a population threshold before analysis would help minimize bias in Hispanic estimates.]

6.3 Specific Recommendations

library(dplyr)
library(kableExtra)
library(knitr)

kbl_safe <- function(x, ...) {
  tbl <- knitr::kable(x, ...)
  if (use_kableExtra) {
    tbl <- tbl %>% kableExtra::kable_styling(
      full_width = FALSE,
      bootstrap_options = c("striped","condensed")
    )
  }
  tbl
}

moe_pct <- function(margin, estimate) {
  ifelse(is.na(estimate) | estimate == 0, NA_real_, 100 * margin / estimate)
}

recommendations_data <- ny_reliability %>%
  mutate(
    income_moe_pct = if_else(is.na(median_incomeE) | median_incomeE == 0,
                          NA_real_, 100 * median_incomeM / median_incomeE),

  reliability_category = case_when(
      !is.na(income_moe_pct) & income_moe_pct <= 5   ~ "High Confidence",
      !is.na(income_moe_pct) & income_moe_pct <= 10  ~ "Moderate Confidence",
      TRUE                                     ~ "Low Confidence"
    ),
  recommendation = case_when(
      reliability_category == "High Confidence"     ~ "Safe for algorithmic decisions",
      reliability_category == "Moderate Confidence" ~ "Use with caution – monitor outcomes",
      reliability_category == "Low Confidence"      ~ "Requires manual review or additional data"
    )
  ) %>%
  transmute(
    county_name,
    median_income = median_incomeE,
    income_moe_pct = round(income_moe_pct, 1),
    reliability_category,
    recommendation
  )

kable(recommendations_data,
      col.names = c("County","Median Income ($)","Income MOE %","Reliability","Recommendation"),
      caption = "Decision Framework for Algorithmic Implementation")

Decision Framework for Algorithmic Implementation
County	Median Income ($)	Income MOE %	Reliability	Recommendation
Albany	78829	2.6	High Confidence	Safe for algorithmic decisions
Allegany	58725	3.3	High Confidence	Safe for algorithmic decisions
Bronx	47036	1.9	High Confidence	Safe for algorithmic decisions
Broome	58317	3.0	High Confidence	Safe for algorithmic decisions
Cattaraugus	56889	3.1	High Confidence	Safe for algorithmic decisions
Cayuga	63227	4.3	High Confidence	Safe for algorithmic decisions
Chautauqua	54625	3.2	High Confidence	Safe for algorithmic decisions
Chemung	61358	4.0	High Confidence	Safe for algorithmic decisions
Chenango	61741	4.1	High Confidence	Safe for algorithmic decisions
Clinton	67097	4.2	High Confidence	Safe for algorithmic decisions
Columbia	81741	3.4	High Confidence	Safe for algorithmic decisions
Cortland	65029	4.4	High Confidence	Safe for algorithmic decisions
Delaware	58338	3.7	High Confidence	Safe for algorithmic decisions
Dutchess	94578	2.7	High Confidence	Safe for algorithmic decisions
Erie	68014	1.2	High Confidence	Safe for algorithmic decisions
Essex	68090	5.3	Moderate Confidence	Use with caution – monitor outcomes
Franklin	60270	4.8	High Confidence	Safe for algorithmic decisions
Fulton	60557	4.4	High Confidence	Safe for algorithmic decisions
Genesee	68178	4.6	High Confidence	Safe for algorithmic decisions
Greene	70294	6.2	Moderate Confidence	Use with caution – monitor outcomes
Hamilton	66891	11.4	Low Confidence	Requires manual review or additional data
Herkimer	68104	4.8	High Confidence	Safe for algorithmic decisions
Jefferson	62782	3.6	High Confidence	Safe for algorithmic decisions
Kings	74692	1.3	High Confidence	Safe for algorithmic decisions
Lewis	64401	4.2	High Confidence	Safe for algorithmic decisions
Livingston	70443	4.0	High Confidence	Safe for algorithmic decisions
Madison	68869	4.0	High Confidence	Safe for algorithmic decisions
Monroe	71450	1.4	High Confidence	Safe for algorithmic decisions
Montgomery	58033	3.6	High Confidence	Safe for algorithmic decisions
Nassau	137709	1.4	High Confidence	Safe for algorithmic decisions
New York	99880	1.8	High Confidence	Safe for algorithmic decisions
Niagara	65882	2.7	High Confidence	Safe for algorithmic decisions
Oneida	66402	3.3	High Confidence	Safe for algorithmic decisions
Onondaga	71479	1.6	High Confidence	Safe for algorithmic decisions
Ontario	76603	2.9	High Confidence	Safe for algorithmic decisions
Orange	91806	1.9	High Confidence	Safe for algorithmic decisions
Orleans	61069	4.9	High Confidence	Safe for algorithmic decisions
Oswego	65054	3.3	High Confidence	Safe for algorithmic decisions
Otsego	65778	4.5	High Confidence	Safe for algorithmic decisions
Putnam	120970	4.0	High Confidence	Safe for algorithmic decisions
Queens	82431	1.1	High Confidence	Safe for algorithmic decisions
Rensselaer	83734	2.3	High Confidence	Safe for algorithmic decisions
Richmond	96185	2.6	High Confidence	Safe for algorithmic decisions
Rockland	106173	2.9	High Confidence	Safe for algorithmic decisions
St. Lawrence	58339	3.5	High Confidence	Safe for algorithmic decisions
Saratoga	97038	2.3	High Confidence	Safe for algorithmic decisions
Schenectady	75056	3.0	High Confidence	Safe for algorithmic decisions
Schoharie	71479	4.0	High Confidence	Safe for algorithmic decisions
Schuyler	61316	9.5	Moderate Confidence	Use with caution – monitor outcomes
Seneca	64050	5.2	Moderate Confidence	Use with caution – monitor outcomes
Steuben	62506	2.9	High Confidence	Safe for algorithmic decisions
Suffolk	122498	1.2	High Confidence	Safe for algorithmic decisions
Sullivan	67841	4.4	High Confidence	Safe for algorithmic decisions
Tioga	70427	4.0	High Confidence	Safe for algorithmic decisions
Tompkins	69995	4.0	High Confidence	Safe for algorithmic decisions
Ulster	77197	4.5	High Confidence	Safe for algorithmic decisions
Warren	74531	4.7	High Confidence	Safe for algorithmic decisions
Washington	68703	3.4	High Confidence	Safe for algorithmic decisions
Wayne	71007	3.1	High Confidence	Safe for algorithmic decisions
Westchester	114651	1.6	High Confidence	Safe for algorithmic decisions
Wyoming	65066	3.4	High Confidence	Safe for algorithmic decisions
Yates	63974	5.8	Moderate Confidence	Use with caution – monitor outcomes

recommendations_data %>% filter(reliability_category == “High Confidence”) recommendations_data %>% filter(reliability_category == “Moderate Confidence”) recommendations_data %>% filter(reliability_category == “Low Confidence”)

Counties suitable for immediate algorithmic implementation: [All counties with MOE below 5% can be safely used for automated decisions. For example, Albany, Allegany, and Bronx all show income MOE between 3–4%, indicating strong reliability. Other counties of high confidence inculdes Broome, Cattaraugus, Cayuge, Chautauqua, Chemung, Chenango and Clinton.]
Counties requiring additional oversight: [Counties with MOE between 5–10% should be used with caution and monitored for stability. For instance, Essex (5.3%), Greene (7.0%), and Schuyler (5.8%) fall into this category, suggesting periodic review of outcomes. Besides, Seneca and Yates are also included in this category.]
Counties needing alternative approaches: [Counties with MOE above 10% require manual review or supplemental data. Only Hamilton shows an income MOE of 11.4%, which signals insufficient reliability for direct algorithmic use.]

Questions for Further Investigation

[1. How do MOE patterns differ between urban and rural counties, and what does this mean for equitable decision-making? 2. Do time trends (e.g., comparing single-year vs. five-year ACS data) change the reliability patterns we observed? 3. Are smaller minority populations systematically associated with higher MOEs, and how might improved sampling address this?]

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [2025.09.20]

Reproducibility: - All analysis conducted in R version [R Studio 4.3.1] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: [Defined high MOE as >10% relative margin of error. Used county-level median income MOE% as the primary reliability measure.]

Limitations: [Tracts with small minority populations produce very high MOE%, which may lead to unreliable result.]