library(tidycensus)
library(tidyverse)
library(knitr)
census_api_key("e17c1c6ce413dc3faca31a17a58ab3374404bb9e")
Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Part 1: Portfolio Integration
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
Setup
State Selection: I have chosen [New York] for this analysis because: [I travelled to New York this month, so i’m curious about this county.]
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
<- get_acs(
state_pop geography = "county",
variables = "B01003_001",
state = "NY",
year = 2022,
survey = "acs5"
)glimpse(state_pop)
Rows: 62
Columns: 5
$ GEOID <chr> "36001", "36003", "36005", "36007", "36009", "36011", "36013"…
$ NAME <chr> "Albany County, New York", "Allegany County, New York", "Bron…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", "B010…
$ estimate <dbl> 315041, 47222, 1443229, 198365, 77000, 76171, 127440, 83584, …
$ moe <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
<- get_acs(
ny_data geography = "county",
variables = c(
total_pop = "B01003_001",
median_income = "B19013_001"
),state = "NY",
year = 2022,
output = "wide"
)
head(ny_data)
# A tibble: 6 × 6
GEOID NAME total_popE total_popM median_incomeE median_incomeM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 36001 Albany County, New … 315041 NA 78829 2049
2 36003 Allegany County, Ne… 47222 NA 58725 1965
3 36005 Bronx County, New Y… 1443229 NA 47036 890
4 36007 Broome County, New … 198365 NA 58317 1761
5 36009 Cattaraugus County,… 77000 NA 56889 1778
6 36011 Cayuga County, New … 76171 NA 63227 2736
<- ny_data %>%
ny_clean mutate(
county_name = str_remove(NAME, ", New York"),
county_name = str_remove(county_name, " County")
)
2.2 Data Quality Assessment
select(ny_clean, NAME, county_name)
# A tibble: 62 × 2
NAME county_name
<chr> <chr>
1 Albany County, New York Albany
2 Allegany County, New York Allegany
3 Bronx County, New York Bronx
4 Broome County, New York Broome
5 Cattaraugus County, New York Cattaraugus
6 Cayuga County, New York Cayuga
7 Chautauqua County, New York Chautauqua
8 Chemung County, New York Chemung
9 Chenango County, New York Chenango
10 Clinton County, New York Clinton
# ℹ 52 more rows
<- ny_clean %>%
ny_reliability mutate(
moe_percentage = round((median_incomeM / median_incomeE) * 100, 2),
reliability = case_when(
< 5 ~ "High Confidence",
moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
moe_percentage > 10 ~ "Low Confidence"
moe_percentage
)
)
count(ny_reliability, reliability)
# A tibble: 3 × 2
reliability n
<chr> <int>
1 High Confidence 56
2 Low Confidence 1
3 Moderate 5
2.3 High Uncertainty Counties
<- ny_reliability %>%
high_uncertainty filter(moe_percentage > 8) %>%
arrange(desc(moe_percentage)) %>%
select(county_name, median_incomeE, moe_percentage)
<- ny_reliability %>%
reliability_summary group_by(reliability) %>%
summarize(
counties = n(),
avg_income = round(mean(median_incomeE, na.rm = TRUE), 0)
)
kable(high_uncertainty,
col.names = c("County", "Median Income", "MOE %"),
caption = "Counties with Highest Income Data Uncertainty",
format.args = list(big.mark = ","))
County | Median Income | MOE % |
---|---|---|
Hamilton | 66,891 | 11.39 |
Schuyler | 61,316 | 9.49 |
Data Quality Commentary:
[The tables show that counties like Hamilton and Schuyler have high MOE in the income data, which means these data may lead to less reliable outcome or decision for residents in terms of algorithms. Small size of populations and limited survey samples may contribute to this uncertainty, so that unfair treatment or lows may be established which will affect local people in these areas.]
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
<- ny_reliability %>%
selected_counties filter(
%in% c("Albany", "Schuyler", "Hamilton")
county_name
)%>%
selected_counties select(county_name, median_incomeE, moe_percentage, reliability)
# A tibble: 3 × 4
county_name median_incomeE moe_percentage reliability
<chr> <dbl> <dbl> <chr>
1 Albany 78829 2.6 High Confidence
2 Hamilton 66891 11.4 Low Confidence
3 Schuyler 61316 9.49 Moderate
Comment on the output: [These three countries reflects different levels of reliability. Albany has highly reliable income data with a very low MOE, while Hamilton’s data displays higher uncertainty. As a result, lower populations and limited sample sizes may influence the data confidence, which could lead to less stable estimates and algorithmic decisions.]
3.2 Tract-Level Demographics
<- c("Albany", "Hamilton", "Schuyler")
selected_counties
<- tidycensus::fips_codes %>%
ny_lookup filter(state == "NY", county %in% paste0(selected_counties, " County")) %>%
distinct(county_code, county) %>%
mutate(county_name = str_remove(county, " County")) %>%
select(county_code, county_name)
<- get_acs(
race_tract geography = "tract",
variables = c(
white = "B03002_003",
black = "B03002_004",
hispanic = "B03002_012",
total_pop = "B03002_001"
),state = "NY",
county = ny_lookup$county_code,
year = 2022,
survey = "acs5",
output = "wide"
)
<- race_tract %>%
race_tract mutate(
white_pct = if_else(total_popE > 0, whiteE / total_popE, NA_real_),
black_pct = if_else(total_popE > 0, blackE / total_popE, NA_real_),
hispanic_pct = if_else(total_popE > 0, hispanicE / total_popE, NA_real_)
)
<- race_tract %>%
race_tract mutate(
county_fips = str_sub(GEOID, 3, 5),
tract_id = str_sub(GEOID, 6)
%>%
) left_join(ny_lookup, by = c("county_fips" = "county_code")) %>%
mutate(
tract_label = paste0(county_name, " County – Tract ", tract_id)
)
%>%
race_tract select(GEOID, tract_label, county_name,
total_popE, whiteE, blackE, hispanicE,%>%
white_pct, black_pct, hispanic_pct) arrange(county_name, tract_label) %>%
head()
# A tibble: 6 × 10
GEOID tract_label county_name total_popE whiteE blackE hispanicE white_pct
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 36001000… Albany Cou… Albany 2259 725 982 346 0.321
2 36001000… Albany Cou… Albany 2465 372 1742 174 0.151
3 36001000… Albany Cou… Albany 2374 317 1952 45 0.134
4 36001000… Albany Cou… Albany 2837 678 1271 673 0.239
5 36001000… Albany Cou… Albany 3200 1963 538 183 0.613
6 36001000… Albany Cou… Albany 2301 2012 134 98 0.874
# ℹ 2 more variables: black_pct <dbl>, hispanic_pct <dbl>
3.3 Demographic Analysis
<- race_tract %>%
top_hispanic arrange(desc(hispanic_pct)) %>%
slice(1)
top_hispanic
# A tibble: 1 × 17
GEOID NAME whiteE whiteM blackE blackM hispanicE hispanicM total_popE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 36001002000 Census… 2846 835 1460 445 1562 811 6576
# ℹ 8 more variables: total_popM <dbl>, white_pct <dbl>, black_pct <dbl>,
# hispanic_pct <dbl>, county_fips <chr>, tract_id <chr>, county_name <chr>,
# tract_label <chr>
<- race_tract %>%
county_summary group_by(county_name) %>%
summarize(
n_tracts = n(),
avg_white_pct = mean(white_pct, na.rm = TRUE),
avg_black_pct = mean(black_pct, na.rm = TRUE),
avg_hispanic_pct = mean(hispanic_pct, na.rm = TRUE),
)
<- race_tract %>%
county_summary kable(digits = 2, captison = "Average Demographics by county (Albany, Hamilton, Schuyler)")
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
<- race_tract %>%
demographic_moe mutate(
white_moe_pct = if_else(is.na(whiteE) | whiteE == 0, NA_real_, 100 * whiteM / whiteE),
black_moe_pct = if_else(is.na(blackE) | blackE == 0, NA_real_, 100 * blackM / blackE),
hispanic_moe_pct = if_else(is.na(hispanicE) | hispanicE == 0, NA_real_, 100 * hispanicM / hispanicE)
)
<- demographic_moe %>%
demographic_moe mutate(
high_moe_flag = if_else(
> 10 | black_moe_pct > 10 | hispanic_moe_pct > 10,
white_moe_pct TRUE, FALSE
)
)
<- demographic_moe %>%
summary_tracts summarise(
total_tracts = n(),
high_moe_tracts = sum(high_moe_flag, na.rm = TRUE),
pct_high_moe = 100 * mean(high_moe_flag, na.rm = TRUE)
) summary_tracts
# A tibble: 1 × 3
total_tracts high_moe_tracts pct_high_moe
<int> <int> <dbl>
1 94 94 100
4.2 Pattern Analysis
<- demographic_moe %>%
avg_characteristics group_by(high_moe_flag) %>%
summarise(
avg_pop = mean(total_popE, na.rm = TRUE),
avg_white = mean(white_pct, na.rm = TRUE),
avg_black = mean(black_pct, na.rm = TRUE),
avg_hispanic = mean(hispanic_pct, na.rm = TRUE),
n_tracts = n()
)
avg_characteristics
# A tibble: 1 × 6
high_moe_flag avg_pop avg_white avg_black avg_hispanic n_tracts
<lgl> <dbl> <dbl> <dbl> <dbl> <int>
1 TRUE 3596. 0.711 0.116 0.0607 94
Pattern Analysis: [Small tract populations would lead to a high MOE, so the related estimated are less reliable. On average, tracts flagged as high MOE only have nearly 3600 people, with minority groups like Hispanics accounting for only 6% of the total population.]
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
[All 94 census tracts show at least one demographic group with MOE above 10%, indicating data reliability issues are widespread.For example, Albany County Tract 000100 has White MOE = 46.97%, Black MOE = 36.86%, and Hispanic MOE = 62.71%. Hispanic populations consistently show the highest MOEs, suggesting this group faces the greatest data reliability challenges.In many tracts, Hispanic MOEs exceed 60% or even 100%, while White and Black MOEs are also high but relatively lower. High MOE levels are largely driven by small population sizes in many tracts, which make reliable estimates difficult.On average, tracts flagged as high MOE have only about 3,596 people, with Hispanic populations averaging just 6%. Improving reliability requires either aggregating data or using multi-year ACS estimates to reduce sampling error.For example, combining smaller tracts or applying a population threshold before analysis would help minimize bias in Hispanic estimates.]
6.3 Specific Recommendations
library(dplyr)
library(kableExtra)
library(knitr)
<- function(x, ...) {
kbl_safe <- knitr::kable(x, ...)
tbl if (use_kableExtra) {
<- tbl %>% kableExtra::kable_styling(
tbl full_width = FALSE,
bootstrap_options = c("striped","condensed")
)
}
tbl
}
<- function(margin, estimate) {
moe_pct ifelse(is.na(estimate) | estimate == 0, NA_real_, 100 * margin / estimate)
}
<- ny_reliability %>%
recommendations_data mutate(
income_moe_pct = if_else(is.na(median_incomeE) | median_incomeE == 0,
NA_real_, 100 * median_incomeM / median_incomeE),
reliability_category = case_when(
!is.na(income_moe_pct) & income_moe_pct <= 5 ~ "High Confidence",
!is.na(income_moe_pct) & income_moe_pct <= 10 ~ "Moderate Confidence",
TRUE ~ "Low Confidence"
),recommendation = case_when(
== "High Confidence" ~ "Safe for algorithmic decisions",
reliability_category == "Moderate Confidence" ~ "Use with caution – monitor outcomes",
reliability_category == "Low Confidence" ~ "Requires manual review or additional data"
reliability_category
)%>%
) transmute(
county_name,median_income = median_incomeE,
income_moe_pct = round(income_moe_pct, 1),
reliability_category,
recommendation
)
kable(recommendations_data,
col.names = c("County","Median Income ($)","Income MOE %","Reliability","Recommendation"),
caption = "Decision Framework for Algorithmic Implementation")
County | Median Income ($) | Income MOE % | Reliability | Recommendation |
---|---|---|---|---|
Albany | 78829 | 2.6 | High Confidence | Safe for algorithmic decisions |
Allegany | 58725 | 3.3 | High Confidence | Safe for algorithmic decisions |
Bronx | 47036 | 1.9 | High Confidence | Safe for algorithmic decisions |
Broome | 58317 | 3.0 | High Confidence | Safe for algorithmic decisions |
Cattaraugus | 56889 | 3.1 | High Confidence | Safe for algorithmic decisions |
Cayuga | 63227 | 4.3 | High Confidence | Safe for algorithmic decisions |
Chautauqua | 54625 | 3.2 | High Confidence | Safe for algorithmic decisions |
Chemung | 61358 | 4.0 | High Confidence | Safe for algorithmic decisions |
Chenango | 61741 | 4.1 | High Confidence | Safe for algorithmic decisions |
Clinton | 67097 | 4.2 | High Confidence | Safe for algorithmic decisions |
Columbia | 81741 | 3.4 | High Confidence | Safe for algorithmic decisions |
Cortland | 65029 | 4.4 | High Confidence | Safe for algorithmic decisions |
Delaware | 58338 | 3.7 | High Confidence | Safe for algorithmic decisions |
Dutchess | 94578 | 2.7 | High Confidence | Safe for algorithmic decisions |
Erie | 68014 | 1.2 | High Confidence | Safe for algorithmic decisions |
Essex | 68090 | 5.3 | Moderate Confidence | Use with caution – monitor outcomes |
Franklin | 60270 | 4.8 | High Confidence | Safe for algorithmic decisions |
Fulton | 60557 | 4.4 | High Confidence | Safe for algorithmic decisions |
Genesee | 68178 | 4.6 | High Confidence | Safe for algorithmic decisions |
Greene | 70294 | 6.2 | Moderate Confidence | Use with caution – monitor outcomes |
Hamilton | 66891 | 11.4 | Low Confidence | Requires manual review or additional data |
Herkimer | 68104 | 4.8 | High Confidence | Safe for algorithmic decisions |
Jefferson | 62782 | 3.6 | High Confidence | Safe for algorithmic decisions |
Kings | 74692 | 1.3 | High Confidence | Safe for algorithmic decisions |
Lewis | 64401 | 4.2 | High Confidence | Safe for algorithmic decisions |
Livingston | 70443 | 4.0 | High Confidence | Safe for algorithmic decisions |
Madison | 68869 | 4.0 | High Confidence | Safe for algorithmic decisions |
Monroe | 71450 | 1.4 | High Confidence | Safe for algorithmic decisions |
Montgomery | 58033 | 3.6 | High Confidence | Safe for algorithmic decisions |
Nassau | 137709 | 1.4 | High Confidence | Safe for algorithmic decisions |
New York | 99880 | 1.8 | High Confidence | Safe for algorithmic decisions |
Niagara | 65882 | 2.7 | High Confidence | Safe for algorithmic decisions |
Oneida | 66402 | 3.3 | High Confidence | Safe for algorithmic decisions |
Onondaga | 71479 | 1.6 | High Confidence | Safe for algorithmic decisions |
Ontario | 76603 | 2.9 | High Confidence | Safe for algorithmic decisions |
Orange | 91806 | 1.9 | High Confidence | Safe for algorithmic decisions |
Orleans | 61069 | 4.9 | High Confidence | Safe for algorithmic decisions |
Oswego | 65054 | 3.3 | High Confidence | Safe for algorithmic decisions |
Otsego | 65778 | 4.5 | High Confidence | Safe for algorithmic decisions |
Putnam | 120970 | 4.0 | High Confidence | Safe for algorithmic decisions |
Queens | 82431 | 1.1 | High Confidence | Safe for algorithmic decisions |
Rensselaer | 83734 | 2.3 | High Confidence | Safe for algorithmic decisions |
Richmond | 96185 | 2.6 | High Confidence | Safe for algorithmic decisions |
Rockland | 106173 | 2.9 | High Confidence | Safe for algorithmic decisions |
St. Lawrence | 58339 | 3.5 | High Confidence | Safe for algorithmic decisions |
Saratoga | 97038 | 2.3 | High Confidence | Safe for algorithmic decisions |
Schenectady | 75056 | 3.0 | High Confidence | Safe for algorithmic decisions |
Schoharie | 71479 | 4.0 | High Confidence | Safe for algorithmic decisions |
Schuyler | 61316 | 9.5 | Moderate Confidence | Use with caution – monitor outcomes |
Seneca | 64050 | 5.2 | Moderate Confidence | Use with caution – monitor outcomes |
Steuben | 62506 | 2.9 | High Confidence | Safe for algorithmic decisions |
Suffolk | 122498 | 1.2 | High Confidence | Safe for algorithmic decisions |
Sullivan | 67841 | 4.4 | High Confidence | Safe for algorithmic decisions |
Tioga | 70427 | 4.0 | High Confidence | Safe for algorithmic decisions |
Tompkins | 69995 | 4.0 | High Confidence | Safe for algorithmic decisions |
Ulster | 77197 | 4.5 | High Confidence | Safe for algorithmic decisions |
Warren | 74531 | 4.7 | High Confidence | Safe for algorithmic decisions |
Washington | 68703 | 3.4 | High Confidence | Safe for algorithmic decisions |
Wayne | 71007 | 3.1 | High Confidence | Safe for algorithmic decisions |
Westchester | 114651 | 1.6 | High Confidence | Safe for algorithmic decisions |
Wyoming | 65066 | 3.4 | High Confidence | Safe for algorithmic decisions |
Yates | 63974 | 5.8 | Moderate Confidence | Use with caution – monitor outcomes |
recommendations_data %>% filter(reliability_category == “High Confidence”) recommendations_data %>% filter(reliability_category == “Moderate Confidence”) recommendations_data %>% filter(reliability_category == “Low Confidence”)
Counties suitable for immediate algorithmic implementation: [All counties with MOE below 5% can be safely used for automated decisions. For example, Albany, Allegany, and Bronx all show income MOE between 3–4%, indicating strong reliability. Other counties of high confidence inculdes Broome, Cattaraugus, Cayuge, Chautauqua, Chemung, Chenango and Clinton.]
Counties requiring additional oversight: [Counties with MOE between 5–10% should be used with caution and monitored for stability. For instance, Essex (5.3%), Greene (7.0%), and Schuyler (5.8%) fall into this category, suggesting periodic review of outcomes. Besides, Seneca and Yates are also included in this category.]
Counties needing alternative approaches: [Counties with MOE above 10% require manual review or supplemental data. Only Hamilton shows an income MOE of 11.4%, which signals insufficient reliability for direct algorithmic use.]
Questions for Further Investigation
[1. How do MOE patterns differ between urban and rural counties, and what does this mean for equitable decision-making? 2. Do time trends (e.g., comparing single-year vs. five-year ACS data) change the reliability patterns we observed? 3. Are smaller minority populations systematically associated with higher MOEs, and how might improved sampling address this?]
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [2025.09.20]
Reproducibility: - All analysis conducted in R version [R Studio 4.3.1] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]
Methodology Notes: [Defined high MOE as >10% relative margin of error. Used county-level median income MOE% as the primary reliability measure.]
Limitations: [Tracts with small minority populations produce very high MOE%, which may lead to unreliable result.]