Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Sihan Yu

Published

October 14, 2025

Assignment Overview

Scenario

You are a data analyst for the [Your State] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("e173851c633e89c20243632174db68f63bac8856")
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "New York"

State Selection: I have chosen New York for this analysis because: It has a large population, which makes it interesting to study how census data reflects socioeconomic patterns in a large scale.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_data <- get_acs(
  geography = "county",
  variables = c(Median_income = "B19013_001",
                population = "B01003_001"
                ),
  state = "NY",
  year = 2022,
  survey = "acs5",
  output = "wide"
  
)
# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_data <- county_data %>%
  mutate(
    NAME = NAME %>%
      str_remove(", New York") %>%
      str_remove("County")
  )
# Display the first few rows
head(county_data)
# A tibble: 6 × 6
  GEOID NAME           Median_incomeE Median_incomeM populationE populationM
  <chr> <chr>                   <dbl>          <dbl>       <dbl>       <dbl>
1 36001 "Albany "               78829           2049      315041          NA
2 36003 "Allegany "             58725           1965       47222          NA
3 36005 "Bronx "                47036            890     1443229          NA
4 36007 "Broome "               58317           1761      198365          NA
5 36009 "Cattaraugus "          56889           1778       77000          NA
6 36011 "Cayuga "               63227           2736       76171          NA

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data = county_data %>%
  mutate(
    MOE_percentage = (Median_incomeM/Median_incomeE) * 100,
    reliability = case_when(
      MOE_percentage < 5 ~"High confidence",
      MOE_percentage < 10 & MOE_percentage >5 ~ "Moderate confidence",
      MOE_percentage >10 ~"Low confidence"
    ),
    unreliable_flag = if_else("MOE percentage" >10, TRUE, FALSE  )
 )
county_data
# A tibble: 62 × 9
   GEOID NAME           Median_incomeE Median_incomeM populationE populationM
   <chr> <chr>                   <dbl>          <dbl>       <dbl>       <dbl>
 1 36001 "Albany "               78829           2049      315041          NA
 2 36003 "Allegany "             58725           1965       47222          NA
 3 36005 "Bronx "                47036            890     1443229          NA
 4 36007 "Broome "               58317           1761      198365          NA
 5 36009 "Cattaraugus "          56889           1778       77000          NA
 6 36011 "Cayuga "               63227           2736       76171          NA
 7 36013 "Chautauqua "           54625           1754      127440          NA
 8 36015 "Chemung "              61358           2475       83584          NA
 9 36017 "Chenango "             61741           2526       47096          NA
10 36019 "Clinton "              67097           2802       79839          NA
# ℹ 52 more rows
# ℹ 3 more variables: MOE_percentage <dbl>, reliability <chr>,
#   unreliable_flag <lgl>
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
library(dplyr)
library(knitr)

reliability_summary <- county_data %>%
  count(reliability) %>%                    
  mutate(percentage = n / sum(n) * 100)     
reliability_summary
# A tibble: 3 × 3
  reliability             n percentage
  <chr>               <int>      <dbl>
1 High confidence        56      90.3 
2 Low confidence          1       1.61
3 Moderate confidence     5       8.06

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top_MOE <- county_data %>%
  arrange(desc(MOE_percentage)) %>%
  slice(1:5) %>%
  select(
  County = NAME,
  Median_Income = Median_incomeE,
  MOE = Median_incomeM,
  MOE_percentage = MOE_percentage,
  Reliability = reliability
  )
# Format as table with kable() - include appropriate column names and caption
kable(
  top_MOE,
  caption = "Top 5 Counties with Highest Income MOE Percentage",
  col.names = c(
    "County",
    "Median Income (Estimate)",
    "Median Income (MOE)",
    "MOE Percentage",
    "Reliability Category"
  ),
  digits = 1 
)
Top 5 Counties with Highest Income MOE Percentage
County Median Income (Estimate) Median Income (MOE) MOE Percentage Reliability Category
Hamilton 66891 7622 11.4 Low confidence
Schuyler 61316 5818 9.5 Moderate confidence
Greene 70294 4341 6.2 Moderate confidence
Yates 63974 3733 5.8 Moderate confidence
Essex 68090 3590 5.3 Moderate confidence

Data Quality Commentary: The table shows the top 5 counties with relatively high MOE percentages, indicating that the median household income estimates have larger uncertainties. Among them, Hamilton County has the highest MOE percentage at 11.4%, suggesting that income data in this county may be less reliable. This could be due to high income variability within the county, a relatively small population, or the presence of outliers. Algorithms that rely on these median income estimates may produce biased or inaccurate decisions in these areas.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
my_counties <- c("Orange", "Yates", "Hamilton")
selected_counties <- county_data %>%
  filter(str_detect(NAME, paste(my_counties, collapse = "|")))

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(
    County = NAME,
    Median_income = Median_incomeE,
    MOE_percentage,
    reliability
  ) %>%
  kable(caption = "Selected Counties with Median Income, MOE %, and Reliability")
Selected Counties with Median Income, MOE %, and Reliability
County Median_income MOE_percentage reliability
Hamilton 66891 11.394657 Low confidence
Orange 91806 1.939960 High confidence
Yates 63974 5.835183 Moderate confidence

Comment on the output: Orange County has the highest median income with low MOE percentage, making it has higher reliability. While Yates and Hamilton have lower median incomes and larger MOE percentages, which reduces confidence in their estimates.It is suggests that wealthier counties might has more reliable data.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

race = c(
  White = "B03002_003",
  Black = "B03002_004",
  Hispanic = "B03002_012",
  total_pop ="B03002_001"
  )

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data = get_acs(
  geography = "tract",
  variables = race,
  state = "NY",
  year = 2022,
  survey = "acs5",
  output = "wide" 
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data = tract_data %>%
  mutate(
    white_p = 100 * WhiteE / total_popE,
    Black_p = 100 * BlackE / total_popE,
    Hispanic_p = 100 * HispanicE / total_popE,
  )


# Add readable tract and county name columns using str_extract() or similar
tract_data <- tract_data %>%
  mutate(
    tract_name  = str_replace(NAME, ";\\s*[^,]+ County;\\s*[^,]+$", ""),
    county_name = str_extract(NAME, ";\\s*[^,]+ County") %>% 
                  str_remove("^;\\s*") %>% 
                  str_remove("\\s*County$")                             
  )

kable(head(tract_data))
GEOID NAME WhiteE WhiteM BlackE BlackM HispanicE HispanicM total_popE total_popM white_p Black_p Hispanic_p tract_name county_name
36001000100 Census Tract 1; Albany County; New York 725 340 982 362 346 217 2259 512 32.09385 43.470562 15.316512 Census Tract 1 Albany
36001000201 Census Tract 2.01; Albany County; New York 372 198 1742 613 174 156 2465 608 15.09128 70.669371 7.058823 Census Tract 2.01 Albany
36001000202 Census Tract 2.02; Albany County; New York 317 193 1952 684 45 68 2374 668 13.35299 82.224094 1.895535 Census Tract 2.02 Albany
36001000301 Census Tract 3.01; Albany County; New York 678 431 1271 480 673 278 2837 581 23.89848 44.800846 23.722242 Census Tract 3.01 Albany
36001000302 Census Tract 3.02; Albany County; New York 1963 496 538 345 183 113 3200 500 61.34375 16.812500 5.718750 Census Tract 3.02 Albany
36001000401 Census Tract 4.01; Albany County; New York 2012 366 134 92 98 78 2301 399 87.44024 5.823555 4.259018 Census Tract 4.01 Albany

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic <- tract_data %>%
  filter(county_name %in% my_counties)%>%
  arrange(desc(Hispanic_p)) %>%
  slice(1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

county_summary <- tract_data %>%
  filter(county_name %in% my_counties) %>% 
  group_by(county_name) %>%
  summarize(
    num_tracts      = n(),
    avg_white_pct   = mean(white_p),
    avg_black_pct   = mean(Black_p),
    avg_hispanic_pct= mean(Hispanic_p)
  )


# Create a nicely formatted table of your results using kable()
kable(
  top_hispanic,
  caption = "Tract with Highest Hispanic/Latino Percentage (Selected Counties)"
)
Tract with Highest Hispanic/Latino Percentage (Selected Counties)
GEOID NAME WhiteE WhiteM BlackE BlackM HispanicE HispanicM total_popE total_popM white_p Black_p Hispanic_p tract_name county_name
36071000501 Census Tract 5.01; Orange County; New York 246 136 362 274 1723 631 2473 681 9.947432 14.63809 69.67246 Census Tract 5.01 Orange
kable(county_summary, caption = "Average Demographic Percentages by County")
Average Demographic Percentages by County
county_name num_tracts avg_white_pct avg_black_pct avg_hispanic_pct
Hamilton 4 91.87527 1.240264 2.038062
Orange 92 59.23838 10.868294 23.054479
Yates 8 94.17341 0.520878 2.116160

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data_filtered <- tract_data %>%
  filter(WhiteE > 0, BlackE> 0, HispanicE>0)

tract_data_filtered = tract_data_filtered %>%
  mutate(
    white_moe = 100 * WhiteM / WhiteE,
    Black_moe = 100 * BlackM / BlackE,
    Hispanic_moe = 100 * HispanicM / HispanicE
  ) %>%
  filter( 
    white_moe < 100, 
    Black_moe < 100, 
    Hispanic_moe <100
    )

# Create a flag for tracts with high MOE on any demographic variable
tract_moe = tract_data_filtered %>% 
  mutate(
    high_moe = ifelse(
      white_moe >15 | Black_moe >15 |Hispanic_moe >15, 
      TRUE, FALSE
    )
  )
# Use logical operators (| for OR) in an ifelse() statement

# Create summary statistics showing how many tracts have data quality issues
summary <- tract_moe %>%
  summarize(
    total_tracts = n(),
    tracts_high_moe = sum(high_moe),
    pct_high_moe = (sum(high_moe)/n()*100),
  )

summary_county <- tract_moe %>%
  filter(county_name %in% my_counties)%>%
  group_by(county_name)%>%
  summarize(
    total_tracts = n(),
    tracts_high_moe = sum(high_moe),
    pct_high_moe = (sum(high_moe)/n()*100),
  )
kable(summary, caption = "Overall Data Quality: MOE Analysis")
Overall Data Quality: MOE Analysis
total_tracts tracts_high_moe pct_high_moe
2669 2669 100
summary_county %>%
  kable(
    caption = "Summary of tracts with high MOE (>15%)",
    col.names = c("County", "N tracts", "High-MOE tracts", "% High-MOE")
  ) 
Summary of tracts with high MOE (>15%)
County N tracts High-MOE tracts % High-MOE
Hamilton 1 1 100
Orange 68 68 100
Yates 2 2 100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison

tract_moe_changed = tract_data_filtered %>% 
  mutate(
    high_moe = ifelse(
      white_moe >50 | Black_moe >50 |Hispanic_moe >50, 
      TRUE, FALSE
    )
  )

pattern_analysis <- tract_moe_changed %>%
  mutate(high_moe_flag = ifelse(high_moe, "High MOE", "Low MOE")) %>%
  group_by(high_moe_flag) %>%
  summarize(
    num_tracts       = n(),
    avg_population   = mean(total_popE, na.rm = TRUE),
    avg_white_pct    = mean(WhiteE / total_popE * 100, na.rm = TRUE),
    avg_black_pct    = mean(BlackE / total_popE * 100, na.rm = TRUE),
    avg_hispanic_pct = mean(HispanicE / total_popE * 100, na.rm = TRUE)
  )


# Create a professional table showing the patterns
pattern_analysis %>%
  kable(
    digits = 1,
    col.names = c("High MOE Flag", "N tracts", "Avg Pop", "Avg % White", "Avg % Black", "Avg % Hispanic"),
    caption = "Comparison of tract characteristics by data quality group"
  )
Comparison of tract characteristics by data quality group
High MOE Flag N tracts Avg Pop Avg % White Avg % Black Avg % Hispanic
High MOE 2141 3989.4 47.7 17.0 21.5
Low MOE 528 4564.0 35.3 25.7 27.6

Pattern Analysis: Using a 15% MOE threshold produced only one group, limiting the ability to compare patterns. To enable analysis, the threshold was adjusted to 50%, yielding two groups: High MOE (2141 tracts, ~80%) and Low MOE (528 tracts). Yet even under this more lenient criterion, the majority of tracts remain classified as High MOE, highlighting substantial data reliability concerns. Interestingly, these tracts tend to have smaller populations, higher proportions of White residents, and lower proportions of Black and Hispanic residents—patterns that run counter to the initial expectation that more racially diverse areas would exhibit greater uncertainty. Overall, the prevalence of high MOE across most tracts underscores serious data reliability issues.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: The analyses reveal consistent geographic and demographic disparities in median income and data quality across counties. Higher-income counties, such as Orange (median income ≈ 91,806), generally show high data reliability and low margins of error (MOE), while lower-income or more rural counties, such as Hamilton (median income ≈ 66,891) and Bronx (≈ 47,036), show moderate-to-low confidence in estimates and higher MOE percentages. Across all tracts, High-MOE areas tend to have smaller populations (≈3,989 per tract on average) and higher percentages of White residents, whereas Low-MOE areas tend to be larger and more diverse.

Communities at greatest risk of algorithmic bias are not necessarily those county tracts with higher minority percentage, but rather those in small, high-MOE, predominantly White tracts. These high-uncertainty areas are more likely to produce unreliable estimates, meaning that algorithms could allocate resources uneven in less diverse counties. Low-MOE, more diverse tracts are comparatively better represented in the data, which reduces the risk of bias for minority communities.

High margins of error (MOE) and data unreliability in New York State are primarily driven by several factors. Population size and demographic homogeneity play a major role: predominantly White, small-population tracts are more likely to exhibit high MOE, amplifying uncertainty for algorithmic applications. Sampling variability further contributes, as rural or low-density areas are harder to survey accurately. Some communities—such as extremely low-income households, residents in rural areas, certain minority populations, and immigrants might have low response rates to the census and surveys, resulting in sparse and uncertain data. Socioeconomic factors, including income levels and population dispersion, also affect the statistical precision of estimates. Additionally, some tracts reporting MOE greater than 100%—highlight the severity of uncertainty, making certain estimates statistically very unreliable and drastically increasing the risk of misrepresentation in algorithms and policy decisions.

Strategic Recommendations

  1. Enhance Data Collection: Increase survey coverage in high-MOE, low-population tracts to improve the reliability of estimates.

  2. Incorporate MOE in Modeling: Integrate uncertainty measures into algorithmic models to weight outputs appropriately and reduce reliance on potentially biased estimates.

  3. Community Engagement: Encourage participation in surveys and censuses in smaller or rural tracts, particularly among low-income, minority, and immigrant populations, to enhance data completeness and reduce uncertainty.

  4. Equity-Focused Resource Allocation: Implement human review for areas with poor data quality; if data reliability is low, do not let algorithms make decisions independently. Pay special attention to high-MOE tracts to prevent misdirected interventions.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
county_data <- county_data %>%
  mutate(
    algorithm_recommendation = case_when(
      reliability == "High confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate confidence" ~ "Use with caution - monitor outcomes",
      reliability == "Low confidence" ~ "Requires manual review or additional data",
      TRUE ~ "Unknown"
    )
  )

# Format as a professional table with kable()

summary_table <- county_data %>%
  select(county_name = NAME, Median_income = Median_incomeE, MOE_percentage, 
         reliability, algorithm_recommendation)

# Display professional table
kable(summary_table, 
      caption = "County-Level Median Income Reliability and Algorithm Recommendations",
      col.names = c("County", "Median Income", "MOE (%)", "Reliability Category", 
                    "Algorithm Recommendation"),
      digits = 2)
County-Level Median Income Reliability and Algorithm Recommendations
County Median Income MOE (%) Reliability Category Algorithm Recommendation
Albany 78829 2.60 High confidence Safe for algorithmic decisions
Allegany 58725 3.35 High confidence Safe for algorithmic decisions
Bronx 47036 1.89 High confidence Safe for algorithmic decisions
Broome 58317 3.02 High confidence Safe for algorithmic decisions
Cattaraugus 56889 3.13 High confidence Safe for algorithmic decisions
Cayuga 63227 4.33 High confidence Safe for algorithmic decisions
Chautauqua 54625 3.21 High confidence Safe for algorithmic decisions
Chemung 61358 4.03 High confidence Safe for algorithmic decisions
Chenango 61741 4.09 High confidence Safe for algorithmic decisions
Clinton 67097 4.18 High confidence Safe for algorithmic decisions
Columbia 81741 3.39 High confidence Safe for algorithmic decisions
Cortland 65029 4.42 High confidence Safe for algorithmic decisions
Delaware 58338 3.67 High confidence Safe for algorithmic decisions
Dutchess 94578 2.66 High confidence Safe for algorithmic decisions
Erie 68014 1.18 High confidence Safe for algorithmic decisions
Essex 68090 5.27 Moderate confidence Use with caution - monitor outcomes
Franklin 60270 4.81 High confidence Safe for algorithmic decisions
Fulton 60557 4.37 High confidence Safe for algorithmic decisions
Genesee 68178 4.57 High confidence Safe for algorithmic decisions
Greene 70294 6.18 Moderate confidence Use with caution - monitor outcomes
Hamilton 66891 11.39 Low confidence Requires manual review or additional data
Herkimer 68104 4.79 High confidence Safe for algorithmic decisions
Jefferson 62782 3.64 High confidence Safe for algorithmic decisions
Kings 74692 1.27 High confidence Safe for algorithmic decisions
Lewis 64401 4.16 High confidence Safe for algorithmic decisions
Livingston 70443 3.99 High confidence Safe for algorithmic decisions
Madison 68869 4.04 High confidence Safe for algorithmic decisions
Monroe 71450 1.35 High confidence Safe for algorithmic decisions
Montgomery 58033 3.63 High confidence Safe for algorithmic decisions
Nassau 137709 1.39 High confidence Safe for algorithmic decisions
New York 99880 1.78 High confidence Safe for algorithmic decisions
Niagara 65882 2.67 High confidence Safe for algorithmic decisions
Oneida 66402 3.27 High confidence Safe for algorithmic decisions
Onondaga 71479 1.57 High confidence Safe for algorithmic decisions
Ontario 76603 2.94 High confidence Safe for algorithmic decisions
Orange 91806 1.94 High confidence Safe for algorithmic decisions
Orleans 61069 4.89 High confidence Safe for algorithmic decisions
Oswego 65054 3.26 High confidence Safe for algorithmic decisions
Otsego 65778 4.51 High confidence Safe for algorithmic decisions
Putnam 120970 4.03 High confidence Safe for algorithmic decisions
Queens 82431 1.06 High confidence Safe for algorithmic decisions
Rensselaer 83734 2.27 High confidence Safe for algorithmic decisions
Richmond 96185 2.60 High confidence Safe for algorithmic decisions
Rockland 106173 2.88 High confidence Safe for algorithmic decisions
St. Lawrence 58339 3.47 High confidence Safe for algorithmic decisions
Saratoga 97038 2.26 High confidence Safe for algorithmic decisions
Schenectady 75056 3.03 High confidence Safe for algorithmic decisions
Schoharie 71479 3.96 High confidence Safe for algorithmic decisions
Schuyler 61316 9.49 Moderate confidence Use with caution - monitor outcomes
Seneca 64050 5.24 Moderate confidence Use with caution - monitor outcomes
Steuben 62506 2.87 High confidence Safe for algorithmic decisions
Suffolk 122498 1.18 High confidence Safe for algorithmic decisions
Sullivan 67841 4.35 High confidence Safe for algorithmic decisions
Tioga 70427 3.99 High confidence Safe for algorithmic decisions
Tompkins 69995 4.01 High confidence Safe for algorithmic decisions
Ulster 77197 4.52 High confidence Safe for algorithmic decisions
Warren 74531 4.74 High confidence Safe for algorithmic decisions
Washington 68703 3.41 High confidence Safe for algorithmic decisions
Wayne 71007 3.10 High confidence Safe for algorithmic decisions
Westchester 114651 1.56 High confidence Safe for algorithmic decisions
Wyoming 65066 3.38 High confidence Safe for algorithmic decisions
Yates 63974 5.84 Moderate confidence Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation: Albany, Bronx, Broome, Cattaraugus, Cayuga, Chautauqua, Chemung, Chenango, Clinton, Columbia, Cortland, Delaware, Dutchess, Erie, Franklin, Fulton, Genesee, Herkimer, Jefferson, Kings, Lewis, Livingston, Madison, Monroe, Montgomery, Nassau, New York, Niagara, Oneida, Onondaga, Ontario, Orange, Orleans, Oswego, Otsego, Putnam, Queens, Rensselaer, Richmond, Rockland, St. Lawrence, Saratoga, Schenectady, Schoharie, Steuben, Suffolk, Sullivan, Tioga, Tompkins, Ulster, Warren, Washington, Wayne, Westchester, Wyoming.

These counties has low MOE that can confidently use thes estiments.

  1. Counties requiring additional oversight: Essex, Greene, Schuyler, Seneca, Yates

The counties has moderate confidence data with 5-10% MOE, algorithmic decision can be used but the outcomes should be monitored. Periodic checks can be used to examine and ensure the accuracy of predictions.

  1. Counties needing alternative approaches: Hamilton

This county has low confidence and algorithmic decisions are not recommended. Manual review or additional data collection is required.By adding the human oversight that algorithmic outputs can be validated and corrected if needed.

Questions for Further Investigation

  1. How do spatial patterns in income and demographic distributions affect the reliability of county- and tract-level data? Are there specific geographic clusters with consistently low reliability?

  2. What temporal trends or changes over time could be observed if historical ACS survey data were included?

  3. How do measurement errors and low survey response rates impact smaller racial or ethnic groups, and what effective safeguards can help prevent bias in algorithmic decisions?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/20/2025

Reproducibility: - All analysis conducted in R version [4.5.1] - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-sihan-yu429/assignments/assignments1/scripts/assignment1_template.html

Methodology Notes: Several key steps were made during the analysis. Counties were classified into “High,” “Moderate,” and “Low” confidence categories based on ACS income margins of error (MOE), using 5% and 10% cutoffs, where 5% indicates a precise estimate and beyond 10% signals caution. One county was intentionally selected from each reliability category to illustrate contrasts. This approach highlights differences but means specific findings are illustrative. At the tract level, demographic estimates were processed by calculating MOE percentages for racial/ethnic groups, and a 50% MOE flag was applied to identify particularly unreliable tracts. Although the assignment suggested a 15% threshold, using it would flag nearly all tracts, so the higher threshold focuses discussion on the worst data quality issues. These decisions shaped reliability categories and subsequent recommendations, providing conservative but clear examples of high-risk data areas and guiding interpretation of county- and tract-level results.

Limitations: The analysis is limited by sample size constraints, especially for smaller racial/ethnic groups at the tract level. This can result in high margins of error. Geographic coverage focuses on selected counties and tracts, so findings may not generalize elsewhere. Temporal scope is limited to the 2018–2022 ACS 5-year estimates, which may not capture rapid changes. Additionally, ACS survey data itself contains inherent inaccuracies due to sampling and nonresponse errors.


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html