Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Sihan Yu

Published

October 14, 2025

Assignment Overview

Scenario

You are a data analyst for the [Your State] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("e173851c633e89c20243632174db68f63bac8856")
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "New York"

State Selection: I have chosen New York for this analysis because: It has a large population, which makes it interesting to study how census data reflects socioeconomic patterns in a large scale.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_data <- get_acs(
  geography = "county",
  variables = c(Median_income = "B19013_001",
                population = "B01003_001"
                ),
  state = "NY",
  year = 2022,
  survey = "acs5",
  output = "wide"
  
)
# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_data <- county_data %>%
  mutate(
    NAME = NAME %>%
      str_remove(", New York") %>%
      str_remove("County")
  )
# Display the first few rows
head(county_data)

# A tibble: 6 × 6
  GEOID NAME           Median_incomeE Median_incomeM populationE populationM
  <chr> <chr>                   <dbl>          <dbl>       <dbl>       <dbl>
1 36001 "Albany "               78829           2049      315041          NA
2 36003 "Allegany "             58725           1965       47222          NA
3 36005 "Bronx "                47036            890     1443229          NA
4 36007 "Broome "               58317           1761      198365          NA
5 36009 "Cattaraugus "          56889           1778       77000          NA
6 36011 "Cayuga "               63227           2736       76171          NA

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data = county_data %>%
  mutate(
    MOE_percentage = (Median_incomeM/Median_incomeE) * 100,
    reliability = case_when(
      MOE_percentage < 5 ~"High confidence",
      MOE_percentage < 10 & MOE_percentage >5 ~ "Moderate confidence",
      MOE_percentage >10 ~"Low confidence"
    ),
    unreliable_flag = if_else("MOE percentage" >10, TRUE, FALSE  )
 )
county_data

# A tibble: 62 × 9
   GEOID NAME           Median_incomeE Median_incomeM populationE populationM
   <chr> <chr>                   <dbl>          <dbl>       <dbl>       <dbl>
 1 36001 "Albany "               78829           2049      315041          NA
 2 36003 "Allegany "             58725           1965       47222          NA
 3 36005 "Bronx "                47036            890     1443229          NA
 4 36007 "Broome "               58317           1761      198365          NA
 5 36009 "Cattaraugus "          56889           1778       77000          NA
 6 36011 "Cayuga "               63227           2736       76171          NA
 7 36013 "Chautauqua "           54625           1754      127440          NA
 8 36015 "Chemung "              61358           2475       83584          NA
 9 36017 "Chenango "             61741           2526       47096          NA
10 36019 "Clinton "              67097           2802       79839          NA
# ℹ 52 more rows
# ℹ 3 more variables: MOE_percentage <dbl>, reliability <chr>,
#   unreliable_flag <lgl>

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
library(dplyr)
library(knitr)

reliability_summary <- county_data %>%
  count(reliability) %>%                    
  mutate(percentage = n / sum(n) * 100)     
reliability_summary

# A tibble: 3 × 3
  reliability             n percentage
  <chr>               <int>      <dbl>
1 High confidence        56      90.3 
2 Low confidence          1       1.61
3 Moderate confidence     5       8.06

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top_MOE <- county_data %>%
  arrange(desc(MOE_percentage)) %>%
  slice(1:5) %>%
  select(
  County = NAME,
  Median_Income = Median_incomeE,
  MOE = Median_incomeM,
  MOE_percentage = MOE_percentage,
  Reliability = reliability
  )
# Format as table with kable() - include appropriate column names and caption
kable(
  top_MOE,
  caption = "Top 5 Counties with Highest Income MOE Percentage",
  col.names = c(
    "County",
    "Median Income (Estimate)",
    "Median Income (MOE)",
    "MOE Percentage",
    "Reliability Category"
  ),
  digits = 1 
)

Top 5 Counties with Highest Income MOE Percentage
County	Median Income (Estimate)	Median Income (MOE)	MOE Percentage	Reliability Category
Hamilton	66891	7622	11.4	Low confidence
Schuyler	61316	5818	9.5	Moderate confidence
Greene	70294	4341	6.2	Moderate confidence
Yates	63974	3733	5.8	Moderate confidence
Essex	68090	3590	5.3	Moderate confidence

Data Quality Commentary: The table shows the top 5 counties with relatively high MOE percentages, indicating that the median household income estimates have larger uncertainties. Among them, Hamilton County has the highest MOE percentage at 11.4%, suggesting that income data in this county may be less reliable. This could be due to high income variability within the county, a relatively small population, or the presence of outliers. Algorithms that rely on these median income estimates may produce biased or inaccurate decisions in these areas.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
my_counties <- c("Orange", "Yates", "Hamilton")
selected_counties <- county_data %>%
  filter(str_detect(NAME, paste(my_counties, collapse = "|")))

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(
    County = NAME,
    Median_income = Median_incomeE,
    MOE_percentage,
    reliability
  ) %>%
  kable(caption = "Selected Counties with Median Income, MOE %, and Reliability")

Selected Counties with Median Income, MOE %, and Reliability
County	Median_income	MOE_percentage	reliability
Hamilton	66891	11.394657	Low confidence
Orange	91806	1.939960	High confidence
Yates	63974	5.835183	Moderate confidence

Comment on the output: Orange County has the highest median income with low MOE percentage, making it has higher reliability. While Yates and Hamilton have lower median incomes and larger MOE percentages, which reduces confidence in their estimates.It is suggests that wealthier counties might has more reliable data.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

race = c(
  White = "B03002_003",
  Black = "B03002_004",
  Hispanic = "B03002_012",
  total_pop ="B03002_001"
  )

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data = get_acs(
  geography = "tract",
  variables = race,
  state = "NY",
  year = 2022,
  survey = "acs5",
  output = "wide" 
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data = tract_data %>%
  mutate(
    white_p = 100 * WhiteE / total_popE,
    Black_p = 100 * BlackE / total_popE,
    Hispanic_p = 100 * HispanicE / total_popE,
  )


# Add readable tract and county name columns using str_extract() or similar
tract_data <- tract_data %>%
  mutate(
    tract_name  = str_replace(NAME, ";\\s*[^,]+ County;\\s*[^,]+$", ""),
    county_name = str_extract(NAME, ";\\s*[^,]+ County") %>% 
                  str_remove("^;\\s*") %>% 
                  str_remove("\\s*County$")                             
  )

kable(head(tract_data))

GEOID	NAME	WhiteE	WhiteM	BlackE	BlackM	HispanicE	HispanicM	total_popE	total_popM	white_p	Black_p	Hispanic_p	tract_name	county_name
36001000100	Census Tract 1; Albany County; New York	725	340	982	362	346	217	2259	512	32.09385	43.470562	15.316512	Census Tract 1	Albany
36001000201	Census Tract 2.01; Albany County; New York	372	198	1742	613	174	156	2465	608	15.09128	70.669371	7.058823	Census Tract 2.01	Albany
36001000202	Census Tract 2.02; Albany County; New York	317	193	1952	684	45	68	2374	668	13.35299	82.224094	1.895535	Census Tract 2.02	Albany
36001000301	Census Tract 3.01; Albany County; New York	678	431	1271	480	673	278	2837	581	23.89848	44.800846	23.722242	Census Tract 3.01	Albany
36001000302	Census Tract 3.02; Albany County; New York	1963	496	538	345	183	113	3200	500	61.34375	16.812500	5.718750	Census Tract 3.02	Albany
36001000401	Census Tract 4.01; Albany County; New York	2012	366	134	92	98	78	2301	399	87.44024	5.823555	4.259018	Census Tract 4.01	Albany

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic <- tract_data %>%
  filter(county_name %in% my_counties)%>%
  arrange(desc(Hispanic_p)) %>%
  slice(1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

county_summary <- tract_data %>%
  filter(county_name %in% my_counties) %>% 
  group_by(county_name) %>%
  summarize(
    num_tracts      = n(),
    avg_white_pct   = mean(white_p),
    avg_black_pct   = mean(Black_p),
    avg_hispanic_pct= mean(Hispanic_p)
  )


# Create a nicely formatted table of your results using kable()
kable(
  top_hispanic,
  caption = "Tract with Highest Hispanic/Latino Percentage (Selected Counties)"
)

Tract with Highest Hispanic/Latino Percentage (Selected Counties)
GEOID	NAME	WhiteE	WhiteM	BlackE	BlackM	HispanicE	HispanicM	total_popE	total_popM	white_p	Black_p	Hispanic_p	tract_name	county_name
36071000501	Census Tract 5.01; Orange County; New York	246	136	362	274	1723	631	2473	681	9.947432	14.63809	69.67246	Census Tract 5.01	Orange

kable(county_summary, caption = "Average Demographic Percentages by County")

Average Demographic Percentages by County
county_name	num_tracts	avg_white_pct	avg_black_pct	avg_hispanic_pct
Hamilton	4	91.87527	1.240264	2.038062
Orange	92	59.23838	10.868294	23.054479
Yates	8	94.17341	0.520878	2.116160

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data_filtered <- tract_data %>%
  filter(WhiteE > 0, BlackE> 0, HispanicE>0)

tract_data_filtered = tract_data_filtered %>%
  mutate(
    white_moe = 100 * WhiteM / WhiteE,
    Black_moe = 100 * BlackM / BlackE,
    Hispanic_moe = 100 * HispanicM / HispanicE
  ) %>%
  filter( 
    white_moe < 100, 
    Black_moe < 100, 
    Hispanic_moe <100
    )

# Create a flag for tracts with high MOE on any demographic variable
tract_moe = tract_data_filtered %>% 
  mutate(
    high_moe = ifelse(
      white_moe >15 | Black_moe >15 |Hispanic_moe >15, 
      TRUE, FALSE
    )
  )
# Use logical operators (| for OR) in an ifelse() statement

# Create summary statistics showing how many tracts have data quality issues
summary <- tract_moe %>%
  summarize(
    total_tracts = n(),
    tracts_high_moe = sum(high_moe),
    pct_high_moe = (sum(high_moe)/n()*100),
  )

summary_county <- tract_moe %>%
  filter(county_name %in% my_counties)%>%
  group_by(county_name)%>%
  summarize(
    total_tracts = n(),
    tracts_high_moe = sum(high_moe),
    pct_high_moe = (sum(high_moe)/n()*100),
  )
kable(summary, caption = "Overall Data Quality: MOE Analysis")

Overall Data Quality: MOE Analysis
total_tracts	tracts_high_moe	pct_high_moe
2669	2669	100

summary_county %>%
  kable(
    caption = "Summary of tracts with high MOE (>15%)",
    col.names = c("County", "N tracts", "High-MOE tracts", "% High-MOE")
  )

Summary of tracts with high MOE (>15%)
County	N tracts	High-MOE tracts	% High-MOE
Hamilton	1	1	100
Orange	68	68	100
Yates	2	2	100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison

tract_moe_changed = tract_data_filtered %>% 
  mutate(
    high_moe = ifelse(
      white_moe >50 | Black_moe >50 |Hispanic_moe >50, 
      TRUE, FALSE
    )
  )

pattern_analysis <- tract_moe_changed %>%
  mutate(high_moe_flag = ifelse(high_moe, "High MOE", "Low MOE")) %>%
  group_by(high_moe_flag) %>%
  summarize(
    num_tracts       = n(),
    avg_population   = mean(total_popE, na.rm = TRUE),
    avg_white_pct    = mean(WhiteE / total_popE * 100, na.rm = TRUE),
    avg_black_pct    = mean(BlackE / total_popE * 100, na.rm = TRUE),
    avg_hispanic_pct = mean(HispanicE / total_popE * 100, na.rm = TRUE)
  )


# Create a professional table showing the patterns
pattern_analysis %>%
  kable(
    digits = 1,
    col.names = c("High MOE Flag", "N tracts", "Avg Pop", "Avg % White", "Avg % Black", "Avg % Hispanic"),
    caption = "Comparison of tract characteristics by data quality group"
  )

Comparison of tract characteristics by data quality group
High MOE Flag	N tracts	Avg Pop	Avg % White	Avg % Black	Avg % Hispanic
High MOE	2141	3989.4	47.7	17.0	21.5
Low MOE	528	4564.0	35.3	25.7	27.6

Pattern Analysis: Using a 15% MOE threshold produced only one group, limiting the ability to compare patterns. To enable analysis, the threshold was adjusted to 50%, yielding two groups: High MOE (2141 tracts, ~80%) and Low MOE (528 tracts). Yet even under this more lenient criterion, the majority of tracts remain classified as High MOE, highlighting substantial data reliability concerns. Interestingly, these tracts tend to have smaller populations, higher proportions of White residents, and lower proportions of Black and Hispanic residents—patterns that run counter to the initial expectation that more racially diverse areas would exhibit greater uncertainty. Overall, the prevalence of high MOE across most tracts underscores serious data reliability issues.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: The analyses reveal consistent geographic and demographic disparities in median income and data quality across counties. Higher-income counties, such as Orange (median income ≈ 91,806), generally show high data reliability and low margins of error (MOE), while lower-income or more rural counties, such as Hamilton (median income ≈ 66,891) and Bronx (≈ 47,036), show moderate-to-low confidence in estimates and higher MOE percentages. Across all tracts, High-MOE areas tend to have smaller populations (≈3,989 per tract on average) and higher percentages of White residents, whereas Low-MOE areas tend to be larger and more diverse.

Communities at greatest risk of algorithmic bias are not necessarily those county tracts with higher minority percentage, but rather those in small, high-MOE, predominantly White tracts. These high-uncertainty areas are more likely to produce unreliable estimates, meaning that algorithms could allocate resources uneven in less diverse counties. Low-MOE, more diverse tracts are comparatively better represented in the data, which reduces the risk of bias for minority communities.

High margins of error (MOE) and data unreliability in New York State are primarily driven by several factors. Population size and demographic homogeneity play a major role: predominantly White, small-population tracts are more likely to exhibit high MOE, amplifying uncertainty for algorithmic applications. Sampling variability further contributes, as rural or low-density areas are harder to survey accurately. Some communities—such as extremely low-income households, residents in rural areas, certain minority populations, and immigrants might have low response rates to the census and surveys, resulting in sparse and uncertain data. Socioeconomic factors, including income levels and population dispersion, also affect the statistical precision of estimates. Additionally, some tracts reporting MOE greater than 100%—highlight the severity of uncertainty, making certain estimates statistically very unreliable and drastically increasing the risk of misrepresentation in algorithms and policy decisions.

Strategic Recommendations

Enhance Data Collection: Increase survey coverage in high-MOE, low-population tracts to improve the reliability of estimates.
Incorporate MOE in Modeling: Integrate uncertainty measures into algorithmic models to weight outputs appropriately and reduce reliance on potentially biased estimates.
Community Engagement: Encourage participation in surveys and censuses in smaller or rural tracts, particularly among low-income, minority, and immigrant populations, to enhance data completeness and reduce uncertainty.
Equity-Focused Resource Allocation: Implement human review for areas with poor data quality; if data reliability is low, do not let algorithms make decisions independently. Pay special attention to high-MOE tracts to prevent misdirected interventions.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
county_data <- county_data %>%
  mutate(
    algorithm_recommendation = case_when(
      reliability == "High confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate confidence" ~ "Use with caution - monitor outcomes",
      reliability == "Low confidence" ~ "Requires manual review or additional data",
      TRUE ~ "Unknown"
    )
  )

# Format as a professional table with kable()

summary_table <- county_data %>%
  select(county_name = NAME, Median_income = Median_incomeE, MOE_percentage, 
         reliability, algorithm_recommendation)

# Display professional table
kable(summary_table, 
      caption = "County-Level Median Income Reliability and Algorithm Recommendations",
      col.names = c("County", "Median Income", "MOE (%)", "Reliability Category", 
                    "Algorithm Recommendation"),
      digits = 2)

County-Level Median Income Reliability and Algorithm Recommendations
County	Median Income	MOE (%)	Reliability Category	Algorithm Recommendation
Albany	78829	2.60	High confidence	Safe for algorithmic decisions
Allegany	58725	3.35	High confidence	Safe for algorithmic decisions
Bronx	47036	1.89	High confidence	Safe for algorithmic decisions
Broome	58317	3.02	High confidence	Safe for algorithmic decisions
Cattaraugus	56889	3.13	High confidence	Safe for algorithmic decisions
Cayuga	63227	4.33	High confidence	Safe for algorithmic decisions
Chautauqua	54625	3.21	High confidence	Safe for algorithmic decisions
Chemung	61358	4.03	High confidence	Safe for algorithmic decisions
Chenango	61741	4.09	High confidence	Safe for algorithmic decisions
Clinton	67097	4.18	High confidence	Safe for algorithmic decisions
Columbia	81741	3.39	High confidence	Safe for algorithmic decisions
Cortland	65029	4.42	High confidence	Safe for algorithmic decisions
Delaware	58338	3.67	High confidence	Safe for algorithmic decisions
Dutchess	94578	2.66	High confidence	Safe for algorithmic decisions
Erie	68014	1.18	High confidence	Safe for algorithmic decisions
Essex	68090	5.27	Moderate confidence	Use with caution - monitor outcomes
Franklin	60270	4.81	High confidence	Safe for algorithmic decisions
Fulton	60557	4.37	High confidence	Safe for algorithmic decisions
Genesee	68178	4.57	High confidence	Safe for algorithmic decisions
Greene	70294	6.18	Moderate confidence	Use with caution - monitor outcomes
Hamilton	66891	11.39	Low confidence	Requires manual review or additional data
Herkimer	68104	4.79	High confidence	Safe for algorithmic decisions
Jefferson	62782	3.64	High confidence	Safe for algorithmic decisions
Kings	74692	1.27	High confidence	Safe for algorithmic decisions
Lewis	64401	4.16	High confidence	Safe for algorithmic decisions
Livingston	70443	3.99	High confidence	Safe for algorithmic decisions
Madison	68869	4.04	High confidence	Safe for algorithmic decisions
Monroe	71450	1.35	High confidence	Safe for algorithmic decisions
Montgomery	58033	3.63	High confidence	Safe for algorithmic decisions
Nassau	137709	1.39	High confidence	Safe for algorithmic decisions
New York	99880	1.78	High confidence	Safe for algorithmic decisions
Niagara	65882	2.67	High confidence	Safe for algorithmic decisions
Oneida	66402	3.27	High confidence	Safe for algorithmic decisions
Onondaga	71479	1.57	High confidence	Safe for algorithmic decisions
Ontario	76603	2.94	High confidence	Safe for algorithmic decisions
Orange	91806	1.94	High confidence	Safe for algorithmic decisions
Orleans	61069	4.89	High confidence	Safe for algorithmic decisions
Oswego	65054	3.26	High confidence	Safe for algorithmic decisions
Otsego	65778	4.51	High confidence	Safe for algorithmic decisions
Putnam	120970	4.03	High confidence	Safe for algorithmic decisions
Queens	82431	1.06	High confidence	Safe for algorithmic decisions
Rensselaer	83734	2.27	High confidence	Safe for algorithmic decisions
Richmond	96185	2.60	High confidence	Safe for algorithmic decisions
Rockland	106173	2.88	High confidence	Safe for algorithmic decisions
St. Lawrence	58339	3.47	High confidence	Safe for algorithmic decisions
Saratoga	97038	2.26	High confidence	Safe for algorithmic decisions
Schenectady	75056	3.03	High confidence	Safe for algorithmic decisions
Schoharie	71479	3.96	High confidence	Safe for algorithmic decisions
Schuyler	61316	9.49	Moderate confidence	Use with caution - monitor outcomes
Seneca	64050	5.24	Moderate confidence	Use with caution - monitor outcomes
Steuben	62506	2.87	High confidence	Safe for algorithmic decisions
Suffolk	122498	1.18	High confidence	Safe for algorithmic decisions
Sullivan	67841	4.35	High confidence	Safe for algorithmic decisions
Tioga	70427	3.99	High confidence	Safe for algorithmic decisions
Tompkins	69995	4.01	High confidence	Safe for algorithmic decisions
Ulster	77197	4.52	High confidence	Safe for algorithmic decisions
Warren	74531	4.74	High confidence	Safe for algorithmic decisions
Washington	68703	3.41	High confidence	Safe for algorithmic decisions
Wayne	71007	3.10	High confidence	Safe for algorithmic decisions
Westchester	114651	1.56	High confidence	Safe for algorithmic decisions
Wyoming	65066	3.38	High confidence	Safe for algorithmic decisions
Yates	63974	5.84	Moderate confidence	Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Albany, Bronx, Broome, Cattaraugus, Cayuga, Chautauqua, Chemung, Chenango, Clinton, Columbia, Cortland, Delaware, Dutchess, Erie, Franklin, Fulton, Genesee, Herkimer, Jefferson, Kings, Lewis, Livingston, Madison, Monroe, Montgomery, Nassau, New York, Niagara, Oneida, Onondaga, Ontario, Orange, Orleans, Oswego, Otsego, Putnam, Queens, Rensselaer, Richmond, Rockland, St. Lawrence, Saratoga, Schenectady, Schoharie, Steuben, Suffolk, Sullivan, Tioga, Tompkins, Ulster, Warren, Washington, Wayne, Westchester, Wyoming.

These counties has low MOE that can confidently use thes estiments.

Counties requiring additional oversight: Essex, Greene, Schuyler, Seneca, Yates

The counties has moderate confidence data with 5-10% MOE, algorithmic decision can be used but the outcomes should be monitored. Periodic checks can be used to examine and ensure the accuracy of predictions.

Counties needing alternative approaches: Hamilton

This county has low confidence and algorithmic decisions are not recommended. Manual review or additional data collection is required.By adding the human oversight that algorithmic outputs can be validated and corrected if needed.

Questions for Further Investigation

How do spatial patterns in income and demographic distributions affect the reliability of county- and tract-level data? Are there specific geographic clusters with consistently low reliability?
What temporal trends or changes over time could be observed if historical ACS survey data were included?
How do measurement errors and low survey response rates impact smaller racial or ethnic groups, and what effective safeguards can help prevent bias in algorithmic decisions?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/20/2025

Reproducibility: - All analysis conducted in R version [4.5.1] - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-sihan-yu429/assignments/assignments1/scripts/assignment1_template.html

Methodology Notes: Several key steps were made during the analysis. Counties were classified into “High,” “Moderate,” and “Low” confidence categories based on ACS income margins of error (MOE), using 5% and 10% cutoffs, where 5% indicates a precise estimate and beyond 10% signals caution. One county was intentionally selected from each reliability category to illustrate contrasts. This approach highlights differences but means specific findings are illustrative. At the tract level, demographic estimates were processed by calculating MOE percentages for racial/ethnic groups, and a 50% MOE flag was applied to identify particularly unreliable tracts. Although the assignment suggested a 15% threshold, using it would flag nearly all tracts, so the higher threshold focuses discussion on the worst data quality issues. These decisions shaped reliability categories and subsequent recommendations, providing conservative but clear examples of high-risk data areas and guiding interpretation of county- and tract-level results.

Limitations: The analysis is limited by sample size constraints, especially for smaller racial/ethnic groups at the tract level. This can result in high margins of error. Geographic coverage focuses on selected counties and tracts, so findings may not generalize elsewhere. Temporal scope is limited to the 2018–2022 ACS 5-year estimates, which may not capture rapid changes. Additionally, ACS survey data itself contains inherent inaccuracies due to sampling and nonresponse errors.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html