# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("e173851c633e89c20243632174db68f63bac8856")
# Choose your state for analysis - assign it to a variable called my_state
<- "New York" my_state
Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the [Your State] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml
navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/
folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen New York for this analysis because: It has a large population, which makes it interesting to study how census data reflects socioeconomic patterns in a large scale.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs()
to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code")
syntax.
# Write your get_acs() code here
<- get_acs(
county_data geography = "county",
variables = c(Median_income = "B19013_001",
population = "B01003_001"
),state = "NY",
year = 2022,
survey = "acs5",
output = "wide"
)# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
<- county_data %>%
county_data mutate(
NAME = NAME %>%
str_remove(", New York") %>%
str_remove("County")
)# Display the first few rows
head(county_data)
# A tibble: 6 × 6
GEOID NAME Median_incomeE Median_incomeM populationE populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 36001 "Albany " 78829 2049 315041 NA
2 36003 "Allegany " 58725 1965 47222 NA
3 36005 "Bronx " 47036 890 1443229 NA
4 36007 "Broome " 58317 1761 198365 NA
5 36009 "Cattaraugus " 56889 1778 77000 NA
6 36011 "Cayuga " 63227 2736 76171 NA
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate()
with case_when()
for the categories.
# Calculate MOE percentage and reliability categories using mutate()
= county_data %>%
county_data mutate(
MOE_percentage = (Median_incomeM/Median_incomeE) * 100,
reliability = case_when(
< 5 ~"High confidence",
MOE_percentage < 10 & MOE_percentage >5 ~ "Moderate confidence",
MOE_percentage >10 ~"Low confidence"
MOE_percentage
),unreliable_flag = if_else("MOE percentage" >10, TRUE, FALSE )
) county_data
# A tibble: 62 × 9
GEOID NAME Median_incomeE Median_incomeM populationE populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 36001 "Albany " 78829 2049 315041 NA
2 36003 "Allegany " 58725 1965 47222 NA
3 36005 "Bronx " 47036 890 1443229 NA
4 36007 "Broome " 58317 1761 198365 NA
5 36009 "Cattaraugus " 56889 1778 77000 NA
6 36011 "Cayuga " 63227 2736 76171 NA
7 36013 "Chautauqua " 54625 1754 127440 NA
8 36015 "Chemung " 61358 2475 83584 NA
9 36017 "Chenango " 61741 2526 47096 NA
10 36019 "Clinton " 67097 2802 79839 NA
# ℹ 52 more rows
# ℹ 3 more variables: MOE_percentage <dbl>, reliability <chr>,
# unreliable_flag <lgl>
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
library(dplyr)
library(knitr)
<- county_data %>%
reliability_summary count(reliability) %>%
mutate(percentage = n / sum(n) * 100)
reliability_summary
# A tibble: 3 × 3
reliability n percentage
<chr> <int> <dbl>
1 High confidence 56 90.3
2 Low confidence 1 1.61
3 Moderate confidence 5 8.06
2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange()
, slice()
, and select()
functions.
# Create table of top 5 counties by MOE percentage
<- county_data %>%
top_MOE arrange(desc(MOE_percentage)) %>%
slice(1:5) %>%
select(
County = NAME,
Median_Income = Median_incomeE,
MOE = Median_incomeM,
MOE_percentage = MOE_percentage,
Reliability = reliability
)# Format as table with kable() - include appropriate column names and caption
kable(
top_MOE,caption = "Top 5 Counties with Highest Income MOE Percentage",
col.names = c(
"County",
"Median Income (Estimate)",
"Median Income (MOE)",
"MOE Percentage",
"Reliability Category"
),digits = 1
)
County | Median Income (Estimate) | Median Income (MOE) | MOE Percentage | Reliability Category |
---|---|---|---|---|
Hamilton | 66891 | 7622 | 11.4 | Low confidence |
Schuyler | 61316 | 5818 | 9.5 | Moderate confidence |
Greene | 70294 | 4341 | 6.2 | Moderate confidence |
Yates | 63974 | 3733 | 5.8 | Moderate confidence |
Essex | 68090 | 3590 | 5.3 | Moderate confidence |
Data Quality Commentary: The table shows the top 5 counties with relatively high MOE percentages, indicating that the median household income estimates have larger uncertainties. Among them, Hamilton County has the highest MOE percentage at 11.4%, suggesting that income data in this county may be less reliable. This could be due to high income variability within the county, a relatively small population, or the presence of outliers. Algorithms that rely on these median income estimates may produce biased or inaccurate decisions in these areas.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
<- c("Orange", "Yates", "Hamilton")
my_counties <- county_data %>%
selected_counties filter(str_detect(NAME, paste(my_counties, collapse = "|")))
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
%>%
selected_counties select(
County = NAME,
Median_income = Median_incomeE,
MOE_percentage,
reliability%>%
) kable(caption = "Selected Counties with Median Income, MOE %, and Reliability")
County | Median_income | MOE_percentage | reliability |
---|---|---|---|
Hamilton | 66891 | 11.394657 | Low confidence |
Orange | 91806 | 1.939960 | High confidence |
Yates | 63974 | 5.835183 | Moderate confidence |
Comment on the output: Orange County has the highest median income with low MOE percentage, making it has higher reliability. While Yates and Hamilton have lower median incomes and larger MOE percentages, which reduces confidence in their estimates.It is suggests that wealthier counties might has more reliable data.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
= c(
race White = "B03002_003",
Black = "B03002_004",
Hispanic = "B03002_012",
total_pop ="B03002_001"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
= get_acs(
tract_data geography = "tract",
variables = race,
state = "NY",
year = 2022,
survey = "acs5",
output = "wide"
)# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
= tract_data %>%
tract_data mutate(
white_p = 100 * WhiteE / total_popE,
Black_p = 100 * BlackE / total_popE,
Hispanic_p = 100 * HispanicE / total_popE,
)
# Add readable tract and county name columns using str_extract() or similar
<- tract_data %>%
tract_data mutate(
tract_name = str_replace(NAME, ";\\s*[^,]+ County;\\s*[^,]+$", ""),
county_name = str_extract(NAME, ";\\s*[^,]+ County") %>%
str_remove("^;\\s*") %>%
str_remove("\\s*County$")
)
kable(head(tract_data))
GEOID | NAME | WhiteE | WhiteM | BlackE | BlackM | HispanicE | HispanicM | total_popE | total_popM | white_p | Black_p | Hispanic_p | tract_name | county_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36001000100 | Census Tract 1; Albany County; New York | 725 | 340 | 982 | 362 | 346 | 217 | 2259 | 512 | 32.09385 | 43.470562 | 15.316512 | Census Tract 1 | Albany |
36001000201 | Census Tract 2.01; Albany County; New York | 372 | 198 | 1742 | 613 | 174 | 156 | 2465 | 608 | 15.09128 | 70.669371 | 7.058823 | Census Tract 2.01 | Albany |
36001000202 | Census Tract 2.02; Albany County; New York | 317 | 193 | 1952 | 684 | 45 | 68 | 2374 | 668 | 13.35299 | 82.224094 | 1.895535 | Census Tract 2.02 | Albany |
36001000301 | Census Tract 3.01; Albany County; New York | 678 | 431 | 1271 | 480 | 673 | 278 | 2837 | 581 | 23.89848 | 44.800846 | 23.722242 | Census Tract 3.01 | Albany |
36001000302 | Census Tract 3.02; Albany County; New York | 1963 | 496 | 538 | 345 | 183 | 113 | 3200 | 500 | 61.34375 | 16.812500 | 5.718750 | Census Tract 3.02 | Albany |
36001000401 | Census Tract 4.01; Albany County; New York | 2012 | 366 | 134 | 92 | 98 | 78 | 2301 | 399 | 87.44024 | 5.823555 | 4.259018 | Census Tract 4.01 | Albany |
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
<- tract_data %>%
top_hispanic filter(county_name %in% my_counties)%>%
arrange(desc(Hispanic_p)) %>%
slice(1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
<- tract_data %>%
county_summary filter(county_name %in% my_counties) %>%
group_by(county_name) %>%
summarize(
num_tracts = n(),
avg_white_pct = mean(white_p),
avg_black_pct = mean(Black_p),
avg_hispanic_pct= mean(Hispanic_p)
)
# Create a nicely formatted table of your results using kable()
kable(
top_hispanic,caption = "Tract with Highest Hispanic/Latino Percentage (Selected Counties)"
)
GEOID | NAME | WhiteE | WhiteM | BlackE | BlackM | HispanicE | HispanicM | total_popE | total_popM | white_p | Black_p | Hispanic_p | tract_name | county_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36071000501 | Census Tract 5.01; Orange County; New York | 246 | 136 | 362 | 274 | 1723 | 631 | 2473 | 681 | 9.947432 | 14.63809 | 69.67246 | Census Tract 5.01 | Orange |
kable(county_summary, caption = "Average Demographic Percentages by County")
county_name | num_tracts | avg_white_pct | avg_black_pct | avg_hispanic_pct |
---|---|---|---|---|
Hamilton | 4 | 91.87527 | 1.240264 | 2.038062 |
Orange | 92 | 59.23838 | 10.868294 | 23.054479 |
Yates | 8 | 94.17341 | 0.520878 | 2.116160 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
<- tract_data %>%
tract_data_filtered filter(WhiteE > 0, BlackE> 0, HispanicE>0)
= tract_data_filtered %>%
tract_data_filtered mutate(
white_moe = 100 * WhiteM / WhiteE,
Black_moe = 100 * BlackM / BlackE,
Hispanic_moe = 100 * HispanicM / HispanicE
%>%
) filter(
< 100,
white_moe < 100,
Black_moe <100
Hispanic_moe
)
# Create a flag for tracts with high MOE on any demographic variable
= tract_data_filtered %>%
tract_moe mutate(
high_moe = ifelse(
>15 | Black_moe >15 |Hispanic_moe >15,
white_moe TRUE, FALSE
)
)# Use logical operators (| for OR) in an ifelse() statement
# Create summary statistics showing how many tracts have data quality issues
<- tract_moe %>%
summary summarize(
total_tracts = n(),
tracts_high_moe = sum(high_moe),
pct_high_moe = (sum(high_moe)/n()*100),
)
<- tract_moe %>%
summary_county filter(county_name %in% my_counties)%>%
group_by(county_name)%>%
summarize(
total_tracts = n(),
tracts_high_moe = sum(high_moe),
pct_high_moe = (sum(high_moe)/n()*100),
)kable(summary, caption = "Overall Data Quality: MOE Analysis")
total_tracts | tracts_high_moe | pct_high_moe |
---|---|---|
2669 | 2669 | 100 |
%>%
summary_county kable(
caption = "Summary of tracts with high MOE (>15%)",
col.names = c("County", "N tracts", "High-MOE tracts", "% High-MOE")
)
County | N tracts | High-MOE tracts | % High-MOE |
---|---|---|---|
Hamilton | 1 | 1 | 100 |
Orange | 68 | 68 | 100 |
Yates | 2 | 2 | 100 |
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
= tract_data_filtered %>%
tract_moe_changed mutate(
high_moe = ifelse(
>50 | Black_moe >50 |Hispanic_moe >50,
white_moe TRUE, FALSE
)
)
<- tract_moe_changed %>%
pattern_analysis mutate(high_moe_flag = ifelse(high_moe, "High MOE", "Low MOE")) %>%
group_by(high_moe_flag) %>%
summarize(
num_tracts = n(),
avg_population = mean(total_popE, na.rm = TRUE),
avg_white_pct = mean(WhiteE / total_popE * 100, na.rm = TRUE),
avg_black_pct = mean(BlackE / total_popE * 100, na.rm = TRUE),
avg_hispanic_pct = mean(HispanicE / total_popE * 100, na.rm = TRUE)
)
# Create a professional table showing the patterns
%>%
pattern_analysis kable(
digits = 1,
col.names = c("High MOE Flag", "N tracts", "Avg Pop", "Avg % White", "Avg % Black", "Avg % Hispanic"),
caption = "Comparison of tract characteristics by data quality group"
)
High MOE Flag | N tracts | Avg Pop | Avg % White | Avg % Black | Avg % Hispanic |
---|---|---|---|---|---|
High MOE | 2141 | 3989.4 | 47.7 | 17.0 | 21.5 |
Low MOE | 528 | 4564.0 | 35.3 | 25.7 | 27.6 |
Pattern Analysis: Using a 15% MOE threshold produced only one group, limiting the ability to compare patterns. To enable analysis, the threshold was adjusted to 50%, yielding two groups: High MOE (2141 tracts, ~80%) and Low MOE (528 tracts). Yet even under this more lenient criterion, the majority of tracts remain classified as High MOE, highlighting substantial data reliability concerns. Interestingly, these tracts tend to have smaller populations, higher proportions of White residents, and lower proportions of Black and Hispanic residents—patterns that run counter to the initial expectation that more racially diverse areas would exhibit greater uncertainty. Overall, the prevalence of high MOE across most tracts underscores serious data reliability issues.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary: The analyses reveal consistent geographic and demographic disparities in median income and data quality across counties. Higher-income counties, such as Orange (median income ≈ 91,806), generally show high data reliability and low margins of error (MOE), while lower-income or more rural counties, such as Hamilton (median income ≈ 66,891) and Bronx (≈ 47,036), show moderate-to-low confidence in estimates and higher MOE percentages. Across all tracts, High-MOE areas tend to have smaller populations (≈3,989 per tract on average) and higher percentages of White residents, whereas Low-MOE areas tend to be larger and more diverse.
Communities at greatest risk of algorithmic bias are not necessarily those county tracts with higher minority percentage, but rather those in small, high-MOE, predominantly White tracts. These high-uncertainty areas are more likely to produce unreliable estimates, meaning that algorithms could allocate resources uneven in less diverse counties. Low-MOE, more diverse tracts are comparatively better represented in the data, which reduces the risk of bias for minority communities.
High margins of error (MOE) and data unreliability in New York State are primarily driven by several factors. Population size and demographic homogeneity play a major role: predominantly White, small-population tracts are more likely to exhibit high MOE, amplifying uncertainty for algorithmic applications. Sampling variability further contributes, as rural or low-density areas are harder to survey accurately. Some communities—such as extremely low-income households, residents in rural areas, certain minority populations, and immigrants might have low response rates to the census and surveys, resulting in sparse and uncertain data. Socioeconomic factors, including income levels and population dispersion, also affect the statistical precision of estimates. Additionally, some tracts reporting MOE greater than 100%—highlight the severity of uncertainty, making certain estimates statistically very unreliable and drastically increasing the risk of misrepresentation in algorithms and policy decisions.
Strategic Recommendations
Enhance Data Collection: Increase survey coverage in high-MOE, low-population tracts to improve the reliability of estimates.
Incorporate MOE in Modeling: Integrate uncertainty measures into algorithmic models to weight outputs appropriately and reduce reliance on potentially biased estimates.
Community Engagement: Encourage participation in surveys and censuses in smaller or rural tracts, particularly among low-income, minority, and immigrant populations, to enhance data completeness and reduce uncertainty.
Equity-Focused Resource Allocation: Implement human review for areas with poor data quality; if data reliability is low, do not let algorithms make decisions independently. Pay special attention to high-MOE tracts to prevent misdirected interventions.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
<- county_data %>%
county_data mutate(
algorithm_recommendation = case_when(
== "High confidence" ~ "Safe for algorithmic decisions",
reliability == "Moderate confidence" ~ "Use with caution - monitor outcomes",
reliability == "Low confidence" ~ "Requires manual review or additional data",
reliability TRUE ~ "Unknown"
)
)
# Format as a professional table with kable()
<- county_data %>%
summary_table select(county_name = NAME, Median_income = Median_incomeE, MOE_percentage,
reliability, algorithm_recommendation)
# Display professional table
kable(summary_table,
caption = "County-Level Median Income Reliability and Algorithm Recommendations",
col.names = c("County", "Median Income", "MOE (%)", "Reliability Category",
"Algorithm Recommendation"),
digits = 2)
County | Median Income | MOE (%) | Reliability Category | Algorithm Recommendation |
---|---|---|---|---|
Albany | 78829 | 2.60 | High confidence | Safe for algorithmic decisions |
Allegany | 58725 | 3.35 | High confidence | Safe for algorithmic decisions |
Bronx | 47036 | 1.89 | High confidence | Safe for algorithmic decisions |
Broome | 58317 | 3.02 | High confidence | Safe for algorithmic decisions |
Cattaraugus | 56889 | 3.13 | High confidence | Safe for algorithmic decisions |
Cayuga | 63227 | 4.33 | High confidence | Safe for algorithmic decisions |
Chautauqua | 54625 | 3.21 | High confidence | Safe for algorithmic decisions |
Chemung | 61358 | 4.03 | High confidence | Safe for algorithmic decisions |
Chenango | 61741 | 4.09 | High confidence | Safe for algorithmic decisions |
Clinton | 67097 | 4.18 | High confidence | Safe for algorithmic decisions |
Columbia | 81741 | 3.39 | High confidence | Safe for algorithmic decisions |
Cortland | 65029 | 4.42 | High confidence | Safe for algorithmic decisions |
Delaware | 58338 | 3.67 | High confidence | Safe for algorithmic decisions |
Dutchess | 94578 | 2.66 | High confidence | Safe for algorithmic decisions |
Erie | 68014 | 1.18 | High confidence | Safe for algorithmic decisions |
Essex | 68090 | 5.27 | Moderate confidence | Use with caution - monitor outcomes |
Franklin | 60270 | 4.81 | High confidence | Safe for algorithmic decisions |
Fulton | 60557 | 4.37 | High confidence | Safe for algorithmic decisions |
Genesee | 68178 | 4.57 | High confidence | Safe for algorithmic decisions |
Greene | 70294 | 6.18 | Moderate confidence | Use with caution - monitor outcomes |
Hamilton | 66891 | 11.39 | Low confidence | Requires manual review or additional data |
Herkimer | 68104 | 4.79 | High confidence | Safe for algorithmic decisions |
Jefferson | 62782 | 3.64 | High confidence | Safe for algorithmic decisions |
Kings | 74692 | 1.27 | High confidence | Safe for algorithmic decisions |
Lewis | 64401 | 4.16 | High confidence | Safe for algorithmic decisions |
Livingston | 70443 | 3.99 | High confidence | Safe for algorithmic decisions |
Madison | 68869 | 4.04 | High confidence | Safe for algorithmic decisions |
Monroe | 71450 | 1.35 | High confidence | Safe for algorithmic decisions |
Montgomery | 58033 | 3.63 | High confidence | Safe for algorithmic decisions |
Nassau | 137709 | 1.39 | High confidence | Safe for algorithmic decisions |
New York | 99880 | 1.78 | High confidence | Safe for algorithmic decisions |
Niagara | 65882 | 2.67 | High confidence | Safe for algorithmic decisions |
Oneida | 66402 | 3.27 | High confidence | Safe for algorithmic decisions |
Onondaga | 71479 | 1.57 | High confidence | Safe for algorithmic decisions |
Ontario | 76603 | 2.94 | High confidence | Safe for algorithmic decisions |
Orange | 91806 | 1.94 | High confidence | Safe for algorithmic decisions |
Orleans | 61069 | 4.89 | High confidence | Safe for algorithmic decisions |
Oswego | 65054 | 3.26 | High confidence | Safe for algorithmic decisions |
Otsego | 65778 | 4.51 | High confidence | Safe for algorithmic decisions |
Putnam | 120970 | 4.03 | High confidence | Safe for algorithmic decisions |
Queens | 82431 | 1.06 | High confidence | Safe for algorithmic decisions |
Rensselaer | 83734 | 2.27 | High confidence | Safe for algorithmic decisions |
Richmond | 96185 | 2.60 | High confidence | Safe for algorithmic decisions |
Rockland | 106173 | 2.88 | High confidence | Safe for algorithmic decisions |
St. Lawrence | 58339 | 3.47 | High confidence | Safe for algorithmic decisions |
Saratoga | 97038 | 2.26 | High confidence | Safe for algorithmic decisions |
Schenectady | 75056 | 3.03 | High confidence | Safe for algorithmic decisions |
Schoharie | 71479 | 3.96 | High confidence | Safe for algorithmic decisions |
Schuyler | 61316 | 9.49 | Moderate confidence | Use with caution - monitor outcomes |
Seneca | 64050 | 5.24 | Moderate confidence | Use with caution - monitor outcomes |
Steuben | 62506 | 2.87 | High confidence | Safe for algorithmic decisions |
Suffolk | 122498 | 1.18 | High confidence | Safe for algorithmic decisions |
Sullivan | 67841 | 4.35 | High confidence | Safe for algorithmic decisions |
Tioga | 70427 | 3.99 | High confidence | Safe for algorithmic decisions |
Tompkins | 69995 | 4.01 | High confidence | Safe for algorithmic decisions |
Ulster | 77197 | 4.52 | High confidence | Safe for algorithmic decisions |
Warren | 74531 | 4.74 | High confidence | Safe for algorithmic decisions |
Washington | 68703 | 3.41 | High confidence | Safe for algorithmic decisions |
Wayne | 71007 | 3.10 | High confidence | Safe for algorithmic decisions |
Westchester | 114651 | 1.56 | High confidence | Safe for algorithmic decisions |
Wyoming | 65066 | 3.38 | High confidence | Safe for algorithmic decisions |
Yates | 63974 | 5.84 | Moderate confidence | Use with caution - monitor outcomes |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
- Counties suitable for immediate algorithmic implementation: Albany, Bronx, Broome, Cattaraugus, Cayuga, Chautauqua, Chemung, Chenango, Clinton, Columbia, Cortland, Delaware, Dutchess, Erie, Franklin, Fulton, Genesee, Herkimer, Jefferson, Kings, Lewis, Livingston, Madison, Monroe, Montgomery, Nassau, New York, Niagara, Oneida, Onondaga, Ontario, Orange, Orleans, Oswego, Otsego, Putnam, Queens, Rensselaer, Richmond, Rockland, St. Lawrence, Saratoga, Schenectady, Schoharie, Steuben, Suffolk, Sullivan, Tioga, Tompkins, Ulster, Warren, Washington, Wayne, Westchester, Wyoming.
These counties has low MOE that can confidently use thes estiments.
- Counties requiring additional oversight: Essex, Greene, Schuyler, Seneca, Yates
The counties has moderate confidence data with 5-10% MOE, algorithmic decision can be used but the outcomes should be monitored. Periodic checks can be used to examine and ensure the accuracy of predictions.
- Counties needing alternative approaches: Hamilton
This county has low confidence and algorithmic decisions are not recommended. Manual review or additional data collection is required.By adding the human oversight that algorithmic outputs can be validated and corrected if needed.
Questions for Further Investigation
How do spatial patterns in income and demographic distributions affect the reliability of county- and tract-level data? Are there specific geographic clusters with consistently low reliability?
What temporal trends or changes over time could be observed if historical ACS survey data were included?
How do measurement errors and low survey response rates impact smaller racial or ethnic groups, and what effective safeguards can help prevent bias in algorithmic decisions?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/20/2025
Reproducibility: - All analysis conducted in R version [4.5.1] - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-sihan-yu429/assignments/assignments1/scripts/assignment1_template.html
Methodology Notes: Several key steps were made during the analysis. Counties were classified into “High,” “Moderate,” and “Low” confidence categories based on ACS income margins of error (MOE), using 5% and 10% cutoffs, where 5% indicates a precise estimate and beyond 10% signals caution. One county was intentionally selected from each reliability category to illustrate contrasts. This approach highlights differences but means specific findings are illustrative. At the tract level, demographic estimates were processed by calculating MOE percentages for racial/ethnic groups, and a 50% MOE flag was applied to identify particularly unreliable tracts. Although the assignment suggested a 15% threshold, using it would flag nearly all tracts, so the higher threshold focuses discussion on the worst data quality issues. These decisions shaped reliability categories and subsequent recommendations, providing conservative but clear examples of high-risk data areas and guiding interpretation of county- and tract-level results.
Limitations: The analysis is limited by sample size constraints, especially for smaller racial/ethnic groups at the tract level. This can result in high margins of error. Geographic coverage focuses on selected counties and tracts, so findings may not generalize elsewhere. Temporal scope is limited to the 2018–2022 ACS 5-year estimates, which may not capture rapid changes. Additionally, ACS survey data itself contains inherent inaccuracies due to sampling and nonresponse errors.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html