Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Xinyuan(Christine) Cui

Published

September 29, 2025

Assignment Overview

Scenario

You are a data analyst for the Texas Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
"knitr" %in% rownames(installed.packages())

[1] TRUE

library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("20068788c6e79d5716fbceb0dcd562ab23f74ca1", overwrite=TRUE, install = TRUE)

[1] "20068788c6e79d5716fbceb0dcd562ab23f74ca1"

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "TX"

State Selection: I have chosen Texas for this analysis because: Texas is an ideal state for this analysis due to its unique combination of rapid population growth, demographic diversity, and significant geographic variation. These factors present specific data quality challenges that are critical to evaluating algorithmic fairness.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_data_texas <- get_acs(
  geography = "county",        
  variables = c(               
    median_household_income = "B19013_001",
    total_population = "B01003_001"
  ),
  state = my_state,            
  year = 2022,                 
  survey = "acs5",             
  output = "wide"              
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_data_texas <- county_data_texas %>%
  mutate(
    # remove state name（Texas）
    NAME = str_remove(NAME, ", Texas"),
    # remove " County"
    NAME = str_remove(NAME, " County")
  )
# Display the first few rows
head(county_data_texas)

# A tibble: 6 × 6
  GEOID NAME     median_household_inc…¹ median_household_inc…² total_populationE
  <chr> <chr>                     <dbl>                  <dbl>             <dbl>
1 48001 Anderson                  57445                   4562             58077
2 48003 Andrews                   86458                  16116             18362
3 48005 Angelina                  57055                   2484             86608
4 48007 Aransas                   58168                   6458             24048
5 48009 Archer                    69954                   8482              8649
6 48011 Armstro…                  70417                  14574              1912
# ℹ abbreviated names: ¹median_household_incomeE, ²median_household_incomeM
# ℹ 1 more variable: total_populationM <dbl>

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data_texas <- county_data_texas %>%
  mutate(
    # Calculate MOE percentage
    income_moe_pct = (median_household_incomeM / median_household_incomeE) * 100,
    
    # Create reliability categories
    income_reliability = case_when(
      income_moe_pct < 5 ~ "High Confidence",
      income_moe_pct >= 5 & income_moe_pct <= 10 ~ "Moderate Confidence",
      income_moe_pct > 10 ~ "Low Confidence"
    ),
    
    # Create a flag for unreliable estimates
    unreliable_income = income_moe_pct > 10
  )
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
county_data_texas %>%
  count(income_reliability) %>%
  mutate(
    percentage = (n / sum(n)) * 100
  )

# A tibble: 4 × 3
  income_reliability      n percentage
  <chr>               <int>      <dbl>
1 High Confidence        58     22.8  
2 Low Confidence        113     44.5  
3 Moderate Confidence    82     32.3  
4 <NA>                    1      0.394

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
library(knitr)
county_data_texas %>%
  arrange(desc(income_moe_pct)) %>%
  slice(1:5) %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `Margin of Error` = median_household_incomeM,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%

# Format as table with kable() - include appropriate column names and caption
kable(
    caption = "Top 5 Counties with The Highest MOE Percentages",
    digits = 2
  )

Top 5 Counties with The Highest MOE Percentages
County Name	Median Income	Margin of Error	MOE Percentage	Reliability Category
Jeff Davis	38125	25205	66.11	Low Confidence
Culberson	35924	18455	51.37	Low Confidence
King	59375	29395	49.51	Low Confidence
Kinney	52386	23728	45.29	Low Confidence
Dimmit	27374	12374	45.20	Low Confidence

Data Quality Commentary: Based on these results, algorithmic funding decisions relying on the median income in counties like Jeff Davis and Culberson (with MOE>50%) would be highly unreliable. These counties risk being poorly served because the true income value could be far outside the estimate, potentially leading to the algorithm severely over-allocating or under-allocating crucial resources. The extreme uncertainty is primarily due to small population size and low survey response rates, making precise income estimation statistically impossible in these sparsely populated areas.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_data_texas %>%
  filter(
    income_moe_pct > 30 |
    (income_moe_pct >= 5 & income_moe_pct <= 10) |
    income_moe_pct < 2
  ) %>%
  # select 3
  group_by(income_reliability) %>%
  slice(1)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%
  kable(
    caption = "3 Counties Selected for Tract-Level Study",
    digits= 2
  )

3 Counties Selected for Tract-Level Study
County Name	Median Income	MOE Percentage	Reliability Category
Bexar	67275	1.15	High Confidence
Brooks	30566	33.82	Low Confidence
Anderson	57445	7.94	Moderate Confidence

Comment on the output: I deliberately selected counties representing the more extreme ends of the High and Low Confidence ranges. This strategy ensures a clearer and more impactful comparison of data quality. By focusing on these distinct cases (one very high confidence, one moderate, and one very low confidence), the tract-level analysis will better expose the full spectrum of data reliability issues that could affect the department’s funding algorithms.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide

Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
ethnicity_variables <- c(
  white_alone = "B03002_003",
  Black_African_American = "B03002_004",
  Hispanic_Latino = "B03002_012",
  total_population = "B03002_001"
)

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
selected_county_geoids <- selected_counties$GEOID
selected_county_codes <- str_sub(selected_county_geoids, start = -3) #the last 3 numbers of GEOID is county codes

tract_data_texas <- get_acs(
  geography = "tract",        
  variables = ethnicity_variables,         
  state = my_state,           
  county = selected_county_codes, 
  year = 2022,                 
  survey = "acs5",             
  output = "wide"             
)

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data_texas <- tract_data_texas %>%
  mutate(
    pct_white = (white_aloneE / total_populationE) * 100,
    pct_black = (Black_African_AmericanE / total_populationE) * 100,
    pct_hispanic = (Hispanic_LatinoE / total_populationE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
tract_data_texas <- tract_data_texas %>%
  mutate(
   Tract  = str_extract(NAME, "Census Tract \\d+(?:\\.\\d+)?"),
   County = str_extract(NAME, "[A-Za-z]+ County")
  )

head(tract_data_texas)

# A tibble: 6 × 15
  GEOID       NAME              white_aloneE white_aloneM Black_African_Americ…¹
  <chr>       <chr>                    <dbl>        <dbl>                  <dbl>
1 48001950100 Census Tract 950…         4061          682                    213
2 48001950401 Census Tract 950…         1073          308                   1569
3 48001950402 Census Tract 950…         1920          245                   2419
4 48001950500 Census Tract 950…         1924          381                    801
5 48001950600 Census Tract 950…         3287          674                    711
6 48001950700 Census Tract 950…          626          174                   1344
# ℹ abbreviated name: ¹Black_African_AmericanE
# ℹ 10 more variables: Black_African_AmericanM <dbl>, Hispanic_LatinoE <dbl>,
#   Hispanic_LatinoM <dbl>, total_populationE <dbl>, total_populationM <dbl>,
#   pct_white <dbl>, pct_black <dbl>, pct_hispanic <dbl>, Tract <chr>,
#   County <chr>

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

top_hispanic_tract <- tract_data_texas %>%
  # 按照 Hispanic/Latino 百分比 (pct_hispanic) 降序排列
  arrange(desc(pct_hispanic)) %>%
  slice(1)
top_hispanic_tract %>%
  
  kable(
    caption = "Tract with the Highest Percentage of Hispanic/Latino Residents",
  )

Tract with the Highest Percentage of Hispanic/Latino Residents
GEOID	NAME	white_aloneE	white_aloneM	Black_African_AmericanE	Black_African_AmericanM	Hispanic_LatinoE	Hispanic_LatinoM	total_populationE	total_populationM	pct_white	pct_black	pct_hispanic	Tract	County
48029160100	Census Tract 1601; Bexar County; Texas	18	26	0	21	6931	1114	6949	1112	0.2590301	0	99.74097	Census Tract 1601	Bexar County

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_demographics_summary <- tract_data_texas %>%
  group_by(County) %>%
  summarize(
    number_of_tracts = n(),
    avg_pct_white = mean(pct_white, na.rm = TRUE),
    avg_pct_black = mean(pct_black, na.rm = TRUE),
    avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE),
    total_population_E = sum(total_populationE, na.rm = TRUE),
  )
  
  county_demographics_summary %>%
    select(
      `County` = County,
      `Tracts` = number_of_tracts,
      `Avg White/%` = avg_pct_white,
      `Avg Black/%` = avg_pct_black,
      `Avg Hispanic/%` = avg_pct_hispanic,
      `Total Pop` = total_population_E
    ) %>%
    
    kable(
      caption = "Average Demographics by County",
      digits = 2
    )

Average Demographics by County
County	Tracts	Avg White/%	Avg Black/%	Avg Hispanic/%	Total Pop
Anderson County	12	56.89	18.80	17.98	58077
Bexar County	375	25.50	6.42	62.54	2014059
Brooks County	2	8.42	0.88	89.16	7059

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data_texas <- tract_data_texas %>%
  mutate(
    # Calculate MOE percentage
    white_moe_pct = (white_aloneM / white_aloneE) * 100,
    black_moe_pct = (Black_African_AmericanM / Black_African_AmericanE) * 100,
    hispanic_moe_pct = (Hispanic_LatinoM / Hispanic_LatinoE) * 100,
    
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
    demographic_reliability = ifelse(
      white_moe_pct > 15 | black_moe_pct > 15 | hispanic_moe_pct > 15,"High MOE",
      "Acceptable"
    )
  )
    
# Create summary statistics showing how many tracts have data quality issues
# Calculate MOE percentage and reliability categories using mutate()
tract_data_texas %>%
  count(demographic_reliability) %>%
  mutate(
    percentage = (n / sum(n)) * 100
  )

# A tibble: 1 × 3
  demographic_reliability     n percentage
  <chr>                   <int>      <dbl>
1 High MOE                  389        100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
tract_reliability_comparison <- tract_data_texas %>%
  group_by(demographic_reliability) %>%
  summarize(
    Number_of_Tracts = n(),
    Avg_Population = mean(total_populationE, na.rm = TRUE),
    Avg_Pct_White = mean(pct_white, na.rm = TRUE),
    Avg_Pct_Black = mean(pct_black, na.rm = TRUE),
    Avg_Pct_Hispanic = mean(pct_hispanic, na.rm = TRUE)
  )

tract_reliability_comparison %>%
  select(
    `Reliability Group` = demographic_reliability,
    `Tract Count` = Number_of_Tracts,
    `Avg Pop. Size` = Avg_Population, 
    `Avg White/%` = Avg_Pct_White,
    `Avg Black/%` = Avg_Pct_Black,
    `Avg Hispanic/%` = Avg_Pct_Hispanic
  ) %>%             # KEY：使用管道符将数据框传递给 kable
  kable(
    caption = "Comparison of Tract Characteristics by Data Reliability",
    digits = 2
  )

Comparison of Tract Characteristics by Data Reliability
Reliability Group	Tract Count	Avg Pop. Size	Avg White/%	Avg Black/%	Avg Hispanic/%
High MOE	389	5344.97	26.39	6.78	61.29

Pattern Analysis: The analysis reveals an extreme, systemic data quality issue: 100% of the tracts in the selected focus areas fall into the High MOE category. This means data reliability is not merely poor in isolated cases, but universally insufficient across these tracts. The average population size for all these unreliable tracts is around 5,345 people, with a high average concentration of Hispanic residents (61.29%). This pattern strongly suggests that in small, high-minority counties, the reliability of demographic estimates completely breaks down at the tract level. The lack of sufficient sample size in these relatively small, rural, tracts makes the ACS data inherently unstable. Consequently, any algorithm relying on this demographic data in these counties would be making resource allocation decisions based on pure statistical uncertainty, disproportionately impacting Hispanic-majority communities with unreliable information.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

The analysis reveals a two-type, systematic breakdown in data reliability: At the county level, high MOE is concentrated in sparsely populated counties like Jeff Davis and Culberson. These areas immediately fail basic reliability thresholds for household income estimates. At the county level, 100% of all tracts examined fell into the High Uncertainty category, showing an MOE greater than 15% on demographic variables. This means data quality completely collapses at the neighborhood level within these regions, making reliable demographic information universally insufficient for use in funding or allocation decisions.

Communities with the highest risk of algorithmic bias are those that are small, rural, and possess a high concentration of minority residents. The tracts flagged as universally unreliable show an average of 61.29% Hispanic residents. Algorithms relying on this fundamentally flawed data risk systematically under-serving these communities, as any resource allocation based on the estimate would carry a high degree of statistical uncertainty, effectively rendering the data unusable for precise needs assessment.

The underlying factor driving both data quality issues and the resulting bias risk is the insufficient statistical sample size combined with geographical isolation. In sparsely populated, often rural counties, the ACS cannot collect enough household responses to produce stable estimates for either income or demographics. Because high-minority communities are overrepresented in these low-population areas in Texas, the data gaps systematically align with ethnic lines, translating a statistical problem into an undeniable equity problem.

To reduce this systemic risk and ensure equitable allocation, the Department should implement the following strategic measures: 1. Implement a strict rule requiring that any data used by the allocation algorithm must have an MOE percentage below a pre-defined threshold (e.g., 10%). Funding allocation logic must be halted if this threshold is breached. 2. For all counties or tracts flagged as “High Uncertainty” (MOE>15%), the algorithm must default to using supplementary, non-ACS data rather than relying on unreliable census estimates. 3. Program the algorithm to automatically flag and defer funding decisions for any tract with high data uncertainty to a human review committee, ensuring local context and qualitative data are applied where quantitative data is weakest.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_data_texas %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
mutate(
    `Algorithm Recommendation` = case_when(
      # High Confidence: MOE < 5%
      `Reliability Category` == "High Confidence" ~ "Safe for algorithmic decisions",
      # Moderate Confidence: MOE 5-10%
      `Reliability Category` == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      # Low Confidence: MOE > 10%
      `Reliability Category` == "Low Confidence" ~ "Requires manual review or additional data"
    )
  ) %>%
# Format as a professional table with kable()
  kable(
    caption = "Algorithmic Implementation Framework",
    digits = 2 
  )

Algorithmic Implementation Framework
County Name	Median Income	MOE Percentage	Reliability Category	Algorithm Recommendation
Anderson	57445	7.94	Moderate Confidence	Use with caution - monitor outcomes
Andrews	86458	18.64	Low Confidence	Requires manual review or additional data
Angelina	57055	4.35	High Confidence	Safe for algorithmic decisions
Aransas	58168	11.10	Low Confidence	Requires manual review or additional data
Archer	69954	12.13	Low Confidence	Requires manual review or additional data
Armstrong	70417	20.70	Low Confidence	Requires manual review or additional data
Atascosa	67442	6.39	Moderate Confidence	Use with caution - monitor outcomes
Austin	73556	6.47	Moderate Confidence	Use with caution - monitor outcomes
Bailey	69830	18.79	Low Confidence	Requires manual review or additional data
Bandera	70965	8.05	Moderate Confidence	Use with caution - monitor outcomes
Bastrop	80151	6.09	Moderate Confidence	Use with caution - monitor outcomes
Baylor	52716	25.14	Low Confidence	Requires manual review or additional data
Bee	50283	10.00	Moderate Confidence	Use with caution - monitor outcomes
Bell	62858	2.80	High Confidence	Safe for algorithmic decisions
Bexar	67275	1.15	High Confidence	Safe for algorithmic decisions
Blanco	79717	9.35	Moderate Confidence	Use with caution - monitor outcomes
Borden	80625	24.88	Low Confidence	Requires manual review or additional data
Bosque	63868	6.05	Moderate Confidence	Use with caution - monitor outcomes
Bowie	56628	4.12	High Confidence	Safe for algorithmic decisions
Brazoria	91972	3.31	High Confidence	Safe for algorithmic decisions
Brazos	57562	3.45	High Confidence	Safe for algorithmic decisions
Brewster	47747	10.97	Low Confidence	Requires manual review or additional data
Briscoe	35446	27.64	Low Confidence	Requires manual review or additional data
Brooks	30566	33.82	Low Confidence	Requires manual review or additional data
Brown	53792	4.75	High Confidence	Safe for algorithmic decisions
Burleson	71745	6.53	Moderate Confidence	Use with caution - monitor outcomes
Burnet	71482	8.09	Moderate Confidence	Use with caution - monitor outcomes
Caldwell	66779	6.96	Moderate Confidence	Use with caution - monitor outcomes
Calhoun	62267	9.92	Moderate Confidence	Use with caution - monitor outcomes
Callahan	63906	3.63	High Confidence	Safe for algorithmic decisions
Cameron	47435	3.41	High Confidence	Safe for algorithmic decisions
Camp	53968	7.64	Moderate Confidence	Use with caution - monitor outcomes
Carson	83199	4.51	High Confidence	Safe for algorithmic decisions
Cass	54303	6.82	Moderate Confidence	Use with caution - monitor outcomes
Castro	59886	17.03	Low Confidence	Requires manual review or additional data
Chambers	106103	8.28	Moderate Confidence	Use with caution - monitor outcomes
Cherokee	56971	8.45	Moderate Confidence	Use with caution - monitor outcomes
Childress	56063	29.34	Low Confidence	Requires manual review or additional data
Clay	75227	7.11	Moderate Confidence	Use with caution - monitor outcomes
Cochran	41597	16.97	Low Confidence	Requires manual review or additional data
Coke	40230	13.80	Low Confidence	Requires manual review or additional data
Coleman	51034	7.95	Moderate Confidence	Use with caution - monitor outcomes
Collin	113255	1.16	High Confidence	Safe for algorithmic decisions
Collingsworth	52045	22.02	Low Confidence	Requires manual review or additional data
Colorado	63352	8.17	Moderate Confidence	Use with caution - monitor outcomes
Comal	93744	2.83	High Confidence	Safe for algorithmic decisions
Comanche	57383	13.32	Low Confidence	Requires manual review or additional data
Concho	55750	27.87	Low Confidence	Requires manual review or additional data
Cooke	66374	8.36	Moderate Confidence	Use with caution - monitor outcomes
Coryell	63281	3.62	High Confidence	Safe for algorithmic decisions
Cottle	47625	37.82	Low Confidence	Requires manual review or additional data
Crane	71364	32.89	Low Confidence	Requires manual review or additional data
Crockett	64103	33.95	Low Confidence	Requires manual review or additional data
Crosby	50268	10.40	Low Confidence	Requires manual review or additional data
Culberson	35924	51.37	Low Confidence	Requires manual review or additional data
Dallam	71969	9.75	Moderate Confidence	Use with caution - monitor outcomes
Dallas	70732	0.86	High Confidence	Safe for algorithmic decisions
Dawson	45268	27.64	Low Confidence	Requires manual review or additional data
Deaf Smith	51942	6.01	Moderate Confidence	Use with caution - monitor outcomes
Delta	68491	27.75	Low Confidence	Requires manual review or additional data
Denton	104180	1.30	High Confidence	Safe for algorithmic decisions
DeWitt	61100	7.86	Moderate Confidence	Use with caution - monitor outcomes
Dickens	46638	13.62	Low Confidence	Requires manual review or additional data
Dimmit	27374	45.20	Low Confidence	Requires manual review or additional data
Donley	51711	12.45	Low Confidence	Requires manual review or additional data
Duval	50697	20.28	Low Confidence	Requires manual review or additional data
Eastland	52902	12.21	Low Confidence	Requires manual review or additional data
Ector	70566	4.20	High Confidence	Safe for algorithmic decisions
Edwards	40809	27.09	Low Confidence	Requires manual review or additional data
Ellis	93248	2.66	High Confidence	Safe for algorithmic decisions
El Paso	55417	1.92	High Confidence	Safe for algorithmic decisions
Erath	59654	6.73	Moderate Confidence	Use with caution - monitor outcomes
Falls	45172	15.35	Low Confidence	Requires manual review or additional data
Fannin	65835	6.09	Moderate Confidence	Use with caution - monitor outcomes
Fayette	72881	5.03	Moderate Confidence	Use with caution - monitor outcomes
Fisher	60461	8.19	Moderate Confidence	Use with caution - monitor outcomes
Floyd	49321	8.99	Moderate Confidence	Use with caution - monitor outcomes
Foard	41944	20.94	Low Confidence	Requires manual review or additional data
Fort Bend	109987	2.64	High Confidence	Safe for algorithmic decisions
Franklin	67915	4.37	High Confidence	Safe for algorithmic decisions
Freestone	55902	10.53	Low Confidence	Requires manual review or additional data
Frio	56042	30.34	Low Confidence	Requires manual review or additional data
Gaines	73299	13.82	Low Confidence	Requires manual review or additional data
Galveston	83913	2.78	High Confidence	Safe for algorithmic decisions
Garza	56215	34.96	Low Confidence	Requires manual review or additional data
Gillespie	70162	8.15	Moderate Confidence	Use with caution - monitor outcomes
Glasscock	112188	27.86	Low Confidence	Requires manual review or additional data
Goliad	58125	25.84	Low Confidence	Requires manual review or additional data
Gonzales	64255	8.43	Moderate Confidence	Use with caution - monitor outcomes
Gray	54563	7.16	Moderate Confidence	Use with caution - monitor outcomes
Grayson	66608	3.60	High Confidence	Safe for algorithmic decisions
Gregg	63811	3.91	High Confidence	Safe for algorithmic decisions
Grimes	63484	9.10	Moderate Confidence	Use with caution - monitor outcomes
Guadalupe	88111	3.93	High Confidence	Safe for algorithmic decisions
Hale	50721	8.97	Moderate Confidence	Use with caution - monitor outcomes
Hall	43873	10.96	Low Confidence	Requires manual review or additional data
Hamilton	54890	17.27	Low Confidence	Requires manual review or additional data
Hansford	62350	19.70	Low Confidence	Requires manual review or additional data
Hardeman	60455	15.33	Low Confidence	Requires manual review or additional data
Hardin	70164	5.15	Moderate Confidence	Use with caution - monitor outcomes
Harris	70789	0.68	High Confidence	Safe for algorithmic decisions
Harrison	63427	4.81	High Confidence	Safe for algorithmic decisions
Hartley	78065	27.00	Low Confidence	Requires manual review or additional data
Haskell	52786	16.33	Low Confidence	Requires manual review or additional data
Hays	79990	3.73	High Confidence	Safe for algorithmic decisions
Hemphill	67798	27.70	Low Confidence	Requires manual review or additional data
Henderson	59778	4.35	High Confidence	Safe for algorithmic decisions
Hidalgo	49371	2.31	High Confidence	Safe for algorithmic decisions
Hill	60669	6.01	Moderate Confidence	Use with caution - monitor outcomes
Hockley	53283	7.51	Moderate Confidence	Use with caution - monitor outcomes
Hood	80013	4.72	High Confidence	Safe for algorithmic decisions
Hopkins	63766	5.67	Moderate Confidence	Use with caution - monitor outcomes
Houston	51043	10.50	Low Confidence	Requires manual review or additional data
Howard	67243	6.53	Moderate Confidence	Use with caution - monitor outcomes
Hudspeth	35163	23.20	Low Confidence	Requires manual review or additional data
Hunt	66885	4.17	High Confidence	Safe for algorithmic decisions
Hutchinson	62211	7.55	Moderate Confidence	Use with caution - monitor outcomes
Irion	54708	16.96	Low Confidence	Requires manual review or additional data
Jack	58861	13.22	Low Confidence	Requires manual review or additional data
Jackson	67176	17.91	Low Confidence	Requires manual review or additional data
Jasper	48818	9.84	Moderate Confidence	Use with caution - monitor outcomes
Jeff Davis	38125	66.11	Low Confidence	Requires manual review or additional data
Jefferson	57294	2.91	High Confidence	Safe for algorithmic decisions
Jim Hogg	42292	13.64	Low Confidence	Requires manual review or additional data
Jim Wells	46626	12.69	Low Confidence	Requires manual review or additional data
Johnson	77058	3.06	High Confidence	Safe for algorithmic decisions
Jones	59361	10.04	Low Confidence	Requires manual review or additional data
Karnes	57798	14.47	Low Confidence	Requires manual review or additional data
Kaufman	84075	4.11	High Confidence	Safe for algorithmic decisions
Kendall	104196	8.26	Moderate Confidence	Use with caution - monitor outcomes
Kenedy	45455	25.12	Low Confidence	Requires manual review or additional data
Kent	68553	15.57	Low Confidence	Requires manual review or additional data
Kerr	66713	6.23	Moderate Confidence	Use with caution - monitor outcomes
Kimble	62386	22.58	Low Confidence	Requires manual review or additional data
King	59375	49.51	Low Confidence	Requires manual review or additional data
Kinney	52386	45.29	Low Confidence	Requires manual review or additional data
Kleberg	52487	9.50	Moderate Confidence	Use with caution - monitor outcomes
Knox	48750	9.82	Moderate Confidence	Use with caution - monitor outcomes
Lamar	58246	4.88	High Confidence	Safe for algorithmic decisions
Lamb	54519	8.61	Moderate Confidence	Use with caution - monitor outcomes
Lampasas	73269	7.44	Moderate Confidence	Use with caution - monitor outcomes
La Salle	62798	26.04	Low Confidence	Requires manual review or additional data
Lavaca	58530	7.71	Moderate Confidence	Use with caution - monitor outcomes
Lee	66448	10.40	Low Confidence	Requires manual review or additional data
Leon	57363	12.04	Low Confidence	Requires manual review or additional data
Liberty	59605	6.69	Moderate Confidence	Use with caution - monitor outcomes
Limestone	53102	7.12	Moderate Confidence	Use with caution - monitor outcomes
Lipscomb	71625	12.92	Low Confidence	Requires manual review or additional data
Live Oak	55949	17.98	Low Confidence	Requires manual review or additional data
Llano	64241	8.81	Moderate Confidence	Use with caution - monitor outcomes
Loving	NA	NA	NA	NA
Lubbock	61911	3.54	High Confidence	Safe for algorithmic decisions
Lynn	52996	7.09	Moderate Confidence	Use with caution - monitor outcomes
McCulloch	53214	16.82	Low Confidence	Requires manual review or additional data
McLennan	59781	3.19	High Confidence	Safe for algorithmic decisions
McMullen	60313	41.18	Low Confidence	Requires manual review or additional data
Madison	65768	9.56	Moderate Confidence	Use with caution - monitor outcomes
Marion	48040	4.95	High Confidence	Safe for algorithmic decisions
Martin	70217	27.10	Low Confidence	Requires manual review or additional data
Mason	77583	15.82	Low Confidence	Requires manual review or additional data
Matagorda	56412	6.28	Moderate Confidence	Use with caution - monitor outcomes
Maverick	48497	10.10	Low Confidence	Requires manual review or additional data
Medina	73060	4.04	High Confidence	Safe for algorithmic decisions
Menard	40945	17.91	Low Confidence	Requires manual review or additional data
Midland	90123	5.80	Moderate Confidence	Use with caution - monitor outcomes
Milam	56985	5.92	Moderate Confidence	Use with caution - monitor outcomes
Mills	59315	9.26	Moderate Confidence	Use with caution - monitor outcomes
Mitchell	49869	12.52	Low Confidence	Requires manual review or additional data
Montague	63336	8.60	Moderate Confidence	Use with caution - monitor outcomes
Montgomery	95946	3.45	High Confidence	Safe for algorithmic decisions
Moore	59041	6.55	Moderate Confidence	Use with caution - monitor outcomes
Morris	51532	6.71	Moderate Confidence	Use with caution - monitor outcomes
Motley	66528	8.43	Moderate Confidence	Use with caution - monitor outcomes
Nacogdoches	51153	4.19	High Confidence	Safe for algorithmic decisions
Navarro	56261	7.70	Moderate Confidence	Use with caution - monitor outcomes
Newton	38871	16.91	Low Confidence	Requires manual review or additional data
Nolan	47437	7.72	Moderate Confidence	Use with caution - monitor outcomes
Nueces	64027	2.26	High Confidence	Safe for algorithmic decisions
Ochiltree	62240	17.77	Low Confidence	Requires manual review or additional data
Oldham	71103	11.17	Low Confidence	Requires manual review or additional data
Orange	71910	7.89	Moderate Confidence	Use with caution - monitor outcomes
Palo Pinto	65242	4.40	High Confidence	Safe for algorithmic decisions
Panola	58205	18.34	Low Confidence	Requires manual review or additional data
Parker	95721	3.91	High Confidence	Safe for algorithmic decisions
Parmer	65575	13.74	Low Confidence	Requires manual review or additional data
Pecos	59325	17.15	Low Confidence	Requires manual review or additional data
Polk	57315	5.19	Moderate Confidence	Use with caution - monitor outcomes
Potter	47974	3.95	High Confidence	Safe for algorithmic decisions
Presidio	29012	23.98	Low Confidence	Requires manual review or additional data
Rains	60291	10.78	Low Confidence	Requires manual review or additional data
Randall	78038	3.45	High Confidence	Safe for algorithmic decisions
Reagan	70319	12.82	Low Confidence	Requires manual review or additional data
Real	46842	32.96	Low Confidence	Requires manual review or additional data
Red River	44583	9.73	Moderate Confidence	Use with caution - monitor outcomes
Reeves	57487	22.06	Low Confidence	Requires manual review or additional data
Refugio	54304	4.88	High Confidence	Safe for algorithmic decisions
Roberts	62667	14.22	Low Confidence	Requires manual review or additional data
Robertson	59410	17.70	Low Confidence	Requires manual review or additional data
Rockwall	121303	3.78	High Confidence	Safe for algorithmic decisions
Runnels	55424	5.97	Moderate Confidence	Use with caution - monitor outcomes
Rusk	61661	9.85	Moderate Confidence	Use with caution - monitor outcomes
Sabine	47061	16.90	Low Confidence	Requires manual review or additional data
San Augustine	45888	9.69	Moderate Confidence	Use with caution - monitor outcomes
San Jacinto	54839	12.97	Low Confidence	Requires manual review or additional data
San Patricio	63842	6.88	Moderate Confidence	Use with caution - monitor outcomes
San Saba	54087	16.38	Low Confidence	Requires manual review or additional data
Schleicher	53774	15.15	Low Confidence	Requires manual review or additional data
Scurry	58932	21.39	Low Confidence	Requires manual review or additional data
Shackelford	60924	14.04	Low Confidence	Requires manual review or additional data
Shelby	49231	9.96	Moderate Confidence	Use with caution - monitor outcomes
Sherman	66169	27.93	Low Confidence	Requires manual review or additional data
Smith	69053	3.19	High Confidence	Safe for algorithmic decisions
Somervell	87899	33.73	Low Confidence	Requires manual review or additional data
Starr	35979	8.42	Moderate Confidence	Use with caution - monitor outcomes
Stephens	44712	18.85	Low Confidence	Requires manual review or additional data
Sterling	63558	22.25	Low Confidence	Requires manual review or additional data
Stonewall	66591	32.48	Low Confidence	Requires manual review or additional data
Sutton	56778	22.74	Low Confidence	Requires manual review or additional data
Swisher	40290	13.69	Low Confidence	Requires manual review or additional data
Tarrant	78872	1.01	High Confidence	Safe for algorithmic decisions
Taylor	61806	3.90	High Confidence	Safe for algorithmic decisions
Terrell	52813	21.03	Low Confidence	Requires manual review or additional data
Terry	42694	10.93	Low Confidence	Requires manual review or additional data
Throckmorton	55221	21.09	Low Confidence	Requires manual review or additional data
Titus	57634	8.12	Moderate Confidence	Use with caution - monitor outcomes
Tom Green	67215	4.68	High Confidence	Safe for algorithmic decisions
Travis	92731	1.19	High Confidence	Safe for algorithmic decisions
Trinity	51165	11.37	Low Confidence	Requires manual review or additional data
Tyler	50898	10.34	Low Confidence	Requires manual review or additional data
Upshur	60456	7.74	Moderate Confidence	Use with caution - monitor outcomes
Upton	55284	21.72	Low Confidence	Requires manual review or additional data
Uvalde	55000	15.28	Low Confidence	Requires manual review or additional data
Val Verde	57250	8.01	Moderate Confidence	Use with caution - monitor outcomes
Van Zandt	62334	8.10	Moderate Confidence	Use with caution - monitor outcomes
Victoria	66308	3.55	High Confidence	Safe for algorithmic decisions
Walker	47193	6.32	Moderate Confidence	Use with caution - monitor outcomes
Waller	71643	5.98	Moderate Confidence	Use with caution - monitor outcomes
Ward	70771	12.60	Low Confidence	Requires manual review or additional data
Washington	70043	9.41	Moderate Confidence	Use with caution - monitor outcomes
Webb	59984	3.11	High Confidence	Safe for algorithmic decisions
Wharton	59712	6.84	Moderate Confidence	Use with caution - monitor outcomes
Wheeler	58158	14.24	Low Confidence	Requires manual review or additional data
Wichita	58862	3.24	High Confidence	Safe for algorithmic decisions
Wilbarger	50769	18.36	Low Confidence	Requires manual review or additional data
Willacy	42839	13.05	Low Confidence	Requires manual review or additional data
Williamson	102851	1.42	High Confidence	Safe for algorithmic decisions
Wilson	89708	4.89	High Confidence	Safe for algorithmic decisions
Winkler	89155	16.06	Low Confidence	Requires manual review or additional data
Wise	85385	6.07	Moderate Confidence	Use with caution - monitor outcomes
Wood	61748	6.02	Moderate Confidence	Use with caution - monitor outcomes
Yoakum	80317	8.81	Moderate Confidence	Use with caution - monitor outcomes
Young	65565	16.86	Low Confidence	Requires manual review or additional data
Zapata	35061	10.72	Low Confidence	Requires manual review or additional data
Zavala	49243	28.78	Low Confidence	Requires manual review or additional data

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation:

Counties categorized as High Confidence (those with median income MOE below 5%) are suitable. These typically include large, urban counties with stable estimates, such as Harris or Dallas County. They are appropriate because their data is statistically stable, meaning resource allocations based on the income estimates are highly reliable (>95% certainty), minimizing the risk of systemic misallocation.

Counties requiring additional oversight:

Counties categorized as Moderate Confidence (median income MOE between 5% and 10%) should be deployed with caution. The Department must implement real-time monitoring to compare algorithmic allocation decisions against actual service uptake and ground-level case worker reports. A 5%-10% uncertainty range requires human oversight to validate outcomes and trigger manual review if the allocation pattern appears to deviate significantly from expected need.

Counties needing alternative approaches:

All Low Confidence counties (MOE>10%), especially those with extreme uncertainty like Jeff Davis (MOE≈66%) and Culberson (MOE≈53%), must have standard algorithmic application immediately suspended. For these areas, where tract-level data is universally unreliable and serves high Hispanic populations, the Department must transition to manual case worker review for needs assessment, or invest in small, targeted administrative surveys as a more reliable proxy for need than the ACS income estimates.

Questions for Further Investigation

How do the tracts flagged with High MOE cluster geographically, and is there a measurable correlation between these clusters and existing disparities in access to Department resources or social infrastructure?
Beyond the Hispanic population, are High MOE issues equally prevalent across other small, concentrated minority groups (e.g., specific Native American or Asian subgroups), or is the sample size challenge uniquely impacting the Hispanic population in the Texas border regions?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on September 19, 2025

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://github.com/MUSA-5080-Fall-2025/portfolio-setup-ChristineCui12

Methodology Notes:

The analysis employed several deliberate choices to focus the data quality evaluation. First, all estimates were sourced from the American Community Survey (ACS) 2018-2022 5-Year Estimates to ensure the most geographically detailed and statistically stable data available for tract-level analysis. Second, the three focus counties were intentionally selected based on the MOE analysis to represent the extremes of data reliability (High, Moderate, Low Confidence MOE ranges) rather than simply picking counties near the confidence boundaries. This was done to maximize the observed contrast in data quality patterns during the tract-level deep dive. Third, a conservative 15% MOE threshold was set for the tract-level demographic variables. If the MOE for any of the three major demographic groups (White, Black, Hispanic) exceeded this threshold, the entire tract was flagged as High MOE.

Limitations:

The findings are restricted to Texas and the three specifically selected counties. The conclusion that 100% of tracts in the focus area are unreliable is an extreme finding that may not be generalizable to counties in other parts of the state or country. Besides, the use of 5-Year ACS estimates means the data represents a rolling average from 2018 to 2022. This dampens volatility but may mask rapid economic changes or population shifts that have occurred since the beginning of the period.

Submission Checklist

Before submitting your portfolio link on Canvas:

[√] All code chunks run without errors
[√] All “[Fill this in]” prompts have been completed
[√] Tables are properly formatted and readable
[√] Executive summary addresses all four required components
[√] Portfolio navigation includes this assignment
[√] Census API key is properly set
[√] Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html