Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Xinyuan(Christine) Cui

Published

September 29, 2025

Assignment Overview

Scenario

You are a data analyst for the Texas Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
"knitr" %in% rownames(installed.packages())
[1] TRUE
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("20068788c6e79d5716fbceb0dcd562ab23f74ca1", overwrite=TRUE, install = TRUE)
[1] "20068788c6e79d5716fbceb0dcd562ab23f74ca1"
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "TX"

State Selection: I have chosen Texas for this analysis because: Texas is an ideal state for this analysis due to its unique combination of rapid population growth, demographic diversity, and significant geographic variation. These factors present specific data quality challenges that are critical to evaluating algorithmic fairness.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_data_texas <- get_acs(
  geography = "county",        
  variables = c(               
    median_household_income = "B19013_001",
    total_population = "B01003_001"
  ),
  state = my_state,            
  year = 2022,                 
  survey = "acs5",             
  output = "wide"              
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_data_texas <- county_data_texas %>%
  mutate(
    # remove state name(Texas)
    NAME = str_remove(NAME, ", Texas"),
    # remove " County"
    NAME = str_remove(NAME, " County")
  )
# Display the first few rows
head(county_data_texas)
# A tibble: 6 × 6
  GEOID NAME     median_household_inc…¹ median_household_inc…² total_populationE
  <chr> <chr>                     <dbl>                  <dbl>             <dbl>
1 48001 Anderson                  57445                   4562             58077
2 48003 Andrews                   86458                  16116             18362
3 48005 Angelina                  57055                   2484             86608
4 48007 Aransas                   58168                   6458             24048
5 48009 Archer                    69954                   8482              8649
6 48011 Armstro…                  70417                  14574              1912
# ℹ abbreviated names: ¹​median_household_incomeE, ²​median_household_incomeM
# ℹ 1 more variable: total_populationM <dbl>

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data_texas <- county_data_texas %>%
  mutate(
    # Calculate MOE percentage
    income_moe_pct = (median_household_incomeM / median_household_incomeE) * 100,
    
    # Create reliability categories
    income_reliability = case_when(
      income_moe_pct < 5 ~ "High Confidence",
      income_moe_pct >= 5 & income_moe_pct <= 10 ~ "Moderate Confidence",
      income_moe_pct > 10 ~ "Low Confidence"
    ),
    
    # Create a flag for unreliable estimates
    unreliable_income = income_moe_pct > 10
  )
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
county_data_texas %>%
  count(income_reliability) %>%
  mutate(
    percentage = (n / sum(n)) * 100
  )
# A tibble: 4 × 3
  income_reliability      n percentage
  <chr>               <int>      <dbl>
1 High Confidence        58     22.8  
2 Low Confidence        113     44.5  
3 Moderate Confidence    82     32.3  
4 <NA>                    1      0.394

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
library(knitr)
county_data_texas %>%
  arrange(desc(income_moe_pct)) %>%
  slice(1:5) %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `Margin of Error` = median_household_incomeM,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%

# Format as table with kable() - include appropriate column names and caption
kable(
    caption = "Top 5 Counties with The Highest MOE Percentages",
    digits = 2
  )
Top 5 Counties with The Highest MOE Percentages
County Name Median Income Margin of Error MOE Percentage Reliability Category
Jeff Davis 38125 25205 66.11 Low Confidence
Culberson 35924 18455 51.37 Low Confidence
King 59375 29395 49.51 Low Confidence
Kinney 52386 23728 45.29 Low Confidence
Dimmit 27374 12374 45.20 Low Confidence

Data Quality Commentary: Based on these results, algorithmic funding decisions relying on the median income in counties like Jeff Davis and Culberson (with MOE>50%) would be highly unreliable. These counties risk being poorly served because the true income value could be far outside the estimate, potentially leading to the algorithm severely over-allocating or under-allocating crucial resources. The extreme uncertainty is primarily due to small population size and low survey response rates, making precise income estimation statistically impossible in these sparsely populated areas.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_data_texas %>%
  filter(
    income_moe_pct > 30 |
    (income_moe_pct >= 5 & income_moe_pct <= 10) |
    income_moe_pct < 2
  ) %>%
  # select 3
  group_by(income_reliability) %>%
  slice(1)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%
  kable(
    caption = "3 Counties Selected for Tract-Level Study",
    digits= 2
  )
3 Counties Selected for Tract-Level Study
County Name Median Income MOE Percentage Reliability Category
Bexar 67275 1.15 High Confidence
Brooks 30566 33.82 Low Confidence
Anderson 57445 7.94 Moderate Confidence

Comment on the output: I deliberately selected counties representing the more extreme ends of the High and Low Confidence ranges. This strategy ensures a clearer and more impactful comparison of data quality. By focusing on these distinct cases (one very high confidence, one moderate, and one very low confidence), the tract-level analysis will better expose the full spectrum of data reliability issues that could affect the department’s funding algorithms.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide

Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
ethnicity_variables <- c(
  white_alone = "B03002_003",
  Black_African_American = "B03002_004",
  Hispanic_Latino = "B03002_012",
  total_population = "B03002_001"
)

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
selected_county_geoids <- selected_counties$GEOID
selected_county_codes <- str_sub(selected_county_geoids, start = -3) #the last 3 numbers of GEOID is county codes

tract_data_texas <- get_acs(
  geography = "tract",        
  variables = ethnicity_variables,         
  state = my_state,           
  county = selected_county_codes, 
  year = 2022,                 
  survey = "acs5",             
  output = "wide"             
)

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data_texas <- tract_data_texas %>%
  mutate(
    pct_white = (white_aloneE / total_populationE) * 100,
    pct_black = (Black_African_AmericanE / total_populationE) * 100,
    pct_hispanic = (Hispanic_LatinoE / total_populationE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
tract_data_texas <- tract_data_texas %>%
  mutate(
   Tract  = str_extract(NAME, "Census Tract \\d+(?:\\.\\d+)?"),
   County = str_extract(NAME, "[A-Za-z]+ County")
  )

head(tract_data_texas)
# A tibble: 6 × 15
  GEOID       NAME              white_aloneE white_aloneM Black_African_Americ…¹
  <chr>       <chr>                    <dbl>        <dbl>                  <dbl>
1 48001950100 Census Tract 950…         4061          682                    213
2 48001950401 Census Tract 950…         1073          308                   1569
3 48001950402 Census Tract 950…         1920          245                   2419
4 48001950500 Census Tract 950…         1924          381                    801
5 48001950600 Census Tract 950…         3287          674                    711
6 48001950700 Census Tract 950…          626          174                   1344
# ℹ abbreviated name: ¹​Black_African_AmericanE
# ℹ 10 more variables: Black_African_AmericanM <dbl>, Hispanic_LatinoE <dbl>,
#   Hispanic_LatinoM <dbl>, total_populationE <dbl>, total_populationM <dbl>,
#   pct_white <dbl>, pct_black <dbl>, pct_hispanic <dbl>, Tract <chr>,
#   County <chr>

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

top_hispanic_tract <- tract_data_texas %>%
  # 按照 Hispanic/Latino 百分比 (pct_hispanic) 降序排列
  arrange(desc(pct_hispanic)) %>%
  slice(1)
top_hispanic_tract %>%
  
  kable(
    caption = "Tract with the Highest Percentage of Hispanic/Latino Residents",
  )
Tract with the Highest Percentage of Hispanic/Latino Residents
GEOID NAME white_aloneE white_aloneM Black_African_AmericanE Black_African_AmericanM Hispanic_LatinoE Hispanic_LatinoM total_populationE total_populationM pct_white pct_black pct_hispanic Tract County
48029160100 Census Tract 1601; Bexar County; Texas 18 26 0 21 6931 1114 6949 1112 0.2590301 0 99.74097 Census Tract 1601 Bexar County
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_demographics_summary <- tract_data_texas %>%
  group_by(County) %>%
  summarize(
    number_of_tracts = n(),
    avg_pct_white = mean(pct_white, na.rm = TRUE),
    avg_pct_black = mean(pct_black, na.rm = TRUE),
    avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE),
    total_population_E = sum(total_populationE, na.rm = TRUE),
  )
  
  county_demographics_summary %>%
    select(
      `County` = County,
      `Tracts` = number_of_tracts,
      `Avg White/%` = avg_pct_white,
      `Avg Black/%` = avg_pct_black,
      `Avg Hispanic/%` = avg_pct_hispanic,
      `Total Pop` = total_population_E
    ) %>%
    
    kable(
      caption = "Average Demographics by County",
      digits = 2
    )
Average Demographics by County
County Tracts Avg White/% Avg Black/% Avg Hispanic/% Total Pop
Anderson County 12 56.89 18.80 17.98 58077
Bexar County 375 25.50 6.42 62.54 2014059
Brooks County 2 8.42 0.88 89.16 7059

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data_texas <- tract_data_texas %>%
  mutate(
    # Calculate MOE percentage
    white_moe_pct = (white_aloneM / white_aloneE) * 100,
    black_moe_pct = (Black_African_AmericanM / Black_African_AmericanE) * 100,
    hispanic_moe_pct = (Hispanic_LatinoM / Hispanic_LatinoE) * 100,
    
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
    demographic_reliability = ifelse(
      white_moe_pct > 15 | black_moe_pct > 15 | hispanic_moe_pct > 15,"High MOE",
      "Acceptable"
    )
  )
    
# Create summary statistics showing how many tracts have data quality issues
# Calculate MOE percentage and reliability categories using mutate()
tract_data_texas %>%
  count(demographic_reliability) %>%
  mutate(
    percentage = (n / sum(n)) * 100
  )
# A tibble: 1 × 3
  demographic_reliability     n percentage
  <chr>                   <int>      <dbl>
1 High MOE                  389        100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
tract_reliability_comparison <- tract_data_texas %>%
  group_by(demographic_reliability) %>%
  summarize(
    Number_of_Tracts = n(),
    Avg_Population = mean(total_populationE, na.rm = TRUE),
    Avg_Pct_White = mean(pct_white, na.rm = TRUE),
    Avg_Pct_Black = mean(pct_black, na.rm = TRUE),
    Avg_Pct_Hispanic = mean(pct_hispanic, na.rm = TRUE)
  )

tract_reliability_comparison %>%
  select(
    `Reliability Group` = demographic_reliability,
    `Tract Count` = Number_of_Tracts,
    `Avg Pop. Size` = Avg_Population, 
    `Avg White/%` = Avg_Pct_White,
    `Avg Black/%` = Avg_Pct_Black,
    `Avg Hispanic/%` = Avg_Pct_Hispanic
  ) %>%             # KEY:使用管道符将数据框传递给 kable
  kable(
    caption = "Comparison of Tract Characteristics by Data Reliability",
    digits = 2
  )
Comparison of Tract Characteristics by Data Reliability
Reliability Group Tract Count Avg Pop. Size Avg White/% Avg Black/% Avg Hispanic/%
High MOE 389 5344.97 26.39 6.78 61.29

Pattern Analysis: The analysis reveals an extreme, systemic data quality issue: 100% of the tracts in the selected focus areas fall into the High MOE category. This means data reliability is not merely poor in isolated cases, but universally insufficient across these tracts. The average population size for all these unreliable tracts is around 5,345 people, with a high average concentration of Hispanic residents (61.29%). This pattern strongly suggests that in small, high-minority counties, the reliability of demographic estimates completely breaks down at the tract level. The lack of sufficient sample size in these relatively small, rural, tracts makes the ACS data inherently unstable. Consequently, any algorithm relying on this demographic data in these counties would be making resource allocation decisions based on pure statistical uncertainty, disproportionately impacting Hispanic-majority communities with unreliable information.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

The analysis reveals a two-type, systematic breakdown in data reliability: At the county level, high MOE is concentrated in sparsely populated counties like Jeff Davis and Culberson. These areas immediately fail basic reliability thresholds for household income estimates. At the county level, 100% of all tracts examined fell into the High Uncertainty category, showing an MOE greater than 15% on demographic variables. This means data quality completely collapses at the neighborhood level within these regions, making reliable demographic information universally insufficient for use in funding or allocation decisions.

Communities with the highest risk of algorithmic bias are those that are small, rural, and possess a high concentration of minority residents. The tracts flagged as universally unreliable show an average of 61.29% Hispanic residents. Algorithms relying on this fundamentally flawed data risk systematically under-serving these communities, as any resource allocation based on the estimate would carry a high degree of statistical uncertainty, effectively rendering the data unusable for precise needs assessment.

The underlying factor driving both data quality issues and the resulting bias risk is the insufficient statistical sample size combined with geographical isolation. In sparsely populated, often rural counties, the ACS cannot collect enough household responses to produce stable estimates for either income or demographics. Because high-minority communities are overrepresented in these low-population areas in Texas, the data gaps systematically align with ethnic lines, translating a statistical problem into an undeniable equity problem.

To reduce this systemic risk and ensure equitable allocation, the Department should implement the following strategic measures: 1. Implement a strict rule requiring that any data used by the allocation algorithm must have an MOE percentage below a pre-defined threshold (e.g., 10%). Funding allocation logic must be halted if this threshold is breached. 2. For all counties or tracts flagged as “High Uncertainty” (MOE>15%), the algorithm must default to using supplementary, non-ACS data rather than relying on unreliable census estimates. 3. Program the algorithm to automatically flag and defer funding decisions for any tract with high data uncertainty to a human review committee, ensuring local context and qualitative data are applied where quantitative data is weakest.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_data_texas %>%
  select(
    `County Name` = NAME,
    `Median Income` = median_household_incomeE,
    `MOE Percentage` = income_moe_pct,
    `Reliability Category` = income_reliability
  ) %>%
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
mutate(
    `Algorithm Recommendation` = case_when(
      # High Confidence: MOE < 5%
      `Reliability Category` == "High Confidence" ~ "Safe for algorithmic decisions",
      # Moderate Confidence: MOE 5-10%
      `Reliability Category` == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      # Low Confidence: MOE > 10%
      `Reliability Category` == "Low Confidence" ~ "Requires manual review or additional data"
    )
  ) %>%
# Format as a professional table with kable()
  kable(
    caption = "Algorithmic Implementation Framework",
    digits = 2 
  )
Algorithmic Implementation Framework
County Name Median Income MOE Percentage Reliability Category Algorithm Recommendation
Anderson 57445 7.94 Moderate Confidence Use with caution - monitor outcomes
Andrews 86458 18.64 Low Confidence Requires manual review or additional data
Angelina 57055 4.35 High Confidence Safe for algorithmic decisions
Aransas 58168 11.10 Low Confidence Requires manual review or additional data
Archer 69954 12.13 Low Confidence Requires manual review or additional data
Armstrong 70417 20.70 Low Confidence Requires manual review or additional data
Atascosa 67442 6.39 Moderate Confidence Use with caution - monitor outcomes
Austin 73556 6.47 Moderate Confidence Use with caution - monitor outcomes
Bailey 69830 18.79 Low Confidence Requires manual review or additional data
Bandera 70965 8.05 Moderate Confidence Use with caution - monitor outcomes
Bastrop 80151 6.09 Moderate Confidence Use with caution - monitor outcomes
Baylor 52716 25.14 Low Confidence Requires manual review or additional data
Bee 50283 10.00 Moderate Confidence Use with caution - monitor outcomes
Bell 62858 2.80 High Confidence Safe for algorithmic decisions
Bexar 67275 1.15 High Confidence Safe for algorithmic decisions
Blanco 79717 9.35 Moderate Confidence Use with caution - monitor outcomes
Borden 80625 24.88 Low Confidence Requires manual review or additional data
Bosque 63868 6.05 Moderate Confidence Use with caution - monitor outcomes
Bowie 56628 4.12 High Confidence Safe for algorithmic decisions
Brazoria 91972 3.31 High Confidence Safe for algorithmic decisions
Brazos 57562 3.45 High Confidence Safe for algorithmic decisions
Brewster 47747 10.97 Low Confidence Requires manual review or additional data
Briscoe 35446 27.64 Low Confidence Requires manual review or additional data
Brooks 30566 33.82 Low Confidence Requires manual review or additional data
Brown 53792 4.75 High Confidence Safe for algorithmic decisions
Burleson 71745 6.53 Moderate Confidence Use with caution - monitor outcomes
Burnet 71482 8.09 Moderate Confidence Use with caution - monitor outcomes
Caldwell 66779 6.96 Moderate Confidence Use with caution - monitor outcomes
Calhoun 62267 9.92 Moderate Confidence Use with caution - monitor outcomes
Callahan 63906 3.63 High Confidence Safe for algorithmic decisions
Cameron 47435 3.41 High Confidence Safe for algorithmic decisions
Camp 53968 7.64 Moderate Confidence Use with caution - monitor outcomes
Carson 83199 4.51 High Confidence Safe for algorithmic decisions
Cass 54303 6.82 Moderate Confidence Use with caution - monitor outcomes
Castro 59886 17.03 Low Confidence Requires manual review or additional data
Chambers 106103 8.28 Moderate Confidence Use with caution - monitor outcomes
Cherokee 56971 8.45 Moderate Confidence Use with caution - monitor outcomes
Childress 56063 29.34 Low Confidence Requires manual review or additional data
Clay 75227 7.11 Moderate Confidence Use with caution - monitor outcomes
Cochran 41597 16.97 Low Confidence Requires manual review or additional data
Coke 40230 13.80 Low Confidence Requires manual review or additional data
Coleman 51034 7.95 Moderate Confidence Use with caution - monitor outcomes
Collin 113255 1.16 High Confidence Safe for algorithmic decisions
Collingsworth 52045 22.02 Low Confidence Requires manual review or additional data
Colorado 63352 8.17 Moderate Confidence Use with caution - monitor outcomes
Comal 93744 2.83 High Confidence Safe for algorithmic decisions
Comanche 57383 13.32 Low Confidence Requires manual review or additional data
Concho 55750 27.87 Low Confidence Requires manual review or additional data
Cooke 66374 8.36 Moderate Confidence Use with caution - monitor outcomes
Coryell 63281 3.62 High Confidence Safe for algorithmic decisions
Cottle 47625 37.82 Low Confidence Requires manual review or additional data
Crane 71364 32.89 Low Confidence Requires manual review or additional data
Crockett 64103 33.95 Low Confidence Requires manual review or additional data
Crosby 50268 10.40 Low Confidence Requires manual review or additional data
Culberson 35924 51.37 Low Confidence Requires manual review or additional data
Dallam 71969 9.75 Moderate Confidence Use with caution - monitor outcomes
Dallas 70732 0.86 High Confidence Safe for algorithmic decisions
Dawson 45268 27.64 Low Confidence Requires manual review or additional data
Deaf Smith 51942 6.01 Moderate Confidence Use with caution - monitor outcomes
Delta 68491 27.75 Low Confidence Requires manual review or additional data
Denton 104180 1.30 High Confidence Safe for algorithmic decisions
DeWitt 61100 7.86 Moderate Confidence Use with caution - monitor outcomes
Dickens 46638 13.62 Low Confidence Requires manual review or additional data
Dimmit 27374 45.20 Low Confidence Requires manual review or additional data
Donley 51711 12.45 Low Confidence Requires manual review or additional data
Duval 50697 20.28 Low Confidence Requires manual review or additional data
Eastland 52902 12.21 Low Confidence Requires manual review or additional data
Ector 70566 4.20 High Confidence Safe for algorithmic decisions
Edwards 40809 27.09 Low Confidence Requires manual review or additional data
Ellis 93248 2.66 High Confidence Safe for algorithmic decisions
El Paso 55417 1.92 High Confidence Safe for algorithmic decisions
Erath 59654 6.73 Moderate Confidence Use with caution - monitor outcomes
Falls 45172 15.35 Low Confidence Requires manual review or additional data
Fannin 65835 6.09 Moderate Confidence Use with caution - monitor outcomes
Fayette 72881 5.03 Moderate Confidence Use with caution - monitor outcomes
Fisher 60461 8.19 Moderate Confidence Use with caution - monitor outcomes
Floyd 49321 8.99 Moderate Confidence Use with caution - monitor outcomes
Foard 41944 20.94 Low Confidence Requires manual review or additional data
Fort Bend 109987 2.64 High Confidence Safe for algorithmic decisions
Franklin 67915 4.37 High Confidence Safe for algorithmic decisions
Freestone 55902 10.53 Low Confidence Requires manual review or additional data
Frio 56042 30.34 Low Confidence Requires manual review or additional data
Gaines 73299 13.82 Low Confidence Requires manual review or additional data
Galveston 83913 2.78 High Confidence Safe for algorithmic decisions
Garza 56215 34.96 Low Confidence Requires manual review or additional data
Gillespie 70162 8.15 Moderate Confidence Use with caution - monitor outcomes
Glasscock 112188 27.86 Low Confidence Requires manual review or additional data
Goliad 58125 25.84 Low Confidence Requires manual review or additional data
Gonzales 64255 8.43 Moderate Confidence Use with caution - monitor outcomes
Gray 54563 7.16 Moderate Confidence Use with caution - monitor outcomes
Grayson 66608 3.60 High Confidence Safe for algorithmic decisions
Gregg 63811 3.91 High Confidence Safe for algorithmic decisions
Grimes 63484 9.10 Moderate Confidence Use with caution - monitor outcomes
Guadalupe 88111 3.93 High Confidence Safe for algorithmic decisions
Hale 50721 8.97 Moderate Confidence Use with caution - monitor outcomes
Hall 43873 10.96 Low Confidence Requires manual review or additional data
Hamilton 54890 17.27 Low Confidence Requires manual review or additional data
Hansford 62350 19.70 Low Confidence Requires manual review or additional data
Hardeman 60455 15.33 Low Confidence Requires manual review or additional data
Hardin 70164 5.15 Moderate Confidence Use with caution - monitor outcomes
Harris 70789 0.68 High Confidence Safe for algorithmic decisions
Harrison 63427 4.81 High Confidence Safe for algorithmic decisions
Hartley 78065 27.00 Low Confidence Requires manual review or additional data
Haskell 52786 16.33 Low Confidence Requires manual review or additional data
Hays 79990 3.73 High Confidence Safe for algorithmic decisions
Hemphill 67798 27.70 Low Confidence Requires manual review or additional data
Henderson 59778 4.35 High Confidence Safe for algorithmic decisions
Hidalgo 49371 2.31 High Confidence Safe for algorithmic decisions
Hill 60669 6.01 Moderate Confidence Use with caution - monitor outcomes
Hockley 53283 7.51 Moderate Confidence Use with caution - monitor outcomes
Hood 80013 4.72 High Confidence Safe for algorithmic decisions
Hopkins 63766 5.67 Moderate Confidence Use with caution - monitor outcomes
Houston 51043 10.50 Low Confidence Requires manual review or additional data
Howard 67243 6.53 Moderate Confidence Use with caution - monitor outcomes
Hudspeth 35163 23.20 Low Confidence Requires manual review or additional data
Hunt 66885 4.17 High Confidence Safe for algorithmic decisions
Hutchinson 62211 7.55 Moderate Confidence Use with caution - monitor outcomes
Irion 54708 16.96 Low Confidence Requires manual review or additional data
Jack 58861 13.22 Low Confidence Requires manual review or additional data
Jackson 67176 17.91 Low Confidence Requires manual review or additional data
Jasper 48818 9.84 Moderate Confidence Use with caution - monitor outcomes
Jeff Davis 38125 66.11 Low Confidence Requires manual review or additional data
Jefferson 57294 2.91 High Confidence Safe for algorithmic decisions
Jim Hogg 42292 13.64 Low Confidence Requires manual review or additional data
Jim Wells 46626 12.69 Low Confidence Requires manual review or additional data
Johnson 77058 3.06 High Confidence Safe for algorithmic decisions
Jones 59361 10.04 Low Confidence Requires manual review or additional data
Karnes 57798 14.47 Low Confidence Requires manual review or additional data
Kaufman 84075 4.11 High Confidence Safe for algorithmic decisions
Kendall 104196 8.26 Moderate Confidence Use with caution - monitor outcomes
Kenedy 45455 25.12 Low Confidence Requires manual review or additional data
Kent 68553 15.57 Low Confidence Requires manual review or additional data
Kerr 66713 6.23 Moderate Confidence Use with caution - monitor outcomes
Kimble 62386 22.58 Low Confidence Requires manual review or additional data
King 59375 49.51 Low Confidence Requires manual review or additional data
Kinney 52386 45.29 Low Confidence Requires manual review or additional data
Kleberg 52487 9.50 Moderate Confidence Use with caution - monitor outcomes
Knox 48750 9.82 Moderate Confidence Use with caution - monitor outcomes
Lamar 58246 4.88 High Confidence Safe for algorithmic decisions
Lamb 54519 8.61 Moderate Confidence Use with caution - monitor outcomes
Lampasas 73269 7.44 Moderate Confidence Use with caution - monitor outcomes
La Salle 62798 26.04 Low Confidence Requires manual review or additional data
Lavaca 58530 7.71 Moderate Confidence Use with caution - monitor outcomes
Lee 66448 10.40 Low Confidence Requires manual review or additional data
Leon 57363 12.04 Low Confidence Requires manual review or additional data
Liberty 59605 6.69 Moderate Confidence Use with caution - monitor outcomes
Limestone 53102 7.12 Moderate Confidence Use with caution - monitor outcomes
Lipscomb 71625 12.92 Low Confidence Requires manual review or additional data
Live Oak 55949 17.98 Low Confidence Requires manual review or additional data
Llano 64241 8.81 Moderate Confidence Use with caution - monitor outcomes
Loving NA NA NA NA
Lubbock 61911 3.54 High Confidence Safe for algorithmic decisions
Lynn 52996 7.09 Moderate Confidence Use with caution - monitor outcomes
McCulloch 53214 16.82 Low Confidence Requires manual review or additional data
McLennan 59781 3.19 High Confidence Safe for algorithmic decisions
McMullen 60313 41.18 Low Confidence Requires manual review or additional data
Madison 65768 9.56 Moderate Confidence Use with caution - monitor outcomes
Marion 48040 4.95 High Confidence Safe for algorithmic decisions
Martin 70217 27.10 Low Confidence Requires manual review or additional data
Mason 77583 15.82 Low Confidence Requires manual review or additional data
Matagorda 56412 6.28 Moderate Confidence Use with caution - monitor outcomes
Maverick 48497 10.10 Low Confidence Requires manual review or additional data
Medina 73060 4.04 High Confidence Safe for algorithmic decisions
Menard 40945 17.91 Low Confidence Requires manual review or additional data
Midland 90123 5.80 Moderate Confidence Use with caution - monitor outcomes
Milam 56985 5.92 Moderate Confidence Use with caution - monitor outcomes
Mills 59315 9.26 Moderate Confidence Use with caution - monitor outcomes
Mitchell 49869 12.52 Low Confidence Requires manual review or additional data
Montague 63336 8.60 Moderate Confidence Use with caution - monitor outcomes
Montgomery 95946 3.45 High Confidence Safe for algorithmic decisions
Moore 59041 6.55 Moderate Confidence Use with caution - monitor outcomes
Morris 51532 6.71 Moderate Confidence Use with caution - monitor outcomes
Motley 66528 8.43 Moderate Confidence Use with caution - monitor outcomes
Nacogdoches 51153 4.19 High Confidence Safe for algorithmic decisions
Navarro 56261 7.70 Moderate Confidence Use with caution - monitor outcomes
Newton 38871 16.91 Low Confidence Requires manual review or additional data
Nolan 47437 7.72 Moderate Confidence Use with caution - monitor outcomes
Nueces 64027 2.26 High Confidence Safe for algorithmic decisions
Ochiltree 62240 17.77 Low Confidence Requires manual review or additional data
Oldham 71103 11.17 Low Confidence Requires manual review or additional data
Orange 71910 7.89 Moderate Confidence Use with caution - monitor outcomes
Palo Pinto 65242 4.40 High Confidence Safe for algorithmic decisions
Panola 58205 18.34 Low Confidence Requires manual review or additional data
Parker 95721 3.91 High Confidence Safe for algorithmic decisions
Parmer 65575 13.74 Low Confidence Requires manual review or additional data
Pecos 59325 17.15 Low Confidence Requires manual review or additional data
Polk 57315 5.19 Moderate Confidence Use with caution - monitor outcomes
Potter 47974 3.95 High Confidence Safe for algorithmic decisions
Presidio 29012 23.98 Low Confidence Requires manual review or additional data
Rains 60291 10.78 Low Confidence Requires manual review or additional data
Randall 78038 3.45 High Confidence Safe for algorithmic decisions
Reagan 70319 12.82 Low Confidence Requires manual review or additional data
Real 46842 32.96 Low Confidence Requires manual review or additional data
Red River 44583 9.73 Moderate Confidence Use with caution - monitor outcomes
Reeves 57487 22.06 Low Confidence Requires manual review or additional data
Refugio 54304 4.88 High Confidence Safe for algorithmic decisions
Roberts 62667 14.22 Low Confidence Requires manual review or additional data
Robertson 59410 17.70 Low Confidence Requires manual review or additional data
Rockwall 121303 3.78 High Confidence Safe for algorithmic decisions
Runnels 55424 5.97 Moderate Confidence Use with caution - monitor outcomes
Rusk 61661 9.85 Moderate Confidence Use with caution - monitor outcomes
Sabine 47061 16.90 Low Confidence Requires manual review or additional data
San Augustine 45888 9.69 Moderate Confidence Use with caution - monitor outcomes
San Jacinto 54839 12.97 Low Confidence Requires manual review or additional data
San Patricio 63842 6.88 Moderate Confidence Use with caution - monitor outcomes
San Saba 54087 16.38 Low Confidence Requires manual review or additional data
Schleicher 53774 15.15 Low Confidence Requires manual review or additional data
Scurry 58932 21.39 Low Confidence Requires manual review or additional data
Shackelford 60924 14.04 Low Confidence Requires manual review or additional data
Shelby 49231 9.96 Moderate Confidence Use with caution - monitor outcomes
Sherman 66169 27.93 Low Confidence Requires manual review or additional data
Smith 69053 3.19 High Confidence Safe for algorithmic decisions
Somervell 87899 33.73 Low Confidence Requires manual review or additional data
Starr 35979 8.42 Moderate Confidence Use with caution - monitor outcomes
Stephens 44712 18.85 Low Confidence Requires manual review or additional data
Sterling 63558 22.25 Low Confidence Requires manual review or additional data
Stonewall 66591 32.48 Low Confidence Requires manual review or additional data
Sutton 56778 22.74 Low Confidence Requires manual review or additional data
Swisher 40290 13.69 Low Confidence Requires manual review or additional data
Tarrant 78872 1.01 High Confidence Safe for algorithmic decisions
Taylor 61806 3.90 High Confidence Safe for algorithmic decisions
Terrell 52813 21.03 Low Confidence Requires manual review or additional data
Terry 42694 10.93 Low Confidence Requires manual review or additional data
Throckmorton 55221 21.09 Low Confidence Requires manual review or additional data
Titus 57634 8.12 Moderate Confidence Use with caution - monitor outcomes
Tom Green 67215 4.68 High Confidence Safe for algorithmic decisions
Travis 92731 1.19 High Confidence Safe for algorithmic decisions
Trinity 51165 11.37 Low Confidence Requires manual review or additional data
Tyler 50898 10.34 Low Confidence Requires manual review or additional data
Upshur 60456 7.74 Moderate Confidence Use with caution - monitor outcomes
Upton 55284 21.72 Low Confidence Requires manual review or additional data
Uvalde 55000 15.28 Low Confidence Requires manual review or additional data
Val Verde 57250 8.01 Moderate Confidence Use with caution - monitor outcomes
Van Zandt 62334 8.10 Moderate Confidence Use with caution - monitor outcomes
Victoria 66308 3.55 High Confidence Safe for algorithmic decisions
Walker 47193 6.32 Moderate Confidence Use with caution - monitor outcomes
Waller 71643 5.98 Moderate Confidence Use with caution - monitor outcomes
Ward 70771 12.60 Low Confidence Requires manual review or additional data
Washington 70043 9.41 Moderate Confidence Use with caution - monitor outcomes
Webb 59984 3.11 High Confidence Safe for algorithmic decisions
Wharton 59712 6.84 Moderate Confidence Use with caution - monitor outcomes
Wheeler 58158 14.24 Low Confidence Requires manual review or additional data
Wichita 58862 3.24 High Confidence Safe for algorithmic decisions
Wilbarger 50769 18.36 Low Confidence Requires manual review or additional data
Willacy 42839 13.05 Low Confidence Requires manual review or additional data
Williamson 102851 1.42 High Confidence Safe for algorithmic decisions
Wilson 89708 4.89 High Confidence Safe for algorithmic decisions
Winkler 89155 16.06 Low Confidence Requires manual review or additional data
Wise 85385 6.07 Moderate Confidence Use with caution - monitor outcomes
Wood 61748 6.02 Moderate Confidence Use with caution - monitor outcomes
Yoakum 80317 8.81 Moderate Confidence Use with caution - monitor outcomes
Young 65565 16.86 Low Confidence Requires manual review or additional data
Zapata 35061 10.72 Low Confidence Requires manual review or additional data
Zavala 49243 28.78 Low Confidence Requires manual review or additional data

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation:

Counties categorized as High Confidence (those with median income MOE below 5%) are suitable. These typically include large, urban counties with stable estimates, such as Harris or Dallas County. They are appropriate because their data is statistically stable, meaning resource allocations based on the income estimates are highly reliable (>95% certainty), minimizing the risk of systemic misallocation.

  1. Counties requiring additional oversight:

Counties categorized as Moderate Confidence (median income MOE between 5% and 10%) should be deployed with caution. The Department must implement real-time monitoring to compare algorithmic allocation decisions against actual service uptake and ground-level case worker reports. A 5%-10% uncertainty range requires human oversight to validate outcomes and trigger manual review if the allocation pattern appears to deviate significantly from expected need.

  1. Counties needing alternative approaches:

All Low Confidence counties (MOE>10%), especially those with extreme uncertainty like Jeff Davis (MOE≈66%) and Culberson (MOE≈53%), must have standard algorithmic application immediately suspended. For these areas, where tract-level data is universally unreliable and serves high Hispanic populations, the Department must transition to manual case worker review for needs assessment, or invest in small, targeted administrative surveys as a more reliable proxy for need than the ACS income estimates.

Questions for Further Investigation

  1. How do the tracts flagged with High MOE cluster geographically, and is there a measurable correlation between these clusters and existing disparities in access to Department resources or social infrastructure?

  2. Beyond the Hispanic population, are High MOE issues equally prevalent across other small, concentrated minority groups (e.g., specific Native American or Asian subgroups), or is the sample size challenge uniquely impacting the Hispanic population in the Texas border regions?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on September 19, 2025

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://github.com/MUSA-5080-Fall-2025/portfolio-setup-ChristineCui12

Methodology Notes:

The analysis employed several deliberate choices to focus the data quality evaluation. First, all estimates were sourced from the American Community Survey (ACS) 2018-2022 5-Year Estimates to ensure the most geographically detailed and statistically stable data available for tract-level analysis. Second, the three focus counties were intentionally selected based on the MOE analysis to represent the extremes of data reliability (High, Moderate, Low Confidence MOE ranges) rather than simply picking counties near the confidence boundaries. This was done to maximize the observed contrast in data quality patterns during the tract-level deep dive. Third, a conservative 15% MOE threshold was set for the tract-level demographic variables. If the MOE for any of the three major demographic groups (White, Black, Hispanic) exceeded this threshold, the entire tract was flagged as High MOE.

Limitations:

The findings are restricted to Texas and the three specifically selected counties. The conclusion that 100% of tracts in the focus area are unreliable is an extreme finding that may not be generalizable to counties in other parts of the state or country. Besides, the use of 5-Year ACS estimates means the data represents a rolling average from 2018 to 2022. This dampens volatility but may mask rapid economic changes or population shifts that have occurred since the beginning of the period.


Submission Checklist

Before submitting your portfolio link on Canvas:

  • [√] All code chunks run without errors
  • [√] All “[Fill this in]” prompts have been completed
  • [√] Tables are properly formatted and readable
  • [√] Executive summary addresses all four required components
  • [√] Portfolio navigation includes this assignment
  • [√] Census API key is properly set
  • [√] Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html