Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Jun Luu

Published

September 29, 2025

Assignment Overview

Scenario

You are a data analyst for the Virginia Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)

# Set your Census API key
census_api_key(Sys.getenv("368bf145f527c34904bbbc75ef3158887059279a"))

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "VA"

State Selection: I have chosen Virginia for this analysis because I am interested in my home state!

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_acs <- get_acs(
  variables = c(
    median_household_income = "B19013_001", 
    total_population = "B01003_001"),
  year = 2022, 
  geography = "county", 
  state = my_state, 
  survey = "acs5", 
  cache = TRUE, 
  output = "wide")

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_clean = county_acs %>%
  mutate(
    NAME = NAME %>%
      str_remove(" County,") %>%
      str_remove(" Virginia") %>%
      str_remove(" city,") # I had some counties that were named city as well as county 
  )

# Display the first few rows
head(county_clean)
# A tibble: 6 × 6
  GEOID NAME     median_household_inc…¹ median_household_inc…² total_populationE
  <chr> <chr>                     <dbl>                  <dbl>             <dbl>
1 51001 Accomack                  52694                   5883             33367
2 51003 Albemar…                  97708                   3686            112513
3 51005 Allegha…                  52546                   3958             15159
4 51007 Amelia                    63438                  15114             13309
5 51009 Amherst                   64454                   4514             31426
6 51011 Appomat…                  60041                   7091             16253
# ℹ abbreviated names: ¹​median_household_incomeE, ²​median_household_incomeM
# ℹ 1 more variable: total_populationM <dbl>

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_reliability <- county_clean %>%
  mutate(
    hh_income_moe_pct = (median_household_incomeM / median_household_incomeE) * 100,
    hh_income_moe_cat = case_when(
      hh_income_moe_pct < 5 ~ "High Confidence",
      hh_income_moe_pct < 10 ~ "Moderate Confidence",
      TRUE ~ "Low Confidence"
    )
  )

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

reliability_summary <- county_reliability %>%
  count(hh_income_moe_cat, name = "reliability_count") %>%
  mutate(percentage= reliability_count/sum(reliability_count)*100)

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
highest_moe <- county_reliability %>%
  arrange(desc(hh_income_moe_pct))

top_moe_counties <- slice_head(highest_moe, n=5)

# Format as table with kable() - include appropriate column names and caption
highest_moe_table <- top_moe_counties %>%
  select(  
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "Margin Error (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat 
  )

kable(highest_moe_table, caption = "Top 5 Counties in Virginia by Median Household Income MOE", booktabs = TRUE, digits = 2, align = "c")
Top 5 Counties in Virginia by Median Household Income MOE
County Median Income Margin Error (%) Reliability
King and Queen 70147 26.88 Low Confidence
Norton 36974 25.41 Low Confidence
Amelia 63438 23.82 Low Confidence
Bath 55699 21.37 Low Confidence
Lexington 93651 20.95 Low Confidence

Data Quality Commentary:

The counties listed above have a high margin of error when reviewing their median household income. This can happen for many different reasons including lower population size, high income variability, and or smaller counties. This data could misrepresent the population and lead to biased decision making.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties = c("Loudoun", "Essex", "Falls Church")

filtered_counties <- county_reliability %>%
  filter(NAME %in% selected_counties)

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

selected_counties_table <- filtered_counties %>%
  select(  
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "Margin Error (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat 
  )

kable(selected_counties_table, caption = "Median Household Income of Selected Counties Virginia", booktabs = TRUE, digits = 2, align = "c")
Median Household Income of Selected Counties Virginia
County Median Income Margin Error (%) Reliability
Essex 52335 15.29 Low Confidence
Loudoun 170463 2.05 High Confidence
Falls Church 164536 8.01 Moderate Confidence

Comment on the output: There seems like there may be a relationship between the MOE and Median Income.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_ethnicity <- c(
  white_alone = "B03002_003",
  black = "B03002_004",
  hispanic_latino = "B03002_012",
  total_pop = "B03002_001"
  )
  
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_acs <- get_acs(
  variables = race_ethnicity,
  year = 2022, 
  geography = "tract", 
  state = my_state, 
  survey = "acs5", 
  county = selected_counties,
  cache = TRUE, 
  output = "wide")

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_acs <- tract_acs %>%
  mutate(
    white_alone_pct = (white_aloneE / total_popE) * 100,
    black_pct = (blackE / total_popE) * 100,
    hispanic_latino_pct = (hispanic_latinoE / total_popE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
tract_acs <- tract_acs %>%
  mutate(
  tract = str_extract(NAME, "Tract [^;]+"),
  county = str_extract(NAME, "(?<=; )[^;]+(?=;)") %>%
    str_remove("city") %>%
    str_remove(" County")
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic_tract <- tract_acs %>%
  arrange(desc(hispanic_latino_pct)) %>% 
  slice_head(n = 1) 

print(top_hispanic_tract)
# A tibble: 1 × 15
  GEOID       NAME      white_aloneE white_aloneM blackE blackM hispanic_latinoE
  <chr>       <chr>            <dbl>        <dbl>  <dbl>  <dbl>            <dbl>
1 51107611602 Census T…          574          202    196    151             2325
# ℹ 8 more variables: hispanic_latinoM <dbl>, total_popE <dbl>,
#   total_popM <dbl>, white_alone_pct <dbl>, black_pct <dbl>,
#   hispanic_latino_pct <dbl>, tract <chr>, county <chr>
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
demographics_summary <- tract_acs %>%
  group_by(county) %>%
  summarize(
    tracts_count = n(),
    avg_white_pct = mean(white_alone_pct, na.rm = TRUE),
    avg_black_pct = mean(black_pct, na.rm = TRUE),
    avg_hispanic_latino_pct = mean(hispanic_latino_pct, na.rm = TRUE)
  )

# Create a nicely formatted table of your results using kable()

demographics_summary_table <- demographics_summary %>%
  select( 
    "County" = county,
    "Number of Tracts" = tracts_count,
    "Average White Only Population (%)" = avg_white_pct,
    "Average Black Population (%)" = avg_black_pct,
    "Average Hispanic/Latino Population (%)" = avg_hispanic_latino_pct,    
  )

kable(demographics_summary_table, caption = "Average Demographics by County", booktabs = TRUE, digits = 2, align = "c")
Average Demographics by County
County Number of Tracts Average White Only Population (%) Average Black Population (%) Average Hispanic/Latino Population (%)
Essex 3 53.82 37.84 4.38
Falls Church 3 69.04 4.18 11.34
Loudoun 75 54.07 7.28 14.21

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
demographics_reliability <- tract_acs %>%
  mutate(
    white_alone_moe_pct = (white_aloneM / white_aloneE) * 100,
    black_moe_pct = (blackM / blackE) * 100,
    hispanic_latino_moe_pct = (hispanic_latinoM / hispanic_latinoE) * 100,
    
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
        high_moe_flag = (white_alone_moe_pct > 15) | 
                        (black_moe_pct > 15) | 
                        (hispanic_latino_moe_pct > 15)
  )

# Create summary statistics showing how many tracts have data quality issues
moe_summary <- demographics_reliability %>%
  summarise(
    total_tracts = n(),                        # total number of tracts
    tracts_high_moe = sum(high_moe_flag),      # number of tracts flagged as high MOE
    pct_high_moe = (tracts_high_moe / total_tracts) * 100  # percentage of tracts with high MOE
  )

moe_summary_table <- moe_summary %>%
  select(  
  "Total Number of Tracts" = total_tracts,
  "Tracts with High MOE" = tracts_high_moe,
  "Tracts with High MOE (%)" = pct_high_moe
  )

kable(moe_summary_table, caption = "Tracts with Data Quality Issues", booktabs = TRUE, digits = 2, align = "c")
Tracts with Data Quality Issues
Total Number of Tracts Tracts with High MOE Tracts with High MOE (%)
81 81 100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

moe_tract <- demographics_reliability %>%
  group_by(high_moe_flag) %>%
  summarize(
    avg_pop = mean(total_popE, na.rm = TRUE),
    avg_white_alone = mean(white_aloneE, na.rm = TRUE),
    avg_black = mean(blackE, na.rm = TRUE),
    avg_hispanic = mean(hispanic_latinoE, na.rm = TRUE),
    .groups = "drop"
  )

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

moe_tract_table <- moe_tract %>%
  select(
    "Total" = avg_pop,
    "White" = avg_white_alone,
    "Black" = avg_black,
    "Hispanic/Latino" = avg_hispanic
  )
kable(moe_tract_table, caption = "Average Population of Tracts with High MOE", booktabs = TRUE, digits = 2, align = "c") # Since all my tracts have high MOE
Average Population of Tracts with High MOE
Total White Black Hispanic/Latino
5505.57 2955.37 441.56 744.53

Pattern Analysis: I found that there were varying margins of error when it came to income levels, but the demographic data had consistently higher margins of error. The data was particularly unreliable for Black and Hispanic/Latino populations, in some cases exceeding 100% margins of error. This is likely due to small population sizes within certain counties, which makes estimates less certain. For income levels, counties in southern Virginia showed lower confidence in their margins of error, and these counties also tended to have lower household incomes. By contrast, counties with larger population sizes generally had higher reliability, both in household income estimates and demographic data.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

Across all analyses, there were a few systematic patterns that emerges in the reliability of demographic and socioeconomic data. Household income data showed lower confidence concentrated in southern Virginia counties that also exhibited lower household incomes. Demographic data consistently revealed higher margins of error, particularly for Black and Hispanic/Latino populations, in some cases exceeding 100% across all counties. Counties with larger population sizes tended to have more reliable estimates across both demographic and income indicators, highlighting a structural imbalance in data quality.

From an equity perspective, these disparities mean that communities of Black and Hispanic/Latino residents face a greater risk of algorithmic bias when data is used for policy, funding, or service allocation. When data carries high uncertainty, any algorithm built on it risks incorrectly representing those groups. Similarly, lower-income communities in southern Virginia also face heightened risk, as unreliable income data undermines equitable targeting of economic support or development programs.

The root causes of these issues stem primarily from small sample sizes in survey-based data collection methods, which reduce reliability for smaller populations. Smaller minority populations, rural communities, and low-income counties are more likely to experience data quality issues and more dependent on accurate representation for equitable policy outcomes. These factors create a feedback loop where communities most in need of resources have the weakest statistical representation.

To address these systemic challenges, the Department should invest in data collection for small populations by increasing the percentage per county population data collected. The Department could also look at different data collection surveys in order decrease margin of error. These options could help the Department mitigate risks of algorithmic bias while ensuring that vulnerable communities are not disadvantaged by unreliable data. If they are unable to make these changes, I would urge them to include their margins of error when using any information from the data sets.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_reliability_table <- county_reliability %>%
  select(
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "MOE (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat
  )
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_reliability_table <- county_reliability_table %>%
  mutate(
    "Algorithm Recommendation" = case_when(
      Reliability == "High Confidence"   ~ "Safe for algorithmic decisions",
      Reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      Reliability == "Low Confidence"    ~ "Requires manual review or additional data",
      TRUE ~ NA_character_  # for any missing/unexpected values
    )
  )
  
# Format as a professional table with kable()
kable(county_reliability_table, caption = "Household Income Reliability by County", booktabs = TRUE, digits = 2, align = "c") 
Household Income Reliability by County
County Median Income MOE (%) Reliability Algorithm Recommendation
Accomack 52694 11.16 Low Confidence Requires manual review or additional data
Albemarle 97708 3.77 High Confidence Safe for algorithmic decisions
Alleghany 52546 7.53 Moderate Confidence Use with caution - monitor outcomes
Amelia 63438 23.82 Low Confidence Requires manual review or additional data
Amherst 64454 7.00 Moderate Confidence Use with caution - monitor outcomes
Appomattox 60041 11.81 Low Confidence Requires manual review or additional data
Arlington 137387 1.98 High Confidence Safe for algorithmic decisions
Augusta 76124 4.17 High Confidence Safe for algorithmic decisions
Bath 55699 21.37 Low Confidence Requires manual review or additional data
Bedford 74773 4.38 High Confidence Safe for algorithmic decisions
Bland 59901 4.73 High Confidence Safe for algorithmic decisions
Botetourt 77680 7.33 Moderate Confidence Use with caution - monitor outcomes
Brunswick 52678 5.92 Moderate Confidence Use with caution - monitor outcomes
Buchanan 39591 7.40 Moderate Confidence Use with caution - monitor outcomes
Buckingham 59894 14.55 Low Confidence Requires manual review or additional data
Campbell 59022 6.30 Moderate Confidence Use with caution - monitor outcomes
Caroline 83562 8.56 Moderate Confidence Use with caution - monitor outcomes
Carroll 49113 9.77 Moderate Confidence Use with caution - monitor outcomes
Charles City 65573 5.26 Moderate Confidence Use with caution - monitor outcomes
Charlotte 51548 17.84 Low Confidence Requires manual review or additional data
Chesterfield 95757 2.22 High Confidence Safe for algorithmic decisions
Clarke 107475 14.79 Low Confidence Requires manual review or additional data
Craig 66286 12.27 Low Confidence Requires manual review or additional data
Culpeper 92359 4.65 High Confidence Safe for algorithmic decisions
Cumberland 56497 14.25 Low Confidence Requires manual review or additional data
Dickenson 40143 7.46 Moderate Confidence Use with caution - monitor outcomes
Dinwiddie 77225 9.55 Moderate Confidence Use with caution - monitor outcomes
Essex 52335 15.29 Low Confidence Requires manual review or additional data
Fairfax 145165 1.16 High Confidence Safe for algorithmic decisions
Fauquier 122785 5.40 Moderate Confidence Use with caution - monitor outcomes
Floyd 57146 10.90 Low Confidence Requires manual review or additional data
Fluvanna 90766 6.05 Moderate Confidence Use with caution - monitor outcomes
Franklin 66275 5.74 Moderate Confidence Use with caution - monitor outcomes
Frederick 92443 4.99 High Confidence Safe for algorithmic decisions
Giles 61987 6.19 Moderate Confidence Use with caution - monitor outcomes
Gloucester 83750 5.24 Moderate Confidence Use with caution - monitor outcomes
Goochland 105600 8.09 Moderate Confidence Use with caution - monitor outcomes
Grayson 43348 10.54 Low Confidence Requires manual review or additional data
Greene 81338 6.64 Moderate Confidence Use with caution - monitor outcomes
Greensville 51823 13.84 Low Confidence Requires manual review or additional data
Halifax 49145 5.18 Moderate Confidence Use with caution - monitor outcomes
Hanover 104678 3.36 High Confidence Safe for algorithmic decisions
Henrico 82424 2.10 High Confidence Safe for algorithmic decisions
Henry 43694 5.98 Moderate Confidence Use with caution - monitor outcomes
Highland 57070 14.45 Low Confidence Requires manual review or additional data
Isle of Wight 91680 4.09 High Confidence Safe for algorithmic decisions
James City 100711 4.64 High Confidence Safe for algorithmic decisions
King and Queen 70147 26.88 Low Confidence Requires manual review or additional data
King George 103264 8.14 Moderate Confidence Use with caution - monitor outcomes
King William 79398 11.86 Low Confidence Requires manual review or additional data
Lancaster 62674 7.10 Moderate Confidence Use with caution - monitor outcomes
Lee 41619 12.35 Low Confidence Requires manual review or additional data
Loudoun 170463 2.05 High Confidence Safe for algorithmic decisions
Louisa 76594 9.58 Moderate Confidence Use with caution - monitor outcomes
Lunenburg 54438 14.45 Low Confidence Requires manual review or additional data
Madison 74586 8.68 Moderate Confidence Use with caution - monitor outcomes
Mathews 79054 18.95 Low Confidence Requires manual review or additional data
Mecklenburg 51265 8.19 Moderate Confidence Use with caution - monitor outcomes
Middlesex 69389 9.39 Moderate Confidence Use with caution - monitor outcomes
Montgomery 65270 5.45 Moderate Confidence Use with caution - monitor outcomes
Nelson 64028 17.43 Low Confidence Requires manual review or additional data
New Kent 113120 6.15 Moderate Confidence Use with caution - monitor outcomes
Northampton 54693 13.07 Low Confidence Requires manual review or additional data
Northumberland 64655 14.33 Low Confidence Requires manual review or additional data
Nottoway 62366 13.91 Low Confidence Requires manual review or additional data
Orange 87309 9.40 Moderate Confidence Use with caution - monitor outcomes
Page 56760 7.45 Moderate Confidence Use with caution - monitor outcomes
Patrick 49180 10.87 Low Confidence Requires manual review or additional data
Pittsylvania 52619 5.76 Moderate Confidence Use with caution - monitor outcomes
Powhatan 108089 4.56 High Confidence Safe for algorithmic decisions
Prince Edward 57304 6.61 Moderate Confidence Use with caution - monitor outcomes
Prince George 80318 6.31 Moderate Confidence Use with caution - monitor outcomes
Prince William 123193 2.19 High Confidence Safe for algorithmic decisions
Pulaski 59740 6.43 Moderate Confidence Use with caution - monitor outcomes
Rappahannock 98663 11.29 Low Confidence Requires manual review or additional data
Richmond 62708 18.57 Low Confidence Requires manual review or additional data
Roanoke 80872 2.34 High Confidence Safe for algorithmic decisions
Rockbridge 61903 5.71 Moderate Confidence Use with caution - monitor outcomes
Rockingham 73232 3.18 High Confidence Safe for algorithmic decisions
Russell 44088 8.87 Moderate Confidence Use with caution - monitor outcomes
Scott 44535 6.38 Moderate Confidence Use with caution - monitor outcomes
Shenandoah 62149 6.68 Moderate Confidence Use with caution - monitor outcomes
Smyth 45061 7.21 Moderate Confidence Use with caution - monitor outcomes
Southampton 67813 10.69 Low Confidence Requires manual review or additional data
Spotsylvania 105068 4.52 High Confidence Safe for algorithmic decisions
Stafford 128036 3.18 High Confidence Safe for algorithmic decisions
Surry 68655 10.95 Low Confidence Requires manual review or additional data
Sussex 59195 11.64 Low Confidence Requires manual review or additional data
Tazewell 46508 7.10 Moderate Confidence Use with caution - monitor outcomes
Warren 79313 8.86 Moderate Confidence Use with caution - monitor outcomes
Washington 59116 4.31 High Confidence Safe for algorithmic decisions
Westmoreland 56647 12.26 Low Confidence Requires manual review or additional data
Wise 47541 6.43 Moderate Confidence Use with caution - monitor outcomes
Wythe 53921 7.47 Moderate Confidence Use with caution - monitor outcomes
York 105154 3.38 High Confidence Safe for algorithmic decisions
Alexandria 113179 2.24 High Confidence Safe for algorithmic decisions
Bristol 45250 6.95 Moderate Confidence Use with caution - monitor outcomes
Buena Vista 48783 14.29 Low Confidence Requires manual review or additional data
Charlottesville 67177 7.63 Moderate Confidence Use with caution - monitor outcomes
Chesapeake 92703 2.37 High Confidence Safe for algorithmic decisions
Colonial Heights 72216 7.53 Moderate Confidence Use with caution - monitor outcomes
Covington 45737 15.38 Low Confidence Requires manual review or additional data
Danville 41484 8.28 Moderate Confidence Use with caution - monitor outcomes
Emporia 41442 14.39 Low Confidence Requires manual review or additional data
Fairfax 128708 9.39 Moderate Confidence Use with caution - monitor outcomes
Falls Church 164536 8.01 Moderate Confidence Use with caution - monitor outcomes
Franklin 57537 9.03 Moderate Confidence Use with caution - monitor outcomes
Fredericksburg 83445 6.36 Moderate Confidence Use with caution - monitor outcomes
Galax 44612 20.94 Low Confidence Requires manual review or additional data
Hampton 64430 3.09 High Confidence Safe for algorithmic decisions
Harrisonburg 56050 3.21 High Confidence Safe for algorithmic decisions
Hopewell 50661 10.43 Low Confidence Requires manual review or additional data
Lexington 93651 20.95 Low Confidence Requires manual review or additional data
Lynchburg 56243 4.99 High Confidence Safe for algorithmic decisions
Manassas 110559 7.71 Moderate Confidence Use with caution - monitor outcomes
Manassas Park 91673 16.15 Low Confidence Requires manual review or additional data
Martinsville 39127 9.27 Moderate Confidence Use with caution - monitor outcomes
Newport News 63355 3.31 High Confidence Safe for algorithmic decisions
Norfolk 60998 3.08 High Confidence Safe for algorithmic decisions
Norton 36974 25.41 Low Confidence Requires manual review or additional data
Petersburg 46930 6.60 Moderate Confidence Use with caution - monitor outcomes
Poquoson 114503 7.31 Moderate Confidence Use with caution - monitor outcomes
Portsmouth 57154 4.32 High Confidence Safe for algorithmic decisions
Radford 51039 14.27 Low Confidence Requires manual review or additional data
Richmond 59606 3.96 High Confidence Safe for algorithmic decisions
Roanoke 51523 4.47 High Confidence Safe for algorithmic decisions
Salem 68402 6.98 Moderate Confidence Use with caution - monitor outcomes
Staunton 59731 8.95 Moderate Confidence Use with caution - monitor outcomes
Suffolk 87758 4.22 High Confidence Safe for algorithmic decisions
Virginia Beach 87544 2.24 High Confidence Safe for algorithmic decisions
Waynesboro 52519 8.79 Moderate Confidence Use with caution - monitor outcomes
Williamsburg 66815 7.32 Moderate Confidence Use with caution - monitor outcomes
Winchester 62495 6.93 Moderate Confidence Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation: Albemarle, Arlington, Augusta, Bedford, Bland, Chesterfield, Culpeper, Fairfax, Frederick, Hanover, Henrico, Isle of Wight, James City, Loudoun, Powhatan, Prince William, Roanoke, Rockingham, Spotsylvania, Stafford, Washington, York, Alexandria, Chesapeake, Hampton, Harrisonburg, Lynchburg, Newport News, Norfolk, Portsmouth, Richmond, Roanoke, Suffolk, Virginia Beach.

The high confidence counties have data with low margins of errors. These counties are considered “safe” because the data is statistically sound, enabling reliable automated decision-making with minimal risk of error or bias.

  1. Counties requiring additional oversight: Alleghany, Amherst, Botetourt, Brunswick, Buchanan, Campbell, Caroline, Carroll, Charles City, Dickenson, Dinwiddie, Fauquier, Fluvanna, Franklin, Giles, Gloucester, Goochland, Greene, Halifax, Henry, King George, Lancaster, Louisa, Madison, Mecklenburg, Middlesex, Montgomery, New Kent, Orange, Page, Pittsylvania, Prince Edward, Prince George, Pulaski, Rockbridge, Russell, Scott, Shenandoah, Smyth, Tazewell, Warren, Wise, Wythe, Bristol, Charlottesville, Colonial Heights, Danville, Fairfax, Falls Church, Franklin, Fredericksburg, Manassas, Martinsville, Petersburg, Poquoson, Salem, Staunton, Waynesboro, Williamsburg, Winchester.

Moderate confidence county data should be used cautiously. The data is generally usable but carries enough uncertainty that algorithmic decisions could occasionally be misleading. Though these counties do not need to be manually reviewed every time, it is important to monitor any abnormalities in the data.

  1. Counties needing alternative approaches: Accomack, Amelia, Appomattox, Bath, Buckingham, Charlotte, Clarke, Craig, Cumberland, Essex, Floyd, Grayson, Greensville, Highland, King and Queen, King William, Lee, Lunenburg, Mathews, Nelson, Northampton, Northumberland, Nottoway, Patrick, Rappahannock, Richmond, Southampton, Surry, Sussex, Westmoreland, Buena Vista, Covington, Emporia, Galax, Hopewell, Lexington, Manassas Park, Norton, Radford.

These counties have a high margin of error meaning their data has higher variability with lower data points. This means that decisions that are made using this data set should be checked manually before implementation.

Questions for Further Investigation

  1. Are there regional clusters of counties with consistently high margins of error, and how do these clusters relate to population size, demographics, rurality, or economic conditions?
  2. How do margins of error and confidence levels change over time?
  3. Do certain demographic characteristics such as race and age contribute to higher MOE or lower confidence in algorithmic outputs?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/28/2025

Reproducibility: - All analysis conducted in R version 2025.05.1+513 (2025.05.1+513) - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-jenniferluu6/

Methodology Notes: In processing the data, I selected Virginia as the focus of analysis. During cleaning, I had to make adjustments to county names by removing the word “City” in certain cases to ensure consistency across data sets. This additional step suggests that similar cleaning may be required if applying the same methods to other states. For county selection, I chose locations I was already familiar with, which provided helpful context for interpreting the results. However, this choice may limit reproducibility since another researcher might select different counties and arrive at slightly different insights.

Limitations: The sample size issues affected the reliability of estimates, particularly for smaller demographic groups such as Black and Hispanic/Latino populations, where margins of error sometimes exceeded 100%. Also, county selection was influenced by personal familiarity, which, while helpful for interpretation, may introduce bias and reduce reproducibility.


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html