Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Jun Luu

Published

September 29, 2025

Assignment Overview

Scenario

You are a data analyst for the Virginia Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: assignments/assignment_1/your_file_name.qmd
      text: "Assignment 1: Census Data Exploration"

If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)

# Set your Census API key
census_api_key(Sys.getenv("368bf145f527c34904bbbc75ef3158887059279a"))

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "VA"

State Selection: I have chosen Virginia for this analysis because I am interested in my home state!

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
county_acs <- get_acs(
  variables = c(
    median_household_income = "B19013_001", 
    total_population = "B01003_001"),
  year = 2022, 
  geography = "county", 
  state = my_state, 
  survey = "acs5", 
  cache = TRUE, 
  output = "wide")

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_clean = county_acs %>%
  mutate(
    NAME = NAME %>%
      str_remove(" County,") %>%
      str_remove(" Virginia") %>%
      str_remove(" city,") # I had some counties that were named city as well as county 
  )

# Display the first few rows
head(county_clean)

# A tibble: 6 × 6
  GEOID NAME     median_household_inc…¹ median_household_inc…² total_populationE
  <chr> <chr>                     <dbl>                  <dbl>             <dbl>
1 51001 Accomack                  52694                   5883             33367
2 51003 Albemar…                  97708                   3686            112513
3 51005 Allegha…                  52546                   3958             15159
4 51007 Amelia                    63438                  15114             13309
5 51009 Amherst                   64454                   4514             31426
6 51011 Appomat…                  60041                   7091             16253
# ℹ abbreviated names: ¹median_household_incomeE, ²median_household_incomeM
# ℹ 1 more variable: total_populationM <dbl>

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_reliability <- county_clean %>%
  mutate(
    hh_income_moe_pct = (median_household_incomeM / median_household_incomeE) * 100,
    hh_income_moe_cat = case_when(
      hh_income_moe_pct < 5 ~ "High Confidence",
      hh_income_moe_pct < 10 ~ "Moderate Confidence",
      TRUE ~ "Low Confidence"
    )
  )

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

reliability_summary <- county_reliability %>%
  count(hh_income_moe_cat, name = "reliability_count") %>%
  mutate(percentage= reliability_count/sum(reliability_count)*100)

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
highest_moe <- county_reliability %>%
  arrange(desc(hh_income_moe_pct))

top_moe_counties <- slice_head(highest_moe, n=5)

# Format as table with kable() - include appropriate column names and caption
highest_moe_table <- top_moe_counties %>%
  select(  
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "Margin Error (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat 
  )

kable(highest_moe_table, caption = "Top 5 Counties in Virginia by Median Household Income MOE", booktabs = TRUE, digits = 2, align = "c")

Top 5 Counties in Virginia by Median Household Income MOE
County	Median Income	Margin Error (%)	Reliability
King and Queen	70147	26.88	Low Confidence
Norton	36974	25.41	Low Confidence
Amelia	63438	23.82	Low Confidence
Bath	55699	21.37	Low Confidence
Lexington	93651	20.95	Low Confidence

Data Quality Commentary:

The counties listed above have a high margin of error when reviewing their median household income. This can happen for many different reasons including lower population size, high income variability, and or smaller counties. This data could misrepresent the population and lead to biased decision making.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties = c("Loudoun", "Essex", "Falls Church")

filtered_counties <- county_reliability %>%
  filter(NAME %in% selected_counties)

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

selected_counties_table <- filtered_counties %>%
  select(  
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "Margin Error (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat 
  )

kable(selected_counties_table, caption = "Median Household Income of Selected Counties Virginia", booktabs = TRUE, digits = 2, align = "c")

Median Household Income of Selected Counties Virginia
County	Median Income	Margin Error (%)	Reliability
Essex	52335	15.29	Low Confidence
Loudoun	170463	2.05	High Confidence
Falls Church	164536	8.01	Moderate Confidence

Comment on the output: There seems like there may be a relationship between the MOE and Median Income.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_ethnicity <- c(
  white_alone = "B03002_003",
  black = "B03002_004",
  hispanic_latino = "B03002_012",
  total_pop = "B03002_001"
  )
  
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_acs <- get_acs(
  variables = race_ethnicity,
  year = 2022, 
  geography = "tract", 
  state = my_state, 
  survey = "acs5", 
  county = selected_counties,
  cache = TRUE, 
  output = "wide")

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_acs <- tract_acs %>%
  mutate(
    white_alone_pct = (white_aloneE / total_popE) * 100,
    black_pct = (blackE / total_popE) * 100,
    hispanic_latino_pct = (hispanic_latinoE / total_popE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
tract_acs <- tract_acs %>%
  mutate(
  tract = str_extract(NAME, "Tract [^;]+"),
  county = str_extract(NAME, "(?<=; )[^;]+(?=;)") %>%
    str_remove("city") %>%
    str_remove(" County")
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic_tract <- tract_acs %>%
  arrange(desc(hispanic_latino_pct)) %>% 
  slice_head(n = 1) 

print(top_hispanic_tract)

# A tibble: 1 × 15
  GEOID       NAME      white_aloneE white_aloneM blackE blackM hispanic_latinoE
  <chr>       <chr>            <dbl>        <dbl>  <dbl>  <dbl>            <dbl>
1 51107611602 Census T…          574          202    196    151             2325
# ℹ 8 more variables: hispanic_latinoM <dbl>, total_popE <dbl>,
#   total_popM <dbl>, white_alone_pct <dbl>, black_pct <dbl>,
#   hispanic_latino_pct <dbl>, tract <chr>, county <chr>

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
demographics_summary <- tract_acs %>%
  group_by(county) %>%
  summarize(
    tracts_count = n(),
    avg_white_pct = mean(white_alone_pct, na.rm = TRUE),
    avg_black_pct = mean(black_pct, na.rm = TRUE),
    avg_hispanic_latino_pct = mean(hispanic_latino_pct, na.rm = TRUE)
  )

# Create a nicely formatted table of your results using kable()

demographics_summary_table <- demographics_summary %>%
  select( 
    "County" = county,
    "Number of Tracts" = tracts_count,
    "Average White Only Population (%)" = avg_white_pct,
    "Average Black Population (%)" = avg_black_pct,
    "Average Hispanic/Latino Population (%)" = avg_hispanic_latino_pct,    
  )

kable(demographics_summary_table, caption = "Average Demographics by County", booktabs = TRUE, digits = 2, align = "c")

Average Demographics by County
County	Number of Tracts	Average White Only Population (%)	Average Black Population (%)	Average Hispanic/Latino Population (%)
Essex	3	53.82	37.84	4.38
Falls Church	3	69.04	4.18	11.34
Loudoun	75	54.07	7.28	14.21

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
demographics_reliability <- tract_acs %>%
  mutate(
    white_alone_moe_pct = (white_aloneM / white_aloneE) * 100,
    black_moe_pct = (blackM / blackE) * 100,
    hispanic_latino_moe_pct = (hispanic_latinoM / hispanic_latinoE) * 100,
    
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
        high_moe_flag = (white_alone_moe_pct > 15) | 
                        (black_moe_pct > 15) | 
                        (hispanic_latino_moe_pct > 15)
  )

# Create summary statistics showing how many tracts have data quality issues
moe_summary <- demographics_reliability %>%
  summarise(
    total_tracts = n(),                        # total number of tracts
    tracts_high_moe = sum(high_moe_flag),      # number of tracts flagged as high MOE
    pct_high_moe = (tracts_high_moe / total_tracts) * 100  # percentage of tracts with high MOE
  )

moe_summary_table <- moe_summary %>%
  select(  
  "Total Number of Tracts" = total_tracts,
  "Tracts with High MOE" = tracts_high_moe,
  "Tracts with High MOE (%)" = pct_high_moe
  )

kable(moe_summary_table, caption = "Tracts with Data Quality Issues", booktabs = TRUE, digits = 2, align = "c")

Tracts with Data Quality Issues
Total Number of Tracts	Tracts with High MOE	Tracts with High MOE (%)
81	81	100

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

moe_tract <- demographics_reliability %>%
  group_by(high_moe_flag) %>%
  summarize(
    avg_pop = mean(total_popE, na.rm = TRUE),
    avg_white_alone = mean(white_aloneE, na.rm = TRUE),
    avg_black = mean(blackE, na.rm = TRUE),
    avg_hispanic = mean(hispanic_latinoE, na.rm = TRUE),
    .groups = "drop"
  )

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

moe_tract_table <- moe_tract %>%
  select(
    "Total" = avg_pop,
    "White" = avg_white_alone,
    "Black" = avg_black,
    "Hispanic/Latino" = avg_hispanic
  )
kable(moe_tract_table, caption = "Average Population of Tracts with High MOE", booktabs = TRUE, digits = 2, align = "c") # Since all my tracts have high MOE

Average Population of Tracts with High MOE
Total	White	Black	Hispanic/Latino
5505.57	2955.37	441.56	744.53

Pattern Analysis: I found that there were varying margins of error when it came to income levels, but the demographic data had consistently higher margins of error. The data was particularly unreliable for Black and Hispanic/Latino populations, in some cases exceeding 100% margins of error. This is likely due to small population sizes within certain counties, which makes estimates less certain. For income levels, counties in southern Virginia showed lower confidence in their margins of error, and these counties also tended to have lower household incomes. By contrast, counties with larger population sizes generally had higher reliability, both in household income estimates and demographic data.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

Across all analyses, there were a few systematic patterns that emerges in the reliability of demographic and socioeconomic data. Household income data showed lower confidence concentrated in southern Virginia counties that also exhibited lower household incomes. Demographic data consistently revealed higher margins of error, particularly for Black and Hispanic/Latino populations, in some cases exceeding 100% across all counties. Counties with larger population sizes tended to have more reliable estimates across both demographic and income indicators, highlighting a structural imbalance in data quality.

From an equity perspective, these disparities mean that communities of Black and Hispanic/Latino residents face a greater risk of algorithmic bias when data is used for policy, funding, or service allocation. When data carries high uncertainty, any algorithm built on it risks incorrectly representing those groups. Similarly, lower-income communities in southern Virginia also face heightened risk, as unreliable income data undermines equitable targeting of economic support or development programs.

The root causes of these issues stem primarily from small sample sizes in survey-based data collection methods, which reduce reliability for smaller populations. Smaller minority populations, rural communities, and low-income counties are more likely to experience data quality issues and more dependent on accurate representation for equitable policy outcomes. These factors create a feedback loop where communities most in need of resources have the weakest statistical representation.

To address these systemic challenges, the Department should invest in data collection for small populations by increasing the percentage per county population data collected. The Department could also look at different data collection surveys in order decrease margin of error. These options could help the Department mitigate risks of algorithmic bias while ensuring that vulnerable communities are not disadvantaged by unreliable data. If they are unable to make these changes, I would urge them to include their margins of error when using any information from the data sets.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_reliability_table <- county_reliability %>%
  select(
    "County" = NAME,
    "Median Income" = median_household_incomeE,
    "MOE (%)" = hh_income_moe_pct,
    "Reliability" = hh_income_moe_cat
  )
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_reliability_table <- county_reliability_table %>%
  mutate(
    "Algorithm Recommendation" = case_when(
      Reliability == "High Confidence"   ~ "Safe for algorithmic decisions",
      Reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      Reliability == "Low Confidence"    ~ "Requires manual review or additional data",
      TRUE ~ NA_character_  # for any missing/unexpected values
    )
  )
  
# Format as a professional table with kable()
kable(county_reliability_table, caption = "Household Income Reliability by County", booktabs = TRUE, digits = 2, align = "c")

Household Income Reliability by County
County	Median Income	MOE (%)	Reliability	Algorithm Recommendation
Accomack	52694	11.16	Low Confidence	Requires manual review or additional data
Albemarle	97708	3.77	High Confidence	Safe for algorithmic decisions
Alleghany	52546	7.53	Moderate Confidence	Use with caution - monitor outcomes
Amelia	63438	23.82	Low Confidence	Requires manual review or additional data
Amherst	64454	7.00	Moderate Confidence	Use with caution - monitor outcomes
Appomattox	60041	11.81	Low Confidence	Requires manual review or additional data
Arlington	137387	1.98	High Confidence	Safe for algorithmic decisions
Augusta	76124	4.17	High Confidence	Safe for algorithmic decisions
Bath	55699	21.37	Low Confidence	Requires manual review or additional data
Bedford	74773	4.38	High Confidence	Safe for algorithmic decisions
Bland	59901	4.73	High Confidence	Safe for algorithmic decisions
Botetourt	77680	7.33	Moderate Confidence	Use with caution - monitor outcomes
Brunswick	52678	5.92	Moderate Confidence	Use with caution - monitor outcomes
Buchanan	39591	7.40	Moderate Confidence	Use with caution - monitor outcomes
Buckingham	59894	14.55	Low Confidence	Requires manual review or additional data
Campbell	59022	6.30	Moderate Confidence	Use with caution - monitor outcomes
Caroline	83562	8.56	Moderate Confidence	Use with caution - monitor outcomes
Carroll	49113	9.77	Moderate Confidence	Use with caution - monitor outcomes
Charles City	65573	5.26	Moderate Confidence	Use with caution - monitor outcomes
Charlotte	51548	17.84	Low Confidence	Requires manual review or additional data
Chesterfield	95757	2.22	High Confidence	Safe for algorithmic decisions
Clarke	107475	14.79	Low Confidence	Requires manual review or additional data
Craig	66286	12.27	Low Confidence	Requires manual review or additional data
Culpeper	92359	4.65	High Confidence	Safe for algorithmic decisions
Cumberland	56497	14.25	Low Confidence	Requires manual review or additional data
Dickenson	40143	7.46	Moderate Confidence	Use with caution - monitor outcomes
Dinwiddie	77225	9.55	Moderate Confidence	Use with caution - monitor outcomes
Essex	52335	15.29	Low Confidence	Requires manual review or additional data
Fairfax	145165	1.16	High Confidence	Safe for algorithmic decisions
Fauquier	122785	5.40	Moderate Confidence	Use with caution - monitor outcomes
Floyd	57146	10.90	Low Confidence	Requires manual review or additional data
Fluvanna	90766	6.05	Moderate Confidence	Use with caution - monitor outcomes
Franklin	66275	5.74	Moderate Confidence	Use with caution - monitor outcomes
Frederick	92443	4.99	High Confidence	Safe for algorithmic decisions
Giles	61987	6.19	Moderate Confidence	Use with caution - monitor outcomes
Gloucester	83750	5.24	Moderate Confidence	Use with caution - monitor outcomes
Goochland	105600	8.09	Moderate Confidence	Use with caution - monitor outcomes
Grayson	43348	10.54	Low Confidence	Requires manual review or additional data
Greene	81338	6.64	Moderate Confidence	Use with caution - monitor outcomes
Greensville	51823	13.84	Low Confidence	Requires manual review or additional data
Halifax	49145	5.18	Moderate Confidence	Use with caution - monitor outcomes
Hanover	104678	3.36	High Confidence	Safe for algorithmic decisions
Henrico	82424	2.10	High Confidence	Safe for algorithmic decisions
Henry	43694	5.98	Moderate Confidence	Use with caution - monitor outcomes
Highland	57070	14.45	Low Confidence	Requires manual review or additional data
Isle of Wight	91680	4.09	High Confidence	Safe for algorithmic decisions
James City	100711	4.64	High Confidence	Safe for algorithmic decisions
King and Queen	70147	26.88	Low Confidence	Requires manual review or additional data
King George	103264	8.14	Moderate Confidence	Use with caution - monitor outcomes
King William	79398	11.86	Low Confidence	Requires manual review or additional data
Lancaster	62674	7.10	Moderate Confidence	Use with caution - monitor outcomes
Lee	41619	12.35	Low Confidence	Requires manual review or additional data
Loudoun	170463	2.05	High Confidence	Safe for algorithmic decisions
Louisa	76594	9.58	Moderate Confidence	Use with caution - monitor outcomes
Lunenburg	54438	14.45	Low Confidence	Requires manual review or additional data
Madison	74586	8.68	Moderate Confidence	Use with caution - monitor outcomes
Mathews	79054	18.95	Low Confidence	Requires manual review or additional data
Mecklenburg	51265	8.19	Moderate Confidence	Use with caution - monitor outcomes
Middlesex	69389	9.39	Moderate Confidence	Use with caution - monitor outcomes
Montgomery	65270	5.45	Moderate Confidence	Use with caution - monitor outcomes
Nelson	64028	17.43	Low Confidence	Requires manual review or additional data
New Kent	113120	6.15	Moderate Confidence	Use with caution - monitor outcomes
Northampton	54693	13.07	Low Confidence	Requires manual review or additional data
Northumberland	64655	14.33	Low Confidence	Requires manual review or additional data
Nottoway	62366	13.91	Low Confidence	Requires manual review or additional data
Orange	87309	9.40	Moderate Confidence	Use with caution - monitor outcomes
Page	56760	7.45	Moderate Confidence	Use with caution - monitor outcomes
Patrick	49180	10.87	Low Confidence	Requires manual review or additional data
Pittsylvania	52619	5.76	Moderate Confidence	Use with caution - monitor outcomes
Powhatan	108089	4.56	High Confidence	Safe for algorithmic decisions
Prince Edward	57304	6.61	Moderate Confidence	Use with caution - monitor outcomes
Prince George	80318	6.31	Moderate Confidence	Use with caution - monitor outcomes
Prince William	123193	2.19	High Confidence	Safe for algorithmic decisions
Pulaski	59740	6.43	Moderate Confidence	Use with caution - monitor outcomes
Rappahannock	98663	11.29	Low Confidence	Requires manual review or additional data
Richmond	62708	18.57	Low Confidence	Requires manual review or additional data
Roanoke	80872	2.34	High Confidence	Safe for algorithmic decisions
Rockbridge	61903	5.71	Moderate Confidence	Use with caution - monitor outcomes
Rockingham	73232	3.18	High Confidence	Safe for algorithmic decisions
Russell	44088	8.87	Moderate Confidence	Use with caution - monitor outcomes
Scott	44535	6.38	Moderate Confidence	Use with caution - monitor outcomes
Shenandoah	62149	6.68	Moderate Confidence	Use with caution - monitor outcomes
Smyth	45061	7.21	Moderate Confidence	Use with caution - monitor outcomes
Southampton	67813	10.69	Low Confidence	Requires manual review or additional data
Spotsylvania	105068	4.52	High Confidence	Safe for algorithmic decisions
Stafford	128036	3.18	High Confidence	Safe for algorithmic decisions
Surry	68655	10.95	Low Confidence	Requires manual review or additional data
Sussex	59195	11.64	Low Confidence	Requires manual review or additional data
Tazewell	46508	7.10	Moderate Confidence	Use with caution - monitor outcomes
Warren	79313	8.86	Moderate Confidence	Use with caution - monitor outcomes
Washington	59116	4.31	High Confidence	Safe for algorithmic decisions
Westmoreland	56647	12.26	Low Confidence	Requires manual review or additional data
Wise	47541	6.43	Moderate Confidence	Use with caution - monitor outcomes
Wythe	53921	7.47	Moderate Confidence	Use with caution - monitor outcomes
York	105154	3.38	High Confidence	Safe for algorithmic decisions
Alexandria	113179	2.24	High Confidence	Safe for algorithmic decisions
Bristol	45250	6.95	Moderate Confidence	Use with caution - monitor outcomes
Buena Vista	48783	14.29	Low Confidence	Requires manual review or additional data
Charlottesville	67177	7.63	Moderate Confidence	Use with caution - monitor outcomes
Chesapeake	92703	2.37	High Confidence	Safe for algorithmic decisions
Colonial Heights	72216	7.53	Moderate Confidence	Use with caution - monitor outcomes
Covington	45737	15.38	Low Confidence	Requires manual review or additional data
Danville	41484	8.28	Moderate Confidence	Use with caution - monitor outcomes
Emporia	41442	14.39	Low Confidence	Requires manual review or additional data
Fairfax	128708	9.39	Moderate Confidence	Use with caution - monitor outcomes
Falls Church	164536	8.01	Moderate Confidence	Use with caution - monitor outcomes
Franklin	57537	9.03	Moderate Confidence	Use with caution - monitor outcomes
Fredericksburg	83445	6.36	Moderate Confidence	Use with caution - monitor outcomes
Galax	44612	20.94	Low Confidence	Requires manual review or additional data
Hampton	64430	3.09	High Confidence	Safe for algorithmic decisions
Harrisonburg	56050	3.21	High Confidence	Safe for algorithmic decisions
Hopewell	50661	10.43	Low Confidence	Requires manual review or additional data
Lexington	93651	20.95	Low Confidence	Requires manual review or additional data
Lynchburg	56243	4.99	High Confidence	Safe for algorithmic decisions
Manassas	110559	7.71	Moderate Confidence	Use with caution - monitor outcomes
Manassas Park	91673	16.15	Low Confidence	Requires manual review or additional data
Martinsville	39127	9.27	Moderate Confidence	Use with caution - monitor outcomes
Newport News	63355	3.31	High Confidence	Safe for algorithmic decisions
Norfolk	60998	3.08	High Confidence	Safe for algorithmic decisions
Norton	36974	25.41	Low Confidence	Requires manual review or additional data
Petersburg	46930	6.60	Moderate Confidence	Use with caution - monitor outcomes
Poquoson	114503	7.31	Moderate Confidence	Use with caution - monitor outcomes
Portsmouth	57154	4.32	High Confidence	Safe for algorithmic decisions
Radford	51039	14.27	Low Confidence	Requires manual review or additional data
Richmond	59606	3.96	High Confidence	Safe for algorithmic decisions
Roanoke	51523	4.47	High Confidence	Safe for algorithmic decisions
Salem	68402	6.98	Moderate Confidence	Use with caution - monitor outcomes
Staunton	59731	8.95	Moderate Confidence	Use with caution - monitor outcomes
Suffolk	87758	4.22	High Confidence	Safe for algorithmic decisions
Virginia Beach	87544	2.24	High Confidence	Safe for algorithmic decisions
Waynesboro	52519	8.79	Moderate Confidence	Use with caution - monitor outcomes
Williamsburg	66815	7.32	Moderate Confidence	Use with caution - monitor outcomes
Winchester	62495	6.93	Moderate Confidence	Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Albemarle, Arlington, Augusta, Bedford, Bland, Chesterfield, Culpeper, Fairfax, Frederick, Hanover, Henrico, Isle of Wight, James City, Loudoun, Powhatan, Prince William, Roanoke, Rockingham, Spotsylvania, Stafford, Washington, York, Alexandria, Chesapeake, Hampton, Harrisonburg, Lynchburg, Newport News, Norfolk, Portsmouth, Richmond, Roanoke, Suffolk, Virginia Beach.

The high confidence counties have data with low margins of errors. These counties are considered “safe” because the data is statistically sound, enabling reliable automated decision-making with minimal risk of error or bias.

Counties requiring additional oversight: Alleghany, Amherst, Botetourt, Brunswick, Buchanan, Campbell, Caroline, Carroll, Charles City, Dickenson, Dinwiddie, Fauquier, Fluvanna, Franklin, Giles, Gloucester, Goochland, Greene, Halifax, Henry, King George, Lancaster, Louisa, Madison, Mecklenburg, Middlesex, Montgomery, New Kent, Orange, Page, Pittsylvania, Prince Edward, Prince George, Pulaski, Rockbridge, Russell, Scott, Shenandoah, Smyth, Tazewell, Warren, Wise, Wythe, Bristol, Charlottesville, Colonial Heights, Danville, Fairfax, Falls Church, Franklin, Fredericksburg, Manassas, Martinsville, Petersburg, Poquoson, Salem, Staunton, Waynesboro, Williamsburg, Winchester.

Moderate confidence county data should be used cautiously. The data is generally usable but carries enough uncertainty that algorithmic decisions could occasionally be misleading. Though these counties do not need to be manually reviewed every time, it is important to monitor any abnormalities in the data.

Counties needing alternative approaches: Accomack, Amelia, Appomattox, Bath, Buckingham, Charlotte, Clarke, Craig, Cumberland, Essex, Floyd, Grayson, Greensville, Highland, King and Queen, King William, Lee, Lunenburg, Mathews, Nelson, Northampton, Northumberland, Nottoway, Patrick, Rappahannock, Richmond, Southampton, Surry, Sussex, Westmoreland, Buena Vista, Covington, Emporia, Galax, Hopewell, Lexington, Manassas Park, Norton, Radford.

These counties have a high margin of error meaning their data has higher variability with lower data points. This means that decisions that are made using this data set should be checked manually before implementation.

Questions for Further Investigation

Are there regional clusters of counties with consistently high margins of error, and how do these clusters relate to population size, demographics, rurality, or economic conditions?
How do margins of error and confidence levels change over time?
Do certain demographic characteristics such as race and age contribute to higher MOE or lower confidence in algorithmic outputs?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/28/2025

Reproducibility: - All analysis conducted in R version 2025.05.1+513 (2025.05.1+513) - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-jenniferluu6/

Methodology Notes: In processing the data, I selected Virginia as the focus of analysis. During cleaning, I had to make adjustments to county names by removing the word “City” in certain cases to ensure consistency across data sets. This additional step suggests that similar cleaning may be required if applying the same methods to other states. For county selection, I chose locations I was already familiar with, which provided helpful context for interpreting the results. However, this choice may limit reproducibility since another researcher might select different counties and arrive at slightly different insights.

Limitations: The sample size issues affected the reliability of estimates, particularly for smaller demographic groups such as Black and Hispanic/Latino populations, where margins of error sometimes exceeded 100%. Also, county selection was influenced by personal familiarity, which, while helpful for interpretation, may introduce bias and reduce reproducibility.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html