# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)
# Set your Census API key
census_api_key(Sys.getenv("368bf145f527c34904bbbc75ef3158887059279a"))
# Choose your state for analysis - assign it to a variable called my_state
<- "VA" my_state
Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Virginia Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml
navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/
folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Virginia for this analysis because I am interested in my home state!
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs()
to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code")
syntax.
# Write your get_acs() code here
<- get_acs(
county_acs variables = c(
median_household_income = "B19013_001",
total_population = "B01003_001"),
year = 2022,
geography = "county",
state = my_state,
survey = "acs5",
cache = TRUE,
output = "wide")
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
= county_acs %>%
county_clean mutate(
NAME = NAME %>%
str_remove(" County,") %>%
str_remove(" Virginia") %>%
str_remove(" city,") # I had some counties that were named city as well as county
)
# Display the first few rows
head(county_clean)
# A tibble: 6 × 6
GEOID NAME median_household_inc…¹ median_household_inc…² total_populationE
<chr> <chr> <dbl> <dbl> <dbl>
1 51001 Accomack 52694 5883 33367
2 51003 Albemar… 97708 3686 112513
3 51005 Allegha… 52546 3958 15159
4 51007 Amelia 63438 15114 13309
5 51009 Amherst 64454 4514 31426
6 51011 Appomat… 60041 7091 16253
# ℹ abbreviated names: ¹median_household_incomeE, ²median_household_incomeM
# ℹ 1 more variable: total_populationM <dbl>
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate()
with case_when()
for the categories.
# Calculate MOE percentage and reliability categories using mutate()
<- county_clean %>%
county_reliability mutate(
hh_income_moe_pct = (median_household_incomeM / median_household_incomeE) * 100,
hh_income_moe_cat = case_when(
< 5 ~ "High Confidence",
hh_income_moe_pct < 10 ~ "Moderate Confidence",
hh_income_moe_pct TRUE ~ "Low Confidence"
)
)
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
<- county_reliability %>%
reliability_summary count(hh_income_moe_cat, name = "reliability_count") %>%
mutate(percentage= reliability_count/sum(reliability_count)*100)
2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange()
, slice()
, and select()
functions.
# Create table of top 5 counties by MOE percentage
<- county_reliability %>%
highest_moe arrange(desc(hh_income_moe_pct))
<- slice_head(highest_moe, n=5)
top_moe_counties
# Format as table with kable() - include appropriate column names and caption
<- top_moe_counties %>%
highest_moe_table select(
"County" = NAME,
"Median Income" = median_household_incomeE,
"Margin Error (%)" = hh_income_moe_pct,
"Reliability" = hh_income_moe_cat
)
kable(highest_moe_table, caption = "Top 5 Counties in Virginia by Median Household Income MOE", booktabs = TRUE, digits = 2, align = "c")
County | Median Income | Margin Error (%) | Reliability |
---|---|---|---|
King and Queen | 70147 | 26.88 | Low Confidence |
Norton | 36974 | 25.41 | Low Confidence |
Amelia | 63438 | 23.82 | Low Confidence |
Bath | 55699 | 21.37 | Low Confidence |
Lexington | 93651 | 20.95 | Low Confidence |
Data Quality Commentary:
The counties listed above have a high margin of error when reviewing their median household income. This can happen for many different reasons including lower population size, high income variability, and or smaller counties. This data could misrepresent the population and lead to biased decision making.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
= c("Loudoun", "Essex", "Falls Church")
selected_counties
<- county_reliability %>%
filtered_counties filter(NAME %in% selected_counties)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
<- filtered_counties %>%
selected_counties_table select(
"County" = NAME,
"Median Income" = median_household_incomeE,
"Margin Error (%)" = hh_income_moe_pct,
"Reliability" = hh_income_moe_cat
)
kable(selected_counties_table, caption = "Median Household Income of Selected Counties Virginia", booktabs = TRUE, digits = 2, align = "c")
County | Median Income | Margin Error (%) | Reliability |
---|---|---|---|
Essex | 52335 | 15.29 | Low Confidence |
Loudoun | 170463 | 2.05 | High Confidence |
Falls Church | 164536 | 8.01 | Moderate Confidence |
Comment on the output: There seems like there may be a relationship between the MOE and Median Income.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
<- c(
race_ethnicity white_alone = "B03002_003",
black = "B03002_004",
hispanic_latino = "B03002_012",
total_pop = "B03002_001"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
<- get_acs(
tract_acs variables = race_ethnicity,
year = 2022,
geography = "tract",
state = my_state,
survey = "acs5",
county = selected_counties,
cache = TRUE,
output = "wide")
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
<- tract_acs %>%
tract_acs mutate(
white_alone_pct = (white_aloneE / total_popE) * 100,
black_pct = (blackE / total_popE) * 100,
hispanic_latino_pct = (hispanic_latinoE / total_popE) * 100
)
# Add readable tract and county name columns using str_extract() or similar
<- tract_acs %>%
tract_acs mutate(
tract = str_extract(NAME, "Tract [^;]+"),
county = str_extract(NAME, "(?<=; )[^;]+(?=;)") %>%
str_remove("city") %>%
str_remove(" County")
)
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
<- tract_acs %>%
top_hispanic_tract arrange(desc(hispanic_latino_pct)) %>%
slice_head(n = 1)
print(top_hispanic_tract)
# A tibble: 1 × 15
GEOID NAME white_aloneE white_aloneM blackE blackM hispanic_latinoE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 51107611602 Census T… 574 202 196 151 2325
# ℹ 8 more variables: hispanic_latinoM <dbl>, total_popE <dbl>,
# total_popM <dbl>, white_alone_pct <dbl>, black_pct <dbl>,
# hispanic_latino_pct <dbl>, tract <chr>, county <chr>
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
<- tract_acs %>%
demographics_summary group_by(county) %>%
summarize(
tracts_count = n(),
avg_white_pct = mean(white_alone_pct, na.rm = TRUE),
avg_black_pct = mean(black_pct, na.rm = TRUE),
avg_hispanic_latino_pct = mean(hispanic_latino_pct, na.rm = TRUE)
)
# Create a nicely formatted table of your results using kable()
<- demographics_summary %>%
demographics_summary_table select(
"County" = county,
"Number of Tracts" = tracts_count,
"Average White Only Population (%)" = avg_white_pct,
"Average Black Population (%)" = avg_black_pct,
"Average Hispanic/Latino Population (%)" = avg_hispanic_latino_pct,
)
kable(demographics_summary_table, caption = "Average Demographics by County", booktabs = TRUE, digits = 2, align = "c")
County | Number of Tracts | Average White Only Population (%) | Average Black Population (%) | Average Hispanic/Latino Population (%) |
---|---|---|---|---|
Essex | 3 | 53.82 | 37.84 | 4.38 |
Falls Church | 3 | 69.04 | 4.18 | 11.34 |
Loudoun | 75 | 54.07 | 7.28 | 14.21 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
<- tract_acs %>%
demographics_reliability mutate(
white_alone_moe_pct = (white_aloneM / white_aloneE) * 100,
black_moe_pct = (blackM / blackE) * 100,
hispanic_latino_moe_pct = (hispanic_latinoM / hispanic_latinoE) * 100,
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
high_moe_flag = (white_alone_moe_pct > 15) |
> 15) |
(black_moe_pct > 15)
(hispanic_latino_moe_pct
)
# Create summary statistics showing how many tracts have data quality issues
<- demographics_reliability %>%
moe_summary summarise(
total_tracts = n(), # total number of tracts
tracts_high_moe = sum(high_moe_flag), # number of tracts flagged as high MOE
pct_high_moe = (tracts_high_moe / total_tracts) * 100 # percentage of tracts with high MOE
)
<- moe_summary %>%
moe_summary_table select(
"Total Number of Tracts" = total_tracts,
"Tracts with High MOE" = tracts_high_moe,
"Tracts with High MOE (%)" = pct_high_moe
)
kable(moe_summary_table, caption = "Tracts with Data Quality Issues", booktabs = TRUE, digits = 2, align = "c")
Total Number of Tracts | Tracts with High MOE | Tracts with High MOE (%) |
---|---|---|
81 | 81 | 100 |
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
<- demographics_reliability %>%
moe_tract group_by(high_moe_flag) %>%
summarize(
avg_pop = mean(total_popE, na.rm = TRUE),
avg_white_alone = mean(white_aloneE, na.rm = TRUE),
avg_black = mean(blackE, na.rm = TRUE),
avg_hispanic = mean(hispanic_latinoE, na.rm = TRUE),
.groups = "drop"
)
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
<- moe_tract %>%
moe_tract_table select(
"Total" = avg_pop,
"White" = avg_white_alone,
"Black" = avg_black,
"Hispanic/Latino" = avg_hispanic
)kable(moe_tract_table, caption = "Average Population of Tracts with High MOE", booktabs = TRUE, digits = 2, align = "c") # Since all my tracts have high MOE
Total | White | Black | Hispanic/Latino |
---|---|---|---|
5505.57 | 2955.37 | 441.56 | 744.53 |
Pattern Analysis: I found that there were varying margins of error when it came to income levels, but the demographic data had consistently higher margins of error. The data was particularly unreliable for Black and Hispanic/Latino populations, in some cases exceeding 100% margins of error. This is likely due to small population sizes within certain counties, which makes estimates less certain. For income levels, counties in southern Virginia showed lower confidence in their margins of error, and these counties also tended to have lower household incomes. By contrast, counties with larger population sizes generally had higher reliability, both in household income estimates and demographic data.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
Across all analyses, there were a few systematic patterns that emerges in the reliability of demographic and socioeconomic data. Household income data showed lower confidence concentrated in southern Virginia counties that also exhibited lower household incomes. Demographic data consistently revealed higher margins of error, particularly for Black and Hispanic/Latino populations, in some cases exceeding 100% across all counties. Counties with larger population sizes tended to have more reliable estimates across both demographic and income indicators, highlighting a structural imbalance in data quality.
From an equity perspective, these disparities mean that communities of Black and Hispanic/Latino residents face a greater risk of algorithmic bias when data is used for policy, funding, or service allocation. When data carries high uncertainty, any algorithm built on it risks incorrectly representing those groups. Similarly, lower-income communities in southern Virginia also face heightened risk, as unreliable income data undermines equitable targeting of economic support or development programs.
The root causes of these issues stem primarily from small sample sizes in survey-based data collection methods, which reduce reliability for smaller populations. Smaller minority populations, rural communities, and low-income counties are more likely to experience data quality issues and more dependent on accurate representation for equitable policy outcomes. These factors create a feedback loop where communities most in need of resources have the weakest statistical representation.
To address these systemic challenges, the Department should invest in data collection for small populations by increasing the percentage per county population data collected. The Department could also look at different data collection surveys in order decrease margin of error. These options could help the Department mitigate risks of algorithmic bias while ensuring that vulnerable communities are not disadvantaged by unreliable data. If they are unable to make these changes, I would urge them to include their margins of error when using any information from the data sets.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
<- county_reliability %>%
county_reliability_table select(
"County" = NAME,
"Median Income" = median_household_incomeE,
"MOE (%)" = hh_income_moe_pct,
"Reliability" = hh_income_moe_cat
)# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
<- county_reliability_table %>%
county_reliability_table mutate(
"Algorithm Recommendation" = case_when(
== "High Confidence" ~ "Safe for algorithmic decisions",
Reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
Reliability == "Low Confidence" ~ "Requires manual review or additional data",
Reliability TRUE ~ NA_character_ # for any missing/unexpected values
)
)
# Format as a professional table with kable()
kable(county_reliability_table, caption = "Household Income Reliability by County", booktabs = TRUE, digits = 2, align = "c")
County | Median Income | MOE (%) | Reliability | Algorithm Recommendation |
---|---|---|---|---|
Accomack | 52694 | 11.16 | Low Confidence | Requires manual review or additional data |
Albemarle | 97708 | 3.77 | High Confidence | Safe for algorithmic decisions |
Alleghany | 52546 | 7.53 | Moderate Confidence | Use with caution - monitor outcomes |
Amelia | 63438 | 23.82 | Low Confidence | Requires manual review or additional data |
Amherst | 64454 | 7.00 | Moderate Confidence | Use with caution - monitor outcomes |
Appomattox | 60041 | 11.81 | Low Confidence | Requires manual review or additional data |
Arlington | 137387 | 1.98 | High Confidence | Safe for algorithmic decisions |
Augusta | 76124 | 4.17 | High Confidence | Safe for algorithmic decisions |
Bath | 55699 | 21.37 | Low Confidence | Requires manual review or additional data |
Bedford | 74773 | 4.38 | High Confidence | Safe for algorithmic decisions |
Bland | 59901 | 4.73 | High Confidence | Safe for algorithmic decisions |
Botetourt | 77680 | 7.33 | Moderate Confidence | Use with caution - monitor outcomes |
Brunswick | 52678 | 5.92 | Moderate Confidence | Use with caution - monitor outcomes |
Buchanan | 39591 | 7.40 | Moderate Confidence | Use with caution - monitor outcomes |
Buckingham | 59894 | 14.55 | Low Confidence | Requires manual review or additional data |
Campbell | 59022 | 6.30 | Moderate Confidence | Use with caution - monitor outcomes |
Caroline | 83562 | 8.56 | Moderate Confidence | Use with caution - monitor outcomes |
Carroll | 49113 | 9.77 | Moderate Confidence | Use with caution - monitor outcomes |
Charles City | 65573 | 5.26 | Moderate Confidence | Use with caution - monitor outcomes |
Charlotte | 51548 | 17.84 | Low Confidence | Requires manual review or additional data |
Chesterfield | 95757 | 2.22 | High Confidence | Safe for algorithmic decisions |
Clarke | 107475 | 14.79 | Low Confidence | Requires manual review or additional data |
Craig | 66286 | 12.27 | Low Confidence | Requires manual review or additional data |
Culpeper | 92359 | 4.65 | High Confidence | Safe for algorithmic decisions |
Cumberland | 56497 | 14.25 | Low Confidence | Requires manual review or additional data |
Dickenson | 40143 | 7.46 | Moderate Confidence | Use with caution - monitor outcomes |
Dinwiddie | 77225 | 9.55 | Moderate Confidence | Use with caution - monitor outcomes |
Essex | 52335 | 15.29 | Low Confidence | Requires manual review or additional data |
Fairfax | 145165 | 1.16 | High Confidence | Safe for algorithmic decisions |
Fauquier | 122785 | 5.40 | Moderate Confidence | Use with caution - monitor outcomes |
Floyd | 57146 | 10.90 | Low Confidence | Requires manual review or additional data |
Fluvanna | 90766 | 6.05 | Moderate Confidence | Use with caution - monitor outcomes |
Franklin | 66275 | 5.74 | Moderate Confidence | Use with caution - monitor outcomes |
Frederick | 92443 | 4.99 | High Confidence | Safe for algorithmic decisions |
Giles | 61987 | 6.19 | Moderate Confidence | Use with caution - monitor outcomes |
Gloucester | 83750 | 5.24 | Moderate Confidence | Use with caution - monitor outcomes |
Goochland | 105600 | 8.09 | Moderate Confidence | Use with caution - monitor outcomes |
Grayson | 43348 | 10.54 | Low Confidence | Requires manual review or additional data |
Greene | 81338 | 6.64 | Moderate Confidence | Use with caution - monitor outcomes |
Greensville | 51823 | 13.84 | Low Confidence | Requires manual review or additional data |
Halifax | 49145 | 5.18 | Moderate Confidence | Use with caution - monitor outcomes |
Hanover | 104678 | 3.36 | High Confidence | Safe for algorithmic decisions |
Henrico | 82424 | 2.10 | High Confidence | Safe for algorithmic decisions |
Henry | 43694 | 5.98 | Moderate Confidence | Use with caution - monitor outcomes |
Highland | 57070 | 14.45 | Low Confidence | Requires manual review or additional data |
Isle of Wight | 91680 | 4.09 | High Confidence | Safe for algorithmic decisions |
James City | 100711 | 4.64 | High Confidence | Safe for algorithmic decisions |
King and Queen | 70147 | 26.88 | Low Confidence | Requires manual review or additional data |
King George | 103264 | 8.14 | Moderate Confidence | Use with caution - monitor outcomes |
King William | 79398 | 11.86 | Low Confidence | Requires manual review or additional data |
Lancaster | 62674 | 7.10 | Moderate Confidence | Use with caution - monitor outcomes |
Lee | 41619 | 12.35 | Low Confidence | Requires manual review or additional data |
Loudoun | 170463 | 2.05 | High Confidence | Safe for algorithmic decisions |
Louisa | 76594 | 9.58 | Moderate Confidence | Use with caution - monitor outcomes |
Lunenburg | 54438 | 14.45 | Low Confidence | Requires manual review or additional data |
Madison | 74586 | 8.68 | Moderate Confidence | Use with caution - monitor outcomes |
Mathews | 79054 | 18.95 | Low Confidence | Requires manual review or additional data |
Mecklenburg | 51265 | 8.19 | Moderate Confidence | Use with caution - monitor outcomes |
Middlesex | 69389 | 9.39 | Moderate Confidence | Use with caution - monitor outcomes |
Montgomery | 65270 | 5.45 | Moderate Confidence | Use with caution - monitor outcomes |
Nelson | 64028 | 17.43 | Low Confidence | Requires manual review or additional data |
New Kent | 113120 | 6.15 | Moderate Confidence | Use with caution - monitor outcomes |
Northampton | 54693 | 13.07 | Low Confidence | Requires manual review or additional data |
Northumberland | 64655 | 14.33 | Low Confidence | Requires manual review or additional data |
Nottoway | 62366 | 13.91 | Low Confidence | Requires manual review or additional data |
Orange | 87309 | 9.40 | Moderate Confidence | Use with caution - monitor outcomes |
Page | 56760 | 7.45 | Moderate Confidence | Use with caution - monitor outcomes |
Patrick | 49180 | 10.87 | Low Confidence | Requires manual review or additional data |
Pittsylvania | 52619 | 5.76 | Moderate Confidence | Use with caution - monitor outcomes |
Powhatan | 108089 | 4.56 | High Confidence | Safe for algorithmic decisions |
Prince Edward | 57304 | 6.61 | Moderate Confidence | Use with caution - monitor outcomes |
Prince George | 80318 | 6.31 | Moderate Confidence | Use with caution - monitor outcomes |
Prince William | 123193 | 2.19 | High Confidence | Safe for algorithmic decisions |
Pulaski | 59740 | 6.43 | Moderate Confidence | Use with caution - monitor outcomes |
Rappahannock | 98663 | 11.29 | Low Confidence | Requires manual review or additional data |
Richmond | 62708 | 18.57 | Low Confidence | Requires manual review or additional data |
Roanoke | 80872 | 2.34 | High Confidence | Safe for algorithmic decisions |
Rockbridge | 61903 | 5.71 | Moderate Confidence | Use with caution - monitor outcomes |
Rockingham | 73232 | 3.18 | High Confidence | Safe for algorithmic decisions |
Russell | 44088 | 8.87 | Moderate Confidence | Use with caution - monitor outcomes |
Scott | 44535 | 6.38 | Moderate Confidence | Use with caution - monitor outcomes |
Shenandoah | 62149 | 6.68 | Moderate Confidence | Use with caution - monitor outcomes |
Smyth | 45061 | 7.21 | Moderate Confidence | Use with caution - monitor outcomes |
Southampton | 67813 | 10.69 | Low Confidence | Requires manual review or additional data |
Spotsylvania | 105068 | 4.52 | High Confidence | Safe for algorithmic decisions |
Stafford | 128036 | 3.18 | High Confidence | Safe for algorithmic decisions |
Surry | 68655 | 10.95 | Low Confidence | Requires manual review or additional data |
Sussex | 59195 | 11.64 | Low Confidence | Requires manual review or additional data |
Tazewell | 46508 | 7.10 | Moderate Confidence | Use with caution - monitor outcomes |
Warren | 79313 | 8.86 | Moderate Confidence | Use with caution - monitor outcomes |
Washington | 59116 | 4.31 | High Confidence | Safe for algorithmic decisions |
Westmoreland | 56647 | 12.26 | Low Confidence | Requires manual review or additional data |
Wise | 47541 | 6.43 | Moderate Confidence | Use with caution - monitor outcomes |
Wythe | 53921 | 7.47 | Moderate Confidence | Use with caution - monitor outcomes |
York | 105154 | 3.38 | High Confidence | Safe for algorithmic decisions |
Alexandria | 113179 | 2.24 | High Confidence | Safe for algorithmic decisions |
Bristol | 45250 | 6.95 | Moderate Confidence | Use with caution - monitor outcomes |
Buena Vista | 48783 | 14.29 | Low Confidence | Requires manual review or additional data |
Charlottesville | 67177 | 7.63 | Moderate Confidence | Use with caution - monitor outcomes |
Chesapeake | 92703 | 2.37 | High Confidence | Safe for algorithmic decisions |
Colonial Heights | 72216 | 7.53 | Moderate Confidence | Use with caution - monitor outcomes |
Covington | 45737 | 15.38 | Low Confidence | Requires manual review or additional data |
Danville | 41484 | 8.28 | Moderate Confidence | Use with caution - monitor outcomes |
Emporia | 41442 | 14.39 | Low Confidence | Requires manual review or additional data |
Fairfax | 128708 | 9.39 | Moderate Confidence | Use with caution - monitor outcomes |
Falls Church | 164536 | 8.01 | Moderate Confidence | Use with caution - monitor outcomes |
Franklin | 57537 | 9.03 | Moderate Confidence | Use with caution - monitor outcomes |
Fredericksburg | 83445 | 6.36 | Moderate Confidence | Use with caution - monitor outcomes |
Galax | 44612 | 20.94 | Low Confidence | Requires manual review or additional data |
Hampton | 64430 | 3.09 | High Confidence | Safe for algorithmic decisions |
Harrisonburg | 56050 | 3.21 | High Confidence | Safe for algorithmic decisions |
Hopewell | 50661 | 10.43 | Low Confidence | Requires manual review or additional data |
Lexington | 93651 | 20.95 | Low Confidence | Requires manual review or additional data |
Lynchburg | 56243 | 4.99 | High Confidence | Safe for algorithmic decisions |
Manassas | 110559 | 7.71 | Moderate Confidence | Use with caution - monitor outcomes |
Manassas Park | 91673 | 16.15 | Low Confidence | Requires manual review or additional data |
Martinsville | 39127 | 9.27 | Moderate Confidence | Use with caution - monitor outcomes |
Newport News | 63355 | 3.31 | High Confidence | Safe for algorithmic decisions |
Norfolk | 60998 | 3.08 | High Confidence | Safe for algorithmic decisions |
Norton | 36974 | 25.41 | Low Confidence | Requires manual review or additional data |
Petersburg | 46930 | 6.60 | Moderate Confidence | Use with caution - monitor outcomes |
Poquoson | 114503 | 7.31 | Moderate Confidence | Use with caution - monitor outcomes |
Portsmouth | 57154 | 4.32 | High Confidence | Safe for algorithmic decisions |
Radford | 51039 | 14.27 | Low Confidence | Requires manual review or additional data |
Richmond | 59606 | 3.96 | High Confidence | Safe for algorithmic decisions |
Roanoke | 51523 | 4.47 | High Confidence | Safe for algorithmic decisions |
Salem | 68402 | 6.98 | Moderate Confidence | Use with caution - monitor outcomes |
Staunton | 59731 | 8.95 | Moderate Confidence | Use with caution - monitor outcomes |
Suffolk | 87758 | 4.22 | High Confidence | Safe for algorithmic decisions |
Virginia Beach | 87544 | 2.24 | High Confidence | Safe for algorithmic decisions |
Waynesboro | 52519 | 8.79 | Moderate Confidence | Use with caution - monitor outcomes |
Williamsburg | 66815 | 7.32 | Moderate Confidence | Use with caution - monitor outcomes |
Winchester | 62495 | 6.93 | Moderate Confidence | Use with caution - monitor outcomes |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
- Counties suitable for immediate algorithmic implementation: Albemarle, Arlington, Augusta, Bedford, Bland, Chesterfield, Culpeper, Fairfax, Frederick, Hanover, Henrico, Isle of Wight, James City, Loudoun, Powhatan, Prince William, Roanoke, Rockingham, Spotsylvania, Stafford, Washington, York, Alexandria, Chesapeake, Hampton, Harrisonburg, Lynchburg, Newport News, Norfolk, Portsmouth, Richmond, Roanoke, Suffolk, Virginia Beach.
The high confidence counties have data with low margins of errors. These counties are considered “safe” because the data is statistically sound, enabling reliable automated decision-making with minimal risk of error or bias.
- Counties requiring additional oversight: Alleghany, Amherst, Botetourt, Brunswick, Buchanan, Campbell, Caroline, Carroll, Charles City, Dickenson, Dinwiddie, Fauquier, Fluvanna, Franklin, Giles, Gloucester, Goochland, Greene, Halifax, Henry, King George, Lancaster, Louisa, Madison, Mecklenburg, Middlesex, Montgomery, New Kent, Orange, Page, Pittsylvania, Prince Edward, Prince George, Pulaski, Rockbridge, Russell, Scott, Shenandoah, Smyth, Tazewell, Warren, Wise, Wythe, Bristol, Charlottesville, Colonial Heights, Danville, Fairfax, Falls Church, Franklin, Fredericksburg, Manassas, Martinsville, Petersburg, Poquoson, Salem, Staunton, Waynesboro, Williamsburg, Winchester.
Moderate confidence county data should be used cautiously. The data is generally usable but carries enough uncertainty that algorithmic decisions could occasionally be misleading. Though these counties do not need to be manually reviewed every time, it is important to monitor any abnormalities in the data.
- Counties needing alternative approaches: Accomack, Amelia, Appomattox, Bath, Buckingham, Charlotte, Clarke, Craig, Cumberland, Essex, Floyd, Grayson, Greensville, Highland, King and Queen, King William, Lee, Lunenburg, Mathews, Nelson, Northampton, Northumberland, Nottoway, Patrick, Rappahannock, Richmond, Southampton, Surry, Sussex, Westmoreland, Buena Vista, Covington, Emporia, Galax, Hopewell, Lexington, Manassas Park, Norton, Radford.
These counties have a high margin of error meaning their data has higher variability with lower data points. This means that decisions that are made using this data set should be checked manually before implementation.
Questions for Further Investigation
- Are there regional clusters of counties with consistently high margins of error, and how do these clusters relate to population size, demographics, rurality, or economic conditions?
- How do margins of error and confidence levels change over time?
- Do certain demographic characteristics such as race and age contribute to higher MOE or lower confidence in algorithmic outputs?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/28/2025
Reproducibility: - All analysis conducted in R version 2025.05.1+513 (2025.05.1+513) - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-jenniferluu6/
Methodology Notes: In processing the data, I selected Virginia as the focus of analysis. During cleaning, I had to make adjustments to county names by removing the word “City” in certain cases to ensure consistency across data sets. This additional step suggests that similar cleaning may be required if applying the same methods to other states. For county selection, I chose locations I was already familiar with, which provided helpful context for interpreting the results. However, this choice may limit reproducibility since another researcher might select different counties and arrive at slightly different insights.
Limitations: The sample size issues affected the reliability of estimates, particularly for smaller demographic groups such as Black and Hispanic/Latino populations, where margins of error sometimes exceeded 100%. Also, county selection was influenced by personal familiarity, which, while helpful for interpretation, may introduce bias and reduce reproducibility.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html