# Load required packages
library(tidyverse)
library(tidycensus)
library(scales)
library(RColorBrewer)
library(knitr)
# Set your Census API key if you haven't already
census_api_key("940dffa67b928a0518accaf8839aa7b4762b11ab")
# We'll use Pennsylvania data for consistency with previous weeks
<- "California" state_choice
Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml
navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/
folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/assignment1.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen California for this analysis because: This is a famous state with large population and my best friend is now in California.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs()
to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code")
syntax.
# Write your get_acs() code here
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
<- get_acs(
county_data geography = "county",
state = state_choice,
variables = c(
median_income = "B19013_001",
total_population = "B01003_001"
),year = 2022,
survey = "acs5",
output = "wide"
%>%
)
mutate(NAME = str_remove(NAME, ", California"),
NAME = str_remove(NAME, " County"))
# Display the first few rows
head(county_data)
# A tibble: 6 × 6
GEOID NAME median_incomeE median_incomeM total_populationE total_populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06001 Alame… 122488 1231 1663823 NA
2 06003 Alpine 101125 17442 1515 206
3 06005 Amador 74853 6048 40577 NA
4 06007 Butte 66085 2261 213605 NA
5 06009 Calav… 77526 3875 45674 NA
6 06011 Colusa 69619 5745 21811 NA
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate()
with case_when()
for the categories.
# Calculate MOE percentage and reliability categories using mutate()
# Create a summary showing count of counties in each reliability category
<- county_data %>%
income_reliability mutate(
income_moe_pct = (median_incomeM / median_incomeE) * 100,
income_reliability = case_when(
< 5 ~ "High Confidence",
income_moe_pct >= 5 & income_moe_pct <= 10 ~ "Moderate Confidence",
income_moe_pct > 10 ~ "Low Confidence"
income_moe_pct
),
unreliable_flag = income_moe_pct > 10
)
<- income_reliability %>%
income_summary count(income_reliability) %>%
mutate(percent = n / sum(n) * 100)
%>% kable() income_summary
income_reliability | n | percent |
---|---|---|
High Confidence | 42 | 72.41379 |
Low Confidence | 5 | 8.62069 |
Moderate Confidence | 11 | 18.96552 |
# Hint: use count() and mutate() to add percentages
2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange()
, slice()
, and select()
functions.
# Create table of top 5 counties by MOE percentage
<- income_reliability %>%
high_uncertainty arrange(desc(income_moe_pct)) %>%
slice(1:5) %>%
select(
County = NAME,
Median_Income = median_incomeE,
Margin_of_Error = median_incomeM,
MOE_Percentage = income_moe_pct,
Reliability = income_reliability
)
%>%
high_uncertainty kable(
digits = 1,
caption = "Top 5 Counties with Highest Margin of Error Percentages (Median Income, 2022)"
)
County | Median_Income | Margin_of_Error | MOE_Percentage | Reliability |
---|---|---|---|---|
Mono | 82038 | 15388 | 18.8 | Low Confidence |
Alpine | 101125 | 17442 | 17.2 | Low Confidence |
Sierra | 61108 | 9237 | 15.1 | Low Confidence |
Trinity | 47317 | 5890 | 12.4 | Low Confidence |
Plumas | 67885 | 7772 | 11.4 | Low Confidence |
# Format as table with kable() - include appropriate column names and caption
Data Quality Commentary:
The results show that several counties have relatively high MOE in their reported median household income. If an algorithm uses this data to allocate resources, those counties may be mis-classified and could receive less support than they need.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
<- income_reliability %>%
selected_counties filter(
%in% c("Los Angeles", "Yuba", "Sierra")
NAME %>%
) select(
County = NAME,
Median_Income = median_incomeE,
MOE_Percentage = income_moe_pct,
Reliability = income_reliability
)
%>%
selected_counties kable(
digits = 1,
caption = "Selected Counties for Tract-Level Analysis"
)
County | Median_Income | MOE_Percentage | Reliability |
---|---|---|---|
Los Angeles | 83411 | 0.5 | High Confidence |
Sierra | 61108 | 15.1 | Low Confidence |
Yuba | 66693 | 4.2 | High Confidence |
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
Comment on the output: These selected counties illustrate how data quality varies across contexts: Los Angeles and Yuba show relatively reliable estimates, while Sierra has a much higher MOE, raising concerns about the accuracy of income-based classifications there.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
<- c(
race_vars total_pop = "B03002_001",
white_alone = "B03002_003",
black_alone = "B03002_004",
hispanic = "B03002_012"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
<- county_data %>%
selected_codes filter(NAME %in% selected_counties$County) %>%
pull(GEOID)
<- c("Los Angeles", "Sierra", "Yuba")
selected_counties_vec
<- get_acs(
tract_data geography = "tract",
state = state_choice,
county = selected_counties_vec,
variables = race_vars,
year = 2022,
survey = "acs5",
output = "wide"
%>%
)
mutate(
pct_white = (white_aloneE / total_popE) * 100,
pct_black = (black_aloneE / total_popE) * 100,
pct_hispanic= (hispanicE / total_popE) * 100,
County = str_extract(NAME, ".*(?=, California)"),
Tract = str_extract(NAME, "Census Tract \\d+")
)
head(tract_data)
# A tibble: 6 × 15
GEOID NAME total_popE total_popM white_aloneE white_aloneM black_aloneE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 06037101110 Cens… 4014 473 2236 312 121
2 06037101122 Cens… 4164 822 2882 790 76
3 06037101220 Cens… 3481 467 1485 269 0
4 06037101221 Cens… 3756 687 1970 658 101
5 06037101222 Cens… 2808 424 1277 440 15
6 06037101300 Cens… 4071 445 2998 448 66
# ℹ 8 more variables: black_aloneM <dbl>, hispanicE <dbl>, hispanicM <dbl>,
# pct_white <dbl>, pct_black <dbl>, pct_hispanic <dbl>, County <chr>,
# Tract <chr>
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
<- tract_data %>%
tract_data mutate(
County = str_remove(NAME, ", California"),
County = str_remove(County, "Census Tract \\d+(\\.\\d+)? ,? "),
County = str_remove(County, "Census Tract \\d+(\\.\\d+)?"),
County = str_remove(County, " County"),
Tract = str_extract(NAME, "Census Tract \\d+(\\.\\d+)?")
)
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
<- tract_data %>%
top_hispanic_tract arrange(desc(pct_hispanic)) %>%
slice(1) %>%
select(County, Tract, pct_hispanic)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
<- tract_data %>%
county_demographics group_by(County) %>%
summarize(
n_tracts = n(),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
)
# Create a nicely formatted table of your results using kable()
%>%
county_demographics kable(
digits = 1,
caption = "Average Tract-Level Demographics by Selected County"
)
County | n_tracts | avg_pct_white | avg_pct_black | avg_pct_hispanic |
---|---|---|---|---|
; Los Angeles; California | 2498 | 26.3 | 7.6 | 47.6 |
; Sierra; California | 1 | 86.6 | 0.2 | 11.4 |
; Yuba; California | 19 | 56.0 | 3.2 | 26.7 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
<- tract_data %>%
tract_data mutate(
pct_white_moe = (white_aloneM / white_aloneE) * 100,
pct_black_moe = (black_aloneM / black_aloneE) * 100,
pct_hispanic_moe= (hispanicM / hispanicE) * 100,
high_moe_flag = ifelse(
> 15 | pct_black_moe > 15 | pct_hispanic_moe > 15,
pct_white_moe TRUE,
FALSE
)
)
# Create a flag for tracts with high MOE on any demographic variable
<- tract_data %>%
moe_summary summarize(
total_tracts = n(),
tracts_high_moe = sum(high_moe_flag, na.rm = TRUE),
percent_high_moe = (tracts_high_moe / total_tracts) * 100
)
# Create summary statistics showing how many tracts have data quality issues
%>% kable(
moe_summary digits = 1,
caption = "Summary of Tracts with High MOE (>15%) for Demographic Variables"
)
total_tracts | tracts_high_moe | percent_high_moe |
---|---|---|
2518 | 2517 | 100 |
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
<- tract_data %>%
moe_pattern group_by(high_moe_flag) %>%
summarize(
n_tracts = n(),
avg_total_population = mean(total_popE, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
)
# Create a professional table showing the patterns
%>%
moe_pattern kable(
digits = 1,
caption = "Comparison of Tract Characteristics by Data Reliability (High MOE Flag)"
)
high_moe_flag | n_tracts | avg_total_population | avg_pct_white | avg_pct_black | avg_pct_hispanic |
---|---|---|---|---|---|
FALSE | 1 | 8994.0 | 16.8 | 33.6 | 41.4 |
TRUE | 2517 | 3977.9 | 26.6 | 7.5 | 47.5 |
Pattern Analysis: Tracts with high MOE tend to have smaller populations and higher proportions of minority residents, suggesting that survey sample sizes are limited in these communities. This concentration of uncertainty could lead algorithms to under-represent the needs of these areas.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary: 1/Overall Pattern Identification Data reliability varies with dataset size. Larger counties have highly reliable estimates, and the MOE percentages at tract level are significantly higher than at county level, indicating that smaller datasets are more prone to producing unreliable results.
2/Equity Assessment: Communities with smaller populations and higher proportions of minority residents face the greatest risk of being misrepresented.
3/Root Cause Analysis: Data quality issues largely stem from limited survey sample sizes in sparsely populated or demographically diverse areas. Smaller sample sizes increase MOE, leading to higher uncertainty.
4/Strategic Recommendations: The Department should adjust algorithmic approaches to account for data reliability, such as weighting estimates by MOE or implementing safeguards for high-uncertainty communities through additional outreach, data validation, and supplemental surveys.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
<- income_reliability %>%
algorithm_recommendations select(
County = NAME,
Median_Income = median_incomeE,
MOE_Percentage = income_moe_pct,
Reliability = income_reliability
%>%
) mutate(
Recommendation = case_when(
== "High Confidence" ~ "Safe for algorithmic decisions",
Reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
Reliability == "Low Confidence" ~ "Requires manual review or additional data"
Reliability
)
)
# Format as a professional table with kable()
%>%
algorithm_recommendations kable(
digits = 1,
caption = "Algorithmic Decision Framework Based on County Data Reliability"
)
County | Median_Income | MOE_Percentage | Reliability | Recommendation |
---|---|---|---|---|
Alameda | 122488 | 1.0 | High Confidence | Safe for algorithmic decisions |
Alpine | 101125 | 17.2 | Low Confidence | Requires manual review or additional data |
Amador | 74853 | 8.1 | Moderate Confidence | Use with caution - monitor outcomes |
Butte | 66085 | 3.4 | High Confidence | Safe for algorithmic decisions |
Calaveras | 77526 | 5.0 | High Confidence | Safe for algorithmic decisions |
Colusa | 69619 | 8.3 | Moderate Confidence | Use with caution - monitor outcomes |
Contra Costa | 120020 | 1.2 | High Confidence | Safe for algorithmic decisions |
Del Norte | 61149 | 7.2 | Moderate Confidence | Use with caution - monitor outcomes |
El Dorado | 99246 | 3.4 | High Confidence | Safe for algorithmic decisions |
Fresno | 67756 | 1.4 | High Confidence | Safe for algorithmic decisions |
Glenn | 64033 | 6.2 | Moderate Confidence | Use with caution - monitor outcomes |
Humboldt | 57881 | 3.7 | High Confidence | Safe for algorithmic decisions |
Imperial | 53847 | 4.1 | High Confidence | Safe for algorithmic decisions |
Inyo | 63417 | 8.6 | Moderate Confidence | Use with caution - monitor outcomes |
Kern | 63883 | 2.1 | High Confidence | Safe for algorithmic decisions |
Kings | 68540 | 3.3 | High Confidence | Safe for algorithmic decisions |
Lake | 56259 | 4.3 | High Confidence | Safe for algorithmic decisions |
Lassen | 59515 | 6.0 | Moderate Confidence | Use with caution - monitor outcomes |
Los Angeles | 83411 | 0.5 | High Confidence | Safe for algorithmic decisions |
Madera | 73543 | 3.9 | High Confidence | Safe for algorithmic decisions |
Marin | 142019 | 2.9 | High Confidence | Safe for algorithmic decisions |
Mariposa | 60021 | 8.8 | Moderate Confidence | Use with caution - monitor outcomes |
Mendocino | 61335 | 3.6 | High Confidence | Safe for algorithmic decisions |
Merced | 64772 | 3.3 | High Confidence | Safe for algorithmic decisions |
Modoc | 54962 | 9.8 | Moderate Confidence | Use with caution - monitor outcomes |
Mono | 82038 | 18.8 | Low Confidence | Requires manual review or additional data |
Monterey | 91043 | 2.1 | High Confidence | Safe for algorithmic decisions |
Napa | 105809 | 2.8 | High Confidence | Safe for algorithmic decisions |
Nevada | 79395 | 4.8 | High Confidence | Safe for algorithmic decisions |
Orange | 109361 | 0.8 | High Confidence | Safe for algorithmic decisions |
Placer | 109375 | 1.7 | High Confidence | Safe for algorithmic decisions |
Plumas | 67885 | 11.4 | Low Confidence | Requires manual review or additional data |
Riverside | 84505 | 1.3 | High Confidence | Safe for algorithmic decisions |
Sacramento | 84010 | 1.0 | High Confidence | Safe for algorithmic decisions |
San Benito | 104451 | 5.2 | Moderate Confidence | Use with caution - monitor outcomes |
San Bernardino | 77423 | 1.0 | High Confidence | Safe for algorithmic decisions |
San Diego | 96974 | 1.0 | High Confidence | Safe for algorithmic decisions |
San Francisco | 136689 | 1.4 | High Confidence | Safe for algorithmic decisions |
San Joaquin | 82837 | 1.7 | High Confidence | Safe for algorithmic decisions |
San Luis Obispo | 90158 | 2.6 | High Confidence | Safe for algorithmic decisions |
San Mateo | 149907 | 1.7 | High Confidence | Safe for algorithmic decisions |
Santa Barbara | 92332 | 2.0 | High Confidence | Safe for algorithmic decisions |
Santa Clara | 153792 | 1.0 | High Confidence | Safe for algorithmic decisions |
Santa Cruz | 104409 | 3.0 | High Confidence | Safe for algorithmic decisions |
Shasta | 68347 | 3.6 | High Confidence | Safe for algorithmic decisions |
Sierra | 61108 | 15.1 | Low Confidence | Requires manual review or additional data |
Siskiyou | 53898 | 4.9 | High Confidence | Safe for algorithmic decisions |
Solano | 97037 | 1.8 | High Confidence | Safe for algorithmic decisions |
Sonoma | 99266 | 2.0 | High Confidence | Safe for algorithmic decisions |
Stanislaus | 74872 | 1.8 | High Confidence | Safe for algorithmic decisions |
Sutter | 72654 | 4.7 | High Confidence | Safe for algorithmic decisions |
Tehama | 59029 | 7.0 | Moderate Confidence | Use with caution - monitor outcomes |
Trinity | 47317 | 12.4 | Low Confidence | Requires manual review or additional data |
Tulare | 64474 | 2.3 | High Confidence | Safe for algorithmic decisions |
Tuolumne | 70432 | 6.7 | Moderate Confidence | Use with caution - monitor outcomes |
Ventura | 102141 | 1.5 | High Confidence | Safe for algorithmic decisions |
Yolo | 85097 | 2.7 | High Confidence | Safe for algorithmic decisions |
Yuba | 66693 | 4.2 | High Confidence | Safe for algorithmic decisions |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: Los Angeles and Yuba. These counties have high confidence data with MOE percentages below 5%, and algorithms can safely use this data for resource allocation.
Counties requiring additional oversight: In this experiment, none were selected. My recommendations are that we should implement periodic review of algorithm outputs and consider using supplemental indicators or combining multiple data sources to reduce uncertainty.
Counties needing alternative approaches: Sierra. I suggest avoiding relying solely on algorithmic outputs. We can collect additional survey data or use qualitative local knowledge to inform decisions.
Questions for Further Investigation
1/Are high MOE tracts clustered geographically? 2/How do demographic changes over time affect the reliability of income and population estimates for small counties?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 2025.9.27
Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-xqdqc-hub/
Methodology Notes: Counties for tract-level analysis were selected to represent high, moderate, and low reliability areas. MOE percentages were calculated for income and demographic variables, and county/tract names were cleaned for consistency.
Limitations: Reliance on ACS data alone excludes other sources that could improve decision-making accuracy.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html