# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
library(kableExtra)
# Set your Census API key
#census_api_key("1c6a8b55f564cfbcb4398bab3b845f90d7055d0d", install = TRUE)
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Pennsylvania"Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the [Your State] Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Pennsylvania for this analysis because it is where I currently live.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
acs_vars <- c(
median_income = "B19013_001",
total_population = "B01003_001"
)
county_data <- get_acs(
geography = "county",
variables = acs_vars,
state = my_state,
year = 2022,
survey = "acs5",
output = "wide"
)
# Clean the county names to remove state name and "County"
county_data <- county_data %>%
mutate(
county = str_remove(NAME, " County,.*")
) %>%
select(county, everything())
# Hint: use mutate() with str_remove()
# Display the first few rows
head(county_data)# A tibble: 6 × 7
county GEOID NAME median_incomeE median_incomeM total_populationE
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Adams 42001 Adams County,… 78975 3334 104604
2 Allegheny 42003 Allegheny Cou… 72537 869 1245310
3 Armstrong 42005 Armstrong Cou… 61011 2202 65538
4 Beaver 42007 Beaver County… 67194 1531 167629
5 Bedford 42009 Bedford Count… 58337 2606 47613
6 Berks 42011 Berks County,… 74617 1191 428483
# ℹ 1 more variable: total_populationM <dbl>
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
county_reliability <- county_data %>%
mutate(
moe_pct_income = (median_incomeM / median_incomeE) * 100,
reliability = case_when(
moe_pct_income < 5 ~ "High Confidence",
moe_pct_income >= 5 & moe_pct_income <= 10 ~ "Moderate Confidence",
moe_pct_income > 10 ~ "Low Confidence",
TRUE ~ "Missing"
)
)
# Create a summary showing count of counties in each reliability category
reliability_summary <- county_reliability %>%
count(reliability) %>%
mutate(
pct = round(100 * n / sum(n), 1)
)
reliability_summary# A tibble: 2 × 3
reliability n pct
<chr> <int> <dbl>
1 High Confidence 57 85.1
2 Moderate Confidence 10 14.9
# Hint: use count() and mutate() to add percentages2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
library(knitr)
top_uncertainty <- county_reliability %>%
arrange(desc(moe_pct_income)) %>%
slice_head(n = 5) %>%
select(
county,
median_incomeE,
median_incomeM,
moe_pct_income,
reliability
)
# Format as table with kable() - include appropriate column names and caption
kable(
top_uncertainty,
caption = "Top 5 Counties with Highest Income MOE Percentage",
col.names = c(
"County",
"Median Income (Estimate)",
"Median Income (MOE)",
"MOE Percentage",
"Reliability Category"
),
digits = 1
)| County | Median Income (Estimate) | Median Income (MOE) | MOE Percentage | Reliability Category |
|---|---|---|---|---|
| Forest | 46188 | 4612 | 10.0 | Moderate Confidence |
| Sullivan | 62910 | 5821 | 9.3 | Moderate Confidence |
| Union | 64914 | 4753 | 7.3 | Moderate Confidence |
| Montour | 72626 | 5146 | 7.1 | Moderate Confidence |
| Elk | 61672 | 4091 | 6.6 | Moderate Confidence |
Data Quality Commentary: Counties such as Forest and Sullivan show higher uncertainty in their income estimates, meaning algorithms could misjudge local needs if they rely on these values. This uncertainty is often linked to smaller populations and limited ACS survey samples in rural areas.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
library(dplyr)
library(stringr)
pick_county_by_conf <- function(df, conf) {
df_conf <- df %>% filter(reliability == conf)
if (nrow(df_conf) == 0) {
return(tibble())
}
med_moe <- median(df_conf$moe_pct_income, na.rm = TRUE)
picked <- df_conf %>%
mutate(diff = abs(moe_pct_income - med_moe)) %>%
arrange(diff) %>%
slice_head(n = 1)
return(picked %>% select(GEOID, county, median_incomeE, median_incomeM, moe_pct_income, reliability))
}
# Store the selected counties in a variable called selected_counties
hc_pick <- pick_county_by_conf(county_reliability, "High Confidence")
mc_pick <- pick_county_by_conf(county_reliability, "Moderate Confidence")
lc_pick <- pick_county_by_conf(county_reliability, "Low Confidence")
# Display the selected counties with their key characteristics
if (nrow(hc_pick) == 0) {
hc_pick <- county_reliability %>% arrange(moe_pct_income) %>% slice_head(n = 1)
}
if (nrow(mc_pick) == 0) {
mc_pick <- county_reliability %>% arrange(desc(moe_pct_income)) %>% slice_head(n = 1)
}
if (nrow(lc_pick) == 0) {
lc_pick <- county_reliability %>% arrange(desc(moe_pct_income)) %>% slice_head(n = 1)
}
# Show: county name, median income, MOE percentage, reliability category
selected_counties <- bind_rows(hc_pick, mc_pick, lc_pick) %>%
distinct(GEOID, .keep_all = TRUE) %>%
mutate(
median_income = median_incomeE,
median_income_moe = median_incomeM,
moe_pct = round(moe_pct_income, 2)
) %>%
select(GEOID, county, median_income, median_income_moe, moe_pct, reliability)
selected_counties %>%
mutate(
median_income = scales::dollar(median_income),
median_income_moe = scales::dollar(median_income_moe)
) %>%
knitr::kable(caption = "Selected counties for tract-level analysis (representing different reliability levels)")| GEOID | county | median_income | median_income_moe | moe_pct | reliability |
|---|---|---|---|---|---|
| 42115 | Susquehanna | $63,968 | $2,010 | 3.14 | High Confidence |
| 42047 | Elk | $61,672 | $4,091 | 6.63 | Moderate Confidence |
| 42053 | Forest | $46,188 | $4,612 | 9.99 | Moderate Confidence |
Comment on the output: [write something :)]
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
library(tidycensus)
library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)
# Define your race/ethnicity variables with descriptive names
tract_vars <- c(
total_pop = "B03002_001",
white = "B03002_003",
black = "B03002_004",
hispanic = "B03002_012"
)
counties_to_get <- c("Susquehanna", "Elk", "Forest")
# Use get_acs() to retrieve tract-level data
tract_raw <- get_acs(
geography = "tract",
state = my_state,
county = counties_to_get,
variables = tract_vars,
year = 2022,
survey = "acs5",
output = "wide",
geometry = FALSE
)
# Hint: You may need to specify county codes in the county parameter
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_demo <- tract_raw %>%
transmute(
GEOID,
tract_name = NAME,
county = str_extract(NAME, paste(counties_to_get, collapse = "|")),
total_pop = total_popE,
total_pop_moe = total_popM,
white = whiteE, white_moe = whiteM,
black = blackE, black_moe = blackM,
hispanic = hispanicE, hispanic_moe = hispanicM,
pct_white = if_else(total_pop > 0, 100 * white / total_pop, NA_real_),
pct_black = if_else(total_pop > 0, 100 * black / total_pop, NA_real_),
pct_hispanic = if_else(total_pop > 0, 100 * hispanic / total_pop, NA_real_)
) %>%
mutate(
pct_white = round(pct_white, 1),
pct_black = round(pct_black, 1),
pct_hispanic = round(pct_hispanic, 1)
)
# Add readable tract and county name columns using str_extract() or similar
tract_demo %>%
arrange(county, GEOID) %>%
group_by(county) %>%
slice_head(n = 8) %>%
ungroup() %>%
select(GEOID, tract_name, county, total_pop, pct_white, pct_black, pct_hispanic) %>%
kable(
caption = "Sample tracts (first 8 per selected county)",
col.names = c("GEOID", "Tract name", "County", "Total pop", "% White", "% Black", "% Hispanic")
) %>%
kable_styling(full_width = FALSE)| GEOID | Tract name | County | Total pop | % White | % Black | % Hispanic |
|---|---|---|---|---|---|---|
| 42047950100 | Census Tract 9501; Elk County; Pennsylvania | Elk | 1557 | 94.3 | 0.0 | 0.2 |
| 42047950200 | Census Tract 9502; Elk County; Pennsylvania | Elk | 2929 | 98.5 | 0.0 | 0.8 |
| 42047950400 | Census Tract 9504; Elk County; Pennsylvania | Elk | 4014 | 95.4 | 0.2 | 0.4 |
| 42047950500 | Census Tract 9505; Elk County; Pennsylvania | Elk | 2293 | 97.3 | 0.0 | 2.0 |
| 42047950900 | Census Tract 9509; Elk County; Pennsylvania | Elk | 2475 | 96.4 | 0.0 | 0.0 |
| 42047951000 | Census Tract 9510; Elk County; Pennsylvania | Elk | 4927 | 97.8 | 0.9 | 0.0 |
| 42047951100 | Census Tract 9511; Elk County; Pennsylvania | Elk | 5360 | 94.1 | 0.4 | 2.3 |
| 42047951200 | Census Tract 9512; Elk County; Pennsylvania | Elk | 2018 | 97.0 | 0.0 | 0.0 |
| 42053530100 | Census Tract 5301; Forest County; Pennsylvania | Forest | 4258 | 60.1 | 23.1 | 7.0 |
| 42053530200 | Census Tract 5302; Forest County; Pennsylvania | Forest | 2701 | 82.3 | 4.0 | 7.7 |
| 42115032000 | Census Tract 320; Susquehanna County; Pennsylvania | Susquehanna | 2770 | 93.8 | 0.3 | 3.0 |
| 42115032100 | Census Tract 321; Susquehanna County; Pennsylvania | Susquehanna | 3591 | 95.8 | 0.2 | 1.6 |
| 42115032200 | Census Tract 322; Susquehanna County; Pennsylvania | Susquehanna | 3442 | 94.0 | 0.2 | 2.7 |
| 42115032300 | Census Tract 323; Susquehanna County; Pennsylvania | Susquehanna | 3455 | 97.9 | 0.4 | 0.2 |
| 42115032401 | Census Tract 324.01; Susquehanna County; Pennsylvania | Susquehanna | 1870 | 94.8 | 0.4 | 3.0 |
| 42115032402 | Census Tract 324.02; Susquehanna County; Pennsylvania | Susquehanna | 2103 | 96.6 | 0.6 | 1.2 |
| 42115032500 | Census Tract 325; Susquehanna County; Pennsylvania | Susquehanna | 3835 | 96.2 | 0.0 | 2.1 |
| 42115032600 | Census Tract 326; Susquehanna County; Pennsylvania | Susquehanna | 4026 | 94.8 | 0.2 | 1.3 |
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic <- tract_demo %>%
filter(!is.na(pct_hispanic)) %>%
arrange(desc(pct_hispanic)) %>%
slice_head(n = 1) %>%
select(GEOID, tract_name, county, total_pop, pct_hispanic)
knitr::kable(
top_hispanic,
caption = "Tract with highest % Hispanic among selected counties",
col.names = c("GEOID", "Tract Name", "County", "Total Population", "% Hispanic")
)| GEOID | Tract Name | County | Total Population | % Hispanic |
|---|---|---|---|---|
| 42053530200 | Census Tract 5302; Forest County; Pennsylvania | Forest | 2701 | 7.7 |
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_summary <- tract_demo %>%
group_by(county) %>%
summarize(
n_tracts = n(),
avg_pct_white = round(mean(pct_white, na.rm = TRUE), 1),
avg_pct_black = round(mean(pct_black, na.rm = TRUE), 1),
avg_pct_hispanic = round(mean(pct_hispanic, na.rm = TRUE), 1)
) %>%
arrange(desc(avg_pct_hispanic))
# Create a nicely formatted table of your results using kable()
knitr::kable(
county_summary,
caption = "Average tract-level demographics by county",
col.names = c("County", "N tracts", "Avg % White", "Avg % Black", "Avg % Hispanic")
)| County | N tracts | Avg % White | Avg % Black | Avg % Hispanic |
|---|---|---|---|---|
| Forest | 2 | 71.2 | 13.6 | 7.3 |
| Susquehanna | 12 | 94.9 | 0.4 | 2.2 |
| Elk | 9 | 95.9 | 0.5 | 0.7 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_moe <- tract_demo %>%
mutate(
white_moe_pct = if_else(white > 0, 100 * white_moe / white, NA_real_),
black_moe_pct = if_else(black > 0, 100 * black_moe / black, NA_real_),
hispanic_moe_pct = if_else(hispanic > 0, 100 * hispanic_moe / hispanic, NA_real_),
high_moe_flag = if_else(
white_moe_pct > 15 | black_moe_pct > 15 | hispanic_moe_pct > 15,
TRUE, FALSE
)
)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_moe %>%
arrange(county, GEOID) %>%
group_by(county) %>%
slice_head(n = 5) %>%
ungroup() %>%
select(GEOID, county,
white, white_moe_pct,
black, black_moe_pct,
hispanic, hispanic_moe_pct,
high_moe_flag) %>%
kable(
caption = "MOE% for race/ethnicity estimates (sample tracts)",
col.names = c("GEOID", "County",
"White (count)", "White MOE%",
"Black (count)", "Black MOE%",
"Hispanic (count)", "Hispanic MOE%",
"High MOE Flag (>15%)")
) %>%
kable_styling(full_width = FALSE)| GEOID | County | White (count) | White MOE% | Black (count) | Black MOE% | Hispanic (count) | Hispanic MOE% | High MOE Flag (>15%) |
|---|---|---|---|---|---|---|---|---|
| 42047950100 | Elk | 1469 | 11.300204 | 0 | NA | 3 | 133.33333 | TRUE |
| 42047950200 | Elk | 2884 | 8.495146 | 0 | NA | 22 | 109.09091 | TRUE |
| 42047950400 | Elk | 3831 | 2.584182 | 7 | 242.85714 | 16 | 131.25000 | TRUE |
| 42047950500 | Elk | 2232 | 10.528674 | 0 | NA | 46 | 71.73913 | TRUE |
| 42047950900 | Elk | 2386 | 9.555742 | 0 | NA | 0 | NA | NA |
| 42053530100 | Forest | 2558 | 6.802189 | 983 | 14.85249 | 299 | 25.08361 | TRUE |
| 42053530200 | Forest | 2223 | 7.827261 | 109 | 124.77064 | 209 | 35.88517 | TRUE |
| 42115032000 | Susquehanna | 2597 | 7.893724 | 8 | 87.50000 | 84 | 48.80952 | TRUE |
| 42115032100 | Susquehanna | 3439 | 10.293690 | 6 | 150.00000 | 59 | 54.23729 | TRUE |
| 42115032200 | Susquehanna | 3237 | 9.576769 | 6 | 183.33333 | 94 | 45.74468 | TRUE |
| 42115032300 | Susquehanna | 3384 | 9.072104 | 14 | 114.28571 | 6 | 83.33333 | TRUE |
| 42115032401 | Susquehanna | 1773 | 7.727016 | 7 | 71.42857 | 56 | 62.50000 | TRUE |
# Create summary statistics showing how many tracts have data quality issues
moe_summary <- tract_moe %>%
group_by(county) %>%
summarize(
n_tracts = n(),
n_high_moe = sum(high_moe_flag, na.rm = TRUE),
pct_high_moe = round(100 * n_high_moe / n_tracts, 1)
)
moe_summary %>%
kable(
caption = "Summary of tracts with high MOE (>15%)",
col.names = c("County", "N tracts", "High-MOE tracts", "% High-MOE")
) %>%
kable_styling(full_width = FALSE)| County | N tracts | High-MOE tracts | % High-MOE |
|---|---|---|---|
| Elk | 9 | 7 | 77.8 |
| Forest | 2 | 2 | 100.0 |
| Susquehanna | 12 | 12 | 100.0 |
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
pattern_summary <- tract_moe %>%
group_by(high_moe_flag) %>%
summarize(
n_tracts = n(),
avg_pop = mean(total_pop, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
)
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
library(knitr)
pattern_summary %>%
kable(
digits = 1,
col.names = c("High MOE Flag", "N tracts", "Avg Pop", "Avg % White", "Avg % Black", "Avg % Hispanic"),
caption = "Comparison of tract characteristics by data quality group"
)| High MOE Flag | N tracts | Avg Pop | Avg % White | Avg % Black | Avg % Hispanic |
|---|---|---|---|---|---|
| TRUE | 21 | 3423.4 | 92.9 | 1.7 | 2.3 |
| NA | 2 | 2246.5 | 96.7 | 0.0 | 0.0 |
Pattern Analysis: The pattern analysis shows that nearly all tracts in the study area fall into the high-MOE category. On average, these high-MOE tracts have populations of about 3,400 residents and are predominantly White (93%), with very small Black (1.7%) and Hispanic (2.3%) populations.
By contrast, the two tracts without usable MOE calculations are somewhat smaller (about 2,250 residents) and almost entirely White.
These findings suggest that data quality problems are not randomly distributed but instead strongly associated with the demographic composition of the counties. In particular, tracts with very small minority populations tend to have especially large margins of error, reflecting the challenges of reliably estimating characteristics in small, rural communities.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements:
Overall Pattern Identification: What are the systematic patterns across all your analyses?
Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings?
Root Cause Analysis: What underlying factors drive both data quality issues and bias risk?
Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
1.Overall Pattern Identification
Our analysis of income and demographic data across three Pennsylvania counties—Forest, Susquehanna, and Elk—reveals systematic data quality challenges. Income estimates generally fall within a moderate confidence range, with margins of error around 7–10% in several counties. At the tract level, demographic estimates show even greater uncertainty: over 75% of tracts in Elk and 100% of tracts in both Forest and Susquehanna exceed the 15% MOE threshold for racial/ethnic characteristics. These patterns demonstrate that uncertainty is not random but concentrated in small, rural communities.
2.Equity Assessment
Communities with the highest data uncertainty are predominantly rural and overwhelmingly White, with very small Black and Hispanic populations. Because minority groups are so sparsely represented, the estimates for these populations are especially unreliable. Algorithms that rely on these data to allocate resources or assess equity risk systematically under-serving minority communities in these counties, since both absolute population counts and proportions may be skewed by large margins of error.
3.Root Cause Analysis
The underlying drivers of these data quality issues include small population size at the tract level, limited Census sampling coverage in rural areas, and the statistical difficulty of estimating characteristics for very small subgroups. These conditions compound in counties like Forest and Susquehanna, where small total populations and very limited diversity magnify the uncertainty of demographic estimates. As a result, the same communities most in need of precise measurement are those where statistical reliability is lowest.
4.Strategic Recommendations
To mitigate risks of algorithmic bias, the Department should adopt a cautious approach to using tract-level data in small rural counties. Strategies may include: (1) implementing reliability thresholds that flag and down-weight high-MOE estimates in automated systems; (2) supplementing ACS data with administrative or state-collected sources to validate minority population estimates; (3) investing in oversampling or improved survey design in rural areas; and (4) ensuring that equity assessments explicitly account for measurement error. These steps will improve the fairness and robustness of algorithmic decision-making for Pennsylvania’s most vulnerable communities.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
library(dplyr)
library(knitr)
recommendations <- selected_counties %>%
transmute(
County=county,
Median_Income = median_income,
MOE_Pct = round(moe_pct, 1),
Reliability = reliability,
Recommendation = case_when(
Reliability == "High Confidence" ~ "Safe for algorithmic decisions",
Reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
Reliability == "Low Confidence" ~ "Requires manual review or additional data",
TRUE ~ "Not classified"
)
)
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
# Format as a professional table with kable()
recommendations %>%
kable(
digits = 1,
col.names = c("County", "Median Income", "MOE %", "Reliability Category", "Algorithm Recommendation"),
caption = "Framework for Algorithmic Decision-Making by County"
)| County | Median Income | MOE % | Reliability Category | Algorithm Recommendation |
|---|---|---|---|---|
| Susquehanna | 63968 | 3.1 | High Confidence | Safe for algorithmic decisions |
| Elk | 61672 | 6.6 | Moderate Confidence | Use with caution - monitor outcomes |
| Forest | 46188 | 10.0 | Moderate Confidence | Use with caution - monitor outcomes |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: Susquehanna County qualifies for immediate algorithmic implementation. Its median household income estimates have a low margin of error (3.1%) and fall into the High Confidence category. This level of data reliability indicates that automated systems can be used here with minimal risk of bias or misallocation.
Counties requiring additional oversight: Elk County and Forest County fall under the Moderate Confidence category, with MOE percentages between 6–10%. Algorithms may be applied in these counties, but their outcomes should be closely monitored. Recommended oversight includes: (a) periodic validation of algorithm outputs against ground-level administrative records, and (b) monitoring for systematic underrepresentation of smaller minority populations.
Counties needing alternative approaches: In this analysis, no counties fell into the Low Confidence category. However, if future updates reveal tracts with very high MOE or unstable estimates, the Department should consider alternatives such as manual review of high-risk cases, supplemental state-level surveys, or partnerships with local agencies to gather more reliable demographic information.
Questions for Further Investigation
1.Spatial patterns of uncertainty: Are data quality issues more pronounced in specific geographic areas (e.g., border tracts, rural interior tracts) within these counties?
2.Temporal stability: How do MOE levels and demographic compositions change across different ACS release years? Do certain counties show increasing or decreasing reliability over time?
3.Equity implications: How might measurement error differentially affect smaller racial/ethnic populations, and what safeguards could be added to prevent bias in algorithmic decision-making?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]
Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]
Methodology Notes: Several key decisions were made during the analysis that affect reproducibility. First, counties were selected based on reliability categories derived from ACS income margins of error, ensuring representation across high- and moderate-confidence cases. Where no counties fell into the “Low Confidence” category, the county with the highest MOE percentage was included to illustrate potential issues. At the tract level, demographic estimates were processed by calculating MOE percentages for racial/ethnic groups and flagging tracts above a 15% threshold. These processing choices shaped the reliability categories and downstream recommendations.
Limitations: This analysis is constrained by several limitations. The ACS relies on sample surveys, and margins of error are particularly high in small, rural tracts, limiting the precision of estimates for minority populations. Geographic scope was restricted to three Pennsylvania counties, meaning that findings may not generalize to more urban or diverse areas. In addition, the analysis uses a single ACS release, so temporal variability in income and demographic estimates was not captured. These limitations suggest that conclusions should be interpreted with caution and supplemented with additional data sources where possible.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html