# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)
# Set your Census API key
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "NJ"Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the New Jersey Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen New Jersey for this analysis because: New Jersey is located in the Northeast Megalopolis, positioned between two major metropolitan centers: New York City and Philadelphia. This unique geographical location makes New Jersey a critical one in regional urban development, which I find particularly interesting.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
county_data <- get_acs(
geography = "county",
variables = c(
median_income = "B19013_001", # Median household income
total_population = "B01003_001" # Total pop
),
state = my_state,
year = 2022,
survey = "acs5",
output = "wide"
)
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
county_data <- county_data %>%
mutate(county_name = NAME %>%
str_remove(" County, New Jersey")
)
# Display the first few rows
head(county_data,10)# A tibble: 10 × 7
GEOID NAME median_incomeE median_incomeM total_populationE total_populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 34001 Atla… 73113 1917 274339 NA
2 34003 Berg… 118714 1607 953243 NA
3 34005 Burl… 102615 1436 461853 NA
4 34007 Camd… 82005 1414 522581 NA
5 34009 Cape… 83870 3707 95456 NA
6 34011 Cumb… 62310 2205 153588 NA
7 34013 Esse… 73785 1477 853374 NA
8 34015 Glou… 99668 2605 302621 NA
9 34017 Huds… 86854 1782 712029 NA
10 34019 Hunt… 133534 3236 129099 NA
# ℹ 1 more variable: county_name <chr>
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
mutate(median_incomeMOE = (median_incomeM / median_incomeE) * 100,
rel_categories = case_when(
median_incomeMOE < 5 ~ "High Confidence",
median_incomeMOE >= 5 & median_incomeMOE < 10 ~ "Moderate Confidence",
median_incomeMOE >= 10 ~ "Low Confidence"
)
)
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
summary_data <- county_data %>%
count(rel_categories) %>%
mutate(percentage = n/sum(n))2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
#?arrange
#?select
#glimpse(county_data)
top_5 <- county_data %>%
arrange(desc(median_incomeMOE)) %>%
slice_head( n = 5 ) %>%
select(county_name, median_incomeE, median_incomeM, median_incomeMOE, rel_categories)
top_5# A tibble: 5 × 5
county_name median_incomeE median_incomeM median_incomeMOE rel_categories
<chr> <dbl> <dbl> <dbl> <chr>
1 Cape May 83870 3707 4.42 High Confidence
2 Salem 73378 3047 4.15 High Confidence
3 Cumberland 62310 2205 3.54 High Confidence
4 Atlantic 73113 1917 2.62 High Confidence
5 Gloucester 99668 2605 2.61 High Confidence
# Format as table with kable() - include appropriate column names and caption
#?kable
kable(top_5,digits = getOption("digits"), caption = "Top 5 counties in NJ with the highest median income MOE")| county_name | median_incomeE | median_incomeM | median_incomeMOE | rel_categories |
|---|---|---|---|---|
| Cape May | 83870 | 3707 | 4.419936 | High Confidence |
| Salem | 73378 | 3047 | 4.152471 | High Confidence |
| Cumberland | 62310 | 2205 | 3.538758 | High Confidence |
| Atlantic | 73113 | 1917 | 2.621969 | High Confidence |
| Gloucester | 99668 | 2605 | 2.613677 | High Confidence |
Data Quality Commentary:
[Write 2-3 sentences explaining what these results mean for algorithmic decision-making. Consider: Which counties might be poorly served by algorithms that rely on this income data? What factors might contribute to higher uncertainty?]
Counties with high MOE are not suitable for consideration using algorithmic decision-making, as the margin errors are substantial, which means some important details could be lost in the calculation process. The result of the algorithm could be unreliable and can not convince people. For this income data in New Jersey, Cape May is not a good choice. Factors contributing to the high uncertainty could include income variance across different groups, sample sizes, and data collecting methods.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
# NOTE: Since all the NJ counties are High Confidence, I chose the highest and lowest ones, combined with a moderate one
selected_counties <- county_data %>%
filter(county_name == "Cape May"|county_name == "Bergen"|county_name == "Union")
selected_counties# A tibble: 3 × 9
GEOID NAME median_incomeE median_incomeM total_populationE total_populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 34003 Berge… 118714 1607 953243 NA
2 34009 Cape … 83870 3707 95456 NA
3 34039 Union… 95000 2210 572079 NA
# ℹ 3 more variables: county_name <chr>, median_incomeMOE <dbl>,
# rel_categories <chr>
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties <- selected_counties %>%
select(county_name, median_incomeE, median_incomeMOE, rel_categories)
kable(selected_counties,digits = getOption("digits"), caption = "Selected Counties")| county_name | median_incomeE | median_incomeMOE | rel_categories |
|---|---|---|---|
| Bergen | 118714 | 1.353673 | High Confidence |
| Cape May | 83870 | 4.419936 | High Confidence |
| Union | 95000 | 2.326316 | High Confidence |
Comment on the output: It appears that the median income level and the MOE are somewhat correlated, the higher the median income, the lower the MOE.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
vars = c(
white = "B03002_003",
Black = "B03002_004",
Hispanic = "B03002_012",
total_population = "B03002_001"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
geography = "tract",
variables = vars,
state = my_state,
county = c("003", "009", "039"),
year = 2022,
survey = "acs5",
output = "wide"
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data <- tract_data %>%
mutate(
white_percentage = (whiteE / total_populationE)*100,
black_percentage = (BlackE / total_populationE)*100,
his_percentage = (HispanicE / total_populationE)*100
)
# Add readable tract and county name columns using str_extract() or similar
#?str_extract()
tract_data <- tract_data %>%
mutate(
tract_name = str_trim(str_split_fixed(NAME, ";", 3)[,1]),
county_name = str_trim(str_remove(str_split_fixed(NAME, ";", 3)[,2], " County"))
)3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_his_tract <- tract_data %>%
arrange(desc(his_percentage)) %>%
slice_head(n = 1)
top_his_tract# A tibble: 1 × 15
GEOID NAME whiteE whiteM BlackE BlackM HispanicE HispanicM total_populationE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 34039… Cens… 218 120 310 219 2935 624 3483
# ℹ 6 more variables: total_populationM <dbl>, white_percentage <dbl>,
# black_percentage <dbl>, his_percentage <dbl>, tract_name <chr>,
# county_name <chr>
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
summary_data_2 <- tract_data %>%
group_by(county_name) %>%
summarize(
num_tract = n(),
avg_white_p = mean(white_percentage, na.rm = TRUE),
avg_black_p = mean(black_percentage, na.rm = TRUE),
avg_his_p = mean(his_percentage, na.rm = TRUE)
)
# Create a nicely formatted table of your results using kable()
kable(summary_data_2,digits = getOption("digits"), caption = "Average percentage of races")| county_name | num_tract | avg_white_p | avg_black_p | avg_his_p |
|---|---|---|---|---|
| Bergen | 203 | 53.08512 | 5.390523 | 21.595950 |
| Cape May | 33 | 86.01229 | 2.830430 | 7.664504 |
| Union | 120 | 35.83627 | 20.277607 | 34.532339 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data <- tract_data %>%
mutate(
white_MOE = (whiteM / whiteE)*100,
black_MOE = (BlackM / BlackE)*100,
his_MOE = (HispanicM / HispanicE)*100,
)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_data <- tract_data %>%
mutate(
flag = if_else(
white_MOE > 15 | black_MOE > 15 | his_MOE > 15,
"Flag",
"Moderate"
)
)
# Create summary statistics showing how many tracts have data quality issues
num_flag <- tract_data %>%
filter(flag == "Flag") %>%
summarise(
num = n()
) #3564.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
#NOTE: I chose white alone and calculated the white_flag, cuz all tracts are flagged in the previous step
tract_data <- tract_data %>%
mutate(
flag_white = if_else(
white_MOE > 15,
"Flag",
"Moderate"
)
)
summary_data_3 <- tract_data %>%
group_by(flag_white) %>%
summarise(
num = n(),
avg_pop = mean(total_populationE, na.rm = TRUE),
avg_white_p = mean(white_percentage, na.rm = TRUE),
)
kable(summary_data_3,digits = getOption("digits"), caption = "summary of flagged and unflaged group")| flag_white | num | avg_pop | avg_white_p |
|---|---|---|---|
| Flag | 256 | 4440.484 | 40.63099 |
| Moderate | 100 | 4840.140 | 74.68121 |
Pattern Analysis: [Describe any patterns you observe. Do certain types of communities have less reliable data? What might explain this?] It seems like that a community with large population and high concentration of certain kind of people are likely to be not flagged. This can be explained by the sample size that MOE would be smaller if the sample size goes larger.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses?
Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings?
Root Cause Analysis: What underlying factors drive both data quality issues and bias risk?
Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
In this study, I used MOE percentages as key metrics. I divided the large dataset into several groups to see if there were differences between them. I aim to identify counties or communities that are not suitable for data-driven decision-making processes in economic and racial aspects.
My findings are: communities with smaller population size are in the risk of algorithmic bias, what’s more, diverse communities with different races and various groups of people have higher MOE percentage. Economically disadvantaged communities with income inequality are also liked to have the bias issue.
The reasons for the data quality issues and bias risk can be: (1) Sample size. Smaller populations usually have higher sample error and missing data. Minority are hard to reach out by the census survey as well. (2) Socioeconomic and Demographic variance. Diverse communties have large internal variation, which leads to higher marginal errors.
Departments should not just rely on the algorithm to make their decisions. On-site investigation and community engagement is crucial to know the truth. There are stories, knowledge, and so on that can not reflect from numbers. Planners should be critical about the results of data-driven analysis.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
summary_county <- county_data %>%
select(county_name,median_incomeE,median_incomeMOE,rel_categories)
summary_county# A tibble: 21 × 4
county_name median_incomeE median_incomeMOE rel_categories
<chr> <dbl> <dbl> <chr>
1 Atlantic 73113 2.62 High Confidence
2 Bergen 118714 1.35 High Confidence
3 Burlington 102615 1.40 High Confidence
4 Camden 82005 1.72 High Confidence
5 Cape May 83870 4.42 High Confidence
6 Cumberland 62310 3.54 High Confidence
7 Essex 73785 2.00 High Confidence
8 Gloucester 99668 2.61 High Confidence
9 Hudson 86854 2.05 High Confidence
10 Hunterdon 133534 2.42 High Confidence
# ℹ 11 more rows
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
summary_county <- summary_county %>%
mutate(
recommendation =
case_when(
rel_categories == "High Confidence" ~ "Safe for algorithmic decisions",
rel_categories == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
rel_categories == "Low Confidence" ~ "Requires manual review or additional data"
)
)
# Format as a professional table with kable()
kable(summary_county,digits = getOption("digits"), caption = "Recommendation for algorithmic decisions")| county_name | median_incomeE | median_incomeMOE | rel_categories | recommendation |
|---|---|---|---|---|
| Atlantic | 73113 | 2.621969 | High Confidence | Safe for algorithmic decisions |
| Bergen | 118714 | 1.353673 | High Confidence | Safe for algorithmic decisions |
| Burlington | 102615 | 1.399406 | High Confidence | Safe for algorithmic decisions |
| Camden | 82005 | 1.724285 | High Confidence | Safe for algorithmic decisions |
| Cape May | 83870 | 4.419936 | High Confidence | Safe for algorithmic decisions |
| Cumberland | 62310 | 3.538758 | High Confidence | Safe for algorithmic decisions |
| Essex | 73785 | 2.001762 | High Confidence | Safe for algorithmic decisions |
| Gloucester | 99668 | 2.613677 | High Confidence | Safe for algorithmic decisions |
| Hudson | 86854 | 2.051719 | High Confidence | Safe for algorithmic decisions |
| Hunterdon | 133534 | 2.423353 | High Confidence | Safe for algorithmic decisions |
| Mercer | 92697 | 2.325857 | High Confidence | Safe for algorithmic decisions |
| Middlesex | 105206 | 1.461894 | High Confidence | Safe for algorithmic decisions |
| Monmouth | 118527 | 1.605541 | High Confidence | Safe for algorithmic decisions |
| Morris | 130808 | 2.084735 | High Confidence | Safe for algorithmic decisions |
| Ocean | 82379 | 1.766227 | High Confidence | Safe for algorithmic decisions |
| Passaic | 84465 | 1.851654 | High Confidence | Safe for algorithmic decisions |
| Salem | 73378 | 4.152471 | High Confidence | Safe for algorithmic decisions |
| Somerset | 131948 | 2.485828 | High Confidence | Safe for algorithmic decisions |
| Sussex | 111094 | 2.460979 | High Confidence | Safe for algorithmic decisions |
| Union | 95000 | 2.326316 | High Confidence | Safe for algorithmic decisions |
| Warren | 92620 | 2.598791 | High Confidence | Safe for algorithmic decisions |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: Low MOE group (≈ 1.85 – 1.35) Counties: Passaic, Ocean, Camden, Monmouth, Middlesex, Burlington, Bergen
Counties requiring additional oversight: Mid MOE group (≈ 2.46 – 2.00) Counties: Sussex, Hunterdon, Union, Mercer, Morris, Hudson, Essex
Counties needing alternative approaches: High MOE group (≈ 4.42 – 2.60) Counties: Cape May, Salem, Cumberland, Atlantic, Gloucester, Warren, Somerset
Questions for Further Investigation
- Do counties with higher margins of error cluster geographically?
- Do certain counties show persistently high uncertainty?
- Are higher MOEs systematically associated with specific demographic factors?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on September 26 2025
Reproducibility: - All analysis conducted in R version 2025.05.1+513 - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-CenJinHeng/
Methodology Notes: The choice of threshold values (e.g., 15% and 10%) to categorize reliability was subjective.
Limitations: - Geographic Scope: The analysis focused solely on New Jersey, which limits the ability to generalize findings to other states or regional contexts. - Temporal Factors: Only 2022 ACS 5-year estimates were used, providing a snapshot - Variable Selection: The study emphasized median household income and racial/ethnic categories
Submission Checklist
Before submitting your portfolio link on Canvas:
- [✅] All code chunks run without errors
- [✅ All “[Fill this in]” prompts have been completed
- [✅] Tables are properly formatted and readable
- [✅] Executive summary addresses all four required components
- [✅] Portfolio navigation includes this assignment
- [✅] Census API key is properly set
- [✅] Document renders correctly to HTML
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html