# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
census_api_key("595f8315942e8965e5a91f7cfa62b7210fe62245")
# Choose your state for analysis - assign it to a variable called my_state
<- "PA" my_state
Assignment 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Pennsylvania Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/
Make sure to update your _quarto.yml
navigation to include this assignment under an “Assignments” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an assignments/assignment_1/
folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: assignments/assignment_1/your_file_name.qmd
text: "Assignment 1: Census Data Exploration"
If there is a special character like comma, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Pennsylvania for this analysis for its personal relevance since I currently live here.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs()
to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code")
syntax.
# Write your get_acs() code here
<- get_acs(
county_data geography ="county",
variables= c(
median_income ="B19013_001",
total_pop="B01003_001"
),state = my_state,
year = 2022,
output = "wide"
)# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
<- county_data%>%
county_data mutate(county_name = str_remove(NAME, " County, Pennsylvania"))
# Display the first few rows
glimpse(county_data)
Rows: 67
Columns: 7
$ GEOID <chr> "42001", "42003", "42005", "42007", "42009", "42011", "…
$ NAME <chr> "Adams County, Pennsylvania", "Allegheny County, Pennsy…
$ median_incomeE <dbl> 78975, 72537, 61011, 67194, 58337, 74617, 59386, 60650,…
$ median_incomeM <dbl> 3334, 869, 2202, 1531, 2606, 1191, 2058, 2167, 1516, 21…
$ total_popE <dbl> 104604, 1245310, 65538, 167629, 47613, 428483, 122640, …
$ total_popM <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ county_name <chr> "Adams", "Allegheny", "Armstrong", "Beaver", "Bedford",…
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate()
with case_when()
for the categories.
# Calculate MOE percentage and reliability categories using mutate()
<-county_data%>%
county_data mutate(MOE_PCT=(median_incomeM/median_incomeE)*100,
reliability = case_when(
< 5 ~ "High Confidence (<5%)",
MOE_PCT <= 10 ~ "Moderate Confidence (5%-10%)",
MOE_PCT TRUE ~ "Low Confidence (>10%)"
)
) county_data
# A tibble: 67 × 9
GEOID NAME median_incomeE median_incomeM total_popE total_popM county_name
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 42001 Adams … 78975 3334 104604 NA Adams
2 42003 Allegh… 72537 869 1245310 NA Allegheny
3 42005 Armstr… 61011 2202 65538 NA Armstrong
4 42007 Beaver… 67194 1531 167629 NA Beaver
5 42009 Bedfor… 58337 2606 47613 NA Bedford
6 42011 Berks … 74617 1191 428483 NA Berks
7 42013 Blair … 59386 2058 122640 NA Blair
8 42015 Bradfo… 60650 2167 60159 NA Bradford
9 42017 Bucks … 107826 1516 645163 NA Bucks
10 42019 Butler… 82932 2164 194562 NA Butler
# ℹ 57 more rows
# ℹ 2 more variables: MOE_PCT <dbl>, reliability <chr>
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
<- county_data%>%
reliability_summary count(reliability) %>%
mutate(
percentage = round(n/sum(n) * 100, 1),
total = sum(n)
) reliability_summary
# A tibble: 2 × 4
reliability n percentage total
<chr> <int> <dbl> <int>
1 High Confidence (<5%) 57 85.1 67
2 Moderate Confidence (5%-10%) 10 14.9 67
2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange()
, slice()
, and select()
functions.
# Create table of top 5 counties by MOE percentage
<- county_data %>%
top_5_counties arrange(desc(MOE_PCT))%>%
slice_head(n=5)
# Format as table with kable() - include appropriate column names and caption
%>%
top_5_counties select(county_name, median_incomeE, median_incomeM, MOE_PCT, reliability)%>%
kable(
col.names = c("County Name", "Income ($)", "MOE", "MOE Percentage (%)", "MOE Reliability"),
caption = "Top 5 Counties Exhibiting High Uncertainty",
digits=c(0,0,1,1,0)
)
County Name | Income ($) | MOE | MOE Percentage (%) | MOE Reliability |
---|---|---|---|---|
Forest | 46188 | 4612 | 10.0 | Moderate Confidence (5%-10%) |
Sullivan | 62910 | 5821 | 9.3 | Moderate Confidence (5%-10%) |
Union | 64914 | 4753 | 7.3 | Moderate Confidence (5%-10%) |
Montour | 72626 | 5146 | 7.1 | Moderate Confidence (5%-10%) |
Elk | 61672 | 4091 | 6.6 | Moderate Confidence (5%-10%) |
Data Quality Commentary:
These results highlight that although all five counties are all considered to indicate moderate confidence in their income estimations, their margins of error vary significantly. For instance, the county with the largest margin of error, Forest County, lies on the cusp of exhibiting low confidence while the smallest margin of error, Elk County, is much closer to being high confidence than low. These discrepancies, in addition to general uncertainty, indicate a need for additional oversight when using income data from these counties for decision-making.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
<- county_data %>%
selected_counties filter(county_name %in% c("Allegheny", "Carbon", "Forest")) %>%
select(county_name, median_incomeE, MOE_PCT, reliability)
selected_counties
# A tibble: 3 × 4
county_name median_incomeE MOE_PCT reliability
<chr> <dbl> <dbl> <chr>
1 Allegheny 72537 1.20 High Confidence (<5%)
2 Carbon 64538 5.31 Moderate Confidence (5%-10%)
3 Forest 46188 9.99 Moderate Confidence (5%-10%)
Comment on the output: Because my data possessed no findings with low confidence, I chose the top county from my “Top 5 Counties Exhibiting High Uncertainty” as my low confidence observation. A gradient in reliability, however, is still achieved as, as previously observed, the margins of error between both counties with estimations of moderate confidence range drastically.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
<- county_data %>%
find_codes select(county_name, GEOID)%>%
filter(county_name %in% c("Allegheny", "Carbon", "Forest")) %>%
mutate(county_code = str_remove(GEOID, "42"))
find_codes
# A tibble: 3 × 3
county_name GEOID county_code
<chr> <chr> <chr>
1 Allegheny 42003 003
2 Carbon 42025 025
3 Forest 42053 053
<- get_acs(
tract_data geography ="tract",
variables= c(
white ="B03002_003",
black="B03002_004",
hispanic="B03002_012",
total_pop="B03002_001"
),state = my_state,
county =c(003, 025, 053),
year = 2022,
output = "wide"
)
tract_data
# A tibble: 413 × 10
GEOID NAME whiteE whiteM blackE blackM hispanicE hispanicM total_popE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 42003010301 Censu… 755 108 1086 125 60 47 2028
2 42003010302 Censu… 3283 254 678 174 269 147 4631
3 42003020100 Censu… 2915 319 541 142 192 102 4310
4 42003020300 Censu… 1170 201 15 23 44 32 1471
5 42003030500 Censu… 724 179 1821 451 135 101 3044
6 42003040200 Censu… 947 234 493 140 53 30 1843
7 42003040400 Censu… 1192 216 61 49 17 21 1629
8 42003040500 Censu… 2342 404 79 63 88 56 2840
9 42003040600 Censu… 1928 417 69 63 67 60 2302
10 42003040900 Censu… 2496 409 609 248 92 63 3722
# ℹ 403 more rows
# ℹ 1 more variable: total_popM <dbl>
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
# Add readable tract and county name columns using str_extract() or similar
= str_split(tract_data$NAME, ";", simplify = TRUE)
unreadable
<-tract_data%>%
tract_data mutate(white_pct=(whiteE/total_popE)*100,
black_pct=(blackE/total_popE)*100,
hispanic_pct=(hispanicE/total_popE)*100,
tract_name=unreadable[,1],
county_name=str_remove(unreadable[,2], "County")
)
glimpse(tract_data)
Rows: 413
Columns: 15
$ GEOID <chr> "42003010301", "42003010302", "42003020100", "42003020300…
$ NAME <chr> "Census Tract 103.01; Allegheny County; Pennsylvania", "C…
$ whiteE <dbl> 755, 3283, 2915, 1170, 724, 947, 1192, 2342, 1928, 2496, …
$ whiteM <dbl> 108, 254, 319, 201, 179, 234, 216, 404, 417, 409, 57, 150…
$ blackE <dbl> 1086, 678, 541, 15, 1821, 493, 61, 79, 69, 609, 1450, 132…
$ blackM <dbl> 125, 174, 142, 23, 451, 140, 49, 63, 63, 248, 404, 422, 3…
$ hispanicE <dbl> 60, 269, 192, 44, 135, 53, 17, 88, 67, 92, 3, 0, 84, 98, …
$ hispanicM <dbl> 47, 147, 102, 32, 101, 30, 21, 56, 60, 63, 8, 11, 73, 54,…
$ total_popE <dbl> 2028, 4631, 4310, 1471, 3044, 1843, 1629, 2840, 2302, 372…
$ total_popM <dbl> 61, 198, 368, 206, 535, 235, 220, 423, 446, 452, 401, 377…
$ white_pct <dbl> 37.2287968, 70.8918160, 67.6334107, 79.5377294, 23.784494…
$ black_pct <dbl> 53.5502959, 14.6404664, 12.5522042, 1.0197145, 59.8226018…
$ hispanic_pct <dbl> 2.9585799, 5.8086806, 4.4547564, 2.9911625, 4.4349540, 2.…
$ tract_name <chr> "Census Tract 103.01", "Census Tract 103.02", "Census Tra…
$ county_name <chr> " Allegheny ", " Allegheny ", " Allegheny ", " Allegheny …
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
<- tract_data %>%
highest_hispanic_pct arrange(desc(hispanic_pct))%>%
slice(1)
glimpse(highest_hispanic_pct)
Rows: 1
Columns: 15
$ GEOID <chr> "42003120300"
$ NAME <chr> "Census Tract 1203; Allegheny County; Pennsylvania"
$ whiteE <dbl> 33
$ whiteM <dbl> 38
$ blackE <dbl> 1678
$ blackM <dbl> 494
$ hispanicE <dbl> 343
$ hispanicM <dbl> 499
$ total_popE <dbl> 2093
$ total_popM <dbl> 681
$ white_pct <dbl> 1.576684
$ black_pct <dbl> 80.172
$ hispanic_pct <dbl> 16.38796
$ tract_name <chr> "Census Tract 1203"
$ county_name <chr> " Allegheny "
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
<- tract_data %>%
summary_by_county group_by(county_name) %>%
summarize(
number_of_tracts=n(),
mean_white_pct=round(mean(white_pct, na.rm = TRUE),1),
mean_black_pct=round(mean(black_pct, na.rm = TRUE),1),
mean_hispanic_pct=round(mean(hispanic_pct, na.rm = TRUE),1)
) summary_by_county
# A tibble: 3 × 5
county_name number_of_tracts mean_white_pct mean_black_pct mean_hispanic_pct
<chr> <int> <dbl> <dbl> <dbl>
1 " Allegheny " 394 74.5 15.4 2.4
2 " Carbon " 17 89 1.9 6.4
3 " Forest " 2 71.2 13.6 7.4
# Create a nicely formatted table of your results using kable()
%>%
summary_by_countykable(
col.names = c("County Name", "Total Tracts Per County", "Average of White Residents(%)", "Average of Black Residents(%)", "Average of Hispanic Residents (%)"),
caption = "Average Demographics Per County",
digits=c(0,0,1,1,1)
)
County Name | Total Tracts Per County | Average of White Residents(%) | Average of Black Residents(%) | Average of Hispanic Residents (%) |
---|---|---|---|---|
Allegheny | 394 | 74.5 | 15.4 | 2.4 |
Carbon | 17 | 89.0 | 1.9 | 6.4 |
Forest | 2 | 71.2 | 13.6 | 7.4 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
<-tract_data%>%
tract_data mutate(
white_moe_pct=(whiteM/whiteE)*100,
black_moe_pct=(blackM/blackE)*100,
hispanic_moe_pct=(hispanicM/hispanicE)*100,
moe_flag=ifelse(
> 15 |
white_moe_pct > 15 |
black_moe_pct > 15,
hispanic_moe_pct TRUE, FALSE)
)
# Create summary statistics showing how many tracts have data quality issues
<- tract_data%>%
quality_summary count(moe_flag) %>%
mutate(
percentage = round(n/sum(n) * 100, 1),
total = sum(n)
) quality_summary
# A tibble: 1 × 4
moe_flag n percentage total
<lgl> <int> <dbl> <int>
1 TRUE 413 100 413
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
<- tract_data %>%
summary_by_moe group_by(moe_flag) %>%
summarize(
number_of_tracts= n(),
mean_population=round(mean(total_popE, na.rm = TRUE),0),
mean_white_pct=round(mean(white_pct, na.rm = TRUE),1),
mean_black_pct=round(mean(black_pct, na.rm = TRUE),1),
mean_hispanic_pct=round(mean(hispanic_pct, na.rm = TRUE),1)
) summary_by_moe
# A tibble: 1 × 6
moe_flag number_of_tracts mean_population mean_white_pct mean_black_pct
<lgl> <int> <dbl> <dbl> <dbl>
1 TRUE 413 3190 75 14.8
# ℹ 1 more variable: mean_hispanic_pct <dbl>
# Create a professional table showing the patterns
%>%
summary_by_moekable(
col.names = c("Quality Issues (MOE > %15)", "Total Tracts Detected", "Average Total Population", "Average of White Residents(%)", "Average of Black Residents(%)", "Average of Hispanic Residents (%)"),
caption = "Average Demographics Per County",
digits=c(0,0,0,1,1,1)
)
Quality Issues (MOE > %15) | Total Tracts Detected | Average Total Population | Average of White Residents(%) | Average of Black Residents(%) | Average of Hispanic Residents (%) |
---|---|---|---|---|---|
TRUE | 413 | 3190 | 75 | 14.8 | 2.6 |
Pattern Analysis: All of the tracts investigated possessed high margins of error and predominantly white communities. This may suggest that demographic data from less diverse communities is more likely to be unreliable.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
Overall Pattern Identification Across all analyses, the smaller the community being surveyed, the less accurate the data would become. Although the counties that were chosen to be further analyzed at the tract level represented a gradient in income estimation reliability, they all shared similar unreliability in demographic data. Allegheny, the county with the smallest margin of error in income estimation, possessed the largest population of the chosen counties while Forest county, the county with the largest margin of error, possessed the smallest population. These counties similarly, however, possessed a disproportionately large number of white residents compare to other racial and ethnic groups and returned high margins of error in their demographic data.
Equity Assessment The county data here emphasizes that margins of error vary depending on the type of data collected and are influenced by both the absolute size of a population and its proportion within the broader population. More specifically, smaller groups such as counties with small populations and racial or ethnic minorities are more prone to data inaccuracies and thus more vulnerable to bias in algorithmic decision-making. For instance, the tract with the largest hispanic population, the smallest average population across all tracts, had an even larger margin of error. Decisions made based on the this demographic data may vastly misrepresent the true needs of the hispanic population.
Root Cause Analysis The misrepresentation of vulnerable populations such as small or spatially uneven populations in data can result in algorithms using inaccurate inputs and thus, inherently reproducing non-representational outputs. Furthermore, the subjective choices made when analyzing data can also skew outputs.
Strategic Recommendations To mitigate these risks, the Department should identify margins of error and create thorough thresholds for interpretations of margins of errors. Certain thresholds should call for human oversight and require additional social, historical, and spatial explanations.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
# Format as a professional table with kable()
<-county_data%>%
county_summary mutate(
rec_for_algorithm= case_when(
== "High Confidence (<5%)" ~ "Safe for algorithmic decisions",
reliability == "Moderate Confidence (5%-10%)" ~ "Use with caution - monitor outcomes",
reliability == "Low Confidence (>10%)" ~ "Requires manual review or additional data"
reliability
)%>%
)select(county_name, median_incomeE, MOE_PCT, reliability, rec_for_algorithm)%>%
arrange(desc(MOE_PCT))%>%
kable(
col.names = c("County Name", "Median Income ($)", "MOE Percentage(%)", "Reliability", "Algorithm Recommendation)"),
caption = "County-level Algorithmic Recommendations",
digits=c(0,0,1,0,0)
) county_summary
County Name | Median Income ($) | MOE Percentage(%) | Reliability | Algorithm Recommendation) |
---|---|---|---|---|
Forest | 46188 | 10.0 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Sullivan | 62910 | 9.3 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Union | 64914 | 7.3 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Montour | 72626 | 7.1 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Elk | 61672 | 6.6 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Greene | 66283 | 6.4 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Cameron | 46186 | 5.6 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Snyder | 65914 | 5.6 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Carbon | 64538 | 5.3 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Warren | 57925 | 5.2 | Moderate Confidence (5%-10%) | Use with caution - monitor outcomes |
Pike | 76416 | 4.9 | High Confidence (<5%) | Safe for algorithmic decisions |
Wayne | 59240 | 4.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Juniata | 61915 | 4.8 | High Confidence (<5%) | Safe for algorithmic decisions |
McKean | 57861 | 4.7 | High Confidence (<5%) | Safe for algorithmic decisions |
Huntingdon | 61300 | 4.7 | High Confidence (<5%) | Safe for algorithmic decisions |
Indiana | 57170 | 4.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Bedford | 58337 | 4.5 | High Confidence (<5%) | Safe for algorithmic decisions |
Potter | 56491 | 4.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Lycoming | 63437 | 4.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Clarion | 58690 | 4.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Adams | 78975 | 4.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Fayette | 55579 | 4.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Crawford | 58734 | 3.9 | High Confidence (<5%) | Safe for algorithmic decisions |
Clinton | 59011 | 3.9 | High Confidence (<5%) | Safe for algorithmic decisions |
Wyoming | 67968 | 3.9 | High Confidence (<5%) | Safe for algorithmic decisions |
Columbia | 59457 | 3.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Fulton | 63153 | 3.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Mercer | 57353 | 3.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Armstrong | 61011 | 3.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Bradford | 60650 | 3.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Blair | 59386 | 3.5 | High Confidence (<5%) | Safe for algorithmic decisions |
Venango | 59278 | 3.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Mifflin | 58012 | 3.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Jefferson | 56607 | 3.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Cambria | 54221 | 3.3 | High Confidence (<5%) | Safe for algorithmic decisions |
Tioga | 59707 | 3.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Monroe | 80656 | 3.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Perry | 76103 | 3.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Susquehanna | 63968 | 3.1 | High Confidence (<5%) | Safe for algorithmic decisions |
Lawrence | 57585 | 3.1 | High Confidence (<5%) | Safe for algorithmic decisions |
Franklin | 71808 | 3.0 | High Confidence (<5%) | Safe for algorithmic decisions |
Clearfield | 56982 | 2.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Somerset | 57357 | 2.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Centre | 70087 | 2.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Lebanon | 72532 | 2.7 | High Confidence (<5%) | Safe for algorithmic decisions |
Northumberland | 55952 | 2.7 | High Confidence (<5%) | Safe for algorithmic decisions |
Butler | 82932 | 2.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Lackawanna | 63739 | 2.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Erie | 59396 | 2.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Schuylkill | 63574 | 2.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Washington | 74403 | 2.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Luzerne | 60836 | 2.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Beaver | 67194 | 2.3 | High Confidence (<5%) | Safe for algorithmic decisions |
Dauphin | 71046 | 2.3 | High Confidence (<5%) | Safe for algorithmic decisions |
Cumberland | 82849 | 2.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Lehigh | 74973 | 2.0 | High Confidence (<5%) | Safe for algorithmic decisions |
Westmoreland | 69454 | 2.0 | High Confidence (<5%) | Safe for algorithmic decisions |
Northampton | 82201 | 1.9 | High Confidence (<5%) | Safe for algorithmic decisions |
Lancaster | 81458 | 1.8 | High Confidence (<5%) | Safe for algorithmic decisions |
York | 79183 | 1.8 | High Confidence (<5%) | Safe for algorithmic decisions |
Chester | 118574 | 1.7 | High Confidence (<5%) | Safe for algorithmic decisions |
Berks | 74617 | 1.6 | High Confidence (<5%) | Safe for algorithmic decisions |
Delaware | 86390 | 1.5 | High Confidence (<5%) | Safe for algorithmic decisions |
Bucks | 107826 | 1.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Philadelphia | 57537 | 1.4 | High Confidence (<5%) | Safe for algorithmic decisions |
Montgomery | 107441 | 1.3 | High Confidence (<5%) | Safe for algorithmic decisions |
Allegheny | 72537 | 1.2 | High Confidence (<5%) | Safe for algorithmic decisions |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
- Counties suitable for immediate algorithmic implementation: Allegheny, Montgomery, Bucks, Chester, Delaware, Lancaster, York, Northampton, Lehigh, Cumberland, Dauphin, Beaver, Luzerne, Washington, Schuylkill, Erie, Lackawanna, Butler, Lebanon, Centre, Somerset, Clearfield, Franklin, Lawrence, Susquehanna, Perry, Monroe, Tioga, Cambria, Jefferson, Mifflin, Venango, Blair, Bradford, Armstrong, Mercer, Fulton, Columbia, Wyoming, Clinton, Crawford, Fayette, Adams, Clarion, Lycoming, Potter, Bedford, Indiana, Huntingdon, McKean, Juniata, Wayne, Pike
Justification: Low margins of error indicate reduced risk in algorithmic bias
- Counties requiring additional oversight: Sullivan, Union, Montour, Elk, Greene, Cameron, Snyder, Carbon, Warren
Justification: Although these counties fall in the moderate confidence range, outcomes should be monitored to ensure appropriateness.
- Counties needing alternative approaches: Forest
Justifications: Although Forest county falls in the moderate confidence range, it borders the low confidence range and likely should avoid automated decisions until the quality of data improves.
Questions for Further Investigation
- How small or strict should data reliability ranges be?
- Would it be statistically meaningful to compare margin of error percentages of different variables?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/26/2025
Reproducibility: - All analysis conducted in R version R 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://musa-5080-fall-2025.github.io/portfolio-setup-aruth3/
Methodology Notes: County selection was determined base on numerical values of reliability rather than categorical indications of reliability due to a lack of categorical variation.
Limitations: Certain population percentage estimates returned values of infinity due to division by denominators with a value of zero. These results should be interpreted with caution as they do not represent meaningful statistical outcomes.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/assignments/assignment_1/your_file_name.html