Assignment 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Tess Vu

Published

September 8, 2025

OVERVIEW

Scenario

I am a data analyst for the Utah Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. My supervisor has asked me to evaluate the quality and reliability of available census data to inform this decision.

Drawing on the Week 2 discussion of algorithmic bias, I need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

PART I: SETUP

Code

# Load required packages
library(tidycensus)
library(tidyverse)
library(knitr)
library(kableExtra)

# Set Census API key
census_api_key <- Sys.getenv("CENSUS_API_KEY")

# Choose state for analysis - assign it to a variable called my_state
my_state = "Utah"

State Selection: I’ve chosen Utah for this analysis because: It’s my hometown and I grew up around Salt Lake valley area in different neighborhoods and would like to explore the data behind my memories and maybe how things have changed since I was last living there around 3 years ago.

PART II: COUNTY-LEVEL RESOURCE ASSESSMENT

i. Data Retrieval

Code

# Write get_acs() code here
utah_reliability <- get_acs(
  geography = "county",
  state = my_state,
  # Median household income and total population.
  variables = c(median_inc = "B19013_001", total_pop = "B01003_001"),
  year = 2022,
  survey = "acs5",
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
utah_reliability <- mutate(utah_reliability, NAME = str_remove(NAME, " County, Utah"))

# Display the first few rows
glimpse(utah_reliability)

Rows: 29
Columns: 6
$ GEOID       <chr> "49001", "49003", "49005", "49007", "49009", "49011", "490…
$ NAME        <chr> "Beaver", "Box Elder", "Cache", "Carbon", "Daggett", "Davi…
$ median_incE <dbl> 80268, 72769, 72719, 53734, 61250, 101285, 70821, 67056, 5…
$ median_incM <dbl> 11189, 2873, 2016, 4102, 14033, 1754, 3448, 3821, 7536, 67…
$ total_popE  <dbl> 7102, 58291, 134428, 20338, 638, 363032, 19779, 9898, 5121…
$ total_popM  <dbl> NA, NA, NA, NA, 124, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

ii Data Quality Assessment

Code

# Calculate MOE percentage and reliability categories using mutate()
utah_reliability <- mutate(utah_reliability, MOE_percent = (median_incM / median_incE) * 100) %>%
                mutate(utah_reliability, reliability_cat =
                         case_when(
                           MOE_percent < 5 ~ "High Confidence",
                           MOE_percent <= 10 ~ "Moderate Confidence",
                           MOE_percent > 10 ~ "Low Confidence"
                           )) %>%
                mutate(utah_reliability, low_flag = reliability_cat == "Low Confidence")

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

count_summary <- utah_reliability %>%
  group_by(reliability_cat) %>%
  summarize(frequency = n()) %>%
  mutate(frequency = frequency) %>%
  mutate(percent_counties = frequency / sum(frequency))

count_summary

# A tibble: 3 × 3
  reliability_cat     frequency percent_counties
  <chr>                   <int>            <dbl>
1 High Confidence            10            0.345
2 Low Confidence              9            0.310
3 Moderate Confidence        10            0.345

iii High Uncertainty Counties

Code

# Create table of top 5 counties by MOE percentage
top_5 <- data.frame()

top_5 <- utah_reliability %>%
  arrange(desc(MOE_percent)) %>%
  slice(0:5) %>%
  select(2:8)

# Format as table with kable() - include appropriate column names and caption
kable(
  top_5,
  col.names = c("County", "Median Income", "Median Income MOE", "Total Population", "Total Population MOE", "Percent MOE", "MOE Confidence"),
  digit = 2,
  caption = "<b>TOP 5 UTAH COUNTIES: HIGHEST MEDIAN INCOME MARGIN-OF-ERROR</b>"
) %>%
  kable_styling(latex_options = "striped") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "white", background = "black")

**TOP 5 UTAH COUNTIES: HIGHEST MEDIAN INCOME MARGIN-OF-ERROR**
County	Median Income	Median Income MOE	Total Population	Total Population MOE	Percent MOE	MOE Confidence
Piute	33359	12379	1764	124	37.11	Low Confidence
Wayne	64870	16573	2532	NA	25.55	Low Confidence
Daggett	61250	14033	638	124	22.91	Low Confidence
Kane	70327	11967	7814	NA	17.02	Low Confidence
Beaver	80268	11189	7102	NA	13.94	Low Confidence

Data Quality Commentary:

The top five counties with the highest MOEs are rural—Piute, Wayne, Daggett, and Kane counties lack a major interstate highway in a state with cities that rely on car-transport within urban areas as well as within the vast swaths of undeveloped desert land. Beaver county at fifth does have the I-15 freeway. Another factor is the environmental topography that could influence towns’ infrastructure expansion as well as travel, all five counties’ are mountainous regions for the majority of their landscape, making it difficult and expensive to develop.

PART III: NEIGHBORHOOD-LEVEL ANALYSIS

i. Focus Area Selection

Code

# Use filter() to select 2-3 counties from utah_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- utah_reliability %>%
  filter(NAME == "Salt Lake" | NAME == "Morgan" | NAME == "Piute") %>%
  arrange(MOE_percent)

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
kable(
  selected_counties[c("GEOID", "NAME", "median_incE", "MOE_percent", "reliability_cat")],
  col.names = c("GEOID", "County", "Median Income", "MOE Percentage", "MOE Confidence"),
  digit = 2,
  caption = "<b>SELECTED UTAH COUNTIES</b>"
) %>%
  kable_styling(latex_options = "striped") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "white", background = "black")

**SELECTED UTAH COUNTIES**
GEOID	County	Median Income	MOE Percentage	MOE Confidence
49035	Salt Lake	90011	1.17	High Confidence
49029	Morgan	120854	7.78	Moderate Confidence
49031	Piute	33359	37.11	Low Confidence

Comment on the output: This table output shows Salt Lake, Morgan, and Piute counties, which have high, moderate, and low confidence, respectively. Salt Lake has a 1.17% MOE percentage and Morgan has a 7.78% MOE percentage, Piute has a significant jump from Morgan by 29.33% units at 37.11%. Salt Lake and Morgan are counties that are actually adjacent to one another, whereas Piute is in central, rural Utah.

ii. Tract-Level Demographics

Code

# Define race/ethnicity variables with descriptive names
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_level <- get_acs(
  geography = "tract",
  state = my_state,
  # Median household income and total population.
  variables = c(white = "B03002_003", black = "B03002_004", hispanic = "B03002_012", total = "B03002_001"),
  year = 2022,
  survey = "acs5",
  output = "wide"
) %>%
  filter(str_detect(GEOID, "^49035") | str_detect(GEOID, "^49029") | str_detect(GEOID, "^49031"))

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_level <- tract_level %>%
  mutate(
    "White Percentage" = (whiteE / totalE) * 100,
    "Black Percentage" = (blackE / totalE) * 100,
    "Hispanic Percentage" = (hispanicE / totalE) * 100
  )

# Add readable tract and county name columns using str_extract() or similar
tract_level <- tract_level %>%
  separate(
    col = NAME,
    into = c("TRACT", "COUNTY", "STATE"),
    sep = ";"
  )

tract_level

# A tibble: 255 × 15
   GEOID      TRACT COUNTY STATE whiteE whiteM blackE blackM hispanicE hispanicM
   <chr>      <chr> <chr>  <chr>  <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
 1 490299701… Cens… " Mor… " Ut…   5822    369     64     70       214       116
 2 490299701… Cens… " Mor… " Ut…   4031    362      5     10       159       118
 3 490299702… Cens… " Mor… " Ut…   1679    232      0     12         6        12
 4 490319601… Cens… " Piu… " Ut…   1674    134      3      7        75        57
 5 490351001… Cens… " Sal… " Ut…   1853    370    119    125       663       384
 6 490351002… Cens… " Sal… " Ut…   1150    228     13     16       115        65
 7 490351003… Cens… " Sal… " Ut…   1358    329    216    124      3319      1218
 8 490351003… Cens… " Sal… " Ut…   1371    267     85    107      2603       475
 9 490351003… Cens… " Sal… " Ut…    812    231    364    230      2358       454
10 490351005… Cens… " Sal… " Ut…   2788    617    102    115      2010       730
# ℹ 245 more rows
# ℹ 5 more variables: totalE <dbl>, totalM <dbl>, `White Percentage` <dbl>,
#   `Black Percentage` <dbl>, `Hispanic Percentage` <dbl>

iii. Demographic Analysis

Code

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
tract_level <- tract_level %>%
  arrange(desc(`Hispanic Percentage`))

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
tract_level_summary <- tract_level %>%
  group_by(COUNTY) %>%
  summarize("Number of Tracts" = n(),
            "White Average Percent" = mean(`White Percentage`, na.rm = TRUE),
            "Black Average Percent" = mean(`Black Percentage`, na.rm = TRUE),
            "Hispanic Average Percent" = mean(`Hispanic Percentage`, na.rm = TRUE)
            )

# Create a nicely formatted table of results using kable()
kable(
  tract_level_summary,
  caption = "<b>SELECTED UTAH COUNTIES: Average Racial Percents</b>"
) %>%
  kable_styling(latex_options = "striped") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "white", background = "black")

**SELECTED UTAH COUNTIES: Average Racial Percents**
COUNTY	Number of Tracts	White Average Percent	Black Average Percent	Hispanic Average Percent
Morgan County	3	93.48180	0.3784084	2.480240
Piute County	1	94.89796	0.1700680	4.251701
Salt Lake County	251	69.57696	1.7770290	18.331447

PART IV: COMPREHENSIVE DATA QUALITY EVALUATION

i. MOE Analysis for Demographic Variables

Code

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
options(scipen = 999)

# Avoid Inf values by changing 0 values to 0.01.
MOE_percentages <- tract_level %>%
  mutate(across(c(whiteM, whiteE, blackM, blackE, hispanicM, hispanicE),
                ~ifelse(. == 0, 0.1, .)))

MOE_percentages <- MOE_percentages %>%
  mutate(
    "White MOE" = (whiteM / whiteE) * 100,
    "Black MOE" = (blackM / blackE) * 100,
    "Hispanic MOE" = (hispanicM / hispanicE) * 100
  )

# Use logical operators (| for OR) in an ifelse() statement
MOE_percentages <- MOE_percentages %>%
  mutate(MOE_percentages,
         "MOE Flag" = if_else(
           `White MOE` > 15 | `Black MOE` > 15 | `Hispanic MOE` > 15, TRUE, FALSE
         )
  )

# Create summary statistics showing how many tracts have data quality issues
data_quality_summary <- MOE_percentages %>%
  group_by(COUNTY) %>%
  summarize(
    "Number of Tracts" = n(),
    "Data Quality Issues" = sum(`MOE Flag`, na.rm = TRUE)
    )

data_quality_summary

# A tibble: 3 × 3
  COUNTY              `Number of Tracts` `Data Quality Issues`
  <chr>                            <int>                 <int>
1 " Morgan County"                     3                     3
2 " Piute County"                      1                     1
3 " Salt Lake County"                251                   251

ii. Pattern Analysis

Code

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison

flagged_MOE <- MOE_percentages %>%
  group_by(`MOE Flag`) %>%
  summarize(
    "Avg White" = mean(whiteE, na.rm = TRUE), "Avg % White" = mean(`White Percentage`, na.rm = TRUE),
    "Avg Black" = mean(blackE, na.rm = TRUE), "Avg % Black" = mean(`Black Percentage`, na.rm = TRUE),
    "Avg Hispanic" = mean(hispanicE, na.rm = TRUE), "Avg % Hispanic" = mean(`Hispanic Percentage`, na.rm = TRUE)
    )

# Create a professional table showing the patterns
kable(
  flagged_MOE,
  digit = 2,
  caption = "<b>SALT LAKE, MORGAN, AND PIUTE COUNTIES: Average Population and Percents by Race</b>"
) %>%
  kable_styling(latex_options = "striped") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "white", background = "black")

**SALT LAKE, MORGAN, AND PIUTE COUNTIES: Average Population and Percents by Race**
MOE Flag	Avg White	Avg % White	Avg Black	Avg % Black	Avg Hispanic	Avg % Hispanic
TRUE	3252.52	69.96	77.52	1.75	884.83	18.09

Code

# Check minimums.
min(MOE_percentages[["White MOE"]])

[1] 6.309819

Code

min(MOE_percentages[["Black MOE"]])

[1] 54.38596

Code

min(MOE_percentages[["Hispanic MOE"]])

[1] 17.59717

Code

# Flag tracts with MOES below 15%.
low_MOE <- MOE_percentages %>%
  mutate(MOE_percentages,
         "Low MOE" = if_else(
           `White MOE` < 15 | `Black MOE` < 15 | `Hispanic MOE` < 15, TRUE, FALSE
         )
  )

# Group low MOEs.
low_MOE_summary <- low_MOE %>%
  group_by(`Low MOE`) %>%
  summarize(
    "Avg White" = mean(whiteE, na.rm = TRUE), "Avg % White" = mean(`White Percentage`, na.rm = TRUE),
    "Avg Black" = mean(blackE, na.rm = TRUE), "Avg % Black" = mean(`Black Percentage`, na.rm = TRUE),
    "Avg Hispanic" = mean(hispanicE, na.rm = TRUE), "Avg % Hispanic" = mean(`Hispanic Percentage`, na.rm = TRUE)
    )
  
low_MOE_summary

# A tibble: 2 × 7
  `Low MOE` `Avg White` `Avg % White` `Avg Black` `Avg % Black` `Avg Hispanic`
  <lgl>           <dbl>         <dbl>       <dbl>         <dbl>          <dbl>
1 FALSE           2930.          63.8        96.2          2.15          1117.
2 TRUE            3736.          79.0        49.5          1.17           536.
# ℹ 1 more variable: `Avg % Hispanic` <dbl>

Pattern Analysis: All of the census tracts in Salt Lake, Morgan, and Piute counties had MOEs above 15%, and judging off of that it seems like the census had significant MOEs for the three races. However, when looking at the minimum MOE percentages, the white population’s range starts at 6.31%, and Hispanic at 17.60% and Black at 54.39%, which are both substantial jumps from the white communities. Possible explanations might be due to redlining, and from anecdotal knowledge, the diversity in Salt Lake county tends to be clustered on the west side and periphery with higher Black, Hispanic, and Asian populations. Another explanation could be that the other counties are smaller and have a much larger white population.

PARTI V: POLICY RECOMMENDATIONS

i. Analysis Integration and Professional Summary

Here I address:

Overall Pattern Identification: What are the systematic patterns across all my analyses?
Equity Assessment: Which communities face the greatest risk of algorithmic bias based on my findings?
Root Cause Analysis: What underlying factors drive both data quality issues and bias risk?
Strategic Recommendations: What should the Department implement to address these systematic issues?

Looking at the data without any geographic or chart visualization, it seems like many of the counties with high margins-of-error tend to be more rural, and a good number of these rural counties have as little as three census tracts compared to more urban counties that have nearly a hundred or more than two-hundred census tracts. Even counties marked with moderate confidence like Morgan, which is actually adjacent and northeast of Salt Lake county which has high confidence, have very little census tracts. When looking at the counties that have low confidence, it seems like the trend is that the confidence level has a positive relationship with income and population, and a negative relationship with margin-of-error as it rises. However, this trend seems to hold true for more densely populated counties, because Morgan county has a median income around 120,000 with moderate reliability and less than 12,000 residents, and Beaver county has a median income around 80,000 with low confidence and less than 8,000 residents. Income also has a strong relationship with race, so it can be inferred that Black and Hispanic communities may share similar relationship characteristics to the aforementioned statement, but this is in regards to more diverse, urban areas versus rural areas.

Because of these observations, the communities at greatest risk of algorithmic bias would be Black, Hispanic, rural, and low-income populations, and potentially other unmentioned races like Native Americans. Most of the US’ diversity is a result from immigrants landing in coastal states and Black slaves who were very condensed in the South, and over the decades that diversity has moved to more inland states; Utah in particular had an overwhelming 98% white population in the entire state around the 1970s, and as of recently a little over five decades later, Utah is overall 90% white.

In addition, Utah’s environment makes it particularly expensive to build in due to the desert and rocky environment that make construction difficult. It actually also makes it difficult to travel in as well, even with private vehicles due to the terrain. Much of Utah’s topography is mountainous and desert, and rural communities do not have access to broadband—this means that the digital gap could be an obstacle to rural individuals, who receive the census surveys through mail, and who may have difficult mailing routes for the US Postal Service to reach. Most rural areas or developing towns, if they do possess internet cables, tend to be older and much slow infrastructure like copper, coaxial, or even satellite. These telecommunications technologies are very outdated compared to the current fiber optic standard. So census by mail is the best way to reach rural and Native reservations, and mail is not as timely to collect because of the physical and long-distance aspects that are required.

Taking the liberty of observing the averages of margins-of-error less than 15% in section 4.2, it seems like the lower margin-of-error occurrences are within even less diverse tracts to others, with the white percent change being +15.23%, and then -0.98% and -12.50% percent changes for Black and Hispanic percents, respectively. From anecdotal knowledge, as a born-and-raised Utahn in Salt Lake county, the valley has the picturesque Wasatch mountains to the east and more flat and dusty mountain ranges to the far west as well as the Kennecott Copper Mine. The county is also split longitudinally by the main interstate highway I-15, and due to historical redlining and more frequent efforts to uplift communities near the Wasatch mountains, many of the wealthy and white residents live on the east side of I-15 and many of Salt Lake’s diversity are clustered on the west side of I-15, namely the Black, Hispanic, and Asian communities.

The Department of Health and Human Services could work with the Department of Commerce, which oversees telecommunications, to develop a program that incentivizes or subsidizes telecommunications companies to build out fiber broadband networks with at least 250 to 300 Mbps. Because fiber is expensive to build out, companies can use older fiber cable technology to balance between the 500 Mbps to 1 GB commonly found in more urban regions. While rural areas receive the census by mail, the vast majority of ruralites own smartphones and do rely on their internet being reliable even if it’s not very fast. So they won’t have a lot of trouble navigating internet browsers, but filling out the census on mobile phone could be a potential hurdle, so local public facilities to them like libraries could facilitate in-person or virtual resources to fill out the census survey. The suggestion might result in a marginal change because most Americans use smartphones, but slow internet can often be a deterrent when filling out forms.

The margins-of-error that are higher in certain races in Utah could also be due to cultural differences, not a lot of immigrants, who make up the majority of BIPOC demographics, are able to read English well or they may be less likely to fill the census for a variety of reasons. In the 2020 ACS, only twelve languages were available to read the survey questions, and other languages outside the twelve require extra steps to receive translations; it could be worth pursuing creating a help center in public libraries as well for those who speak minority languages, but whether or not individuals come in for that public service is another thing entirely, so providing any online videos and glossaries in minority languages could be helpful, or allowing individuals to specify if they need a paper glossary with key translations mailed to them in their language.

ii. Specific Recommendations

Code

# Create a summary table using county reliability data
# Include: county name, median income, MOE percentage, reliability category
summary_table <- utah_reliability %>%
  summarize(
    "County" = utah_reliability[["NAME"]],
    "Median Income" = utah_reliability[["median_incE"]],
    "MOE Percentage" = utah_reliability[["MOE_percent"]],
    "Reliability Category" = utah_reliability[["reliability_cat"]]
  )

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
summary_table <- summary_table %>%
  mutate(
    "Algorithm Recommendations" = case_when(
      `Reliability Category` == "High Confidence" ~ "Safe for algorithmic decisions",
      `Reliability Category` == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      `Reliability Category` == "Low Confidence" ~ "Requires manual review or additional data"
    )
  )

# Format as a professional table with kable()
kable(
  summary_table,
  digit = 2,
  caption = "<b>UTAH COUNTIES: Median Income and Reliabillity for Algorithmic Decision-Making</b>"
) %>%
  kable_styling(latex_options = "striped") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "white", background = "black")

**UTAH COUNTIES: Median Income and Reliabillity for Algorithmic Decision-Making**
County	Median Income	MOE Percentage	Reliability Category	Algorithm Recommendations
Beaver	80268	13.94	Low Confidence	Requires manual review or additional data
Box Elder	72769	3.95	High Confidence	Safe for algorithmic decisions
Cache	72719	2.77	High Confidence	Safe for algorithmic decisions
Carbon	53734	7.63	Moderate Confidence	Use with caution - monitor outcomes
Daggett	61250	22.91	Low Confidence	Requires manual review or additional data
Davis	101285	1.73	High Confidence	Safe for algorithmic decisions
Duchesne	70821	4.87	High Confidence	Safe for algorithmic decisions
Emery	67056	5.70	Moderate Confidence	Use with caution - monitor outcomes
Garfield	56481	13.34	Low Confidence	Requires manual review or additional data
Grand	59171	11.41	Low Confidence	Requires manual review or additional data
Iron	63005	6.97	Moderate Confidence	Use with caution - monitor outcomes
Juab	88048	7.68	Moderate Confidence	Use with caution - monitor outcomes
Kane	70327	17.02	Low Confidence	Requires manual review or additional data
Millard	69403	4.16	High Confidence	Safe for algorithmic decisions
Morgan	120854	7.78	Moderate Confidence	Use with caution - monitor outcomes
Piute	33359	37.11	Low Confidence	Requires manual review or additional data
Rich	69250	12.42	Low Confidence	Requires manual review or additional data
Salt Lake	90011	1.17	High Confidence	Safe for algorithmic decisions
San Juan	52108	10.34	Low Confidence	Requires manual review or additional data
Sanpete	64356	5.29	Moderate Confidence	Use with caution - monitor outcomes
Sevier	66972	7.06	Moderate Confidence	Use with caution - monitor outcomes
Summit	126392	6.33	Moderate Confidence	Use with caution - monitor outcomes
Tooele	95545	4.57	High Confidence	Safe for algorithmic decisions
Uintah	67983	7.48	Moderate Confidence	Use with caution - monitor outcomes
Utah	91263	1.09	High Confidence	Safe for algorithmic decisions
Wasatch	104855	7.16	Moderate Confidence	Use with caution - monitor outcomes
Washington	71976	4.74	High Confidence	Safe for algorithmic decisions
Wayne	64870	25.55	Low Confidence	Requires manual review or additional data
Weber	82291	2.15	High Confidence	Safe for algorithmic decisions

Key Recommendations:

Counties suitable for immediate algorithmic implementation: Box Elder, Cache, Davis, Duchesne, Millard, Salt Lake, Tooele, Utah, Washington, and Weber counties have high confidence data, especially since all of these counties are either major urban regions in the state or they have a high population around 20,000 or more individuals. However, even if these counties are suitable for algorithmic implementation, it still needs to be stressed that automated decision-making should have flagging features built into them for potential review. The volume of data recorded in these counties are numerous and make any manual process in collecting and wrangling the data time-consuming, monotonous, and prone to human error. So balancing algorithmic implementation to do the heavy-lifting alongside human reviews when any flagging safeguards are breached could be crucial to maintain or further improve high confidence data.
Counties requiring additional oversight: Carbon, Emery, Iron, Juab, Morgan, Sanpete, Sevier, Summit, Uintah, and Wasatch counties have moderate confidence data. Keeping an eye on how margins-of-error have changed over time in the census might be a worthwhile endeavor, if certain counties are gradually improving on their margins then oversight can ensure that their data reliability can transition to high confidence, and if certain counties have a history of their margins-of-error being stagnant, have higher variability year after year compared to other counties, or have been non a downward trend then it could be an option to introduce manual review in some parts of the data collection and cleaning process. Even sending out an additional survey might adjust the data confidence beneficially.
Counties needing alternative approaches: Beaver, Daggett, Garfield, Grand, Kane, Piute, Rich, San Juan, and Wayne counties have low confidence data. These counties also happen to be in the more rural parts of Utah, which tend to have harsher desert environments and with rougher terrain when travelling, they also encompass all of the Native American reservations in Utah. These counties also have much smaller populations and population densities that could make them worth manually reviewing and sending out additional surveys to make sure people send them in. While USPS mail delivery procedures are largely the same, mailing to reservations have an additional unique challenge other than remoteness, which is that many residences don’t have address names, so making sure that rural areas and reservations receive the census by sending more surveys could be part of a solution. The Department could also introduce a pilot program to provide P.O. boxes for reservations if Indigenous peoples in the region are open to the idea, although it’ll be many long years before seeing whether or not it made substantial improvements to census data accuracy and lowering MOEs over time.

Questions for Further Investigation

How would observations and interpretations change if other races were included, like Asians and Native Americans?
Because the latest 5-year ACS for Utah had reported significant errors, how would that survey compare to previous surveys?
How would this look spatially visualized? A deeper dive into neighborhoods with high MOE census tracts could be beneficial.
What is the margin-of-error’s relationship with population density and Utah’s topography, in addition to median income?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/15/25.

Reproducibility: - All analysis conducted in R version 4.5.1. - Census API key required for replication. - Complete code and documentation available at: https://github.com/MUSA-5080-Fall-2025/portfolio-setup-TessaVu

Methodology Notes: Because several values were zero, to get past the zero division hurdle the values were mutated to a smaller value like 0.01, which calculated extreme MOEs like 17,000.

Limitations: There were sample issues with tracts noting zero Black or Hispanic individuals, and dividing by zero created infinite values. This project selected Salt Lake, Morgan, and Piute counties, which have high, moderate, and low confidences, respectively. However, Morgan and Piute counties have very little census tracts at 4 and 5 compared to 200+ for Salt Lake. In addition, Utah was one of the 14 states in the recent 5-year census that over-counted its citizens by almost 3% with several individuals double-counted.

NOTE: Credit and AI Usage

No AI usage.

Base code template provided by Dr. Delmelle with adjustments.

Some visuals taken from lecture examples and modified.