Week 3 Notes - Data Viz and Exploratory Analysis

Published

September 22, 2025

Key Concepts Learned

  • Data Viz basics with ggplot in RStudio
  • Exploratory Data Analysis (EDA) Workflow
    • Load and inspect - dimensions, variable types, missing data
    • Assess reliability - examine margins of error, calculate coefficients of variation
    • Visualize distributions - histograms, boxplots for each variable
    • Explore relationships - scatter plots, correlations
    • Identify patterns - grouping, clustering, geographical patterns
    • Question anomalies - investigate outliers and unusual patterns
    • Document limitations - prepare honest communication about data quality
  • Combining datasets through spatial joins
    • left_join() - Keep all rows from left dataset
    • right_join() - Keep all rows from right dataset
    • inner_join() - Keep only rows that match in both
    • full_join() - Keep all rows from both datasets

Coding Techniques

Formatting Charts: Grammar of Graphics

Data → Aesthetics → Geometries → Visual

Code
# Formatting Charts
ggplot(demo_data) +
  aes(x = total_popE, y = median_incomeE) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Income vs Population in Pennsylvania Counties",
    subtitle = "2018-2022 ACS 5-Year Estimates",
    x = "Total Population",
    y = "Median Household Income ($)",
    caption = "Source: U.S. Census Bureau ACS"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar) +
  scale_x_continuous(labels = scales::comma)

Finding Census Variable Codes

Code
# Finding Census Variable Codes
population_vars <- acs_vars_2022 %>%
  filter(str_detect(label, "Total.*population"))

age_vars <- acs_vars_2022 %>%
  filter(str_detect(label, "[Mm]edian.*age"))

housing_vars <- acs_vars_2022 %>%
  filter(str_detect(label, "[Mm]edian.*(rent|value)"))

Questions & Challenges

  • With more reps, I hope to become more independent in my coding instead of having to refer to demo code

Connections to Policy

  • Summary statistics can hide critical patterns
  • Outliers may represent important communities
  • Relationships aren’t always linear
  • Visual inspection reveals data quality issues

Reflection

  • I really liked the Anscombe’s Quartet example – it is a really powerful and intuitive example to show the power of data visualization
  • Astounding Statistic that only 27% of planners warn users about unreliable ACS data – I admit to being guilty of this oversight too