MUSA 5080 Notes #3
Week 3: Data Visualization & Exploratory Analysis
Week 3: Data Visualization & Exploratory Analysis
Date: 09/22/2025
Part 1: Why Visualization Matters
The Power of Visual Communication
This week I learned about the critical importance of data visualization, especially in policy contexts. When I think about Assignment 1, I created tables showing income reliability patterns across counties, but I realized that visual presentation could dramatically change the impact of my analysis when presenting to:
- The state legislature (2-minute briefing)
- Community advocacy groups
- Local news reporters
Anscombe’s Quartet: The Famous Lesson
Four datasets with identical summary statistics: - Same means (x̄ = 9, ȳ = 7.5) - Same variances - Same correlation (r = 0.816) - Same regression line
But completely different patterns when visualized
Policy Implications I Need to Remember
Why this matters for my work: - Summary statistics can hide critical patterns - Outliers may represent important communities - Relationships aren’t always linear - Visual inspection reveals data quality issues
Example: A county with “average” income might have extreme inequality that algorithms would miss without visualization.
Ethical Data Communication
From last week’s algorithmic bias discussion:
Research finding: Only 27% of planners warn users about unreliable ACS data - Most planners don’t report margins of error - Many lack training on statistical uncertainty - This violates AICP Code of Ethics
My responsibility: - Create honest, transparent visualizations - Always assess and communicate data quality - Consider who might be harmed by uncertain data
Real Consequences of Bad Visualizations
Common problems in government data presentation: - Misleading scales or axes - Cherry-picked time periods - Hidden or ignored uncertainty - Missing context about data reliability
Real impact: The Jurjevich et al. study found that 72% of Portland census tracts had unreliable child poverty estimates, yet planners rarely communicated this uncertainty.
Result: Poor policy decisions based on misunderstood data
Part 2: Grammar of Graphics
The ggplot2 Philosophy
Grammar of Graphics principles:
Data → Aesthetics → Geometries → Visual
- Data: My dataset (census data, survey responses, etc.)
- Aesthetics: What variables map to visual properties (x, y, color, size)
- Geometries: How to display the data (points, bars, lines)
- Additional layers: Scales, themes, facets, annotations
Basic ggplot2 Structure
Every ggplot I create has this pattern:
ggplot(data = my_data) +
aes(x = variable1, y = variable2) +
geom_something() +
additional_layers()
I build plots by adding layers with +
Aesthetic Mappings: The Key to ggplot2
Aesthetics map data to visual properties: - x
, y
- position - color
- point/line color - fill
- area fill color - size
- point/line size - shape
- point shape - alpha
- transparency
Important: Aesthetics go inside aes()
, constants go outside
Example: Basic Scatter Plot
library(tidyverse)
# Basic scatter plot
ggplot(county_data) +
aes(x = median_income, y = total_population) +
geom_point() +
labs(
title = "Income vs Population in PA Counties",
x = "Median Household Income ($)",
y = "Total Population"
+
) theme_minimal()
Part 3: Exploratory Data Analysis
The EDA Mindset
Exploratory Data Analysis is detective work I need to master:
- What does the data look like? (distributions, missing values)
- What patterns exist? (relationships, clusters, trends)
- What’s unusual? (outliers, anomalies, data quality issues)
- What questions does this raise? (hypotheses for further investigation)
- How reliable is this data?
Goal: Understand my data before making decisions or building models
Enhanced EDA Workflow for Policy Analysis
My enhanced process:
- Load and inspect - dimensions, variable types, missing data
- Assess reliability - examine margins of error, calculate coefficients of variation
- Visualize distributions - histograms, boxplots for each variable
- Explore relationships - scatter plots, correlations
- Identify patterns - grouping, clustering, geographical patterns
- Question anomalies - investigate outliers and unusual patterns
- Document limitations - prepare honest communication about data quality
Understanding Distributions
Why distribution shape matters:
# Exploring income distribution
ggplot(county_data) +
aes(x = median_income) +
geom_histogram(bins = 15, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Median Income Across PA Counties",
x = "Median Household Income ($)",
y = "Number of Counties"
)
What I should look for: Skewness, outliers, multiple peaks, gaps
Boxplots for Quick Summaries
# Boxplot to understand income distribution
ggplot(county_data) +
aes(y = median_income) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Income Distribution Summary",
y = "Median Household Income ($)"
)
Critical: Data Quality Through Visualization
Research-Based Best Practices
Jurjevich et al. (2018): 5 Essential Guidelines I Must Follow:
- Report the corresponding MOEs of ACS estimates
- Always include margin of error values
- Include a footnote when not reporting MOEs
- Explicitly acknowledge omission
- Provide context for (un)reliability
- Use coefficient of variation (CV):
- CV < 12% = reliable (green coding)
- CV 12-40% = somewhat reliable (yellow)
- CV > 40% = unreliable (red coding)
- Use coefficient of variation (CV):
- Reduce statistical uncertainty
- Collapse data detail, aggregate geographies, use multi-year estimates
- Always conduct statistical significance tests
- When comparing ACS estimates over time
Key insight: These practices are not just technical best practices—they are ethical requirements under the AICP Code of Ethics
Visualizing Data Quality
# Visualizing margin of error patterns
<- county_data %>%
county_reliability mutate(
moe_percentage = (median_incomeM / median_incomeE) * 100,
cv = moe_percentage / 1.645 # Convert MOE to CV
)
ggplot(county_reliability) +
aes(x = total_population, y = moe_percentage) +
geom_point(alpha = 0.7) +
labs(
title = "Data Quality vs Population Size",
x = "Total Population",
y = "Margin of Error (%)",
subtitle = "Smaller populations have higher uncertainty"
+
) theme_minimal()
Pattern I observed: Smaller populations have higher uncertainty
Ethical implication: These communities might be systematically undercounted
Part 4: Data Joins & Integration
Why I Need to Join Data
To combine datasets effectively: - Census demographics + Economic indicators - Survey responses + Geographic boundaries - Current data + Historical trends - Administrative records + Survey data
Types of Joins in dplyr
Four main types I’ll use: - left_join()
- Keep all rows from left dataset - right_join()
- Keep all rows from right dataset - inner_join()
- Keep only rows that match in both - full_join()
- Keep all rows from both datasets
Most common: left_join()
to add columns to my main dataset
Example: Joining Census Tables
# Get income data
<- get_acs(
income_data geography = "county",
variables = "B19013_001",
state = "PA",
year = 2022,
survey = "acs5"
)
# Get education data
<- get_acs(
education_data geography = "county",
variables = "B15003_022", # Bachelor's degree
state = "PA",
year = 2022,
survey = "acs5"
)
# Join the datasets
<- income_data %>%
combined_data left_join(education_data, by = "GEOID")
Checking Join Results and Data Quality
I must always verify joins AND assess combined reliability:
# Check dimensions
nrow(income_data) # Should match
nrow(education_data) # Should match
nrow(combined_data) # Should match
# Check for missing values
%>%
combined_data summarize(
missing_income = sum(is.na(median_income)),
missing_education = sum(is.na(college_pop))
)
# Calculate reliability for both variables
%>%
combined_data mutate(
income_cv = (income_moe / median_income) * 100 / 1.645,
college_cv = (college_moe / college_pop) * 100 / 1.645
)
EDA for Policy Analysis
Key Questions I Should Always Ask
For census data specifically: - Geographic patterns: Are problems concentrated in certain areas? - Population relationships: How does size affect data quality? - Demographic patterns: Are certain communities systematically different? - Temporal trends: How do patterns change over time? - Data integrity: Where might survey bias affect results? - Reliability assessment: Which estimates should I trust?
Professional Visualization Standards
# Create publication-ready visualization
ggplot(combined_data) +
aes(x = median_income, y = college_percentage) +
geom_point(aes(size = total_population), alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE) +
scale_size_continuous(name = "Population") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(
title = "Education and Income Relationship in PA Counties",
subtitle = "Higher income counties tend to have more college graduates",
x = "Median Household Income",
y = "Percent with Bachelor's Degree",
caption = "Source: ACS 2018-2022 5-year estimates"
+
) theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom"
)
Summary
This week I learned that visualization is not just about making pretty charts—it’s about ethical communication and uncovering patterns that summary statistics might hide. Key takeaways:
- Anscombe’s Quartet taught me that identical statistics can hide completely different patterns
- Professional ethics require me to always assess and communicate data uncertainty
- ggplot2’s Grammar of Graphics provides a systematic approach to building meaningful visualizations
- Enhanced EDA workflow helps me understand data quality before making policy decisions
- Data joins allow me to combine multiple sources while maintaining quality standards
Most important lesson: As a future planner, I have an ethical obligation to communicate data uncertainty honestly. My visualizations should reveal, not hide, the limitations of my data.
Skills I Practiced
ggplot2 fundamentals: - Scatter plots, histograms, boxplots - Aesthetic mappings and customization - Professional themes and labels
EDA workflow: - Distribution analysis - Outlier detection - Pattern identification
Ethical data practice: - Visualizing and reporting margins of error - Using coefficient of variation to assess reliability - Creating honest, transparent communications
Moving forward: I will always start my analysis with exploratory visualization and end with honest communication about data limitations. This is both a technical best practice and an ethical requirement.