Week 3 Notes - Data Visualization & Exploratory Analysis

Published

September 22, 2025

Key Concepts Learned

Quartet: Datasets with identical summary statistics, but completely different patterns when visualized.

The policy Implications

why it matters - Summary statistics can hide critical patterns - Outliers may represent important communities - Relationships aren’t always linear - Visual inspection reveals data quality issues

Bad Visualizations Consequences

Common problems in government data presentation: - Misleading scales or axes - Cherry-picked time periods - Hidden or ignored uncertainty - Missing context about data reliability

The smaller your sample size is, the larger your margin of error would be

Grammar of Graphics

The ggplot2 Philosophy

principles: Data-Aesthetics-Geometries-Visual Data: Your data set (census data, survey responses, etc.)“I would like to make a plot” coding:g-ggplot(data=your_data)

Aesthetics: What variables map to visual properties (x, y, color, size) how to draw a connection to specific element in our plot coding: aes(x=variable 1,y=variable 2) x, y - position color - point/line color fill - area fill color size - point/line size shape - point shape alpha - transparency

Geometries: How to display the data (points, bars, lines) coding:geom_something()

Additional layers: Scales, themes, facets, annotations coding:additional_layers() BUILD PLOTS BY ADDING LAYERS WITH +

Understand your data before making decisions or building models

What does the data look like? (distributions, missing values)
What patterns exist? (relationships, clusters, trends)
What’s unusual? (outliers, anomalies, data quality issues)
What questions does this raise? (hypotheses for further investigation)
How reliable is this data

Enhanced process for policy analysis

Load and inspect - dimensions, variable types, missing data
Assess reliability - examine margins of error, calculate coefficients of variation
Visualize distributions - histograms, boxplots for each variable
Explore relationships - scatter plots, correlations
Identify patterns - grouping, clustering, geographical patterns
Question anomalies - investigate outliers and unusual patterns
Document limitations - prepare honest communication about data quality

Questions & Answers

Q: Why do algorithmic decision-making systems rely on “proxy” variables? A: Because the data we want to measure directly is often unavailable or difficult to collect.