Week 3 Notes - Data Visualization & Exploratory Analysis
Key Concepts Learned
Quartet: Datasets with identical summary statistics, but completely different patterns when visualized.
The policy Implications
why it matters - Summary statistics can hide critical patterns - Outliers may represent important communities - Relationships aren’t always linear - Visual inspection reveals data quality issues
Bad Visualizations Consequences
Common problems in government data presentation: - Misleading scales or axes - Cherry-picked time periods - Hidden or ignored uncertainty - Missing context about data reliability
The smaller your sample size is, the larger your margin of error would be
Grammar of Graphics
The ggplot2 Philosophy
principles: Data-Aesthetics-Geometries-Visual Data: Your data set (census data, survey responses, etc.)“I would like to make a plot” coding:g-ggplot(data=your_data)
Aesthetics: What variables map to visual properties (x, y, color, size) how to draw a connection to specific element in our plot coding: aes(x=variable 1,y=variable 2) x, y - position color - point/line color fill - area fill color size - point/line size shape - point shape alpha - transparency
Geometries: How to display the data (points, bars, lines) coding:geom_something()
Additional layers: Scales, themes, facets, annotations coding:additional_layers() BUILD PLOTS BY ADDING LAYERS WITH +
Understand your data before making decisions or building models
- What does the data look like? (distributions, missing values)
- What patterns exist? (relationships, clusters, trends)
- What’s unusual? (outliers, anomalies, data quality issues)
- What questions does this raise? (hypotheses for further investigation)
- How reliable is this data
Enhanced process for policy analysis
- Load and inspect - dimensions, variable types, missing data
- Assess reliability - examine margins of error, calculate coefficients of variation
- Visualize distributions - histograms, boxplots for each variable
- Explore relationships - scatter plots, correlations
- Identify patterns - grouping, clustering, geographical patterns
- Question anomalies - investigate outliers and unusual patterns
- Document limitations - prepare honest communication about data quality
Questions & Answers
Q: Why do algorithmic decision-making systems rely on “proxy” variables? A: Because the data we want to measure directly is often unavailable or difficult to collect.