DYPLR BASICS AND CENSUS DATA
WEEK 2 NOTES
KEY CONCEPTS LEARNED
- Main Concepts
- Algorithms are a set of instructions or rules for solving a problem or completing a task.
- Algorithms may provide efficiency, consistency, objectivity, and cost-savings in certain scenarios, but they aren’t infallible, especially since the data that is fed to them could be collected in a biased manner.
- Algorithms can help with decision-making, but at the end of the day, it’s humans that make the decisions. The many choices behind creating algorithms and making predictions embed human values and biases.
- Data Science: Computer science/engineering focus on algorithms and methods.
- Data Analytics: Application of data science methods to other disciplines.
- Machine Learning: Algorithms for classification & prediction that learn from data.
- AI: Algorithms that adjust and improve across iterations (neural networks, etc.)
- Algorithmic Decision Making: Understanding why your analysis matters for real policy decisions.
- Data Subjectivity: Why we emphasize transparent, reproducible methods in this class.
- Census Data: The foundation for most urban planning and policy analysis.
- R Skills: The tools to do this work professionally and ethically.
- Technical Skills
- Census data is publicly accessible.
- County level data is used for state and regional planning. Census tract level data is used for neighborhood analysis. Block group level is used for very local analysis, but there’s large margins of error.
- Data de-indentification by adding mathematical “noise” in the data to preserve patterns, but maintain privacy. Created impossible results like people living underwater, so be wary.
- Can use R packages to access data like tidycensus package, so there’s always the latest data. This makes it easier to create reproducible workflows, provide automatic geographic boundaries, and there’s built-in error handling.
- Large MOE relative to estimate = less reliable. Small MOE relative to estimate = more reliable. Always report MOE alongside estimates. Be cautious comparing estimates with overlapping error margins. Consider using 5-year estimates for greater reliability.
- Census Summary Tables: Pre-calculated statistics by geography. Median income, percent college-educated, etc. Good for mapping and comparing places.
- PUMS, or Individual Records: Anonymous individual/household responses. Good for custom analysis and regression models. More complex, but more flexible.
CODING TECHNIQUES
- Piping %>% is used to stack different functions, think of it as a replacement for “and then” or as a replacement equivalent to nesting functions.
- get_acs() used to grab ACS survey data.
- str_remove() used to remove a string.
- str_extract() used to extract a string.
- str_replace() used to replace a string.
- case_when() for categories, MOE calculations.
- kable() for professional formatting.
QUESTIONS & CHALLENGES
- Using algorithms to completely replace human decision-makers is problematic, because computers and algorithms, no matter how complex they can be programmed, can’t perfectly mimic the messiness of everyday human life.
- Using historical data to inform policy is a challenge, because very biased policies against different races, genders, incomes, and more created socio-economic disadvantages that persist today. Like trying to geographically predict crime would clearly target disinvested and oppressed communities that haven’t had a chance or will likely not have a chance to recover within a lifetime or more.
- Algorithms can fail because of people, because humans are behind data cleaning decisions, data coding or classification, data collection like use of imperfect proxies, how results are interpreted, what variables are put in the model.
- Many real life examples of biased algorithms harming historically minoritized communities.
CONNECTIONS TO POLICY
- “Algorithmic decision-making is used in government to assist or replace human decision-makers”.
- Using historical data to inform policy and decision-makers by predicting the future as precisely as possible. There are applications in housing, finance, healthcare, crime, etc.
- Governments have limited budgets and need to serve everyone to optimizing and being cost-effective means using algorithmic decision making. Proponents argue in favor that it provides efficiency, consistency, objectivity, and cost-savings, but this isn’t always the case in real life.
- Decennial census is 9 questions for the entire US.
- American Community Survey (ACS) is a survey of 3% households annually with detailed questions.
- Algorithmic Fairness: Unreliable data can bias automated decisions.
- Resource Allocation: Know which areas need extra attention.
- Equity Analysis: Some communities may be systematically under-counted.
- Professional Credibility: Always assess your data quality.
REFLECTION
- Census data is used for understanding community demographics, allocating government resources, and tracking neighborhood change, but the census itself has had a biased and racist history.
- ACS 1-Year Estimates (areas > 65,000 people) has the most current data and is the smallest sample. ACS 5-Year Estimates include all areas including census tracts and is the most reliable data and largest sample of the ACS. It does have a margin of error, but is pretty robust for data analysis.
- Even “objective” data involves subjective choices about privacy vs. accuracy (e.g. 202 Decennial Census).
- What assumptions am I making in my data choices?
- Who might be excluded from my analysis?
- How could my findings be misused if taken out of context?
- What would I want policymakers to understand about my methods?