Key Concepts Learned
- 1.The Statistical Learning Framework
- Formalizing the Relationship Y=f(X)+ϵ
- Why Estimate f? 1.Prediction 2. Inference推理
- How Do We Estimate f? Parametric Methods(We assume f is linear, then estimate β₀ and β₁) Non-Parametric Method(We let the data determine the shape of f)
- Parametric Approach: Linear Regression参数方法:线性回归
- The assumption: Relationship between X and Y is linear. Y≈β0+β1X1+β2X2+…+βpXp(Estimate the β coefficient系数)
- The task: Estimate the β coefficients using our sample data
- The method: Ordinary Least Squares (OLS)
- 2.Two Different Goals
- Inference
- Focus on coefficients
- Statistical significance matters
- Understand mechanisms 了解机制
- Prediction
- Focus on accuracy
- Prediction intervals matter
- 3.Building Your First Model
- Fit the Model
- R2,p value(<0.001→ Very unlikely to see this if true β₁ = 0,We can reject the null hypothesis)
- Interpreting Coefficients
- The “Holy Grail” Concept(all the estimation is based on the sample,not the entire pop.so we will never know the true relationship)(No relationship一条横线: Null Hypothesis)
- Statistical Significance
- t value(更大 |t|= 更有信心这种关系是真实的)
- p value(p 值:如果 H₀ 为真,则看到我们的估计值的概率;小 p →拒绝 H₀,得出关系存在)
- 4.Model Evaluation
- How Good is This Model?
- How well does it fit the data we used? (in-sample fit:R²,R² = 0.208 means “21% of variation in income is explained by population”R² alone doesn’t tell us if the model is trustworthy)(Overfitting:High R² doesn’t mean good predictions!)
- How well would it predict new data? (out-of-sample performance:Train70%/Test30% Split)
- Code:set.seed(123) n <- nrow(pa_data)
- Evaluate Predictions
- Code:# Calculate prediction error (RMSE) rmse_test <- sqrt(mean((test_data\(median_incomeE - test_predictions)^2))
rmse_train <- summary(model_train)\)sigma平方是为了放大error
- Cross-Validation(Better approach: Multiple train/test splits)
- 5.Checking Assumptions
- Assumption 1: Linearity
- What we assume: Relationship is actually linear
- How to check: Residual plot(clear funnel pattern & random scatter随机散点)
- Assumption 2: Constant Variance常定方差
- Heteroscedasticity(异方差性): Variance changes across X(residual变大的时候思考是否需要引入新变量来解释→Adding the right predictor can fix heteroscedasticity)
- Impact: Standard errors are wrong → p-values misleading
- Formal Test: Breusch-Pagan(进行异方差性(Heteroscedasticity)的统计假设检验,而 p 值是这个检验的最终输出和决策工具)
- Interpretation: p > 0.05: Constant variance assumption OK; p < 0.05: Evidence of heteroscedasticity
- Solutions: 1.Transform Y (try log(income)) 2.Robust standard errors 3.Add missing variables 4.Accept it (point predictions still OK for prediction goals)
- Assumption: Normality of Residuals(Q-Q Plot)
- Assumption 3: No Multicollinearity
- Assumption 4: No Influential Outliers
- Not all outliers are problems - only those with high leverage AND large residuals
- Identify Influential Points(Cook’s D > 4/n = potentially influential)
- 6.Improving the Model
- Adding More Predictors
- Log Transformations
- If relationship is curved, try transforming
- Interpretation changes: Log models show percentage relationships
- Categorical Variables(create new variables)
Coding Techniques
- [New R functions or approaches]
- [Quarto features learned]
Questions & Challenges
- What I didn’t fully understand
- Areas needing more practice
Connections to Policy
- [How this week’s content applies to real policy work]
Reflection
- [What was most interesting]
- [How I’ll apply this knowledge]