Week 6 Notes - Spatial Machine Learning & Advanced Regression

Published

October 13, 2025

Key Concepts Learned

  • Basic Model Built
    • Improvement of the limitation
      • Add spatial features (crime nearby, distance to amenities)
      • Control for neighborhood (fixed effects)邻域控制(固定效应)
      • Include interactions (does size matter more in wealthy areas?)互动(在富裕地区规模更重要吗?)
  • Converting to Spatial Data
    • Step 1: Make your data spatial
      • st_as_sf(coords = c(“Longitude”, “Latitude”), crs = 4326) %>% 设定初始的坐标参考系统 (CRS)。4326 是 WGS 84 的 EPSG 代码,这是一种全球地理坐标系,通常用于 GPS 坐标,单位是度(Degrees)。
      • st_transform(‘ESRI:102286’) # MA State Plane (feet) 转换(或重新投影)空间数据的坐标参考系统。ESRI:102286 对应的是 NAD_1983_HARN_StatePlane_Massachusetts_Mainland_FIPS_2001,这是一个适用于马萨诸塞州(波士顿所在地)大陆部分的投影坐标系(Projected CRS),它的单位是英尺 (feet)。
    • Step 2: Spatial Join with Neighborhoods
      • st_join(nhoods, join = st_intersects) 空间连接函数:这是 sf 包中用于执行空间连接的核心函数。它将两个简单特征对象基于它们的空间关系进行连接。第一个参数 (nhoods):指定要连接的第二个简单特征对象(包含波士顿各邻域边界的多边形简单特征对象)。st_intersects 表示执行相交连接。具体而言,它将 boston.sf 中的每个房屋点,与 nhoods 中任何与其相交(即包含该点)的邻域多边形进行匹配和连接。
  • Part 1: Expanding Your Regression Toolkit
    • Categoral variables 原始数据的类型(定性属性)
    • Dummy Variables
      • 对原始分类变量进行数值编码的方法(只有0或1)当你的自变量是文字而不是数字时(比如社区名称”Back Bay”,“BeaconHill”等),R会自动将它们转换成一系列二进制虚拟变量。
      • The (n-1) Rule:如果有n个类别,R只会创建(n-1)个虚拟变量,自动省略一个作为参照组。如果有k个类别,通常只需要创建k-1个哑变量(为了避免多重共线性问题,即“虚拟变量陷阱”)。
    • Add Dummy (Categorical) Variables to the Model
      • Objection:是为房价模型添加邻域固定效应(Neighborhood Fixed Effects)。这是一种常用的统计技术,用于控制(或消除)不同邻域之间的平均房价差异对模型的影响。reference category
      • Interpretation:每个邻域哑变量的系数表示该邻域的房价与参考邻域的房价相比,在控制了居住面积(LivingArea)的影响之后,平均高出或低出多少。
    • Interaction Effects: When Relationships Depend
      • Mathematical Form: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \mathbf{\beta_3 (X_1 \cdot X_2)} + \epsilon\] x1对Y的影响,会随着x2 每增加一个单位而改变 \(\beta_3\) 个单位。
      • Theory: Luxury Premium Hypothesis()
      • Create the Neighborhood Categories
        • Model 1: No Interaction (Parallel Slopes)
        • Model 2: With Interaction (Different Slopes)
      • Interpreting the Interaction Coefficients()
      • Breaking Down the Coefficients
      • Compare Model Performance
      • When Not To Use Interactions
        • Small samples: Need sufficient data in each group
        • Overfitting: Too many interactions make models unstable
    • Polynomial Terms: Non-Linear Relationships多项式项:非线性关系
      • Signs of Non-Linearity
        • Curved residual plots
        • U-shaped or inverted-U patterns
        • Diminishing returns/Accelerating effects??????
      • Theory: The U-Shaped Age Effect
      • Model Build
        • Create Age Variable(Visualisation:ggplot)
        • First: Linear Model (Baseline)
        • Add Polynomial Term: Age Squared COMMAND: model_age_quad <- lm(SalePrice ~ Age + I(Age^2) + LivingArea, data = boston.sf) 注意:1. 要用I() 2. 有Age又有Age^2
        • Interpreting Polynomial Coefficients(无法直接解释系数!)
        • Compare Model Performance
        • Check Residual Plot?????????
  • Part 2: Creating Spatial Features
    • Three Approaches to Spatial Features
      • Buffer Aggregation缓冲区聚合(对定义距离内的事件进行计数或求和)
      • k-Nearest Neighbors(kNN)k-最近邻(到 k 个最近事件的平均距离)
      • Distance to Specific Points(到重要地点的直线距离)
    • Example
      • Load and Prepare Crime Data
      • Approach 1: Buffer Aggregation(660ft,500ft?????)
      • Approach 2: k-Nearest Neighborhoods Method到 k 个最近事件的平均距离(相关性最强的 kNN 特征告诉我们犯罪感知的相关“影响区”!)????????
      • Approach 3: Distance to Downtown
      • All spatial features together
      • Model Comparison: Adding Spatial Features
  • Part 3: Fixed Effects(FE)
    • What Are Fixed Effects?: Fixed Effects = Categorical variables that capture all unmeasured characteristics of a group捕获组的所有未测量特征的分类变量
    • How Fixed Effects Work: Each coefficient = price premium/discount for that neighborhood (holding all else constant)
    • Why Use Fixed Effects?
  • Part 5: Cross-Validation (with Categorical Variables)
    • The Problem: Sparse Categories(Rule of Thumb: Categories with n < 10 will likely cause CV problems)
    • Solution: Group Small Neighborhoods(Trade-off: Lose granularity for small neighborhoods, but avoid CV crashes)
    • Alternative: Drop Sparse Categories

Coding Techniques

  • [New R functions or approaches]
  • [Quarto features learned]

Questions & Challenges

  • What I didn’t fully understand
    • left out???0??
  • Areas needing more practice
    • 1

Connections to Policy

  • [How this week’s content applies to real policy work]

Reflection

  • [What was most interesting]
  • [How I’ll apply this knowledge]