SPATIAL MACHINE LEARNING & ADVANCED REGRESSION

WEEK 6 NOTES

Author

Tess Vu

Published

October 13, 2025

Key Concepts Learned

Baseline Model

Code
# Load packages and data
library(tidyverse)
library(sf)
library(here)

# Load Boston housing data
boston <- read_csv(here("data/boston.csv"))

# Quick look at the data
glimpse(boston)
Rows: 1,485
Columns: 24
$ ...1       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ Parcel_No  <dbl> 100032000, 100058000, 100073000, 100112000, 100137000, 1001…
$ SalePrice  <dbl> 450000, 600000, 450000, 670000, 260000, 355000, 665000, 355…
$ PricePerSq <dbl> 228.89, 164.34, 105.98, 291.94, 217.21, 190.96, 227.35, 120…
$ LivingArea <dbl> 1966, 3840, 4246, 2295, 1197, 1859, 2925, 2904, 892, 1916, …
$ Style      <chr> "Conventional", "Semi?Det", "Decker", "Row\xa0End", "Coloni…
$ GROSS_AREA <dbl> 3111, 5603, 6010, 3482, 1785, 2198, 4341, 3892, 1658, 3318,…
$ NUM_FLOORS <dbl> 2.0, 3.0, 3.0, 3.0, 2.0, 1.5, 3.0, 3.0, 2.0, 2.0, 1.5, 2.0,…
$ R_BDRMS    <dbl> 4, 8, 9, 6, 2, 3, 8, 6, 2, 2, 4, 3, 3, 3, 6, 8, 4, 4, 3, 2,…
$ R_FULL_BTH <dbl> 2, 3, 3, 3, 1, 3, 3, 3, 1, 2, 2, 3, 1, 1, 3, 3, 2, 3, 1, 2,…
$ R_HALF_BTH <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,…
$ R_KITCH    <dbl> 2, 3, 3, 3, 1, 2, 3, 3, 1, 2, 2, 3, 1, 1, 3, 3, 2, 3, 2, 2,…
$ R_AC       <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ R_FPLACE   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ LU         <chr> "R2", "R3", "R3", "R3", "R1", "R2", "R3", "E", "R1", "R2", …
$ OWN_OCC    <chr> "Y", "N", "Y", "N", "N", "N", "N", "N", "N", "Y", "N", "N",…
$ R_BLDG_STY <chr> "CV", "SD", "DK", "RE", "CL", "CV", "DK", "DK", "RE", "TF",…
$ R_ROOF_TYP <chr> "H", "F", "F", "F", "F", "G", "F", "F", "G", "F", "G", "F",…
$ R_EXT_FIN  <chr> "M", "B", "M", "M", "P", "M", "M", "A", "A", "M", "W", "W",…
$ R_TOTAL_RM <dbl> 10, 17, 20, 14, 5, 8, 14, 14, 4, 9, 7, 7, 5, 5, 15, 14, 11,…
$ R_HEAT_TYP <chr> "W", "W", "W", "W", "E", "E", "W", "W", "W", "W", "W", "W",…
$ YR_BUILT   <dbl> 1900, 1910, 1910, 1905, 1860, 1905, 1900, 1890, 1900, 1900,…
$ Latitude   <dbl> 42.37963, 42.37877, 42.37940, 42.38014, 42.37967, 42.37953,…
$ Longitude  <dbl> -71.03076, -71.02943, -71.02846, -71.02859, -71.02903, -71.…
Code
# Simple model: Predict price from living area
# SalePrice is y (response) and LivingArea is x (predictor).
baseline_model <- lm(SalePrice ~ LivingArea, data = boston)
summary(baseline_model)

Call:
lm(formula = SalePrice ~ LivingArea, data = boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-855962 -219491  -68291   55248 9296561 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 157968.32   35855.59   4.406 1.13e-05 ***
LivingArea     216.54      14.47  14.969  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 563800 on 1483 degrees of freedom
Multiple R-squared:  0.1313,    Adjusted R-squared:  0.1307 
F-statistic: 224.1 on 1 and 1483 DF,  p-value: < 2.2e-16
  • Interpretation

    • Intercept’s Estimate: Base Sale Price at 0 sq. ft. is ~$157,968.32. Not useful in practice.

    • Living Area’s Estimate: Each additional square foot of Living Area adds ~$216 to Sale Price.

    • Relationship is statistically significant (p < 0.001). If p is small, reject the null hypothesis.

    • R^2 is 0.13, so 13% of variation is explained by regression model output.

      • Most of the 87% variation is unexplained, so what’s the problem?

        • Limitations

          • Ignores location (North End vs. Roxbury vs. Back Bay); proximity to downtown, waterfront, parks, etc.; nearby crime levels; school quality; neighborhood characteristics.

          • Might fail because 1,000 sq ft in Back Bay is not equivalent to 1,000 sq ft in Roxbury; same house, different locations, means very different prices; location!

          • Could improve by adding spatial features like crime and distance to amenities; control for neighborhood (fixed effects); include interactions (does size matter more in wealthy areas)?

          • Just doing a linear regression is a limit, so that’s where spatial features come in.

Code
# Add number of bathrooms
better_model <- lm(SalePrice ~ LivingArea + R_FULL_BTH, data = boston)
summary(better_model)

Call:
lm(formula = SalePrice ~ LivingArea + R_FULL_BTH, data = boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-783019 -230756  -65737   75367 9193133 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 107683.96   37340.88   2.884  0.00399 ** 
LivingArea     145.68      21.33   6.828 1.25e-11 ***
R_FULL_BTH  106978.13   23800.30   4.495 7.50e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 560200 on 1482 degrees of freedom
Multiple R-squared:  0.1429,    Adjusted R-squared:  0.1418 
F-statistic: 123.6 on 2 and 1482 DF,  p-value: < 2.2e-16
Code
# Compare models
cat("Baseline R²:", summary(baseline_model)$r.squared, "\n")
Baseline R²: 0.1312636 
Code
cat("With bathrooms R²:", summary(better_model)$r.squared, "\n")
With bathrooms R²: 0.1429474 
Code
# Convert boston data to sf object
boston.sf <- boston %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102286')  # MA State Plane (feet)

# Check it worked
head(boston.sf)
Simple feature collection with 6 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 238643.6 ymin: 903246 xmax: 238833.2 ymax: 903399.5
Projected CRS: NAD_1983_HARN_StatePlane_Massachusetts_Mainland_FIPS_2001
# A tibble: 6 × 23
   ...1 Parcel_No SalePrice PricePerSq LivingArea Style    GROSS_AREA NUM_FLOORS
  <dbl>     <dbl>     <dbl>      <dbl>      <dbl> <chr>         <dbl>      <dbl>
1     1 100032000    450000       229.       1966 "Conven…       3111        2  
2     2 100058000    600000       164.       3840 "Semi?D…       5603        3  
3     3 100073000    450000       106.       4246 "Decker"       6010        3  
4     4 100112000    670000       292.       2295 "Row\xa…       3482        3  
5     5 100137000    260000       217.       1197 "Coloni…       1785        2  
6     6 100138000    355000       191.       1859 "Conven…       2198        1.5
# ℹ 15 more variables: R_BDRMS <dbl>, R_FULL_BTH <dbl>, R_HALF_BTH <dbl>,
#   R_KITCH <dbl>, R_AC <chr>, R_FPLACE <dbl>, LU <chr>, OWN_OCC <chr>,
#   R_BLDG_STY <chr>, R_ROOF_TYP <chr>, R_EXT_FIN <chr>, R_TOTAL_RM <dbl>,
#   R_HEAT_TYP <chr>, YR_BUILT <dbl>, geometry <POINT [m]>
Code
class(boston.sf)  # Should show "sf" and "data.frame"
[1] "sf"         "tbl_df"     "tbl"        "data.frame"
Code
# Load neighborhood boundaries
nhoods <- read_sf(here("data/BPDA_Neighborhood_Boundaries.geojson")) %>%
  st_transform('ESRI:102286')  # Match CRS!

# Check the neighborhoods
head(nhoods)
Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 229049.1 ymin: 890856 xmax: 236590.9 ymax: 900302.5
Projected CRS: NAD_1983_HARN_StatePlane_Massachusetts_Mainland_FIPS_2001
# A tibble: 6 × 8
  sqmiles name         neighborhood_id  acres SHAPE__Length objectid SHAPE__Area
    <dbl> <chr>        <chr>            <dbl>         <dbl>    <int>       <dbl>
1    2.51 Roslindale   15              1606.         53564.       53   69938273.
2    3.94 Jamaica Pla… 11              2519.         56350.       54  109737890.
3    0.55 Mission Hill 13               351.         17919.       55   15283120.
4    0.29 Longwood     28               189.         11909.       56    8215904.
5    0.04 Bay Village  33                26.5         4651.       57    1156071.
6    0.02 Leather Dis… 27                15.6         3237.       58     681272.
# ℹ 1 more variable: geometry <MULTIPOLYGON [m]>
Code
nrow(nhoods)  # How many neighborhoods?
[1] 26
Code
# Spatial join: Assign each house to its neighborhood
boston.sf <- boston.sf %>%
  st_join(nhoods, join = st_intersects)

# Check results
boston.sf %>%
  st_drop_geometry() %>%
  count(`name`) %>%
  arrange(desc(n))
# A tibble: 19 × 2
   name              n
   <chr>         <int>
 1 Dorchester      344
 2 West Roxbury    242
 3 Hyde Park       152
 4 East Boston     149
 5 Roslindale      142
 6 Jamaica Plain   113
 7 South Boston     80
 8 Charlestown      65
 9 Mattapan         65
10 Roxbury          63
11 South End        20
12 Beacon Hill      17
13 Mission Hill     14
14 Allston           6
15 Brighton          6
16 Back Bay          3
17 Fenway            2
18 Bay Village       1
19 Downtown          1
  • Why do you think certain neighborhoods command higher prices? Proximity to downtown? Historical character? School quality? Safety? All of the above? This is why we need spatial features and neighborhood controls!

Part 1: Expanding Toolkit

Categorical Variables

  • Continuous Variables

    • Square footage

    • Age of house

    • Income levels

    • Distance to downtown

  • Categorical Variables

    • Neighborhood

      • n-1 Rule: One neighborhood is chosen as the reference category (omitted). R picks the first alphabetically unless specified otherwise.
    • School district

    • Building type

    • Has garage

Code
# Ensure name is a factor
boston.sf <- boston.sf %>%
  mutate(name = as.factor(name))

# Check which is reference (first alphabetically)
levels(boston.sf$name)[1]
[1] "Allston"
Code
# Fit model with neighborhood fixed effects
model_neighborhoods <- lm(SalePrice ~ LivingArea + name, 
                          data = boston.sf)

# Show just first 10 coefficients
summary(model_neighborhoods)$coef[1:10, ]
                    Estimate   Std. Error    t value      Pr(>|t|)
(Intercept)      704306.0751 9.022210e+04  7.8063584  1.112892e-14
LivingArea          138.3206 6.468186e+00 21.3847638  1.605175e-88
nameBack Bay    7525585.0560 1.533714e+05 49.0677253 1.522921e-311
nameBay Village 1284057.5326 2.320256e+05  5.5341206  3.700442e-08
nameBeacon Hill 1767805.2297 1.020211e+05 17.3278366  2.465874e-61
nameBrighton    -168941.2113 1.239972e+05 -1.3624597  1.732623e-01
nameCharlestown   -2556.8667 9.208989e+04 -0.0277649  9.778534e-01
nameDorchester  -576819.4858 8.846990e+04 -6.5199517  9.653393e-11
nameDowntown      58553.1237 2.322150e+05  0.2521505  8.009601e-01
nameEast Boston -534060.4110 8.962950e+04 -5.9585340  3.182615e-09
  • Interperetation

    • Reference Category: Allston was automatically chosen as the intercept because it’s the first alphabetically.

    • Structural Variables

      • Living Area: Each additional sq. ft. adds this amount (same for all neighborhoods).

      • Bedrooms: Effect of one or more full bathroom (same for all neighborhoods).

    • Neighborhood Dummies

      • Positive Coefficient: This neighborhood is more expensive than Allston.

      • Negative Coefficient: This neighborhood is less expensive than Allston.

      • All other variables constant and held equal, same size and same bathrooms.

        • House A in Back Bay

          • Living Area: 1,500 sq. ft.

          • Baths: 2

          • Neighborhood: Back Bay

          • Predicted Price: $8,458,873

        • House B in Roxbury

          • Living Area: 1,500 sq. ft.

          • Baths: 2

          • Neighborhood: Roxbury

          • Predicted Price: $311,066

        • Neighborhood Effect Price Difference: $8,127,807

          • Same house, different location, but huge price difference. This is what neighborhood dummies capture.

            • R automatically handles dummy variables as booleans, so 1 is true for Back Bay and 0 is false for Back Bay when creating neighborhood variables.

Interactions: When Relationships Depend

  • The Question: Doess the effect of one variable depend on the level of another variable?

    • Example Scenarios

      • Housing: Does square footage matter more in wealthy neighborhoods?

      • Education: Do tutoring effects vary by initial skill level?

      • Public Health: Do pollution effects differ by age?

    • Example for this study: Is the value of square footage the same across all Boston neighborhoods?

      • SalePrice = β₀ + β₁(LivingArea) + β₂(WealthyNeighborhood) + β₃(LivingArea × WealthyNeighborhood) + ε
  • Theory: Luxury Premium Hypothesis

    • In Wealthy Neighborhoods (Back Bay, Beacon Hill, South End)

      • High-end buyers pay premium for space.

      • Luxury finishes, location prestige.

      • Each sq. ft. adds substantial value.

      • Steep slope.

      • Hypothesis: $300+ per sq. ft.

    • In Working-Class Neighborhoods (Dorchester, Mattapan, East Boston)

      • Buyers value function over luxury.

      • More price-sensitive market.

      • Space matters, but less premium.

      • Flatter slope.

      • Hypothesis: $100-150 per sq. ft.

    • Key Question: If we assume one slope for all neighborhoods, are we misunderstanding the market?

Code
# Define wealthy neighborhoods based on median prices
wealthy_hoods <- c("Back Bay", "Beacon Hill", "South End", "Bay Village")

# Create binary indicator
boston.sf <- boston.sf %>%
  mutate(
    wealthy_neighborhood = ifelse(name %in% wealthy_hoods, "Wealthy", "Not Wealthy"),
    wealthy_neighborhood = as.factor(wealthy_neighborhood)
  )

# Check the split
boston.sf %>%
  st_drop_geometry() %>%
  count(wealthy_neighborhood)
# A tibble: 2 × 2
  wealthy_neighborhood     n
  <fct>                <int>
1 Not Wealthy           1444
2 Wealthy                 41

Model 1: No Interaction (Parallel Slopes)

Code
# Model assumes same slope everywhere
model_no_interact <- lm(SalePrice ~ LivingArea + wealthy_neighborhood, 
                        data = boston.sf)

summary(model_no_interact)$coef
                                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)                  213842.7642 23893.619394  8.949785  1.039392e-18
LivingArea                      160.2357     9.713346 16.496442  2.936087e-56
wealthy_neighborhoodWealthy 2591012.0471 59958.794614 43.213211 1.103579e-264
  • Assumes living area has the same effect in all neighborhoods. Only the intercept differs (wealthy areas start higher). Parallel lines on a plot.

Model 2: With Interaction (Different Slopes)

Code
# Model allows different slopes
model_interact <- lm(SalePrice ~ LivingArea * wealthy_neighborhood, 
                     data = boston.sf)

summary(model_interact)$coef
                                            Estimate   Std. Error   t value
(Intercept)                             358542.41696 1.863997e+04 19.235144
LivingArea                                  95.63947 7.620382e+00 12.550483
wealthy_neighborhoodWealthy            -377937.38372 1.005554e+05 -3.758498
LivingArea:wealthy_neighborhoodWealthy     985.12500 2.975904e+01 33.103383
                                            Pr(>|t|)
(Intercept)                             8.875710e-74
LivingArea                              2.079700e-34
wealthy_neighborhoodWealthy             1.775774e-04
LivingArea:wealthy_neighborhoodWealthy 2.445570e-180
  • Allows living are to have different effects in different neighborhoods. Both intercept and slope differ. Non-parallel lines on a plot.

  • We get the un-intuitive negative premium here because that is an intercept adjustment (applies at 0 sqft). The slope difference (+985sq/ft) is huge - we can calculate when wealthy areas become more expensive at what sq. ft. = 384 (358,542 / 985).

    • Not Wealthy Areas Equation: \(Price = 358,542 + 96 × LivingArea\)

      • Interpretation is at base price $358,542, each sq. ft. adds $96.
    • Wealthy Areas Equation: \(Price = -19,395 + 1,081 × LivingArea\)

      • Interpretation is at base price -$19,395, each sq. ft. adds $1,081.
    • The Interaction Effect. Wealthy areas value each sq ft $985 more than non-wealthy areas!

  • Key Observation: The lines are NOT parallel when plotted - that’s the interaction!

Comparing Model Performance

Code
# Compare R-squared
cat("Model WITHOUT interaction R²:", round(summary(model_no_interact)$r.squared, 4), "\n")
Model WITHOUT interaction R²: 0.6156 
Code
cat("Model WITH interaction R²:", round(summary(model_interact)$r.squared, 4), "\n")
Model WITH interaction R²: 0.7791 
Code
cat("Improvement:", round(summary(model_interact)$r.squared - summary(model_no_interact)$r.squared, 4), "\n")
Improvement: 0.1635 
  • Model Improvement: Adding the interaction improves R² by 0.1635 (a 26.6% relative improvement)

    • Interpretation: We explain 16.35% more variation in prices by allowing different slopes!

Polynomial Terms: Non-Linear Relationships When Straight Lines Don’t Fit

  • Signs of Non-Linearity:

    • Curved residual plots.

    • Diminishing returns.

    • Accelerating effects.

    • U-shaped or inverted-U patterns.

    • Theoretical reasons.

  • Examples:

    • House Age: Depreciation, then vintage premium.

    • Test Scores: Plateau after studying.

    • Advertising: Diminishing returns.

    • Crime Prevention: Early gains, then plateaus.

  • Polynomial Regression

    • SalePrice = β₀ + β₁(Age) + β₂(Age²) + ε

      • Allows for curved relationship.
  • Age’s Non-Linear Effect

    • New Houses (0-20 Years)

      • Modern amenities.

      • Move-in ready.

      • No repairs needed.

      • High value.

      • Steep depreciation initially.

    • Middle-Aged (20-80 Years)

      • Needs updates.

      • Wear and tear.

      • Not yet “historic”.

      • Lowest value.

      • Trough of the curve.

    • Historic/Vintage (80+ Years)

      • Architectural character.

      • Historic districts.

      • Prestige value.

      • Rising value.

      • “Vintage premium”.

  • Boston’s Context: The city has LOTS of historic homes (Back Bay, Beacon Hill built 1850s-1900s). Does age create a U-shaped curve?

Code
# Calculate age from year built
boston.sf <- boston.sf %>%
  mutate(Age = 2025 - YR_BUILT)%>% filter(Age <2000)


# Check the distribution of age
summary(boston.sf$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    95.0   115.0   108.1   125.0   223.0 
Code
# Visualize age distribution
ggplot(boston.sf, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.7) +
  labs(title = "Distribution of House Age in Boston",
       x = "Age (years)",
       y = "Count") +
  theme_minimal()

Linear Model (Baseline)

Code
# Simple linear relationship
model_age_linear <- lm(SalePrice ~ Age + LivingArea, data = boston.sf)

summary(model_age_linear)$coef
                Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -120808.8738 57836.98918 -2.088782 3.689911e-02
Age            2834.0198   497.72013  5.694003 1.496246e-08
LivingArea      202.8312    15.10373 13.429214 7.228187e-39
  • Each additional year of age changes price by $2834.01 (assumed constant rate).

Add Polynomial Term: Age Squared

Code
# Quadratic model (Age²)
model_age_quad <- lm(SalePrice ~ Age + I(Age^2) + LivingArea, data = boston.sf)

summary(model_age_quad)$coef
                Estimate   Std. Error   t value     Pr(>|t|)
(Intercept) 570397.14406 97894.237962  5.826667 6.938345e-09
Age         -13007.51918  1896.446156 -6.858892 1.019524e-11
I(Age^2)        80.88988     9.360652  8.641479 1.425208e-17
LivingArea     203.75007    14.739041 13.823835 5.975020e-41
  • Important: The I() Function Why I(Age^2) instead of just Age^2? In R formulas, ^ has special meaning. I() tells R: “interpret this literally, compute Age²” Without I(): R would interpret it differently in the formula.

  • Model Equation: \(Price = 570,397 + - 13,008 × Age + 80 × Age² + 204 × LivingArea\)

    • Warning: Can’t interpret coefficients directly. With Age², the effect of age is no longer constant. You need to calculate the marginal effect. Marginal effect of \(Age = β₁ + 2 × β₂ × Age\) This means the effect changes at every age!

Compare Models

Code
# R-squared comparison
r2_linear <- summary(model_age_linear)$r.squared
r2_quad <- summary(model_age_quad)$r.squared

cat("Linear model R²:", round(r2_linear, 4), "\n")
Linear model R²: 0.1549 
Code
cat("Quadratic model R²:", round(r2_quad, 4), "\n")
Quadratic model R²: 0.1959 
Code
cat("Improvement:", round(r2_quad - r2_linear, 4), "\n\n")
Improvement: 0.0409 
Code
# F-test: Is the Age² term significant?
anova(model_age_linear, model_age_quad)
Analysis of Variance Table

Model 1: SalePrice ~ Age + LivingArea
Model 2: SalePrice ~ Age + I(Age^2) + LivingArea
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1469 4.5786e+14                                   
2   1468 4.3569e+14  1 2.2163e+13 74.675 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check Residual Plot

Code
# Compare residual plots
par(mfrow = c(1, 2))

# Linear model residuals
plot(fitted(model_age_linear), residuals(model_age_linear),
     main = "Linear Model Residuals",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

# Quadratic model residuals  
plot(fitted(model_age_quad), residuals(model_age_quad),
     main = "Quadratic Model Residuals",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

Part 2: Why Space Matters

Hedonic Model Framework

Tobler’s First Law

“Everything is related to everything else, but near things are more related than distant things.”

- Waldo Tobler, 1970

  • For house prices, crime nearby matters more than crime across the city.

  • Parks within walking distance affect value.

  • Immediate neighborhood defines market.

Spatial Autocorrelation

Part 3: Creating Spatial Features

Buffer Aggregation

  • Count or sum events within a defined distance.

    • Example: Number of crimes within 500’.
Code
neighborhood_boundaries <- st_read("data/BPDA_Neighborhood_Boundaries.geojson")
Reading layer `BPDA_Neighborhood_Boundaries' from data source 
  `C:\Users\Tess\Documents\GitHub\portfolio-setup-TessaVu\weekly-notes\week-06\data\BPDA_Neighborhood_Boundaries.geojson' 
  using driver `GeoJSON'
Simple feature collection with 26 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -71.19125 ymin: 42.22792 xmax: -70.92278 ymax: 42.39699
Geodetic CRS:  WGS 84
Code
crimes.sf <- read_csv("data/bostonCrimes.csv")
crimes.sf <- crimes.sf %>%
  na.omit()

neighborhood_boundaries <- st_transform(neighborhood_boundaries, 102286)
boston.sf <- st_transform(boston.sf, 102286)
crimes.sf <- st_as_sf(crimes.sf, coords = c("Long", "Lat"), crs = 4326)
crimes.sf <- st_transform(crimes.sf, 102286)
Code
# Create buffer features - these will work now that CRS is correct
boston.sf <- boston.sf %>%
  mutate(
    crimes.Buffer = lengths(st_intersects(
      st_buffer(geometry, 660),
      crimes.sf
    )),
    crimes_500ft = lengths(st_intersects(
      st_buffer(geometry, 500),
      crimes.sf
    ))
  )

K-Nearest Neighbors

  • Average distance to k closest events.

    • Example: Average distance to 3 nearest violent crimes.
Code
# Calculate distance matrix (houses to crimes)
dist_matrix <- st_distance(boston.sf, crimes.sf)

# Function to get mean distance to k nearest neighbors
get_knn_distance <- function(dist_matrix, k) {
  apply(dist_matrix, 1, function(distances) {
    # Sort and take first k, then average
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create multiple kNN features
boston.sf <- boston.sf %>%
  mutate(
    crime_nn1 = get_knn_distance(dist_matrix, k = 1),
    crime_nn3 = get_knn_distance(dist_matrix, k = 3),
    crime_nn5 = get_knn_distance(dist_matrix, k = 5)
  )

# Check results
summary(boston.sf %>% st_drop_geometry() %>% select(starts_with("crime_nn")))
   crime_nn1         crime_nn3         crime_nn5      
 Min.   :  16.58   Min.   :  16.58   Min.   :  16.58  
 1st Qu.: 299.19   1st Qu.: 331.19   1st Qu.: 382.47  
 Median : 585.10   Median : 611.03   Median : 679.62  
 Mean   : 757.88   Mean   : 791.84   Mean   : 857.06  
 3rd Qu.:1043.77   3rd Qu.:1079.86   3rd Qu.:1174.81  
 Max.   :3558.73   Max.   :3558.73   Max.   :3630.10  
  • crime_nn3 = 83.29 means the average distance to the 3 nearest crimes is 83.29 feet.
Code
# Which k value correlates most with price?
boston.sf %>%
  st_drop_geometry() %>%
  select(SalePrice, crime_nn1, crime_nn3, crime_nn5) %>%
  cor(use = "complete.obs") %>%
  as.data.frame() %>%
  select(SalePrice)
           SalePrice
SalePrice 1.00000000
crime_nn1 0.03267382
crime_nn3 0.03117884
crime_nn5 0.06963100
  • Finding the kNN feature with the strongest correlation tells us the relevant “zone of influence” for crime perception!

Distance to Specific Points

  • Straight-line (Euclidian) distance to important locations.

    • Example: Distance to downtown, nearest T-Station.
Code
# Define downtown Boston (Boston Common: 42.3551° N, 71.0656° W)
downtown <- st_sfc(st_point(c(-71.0656, 42.3551)), crs = "EPSG:4326") %>%
  st_transform('ESRI:102286')

# Calculate distance from each house to downtown
boston.sf <- boston.sf %>%
  mutate(
    dist_downtown_ft = as.numeric(st_distance(geometry, downtown)),
    dist_downtown_mi = dist_downtown_ft / 5280
  )

# Summary
summary(boston.sf$dist_downtown_mi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05467 0.80476 1.37410 1.39107 1.93750 2.77678 
Code
# Summary of all spatial features created
spatial_summary <- boston.sf %>%
  st_drop_geometry() %>%
  select(crimes.Buffer, crimes_500ft, crime_nn3, dist_downtown_mi) %>%
  summary()

spatial_summary
 crimes.Buffer    crimes_500ft     crime_nn3       dist_downtown_mi 
 Min.   : 0.00   Min.   : 0.00   Min.   :  16.58   Min.   :0.05467  
 1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 331.19   1st Qu.:0.80476  
 Median : 3.00   Median : 0.00   Median : 611.03   Median :1.37410  
 Mean   : 7.78   Mean   : 4.49   Mean   : 791.84   Mean   :1.39107  
 3rd Qu.: 9.00   3rd Qu.: 6.00   3rd Qu.:1079.86   3rd Qu.:1.93750  
 Max.   :75.00   Max.   :63.00   Max.   :3558.73   Max.   :2.77678  
  • crimes.Buffer (660ft) is a buffer count measuring number of crimes near a house.

  • crimes_500ft is a buffer county measuring crimes within 500ft.

  • crime_nn3 is a kNN distance measuring average distance to 3 nearest crimes in feet.

  • dist_downtown_mi is a point distance measuring miles from downtown boston.

Code
boston.sf <- boston.sf %>%
  mutate(Age = 2015 - YR_BUILT)  

# Model 1: Structural only
model_structural <- lm(SalePrice ~ LivingArea + R_BDRMS + Age, 
                       data = boston.sf)

# Model 2: Add spatial features
model_spatial <- lm(SalePrice ~ LivingArea + R_BDRMS + Age +
                    crimes_500ft + crime_nn3 + dist_downtown_mi,
                    data = boston.sf)

# Compare
cat("Structural R²:", round(summary(model_structural)$r.squared, 4), "\n")
Structural R²: 0.2374 
Code
cat("With spatial R²:", round(summary(model_spatial)$r.squared, 4), "\n")
With spatial R²: 0.3644 
Code
cat("Improvement:", round(summary(model_spatial)$r.squared - 
                          summary(model_structural)$r.squared, 4), "\n")
Improvement: 0.127 

Part 4: Fixed Effects

  • Categorical variables that capture all unmeasured characteristics of a group.

  • In hedonic models:

    • Each neighborhood gets its own dummy variable.

    • Captures everything unique about that neighborhood we didn’t explicitly measure.

      • Technically done when went over categorical data.

      • What’s captured:

        • School quality.

        • Prestige or reputation.

        • Walkability.

        • Access to jobs.

        • Cultural amenities.

        • Things we can’t easily measure.

Code
# Behind the scenes, R creates dummies:
# is_BackBay = 1 if Back Bay, 0 otherwise
# is_Beacon = 1 if Beacon Hill, 0 otherwise
# is_Allston = 1 if Allston, 0 otherwise
# ... (R drops one as reference category)

Each coefficient is equal to the price premium/discount for that neighborhood (holding all else constant).

  • Why use fixed effects?

    • Dramatically improves prediction.

      • In current data it’s because neighborhoods bundle many unmeasured factors like school districts, job access, amenities, and “cool factor”.
    • Coefficients change.

      • Crime coefficient without FE: -$125/crime

      • Crime coefficient with FE: -$85/crime

        • Without FE: Captured confounders too.

        • With FE: Neighborhoods “absorb” other differences.

        • Now just the crime effect.

      • Trade-Off: FEs are powerful, but they’re a black box, we don’t know why Back Bay commands a premium.

Compare All Models

Part 5: Cross-Validation with Categorical Variables

CV Recap From Last Week

  • Three common validation approaches:

    • Train/Test Split: 80/20 training/testing split, simple but unstable.

    • k-Fold Cross-Validation: Split into k-folds, train on k-1, test on 1, repeat.

    • LOOCV: Leave one observation out at a time (special case of k-fold).

Use k-fold CV to compare hedonic models.

  • CV tells us how well model predicts new data.

  • More honest than in-sample R^2.

  • Helps detect overfitting.

Rule of Thumb: Categories with n < 10 will likely cause CV problems.

  • Solution: Group small neighborhoods.

    • Trade-Off: Lose granularity for small neighborhoods, but avoid CV crashes.
  • Alternative: Drop sparse categories.

    • Warning: Consider carefully, which neighborhoods are you excluding? Often those with less data are marginalized communities. Document what you removed and why.

RECOMMMENDED WORKFLOW

  • Note: These values are kind of ginormous - remember RMSE squares big errors, so outliers can have a really large impact.

  • Key Insight: Each layer improves out-of-sample prediction, with fixed effects providing the biggest boost.

  • Why? Neighborhoods bundle many unmeasured factors (schools, amenities, prestige) that we can’t easily quantify individually.

  • Look for: - Prices over $2-3 million (could be luxury condos or errors) - Prices near 0 (data errors) - Long right tail in histogram

    • Log Transform the skewed dependent variable.

      • Interpreting log models:

        • RMSE is now in log-dollars (hard to interpret).

        • To convert back: exp(predictions) gives actual dollars.

        • Coefficients now represent percentage changes, not dollar changes.

        • This is standard practice in hedonic modeling!

Coding Techniques

  • sum(is.na())

  • glimpse()

  • summary()

  • st_drop_geometry()

Questions & Challenges

  • Do NOT use interactions when:

    • There are small samples, sufficient data is needed in each group.

    • There are already too many interactions, it makes models unstable and causes overfitting.

  • How do we quantify “nearness” in a way a regression model can use?

    • Must creaate spaatial features that measure proximity to amenities/disamenities.
  • When CV fails with categorical variables, the problem tends to be sparse categories.

    • In example:

      • Random split created 10 folds.

      • All “West End” sales ended up in one fold, the test fold.

      • Training folds never saw “West End”.

      • Model can’t predict for a category it never learned.

    • Issue: When neighborhoods have very few sales (<10), random CV splits can put all instances in the saame fold, breaking the model.

Connections to Policy

  • Boston’s Housing Market

    • Market Segmentation: Boston operates as two distinct housing markets.

      • Luxury Market: Every sq. ft. is premium at $1,081/sq. ft.

      • Standard Market: Space is valued, but lower premium at $96 / sq. ft.

    • Affordability Crisis: The interaction amplifies inequality.

      • Large homes in wealthy areas become exponentially more expensive.

      • Creates barriers to mobility between neighborhoods.

    • Policy Design: One-size-fits-all policies likely will fail.

      • Property tax assessments should account for neighborhood-specific valuation.

      • Housing assistance needs vary dramatically by area.

Reflection

Tips for Success:

  • Do:

    • Start simple, add complexity.

    • Check for NAs: sum(is.na()).

    • Test on small subset first.

    • Commment code.

    • Check coefficient signs.

    • Use glimpse() and summary().

  • Don’t:

    • Add 50 variables at once.

    • Ignore errors.

    • Forget st_drop_geometry() for non-spatial operations.

    • Skip sparse category check.

  • Common Errors:

    • “Factor has new levels” -> Group sparse categories.

    • “Computationally singular” -> Remove collinear variables.

    • “Very high RMSE” -> Check outliers, scale.

    • “CV takes forever” -> Simplify model or reduce folds.

    • “Negative R^2” -> Model worse than mean, rethink variables.