Philadelphia Housing Price Prediction - Technical Appendix

Author

Nina Carlsen, Ryan Drake, Sujanesh Kakumanu, Kavana Raju, Chloe Robinson, Henry Sywulak-Herr.

This technical appendix documents the full workflow used to engineer and visualize spatial features for predicting residential housing prices in Philadelphia.

Code

options(scipen = 999)

# Packages
if(!require(pacman)){install.packages("pacman"); library(pacman, quietly = T)}

Loading required package: pacman

Code

p_load(knitr, sf, tidyverse, tidycensus, tigris, here, dplyr, FNN, ggplot2, scales, patchwork, caret, Hmisc, stargazer)

# Files
sf_data <- st_read("./data/OPA_data.geojson", quiet = TRUE)
nhoods <- st_read("./data/philadelphia-neighborhoods.geojson", quiet = TRUE)

1 Property sales data

1.1 Data Preparation

We apply several filters to the property data to for quality and relevance. First, we restrict our analysis to residential properties sold between 2023 and 2024, excluding any other property categories. Second, we remove properties with sale prices below $10, as these are abnormal prices for residential properties.

To work with Github file size limits, the data is further trimmed of irrelevant columns.

Code

# Restrict to residential only
residential_categories <- c(
  "APARTMENTS > 4 UNITS",
  "MULTI FAMILY",
  "SINGLE FAMILY",
  "MIXED USE"
)
residential_data <- sf_data %>%
  filter(category_code_description %in% residential_categories,
         year(sale_date) %in% c(2023, 2024),
         mailing_city_state == "PHILADELPHIA PA",
         sale_price > 10
         )

table(residential_data$category_code_description)

# Making sure the file saved to the repo is the trimmed data (to stay below GitHub data limits)
st_write(residential_data, "./data/OPA_data.geojson", driver = "GeoJSON", delete_dsn = TRUE, quiet = TRUE)
file.exists("./data/OPA_data.geojson")
OPA_raw <- st_read("./data/OPA_data.geojson", quiet = TRUE) %>% 
  st_transform(2272)

# OPA_data -> cutting mostly NA columns or irrelevant columns for this model.
OPA_raw <- OPA_raw %>%
  select(-c(
  cross_reference, date_exterior_condition, exempt_land, fireplaces, fuel, garage_type, house_extension, mailing_address_2, mailing_care_of, market_value_date, number_of_rooms, other_building, owner_2, separate_utilities, sewer, site_type, street_direction, suffix, unfinished, unit, utility
  ))

names(OPA_raw)

The property sales data was gathered from the OPA properties public data set from the City of Philadelphia. This data set was 32 columns and 583,825 observations. This file was too large for our shared GitHub work space so it was reduced by filtering for residential properties, years 2023 and 2024, location within Philadelphia, and sale price over 10 since some were NA, 1, or 10. This was just enough to get the most basic and general data to work with that ran with GitHub size limits. This reduced the size to 22121 observations. The original geojson file was overwritten and named OPA_data.

Properties selected for residential included apartments >4 units, single family, multi-family, and mixed use. Mixed use was left in as there are still residential unit to account yet add more complex property types to our total data set when comparing sale price and other aspects such as total area to other observations. These properties should also be cross referenced with zoning codes for future research.

We left mixed use in during this process to give us the most general data set representation. There was also limited data cleaning other than omitting columns that were mostly NA. This gave our model the most general data set to work with despite lower future RMSE values. Future research would be needed to most accurately assess the choices of losing data and a more generalized Philadelphia housing market verses very clean data and more specific Philadelphia housing market that may omit certain aspects of the housing market like data in lower income areas or multi use residential aspects. This could have also been conducted in grouping NA values and sparse categories. More complexity could be accounted for in future work.

This was our start simple and add complexity approach. Our original to final OPA data set went from 583,825 to 22,121 observations and from 32 to 68 variables.

1.2 Exploratory Data Analysis

Below are selected property variables—Total Livable Area, Bedrooms, Bathrooms, and Age—in relation to Sale Price. Properties with excessive square footage (>10,000 sqft), missing bedroom or bathroom data, over 12 bathrooms with low sale prices, or implausible construction years were removed to reduce skew and data errors. This additional filtering was kept for the rest of the analysis in this report.

Code

# filter out outliers from the dataset
OPA_data <- OPA_raw %>%
  filter(
    total_livable_area <= 10000,
    year_built > 1800,
    !is.na(number_of_bathrooms),
    !is.na(number_of_bedrooms),
    number_of_bathrooms < 12,
  ) %>%
  mutate(
    year_built = as.numeric(year_built),
    building_age = 2025 - year_built
  )

p1 <- ggplot(OPA_data, aes(x = total_livable_area, y = sale_price)) +
  geom_point(alpha = 0.3, size = 0.8) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  scale_y_continuous(labels = dollar_format()) +
  scale_x_continuous(labels = comma_format()) +
  labs(
    title = "Sale Price vs. Total Livable Area",
    x = "Total Livable Area (sq ft)",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 10, face = "bold"))

p2 <- ggplot(OPA_data, aes(x = factor(number_of_bedrooms), y = sale_price)) +
  geom_boxplot(fill = "gray", alpha = 0.6, outlier.alpha = 0.3, outlier.size = 0.5) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 2, shape = 18) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Sale Price vs. Number of Bedrooms",
    x = "Number of Bedrooms",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 10, face = "bold"))

p3 <- ggplot(OPA_data, aes(x = factor(number_of_bathrooms), y = sale_price)) +
  geom_boxplot(fill = "gray", alpha = 0.6, outlier.alpha = 0.3, outlier.size = 0.5) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 2, shape = 18) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Sale Price vs. Number of Bathrooms",
    x = "Number of Bathrooms",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 10, face = "bold"))

p4 <- ggplot(OPA_data, aes(x = building_age, y = sale_price)) +
  geom_point(alpha = 0.3, size = 0.8) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Sale Price vs. Age",
    x = "Age",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 10, face = "bold"))

# Combine plots
(p1 | p2) / (p3 | p4)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

1.3 Feature Engineering

Code

OPA_data <- OPA_data %>%
  mutate(
    # convert to numeric before interactions
    total_livable_area = as.numeric(total_livable_area),
    census_tract = as.numeric(as.character(census_tract)),
    year_built = as.numeric(year_built),
    total_area = as.numeric(total_area),
    market_value = as.numeric(market_value),
    number_of_bedrooms = as.numeric(number_of_bedrooms),

    # building code and total area
    int_type_tarea = as.numeric(as.factor(building_code_description)) * total_area,

    # market and livable area
    int_value_larea = market_value * total_livable_area,

    # market and total area
    int_value_tarea = market_value * total_area,

    # livable area and exterior condition
    int_larea_econd = total_livable_area * as.numeric(as.factor(exterior_condition)),

    # livable area and interior condition
    int_larea_icond = total_livable_area * as.numeric(as.factor(interior_condition)),

    # livable area and bedrooms
    int_larea_beds = total_livable_area * number_of_bedrooms
  )

Code

pa_crs <- 2272  
neighbor_points <- st_transform(OPA_data, pa_crs)

nrow(nhoods)

st_crs(neighbor_points)
nhoods <- st_transform(nhoods, 2272)

#joining houses to neighborhoods
neighbor_points <- neighbor_points %>%
  st_join(., nhoods, join = st_intersects)

# one property doesn't lie in any neighborhood
neighbor_points <- neighbor_points[!is.na(neighbor_points$NAME),]

#results 
neighbor_points %>%
  st_drop_geometry() %>%
  count(NAME) %>%
  arrange(desc(n))

Code

#spatial joins
price_by_nhood <- neighbor_points %>%
  st_drop_geometry() %>%
  group_by(NAME) %>%
  dplyr::summarize(
    median_price = median(sale_price, na.rm = TRUE),
    n_sales = n()
  )

nhoods_prices <- nhoods %>%
  left_join(., price_by_nhood, by = "NAME")

#setting median house price classes
nhoods_prices <- nhoods_prices %>%
  mutate(
    price_class = cut(median_price,
                      breaks = c(0, 400000, 600000, 800000, 1000000, Inf),
                      labels = c("Under $400k", "$400k-$600k", "$600k-$800k", 
                                 "$800k-$1M", "Over $1M"),
                      include.lowest = TRUE)
  )

#mapping
ggplot() +
  geom_sf(data = nhoods_prices, aes(fill = price_class), 
          color = "white", size = 0.5) +
  scale_fill_brewer(
    name = "Median Price",
    palette = "YlOrRd",
    na.value = "grey90",
    direction = 1
  ) +
  labs(
    title = "Median Home Prices by Philadelphia Neighborhood",
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14),
    legend.title = element_text(face = "bold")
  )

Code

price_by_nhood %>%
  arrange(desc(median_price)) %>%
  head(10)

price_by_nhood %>%
  arrange(desc(median_price)) %>%
  print(n = 50)

price_by_nhood %>%
  arrange(desc(median_price)) %>%
  print(n = 50)

price_by_nhood %>%
  arrange(desc(n_sales)) %>%
  head(5)

Code

# Define wealthy as >=$420,000 which is 1.5x city median of 279,900
nhoods_prices <- nhoods_prices %>%
  mutate(
    wealthy_neighborhood = ifelse(median_price >= 420000, "Wealthy", "Not Wealthy"),
    wealthy_neighborhood = as.factor(wealthy_neighborhood)
  )

nhoods_prices %>%
  st_drop_geometry() %>%
  count(wealthy_neighborhood)

  wealthy_neighborhood   n
1          Not Wealthy 123
2              Wealthy  26
3                 <NA>  10

Code

neighbor_points <- neighbor_points %>%
  left_join(.,
            nhoods_prices %>%
              st_drop_geometry %>%
              select(NAME, wealthy_neighborhood),
            by = "NAME")

# Still add neighbor points to OPA data

Households were denoted as wealthy if their median household price was over $420,000, which is 1.5x city median of 279,900. This term will be used in an interaction in Model 4 to account for theoretical differences in wealthy neighborhoods, such as inflated costs for additional home amenities such as bedrooms, bathrooms, or livable floor area.

2 Census Data

2.1 Data Preparation

Code

# link variables and aliases
vars <- c("pop_tot" = "B01001_001",
          "med_hh_inc" = "B19013_001",
          "med_age" = "B01002_001")

# the FIPS code for the state of PA is 42
fips_pa <- 42

Variables pulled from the census include total population, median household income, and median age.

Code

# retrieve data from the ACS for 2023
demo_vars_pa <- get_acs(geography = "tract",
                        variable = vars,
                        year = 2023,
                        state = fips_pa,
                        output = "wide",
                        geometry = T,
                        progress_bar = F) %>% 
  st_transform(2272)

# separate NAME column into its constituent parts
demo_vars_pa <- demo_vars_pa %>%
  separate(NAME,
           into = c("TRACT", "COUNTY", "STATE"),
           sep = "; ",
           remove = T) %>% 
  mutate(TRACT = parse_number(TRACT),
         COUNTY = sub(x = COUNTY, " County", ""))

# filter out Philadelphia tracts
demo_vars_phl <- demo_vars_pa %>%
  filter(COUNTY == "Philadelphia")

Code

# plot cenusus variables to compare
plot(demo_vars_phl[,"pop_totE"],
     main = "Total Population",
     breaks = seq(0, 10500, 500),
     nbreaks = 21)

Code

plot(demo_vars_phl[,"med_hh_incE"],
     main = "Median Household Income",
     breaks = seq(0, 200000, 10000),
     nbreaks = 20)

Code

plot(demo_vars_phl[,"med_ageE"],
     main = "Median Age",
     breaks = seq(0, 75, 5),
     nbreaks = 15)

Code

# get NA counts per column
na_counts <- sapply(demo_vars_phl, function(x) {sum(is.na(x))})
kable(t(as.data.frame(na_counts)))

	GEOID	TRACT	COUNTY	STATE	pop_totE	pop_totM	med_hh_incE	med_hh_incM	med_ageE	med_ageM	geometry
na_counts	0	0	0	0	0	0	27	27	17	17	0

Code

# filter out all rows that have at least one column with an na value
na_index <- !complete.cases(demo_vars_phl %>% st_drop_geometry())
demo_vars_phl_na <- demo_vars_phl[na_index,]
kable(head(demo_vars_phl_na, 5) %>% select(-ends_with("M")) %>% st_drop_geometry(),
      col.names = c("GeoID", "Tract", "County", "State", "Population", "Median HH Inc ($)", "Median Age (yrs)"),
      row.names = F)

GeoID	Tract	County	State	Population	Median HH Inc ($)	Median Age (yrs)
42101016500	165.00	Philadelphia	Pennsylvania	2165	NA	44.1
42101980902	9809.02	Philadelphia	Pennsylvania	0	NA	NA
42101980800	9808.00	Philadelphia	Pennsylvania	0	NA	NA
42101028500	285.00	Philadelphia	Pennsylvania	2625	NA	36.1
42101036901	369.01	Philadelphia	Pennsylvania	49	NA	42.1

27 and 17 census tracts have a value of NA for median household income and median age, respectively. For the 17 census tracts where there is no reported population, the median household income and median age will be set to 0. The remaining 10 census tracts that have population but no reported median household income will be omitted from the dataset.

Code

# create a dataset with NAs replaced with zero where applicable
demo_vars_phl_rep <- demo_vars_phl %>% 
  mutate(med_hh_incE = case_when(pop_totE == 0 & is.na(med_hh_incE) ~ 0,
                                 .default = med_hh_incE),
         med_ageE = case_when(pop_totE == 0 & is.na(med_ageE) ~ 0,
                                 .default = med_ageE))

# final cleaned dataset without the 10 census tracts that have population values but have NA values for Median Household Income
demo_vars_phl_clean <- demo_vars_phl_rep[complete.cases(demo_vars_phl_rep %>%
                                                          select(-ends_with("M")) %>%
                                                          st_drop_geometry()),]

# table with the omitted census tracts
demo_vars_phl_omit <- demo_vars_phl_rep[!complete.cases(demo_vars_phl_rep %>% select(-ends_with("M")) %>% st_drop_geometry()),]
kable(demo_vars_phl_omit %>% select(-ends_with("M")) %>% st_drop_geometry(),
      col.names = c("GeoID", "Tract", "County", "State", "Population", "Median HH Inc ($)", "Median Age (yrs)"),
      row.names = F, caption = "Census Tracts Omitted from Analysis due to Data Unavailability")

Census Tracts Omitted from Analysis due to Data Unavailability
GeoID	Tract	County	State	Population	Median HH Inc ($)	Median Age (yrs)
42101016500	165.00	Philadelphia	Pennsylvania	2165	NA	44.1
42101028500	285.00	Philadelphia	Pennsylvania	2625	NA	36.1
42101036901	369.01	Philadelphia	Pennsylvania	49	NA	42.1
42101014800	148.00	Philadelphia	Pennsylvania	892	NA	40.9
42101027700	277.00	Philadelphia	Pennsylvania	5489	NA	36.9
42101030100	301.00	Philadelphia	Pennsylvania	6446	NA	37.2
42101989100	9891.00	Philadelphia	Pennsylvania	1240	NA	29.7
42101989300	9893.00	Philadelphia	Pennsylvania	160	NA	30.5
42101020500	205.00	Philadelphia	Pennsylvania	3383	NA	33.3
42101980200	9802.00	Philadelphia	Pennsylvania	396	NA	74.9

Code

# join census variables to the OPA data
OPA_data <- st_join(OPA_data, demo_vars_phl_clean %>% select(pop_totE, med_hh_incE, med_ageE))

# isolate NA rows and plot where they are
censusNAs <- OPA_data[is.na(OPA_data$med_hh_incE),]

census_plt1 <- ggplot() +
  geom_sf(data = demo_vars_phl_clean$geometry) +
  geom_sf(data = censusNAs, size = 0.15) +
  theme_void() +
  labs(title = "Properties without Census Data")
census_plt2 <- ggplot() +
  geom_sf(data = demo_vars_phl_clean$geometry, fill = "black", color = "transparent") +
  theme_void() +
  labs(title = "Census Tracts with Data (Black)")

(census_plt1 | census_plt2)

Of the 22121 properties in the dataset after cleaning and omitting outliers, 248 - approximately 1.1% of the dataset - have no associated census data due to the lack of a Median Household Income value for those census tracts. Comparing plots of property locations without census data and that of census tracts which have data confirms this spatial relationship.

3 Spatial Features

3.1 Data Preparation

This stage prepares and validates the OpenDataPhilly spatial datasets used to engineer neighborhood-level variables for the housing model.

Steps Performed

Transformed all spatial datasets to EPSG 2272 (NAD83 / PA South ftUS) for consistent distance measurements.
Removed invalid geometries, dropped Z/M values, and converted all housing geometries to points.
Imported and projected OpenDataPhilly amenities:
- Transit Stops
- Hospitals
- Parks & Recreation Sites
- Schools Parcels (centroids created from polygon features)
- Crime Incidents (2023 and 2024 combined)

Code

#CRS & radii
pa_crs <- 2272    # NAD83 / PA South (ftUS)
mi_to_ft   <- 5280
r_park_ft   <- 0.25 * mi_to_ft
r_crime_ft  <- 0.50 * mi_to_ft
r_school_ft <- 0.50 * mi_to_ft

# turn off spherical geometry (makes buffer/join ops faster)
sf::sf_use_s2(FALSE)

## CONVERT SALES DATA TO POINTS ##
OPA_points <- st_transform(OPA_data, pa_crs)

#Drop Z/M if present
st_geometry(OPA_points) <- st_zm(st_geometry(OPA_points), drop = TRUE, what = "ZM")

#Make geometries valid
st_geometry(OPA_points) <- st_make_valid(st_geometry(OPA_points))

#Ensure POINT geometry (works for points/lines/polygons/collections)
st_geometry(OPA_points) <- st_point_on_surface(st_geometry(OPA_points))

#Add sale ID
OPA_points <- OPA_points %>%
  mutate(sale_id = dplyr::row_number())

#read & project layers
transit   <- st_read("./data/Transit_Stops_(Spring_2025)/Transit_Stops_(Spring_2025).shp", quiet = TRUE) |> st_transform(pa_crs)
hospitals <- st_read("./data/Hospitals.geojson", quiet = TRUE) |> st_transform(pa_crs)
parksrec  <- st_read("./data/PPR_Program_Sites.geojson", quiet = TRUE)|> st_transform(pa_crs)
schools_polygons   <- st_read("./data/Schools_Parcels.geojson", quiet = TRUE) |> st_transform(pa_crs)
crime_2023     <- st_read("./data/crime_incidents_2023/incidents_part1_part2.shp", quiet = TRUE)        |> st_transform(pa_crs)
crime_2024     <- st_read("./data/crime_incidents_2024/incidents_part1_part2.shp", quiet = TRUE)        |> st_transform(pa_crs)

#combine 2023 & 2024 crime datasets
crime <- rbind(crime_2023, crime_2024)

#create centroids for schools dataset
schools <- if (any(st_geometry_type(schools_polygons) %in% c("POLYGON","MULTIPOLYGON"))) {
  st_centroid(schools_polygons, )
} else {
  schools_polygons
}

#crop transit data to philadelphia
philly_boundary <- st_union(nhoods)

transit <- st_filter(transit, philly_boundary, .predicate = st_within)

3.2 Exploratory Data Analysis

Exploratory plots and maps examine the raw accessibility patterns across Philadelphia before feature engineering.

Code

# Transit stops (raw)
ggplot() +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(data = transit, size = 0.3, alpha = 0.6) +
  labs(title = "Raw Layer Check: Transit Stops (SEPTA Spring 2025)") +
  theme_void()

Code

# Hospitals (raw)
ggplot() +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(data = hospitals, size = 0.6, alpha = 0.7) +
  labs(title = "Raw Layer Check: Hospitals") +
  theme_void()

Code

# Parks & Recreation Program Sites (raw)
ggplot() +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(data = parksrec, size = 0.4, alpha = 0.6) +
  labs(title = "Raw Layer Check: Parks & Recreation Sites") +
  theme_void()

Code

# Schools (centroids of polygons) — raw
ggplot() +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(data = schools, size = 0.4, alpha = 0.7) +
  labs(title = "Raw Layer Check: Schools (Centroids)") +
  theme_void()

Code

# Crime points are huge; sampling for speed
set.seed(5080)
crime_quick <- if (nrow(crime) > 30000) dplyr::slice_sample(crime, n = 30000) else crime

ggplot() +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(data = crime_quick, size = 0.1, alpha = 0.25) +
  labs(title = "Raw Layer Check: Crime Incidents (sampled if large)") +
  theme_void()

3.2.1 Interpretation

Transit Stops: Dense corridors radiate from Center City, showing strong transit coverage.
Hospitals: Sparse but geographically balanced.
Parks & Recreation: uneven distribution,
Schools: evenly distributed across most neighborhoods
Crime: Visibly concentrated, confirming the need for log-transformed counts

3.3 Feature Engineering

Spatial features were derived using two complementary approaches: k-Nearest Neighbor (kNN) and buffer-based counts, depending on whether accessibility was best captured as proximity or exposure.

Code

#| message: false
#| warning: false

#clean sales data
sales_xy <- st_coordinates(OPA_points)
ok_sales  <- complete.cases(sales_xy)
OPA_points <- OPA_points[ok_sales, ]    # keep only rows with valid XY
sales_xy  <- st_coordinates(OPA_points) # refresh coordinates

transit_xy <- st_coordinates(transit)
hosp_xy    <- st_coordinates(hospitals)

# feature 1 - distance to nearest transit stop (ft)
knn_tr <- FNN::get.knnx(
  data  = st_coordinates(transit),
  query = sales_xy,
  k = 1)

OPA_points <- OPA_points %>%
  mutate(dist_nearest_transit_ft = as.numeric(knn_tr$nn.dist[,1]))

# feature 2 - distance to nearest hospital (ft)
knn_hp <- FNN::get.knnx(
  data  = st_coordinates(hospitals),
  query = sales_xy,
  k = 1)

OPA_points <- OPA_points %>%
  mutate(dist_nearest_hospital_ft = as.numeric(knn_hp$nn.dist[,1]))

Code

# feature 3 - parks/rec sites within 0.25 mi (count)
rel_parks <- sf::st_is_within_distance(OPA_points, parksrec, dist = r_park_ft)

OPA_points <- OPA_points %>%
  mutate(parks_cnt_0p25mi = lengths(rel_parks))

# feature 4 - crime count within 0.5 mi (per square mile)
crime_buffer <- st_buffer(OPA_points, dist = r_crime_ft)

rel_crime <- st_intersects(crime_buffer, crime, sparse = TRUE)

# count number of crimes
crime_cnt <- lengths(rel_crime)

rm(rel_crime)

OPA_points <- OPA_points |>
  mutate(
    crime_cnt_0p5mi     = crime_cnt,
    log1p_crime_cnt_0p5 = log1p(crime_cnt_0p5mi)
  )

# feature 5 - schools within 0.5 mi (using centroids)
rel_sch <- sf::st_is_within_distance(OPA_points, schools, dist = r_school_ft)

OPA_points <- OPA_points %>%
  mutate(schools_cnt_0p5mi = lengths(rel_sch))

3.3.1 Summary Table and Justification

Feature	Method	Parameter	Theoretical Rationale
Distance to Nearest Transit Stop	kNN (k = 1)	–	Captures ease of access to public transport; nearest stop approximates walkability and job access.
Distance to Nearest Hospital	kNN (k = 1)	–	Reflects accessibility to health care and emergency services; proximity adds perceived security for households.
Parks & Rec Sites within 0.25 mi	Buffer Count	r = 0.25 mi	Measures exposure to green space and recreational facilities within a 5-minute walk; positive amenity effect on property value.
Crime Incidents within 0.5 mi	Buffer Count	r = 0.5 mi	Represents local safety environment; higher crime counts reduce housing desirability.
Schools within 0.5 mi	Buffer Count	r = 0.5 mi	Reflects educational access and family appeal; clustering of schools often raises residential demand.
Population	Census	–	Represents the present residential demand within a census tract
Median Household Income	Census	–	Indicative of the ability of present residents of a census tract to afford housing
Median Age	Census	–	Measure of the dominant age group in a census tract (i.e. high student or elderly population)

3.3.2 Feature Validation and Visualization

Code

## Transit Accessibility
ggplot(OPA_points, aes(x = dist_nearest_transit_ft)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(title = "Distribution: Distance to Nearest Transit Stop",
       x = "Feet to Nearest Stop", y = "Count") +
  theme_minimal()

Code

ggplot(OPA_points) +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(aes(color = dist_nearest_transit_ft), size = 0.5) +
  scale_color_viridis_c(option = "plasma", labels = comma) +
  labs(title = "Transit Accessibility Across Sales Parcels",
       color = "Distance (ft)") +
  theme_void()

Code

## Hospital Proximity
ggplot(OPA_points, aes(x = dist_nearest_hospital_ft)) +
  geom_histogram(fill = "darkorange", color = "white", bins = 30) +
  labs(title = "Distribution: Distance to Nearest Hospital",
       x = "Feet to Nearest Hospital", y = "Count") +
  theme_minimal()

Code

ggplot(OPA_points) +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(aes(color = dist_nearest_hospital_ft), size = 0.5) +
  scale_color_viridis_c(option = "magma", labels = comma) +
  labs(title = "Hospital Accessibility Across Sales Parcels",
       color = "Distance (ft)") +
  theme_void()

Code

## Parks & Recreation
ggplot(OPA_points, aes(x = parks_cnt_0p25mi)) +
  geom_histogram(fill = "seagreen", color = "white", bins = 20) +
  labs(title = "Distribution: Parks & Rec Sites Within 0.25 mi",
       x = "Count within 0.25 mi", y = "Number of Parcels") +
  theme_minimal()

Code

ggplot(OPA_points) +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(aes(color = parks_cnt_0p25mi), size = 0.6) +
  scale_color_viridis_c(option = "viridis") +
  labs(title = "Proximity to Parks & Recreation (0.25 mi Buffer)",
       color = "Parks Count") +
  theme_void()

Code

## Crime Counts
ggplot(OPA_points, aes(x = crime_cnt_0p5mi)) +
  geom_histogram(fill = "firebrick", color = "white", bins = 30) +
  labs(title = "Distribution: Crime Incidents Within 0.5 mi",
       x = "Crime Count (2023–2024)", y = "Number of Parcels") +
  theme_minimal()

Code

ggplot(OPA_points) +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(aes(color = log1p_crime_cnt_0p5), size = 0.6) +
  scale_color_viridis_c(option = "inferno") +
  labs(title = "Crime Exposure (log-transformed within 0.5 mi)",
       color = "log(1+Crime Count)") +
  theme_void()

Code

## Schools Accessibility
ggplot(OPA_points, aes(x = schools_cnt_0p5mi)) +
  geom_histogram(fill = "purple", color = "white", bins = 20) +
  labs(title = "Distribution: Schools Within 0.5 mi",
       x = "School Count (0.5 mi Buffer)", y = "Number of Parcels") +
  theme_minimal()

Code

ggplot(OPA_points) +
  geom_sf(data = nhoods, 
          fill = NA, color = "grey70", linewidth = 0.3) +
  geom_sf(aes(color = schools_cnt_0p5mi), size = 0.6) +
  scale_color_viridis_c(option = "cividis") +
  labs(title = "School Accessibility (0.5 mi Buffer)",
       color = "Schools Count") +
  theme_void()

Interpretation:

Transit proximity: Most parcels are within 500 ft of a stop, confirming strong transit coverage across Philadelphia.
Hospital proximity: Right-skewed distribution, consistent with limited facility count.
Parks access: Sparse exposure (mostly 0–1 within 0.25 mi), highlighting recreational inequities.
Crime exposure: Wide variation, clustered along high-density corridors; log-transformed to stabilize scale.
School proximity: Uniform urban coverage with typical parcels having 4-7 schools within 0.5 mi.

Code

sp_data <- st_read("./data/OPA_data.geojson", quiet = T)

str(sp_data$sale_price)

 int [1:23611] 389000 20000 309000 185000 399000 50000 70000 50000 30000 230000 ...

Code

sp_data_filtered <- sp_data %>%
  mutate(sale_price = as.numeric(sale_price)) %>%
  filter(sale_price > 1000)

ggplot(sp_data_filtered, aes(x = sale_price)) + 
  geom_histogram(
    binwidth = 20000, 
    fill = "grey",
    color = "black"
  ) +
  labs(
    title = "Histogram of Sale Prices in Philadelphia", 
    x = "Sale Price",
    y = "Count of Homes"
  ) +
  theme_minimal() +
  scale_x_continuous(labels = label_dollar()) +
  coord_cartesian(xlim = c(0, 2000000), ylim = c(0, 3000))

Code

summary(sp_data_filtered$sale_price)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    1500   170000   269000   348626   395000 30000000

Code

sp_data_filtered <- sp_data_filtered %>%
  mutate(
    sale_price = as.numeric(sale_price),
    sale_price_capped = pmin(sale_price, quantile(sale_price, 0.99, na.rm = TRUE))
  )

ggplot(sp_data_filtered) +
  geom_sf(aes(color = sale_price_capped), size = 0.6, alpha = 0.7) +
  scale_color_viridis_c(labels = label_dollar(), name = "Sale Price (USD)") +
  labs(title = "Philadelphia Sale Prices") +
  theme_minimal()

Code

# function to check for statistically significant correlations between independent variables
sig_corr <- function(dat, dep_var) {
  # remove the independent variable from the dataset
  dat_corr <- dat %>% select(-all_of(dep_var))
  
  # run a correlation matrix for the independent vars
  correlation_matrix <- rcorr(as.matrix(dat_corr))
  values <- correlation_matrix$r
  vifs <- apply(values, 1, function(x){
    return(round(1/(1-abs(x)), 2))
  })
  
  values_df <- values %>% as.data.frame()
  vifs_df <- vifs %>% as.data.frame()
  
  # convert correlation coefficients and p-values to long format
  corCoeff_df <- correlation_matrix$r %>% 
    as.data.frame() %>% 
    mutate(var1 = rownames(.))
  
  corVIF_df <- vifs %>% 
    as.data.frame() %>% 
    mutate(var1 = rownames(.))
  
  corPval_df <- correlation_matrix$P %>% 
    as.data.frame() %>% 
    mutate(var1 = rownames(.))
  
  # merge long format data
  corMerge <- list(
    corCoeff_df %>% pivot_longer(-var1, names_to = "var2", values_to = "correlation") %>% as.data.frame(),
    corVIF_df %>% pivot_longer(-var1, names_to = "var2", values_to = "vif_factor") %>% as.data.frame(),
    corPval_df %>% pivot_longer(-var1, names_to = "var2", values_to = "p_value") %>% as.data.frame()) %>%
    reduce(left_join, by = c("var1", "var2"))
  
  # filter to isolate unique pairs, then rows with correlation > 0.5 and p < 0.05
  corUnfiltered <- corMerge %>% 
    filter(var1 != var2) %>% 
    rowwise() %>% 
    filter(var1 < var2) %>% 
    ungroup() %>% 
    as.data.frame()
  
  corFiltered <- corUnfiltered %>% 
    filter(abs(vif_factor) > 3 & p_value < 0.05) %>% 
    arrange(desc(abs(correlation)))
  
  # save the raw correlation values and the filtered variable pairs
  final <- set_names(list(values_df, vifs_df, corUnfiltered, corFiltered),
                     c("R2", "VIF", "AllCor", "SelCor"))
  
  return(final)
}

# create a dataset with just modeling variables
OPA_modelvars <- OPA_points %>% select(sale_price, total_livable_area, building_age, number_of_bedrooms, number_of_bathrooms,
                                       pop_totE, med_hh_incE, med_ageE,
                                       dist_nearest_transit_ft, dist_nearest_hospital_ft, parks_cnt_0p25mi, log1p_crime_cnt_0p5, schools_cnt_0p5mi,
                                       )

# calculate VIFs and determine potentially troublesome correlations between variables
vif_check <- sig_corr(OPA_modelvars %>% st_drop_geometry(), dep_var = "sale_price")

kable(vif_check[["VIF"]])

	total_livable_area	building_age	number_of_bedrooms	number_of_bathrooms	pop_totE	med_hh_incE	med_ageE	dist_nearest_transit_ft	dist_nearest_hospital_ft	parks_cnt_0p25mi	log1p_crime_cnt_0p5	schools_cnt_0p5mi
total_livable_area	Inf	1.22	1.70	1.99	1.14	1.24	1.07	1.07	1.01	1.03	1.22	1.03
building_age	1.22	Inf	1.01	1.32	1.04	1.23	1.18	1.21	1.12	1.05	1.43	1.24
number_of_bedrooms	1.70	1.01	Inf	2.26	1.02	1.09	1.05	1.05	1.03	1.01	1.07	1.02
number_of_bathrooms	1.99	1.32	2.26	Inf	1.12	1.25	1.03	1.02	1.04	1.01	1.09	1.01
pop_totE	1.14	1.04	1.02	1.12	Inf	1.33	1.16	1.09	1.18	1.00	1.01	1.08
med_hh_incE	1.24	1.23	1.09	1.25	1.33	Inf	1.13	1.01	1.13	1.04	1.31	1.03
med_ageE	1.07	1.18	1.05	1.03	1.16	1.13	Inf	1.15	1.27	1.15	1.70	1.25
dist_nearest_transit_ft	1.07	1.21	1.05	1.02	1.09	1.01	1.15	Inf	1.25	1.11	1.72	1.44
dist_nearest_hospital_ft	1.01	1.12	1.03	1.04	1.18	1.13	1.27	1.25	Inf	1.10	1.85	1.66
parks_cnt_0p25mi	1.03	1.05	1.01	1.01	1.00	1.04	1.15	1.11	1.10	Inf	1.25	1.23
log1p_crime_cnt_0p5	1.22	1.43	1.07	1.09	1.01	1.31	1.70	1.72	1.85	1.25	Inf	2.46
schools_cnt_0p5mi	1.03	1.24	1.02	1.01	1.08	1.03	1.25	1.44	1.66	1.23	2.46	Inf

None of the variables tested have a significant VIF score that is above 3, indicating that there is little concern of multicollinearity in the models moving forward.

4 Model Building

4.1 Model 1: Structural Terms

Our first model uses structural property characteristics to build a multiple linear regression, regressing sale price on total livable area, number of bedrooms, number of bathrooms, and building age.

Code

model1_data <- na.omit(OPA_points)

model1 <- lm(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age,

  data = model1_data
)

summary(model1)


Call:
lm(formula = sale_price ~ total_livable_area + number_of_bedrooms + 
    number_of_bathrooms + building_age, data = model1_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1063434   -92640   -20598    58880  7759292 

Coefficients:
                      Estimate Std. Error t value            Pr(>|t|)    
(Intercept)         -11217.250  12625.853  -0.888               0.374    
total_livable_area     221.516      5.126  43.217 <0.0000000000000002 ***
number_of_bedrooms  -34668.316   3239.669 -10.701 <0.0000000000000002 ***
number_of_bathrooms  70054.014   3967.307  17.658 <0.0000000000000002 ***
building_age           144.447    104.814   1.378               0.168    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 233700 on 10324 degrees of freedom
Multiple R-squared:  0.2762,    Adjusted R-squared:  0.2759 
F-statistic: 984.7 on 4 and 10324 DF,  p-value: < 0.00000000000000022

4.2 Model 2: Incorporation of Census Data

Our second model builds on the structural property characteristics regression by incorporating census tract–level variables, including population, median household income, and median age.

Code

model2_data <- na.omit(OPA_points)

model2 <- lm(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +

    pop_totE +
    med_hh_incE +
    med_ageE,            
    
  data = model2_data
)

summary(model2)


Call:
lm(formula = sale_price ~ total_livable_area + number_of_bedrooms + 
    number_of_bathrooms + building_age + pop_totE + med_hh_incE + 
    med_ageE, data = model2_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-811594  -68780  -14242   37407 7882319 

Coefficients:
                        Estimate   Std. Error t value             Pr(>|t|)    
(Intercept)         -146801.0740   22092.0331  -6.645      0.0000000000319 ***
total_livable_area      182.7335       4.9502  36.914 < 0.0000000000000002 ***
number_of_bedrooms   -17516.7805    3085.6817  -5.677      0.0000000140948 ***
number_of_bathrooms   57541.2262    3746.3467  15.359 < 0.0000000000000002 ***
building_age            101.4828     101.8382   0.997                0.319    
pop_totE                 -7.5113       1.3699  -5.483      0.0000000427396 ***
med_hh_incE               2.5838       0.0747  34.587 < 0.0000000000000002 ***
med_ageE                496.0580     357.9537   1.386                0.166    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 219700 on 10321 degrees of freedom
Multiple R-squared:  0.3601,    Adjusted R-squared:  0.3596 
F-statistic: 829.6 on 7 and 10321 DF,  p-value: < 0.00000000000000022

4.3 Model 3: Incorporation of Spatial Features

Code

model3_data <- na.omit(OPA_points)

model3 <- lm(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +
    total_area +
    
    pop_totE +
    med_hh_incE +
    med_ageE +  

    dist_nearest_transit_ft +
    dist_nearest_hospital_ft +
    parks_cnt_0p25mi +
    log1p_crime_cnt_0p5,
    
  data = model3_data
)

summary(model3)


Call:
lm(formula = sale_price ~ total_livable_area + number_of_bedrooms + 
    number_of_bathrooms + building_age + total_area + pop_totE + 
    med_hh_incE + med_ageE + dist_nearest_transit_ft + dist_nearest_hospital_ft + 
    parks_cnt_0p25mi + log1p_crime_cnt_0p5, data = model3_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-918678  -69433  -13317   39370 7885217 

Coefficients:
                              Estimate    Std. Error t value
(Intercept)              -289203.14616   43841.08957  -6.597
total_livable_area           163.80852       5.66558  28.913
number_of_bedrooms        -14331.54545    3085.57464  -4.645
number_of_bathrooms        56708.75377    3736.10271  15.179
building_age                -222.32246     113.51444  -1.959
total_area                     5.78704       0.77110   7.505
pop_totE                      -6.85323       1.38220  -4.958
med_hh_incE                    2.68603       0.08168  32.884
med_ageE                    1082.79304     372.86446   2.904
dist_nearest_transit_ft        1.13603       7.45510   0.152
dist_nearest_hospital_ft      -4.89926       0.77986  -6.282
parks_cnt_0p25mi            3092.03921    3616.67427   0.855
log1p_crime_cnt_0p5        21979.86267    4391.65064   5.005
                                     Pr(>|t|)    
(Intercept)                0.0000000000441242 ***
total_livable_area       < 0.0000000000000002 ***
number_of_bedrooms         0.0000034479488026 ***
number_of_bathrooms      < 0.0000000000000002 ***
building_age                          0.05019 .  
total_area                 0.0000000000000666 ***
pop_totE                   0.0000007228149059 ***
med_hh_incE              < 0.0000000000000002 ***
med_ageE                              0.00369 ** 
dist_nearest_transit_ft               0.87889    
dist_nearest_hospital_ft   0.0000000003472200 ***
parks_cnt_0p25mi                      0.39260    
log1p_crime_cnt_0p5        0.0000005680776718 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 218400 on 10316 degrees of freedom
Multiple R-squared:  0.3679,    Adjusted R-squared:  0.3672 
F-statistic: 500.4 on 12 and 10316 DF,  p-value: < 0.00000000000000022

4.4 Model 4: Incorporation of Interactions and Fixed Effects

Code

# join data separately here to avoid conflicts with earlier code blocks
OPA_points_copy <- left_join(OPA_points,
                             neighbor_points %>%
                               select(parcel_number, wealthy_neighborhood) %>%
                               st_drop_geometry(),
                             by = "parcel_number")

Code

model4_data <- na.omit(OPA_points_copy)

model4 <- lm(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +
    total_area +
    
    pop_totE +
    med_hh_incE +
    med_ageE +  
  
    dist_nearest_transit_ft +
    dist_nearest_hospital_ft +
    parks_cnt_0p25mi +
    log1p_crime_cnt_0p5 +
    
    number_of_bathrooms * wealthy_neighborhood +
    
    int_type_tarea +
    int_value_larea +
    int_value_tarea +
    int_larea_econd +
    int_larea_icond +
    int_larea_beds,
    
                     
  data = model4_data
)

summary(model4)


Call:
lm(formula = sale_price ~ total_livable_area + number_of_bedrooms + 
    number_of_bathrooms + building_age + total_area + pop_totE + 
    med_hh_incE + med_ageE + dist_nearest_transit_ft + dist_nearest_hospital_ft + 
    parks_cnt_0p25mi + log1p_crime_cnt_0p5 + number_of_bathrooms * 
    wealthy_neighborhood + int_type_tarea + int_value_larea + 
    int_value_tarea + int_larea_econd + int_larea_icond + int_larea_beds, 
    data = model4_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1346085   -55380   -11966    28218  7763690 

Coefficients:
                                                         Estimate
(Intercept)                                     124512.2529592711
total_livable_area                                 157.6273524102
number_of_bedrooms                               31208.7841915061
number_of_bathrooms                              24997.9383903477
building_age                                      -369.4190652424
total_area                                           9.5617113327
pop_totE                                            -3.8679981835
med_hh_incE                                          1.0393743579
med_ageE                                           435.9742538792
dist_nearest_transit_ft                              1.1607100282
dist_nearest_hospital_ft                            -3.4937314721
parks_cnt_0p25mi                                  -471.3368815190
log1p_crime_cnt_0p5                              -9758.6877799856
wealthy_neighborhoodWealthy                      39818.8812951698
int_type_tarea                                      -0.0163412165
int_value_larea                                      0.0001564457
int_value_tarea                                     -0.0000081686
int_larea_econd                                    -10.3285787884
int_larea_icond                                    -12.2381816682
int_larea_beds                                     -14.9092798920
number_of_bathrooms:wealthy_neighborhoodWealthy  58001.7136402215
                                                       Std. Error t value
(Intercept)                                      45454.5022193101   2.739
total_livable_area                                  14.7964439643  10.653
number_of_bedrooms                                4850.3805607397   6.434
number_of_bathrooms                               3766.3281923183   6.637
building_age                                       106.2530116469  -3.477
total_area                                           1.4114779417   6.774
pop_totE                                             1.2822166727  -3.017
med_hh_incE                                          0.0911237754  11.406
med_ageE                                           345.6926256018   1.261
dist_nearest_transit_ft                              6.8915594246   0.168
dist_nearest_hospital_ft                             0.7205731337  -4.849
parks_cnt_0p25mi                                  3343.6898579606  -0.141
log1p_crime_cnt_0p5                               4322.8360013350  -2.257
wealthy_neighborhoodWealthy                      14956.5415753626   2.662
int_type_tarea                                       0.0101374871  -1.612
int_value_larea                                      0.0000057824  27.056
int_value_tarea                                      0.0000007365 -11.091
int_larea_econd                                      2.6046336489  -3.965
int_larea_icond                                      2.0327578830  -6.020
int_larea_beds                                       2.0097460026  -7.418
number_of_bathrooms:wealthy_neighborhoodWealthy   7671.1953640371   7.561
                                                            Pr(>|t|)    
(Intercept)                                                  0.00617 ** 
total_livable_area                              < 0.0000000000000002 ***
number_of_bedrooms                                0.0000000001295545 ***
number_of_bathrooms                               0.0000000000335729 ***
building_age                                                 0.00051 ***
total_area                                        0.0000000000131872 ***
pop_totE                                                     0.00256 ** 
med_hh_incE                                     < 0.0000000000000002 ***
med_ageE                                                     0.20728    
dist_nearest_transit_ft                                      0.86625    
dist_nearest_hospital_ft                          0.0000012618714105 ***
parks_cnt_0p25mi                                             0.88790    
log1p_crime_cnt_0p5                                          0.02400 *  
wealthy_neighborhoodWealthy                                  0.00777 ** 
int_type_tarea                                               0.10700    
int_value_larea                                 < 0.0000000000000002 ***
int_value_tarea                                 < 0.0000000000000002 ***
int_larea_econd                                   0.0000737487629116 ***
int_larea_icond                                   0.0000000017982736 ***
int_larea_beds                                    0.0000000000001278 ***
number_of_bathrooms:wealthy_neighborhoodWealthy   0.0000000000000434 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 201400 on 10308 degrees of freedom
Multiple R-squared:  0.4631,    Adjusted R-squared:  0.462 
F-statistic: 444.5 on 20 and 10308 DF,  p-value: < 0.00000000000000022

Code

#there is only a premium on wealth neighborhood for total area, total livable area, and number of bathrooms that are significant. There is also a significant value for int_value_larea just from interacting the OPA data itsself which just assesses market value and size scalability.

4.5 Comparison of Model Performance

We can evaluate performance by conducting a 10-fold cross-validation of the 4 models, and comparing their RMSE, MAE, and $R^2$.

Code

# Define 10-fold CV
train_control <- trainControl(
  method = "cv",
  number = 10,
  savePredictions = "final"
)

Code

# Model 1: Structural Features Only
model1_cv <- train(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age,
  data = na.omit(OPA_points),
  method = "lm",
  trControl = train_control
)

model1_cv

Linear Regression 

10329 samples
    5 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 9295, 9297, 9296, 9297, 9296, 9296, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  230140.4  0.2865262  115174.1

Tuning parameter 'intercept' was held constant at a value of TRUE

Code

# Model 2: Structural + Census
model2_cv <- train(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +
    pop_totE +
    med_hh_incE +
    med_ageE,
  data = na.omit(OPA_points),
  method = "lm",
  trControl = train_control
)

model2_cv

Linear Regression 

10329 samples
    8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 9296, 9296, 9297, 9297, 9297, 9296, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  215430.3  0.3799429  90217.97

Tuning parameter 'intercept' was held constant at a value of TRUE

Code

# Model 3: Structural + Census + Spatial
model3_cv <- train(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +
    total_area +
    
    pop_totE +
    med_hh_incE +
    med_ageE +  

    dist_nearest_transit_ft +
    dist_nearest_hospital_ft +
    parks_cnt_0p25mi +
    log1p_crime_cnt_0p5,
  data = na.omit(OPA_points),
  method = "lm",
  trControl = train_control
)

model3_cv

Linear Regression 

10329 samples
   13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 9295, 9297, 9295, 9295, 9296, 9298, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  211021.2  0.4053308  90144.36

Tuning parameter 'intercept' was held constant at a value of TRUE

Code

# Model 4: Structural + Census + Spatial + Interaction
model4_cv <- train(
  sale_price ~ 
    total_livable_area +
    number_of_bedrooms +
    number_of_bathrooms +
    building_age +
    total_area +
    
    pop_totE +
    med_hh_incE +
    med_ageE +  
  
    dist_nearest_transit_ft +
    dist_nearest_hospital_ft +
    parks_cnt_0p25mi +
    log1p_crime_cnt_0p5 +
    
    number_of_bathrooms * wealthy_neighborhood +
    
    int_type_tarea +
    int_value_larea +
    int_value_tarea +
    int_larea_econd +
    int_larea_icond +
    int_larea_beds,
  
  data = na.omit(OPA_points_copy),
  method = "lm",
  trControl = train_control
)

model4_cv

Linear Regression 

10329 samples
   20 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 9297, 9296, 9296, 9296, 9296, 9297, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  194649.2  0.4958671  71748.91

Tuning parameter 'intercept' was held constant at a value of TRUE

Code

cv_results <- data.frame(
  Model = c("Model 1", 
            "Model 2", 
            "Model 3", 
            "Model 4"),
  RMSE = c(
        model1_cv$results$RMSE,
        model2_cv$results$RMSE,
        model3_cv$results$RMSE,
        model4_cv$results$RMSE
      ),
  
  log_RMSE = c(
        log(model1_cv$results$RMSE),
        log(model2_cv$results$RMSE),
        log(model3_cv$results$RMSE),
        log(model4_cv$results$RMSE)
      ),
  
    MAE = c(
        model1_cv$results$MAE,
        model2_cv$results$MAE,
        model3_cv$results$MAE,
        model4_cv$results$MAE
      ),
  R_squared = c(
        model1_cv$results$Rsquared,
        model2_cv$results$Rsquared,
        model3_cv$results$Rsquared,
        model4_cv$results$Rsquared
      )
)

print(cv_results)

    Model     RMSE log_RMSE       MAE R_squared
1 Model 1 230140.4 12.34644 115174.13 0.2865262
2 Model 2 215430.3 12.28039  90217.97 0.3799429
3 Model 3 211021.2 12.25971  90144.36 0.4053308
4 Model 4 194649.2 12.17895  71748.91 0.4958671

Code

# create model coefficient table in stargazer
models_list <- list(model1, model2, model3, model4)
models_summary_table <- stargazer(models_list, type = "text", style = "default")


=============================================================================================================================================================
                                                                                             Dependent variable:                                             
                                                -------------------------------------------------------------------------------------------------------------
                                                                                                 sale_price                                                  
                                                           (1)                        (2)                         (3)                         (4)            
-------------------------------------------------------------------------------------------------------------------------------------------------------------
total_livable_area                                      221.516***                 182.734***                 163.809***                  157.627***         
                                                         (5.126)                    (4.950)                     (5.666)                    (14.796)          
                                                                                                                                                             
number_of_bedrooms                                    -34,668.320***             -17,516.780***             -14,331.550***               31,208.780***       
                                                       (3,239.669)                (3,085.682)                 (3,085.575)                 (4,850.381)        
                                                                                                                                                             
number_of_bathrooms                                   70,054.010***              57,541.230***               56,708.750***               24,997.940***       
                                                       (3,967.307)                (3,746.347)                 (3,736.103)                 (3,766.328)        
                                                                                                                                                             
building_age                                             144.447                    101.483                    -222.322*                  -369.419***        
                                                        (104.814)                  (101.838)                   (113.514)                   (106.253)         
                                                                                                                                                             
total_area                                                                                                     5.787***                    9.562***          
                                                                                                                (0.771)                     (1.411)          
                                                                                                                                                             
pop_totE                                                                           -7.511***                   -6.853***                   -3.868***         
                                                                                    (1.370)                     (1.382)                     (1.282)          
                                                                                                                                                             
med_hh_incE                                                                         2.584***                   2.686***                    1.039***          
                                                                                    (0.075)                     (0.082)                     (0.091)          
                                                                                                                                                             
med_ageE                                                                            496.058                  1,082.793***                   435.974          
                                                                                   (357.954)                   (372.864)                   (345.693)         
                                                                                                                                                             
dist_nearest_transit_ft                                                                                          1.136                       1.161           
                                                                                                                (7.455)                     (6.892)          
                                                                                                                                                             
dist_nearest_hospital_ft                                                                                       -4.899***                   -3.494***         
                                                                                                                (0.780)                     (0.721)          
                                                                                                                                                             
parks_cnt_0p25mi                                                                                               3,092.039                   -471.337          
                                                                                                              (3,616.674)                 (3,343.690)        
                                                                                                                                                             
log1p_crime_cnt_0p5                                                                                          21,979.860***               -9,758.688**        
                                                                                                              (4,391.651)                 (4,322.836)        
                                                                                                                                                             
wealthy_neighborhoodWealthy                                                                                                              39,818.880***       
                                                                                                                                         (14,956.540)        
                                                                                                                                                             
int_type_tarea                                                                                                                              -0.016           
                                                                                                                                            (0.010)          
                                                                                                                                                             
int_value_larea                                                                                                                            0.0002***         
                                                                                                                                           (0.00001)         
                                                                                                                                                             
int_value_tarea                                                                                                                           -0.00001***        
                                                                                                                                           (0.00000)         
                                                                                                                                                             
int_larea_econd                                                                                                                           -10.329***         
                                                                                                                                            (2.605)          
                                                                                                                                                             
int_larea_icond                                                                                                                           -12.238***         
                                                                                                                                            (2.033)          
                                                                                                                                                             
int_larea_beds                                                                                                                            -14.909***         
                                                                                                                                            (2.010)          
                                                                                                                                                             
number_of_bathrooms:wealthy_neighborhoodWealthy                                                                                          58,001.710***       
                                                                                                                                          (7,671.195)        
                                                                                                                                                             
Constant                                               -11,217.250              -146,801.100***             -289,203.100***             124,512.300***       
                                                       (12,625.850)               (22,092.030)               (43,841.090)                (45,454.500)        
                                                                                                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations                                              10,329                     10,329                     10,329                      10,329           
R2                                                        0.276                      0.360                       0.368                       0.463           
Adjusted R2                                               0.276                      0.360                       0.367                       0.462           
Residual Std. Error                              233,656.600 (df = 10324)   219,730.600 (df = 10321)   218,431.100 (df = 10316)    201,398.000 (df = 10308)  
F Statistic                                     984.706*** (df = 4; 10324) 829.572*** (df = 7; 10321) 500.372*** (df = 12; 10316) 444.490*** (df = 20; 10308)
=============================================================================================================================================================
Note:                                                                                                                             *p<0.1; **p<0.05; ***p<0.01

Code

# plot predicted vs actual value plots
ggplot(model1_cv$pred, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Model 1 Cross-Validation: Predicted vs. Actual Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

Code

ggplot(model2_cv$pred, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Model 2 Cross-Validation: Predicted vs. Actual Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

Code

ggplot(model3_cv$pred, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Model 3 Cross-Validation: Predicted vs. Actual Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

Code

ggplot(model4_cv$pred, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Model 4 Cross-Validation: Predicted vs. Actual Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

5 Model Diagnostics

Code

# create diagnostic plots
plot(model4)

In the plot of residuals versus fitted values, the data largely displays a random distribution except for a spike in low sale price values. This indicates that the model might be poorly predicting for low sale price values as they don’t follow an entirely linear pattern. However, there was no visible cone shape in the data, which was a promising sign that this discrepancy was not due to a trend consistent across the entire dataset. This discrepancy is suspected to follow a neighborhood based spatial pattern, and that exploration is done below. The Q-Q plot indicates that residual values are not perfectly normally distributed, further confirming that the dataset does not seem to display a linear relationship around extreme values. However, an analysis of the residuals in a Cook’s Distance plot indicates that despite this lack of normality in the dependent variable, none of the observations are extraordinarily influential. These violations were therefore deemed mild enough to not warrant intervention.

Code

OPA_points_clean <- na.omit(OPA_points_copy)
OPA_points_clean$residuals <- residuals(model4)

points_with_nhoods <- st_join(OPA_points_clean, nhoods)

avg_residuals_by_nhood <- points_with_nhoods %>%
  st_drop_geometry() %>%
  group_by(NAME) %>%
  summarise(
    avg_residual = mean(residuals, na.rm = TRUE),
    n_properties = n()
  )

nhoods_with_residuals <- nhoods %>%
  left_join(avg_residuals_by_nhood, by = "NAME")

Code

ggplot(nhoods_with_residuals) +
  geom_sf(aes(fill = avg_residual), color = "white", size = 0.15) +
  scale_fill_gradient2(
    low = "red",
    mid = "gray90",
    high = "purple",
    midpoint = 0,
    name = "Avg Residual ($)",
    limits = c(-1, 1) * max(abs(nhoods_with_residuals$avg_residual), na.rm = TRUE)
  ) +
  theme_minimal() +
  labs(
    title = "Model Performance by Neighborhood",
    subtitle = "Purple = Over-prediction \nRed = Under-prediction"
  ) +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold", size = 15),
    plot.subtitle = element_text(size = 10),
    legend.position = "right",
  )

The results of this map confirm suspicions of local parameters affecting sale price, and should guide future work as explained in the following section.

6 Conclusion and Recommendations

Our final model had an RMSE value of $195,093.2 and an MAE of $71,587.66. This was an improvement over previous models, where Model 1 had an RMSE value of $315,407.2 and an MAE of $146,134.04. Despite these improvements, these values are still extremely high relative to the current median household price in Philadelphia of approximately $232,000, according to Zillow. The strongest predicting features utilized in our model included whether the neighborhood a property had a median household income greater than $420,000 and could be considered wealthy ($\beta$ = 47,309.230, p < 0.01), the number of bedrooms ($\beta$ = 35,121.540, p < 0.01), and the number of bathrooms ($\beta$ = 25,942.210, p < 0.01). Furthermore, our interaction term between wealthy neighborhood status and the number of bathrooms proved to also strongly influence the model output ($\beta$ = 22,639.990, p < 0.01), indicating that properties in wealthier neighborhoods that have additional bathrooms tend to gain a stronger sale price premium compared to properties in non-wealthy neighborhoods.

Sale prices in Philadelphia do not seem to entirely follow a linear relationship, particularly among lower-priced homes. This resulted in limited prediction potential for properties with lower sale prices. This could produce potential equity issues related to pricing homes of lesser value and those in lower-income neighborhoods that are impacted by surrounding properties. Many of the variables used were simply aggregated without being weighted, such as how the count of crime incidents within a half-mile does not take into account the severity of such events or how the quality of parks within a quarter-mile of a property is not considered. This could over- or under-emphasize these variables in the model.