Philadelphia Housing Model—Technical Appendix

Predicting Residential Property Prices

Author

Ming Cao, Mark Deng, Jun Luu, Josh Rigsby, Alex Stauffer, Tess Vu

Data Cleaning

Primary Philadelphia Sales Data (OpenDataPhilly)

Code

# Import relevant libraries.
library(car)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(scales)
library(sf)
library(stringr)
library(tibble)
library(tidycensus)
library(tidyr)
library(tidyverse)
library(tigris)
library(tmap)
library(units)

options(scipen = 999)

Code

# Property data.
properties_path <- "data/philly_properties.csv"
properties <- read.csv(properties_path)

# Capture dimensions.
og_property_dimension <- dim(properties)

# Set Census API key.
census_api_key("3aaee31789e10b674a531e9f236c35d5394b19ed")

Code

# All variables are character strings, remove white space then convert numeric character variables to numeric classes for chosen variables.

properties <- properties %>%
  mutate(across(where(is.character), str_trim),
         across(c(fireplaces, garage_spaces, number_of_bathrooms, number_stories,
                  sale_price, total_livable_area, year_built), as.numeric)) %>%
  rename(fireplace_num = fireplaces,
         garage_num = garage_spaces,
         bath_num = number_of_bathrooms,
         story_num = number_stories,
         square_feet = total_livable_area,
         basement_type = basements,
         ac_binary = central_air,
         fuel_type = fuel,
         )

Code

# Filter to residential properties and 2023-2024 sales.
# Note: Category code #1 is for residential.
residential_prop <- properties %>%
  filter(.,
         category_code == 1,
         startsWith(sale_date, "2023") | startsWith(sale_date, "2024"))

# Drop empty variables or variables not needed for model.
residential_prop <- residential_prop %>%
  select(c(basement_type, ac_binary, fireplace_num, fuel_type, garage_num,
           bath_num, story_num, sale_date, sale_price,
           square_feet, year_built, shape)
         )

# Make empty character column values NA.
residential_prop <- residential_prop %>%
  mutate(across(where(is.character), ~na_if(., "")))

Code

# Drop prices that are less than $10,000 as a catch-all (might not be as reflective for rural areas). Avoiding dropping prices based on percent of assessed value since property assessments can be biased against minoritized communities. Ideal drop would add deed type to drop any family or forced transfers.
residential_prop <- residential_prop %>%
  filter(sale_price > 10000,
         square_feet > 0)

Code

# Create building age column, change central air to binary, and adjust basement and fuel types.
# Create log value for the sale price.
residential_prop <- residential_prop %>%
  mutate(ln_sale_price = log(sale_price),
         age = 2025 - year_built,
         ln_square_feet = log(square_feet),
         ac_binary = case_when(
           ac_binary == "Y" ~ 1,
           ac_binary == "N" ~ 0),
         basement_type = case_when(
           basement_type == "0" ~ "None",
           basement_type == "A" ~ "Full Finished",
           basement_type == "B" ~ "Full Semi-Finished",
           basement_type == "C" ~ "Full Unfinished",
           basement_type == "D" ~ "Full Unknown Finish",
           basement_type == "E" ~ "Partial Finished",
           basement_type == "F" ~ "Partial Semi-Finished",
           basement_type == "G" ~ "Partial Unfinished",
           basement_type == "H" ~ "Partial Unknown Finish",
           basement_type == "I" ~ "Unknown Size Finished",
           basement_type == "J" ~ "Unknown Size Unfinished"),
         fuel_type = case_when(
           fuel_type == "A" ~ "Natural Gas",
           fuel_type == "B" ~ "Oil Heat",
           fuel_type == "C" ~ "Electric",
           fuel_type == "D" ~ "Coal",
           fuel_type == "E" ~ "Solar",
           fuel_type == "F" ~ "Wood",
           fuel_type == "G" ~ "Other",
           fuel_type == "H" ~ "None")
         )

Code

# Turn categorical data into factors so OLS regression doesn't handle the data as a list of strings.
residential_prop$basement_type <- as.factor(residential_prop$basement_type)
residential_prop$fuel_type <- as.factor(residential_prop$fuel_type)

# Change the reference categories for baseline comparison.
residential_prop$basement_type <- relevel(residential_prop$basement_type, ref = "None")
residential_prop$fuel_type <- relevel(residential_prop$fuel_type, ref = "Natural Gas")

# Place fuel type with 10 or less counts into other category.
residential_prop <- residential_prop %>%
  mutate(fuel_type = fct_lump_min(fuel_type, min = 11, other_level = "Other"))

Code

# Fixed effect temporal market fluctuations. Based on sale date, splitting the years into quarters (Q1, Q2, Q3, Q4). Potential fixed effect.
residential_prop <- residential_prop %>%
  mutate(
    quarters_fe = quarter(as_datetime(sale_date))
    )

# Make it a factor.
residential_prop$quarters_fe <- factor(residential_prop$quarters_fe)

Code

# Capture dimensions.
after_property_dimension <- dim(residential_prop)

# Convert residential property to geodataframe. Use EPSG 2272 for South Pennsylvania in feet.
# Drop shape when finished creating geometry.
residential_prop_gdf <- residential_prop %>%
  mutate(geometry = st_as_sfc(shape)) %>%
  st_as_sf(crs = 2272) %>%
  rename(geometry_point = geometry) %>%
  select(-c(shape))

Spatial Data and Feature Engineering

Spatial Data (OpenDataPhilly)

Code

# Read in Philadelpha census tracts.
philly_tracts_path <- "data/philly_tracts/philly_tracts.shp"
philly_tracts <- st_read(philly_tracts_path)

Reading layer `philly_tracts' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\MUSA5080-Assignment3\data\philly_tracts\philly_tracts.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3446 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51985 ymin: 39.7198 xmax: -74.68956 ymax: 42.51607
Geodetic CRS:  NAD83

Code

# Match CRS.
philly_tracts <- st_transform(philly_tracts, crs = 2272)

# Left spatial join.
residential_points <- st_join(residential_prop_gdf, philly_tracts)

# Drop unnecessary columns and remove incomplete observations (rows) for upcoming spatial feature computations.
residential_points <- residential_points %>%
  select(-c(FUNCSTAT, MTFCC, NAMELSAD, NAME,
            STATEFP, COUNTYFP, TRACTCE)) %>%
  na.omit(.)

# Remove unneeded datasets for housekeeping and call garbage collector to reduce memory.
rm(properties, residential_prop, residential_prop_gdf)
gc()

           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  8656917 462.4   14751678  787.9  10724177  572.8
Vcells 73508658 560.9  140103430 1069.0 140102907 1069.0

Code

# Proximity to downtown.

# Decided on Euclidean distance because network proximity computation is demanding on thousands of points, even with parallel programming.

# Create single Center City point feature based on City Hall coordinates.
center_city <- st_sfc(st_point(c(-75.163500, 39.952800)), crs = 4326) %>%
  st_transform(crs = 2272)

# Need to add mile units for operations. Then remove units object for easier plotting.
residential_points$city_dist_mi <- (st_distance(residential_points, st_union(center_city))) %>%
  set_units("mi") %>%
  drop_units() %>%
  as.numeric()

# Log transform because distance benefit diminishes, for potential use.
residential_points$ln_city_dist <- log(residential_points$city_dist_mi + 0.1)

Code

# Transit proximity.
# Major cities could be distance to nearest transit like metro/rail stations, but suburban and rural areas might be better served by distance to nearest major highway.
# Read in SEPTA stops.
septa_stops_path <- "data/septa_stops.csv"

septa_stops_df <- read.csv(septa_stops_path)

# Make csv a geodataframe.
septa_stops <- septa_stops_df %>%
  st_as_sf(., coords = c("Lon", "Lat"), crs = 4326)

# Match CRS.
septa_stops <- septa_stops %>%
  st_transform(., crs = 2272)

# Stops are duplicated for the same station because the data includes directions for all cardinal directions as well as bus, rail, and trolley for the same location. This means a single station could have more than one point representing a single location residents go to commute.
# Create new column with stop name without the cardinal suffixes and keep only the unique station values.
septa_stops <- septa_stops %>%
  mutate(stations = if_else(
    str_detect(StopAbbr, "NO$|SO$|EA$|WE$|NE$|NW$|SE$|SW$"),
    str_sub(StopAbbr, end = -3),
    str_sub(StopAbbr)
    )
  ) %>%
  distinct(stations, .keep_all = TRUE)

# Create buffer zone for stops within a half mile. This is ~10 minute walk, depending on topography.
# Note: EPSG 2272 is measured in feet, not miles.
septa_distance <- st_buffer(residential_points, 2640)

# Create number of stops in the buffer zone.
septa_stations <- st_intersects(septa_distance, septa_stops)

# Append buffer zone counts and put into main tract data. Create a logged version for potential use as well because distance benefit tapers off.
residential_points <- residential_points %>%
  mutate(
    septa_half_mi = lengths(septa_stations),
    ln_septa_half_mi = log(septa_half_mi + 0.1)
  )

Code

# Park proximity / size. Measuring distance is important for accessibility, but the size of the park often matters because a property near a block-sized pocket of green space is not equivalent to being near a large one like Wissahickon Valley Park.

# Read in geojson data.
parks_path <- "data/parks.geojson"

parks <- st_read(parks_path)

Reading layer `parks' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\MUSA5080-Assignment3\data\parks.geojson' 
  using driver `GeoJSON'
Simple feature collection with 63 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -75.2837 ymin: 39.87048 xmax: -74.95865 ymax: 40.13191
Geodetic CRS:  WGS 84

Code

# Match CRS and filter by parks.
parks <- parks %>%
  st_transform(., crs = 2272) %>%
  filter(str_detect(USE_, "PARK"))

# Get distance to the edge of the nearest park.
# Note: Don't try to do spatial operations in apply() and mutate().
# Distance matrix of residential properties to parks.
parks_matrix <- st_distance(residential_points, parks)

# Get the nearest distance for each point.
residential_points$parks_mile <- apply(parks_matrix, 1, min)

# Convert to miles.
residential_points$parks_mile <- as.numeric(residential_points$parks_mile) / 5280

# Log parks data for potential use because of diminishing distance benefits.
residential_points$ln_park_dist <- as.numeric(log(residential_points$parks_mile + 0.1))

Code

# Convenience/Food points of interest. Using kNN to measure the density of these amenities rather than nearest amenity point.
amenities_path <- "data/osm_pois/osm_pois.shp"
amenities <- st_read(amenities_path)

Reading layer `osm_pois' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\MUSA5080-Assignment3\data\osm_pois\osm_pois.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 65127 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.52111 ymin: 39.71816 xmax: -74.69473 ymax: 42.25797
Geodetic CRS:  WGS 84

Code

# Filter amenities by convenience and food.
amenities <- amenities %>%
  filter(fclass %in% c(
    "atm", "bakery", "bank", "bar", "beauty_shop", "biergarten", "bookshop",
    "butcher", "cafe", "convenience", "department_store", "fast_food", "food_court",
    "greengrocer", "hairdresser", "kiosk", "laundry", "market_place", "pharmacy",
    "mall", "pub", "restaurant", "supermarket"
  )
         ) %>%
  st_transform(., crs = 2272)

# Distance matrix of residential properties to amenities.
amenities_matrix <- st_distance(residential_points, amenities)

Code

# k-Nearest Neighbors (kNN) function.
knn_distance <- function(distance_matrix, k) {
  apply(distance_matrix, 1, function(distances){
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create kNN feature for amenities. k = 4 to balance for urban and suburban areas, probably not as representative of rural areas.
residential_points <- residential_points %>%
  mutate(
    knn_amenity_mi = as.numeric(knn_distance(amenities_matrix, k = 4))
  )

# Convert to miles.
residential_points$knn_amenity_mi <- as.numeric(residential_points$knn_amenity_mi / 5280)

Code

# Fixed effect neighborhoods.
neighborhoods_path <- "data/philadelphia_neighborhoods/philadelphia_neighborhoods.shp"
philly_neighborhoods <- st_read(neighborhoods_path)

Reading layer `philadelphia_neighborhoods' from data source 
  `C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\MUSA5080-Assignment3\data\philadelphia_neighborhoods\philadelphia_neighborhoods.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 159 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28026 ymin: 39.86701 xmax: -74.95576 ymax: 40.13799
Geodetic CRS:  WGS 84

Code

# Match CRS.
philly_neighborhoods <- philly_neighborhoods %>%
  st_transform(., crs = 2272)

# Join to residential points and rename to neighborhoods.
residential_points <- residential_points %>%
  st_join(., philly_neighborhoods) %>%
  rename(neighborhood_fe = MAPNAME)

# Make the neighborhoods a factor.
residential_points$neighborhood_fe <- relevel(factor(residential_points$neighborhood_fe), ref = "East Falls")

# Place neighborhoods with 10 or less sales into a small neighborhoods category.
residential_points <- residential_points %>%
  mutate(neighborhood_fe = fct_lump_min(neighborhood_fe, min = 11, other_level = "Small Neighborhoods"))

Code

# Capture spatial feature dimensions.
after_feat_eng_dimension <- dim(residential_points)

Code

# Spatial feature creation table, select spatial features into a separate data frame and drop geometry.
spatial_feature_df <- residential_points %>%
  select(c(city_dist_mi, ln_city_dist, septa_half_mi, ln_septa_half_mi,
           parks_mile, ln_park_dist, knn_amenity_mi)) %>%
  na.omit(.) %>%
  st_drop_geometry()

# Create a tibble from the selected spatial features.
spatial_summary <- tibble(
  "Spatial Features" = names(spatial_feature_df),
  "Description" = c("Distance from city (mi).", "Log of distance from city.", "Within 0.5mi of SEPTA station.",
                    "Log of 0.5 SEPTA station.", "Distance from nearest park (mi).",
                    "Log of distance from nearest park.", "k-Nearest Neighbors convenience and food amenities.")
)

# Make Kable of spatial features.
spatial_kable <- kable(spatial_summary,
                       caption = "Feature Engineered Variables",
                       format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "5cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

spatial_kable

Feature Engineered Variables
Spatial Features	Description
city_dist_mi	Distance from city (mi).
ln_city_dist	Log of distance from city.
septa_half_mi	Within 0.5mi of SEPTA station.
ln_septa_half_mi	Log of 0.5 SEPTA station.
parks_mile	Distance from nearest park (mi).
ln_park_dist	Log of distance from nearest park.
knn_amenity_mi	k-Nearest Neighbors convenience and food amenities.

Primary:

From the CSV, we are analyzing the conditions of basements, number of fireplaces, garage spaces, number of bathrooms, number of stories, the total livable area in square feet, the existence of central air, and type of fuel used on the property. We filtered residential category code with the sale dates between 2023 and 2024. We eliminated property prices <10k. Rather than adhering to the percentage of assessed value as a guide for this filter, for it could incorporate marginalized bias, filtering the property prices removes non-market transactions but still incorporates a wide diversity of communities.

The forced the central air characteristic to become binary rather than “Y” and “N” and made sure to turn the categorical data to factors. The reference categories for types of basements is “None” and for fuel type, it’s “Natural Gas”. We including a building age category computed from the year built data. We logged the square footage and the sale price to correct for right-skewedness.

Spatial:

We inserted the Philadelphia census tracts, changing the CRS to 2272, the ideal projection for SE Pennsylvania analysis. We decided to perform a log transformation on the following variables - city_dist (distance from City Hall in Center City), septa_half_mi (half mile buffer zone from all septa stops within the city geometry), and parks_dist (distance to edge of nearest park in miles) - because their effects on housing prices were non-linear. This transformation ensures that changes in proximity are measured more consistently across the range of distances, rather than being dominated by properties very close to these features.

Regarding amenities, we used k-NN (k nearest neighbors) to measure the density of amenity accessibility rather than individual point data. The amenities are as follows: ATM, bakery, bank, bar, beauty shop, biergarten, bookshop, butcher, café, convenience, department store, fast food, food court, greengrocer, hairdresser, kiosk, laundry, marketplace, pharmacy, mall, pub, restaurant, supermarket. We filtered by convenience and food, transformed the CRS to 2272 for consistency, and then developed a matrix. The distance was inverted, log-transformed to account for diminishing returns, and scaled it to produce a single numeric value in which higher, positive values indicate greater accessibility to amenities.

We included neighborhoods as fixed effects to help explain unknown, unquantifiable factors like cultural atmosphere and other neighborhood-specific factors we cannot statistically account for that may influence housing prices. It was converted into a factor so each neighborhood can receive its own baseline model. Fiscal quarters were also introduced as fixed effects; splitting a year into 4 quarters for unknown factors when it comes to purchasing property (e.g. people are more likely to buy real estate in spring and summer).

Census Data (TidyCensus)

Code

# Open tidycensus data. Using 2023 data, because we are looking at sales 2023-2024
acs_vars <- load_variables(2023, "acs5", cache = TRUE)

# Get acs dimensions.
og_acs_dimension <- dim(acs_vars)

# The variables that we want from tidycensus
variables <- c(
  median_household_income = "B19013_001",
  total_pop = "B01003_001",
  poverty_white = "B17001A_001", # To get poverty percentage
  poverty_black = "B17001B_001",
  poverty_native = "B17001C_001",
  poverty_asian = "B17001D_001",
  poverty_islander = "B17001E_001",
  poverty_other = "B17001F_001",
  poverty_multiracial = "B17001G_001",
  male_18_24_bach = "B15001_009", # Tracts only show bachelor's degrees, unless we want to look at only people 25+
  male_25_34_bach = "B15001_017",
  male_35_44_bach = "B15001_025",
  male_45_64_bach = "B15001_033",
  male_65plus_bach = "B15001_041",
  female_18_24_bach = "B15001_050",
  female_25_34_bach = "B15001_058",
  female_35_44_bach = "B15001_066",
  female_45_64_bach = "B15001_074",
  female_65plus_bach = "B15001_082",
  total_vacant = "B25005_001", # To get vacancy percentage
  white_total_units = "B25032A_001", # Need total units to get percentage of single, detached units and vacant units.
  white_single_family = "B25032A_002",
  black_total_units = "B25032B_001",
  black_single_family = "B25032B_002",
  native_total_units = "B25032C_001",
  native_single_family = "B25032C_002",
  asian_total_units = "B25032D_001",
  asian_single_family = "B25032D_002",
  islander_total_units = "B25032E_001",
  islander_single_family = "B25032E_002",
  other_total_units = "B25032F_001",
  other_single_family = "B25032F_002",
  multiracial_total_units = "B25032G_001",
  multiracial_single_family = "B25032G_002",
  medhhinc_white = "B19013A_001", # Median Household Income
  medhhinc_black = "B19013B_001",
  medhhinc_native = "B19013C_001", 
  medhhinc_asian = "B19013D_001", 
  medhhinc_other = "B19013F_001",  # There is no tract data for native hawiian/pacific islander, I'm including it with other
  medhhinc_multiracial = "B19013G_001", 
  white_pop = "B01001A_001",
  black_pop = "B01001B_001",
  native_pop = "B01001C_001",
  asian_pop = "B01001D_001",
  islander_pop = "B01001E_001",
  other_pop = "B01001F_001",
  multiracial_pop = "B01001G_001"
)

# We are grouping our data by tracts
philly_tract_acs <- get_acs(
  geography = "tract",
  state = "PA",
  variables = variables,
  year = 2022,
  survey = "acs5",
  cache_table = TRUE, 
  output = "wide"
)

Code

# Summing up the variables that we need to create our percentage variables
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    total_poverty = poverty_whiteE + poverty_blackE + poverty_nativeE + poverty_asianE + poverty_islanderE + poverty_otherE + poverty_multiracialE, # Adding all poverty populations together 
    
    total_bach = male_18_24_bachE + male_25_34_bachE + male_35_44_bachE + male_45_64_bachE + male_65plus_bachE + female_18_24_bachE + female_25_34_bachE + female_35_44_bachE + female_45_64_bachE + female_65plus_bachE, #Adding all bachelors degrees together
    
    total_units = white_total_unitsE + black_total_unitsE + native_total_unitsE + asian_total_unitsE + islander_total_unitsE + other_total_unitsE + multiracial_total_unitsE, # Total housing units
    
    total_single_family = white_single_familyE + black_single_familyE + native_single_familyE + asian_single_familyE + islander_single_familyE + other_single_familyE + multiracial_single_familyE # Total single family homes
  )

Code

# Creating our variables that we are going to analyze
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    pct_poverty = (total_poverty/total_popE)*100, # Divide total poverty population by total population

    pct_bach = (total_bach/total_popE)*100, # Divide bachelor degree holders by total population
    
    pct_vacant = (total_vacantE/total_units)*100, # Divide vacant units by total housing units
    pct_vacant = ifelse(is.infinite(pct_vacant) | total_vacantE > total_units, 100, pct_vacant), # Fixing errors when units equal zero or high MOE
    
    pct_single_family = (total_single_family/total_units)*100, # Divide single family homes by total housing units
    
    medhhinc = 
  (ifelse(is.na(medhhinc_whiteE), 0, medhhinc_whiteE) * white_popE +
   ifelse(is.na(medhhinc_blackE), 0, medhhinc_blackE) * black_popE +
   ifelse(is.na(medhhinc_nativeE), 0, medhhinc_nativeE) * native_popE +
   ifelse(is.na(medhhinc_asianE), 0, medhhinc_asianE) * asian_popE +
   ifelse(is.na(medhhinc_otherE), 0, medhhinc_otherE) * (islander_popE + other_popE) +
   ifelse(is.na(medhhinc_multiracialE), 0, medhhinc_multiracialE) * multiracial_popE) / total_popE)
# For median household income, I had to turn all median household incomes that were NA to 0, so that it would not mess up the formula. 
# Multiplying median household income times population by race. There was no islander median household income, so I included it in other. All divided by the total population, to get the total median household income.

Code

# Creating a summary table 
philly_acs_summary <- philly_tract_acs %>%
  select(
    GEOID, 
    NAME,
    pct_poverty,
    pct_bach,
    pct_vacant,
    pct_single_family,
    medhhinc
  )

# Get after acs dimension.
after_acs_dimension <- dim(philly_acs_summary)

Code

# Join primary and census data.
final_data <- residential_points %>%
  left_join(philly_acs_summary, by = "GEOID") %>%
  select(-c(sale_date, year_built, ALAND, AWATER,
            INTPTLAT, INTPTLON, NAME.x, LISTNAME, NAME.y,
            Shape_Leng, Shape_Area)
         )

Code

# Create key variables list.
key_columns <- c("sale_price", "ln_sale_price", "square_feet", "ln_square_feet",
                 "bath_num", "fireplace_num", "garage_num", "ac_binary",
                 "story_num", "age", "city_dist_mi", "ln_city_dist",
                 "septa_half_mi", "ln_septa_half_mi", "parks_mile", "ln_park_dist",
                 "knn_amenity_mi", "pct_poverty", "pct_bach",
                 "pct_vacant", "pct_single_family", "medhhinc",
                 "basement_type", "fuel_type", "neighborhood_fe", "quarters_fe")

# Reorder for key columns first and drop all rows with NA because OLS needs complete observations.
final_data <- final_data %>%
  select(any_of(key_columns), everything()) %>%
  na.omit(.)

# Get final dimension.
final_dimension <- dim(final_data)

Code

# Separate before/after dimensions for data.
dimensions <- data.frame(
  rows_columns = c("Rows", "Columns"),
  "Property Data Before" = og_property_dimension,
  "Property Data After" = after_property_dimension,
  "Property Data After Feature Engineering" = after_feat_eng_dimension,
  "ACS Data Before" = og_acs_dimension,
  "ACS Data After" = after_acs_dimension,
  "Final Data" = final_dimension
)

# Make Kable of dimensions.
dimensions_kable <- kable(dimensions,
                          col.names = c("Dimensions", "Property Data Before", "Property Data After",
                                        "Property Data After Feature Engineering",
                                        "ACS Data Before", "ACS Data After", "Final Data"),
                          digits = 2,
                          caption = "Before and After Data Dimensions",
                          format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

dimensions_kable

Before and After Data Dimensions
Dimensions	Property Data Before	Property Data After	Property Data After Feature Engineering	ACS Data Before	ACS Data After	Final Data
Rows	559,322	26,344	13,884	28,261	3,446	13,883
Columns	79	16	33	4	7	28

Census:

Using tidycensus, we imported all variables that aligned with our structural data from 2023 ACS data by tracts: median household income, total population, poverty by ethnicity (White, Black, Native American, Asian, Pacific Islander, “Other,” Multiracial), males and females aged 18–65+ with bachelor’s degrees or higher, total vacancy, and total housing units by ethnicity, as well as single-family households and median household income per ethnic group. We compiled the individual poverty, bachelor’s degree, unit, and single-family household counts by ethnicity to form the following percentage variables: total_poverty, total_bach, total_units, and total_single_family.

From this, we calculated pct_poverty, pct_bach, pct_vacant (accounting for ACS errors), pct_single_family, and medhhinc, transforming NAs to 0 for regression analysis.

Using our residential property vector data (which includes structural, spatial, and feature-engineered variables), we performed a left join on the cleaned ACS summary data by GEOID.

After organizing the final dataset so key variables appear first, we generated a kable summarizing the workflow. The final dataset contains 26,344 observations and 29 columns.

Exploratory Data Analysis (EDA)

Visualizations

Code

# Distribution of Sale Prices Histogram

sale_price_hist <- ggplot(final_data, aes(x = sale_price)) +
  geom_histogram(fill = "#6BAED6", color = "white", bins = 25) +
  labs(title = "Distribution of Sale Prices",
       x = "Sale Price ($)",
       y = "Count") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

sale_price_hist

Distribution of Sales Prices - Histogram

The distribution of sales prices is heavily skewed right, with a high percentage of thee transactions below $500,000, and a relatively small number of homes falling in the multi-million dollar range. This chart also also gives insight into the potential presence of high leverage outliers falling toward the left. A concentration of thousands of sales clustered on the lower end of the price ranges suggests strong segmentation, meaning that the market is split into distinct groups, where you see a large cluster of homes that are relatively similar and a smaller but very different cluster of high-value properties with luxury features and few homes in-between, the market looks like separate groups rather than one gradually increasing scale. It’s important to note that this spread likely indicates heteroskedasticity, and would require a log transformation for statistical inference.

Code

# Geographic distribution map - Sale Price

tmap_mode("plot")

sale_price_map <- tm_shape(philly_neighborhoods) +
  tm_polygons(col = "gray95", border.col = "gray65") +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          size = 0.05,
          style = "quantile",
          title = "Sale Price ($)") +
  tm_layout(main.title = "Geographic Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE,
            bg.color = "#f5f4f0")

sale_price_map

Code

#tmap_save(tm = sale_price_map, filename = "slide_images/geo-price-map.png", width = 10, height = 6, units = "in", dpi = 300, bg = "transparent")

Geographic Distribution of Sales Prices - Map

This map displays a high variation of sales prices distributed throughout the city. The higher priced clustered areas are the Center City, University City, the riverfront, and affluent pockets of the Northwest, potentially because these areas provide easy access to transit, employment centers, and cultural amenities. The northern area stretching above Broad Street into parts of West and North Philadelphia displays lower-priced homes, potentially reflecting long-term disinvestment, high vacancy rates, and aging non-renovated housing stock. The spatial clustering suggests that sale price is place-dependent in Philadelphia, mostly due to neighborhood qualities, and fixed effects.

Code

# Sale Price vs structural features scatter plot

# Sale Price vs. Square feet 
price_v_sqft_plot <- ggplot(final_data, aes(x = square_feet, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#08519C", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Square Feet",
       x = "Square Feet",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_sqft_plot

Code

#ggsave("slide_images/price-v-sqft.png", plot = price_v_sqft_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Square Feet

This scatter plot highlights the importance of home size as a structural indicator of value, even though the relationship may not be linear. There is a dense concentration of homes clustered below 3,000 sq ft. and under $500,000 consistent with the sales price distribution. Here we are seeing the same strong skew to the right, displaying a relationship between the two variables. The upward trend is fairly obvious, larger homes = an increase in price, however the spread gets wider as square footage increases, indicating that while square footage is positively associated with price, the weaker relationship among larger properties suggests that additional living space contributes less to value once the home reaches a certain size category.

Code

# Sale Price vs. Number of Fireplaces 

price_v_fire_plot <- ggplot(final_data, aes(x = fireplace_num, y = sale_price)) +
  geom_jitter(alpha = 0.5, color = "#08519C", size = 1.3, width = 0.15, height = 0) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Number of Fireplaces",
       x = "Number of Fireplaces",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_fire_plot

Code

#ggsave("slide_images/price-v-fire.png", plot = price_v_fire_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Number of Fireplaces

This chart displays the positive relationship between the value of a home in relation to the quality and character of aesthetic features. Nearly all homes with no fireplaces remain in the lower-mid price range, and once a home has two or more fire places the sale price increases to a much higher range. Homes in Philadelphia with several fireplaces are usually bigger, older, homes with higher-end finishes, meaning that fireplace count can serve as a secondary and indirect indicator of several other indicators that influence home value, and this is the reason for the high level of noise on the chart.

Code

# Sale Price vs. Spatial Features

# Sale Price vs Distance to city center
price_v_spatial_plot <- ggplot(final_data, aes(x = city_dist_mi, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#41AB5D", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Downtown",
       x = "ln Distance to City Center, mi",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_spatial_plot

Sale Price vs. Distance to City Center

The plots shows a weak negative relationship between price and distance to the city center. High valued homes fall both close to downtown and well outside of it, highlighting the structure of Philadelphia’s housing market as it relates to location. It’s not a city where closer is always better, instead certain neighborhoods like chestnut hill maintain their high premiums due to neighborhood reputation, and school quality. This pattern suggests that there are many fixed affects at play here, and it is important to note that similar patterns were displayed among many spatial variables.

Code

# Sale Price vs. Distance to Parks - Park Accessibility

price_v_park_plot <- ggplot(final_data, aes(x = parks_mile, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#238B45", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Nearest Park",
       x = "Distance to Nearest Park (mi)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_park_plot

Sale Price vs. Distance to Parks - Park Accessibility

The plot suggests that distance to parks has very little trend with sale price. Many high-priced properties sit both very close and very far away from parks, suggesting that park accessibility alone may not hold much value. This could be correlated with the fact that some parks are major attractions; i.e Fairmount Park, while others have limited impact on neighborhood desirability, especially in areas of low investment. Because of the difference in park quality across the city, the distance metric may not represents the true relationship, and more neighborhood or amenity variables may be required to capture environmental quality more accurately.

Code

# Median income vs Sale Price per neighborhood

inc_v_price_plot <- ggplot(final_data, aes(x = medhhinc, y = sale_price, color = neighborhood_fe)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Relationship Between Median Income and Sale Price by Neighborhood",
       x = "Median Household Income ($)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_viridis_d(option = "turbo") +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    legend.position = "bottom",
    legend.text = element_text(size = 7),
    legend.key.size = unit(0.4, "cm"),
    legend.box = "horizontal"
    ) +
  guides(color = guide_legend(ncol = 8, byrow = TRUE))

inc_v_price_plot

Median Income vs Sale Price per neighborhood

This plot further advances the claim that even with a lot of scatter in the points, there is a noticeable upward trend of homes in higher-income neighborhoods selling for higher prices. Many of the neighborhoods cluster in specific ranges of the price scale, even when income levels are similar, suggesting that the housing market in Philadelphia is dependent not only on demographics or income, but place effects, where reputation or fixed effects in a neighborhood increase or reduce prices. An example of this is some neighborhoods with moderate household incomes still show clusters of high-value transactions, while others with similar incomes remain in the lower end of the market, again highlighting other factors and fixed effects like transit access, historical character, and school quality. The main takeaway is that while higher-income neighborhood tend to have homes with higher sale prices, neighborhood identity also plays a big role in sale price.

Code

# Spatial Relationship Between sale price and structural predictors

tmap_mode("plot")

# Sale Price
map_value <- tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          style = "quantile",
          size = 0.04,
          title = "Sale Price") +
  tm_layout(main.title = "Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)

Code

# Bathrooms Predictor

map_bath <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "bath_num",
          palette = "Blues",
          style = "quantile",
          size = 0.04,
          title = "Number of Bathrooms") +
  tm_layout(main.title = "Distribution of Bathrooms",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)

Code

# Stories Predictor

map_story <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "story_num",
          palette = "Greens",
          style = "quantile",
          size = 0.04,
          title = "Number of Stories") +
  tm_layout(main.title = "Distribution of Stories",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)

Code

# Display them side by side

tmap_arrange(map_value, map_bath, map_story, ncol = 3)

Code

map_struct <- tmap_arrange(map_value, map_bath, map_story, ncol = 3)

Spatial Relationship Between sale price and structural predictors (bathrooms and stories) - Map

This figure shows how housing characteristics are clustered across Philadelphia. Here we can see the sale price distribution directly against certain structural features. Bathrooms and number of stories follow similar geographic patterns, as sale price, areas with higher sales prices tend to have bigger homes with more bathrooms and more stories. In contrast, neighborhoods where homes have less structural amenities also have subsequently lower sale prices. This group of maps makes a strong visual argument that structural features and neighborhood context evolve together, supporting the modeling strategy of incorporating more structural than spatial predictors.

Code

# Sale price histogram.
price_hist <- ggplot(residential_points, aes(x = sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of Sale Price", x = "Sale Price ($)", y = "Count") +
  theme_minimal()

# Log sale price histogram.
ln_price_hist <- ggplot(residential_points, aes(x = ln_sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of ln(Sale Price)", x = "ln(Sale Price)", y = "Count") +
  theme_minimal()

grid.arrange(price_hist, ln_price_hist, ncol = 2)

The raw distribution of sale prices is right-skewed with a median of 250,000 USD and a mean of 343,867 USD. There is a substantial gap in price between the third quartile (375,000 USD) and the maximum price (15,428,633 USD). While 75% of houses sold for 375,000 USD or less, the upper 25% exhibits considerable variability, with prices ranging up to 15.4 million USD, affecting the tail distribution. This suggests that a small number of luxury properties are affecting the distribution of the sales price data To address this skewness and improve model performance, we performed a log transformation, making our data closer to normal by compressing the scale of higher values, emphasizing a standardized change in percentage over dollar amount.

Code

# Livable space histogram.
livable_area_hist <- ggplot(residential_points, aes(x = square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of Square Feet", x = "Square Feet", y = "Count") +
  theme_minimal()

# Log livable space histogram.
ln_square_feet_hist <- ggplot(residential_points, aes(x = ln_square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of ln(Square Feet)", x = "ln(Square Feet)", y = "Count") +
  theme_minimal()

grid.arrange(livable_area_hist, ln_square_feet_hist, ncol = 2)

The distribution of livable area is right-skewed, with a median of 1,216 sq ft and a mean of 1,372 sq ft. While 75% of homes are under 1,509 sq ft, a small number of larger properties extend the upper tail of the distribution. Notably, homes above 3,000 sq ft become increasingly sparse, suggesting a separation between standard and luxury housing markets. We applied a log transformation to this variable to create a more symmetric distribution by compressing the scale of larger homes, which improves linearity in our model and allows coefficients to represent proportional rather than absolute changes in square footage.

Code

# Distance to downtown histogram.
downtown_dist_hist <- ggplot(residential_points, aes(x = city_dist_mi)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of Downtown Distance", x = "Downtown Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to downtown histogram.
ln_downtown_dist_hist <- ggplot(residential_points, aes(x = ln_city_dist)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of ln(Downtown Distance)", x = "ln(Downtown Distance)", y = "Count") +
  theme_minimal()

grid.arrange(downtown_dist_hist, ln_downtown_dist_hist, ncol = 2)

The distribution of distance to Center City (City Hall) is right-skewed, with fewer observations of houses occurring as the distance from Center City/City Hall increases. The effect of distance on housing prices is non-linear: being 1 mile from Center City has a larger impact on price than being 6 vs. 11 miles out. To account for this, we applied a log transformation, which compresses the upper tail, creates a more symmetric distribution, and reduces the influence of extreme distances. This transformation improves linearity in our regression model and allows coefficients to be interpreted as proportional changes in price per proportional change in distance.

Code

# SEPTA buffer histogram.
septa_buffer_hist <- ggplot(residential_points, aes(x = septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of SEPTA 0.5mi Buffer", x = "SEPTA 0.5mi Buffer (mi)", y = "Count") +
  theme_minimal()

# Log SEPTA buffer histogram.
ln_septa_buffer_hist <- ggplot(residential_points, aes(x = ln_septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of ln(SEPTA 0.5mi Buffer)", x = "ln(SEPTA 0.5mi Buffer)", y = "Count") +
  theme_minimal()

grid.arrange(septa_buffer_hist, ln_septa_buffer_hist, ncol = 2)

The distribution of SEPTA access within a 0.5-mile buffer of each property is right-skewed, with a median of 44 and a mean of 52.5. While most properties have between 29 and 69 nearby SEPTA access points, there is substantial variation ranging from zero (likely remote suburban properties) to over 160 in the most transit-dense neighborhoods. The log transformation compresses this wide range and produces a more symmetric distribution, which is appropriate given that transit accessibility exhibits diminishing returns—the marginal benefit of additional access decreases as the total number increases.

Code

# kNN amenities histogram.
knn_amenities_hist <- ggplot(residential_points, aes(x = knn_amenity_mi)) +
  geom_histogram(fill = "darkseagreen", color = "white") +
  labs(title = "Histogram of kNN Amenities", x = "kNN Amenities (mi)", y = "Count") +
  theme_minimal()

grid.arrange(knn_amenities_hist)

For the kNN Amenities variable, the mean distance to the nearest amenity per household is 0.31 miles, and the median is 0.27 miles. Seventy-five percent of households are within 0.42 miles of any of the 23 amenities described in the data cleaning section. These statistics showcase Philadelphia’s reputation as a highly walkable city. Observations beyond 1 mile typically reflect suburban or rural settings. We did not transform this variable, instead using the raw distances to preserve the direct relationship between amenity proximity and property values.The kNN approach inherently reflects local amenity density, with shorter distances in denser areas and longer distances where households are more dispersed.

Code

# Distance to park histogram..
parks_dist <- ggplot(residential_points, aes(x = parks_mile)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of Parks Distance", x = "Parks Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to park histogram.
ln_parks_dist <- ggplot(residential_points, aes(x = ln_park_dist)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of ln(Parks Distance)", x = "Parks Distance", y = "Count") +
  theme_minimal()

grid.arrange(parks_dist, ln_parks_dist, ncol = 2)

The distribution of the variable Distance from Parks by miles is showing a slight right skew. Between the third quartile and the maximum there is a jump of about 3 miles, indicating outliers beyond 2 miles. Due to this, we chose to log the variable to ensure it is normal for our model.

Code

# Number of bathrooms histogram.
ggplot(residential_points, aes(x = bath_num)) +
  geom_histogram(fill = "gold", color = "white") +
  labs(title = "Histogram of Bathrooms", x = "Bathrooms", y = "Count") +
  theme_minimal()

The histogram showcases the number of observations of properties with n bathrooms. Most observations exhibit 1 or 2 bathrooms per property, with a an outlier of 8 bathrooms.

Code

# Number of fireplaces histogram.
ggplot(residential_points, aes(x = fireplace_num)) +
  geom_histogram(fill = "darkred", color = "white") +
  labs(title = "Histogram of Fireplaces", x = "Fireplaces", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations with n fireplaces. Most observations contain 0 fireplaces, with outliers of property(ies) containing 2 or more.

Code

# Number of garages histogram.
ggplot(residential_points, aes(x = garage_num)) +
  geom_histogram(fill = "gray", color = "white") +
  labs(title = "Histogram of Garages", x = "Garages", y = "Count") +
  theme_minimal()

This histogram showcases the number of garages available per observation. The median is 0 garages per property, with a maximum of 5 garages per property.

Code

# Number of stories histogram.
ggplot(residential_points, aes(x = story_num)) +
  geom_histogram(fill = "orange", color = "white") +
  labs(title = "Histogram of Stories", x = "Stories", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations that contain n housing stories. The median is 2 stories per property, with a maximum of 5 stories.

Code

# Age histogram.
ggplot(residential_points, aes(x = age)) +
  geom_histogram(fill = "black", color = "white") +
  labs(title = "Histogram of Age", x = "Age", y = "Count") +
  theme_minimal()

This histogram showcases the amount of observations with their calculated ages (years) based off year built. The median age is 100 years, and the maximum is 275 years. This was then transformed into a polynomial variable to account for the decrease in housing price as the home ages until a certain age, then the price rises again.

Code

# Histogram for pct_poverty
ggplot(philly_acs_summary, aes(x = pct_poverty)) +
  geom_histogram(fill = "skyblue", color = "white") +
  labs(title = "Histogram of Percent Poverty", x = "Percent Poverty (%)", y = "Count") +
  theme_minimal()

This histogram presents the tracts with their determined percent poverty. The percentage is suspiciously high with the range from the minimum (0%) from the first quartile is 98 percent.We also have 32 tracts with no data.

Code

# Histogram for pct_bach
ggplot(philly_acs_summary, aes(x = pct_bach)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "white") +
  labs(title = "Histogram of Percent Bachelor Degree Holders", x = "Percent Bachelor Degree Holders (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent of bachelor + degree holders per census tract. The median is 13.5 %, with most tracts falling between the minimum and third quartile. We also have 32 tracts with no data.

Code

# Histogram for pct_vacant
ggplot(philly_acs_summary, aes(x = pct_vacant)) +
  geom_histogram(binwidth = 10, fill = "salmon", color = "white") +
  labs(title = "Histogram of Percent Vacant Units", x = "Percent Vacant (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent vacant residences within the census tracts. The median percentage of vacant homes is 8 percent. We also have 44 tracts with no data.

Code

# Histogram for pct_single_family
ggplot(philly_acs_summary, aes(x = pct_single_family)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "white") +
  labs(title = "Histogram of Percent Single-Family Homes ", x = "Percent Single-Family (%)", y = "Count") +
  theme_minimal()

This histogram represents the percent of single families per tract. The median for percent single families per tract is 66.58%. We also have 45 tracts with no data.

Code

# Histogram for medhhinc
ggplot(philly_acs_summary, aes(x = medhhinc)) +
  geom_histogram(binwidth = 10000, fill = "purple", color = "white") +
  labs(title = "Histogram of Median Household Income", x = "Median Household Income ($)", y = "Count") +
  theme_minimal()

This histogram represents the median household income value per census tract. The median of the median household income value is $66,795. Slightly right-skewed. We also have 32 tracts with no data.

Code

# Residential property correlation.
corr_matrix_df <- final_data %>%
  select(1:22) %>%
  st_drop_geometry()

corr_matrix <- round(cor(na.omit(corr_matrix_df)), 2)

corr_matrix_plot <- ggcorrplot(corr_matrix,
                               title = "Correlation Matrix",
                               hc.order = FALSE,
                               method = "square",
                               lab = TRUE,
                               lab_size = 3,
                               colors = c("#6D9EC1", "white", "#E46726"),
                               ggtheme = ggplot2::theme_gray,
                               insig = "blank"
                               )

corr_matrix_plot

Code

#ggsave("correlation_matrix.png", width = 8, height = 8, dpi = 400)

5 strongest correlations to price:

square feet
bathrooms
median income
percent bachelor’s
fireplaces

Seems like bachelor degrees are highly correlated with median income, and should be excluded as a predictor.

Model Building

Model Building Progression

Check for multicollinearity:

Code

# VIF check for multicollinearity.
vif_check <- lm(ln_sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + story_num  + garage_num + ln_septa_half_mi + ln_park_dist + ln_city_dist + basement_type + fuel_type, data = residential_points)

vif(vif_check)

                     GVIF Df GVIF^(1/(2*Df))
ln_square_feet   2.135764  1        1.461425
bath_num         1.898126  1        1.377725
ac_binary        1.326611  1        1.151786
fireplace_num    1.254647  1        1.120110
story_num        1.480500  1        1.216758
garage_num       2.278251  1        1.509387
ln_septa_half_mi 3.022487  1        1.738530
ln_park_dist     1.042356  1        1.020958
ln_city_dist     3.366197  1        1.834720
basement_type    3.098173 10        1.058170
fuel_type        1.050158  3        1.008190

Code

# Build Model step by step
# First Model (structural features only)

first_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data)

summary(first_model)


Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2), data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-948550  -77612  -12427   56061 4967680 

Coefficients:
                                          Estimate    Std. Error t value
(Intercept)                          -1573115.2430    47184.2963 -33.340
ln_square_feet                         265324.4547     6777.4520  39.148
bath_num                                50976.0380     2850.5451  17.883
ac_binary                               93358.6805     3519.3817  26.527
fireplace_num                          139754.2587     4574.3312  30.552
story_num                               36185.3161     3367.8324  10.744
garage_num                              73507.4314     4149.3241  17.716
basement_typeFull Finished             -53619.3219    10227.7625  -5.243
basement_typeFull Semi-Finished        -64747.7385    11812.4468  -5.481
basement_typeFull Unfinished           -69284.7145    10632.8706  -6.516
basement_typeFull Unknown Finish       -85987.4342    10988.0644  -7.826
basement_typePartial Finished         -119781.8006    10844.6785 -11.045
basement_typePartial Semi-Finished    -130192.6848    11305.7582 -11.516
basement_typePartial Unfinished       -124518.1899    13722.2343  -9.074
basement_typePartial Unknown Finish   -131660.7994    13342.8310  -9.868
basement_typeUnknown Size Finished      64719.7760    49966.7638   1.295
basement_typeUnknown Size Unfinished   -25066.6493    38151.1732  -0.657
fuel_typeElectric                      -10365.3404    10223.5159  -1.014
fuel_typeOil Heat                       -2321.1682    20691.7649  -0.112
fuel_typeOther                         261307.5132    62624.4546   4.173
age                                     -4531.8642      146.4668 -30.941
I(age^2)                                   26.6178        0.8083  32.929
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                            < 0.0000000000000002 ***
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished            0.00000016071883817 ***
basement_typeFull Semi-Finished       0.00000004295551073 ***
basement_typeFull Unfinished          0.00000000007465800 ***
basement_typeFull Unknown Finish      0.00000000000000542 ***
basement_typePartial Finished        < 0.0000000000000002 ***
basement_typePartial Semi-Finished   < 0.0000000000000002 ***
basement_typePartial Unfinished      < 0.0000000000000002 ***
basement_typePartial Unknown Finish  < 0.0000000000000002 ***
basement_typeUnknown Size Finished                  0.195    
basement_typeUnknown Size Unfinished                0.511    
fuel_typeElectric                                   0.311    
fuel_typeOil Heat                                   0.911    
fuel_typeOther                        0.00003029641753816 ***
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 176400 on 13861 degrees of freedom
Multiple R-squared:  0.5896,    Adjusted R-squared:  0.5889 
F-statistic: 948.1 on 21 and 13861 DF,  p-value: < 0.00000000000000022

Code

# Build Model step by step
# Second Model (structural features + census features)

second_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # Census feature.
                    data = final_data)

summary(second_model)


Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-988077  -66190   -4479   51392 5089350 

Coefficients:
                                           Estimate     Std. Error t value
(Intercept)                          -1631653.69177    43478.62618 -37.528
ln_square_feet                         249024.29196     6287.93299  39.604
bath_num                                57294.02210     2625.66512  21.821
ac_binary                               51858.97037     3354.26227  15.461
fireplace_num                          123699.62363     4248.12112  29.119
story_num                                8121.18745     3187.48631   2.548
garage_num                              78147.86081     3858.85587  20.252
basement_typeFull Finished             -11441.07419     9433.38338  -1.213
basement_typeFull Semi-Finished        -28190.97570    10881.72867  -2.591
basement_typeFull Unfinished           -26373.93425     9803.25374  -2.690
basement_typeFull Unknown Finish       -35981.99248    10140.98926  -3.548
basement_typePartial Finished          -81503.33848    10010.20710  -8.142
basement_typePartial Semi-Finished     -80042.76840    10459.75246  -7.652
basement_typePartial Unfinished        -73358.61614    12658.62517  -5.795
basement_typePartial Unknown Finish    -84936.85419    12306.96737  -6.902
basement_typeUnknown Size Finished     100022.70014    45911.66164   2.179
basement_typeUnknown Size Unfinished    18865.34248    35068.79357   0.538
fuel_typeElectric                        3565.49288     9416.44401   0.379
fuel_typeOil Heat                       -9369.15017    19014.09219  -0.493
fuel_typeOther                         149197.56776    57590.28952   2.591
age                                     -3249.53619      137.00469 -23.718
I(age^2)                                   20.10269        0.75712  26.552
medhhinc                                    2.64611        0.05506  48.061
pct_vacant                                383.54766      219.97666   1.744
pct_single_family                       -1835.25686      167.77302 -10.939
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.010850 *  
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.225216    
basement_typeFull Semi-Finished                  0.009589 ** 
basement_typeFull Unfinished                     0.007147 ** 
basement_typeFull Unknown Finish                 0.000389 ***
basement_typePartial Finished        0.000000000000000422 ***
basement_typePartial Semi-Finished   0.000000000000021017 ***
basement_typePartial Unfinished      0.000000006974481709 ***
basement_typePartial Unknown Finish  0.000000000005368236 ***
basement_typeUnknown Size Finished               0.029379 *  
basement_typeUnknown Size Unfinished             0.590619    
fuel_typeElectric                                0.704957    
fuel_typeOil Heat                                0.622199    
fuel_typeOther                                   0.009589 ** 
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                                       0.081254 .  
pct_single_family                    < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 162000 on 13858 degrees of freedom
Multiple R-squared:  0.6538,    Adjusted R-squared:  0.6532 
F-statistic:  1091 on 24 and 13858 DF,  p-value: < 0.00000000000000022

Code

# Build Model step by step
# Third Model (structural features + census features + spatial features)

third_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial 
                    data = final_data)

summary(third_model)


Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-817667  -62638   -2348   51585 5096901 

Coefficients:
                                          Estimate    Std. Error t value
(Intercept)                          -1755937.2954    42209.8105 -41.600
ln_square_feet                         259188.8862     6030.4108  42.980
bath_num                                51003.1282     2524.1100  20.206
ac_binary                               46002.9566     3224.9122  14.265
fireplace_num                          127882.4271     4065.5162  31.455
story_num                               -3293.9735     3080.5023  -1.069
garage_num                              87635.7897     3702.7940  23.667
basement_typeFull Finished               9025.2270     9060.4446   0.996
basement_typeFull Semi-Finished          -655.1732    10446.8183  -0.063
basement_typeFull Unfinished            -6175.5266     9422.9158  -0.655
basement_typeFull Unknown Finish       -15053.7679     9744.2956  -1.545
basement_typePartial Finished          -37741.0741     9667.9264  -3.904
basement_typePartial Semi-Finished     -34526.5862    10142.6823  -3.404
basement_typePartial Unfinished        -43195.2313    12139.1172  -3.558
basement_typePartial Unknown Finish    -52068.7616    11806.6007  -4.410
basement_typeUnknown Size Finished     125455.6640    43907.0332   2.857
basement_typeUnknown Size Unfinished    36845.8296    33537.6606   1.099
fuel_typeElectric                        2967.6188     9004.5144   0.330
fuel_typeOil Heat                        -796.3198    18184.1313  -0.044
fuel_typeOther                         104285.2749    55084.8694   1.893
age                                     -2593.0061      132.4070 -19.584
I(age^2)                                   14.7489        0.7398  19.935
medhhinc                                    2.1122        0.0560  37.717
pct_vacant                              -1163.6035      224.4922  -5.183
pct_single_family                         724.0007      182.9895   3.957
city_dist_mi                            -2507.4907      766.2599  -3.272
septa_half_mi                            1848.1046       70.2789  26.297
ln_park_dist                           -16888.4011     2091.3400  -8.075
knn_amenity_mi                         -12260.4575     8546.2417  -1.435
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.284954    
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.319213    
basement_typeFull Semi-Finished                  0.949994    
basement_typeFull Unfinished                     0.512238    
basement_typeFull Unknown Finish                 0.122398    
basement_typePartial Finished        0.000095166072865724 ***
basement_typePartial Semi-Finished               0.000666 ***
basement_typePartial Unfinished                  0.000374 ***
basement_typePartial Unknown Finish  0.000010408240366615 ***
basement_typeUnknown Size Finished               0.004279 ** 
basement_typeUnknown Size Unfinished             0.271944    
fuel_typeElectric                                0.741730    
fuel_typeOil Heat                                0.965071    
fuel_typeOther                                   0.058355 .  
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                           0.000000221096385096 ***
pct_single_family                    0.000076430420795547 ***
city_dist_mi                                     0.001069 ** 
septa_half_mi                        < 0.0000000000000002 ***
ln_park_dist                         0.000000000000000728 ***
knn_amenity_mi                                   0.151423    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 154900 on 13854 degrees of freedom
Multiple R-squared:  0.6837,    Adjusted R-squared:  0.6831 
F-statistic:  1069 on 28 and 13854 DF,  p-value: < 0.00000000000000022

Code

# Build Model step by step
# Final Model 
## (structural features + census features + spatial features + interactions and fixed effects)

final_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial 
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe + quarters_fe, # Fixed effect 
                    data = final_data)

summary(final_model)


Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + 
    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + 
    neighborhood_fe + quarters_fe, data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-798051  -47316    2231   44345 5123884 

Coefficients:
                                                 Estimate     Std. Error
(Intercept)                                -1657940.22470    45493.46650
ln_square_feet                               261653.15436     5664.41060
bath_num                                      43843.65408     2308.64344
ac_binary                                     38117.05447     3019.61163
fireplace_num                                 88491.39266     3861.20497
story_num                                     -3914.30380     2936.48610
garage_num                                    77030.43653     3411.56673
basement_typeFull Finished                   101483.65131     8830.21379
basement_typeFull Semi-Finished               92361.27832     9993.78345
basement_typeFull Unfinished                  81610.24126     9130.96925
basement_typeFull Unknown Finish              72063.11376     9381.23915
basement_typePartial Finished                 49757.63088     9336.90901
basement_typePartial Semi-Finished            45706.97947     9840.50165
basement_typePartial Unfinished               38542.17995    11431.04311
basement_typePartial Unknown Finish           32484.57614    11155.03267
basement_typeUnknown Size Finished           168383.76890    39699.41756
basement_typeUnknown Size Unfinished         130060.38774    30365.34179
fuel_typeElectric                             -9782.82669     8165.97207
fuel_typeOil Heat                              3778.65607    16324.04269
fuel_typeOther                                 8483.98198    49580.06057
age                                           -1482.81195      130.24128
I(age^2)                                          5.87768        0.74553
medhhinc                                          0.62950        0.08448
pct_vacant                                    -1005.64567      291.81015
pct_single_family                               219.33928      214.90003
city_dist_mi                                  -4819.21361     3219.40481
septa_half_mi                                  1424.38286      194.01801
ln_park_dist                                  -6234.17054     4838.09208
knn_amenity_mi                               -20833.95786    24021.24910
neighborhood_feAcademy Gardens                37183.75011    29063.24192
neighborhood_feAllegheny West                -81178.66645    19747.80365
neighborhood_feAndorra                       -36899.86992    30639.87807
neighborhood_feAston-Woodbridge               37471.10986    33193.50178
neighborhood_feBella Vista                    90030.05119    24534.03655
neighborhood_feBelmont                       -73260.20102    41183.35267
neighborhood_feBrewerytown                   -70887.58170    18851.85190
neighborhood_feBridesburg                      8200.62849    22173.10061
neighborhood_feBurholme                      -11020.59219    36141.98192
neighborhood_feBustleton                      26845.36972    24665.90193
neighborhood_feCarroll Park                  -79485.63764    20533.40133
neighborhood_feCedar Park                     50903.74805    21594.31513
neighborhood_feCedarbrook                    -18733.12094    23527.99233
neighborhood_feChestnut Hill                 405617.68702    20989.45599
neighborhood_feClearview                     -84924.33030    32006.87846
neighborhood_feCobbs Creek                   -77562.70250    16239.01668
neighborhood_feDickinson Narrows             -46228.73425    20672.82929
neighborhood_feDunlap                       -149077.53577    34850.29349
neighborhood_feEast Germantown               -55725.61363    22284.70084
neighborhood_feEast Kensington               -24931.83758    18649.97574
neighborhood_feEast Mount Airy               -39672.38823    18326.30068
neighborhood_feEast Oak Lane                -110720.66613    29189.26470
neighborhood_feEast Parkside                 -96875.29736    43009.02090
neighborhood_feEast Passyunk                  18548.68326    19771.17863
neighborhood_feEastwick                      -65187.86687    34422.87528
neighborhood_feElmwood                       -65408.97379    18135.99045
neighborhood_feFairhill                      -71704.26714    37417.40279
neighborhood_feFairmount                      79935.76169    18138.51850
neighborhood_feFeltonville                   -64193.35248    19842.10485
neighborhood_feFern Rock                     -80858.30468    34391.53574
neighborhood_feFishtown - Lower Kensington     9753.74777    15479.96362
neighborhood_feFitler Square                 433898.63083    30666.44346
neighborhood_feFox Chase                      -1191.12744    21434.30138
neighborhood_feFrancisville                  -55199.70458    23448.72689
neighborhood_feFrankford                     -65752.33751    18632.41316
neighborhood_feFranklinville                -135326.00022    29388.65347
neighborhood_feGermantown - Morton           -73568.42918    26505.05993
neighborhood_feGermantown - Penn Knox       -136518.39681    44353.19600
neighborhood_feGermantown - Westside         -98563.13632    39334.28353
neighborhood_feGermany Hill                   10231.17331    26115.45737
neighborhood_feGirard Estates                -30603.49061    18939.42860
neighborhood_feGlenwood                     -119474.06331    28556.83370
neighborhood_feGraduate Hospital              97520.93433    19973.04887
neighborhood_feGrays Ferry                   -75481.41036    17444.55221
neighborhood_feGreenwich                     -64175.19929    25230.25907
neighborhood_feHaddington                    -91270.64590    18398.19928
neighborhood_feHarrowgate                    -62121.18902    18977.91421
neighborhood_feHartranft                    -131582.73560    22639.15057
neighborhood_feHawthorne                      13169.95713    28842.84687
neighborhood_feHolmesburg                    -17670.00910    20551.32095
neighborhood_feHunting Park                  -53147.58934    22061.51403
neighborhood_feJuniata Park                  -49408.97887    17918.47259
neighborhood_feKingsessing                  -100245.71687    17213.22477
neighborhood_feLawndale                      -38945.07939    17685.32938
neighborhood_feLexington Park                 23243.39571    28523.99591
neighborhood_feLogan                        -102555.05340    19864.76119
neighborhood_feLogan Square                  379046.61233    30173.19574
neighborhood_feLower Moyamensing             -65365.63965    17491.07739
neighborhood_feManayunk                       28356.63404    18199.96310
neighborhood_feMantua                        -92910.84396    27416.06170
neighborhood_feMayfair                       -11299.33352    18328.17970
neighborhood_feMelrose Park Gardens          -68278.66749    32017.80130
neighborhood_feMill Creek                   -108864.06595    26183.51708
neighborhood_feMillbrook                      15534.97223    30551.88468
neighborhood_feModena                         55854.04856    28537.16201
neighborhood_feMorrell Park                   25251.89869    27932.79846
neighborhood_feNewbold                       -54636.97578    21808.17215
neighborhood_feNicetown                      -66893.42947    36298.67571
neighborhood_feNormandy Village              155197.36009    49122.79282
neighborhood_feNorth Central                -101900.04242    22183.92696
neighborhood_feNorthern Liberties              2692.49730    19255.79096
neighborhood_feNorthwood                     -96763.98083    23348.68911
neighborhood_feOgontz                        -38842.22862    20138.09837
neighborhood_feOld City                      443334.48569    45204.28297
neighborhood_feOld Kensington                -46454.88136    20353.62973
neighborhood_feOlney                         -68389.05352    17153.25904
neighborhood_feOverbrook                     -93077.05144    16987.27692
neighborhood_feOxford Circle                 -22701.66053    17820.65848
neighborhood_fePacker Park                     9689.89314    25692.21263
neighborhood_feParkwood Manor                 51865.84423    29534.39315
neighborhood_fePaschall                      -77498.68652    19758.48273
neighborhood_fePassyunk Square                 4243.48331    20608.82553
neighborhood_fePennsport                     -32232.31694    19668.11315
neighborhood_fePennypack                       -229.17130    23906.35132
neighborhood_fePennypack Woods                10601.67627    46721.77819
neighborhood_fePenrose                       -58590.52553    30805.52778
neighborhood_fePoint Breeze                  -63076.08386    17718.03825
neighborhood_feQueen Village                 110051.36389    21297.63346
neighborhood_feRhawnhurst                     13210.76897    21751.53460
neighborhood_feRichmond                      -28186.33567    15369.92197
neighborhood_feRittenhouse                   428527.90443    26394.39500
neighborhood_feRoxborough                     10264.37966    16135.89529
neighborhood_feRoxborough Park               -33966.75004    32811.40721
neighborhood_feSharswood                     -96773.71513    41275.66768
neighborhood_feSociety Hill                  350959.85928    26133.87223
neighborhood_feSomerton                       57933.82996    29085.20858
neighborhood_feSouthwest Germantown         -105279.23491    24716.20835
neighborhood_feSouthwest Schuylkill          -90431.06270    20983.94325
neighborhood_feSpring Garden                 178866.21603    25856.32707
neighborhood_feSpruce Hill                   134662.38807    30063.65847
neighborhood_feStadium District              -23294.11604    21985.18636
neighborhood_feStanton                      -147735.64393    20701.69637
neighborhood_feStrawberry Mansion           -107463.96962    20368.92240
neighborhood_feSummerdale                    -40151.90279    21740.17259
neighborhood_feTacony                        -38167.86864    19893.82475
neighborhood_feTioga                        -119532.35201    23356.75045
neighborhood_feTorresdale                    -23879.53124    27055.42084
neighborhood_feUpper Kensington              -95731.86958    18344.45529
neighborhood_feUpper Roxborough              -38230.35488    20207.99113
neighborhood_feWalnut Hill                   -59138.01438    30653.12012
neighborhood_feWashington Square West         88690.10153    29937.37012
neighborhood_feWest Central Germantown        16807.75200    27884.51013
neighborhood_feWest Kensington               -68127.87269    20317.08101
neighborhood_feWest Mount Airy                38179.51090    18510.39444
neighborhood_feWest Oak Lane                 -45315.59922    18269.26015
neighborhood_feWest Passyunk                 -79646.98859    19772.56079
neighborhood_feWest Poplar                  -110299.67792    43554.69769
neighborhood_feWest Powelton                -104040.85024    34399.77899
neighborhood_feWhitman                       -42920.67535    18272.47676
neighborhood_feWinchester Park                38808.93027    34718.30402
neighborhood_feWissahickon                    -5389.48604    20201.84628
neighborhood_feWissahickon Hills              12442.07991    32378.82072
neighborhood_feWissinoming                   -23043.18919    18367.46550
neighborhood_feWister                        -68117.95065    31454.19955
neighborhood_feWynnefield                    -99224.93028    20126.46981
neighborhood_feSmall Neighborhoods           -26240.56901    19047.76223
quarters_fe2                                   3325.18036     3378.65124
quarters_fe3                                   4564.25634     3438.64092
quarters_fe4                                   5216.49275     3566.47726
city_dist_mi:knn_amenity_mi                    1654.98301     2642.51941
city_dist_mi:septa_half_mi                     -332.55888       41.45544
                                           t value             Pr(>|t|)    
(Intercept)                                -36.443 < 0.0000000000000002 ***
ln_square_feet                              46.192 < 0.0000000000000002 ***
bath_num                                    18.991 < 0.0000000000000002 ***
ac_binary                                   12.623 < 0.0000000000000002 ***
fireplace_num                               22.918 < 0.0000000000000002 ***
story_num                                   -1.333             0.182558    
garage_num                                  22.579 < 0.0000000000000002 ***
basement_typeFull Finished                  11.493 < 0.0000000000000002 ***
basement_typeFull Semi-Finished              9.242 < 0.0000000000000002 ***
basement_typeFull Unfinished                 8.938 < 0.0000000000000002 ***
basement_typeFull Unknown Finish             7.682  0.00000000000001677 ***
basement_typePartial Finished                5.329  0.00000010024344867 ***
basement_typePartial Semi-Finished           4.645  0.00000343597723092 ***
basement_typePartial Unfinished              3.372             0.000749 ***
basement_typePartial Unknown Finish          2.912             0.003596 ** 
basement_typeUnknown Size Finished           4.241  0.00002235191311615 ***
basement_typeUnknown Size Unfinished         4.283  0.00001854901682208 ***
fuel_typeElectric                           -1.198             0.230938    
fuel_typeOil Heat                            0.231             0.816947    
fuel_typeOther                               0.171             0.864134    
age                                        -11.385 < 0.0000000000000002 ***
I(age^2)                                     7.884  0.00000000000000341 ***
medhhinc                                     7.452  0.00000000000009754 ***
pct_vacant                                  -3.446             0.000570 ***
pct_single_family                            1.021             0.307435    
city_dist_mi                                -1.497             0.134435    
septa_half_mi                                7.341  0.00000000000022310 ***
ln_park_dist                                -1.289             0.197573    
knn_amenity_mi                              -0.867             0.385785    
neighborhood_feAcademy Gardens               1.279             0.200775    
neighborhood_feAllegheny West               -4.111  0.00003966371497639 ***
neighborhood_feAndorra                      -1.204             0.228491    
neighborhood_feAston-Woodbridge              1.129             0.258973    
neighborhood_feBella Vista                   3.670             0.000244 ***
neighborhood_feBelmont                      -1.779             0.075282 .  
neighborhood_feBrewerytown                  -3.760             0.000170 ***
neighborhood_feBridesburg                    0.370             0.711503    
neighborhood_feBurholme                     -0.305             0.760428    
neighborhood_feBustleton                     1.088             0.276456    
neighborhood_feCarroll Park                 -3.871             0.000109 ***
neighborhood_feCedar Park                    2.357             0.018424 *  
neighborhood_feCedarbrook                   -0.796             0.425926    
neighborhood_feChestnut Hill                19.325 < 0.0000000000000002 ***
neighborhood_feClearview                    -2.653             0.007980 ** 
neighborhood_feCobbs Creek                  -4.776  0.00000180380477018 ***
neighborhood_feDickinson Narrows            -2.236             0.025354 *  
neighborhood_feDunlap                       -4.278  0.00001901511843808 ***
neighborhood_feEast Germantown              -2.501             0.012409 *  
neighborhood_feEast Kensington              -1.337             0.181300    
neighborhood_feEast Mount Airy              -2.165             0.030422 *  
neighborhood_feEast Oak Lane                -3.793             0.000149 ***
neighborhood_feEast Parkside                -2.252             0.024310 *  
neighborhood_feEast Passyunk                 0.938             0.348175    
neighborhood_feEastwick                     -1.894             0.058281 .  
neighborhood_feElmwood                      -3.607             0.000311 ***
neighborhood_feFairhill                     -1.916             0.055343 .  
neighborhood_feFairmount                     4.407  0.00001056260193704 ***
neighborhood_feFeltonville                  -3.235             0.001218 ** 
neighborhood_feFern Rock                    -2.351             0.018732 *  
neighborhood_feFishtown - Lower Kensington   0.630             0.528647    
neighborhood_feFitler Square                14.149 < 0.0000000000000002 ***
neighborhood_feFox Chase                    -0.056             0.955684    
neighborhood_feFrancisville                 -2.354             0.018584 *  
neighborhood_feFrankford                    -3.529             0.000419 ***
neighborhood_feFranklinville                -4.605  0.00000416765856661 ***
neighborhood_feGermantown - Morton          -2.776             0.005517 ** 
neighborhood_feGermantown - Penn Knox       -3.078             0.002088 ** 
neighborhood_feGermantown - Westside        -2.506             0.012230 *  
neighborhood_feGermany Hill                  0.392             0.695236    
neighborhood_feGirard Estates               -1.616             0.106147    
neighborhood_feGlenwood                     -4.184  0.00002885494372713 ***
neighborhood_feGraduate Hospital             4.883  0.00000105860760820 ***
neighborhood_feGrays Ferry                  -4.327  0.00001522694286549 ***
neighborhood_feGreenwich                    -2.544             0.010983 *  
neighborhood_feHaddington                   -4.961  0.00000071026962559 ***
neighborhood_feHarrowgate                   -3.273             0.001065 ** 
neighborhood_feHartranft                    -5.812  0.00000000630356266 ***
neighborhood_feHawthorne                     0.457             0.647958    
neighborhood_feHolmesburg                   -0.860             0.389915    
neighborhood_feHunting Park                 -2.409             0.016007 *  
neighborhood_feJuniata Park                 -2.757             0.005833 ** 
neighborhood_feKingsessing                  -5.824  0.00000000588257898 ***
neighborhood_feLawndale                     -2.202             0.027674 *  
neighborhood_feLexington Park                0.815             0.415160    
neighborhood_feLogan                        -5.163  0.00000024686521012 ***
neighborhood_feLogan Square                 12.562 < 0.0000000000000002 ***
neighborhood_feLower Moyamensing            -3.737             0.000187 ***
neighborhood_feManayunk                      1.558             0.119242    
neighborhood_feMantua                       -3.389             0.000704 ***
neighborhood_feMayfair                      -0.617             0.537574    
neighborhood_feMelrose Park Gardens         -2.133             0.032982 *  
neighborhood_feMill Creek                   -4.158  0.00003233743685558 ***
neighborhood_feMillbrook                     0.508             0.611126    
neighborhood_feModena                        1.957             0.050340 .  
neighborhood_feMorrell Park                  0.904             0.365999    
neighborhood_feNewbold                      -2.505             0.012245 *  
neighborhood_feNicetown                     -1.843             0.065371 .  
neighborhood_feNormandy Village              3.159             0.001585 ** 
neighborhood_feNorth Central                -4.593  0.00000439924270724 ***
neighborhood_feNorthern Liberties            0.140             0.888798    
neighborhood_feNorthwood                    -4.144  0.00003428980342820 ***
neighborhood_feOgontz                       -1.929             0.053777 .  
neighborhood_feOld City                      9.807 < 0.0000000000000002 ***
neighborhood_feOld Kensington               -2.282             0.022482 *  
neighborhood_feOlney                        -3.987  0.00006727698685700 ***
neighborhood_feOverbrook                    -5.479  0.00000004347320880 ***
neighborhood_feOxford Circle                -1.274             0.202722    
neighborhood_fePacker Park                   0.377             0.706066    
neighborhood_feParkwood Manor                1.756             0.079091 .  
neighborhood_fePaschall                     -3.922  0.00008813490635269 ***
neighborhood_fePassyunk Square               0.206             0.836867    
neighborhood_fePennsport                    -1.639             0.101276    
neighborhood_fePennypack                    -0.010             0.992352    
neighborhood_fePennypack Woods               0.227             0.820496    
neighborhood_fePenrose                      -1.902             0.057199 .  
neighborhood_fePoint Breeze                 -3.560             0.000372 ***
neighborhood_feQueen Village                 5.167  0.00000024082503649 ***
neighborhood_feRhawnhurst                    0.607             0.543629    
neighborhood_feRichmond                     -1.834             0.066696 .  
neighborhood_feRittenhouse                  16.236 < 0.0000000000000002 ***
neighborhood_feRoxborough                    0.636             0.524708    
neighborhood_feRoxborough Park              -1.035             0.300588    
neighborhood_feSharswood                    -2.345             0.019063 *  
neighborhood_feSociety Hill                 13.429 < 0.0000000000000002 ***
neighborhood_feSomerton                      1.992             0.046406 *  
neighborhood_feSouthwest Germantown         -4.260  0.00002062292616807 ***
neighborhood_feSouthwest Schuylkill         -4.310  0.00001647368688246 ***
neighborhood_feSpring Garden                 6.918  0.00000000000479383 ***
neighborhood_feSpruce Hill                   4.479  0.00000755137153481 ***
neighborhood_feStadium District             -1.060             0.289374    
neighborhood_feStanton                      -7.136  0.00000000000100613 ***
neighborhood_feStrawberry Mansion           -5.276  0.00000013413070739 ***
neighborhood_feSummerdale                   -1.847             0.064783 .  
neighborhood_feTacony                       -1.919             0.055058 .  
neighborhood_feTioga                        -5.118  0.00000031349740898 ***
neighborhood_feTorresdale                   -0.883             0.377460    
neighborhood_feUpper Kensington             -5.219  0.00000018293556101 ***
neighborhood_feUpper Roxborough             -1.892             0.058533 .  
neighborhood_feWalnut Hill                  -1.929             0.053718 .  
neighborhood_feWashington Square West        2.963             0.003057 ** 
neighborhood_feWest Central Germantown       0.603             0.546676    
neighborhood_feWest Kensington              -3.353             0.000801 ***
neighborhood_feWest Mount Airy               2.063             0.039170 *  
neighborhood_feWest Oak Lane                -2.480             0.013134 *  
neighborhood_feWest Passyunk                -4.028  0.00005651845672809 ***
neighborhood_feWest Poplar                  -2.532             0.011338 *  
neighborhood_feWest Powelton                -3.024             0.002495 ** 
neighborhood_feWhitman                      -2.349             0.018842 *  
neighborhood_feWinchester Park               1.118             0.263662    
neighborhood_feWissahickon                  -0.267             0.789641    
neighborhood_feWissahickon Hills             0.384             0.700787    
neighborhood_feWissinoming                  -1.255             0.209658    
neighborhood_feWister                       -2.166             0.030357 *  
neighborhood_feWynnefield                   -4.930  0.00000083160419737 ***
neighborhood_feSmall Neighborhoods          -1.378             0.168343    
quarters_fe2                                 0.984             0.325047    
quarters_fe3                                 1.327             0.184417    
quarters_fe4                                 1.463             0.143587    
city_dist_mi:knn_amenity_mi                  0.626             0.531135    
city_dist_mi:septa_half_mi                  -8.022  0.00000000000000112 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138300 on 13723 degrees of freedom
Multiple R-squared:  0.7501,    Adjusted R-squared:  0.7472 
F-statistic:   259 on 159 and 13723 DF,  p-value: < 0.00000000000000022

Code

# Step model to check for most influential predictors.
step_model <- step(final_model, direction = "both")

Start:  AIC=328831.2
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + fuel_type + age + 
    I(age^2) + medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + knn_amenity_mi * 
    city_dist_mi + septa_half_mi * city_dist_mi + neighborhood_fe + 
    quarters_fe

                               Df      Sum of Sq             RSS    AIC
- fuel_type                     3    29291306230 262526956909507 328827
- quarters_fe                   3    48687286101 262546352889378 328828
- city_dist_mi:knn_amenity_mi   1     7502864212 262505168467489 328830
- pct_single_family             1    19926739927 262517592343204 328830
- ln_park_dist                  1    31760364605 262529425967882 328831
- story_num                     1    33988307880 262531653911157 328831
<none>                                           262497665603277 328831
- pct_vacant                    1   227177637523 262724843240800 328841
- medhhinc                      1  1062196974143 263559862577420 328885
- I(age^2)                      1  1188920399920 263686586003196 328892
- city_dist_mi:septa_half_mi    1  1230978421267 263728644024543 328894
- age                           1  2479426316334 264977091919610 328960
- ac_binary                     1  3047985297556 265545650900832 328990
- basement_type                10  5090476653267 267588142256544 329078
- bath_num                      1  6898841085908 269396506689184 329189
- garage_num                    1  9751992536383 272249658139660 329336
- fireplace_num                 1 10046915002573 272544580605849 329351
- ln_square_feet                1 40814913228775 303312578832052 330836
- neighborhood_fe             126 54479351260673 316977016863950 331197

Step:  AIC=328826.8
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    quarters_fe + city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- quarters_fe                   3    47663097716 262574620007222 328823
- city_dist_mi:knn_amenity_mi   1     7728079582 262534684989088 328825
- pct_single_family             1    19707606054 262546664515561 328826
- ln_park_dist                  1    32231192502 262559188102009 328827
- story_num                     1    33640799011 262560597708518 328827
<none>                                           262526956909507 328827
+ fuel_type                     3    29291306230 262497665603277 328831
- pct_vacant                    1   231494066009 262758450975516 328837
- medhhinc                      1  1068136777060 263595093686567 328881
- I(age^2)                      1  1187398099796 263714355009303 328887
- city_dist_mi:septa_half_mi    1  1235288890463 263762245799970 328890
- age                           1  2473620812909 265000577722415 328955
- ac_binary                     1  3111535277414 265638492186920 328988
- basement_type                10  5116255704022 267643212613528 329075
- bath_num                      1  6880305741509 269407262651016 329184
- garage_num                    1  9775449552879 272302406462386 329332
- fireplace_num                 1 10086445343028 272613402252535 329348
- ln_square_feet                1 41033877166244 303560834075751 330841
- neighborhood_fe             126 54518512496951 317045469406457 331194

Step:  AIC=328823.3
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- city_dist_mi:knn_amenity_mi   1     7640688841 262582260696063 328822
- pct_single_family             1    20067833883 262594687841105 328822
- ln_park_dist                  1    32005690287 262606625697509 328823
- story_num                     1    33899752615 262608519759837 328823
<none>                                           262574620007222 328823
+ quarters_fe                   3    47663097716 262526956909507 328827
+ fuel_type                     3    28267117845 262546352889378 328828
- pct_vacant                    1   232781450230 262807401457452 328834
- medhhinc                      1  1070794141050 263645414148272 328878
- I(age^2)                      1  1188059138702 263762679145925 328884
- city_dist_mi:septa_half_mi    1  1240320543243 263814940550465 328887
- age                           1  2476861470042 265051481477264 328952
- ac_binary                     1  3114048174200 265688668181422 328985
- basement_type                10  5101378305872 267675998313094 329070
- bath_num                      1  6879515547403 269454135554626 329180
- garage_num                    1  9761217229099 272335837236321 329328
- fireplace_num                 1 10086091959024 272660711966246 329345
- ln_square_feet                1 41027723058193 303602343065415 330837
- neighborhood_fe             126 54533360011621 317107980018843 331191

Step:  AIC=328821.7
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- knn_amenity_mi                1    10258397228 262592519093292 328820
- pct_single_family             1    19589044073 262601849740136 328821
- ln_park_dist                  1    31831658301 262614092354365 328821
- story_num                     1    32448748354 262614709444417 328821
<none>                                           262582260696063 328822
+ city_dist_mi:knn_amenity_mi   1     7640688841 262574620007222 328823
+ quarters_fe                   3    47575706975 262534684989088 328825
+ fuel_type                     3    28489340452 262553771355611 328826
- pct_vacant                    1   228890973776 262811151669839 328832
- medhhinc                      1  1123486534387 263705747230451 328879
- I(age^2)                      1  1189659359701 263771920055764 328882
- city_dist_mi:septa_half_mi    1  1431158316586 264013419012649 328895
- age                           1  2484992093025 265067252789089 328950
- ac_binary                     1  3117445785849 265699706481912 328984
- basement_type                10  5100919334738 267683180030801 329069
- bath_num                      1  6872409243178 269454669939241 329178
- garage_num                    1  9754485171728 272336745867792 329326
- fireplace_num                 1 10078758344826 272661019040889 329343
- ln_square_feet                1 41039873091006 303622133787069 330836
- neighborhood_fe             126 54611581815584 317193842511647 331193

Step:  AIC=328820.3
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- pct_single_family            1    17516329942 262610035423233 328819
- story_num                    1    31989214042 262624508307333 328820
- ln_park_dist                 1    33881262942 262626400356233 328820
<none>                                          262592519093292 328820
+ knn_amenity_mi               1    10258397228 262582260696063 328822
+ quarters_fe                  3    47914194039 262544604899253 328824
+ fuel_type                    3    28552031911 262563967061380 328825
- pct_vacant                   1   228071955219 262820591048511 328830
- medhhinc                     1  1154301728799 263746820822091 328879
- I(age^2)                     1  1187867988165 263780387081456 328881
- city_dist_mi:septa_half_mi   1  1428441824517 264020960917808 328894
- age                          1  2483425354156 265075944447448 328949
- ac_binary                    1  3115319852620 265707838945911 328982
- basement_type               10  5093400845586 267685919938877 329067
- bath_num                     1  6873267530482 269465786623774 329177
- garage_num                   1  9744591721973 272337110815265 329324
- fireplace_num                1 10068500036056 272661019129347 329341
- ln_square_feet               1 41051759221484 303644278314775 330835
- neighborhood_fe            126 54934695719999 317527214813290 331205

Step:  AIC=328819.2
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + city_dist_mi + septa_half_mi + ln_park_dist + 
    neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- story_num                    1    33952275751 262643987698984 328819
- ln_park_dist                 1    35780677027 262645816100260 328819
<none>                                          262610035423233 328819
+ pct_single_family            1    17516329942 262592519093292 328820
+ knn_amenity_mi               1     8185683097 262601849740136 328821
+ quarters_fe                  3    48218416368 262561817006865 328823
+ fuel_type                    3    28307547225 262581727876008 328824
- pct_vacant                   1   225660530490 262835695953723 328829
- I(age^2)                     1  1189715385678 263799750808911 328880
- medhhinc                     1  1212143043197 263822178466430 328881
- city_dist_mi:septa_half_mi   1  1465288887215 264075324310448 328894
- age                          1  2483423804818 265093459228052 328948
- ac_binary                    1  3125197709254 265735233132488 328981
- basement_type               10  5157061941754 267767097364987 329069
- bath_num                     1  6886217998672 269496253421905 329177
- garage_num                   1  9861971915501 272472007338734 329329
- fireplace_num                1 10154625988422 272764661411656 329344
- ln_square_feet               1 41239244968238 303849280391471 330842
- neighborhood_fe            126 54917360702734 317527396125968 331203

Step:  AIC=328819
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    garage_num + basement_type + age + I(age^2) + medhhinc + 
    pct_vacant + city_dist_mi + septa_half_mi + ln_park_dist + 
    neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- ln_park_dist                 1    37088779964 262681076478948 328819
<none>                                          262643987698984 328819
+ story_num                    1    33952275751 262610035423233 328819
+ pct_single_family            1    19479391651 262624508307333 328820
+ knn_amenity_mi               1     7661975952 262636325723033 328821
+ quarters_fe                  3    48485012070 262595502686914 328822
+ fuel_type                    3    27891291636 262616096407348 328824
- pct_vacant                   1   223078062952 262867065761936 328829
- I(age^2)                     1  1171351745449 263815339444434 328879
- medhhinc                     1  1206630741698 263850618440683 328881
- city_dist_mi:septa_half_mi   1  1487117530887 264131105229872 328895
- age                          1  2478986749323 265122974448307 328947
- ac_binary                    1  3116812315900 265760800014885 328981
- basement_type               10  5124404684650 267768392383634 329067
- bath_num                     1  6865474227828 269509461926812 329175
- garage_num                   1 10077721127271 272721708826255 329340
- fireplace_num                1 10194706863620 272838694562604 329346
- ln_square_feet               1 46784996837845 309428984536830 331093
- neighborhood_fe            126 54884887831956 317528875530940 331202

Step:  AIC=328819
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    garage_num + basement_type + age + I(age^2) + medhhinc + 
    pct_vacant + city_dist_mi + septa_half_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
<none>                                          262681076478948 328819
+ ln_park_dist                 1    37088779964 262643987698984 328819
+ story_num                    1    35260378688 262645816100260 328819
+ pct_single_family            1    21554786315 262659521692633 328820
+ knn_amenity_mi               1     9414440263 262671662038685 328820
+ quarters_fe                  3    48291130287 262632785348661 328822
+ fuel_type                    3    28384069019 262652692409930 328823
- pct_vacant                   1   233928798083 262915005277031 328829
- I(age^2)                     1  1166982321537 263848058800485 328878
- medhhinc                     1  1199519985435 263880596464384 328880
- city_dist_mi:septa_half_mi   1  1547382207846 264228458686795 328898
- age                          1  2471158435449 265152234914397 328947
- ac_binary                    1  3093083166434 265774159645382 328979
- basement_type               10  5119477792449 267800554271397 329067
- bath_num                     1  6862714567634 269543791046582 329175
- garage_num                   1 10067293974161 272748370453110 329339
- fireplace_num                1 10167733347582 272848809826531 329344
- ln_square_feet               1 46823860752575 309504937231523 331094
- neighborhood_fe            126 56125236061007 318806312539955 331255

Code

# Compute TSS
y <- model.response(model.frame(step_model))
tss <- sum((y - mean(y))^2)

# Sequential (Type I) Sum Sq → ΔR²
anova_model <- anova(step_model)
anova_model <- anova_model[!is.na(anova_model$"Sum Sq"), , drop = FALSE]
seq_imp <- transform(anova_model,
                     term = rownames(anova_model),
                     delta_R2 = `Sum Sq` / tss)

# Get top 5 predictors
seq_top4 <- seq_imp[order(-seq_imp$delta_R2), c("term", "Df", "delta_R2")]
head(seq_top4, 5)

                           term    Df   delta_R2
ln_square_feet   ln_square_feet     1 0.40682553
Residuals             Residuals 13734 0.25010793
neighborhood_fe neighborhood_fe   126 0.06662910
medhhinc               medhhinc     1 0.06411606
bath_num               bath_num     1 0.05055231

Code

# Residual map preparation

# Match CRS
philly_neighborhoods <- st_transform(philly_neighborhoods, st_crs(final_data))

# Spatial join: assign each point to a neighborhood
points_with_neighborhood <- st_join(final_data, philly_neighborhoods)

# Add residual column
points_with_neighborhood$residuals <- residuals(final_model)

# Calculate average residual by neighborhood
neighborhood_residuals <- points_with_neighborhood %>%
  st_drop_geometry() %>%
  group_by(MAPNAME) %>%
  summarise(
    mean_residual = mean(residuals, na.rm = TRUE),
    median_residual = median(residuals, na.rm = TRUE),
    n_sales = n()
  )

# Join to neighborhoods
neighborhoods_with_residuals <- philly_neighborhoods %>%
  left_join(neighborhood_residuals, by = "MAPNAME")

Code

# Map the averaged residuals

neighborhood_map <- ggplot(neighborhoods_with_residuals) +
  geom_sf(aes(fill = mean_residual), color = "black", size = 0.3) +
  scale_fill_gradient2(
    low = "blue2", 
    mid = "#f5f4f0", 
    high = "red2",
    midpoint = 0,
    labels = scales::dollar,
    name = "Mean Residual ($)"
  ) +
  theme_minimal() +
  labs(
    title = "Average Model Residuals by Neighborhood",
    subtitle = "Red = Under-Predicted | Blue = Over-Predicted"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

neighborhood_map

Code

#ggsave("slide_images/residual-neighborhood.png", plot = neighborhood_map, width = 8, height = 6, units = "in", dpi = 300)

University City is the hardest to predict, Penn owns a lot of property that doesn’t get taxed. And some less wealthy and disinvested neighborhoods are overvalued, like Parkside and Wynnefield in West Philadelphia, but the overall model predicts pretty accurately for most neighborhoods in Philadelphia.

Cross-Validation

10-Fold Cross-Validation

Code

library(caret)
# 10-fold cross-validation
# first model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_1 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_1$results

  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 176344.9 0.5890063 102820.4 17348.48 0.04466116 3386.644

Code

library(caret)
# 10-fold cross-validation
# second model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_2 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # census
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_2$results

  intercept     RMSE  Rsquared      MAE  RMSESD RsquaredSD    MAESD
1      TRUE 161630.5 0.6544361 88848.47 20418.2 0.05751453 3576.953

Code

library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_3 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_3$results

  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 154443.5 0.6843236 84936.68 20818.25 0.05820868 3103.814

Code

library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_4 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe, # Fixed effect  
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_4$results

  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 138279.4 0.7459614 72567.76 22998.06 0.06401242 2898.615

Code

# Compare all 4 models

## Combine four models：cv_model_1, cv_model_2, cv_model_3, cv_model_4
model_compare <- bind_rows(
  cv_model_1$results %>% 
    mutate(Model = "Model 1: Structural"),
  cv_model_2$results %>% 
    mutate(Model = "Model 2: Structural + Census"),
  cv_model_3$results %>% 
    mutate(Model = "Model 3: Structural + Census + Spatial"),
  cv_model_4$results %>% 
    mutate(Model = "Model 4: Final Model")
) %>%
  select(Model, RMSE, Rsquared, MAE) %>% 
  mutate(across(c(RMSE, Rsquared, MAE), round, 3))

## Plot
model_kable <- kable(model_compare,
                     col.names = c("Model", "RMSE ($)", "R²", "MAE ($)"),
                     caption = "Model Performance Comparison (10-Fold Cross-Validation)",
                     digits = 4,
                     format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped", full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "10cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100")

model_kable

Model Performance Comparison (10-Fold Cross-Validation)
Model	RMSE ($)	R²	MAE ($)
Model 1: Structural	176,344.9	0.589	102,820.38
Model 2: Structural + Census	161,630.5	0.654	88,848.47
Model 3: Structural + Census + Spatial	154,443.5	0.684	84,936.68
Model 4: Final Model	138,279.4	0.746	72,567.76

Code

# Create predicted vs. actual plot

pred_df <- cv_model_4$pred

# Plot Predicted vs Actual 
pred_v_act <- ggplot(pred_df, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs. Actual Sale Price",
    x = "Actual Sale Price ($)",
    y = "Predicted Sale Price ($)"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

pred_v_act

Model Diagnostics

Check Assumptions

Code

resid_df <- data.frame(
  fitted = fitted(final_model),
  residuals = resid(final_model)
)

resid_v_fit <- ggplot(resid_df, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Fitted Values",
    x = "Fitted Values (Predicted Sale Price)",
    y = "Residuals"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

resid_v_fit

Residuals vs Fitted Values:

The residual plot shows that most residuals are centered around zero, but the spread of residuals increases as the fitted values grow. This “funnel-shaped” pattern suggests potential heteroskedasticity, meaning the variance of errors may increase for higher-priced properties. While the overall linearity assumption appears reasonable, the increasing dispersion indicates that the model’s prediction error is not constant across the price range.

Code

resid_df <- data.frame(residuals = residuals(final_model))

# Q-Q Plot
qq_plot <- ggplot(resid_df, aes(sample = residuals)) +
  stat_qq(color = "#2C7BB6", alpha = 0.6, size = 2) +      
  stat_qq_line(color = "red", linetype = "dashed") +        
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Check for normality assumption.",
    x = "Theoretical Quantiles (Normal Distribution)",
    y = "Sample Quantiles (Residuals)"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

qq_plot

Normal Q–Q Plot:

The Q–Q plot reveals that the residuals deviate from the reference line in both tails, especially in the upper tail. This pattern indicates that the residuals are right-skewed and not perfectly normally distributed. The deviation is mainly driven by a small number of very high sale-price observations, which pull the residual distribution upward. However, moderate departures from normality are common in housing price data and generally do not invalidate the model.

Code

cd <- cooks.distance(final_model)

used_row_idx <- as.integer(rownames(model.frame(final_model)))

cooks_df <- tibble(
  row_in_data  = used_row_idx,           
  row_in_model = seq_along(cd),          
  cooks_distance = as.numeric(cd)
)

n_used <- length(cd)
threshold <- 4 / n_used

cooks_plot <- ggplot(cooks_df, aes(x = row_in_model, y = cooks_distance)) +
  geom_bar(stat = "identity", fill = "#2C7BB6", alpha = 0.6) +
  geom_hline(yintercept = threshold, color = "red", linetype = "dashed") +
  coord_cartesian(ylim = c(0, 0.02)) +  
  labs(
    title = "Cook's Distance (Zoomed In)",
    subtitle = paste0("Red Dashed Line = 4/n ≈ ", round(threshold, 5)),
    x = "Observation Index (In-Model)",
    y = "Cook's Distance"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

cooks_plot

Code

# Most influential
top_influential <- cooks_df %>%
  filter(cooks_distance > threshold) %>%
  arrange(desc(cooks_distance)) %>%
  slice_head(n = 10) %>%
  mutate(
    sale_price = final_data$sale_price[row_in_data]   
  ) %>%
  select(row_in_model, row_in_data, sale_price, cooks_distance)

top_influential

# A tibble: 10 × 4
   row_in_model row_in_data sale_price cooks_distance
          <int>       <int>      <dbl>          <dbl>
 1         6113        6113    3600000         0.197 
 2         6383        6383    5477901         0.129 
 3          198         198    3330400         0.0459
 4         8210        8211     480000         0.0454
 5         2496        2496    3995000         0.0395
 6        10469       10470     281000         0.0356
 7         7913        7914     330000         0.0325
 8         2620        2620    3000000         0.0283
 9         1921        1921     170000         0.0240
10         5439        5439    3850000         0.0233

Cook’s Distance:

The Cook’s distance plot shows that almost all observations have very small influence values (below the 4/n threshold), indicating that the model is not dominated by a few extreme points. A few cases exhibit slightly higher Cook’s D values, suggesting the presence of mildly influential outliers, but none appear to exert excessive leverage on the regression coefficients.

Detailed Discussion

Our final model achieves a cross-validated R² of 0.746, explaining approximately 75% of the variance in Philadelphia residential sale prices. The Mean Absolute Error (MAE) of 72,299 USD indicates that, on average, predicted sale prices deviate from actual prices by roughly 29% of the median home price (250,000 USD). However, the Root Mean Squared Error (RMSE) of 138,257USD—nearly double the MAE—reveals that the model struggles disproportionately with high-value properties. This discrepancy, combined with residuals ranging from -817,701 USD to +5.1 million USD, reflects the outsized influence of luxury homes on overall error. Diagnostic plots confirm these patterns: the Q-Q plot shows deviation from normality in both tails (especially the upper tail), while the residuals vs. fitted values plot exhibits a funnel-shaped pattern indicating heteroskedasticity—prediction error variance increases systematically for higher-priced properties. Despite these issues, the median residual of 2,339 USD (near zero) suggests the model’s predictions are generally unbiased for typical homes, and Cook’s distance values remain well below concerning thresholds, indicating no single observation dominates the model.

The model’s minimum and maximum residuals range from -817,701 USD to +5.1 million USD, reflecting the outlier influence from luxury properties. While simultaneously referencing the QQ Plot, the model reflects two tails in the plot, but more in the positive region, indicating the model is not 100% normal. However, the residual distribution with the median residual of 2,339 USD is fairly close to 0, meaning the model predictions are generally centered about sale prices with limited bias in sale prediction. It is imperative to keep in mind that in the residuals vs fitted values plot, the increase of observations fanning-out as the predicted sale price value increases indicates the existence of heteroskedasticity between some variables.

We have concluded that ln_square_feet, neighborhood fixed effects, median household income, number of bathrooms and number of fireplaces as the most significant variables in our model. We calculated this by doing a stepwise regression, and these five yield the highest $R^2$.

We grossly underpredicted for University City, and a decent amount for Fairmount. This model struggles to accurately predict rural areas and overpredicted uninvested neighborhoods. The severe underprediction of University City may be due to the existence and proximity to Drexel University and the University of Pennsylvania.

External Resources

Relevant Readings

Artificial Intelligence

Claude: For code debugging in data cleaning and visualizations.
Chat GPT: For code debugging in data cleaning and visualizations.