Midterm Challenge: Philadelphia Housing Price Prediction

MUSA 5080 - Public Policy Analytics

Overview

Due Date: October 27, 2025

In-Class Presentations: October 27, 2025 (5 minutes per team)

Weight: 15% of final grade

Team: You’ll work with your table-mates as a team. Feel free to delegate. Everyone should upload their final products onto their own portfolio websites. Be sure to acknowledge your team-mates.

Submission Format:

Presentation Slides (.qmd → revealjs, ~10-15 slides) - Main deliverable (see my weekly lecture notes for inspiration!)
Technical Appendix (.qmd → HTML document) - Supporting details

The Challenge

You are consultancy (please name your consultancy) competing to win the bid to work for a project for the Philadelphia Office of Property Assessment. The city wants to improve its Automated Valuation Model (AVM) for property tax assessments. Your task is to build a predictive model for residential sale prices and present your findings to city officials in a 5-minute briefing.

Deliverables:

Presentation slides (10-15 slides MAX) - Your main findings for stakeholders
Technical appendix (HTML document) - All code, diagnostics, and detailed analysis
5-minute in-class presentation - Deliver your slides on October 27th

Your goal: Predict 2023-2024 home sale prices accurately while communicating findings clearly to a policy audience.

Two-Part Submission

Part A: Presentation Slides (Primary Deliverable)

Format: Quarto revealjs presentation

Example YAML:

---
title: "Philadelphia Housing Price Prediction"
subtitle: "Improving Property Tax Assessments"
author: "Your Name"
format: 
  revealjs:
    theme: simple
    slide-number: true
    smaller: true
---

Content: ~10-15 (could be less!!) slides covering: 1. Research question & motivation (1-2 slides) 2. Data overview (1 slide) 3. Key visualizations (2-3 slides) 4. Model comparison results (1-2 slides) 5. Main findings (2-3 slides) 6. Policy recommendations (1-2 slides)

Audience: City officials who don’t know R or care about the nitty gritty of stats. They just want the best estimates possible.

No code in these slides - just polished visualizations and key takeaways

Part B: Technical Appendix (Supporting Documentation)

Format: Quarto HTML document

Example YAML:

---
title: "Philadelphia Housing Model - Technical Appendix"
author: "Your Name"
format: 
  html:
    code-fold: show
    toc: true
    toc-location: left
    theme: cosmo
---

Content: All the technical details: - Complete data cleaning code - All EDA visualizations - Feature engineering code - Full model outputs - Diagnostic plots - Detailed interpretations

Audience: Data scientists and technical reviewers

All code visible - this is where you show your work

Data Sources

Primary Dataset: Philadelphia Property Sales

Source: Philadelphia Property Sales

This dataset contains actual property sales with:

Sale price
Sale date
Property characteristics (bedrooms, bathrooms, sq ft, etc.)
Property location (address, coordinates)

You will need to:

Download the data
Clean it (missing values, outliers, data errors)
Filter to 2023-2024 residential sales only

Secondary Datasets (You Choose!)

Required: Browse the OpenPhily Data portal and use Census Data to incorporate spatial features into your model.

Your task: Think like an urban planner. What location factors matter for housing prices in Philadelphia?

Assignment Structure

Your work should follow this workflow, with results split between presentation and appendix:

Phase 1: Data Preparation (Technical Appendix)

Load and clean Philadelphia sales data:

Filter to residential properties, 2023-2024 sales
Remove obvious errors
Handle missing values
Document all cleaning decisions

Load secondary data:

Census data (tidycensus):
Spatial amenities (OpenDataPhilly)
Join to sales data appropriately
Make sure you have the correct CRS!

Deliverable (Appendix only):

Complete data cleaning code
Summary tables showing before/after dimensions
Narrative explaining decisions

Phase 2: Exploratory Data Analysis

Create at least 5 professional visualizations:

Distribution of sale prices (histogram)
Geographic distribution (map)
Price vs. structural features (scatter plots)
Price vs. spatial features (scatter plots)
One creative visualization

For presentation slides: Select your best 2-3 visualizations that tell a compelling story

For appendix: Include all visualizations with detailed interpretations

Example presentation slide:

## Where Are Expensive Homes in Philadelphia?

[Beautiful map showing price patterns]

**Key Findings:**
- Center City and University City command premium prices
- River wards show emerging appreciation
- Northeast Philadelphia remains most affordable

Phase 3: Feature Engineering (Technical Appendix)

Create spatial features: (these are examples below, but how you construct your model is up to your team)

Buffer-based features:
- Parks within 500ft, 1000ft
- Transit stops within 400ft
- Schools, crime, etc.
k-Nearest Neighbor features:
- Average distance to k nearest parks, transit, etc.
Census variables:
- Join median income, education, poverty, etc.
Interaction terms:
- Theoretically motivated combinations

Deliverable (Appendix only):

All feature engineering code
Summary table of features created
Brief justification for each feature

Phase 4: Model Building

Build models progressively: (for example)

Structural features only
- Census variables
- Spatial features
- Interactions and fixed effects

For presentation slides: Show one comparison table (RMSE, R² for 4 different models you constructed in your process)

For appendix:

Complete model code
Full stargazer/modelsummary output
Coefficient interpretations

Example presentation slide:

## Model Performance Improves with Each Layer

| Model | CV RMSE (log) | R² |
|-------|---------------|-----|
| Structural Only | 0.42 | 0.61 |
| + Census | 0.38 | 0.69 |
| + Spatial | 0.31 | 0.78 |
| + Interactions/FE | 0.26 | 0.84 |

**Bottom line:** Neighborhood effects matter most!

Phase 5: Model Validation

Use 10-fold cross-validation: - Compare all 4 models - Report RMSE, MAE, R² for each - Create predicted vs. actual plot

For presentation slides: Final CV results table (shown above) + one compelling visual

For appendix:

Complete CV code
Detailed results
Predicted vs. actual scatter plot
Discussion of which features matter most

Phase 6: Model Diagnostics (Technical Appendix Only)

Check assumptions for best model:

Residual plot (linearity, homoscedasticity)
Q-Q plot (normality)
Cook’s distance (influential observations)

Deliverable (Appendix only):

All 3 diagnostic plots
Interpretation of each
How you addressed violations (if any)

Note: Don’t include diagnostic plots in presentation - too technical!

Phase 7: Conclusions & Recommendations

Answer these questions:

What is your final model’s accuracy?
Which features matter most for Philadelphia prices?
Which neighborhoods are hardest to predict?
Equity concerns?
Limitations?

For presentation slides: 1-2 slides with clear, concise answers (bullet points)

For appendix: 2-3 paragraphs with detailed discussion

Example presentation slide:

## Key Findings & Recommendations

**Model Accuracy:** RMSE = 0.26 (log scale) ≈ 26% typical error

**Top Predictors:**
- Neighborhood fixed effects (largest impact)
- Square footage (β = 0.0003, p < 0.001)
- Distance to transit (β = -0.05, p < 0.001)

**Recommendations:**
✓ Current AVM undervalues transit-accessible properties  
✓ Model struggles in rapidly gentrifying neighborhoods

Submission Requirements

What to Submit (by 9:59 AM, October 27, 2025)

Upload to Canvas - A link to your portfolio that Contains

Presentation Slides
- LastName_FirstName_Presentation.html (rendered slides)
- LastName_FirstName_Presentation.qmd (source file)
- Must use format: revealjs
- 10-15 slides maximum
- No code visible in slides
Technical Appendix
- LastName_FirstName_Appendix.html (rendered document)
- LastName_FirstName_Appendix.qmd (source file)
- Must use format: html
- All code visible and commented
- Complete analysis documented
Data files OR clear download instructions in appendix

For Teams: Use LastName1_LastName2_Presentation.html

In-Class Presentation (October 27, 2025)

Format: 5 minutes per team

What to present:

Walk through your presentation slides. Choose your team’s spoke’s person or take turns You’ll all stand up there and try to look calm, confident, & collected.
Hit the highlights - research question, key viz, model results, recommendations
Speak to a policy audience (your classmates are pretending to be city officials)
Be ready for 1-2 questions
You are trying to win the bid! Convince the audience of your agency’s work.

What NOT to do:

Don’t read slides verbatim
Don’t show code
Don’t go into technical details
Don’t go over 5 minutes (I’ll cut you off!)

Grading Rubric (Scaled to 15% of course grade)

Presentation Slides

Component	Points	Criteria
Research Question	2	Clear motivation, Set the stage
Data Overview	2	Concise description of sources, sample size
Visualizations	3	2-3 polished, publication-quality visualizations; clear takeaways
Model Comparison	3	Clean results table; clear winner; interprets improvement
Key Findings	3	Top predictors identified; coefficients interpreted correctly
Presentation Quality	3	Professional design, no typos, flows logically, appropriate for audience

Key: Slides should tell a compelling story without technical jargon. Imagine presenting to the Deputy Mayor.

In-Class Presentation

Component	Points	Criteria
Content	2	Covers key points efficiently, answers questions thoughtfully
Time Management	2	Finishes within 5 minutes without rushing

Technical Appendix

Component	Points	Criteria
Data Cleaning	3	Complete code, proper filtering, missing value handling, documentation
EDA	3	5+ visualizations, each with interpretation
Feature Engineering	3	Buffers , kNN , census , interactions , all properly created
Model Building	3	4 progressive models, proper specification, handles sparse categories
Cross-Validation	3	Proper 10-fold CV, results table, code runs without errors
Diagnostics	3	Residual plots included, interpreted, violations addressed
Code Quality	3	Clean, commented, reproducible, follows best practices, no errors

Key: This is where technical reviewers verify your work. All code must run without errors & be reproducible!!.

Example Presentation Structure

Slide 1: Title - Your team name and teammates - Project title - Date

Slide 2: The Problem

Slide 3: Data Sources - Property sales (n = X,XXX, 2023-2024) - Census ACS (income, education, poverty) - OpenDataPhilly (parks, transit, crime)

Slide 4: Where Are Expensive Homes? - [Map visualization] - Key pattern observed

Slide 5: What Drives Prices? - [Best scatter plot or faceted visualization] - Key relationship identified

Slide 6: Model Comparison - [Results table] - “Each layer improves prediction”

Slide 7: Top Predictors - Neighborhood (biggest impact) - Square footage (β = X) - Transit access (β = Y)

Slide 8: Model Performance - Final RMSE: 0.26 (log scale) - Translation: ~26% typical error - Beats baseline by 40%

Slide 9: Hardest to Predict - [Visualization of residuals by neighborhood]

Slide 10: Recommendations

Slide 11: Limitations & Next Steps

Slide 12: Questions? - Thank you - [Contact info]

Tips for Success

Start Early

Data cleaning always takes longer than expected
OpenDataPhilly can be slow to download
Leave time for troubleshooting AND RENDERING!!!

Check Your Work

Organize your file directory from the beginning.
Run your entire .qmd file from scratch before submitting
Make sure all visualizations display
Check that your narrative flows logically

Frequently Asked Questions

Q: Do I have to create both slides AND an appendix?

A: Yes! Slides are your main deliverable (present findings). Appendix proves you did the work correctly.

Q: Can code appear in my presentation slides?

A: NO! Slides are for city officials. All code goes in the technical appendix.

Q: How many slides should I have?

A: 10-15 maximum. Quality over quantity. Each slide should have a clear purpose.

Q: Can I use a different city?
A: No, everyone uses Philadelphia for comparability.

Q: How do I make my .qmd render as revealjs slides?
A: Use format: revealjs in your YAML (see template above). Test it early!

Q: My presentation is 8 minutes. Is that okay?
A: NO! You must cut it to 5 minutes. Practice and trim ruthlessly.

Q: Should I include all 5 EDA visualizations in my slides?
A: No! Slides should have your best 2-3 visualizations. Put all 5 in the appendix.

Q: My RMSE is 0.35 in log scale. Is that good?
A: Depends on your data, but 0.25-0.45 is typical for hedonic models. Compare to your baseline or put your data back into dollars!

Q: Should I remove all outliers?
A: No! Only remove obvious errors. Use log transformation to handle legitimate outliers.

Q: What if my code works on my computer but not when I knit?
A: Start fresh, restart R, knit in a clean session. Check for hard-coded paths.

Q: Can I use ChatGPT/Claude to write my analysis?
A: You may use AI for debugging code, but NOT for writing your analysis.

Q: How formal should my presentation be?
A: Professional but not stuffy. Like you’re briefing a city council member who’s smart but doesn’t know statistics.

Q: What happens if I go over 5 minutes?
A: I’ll politely cut you off and you’ll lose points. Practice with a timer!

Example Workflow

Week 1: - Download data - Initial cleaning - Basic EDA

Week 2: - Feature engineering - Build 4 models - Run cross-validation - Diagnostics - Write conclusions - Proofread and submit

Academic Integrity

You may discuss concepts with classmates
You may NOT share code or slides
All work must be your own (or your teams)
Cite any external resources used
Please acknowledge how you used AI in your work and which AI you used (For example, Claude helped me today in coming up with a draft of this assignment, but I edited it thoroughly!)

Midterm Challenge: Philadelphia Housing Price Prediction

Overview

The Challenge

Two-Part Submission

Part A: Presentation Slides (Primary Deliverable)

Part B: Technical Appendix (Supporting Documentation)

Data Sources

Primary Dataset: Philadelphia Property Sales

Secondary Datasets (You Choose!)

Assignment Structure

Phase 1: Data Preparation (Technical Appendix)

Phase 2: Exploratory Data Analysis

Phase 3: Feature Engineering (Technical Appendix)

Phase 4: Model Building

Phase 5: Model Validation

Phase 6: Model Diagnostics (Technical Appendix Only)

Phase 7: Conclusions & Recommendations

Submission Requirements

What to Submit (by 9:59 AM, October 27, 2025)

In-Class Presentation (October 27, 2025)

Grading Rubric (Scaled to 15% of course grade)

Presentation Slides

In-Class Presentation

Technical Appendix

Example Presentation Structure

Tips for Success

Start Early

Check Your Work

Frequently Asked Questions

Example Workflow

Academic Integrity

Final Checklist Before Submitting

Presentation Slides

Technical Appendix

Both Files