Midterm Challenge: Philadelphia Housing Price Prediction
MUSA 5080 - Public Policy Analytics
Overview
Due Date: October 27, 2025
In-Class Presentations: October 27, 2025 (5 minutes per team)
Weight: 15% of final grade
Team: You’ll work with your table-mates as a team. Feel free to delegate. Everyone should upload their final products onto their own portfolio websites. Be sure to acknowledge your team-mates.
Submission Format:
Presentation Slides (.qmd → revealjs, ~10-15 slides) - Main deliverable (see my weekly lecture notes for inspiration!)
Technical Appendix (.qmd → HTML document) - Supporting details
The Challenge
You are consultancy (please name your consultancy) competing to win the bid to work for a project for the Philadelphia Office of Property Assessment. The city wants to improve its Automated Valuation Model (AVM) for property tax assessments. Your task is to build a predictive model for residential sale prices and present your findings to city officials in a 5-minute briefing.
Deliverables:
Presentation slides (10-15 slides MAX) - Your main findings for stakeholders
Technical appendix (HTML document) - All code, diagnostics, and detailed analysis
5-minute in-class presentation - Deliver your slides on October 27th
Your goal: Predict 2023-2024 home sale prices accurately while communicating findings clearly to a policy audience.
Content: ~10-15 (could be less!!) slides covering: 1. Research question & motivation (1-2 slides) 2. Data overview (1 slide) 3. Key visualizations (2-3 slides) 4. Model comparison results (1-2 slides) 5. Main findings (2-3 slides) 6. Policy recommendations (1-2 slides)
Audience: City officials who don’t know R or care about the nitty gritty of stats. They just want the best estimates possible.
No code in these slides - just polished visualizations and key takeaways
Part B: Technical Appendix (Supporting Documentation)
Format: Quarto HTML document
Example YAML:
---title:"Philadelphia Housing Model - Technical Appendix"author:"Your Name"format:html:code-fold: showtoc:truetoc-location: lefttheme: cosmo---
Content: All the technical details: - Complete data cleaning code - All EDA visualizations - Feature engineering code - Full model outputs - Diagnostic plots - Detailed interpretations
Audience: Data scientists and technical reviewers
All code visible - this is where you show your work
Required: Browse the OpenPhily Data portal and use Census Data to incorporate spatial features into your model.
Your task: Think like an urban planner. What location factors matter for housing prices in Philadelphia?
Assignment Structure
Your work should follow this workflow, with results split between presentation and appendix:
Phase 1: Data Preparation (Technical Appendix)
Load and clean Philadelphia sales data:
Filter to residential properties, 2023-2024 sales
Remove obvious errors
Handle missing values
Document all cleaning decisions
Load secondary data:
Census data (tidycensus):
Spatial amenities (OpenDataPhilly)
Join to sales data appropriately
Make sure you have the correct CRS!
Deliverable (Appendix only):
Complete data cleaning code
Summary tables showing before/after dimensions
Narrative explaining decisions
Phase 2: Exploratory Data Analysis
Create at least 5 professional visualizations:
Distribution of sale prices (histogram)
Geographic distribution (map)
Price vs. structural features (scatter plots)
Price vs. spatial features (scatter plots)
One creative visualization
For presentation slides: Select your best 2-3 visualizations that tell a compelling story
For appendix: Include all visualizations with detailed interpretations
Example presentation slide:
## Where Are Expensive Homes in Philadelphia?[Beautiful map showing price patterns]**Key Findings:**- Center City and University City command premium prices- River wards show emerging appreciation- Northeast Philadelphia remains most affordable
Phase 3: Feature Engineering (Technical Appendix)
Create spatial features: (these are examples below, but how you construct your model is up to your team)
Buffer-based features:
Parks within 500ft, 1000ft
Transit stops within 400ft
Schools, crime, etc.
k-Nearest Neighbor features:
Average distance to k nearest parks, transit, etc.
Census variables:
Join median income, education, poverty, etc.
Interaction terms:
Theoretically motivated combinations
Deliverable (Appendix only):
All feature engineering code
Summary table of features created
Brief justification for each feature
Phase 4: Model Building
Build models progressively: (for example)
Structural features only
Census variables
Spatial features
Interactions and fixed effects
For presentation slides: Show one comparison table (RMSE, R² for 4 different models you constructed in your process)
For appendix:
Complete model code
Full stargazer/modelsummary output
Coefficient interpretations
Example presentation slide:
## Model Performance Improves with Each Layer| Model | CV RMSE (log) | R² ||-------|---------------|-----|| Structural Only | 0.42 | 0.61 || + Census | 0.38 | 0.69 || + Spatial | 0.31 | 0.78 || + Interactions/FE | 0.26 | 0.84 |**Bottom line:** Neighborhood effects matter most!
Phase 5: Model Validation
Use 10-fold cross-validation: - Compare all 4 models - Report RMSE, MAE, R² for each - Create predicted vs. actual plot
For presentation slides: Final CV results table (shown above) + one compelling visual
For appendix:
Complete CV code
Detailed results
Predicted vs. actual scatter plot
Discussion of which features matter most
Phase 6: Model Diagnostics (Technical Appendix Only)
Check assumptions for best model:
Residual plot (linearity, homoscedasticity)
Q-Q plot (normality)
Cook’s distance (influential observations)
Deliverable (Appendix only):
All 3 diagnostic plots
Interpretation of each
How you addressed violations (if any)
Note: Don’t include diagnostic plots in presentation - too technical!
Phase 7: Conclusions & Recommendations
Answer these questions:
What is your final model’s accuracy?
Which features matter most for Philadelphia prices?
Which neighborhoods are hardest to predict?
Equity concerns?
Limitations?
For presentation slides: 1-2 slides with clear, concise answers (bullet points)
For appendix: 2-3 paragraphs with detailed discussion
Example presentation slide:
## Key Findings & Recommendations**Model Accuracy:** RMSE = 0.26 (log scale) ≈ 26% typical error**Top Predictors:**- Neighborhood fixed effects (largest impact)- Square footage (β = 0.0003, p < 0.001)- Distance to transit (β = -0.05, p < 0.001)**Recommendations:**✓ Current AVM undervalues transit-accessible properties ✓ Model struggles in rapidly gentrifying neighborhoods
Submission Requirements
What to Submit (by 9:59 AM, October 27, 2025)
Upload to Canvas - A link to your portfolio that Contains
Data files OR clear download instructions in appendix
For Teams: Use LastName1_LastName2_Presentation.html
In-Class Presentation (October 27, 2025)
Format: 5 minutes per team
What to present:
Walk through your presentation slides. Choose your team’s spoke’s person or take turns You’ll all stand up there and try to look calm, confident, & collected.
Hit the highlights - research question, key viz, model results, recommendations
Speak to a policy audience (your classmates are pretending to be city officials)
Be ready for 1-2 questions
You are trying to win the bid! Convince the audience of your agency’s work.
Slide 8: Model Performance - Final RMSE: 0.26 (log scale) - Translation: ~26% typical error - Beats baseline by 40%
Slide 9: Hardest to Predict - [Visualization of residuals by neighborhood]
Slide 10: Recommendations
Slide 11: Limitations & Next Steps
Slide 12: Questions? - Thank you - [Contact info]
Tips for Success
Start Early
Data cleaning always takes longer than expected
OpenDataPhilly can be slow to download
Leave time for troubleshooting AND RENDERING!!!
Check Your Work
Organize your file directory from the beginning.
Run your entire .qmd file from scratch before submitting
Make sure all visualizations display
Check that your narrative flows logically
Frequently Asked Questions
Q: Do I have to create both slides AND an appendix?
A: Yes! Slides are your main deliverable (present findings). Appendix proves you did the work correctly.
Q: Can code appear in my presentation slides?
A: NO! Slides are for city officials. All code goes in the technical appendix.
Q: How many slides should I have?
A: 10-15 maximum. Quality over quantity. Each slide should have a clear purpose.
Q: Can I use a different city?
A: No, everyone uses Philadelphia for comparability.
Q: How do I make my .qmd render as revealjs slides?
A: Use format: revealjs in your YAML (see template above). Test it early!
Q: My presentation is 8 minutes. Is that okay?
A: NO! You must cut it to 5 minutes. Practice and trim ruthlessly.
Q: Should I include all 5 EDA visualizations in my slides?
A: No! Slides should have your best 2-3 visualizations. Put all 5 in the appendix.
Q: My RMSE is 0.35 in log scale. Is that good?
A: Depends on your data, but 0.25-0.45 is typical for hedonic models. Compare to your baseline or put your data back into dollars!
Q: Should I remove all outliers?
A: No! Only remove obvious errors. Use log transformation to handle legitimate outliers.
Q: What if my code works on my computer but not when I knit?
A: Start fresh, restart R, knit in a clean session. Check for hard-coded paths.
Q: Can I use ChatGPT/Claude to write my analysis?
A: You may use AI for debugging code, but NOT for writing your analysis.
Q: How formal should my presentation be?
A: Professional but not stuffy. Like you’re briefing a city council member who’s smart but doesn’t know statistics.
Q: What happens if I go over 5 minutes?
A: I’ll politely cut you off and you’ll lose points. Practice with a timer!
Please acknowledge how you used AI in your work and which AI you used (For example, Claude helped me today in coming up with a draft of this assignment, but I edited it thoroughly!)