This report seeks to build a predictive model of home prices in San Francisco, through a consideration of its local context. Prediction of real estate prices in urban spaces is becoming increasingly important; they not only reflect overall market conditions but are also a good indicator of economic health. Such analysis is also critical with respect to assessment for purposes varying from property tax collection to the valuation of residential mortgage-backed securities. Although machine learning techniques are widespread and prevalent for this use-case, the challenge of this analysis was in thoughtfully selecting localized variables that might strengthen the predictive power of our model, while remaining mindful not to overfit it with too much specificity.
In constructing our model, we first attempted to identify a list of physical and place-based characteristics that could potentially influence housing prices. We then sought to improve our model by experimenting with different permutations of these. We identified the best model through comparison of the average difference between observed and predicted prices for each model. To this end, we employed two metrics: the Mean Absolute Error (MAE) and the Mean Absolute Percent Error (MAPE).
With the set of variables we eventually decided on (figure 1), we obtained a MAE of $222,739.60 and a corresponding MAPE of 23.25%, with the mean home sale price for our dataset being $1,144,668. The standard deviation of our MAE across all folds when performing k-fold validation (an indicator of the model’s generalizability to new data) is $31,365.83.
We firstly used the tidycensus package to access American Community Survey (ACS) data for San Francisco county, and selected socio-economic as well as demographic variables that we deemed relevant for home price prediction. We also retrieved geographic information regarding localized features and amenities from San Francisco’s Open Data portal.
Summary statistics for all variables considered within our initial model are as follows. Variables eventually selected for our final model have been highlighted in grey:
Figure 1: List of variables selected for initial and final models.
The correlation matrix below presents pairwise correlations between variables and makes apparent the major associations of our variables with home sale prices. This was an important step in our exploratory analysis, as it allowed us to get a preliminary sense of which variables might have the largest explanatory power.
Figure 2: Correlation matrix of variables.
Correlation scatterplots were then plotted for 4 independent variables to graphically visualize their relationship to home sale prices.
Figure 3: Scatter plots of independent variables as a function of home sale price.
To permit visual inspection of potential relationships between dependent and independent variables across space, the following maps were plotted and selected variables were visualized through quintile breaks.
Figure 4: Maps of dependent and independent variables across San Francisco.
After gathering data as detailed in part 2.1, we selected predictors for our model by analyzing their relationship with the sale price. During this process, we were mindful to consider the three theoretical components of home prices: internal and parcel-level characteristics, public services and amenities, and finally the underlying spatial structure of home prices. The predictors selected for the final model were more significant in the linear regression that was run using all the initial predictors listed in figure 1.
A linear regression model was built to predict home sale prices, such that the prediction is the sum of a linear combination of our predictors. Coefficients were estimated using the Ordinary Least Squares regression, thereby allowing us to predict the sale price of each house given the values of the predictors for the same house.
We ran a few tests to measure the accuracy and generalizability of our model, in order to determine its predictive power. As a basic first test, we ran the linear regression using the predictors on the entire data set to see the relative goodness of fit on seen data by analyzing the R-squared value. We aimed to achieve a relatively high R-squared value, as this represents the proportion of the variance for the sale price that is explained by the predictors.
To test the model on its power to predict data that it was not trained on, we randomly split the data into a training set (60% of the original data set) and a test set (40% of the original data set), and once again ran the regression but only using the training set. We then used this model, with its own set of coefficients for the predictors, to predict sale prices on the test set. The absolute errors for each house price was calculated and summed to give the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), both metrics of how accurately the model predicts unseen data.
As a more rigorous approach to the same concept described above, we ran 100-Fold Cross-Validation on the model; this repeats the aforementioned method 100 times, each time with a new randomly generated test set that is 1/100th the size of the original data set. The average MAE and average MAPE across the 100 iterations of the above process is calculated to provide a better indication of the accuracy of the model. The standard deviation of the MAE across each iteration also provides an indicator of the generalizability of the model across different data sets.
To test the model on its generalizability, we plotted the errors between the predicted price and observed price on a map and calculated the Moran’s I (a measure of spatial auto correlation) value to test for any spatial relationship that was not accounted for in the model. We also plotted the MAPE by neighborhood to inspect if there were significant differences between errors in different neighborhoods. We made sure to account for spatial differences in prices relating to neighborhood characteristics by incorporating neighborhood fixed effects in our model.
We also analyzed the MAE in majority white tracts versus non-majority white tracts, and in higher income versus lower income tracts to test for any bias in our model. Should there be a significant difference between the MAEs, the model would be biased to a particular urban demographic and hence should not be used.
Using the methods detailed in this section, we decided on the predictors used in the final model. We used the entire data set in the linear regression, and the estimated coefficients of the predictors from the linear regression was used to estimate the sale prices of the houses with unknown sale prices.
Following processes of feature engineering and selection, we narrowed our list of predictors such that our final model comprised a total of 20 variables. These have been highlighted in grey in figure 1.
We trained this model on a subset of our data to analyze the significance and coefficients of our selected predictors in determining home sale prices. A table of results from this analysis is as follows:
The R2 (measure of goodness of fit of the model) of the model run on the training set data is 0.6995.
Estimated Coefficient for Predictor | Std. Error | t value | pvalue | |
---|---|---|---|---|
(Intercept) | 1027113.136512 | 383697.9774255 | 2.6768792 | 0.0074442 |
LotArea | 1.096873 | 0.0388050 | 28.2662701 | 0.0000000 |
PropArea | 242.093929 | 7.1158093 | 34.0219811 | 0.0000000 |
builtYear.catOld | -26071.081082 | 21824.0415591 | -1.1946037 | 0.2322724 |
builtYear.catVery Old | 20570.931888 | 19219.7177622 | 1.0703035 | 0.2845105 |
builtYear.catVeryNew | 262315.342934 | 41469.0233979 | 6.3255732 | 0.0000000 |
Beds | 6346.420175 | 2943.1025847 | 2.1563707 | 0.0310803 |
Baths | 38153.000935 | 5608.5928088 | 6.8025978 | 0.0000000 |
Year.cat2013 | 160450.083650 | 11000.4098518 | 14.5858278 | 0.0000000 |
Year.cat2014 | 327157.383482 | 11350.8366531 | 28.8223145 | 0.0000000 |
Year.cat2015 | 504720.217178 | 11505.1265944 | 43.8691581 | 0.0000000 |
Month.catAug | 32446.148235 | 18992.6537256 | 1.7083525 | 0.0876044 |
Month.catDec | 34008.066480 | 19546.4987727 | 1.7398546 | 0.0819177 |
Month.catFeb | -17648.944910 | 21068.2275629 | -0.8377043 | 0.4022184 |
Month.catJan | -129990.176862 | 22067.9321193 | -5.8904557 | 0.0000000 |
Month.catJul | 46246.538783 | 18788.2876738 | 2.4614558 | 0.0138555 |
Month.catJun | 21052.026780 | 18797.5469518 | 1.1199348 | 0.2627705 |
Month.catMar | -10726.872420 | 19779.5935516 | -0.5423202 | 0.5876110 |
Month.catMay | 40469.700422 | 18728.3302507 | 2.1608814 | 0.0307300 |
Month.catNov | 42482.136098 | 19125.9935641 | 2.2211728 | 0.0263633 |
Month.catOct | 50712.788781 | 18598.6141847 | 2.7266972 | 0.0064092 |
Month.catSept | 24327.857038 | 20047.1125673 | 1.2135342 | 0.2249565 |
MedInc | 4.881354 | 0.7022558 | 6.9509629 | 0.0000000 |
PRforeign | -20941.031488 | 5211.0851080 | -4.0185549 | 0.0000590 |
PRDrive | -17090.782921 | 7216.7472783 | -2.3682114 | 0.0178947 |
PRabv65 | 61931.750383 | 11667.9939999 | 5.3078319 | 0.0000001 |
PRSchool | -90385.294593 | 18297.7860907 | -4.9396847 | 0.0000008 |
PRPrivate | 126.543783 | 28.7425179 | 4.4026687 | 0.0000108 |
PRlackPlumbing | 35763.066428 | 25163.0117709 | 1.4212554 | 0.1552761 |
DistBusStops | 32222622.745540 | 6364394.2508101 | 5.0629520 | 0.0000004 |
DistRailStn | 1627630.504329 | 1589629.6151020 | 1.0239055 | 0.3059066 |
nhoodAnza Vista | -435617.651630 | 165942.5120153 | -2.6251118 | 0.0086763 |
nhoodAquatic Park / Ft. Mason | 1738683.552997 | 550551.5014706 | 3.1580761 | 0.0015932 |
nhoodAshbury Heights | -362332.235989 | 248374.1425332 | -1.4588163 | 0.1446495 |
nhoodBalboa Terrace | -1148592.876706 | 281836.9173022 | -4.0753812 | 0.0000463 |
nhoodBayview | -1445430.516934 | 380960.3818154 | -3.7941754 | 0.0001491 |
nhoodBernal Heights | -1031396.022226 | 379483.3002972 | -2.7178957 | 0.0065821 |
nhoodBret Harte | -1490479.748969 | 381878.8059559 | -3.9030177 | 0.0000957 |
nhoodBuena Vista | -501141.245364 | 245696.0352833 | -2.0396798 | 0.0414105 |
nhoodCandlestick Point SRA | -1469250.570772 | 394871.3966419 | -3.7208331 | 0.0001997 |
nhoodCastro | -643302.648457 | 273582.8654374 | -2.3513996 | 0.0187237 |
nhoodCathedral Hill | -548899.176880 | 205304.5862490 | -2.6735846 | 0.0075177 |
nhoodCayuga | -1365423.668010 | 398587.2743668 | -3.4256580 | 0.0006160 |
nhoodCentral Waterfront | -2287481.401002 | 480201.3341981 | -4.7635882 | 0.0000019 |
nhoodCivic Center | -440004.376661 | 405869.2335219 | -1.0841038 | 0.2783469 |
nhoodClarendon Heights | -609152.104027 | 270521.6584495 | -2.2517683 | 0.0243602 |
nhoodCole Valley | -300301.366021 | 253923.7964126 | -1.1826437 | 0.2369807 |
nhoodCorona Heights | -869177.515363 | 251638.6692421 | -3.4540698 | 0.0005547 |
nhoodCow Hollow | 409553.210166 | 175957.8723809 | 2.3275640 | 0.0199567 |
nhoodCrocker Amazon | -1498184.346755 | 397489.4615305 | -3.7691172 | 0.0001648 |
nhoodDiamond Heights | -809506.391750 | 267576.2641291 | -3.0253296 | 0.0024904 |
nhoodDogpatch | -732653.439074 | 343219.9377134 | -2.1346471 | 0.0328160 |
nhoodDolores Heights | -383015.187853 | 266214.8704210 | -1.4387445 | 0.1502566 |
nhoodDuboce Triangle | -540561.235882 | 249952.9901209 | -2.1626516 | 0.0305934 |
nhoodEureka Valley | -505785.957070 | 265473.3196549 | -1.9052233 | 0.0567819 |
nhoodExcelsior | -1452030.084527 | 397305.2997899 | -3.6546960 | 0.0002589 |
nhoodFairmount | -782080.216963 | 266541.0919615 | -2.9341825 | 0.0033525 |
nhoodFishermans Wharf | 2476943.783948 | 561740.3641066 | 4.4094104 | 0.0000105 |
nhoodForest Hill | -857033.766143 | 275617.9608549 | -3.1094990 | 0.0018797 |
nhoodForest Knolls | -1063067.928593 | 279005.1766367 | -3.8102086 | 0.0001397 |
nhoodGlen Park | -852119.635037 | 266842.1721651 | -3.1933469 | 0.0014110 |
nhoodGolden Gate Heights | -1081567.547128 | 273101.3209928 | -3.9603161 | 0.0000754 |
nhoodHaight Ashbury | -627824.235862 | 251578.7260461 | -2.4955379 | 0.0125939 |
nhoodHayes Valley | -478502.186276 | 167069.3048101 | -2.8640940 | 0.0041914 |
nhoodHolly Park | -971967.000842 | 382898.5301575 | -2.5384454 | 0.0111508 |
nhoodHunters Point | -1521803.492588 | 386100.3232775 | -3.9414717 | 0.0000816 |
nhoodIndia Basin | -1719624.368994 | 540455.6152674 | -3.1818050 | 0.0014684 |
nhoodIngleside | -1114908.833066 | 275538.8045297 | -4.0462861 | 0.0000525 |
nhoodIngleside Terraces | -1626241.432073 | 279578.9873509 | -5.8167513 | 0.0000000 |
nhoodInner Richmond | -12464.543862 | 156144.7930737 | -0.0798268 | 0.9363767 |
nhoodInner Sunset | -793286.776987 | 275546.2898436 | -2.8789601 | 0.0039990 |
nhoodJapantown | -137656.651617 | 164574.3055101 | -0.8364407 | 0.4029285 |
nhoodLaguna Honda | -1067436.492015 | 276459.1198360 | -3.8611007 | 0.0001136 |
nhoodLake Street | 260044.124387 | 164604.0682565 | 1.5798159 | 0.1141831 |
nhoodLakeshore | -1071132.728406 | 277751.9642604 | -3.8564362 | 0.0001158 |
nhoodLaurel Heights / Jordan Park | 322922.443097 | 162659.9676835 | 1.9852607 | 0.0471449 |
nhoodLittle Hollywood | -1472848.966476 | 390194.0760973 | -3.7746574 | 0.0001612 |
nhoodLone Mountain | -171569.704786 | 153499.5514958 | -1.1177212 | 0.2637150 |
nhoodLower Haight | -572613.508909 | 231454.1000893 | -2.4739830 | 0.0133794 |
nhoodLower Nob Hill | 628751.791381 | 436087.5106186 | 1.4418019 | 0.1493920 |
nhoodLower Pacific Heights | -29668.417446 | 147027.2721615 | -0.2017885 | 0.8400865 |
nhoodMarina | -335863.380010 | 167388.1718757 | -2.0064941 | 0.0448325 |
nhoodMerced Heights | -1173125.234241 | 276207.7854635 | -4.2472562 | 0.0000219 |
nhoodMerced Manor | -1149356.677654 | 277431.5766183 | -4.1428474 | 0.0000346 |
nhoodMidtown Terrace | -1075534.898122 | 272298.5781992 | -3.9498366 | 0.0000788 |
nhoodMint Hill | -459287.216028 | 275111.0669482 | -1.6694611 | 0.0950598 |
nhoodMiraloma Park | -1203282.293693 | 275857.7750698 | -4.3619662 | 0.0000130 |
nhoodMission | -575004.958828 | 270412.4235062 | -2.1263999 | 0.0334963 |
nhoodMission Bay | -445580.064259 | 361556.5610873 | -1.2323938 | 0.2178332 |
nhoodMission Dolores | -638715.713133 | 271356.1954374 | -2.3537908 | 0.0186038 |
nhoodMission Terrace | -1455552.826115 | 396946.4489094 | -3.6668745 | 0.0002469 |
nhoodMonterey Heights | -1083758.818163 | 280998.9880661 | -3.8568068 | 0.0001157 |
nhoodMt. Davidson Manor | -1260320.804314 | 282481.3843551 | -4.4616066 | 0.0000082 |
nhoodNob Hill | 1668029.269135 | 520027.7636490 | 3.2075773 | 0.0013431 |
nhoodNoe Valley | -420305.398434 | 265127.8217253 | -1.5852934 | 0.1129338 |
nhoodNorth Beach | 1816035.339000 | 518815.8760266 | 3.5003465 | 0.0004668 |
nhoodNorthern Waterfront | 1504947.713678 | 559468.4548783 | 2.6899599 | 0.0071588 |
nhoodOceanview | -1213238.837533 | 275491.5617225 | -4.4039056 | 0.0000108 |
nhoodOuter Mission | -1487334.766079 | 398306.3234045 | -3.7341480 | 0.0001895 |
nhoodOuter Richmond | -294541.653151 | 155433.8409023 | -1.8949648 | 0.0581281 |
nhoodOuter Sunset | -1302730.691530 | 379969.0299048 | -3.4285181 | 0.0006095 |
nhoodPacific Heights | -114548.465807 | 160542.2025803 | -0.7135100 | 0.4755482 |
nhoodPanhandle | -389171.270446 | 142012.7012498 | -2.7403976 | 0.0061482 |
nhoodParkmerced | 524687.125901 | 340702.9888638 | 1.5400133 | 0.1235912 |
nhoodParkside | -1329840.712146 | 380497.7801552 | -3.4950026 | 0.0004763 |
nhoodParnassus Heights | -349642.044542 | 280877.7527974 | -1.2448193 | 0.2132296 |
nhoodPeralta Heights | -1094383.453380 | 382185.8479011 | -2.8634850 | 0.0041995 |
nhoodPolk Gulch | 1263775.057865 | 548089.1446144 | 2.3057838 | 0.0211447 |
nhoodPortola | -1505007.680289 | 397293.8379478 | -3.7881476 | 0.0001527 |
nhoodPotrero Hill | -704303.266780 | 282804.1334817 | -2.4904278 | 0.0127763 |
nhoodPresidio Heights | 967724.644460 | 163992.9427365 | 5.9010140 | 0.0000000 |
nhoodPresidio Terrace | 399489.080475 | 171443.6264536 | 2.3301483 | 0.0198196 |
nhoodRincon Hill | -581685.783964 | 339796.0832195 | -1.7118672 | 0.0869546 |
nhoodRussian Hill | 2103049.016286 | 515736.0672249 | 4.0777622 | 0.0000459 |
nhoodSeacliff | 580192.585178 | 166166.4216773 | 3.4916356 | 0.0004823 |
nhoodSherwood Forest | -1248438.854768 | 278023.9574620 | -4.4904003 | 0.0000072 |
nhoodShowplace Square | -659327.236413 | 478379.3161579 | -1.3782520 | 0.1681587 |
nhoodSilver Terrace | -1393945.277448 | 381061.5114316 | -3.6580584 | 0.0002555 |
nhoodSouth Beach | 7074.615955 | 368070.1923692 | 0.0192208 | 0.9846654 |
nhoodSouth of Market | -695284.393202 | 286610.9122718 | -2.4258825 | 0.0152902 |
nhoodSt. Francis Wood | -662016.288233 | 277044.8539600 | -2.3895636 | 0.0168882 |
nhoodSt. Marys Park | -1249031.023803 | 385320.9200118 | -3.2415344 | 0.0011931 |
nhoodStonestown | -1442969.182196 | 284689.1961838 | -5.0685772 | 0.0000004 |
nhoodSunnydale | -1500633.751163 | 402042.6544850 | -3.7325237 | 0.0001907 |
nhoodSunnyside | -1079543.649362 | 275712.7102770 | -3.9154657 | 0.0000909 |
nhoodSutro Heights | -416240.973423 | 161361.0683295 | -2.5795626 | 0.0099078 |
nhoodTelegraph Hill | 2062146.008009 | 523375.6275863 | 3.9400880 | 0.0000820 |
nhoodTenderloin | 2561433.248380 | 501612.2107127 | 5.1064013 | 0.0000003 |
nhoodUnion Street | -566754.982909 | 175208.3947485 | -3.2347479 | 0.0012218 |
nhoodUniversity Mound | -1450449.996505 | 398060.6837104 | -3.6437912 | 0.0002701 |
nhoodUpper Market | -573758.093877 | 268325.2015798 | -2.1382937 | 0.0325190 |
nhoodVisitacion Valley | -1484554.854893 | 397532.3265407 | -3.7344255 | 0.0001893 |
nhoodWest Portal | -1113680.071044 | 275567.0284588 | -4.0414126 | 0.0000536 |
nhoodWestern Addition | -482555.811561 | 143011.3980956 | -3.3742472 | 0.0007432 |
nhoodWestwood Highlands | -1241505.813311 | 287389.0169747 | -4.3199487 | 0.0000158 |
nhoodWestwood Park | -1350987.607466 | 279809.2327628 | -4.8282453 | 0.0000014 |
crime_nn5 | -10906413.470544 | 17758591.8216114 | -0.6141486 | 0.5391322 |
DistSchool | 3819910.579148 | 3374207.9106327 | 1.1320911 | 0.2576254 |
districtBuena Vista | -128410.195646 | 293399.6579025 | -0.4376631 | 0.6616407 |
districtCentral | -202589.582056 | 273748.6900422 | -0.7400568 | 0.4592842 |
districtDowntown | -199451.593125 | 343939.1071240 | -0.5799038 | 0.5619936 |
districtIngleside | -184691.284006 | 282924.4633603 | -0.6527936 | 0.5139055 |
districtInner Sunset | -188491.299036 | 280870.1639506 | -0.6710976 | 0.5021750 |
districtMarina | 34882.660227 | 373572.9822982 | 0.0933758 | 0.9256071 |
districtMission | -373393.675577 | 279938.0302259 | -1.3338440 | 0.1822877 |
districtNortheast | -2494026.026325 | 522012.1189187 | -4.7777167 | 0.0000018 |
districtRichmond | -901207.676139 | 368129.7407858 | -2.4480708 | 0.0143807 |
districtSouth Central | 103733.593663 | 113384.4787626 | 0.9148835 | 0.3602765 |
districtSouth of Market | -263862.422096 | 295748.8171541 | -0.8921842 | 0.3723173 |
districtWestern Addition | -526909.892455 | 357768.5367545 | -1.4727676 | 0.1408476 |
Figure 5: Table of in-sample (training set) model results.
82 of our 150 predictors are statistically significant (i.e. they have p-values below 0.01) and are hence ultimately useful for sale price prediction; these have been highlighted in orange. The R-squared value of 0.6995 suggests our model is successful in accounting for 69.95% of variations in home prices throughout San Francisco. This is supported by figure 6, where quantile maps of predicted and actual home sale prices reflect similar spatial patterns.
Figure 6: Quantile maps of predicted and actual home sale prices across San Francisco.
To understand the accuracy of our model in predicting unseen data, we ran it on the remaining subset of our data to measure goodness-of-fit. As evident from figure 7, we obtained a MAE of $222,739.60. As the mean home sale price for our dataset is $1,144,668, this gives us a MAPE of 23.25%, suggesting that our model is off by slightly less than a quarter of actual home sale prices.
R-squared Value | MAE | MAPE |
---|---|---|
0.759643 | 222739.6 | 23.25 % |
Figure 7: Table of R-squared Value, Mean Absolute Error, and Mean Absolute Percent Error for the test set.
Diagnostic analysis in figure 8 suggests our model generates fairly accurate predictions for houses worth around $1.25 million; the model over-predicts for homes of lower values, and significantly under-predicts for home of much higher values.
Figure 8: Predicted sale prices (green line) as a function of observed sale prices (i.e. a perfect prediction, orange line).
Figure 9: Distribution of actual and predicted sale prices.
We performed k-fold cross validation (where k=100) to test the generalizability of our model to unseen data; the results for this analysis are presented in figure 10. Our model is relatively generalizable to new data, with relatively comparable goodness of fit metrics across each fold. We obtained a standard deviation of $31,365.83 for the MAE. This degree of variation is somewhat reasonable as it comprises only 2.74% of the overall mean sale price.
Mean | Standard Deviation |
---|---|
252482.9 | 31365.83 |
Figure 10: Mean and standard deviation of MAE from 100-fold cross-validation.
This is supported by a histogram of across-fold MAE in figure 11, where clustering of the errors around the mean MAE indicates that our model is quite generalizable to the 100 folds. Nonetheless, there are a few outliers at the extreme tails, indicating that our model could be overfitting certain characteristics.
Figure 11: Distribution of MAE from 100-fold cross-validation. Dotted line represents mean MAE.
Model errors are mapped in figure 12 to gain a deeper understanding of the extent to which missing information in the model may be spatial in nature. It is visually apparent that errors are clustered together.
Figure 12: Map of sale prices and errors.
This was tested empirically through computation of the Moran’s I statistic. The frequency of randomly permutated Moran’s I values are plotted in a histogram in figure 13, while the observed value is indicated by the orange line. The higher observed value relative to the randomly generated values confirms that errors from our predictive model exhibit spatial autocorrelation. This suggests that some degree of spatial variation relating to the underlying structure of home prices has not been incorporated by our model, even with consideration of fixed effects at the neighbourhood and district level in our model.
Figure 13: Observed (line) and randomly permutated (histogram) Moran’s I values.
Figure 14: Map of mean MAPE by neighborhoods in San Francisco.
However, the slight gradient of the graph in figure 15 suggests prediction errors are only weakly correlated with the mean sale price of each neighborhood. This implies that although our model loses accuracy when predicting sale prices for homes in neighborhoods with higher mean prices, this occurs only to a limited extent. Our model can hence be regarded as relatively generalizable across neighborhoods, at least on the basis of mean home sale prices. This prompts consideration of other urban contexts that could potentially divide San Francisco.
Figure 15: Scatterplot plot of mean MAPE by neighborhood as a function of mean sale price by neighborhood.
Figure 16 maps San Francisco according to race and income. Tracts in which at least 51% of residents are white were designated “Majority White”, while tracts in which incomes were greater than the citywide mean were designated “High Income”. Given the distinct spatial segregation, it appears unlikely that our simple linear model will generalize well across these contexts.
Figure 16: Maps of income and racial segregation in San Francisco.
Surprisingly, MAPE does not differ much between majority white and non-white neighborhoods (figure 17); this suggests our model generalizes well with respect to race and makes predictions of similar accuracy across neighborhoods of different racial compositions.
TractContext | mean.MAPE | mean.MAE |
---|---|---|
Majority non-White | 22.33% | 158457.0 |
Majority White | 24.95% | 347125.4 |
NA | 78.06% | 282592.9 |
Figure 17: Mean MAPE and MAE of home sale prices by neighborhood racial context.
However, there is a lower error rate when predicting home values in high-income neighborhoods as compared to low-income neighborhoods (figure 18). Our model hence does not generalize well with respect to income, as it predicts home prices in low-income neighborhoods with 6% less accuracy relative to high-income neighborhoods. This is likely due to the need to engineer more features that can account for income differences between neighborhoods.
TractContext | mean.MAPE | mean.MAE |
---|---|---|
High Income | 21.2% | 251566.1 |
Low Income | 27.92% | 155602.0 |
NA | 78.06% | 282592.9 |
Figure 18: Mean MAPE and MAE of home sale prices by neighborhood income context.
As discussed in section 4, our model has room for improvement in terms of accuracy and generalizability. For instance, it tends to under-predict home prices for homes worth above $1,250,000 (Figure 8); should Zillow employ our model, this would set unrealisticly low prices for home buyers, and undervalue the selling price for owners with homes of higher value. Unsatisfied with the low valuations, owners of more expensive homes looking to sell their property would exit the Zillow market, thereby creating a market of ‘lemons’ (lower value houses) on the Zillow website. Nonetheless, it is worth noting that the model does relatively well at generalizing the prices between groups of different racial demographics - as such, in spite of its shortcomings, our model provides a useful first step for further developments.
Since the model under predicts the house prices for houses with higher values, one way to improve the model would be to examine characteristics unique to homes with higher prices, and seek to account for these within the model. Additionally, spatial lag can be added to the model since it is clear that a spatial relation remains within the prediction errors. This can be achieved through consideration of mean home prices within a buffer distance from every house, or by taking the average price of some specified number of nearest neighbors. Finally, our model uses Ordinary Least Squares linear regression to model the relationship between predictors and home sale prices. However, it is likely that most predictors are not linearly related to housing prices due to the law of diminishing marginal returns. As such, a non-linear regression model could be used instead to improve the predictive power of our model.