1. Introduction

This report seeks to build a predictive model of home prices in San Francisco, through a consideration of its local context. Prediction of real estate prices in urban spaces is becoming increasingly important; they not only reflect overall market conditions but are also a good indicator of economic health. Such analysis is also critical with respect to assessment for purposes varying from property tax collection to the valuation of residential mortgage-backed securities. Although machine learning techniques are widespread and prevalent for this use-case, the challenge of this analysis was in thoughtfully selecting localized variables that might strengthen the predictive power of our model, while remaining mindful not to overfit it with too much specificity.

In constructing our model, we first attempted to identify a list of physical and place-based characteristics that could potentially influence housing prices. We then sought to improve our model by experimenting with different permutations of these. We identified the best model through comparison of the average difference between observed and predicted prices for each model. To this end, we employed two metrics: the Mean Absolute Error (MAE) and the Mean Absolute Percent Error (MAPE).

With the set of variables we eventually decided on (figure 1), we obtained a MAE of $222,739.60 and a corresponding MAPE of 23.25%, with the mean home sale price for our dataset being $1,144,668. The standard deviation of our MAE across all folds when performing k-fold validation (an indicator of the model’s generalizability to new data) is $31,365.83.


2. Data

2.1. Data and variables

We firstly used the tidycensus package to access American Community Survey (ACS) data for San Francisco county, and selected socio-economic as well as demographic variables that we deemed relevant for home price prediction. We also retrieved geographic information regarding localized features and amenities from San Francisco’s Open Data portal.

Summary statistics for all variables considered within our initial model are as follows. Variables eventually selected for our final model have been highlighted in grey:

Figure 1: List of variables selected for initial and final models.

Figure 1: List of variables selected for initial and final models.

2.2. Correlation of variables

The correlation matrix below presents pairwise correlations between variables and makes apparent the major associations of our variables with home sale prices. This was an important step in our exploratory analysis, as it allowed us to get a preliminary sense of which variables might have the largest explanatory power.

Figure 2: Correlation matrix of variables.

Figure 2: Correlation matrix of variables.

Correlation scatterplots were then plotted for 4 independent variables to graphically visualize their relationship to home sale prices.

Figure 3: Scatter plots of independent variables as a function of home sale price.

2.3. Spatial visualization of variables

To permit visual inspection of potential relationships between dependent and independent variables across space, the following maps were plotted and selected variables were visualized through quintile breaks.

Figure 4: Maps of dependent and independent variables across San Francisco.


3. Methods

We adopted the following pipeline approach for our analysis:

3.1. Determining predictors

After gathering data as detailed in part 2.1, we selected predictors for our model by analyzing their relationship with the sale price. During this process, we were mindful to consider the three theoretical components of home prices: internal and parcel-level characteristics, public services and amenities, and finally the underlying spatial structure of home prices. The predictors selected for the final model were more significant in the linear regression that was run using all the initial predictors listed in figure 1.

3.2. Model building

A linear regression model was built to predict home sale prices, such that the prediction is the sum of a linear combination of our predictors. Coefficients were estimated using the Ordinary Least Squares regression, thereby allowing us to predict the sale price of each house given the values of the predictors for the same house.

3.3. Model validation

We ran a few tests to measure the accuracy and generalizability of our model, in order to determine its predictive power. As a basic first test, we ran the linear regression using the predictors on the entire data set to see the relative goodness of fit on seen data by analyzing the R-squared value. We aimed to achieve a relatively high R-squared value, as this represents the proportion of the variance for the sale price that is explained by the predictors.

3.3.1.Accuracy

To test the model on its power to predict data that it was not trained on, we randomly split the data into a training set (60% of the original data set) and a test set (40% of the original data set), and once again ran the regression but only using the training set. We then used this model, with its own set of coefficients for the predictors, to predict sale prices on the test set. The absolute errors for each house price was calculated and summed to give the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), both metrics of how accurately the model predicts unseen data.

As a more rigorous approach to the same concept described above, we ran 100-Fold Cross-Validation on the model; this repeats the aforementioned method 100 times, each time with a new randomly generated test set that is 1/100th the size of the original data set. The average MAE and average MAPE across the 100 iterations of the above process is calculated to provide a better indication of the accuracy of the model. The standard deviation of the MAE across each iteration also provides an indicator of the generalizability of the model across different data sets.

3.3.2.Generalizability

To test the model on its generalizability, we plotted the errors between the predicted price and observed price on a map and calculated the Moran’s I (a measure of spatial auto correlation) value to test for any spatial relationship that was not accounted for in the model. We also plotted the MAPE by neighborhood to inspect if there were significant differences between errors in different neighborhoods. We made sure to account for spatial differences in prices relating to neighborhood characteristics by incorporating neighborhood fixed effects in our model.

We also analyzed the MAE in majority white tracts versus non-majority white tracts, and in higher income versus lower income tracts to test for any bias in our model. Should there be a significant difference between the MAEs, the model would be biased to a particular urban demographic and hence should not be used.

Using the methods detailed in this section, we decided on the predictors used in the final model. We used the entire data set in the linear regression, and the estimated coefficients of the predictors from the linear regression was used to estimate the sale prices of the houses with unknown sale prices.


4. Analysis and discussion

Following processes of feature engineering and selection, we narrowed our list of predictors such that our final model comprised a total of 20 variables. These have been highlighted in grey in figure 1.

We trained this model on a subset of our data to analyze the significance and coefficients of our selected predictors in determining home sale prices. A table of results from this analysis is as follows:

The R2 (measure of goodness of fit of the model) of the model run on the training set data is 0.6995.

Estimated Coefficient for Predictor Std. Error t value pvalue
(Intercept) 1027113.136512 383697.9774255 2.6768792 0.0074442
LotArea 1.096873 0.0388050 28.2662701 0.0000000
PropArea 242.093929 7.1158093 34.0219811 0.0000000
builtYear.catOld -26071.081082 21824.0415591 -1.1946037 0.2322724
builtYear.catVery Old 20570.931888 19219.7177622 1.0703035 0.2845105
builtYear.catVeryNew 262315.342934 41469.0233979 6.3255732 0.0000000
Beds 6346.420175 2943.1025847 2.1563707 0.0310803
Baths 38153.000935 5608.5928088 6.8025978 0.0000000
Year.cat2013 160450.083650 11000.4098518 14.5858278 0.0000000
Year.cat2014 327157.383482 11350.8366531 28.8223145 0.0000000
Year.cat2015 504720.217178 11505.1265944 43.8691581 0.0000000
Month.catAug 32446.148235 18992.6537256 1.7083525 0.0876044
Month.catDec 34008.066480 19546.4987727 1.7398546 0.0819177
Month.catFeb -17648.944910 21068.2275629 -0.8377043 0.4022184
Month.catJan -129990.176862 22067.9321193 -5.8904557 0.0000000
Month.catJul 46246.538783 18788.2876738 2.4614558 0.0138555
Month.catJun 21052.026780 18797.5469518 1.1199348 0.2627705
Month.catMar -10726.872420 19779.5935516 -0.5423202 0.5876110
Month.catMay 40469.700422 18728.3302507 2.1608814 0.0307300
Month.catNov 42482.136098 19125.9935641 2.2211728 0.0263633
Month.catOct 50712.788781 18598.6141847 2.7266972 0.0064092
Month.catSept 24327.857038 20047.1125673 1.2135342 0.2249565
MedInc 4.881354 0.7022558 6.9509629 0.0000000
PRforeign -20941.031488 5211.0851080 -4.0185549 0.0000590
PRDrive -17090.782921 7216.7472783 -2.3682114 0.0178947
PRabv65 61931.750383 11667.9939999 5.3078319 0.0000001
PRSchool -90385.294593 18297.7860907 -4.9396847 0.0000008
PRPrivate 126.543783 28.7425179 4.4026687 0.0000108
PRlackPlumbing 35763.066428 25163.0117709 1.4212554 0.1552761
DistBusStops 32222622.745540 6364394.2508101 5.0629520 0.0000004
DistRailStn 1627630.504329 1589629.6151020 1.0239055 0.3059066
nhoodAnza Vista -435617.651630 165942.5120153 -2.6251118 0.0086763
nhoodAquatic Park / Ft. Mason 1738683.552997 550551.5014706 3.1580761 0.0015932
nhoodAshbury Heights -362332.235989 248374.1425332 -1.4588163 0.1446495
nhoodBalboa Terrace -1148592.876706 281836.9173022 -4.0753812 0.0000463
nhoodBayview -1445430.516934 380960.3818154 -3.7941754 0.0001491
nhoodBernal Heights -1031396.022226 379483.3002972 -2.7178957 0.0065821
nhoodBret Harte -1490479.748969 381878.8059559 -3.9030177 0.0000957
nhoodBuena Vista -501141.245364 245696.0352833 -2.0396798 0.0414105
nhoodCandlestick Point SRA -1469250.570772 394871.3966419 -3.7208331 0.0001997
nhoodCastro -643302.648457 273582.8654374 -2.3513996 0.0187237
nhoodCathedral Hill -548899.176880 205304.5862490 -2.6735846 0.0075177
nhoodCayuga -1365423.668010 398587.2743668 -3.4256580 0.0006160
nhoodCentral Waterfront -2287481.401002 480201.3341981 -4.7635882 0.0000019
nhoodCivic Center -440004.376661 405869.2335219 -1.0841038 0.2783469
nhoodClarendon Heights -609152.104027 270521.6584495 -2.2517683 0.0243602
nhoodCole Valley -300301.366021 253923.7964126 -1.1826437 0.2369807
nhoodCorona Heights -869177.515363 251638.6692421 -3.4540698 0.0005547
nhoodCow Hollow 409553.210166 175957.8723809 2.3275640 0.0199567
nhoodCrocker Amazon -1498184.346755 397489.4615305 -3.7691172 0.0001648
nhoodDiamond Heights -809506.391750 267576.2641291 -3.0253296 0.0024904
nhoodDogpatch -732653.439074 343219.9377134 -2.1346471 0.0328160
nhoodDolores Heights -383015.187853 266214.8704210 -1.4387445 0.1502566
nhoodDuboce Triangle -540561.235882 249952.9901209 -2.1626516 0.0305934
nhoodEureka Valley -505785.957070 265473.3196549 -1.9052233 0.0567819
nhoodExcelsior -1452030.084527 397305.2997899 -3.6546960 0.0002589
nhoodFairmount -782080.216963 266541.0919615 -2.9341825 0.0033525
nhoodFishermans Wharf 2476943.783948 561740.3641066 4.4094104 0.0000105
nhoodForest Hill -857033.766143 275617.9608549 -3.1094990 0.0018797
nhoodForest Knolls -1063067.928593 279005.1766367 -3.8102086 0.0001397
nhoodGlen Park -852119.635037 266842.1721651 -3.1933469 0.0014110
nhoodGolden Gate Heights -1081567.547128 273101.3209928 -3.9603161 0.0000754
nhoodHaight Ashbury -627824.235862 251578.7260461 -2.4955379 0.0125939
nhoodHayes Valley -478502.186276 167069.3048101 -2.8640940 0.0041914
nhoodHolly Park -971967.000842 382898.5301575 -2.5384454 0.0111508
nhoodHunters Point -1521803.492588 386100.3232775 -3.9414717 0.0000816
nhoodIndia Basin -1719624.368994 540455.6152674 -3.1818050 0.0014684
nhoodIngleside -1114908.833066 275538.8045297 -4.0462861 0.0000525
nhoodIngleside Terraces -1626241.432073 279578.9873509 -5.8167513 0.0000000
nhoodInner Richmond -12464.543862 156144.7930737 -0.0798268 0.9363767
nhoodInner Sunset -793286.776987 275546.2898436 -2.8789601 0.0039990
nhoodJapantown -137656.651617 164574.3055101 -0.8364407 0.4029285
nhoodLaguna Honda -1067436.492015 276459.1198360 -3.8611007 0.0001136
nhoodLake Street 260044.124387 164604.0682565 1.5798159 0.1141831
nhoodLakeshore -1071132.728406 277751.9642604 -3.8564362 0.0001158
nhoodLaurel Heights / Jordan Park 322922.443097 162659.9676835 1.9852607 0.0471449
nhoodLittle Hollywood -1472848.966476 390194.0760973 -3.7746574 0.0001612
nhoodLone Mountain -171569.704786 153499.5514958 -1.1177212 0.2637150
nhoodLower Haight -572613.508909 231454.1000893 -2.4739830 0.0133794
nhoodLower Nob Hill 628751.791381 436087.5106186 1.4418019 0.1493920
nhoodLower Pacific Heights -29668.417446 147027.2721615 -0.2017885 0.8400865
nhoodMarina -335863.380010 167388.1718757 -2.0064941 0.0448325
nhoodMerced Heights -1173125.234241 276207.7854635 -4.2472562 0.0000219
nhoodMerced Manor -1149356.677654 277431.5766183 -4.1428474 0.0000346
nhoodMidtown Terrace -1075534.898122 272298.5781992 -3.9498366 0.0000788
nhoodMint Hill -459287.216028 275111.0669482 -1.6694611 0.0950598
nhoodMiraloma Park -1203282.293693 275857.7750698 -4.3619662 0.0000130
nhoodMission -575004.958828 270412.4235062 -2.1263999 0.0334963
nhoodMission Bay -445580.064259 361556.5610873 -1.2323938 0.2178332
nhoodMission Dolores -638715.713133 271356.1954374 -2.3537908 0.0186038
nhoodMission Terrace -1455552.826115 396946.4489094 -3.6668745 0.0002469
nhoodMonterey Heights -1083758.818163 280998.9880661 -3.8568068 0.0001157
nhoodMt. Davidson Manor -1260320.804314 282481.3843551 -4.4616066 0.0000082
nhoodNob Hill 1668029.269135 520027.7636490 3.2075773 0.0013431
nhoodNoe Valley -420305.398434 265127.8217253 -1.5852934 0.1129338
nhoodNorth Beach 1816035.339000 518815.8760266 3.5003465 0.0004668
nhoodNorthern Waterfront 1504947.713678 559468.4548783 2.6899599 0.0071588
nhoodOceanview -1213238.837533 275491.5617225 -4.4039056 0.0000108
nhoodOuter Mission -1487334.766079 398306.3234045 -3.7341480 0.0001895
nhoodOuter Richmond -294541.653151 155433.8409023 -1.8949648 0.0581281
nhoodOuter Sunset -1302730.691530 379969.0299048 -3.4285181 0.0006095
nhoodPacific Heights -114548.465807 160542.2025803 -0.7135100 0.4755482
nhoodPanhandle -389171.270446 142012.7012498 -2.7403976 0.0061482
nhoodParkmerced 524687.125901 340702.9888638 1.5400133 0.1235912
nhoodParkside -1329840.712146 380497.7801552 -3.4950026 0.0004763
nhoodParnassus Heights -349642.044542 280877.7527974 -1.2448193 0.2132296
nhoodPeralta Heights -1094383.453380 382185.8479011 -2.8634850 0.0041995
nhoodPolk Gulch 1263775.057865 548089.1446144 2.3057838 0.0211447
nhoodPortola -1505007.680289 397293.8379478 -3.7881476 0.0001527
nhoodPotrero Hill -704303.266780 282804.1334817 -2.4904278 0.0127763
nhoodPresidio Heights 967724.644460 163992.9427365 5.9010140 0.0000000
nhoodPresidio Terrace 399489.080475 171443.6264536 2.3301483 0.0198196
nhoodRincon Hill -581685.783964 339796.0832195 -1.7118672 0.0869546
nhoodRussian Hill 2103049.016286 515736.0672249 4.0777622 0.0000459
nhoodSeacliff 580192.585178 166166.4216773 3.4916356 0.0004823
nhoodSherwood Forest -1248438.854768 278023.9574620 -4.4904003 0.0000072
nhoodShowplace Square -659327.236413 478379.3161579 -1.3782520 0.1681587
nhoodSilver Terrace -1393945.277448 381061.5114316 -3.6580584 0.0002555
nhoodSouth Beach 7074.615955 368070.1923692 0.0192208 0.9846654
nhoodSouth of Market -695284.393202 286610.9122718 -2.4258825 0.0152902
nhoodSt. Francis Wood -662016.288233 277044.8539600 -2.3895636 0.0168882
nhoodSt. Marys Park -1249031.023803 385320.9200118 -3.2415344 0.0011931
nhoodStonestown -1442969.182196 284689.1961838 -5.0685772 0.0000004
nhoodSunnydale -1500633.751163 402042.6544850 -3.7325237 0.0001907
nhoodSunnyside -1079543.649362 275712.7102770 -3.9154657 0.0000909
nhoodSutro Heights -416240.973423 161361.0683295 -2.5795626 0.0099078
nhoodTelegraph Hill 2062146.008009 523375.6275863 3.9400880 0.0000820
nhoodTenderloin 2561433.248380 501612.2107127 5.1064013 0.0000003
nhoodUnion Street -566754.982909 175208.3947485 -3.2347479 0.0012218
nhoodUniversity Mound -1450449.996505 398060.6837104 -3.6437912 0.0002701
nhoodUpper Market -573758.093877 268325.2015798 -2.1382937 0.0325190
nhoodVisitacion Valley -1484554.854893 397532.3265407 -3.7344255 0.0001893
nhoodWest Portal -1113680.071044 275567.0284588 -4.0414126 0.0000536
nhoodWestern Addition -482555.811561 143011.3980956 -3.3742472 0.0007432
nhoodWestwood Highlands -1241505.813311 287389.0169747 -4.3199487 0.0000158
nhoodWestwood Park -1350987.607466 279809.2327628 -4.8282453 0.0000014
crime_nn5 -10906413.470544 17758591.8216114 -0.6141486 0.5391322
DistSchool 3819910.579148 3374207.9106327 1.1320911 0.2576254
districtBuena Vista -128410.195646 293399.6579025 -0.4376631 0.6616407
districtCentral -202589.582056 273748.6900422 -0.7400568 0.4592842
districtDowntown -199451.593125 343939.1071240 -0.5799038 0.5619936
districtIngleside -184691.284006 282924.4633603 -0.6527936 0.5139055
districtInner Sunset -188491.299036 280870.1639506 -0.6710976 0.5021750
districtMarina 34882.660227 373572.9822982 0.0933758 0.9256071
districtMission -373393.675577 279938.0302259 -1.3338440 0.1822877
districtNortheast -2494026.026325 522012.1189187 -4.7777167 0.0000018
districtRichmond -901207.676139 368129.7407858 -2.4480708 0.0143807
districtSouth Central 103733.593663 113384.4787626 0.9148835 0.3602765
districtSouth of Market -263862.422096 295748.8171541 -0.8921842 0.3723173
districtWestern Addition -526909.892455 357768.5367545 -1.4727676 0.1408476

Figure 5: Table of in-sample (training set) model results.

82 of our 150 predictors are statistically significant (i.e. they have p-values below 0.01) and are hence ultimately useful for sale price prediction; these have been highlighted in orange. The R-squared value of 0.6995 suggests our model is successful in accounting for 69.95% of variations in home prices throughout San Francisco. This is supported by figure 6, where quantile maps of predicted and actual home sale prices reflect similar spatial patterns.

Figure 6: Quantile maps of predicted and actual home sale prices across San Francisco.

4.1. Accuracy

To understand the accuracy of our model in predicting unseen data, we ran it on the remaining subset of our data to measure goodness-of-fit. As evident from figure 7, we obtained a MAE of $222,739.60. As the mean home sale price for our dataset is $1,144,668, this gives us a MAPE of 23.25%, suggesting that our model is off by slightly less than a quarter of actual home sale prices.

R-squared Value MAE MAPE
0.759643 222739.6 23.25 %

Figure 7: Table of R-squared Value, Mean Absolute Error, and Mean Absolute Percent Error for the test set.

Diagnostic analysis in figure 8 suggests our model generates fairly accurate predictions for houses worth around $1.25 million; the model over-predicts for homes of lower values, and significantly under-predicts for home of much higher values.

Figure 8: Predicted sale prices (green line) as a function of observed sale prices (i.e. a perfect prediction, orange line).

This is supported by the density histogram in figure 9, where predictions deviate from actual prices above and below the $1.25 million mark.

Figure 9: Distribution of actual and predicted sale prices.

4.2. Generalizability

We performed k-fold cross validation (where k=100) to test the generalizability of our model to unseen data; the results for this analysis are presented in figure 10. Our model is relatively generalizable to new data, with relatively comparable goodness of fit metrics across each fold. We obtained a standard deviation of $31,365.83 for the MAE. This degree of variation is somewhat reasonable as it comprises only 2.74% of the overall mean sale price.

Mean Standard Deviation
252482.9 31365.83

Figure 10: Mean and standard deviation of MAE from 100-fold cross-validation.

This is supported by a histogram of across-fold MAE in figure 11, where clustering of the errors around the mean MAE indicates that our model is quite generalizable to the 100 folds. Nonetheless, there are a few outliers at the extreme tails, indicating that our model could be overfitting certain characteristics.

Figure 11: Distribution of MAE from 100-fold cross-validation. Dotted line represents mean MAE.

Model errors are mapped in figure 12 to gain a deeper understanding of the extent to which missing information in the model may be spatial in nature. It is visually apparent that errors are clustered together.

Figure 12: Map of sale prices and errors.

This was tested empirically through computation of the Moran’s I statistic. The frequency of randomly permutated Moran’s I values are plotted in a histogram in figure 13, while the observed value is indicated by the orange line. The higher observed value relative to the randomly generated values confirms that errors from our predictive model exhibit spatial autocorrelation. This suggests that some degree of spatial variation relating to the underlying structure of home prices has not been incorporated by our model, even with consideration of fixed effects at the neighbourhood and district level in our model.

Figure 13: Observed (line) and randomly permutated (histogram) Moran’s I values.

Figure 14 maps the MAPE produced by our model across different neighbourhoods in San Francisco. Our model predicts with reasonable accuracy in western San Francisco, but loses its predictive power particularly in the North and South of the city. We hence conclude that our model is not generalizable across urban space.

Figure 14: Map of mean MAPE by neighborhoods in San Francisco.

However, the slight gradient of the graph in figure 15 suggests prediction errors are only weakly correlated with the mean sale price of each neighborhood. This implies that although our model loses accuracy when predicting sale prices for homes in neighborhoods with higher mean prices, this occurs only to a limited extent. Our model can hence be regarded as relatively generalizable across neighborhoods, at least on the basis of mean home sale prices. This prompts consideration of other urban contexts that could potentially divide San Francisco.

Figure 15: Scatterplot plot of mean MAPE by neighborhood as a function of mean sale price by neighborhood.

Figure 16 maps San Francisco according to race and income. Tracts in which at least 51% of residents are white were designated “Majority White”, while tracts in which incomes were greater than the citywide mean were designated “High Income”. Given the distinct spatial segregation, it appears unlikely that our simple linear model will generalize well across these contexts.

Figure 16: Maps of income and racial segregation in San Francisco.

Surprisingly, MAPE does not differ much between majority white and non-white neighborhoods (figure 17); this suggests our model generalizes well with respect to race and makes predictions of similar accuracy across neighborhoods of different racial compositions.

TractContext mean.MAPE mean.MAE
Majority non-White 22.33% 158457.0
Majority White 24.95% 347125.4
NA 78.06% 282592.9

Figure 17: Mean MAPE and MAE of home sale prices by neighborhood racial context.

However, there is a lower error rate when predicting home values in high-income neighborhoods as compared to low-income neighborhoods (figure 18). Our model hence does not generalize well with respect to income, as it predicts home prices in low-income neighborhoods with 6% less accuracy relative to high-income neighborhoods. This is likely due to the need to engineer more features that can account for income differences between neighborhoods.

TractContext mean.MAPE mean.MAE
High Income 21.2% 251566.1
Low Income 27.92% 155602.0
NA 78.06% 282592.9

Figure 18: Mean MAPE and MAE of home sale prices by neighborhood income context.


5. Conclusion

As discussed in section 4, our model has room for improvement in terms of accuracy and generalizability. For instance, it tends to under-predict home prices for homes worth above $1,250,000 (Figure 8); should Zillow employ our model, this would set unrealisticly low prices for home buyers, and undervalue the selling price for owners with homes of higher value. Unsatisfied with the low valuations, owners of more expensive homes looking to sell their property would exit the Zillow market, thereby creating a market of ‘lemons’ (lower value houses) on the Zillow website. Nonetheless, it is worth noting that the model does relatively well at generalizing the prices between groups of different racial demographics - as such, in spite of its shortcomings, our model provides a useful first step for further developments.

5.1. Improvements

Since the model under predicts the house prices for houses with higher values, one way to improve the model would be to examine characteristics unique to homes with higher prices, and seek to account for these within the model. Additionally, spatial lag can be added to the model since it is clear that a spatial relation remains within the prediction errors. This can be achieved through consideration of mean home prices within a buffer distance from every house, or by taking the average price of some specified number of nearest neighbors. Finally, our model uses Ordinary Least Squares linear regression to model the relationship between predictors and home sale prices. However, it is likely that most predictors are not linearly related to housing prices due to the law of diminishing marginal returns. As such, a non-linear regression model could be used instead to improve the predictive power of our model.