HEDONIC HOME PRICE PREDICTION (San Francisco)

1. Introduction

This report seeks to build a predictive model of home prices in San Francisco, through a consideration of its local context. Prediction of real estate prices in urban spaces is becoming increasingly important; they not only reflect overall market conditions but are also a good indicator of economic health. Such analysis is also critical with respect to assessment for purposes varying from property tax collection to the valuation of residential mortgage-backed securities. Although machine learning techniques are widespread and prevalent for this use-case, the challenge of this analysis was in thoughtfully selecting localized variables that might strengthen the predictive power of our model, while remaining mindful not to overfit it with too much specificity.

In constructing our model, we first attempted to identify a list of physical and place-based characteristics that could potentially influence housing prices. We then sought to improve our model by experimenting with different permutations of these. We identified the best model through comparison of the average difference between observed and predicted prices for each model. To this end, we employed two metrics: the Mean Absolute Error (MAE) and the Mean Absolute Percent Error (MAPE).

With the set of variables we eventually decided on (figure 1), we obtained a MAE of $222,739.60 and a corresponding MAPE of 23.25%, with the mean home sale price for our dataset being $1,144,668. The standard deviation of our MAE across all folds when performing k-fold validation (an indicator of the model’s generalizability to new data) is $31,365.83.

2. Data

2.1. Data and variables

We firstly used the tidycensus package to access American Community Survey (ACS) data for San Francisco county, and selected socio-economic as well as demographic variables that we deemed relevant for home price prediction. We also retrieved geographic information regarding localized features and amenities from San Francisco’s Open Data portal.

Summary statistics for all variables considered within our initial model are as follows. Variables eventually selected for our final model have been highlighted in grey:

Figure 1: List of variables selected for initial and final models.

2.2. Correlation of variables

The correlation matrix below presents pairwise correlations between variables and makes apparent the major associations of our variables with home sale prices. This was an important step in our exploratory analysis, as it allowed us to get a preliminary sense of which variables might have the largest explanatory power.

Figure 2: Correlation matrix of variables.

Correlation scatterplots were then plotted for 4 independent variables to graphically visualize their relationship to home sale prices.

Figure 3: Scatter plots of independent variables as a function of home sale price.

2.3. Spatial visualization of variables

To permit visual inspection of potential relationships between dependent and independent variables across space, the following maps were plotted and selected variables were visualized through quintile breaks.

Figure 4: Maps of dependent and independent variables across San Francisco.

3. Methods

We adopted the following pipeline approach for our analysis:

3.1. Determining predictors

After gathering data as detailed in part 2.1, we selected predictors for our model by analyzing their relationship with the sale price. During this process, we were mindful to consider the three theoretical components of home prices: internal and parcel-level characteristics, public services and amenities, and finally the underlying spatial structure of home prices. The predictors selected for the final model were more significant in the linear regression that was run using all the initial predictors listed in figure 1.

3.2. Model building

A linear regression model was built to predict home sale prices, such that the prediction is the sum of a linear combination of our predictors. Coefficients were estimated using the Ordinary Least Squares regression, thereby allowing us to predict the sale price of each house given the values of the predictors for the same house.

3.3. Model validation

We ran a few tests to measure the accuracy and generalizability of our model, in order to determine its predictive power. As a basic first test, we ran the linear regression using the predictors on the entire data set to see the relative goodness of fit on seen data by analyzing the R-squared value. We aimed to achieve a relatively high R-squared value, as this represents the proportion of the variance for the sale price that is explained by the predictors.

3.3.1.Accuracy

To test the model on its power to predict data that it was not trained on, we randomly split the data into a training set (60% of the original data set) and a test set (40% of the original data set), and once again ran the regression but only using the training set. We then used this model, with its own set of coefficients for the predictors, to predict sale prices on the test set. The absolute errors for each house price was calculated and summed to give the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), both metrics of how accurately the model predicts unseen data.

As a more rigorous approach to the same concept described above, we ran 100-Fold Cross-Validation on the model; this repeats the aforementioned method 100 times, each time with a new randomly generated test set that is 1/100th the size of the original data set. The average MAE and average MAPE across the 100 iterations of the above process is calculated to provide a better indication of the accuracy of the model. The standard deviation of the MAE across each iteration also provides an indicator of the generalizability of the model across different data sets.

3.3.2.Generalizability

To test the model on its generalizability, we plotted the errors between the predicted price and observed price on a map and calculated the Moran’s I (a measure of spatial auto correlation) value to test for any spatial relationship that was not accounted for in the model. We also plotted the MAPE by neighborhood to inspect if there were significant differences between errors in different neighborhoods. We made sure to account for spatial differences in prices relating to neighborhood characteristics by incorporating neighborhood fixed effects in our model.

We also analyzed the MAE in majority white tracts versus non-majority white tracts, and in higher income versus lower income tracts to test for any bias in our model. Should there be a significant difference between the MAEs, the model would be biased to a particular urban demographic and hence should not be used.

Using the methods detailed in this section, we decided on the predictors used in the final model. We used the entire data set in the linear regression, and the estimated coefficients of the predictors from the linear regression was used to estimate the sale prices of the houses with unknown sale prices.

4. Analysis and discussion

Following processes of feature engineering and selection, we narrowed our list of predictors such that our final model comprised a total of 20 variables. These have been highlighted in grey in figure 1.

We trained this model on a subset of our data to analyze the significance and coefficients of our selected predictors in determining home sale prices. A table of results from this analysis is as follows:

The R² (measure of goodness of fit of the model) of the model run on the training set data is 0.6995.

	Estimated Coefficient for Predictor	Std. Error	t value	pvalue
(Intercept)	1027113.136512	383697.9774255	2.6768792	0.0074442
LotArea	1.096873	0.0388050	28.2662701	0.0000000
PropArea	242.093929	7.1158093	34.0219811	0.0000000
builtYear.catOld	-26071.081082	21824.0415591	-1.1946037	0.2322724
builtYear.catVery Old	20570.931888	19219.7177622	1.0703035	0.2845105
builtYear.catVeryNew	262315.342934	41469.0233979	6.3255732	0.0000000
Beds	6346.420175	2943.1025847	2.1563707	0.0310803
Baths	38153.000935	5608.5928088	6.8025978	0.0000000
Year.cat2013	160450.083650	11000.4098518	14.5858278	0.0000000
Year.cat2014	327157.383482	11350.8366531	28.8223145	0.0000000
Year.cat2015	504720.217178	11505.1265944	43.8691581	0.0000000
Month.catAug	32446.148235	18992.6537256	1.7083525	0.0876044
Month.catDec	34008.066480	19546.4987727	1.7398546	0.0819177
Month.catFeb	-17648.944910	21068.2275629	-0.8377043	0.4022184
Month.catJan	-129990.176862	22067.9321193	-5.8904557	0.0000000
Month.catJul	46246.538783	18788.2876738	2.4614558	0.0138555
Month.catJun	21052.026780	18797.5469518	1.1199348	0.2627705
Month.catMar	-10726.872420	19779.5935516	-0.5423202	0.5876110
Month.catMay	40469.700422	18728.3302507	2.1608814	0.0307300
Month.catNov	42482.136098	19125.9935641	2.2211728	0.0263633
Month.catOct	50712.788781	18598.6141847	2.7266972	0.0064092
Month.catSept	24327.857038	20047.1125673	1.2135342	0.2249565
MedInc	4.881354	0.7022558	6.9509629	0.0000000
PRforeign	-20941.031488	5211.0851080	-4.0185549	0.0000590
PRDrive	-17090.782921	7216.7472783	-2.3682114	0.0178947
PRabv65	61931.750383	11667.9939999	5.3078319	0.0000001
PRSchool	-90385.294593	18297.7860907	-4.9396847	0.0000008
PRPrivate	126.543783	28.7425179	4.4026687	0.0000108
PRlackPlumbing	35763.066428	25163.0117709	1.4212554	0.1552761
DistBusStops	32222622.745540	6364394.2508101	5.0629520	0.0000004
DistRailStn	1627630.504329	1589629.6151020	1.0239055	0.3059066
nhoodAnza Vista	-435617.651630	165942.5120153	-2.6251118	0.0086763
nhoodAquatic Park / Ft. Mason	1738683.552997	550551.5014706	3.1580761	0.0015932
nhoodAshbury Heights	-362332.235989	248374.1425332	-1.4588163	0.1446495
nhoodBalboa Terrace	-1148592.876706	281836.9173022	-4.0753812	0.0000463
nhoodBayview	-1445430.516934	380960.3818154	-3.7941754	0.0001491
nhoodBernal Heights	-1031396.022226	379483.3002972	-2.7178957	0.0065821
nhoodBret Harte	-1490479.748969	381878.8059559	-3.9030177	0.0000957
nhoodBuena Vista	-501141.245364	245696.0352833	-2.0396798	0.0414105
nhoodCandlestick Point SRA	-1469250.570772	394871.3966419	-3.7208331	0.0001997
nhoodCastro	-643302.648457	273582.8654374	-2.3513996	0.0187237
nhoodCathedral Hill	-548899.176880	205304.5862490	-2.6735846	0.0075177
nhoodCayuga	-1365423.668010	398587.2743668	-3.4256580	0.0006160
nhoodCentral Waterfront	-2287481.401002	480201.3341981	-4.7635882	0.0000019
nhoodCivic Center	-440004.376661	405869.2335219	-1.0841038	0.2783469
nhoodClarendon Heights	-609152.104027	270521.6584495	-2.2517683	0.0243602
nhoodCole Valley	-300301.366021	253923.7964126	-1.1826437	0.2369807
nhoodCorona Heights	-869177.515363	251638.6692421	-3.4540698	0.0005547
nhoodCow Hollow	409553.210166	175957.8723809	2.3275640	0.0199567
nhoodCrocker Amazon	-1498184.346755	397489.4615305	-3.7691172	0.0001648
nhoodDiamond Heights	-809506.391750	267576.2641291	-3.0253296	0.0024904
nhoodDogpatch	-732653.439074	343219.9377134	-2.1346471	0.0328160
nhoodDolores Heights	-383015.187853	266214.8704210	-1.4387445	0.1502566
nhoodDuboce Triangle	-540561.235882	249952.9901209	-2.1626516	0.0305934
nhoodEureka Valley	-505785.957070	265473.3196549	-1.9052233	0.0567819
nhoodExcelsior	-1452030.084527	397305.2997899	-3.6546960	0.0002589
nhoodFairmount	-782080.216963	266541.0919615	-2.9341825	0.0033525
nhoodFishermans Wharf	2476943.783948	561740.3641066	4.4094104	0.0000105
nhoodForest Hill	-857033.766143	275617.9608549	-3.1094990	0.0018797
nhoodForest Knolls	-1063067.928593	279005.1766367	-3.8102086	0.0001397
nhoodGlen Park	-852119.635037	266842.1721651	-3.1933469	0.0014110
nhoodGolden Gate Heights	-1081567.547128	273101.3209928	-3.9603161	0.0000754
nhoodHaight Ashbury	-627824.235862	251578.7260461	-2.4955379	0.0125939
nhoodHayes Valley	-478502.186276	167069.3048101	-2.8640940	0.0041914
nhoodHolly Park	-971967.000842	382898.5301575	-2.5384454	0.0111508
nhoodHunters Point	-1521803.492588	386100.3232775	-3.9414717	0.0000816
nhoodIndia Basin	-1719624.368994	540455.6152674	-3.1818050	0.0014684
nhoodIngleside	-1114908.833066	275538.8045297	-4.0462861	0.0000525
nhoodIngleside Terraces	-1626241.432073	279578.9873509	-5.8167513	0.0000000
nhoodInner Richmond	-12464.543862	156144.7930737	-0.0798268	0.9363767
nhoodInner Sunset	-793286.776987	275546.2898436	-2.8789601	0.0039990
nhoodJapantown	-137656.651617	164574.3055101	-0.8364407	0.4029285
nhoodLaguna Honda	-1067436.492015	276459.1198360	-3.8611007	0.0001136
nhoodLake Street	260044.124387	164604.0682565	1.5798159	0.1141831
nhoodLakeshore	-1071132.728406	277751.9642604	-3.8564362	0.0001158
nhoodLaurel Heights / Jordan Park	322922.443097	162659.9676835	1.9852607	0.0471449
nhoodLittle Hollywood	-1472848.966476	390194.0760973	-3.7746574	0.0001612
nhoodLone Mountain	-171569.704786	153499.5514958	-1.1177212	0.2637150
nhoodLower Haight	-572613.508909	231454.1000893	-2.4739830	0.0133794
nhoodLower Nob Hill	628751.791381	436087.5106186	1.4418019	0.1493920
nhoodLower Pacific Heights	-29668.417446	147027.2721615	-0.2017885	0.8400865
nhoodMarina	-335863.380010	167388.1718757	-2.0064941	0.0448325
nhoodMerced Heights	-1173125.234241	276207.7854635	-4.2472562	0.0000219
nhoodMerced Manor	-1149356.677654	277431.5766183	-4.1428474	0.0000346
nhoodMidtown Terrace	-1075534.898122	272298.5781992	-3.9498366	0.0000788
nhoodMint Hill	-459287.216028	275111.0669482	-1.6694611	0.0950598
nhoodMiraloma Park	-1203282.293693	275857.7750698	-4.3619662	0.0000130
nhoodMission	-575004.958828	270412.4235062	-2.1263999	0.0334963
nhoodMission Bay	-445580.064259	361556.5610873	-1.2323938	0.2178332
nhoodMission Dolores	-638715.713133	271356.1954374	-2.3537908	0.0186038
nhoodMission Terrace	-1455552.826115	396946.4489094	-3.6668745	0.0002469
nhoodMonterey Heights	-1083758.818163	280998.9880661	-3.8568068	0.0001157
nhoodMt. Davidson Manor	-1260320.804314	282481.3843551	-4.4616066	0.0000082
nhoodNob Hill	1668029.269135	520027.7636490	3.2075773	0.0013431
nhoodNoe Valley	-420305.398434	265127.8217253	-1.5852934	0.1129338
nhoodNorth Beach	1816035.339000	518815.8760266	3.5003465	0.0004668
nhoodNorthern Waterfront	1504947.713678	559468.4548783	2.6899599	0.0071588
nhoodOceanview	-1213238.837533	275491.5617225	-4.4039056	0.0000108
nhoodOuter Mission	-1487334.766079	398306.3234045	-3.7341480	0.0001895
nhoodOuter Richmond	-294541.653151	155433.8409023	-1.8949648	0.0581281
nhoodOuter Sunset	-1302730.691530	379969.0299048	-3.4285181	0.0006095
nhoodPacific Heights	-114548.465807	160542.2025803	-0.7135100	0.4755482
nhoodPanhandle	-389171.270446	142012.7012498	-2.7403976	0.0061482
nhoodParkmerced	524687.125901	340702.9888638	1.5400133	0.1235912
nhoodParkside	-1329840.712146	380497.7801552	-3.4950026	0.0004763
nhoodParnassus Heights	-349642.044542	280877.7527974	-1.2448193	0.2132296
nhoodPeralta Heights	-1094383.453380	382185.8479011	-2.8634850	0.0041995
nhoodPolk Gulch	1263775.057865	548089.1446144	2.3057838	0.0211447
nhoodPortola	-1505007.680289	397293.8379478	-3.7881476	0.0001527
nhoodPotrero Hill	-704303.266780	282804.1334817	-2.4904278	0.0127763
nhoodPresidio Heights	967724.644460	163992.9427365	5.9010140	0.0000000
nhoodPresidio Terrace	399489.080475	171443.6264536	2.3301483	0.0198196
nhoodRincon Hill	-581685.783964	339796.0832195	-1.7118672	0.0869546
nhoodRussian Hill	2103049.016286	515736.0672249	4.0777622	0.0000459
nhoodSeacliff	580192.585178	166166.4216773	3.4916356	0.0004823
nhoodSherwood Forest	-1248438.854768	278023.9574620	-4.4904003	0.0000072
nhoodShowplace Square	-659327.236413	478379.3161579	-1.3782520	0.1681587
nhoodSilver Terrace	-1393945.277448	381061.5114316	-3.6580584	0.0002555
nhoodSouth Beach	7074.615955	368070.1923692	0.0192208	0.9846654
nhoodSouth of Market	-695284.393202	286610.9122718	-2.4258825	0.0152902
nhoodSt. Francis Wood	-662016.288233	277044.8539600	-2.3895636	0.0168882
nhoodSt. Marys Park	-1249031.023803	385320.9200118	-3.2415344	0.0011931
nhoodStonestown	-1442969.182196	284689.1961838	-5.0685772	0.0000004
nhoodSunnydale	-1500633.751163	402042.6544850	-3.7325237	0.0001907
nhoodSunnyside	-1079543.649362	275712.7102770	-3.9154657	0.0000909
nhoodSutro Heights	-416240.973423	161361.0683295	-2.5795626	0.0099078
nhoodTelegraph Hill	2062146.008009	523375.6275863	3.9400880	0.0000820
nhoodTenderloin	2561433.248380	501612.2107127	5.1064013	0.0000003
nhoodUnion Street	-566754.982909	175208.3947485	-3.2347479	0.0012218
nhoodUniversity Mound	-1450449.996505	398060.6837104	-3.6437912	0.0002701
nhoodUpper Market	-573758.093877	268325.2015798	-2.1382937	0.0325190
nhoodVisitacion Valley	-1484554.854893	397532.3265407	-3.7344255	0.0001893
nhoodWest Portal	-1113680.071044	275567.0284588	-4.0414126	0.0000536
nhoodWestern Addition	-482555.811561	143011.3980956	-3.3742472	0.0007432
nhoodWestwood Highlands	-1241505.813311	287389.0169747	-4.3199487	0.0000158
nhoodWestwood Park	-1350987.607466	279809.2327628	-4.8282453	0.0000014
crime_nn5	-10906413.470544	17758591.8216114	-0.6141486	0.5391322
DistSchool	3819910.579148	3374207.9106327	1.1320911	0.2576254
districtBuena Vista	-128410.195646	293399.6579025	-0.4376631	0.6616407
districtCentral	-202589.582056	273748.6900422	-0.7400568	0.4592842
districtDowntown	-199451.593125	343939.1071240	-0.5799038	0.5619936
districtIngleside	-184691.284006	282924.4633603	-0.6527936	0.5139055
districtInner Sunset	-188491.299036	280870.1639506	-0.6710976	0.5021750
districtMarina	34882.660227	373572.9822982	0.0933758	0.9256071
districtMission	-373393.675577	279938.0302259	-1.3338440	0.1822877
districtNortheast	-2494026.026325	522012.1189187	-4.7777167	0.0000018
districtRichmond	-901207.676139	368129.7407858	-2.4480708	0.0143807
districtSouth Central	103733.593663	113384.4787626	0.9148835	0.3602765
districtSouth of Market	-263862.422096	295748.8171541	-0.8921842	0.3723173
districtWestern Addition	-526909.892455	357768.5367545	-1.4727676	0.1408476

Figure 5: Table of in-sample (training set) model results.

82 of our 150 predictors are statistically significant (i.e. they have p-values below 0.01) and are hence ultimately useful for sale price prediction; these have been highlighted in orange. The R-squared value of 0.6995 suggests our model is successful in accounting for 69.95% of variations in home prices throughout San Francisco. This is supported by figure 6, where quantile maps of predicted and actual home sale prices reflect similar spatial patterns.

Figure 6: Quantile maps of predicted and actual home sale prices across San Francisco.

4.1. Accuracy

To understand the accuracy of our model in predicting unseen data, we ran it on the remaining subset of our data to measure goodness-of-fit. As evident from figure 7, we obtained a MAE of $222,739.60. As the mean home sale price for our dataset is $1,144,668, this gives us a MAPE of 23.25%, suggesting that our model is off by slightly less than a quarter of actual home sale prices.

R-squared Value	MAE	MAPE
0.759643	222739.6	23.25 %

Figure 7: Table of R-squared Value, Mean Absolute Error, and Mean Absolute Percent Error for the test set.

Diagnostic analysis in figure 8 suggests our model generates fairly accurate predictions for houses worth around $1.25 million; the model over-predicts for homes of lower values, and significantly under-predicts for home of much higher values.

Figure 8: Predicted sale prices (green line) as a function of observed sale prices (i.e. a perfect prediction, orange line).

This is supported by the density histogram in figure 9, where predictions deviate from actual prices above and below the $1.25 million mark.

Figure 9: Distribution of actual and predicted sale prices.

4.2. Generalizability

We performed k-fold cross validation (where k=100) to test the generalizability of our model to unseen data; the results for this analysis are presented in figure 10. Our model is relatively generalizable to new data, with relatively comparable goodness of fit metrics across each fold. We obtained a standard deviation of $31,365.83 for the MAE. This degree of variation is somewhat reasonable as it comprises only 2.74% of the overall mean sale price.

Mean	Standard Deviation
252482.9	31365.83

Figure 10: Mean and standard deviation of MAE from 100-fold cross-validation.

This is supported by a histogram of across-fold MAE in figure 11, where clustering of the errors around the mean MAE indicates that our model is quite generalizable to the 100 folds. Nonetheless, there are a few outliers at the extreme tails, indicating that our model could be overfitting certain characteristics.

Figure 11: Distribution of MAE from 100-fold cross-validation. Dotted line represents mean MAE.

Model errors are mapped in figure 12 to gain a deeper understanding of the extent to which missing information in the model may be spatial in nature. It is visually apparent that errors are clustered together.

Figure 12: Map of sale prices and errors.

This was tested empirically through computation of the Moran’s I statistic. The frequency of randomly permutated Moran’s I values are plotted in a histogram in figure 13, while the observed value is indicated by the orange line. The higher observed value relative to the randomly generated values confirms that errors from our predictive model exhibit spatial autocorrelation. This suggests that some degree of spatial variation relating to the underlying structure of home prices has not been incorporated by our model, even with consideration of fixed effects at the neighbourhood and district level in our model.

Figure 13: Observed (line) and randomly permutated (histogram) Moran’s I values.

Figure 14 maps the MAPE produced by our model across different neighbourhoods in San Francisco. Our model predicts with reasonable accuracy in western San Francisco, but loses its predictive power particularly in the North and South of the city. We hence conclude that our model is not generalizable across urban space.

Figure 14: Map of mean MAPE by neighborhoods in San Francisco.

However, the slight gradient of the graph in figure 15 suggests prediction errors are only weakly correlated with the mean sale price of each neighborhood. This implies that although our model loses accuracy when predicting sale prices for homes in neighborhoods with higher mean prices, this occurs only to a limited extent. Our model can hence be regarded as relatively generalizable across neighborhoods, at least on the basis of mean home sale prices. This prompts consideration of other urban contexts that could potentially divide San Francisco.

Figure 15: Scatterplot plot of mean MAPE by neighborhood as a function of mean sale price by neighborhood.

Figure 16 maps San Francisco according to race and income. Tracts in which at least 51% of residents are white were designated “Majority White”, while tracts in which incomes were greater than the citywide mean were designated “High Income”. Given the distinct spatial segregation, it appears unlikely that our simple linear model will generalize well across these contexts.

Figure 16: Maps of income and racial segregation in San Francisco.

Surprisingly, MAPE does not differ much between majority white and non-white neighborhoods (figure 17); this suggests our model generalizes well with respect to race and makes predictions of similar accuracy across neighborhoods of different racial compositions.

TractContext	mean.MAPE	mean.MAE
Majority non-White	22.33%	158457.0
Majority White	24.95%	347125.4
NA	78.06%	282592.9

Figure 17: Mean MAPE and MAE of home sale prices by neighborhood racial context.

However, there is a lower error rate when predicting home values in high-income neighborhoods as compared to low-income neighborhoods (figure 18). Our model hence does not generalize well with respect to income, as it predicts home prices in low-income neighborhoods with 6% less accuracy relative to high-income neighborhoods. This is likely due to the need to engineer more features that can account for income differences between neighborhoods.

TractContext	mean.MAPE	mean.MAE
High Income	21.2%	251566.1
Low Income	27.92%	155602.0
NA	78.06%	282592.9

Figure 18: Mean MAPE and MAE of home sale prices by neighborhood income context.

5. Conclusion

As discussed in section 4, our model has room for improvement in terms of accuracy and generalizability. For instance, it tends to under-predict home prices for homes worth above $1,250,000 (Figure 8); should Zillow employ our model, this would set unrealisticly low prices for home buyers, and undervalue the selling price for owners with homes of higher value. Unsatisfied with the low valuations, owners of more expensive homes looking to sell their property would exit the Zillow market, thereby creating a market of ‘lemons’ (lower value houses) on the Zillow website. Nonetheless, it is worth noting that the model does relatively well at generalizing the prices between groups of different racial demographics - as such, in spite of its shortcomings, our model provides a useful first step for further developments.

5.1. Improvements

Since the model under predicts the house prices for houses with higher values, one way to improve the model would be to examine characteristics unique to homes with higher prices, and seek to account for these within the model. Additionally, spatial lag can be added to the model since it is clear that a spatial relation remains within the prediction errors. This can be achieved through consideration of mean home prices within a buffer distance from every house, or by taking the average price of some specified number of nearest neighbors. Finally, our model uses Ordinary Least Squares linear regression to model the relationship between predictors and home sale prices. However, it is likely that most predictors are not linearly related to housing prices due to the law of diminishing marginal returns. As such, a non-linear regression model could be used instead to improve the predictive power of our model.