library(tidyverse)
library(caret)
library(knitr) # for modelling
library(pscl) #polscience computational library for new reg model
library(plotROC)
library(pROC)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(grid)
library(kableExtra)
palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")
The purpose of this report is to increase the uptake of the Department of Housing and Community Development’s (HCD) credit program for homeowners in the City of Philadelphia to fund repairs for their homes. Homeowners whose houses are deemed to be blighted by the city are eligible for this credit of $5,000. In previous campaigns of this credit program, eligible owners have been reached out to at random, resulting in a low 11% uptake rate of this credit. The following analysis uses past data from campaigns on homeowners reached out to and associated factors; these include economic conditions, homeowner demographics, properties of the home and campaign factors. With this data, a binomial regression was used to train a model that can be used to predict if a certain person is likely to accept the credit or not. With this information, the HCD will be able to target their resources at a subset of eligible homeowners that will result in an optimized benefit; accounting for the benefit of the credit and the cost of marketing.
Useful features for predicting if the homeowner is likely to accept or not accept the credit are features where the likelihood of accepting is different between the categories of that feature. Hence, in exploring the data, the percentage of homeowners that accepted the credit was calculated for categories in each feature. If there was a greater difference, the feature was deemed appropriate for the final model. The features have been grouped into 4 categories:
. Economic Conditions
. Demographic of Owner
. Resident’s Home
. Campaign Features
An overall summary of all the original data features is shown below:
peopleCredit <- read.csv("C:/Users/leann/OneDrive/Desktop/MUSA507/Week08/people_based_ml_hw_2019/housingSubsidy.csv") %>% rename('accept' = 'y')
# overall summary
# sum(is.na(peopleCredit)) # no missing data
summary(peopleCredit)
## X age job marital
## Min. : 1 Min. :18.00 admin. :1012 divorced: 446
## 1st Qu.:1030 1st Qu.:32.00 blue-collar: 884 married :2509
## Median :2060 Median :38.00 technician : 691 single :1153
## Mean :2060 Mean :40.11 services : 393 unknown : 11
## 3rd Qu.:3090 3rd Qu.:47.00 management : 324
## Max. :4119 Max. :88.00 retired : 166
## (Other) : 649
## education taxLien mortgage taxbill_in_phl
## university.degree :1264 no :3315 no :1839 no : 770
## high.school : 921 unknown: 803 unknown: 105 yes:3349
## basic.9y : 574 yes : 1 yes :2175
## professional.course: 535
## basic.4y : 429
## basic.6y : 228
## (Other) : 168
## contact month day_of_week campaign
## cellular :2652 may :1378 fri:768 Min. : 1.000
## telephone:1467 jul : 711 mon:855 1st Qu.: 1.000
## aug : 636 thu:860 Median : 2.000
## jun : 530 tue:841 Mean : 2.537
## nov : 446 wed:795 3rd Qu.: 3.000
## apr : 215 Max. :35.000
## (Other): 203
## pdays previous poutcome unemploy_rate
## Min. : 0.0 Min. :0.0000 failure : 454 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.0000 nonexistent:3523 1st Qu.:-1.80000
## Median :999.0 Median :0.0000 success : 142 Median : 1.10000
## Mean :960.4 Mean :0.1903 Mean : 0.08497
## 3rd Qu.:999.0 3rd Qu.:0.0000 3rd Qu.: 1.40000
## Max. :999.0 Max. :6.0000 Max. : 1.40000
##
## cons.price.idx cons.conf.idx inflation_rate spent_on_repairs
## Min. :92.20 Min. :-50.8 Min. :0.635 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.334 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5166
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
##
## accept y_numeric
## no :3668 Min. :0.0000
## yes: 451 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1095
## 3rd Qu.:0.0000
## Max. :1.0000
##
The data is split into a 65/35 training and test set. The regression model is run on the training set and this regressed model is then used to predict outcomes on the test set.
set.seed(3456)
trainIndex <- createDataPartition(peopleCredit$accept,
p = .65,
list = FALSE,
times = 1)
TrainSet <- peopleCredit[ trainIndex,]
TestSet <- peopleCredit[-trainIndex,]
As a baseline, the model is first run on all the available variables. A summary of the regression and the results of the predictions on the test sets are given below.
kitchenSink <- glm(TrainSet$y_numeric~ .,
data=TrainSet %>% dplyr::select(y_numeric, age, job, marital, education, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs),
family="binomial" (link="logit"))
summary(kitchenSink)
##
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"),
## data = TrainSet %>% dplyr::select(y_numeric, age, job, marital,
## education, mortgage, taxbill_in_phl, contact, month,
## day_of_week, campaign, pdays, previous, poutcome, unemploy_rate,
## cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4312 -0.4130 -0.3410 -0.2743 2.7971
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -64.4569819 131.5866187 -0.490 0.6242
## age 0.0040401 0.0085015 0.475 0.6346
## jobblue-collar -0.2019782 0.2695338 -0.749 0.4536
## jobentrepreneur -0.4398813 0.4574926 -0.962 0.3363
## jobhousemaid -0.6081370 0.5533354 -1.099 0.2718
## jobmanagement -0.2958843 0.3058101 -0.968 0.3333
## jobretired -0.0770689 0.3703018 -0.208 0.8351
## jobself-employed -0.7201679 0.4588905 -1.569 0.1166
## jobservices -0.3785310 0.2992370 -1.265 0.2059
## jobstudent -0.5827998 0.4749019 -1.227 0.2197
## jobtechnician -0.0815229 0.2332106 -0.350 0.7267
## jobunemployed -0.0036131 0.4282215 -0.008 0.9933
## jobunknown -0.3449106 0.6996710 -0.493 0.6220
## maritalmarried 0.0025538 0.2390554 0.011 0.9915
## maritalsingle -0.0272334 0.2743888 -0.099 0.9209
## maritalunknown -11.7954967 329.2148401 -0.036 0.9714
## educationbasic.6y 0.3670326 0.3999401 0.918 0.3588
## educationbasic.9y -0.0754439 0.3385100 -0.223 0.8236
## educationhigh.school 0.1334629 0.3225799 0.414 0.6791
## educationilliterate -13.1856118 882.7435574 -0.015 0.9881
## educationprofessional.course 0.1791336 0.3502604 0.511 0.6091
## educationuniversity.degree 0.2875891 0.3230265 0.890 0.3733
## educationunknown 0.3203488 0.4125826 0.776 0.4375
## mortgageunknown -0.1105733 0.5672672 -0.195 0.8455
## mortgageyes -0.2088628 0.1428249 -1.462 0.1436
## taxbill_in_phlyes 0.1817119 0.2008873 0.905 0.3657
## contacttelephone -0.6417183 0.2850122 -2.252 0.0244 *
## monthaug 0.0119168 0.4435987 0.027 0.9786
## monthdec 1.8188174 0.7795533 2.333 0.0196 *
## monthjul 0.0987037 0.3739052 0.264 0.7918
## monthjun 0.0601515 0.4719817 0.127 0.8986
## monthmar 1.3910066 0.5690012 2.445 0.0145 *
## monthmay -0.4100960 0.3159273 -1.298 0.1943
## monthnov -0.6154623 0.4249091 -1.448 0.1475
## monthoct 0.0704836 0.5478438 0.129 0.8976
## monthsep -0.9263980 0.6826431 -1.357 0.1748
## day_of_weekmon -0.3172493 0.2213597 -1.433 0.1518
## day_of_weekthu -0.1405554 0.2150320 -0.654 0.5133
## day_of_weektue -0.1571005 0.2166871 -0.725 0.4684
## day_of_weekwed -0.2054917 0.2312499 -0.889 0.3742
## campaign -0.1080497 0.0434241 -2.488 0.0128 *
## pdays -0.0007349 0.0007506 -0.979 0.3276
## previous 0.2429603 0.2130561 1.140 0.2541
## poutcomenonexistent 0.7425384 0.3500727 2.121 0.0339 *
## poutcomesuccess 1.1095148 0.7477278 1.484 0.1378
## unemploy_rate -0.9391549 0.4866028 -1.930 0.0536 .
## cons.price.idx 0.9077695 0.8614268 1.054 0.2920
## cons.conf.idx 0.0229236 0.0292698 0.783 0.4335
## inflation_rate 0.4997422 0.4479791 1.116 0.2646
## spent_on_repairs -0.0044715 0.0107313 -0.417 0.6769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1489.0 on 2629 degrees of freedom
## AIC: 1589
##
## Number of Fisher Scoring iterations: 13
kitchenSinkTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
Probs = predict(kitchenSink, TestSet, type= "response"))
kitchenSinkTEST<-kitchenSinkTEST%>%
mutate(predOutcome = as.factor(ifelse(kitchenSinkTEST$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(kitchenSinkTEST$predOutcome, kitchenSinkTEST$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1258 119
## 1 25 38
##
## Accuracy : 0.9
## 95% CI : (0.8833, 0.915)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 0.1449
##
## Kappa : 0.3019
##
## Mcnemar's Test P-Value : 0.000000000000009189
##
## Sensitivity : 0.24204
## Specificity : 0.98051
## Pos Pred Value : 0.60317
## Neg Pred Value : 0.91358
## Prevalence : 0.10903
## Detection Rate : 0.02639
## Detection Prevalence : 0.04375
## Balanced Accuracy : 0.61128
##
## 'Positive' Class : 1
##
The baseline model has a sensitivity of around 20%. To improve this, new features were created to better predict both outcomes.
These new features are:
Variable Name | Description |
---|---|
UnEmGrouped | Is the unemployment rate <-1? |
ageGroups2 | Is the person <70 Years old? |
educationGroup | Grouping of the education types |
isSingle | Is the owner Single? |
jobGroup | Grouping of the job type |
spentGroups | Is the amount spend on home repairs less than 5050? |
previousSuccess | Did the person accept the credit previously? |
numWorkers | Is the number of workers for the current campaign less than 5? |
monthGroup | Is the month Dec, Oct, Sep, Mar? |
failedAndIncr | Previously failed and increased workers in next campaign? |
As demonstrated in section 2, these features were engineered as there seemed to be a significant difference in likelihood between the categories in the original feature. Since each of the above features represent these groupings, these categorical features aim to distinguish between homeowners who are likely to accept the credit and those who are likely not to accept the credit.
# creditModelTEST <- glm(TrainSet$y_numeric~ .,
# data=TrainSet %>% dplyr::select(y_numeric, ageGroups2, jobGroup, isSingle, educationGroup, contact, monthGroup, numWorkers, previousSuccess, UnEmGrouped, cons.price.idx, cons.conf.idx, inflation_rate, spentGroups, failedAndIncr),
# family="binomial" (link="logit"))
#
# summary(creditModelTEST)
#
# testProbsTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
# Probs = predict(creditModelTEST, TestSet, type= "response"))
#
# testProbsTEST <-testProbsTEST %>%
# mutate(predOutcome = as.factor(ifelse(testProbsTEST$Probs > 0.5 , 1, 0)))
#
# caret::confusionMatrix(testProbsTEST$predOutcome, testProbsTEST$Outcome,
# positive = "1")
The final model uses the new features engineered that were listed in 3.2.2 as well as the consumer price index, inflation rate, month of contact and type of contact.
A summary of the model as well as the pseudo R-squared are printed below.
creditModel <- glm(TrainSet$y_numeric~ .,
data=TrainSet %>% dplyr::select(y_numeric, ageGroups2, jobGroup, isSingle, educationGroup, contact, monthGroup, numWorkers, previousSuccess, UnEmGrouped, cons.price.idx, cons.conf.idx, inflation_rate, spentGroups, failedAndIncr),
family="binomial" (link="logit"))
summary(creditModel)
##
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"),
## data = TrainSet %>% dplyr::select(y_numeric, ageGroups2,
## jobGroup, isSingle, educationGroup, contact, monthGroup,
## numWorkers, previousSuccess, UnEmGrouped, cons.price.idx,
## cons.conf.idx, inflation_rate, spentGroups, failedAndIncr))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9899 -0.4212 -0.3539 -0.3044 2.8808
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -71.63026 25.55717 -2.803 0.005067 **
## ageGroups2 -0.32236 0.57683 -0.559 0.576264
## jobGroupUnlikely -0.08465 0.25693 -0.329 0.741793
## isSingleYes -0.04963 0.14882 -0.333 0.738769
## educationGroupOther 0.26864 0.14031 1.915 0.055541 .
## contacttelephone -0.75844 0.22389 -3.388 0.000705 ***
## monthGroup 0.80585 0.29844 2.700 0.006929 **
## numWorkers>=5 -0.90249 0.31112 -2.901 0.003723 **
## previousSuccess 1.50542 0.27447 5.485 0.0000000414 ***
## UnEmGrouped 0.15495 1.22122 0.127 0.899031
## cons.price.idx 0.78225 0.28472 2.747 0.006007 **
## cons.conf.idx 0.07363 0.02273 3.240 0.001196 **
## inflation_rate -0.47065 0.37800 -1.245 0.213088
## spentGroupsSpend>5050 0.50274 0.52327 0.961 0.336669
## failedAndIncr 0.24150 0.12757 1.893 0.058352 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1549.7 on 2664 degrees of freedom
## AIC: 1579.7
##
## Number of Fisher Scoring iterations: 6
pR2(creditModel)
## llh llhNull G2 McFadden r2ML
## -774.8257291 -926.8710680 304.0906778 0.1640415 0.1073039
## r2CU
## 0.2148645
To asses the model, predictions are generated on the test set using the final model above.
testProbs <- data.frame(Outcome = as.factor(TestSet$y_numeric),
Probs = predict(creditModel, TestSet, type= "response"))
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Probability of Accept", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
It is clear that the final model is good at predicting homeowners who do not accept the credit but is very bad at predicting homeowners who accept the credit. If the model were good at predicting the observed outcomes, there would be more clustering of the densities in the bottom graph around 1, rather than the heavy side around 0 as it is now.
To understand the accuracy of the model, the confusion matrix is provided below for the predictions on the test set using the final model. The ‘Reference’ is the actual outcome while the ‘Predicted’ is the predicted outcome by the model, where the probability threshold is set to 0.5. A threshold of 0.5 implies that if the predicted probability is more than 0.5, the predicted outcome is that the owner accepts the credit.
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1268 115
## 1 15 42
##
## Accuracy : 0.9097
## 95% CI : (0.8937, 0.924)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 0.01104
##
## Kappa : 0.3551
##
## Mcnemar's Test P-Value : < 0.0000000000000002
##
## Sensitivity : 0.26752
## Specificity : 0.98831
## Pos Pred Value : 0.73684
## Neg Pred Value : 0.91685
## Prevalence : 0.10903
## Detection Rate : 0.02917
## Detection Prevalence : 0.03958
## Balanced Accuracy : 0.62791
##
## 'Positive' Class : 1
##
To test the generalizability of the model, K-fold Cross Validation is run on the dataset, where k=100. The CV results of the final model are also compared to the ‘Kitchen Sink’ regression CV results. The following graphs show the ROC, Sensitivity (fraction of accepted correctly predicted as accept) and Specificity (fraction of not accepted correctly predicted as not accepted).
peopleCredit<-peopleCredit%>% mutate(
acceptCV = case_when(
accept == "no" ~ "no_accept",
accept == "yes" ~ "accept"
)
)
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit <- train(acceptCV ~ ., data = peopleCredit %>%
dplyr::select(acceptCV, ageGroups2, jobGroup, isSingle, educationGroup, contact, monthGroup, numWorkers, previousSuccess, UnEmGrouped, cons.price.idx, cons.conf.idx, inflation_rate, spentGroups, failedAndIncr),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFitSINK <- train(acceptCV ~ ., data = peopleCredit%>%
dplyr::select(acceptCV, ageGroups2, jobGroup, marital, educationGroup, taxLien, mortgage, taxbill_in_phl, contact, monthGroup, day_of_week, numWorkers, pdays, previous, previousSuccess, UnEmGrouped, cons.price.idx, cons.conf.idx, inflation_rate, spentGroups, failedAndIncr),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
dplyr::select(cvFitSINK$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFitSINK$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="KITCHEN SINK: CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
dplyr::select(cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="FINAL MODEL: CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
There are minimal goodness of fit improvements between the final and the baseline model. The means of both are the same. It can be noted that the final model’s specificity has a lower variance than the baseline across the 100 folds, indicating that the final model is more generalizable to homeowners that do not accept the credit.
The y-axis of the ROC curve shows the rate of true positives (predicted accepted and accepted) for each threshold from 0.01 to 1. The x-axis shows the rate of false positives (predicted accepted and did not accept) for each threshold.
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - credit accepting Model")
auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.8187
The ROC lies above the 45 degree line, indicating that the model is an improvement from the randomization used by the HCD previously (it predicts better than a 50-50 coin flip). The model is also not over fit as it is not a right angle at any of the corners, thus we know that this model is fairly generalizable. However, the ROC can be a bit more concave, especially towards the higher values for the ‘True Positive’ and ‘False Positive’ Fraction. This can be adjusted for by creating a model that better predicts the ‘accept’ outcome.
True Positive: For 80%, cost is 2850 for marketing + 5000 for the credit = 7850. The benefit is 10,000 + 56,000 = 66,000 for the house/neighborhood benefit. The expected profit for this 80% is thus $58,150. For 20%, the cost is 2850 for marketing, with no revenue, hence the expected profit for this 20% is -$2580
True Negative: No marketing or credit was given, hence no benefit was enjoyed as well. Thus the overall profit is $0.
False Positive: 2850 for marketing only, hence the profit of this is -$2580.
False Negative: $0
The table below calculates the total monetary benefit of the final model.
cost_benefit_table <-
testProbs %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Benefit =
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
ifelse(Variable == "False_Negative", (0) * Count,
ifelse(Variable == "False_Positive", (-2580) * Count, 0))))) %>%
bind_cols(data.frame(Description = c(
"We predicted the owner would not accept and did not market to them.",
"We predicted the owner would accept and marketed to them.",
"We predicted the owner would not accept and did not market to them, hence the owner made no action.",
"We predicted the owner would accept and marketed to them but the owner did not.")))
kable(cost_benefit_table,
caption = "Cost/Benefit Table") %>% kable_styling()
Variable | Count | Benefit | Description |
---|---|---|---|
True_Negative | 1268 | 0 | We predicted the owner would not accept and did not market to them. |
True_Positive | 42 | 1929900 | We predicted the owner would accept and marketed to them. |
False_Negative | 115 | 0 | We predicted the owner would not accept and did not market to them, hence the owner made no action. |
False_Positive | 15 | -38700 | We predicted the owner would accept and marketed to them but the owner did not. |
In section 4.2, a threshold was set such that for any predicted probability above the threshold, the predicted outcome is set to ‘accept’. This threshold can range from 0 to 1, and depending on this threshold, different levels of expected profit can be calculated. Hence the expected profit can be maximized at a certain threshold level and is calculated in this section.
iterateThresholds <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
data %>%
mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Benefit =
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
ifelse(Variable == "False_Negative", (0) * Count,
ifelse(Variable == "False_Positive", (-2580) * Count, 0)))),
Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs)
whichThreshold %>%
ggplot(.,aes(Threshold, Benefit, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5[c(5, 1:3)]) +
labs(title = "Benefit by confusion matrix type and threshold cut off",
y="Benefit") +
guides(colour=guide_legend(title="Confusion Matrix"))
Note: The cost of true negative and false negative is zero, hence the true_negatives are masked in the above graph by the false_negatives.
whichThreshold_revenue <-
whichThreshold %>%
mutate(actualAccept = ifelse(Variable == "True_Positive", (Count * .8), 0))%>%
group_by(Threshold)%>%
summarize(Total_Benefit = sum(Benefit),
Total_Count_of_Credits = sum(actualAccept))
colnames(whichThreshold_revenue) = c("Threshold", "Total Benefit ($)", "Total Number of Credits")
kable(whichThreshold_revenue) %>% kable_styling(bootstrap_options = c("striped", "hover",fixed_thead= TRUE))%>%
scroll_box(width = "100%", height = "200px")
Threshold | Total Benefit ($) | Total Number of Credits |
---|---|---|
0.01 | 3906590 | 125.6 |
0.02 | 3960770 | 125.6 |
0.03 | 4042710 | 122.4 |
0.04 | 4166550 | 122.4 |
0.05 | 4149340 | 116.8 |
0.06 | 4387060 | 112.0 |
0.07 | 4501790 | 101.6 |
0.08 | 4634350 | 100.0 |
0.09 | 4676480 | 94.4 |
0.10 | 4643430 | 93.6 |
0.11 | 4621190 | 92.0 |
0.12 | 4539610 | 90.4 |
0.13 | 4597840 | 88.0 |
0.14 | 4534810 | 85.6 |
0.15 | 4534810 | 85.6 |
0.16 | 4385040 | 81.6 |
0.17 | 4097420 | 75.2 |
0.18 | 4105160 | 75.2 |
0.19 | 4013260 | 73.6 |
0.20 | 3939420 | 72.0 |
0.21 | 3804150 | 69.6 |
0.22 | 3811890 | 69.6 |
0.23 | 3595040 | 65.6 |
0.24 | 3342560 | 60.8 |
0.25 | 3207290 | 58.4 |
0.26 | 3120550 | 56.8 |
0.27 | 3130870 | 56.8 |
0.28 | 3049290 | 55.2 |
0.29 | 2926920 | 52.8 |
0.30 | 2842760 | 51.2 |
0.31 | 2750860 | 49.6 |
0.32 | 2572220 | 46.4 |
0.33 | 2531430 | 45.6 |
0.34 | 2531430 | 45.6 |
0.35 | 2439530 | 44.0 |
0.36 | 2393580 | 43.2 |
0.37 | 2352790 | 42.4 |
0.38 | 2220100 | 40.0 |
0.39 | 2176730 | 39.2 |
0.40 | 2133360 | 38.4 |
0.41 | 2087410 | 37.6 |
0.42 | 1949560 | 35.2 |
0.43 | 1959880 | 35.2 |
0.44 | 1959880 | 35.2 |
0.45 | 1873140 | 33.6 |
0.46 | 1875720 | 33.6 |
0.47 | 1878300 | 33.6 |
0.48 | 1880880 | 33.6 |
0.49 | 1888620 | 33.6 |
0.50 | 1891200 | 33.6 |
0.51 | 1753350 | 31.2 |
0.52 | 1709980 | 30.4 |
0.53 | 1618080 | 28.8 |
0.54 | 1528760 | 27.2 |
0.55 | 1436860 | 25.6 |
0.56 | 1439440 | 25.6 |
0.57 | 1393490 | 24.8 |
0.58 | 1393490 | 24.8 |
0.59 | 1350120 | 24.0 |
0.60 | 1352700 | 24.0 |
0.61 | 1352700 | 24.0 |
0.62 | 1306750 | 23.2 |
0.63 | 1263380 | 22.4 |
0.64 | 1265960 | 22.4 |
0.65 | 1089900 | 19.2 |
0.66 | 1089900 | 19.2 |
0.67 | 1043950 | 18.4 |
0.68 | 998000 | 17.6 |
0.69 | 998000 | 17.6 |
0.70 | 998000 | 17.6 |
0.71 | 952050 | 16.8 |
0.72 | 908680 | 16.0 |
0.73 | 773410 | 13.6 |
0.74 | 730040 | 12.8 |
0.75 | 638140 | 11.2 |
0.76 | 592190 | 10.4 |
0.77 | 410970 | 7.2 |
0.78 | 227170 | 4.0 |
0.79 | 183800 | 3.2 |
0.80 | 183800 | 3.2 |
0.81 | 183800 | 3.2 |
0.82 | 183800 | 3.2 |
0.83 | 183800 | 3.2 |
0.84 | 137850 | 2.4 |
0.85 | 137850 | 2.4 |
0.86 | 45950 | 0.8 |
0.87 | 45950 | 0.8 |
0.88 | 45950 | 0.8 |
0.89 | 0 | 0.0 |
0.90 | 0 | 0.0 |
0.91 | 0 | 0.0 |
0.92 | 0 | 0.0 |
0.93 | 0 | 0.0 |
0.94 | 0 | 0.0 |
0.95 | 0 | 0.0 |
0.96 | 0 | 0.0 |
0.97 | 0 | 0.0 |
0.98 | 0 | 0.0 |
0.99 | 0 | 0.0 |
whichThreshold_revenue %>%
gather(Variable, count, - Threshold)%>%
ggplot(., aes(Threshold, count, fill = Variable)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette5) +
labs(x="Threshold", y="Total Value",
title = "Total Benefit and Num of Credits as a function of Threshold", subtitle = "For predicted probabilty > threshold, the predicted outcome is accept.")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As the threshold increases, the model predicts fewer outcomes to be ‘accept’ and hence the total benefit and number of credits decreases. This agrees with our cost/benefit table in 6.1.1. where the biggest benefit is from ‘true positives’; i.e. people who are predicted to accept the credit and actually accept the credit, which results in fulfilled benefits. The total number of credits can be interpreted as the number of true positives. The slope of the true positives is very steep at low threshold levels, which could mean that a number of observed positives have a very low predicted probability of conversion. This once again alludes to the poor sensitivity of the model.
top <- whichThreshold_revenue %>% top_n(1, `Total Benefit ($)`)
fiftyPercent<-whichThreshold_revenue %>% filter(Threshold < 0.51 & Threshold > 0.491)
rbind(top, fiftyPercent) %>% kable(.,
caption = "Optimal vs 50% Threshold") %>% kable_styling(bootstrap_options = c("striped", "hover"))
Threshold | Total Benefit ($) | Total Number of Credits |
---|---|---|
0.09 | 4676480 | 94.4 |
0.50 | 1891200 | 33.6 |
Choosing a threshold of 0.09 is much better in terms of the total benefit and number of credits distributed than the 50% threshold. HCD should use the model with a 9% threshold.
To conclude, the final model is not very good at predicting which homeowners would accept the credit, however it is good at predicting which homeowners would not accept the credit, which can significantly help HCD in their marketing efforts to maximize the benefit of the program. Hence HCD should not settle on this final model, however this model is still better than the previous strategy of randomization among the eligible homeowners. The expected benefit using this model is $4,676,480 . Since this model is very good at predicting which homeowners will not be likely to accept the credit, marketing materials should not be given to these homeowners and should only be focused on those who are predicted to ‘accept’. This model is also an improvement on the randomization, since the sensitivity for the test set was 27%, which is much higher than the 11% response rate of the previous strategy.
The model can be improved if more data on those who accepted the credit can be used. However this is inherently a problem if the number of people who accept the credit is low to begin with. Better features that separate those who accept and those who do not accept the credit can be created. HCD should consult workers who have ground experience for suggestions on what other features could be significant to differentiate between the two groups of homeowners.