library(tidyverse)
library(caret)
library(knitr) # for modelling 
library(pscl) #polscience computational library for new reg model
library(plotROC) 
library(pROC)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(grid)
library(kableExtra)

palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")

1. Introduction

The purpose of this report is to increase the uptake of the Department of Housing and Community Development’s (HCD) credit program for homeowners in the City of Philadelphia to fund repairs for their homes. Homeowners whose houses are deemed to be blighted by the city are eligible for this credit of $5,000. In previous campaigns of this credit program, eligible owners have been reached out to at random, resulting in a low 11% uptake rate of this credit. The following analysis uses past data from campaigns on homeowners reached out to and associated factors; these include economic conditions, homeowner demographics, properties of the home and campaign factors. With this data, a binomial regression was used to train a model that can be used to predict if a certain person is likely to accept the credit or not. With this information, the HCD will be able to target their resources at a subset of eligible homeowners that will result in an optimized benefit; accounting for the benefit of the credit and the cost of marketing.

2. Exploring Data and Feature Engineering

Useful features for predicting if the homeowner is likely to accept or not accept the credit are features where the likelihood of accepting is different between the categories of that feature. Hence, in exploring the data, the percentage of homeowners that accepted the credit was calculated for categories in each feature. If there was a greater difference, the feature was deemed appropriate for the final model. The features have been grouped into 4 categories:

. Economic Conditions
. Demographic of Owner
. Resident’s Home
. Campaign Features

An overall summary of all the original data features is shown below:

peopleCredit <- read.csv("C:/Users/leann/OneDrive/Desktop/MUSA507/Week08/people_based_ml_hw_2019/housingSubsidy.csv") %>% rename('accept' = 'y')

# overall summary 
# sum(is.na(peopleCredit)) # no missing data 
summary(peopleCredit)
##        X             age                 job           marital    
##  Min.   :   1   Min.   :18.00   admin.     :1012   divorced: 446  
##  1st Qu.:1030   1st Qu.:32.00   blue-collar: 884   married :2509  
##  Median :2060   Median :38.00   technician : 691   single  :1153  
##  Mean   :2060   Mean   :40.11   services   : 393   unknown :  11  
##  3rd Qu.:3090   3rd Qu.:47.00   management : 324                  
##  Max.   :4119   Max.   :88.00   retired    : 166                  
##                                 (Other)    : 649                  
##                education       taxLien        mortgage    taxbill_in_phl
##  university.degree  :1264   no     :3315   no     :1839   no : 770      
##  high.school        : 921   unknown: 803   unknown: 105   yes:3349      
##  basic.9y           : 574   yes    :   1   yes    :2175                 
##  professional.course: 535                                               
##  basic.4y           : 429                                               
##  basic.6y           : 228                                               
##  (Other)            : 168                                               
##       contact         month      day_of_week    campaign     
##  cellular :2652   may    :1378   fri:768     Min.   : 1.000  
##  telephone:1467   jul    : 711   mon:855     1st Qu.: 1.000  
##                   aug    : 636   thu:860     Median : 2.000  
##                   jun    : 530   tue:841     Mean   : 2.537  
##                   nov    : 446   wed:795     3rd Qu.: 3.000  
##                   apr    : 215               Max.   :35.000  
##                   (Other): 203                               
##      pdays          previous             poutcome    unemploy_rate     
##  Min.   :  0.0   Min.   :0.0000   failure    : 454   Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.0000   nonexistent:3523   1st Qu.:-1.80000  
##  Median :999.0   Median :0.0000   success    : 142   Median : 1.10000  
##  Mean   :960.4   Mean   :0.1903                      Mean   : 0.08497  
##  3rd Qu.:999.0   3rd Qu.:0.0000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :6.0000                      Max.   : 1.40000  
##                                                                        
##  cons.price.idx  cons.conf.idx   inflation_rate  spent_on_repairs
##  Min.   :92.20   Min.   :-50.8   Min.   :0.635   Min.   :4964    
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334   1st Qu.:5099    
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191    
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5166    
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228    
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228    
##                                                                  
##  accept       y_numeric     
##  no :3668   Min.   :0.0000  
##  yes: 451   1st Qu.:0.0000  
##             Median :0.0000  
##             Mean   :0.1095  
##             3rd Qu.:0.0000  
##             Max.   :1.0000  
## 

3. The Model

3.1. Test and Training Set

The data is split into a 65/35 training and test set. The regression model is run on the training set and this regressed model is then used to predict outcomes on the test set.

set.seed(3456)
trainIndex <- createDataPartition(peopleCredit$accept, 
                                  p = .65,
                                  list = FALSE,
                                  times = 1)
TrainSet <- peopleCredit[ trainIndex,]
TestSet  <- peopleCredit[-trainIndex,]

3.2. Binomial Regression Model

3.2.1 Regression Summary for Kitchen Sink Model

As a baseline, the model is first run on all the available variables. A summary of the regression and the results of the predictions on the test sets are given below.

kitchenSink <- glm(TrainSet$y_numeric~ .,
                  data=TrainSet %>% dplyr::select(y_numeric, age,   job,    marital,    education,  mortgage,   taxbill_in_phl, contact,    month,  day_of_week,    campaign,   pdays,  previous,   poutcome,   unemploy_rate,  cons.price.idx, cons.conf.idx,  inflation_rate, spent_on_repairs),
                  family="binomial" (link="logit"))
summary(kitchenSink)
## 
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"), 
##     data = TrainSet %>% dplyr::select(y_numeric, age, job, marital, 
##         education, mortgage, taxbill_in_phl, contact, month, 
##         day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, 
##         cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4312  -0.4130  -0.3410  -0.2743   2.7971  
## 
## Coefficients:
##                                 Estimate  Std. Error z value Pr(>|z|)  
## (Intercept)                  -64.4569819 131.5866187  -0.490   0.6242  
## age                            0.0040401   0.0085015   0.475   0.6346  
## jobblue-collar                -0.2019782   0.2695338  -0.749   0.4536  
## jobentrepreneur               -0.4398813   0.4574926  -0.962   0.3363  
## jobhousemaid                  -0.6081370   0.5533354  -1.099   0.2718  
## jobmanagement                 -0.2958843   0.3058101  -0.968   0.3333  
## jobretired                    -0.0770689   0.3703018  -0.208   0.8351  
## jobself-employed              -0.7201679   0.4588905  -1.569   0.1166  
## jobservices                   -0.3785310   0.2992370  -1.265   0.2059  
## jobstudent                    -0.5827998   0.4749019  -1.227   0.2197  
## jobtechnician                 -0.0815229   0.2332106  -0.350   0.7267  
## jobunemployed                 -0.0036131   0.4282215  -0.008   0.9933  
## jobunknown                    -0.3449106   0.6996710  -0.493   0.6220  
## maritalmarried                 0.0025538   0.2390554   0.011   0.9915  
## maritalsingle                 -0.0272334   0.2743888  -0.099   0.9209  
## maritalunknown               -11.7954967 329.2148401  -0.036   0.9714  
## educationbasic.6y              0.3670326   0.3999401   0.918   0.3588  
## educationbasic.9y             -0.0754439   0.3385100  -0.223   0.8236  
## educationhigh.school           0.1334629   0.3225799   0.414   0.6791  
## educationilliterate          -13.1856118 882.7435574  -0.015   0.9881  
## educationprofessional.course   0.1791336   0.3502604   0.511   0.6091  
## educationuniversity.degree     0.2875891   0.3230265   0.890   0.3733  
## educationunknown               0.3203488   0.4125826   0.776   0.4375  
## mortgageunknown               -0.1105733   0.5672672  -0.195   0.8455  
## mortgageyes                   -0.2088628   0.1428249  -1.462   0.1436  
## taxbill_in_phlyes              0.1817119   0.2008873   0.905   0.3657  
## contacttelephone              -0.6417183   0.2850122  -2.252   0.0244 *
## monthaug                       0.0119168   0.4435987   0.027   0.9786  
## monthdec                       1.8188174   0.7795533   2.333   0.0196 *
## monthjul                       0.0987037   0.3739052   0.264   0.7918  
## monthjun                       0.0601515   0.4719817   0.127   0.8986  
## monthmar                       1.3910066   0.5690012   2.445   0.0145 *
## monthmay                      -0.4100960   0.3159273  -1.298   0.1943  
## monthnov                      -0.6154623   0.4249091  -1.448   0.1475  
## monthoct                       0.0704836   0.5478438   0.129   0.8976  
## monthsep                      -0.9263980   0.6826431  -1.357   0.1748  
## day_of_weekmon                -0.3172493   0.2213597  -1.433   0.1518  
## day_of_weekthu                -0.1405554   0.2150320  -0.654   0.5133  
## day_of_weektue                -0.1571005   0.2166871  -0.725   0.4684  
## day_of_weekwed                -0.2054917   0.2312499  -0.889   0.3742  
## campaign                      -0.1080497   0.0434241  -2.488   0.0128 *
## pdays                         -0.0007349   0.0007506  -0.979   0.3276  
## previous                       0.2429603   0.2130561   1.140   0.2541  
## poutcomenonexistent            0.7425384   0.3500727   2.121   0.0339 *
## poutcomesuccess                1.1095148   0.7477278   1.484   0.1378  
## unemploy_rate                 -0.9391549   0.4866028  -1.930   0.0536 .
## cons.price.idx                 0.9077695   0.8614268   1.054   0.2920  
## cons.conf.idx                  0.0229236   0.0292698   0.783   0.4335  
## inflation_rate                 0.4997422   0.4479791   1.116   0.2646  
## spent_on_repairs              -0.0044715   0.0107313  -0.417   0.6769  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1489.0  on 2629  degrees of freedom
## AIC: 1589
## 
## Number of Fisher Scoring iterations: 13
kitchenSinkTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
                        Probs = predict(kitchenSink, TestSet, type= "response"))

kitchenSinkTEST<-kitchenSinkTEST%>%
  mutate(predOutcome = as.factor(ifelse(kitchenSinkTEST$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(kitchenSinkTEST$predOutcome, kitchenSinkTEST$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1258  119
##          1   25   38
##                                               
##                Accuracy : 0.9                 
##                  95% CI : (0.8833, 0.915)     
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.1449              
##                                               
##                   Kappa : 0.3019              
##                                               
##  Mcnemar's Test P-Value : 0.000000000000009189
##                                               
##             Sensitivity : 0.24204             
##             Specificity : 0.98051             
##          Pos Pred Value : 0.60317             
##          Neg Pred Value : 0.91358             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02639             
##    Detection Prevalence : 0.04375             
##       Balanced Accuracy : 0.61128             
##                                               
##        'Positive' Class : 1                   
## 

3.2.2. Improving the Kitchen Sink Model

The baseline model has a sensitivity of around 20%. To improve this, new features were created to better predict both outcomes.

These new features are:

Variable Name Description
UnEmGrouped Is the unemployment rate <-1?
ageGroups2 Is the person <70 Years old?
educationGroup Grouping of the education types
isSingle Is the owner Single?
jobGroup Grouping of the job type
spentGroups Is the amount spend on home repairs less than 5050?
previousSuccess Did the person accept the credit previously?
numWorkers Is the number of workers for the current campaign less than 5?
monthGroup Is the month Dec, Oct, Sep, Mar?
failedAndIncr Previously failed and increased workers in next campaign?

As demonstrated in section 2, these features were engineered as there seemed to be a significant difference in likelihood between the categories in the original feature. Since each of the above features represent these groupings, these categorical features aim to distinguish between homeowners who are likely to accept the credit and those who are likely not to accept the credit.

# creditModelTEST <- glm(TrainSet$y_numeric~ .,
#                   data=TrainSet %>% dplyr::select(y_numeric, ageGroups2,  jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr),
#                   family="binomial" (link="logit"))
# 
# summary(creditModelTEST)
# 
# testProbsTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
#                         Probs = predict(creditModelTEST, TestSet, type= "response"))
# 
# testProbsTEST <-testProbsTEST %>%
#   mutate(predOutcome  = as.factor(ifelse(testProbsTEST$Probs > 0.5 , 1, 0)))
# 
# caret::confusionMatrix(testProbsTEST$predOutcome, testProbsTEST$Outcome,
#                        positive = "1")

3.2.3. The Final Model

The final model uses the new features engineered that were listed in 3.2.2 as well as the consumer price index, inflation rate, month of contact and type of contact.

A summary of the model as well as the pseudo R-squared are printed below.

creditModel <- glm(TrainSet$y_numeric~ .,
                  data=TrainSet %>% dplyr::select(y_numeric, ageGroups2,    jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr),
                  family="binomial" (link="logit"))

summary(creditModel)
## 
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"), 
##     data = TrainSet %>% dplyr::select(y_numeric, ageGroups2, 
##         jobGroup, isSingle, educationGroup, contact, monthGroup, 
##         numWorkers, previousSuccess, UnEmGrouped, cons.price.idx, 
##         cons.conf.idx, inflation_rate, spentGroups, failedAndIncr))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9899  -0.4212  -0.3539  -0.3044   2.8808  
## 
## Coefficients:
##                        Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)           -71.63026   25.55717  -2.803     0.005067 ** 
## ageGroups2             -0.32236    0.57683  -0.559     0.576264    
## jobGroupUnlikely       -0.08465    0.25693  -0.329     0.741793    
## isSingleYes            -0.04963    0.14882  -0.333     0.738769    
## educationGroupOther     0.26864    0.14031   1.915     0.055541 .  
## contacttelephone       -0.75844    0.22389  -3.388     0.000705 ***
## monthGroup              0.80585    0.29844   2.700     0.006929 ** 
## numWorkers>=5          -0.90249    0.31112  -2.901     0.003723 ** 
## previousSuccess         1.50542    0.27447   5.485 0.0000000414 ***
## UnEmGrouped             0.15495    1.22122   0.127     0.899031    
## cons.price.idx          0.78225    0.28472   2.747     0.006007 ** 
## cons.conf.idx           0.07363    0.02273   3.240     0.001196 ** 
## inflation_rate         -0.47065    0.37800  -1.245     0.213088    
## spentGroupsSpend>5050   0.50274    0.52327   0.961     0.336669    
## failedAndIncr           0.24150    0.12757   1.893     0.058352 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1549.7  on 2664  degrees of freedom
## AIC: 1579.7
## 
## Number of Fisher Scoring iterations: 6
pR2(creditModel)
##          llh      llhNull           G2     McFadden         r2ML 
## -774.8257291 -926.8710680  304.0906778    0.1640415    0.1073039 
##         r2CU 
##    0.2148645

4. Assessing the Model

To asses the model, predictions are generated on the test set using the final model above.

testProbs <- data.frame(Outcome = as.factor(TestSet$y_numeric),
                        Probs = predict(creditModel, TestSet, type= "response"))

4.1. Distribution of Predicted Probabilities

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probability of Accept", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

It is clear that the final model is good at predicting homeowners who do not accept the credit but is very bad at predicting homeowners who accept the credit. If the model were good at predicting the observed outcomes, there would be more clustering of the densities in the bottom graph around 1, rather than the heavy side around 0 as it is now.

4.2. Sensitivity and Specificity

To understand the accuracy of the model, the confusion matrix is provided below for the predictions on the test set using the final model. The ‘Reference’ is the actual outcome while the ‘Predicted’ is the predicted outcome by the model, where the probability threshold is set to 0.5. A threshold of 0.5 implies that if the predicted probability is more than 0.5, the predicted outcome is that the owner accepts the credit.

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1268  115
##          1   15   42
##                                               
##                Accuracy : 0.9097              
##                  95% CI : (0.8937, 0.924)     
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.01104             
##                                               
##                   Kappa : 0.3551              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.26752             
##             Specificity : 0.98831             
##          Pos Pred Value : 0.73684             
##          Neg Pred Value : 0.91685             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02917             
##    Detection Prevalence : 0.03958             
##       Balanced Accuracy : 0.62791             
##                                               
##        'Positive' Class : 1                   
## 

4.3. Generalizability - Cross Validation

To test the generalizability of the model, K-fold Cross Validation is run on the dataset, where k=100. The CV results of the final model are also compared to the ‘Kitchen Sink’ regression CV results. The following graphs show the ROC, Sensitivity (fraction of accepted correctly predicted as accept) and Specificity (fraction of not accepted correctly predicted as not accepted).

peopleCredit<-peopleCredit%>% mutate(
  acceptCV = case_when(
    accept == "no" ~ "no_accept",
    accept == "yes" ~ "accept"
  )
)
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(acceptCV ~ ., data = peopleCredit %>% 
                                   dplyr::select(acceptCV, ageGroups2,  jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)



cvFitSINK <- train(acceptCV ~ ., data = peopleCredit%>% 
                     dplyr::select(acceptCV, ageGroups2,    jobGroup,   marital,    educationGroup, taxLien,    mortgage,   taxbill_in_phl, contact,    monthGroup, day_of_week,    numWorkers, pdays,  previous,   previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)
dplyr::select(cvFitSINK$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitSINK$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="KITCHEN SINK: CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="FINAL MODEL: CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

There are minimal goodness of fit improvements between the final and the baseline model. The means of both are the same. It can be noted that the final model’s specificity has a lower variance than the baseline across the 100 folds, indicating that the final model is more generalizable to homeowners that do not accept the credit.

5. ROC Curve

The y-axis of the ROC curve shows the rate of true positives (predicted accepted and accepted) for each threshold from 0.01 to 1. The x-axis shows the rate of false positives (predicted accepted and did not accept) for each threshold.

ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - credit accepting Model")

auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.8187

The ROC lies above the 45 degree line, indicating that the model is an improvement from the randomization used by the HCD previously (it predicts better than a 50-50 coin flip). The model is also not over fit as it is not a right angle at any of the corners, thus we know that this model is fairly generalizable. However, the ROC can be a bit more concave, especially towards the higher values for the ‘True Positive’ and ‘False Positive’ Fraction. This can be adjusted for by creating a model that better predicts the ‘accept’ outcome.

6. Cost-benefit Analysis

6.1. The Cost and Benefits of Predictions


True Positive: For 80%, cost is 2850 for marketing + 5000 for the credit = 7850. The benefit is 10,000 + 56,000 = 66,000 for the house/neighborhood benefit. The expected profit for this 80% is thus $58,150. For 20%, the cost is 2850 for marketing, with no revenue, hence the expected profit for this 20% is -$2580

True Negative: No marketing or credit was given, hence no benefit was enjoyed as well. Thus the overall profit is $0.

False Positive: 2850 for marketing only, hence the profit of this is -$2580.

False Negative: $0


The table below calculates the total monetary benefit of the final model.

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Benefit =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
               ifelse(Variable == "False_Negative", (0) * Count,
               ifelse(Variable == "False_Positive", (-2580) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We predicted the owner would not accept and did not market to them.",
              "We predicted the owner would accept and marketed to them.",
              "We predicted the owner would not accept and did not market to them, hence the owner made no action.",
              "We predicted the owner would accept and marketed to them but the owner did not.")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>% kable_styling()
Cost/Benefit Table
Variable Count Benefit Description
True_Negative 1268 0 We predicted the owner would not accept and did not market to them.
True_Positive 42 1929900 We predicted the owner would accept and marketed to them.
False_Negative 115 0 We predicted the owner would not accept and did not market to them, hence the owner made no action.
False_Positive 15 -38700 We predicted the owner would accept and marketed to them but the owner did not.

6.2. Optimising the Cost/Benefit

In section 4.2, a threshold was set such that for any predicted probability above the threshold, the predicted outcome is set to ‘accept’. This threshold can range from 0 to 1, and depending on this threshold, different levels of expected profit can be calculated. Hence the expected profit can be maximized at a certain threshold level and is calculated in this section.

6.2.1. Confusion Matrix and Expected Benefit

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      data %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Benefit =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
               ifelse(Variable == "False_Negative", (0) * Count,
               ifelse(Variable == "False_Positive", (-2580) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs)
whichThreshold %>%
  ggplot(.,aes(Threshold, Benefit, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Benefit by confusion matrix type and threshold cut off",
       y="Benefit") +
  guides(colour=guide_legend(title="Confusion Matrix"))

Note: The cost of true negative and false negative is zero, hence the true_negatives are masked in the above graph by the false_negatives.

6.2.2. Threshold as a function of the Total Benefit and Total number of Credits distributed

whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(actualAccept = ifelse(Variable == "True_Positive", (Count * .8), 0))%>% 
    group_by(Threshold)%>% 
    summarize(Total_Benefit = sum(Benefit),
              Total_Count_of_Credits = sum(actualAccept)) 

colnames(whichThreshold_revenue) = c("Threshold", "Total Benefit ($)", "Total Number of Credits") 

kable(whichThreshold_revenue) %>% kable_styling(bootstrap_options = c("striped", "hover",fixed_thead= TRUE))%>%
  scroll_box(width = "100%", height = "200px")
Threshold Total Benefit ($) Total Number of Credits
0.01 3906590 125.6
0.02 3960770 125.6
0.03 4042710 122.4
0.04 4166550 122.4
0.05 4149340 116.8
0.06 4387060 112.0
0.07 4501790 101.6
0.08 4634350 100.0
0.09 4676480 94.4
0.10 4643430 93.6
0.11 4621190 92.0
0.12 4539610 90.4
0.13 4597840 88.0
0.14 4534810 85.6
0.15 4534810 85.6
0.16 4385040 81.6
0.17 4097420 75.2
0.18 4105160 75.2
0.19 4013260 73.6
0.20 3939420 72.0
0.21 3804150 69.6
0.22 3811890 69.6
0.23 3595040 65.6
0.24 3342560 60.8
0.25 3207290 58.4
0.26 3120550 56.8
0.27 3130870 56.8
0.28 3049290 55.2
0.29 2926920 52.8
0.30 2842760 51.2
0.31 2750860 49.6
0.32 2572220 46.4
0.33 2531430 45.6
0.34 2531430 45.6
0.35 2439530 44.0
0.36 2393580 43.2
0.37 2352790 42.4
0.38 2220100 40.0
0.39 2176730 39.2
0.40 2133360 38.4
0.41 2087410 37.6
0.42 1949560 35.2
0.43 1959880 35.2
0.44 1959880 35.2
0.45 1873140 33.6
0.46 1875720 33.6
0.47 1878300 33.6
0.48 1880880 33.6
0.49 1888620 33.6
0.50 1891200 33.6
0.51 1753350 31.2
0.52 1709980 30.4
0.53 1618080 28.8
0.54 1528760 27.2
0.55 1436860 25.6
0.56 1439440 25.6
0.57 1393490 24.8
0.58 1393490 24.8
0.59 1350120 24.0
0.60 1352700 24.0
0.61 1352700 24.0
0.62 1306750 23.2
0.63 1263380 22.4
0.64 1265960 22.4
0.65 1089900 19.2
0.66 1089900 19.2
0.67 1043950 18.4
0.68 998000 17.6
0.69 998000 17.6
0.70 998000 17.6
0.71 952050 16.8
0.72 908680 16.0
0.73 773410 13.6
0.74 730040 12.8
0.75 638140 11.2
0.76 592190 10.4
0.77 410970 7.2
0.78 227170 4.0
0.79 183800 3.2
0.80 183800 3.2
0.81 183800 3.2
0.82 183800 3.2
0.83 183800 3.2
0.84 137850 2.4
0.85 137850 2.4
0.86 45950 0.8
0.87 45950 0.8
0.88 45950 0.8
0.89 0 0.0
0.90 0 0.0
0.91 0 0.0
0.92 0 0.0
0.93 0 0.0
0.94 0 0.0
0.95 0 0.0
0.96 0 0.0
0.97 0 0.0
0.98 0 0.0
0.99 0 0.0
Showing the Total Benefit and Total number of Credits that would be taken (true positives) for each Threshold.
whichThreshold_revenue %>%
  gather(Variable, count, - Threshold)%>%
      ggplot(., aes(Threshold, count, fill = Variable)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette5) +
        labs(x="Threshold", y="Total Value",
             title = "Total Benefit and Num of Credits as a function of Threshold", subtitle = "For predicted probabilty > threshold, the predicted outcome is accept.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As the threshold increases, the model predicts fewer outcomes to be ‘accept’ and hence the total benefit and number of credits decreases. This agrees with our cost/benefit table in 6.1.1. where the biggest benefit is from ‘true positives’; i.e. people who are predicted to accept the credit and actually accept the credit, which results in fulfilled benefits. The total number of credits can be interpreted as the number of true positives. The slope of the true positives is very steep at low threshold levels, which could mean that a number of observed positives have a very low predicted probability of conversion. This once again alludes to the poor sensitivity of the model.

6.2.3. Optimal Threshold vs 50% Threshold

top <- whichThreshold_revenue %>% top_n(1, `Total Benefit ($)`)

fiftyPercent<-whichThreshold_revenue %>% filter(Threshold < 0.51 & Threshold > 0.491)

rbind(top, fiftyPercent) %>% kable(.,
       caption = "Optimal vs 50% Threshold") %>% kable_styling(bootstrap_options = c("striped", "hover"))
Optimal vs 50% Threshold
Threshold Total Benefit ($) Total Number of Credits
0.09 4676480 94.4
0.50 1891200 33.6

Choosing a threshold of 0.09 is much better in terms of the total benefit and number of credits distributed than the 50% threshold. HCD should use the model with a 9% threshold.

7. Conclusion

To conclude, the final model is not very good at predicting which homeowners would accept the credit, however it is good at predicting which homeowners would not accept the credit, which can significantly help HCD in their marketing efforts to maximize the benefit of the program. Hence HCD should not settle on this final model, however this model is still better than the previous strategy of randomization among the eligible homeowners. The expected benefit using this model is $4,676,480 . Since this model is very good at predicting which homeowners will not be likely to accept the credit, marketing materials should not be given to these homeowners and should only be focused on those who are predicted to ‘accept’. This model is also an improvement on the randomization, since the sensitivity for the test set was 27%, which is much higher than the 11% response rate of the previous strategy.

The model can be improved if more data on those who accepted the credit can be used. However this is inherently a problem if the number of people who accept the credit is low to begin with. Better features that separate those who accept and those who do not accept the credit can be created. HCD should consult workers who have ground experience for suggestions on what other features could be significant to differentiate between the two groups of homeowners.