library(tidyverse)
library(caret)
library(knitr) # for modelling 
library(pscl) #polscience computational library for new reg model
library(plotROC) 
library(pROC)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(grid)
library(kableExtra)

palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")

1. Introduction

The purpose of this report is to increase the uptake of the Department of Housing and Community Development’s (HCD) credit program for homeowners in the City of Philadelphia to fund repairs for their homes. Homeowners whose houses are deemed to be blighted by the city are eligible for this credit of $5,000. In previous campaigns of this credit program, eligible owners have been reached out to at random, resulting in a low 11% uptake rate of this credit. The following analysis uses past data from campaigns on homeowners reached out to and associated factors; these include economic conditions, homeowner demographics, properties of the home and campaign factors. With this data, a binomial regression was used to train a model that can be used to predict if a certain person is likely to accept the credit or not. With this information, the HCD will be able to target their resources at a subset of eligible homeowners that will result in an optimized benefit; accounting for the benefit of the credit and the cost of marketing.

2. Exploring Data and Feature Engineering

Useful features for predicting if the homeowner is likely to accept or not accept the credit are features where the likelihood of accepting is different between the categories of that feature. Hence, in exploring the data, the percentage of homeowners that accepted the credit was calculated for categories in each feature. If there was a greater difference, the feature was deemed appropriate for the final model. The features have been grouped into 4 categories:

. Economic Conditions
. Demographic of Owner
. Resident’s Home
. Campaign Features

An overall summary of all the original data features is shown below:

peopleCredit <- read.csv("C:/Users/leann/OneDrive/Desktop/MUSA507/Week08/people_based_ml_hw_2019/housingSubsidy.csv") %>% rename('accept' = 'y')

# overall summary 
# sum(is.na(peopleCredit)) # no missing data 
summary(peopleCredit)

##        X             age                 job           marital    
##  Min.   :   1   Min.   :18.00   admin.     :1012   divorced: 446  
##  1st Qu.:1030   1st Qu.:32.00   blue-collar: 884   married :2509  
##  Median :2060   Median :38.00   technician : 691   single  :1153  
##  Mean   :2060   Mean   :40.11   services   : 393   unknown :  11  
##  3rd Qu.:3090   3rd Qu.:47.00   management : 324                  
##  Max.   :4119   Max.   :88.00   retired    : 166                  
##                                 (Other)    : 649                  
##                education       taxLien        mortgage    taxbill_in_phl
##  university.degree  :1264   no     :3315   no     :1839   no : 770      
##  high.school        : 921   unknown: 803   unknown: 105   yes:3349      
##  basic.9y           : 574   yes    :   1   yes    :2175                 
##  professional.course: 535                                               
##  basic.4y           : 429                                               
##  basic.6y           : 228                                               
##  (Other)            : 168                                               
##       contact         month      day_of_week    campaign     
##  cellular :2652   may    :1378   fri:768     Min.   : 1.000  
##  telephone:1467   jul    : 711   mon:855     1st Qu.: 1.000  
##                   aug    : 636   thu:860     Median : 2.000  
##                   jun    : 530   tue:841     Mean   : 2.537  
##                   nov    : 446   wed:795     3rd Qu.: 3.000  
##                   apr    : 215               Max.   :35.000  
##                   (Other): 203                               
##      pdays          previous             poutcome    unemploy_rate     
##  Min.   :  0.0   Min.   :0.0000   failure    : 454   Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.0000   nonexistent:3523   1st Qu.:-1.80000  
##  Median :999.0   Median :0.0000   success    : 142   Median : 1.10000  
##  Mean   :960.4   Mean   :0.1903                      Mean   : 0.08497  
##  3rd Qu.:999.0   3rd Qu.:0.0000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :6.0000                      Max.   : 1.40000  
##                                                                        
##  cons.price.idx  cons.conf.idx   inflation_rate  spent_on_repairs
##  Min.   :92.20   Min.   :-50.8   Min.   :0.635   Min.   :4964    
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334   1st Qu.:5099    
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191    
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5166    
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228    
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228    
##                                                                  
##  accept       y_numeric     
##  no :3668   Min.   :0.0000  
##  yes: 451   1st Qu.:0.0000  
##             Median :0.0000  
##             Mean   :0.1095  
##             3rd Qu.:0.0000  
##             Max.   :1.0000  
##

2.1. Features related to Economic Conditions

# explore relationship between accept and economy at time (continuous // gives the mean for y/n
peopleCredit%>%
  dplyr::select(accept, unemploy_rate, cons.price.idx,cons.conf.idx, inflation_rate) %>%
  gather(Variable, value, -accept) %>%
    ggplot(aes(accept, value, fill=accept)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="Accept", y="Value", 
           title = "Economic Conditions associations with the likelihood of accepting",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none")

Consumer and confidence index does not seem significant inflation_rate and unemployment_rate might be significant.

2.1.1. Consumer Confidence Index

groupedCCI<- peopleCredit %>%
  group_by(accept,cons.conf.idx) %>%
  summarise(counts = n())

CCIgraph<- ggplot(data=groupedCCI, aes(x=cons.conf.idx, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadCCI <- groupedCCI %>% spread(accept, counts) %>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesCCI <- ggplot(data=spreadCCI, aes(x=cons.conf.idx, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(CCIgraph, percentagesCCI, ncol=1)

2.1.2. Consumer Price Index

groupedCPI<- peopleCredit %>%
  group_by(accept,cons.price.idx) %>%
  summarise(counts = n())

CPIgraph<- ggplot(data=groupedCPI, aes(x=cons.price.idx, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadCPI <- groupedCPI %>% spread(accept, counts) %>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesCPI <- ggplot(data=spreadCPI, aes(x=cons.price.idx, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(CPIgraph, percentagesCPI, ncol=1)

2.1.3. Inflation Rate

groupedinfl<- peopleCredit %>%
  group_by(accept,inflation_rate) %>%
  summarise(counts = n())

inflationGraph<- ggplot(data=groupedinfl, aes(x=inflation_rate, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadInf <- groupedinfl %>% spread(accept, counts) %>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesInf <- ggplot(data=spreadInf, aes(x=inflation_rate, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(inflationGraph, percentagesInf, ncol=1)

#### 2.1.4. Unemployment Rate

groupedUnEm<- peopleCredit %>%
  group_by(accept,unemploy_rate) %>%
  summarise(counts = n())

UnEmGraph<- ggplot(data=groupedUnEm, aes(x=unemploy_rate, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadUnEm <- groupedUnEm %>% spread(accept, counts) %>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesUnEm <- ggplot(data=spreadUnEm, aes(x=unemploy_rate, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(UnEmGraph, percentagesUnEm, ncol=1)

Since there is a significant difference between the acceptance rate for different unemployment rate, one feature could be to split the unemployment rate into two groups; less than -1 and more than -1.

peopleCredit<- peopleCredit %>%
  mutate(UnEmGrouped = ifelse(unemploy_rate < -1, 1, 0))

2.2. Features related to Demographics

2.2.1. Age

df <- peopleCredit %>%
  group_by(accept,age) %>%
  summarise(counts = n())

barGraph <- ggplot(data=df, aes(x=age, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadAge <- df %>% spread(accept, counts) 
spreadAge[is.na(spreadAge)] <- 0
spreadAge<-spreadAge%>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentages <- ggplot(data=spreadAge, aes(x=age, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(barGraph, percentages, ncol=1)

Figure above shows that the likelihood of accepting credit for those aged >60 is much higher than those aged <70.

One way of splitting the age groups would be into only two categories: Less than 70 and more than 70.

peopleCredit <- 
  peopleCredit %>%
  mutate(ageGroups2 = case_when(
    age <70 ~ 0,
    age>=70 ~ 1
    ))

Age can also be split into three categories: young, middle_Aged, old.

peopleCredit <- 
  peopleCredit %>%
  mutate(ageGroups = case_when(
    age <= 20 ~ "Young",
    age >20 & age< 60  ~ "Middle Aged",
    age>=60  ~ "Old"
    ))

peopleCredit %>%
    dplyr::select(accept, ageGroups) %>%
    gather(Variable, value, -accept) %>%
    count(Variable, value, accept) %>%
      ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Accept", y="Value",
             title = "Age associations with the likelihood of Accepting",
             subtitle = "Three Age Categories") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Old are more likely to accept than the middle aged. Hence age group is a good feature.

personal<-peopleCredit %>%
    dplyr::select(accept, job, marital, education) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept)

employment <- personal%>% 
  filter(Variable=='job')%>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Type of Employment", y="Count",
             title = "Employment associations")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

marital <- personal%>% 
  filter(Variable=='marital')%>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Martial Status", y="Count",
             title = "Marital Status associations")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

Education <- personal%>% 
  filter(Variable=='education')%>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Education Level", y="Count",
             title = "Education associations")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

# education
percentageEdu<- peopleCredit %>%
  group_by(accept,education) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=education, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

# marital
percentageMarital<- peopleCredit %>%
  group_by(accept,marital) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=marital, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

# job
percentageJOB<- peopleCredit %>%
  group_by(accept,job) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=job, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

2.2.2. Educational Background

grid.arrange(Education, percentageEdu)

Those with a basic education have lower percentage of accepting the credit than the other groups. Hence one feature could be to split education into two groups; basic and non-basic.

peopleCredit <- 
  peopleCredit %>%
  mutate(educationGroup = case_when(
    education == "basic.4y" ~ "Basic",
    education == "basic.6y" ~ "Basic",
    education == "basic.9y" ~ "Basic",
    education == "illiterate" ~ "Basic",
    education == "high.school" ~ "Basic",
    TRUE ~ "Other"
  ))

2.2.3. Marital Status

grid.arrange(marital, percentageMarital, ncol=2)

The likelihood of accepting is much higher for singles than non-singles.

peopleCredit <- 
  peopleCredit %>%
  mutate(isSingle = case_when(
    marital == "single" ~ "Yes",
    marital != "single" ~ "No"
  ))

2.2.4. Job

grid.arrange(employment, percentageJOB)

Retirees and students are more likely to accept the credit.

# (retired, admin, student, unemployed, )
peopleCredit <- 
  peopleCredit %>%
  mutate(jobGroup = case_when(
    job == "retired"  ~ "Likely",
    job == "student" ~ "Likely",
    TRUE ~ "Unlikely"
  ))

2.3. Features related to Resident’s Home

2.3.1. Mortgage, Tax Resident, Presence of Tax Lien

all <- peopleCredit %>%
    dplyr::select(accept, taxLien, mortgage, taxbill_in_phl) %>%
    gather(Variable, value, -accept) %>%
    count(Variable, value, accept) %>%
      ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Accept", y="Value",
             title = "Feature associations with the likelihood of accepting",
             subtitle = "Three category features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
all

# mortgage
groupedMortgage<- peopleCredit %>%
  group_by(accept,mortgage) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=mortgage, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

# tax bill 
groupedtax<- peopleCredit %>%
  group_by(accept,taxbill_in_phl) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=taxbill_in_phl, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

# lien
groupedLien<- peopleCredit %>%
  group_by(accept,taxLien) %>%
  summarise(counts = n()) %>% spread(accept, counts) %>% 
  mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)%>%
  ggplot(., aes(x=taxLien, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)


grid.arrange(groupedMortgage, groupedtax, groupedLien)

The difference in percentages for each feature is not stark, hence they are not good features. Furthermore, taxLien only has 1 ‘yes’. The above graph shows that the people who do have a mortgage, are full time residents in Philadelphia and do not have a taxLien are still more likely to say no to the credit than to accept.

2.3.2. Amount Spent on Home Repairs

grouped<- peopleCredit %>%
  group_by(accept,spent_on_repairs) %>%
  summarise(counts = n())

plot3 <- ggplot(data=grouped, aes(x=spent_on_repairs, y=counts, fill=accept)) +
  geom_bar(stat="identity")+
  scale_fill_manual(values = palette2)+
  theme(legend.position="bottom")

spreadSpent<- grouped %>% spread(accept, counts) 
spreadSpent[is.na(spreadSpent)] <- 0
spreadSpent<-spreadSpent%>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesSpent <- ggplot(data=spreadSpent, aes(x=spent_on_repairs, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(plot3, percentagesSpent, ncol=2)

The likelihood of accepting is much higher for people who spent less on repairs. Hence a new feature can also be created based on this split.

peopleCredit <- 
  peopleCredit %>%
  mutate(spentGroups = case_when(
    spent_on_repairs <= 5050 ~ "Spend<5050",
    spent_on_repairs > 5050  ~ "Spend>5050"
    ))

peopleCredit %>%
    dplyr::select(accept, spentGroups) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Amount spent on repairs", y="Count",
             title = "Expenditure on repairs associations")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above graph notes that those who spend <5050 on home repairs are more likely to accept the credit.

2.4. Features related to the Marketing Campaign

2.4.1. Outcome of Previous Campaign

peopleCredit%>%
  dplyr::select(accept, poutcome) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Outcome of previous campaign", y="Count",
             title = "Previous outcome with current")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Those who accepted previously are much more likely to accept again, hence a good feature could be one where previous success = 1, otherwise 0.

peopleCredit <- peopleCredit%>%
  mutate(previousSuccess = case_when(
    poutcome == "success" ~ 1,
    poutcome!="success" ~ 0))

2.4.2 Number of Workers on the Campaign

# see if increasing the number of workers on a case helps 
peopleCredit <- peopleCredit%>%
  mutate(increaseWorkers = case_when(
    poutcome != "nonexistent" & (campaign - previous) > 0 ~ "Increase",
    poutcome != "nonexistent" & (campaign - previous)<=0 ~ "No Increase",
    poutcome == "nonexistent" ~ "NA"))

meanPrevious <-peopleCredit%>%
  dplyr::select(poutcome, previous) %>%
  gather(Variable, value, -poutcome) %>%
    ggplot(aes(poutcome, value, fill=poutcome)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette5) +
      labs(x="Accept", y="Value", 
           title = "MEAN number of workers",
           subtitle = "Previous Campaign") +
      theme(legend.position = "none")


meanCurrent<-peopleCredit%>%
  dplyr::select(accept, campaign) %>%
  gather(Variable, value, -accept) %>%
    ggplot(aes(accept, value, fill=accept)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="Accept", y="Value", 
           title = "MEAN number of workers",
           subtitle = "Current Campaign") +
      theme(legend.position = "none")


increase <- peopleCredit %>%
    dplyr::select(accept, increaseWorkers) %>%
    gather(Variable, value, -accept) %>%
    count(Variable, value, accept) %>%
      ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Accept", y="Value",
             title = "Increase in Workers assocaition with the likelihood of Accepting",
             subtitle = "Workers increased from previous campaign to now?") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

# plot accept rate for those whose increaseWorkers>0, and accept rate for  increaseWorkers<0

grid.arrange(meanPrevious, meanCurrent, increase, ncol=2)

‘IncreaseWorkers’ is not a good indicator, not much difference between likelihood of accepting between those that increased and did not increase.

However, the number of current workers can be further analyzed.

countcurrent<-peopleCredit%>%
  dplyr::select(accept, campaign) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="Number of workers current campaign", y="Count",
             title = "Effect of num workers on acceptance")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

groupedCampaign<- peopleCredit %>%
  group_by(accept,campaign) %>%
  summarise(counts = n())


spreadCampaign<- groupedCampaign %>% spread(accept, counts) 
spreadCampaign[is.na(spreadCampaign)] <- 0
spreadCampaign<-spreadCampaign%>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesCam <- ggplot(data=spreadCampaign, aes(x=campaign, y=percentAccept)) +
  geom_line()+
  scale_fill_manual(values = palette2)

grid.arrange(countcurrent, percentagesCam, ncol=2)

There is quite a significant difference in those that accepted based on the number of workers. Hence a new feature can be created that groups the number of workers.

peopleCredit <- peopleCredit %>% 
  mutate(numWorkers = case_when(
    campaign <5 ~ "<5",
    campaign >=5 ~ ">=5"
  ))

Could there be an effect on the acceptance rate of those who declined the credit previously but in the next campaign had more workers to market to them?

peopleCredit <- peopleCredit%>%
  mutate(failedAndIncr = case_when(
    poutcome == 'failure' & increaseWorkers == "Increase" ~ 1,
    poutcome == 'failure' & increaseWorkers == "No Increase" ~0,
    poutcome != "failure"  ~ 2)) # previously no failed 


peopleCredit%>%
  dplyr::select(accept, failedAndIncr) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  filter(value!='2') %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
        scale_fill_manual(values = palette2) +
        labs(x="failed and increased workers?", y="Count",
             title = "Does increasing workers covert those who 'failed'?")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the above graph, there is not much change in likelihood of accepting by increasing workers (‘1’) compared to no increase in workers (‘0’), hence this is not a good feature.

2.4.3. Type of Contact

peopleCredit%>%
  dplyr::select(accept, contact) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
  scale_fill_manual(values = palette2) +
        labs(x="Mode of contact", y="Count",
             title = "Mode of contact and likelihood of accepting")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Those who use cellular are more likely to accept the credit, hence this is a good feature.

2.4.4. Month and Day of contact

# month 
barMonth <- peopleCredit%>%
  dplyr::select(accept, month) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
  scale_fill_manual(values = palette2) +
        labs(x="Month of contact", y="Count",
             title = "Month of contact and likelihood of accepting")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

groupedMonth<- peopleCredit %>%
  group_by(accept,month) %>%
  summarise(counts = n())

spreadMonth<- groupedMonth %>% spread(accept, counts) 
spreadMonth[is.na(spreadMonth)] <- 0
spreadMonth<-spreadMonth%>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesMonth <- ggplot(data=spreadMonth, aes(x=month, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)


# day 
barDay <- peopleCredit%>%
  dplyr::select(accept, day_of_week) %>%
    gather(Variable, value, -accept)%>%
   count(Variable, value, accept) %>%
  ggplot(., aes(value, n, fill = accept)) +   
        geom_bar(position = "dodge", stat="identity") +
  scale_fill_manual(values = palette2) +
        labs(x="Day of contact", y="Count",
             title = "Day of contact and likelihood of accepting")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

groupedDay<- peopleCredit %>%
  group_by(accept,day_of_week) %>%
  summarise(counts = n())

spreadDay<- groupedDay %>% spread(accept, counts) 
spreadDay[is.na(spreadDay)] <- 0
spreadDay<-spreadDay%>% mutate(total = no+yes)%>%
  mutate(percentAccept=(yes/total) *100)

percentagesDay <- ggplot(data=spreadDay, aes(x=day_of_week, y=percentAccept)) +
  geom_point()+
  scale_fill_manual(values = palette2)

grid.arrange(barMonth, percentagesMonth, barDay, percentagesDay, ncol=2)

There is not much difference between the day of the week of calling. However, there does seem to be some variation by months. Create a new feature according to month.

peopleCredit <- peopleCredit %>%
  mutate(monthGroup = case_when(
    month == "dec" ~ 1,
    month == "mar" ~ 1,
    month == "oct" ~ 1,
    month == "sep" ~ 1,
    TRUE ~ 0 
  ))

3. The Model

3.1. Test and Training Set

The data is split into a 65/35 training and test set. The regression model is run on the training set and this regressed model is then used to predict outcomes on the test set.

set.seed(3456)
trainIndex <- createDataPartition(peopleCredit$accept, 
                                  p = .65,
                                  list = FALSE,
                                  times = 1)
TrainSet <- peopleCredit[ trainIndex,]
TestSet  <- peopleCredit[-trainIndex,]

3.2. Binomial Regression Model

3.2.1 Regression Summary for Kitchen Sink Model

As a baseline, the model is first run on all the available variables. A summary of the regression and the results of the predictions on the test sets are given below.

kitchenSink <- glm(TrainSet$y_numeric~ .,
                  data=TrainSet %>% dplyr::select(y_numeric, age,   job,    marital,    education,  mortgage,   taxbill_in_phl, contact,    month,  day_of_week,    campaign,   pdays,  previous,   poutcome,   unemploy_rate,  cons.price.idx, cons.conf.idx,  inflation_rate, spent_on_repairs),
                  family="binomial" (link="logit"))
summary(kitchenSink)

## 
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"), 
##     data = TrainSet %>% dplyr::select(y_numeric, age, job, marital, 
##         education, mortgage, taxbill_in_phl, contact, month, 
##         day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, 
##         cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4312  -0.4130  -0.3410  -0.2743   2.7971  
## 
## Coefficients:
##                                 Estimate  Std. Error z value Pr(>|z|)  
## (Intercept)                  -64.4569819 131.5866187  -0.490   0.6242  
## age                            0.0040401   0.0085015   0.475   0.6346  
## jobblue-collar                -0.2019782   0.2695338  -0.749   0.4536  
## jobentrepreneur               -0.4398813   0.4574926  -0.962   0.3363  
## jobhousemaid                  -0.6081370   0.5533354  -1.099   0.2718  
## jobmanagement                 -0.2958843   0.3058101  -0.968   0.3333  
## jobretired                    -0.0770689   0.3703018  -0.208   0.8351  
## jobself-employed              -0.7201679   0.4588905  -1.569   0.1166  
## jobservices                   -0.3785310   0.2992370  -1.265   0.2059  
## jobstudent                    -0.5827998   0.4749019  -1.227   0.2197  
## jobtechnician                 -0.0815229   0.2332106  -0.350   0.7267  
## jobunemployed                 -0.0036131   0.4282215  -0.008   0.9933  
## jobunknown                    -0.3449106   0.6996710  -0.493   0.6220  
## maritalmarried                 0.0025538   0.2390554   0.011   0.9915  
## maritalsingle                 -0.0272334   0.2743888  -0.099   0.9209  
## maritalunknown               -11.7954967 329.2148401  -0.036   0.9714  
## educationbasic.6y              0.3670326   0.3999401   0.918   0.3588  
## educationbasic.9y             -0.0754439   0.3385100  -0.223   0.8236  
## educationhigh.school           0.1334629   0.3225799   0.414   0.6791  
## educationilliterate          -13.1856118 882.7435574  -0.015   0.9881  
## educationprofessional.course   0.1791336   0.3502604   0.511   0.6091  
## educationuniversity.degree     0.2875891   0.3230265   0.890   0.3733  
## educationunknown               0.3203488   0.4125826   0.776   0.4375  
## mortgageunknown               -0.1105733   0.5672672  -0.195   0.8455  
## mortgageyes                   -0.2088628   0.1428249  -1.462   0.1436  
## taxbill_in_phlyes              0.1817119   0.2008873   0.905   0.3657  
## contacttelephone              -0.6417183   0.2850122  -2.252   0.0244 *
## monthaug                       0.0119168   0.4435987   0.027   0.9786  
## monthdec                       1.8188174   0.7795533   2.333   0.0196 *
## monthjul                       0.0987037   0.3739052   0.264   0.7918  
## monthjun                       0.0601515   0.4719817   0.127   0.8986  
## monthmar                       1.3910066   0.5690012   2.445   0.0145 *
## monthmay                      -0.4100960   0.3159273  -1.298   0.1943  
## monthnov                      -0.6154623   0.4249091  -1.448   0.1475  
## monthoct                       0.0704836   0.5478438   0.129   0.8976  
## monthsep                      -0.9263980   0.6826431  -1.357   0.1748  
## day_of_weekmon                -0.3172493   0.2213597  -1.433   0.1518  
## day_of_weekthu                -0.1405554   0.2150320  -0.654   0.5133  
## day_of_weektue                -0.1571005   0.2166871  -0.725   0.4684  
## day_of_weekwed                -0.2054917   0.2312499  -0.889   0.3742  
## campaign                      -0.1080497   0.0434241  -2.488   0.0128 *
## pdays                         -0.0007349   0.0007506  -0.979   0.3276  
## previous                       0.2429603   0.2130561   1.140   0.2541  
## poutcomenonexistent            0.7425384   0.3500727   2.121   0.0339 *
## poutcomesuccess                1.1095148   0.7477278   1.484   0.1378  
## unemploy_rate                 -0.9391549   0.4866028  -1.930   0.0536 .
## cons.price.idx                 0.9077695   0.8614268   1.054   0.2920  
## cons.conf.idx                  0.0229236   0.0292698   0.783   0.4335  
## inflation_rate                 0.4997422   0.4479791   1.116   0.2646  
## spent_on_repairs              -0.0044715   0.0107313  -0.417   0.6769  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1489.0  on 2629  degrees of freedom
## AIC: 1589
## 
## Number of Fisher Scoring iterations: 13

kitchenSinkTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
                        Probs = predict(kitchenSink, TestSet, type= "response"))

kitchenSinkTEST<-kitchenSinkTEST%>%
  mutate(predOutcome = as.factor(ifelse(kitchenSinkTEST$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(kitchenSinkTEST$predOutcome, kitchenSinkTEST$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1258  119
##          1   25   38
##                                               
##                Accuracy : 0.9                 
##                  95% CI : (0.8833, 0.915)     
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.1449              
##                                               
##                   Kappa : 0.3019              
##                                               
##  Mcnemar's Test P-Value : 0.000000000000009189
##                                               
##             Sensitivity : 0.24204             
##             Specificity : 0.98051             
##          Pos Pred Value : 0.60317             
##          Neg Pred Value : 0.91358             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02639             
##    Detection Prevalence : 0.04375             
##       Balanced Accuracy : 0.61128             
##                                               
##        'Positive' Class : 1                   
##

3.2.2. Improving the Kitchen Sink Model

The baseline model has a sensitivity of around 20%. To improve this, new features were created to better predict both outcomes.

These new features are:

Variable Name	Description
UnEmGrouped	Is the unemployment rate <-1?
ageGroups2	Is the person <70 Years old?
educationGroup	Grouping of the education types
isSingle	Is the owner Single?
jobGroup	Grouping of the job type
spentGroups	Is the amount spend on home repairs less than 5050?
previousSuccess	Did the person accept the credit previously?
numWorkers	Is the number of workers for the current campaign less than 5?
monthGroup	Is the month Dec, Oct, Sep, Mar?
failedAndIncr	Previously failed and increased workers in next campaign?

As demonstrated in section 2, these features were engineered as there seemed to be a significant difference in likelihood between the categories in the original feature. Since each of the above features represent these groupings, these categorical features aim to distinguish between homeowners who are likely to accept the credit and those who are likely not to accept the credit.

# creditModelTEST <- glm(TrainSet$y_numeric~ .,
#                   data=TrainSet %>% dplyr::select(y_numeric, ageGroups2,  jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr),
#                   family="binomial" (link="logit"))
# 
# summary(creditModelTEST)
# 
# testProbsTEST <- data.frame(Outcome = as.factor(TestSet$y_numeric),
#                         Probs = predict(creditModelTEST, TestSet, type= "response"))
# 
# testProbsTEST <-testProbsTEST %>%
#   mutate(predOutcome  = as.factor(ifelse(testProbsTEST$Probs > 0.5 , 1, 0)))
# 
# caret::confusionMatrix(testProbsTEST$predOutcome, testProbsTEST$Outcome,
#                        positive = "1")

3.2.3. The Final Model

The final model uses the new features engineered that were listed in 3.2.2 as well as the consumer price index, inflation rate, month of contact and type of contact.

A summary of the model as well as the pseudo R-squared are printed below.

creditModel <- glm(TrainSet$y_numeric~ .,
                  data=TrainSet %>% dplyr::select(y_numeric, ageGroups2,    jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr),
                  family="binomial" (link="logit"))

summary(creditModel)

## 
## Call:
## glm(formula = TrainSet$y_numeric ~ ., family = binomial(link = "logit"), 
##     data = TrainSet %>% dplyr::select(y_numeric, ageGroups2, 
##         jobGroup, isSingle, educationGroup, contact, monthGroup, 
##         numWorkers, previousSuccess, UnEmGrouped, cons.price.idx, 
##         cons.conf.idx, inflation_rate, spentGroups, failedAndIncr))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9899  -0.4212  -0.3539  -0.3044   2.8808  
## 
## Coefficients:
##                        Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)           -71.63026   25.55717  -2.803     0.005067 ** 
## ageGroups2             -0.32236    0.57683  -0.559     0.576264    
## jobGroupUnlikely       -0.08465    0.25693  -0.329     0.741793    
## isSingleYes            -0.04963    0.14882  -0.333     0.738769    
## educationGroupOther     0.26864    0.14031   1.915     0.055541 .  
## contacttelephone       -0.75844    0.22389  -3.388     0.000705 ***
## monthGroup              0.80585    0.29844   2.700     0.006929 ** 
## numWorkers>=5          -0.90249    0.31112  -2.901     0.003723 ** 
## previousSuccess         1.50542    0.27447   5.485 0.0000000414 ***
## UnEmGrouped             0.15495    1.22122   0.127     0.899031    
## cons.price.idx          0.78225    0.28472   2.747     0.006007 ** 
## cons.conf.idx           0.07363    0.02273   3.240     0.001196 ** 
## inflation_rate         -0.47065    0.37800  -1.245     0.213088    
## spentGroupsSpend>5050   0.50274    0.52327   0.961     0.336669    
## failedAndIncr           0.24150    0.12757   1.893     0.058352 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1549.7  on 2664  degrees of freedom
## AIC: 1579.7
## 
## Number of Fisher Scoring iterations: 6

pR2(creditModel)

##          llh      llhNull           G2     McFadden         r2ML 
## -774.8257291 -926.8710680  304.0906778    0.1640415    0.1073039 
##         r2CU 
##    0.2148645

4. Assessing the Model

To asses the model, predictions are generated on the test set using the final model above.

testProbs <- data.frame(Outcome = as.factor(TestSet$y_numeric),
                        Probs = predict(creditModel, TestSet, type= "response"))

4.1. Distribution of Predicted Probabilities

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Probability of Accept", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

It is clear that the final model is good at predicting homeowners who do not accept the credit but is very bad at predicting homeowners who accept the credit. If the model were good at predicting the observed outcomes, there would be more clustering of the densities in the bottom graph around 1, rather than the heavy side around 0 as it is now.

4.2. Sensitivity and Specificity

To understand the accuracy of the model, the confusion matrix is provided below for the predictions on the test set using the final model. The ‘Reference’ is the actual outcome while the ‘Predicted’ is the predicted outcome by the model, where the probability threshold is set to 0.5. A threshold of 0.5 implies that if the predicted probability is more than 0.5, the predicted outcome is that the owner accepts the credit.

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1268  115
##          1   15   42
##                                               
##                Accuracy : 0.9097              
##                  95% CI : (0.8937, 0.924)     
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.01104             
##                                               
##                   Kappa : 0.3551              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.26752             
##             Specificity : 0.98831             
##          Pos Pred Value : 0.73684             
##          Neg Pred Value : 0.91685             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02917             
##    Detection Prevalence : 0.03958             
##       Balanced Accuracy : 0.62791             
##                                               
##        'Positive' Class : 1                   
##

4.3. Generalizability - Cross Validation

To test the generalizability of the model, K-fold Cross Validation is run on the dataset, where k=100. The CV results of the final model are also compared to the ‘Kitchen Sink’ regression CV results. The following graphs show the ROC, Sensitivity (fraction of accepted correctly predicted as accept) and Specificity (fraction of not accepted correctly predicted as not accepted).

peopleCredit<-peopleCredit%>% mutate(
  acceptCV = case_when(
    accept == "no" ~ "no_accept",
    accept == "yes" ~ "accept"
  )
)
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(acceptCV ~ ., data = peopleCredit %>% 
                                   dplyr::select(acceptCV, ageGroups2,  jobGroup,   isSingle,   educationGroup, contact,    monthGroup, numWorkers, previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)



cvFitSINK <- train(acceptCV ~ ., data = peopleCredit%>% 
                     dplyr::select(acceptCV, ageGroups2,    jobGroup,   marital,    educationGroup, taxLien,    mortgage,   taxbill_in_phl, contact,    monthGroup, day_of_week,    numWorkers, pdays,  previous,   previousSuccess,    UnEmGrouped, cons.price.idx,    cons.conf.idx,  inflation_rate, spentGroups, failedAndIncr), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

dplyr::select(cvFitSINK$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitSINK$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="KITCHEN SINK: CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="FINAL MODEL: CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

There are minimal goodness of fit improvements between the final and the baseline model. The means of both are the same. It can be noted that the final model’s specificity has a lower variance than the baseline across the 100 folds, indicating that the final model is more generalizable to homeowners that do not accept the credit.

5. ROC Curve

The y-axis of the ROC curve shows the rate of true positives (predicted accepted and accepted) for each threshold from 0.01 to 1. The x-axis shows the rate of false positives (predicted accepted and did not accept) for each threshold.

ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - credit accepting Model")

auc(testProbs$Outcome, testProbs$Probs)

## Area under the curve: 0.8187

The ROC lies above the 45 degree line, indicating that the model is an improvement from the randomization used by the HCD previously (it predicts better than a 50-50 coin flip). The model is also not over fit as it is not a right angle at any of the corners, thus we know that this model is fairly generalizable. However, the ROC can be a bit more concave, especially towards the higher values for the ‘True Positive’ and ‘False Positive’ Fraction. This can be adjusted for by creating a model that better predicts the ‘accept’ outcome.

6. Cost-benefit Analysis

6.1. The Cost and Benefits of Predictions

True Positive: For 80%, cost is 2850 for marketing + 5000 for the credit = 7850. The benefit is 10,000 + 56,000 = 66,000 for the house/neighborhood benefit. The expected profit for this 80% is thus $58,150. For 20%, the cost is 2850 for marketing, with no revenue, hence the expected profit for this 20% is -$2580

True Negative: No marketing or credit was given, hence no benefit was enjoyed as well. Thus the overall profit is $0.

False Positive: 2850 for marketing only, hence the profit of this is -$2580.

False Negative: $0

The table below calculates the total monetary benefit of the final model.

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Benefit =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
               ifelse(Variable == "False_Negative", (0) * Count,
               ifelse(Variable == "False_Positive", (-2580) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We predicted the owner would not accept and did not market to them.",
              "We predicted the owner would accept and marketed to them.",
              "We predicted the owner would not accept and did not market to them, hence the owner made no action.",
              "We predicted the owner would accept and marketed to them but the owner did not.")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>% kable_styling()

Cost/Benefit Table
Variable	Count	Benefit	Description
True_Negative	1268	0	We predicted the owner would not accept and did not market to them.
True_Positive	42	1929900	We predicted the owner would accept and marketed to them.
False_Negative	115	0	We predicted the owner would not accept and did not market to them, hence the owner made no action.
False_Positive	15	-38700	We predicted the owner would accept and marketed to them but the owner did not.

6.2. Optimising the Cost/Benefit

In section 4.2, a threshold was set such that for any predicted probability above the threshold, the predicted outcome is set to ‘accept’. This threshold can range from 0 to 1, and depending on this threshold, different levels of expected profit can be calculated. Hence the expected profit can be maximized at a certain threshold level and is calculated in this section.

6.2.1. Confusion Matrix and Expected Benefit

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      data %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Benefit =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((58150) * (Count * .80)) + (-2850 * (Count * .20)),
               ifelse(Variable == "False_Negative", (0) * Count,
               ifelse(Variable == "False_Positive", (-2580) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs)

whichThreshold %>%
  ggplot(.,aes(Threshold, Benefit, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Benefit by confusion matrix type and threshold cut off",
       y="Benefit") +
  guides(colour=guide_legend(title="Confusion Matrix"))

Note: The cost of true negative and false negative is zero, hence the true_negatives are masked in the above graph by the false_negatives.

6.2.2. Threshold as a function of the Total Benefit and Total number of Credits distributed

whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(actualAccept = ifelse(Variable == "True_Positive", (Count * .8), 0))%>% 
    group_by(Threshold)%>% 
    summarize(Total_Benefit = sum(Benefit),
              Total_Count_of_Credits = sum(actualAccept)) 

colnames(whichThreshold_revenue) = c("Threshold", "Total Benefit ($)", "Total Number of Credits") 

kable(whichThreshold_revenue) %>% kable_styling(bootstrap_options = c("striped", "hover",fixed_thead= TRUE))%>%
  scroll_box(width = "100%", height = "200px")

Threshold	Total Benefit ($)	Total Number of Credits
0.01	3906590	125.6
0.02	3960770	125.6
0.03	4042710	122.4
0.04	4166550	122.4
0.05	4149340	116.8
0.06	4387060	112.0
0.07	4501790	101.6
0.08	4634350	100.0
0.09	4676480	94.4
0.10	4643430	93.6
0.11	4621190	92.0
0.12	4539610	90.4
0.13	4597840	88.0
0.14	4534810	85.6
0.15	4534810	85.6
0.16	4385040	81.6
0.17	4097420	75.2
0.18	4105160	75.2
0.19	4013260	73.6
0.20	3939420	72.0
0.21	3804150	69.6
0.22	3811890	69.6
0.23	3595040	65.6
0.24	3342560	60.8
0.25	3207290	58.4
0.26	3120550	56.8
0.27	3130870	56.8
0.28	3049290	55.2
0.29	2926920	52.8
0.30	2842760	51.2
0.31	2750860	49.6
0.32	2572220	46.4
0.33	2531430	45.6
0.34	2531430	45.6
0.35	2439530	44.0
0.36	2393580	43.2
0.37	2352790	42.4
0.38	2220100	40.0
0.39	2176730	39.2
0.40	2133360	38.4
0.41	2087410	37.6
0.42	1949560	35.2
0.43	1959880	35.2
0.44	1959880	35.2
0.45	1873140	33.6
0.46	1875720	33.6
0.47	1878300	33.6
0.48	1880880	33.6
0.49	1888620	33.6
0.50	1891200	33.6
0.51	1753350	31.2
0.52	1709980	30.4
0.53	1618080	28.8
0.54	1528760	27.2
0.55	1436860	25.6
0.56	1439440	25.6
0.57	1393490	24.8
0.58	1393490	24.8
0.59	1350120	24.0
0.60	1352700	24.0
0.61	1352700	24.0
0.62	1306750	23.2
0.63	1263380	22.4
0.64	1265960	22.4
0.65	1089900	19.2
0.66	1089900	19.2
0.67	1043950	18.4
0.68	998000	17.6
0.69	998000	17.6
0.70	998000	17.6
0.71	952050	16.8
0.72	908680	16.0
0.73	773410	13.6
0.74	730040	12.8
0.75	638140	11.2
0.76	592190	10.4
0.77	410970	7.2
0.78	227170	4.0
0.79	183800	3.2
0.80	183800	3.2
0.81	183800	3.2
0.82	183800	3.2
0.83	183800	3.2
0.84	137850	2.4
0.85	137850	2.4
0.86	45950	0.8
0.87	45950	0.8
0.88	45950	0.8
0.89	0	0.0
0.90	0	0.0
0.91	0	0.0
0.92	0	0.0
0.93	0	0.0
0.94	0	0.0
0.95	0	0.0
0.96	0	0.0
0.97	0	0.0
0.98	0	0.0
0.99	0	0.0

Showing the Total Benefit and Total number of Credits that would be taken (true positives) for each Threshold.

whichThreshold_revenue %>%
  gather(Variable, count, - Threshold)%>%
      ggplot(., aes(Threshold, count, fill = Variable)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette5) +
        labs(x="Threshold", y="Total Value",
             title = "Total Benefit and Num of Credits as a function of Threshold", subtitle = "For predicted probabilty > threshold, the predicted outcome is accept.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As the threshold increases, the model predicts fewer outcomes to be ‘accept’ and hence the total benefit and number of credits decreases. This agrees with our cost/benefit table in 6.1.1. where the biggest benefit is from ‘true positives’; i.e. people who are predicted to accept the credit and actually accept the credit, which results in fulfilled benefits. The total number of credits can be interpreted as the number of true positives. The slope of the true positives is very steep at low threshold levels, which could mean that a number of observed positives have a very low predicted probability of conversion. This once again alludes to the poor sensitivity of the model.

6.2.3. Optimal Threshold vs 50% Threshold

top <- whichThreshold_revenue %>% top_n(1, `Total Benefit ($)`)

fiftyPercent<-whichThreshold_revenue %>% filter(Threshold < 0.51 & Threshold > 0.491)

rbind(top, fiftyPercent) %>% kable(.,
       caption = "Optimal vs 50% Threshold") %>% kable_styling(bootstrap_options = c("striped", "hover"))

Optimal vs 50% Threshold
Threshold	Total Benefit ($)	Total Number of Credits
0.09	4676480	94.4
0.50	1891200	33.6

Choosing a threshold of 0.09 is much better in terms of the total benefit and number of credits distributed than the 50% threshold. HCD should use the model with a 9% threshold.

7. Conclusion

To conclude, the final model is not very good at predicting which homeowners would accept the credit, however it is good at predicting which homeowners would not accept the credit, which can significantly help HCD in their marketing efforts to maximize the benefit of the program. Hence HCD should not settle on this final model, however this model is still better than the previous strategy of randomization among the eligible homeowners. The expected benefit using this model is $4,676,480 . Since this model is very good at predicting which homeowners will not be likely to accept the credit, marketing materials should not be given to these homeowners and should only be focused on those who are predicted to ‘accept’. This model is also an improvement on the randomization, since the sensitivity for the test set was 27%, which is much higher than the 11% response rate of the previous strategy.

The model can be improved if more data on those who accepted the credit can be used. However this is inherently a problem if the number of people who accept the credit is low to begin with. Better features that separate those who accept and those who do not accept the credit can be created. HCD should consult workers who have ground experience for suggestions on what other features could be significant to differentiate between the two groups of homeowners.

Optimizing the Benefits of HCD’s Housing Credit Programme with Predictive Modelling

Leanne Chan

10/25/2019