This report seeks to predict food inspection failures in Chicago, in order to provide the Chicago Department of Public Health with a better way of allocating their limited health inspectors across the numerous food establishments in the City. Current business-as-usual approaches to food inspections include responding to complaints from the public, randomized inspections, or simply checks performed in some arbitrary order (e.g. alphabetical). However, such reactive approaches often cause delays in determining which food establishments are unhygienic and as such should be shut down. Given the potential hazards that such unidentified eateries pose to the City, as well as the wider social and public health implications that these bring, it is necessary to identify such dangers expediently to prevent public health crises before they occur.
Given this problem motivation, this report will estimate a logistic regression model to allow the Chicago Health Department to take a more proactive approach in addresing this issue, so as to minimize the duration of public exposure to such risks. Specifically, it will use inspection data on food establishments from one year to predict for the next, to assess the likelihood that a food establishment will fail a health inspection. It can then prioritize high-risk dining establishments and dispatch its limited number of inspectors to focus their efforts accordingly, instead of casting a wide net as is being done presently.
The Youtube link to our presentation can be found (here).
# chicago and the fishnet
chicago <- st_read("chicago_boundary.shp")
fishnet <- st_make_grid(chicago, cellsize = 500) %>% st_sf()
chicagofishnet <- fishnet[chicago, ] %>% mutate(uniqueID = rownames(.)) %>% dplyr::select(uniqueID)
chicagofishnet <- st_transform(chicagofishnet, 3857)
df <- read.csv("Food_Inspections.csv")
# loading most recent inspections (all)
foodInspect <- df %>% filter(Results %in% c("Fail", "Pass", "Pass w/ Conditions")) %>%
filter(License.. > 0) %>% drop_na(Latitude) %>% drop_na(Longitude)
foodInspect$Results[foodInspect$Results == "Pass w/ Conditions"] <- "Pass"
mostRecent_all <- foodInspect %>% group_by(License..) %>% slice(which.max(as.Date(Inspection.Date,
"%m/%d/%Y"))) %>% dplyr::select("License..", "Results", "Inspection.Date", "Latitude",
"Longitude", "Facility.Type")
mostRecent_all <- st_as_sf(mostRecent_all, coords = c("Longitude", "Latitude"), crs = 4326)
mostRecent_all <- st_transform(mostRecent_all, 3857)
# loading most recent inspections (failures only)
mostRecent_fail <- mostRecent_all[mostRecent_all$Results == "Fail", ]
Past inspection data was first retrieved from the City of Chicago’s open data portal (access here). 311 service requests were also taken into consideration (access here) as an indicator of public health complaints. In addition, we used the tidycensus package to access American Community Survey (ACS) data for Cook county, where Chicago is located, in order to obtain socio-economic information at census-tract level.
As part of our exploratory analysis to assess relevant features to include in our model, we considered the potential spatial processes at work that could potentially influence the results of food inspections. For instance, pests and airborne diseases are rarely contained within single food establishments, as they are likely to also affect restaurants in the immediate vicinity. Furthermore, cleanliness of an establishment could vary depending on the type of clientele targeted: in this way, general socio-economic characteristics of the neighborhood that the establishment belongs to could influence general hygiene standards. Personal biases of food inspectors may also be spatial in nature. Instead of assessing each establishment objectively, the results of nearby restaurants could constitute reference points for inspectors; neighborhood characteristics could also result in preconceived prejudices against establishments in particular areas of the City.
To examine spatial patterns of past food inspections, we first created a fishnet grid (of size 500m wide) across the City, and plotted the count of failed establishments. Since records include results from numerous past inspections for some establishments, only results of the most recent inspection for each establishment were taken into account to prevent any overlaps. It is evident that failed inspections are clustered in northern areas of the city; in particular, there is a hotspot located along the eastern edge, where The Loop is located. This is a downtown neighborhood in Chicago containing the City’s Central Business District. As such, proximity to the centroid of The Loop was adopted as a feature in our model.
# joining failed food inspections to the fishnet
failed_net <- mostRecent_fail %>% dplyr::select() %>% mutate(countFailed = 1) %>%
aggregate(., chicagofishnet, sum) %>% mutate(countFailed = ifelse(is.na(countFailed),
0, countFailed), uniqueID = rownames(.))
# plotting failed food inspections
ggplot() + geom_sf(data = failed_net, lwd = 0.1, aes(fill = countFailed)) + scale_fill_gradient2(low = "white",
high = "red") + labs(title = "Failed Food Inspections (Chicago)") + mapTheme()
Counts of 311 requests for sanitation code complaints and rodent baiting requests were plotted in the fishnet as indicators of public perceptions of hygiene levels. Counts of crimes and affordable housing units were also plotted as proxies for neighborhood characteristics. There are clearly some overlapping hotspots for failed food inspections and the 311 requests in the Northern sections of the City. There are also some overlaps with crime spatial patterns, particularly close to The Loop. These variables are hence potentially relevent features that could be used as predictors for the model, thus they were not discarded. On the other hand, there are no obvious spatial similarities between failed food inspections and the presence of affordable housing unit, hence it was not included in our model.
# loading 311 sanitation code complaints
sanitationCode <- read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-No/rccf-5427") %>%
dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X",
"Y"), crs = 4326, agr = "constant")
transformed_sanitationCode <- st_transform(sanitationCode, 3857)
sanitationCode_net <- transformed_sanitationCode %>% dplyr::select() %>% mutate(countSC = 1) %>%
aggregate(., chicagofishnet, sum) %>% mutate(countSC = ifelse(is.na(countSC),
0, countSC), uniqueID = rownames(.))
# loading 311 rodent baiting requests
rodentBaiting <- read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-No-Duplicates/uqhs-j723") %>%
dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X",
"Y"), crs = 4326, agr = "constant")
transformed_rodentBaiting <- st_transform(rodentBaiting, 3857)
rodentBaiting_net <- transformed_rodentBaiting %>% dplyr::select() %>% mutate(countRB = 1) %>%
aggregate(., chicagofishnet, sum) %>% mutate(countRB = ifelse(is.na(countRB),
0, countRB), uniqueID = rownames(.))
# loading 2017 crime data
crime <- read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X",
"Y"), crs = 4326, agr = "constant")
transformed_crime <- st_transform(crime, 3857)
crime_net <- transformed_crime %>% dplyr::select() %>% mutate(countC = 1) %>% aggregate(.,
chicagofishnet, sum) %>% mutate(countC = ifelse(is.na(countC), 0, countC), uniqueID = rownames(.))
# loading affordable rental units
affordableRental <- read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi") %>%
dplyr::select(Y = latitude, X = longitude) %>% na.omit() %>% st_as_sf(coords = c("X",
"Y"), crs = 4326, agr = "constant")
transformed_rental <- st_transform(affordableRental, 3857)
rental_net <- transformed_rental %>% dplyr::select() %>% mutate(countR = 1) %>% aggregate(.,
chicagofishnet, sum) %>% mutate(countR = ifelse(is.na(countR), 0, countR), uniqueID = rownames(.))
# plotting all
SC <- ggplot() + geom_sf(data = sanitationCode_net, lwd = 0.05, aes(fill = countSC)) +
scale_fill_gradient2(low = "white", high = "red") + labs(title = "Sanitation Code Complaints") +
mapTheme()
RB <- ggplot() + geom_sf(data = rodentBaiting_net, lwd = 0.05, aes(fill = countRB)) +
scale_fill_gradient2(low = "white", high = "red") + labs(title = "Rodent Baiting Requests") +
mapTheme()
C <- ggplot() + geom_sf(data = crime_net, lwd = 0.05, aes(fill = countC)) + scale_fill_gradient2(low = "white",
high = "red") + labs(title = "Crimes") + mapTheme()
R <- ggplot() + geom_sf(data = rental_net, lwd = 0.05, aes(fill = countR)) + scale_fill_gradient2(low = "white",
high = "red") + labs(title = "Affordable Rental Units") + mapTheme()
grid.arrange(SC, RB, C, R, nrow = 2, ncol = 2)
To assess if the outcome of an inspection is influenced by the results of surrounding establishments (i.e. spatial lag), 100m buffers were created around all dining establishments, and the counts of other failing establishments within these buffers were plotted. It is evident that eateries along the north (particularly along the north-eastern edge) of the city have higher counts of other failing establishments within their buffers; this spatial pattern coincides with the overall distribution of inspection failures across the city in general, hence spatial lag was taken into consideration as a feature within the model.
# creating 100m buffers around restaurants
Buffer <- mostRecent_all %>% st_buffer(100)
# counting number of restaurants within buffer
Buffer$failed_count <- lengths(st_intersects(Buffer, mostRecent_all))
# joining count of restaurants to original restaurants file
bufferCount <- left_join(mostRecent_all, dplyr::select(st_set_geometry(Buffer, NULL),
License.., failed_count), by = "License..")
# plotting number of restaurants within 100m buffer
ggplot() + geom_sf(data = bufferCount, aes(color = q5(failed_count), show.legend = "point"),
stroke = 10^(-10)) + geom_sf(data = chicago, fill = NA) + labs(title = "Failures within 100m Buffer of Each Restaurant") +
scale_color_manual(values = palette5, labels = qBr(bufferCount, "failed_count"),
name = "Count") + mapTheme()
We also examined the relationship between past outcomes and results of the latest inspection for each establishment. We divided our data into four groups: establishments that were inspected for the first time, establishments that had more failures than passes prior to the latest inspection, establishments that had more passes than failures prior to the latest inspection, and establishments that had no failures prior to the latest inspection. It is evident that eateries that were inspected for the first time had a much higher rate of failure (8.72%) than the other 3 groups. Past outcomes were hence incorporated as a predictive feature within our model.
# getting overall percent fail per license ID
pastPercentFail <- foodInspect %>% dplyr::select(Results, License..) %>% count(License..,
Results) %>% pivot_wider(id_cols = "License..", names_from = "Results", values_from = "n") %>%
mutate(Pass = replace_na(Pass, 0), Fail = replace_na(Fail, 0)) %>% inner_join(mostRecent_all) %>%
mutate(previousPass = ifelse(Results == "Pass", Pass - 1, Pass)) %>% mutate(previousFail = ifelse(Results ==
"Fail", Fail - 1, Fail)) %>% mutate(TotalPrevious = previousPass + previousFail) %>%
mutate(percentPreviousFail = previousFail/TotalPrevious)
# categorize previous results
regroupedPastResult <- pastPercentFail %>% mutate(pastResults = ifelse(TotalPrevious ==
0, "First Inspection", ifelse(percentPreviousFail > 0.5, "More Failures", ifelse(percentPreviousFail >
0, "More Passes", "No Failures")))) %>% mutate(firstTime = ifelse(TotalPrevious ==
0, "Yes", "No"))
regroupedPastResult %>% ggplot(., aes(x = pastResults, fill = Results, color = Results)) +
geom_bar(stat = "Count", colour = "black") + scale_fill_manual(values = c("#FA7800",
"#FED8B1"))
PercentByPastResult <- regroupedPastResult %>% group_by(Results, pastResults) %>%
summarise(counts = n()) %>% spread(Results, counts) %>% mutate(total = Fail +
Pass) %>% mutate(percentFail = (Fail/total) * 100) %>% dplyr::select(pastResults,
percentFail) %>% pivot_wider(names_from = pastResults, values_from = percentFail)
kable(PercentByPastResult, caption = "Proportion of Failures by Past Outcomes (%)") %>%
kable_styling(full_width = F) %>% column_spec(1, background = "#FA7800", color = "black",
bold = T) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F) %>%
column_spec(3, background = "#FED8B1", color = "black", bold = F) %>% column_spec(4,
background = "#FED8B1", color = "black", bold = F)
First Inspection | More Failures | More Passes | No Failures |
---|---|---|---|
8.72956 | 2.728111 | 1.984685 | 2.705768 |
Finally, we considered the duration between the latest 2 inspections as a potential factor influencing outcomes of the most recent inspection. Once again, we divided our data into four groups: establishments that were inspected for the first time, establishments that were inspected within a month prior to the latest inspection, establishments that were inspected within a year prior to the latest inspection, and establishments that were inspected more than a year prior to the latest inspection. It is evident that eateries that were inspected for the first time had a much higher rate of failure (8.93%) than the other 3 groups. Duration since last inspection was hence incorporated as a predictive feature within our model.
# get most recent result date
recentPrevious <- foodInspect %>% arrange(as.Date(Inspection.Date, "%m/%d/%Y")) %>%
group_by(License..) %>% slice(-n()) %>% group_by(License..) %>% slice(which.max(as.Date(Inspection.Date,
"%m/%d/%Y"))) %>% dplyr::select("License..", "Results", "Inspection.Date") %>%
rename(previousResult = Results)
# tabular join
previous_and_mostRecent <- left_join(mostRecent_all, recentPrevious, by = "License..")
previous_and_mostRecent$time_lag <- difftime(as.Date(previous_and_mostRecent$Inspection.Date.x,
"%m/%d/%Y"), as.Date(previous_and_mostRecent$Inspection.Date.y, "%m/%d/%Y"),
units = c("weeks"))
previous_and_mostRecent <- previous_and_mostRecent %>% mutate(time_lag = replace_na(time_lag,
0)) %>% mutate(time_since_last_check = ifelse(time_lag == 0, "First Inspection",
ifelse(time_lag < 4, "< 1 Month", ifelse(time_lag < 52, "1 Year", "> 1 Year"))))
previous_and_mostRecent %>% ggplot(., aes(x = time_since_last_check, fill = Results,
color = Results)) + geom_bar(stat = "count", colour = "black") + scale_fill_manual(values = c("#FA7800",
"#FED8B1"))
PercentByTimeLag <- previous_and_mostRecent %>% st_set_geometry(NULL) %>% group_by(Results,
time_since_last_check) %>% summarise(counts = n()) %>% spread(Results, counts) %>%
mutate(total = Fail + Pass) %>% mutate(percentFail = (Fail/total) * 100) %>%
dplyr::select(time_since_last_check, percentFail) %>% pivot_wider(names_from = time_since_last_check,
values_from = percentFail)
kable(PercentByTimeLag, caption = "Proportion of Failures by Duration Since Last Inspection (%)") %>%
kable_styling(full_width = F) %>% column_spec(1, background = "#FED8B1", color = "black",
bold = F) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F) %>%
column_spec(3, background = "#FED8B1", color = "black", bold = F) %>% column_spec(4,
background = "#FA7800", color = "black", bold = T)
< 1 Month | > 1 Year | 1 Year | First Inspection |
---|---|---|---|
1.332746 | 2.566829 | 2.950706 | 8.930348 |
Following our exploratory analysis, we decided to incorporate the following list of variables in our predictive model:
# JOINING FISHNET VARIABLES 311 variables
final_net <- left_join(sanitationCode_net, st_set_geometry(rodentBaiting_net, NULL),
by = "uniqueID")
final_net <- left_join(final_net, st_set_geometry(crime_net, NULL), by = "uniqueID")
# adding proximity to the loop
neighborhoods <- st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(chicagofishnet))
loopPoint <- neighborhoods %>% filter(name == "Loop") %>% st_centroid()
final_net$loopDistance = st_distance(st_centroid(final_net), loopPoint) %>% as.numeric()
transformed_net <- st_transform(final_net, 3857)
# performing spatial join with restaurants point file
variables = st_join(mostRecent_all, transformed_net)
# adding number of failed restaurants within 100m buffer
variables <- left_join(variables, dplyr::select(st_set_geometry(bufferCount, NULL),
License.., failed_count), by = "License..")
# adding past outcomes
variables <- merge(x = variables, y = regroupedPastResult[, c("pastResults", "License..")],
by = "License..", all.x = TRUE)
# adding duration since past inspection
variables <- left_join(variables, dplyr::select(st_set_geometry(previous_and_mostRecent,
NULL), License.., time_since_last_check), by = "License..")
# adding census variables - total pop, median income, white pop, NBELPOV
census_api_key("131eaa4664fa75f89806d6e6a481a61339ea5bb2")
install = TRUE
census_variables <- c("B00001_001E", "B06011_001E", "C02003_003E", "B06012_002E")
chicago_ACS_unpolished <- get_acs(geography = "tract", variables = census_variables,
state = "IL", county = "Cook", output = "wide", geometry = TRUE)
chicago_ACS <- chicago_ACS_unpolished %>% mutate(PctWhite = C02003_003E/B00001_001E,
PctPov = B06012_002E/B00001_001E) %>% rename(MedInc = B06011_001E)
chicago_ACS <- dplyr::select(chicago_ACS, -(contains("_")))
transformed_census <- st_transform(chicago_ACS, 3857)
variables = st_join(variables, transformed_census)
variables <- subset(variables, select = -c(GEOID))
# adding 'restaurant' classification
finding_Rest <- mostRecent_all %>% ungroup() %>% mutate(Restaurant = as.character(Facility.Type),
Restaurant = ifelse(str_detect(Restaurant, "(?i)resta(?-i)"), 1, 0))
variables <- left_join(variables, dplyr::select(st_set_geometry(finding_Rest, NULL),
License.., Restaurant), by = "License..")
# adding neighborhoods
chicago_nhoods <- st_read("chicago_nhoods.shp")
chicago_nhoods <- st_transform(chicago_nhoods, 3857)
variables = st_join(variables, chicago_nhoods)
variables = subset(variables, select = -c(shape_area, shape_len, sec_neigh))
variables <- na.omit(variables)
We plotted a correlation matrix with non-categorical variables to ensure no collinear factors were included within the model.
CorMat <- variables %>% st_set_geometry(NULL) %>% dplyr::select(-uniqueID, -Inspection.Date,
-Results, -License.., -pastResults, -time_since_last_check, -pri_neigh, -Facility.Type,
-Restaurant, -NAME)
M <- cor(CorMat)
corrplot(M, method = "color", col = palette5, addCoef.col = "black", tl.col = "black",
type = "lower", tl.srt = 45)
The data is split into a 70/30 training and test set. The model is run on the training set and this regressed model is then used to predict outcomes on the test set.
final_set <- variables %>% mutate(Fail = ifelse(Results == "Fail", 1, 0)) %>% st_set_geometry(NULL)
# splitting data
set.seed(3456)
trainIndex <- createDataPartition(y = paste(final_set$Fail, final_set$pri_neigh,
final_set$time_since_last_check, final_set$pastResults, final_set$Restaurant),
p = 0.7, list = FALSE, times = 1)
TrainSet <- final_set[trainIndex, ]
TestSet <- final_set[-trainIndex, ]
A logistic regression model was estimated to predict the probability that each food establishment will pass an inspection. A summary of the model as well as the pseudo R-squared are printed below.
inspectionModel <- glm(TrainSet$Fail ~ ., data = TrainSet %>% dplyr::select(-uniqueID,
-Inspection.Date, -Results, -License.., -Fail, -Facility.Type, -NAME), family = binomial(link = "logit"))
summary <- as.data.frame(summary(inspectionModel)$coefficients)
names(summary)[names(summary) == "Pr(>|z|)"] <- "pvalue"
names(summary)[names(summary) == "Estimate"] <- "Estimated Coefficient for Predictor"
kable(summary) %>% kable_styling(bootstrap_options = c("striped", "hover"), fixed_thead = TRUE) %>%
row_spec(which(summary$pvalue < 0.05), bold = F, color = "white", background = "#FA7800") %>%
scroll_box(width = "100%", height = "300px")
Estimated Coefficient for Predictor | Std. Error | z value | pvalue | |
---|---|---|---|---|
(Intercept) | -6.4500817 | 0.6577053 | -9.8069480 | 0.0000000 |
countSC | 0.0012480 | 0.0006605 | 1.8895686 | 0.0588157 |
countRB | 0.0002148 | 0.0003426 | 0.6269515 | 0.5306910 |
countC | 0.0002951 | 0.0001841 | 1.6035828 | 0.1088060 |
loopDistance | 0.0000325 | 0.0000290 | 1.1191343 | 0.2630829 |
failed_count | -0.0017591 | 0.0022554 | -0.7799733 | 0.4354066 |
pastResultsMore Failures | 1.8302684 | 0.2788755 | 6.5630296 | 0.0000000 |
pastResultsMore Passes | 1.0902113 | 0.2696768 | 4.0426587 | 0.0000528 |
pastResultsNo Failures | 1.1676857 | 0.2714833 | 4.3011328 | 0.0000170 |
time_since_last_check> 1 Year | 1.1116934 | 0.1473919 | 7.5424297 | 0.0000000 |
time_since_last_check1 Year | 1.2050407 | 0.1305420 | 9.2310560 | 0.0000000 |
time_since_last_checkFirst Inspection | 3.4319209 | 0.2822389 | 12.1596304 | 0.0000000 |
MedInc | -0.0000042 | 0.0000045 | -0.9263022 | 0.3542889 |
PctWhite | -0.0102750 | 0.0107409 | -0.9566225 | 0.3387578 |
PctPov | -0.0006113 | 0.0214278 | -0.0285261 | 0.9772426 |
Restaurant | -0.3369386 | 0.0680900 | -4.9484297 | 0.0000007 |
pri_neighAndersonville | -0.8324202 | 1.0510155 | -0.7920152 | 0.4283518 |
pri_neighArcher Heights | 0.3481623 | 0.5193099 | 0.6704326 | 0.5025820 |
pri_neighArmour Square | 1.1468937 | 0.6760444 | 1.6964769 | 0.0897956 |
pri_neighAshburn | -0.1544215 | 0.5405630 | -0.2856679 | 0.7751325 |
pri_neighAuburn Gresham | 0.5952676 | 0.3963502 | 1.5018727 | 0.1331300 |
pri_neighAustin | 0.7880607 | 0.3495984 | 2.2541881 | 0.0241843 |
pri_neighAvalon Park | 0.7327319 | 0.5812354 | 1.2606457 | 0.2074365 |
pri_neighAvondale | 0.2826258 | 0.4379056 | 0.6454036 | 0.5186657 |
pri_neighBelmont Cragin | 0.5364805 | 0.3620915 | 1.4816158 | 0.1384425 |
pri_neighBeverly | -0.5459091 | 0.8270968 | -0.6600305 | 0.5092343 |
pri_neighBoystown | -0.6188676 | 1.0711071 | -0.5777831 | 0.5634106 |
pri_neighBridgeport | 1.0442146 | 0.4696263 | 2.2235011 | 0.0261820 |
pri_neighBrighton Park | 1.1388361 | 0.4044701 | 2.8156248 | 0.0048682 |
pri_neighBucktown | 0.8627946 | 0.5259103 | 1.6405738 | 0.1008859 |
pri_neighBurnside | -13.2001151 | 1696.4380397 | -0.0077811 | 0.9937917 |
pri_neighCalumet Heights | 0.3036962 | 0.6294151 | 0.4825054 | 0.6294470 |
pri_neighChatham | 1.1779407 | 0.4054352 | 2.9053733 | 0.0036682 |
pri_neighChicago Lawn | 0.5787476 | 0.3792768 | 1.5259241 | 0.1270288 |
pri_neighChinatown | 0.4942165 | 0.6462022 | 0.7648016 | 0.4443897 |
pri_neighClearing | 0.2111980 | 0.6783832 | 0.3113255 | 0.7555531 |
pri_neighDouglas | 1.1896075 | 0.5566439 | 2.1371068 | 0.0325893 |
pri_neighDunning | 0.7517766 | 0.4414263 | 1.7030627 | 0.0885563 |
pri_neighEast Side | -0.3955434 | 0.7395115 | -0.5348712 | 0.5927389 |
pri_neighEast Village | 0.7020522 | 0.6819247 | 1.0295157 | 0.3032374 |
pri_neighEdgewater | 0.5464897 | 0.4009887 | 1.3628555 | 0.1729281 |
pri_neighEdison Park | -12.9898482 | 273.1392344 | -0.0475576 | 0.9620688 |
pri_neighEnglewood | 0.6318210 | 0.3683551 | 1.7152499 | 0.0862994 |
pri_neighFuller Park | 1.2372966 | 0.6339539 | 1.9517138 | 0.0509722 |
pri_neighGage Park | 0.7474838 | 0.4340499 | 1.7221148 | 0.0850487 |
pri_neighGalewood | 1.1956098 | 0.5153785 | 2.3198675 | 0.0203480 |
pri_neighGarfield Park | 0.8990715 | 0.4219238 | 2.1308860 | 0.0330985 |
pri_neighGarfield Ridge | 0.5870836 | 0.4787084 | 1.2263910 | 0.2200516 |
pri_neighGold Coast | 0.5508527 | 0.7774487 | 0.7085389 | 0.4786107 |
pri_neighGrand Boulevard | 0.9182268 | 0.5150319 | 1.7828541 | 0.0746100 |
pri_neighGrand Crossing | 1.0463484 | 0.3802742 | 2.7515629 | 0.0059312 |
pri_neighGrant Park | -12.6519464 | 791.0754140 | -0.0159934 | 0.9872397 |
pri_neighGreektown | 1.5187210 | 0.8177549 | 1.8571836 | 0.0632850 |
pri_neighHegewisch | 0.4706400 | 0.7975067 | 0.5901393 | 0.5550973 |
pri_neighHermosa | 0.5753904 | 0.5000405 | 1.1506877 | 0.2498608 |
pri_neighHumboldt Park | 0.8509291 | 0.3904662 | 2.1792642 | 0.0293121 |
pri_neighHyde Park | 0.4303364 | 0.4798954 | 0.8967296 | 0.3698632 |
pri_neighIrving Park | 0.5522681 | 0.3690253 | 1.4965590 | 0.1345081 |
pri_neighJackson Park | -12.9654041 | 778.1362129 | -0.0166621 | 0.9867062 |
pri_neighJefferson Park | -0.1137911 | 0.6087860 | -0.1869147 | 0.8517275 |
pri_neighKenwood | 0.2729991 | 0.8050330 | 0.3391154 | 0.7345228 |
pri_neighLake View | 0.2999529 | 0.4182689 | 0.7171295 | 0.4732942 |
pri_neighLincoln Park | 0.2420321 | 0.4878732 | 0.4960962 | 0.6198266 |
pri_neighLincoln Square | 0.3665293 | 0.4153360 | 0.8824886 | 0.3775126 |
pri_neighLittle Italy, UIC | 0.9050287 | 0.5365292 | 1.6868211 | 0.0916378 |
pri_neighLittle Village | 1.0750242 | 0.3830847 | 2.8062315 | 0.0050125 |
pri_neighLogan Square | 0.3189543 | 0.4065709 | 0.7844987 | 0.4327476 |
pri_neighLoop | 0.2800160 | 0.5942655 | 0.4711968 | 0.6375002 |
pri_neighLower West Side | 1.1786577 | 0.4646940 | 2.5364167 | 0.0111993 |
pri_neighMagnificent Mile | -1.2530525 | 1.1510502 | -1.0886167 | 0.2763230 |
pri_neighMckinley Park | 1.2375685 | 0.5083024 | 2.4347089 | 0.0149038 |
pri_neighMillenium Park | 2.3362160 | 0.9924016 | 2.3541035 | 0.0185674 |
pri_neighMontclare | 0.8287847 | 0.5750212 | 1.4413114 | 0.1494967 |
pri_neighMorgan Park | 0.2763717 | 0.6337563 | 0.4360851 | 0.6627750 |
pri_neighMount Greenwood | -0.2295223 | 0.8602446 | -0.2668106 | 0.7896150 |
pri_neighMuseum Campus | 1.5827475 | 0.9184005 | 1.7233740 | 0.0848209 |
pri_neighNear South Side | 0.5738926 | 0.6322133 | 0.9077516 | 0.3640095 |
pri_neighNew City | 0.8690895 | 0.3955122 | 2.1973772 | 0.0279935 |
pri_neighNorth Center | 0.3721914 | 0.4702068 | 0.7915484 | 0.4286241 |
pri_neighNorth Lawndale | 0.9387365 | 0.4356362 | 2.1548635 | 0.0311725 |
pri_neighNorth Park | 0.6757518 | 0.4742598 | 1.4248559 | 0.1541989 |
pri_neighNorwood Park | 0.5145185 | 0.5406201 | 0.9517192 | 0.3412394 |
pri_neighOakland | -12.8746843 | 779.7725654 | -0.0165108 | 0.9868269 |
pri_neighOld Town | 1.1529615 | 0.5742687 | 2.0077038 | 0.0446748 |
pri_neighPortage Park | 0.5457624 | 0.3871401 | 1.4097282 | 0.1586199 |
pri_neighPrinters Row | -12.5156031 | 430.3112407 | -0.0290850 | 0.9767968 |
pri_neighPullman | -13.2415196 | 331.1389834 | -0.0399878 | 0.9681029 |
pri_neighRiver North | 0.2305131 | 0.5911093 | 0.3899670 | 0.6965610 |
pri_neighRiverdale | -13.1640759 | 524.5901463 | -0.0250940 | 0.9799800 |
pri_neighRogers Park | 0.4217797 | 0.3942600 | 1.0698008 | 0.2847090 |
pri_neighRoseland | 0.4396180 | 0.5050763 | 0.8703990 | 0.3840824 |
pri_neighRush & Division | -0.0428855 | 0.7769515 | -0.0551972 | 0.9559814 |
pri_neighSauganash,Forest Glen | 1.2950399 | 0.5565174 | 2.3270430 | 0.0199630 |
pri_neighSheffield & DePaul | 0.8504996 | 0.5683902 | 1.4963304 | 0.1345676 |
pri_neighSouth Chicago | 1.0089740 | 0.4304376 | 2.3440660 | 0.0190748 |
pri_neighSouth Deering | 0.5531488 | 0.5948863 | 0.9298394 | 0.3524543 |
pri_neighSouth Shore | 1.1030593 | 0.3885349 | 2.8390228 | 0.0045252 |
pri_neighStreeterville | 0.2749334 | 0.6493139 | 0.4234214 | 0.6719878 |
pri_neighUkrainian Village | 0.6091677 | 0.6542211 | 0.9311342 | 0.3517841 |
pri_neighUnited Center | 0.2238911 | 0.5863407 | 0.3818447 | 0.7025765 |
pri_neighUptown | 0.1899183 | 0.4084889 | 0.4649289 | 0.6419824 |
pri_neighWashington Heights | 0.3694993 | 0.5218740 | 0.7080239 | 0.4789304 |
pri_neighWashington Park | 1.6103156 | 0.4895781 | 3.2891906 | 0.0010048 |
pri_neighWest Elsdon | 0.8674430 | 0.5185024 | 1.6729777 | 0.0943317 |
pri_neighWest Lawn | 0.1497712 | 0.4946913 | 0.3027569 | 0.7620751 |
pri_neighWest Loop | 0.4605009 | 0.5967545 | 0.7716756 | 0.4403066 |
pri_neighWest Pullman | 0.6590860 | 0.6126632 | 1.0757721 | 0.2820292 |
pri_neighWest Ridge | 0.3884997 | 0.3745653 | 1.0372016 | 0.2996419 |
pri_neighWest Town | 0.3231333 | 0.5606866 | 0.5763171 | 0.5644009 |
pri_neighWicker Park | 0.3253613 | 0.5099153 | 0.6380692 | 0.5234287 |
pri_neighWoodlawn | 0.6372544 | 0.4918369 | 1.2956620 | 0.1950920 |
pri_neighWrigleyville | -1.1818581 | 1.0662973 | -1.1083758 | 0.2676996 |
pR2(inspectionModel)
## llh llhNull G2 McFadden r2ML
## -3.958876e+03 -4.402831e+03 8.879104e+02 1.008340e-01 3.746176e-02
## r2CU
## 1.188435e-01
To assess the model created above, predictions were generated on the test set.
testProbs <- data.frame(Tract = TestSet$NAME, License = TestSet$License.., MedInc = TestSet$MedInc,
PctWhite = TestSet$PctWhite, Restaurant = TestSet$Restaurant, Outcome = as.factor(TestSet$Fail),
Probs = predict(inspectionModel, TestSet, type = "response"))
The following are histograms showing the density of predicted probabilities by observed outcomes. A well-performing model would result in the ‘hump’ for predicted probabilities of failing an inspection (‘1’) to be clustered around 1.0 on the x-axis, while the predicted probabilities of passing an inspection (‘0’) should be clustered around 0.0. It is clear that our model is good at predicting the probabilities of passing an inspection, but is relatively poor at predicting probabilities of failure.
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + geom_density() + facet_grid(Outcome ~
.) + scale_fill_manual(values = palette2) + labs(x = "Probabilities of Failure",
y = "Density of probabilities", title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18), legend.position = "none")
While cost-benefit analyses might often be used to identify an optimal threshold beyond which dining establishments are predicted to fail an inspection, we felt it is beyond our scope to put a number on the social and economic costs associated with, for instance, missing a high-risk food establishment and allowing a public health risk to fester. As such, in order to determine a probability threshold, we took the midpoint between the peaks of both density histograms.
Most establishments that passed had a predicted probability of 0.0153, while most establishments that failed had a predicted probability of 0.153. The threshold we used subsequently was hence calculated as (0.0153 + 0.153) / 2 = 0.084.
passes <- testProbs[testProbs$Outcome == "0", ]
fails <- testProbs[testProbs$Outcome == "1", ]
densMode <- function(x) {
td <- density(x)
maxDens <- which.max(td$y)
list(x = td$x[maxDens], y = td$y[maxDens])
}
densMode(passes$Probs)
## $x
## [1] 0.0152725
##
## $y
## [1] 20.83022
densMode(fails$Probs)
## $x
## [1] 0.153341
##
## $y
## [1] 6.345045
testProbs <- testProbs %>% mutate(predOutcome = as.factor(ifelse(testProbs$Probs >
0.084, 1, 0)))
For further analysis of accuracy, a confusion matrix was created with the probability threshold of 0.117, meaning that any predicted probabiltiy exceeding this would result in a predicted outcome of Failure. ‘Reference’ refers to the actual outcome, while ‘Prediction’ refers to the predicted outcome. The interpretation of the confusion matrix is as follows:
The model has an overall accuracy of 86.4%, which was computed by adding the number of True Positives to the number of True Negatives and dividing this sum by the total number of observations. Two other metrics, ‘sensitivity’ and ‘specificity’, were also computed. The former refers to the proportion of actual Failures that were predicted to be Failures (this is known as the ‘true positive rate’), while the latter refers to the proportion of actual Passes that were predicted to be Passes (this is known as the ‘true negative rate’). The model has a sensitivity of 72.6%, and a specificity of 86.7%. Consistent with our interpretation in section 5.1., this suggests the model is better at predicting Passes than Failures; this is unsuprising given that a large majority of the data has this outcome.
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7263 43
## 1 1119 114
##
## Accuracy : 0.8639
## 95% CI : (0.8565, 0.8711)
## No Information Rate : 0.9816
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1358
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.72611
## Specificity : 0.86650
## Pos Pred Value : 0.09246
## Neg Pred Value : 0.99411
## Prevalence : 0.01839
## Detection Rate : 0.01335
## Detection Prevalence : 0.14440
## Balanced Accuracy : 0.79631
##
## 'Positive' Class : 1
##
A ROC curve was plotted to visualize trade-offs between False Positives and False Negatives. The y-axis of the ROC curve shows the rate of true positives for each threshold from 0.01 to 1, while the x-axis shows the rate of false positives for each threshold.
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) + geom_roc(n.cuts = 50,
labels = FALSE, colour = "#FA7800") + style_roc(theme = theme_grey) + geom_abline(slope = 1,
intercept = 0, size = 1.5, color = "grey") + labs(title = "ROC Curve - Food Inspections")
The Area Under the ROC Curve (AUC) was also obtained. The AUC value of 0.878 is greater than 0.5, and the ROC curve lies above the 45 degree line, indicating that the model is much better than a coin-flip.
auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.8777
K-fold cross validation (where k=100) was performed to test the generalizability of our model to unseen data. The three metrics output below for AUC, sensitivity and specificty are averages across all 100 folds.
final_set$Fail[final_set$Fail == "1"] <- "Fail"
final_set$Fail[final_set$Fail == "0"] <- "Pass"
ctrl <- trainControl(method = "cv", number = 100, classProbs = TRUE, summaryFunction = twoClassSummary)
cvFit <- train(Fail ~ ., data = final_set %>% dplyr::select(-uniqueID, -Inspection.Date,
-Results, -License.., -Facility.Type, -NAME), method = "glm", family = "binomial",
metric = "ROC", trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 31794 samples
## 12 predictor
## 2 classes: 'Fail', 'Pass'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 31476, 31475, 31475, 31477, 31476, 31475, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7448534 0.002435897 0.9997708
The following histograms plot AUC, sensitivity and specificity across the 100 folds. If the model was perfectly generalizable, the goodness of fit would be the same regardless of which observations showed up in a random fold. In this case, all three metrics are tightly distributed around the mean; this suggests that the model generalizes well to unseen data, particularly with respect to specificity and sensitivity.
However, this is likely because the default threshold set by the package is 0.5; this is much higher than the optimal threshold of 0.084 determined in section 5.2. It is clear from the histograms in section 5.2. that the probabilities of failure generated by the model fall mostly below 0.3; at a threshold of 0.5, all establishments will be predicted to Pass. Since the data largely comprises food establishments that passed their latest inspection, this explains why the specificity is so high (almost all folds at 1.0) and the sensitivity is so low (almost all folds at 0.0) across the 100 folds. As such, this may not be an accurate indication of model generalizability to unseen data; for a more representative idea, a much lower threshold (such as the one determined earlier in section 5.2.) should be used instead.
dplyr::select(cvFit$resample, -Resample) %>% gather(metric, value) %>% left_join(gather(cvFit$results[2:4],
metric, mean)) %>% ggplot(aes(value)) + geom_histogram(bins = 15, fill = "#fed8b1") +
facet_wrap(~metric) + geom_vline(aes(xintercept = mean), colour = "#FA7800",
linetype = 3, size = 1) + scale_x_continuous(limits = c(-0.1, 1.1)) + labs(x = "Goodness of Fit",
y = "Count", title = "CV Goodness of Fit Metrics", subtitle = "Across-fold mean reprented as dotted lines")
To examine if our model predicts well across urban space, we plotted the overall accuracy (defined as the number of correct predictions divided by the total number of predictions) within each neighborhood in the City. It is visually evident that the model predicts with reasonable accuracy in north-east and south-west Chicago, but loses predictive power in the north-western and south-eastern sections of the City. Our model is hence not generalizable across urban space.
# subtracting predicted outcome from actual outcome
testProbs$predOutcome <- as.numeric(as.character(testProbs$predOutcome))
testProbs$Outcome <- as.numeric(as.character(testProbs$Outcome))
testProbs$difference <- abs(testProbs$predOutcome - testProbs$Outcome)
# finding accuracy by neighborhood
grouped <- testProbs %>% group_by(Tract) %>% summarize(accuracy = 1 - (sum(difference)/n()))
# merging with tracts shapefile
names(chicago_ACS)[names(chicago_ACS) == "NAME"] <- "Tract"
grouped <- merge(x = chicago_ACS, y = grouped, by = "Tract", all.x = TRUE)
grouped <- na.omit(grouped)
# plotting accuracy by tract
ggplot() + geom_sf(data = grouped, aes(fill = q5(accuracy))) + scale_fill_manual(values = palette5,
labels = qBr(grouped, "accuracy", rnd = F), name = "Accuracy") + labs(title = "Overall Accuracy by Census Tract") +
mapTheme()
The following figures map Chiacgo according to the racial and income profiles of each census tract. Tracts in which at least 51% of residents are white were designated “Majority White”, while tracts in which median incomes were greater than the citywide median (found on Google to be $55,295) were designated “High Income”. A visual comparison of these plots with the plot of overall accuracy above suggests there are some similarities in the spatial distributions of these socio-economic variables and that of overall accuracy. For instance, the High Income cluster located on the north-eastern edge of the City and the Majority White areas in the north have extremely high rates of accuracy. This implies the model may not generalize well across different socio-economic profiles, since its accuracy rates appear to be stratified by these characteristics.
# splitting majority white and high income
grouped$PctWhite[grouped$PctWhite > "50"] <- "Majority White"
grouped$PctWhite[grouped$PctWhite <= "50"] <- "Majority Non-White"
grouped$MedInc[grouped$MedInc > "55295"] <- "High Income"
grouped$MedInc[grouped$MedInc <= "55295"] <- "Low Income"
# plotting by race and by income
grid.arrange(
ggplot() +
geom_sf(data = na.omit(grouped), aes(fill = PctWhite)) +
scale_fill_manual(values = c("#fed8b1","#FA7800")) +
labs(title = "Racial segregation") +
mapTheme() + theme(legend.position="bottom"),
ggplot() +
geom_sf(data = na.omit(grouped), aes(fill = MedInc)) +
scale_fill_manual(values = c("#fed8b1","#FA7800")) +
labs(title = "Income segregation") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
To further inspect the generalizability of our model across different socio-economic profiles, we computed the mean overall accuracy for each tract context. High Income tracts have a much higher mean overall accuracy of 94.9% relative to Low Income tracts, which have a mean overall accuracy of 82.1%. Similarly, Majority White tracts have a slightly higher mean overall accuracy of 87.8% as compared to Majority Non-White tracts, which have a lower mean overall accuracy of 82.1%. These findings are concordant with conclusions drawn above from visual inspection that Majority White and High Income tracts have higher accuracy rates relative to their counterparts.
According to the confusion matrix above, the model has a better hit-rate for predicting passes as compared to failures. The Negative Predictive Value is 0.994: for every 1000 passes predicted by the model, on average, 994 of these predictions will be correct. In contrast, the Positive Predictive Value is 0.092: for every 1000 failures predicted by the model, only 92 of these predictions will be correct. As such, Majority White and High Income tracts have higher overall accuracy since food establishments within them are more likely to pass food inspections. The model hence does not generalize well across different socio-economic profiles.
# finding mean overall accuracy across race
race <- grouped %>% group_by(PctWhite) %>% summarize(meanAccuracy = (mean(accuracy))) %>%
st_set_geometry(NULL)
names(race)[names(race) == "PctWhite"] <- "Tract Context"
names(race)[names(race) == "meanAccuracy"] <- "Mean Overall Accuracy"
kable(race, caption = "Mean Overall Accuracy of Tracts by Race") %>% kable_styling(full_width = F) %>%
column_spec(1, background = "#FA7800", color = "black", bold = F) %>% column_spec(2,
background = "#FED8B1", color = "black", bold = F)
Tract Context | Mean Overall Accuracy |
---|---|
Majority Non-White | 0.8214062 |
Majority White | 0.8777431 |
# finding mean overall accuracy across income
income <- grouped %>% group_by(MedInc) %>% summarize(meanAccuracy = (mean(accuracy))) %>%
st_set_geometry(NULL)
names(income)[names(income) == "MedInc"] <- "Tract Context"
names(income)[names(income) == "meanAccuracy"] <- "Mean Overall Accuracy"
kable(income, caption = "Mean Overall Accuracy of Tracts by Income") %>% kable_styling(full_width = F) %>%
column_spec(1, background = "#FA7800", color = "black", bold = F) %>% column_spec(2,
background = "#FED8B1", color = "black", bold = F)
Tract Context | Mean Overall Accuracy |
---|---|
High Income | 0.9488910 |
Low Income | 0.8211341 |
Finally, we examined the mean overall accuracy for both Restaurants and non-Restaurants to test if the model works better for particular kinds of food establishments. The model has a much higher overall accuracy of 92.9% for restaurants, compared to the lower overall accuracy of 76.6% for non-restaurants. This is likely due to similar reasons as discussed in the previous section: restaurants are more likely to pass food inspections than non-restaurants, which includes mobile food dispensers and gas stations. As such, the model also does not generalize well across establishment types.
establishment <- testProbs %>% group_by(Restaurant) %>% summarize(accuracy = (1 -
mean(difference)))
establishment$Restaurant[establishment$Restaurant == "1"] <- "Restaurant"
establishment$Restaurant[establishment$Restaurant == "0"] <- "Non-Restaurant"
names(establishment)[names(establishment) == "accuracy"] <- "Mean Overall Accuracy"
kable(establishment, caption = "Mean Overall Accuracy by Establishment Type") %>%
kable_styling(full_width = F) %>% column_spec(1, background = "#FA7800", color = "black",
bold = F) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F)
Restaurant | Mean Overall Accuracy |
---|---|
Non-Restaurant | 0.7659824 |
Restaurant | 0.9290310 |
One limitation of this analysis is that the model is far better at identifying true negatives (passes) than true positives (failures). This is problematic especially since the goal of the regression is to identify food establishments with high likelihoods of failing inspections, so that inspectors can be dispatched there expediently. Another limitation would be the model’s lack of generalizability acrosss different socio-economic profiles and establishment types. The lower overall accuracy in Majority Non-White and Low Income tracts may cause discrimination in the form of unfairly targeted inspections within these areas, which may be disruptive to businesses in the region. In the same manner, non-restaurants may be unduly subjected to more frequent inspections than restaurants.
Nonetheless, our model adopts a data-driven approach in identifying high-risk food establishments to optimize allocation of limited health inspectors; this constitutes an improvement from the business-as-usual approach, which is to conduct checks in a random or arbitrary order. A more efficient approach is necessary given the health risks and social costs that result from exposure of the public to such unhygienic eateries. By determining which food establishments are more likely to fail health inspections with our predictive model, health inspectors can thereby focus their efforts on these establishments to minimize public health risks to the public.