Predicting Food Inspection Failures in Chicago

1. Introduction.

This report seeks to predict food inspection failures in Chicago, in order to provide the Chicago Department of Public Health with a better way of allocating their limited health inspectors across the numerous food establishments in the City. Current business-as-usual approaches to food inspections include responding to complaints from the public, randomized inspections, or simply checks performed in some arbitrary order (e.g. alphabetical). However, such reactive approaches often cause delays in determining which food establishments are unhygienic and as such should be shut down. Given the potential hazards that such unidentified eateries pose to the City, as well as the wider social and public health implications that these bring, it is necessary to identify such dangers expediently to prevent public health crises before they occur.

Given this problem motivation, this report will estimate a logistic regression model to allow the Chicago Health Department to take a more proactive approach in addresing this issue, so as to minimize the duration of public exposure to such risks. Specifically, it will use inspection data on food establishments from one year to predict for the next, to assess the likelihood that a food establishment will fail a health inspection. It can then prioritize high-risk dining establishments and dispatch its limited number of inspectors to focus their efforts accordingly, instead of casting a wide net as is being done presently.

The Youtube link to our presentation can be found (here).

# chicago and the fishnet
chicago <- st_read("chicago_boundary.shp")

fishnet <- st_make_grid(chicago, cellsize = 500) %>% st_sf()

chicagofishnet <- fishnet[chicago, ] %>% mutate(uniqueID = rownames(.)) %>% dplyr::select(uniqueID)

chicagofishnet <- st_transform(chicagofishnet, 3857)

df <- read.csv("Food_Inspections.csv")

# loading most recent inspections (all)
foodInspect <- df %>% filter(Results %in% c("Fail", "Pass", "Pass w/ Conditions")) %>% 
    filter(License.. > 0) %>% drop_na(Latitude) %>% drop_na(Longitude)

foodInspect$Results[foodInspect$Results == "Pass w/ Conditions"] <- "Pass"

mostRecent_all <- foodInspect %>% group_by(License..) %>% slice(which.max(as.Date(Inspection.Date, 
    "%m/%d/%Y"))) %>% dplyr::select("License..", "Results", "Inspection.Date", "Latitude", 
    "Longitude", "Facility.Type")

mostRecent_all <- st_as_sf(mostRecent_all, coords = c("Longitude", "Latitude"), crs = 4326)
mostRecent_all <- st_transform(mostRecent_all, 3857)

# loading most recent inspections (failures only)
mostRecent_fail <- mostRecent_all[mostRecent_all$Results == "Fail", ]

2. Data and Exploratory Approach.

Past inspection data was first retrieved from the City of Chicago’s open data portal (access here). 311 service requests were also taken into consideration (access here) as an indicator of public health complaints. In addition, we used the tidycensus package to access American Community Survey (ACS) data for Cook county, where Chicago is located, in order to obtain socio-economic information at census-tract level.

As part of our exploratory analysis to assess relevant features to include in our model, we considered the potential spatial processes at work that could potentially influence the results of food inspections. For instance, pests and airborne diseases are rarely contained within single food establishments, as they are likely to also affect restaurants in the immediate vicinity. Furthermore, cleanliness of an establishment could vary depending on the type of clientele targeted: in this way, general socio-economic characteristics of the neighborhood that the establishment belongs to could influence general hygiene standards. Personal biases of food inspectors may also be spatial in nature. Instead of assessing each establishment objectively, the results of nearby restaurants could constitute reference points for inspectors; neighborhood characteristics could also result in preconceived prejudices against establishments in particular areas of the City.

To examine spatial patterns of past food inspections, we first created a fishnet grid (of size 500m wide) across the City, and plotted the count of failed establishments. Since records include results from numerous past inspections for some establishments, only results of the most recent inspection for each establishment were taken into account to prevent any overlaps. It is evident that failed inspections are clustered in northern areas of the city; in particular, there is a hotspot located along the eastern edge, where The Loop is located. This is a downtown neighborhood in Chicago containing the City’s Central Business District. As such, proximity to the centroid of The Loop was adopted as a feature in our model.

# joining failed food inspections to the fishnet
failed_net <- mostRecent_fail %>% dplyr::select() %>% mutate(countFailed = 1) %>% 
    aggregate(., chicagofishnet, sum) %>% mutate(countFailed = ifelse(is.na(countFailed), 
    0, countFailed), uniqueID = rownames(.))

# plotting failed food inspections
ggplot() + geom_sf(data = failed_net, lwd = 0.1, aes(fill = countFailed)) + scale_fill_gradient2(low = "white", 
    high = "red") + labs(title = "Failed Food Inspections (Chicago)") + mapTheme()

Counts of 311 requests for sanitation code complaints and rodent baiting requests were plotted in the fishnet as indicators of public perceptions of hygiene levels. Counts of crimes and affordable housing units were also plotted as proxies for neighborhood characteristics. There are clearly some overlapping hotspots for failed food inspections and the 311 requests in the Northern sections of the City. There are also some overlaps with crime spatial patterns, particularly close to The Loop. These variables are hence potentially relevent features that could be used as predictors for the model, thus they were not discarded. On the other hand, there are no obvious spatial similarities between failed food inspections and the presence of affordable housing unit, hence it was not included in our model.

# loading 311 sanitation code complaints
sanitationCode <- read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-No/rccf-5427") %>% 
    dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X", 
    "Y"), crs = 4326, agr = "constant")

transformed_sanitationCode <- st_transform(sanitationCode, 3857)

sanitationCode_net <- transformed_sanitationCode %>% dplyr::select() %>% mutate(countSC = 1) %>% 
    aggregate(., chicagofishnet, sum) %>% mutate(countSC = ifelse(is.na(countSC), 
    0, countSC), uniqueID = rownames(.))

# loading 311 rodent baiting requests
rodentBaiting <- read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-No-Duplicates/uqhs-j723") %>% 
    dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X", 
    "Y"), crs = 4326, agr = "constant")

transformed_rodentBaiting <- st_transform(rodentBaiting, 3857)

rodentBaiting_net <- transformed_rodentBaiting %>% dplyr::select() %>% mutate(countRB = 1) %>% 
    aggregate(., chicagofishnet, sum) %>% mutate(countRB = ifelse(is.na(countRB), 
    0, countRB), uniqueID = rownames(.))

# loading 2017 crime data
crime <- read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    dplyr::select(Y = Latitude, X = Longitude) %>% na.omit() %>% st_as_sf(coords = c("X", 
    "Y"), crs = 4326, agr = "constant")

transformed_crime <- st_transform(crime, 3857)

crime_net <- transformed_crime %>% dplyr::select() %>% mutate(countC = 1) %>% aggregate(., 
    chicagofishnet, sum) %>% mutate(countC = ifelse(is.na(countC), 0, countC), uniqueID = rownames(.))

# loading affordable rental units
affordableRental <- read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi") %>% 
    dplyr::select(Y = latitude, X = longitude) %>% na.omit() %>% st_as_sf(coords = c("X", 
    "Y"), crs = 4326, agr = "constant")

transformed_rental <- st_transform(affordableRental, 3857)

rental_net <- transformed_rental %>% dplyr::select() %>% mutate(countR = 1) %>% aggregate(., 
    chicagofishnet, sum) %>% mutate(countR = ifelse(is.na(countR), 0, countR), uniqueID = rownames(.))

# plotting all
SC <- ggplot() + geom_sf(data = sanitationCode_net, lwd = 0.05, aes(fill = countSC)) + 
    scale_fill_gradient2(low = "white", high = "red") + labs(title = "Sanitation Code Complaints") + 
    mapTheme()

RB <- ggplot() + geom_sf(data = rodentBaiting_net, lwd = 0.05, aes(fill = countRB)) + 
    scale_fill_gradient2(low = "white", high = "red") + labs(title = "Rodent Baiting Requests") + 
    mapTheme()

C <- ggplot() + geom_sf(data = crime_net, lwd = 0.05, aes(fill = countC)) + scale_fill_gradient2(low = "white", 
    high = "red") + labs(title = "Crimes") + mapTheme()

R <- ggplot() + geom_sf(data = rental_net, lwd = 0.05, aes(fill = countR)) + scale_fill_gradient2(low = "white", 
    high = "red") + labs(title = "Affordable Rental Units") + mapTheme()

grid.arrange(SC, RB, C, R, nrow = 2, ncol = 2)

To assess if the outcome of an inspection is influenced by the results of surrounding establishments (i.e. spatial lag), 100m buffers were created around all dining establishments, and the counts of other failing establishments within these buffers were plotted. It is evident that eateries along the north (particularly along the north-eastern edge) of the city have higher counts of other failing establishments within their buffers; this spatial pattern coincides with the overall distribution of inspection failures across the city in general, hence spatial lag was taken into consideration as a feature within the model.

# creating 100m buffers around restaurants
Buffer <- mostRecent_all %>% st_buffer(100)

# counting number of restaurants within buffer
Buffer$failed_count <- lengths(st_intersects(Buffer, mostRecent_all))

# joining count of restaurants to original restaurants file
bufferCount <- left_join(mostRecent_all, dplyr::select(st_set_geometry(Buffer, NULL), 
    License.., failed_count), by = "License..")

# plotting number of restaurants within 100m buffer
ggplot() + geom_sf(data = bufferCount, aes(color = q5(failed_count), show.legend = "point"), 
    stroke = 10^(-10)) + geom_sf(data = chicago, fill = NA) + labs(title = "Failures within 100m Buffer of Each Restaurant") + 
    scale_color_manual(values = palette5, labels = qBr(bufferCount, "failed_count"), 
        name = "Count") + mapTheme()

We also examined the relationship between past outcomes and results of the latest inspection for each establishment. We divided our data into four groups: establishments that were inspected for the first time, establishments that had more failures than passes prior to the latest inspection, establishments that had more passes than failures prior to the latest inspection, and establishments that had no failures prior to the latest inspection. It is evident that eateries that were inspected for the first time had a much higher rate of failure (8.72%) than the other 3 groups. Past outcomes were hence incorporated as a predictive feature within our model.

# getting overall percent fail per license ID
pastPercentFail <- foodInspect %>% dplyr::select(Results, License..) %>% count(License.., 
    Results) %>% pivot_wider(id_cols = "License..", names_from = "Results", values_from = "n") %>% 
    mutate(Pass = replace_na(Pass, 0), Fail = replace_na(Fail, 0)) %>% inner_join(mostRecent_all) %>% 
    mutate(previousPass = ifelse(Results == "Pass", Pass - 1, Pass)) %>% mutate(previousFail = ifelse(Results == 
    "Fail", Fail - 1, Fail)) %>% mutate(TotalPrevious = previousPass + previousFail) %>% 
    mutate(percentPreviousFail = previousFail/TotalPrevious)

# categorize previous results
regroupedPastResult <- pastPercentFail %>% mutate(pastResults = ifelse(TotalPrevious == 
    0, "First Inspection", ifelse(percentPreviousFail > 0.5, "More Failures", ifelse(percentPreviousFail > 
    0, "More Passes", "No Failures")))) %>% mutate(firstTime = ifelse(TotalPrevious == 
    0, "Yes", "No"))


regroupedPastResult %>% ggplot(., aes(x = pastResults, fill = Results, color = Results)) + 
    geom_bar(stat = "Count", colour = "black") + scale_fill_manual(values = c("#FA7800", 
    "#FED8B1"))

PercentByPastResult <- regroupedPastResult %>% group_by(Results, pastResults) %>% 
    summarise(counts = n()) %>% spread(Results, counts) %>% mutate(total = Fail + 
    Pass) %>% mutate(percentFail = (Fail/total) * 100) %>% dplyr::select(pastResults, 
    percentFail) %>% pivot_wider(names_from = pastResults, values_from = percentFail)

kable(PercentByPastResult, caption = "Proportion of Failures by Past Outcomes (%)") %>% 
    kable_styling(full_width = F) %>% column_spec(1, background = "#FA7800", color = "black", 
    bold = T) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F) %>% 
    column_spec(3, background = "#FED8B1", color = "black", bold = F) %>% column_spec(4, 
    background = "#FED8B1", color = "black", bold = F)

Proportion of Failures by Past Outcomes (%)
First Inspection	More Failures	More Passes	No Failures
8.72956	2.728111	1.984685	2.705768

Finally, we considered the duration between the latest 2 inspections as a potential factor influencing outcomes of the most recent inspection. Once again, we divided our data into four groups: establishments that were inspected for the first time, establishments that were inspected within a month prior to the latest inspection, establishments that were inspected within a year prior to the latest inspection, and establishments that were inspected more than a year prior to the latest inspection. It is evident that eateries that were inspected for the first time had a much higher rate of failure (8.93%) than the other 3 groups. Duration since last inspection was hence incorporated as a predictive feature within our model.

# get most recent result date
recentPrevious <- foodInspect %>% arrange(as.Date(Inspection.Date, "%m/%d/%Y")) %>% 
    group_by(License..) %>% slice(-n()) %>% group_by(License..) %>% slice(which.max(as.Date(Inspection.Date, 
    "%m/%d/%Y"))) %>% dplyr::select("License..", "Results", "Inspection.Date") %>% 
    rename(previousResult = Results)

# tabular join
previous_and_mostRecent <- left_join(mostRecent_all, recentPrevious, by = "License..")

previous_and_mostRecent$time_lag <- difftime(as.Date(previous_and_mostRecent$Inspection.Date.x, 
    "%m/%d/%Y"), as.Date(previous_and_mostRecent$Inspection.Date.y, "%m/%d/%Y"), 
    units = c("weeks"))
previous_and_mostRecent <- previous_and_mostRecent %>% mutate(time_lag = replace_na(time_lag, 
    0)) %>% mutate(time_since_last_check = ifelse(time_lag == 0, "First Inspection", 
    ifelse(time_lag < 4, "< 1 Month", ifelse(time_lag < 52, "1 Year", "> 1 Year"))))

previous_and_mostRecent %>% ggplot(., aes(x = time_since_last_check, fill = Results, 
    color = Results)) + geom_bar(stat = "count", colour = "black") + scale_fill_manual(values = c("#FA7800", 
    "#FED8B1"))

PercentByTimeLag <- previous_and_mostRecent %>% st_set_geometry(NULL) %>% group_by(Results, 
    time_since_last_check) %>% summarise(counts = n()) %>% spread(Results, counts) %>% 
    mutate(total = Fail + Pass) %>% mutate(percentFail = (Fail/total) * 100) %>% 
    dplyr::select(time_since_last_check, percentFail) %>% pivot_wider(names_from = time_since_last_check, 
    values_from = percentFail)

kable(PercentByTimeLag, caption = "Proportion of Failures by Duration Since Last Inspection (%)") %>% 
    kable_styling(full_width = F) %>% column_spec(1, background = "#FED8B1", color = "black", 
    bold = F) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F) %>% 
    column_spec(3, background = "#FED8B1", color = "black", bold = F) %>% column_spec(4, 
    background = "#FA7800", color = "black", bold = T)

Proportion of Failures by Duration Since Last Inspection (%)
< 1 Month	> 1 Year	1 Year	First Inspection
1.332746	2.566829	2.950706	8.930348

3. Features.

Following our exploratory analysis, we decided to incorporate the following list of variables in our predictive model:

Distance between the grid cell a food establishment belongs to and centroid of The Loop;
Count of sanitation code complaints in grid cell that a food establishment belongs to;
Count of rodent baiting requests in grid cell that a food establishment belongs to;
Count of crimes in grid cell that a food establishment belongs to;
Count of other eateries that failed their last inspection within a 100m buffer of a food establishment;
Establishment type (i.e. restaurant or not);
Past inspection outcomes;
Duration since past inspection;
Neighborhood that a food establishment belongs to;
Percentage of white population in census tract that a food establishment belongs to;
Percentage of residents living in poverty in census tract that a food establishment belongs to; and
Median income of the census tract that a food establishment belongs to.

# JOINING FISHNET VARIABLES 311 variables
final_net <- left_join(sanitationCode_net, st_set_geometry(rodentBaiting_net, NULL), 
    by = "uniqueID")

final_net <- left_join(final_net, st_set_geometry(crime_net, NULL), by = "uniqueID")

# adding proximity to the loop
neighborhoods <- st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>% 
    st_transform(st_crs(chicagofishnet))

loopPoint <- neighborhoods %>% filter(name == "Loop") %>% st_centroid()

final_net$loopDistance = st_distance(st_centroid(final_net), loopPoint) %>% as.numeric()

transformed_net <- st_transform(final_net, 3857)

# performing spatial join with restaurants point file
variables = st_join(mostRecent_all, transformed_net)

# adding number of failed restaurants within 100m buffer
variables <- left_join(variables, dplyr::select(st_set_geometry(bufferCount, NULL), 
    License.., failed_count), by = "License..")

# adding past outcomes
variables <- merge(x = variables, y = regroupedPastResult[, c("pastResults", "License..")], 
    by = "License..", all.x = TRUE)

# adding duration since past inspection
variables <- left_join(variables, dplyr::select(st_set_geometry(previous_and_mostRecent, 
    NULL), License.., time_since_last_check), by = "License..")

# adding census variables - total pop, median income, white pop, NBELPOV
census_api_key("131eaa4664fa75f89806d6e6a481a61339ea5bb2")
install = TRUE

census_variables <- c("B00001_001E", "B06011_001E", "C02003_003E", "B06012_002E")
chicago_ACS_unpolished <- get_acs(geography = "tract", variables = census_variables, 
    state = "IL", county = "Cook", output = "wide", geometry = TRUE)

chicago_ACS <- chicago_ACS_unpolished %>% mutate(PctWhite = C02003_003E/B00001_001E, 
    PctPov = B06012_002E/B00001_001E) %>% rename(MedInc = B06011_001E)

chicago_ACS <- dplyr::select(chicago_ACS, -(contains("_")))
transformed_census <- st_transform(chicago_ACS, 3857)

variables = st_join(variables, transformed_census)

variables <- subset(variables, select = -c(GEOID))

# adding 'restaurant' classification
finding_Rest <- mostRecent_all %>% ungroup() %>% mutate(Restaurant = as.character(Facility.Type), 
    Restaurant = ifelse(str_detect(Restaurant, "(?i)resta(?-i)"), 1, 0))
variables <- left_join(variables, dplyr::select(st_set_geometry(finding_Rest, NULL), 
    License.., Restaurant), by = "License..")

# adding neighborhoods
chicago_nhoods <- st_read("chicago_nhoods.shp")
chicago_nhoods <- st_transform(chicago_nhoods, 3857)
variables = st_join(variables, chicago_nhoods)
variables = subset(variables, select = -c(shape_area, shape_len, sec_neigh))

variables <- na.omit(variables)

We plotted a correlation matrix with non-categorical variables to ensure no collinear factors were included within the model.

CorMat <- variables %>% st_set_geometry(NULL) %>% dplyr::select(-uniqueID, -Inspection.Date, 
    -Results, -License.., -pastResults, -time_since_last_check, -pri_neigh, -Facility.Type, 
    -Restaurant, -NAME)

M <- cor(CorMat)
corrplot(M, method = "color", col = palette5, addCoef.col = "black", tl.col = "black", 
    type = "lower", tl.srt = 45)

4. The Model.

4.1. Test and Training Set.

The data is split into a 70/30 training and test set. The model is run on the training set and this regressed model is then used to predict outcomes on the test set.

final_set <- variables %>% mutate(Fail = ifelse(Results == "Fail", 1, 0)) %>% st_set_geometry(NULL)

# splitting data
set.seed(3456)
trainIndex <- createDataPartition(y = paste(final_set$Fail, final_set$pri_neigh, 
    final_set$time_since_last_check, final_set$pastResults, final_set$Restaurant), 
    p = 0.7, list = FALSE, times = 1)
TrainSet <- final_set[trainIndex, ]
TestSet <- final_set[-trainIndex, ]

4.2. Logistic Regression Model.

A logistic regression model was estimated to predict the probability that each food establishment will pass an inspection. A summary of the model as well as the pseudo R-squared are printed below.

inspectionModel <- glm(TrainSet$Fail ~ ., data = TrainSet %>% dplyr::select(-uniqueID, 
    -Inspection.Date, -Results, -License.., -Fail, -Facility.Type, -NAME), family = binomial(link = "logit"))

summary <- as.data.frame(summary(inspectionModel)$coefficients)
names(summary)[names(summary) == "Pr(>|z|)"] <- "pvalue"
names(summary)[names(summary) == "Estimate"] <- "Estimated Coefficient for Predictor"

kable(summary) %>% kable_styling(bootstrap_options = c("striped", "hover"), fixed_thead = TRUE) %>% 
    row_spec(which(summary$pvalue < 0.05), bold = F, color = "white", background = "#FA7800") %>% 
    scroll_box(width = "100%", height = "300px")

	Estimated Coefficient for Predictor	Std. Error	z value	pvalue
(Intercept)	-6.4500817	0.6577053	-9.8069480	0.0000000
countSC	0.0012480	0.0006605	1.8895686	0.0588157
countRB	0.0002148	0.0003426	0.6269515	0.5306910
countC	0.0002951	0.0001841	1.6035828	0.1088060
loopDistance	0.0000325	0.0000290	1.1191343	0.2630829
failed_count	-0.0017591	0.0022554	-0.7799733	0.4354066
pastResultsMore Failures	1.8302684	0.2788755	6.5630296	0.0000000
pastResultsMore Passes	1.0902113	0.2696768	4.0426587	0.0000528
pastResultsNo Failures	1.1676857	0.2714833	4.3011328	0.0000170
time_since_last_check> 1 Year	1.1116934	0.1473919	7.5424297	0.0000000
time_since_last_check1 Year	1.2050407	0.1305420	9.2310560	0.0000000
time_since_last_checkFirst Inspection	3.4319209	0.2822389	12.1596304	0.0000000
MedInc	-0.0000042	0.0000045	-0.9263022	0.3542889
PctWhite	-0.0102750	0.0107409	-0.9566225	0.3387578
PctPov	-0.0006113	0.0214278	-0.0285261	0.9772426
Restaurant	-0.3369386	0.0680900	-4.9484297	0.0000007
pri_neighAndersonville	-0.8324202	1.0510155	-0.7920152	0.4283518
pri_neighArcher Heights	0.3481623	0.5193099	0.6704326	0.5025820
pri_neighArmour Square	1.1468937	0.6760444	1.6964769	0.0897956
pri_neighAshburn	-0.1544215	0.5405630	-0.2856679	0.7751325
pri_neighAuburn Gresham	0.5952676	0.3963502	1.5018727	0.1331300
pri_neighAustin	0.7880607	0.3495984	2.2541881	0.0241843
pri_neighAvalon Park	0.7327319	0.5812354	1.2606457	0.2074365
pri_neighAvondale	0.2826258	0.4379056	0.6454036	0.5186657
pri_neighBelmont Cragin	0.5364805	0.3620915	1.4816158	0.1384425
pri_neighBeverly	-0.5459091	0.8270968	-0.6600305	0.5092343
pri_neighBoystown	-0.6188676	1.0711071	-0.5777831	0.5634106
pri_neighBridgeport	1.0442146	0.4696263	2.2235011	0.0261820
pri_neighBrighton Park	1.1388361	0.4044701	2.8156248	0.0048682
pri_neighBucktown	0.8627946	0.5259103	1.6405738	0.1008859
pri_neighBurnside	-13.2001151	1696.4380397	-0.0077811	0.9937917
pri_neighCalumet Heights	0.3036962	0.6294151	0.4825054	0.6294470
pri_neighChatham	1.1779407	0.4054352	2.9053733	0.0036682
pri_neighChicago Lawn	0.5787476	0.3792768	1.5259241	0.1270288
pri_neighChinatown	0.4942165	0.6462022	0.7648016	0.4443897
pri_neighClearing	0.2111980	0.6783832	0.3113255	0.7555531
pri_neighDouglas	1.1896075	0.5566439	2.1371068	0.0325893
pri_neighDunning	0.7517766	0.4414263	1.7030627	0.0885563
pri_neighEast Side	-0.3955434	0.7395115	-0.5348712	0.5927389
pri_neighEast Village	0.7020522	0.6819247	1.0295157	0.3032374
pri_neighEdgewater	0.5464897	0.4009887	1.3628555	0.1729281
pri_neighEdison Park	-12.9898482	273.1392344	-0.0475576	0.9620688
pri_neighEnglewood	0.6318210	0.3683551	1.7152499	0.0862994
pri_neighFuller Park	1.2372966	0.6339539	1.9517138	0.0509722
pri_neighGage Park	0.7474838	0.4340499	1.7221148	0.0850487
pri_neighGalewood	1.1956098	0.5153785	2.3198675	0.0203480
pri_neighGarfield Park	0.8990715	0.4219238	2.1308860	0.0330985
pri_neighGarfield Ridge	0.5870836	0.4787084	1.2263910	0.2200516
pri_neighGold Coast	0.5508527	0.7774487	0.7085389	0.4786107
pri_neighGrand Boulevard	0.9182268	0.5150319	1.7828541	0.0746100
pri_neighGrand Crossing	1.0463484	0.3802742	2.7515629	0.0059312
pri_neighGrant Park	-12.6519464	791.0754140	-0.0159934	0.9872397
pri_neighGreektown	1.5187210	0.8177549	1.8571836	0.0632850
pri_neighHegewisch	0.4706400	0.7975067	0.5901393	0.5550973
pri_neighHermosa	0.5753904	0.5000405	1.1506877	0.2498608
pri_neighHumboldt Park	0.8509291	0.3904662	2.1792642	0.0293121
pri_neighHyde Park	0.4303364	0.4798954	0.8967296	0.3698632
pri_neighIrving Park	0.5522681	0.3690253	1.4965590	0.1345081
pri_neighJackson Park	-12.9654041	778.1362129	-0.0166621	0.9867062
pri_neighJefferson Park	-0.1137911	0.6087860	-0.1869147	0.8517275
pri_neighKenwood	0.2729991	0.8050330	0.3391154	0.7345228
pri_neighLake View	0.2999529	0.4182689	0.7171295	0.4732942
pri_neighLincoln Park	0.2420321	0.4878732	0.4960962	0.6198266
pri_neighLincoln Square	0.3665293	0.4153360	0.8824886	0.3775126
pri_neighLittle Italy, UIC	0.9050287	0.5365292	1.6868211	0.0916378
pri_neighLittle Village	1.0750242	0.3830847	2.8062315	0.0050125
pri_neighLogan Square	0.3189543	0.4065709	0.7844987	0.4327476
pri_neighLoop	0.2800160	0.5942655	0.4711968	0.6375002
pri_neighLower West Side	1.1786577	0.4646940	2.5364167	0.0111993
pri_neighMagnificent Mile	-1.2530525	1.1510502	-1.0886167	0.2763230
pri_neighMckinley Park	1.2375685	0.5083024	2.4347089	0.0149038
pri_neighMillenium Park	2.3362160	0.9924016	2.3541035	0.0185674
pri_neighMontclare	0.8287847	0.5750212	1.4413114	0.1494967
pri_neighMorgan Park	0.2763717	0.6337563	0.4360851	0.6627750
pri_neighMount Greenwood	-0.2295223	0.8602446	-0.2668106	0.7896150
pri_neighMuseum Campus	1.5827475	0.9184005	1.7233740	0.0848209
pri_neighNear South Side	0.5738926	0.6322133	0.9077516	0.3640095
pri_neighNew City	0.8690895	0.3955122	2.1973772	0.0279935
pri_neighNorth Center	0.3721914	0.4702068	0.7915484	0.4286241
pri_neighNorth Lawndale	0.9387365	0.4356362	2.1548635	0.0311725
pri_neighNorth Park	0.6757518	0.4742598	1.4248559	0.1541989
pri_neighNorwood Park	0.5145185	0.5406201	0.9517192	0.3412394
pri_neighOakland	-12.8746843	779.7725654	-0.0165108	0.9868269
pri_neighOld Town	1.1529615	0.5742687	2.0077038	0.0446748
pri_neighPortage Park	0.5457624	0.3871401	1.4097282	0.1586199
pri_neighPrinters Row	-12.5156031	430.3112407	-0.0290850	0.9767968
pri_neighPullman	-13.2415196	331.1389834	-0.0399878	0.9681029
pri_neighRiver North	0.2305131	0.5911093	0.3899670	0.6965610
pri_neighRiverdale	-13.1640759	524.5901463	-0.0250940	0.9799800
pri_neighRogers Park	0.4217797	0.3942600	1.0698008	0.2847090
pri_neighRoseland	0.4396180	0.5050763	0.8703990	0.3840824
pri_neighRush & Division	-0.0428855	0.7769515	-0.0551972	0.9559814
pri_neighSauganash,Forest Glen	1.2950399	0.5565174	2.3270430	0.0199630
pri_neighSheffield & DePaul	0.8504996	0.5683902	1.4963304	0.1345676
pri_neighSouth Chicago	1.0089740	0.4304376	2.3440660	0.0190748
pri_neighSouth Deering	0.5531488	0.5948863	0.9298394	0.3524543
pri_neighSouth Shore	1.1030593	0.3885349	2.8390228	0.0045252
pri_neighStreeterville	0.2749334	0.6493139	0.4234214	0.6719878
pri_neighUkrainian Village	0.6091677	0.6542211	0.9311342	0.3517841
pri_neighUnited Center	0.2238911	0.5863407	0.3818447	0.7025765
pri_neighUptown	0.1899183	0.4084889	0.4649289	0.6419824
pri_neighWashington Heights	0.3694993	0.5218740	0.7080239	0.4789304
pri_neighWashington Park	1.6103156	0.4895781	3.2891906	0.0010048
pri_neighWest Elsdon	0.8674430	0.5185024	1.6729777	0.0943317
pri_neighWest Lawn	0.1497712	0.4946913	0.3027569	0.7620751
pri_neighWest Loop	0.4605009	0.5967545	0.7716756	0.4403066
pri_neighWest Pullman	0.6590860	0.6126632	1.0757721	0.2820292
pri_neighWest Ridge	0.3884997	0.3745653	1.0372016	0.2996419
pri_neighWest Town	0.3231333	0.5606866	0.5763171	0.5644009
pri_neighWicker Park	0.3253613	0.5099153	0.6380692	0.5234287
pri_neighWoodlawn	0.6372544	0.4918369	1.2956620	0.1950920
pri_neighWrigleyville	-1.1818581	1.0662973	-1.1083758	0.2676996

pR2(inspectionModel)

##           llh       llhNull            G2      McFadden          r2ML 
## -3.958876e+03 -4.402831e+03  8.879104e+02  1.008340e-01  3.746176e-02 
##          r2CU 
##  1.188435e-01

5. Assessing Model Accuracy.

To assess the model created above, predictions were generated on the test set.

testProbs <- data.frame(Tract = TestSet$NAME, License = TestSet$License.., MedInc = TestSet$MedInc, 
    PctWhite = TestSet$PctWhite, Restaurant = TestSet$Restaurant, Outcome = as.factor(TestSet$Fail), 
    Probs = predict(inspectionModel, TestSet, type = "response"))

5.1. Density of Predicted Probabilities.

The following are histograms showing the density of predicted probabilities by observed outcomes. A well-performing model would result in the ‘hump’ for predicted probabilities of failing an inspection (‘1’) to be clustered around 1.0 on the x-axis, while the predicted probabilities of passing an inspection (‘0’) should be clustered around 0.0. It is clear that our model is good at predicting the probabilities of passing an inspection, but is relatively poor at predicting probabilities of failure.

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + geom_density() + facet_grid(Outcome ~ 
    .) + scale_fill_manual(values = palette2) + labs(x = "Probabilities of Failure", 
    y = "Density of probabilities", title = "Distribution of predicted probabilities by observed outcome") + 
    theme(strip.text.x = element_text(size = 18), legend.position = "none")

5.2. Determination of Probability Thresholds.

While cost-benefit analyses might often be used to identify an optimal threshold beyond which dining establishments are predicted to fail an inspection, we felt it is beyond our scope to put a number on the social and economic costs associated with, for instance, missing a high-risk food establishment and allowing a public health risk to fester. As such, in order to determine a probability threshold, we took the midpoint between the peaks of both density histograms.

Most establishments that passed had a predicted probability of 0.0153, while most establishments that failed had a predicted probability of 0.153. The threshold we used subsequently was hence calculated as (0.0153 + 0.153) / 2 = 0.084.

passes <- testProbs[testProbs$Outcome == "0", ]
fails <- testProbs[testProbs$Outcome == "1", ]

densMode <- function(x) {
    td <- density(x)
    maxDens <- which.max(td$y)
    list(x = td$x[maxDens], y = td$y[maxDens])
}

densMode(passes$Probs)

## $x
## [1] 0.0152725
## 
## $y
## [1] 20.83022

densMode(fails$Probs)

## $x
## [1] 0.153341
## 
## $y
## [1] 6.345045

testProbs <- testProbs %>% mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 
    0.084, 1, 0)))

5.3. Confusion Matrix.

For further analysis of accuracy, a confusion matrix was created with the probability threshold of 0.117, meaning that any predicted probabiltiy exceeding this would result in a predicted outcome of Failure. ‘Reference’ refers to the actual outcome, while ‘Prediction’ refers to the predicted outcome. The interpretation of the confusion matrix is as follows:

There are 114 True Positives, where the dining establishment was predicted to fail a food inspection, and actually failed the food inspection;
There are 7263 True Negatives, where the dining establishment was predicted to pass a food inspection, and actually passed the food inspection;
There are 1119 False Positives, where the dining establishment was predicted to fail a food inspection, but actually passed the food inspection; and
There are 43 False Negatives, where the dining establishment was predicted to pass a food inspection, but actually failed the food inspection.

The model has an overall accuracy of 86.4%, which was computed by adding the number of True Positives to the number of True Negatives and dividing this sum by the total number of observations. Two other metrics, ‘sensitivity’ and ‘specificity’, were also computed. The former refers to the proportion of actual Failures that were predicted to be Failures (this is known as the ‘true positive rate’), while the latter refers to the proportion of actual Passes that were predicted to be Passes (this is known as the ‘true negative rate’). The model has a sensitivity of 72.6%, and a specificity of 86.7%. Consistent with our interpretation in section 5.1., this suggests the model is better at predicting Passes than Failures; this is unsuprising given that a large majority of the data has this outcome.

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7263   43
##          1 1119  114
##                                              
##                Accuracy : 0.8639             
##                  95% CI : (0.8565, 0.8711)   
##     No Information Rate : 0.9816             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1358             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.72611            
##             Specificity : 0.86650            
##          Pos Pred Value : 0.09246            
##          Neg Pred Value : 0.99411            
##              Prevalence : 0.01839            
##          Detection Rate : 0.01335            
##    Detection Prevalence : 0.14440            
##       Balanced Accuracy : 0.79631            
##                                              
##        'Positive' Class : 1                  
##

5.4. ROC Curve.

A ROC curve was plotted to visualize trade-offs between False Positives and False Negatives. The y-axis of the ROC curve shows the rate of true positives for each threshold from 0.01 to 1, while the x-axis shows the rate of false positives for each threshold.

ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) + geom_roc(n.cuts = 50, 
    labels = FALSE, colour = "#FA7800") + style_roc(theme = theme_grey) + geom_abline(slope = 1, 
    intercept = 0, size = 1.5, color = "grey") + labs(title = "ROC Curve - Food Inspections")

The Area Under the ROC Curve (AUC) was also obtained. The AUC value of 0.878 is greater than 0.5, and the ROC curve lies above the 45 degree line, indicating that the model is much better than a coin-flip.

auc(testProbs$Outcome, testProbs$Probs)

## Area under the curve: 0.8777

6. Assessing Model Generalizability.

6.1. K-Fold Cross Generalization.

K-fold cross validation (where k=100) was performed to test the generalizability of our model to unseen data. The three metrics output below for AUC, sensitivity and specificty are averages across all 100 folds.

final_set$Fail[final_set$Fail == "1"] <- "Fail"
final_set$Fail[final_set$Fail == "0"] <- "Pass"

ctrl <- trainControl(method = "cv", number = 100, classProbs = TRUE, summaryFunction = twoClassSummary)

cvFit <- train(Fail ~ ., data = final_set %>% dplyr::select(-uniqueID, -Inspection.Date, 
    -Results, -License.., -Facility.Type, -NAME), method = "glm", family = "binomial", 
    metric = "ROC", trControl = ctrl)

cvFit

## Generalized Linear Model 
## 
## 31794 samples
##    12 predictor
##     2 classes: 'Fail', 'Pass' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 31476, 31475, 31475, 31477, 31476, 31475, ... 
## Resampling results:
## 
##   ROC        Sens         Spec     
##   0.7448534  0.002435897  0.9997708

The following histograms plot AUC, sensitivity and specificity across the 100 folds. If the model was perfectly generalizable, the goodness of fit would be the same regardless of which observations showed up in a random fold. In this case, all three metrics are tightly distributed around the mean; this suggests that the model generalizes well to unseen data, particularly with respect to specificity and sensitivity.

However, this is likely because the default threshold set by the package is 0.5; this is much higher than the optimal threshold of 0.084 determined in section 5.2. It is clear from the histograms in section 5.2. that the probabilities of failure generated by the model fall mostly below 0.3; at a threshold of 0.5, all establishments will be predicted to Pass. Since the data largely comprises food establishments that passed their latest inspection, this explains why the specificity is so high (almost all folds at 1.0) and the sensitivity is so low (almost all folds at 0.0) across the 100 folds. As such, this may not be an accurate indication of model generalizability to unseen data; for a more representative idea, a much lower threshold (such as the one determined earlier in section 5.2.) should be used instead.

dplyr::select(cvFit$resample, -Resample) %>% gather(metric, value) %>% left_join(gather(cvFit$results[2:4], 
    metric, mean)) %>% ggplot(aes(value)) + geom_histogram(bins = 15, fill = "#fed8b1") + 
    facet_wrap(~metric) + geom_vline(aes(xintercept = mean), colour = "#FA7800", 
    linetype = 3, size = 1) + scale_x_continuous(limits = c(-0.1, 1.1)) + labs(x = "Goodness of Fit", 
    y = "Count", title = "CV Goodness of Fit Metrics", subtitle = "Across-fold mean reprented as dotted lines")

6.2. Generalizability across Space.

To examine if our model predicts well across urban space, we plotted the overall accuracy (defined as the number of correct predictions divided by the total number of predictions) within each neighborhood in the City. It is visually evident that the model predicts with reasonable accuracy in north-east and south-west Chicago, but loses predictive power in the north-western and south-eastern sections of the City. Our model is hence not generalizable across urban space.

# subtracting predicted outcome from actual outcome
testProbs$predOutcome <- as.numeric(as.character(testProbs$predOutcome))
testProbs$Outcome <- as.numeric(as.character(testProbs$Outcome))

testProbs$difference <- abs(testProbs$predOutcome - testProbs$Outcome)

# finding accuracy by neighborhood
grouped <- testProbs %>% group_by(Tract) %>% summarize(accuracy = 1 - (sum(difference)/n()))

# merging with tracts shapefile
names(chicago_ACS)[names(chicago_ACS) == "NAME"] <- "Tract"
grouped <- merge(x = chicago_ACS, y = grouped, by = "Tract", all.x = TRUE)

grouped <- na.omit(grouped)

# plotting accuracy by tract
ggplot() + geom_sf(data = grouped, aes(fill = q5(accuracy))) + scale_fill_manual(values = palette5, 
    labels = qBr(grouped, "accuracy", rnd = F), name = "Accuracy") + labs(title = "Overall Accuracy by Census Tract") + 
    mapTheme()

6.3. Generalizability across Socio-Economic Profiles.

The following figures map Chiacgo according to the racial and income profiles of each census tract. Tracts in which at least 51% of residents are white were designated “Majority White”, while tracts in which median incomes were greater than the citywide median (found on Google to be $55,295) were designated “High Income”. A visual comparison of these plots with the plot of overall accuracy above suggests there are some similarities in the spatial distributions of these socio-economic variables and that of overall accuracy. For instance, the High Income cluster located on the north-eastern edge of the City and the Majority White areas in the north have extremely high rates of accuracy. This implies the model may not generalize well across different socio-economic profiles, since its accuracy rates appear to be stratified by these characteristics.

# splitting majority white and high income
grouped$PctWhite[grouped$PctWhite > "50"] <- "Majority White"
grouped$PctWhite[grouped$PctWhite <= "50"] <- "Majority Non-White"

grouped$MedInc[grouped$MedInc > "55295"] <- "High Income"
grouped$MedInc[grouped$MedInc <= "55295"] <- "Low Income"

# plotting by race and by income
grid.arrange(
  ggplot() + 
    geom_sf(data = na.omit(grouped), aes(fill = PctWhite)) +
    scale_fill_manual(values = c("#fed8b1","#FA7800")) +
    labs(title = "Racial segregation") +
    mapTheme() + theme(legend.position="bottom"), 
  
  ggplot() + 
    geom_sf(data = na.omit(grouped), aes(fill = MedInc)) +
    scale_fill_manual(values = c("#fed8b1","#FA7800")) +
    labs(title = "Income segregation") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

To further inspect the generalizability of our model across different socio-economic profiles, we computed the mean overall accuracy for each tract context. High Income tracts have a much higher mean overall accuracy of 94.9% relative to Low Income tracts, which have a mean overall accuracy of 82.1%. Similarly, Majority White tracts have a slightly higher mean overall accuracy of 87.8% as compared to Majority Non-White tracts, which have a lower mean overall accuracy of 82.1%. These findings are concordant with conclusions drawn above from visual inspection that Majority White and High Income tracts have higher accuracy rates relative to their counterparts.

According to the confusion matrix above, the model has a better hit-rate for predicting passes as compared to failures. The Negative Predictive Value is 0.994: for every 1000 passes predicted by the model, on average, 994 of these predictions will be correct. In contrast, the Positive Predictive Value is 0.092: for every 1000 failures predicted by the model, only 92 of these predictions will be correct. As such, Majority White and High Income tracts have higher overall accuracy since food establishments within them are more likely to pass food inspections. The model hence does not generalize well across different socio-economic profiles.

# finding mean overall accuracy across race
race <- grouped %>% group_by(PctWhite) %>% summarize(meanAccuracy = (mean(accuracy))) %>% 
    st_set_geometry(NULL)

names(race)[names(race) == "PctWhite"] <- "Tract Context"
names(race)[names(race) == "meanAccuracy"] <- "Mean Overall Accuracy"

kable(race, caption = "Mean Overall Accuracy of Tracts by Race") %>% kable_styling(full_width = F) %>% 
    column_spec(1, background = "#FA7800", color = "black", bold = F) %>% column_spec(2, 
    background = "#FED8B1", color = "black", bold = F)

Mean Overall Accuracy of Tracts by Race
Tract Context	Mean Overall Accuracy
Majority Non-White	0.8214062
Majority White	0.8777431

# finding mean overall accuracy across income
income <- grouped %>% group_by(MedInc) %>% summarize(meanAccuracy = (mean(accuracy))) %>% 
    st_set_geometry(NULL)

names(income)[names(income) == "MedInc"] <- "Tract Context"
names(income)[names(income) == "meanAccuracy"] <- "Mean Overall Accuracy"

kable(income, caption = "Mean Overall Accuracy of Tracts by Income") %>% kable_styling(full_width = F) %>% 
    column_spec(1, background = "#FA7800", color = "black", bold = F) %>% column_spec(2, 
    background = "#FED8B1", color = "black", bold = F)

Mean Overall Accuracy of Tracts by Income
Tract Context	Mean Overall Accuracy
High Income	0.9488910
Low Income	0.8211341

6.4. Generalizability across Establishment Types.

Finally, we examined the mean overall accuracy for both Restaurants and non-Restaurants to test if the model works better for particular kinds of food establishments. The model has a much higher overall accuracy of 92.9% for restaurants, compared to the lower overall accuracy of 76.6% for non-restaurants. This is likely due to similar reasons as discussed in the previous section: restaurants are more likely to pass food inspections than non-restaurants, which includes mobile food dispensers and gas stations. As such, the model also does not generalize well across establishment types.

establishment <- testProbs %>% group_by(Restaurant) %>% summarize(accuracy = (1 - 
    mean(difference)))

establishment$Restaurant[establishment$Restaurant == "1"] <- "Restaurant"
establishment$Restaurant[establishment$Restaurant == "0"] <- "Non-Restaurant"

names(establishment)[names(establishment) == "accuracy"] <- "Mean Overall Accuracy"

kable(establishment, caption = "Mean Overall Accuracy by Establishment Type") %>% 
    kable_styling(full_width = F) %>% column_spec(1, background = "#FA7800", color = "black", 
    bold = F) %>% column_spec(2, background = "#FED8B1", color = "black", bold = F)

Mean Overall Accuracy by Establishment Type
Restaurant	Mean Overall Accuracy
Non-Restaurant	0.7659824
Restaurant	0.9290310

7. Conclusion.

One limitation of this analysis is that the model is far better at identifying true negatives (passes) than true positives (failures). This is problematic especially since the goal of the regression is to identify food establishments with high likelihoods of failing inspections, so that inspectors can be dispatched there expediently. Another limitation would be the model’s lack of generalizability acrosss different socio-economic profiles and establishment types. The lower overall accuracy in Majority Non-White and Low Income tracts may cause discrimination in the form of unfairly targeted inspections within these areas, which may be disruptive to businesses in the region. In the same manner, non-restaurants may be unduly subjected to more frequent inspections than restaurants.

Nonetheless, our model adopts a data-driven approach in identifying high-risk food establishments to optimize allocation of limited health inspectors; this constitutes an improvement from the business-as-usual approach, which is to conduct checks in a random or arbitrary order. A more efficient approach is necessary given the health risks and social costs that result from exposure of the public to such unhygienic eateries. By determining which food establishments are more likely to fail health inspections with our predictive model, health inspectors can thereby focus their efforts on these establishments to minimize public health risks to the public.