library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(raster)
library(RColorBrewer)

census_api_key('131eaa4664fa75f89806d6e6a481a61339ea5bb2')
install=TRUE

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

1. Introduction

This report aims to predict areas of high and low risk of sexual assaults in Chicago, by borrowing the locational experience of past accounts of sexual assaults in Chicago, to predict for areas where there might be a risk of sexual assault, but a lack of reporting. These predictions can help improve the effectiveness of hot-spot policing by identifying hot-spots for sexual assaults, allowing for better allocation of social services and police patrolling. The model used for analysis hopes to predict the explicit crime risk by modelling risk and protective factors whose exposure to affects the propensity for crime. These are factors that contribute to selection bias. The selection bias of sexual assaults is discussed more in section 2. A Poisson model is then used to predict the number of sexual assaults per grid cell area, and cross-validation is used to test the generalizability of the model across grid cells, neighborhoods and police districts. This is done in section 5. Section 6 then analyzes goodness of fit metrics for the model. Accuracy is less important in the following report as the goal is to predict latent risk, and there are likely areas that suffer a reporting bias, hence the observed count of assaults is low despite there actually being crimes committed. Instead, the model hopes to improve resource allocation of policing compared to the current approach of identifying hot spots based on current crime reports.

In the following analysis, a fishnet grid is used as the spatial structure for analysis and is created over the map of Chicago in the following code.

chicagoBoundary <- 
  st_read("C:/Users/leann/OneDrive/Desktop/MUSA507/Week10 - Risk Prediction/riskPrediction_data/chicagoBoundary.shp") %>%
  st_transform(crs=102271)
## Reading layer `chicagoBoundary' from data source `C:\Users\leann\OneDrive\Desktop\MUSA507\Week10 - Risk Prediction\riskPrediction_data\chicagoBoundary.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 341164.1 ymin: 552875.4 xmax: 367345.4 ymax: 594870
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
fishnet <- 
  st_make_grid(chicagoBoundary, cellsize = 1000) %>%
  st_sf() #makes a grid, each entry is a gridcell 
fishnet <- 
  fishnet[chicagoBoundary,] %>%
  mutate(uniqueID = rownames(.))%>%# use the row as ID
  dplyr::select(uniqueID) # makes ID the first column 

2. Exploratory Analysis

2.1. Sexual Assault Cases in Chicago, 2019

sexualassult <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2019/w98m-zvie") %>% 
  filter(Primary.Type == "CRIM SEXUAL ASSAULT") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct()

Sexual assault suffers from large selection bias in the form of a reporting bias. Most victims of sexual assault do not report the crime because of feelings of shame or because of a relationship with their assailant. Men who have been sexually assaulted are even less likely than women to report the crime. With such a high reporting bias, it is even more essential to be able to model the latent risk of crime. Furthermore, the risk of sexual assault is higher and cases are less reported in communities with a general tolerance of sexual assault and where there are weak community sanctions on perpetrators. Hence perpetrators of sexual assault might actually ‘select’ into areas with less social support services and police stations. It has also been found that is a correlation between between sexual assaults and domestic violence and substance abuse.

A report on what factors might increase the risk of sexual assault can be found here: https://www.inspq.qc.ca/en/sexual-assault/understanding-sexual-assault/risk-factors.

As an exploratory analysis of sexual assaults, a map of assault in 2019 is plotted. In the following sections, a fishnet grid was overlayed over the map of Chicago to be used as the spatial reference. Next, a histogram of assault was plotted to match the distribution of the crime to an appropriate regression model. Finally, the spatial structure of sexual assaults was also examined to identify important spatial features for the regression.

ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = sexualassult, colour="#CC3C3C", size=0.5, show.legend = "point") +
  labs(title= "Sexual Assults, Chicago - 2019") +
  mapTheme()

2.1.1. Fishnet Count of Assaults

As a measure of space, a fishnet grid was used. A grid cell of size 1000 ft was chosen as that resulted in a more reasonable count per grid cell. With a cell of size 500, too many entries with 0 counts and most grid cells had only 1-3 counts.

crime_net <- 
  sexualassult %>% 
  dplyr::select() %>% 
  mutate(countAssaults = 1) %>% 
  aggregate(., fishnet, sum) %>% # joins assaults to net
  mutate(countAssaults = ifelse(is.na(countAssaults), 0, countAssaults),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet)/6.55), size=nrow(fishnet), replace = TRUE))
ggplot() +
  geom_sf(data = crime_net, aes(fill = countAssaults)) +
  scale_fill_viridis(option="C") +
  labs(title = "Count of Sexual Assaults (2019) for the fishnet") +
  mapTheme()

2.1.2. Histogram of Assaults per Grid Cell

ggplot(crime_net, aes(x=countAssaults)) + 
  geom_histogram(binwidth=1, color="black", fill="#CC3C3C")+
  labs(title="Histogram of Sexual Assaults in Cells, Chicago - 2019",x="Number of Sexual Assaults in a Grid Cell", y = "Frequency")

Given this skewed distribution, a Poisson regression was appropriate for the following analysis.

2.1.3. Spatial Strucutre of Assaults

To explore the hypothesis that crimes suffer from a selection bias that results in a clustering of crimes into ‘hot-spots’ and ‘cold-spots’, the Local Moran’s I for each grid cell was calculated using a ‘Queen’ weights matrix. The null hypothesis of the Local Moran’s I is that counts of assaults in each grid cell is randomly distributed relative to its immediate neighbors in the Queen configuration. Thus, p-values less than 0.05 indicate significant clustering of sexual assaults, and these are indicated in right most plot below.

# Queen Contiguity
crime_net.nb <- poly2nb(crime_net, queen=TRUE)
crime_net.weights <- nb2listw(crime_net.nb, style="W", zero.policy=TRUE)

# local Moran's I

#moranI <-localmoran(crime_net$countAssaults, final_net.weights) # returns 5 statistics for each cell 

crime_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(crime_net$countAssaults, crime_net.weights)), # results
    as.data.frame(crime_net, NULL)) %>% 
    st_sf() %>%
    dplyr::select(Assault_Count = countAssaults, # create the variablees
                  Local_Morans_I = Ii, 
                  P_Value = `Pr(z > 0)`) %>%
    mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
    gather(Variable, Value, -geometry) # gathers by the geometry; variables were col names

vars <- unique(crime_net.localMorans$Variable) # the variables created in dataframe
varList <- list() # create empty list

# loop over the 4 indicators to print
for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(crime_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
      scale_fill_viridis(option="C", name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Sexual Assault"))

# do.call executes function grid.arrange on arguments the variables

The results of calculating the Local Moran’s I on sexual assaults in Chicago identify the significant hot spots, and these are the areas in yellow in the right-most plot. This clustering could help predict the risk of crime in a grid cell by observing the counts in its neighboring grid cells. The engineering of a feature that captures this spatial structure is calculated in section 3.3.

2.2. Risk Factors (Predictors)

As explained in section 2.1., selection bias is significant in the case of sexual assaults. Firstly, there is the reporting bias and this is hard to model. Secondly, there is the selection bias of perpetrators. These are both environmental risk and demographic risk. Past research shows that there is a correlation between domestic violence and substance abuse and the incidence of sexual assaults. Hence these were used as predictors. Third, there are also protective infrastructure that causes perpetrators to ‘select out’ of the areas. Hence police stations and family and support services were added as features.

  • Environmental Risks
    • Abandoned cars
    • Abandoned Buildings
    • Liquor Retail
    • Graffiti
    • Street lights not working
    • Sanitation Complaints
  • Demographic at risk
    • Substance Abuse
    • Domestic Violence
  • Protective Measures
    • Police Stations
    • Family and Support Services

Each of the above factors were plotted on a map of Chicago. Using these variables, features were engineered for the regression in section 3.

Environmental Risk Factors

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2019") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Cars_2019")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings_2018")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2018") %>%
    filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti_2018")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out_2018")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation_2018")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur   rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail") 

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
## Reading layer `chicago' from data source `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
ggplot() +
  geom_sf(data=chicagoBoundary) +
  geom_sf(data = rbind(abandonCars,streetLightsOut,abandonBuildings,
                      liquorRetail, graffiti, sanitation),
         size = .1) +
  facet_wrap(~Legend, ncol = 3) +
  labs(title = "Environmental Risk Factors") +  
  mapTheme()

Demographic at Risk Factors

substanceAbuse <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2019/w98m-zvie") %>% 
  filter(Primary.Type == "NARCOTICS") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct()%>%
  mutate(Legend = "Substance_Abuse_2019")

domesticViolence <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2019/w98m-zvie") %>% 
  filter(Primary.Type == "BATTERY",
         Description == "DOMESTIC BATTERY SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct() %>%
  mutate(Legend = "Domestic_Violence_2019")

domesticViolence<-domesticViolence[-c(9402),]
ggplot() +
  geom_sf(data=chicagoBoundary)+
  geom_sf(data = rbind(substanceAbuse, domesticViolence),
         size = 0.1,
         colour="#CC3C3C") +
  facet_wrap(~Legend, ncol = 2) +
  labs(title = "Demographic at Risk Factors") +  
  mapTheme()

Protective Factors

policeStations <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations-Map/gkur-vufi") %>% 
  mutate(x = gsub("[()]", "", LOCATION)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  mutate(Legend = "Police_Stations") %>%
  dplyr:: select(geometry, Legend)

supportServices<-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Family-and-Support-Services-Delegate-Agencies-Map/xtdn-h9nk") %>% 
  filter(Division=="Domestic Violence")%>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  mutate(Legend = "Support_Services") %>%
  dplyr:: select(geometry, Legend)
ggplot() +
  geom_sf(data=chicagoBoundary) +
  geom_sf(data = rbind(policeStations, supportServices),
         size = 5,
         colour="#10bfea") +
  facet_wrap(~Legend, ncol = 2) +
  labs(title = "Environmental Protective Factors") +  
  mapTheme()

3. Feature Engineering

To engineer the risk factors described in section 2.2, two approaches were taken. First, a count of each factor per grid cell was calculated. Second, the average distance of k-nearest neighbors was calculated for each of the factors. For environmental risk, a k of 3 was used; for demographic at risk, a k of 5 was used, and for the protective facilities, a k of 1 was used.

3.1. Count Per Grid Cell

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation,
        substanceAbuse, domesticViolence,
        policeStations, supportServices) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

vars_net
## Simple feature collection with 656 features and 11 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 341164.1 ymin: 552875.4 xmax: 368164.1 ymax: 594875.4
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
## # A tibble: 656 x 12
## # Groups:   uniqueID [656]
##    uniqueID                  geometry Abandoned_Build~ Abandoned_Cars_~
##    <chr>                <POLYGON [m]>            <dbl>            <dbl>
##  1 1        ((359164.1 552875.4, 360~                8                8
##  2 10       ((356164.1 553875.4, 357~                0                0
##  3 100      ((360164.1 558875.4, 361~               12               19
##  4 101      ((361164.1 558875.4, 362~                0                1
##  5 102      ((362164.1 558875.4, 363~                0               12
##  6 103      ((363164.1 558875.4, 364~                2                7
##  7 104      ((364164.1 558875.4, 365~                4               23
##  8 105      ((365164.1 558875.4, 366~                3               21
##  9 106      ((366164.1 558875.4, 367~                8               89
## 10 107      ((367164.1 558875.4, 368~                2                5
## # ... with 646 more rows, and 8 more variables:
## #   Domestic_Violence_2019 <dbl>, Graffiti_2018 <dbl>,
## #   Liquor_Retail <dbl>, Police_Stations <dbl>, Sanitation_2018 <dbl>,
## #   Street_Lights_Out_2018 <dbl>, Substance_Abuse_2019 <dbl>,
## #   Support_Services <dbl>
vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(option="C", name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors by Fishnet"))

In the above plots, brighter areas indicate higher counts of the factor. Based on a general instinct, it is predicted that areas with higher counts of all the factors except police stations and support services will have a higher count of sexual assaults. This relationship is explored via correlation scatter-plots in section 4, feature selection.

3.2. Nearest Neighbours

Given that the grid cells were generated rather arbitrarily, Modifiable Areal Unit Problems (MAUP) could arise. A perhaps more useful feature would be one that uses the nearest neighbors function than the count per grid cell. The average distance to nearest neighbors could provide a better understanding of local exposure to risk factors. The distance is calculated relative to each grid cell which was converted to a point for the purpose of these calculations.

The nearest neighbor function is presented in this code block.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

For the environmental risk factors, the average distance was calculated between each grid cell and the 3 nearest occurrences of the feature to itself.

# environmental risk 
vars_net$Abandoned_Buildings.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)
    
vars_net$Abandoned_Cars.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)
    
vars_net$Graffiti.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)
    
vars_net$Liquor_Retail.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)

vars_net$Street_Lights_Out.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)
    
vars_net$Sanitation.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)

For the demographic at risk (domestic violence and substance abuse), the average distance was calculated between each grid cell and the 5 nearest incidents to itself.

# risk demographic
vars_net$substanceAbuse.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(substanceAbuse), 5)

vars_net$domesticViolence.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(domesticViolence), 5)

Finally, for the protective features, since the number was low, just the distance between each grid cell and its nearest facility was calculated.

vars_net$policeStation.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(policeStations), 1)

vars_net$supportServices.nn =
    nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(supportServices), 1)

3.3. Significant Hotspots

Based on the Local Moran’s I calculated in 2.1.3, it was clear that the counts of sexual assaults of the neighboring grid cells had an effect on the count in its own grid cell. Thus, a useful feature in predicting latent risk might be the distance to significant hot spots. Instead of using a p-value of 0.05 as was done in 2.1.3, an even more precise measure of p=0.0000001 was used as a threshold in the following feature engineering process. Then, the distance of each grid-cell to the nearest significant grid-cell was measured for each cell. If a cell has a distance of 0, this indicates that itself was a significant hot spot.

crime_net <-
  crime_net %>% 
  mutate(sexualAssault.isSig = ifelse(localmoran(crime_net$countAssaults, 
                                            crime_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(sexualAssault.isSig.dist = nn_function(st_coordinates(st_centroid(crime_net)),
                                           st_coordinates(st_centroid(
                                             filter(crime_net, sexualAssault.isSig == 1))), 1 ))

ggplot() + 
  geom_sf(data = crime_net, aes(fill = sexualAssault.isSig.dist)) +
  scale_fill_viridis(option="C") +
  labs(title = "Distance to highly significant local Sexual Assault hotspots") +
  mapTheme()

From the above plot, one can identify roughly 4 areas that are hot spots and how the distance dissipates outwards rather linearly. This is an assumption about the clustering process that could be changed to improve the predictive power.

3.4. Distance to Center City

Finally, the distance to Chicago’s Central Business District is also calculated as a possible feature.

loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis(option="C") +
  labs(title="Euclidean distance to The Loop") +
  mapTheme() 

3.5. Final Net

Having engineered features, a final geodataframe is created with all the features listed above. A spatial join is also done to add the neighborhood and police district of each grid cell to the final dataframe. This will later be useful in testing the generalizability of the model across different neighborhoods.

final_net <-
  left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID") 
policeDistricts <- 
  st_read("C:/Users/leann/OneDrive/Desktop/MUSA507/Week10 - Risk Prediction/riskPrediction_data/police_district.shp") %>%
  st_transform(crs=102271)
## Reading layer `police_district' from data source `C:\Users\leann\OneDrive\Desktop\MUSA507\Week10 - Risk Prediction\riskPrediction_data\police_district.shp' using driver `ESRI Shapefile'
## Simple feature collection with 25 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
final_net <-
  st_centroid(final_net) %>%
    st_join(., dplyr::select(neighborhoods, name)) %>%
    st_join(., dplyr::select(policeDistricts, dist_label)) %>%
      st_set_geometry(NULL) %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit() 

dplyr::select(final_net, name, dist_label) %>%
  gather(Variable, Value, -geometry) %>%
    ggplot() +
      geom_sf(aes(fill = Value)) +
      facet_wrap(~Variable) +
      scale_fill_viridis(discrete = TRUE, option="C") +
      labs(title = "Aggregate Areas") +
      mapTheme() + theme(legend.position = "none")

4. Feature Selection

To decide which features would have the most predictive power for the model, as well as to check for collinearity between variables, a correlation scatterplot is created. Each plot shows the relationship between each variable and the count of sexual assaults in each grid cell.

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
    dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -dist_label) %>%
    gather(Variable, Value, -countAssaults)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countAssaults, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countAssaults)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Sexual Assault count as a function of risk factors")

Collinearity between the counts and nearest-neighbors exist if one of them has a very low significance. For such pairs, only the variable with the higher significance was used. The above correlation plots indicate a very strong relationship between domestic violence and substance abuse with sexual assaults. It also appears that the presence of liquor retail and poorer sanitation increases the count of sexual assault. Surprisingly, there does not seem to be any correlation between the number or distance to police stations to the count of sexual assaults.

5. The Model

Given the distribution of the histogram in 2.1.2, it was concluded that a Poisson Model be best to predict the count of sexual assaults per grid cell.

Two regression models are created, one including the spatial significance factors and the other without. In section 6, we analyze the results of cross-validation and errors to determine the better model.

# just risk factors
reg.vars <- c("Abandoned_Buildings_2018", "Abandoned_Cars.nn", "Domestic_Violence_2019", "Graffiti.nn", 
              "Liquor_Retail", "policeStation.nn",  "Street_Lights_Out_2018", "Sanitation_2018", "Substance_Abuse_2019", "supportServices.nn", "loopDistance")

# risk factors with spatial structure 
reg.ss.vars <- c("Abandoned_Buildings_2018", "Abandoned_Cars.nn", "Domestic_Violence_2019", "Graffiti.nn", 
              "Liquor_Retail", "policeStation.nn",  "Street_Lights_Out_2018", "Sanitation_2018", "Substance_Abuse_2019", "supportServices.nn", "loopDistance", 
                 "sexualAssault.isSig", "sexualAssault.isSig.dist")

Cross Validation

Cross Validation tests are run on the model to test the generalizability of the model at both a Citywide and local scale. Geospatial risks are purely spatial in nature, hence spatial cross-validation is useful to gauge generalizability of the model. The Leave-One-Group-Out Cross Validation (LOGO-CV) is used as this is the best way to test if the model is generalizable to both the Citywide and Local scale.

To make a comparison, both random k-fold and LOGO CV are conducted. For LOGO CV, two types of cross-validations are conducted; one for the neighborhood level, and the other for the police district level. In total, 6 total regressions and cross validations are done; using the 2 regression models and the 3 methods of cross validation for each. All the results are analyzed in section 6.

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countAssaults ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

Random K-Fold Cross Validation

For the following cross validation, cvID is used as the id to choose each grid cell by. Since 100 different IDs were generated when creating the fishnet, 100-fold cross validation is conducted.

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countAssaults",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countAssaults, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countAssaults",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countAssaults, Prediction, geometry)

Spatial Cross Validation using Neighborhoods

At each step, one neighborhood is left out as the test set while the regression is run on the other neighborhoods. This is repeated for each neighborhood.

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countAssaults",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countAssaults, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countAssaults",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countAssaults, Prediction, geometry)

Spatial Cross Validation using Police Districts

Similarly to the CV with neighborhoods, at each step one police district is taken out to be used as the test set.

reg.spatialCV.police <- crossValidate(
  dataset = final_net,
  id = "dist_label",
  dependentVariable = "countAssaults",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = dist_label, countAssaults, Prediction, geometry)

reg.ss.spatialCV.police <- crossValidate(
  dataset = final_net,
  id = "dist_label",
  dependentVariable = "countAssaults",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = dist_label, countAssaults, Prediction, geometry)

6. Goodness of Fit Measures

6.1. Accuracy and Generalizability

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = countAssaults - Prediction,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = countAssaults - Prediction,
                             Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV,    Error = countAssaults - Prediction,
                             Regression = "Spatial LOGO-CV (Nhoods): Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = countAssaults - Prediction,
                             Regression = "Spatial LOGO-CV (Nhoods): Spatial Structure"),
    
    mutate(reg.spatialCV.police,    Error = countAssaults - Prediction,
                             Regression = "Spatial LOGO-CV (Po Districts): Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV.police, Error = countAssaults - Prediction,
                             Regression = "Spatial LOGO-CV (Po Districts): Spatial Structure")) %>%
    st_sf() 

6.1.1. Predicted Counts of Sexual Assault

On the left hand side, the results from the 6 regressions in section 5 are plotted, and can be compared to the observed counts on the right hand side. The scales of all plots have been standardized to allow for easy comparison.

grid.arrange(
  reg.summary %>%
    ggplot() +
      geom_sf(aes(fill = Prediction)) +
      facet_wrap(~Regression) +
      scale_fill_viridis(limits=c(0,32), option="C") +
      labs(title = "Predicted sexual Assaults by Regression") +
      mapTheme() + theme(legend.position="bottom"),

  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
      geom_sf(aes(fill = countAssaults)) +
      scale_fill_viridis(limits = c(0,32), option="C") +
      labs(title = "Observed Sexual Assaults\n") +
      mapTheme() + theme(legend.position="bottom"), ncol = 2)

Based on a visual comparison of predicted counts, it seems that the model is fairly generalizable across neighborhoods and police districts since the predictions are similar to the random k-fold cross validation. Rather surprisingly, the model with the spatial features (isSig and dist.isSig) is not picking up the hot spot that is being picked up by the model without this spatial features.

6.1.2. Raw Errors for each Regression

For each grid cell, the raw error is calculated as the observed count minus the predicted count. Hence a negative error indicates an overprediction, while a positive error indicates underprediction.

reg.summary %>%
  ggplot() +
    geom_sf(aes(fill = Error)) +
    facet_wrap(~Regression) +
    scale_fill_viridis(option="C") +
    labs(title = "Errors in Predicted Cout of Sexual Assault by Regression") +
    mapTheme()

The model predicts well in most grid cells, however it is severely overpredicting counts of sexual assault in a particular grid cell. This could be due to that cell’s proximity to a hot spot that was identified in section 2.1.3 when calculating the local Moran’s I.

6.1.3. MAE by Regression

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Error), na.rm = T),2),
            SD_MAE = round(sd(abs(Error), na.rm = T),2)) %>% 
  kable(caption = "MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF") %>%
    row_spec(6, color = "black", background = "#FDE725FF")
MAE by regression
Regression MAE SD_MAE
Random k-fold CV: Just Risk Factors 1.25 1.60
Random k-fold CV: Spatial Structure 1.19 1.30
Spatial LOGO-CV (Nhoods): Just Risk Factors 1.28 1.61
Spatial LOGO-CV (Nhoods): Spatial Structure 1.22 1.33
Spatial LOGO-CV (Po Districts): Just Risk Factors 1.33 1.67
Spatial LOGO-CV (Po Districts): Spatial Structure 1.26 1.40

The mean absolute errors are much lower when predicting for a grid cell using the spatial structure, and there is a smaller standard deviation, indicating that there the model is better generalizable with the spatial structures.

6.1.4. Deciles

The set of cells were split into 10 deciles depending on the cell’s count of sexual assaults. Then for each decile, the mean of the observed count and predicted count for the cells in that decile were calculated and plotted for each regression type.

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>%
    mutate(assault_Decile = ntile(countAssaults, 10)) %>%
  group_by(Regression, assault_Decile) %>%
    summarize(meanObserved = mean(countAssaults, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -assault_Decile) %>%          
    ggplot(aes(assault_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = assault_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Predicted and observed sexual assaults by observed assault decile")

In the 10th decile, regressions with the spatial structure predict higher counts of crime on average compared to the regressions without. This implies that the regression with the spatial structure is better at capturing hot spots. The models also overpredict for the 1st to 5th deciles (cold spots), which might indicate that the model is identifying some latent risks of sexual assault not matched by the number of reports.

6.2. Generalizability by Neighborhood

Due to systematic over-policing of neighborhoods with minority races, it is imperative to check the model for any biases it might have been neighborhoods with different racial context.

A map of neighborhoods based on racial context in Chicago is plotted below, with a corresponding table of mean errors for grid cells classified by the neighborhood context.

tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform(102271)  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]
ggplot() + 
  geom_sf(data = tracts17, aes(fill = raceContext)) +
  scale_fill_viridis(discrete = TRUE, option="C", begin=0.5, end=0.9) +
  labs(title = "Race Context", name="Race Context") +
  mapTheme() 

final_reg <- reg.summary%>%
  mutate(uniqueID = rownames(.))

final_reg.tracts <- 
  st_join(st_centroid(final_reg), tracts17) %>%
    st_set_geometry(NULL) %>%
    left_join(dplyr::select(final_reg, uniqueID)) %>%
    st_sf() %>%
    na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
    kable_styling("striped", full_width = F) 
Mean Error by neighborhood racial context
Regression Majority_Non_White Majority_White
Random k-fold CV: Just Risk Factors 0.0553795 -0.1020274
Random k-fold CV: Spatial Structure 0.0327311 -0.0472431
Spatial LOGO-CV (Nhoods): Just Risk Factors 0.0147806 -0.1099984
Spatial LOGO-CV (Nhoods): Spatial Structure -0.0140769 -0.0557564
Spatial LOGO-CV (Po Districts): Just Risk Factors 0.0530017 -0.1169921
Spatial LOGO-CV (Po Districts): Spatial Structure 0.0422743 -0.0640765

For all models, there is an overprediction of sexual assaults in majority white neighborhoods. This could indicate some reporting bias in certain features. For example, graffiti could be higher reported in majority white neighborhoods due to a lower tolerance for graffiti, and thus the model predicts that there would be higher cases of sexual assault. Nonetheless, the absolute errors are very close for both racial contexts, indicating that the model generalizes well across race, especially with the spatial structure.

6.3. Compared to Traditional Models

Traditional methods of allocating policing in a city use Kernal Density to identify hot spots. This identifies hot spots purely based on spatial relationships to current reports of the crime.

A Kernal Density of size 1000 Foot search radius is used.

# Compute kernel density
sass_ppp <- as.ppp(st_coordinates(sexualassult), W = st_bbox(final_net))
sass_KD <- spatstat::density.ppp(sass_ppp, 1000)

# Convert kernel density to grid cells taking the mean
sass_KDE_sf <- as.data.frame(sass_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
  bind_cols(
    aggregate(
      dplyr::select(sexualassult) %>% mutate(assaultCount = 1), ., length) %>%
    mutate(assaultCount = replace_na(assaultCount, 0))) %>%
#Select the fields we need
  dplyr::select(label, Risk_Category, assaultCount)

The final model used was the LOGO CV using neighborhoods with the spatial structure.The grid cells were classified based on the predicted counts of sexual assault in that grid cell into 5 risk categories. The risk levels are based on the count of predicted sexual assaults and not on the observed counts. The hope is that more grid cells which might have a low observed count of sexual assaults in classified as high risk because of the high predicted count of sexual assaults. This would mean that the model is predicting the latent risk.

sass_risk_sf <-
  filter(final_reg, Regression == "Spatial LOGO-CV (Nhoods): Spatial Structure") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(sexualassult) %>% mutate(assaultCount = 1), ., length) %>%
      mutate(assaultCount = replace_na(assaultCount, 0))) %>%
  dplyr::select(label,Risk_Category, assaultCount)

Kernal Density hotspots vs Risk Predicted Hotspots

A plot comparing the kernal density and predicted risk categories are plotted side by side.

rbind(sass_KDE_sf, sass_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sexualassult, size = .1, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE, option="C") +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="Relative to observed Assaults (in black)") +
    mapTheme()

The plot on the right shows the risk categories based on the predicted risks. The model does quite well in providing more fine-tuned hot spots than the traditional kernal density method on the left. This could result in more targeted allocation of policing.

Bar plot for comparison

As observed from the plot above, the traditional kernal density method and the risk predictions method have different combination of grid cells in each of the 5 risk brackets. The bar plot below then compares the count of observed sexual assaults in each of the grid cells classified in each risk category by the kernal density and the risk predictions.

rbind(sass_KDE_sf, sass_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countAssaults = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_sexual_assaults = countAssaults / sum(countAssaults)) %>%
    ggplot(aes(Risk_Category,Rate_of_sexual_assaults)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, option="C", begin = 0.3, end = 0.55)

The model predicts higher ‘hot spots’ than the kernal density model, hence it is marginally better. More crime actually falls in the lower risk categories for the categories using the risk prediction.

7. Conclusion

Recommedations
This model should not yet be used to allocate police resources to target sexual assault cases because it does only marginally better than the kernal density in capturing the observed sexual assaults in areas classified as hot spots. Based on the bar plot, more crimes actually fall in cells classified as low risk than in the traditional model and this is not the desirable outcome as we hope that the highest risk category captures more crimes. Following the model might lead to areas requiring more policing to not receive policing at all. Ways the model can be improved is discussed below. If the model is able to classify risk well, the involved authorities should keep in mind that the majority of sexual assault cases occur behind closed doors when allocating resources, and policing might not be the most effective strategy for reducing the cases of sexual assault. Instead, something more personal such as home visits or education would be more preventative.

Limitations and Improvements
An assumption made when building this model was that the count of sexual assaults dissipates linearly from hot spots. This is not likely the case, hence other functions could be used when engineering the feature that represents the spatial structure of sexual assaults. Additionally, different values of K can be used for the K-Nearest Neighbors calculation to improve the goodness of fit measures. Finally, the size of the grid cell in the fishnet was also chosen rather arbitrarily, and the above model can be repeated with different cell sizes, and then compared to the traditional model to identify which cell size results in a higher share of observed sexual assaults in the highest risk category captured by the risk predictions.