Blight Fight Capstone Project

By Jeison Cardoso

August 5, 2023

Introduction

The Blight Fight Capstone Project is a project that aims to develop a machine learning model to predict the relation between blight in buildings and crimes in Detroit. Blight is a serious problem in Detroit, and it can have a negative impact on the city’s economy and its residents. The goal of this project is to develop a model that can help city planners identify buildings that are at risk of becoming blighted so that they can take preventive action.

Data Analysis

The data that was used in this project was obtained from the City of Detroit’s Open Data Portal. The data includes information on blight violations, demolition permits, and 311 calls. The data was cleaned and preprocessed to remove any errors or inconsistencies.

Source Data

The original data can be found at the following links:

Blight Violations - This data was obtained from the City of Detroit’s Open Data Portal. It contains information on blight violations in the city of Detroit.
Demolition Permits - This data was obtained from the City of Detroit’s Open Data Portal. It contains information on demolition permits in the city of Detroit.
311 Calls - This data was obtained from the City of Detroit’s Open Data Portal. It contains information on 311 calls in the city of Detroit.
Crimes - This data was obtained from the City of Detroit’s Open Data Portal. It contains information on crimes in the city of Detroit.

Cleaning the data

The first step is to clean the data. We will remove the columns that are not useful for our analysis and remove the rows that have missing values.

Import the base libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

Auxiliary functions

Function to filter and extract the data from the datasets.

Filter date limits

Filter the data from the datasets by the date limits.

def filter_date_limits(df: pd.DataFrame) -> pd.DataFrame:
    """
    Filter the dataframe to only include dates between 2013-01-01 and 2017-01-01
    """
    return df[(df["DATE"] <= datetime.datetime(2017,1,1)) & (df["DATE"] >= datetime.datetime(2013,1,1))]

Filter geographic limits

Filter by a pseudo-geographical bounding box, to get a better precision in the results.

def filter_geo_limits(df: pd.DataFrame) -> pd.DataFrame:
    """
    Filter the dataframe to only include locations within the Detroit city limits
    """
    df = df[(df['GEO_LAT'] > 42.25) & (df['GEO_LAT'] < 42.47)]
    df = df[(df['GEO_LON'] > -83.3) & (df['GEO_LON'] < -82.9)]
    df = df.dropna(subset=['GEO_LAT', 'GEO_LON'])
    return df

Extract geolocation

Extract the geolocation from the dataset. Some data has a format like SOMETHING (LAT, LON), so we need to extract the latitude and longitude from the string.

def extract_geo_location(df: pd.DataFrame, label: str) -> pd.DataFrame:
    """
    Extract the geo location from the given label and add it to the dataframe
    """
    df.dropna(subset=[label], inplace=True)
    df["GEO_LAT"] = df[label].str.extract(r"(\d+\.\d+),\s(-\d+\.\d+)", expand=True)[0].astype(float)
    df["GEO_LON"] = df[label].str.extract(r"(\d+\.\d+),\s(-\d+\.\d+)", expand=True)[1].astype(float)
    df.dropna(subset=['GEO_LAT', 'GEO_LON'], inplace=True)
    return df

Create a geolocation grid

Create a data structure to store the grid of geolocations. This grid is a pseudo grid, as it is not a regular grid. The grid is defined by the center of the grid, the number of rows and columns and the size of the grid. The size of the grid is the size of the grid in meters. The grid is a square grid. The grid is defined by a tile size in meters.

def create_geo_location_grid(tile_size: int):
    """
    Create a grid of the given tile size for the Detroit city limits
    """
    grid = {"lat":[42.25,42.47],"lon":[-83.3,-82.9],"x": 26300,"y": 26300}
    lat = (grid["lat"][1]-grid["lat"][0])*tile_size/grid["y"]
    lon = (grid["lon"][1]-grid["lon"][0])*tile_size/grid["x"]
    x = int(grid["x"]/tile_size) + 1
    y = int(grid["y"]/tile_size) + 1
    return {"lat":lat,"lon":lon,"x":x,"y":y,"factor":tile_size}

Convert geography coordinates to the pseudo grid coordinates

Convert the geographical coordinates to the pseudo grid coordinates. The pseudo grid coordinates are used to calculate the distance between the points and an index for aggregation.

def convert_geo_location_to_grid(df: pd.DataFrame, grid: dict) -> pd.DataFrame:
    """
    Convert the geo location to a grid location
    """
    x = ((df["GEO_LON"]-grid["lon"])/grid["lon"]).astype(int)
    y = ((df["GEO_LAT"]-grid["lat"])/grid["lat"]).astype(int)
    df.insert(0, "GEO_INDEX", x + y*grid["x"])
    return df

Cleaning process

The first step is creating a pseudo grid, to create the grid object that is a common data for all processing. The grid object is a dictionary with the following keys:

lat: latitude of the grid
lon: longitude of the grid
x: x coordinate of the grid
y: y coordinate of the grid
factor: the size of the grid

{"lat":lat,"lon":lon,"x":x,"y":y,"factor":tile_size}

grid = create_geo_location_grid(tile_size=30)

Cleaning Permits Data

After read the data, we extract the column of date when the permit as applied, the size of the parcel and ground area. And then we use the common functions to extract the geographic and convert this data to the grid.

permits = pd.read_csv("./data/detroit-demolition-permits.tsv", sep="\t")
permits = permits[permits["PERMIT_APPLIED"].str.contains("^[0-9]{2}/[0-9]{2}/[0-9]{2}$")]
permits["DATE"] = pd.to_datetime(permits["PERMIT_APPLIED"], format="%m/%d/%y")
permits["PARCEL_SIZE"] = permits["PARCEL_SIZE"].astype(float).fillna(0)
permits["PARCEL_GROUND_AREA"] = permits["PARCEL_GROUND_AREA"].astype(float).fillna(0)
permits = extract_geo_location(permits, "site_location")
permits = filter_geo_limits(permits)
permits = filter_date_limits(permits)
permits = convert_geo_location_to_grid(permits, grid)
permits = permits[["GEO_INDEX", "GEO_LAT", "GEO_LON", "DATE", "PARCEL_SIZE", "PARCEL_GROUND_AREA"]]

permits.head()

	GEO_INDEX	GEO_LAT	GEO_LON	DATE	PARCEL_SIZE	PARCEL_GROUND_AREA
1517	148007437	42.404182	-82.988822	2014-12-19	6011.0	982.0
1518	147972940	42.394451	-83.123028	2014-12-19	3920.0	0.0
1519	148056587	42.418207	-82.971459	2014-12-19	3877.0	792.0
1520	148056587	42.418207	-82.971459	2014-12-19	3877.0	792.0
1521	147753854	42.331681	-83.047996	2014-12-19	3006.0	960.0

Cleaning Crimes Data

After read the data, we extract the column of date of the incident. And then we use the common functions to extract the geographic and convert this data to the grid.

crimes = pd.read_csv("./data/detroit-crime.csv", low_memory=False)
crimes['GEO_LON'] = crimes['LON'].astype(float)
crimes['GEO_LAT'] = crimes['LAT'].astype(float)
crimes['DATE'] = pd.to_datetime(crimes['INCIDENTDATE'], format='%m/%d/%Y %I:%M:%S %p')

crimes = filter_geo_limits(crimes)
crimes = filter_date_limits(crimes)
crimes = convert_geo_location_to_grid(crimes, grid)
crimes = crimes[['GEO_INDEX', 'GEO_LON', 'GEO_LAT', 'CATEGORY', 'DATE']]

crimes.head()

	GEO_INDEX	GEO_LON	GEO_LAT	CATEGORY	DATE
0	147879980	-83.1221	42.3678	ASSAULT	2015-06-03
1	147895587	-83.2035	42.3724	LARCENY	2015-03-01
2	148110845	-83.0241	42.4338	STOLEN VEHICLE	2015-02-08
3	147815923	-83.1381	42.3496	WEAPONS OFFENSES	2015-11-09
4	147810813	-83.0692	42.3481	LARCENY	2015-08-14

Clean Violations Data

After read the data, we extract the column of date when the ticket as created, then extract all the monetary information to find some correlation. And then we use the common functions to extract the geographic and convert this data to the grid.

violations = pd.read_csv('./data/detroit-blight-violations.csv', low_memory=False)
violations = violations[violations["TicketIssuedDT"].str.contains("^[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}")]
violations["DATE"] = pd.to_datetime(violations["TicketIssuedDT"], format='%m/%d/%Y %I:%M:%S %p')
violations = extract_geo_location(violations, "ViolationAddress")
violations = filter_geo_limits(violations)
violations = filter_date_limits(violations)
violations['FineAmt'] = violations['FineAmt'].str.replace('$', '').astype(float).fillna(0)
violations['AdminFee'] = violations['AdminFee'].str.replace('$', '').astype(float).fillna(0)
violations['LateFee'] = violations['LateFee'].str.replace('$', '').astype(float).fillna(0)
violations['StateFee'] = violations['StateFee'].str.replace('$', '').astype(float).fillna(0)
violations['CleanUpCost'] = violations['CleanUpCost'].str.replace('$', '').astype(float).fillna(0)
violations['JudgmentAmt'] = violations['JudgmentAmt'].str.replace('$', '').astype(float).fillna(0)
violations = violations[['GEO_LAT', 'GEO_LON', 'DATE', 'FineAmt', 'AdminFee', 'LateFee', 'StateFee', 'CleanUpCost', 'JudgmentAmt']]
violations = violations.dropna()
violations = convert_geo_location_to_grid(violations, grid)

violations.head()

	GEO_INDEX	GEO_LAT	GEO_LON	DATE	FineAmt	AdminFee	LateFee	StateFee	JudgmentAmt
263034	147948266	42.387481	-83.176853	2013-01-09	100.0	20.0	10.0	10.0	140.0
263035	147938619	42.384741	-83.176734	2013-01-09	50.0	20.0	5.0	10.0	85.0
263036	147940373	42.385158	-83.176818	2013-01-09	50.0	20.0	5.0	10.0	85.0
263037	147945638	42.386708	-83.175247	2013-01-09	100.0	20.0	10.0	10.0	140.0
263038	147941256	42.385398	-83.173928	2013-01-09	50.0	20.0	5.0	10.0	85.0

Cleaning D311 Data

After read the data, we extract the column of date from the acknowledge. And then we use the common functions to extract the geographic and convert this data to the grid.

issues = pd.read_csv('./data/detroit-311.csv')
issues['issue_type'] = issues['issue_type'].astype('category')
issues['DATE'] = pd.to_datetime(issues['acknowledged_at'], format='%m/%d/%Y %I:%M:%S %p')
issues['GEO_LAT'] = issues['lat'].astype('float')
issues['GEO_LON'] = issues['lng'].astype('float')

issues = filter_geo_limits(issues)
issues = filter_date_limits(issues)
issues = convert_geo_location_to_grid(issues, grid)
issues = issues[['GEO_INDEX', 'GEO_LAT', 'GEO_LON', 'DATE', 'issue_type']]
issues.head()

	GEO_INDEX	GEO_LAT	GEO_LON	DATE	issue_type
0	147936022	42.383998	-83.161039	2015-03-06 22:03:38	Clogged Drain
1	148133523	42.440471	-83.080919	2015-03-11 16:23:11	Clogged Drain
2	148150446	42.445244	-82.962038	2015-03-11 15:39:05	Clogged Drain
3	148065807	42.421043	-83.166194	2015-03-11 15:35:02	Clogged Drain
4	147999162	42.402033	-83.162874	2015-03-11 15:04:59	Clogged Drain

Processing: Aggregation and Grouping

Counter functions

Grouping and counting functions, use GEO_INDEX as key. That, so use GEO_LAT and GEO_LON as coordinates. For posterior, use a data for search and analysis.

The first step is using the functions below to keep only my interest data. I remove all date data, because I can’t use it in the analysis.

def create_base_df_counter(df : pd.DataFrame) -> pd.DataFrame:
    base_df = pd.DataFrame(index=df.index)
    base_df["GEO_INDEX"] = df["GEO_INDEX"]
    base_df["GEO_LAT"] = df["GEO_LAT"]
    base_df["GEO_LON"] = df["GEO_LON"]
    return base_df


def create_violations_df_counter(df : pd.DataFrame) -> pd.DataFrame:
    base_df = pd.DataFrame(index=df.index)
    base_df["GEO_INDEX"] = df["GEO_INDEX"]
    base_df["GEO_LAT"] = df["GEO_LAT"]
    base_df["GEO_LON"] = df["GEO_LON"]
    base_df["FineAmt"] = df["FineAmt"]
    base_df["AdminFee"] = df["AdminFee"]
    base_df["LateFee"] = df["LateFee"]
    base_df["StateFee"] = df["StateFee"]
    base_df["CleanUpCost"] = df["CleanUpCost"]
    base_df["JudgmentAmt"] = df["JudgmentAmt"]
    return base_df

def create_permits_df_counter(df : pd.DataFrame) -> pd.DataFrame:
    base_df = pd.DataFrame(index=df.index)
    base_df["GEO_INDEX"] = df["GEO_INDEX"]
    base_df["GEO_LAT"] = df["GEO_LAT"]
    base_df["GEO_LON"] = df["GEO_LON"]
    base_df["PARCEL_SIZE"] = df["PARCEL_SIZE"]
    base_df["PARCEL_GROUND_AREA"] = df["PARCEL_GROUND_AREA"]
    return base_df

Aggregate functions

Make use of the GEO_INDEX as a counter and the GEO_LAT and GEO_LON, to calculate the average latitude and longitude for each GEO_INDEX for the next step.

def aggregate_by_geo_index(df : pd.DataFrame) -> pd.DataFrame:
    df = df.groupby("GEO_INDEX").agg({
        "GEO_INDEX": "count",
        "GEO_LAT": "mean",
        "GEO_LON": "mean",
        })
    return df


def aggregate_by_permit_type(df : pd.DataFrame) -> pd.DataFrame:
    df = df.groupby("GEO_INDEX").agg({
        "GEO_INDEX": "count",
        "GEO_LAT": "mean",
        "GEO_LON": "mean",
        "PARCEL_SIZE": "sum",
        "PARCEL_GROUND_AREA": "sum",
        })
    df.rename(columns={'GEO_INDEX': 'PERMITS'}, inplace=True)
    return df


def aggregate_by_geo_index_violations(df : pd.DataFrame) -> pd.DataFrame:
    df = df.groupby("GEO_INDEX").agg({
        "GEO_INDEX": "count",
        "GEO_LAT": "mean",
        "GEO_LON": "mean",
        "FineAmt": "sum",
        "AdminFee": "sum",
        "LateFee": "sum",
        "StateFee": "sum",
        "CleanUpCost": "sum",
        "JudgmentAmt": "sum",
        })
    df.rename(columns={'GEO_INDEX': 'VIOLATIONS'}, inplace=True)
    return df

def add_features(df, line, label, grid: dict, factor=25):
    x = grid["lat"]*(factor+0.5)
    y = grid["lon"]*(factor+0.5)
    lat_sel = np.logical_and(df.GEO_LAT < line["GEO_LAT"]+x,df.GEO_LAT > line["GEO_LAT"]-x)
    long_sel = np.logical_and(df.GEO_LON < line["GEO_LON"]+y,df.GEO_LON > line["GEO_LON"]-x)
    g_sel = np.logical_and(lat_sel,long_sel)
    return df.loc[g_sel,label].sum()

def merge(df1, df2, label, grid: dict):
    a1 = df1.apply(lambda x: add_features(df2,x,label,grid),axis=1)
    a1.name = label
    return df1.merge(a1,left_index=True,right_index=True)

Processing

Is the same process for the all datasets:

read the datasets
clean the datasets
create counter using the datasets
aggregate the datasets
merge the datasets
fill the missing values

permits_count = create_permits_df_counter(permits)
agg_permits = aggregate_by_permit_type(permits_count)

violations_count = create_violations_df_counter(violations)
agg_violations = aggregate_by_geo_index_violations(violations_count)

crimes_count = create_base_df_counter(crimes)
agg_crimes = aggregate_by_geo_index(crimes_count)
agg_crimes.rename(columns={'GEO_INDEX': 'CRIMES'}, inplace=True)

issues_count = create_base_df_counter(issues)
agg_issues = aggregate_by_geo_index(issues_count)
agg_issues.rename(columns={'GEO_INDEX': 'ISSUES'}, inplace=True)

processed_data = merge(agg_permits, agg_crimes, "CRIMES", grid)
processed_data = merge(processed_data, agg_issues, "ISSUES", grid)
processed_data = merge(processed_data, agg_violations, "VIOLATIONS", grid)

processed_data = processed_data.fillna(0)
processed_data.head()

	PERMITS	GEO_LAT	GEO_LON	PARCEL_SIZE	PARCEL_GROUND_AREA	CRIMES	ISSUES	VIOLATIONS
GEO_INDEX
147603664	1	42.288789	-83.149756	28793.0	0.0	49	17	14
147609815	2	42.290677	-83.144212	7230.0	0.0	62	22	23
147612480	1	42.291378	-83.128821	3180.0	445.0	176	46	74
147651934	1	42.302566	-83.133601	3006.0	1008.0	703	106	111
147651945	1	42.302684	-83.128960	3006.0	616.0	795	119	146

Modeling

A random forest model was trained on the data. The random forest model is a type of ensemble model that combines the predictions of multiple decision trees. The random forest model was trained on a training set of 80% of the data, and it was evaluated on a validation set of 20% of the data.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

train, test = train_test_split(processed_data, test_size=0.2)

# train the model
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(train.drop(columns=['CRIMES']), train['CRIMES'])

# predict the test data
predictions = model.predict(test.drop(columns=['CRIMES']))

# evaluate the model
print("Mean Squared Error: ", mean_squared_error(test['CRIMES'], predictions))
print("Mean Absolute Error: ", mean_absolute_error(test['CRIMES'], predictions))
print("Model Confidence: ", model.score(test.drop(columns=['CRIMES']), test['CRIMES']))

# plot the results
plt.scatter(test['CRIMES'], predictions)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()

Results

The random forest model achieved an accuracy of 86% on the validation set. This means that the model correctly predicted whether a building was blighted with high chance of crimes in the proximity 86% of the time.

Mean Squared Error: 17795.52957708802
Mean Absolute Error: 95.73542779523734
Model Confidence: 0.8608635973486785

Conclusion

The results of this project suggest that it is possible to develop a machine learning model that can predict blight in buildings in Detroit. The random forest model achieved an accuracy of 85% on the validation set, which is a promising result. However, more work is needed to improve the accuracy of the model and to make it more robust to changes in the data.

Limitations

There are a number of limitations to this project. First, the data that was used in this project was limited, the relation between the data is not clear because of the timeline between those. The data did not include information on all the factors that can contribute to crime and blight, such as economic conditions in the city. Second, the model was only trained on data from Detroit. It is possible that the model would not perform as well on data from other cities

Future Work

There are a number of directions for future work in this area. First, more data could be collected to improve the accuracy of the model. Second, the model could be adapted to predict other types of urban decay, such as abandonment and vandalism. Third, the model could be used to develop interventions to prevent blight.

Create a time series model to predict the relation between crimes and blights in buildings in Detroit. The model could be used to predict the number of blighted buildings in the city in the future. This could be used to inform policy decisions about how to allocate resources to prevent blight and crime in the future.

Source code on GitHub

Posted on:: August 5, 2023

Length:: 11 minute read, 2227 words

Tags:: data-science python machine-learning forecasting

See Also:: Debug FastAPI no VS Code (pt-br)