Analyzing Baltimore's 911 Call Log Using pySpark
¶Johann Liang
¶October 3, 2016
¶I have used Apache Spark's Python API to analyze Baltimore City's 911 call log from January 1, 2015 to September 18, 2016. In the following, I will demonstrate how to import, explore, and analyze the data, and how to present the results graphically.
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import seaborn as sns
import pandas as pd
dataPath = "Calls_for_Service20160918.csv"
sqlContext.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("inferSchema", "true")\
.load(dataPath)\
.registerTempTable("service_df")
service_df = sqlContext.sql("SELECT * FROM service_df").sort('callDateTime').cache()
We can check that all fields/features are read in as strings.
service_df.printSchema()
Let's take a look at the first 20 rows of the DataFrame.
We can find out the dimensions of the DataFrame, and what features it contains.
nRecords = service_df.count()
print('There are {} records in the dataset.'.format(nRecords))
print(len(service_df.columns))
print(service_df.columns)
Given that the timestamps in the original data set were in the 12-hour clock format, let's convert them into the 24-hour clock format to facilitate analysis in the following sections. We will also want to know the corresponding day of the week for a given timestamp, so we will do the conversion here as well.
timed_service_df = service_df.select(u_parse_time(service_df['callDateTime']).alias('dateTime'),u_get_weekday(service_df['callDateTime']).alias('weekday'),'*').drop('callDateTime').cache()
Let's check that the conversions have been made.
Now we are ready to explore this data set. Let's start with checking how the 911 calls were distributed in terms of priority. To do this, we can use the groupBy() and count() function in Spark.
priority_summary_df = (timed_service_df
.groupBy('priority')
.count()
.sort('count', ascending=False)
)
priority_summary_df.show()
It appears that most calls are of medium/low priorities. Let's make a visual comparison.
Is the request for service greater in certain district than others? Let's find out how many calls originated in each district. Again, the groupBy() and count() functions can help us accomplish this task.
district_summary_df = (timed_service_df
.groupBy('district')
.count()
.sort('count', ascending=False)
.cache())
district_summary_df.show()
call_percent_diff = (float(district_summary_df.first()[1]) - float(district_summary_df.select('count').take(9)[8][0])) * 100. / float(district_summary_df.select('count').take(9)[8][0])
print('The first nine entries are the publicized police districts in Baltimore City. The data show that the NE district received the highest number of calls, and the ED received the lowest number of calls. The difference is almost {:.2f}%.'.format(call_percent_diff))
Knowing that citizens may call 911 for various reasons, let's tally up the calls by descriptions.
desc_summary_df = (timed_service_df
.groupBy('description')
.count()
.orderBy('count', ascending=False)
.cache())
desc_summary_df.show()
initial_desc_count = desc_summary_df.count()
print("Currently, there are {0} distinct versions of descriptions in the data set.".format(initial_desc_count))
Our preliminary results already give some interesting insights. However, by visual examination we can see that the current tally is not efficient/accurate: The system reported 6133 different descriptions, yet some of these so-called "different" descriptions are in fact identical. This is because sometimes a word was spelled out in different ways (e.g., "LARCENCY" clearly means larceny), and sometimes a description was recorded in all upper case letters as oppose to the title case. Let's try to clean up the data set by casting all descriptions into lower case letters, correct some obvious spelling mistakes, and removing multiple spaces between words in a phrase.
desc_summary = cleaned_timed_service_df.groupBy('descr').count().sort('count', ascending=False).cache()
desc_summary.show()
new_desc_count = desc_summary.count()
print("Currently, there are {0} distinct versions of descriptions in the data set, {1} fewer distinct descriptions from the initial count.".format(new_desc_count, initial_desc_count-new_desc_count))
After taking our first step to standardize the descriptions, we cut the number of different descriptions down by 250. The description that tops the list was "911/no voice." At this point we don't know if this means that sound was not transmitted over the phone, that the caller hung up before speaking to an operator, or that the calls were text-messages. The data set did not come with any description of its codes/phrases.
Let's take a look at other top reasons that Baltimoreans call 911.
top6desc = list(desc_summary.take(6))
print('The top reasons that people called 911 are: {0}, {1}, {2}, {3}, and {4}.'.format(str(top6desc[1][0]), str(top6desc[2][0]), str(top6desc[3][0]), str(top6desc[4][0]), str(top6desc[5][0])))
NCallNum = cleaned_timed_service_df.select('callNumber').distinct().count()
NRecord = cleaned_timed_service_df.count()
print('There are {0:,} call numbers among {1:,} call records.'.format(NCallNum, NRecord))
It appears that phone numbers were masked, so we can't tell if the same caller called 911 multiple times.
Now is time to explore the information embedded in the timestamps of the calls. The data set contains 911 calls from all of 2015 and the first nine months of 2016. Let's isolate and look closer at the 2015 call logs.
from pyspark.sql.functions import year, month, dayofmonth, hour
calls2015 = (cleaned_timed_service_df
.select(year('dateTime').alias('year'), month('dateTime').alias('month'), dayofmonth('dateTime').alias('day'), hour('dateTime').alias('hour'),'*')
.filter('year == 2015')
.cache()
)
numOf2015Records = calls2015.count()
print('{0:,} call records belong to 2015.'.format(numOf2015Records))
Let's tally up the calls by month and visualize the results.
callsByMonth = calls2015.groupBy('month').count().sort('month').cache()
callsByMonth.show()
With the help of the visualization, we can see that there is an uptick of 911 calls during the warmer months, i.e., June, July, August, and September.
Was the call volume relative stable throughout the year, or did it peak on certain days? We can answer this question by tallying up calls made in each day of 2015.
callsByDay_df = (calls2015
.groupBy('month', 'day')
.count()
.sort('month', 'day')
.cache()
)
Let's visualize this result.
This heatmap shows the total number of 911 calls for each day in 2015 as given in the data set. One striking feature was the intense color for April 28, which we recognize to be the day after the civic unrest in Baltimore. For this day, there are 4441 registered phone calls, the largest number of calls for any single day in the data set. While this may seem reasonable at first, like me, you may then ask: Why does April 27, the day of the unrest, look so pale and light in comparison? Indeed, according to news reports more than 11,000 calls were made to the 911 call center on April 27, 2015. However, our data set registers only 2418 calls for that day. Could the online data set be truncated?
Here we tally the calls by the day of the week. Note that in the following table, "0" represents Sunday, "1" represents Monday, etc.
callByDay = (calls2015
.groupBy('weekday')
.count()
.sort('weekday')
.cache()
)
callByDay.show()
It seems like call volume tended to increase as the week progressed, and it was highest on Fridays and lowest on Sundays.
It might be interesting to see if there is any variations in call volume in a 24-hour period. Presumably this would help management more effectively schedule operators to responds to the ebbs and flows of demands.
callByHour = (calls2015
.groupBy('hour')
.count()
.sort('hour')
.cache()
)
callByHour.show(n=24)
This plot of call volume at different hours of a day traces out the cycle of human activity in a 24-hour day: Call volumes gradually increased from early morning, peaked at 5PM in the afternoon, stayed high till about 9PM, and gradually decreased in the wee hours.
Given that the call logs contain addresses, let's identify locations that are frequented with incidents.
location_count = calls2015.select('incidentLocation').distinct().count()
print("There are {0:,} distinct locations in the data set.".format(location_count))
calls2015.groupBy('incidentLocation').count().sort('count',ascending=False).show()
It is not clear what the code "100" stands for, nonetheless those of us who don't want to encounter unnecessary incidents may want to avoid the locations on the list.
Given the trend we saw in call volume for different time of a day, it might be interesting to look at whether the call volume for different priority calls exhibit the same trend. Let's first tally the calls by the hour and different priorities.
Now let's visualize the results.
This plot is interesting as it
Since call volumes for different priority calls differ by orders of magnitude, the Emergency call volume has been reduced to a flat line in the above plot. We can look at a log plot of the data to examine variations in the number of Emergency calls more closely.
This plot confirms that the volume of Emergency calls also followed the 24-hour cycle of human activity.
Given that 911 calls are citizen's reports of incidents across the city, let's see if there are any trends of certain incidents at different hours of a day. Here I decided to focus on reports of "auto accident", "auto theft", "burglary", "common assault", "disorderly", and "larcency." We can use the filter() function in Spark to select the rows of interest.
hour_descr_select = (calls2015
.filter((calls2015.descr == "disorderly")
| (calls2015.descr == "common assault")
| (calls2015.descr == "auto accident")
| (calls2015.descr == "auto theft")
| (calls2015.descr == "burglary")
| (calls2015.descr == "larceny"))
.select('hour','descr')
.cache()
)
Let's visualize the results.
This plot shows some interesting trends for the chosen incidents. For example,
Are certain days of the week particularly prone to incidents? Let's find out.
weekday_descr_select = (calls2015
.filter((calls2015.descr == "disorderly")
| (calls2015.descr == "common assault")
| (calls2015.descr == "auto accident")
| (calls2015.descr == "auto theft")
| (calls2015.descr == "burglary")
| (calls2015.descr == "larceny"))
.select('weekday','descr')
.cache()
)
According to the data, Mondays and Tuesdays have the highest number of burglary and larceny incidents. Saturdays and Sundays were worst days for disorderly behaviors and common assaults. As one might have expected, auto accidents were lowest on Saturdays and Sundays.
In the above analysis, we have seen that 911 call volume was influenced by events throughout the year. We have also seen that the top reasons Baltimoreans called 911 were: disorderly, traffic stop, common assault, auto accident, and narcotics outside. Let's see what we can learn from the data if we look at the intersection of these two features. More specifically, let's look at the top reason for 911 calls for each day in 2015.
First, let' filter out calls that were described as '911/no voice' and 'other' since we can't assess the reasons behind these calls. Then, we will group the data by date and description of the call. For each date, we will pick the call description with the highest count.
With the following command, we see that there are only four possible top reasons for Baltimoreans to call 911.
descrWMaxCount.select('descr').distinct().show()
At this point, we realize that April 14, 2015 had two competing top reasons for 911 calls.
descrWMaxCount.filter((descrWMaxCount.month == 4) & (descrWMaxCount.day == 14)).show()
For the visualization we will be making, let's drop one of those entries.
newDescrWMaxCount = descrWMaxCount.dropDuplicates(['month', 'day']).cache()
newDescrWMaxCount.filter((descrWMaxCount.month == 4) & (descrWMaxCount.day == 14)).show()
newDescrWMaxCount.count()
Now we have made sure to have only one entry for each day, let's make a visualization.
This heatmap shows that the top reasons of the year were very much dominated by disorderly conducts (red) and traffic stops (blue). In particular, reports of traffic stops peaked in the colder winter months whereas reports for disorderly conducts stayed high throughout the year. The two yellow days when reports of burglary peaked clearly show a correlation with the civil unrest. The four cyan days when reports of auto accidents peaked suggested that they might be related to weather conditions. A preliminary investigation shows that these four days were plagued by snow, rain, or fog.