1. Importing data into Spark

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import seaborn as sns
import pandas as pd
dataPath = "Calls_for_Service20160918.csv"
sqlContext.read.format("com.databricks.spark.csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load(dataPath)\
  .registerTempTable("service_df")

service_df = sqlContext.sql("SELECT * FROM service_df").sort('callDateTime').cache()

We can check that all fields/features are read in as strings.

service_df.printSchema()
root
 |-- callDateTime: string (nullable = true)
 |-- priority: string (nullable = true)
 |-- district: string (nullable = true)
 |-- description: string (nullable = true)
 |-- callNumber: string (nullable = true)
 |-- incidentLocation: string (nullable = true)
 |-- location: string (nullable = true)

Let's take a look at the first 20 rows of the DataFrame.

callDateTime priority district description callNumber incidentLocation location
0 01/01/2015 01:00:00 PM Medium CD Sick Case 150011416 1600 W MOUNT ROYAL (,)
1 01/01/2015 01:01:00 AM Medium SD Narcotics Inside 150010203 600 ARCHER ST (39.2819779,-76.6302082)
2 01/01/2015 01:01:00 AM High SW Dischrg Firearm 150010204 2100 WHISTLER AV (39.2643810,-76.6503750)
3 01/01/2015 01:01:00 AM Medium ED 911/No Voice 150010205 600 N WOLFE ST (39.2968573,-76.5916191)
4 01/01/2015 01:01:00 PM Low ND Other 150011412 3400 GREENWAY (39.3293529,-76.6177246)
5 01/01/2015 01:01:00 PM Medium NE 911/No Voice 150011413 7600 HARFORD RD (39.3721019,-76.5446527)
6 01/01/2015 01:01:00 PM Medium ED NarcoticsOutside 150011415 1300 N MILTON AV (39.3061440,-76.5830230)
7 01/01/2015 01:01:00 PM Medium WD NarcoticsOutside 150011417 W BALTIMORE ST/N CALHOUN ST (39.288494,-76.639561)
8 01/01/2015 01:01:00 PM Medium ND Disorderly 150011418 6200 THE ALAMEDA (39.3688984,-76.5936018)
9 01/01/2015 01:02:00 AM Low SE Other 150010207 DILLON ST/S STREEPER ST (,)
10 01/01/2015 01:03:00 AM High CD REQUEST SERVICE 150010206 0 SOUTH ST (39.2896598,-76.6109914)
11 01/01/2015 01:03:00 AM Medium SE Sick Case 150010208 ALICEANNA ST/S BROADWAY (39.283523,-76.593501)
12 01/01/2015 01:03:00 AM Medium ND Burglary 150010212 1000 DARTMOUTH RD (39.3664730,-76.6019690)
13 01/01/2015 01:03:00 AM Non-Emergency NE Loud Music 150010213 HARFORD RD/HUGO AV (39.322775,-76.588497)
14 01/01/2015 01:03:00 AM Low CD Auto Accident 150010214 0 E BALTIMORE ST (39.2895407,-76.6151315)
15 01/01/2015 01:03:00 AM Medium ND 911/No Voice 150010215 100 W 29TH ST (39.3231493,-76.6188892)
16 01/01/2015 01:03:00 AM Low SD Intoxicated Pers 150010216 1400 LIGHT ST (39.2741260,-76.6121919)
17 01/01/2015 01:03:00 AM Low SE Larceny 150010217 2300 ORLEANS ST (39.2955536,-76.5842243)
18 01/01/2015 01:03:00 PM Medium NE 911/No Voice 150011420 7600 HARFORD RD (39.3721019,-76.5446527)
19 01/01/2015 01:04:00 AM Medium ND Narcotics OnView 150010210 5400-BLK THE ALAMEDA (,)

We can find out the dimensions of the DataFrame, and what features it contains.

nRecords = service_df.count()
print('There are {} records in the dataset.'.format(nRecords))
There are 1833841 records in the dataset.
print(len(service_df.columns))
7
print(service_df.columns)
['callDateTime', 'priority', 'district', 'description', 'callNumber', 'incidentLocation', 'location']

2. Transforming the data set

Given that the timestamps in the original data set were in the 12-hour clock format, let's convert them into the 24-hour clock format to facilitate analysis in the following sections. We will also want to know the corresponding day of the week for a given timestamp, so we will do the conversion here as well.

timed_service_df = service_df.select(u_parse_time(service_df['callDateTime']).alias('dateTime'),u_get_weekday(service_df['callDateTime']).alias('weekday'),'*').drop('callDateTime').cache()

Let's check that the conversions have been made.

dateTime weekday priority district description callNumber incidentLocation location
0 2015-01-01 13:00:00 4 Medium CD Sick Case 150011416 1600 W MOUNT ROYAL (,)
1 2015-01-01 01:01:00 4 Medium SD Narcotics Inside 150010203 600 ARCHER ST (39.2819779,-76.6302082)
2 2015-01-01 01:01:00 4 High SW Dischrg Firearm 150010204 2100 WHISTLER AV (39.2643810,-76.6503750)
3 2015-01-01 01:01:00 4 Medium ED 911/No Voice 150010205 600 N WOLFE ST (39.2968573,-76.5916191)
4 2015-01-01 13:01:00 4 Low ND Other 150011412 3400 GREENWAY (39.3293529,-76.6177246)
5 2015-01-01 13:01:00 4 Medium NE 911/No Voice 150011413 7600 HARFORD RD (39.3721019,-76.5446527)
6 2015-01-01 13:01:00 4 Medium ED NarcoticsOutside 150011415 1300 N MILTON AV (39.3061440,-76.5830230)
7 2015-01-01 13:01:00 4 Medium WD NarcoticsOutside 150011417 W BALTIMORE ST/N CALHOUN ST (39.288494,-76.639561)
8 2015-01-01 13:01:00 4 Medium ND Disorderly 150011418 6200 THE ALAMEDA (39.3688984,-76.5936018)
9 2015-01-01 01:02:00 4 Low SE Other 150010207 DILLON ST/S STREEPER ST (,)

3. Analysis

3.1. Distribution of call priorities

Now we are ready to explore this data set. Let's start with checking how the 911 calls were distributed in terms of priority. To do this, we can use the groupBy() and count() function in Spark.

priority_summary_df = (timed_service_df
                       .groupBy('priority')
                       .count()
                       .sort('count', ascending=False)
                    )

priority_summary_df.show()
+--------------+------+
|      priority| count|
+--------------+------+
|        Medium|916796|
|           Low|426095|
|          High|276784|
| Non-Emergency|206168|
|              |  6655|
|     Emergency|   700|
|Out of Service|   643|
+--------------+------+

It appears that most calls are of medium/low priorities. Let's make a visual comparison.

3.2. Call distribution among different police districts

Is the request for service greater in certain district than others? Let's find out how many calls originated in each district. Again, the groupBy() and count() functions can help us accomplish this task.

district_summary_df = (timed_service_df
                      .groupBy('district')
                      .count()
                      .sort('count', ascending=False)
                      .cache())
district_summary_df.show()
+--------+------+
|district| count|
+--------+------+
|      NE|273211|
|      SD|200468|
|      CD|200221|
|      SW|200186|
|      SE|199221|
|      ND|185362|
|      NW|181499|
|      WD|174578|
|      ED|171093|
|      CW| 23374|
|     TRU| 18724|
|      SS|  5792|
|    EVT1|    44|
|    EVT2|    43|
|    INFO|    11|
|    EVT3|    10|
|      HP|     2|
|    FIR1|     2|
+--------+------+

call_percent_diff = (float(district_summary_df.first()[1]) - float(district_summary_df.select('count').take(9)[8][0])) * 100. / float(district_summary_df.select('count').take(9)[8][0])
print('The first nine entries are the publicized police districts in Baltimore City. The data show that the NE district received the highest number of calls, and the ED received the lowest number of calls. The difference is almost {:.2f}%.'.format(call_percent_diff))
The first nine entries are the publicized police districts in Baltimore City. The data show that the NE district received the highest number of calls, and the ED received the lowest number of calls. The difference is almost 59.69%.

3.3. Top reasons for calling 911

Knowing that citizens may call 911 for various reasons, let's tally up the calls by descriptions.

desc_summary_df = (timed_service_df
                     .groupBy('description')
                     .count()
                     .orderBy('count', ascending=False)
                     .cache())
desc_summary_df.show()

initial_desc_count = desc_summary_df.count()
print("Currently, there are {0} distinct versions of descriptions in the data set.".format(initial_desc_count))
+----------------+------+
|     description| count|
+----------------+------+
|   911/NO  VOICE|310843|
|      DISORDERLY|151223|
|    Traffic Stop|116479|
|  COMMON ASSAULT| 74279|
|   AUTO ACCIDENT| 66885|
|NARCOTICSOutside| 56831|
|    SILENT ALARM| 49206|
| Repairs/Service| 44072|
|  FAMILY DISTURB| 43465|
|           OTHER| 41833|
|     INVESTIGATE| 37782|
|        BURGLARY| 34817|
|     HIT AND RUN| 32727|
|    911/No Voice| 32233|
|        LARCENCY| 29086|
| Field Interview| 24410|
| DESTRUCT PROPTY| 22988|
| SUSPICIOUS PERS| 21670|
|  LARCENY F/AUTO| 19125|
|     Lab Request| 18873|
+----------------+------+
only showing top 20 rows

Currently, there are 6133 distinct versions of descriptions in the data set.

Our preliminary results already give some interesting insights. However, by visual examination we can see that the current tally is not efficient/accurate: The system reported 6133 different descriptions, yet some of these so-called "different" descriptions are in fact identical. This is because sometimes a word was spelled out in different ways (e.g., "LARCENCY" clearly means larceny), and sometimes a description was recorded in all upper case letters as oppose to the title case. Let's try to clean up the data set by casting all descriptions into lower case letters, correct some obvious spelling mistakes, and removing multiple spaces between words in a phrase.

desc_summary = cleaned_timed_service_df.groupBy('descr').count().sort('count', ascending=False).cache()
desc_summary.show()

new_desc_count = desc_summary.count()
print("Currently, there are {0} distinct versions of descriptions in the data set, {1} fewer distinct descriptions from the initial count.".format(new_desc_count, initial_desc_count-new_desc_count))
+----------------+------+
|           descr| count|
+----------------+------+
|    911/no voice|343097|
|      disorderly|162497|
|    traffic stop|116480|
|  common assault| 80186|
|   auto accident| 73114|
|narcoticsoutside| 60441|
|           other| 59893|
|    silent alarm| 54749|
|  family disturb| 47784|
| repairs/service| 44072|
|     investigate| 37783|
|        burglary| 37772|
|         larceny| 37683|
|     hit and run| 35538|
| destruct propty| 24971|
| field interview| 24410|
| suspicious pers| 23389|
|  larceny f/auto| 20499|
|     foot patrol| 19947|
|     lab request| 18873|
+----------------+------+
only showing top 20 rows

Currently, there are 5883 distinct versions of descriptions in the data set, 250 fewer distinct descriptions from the initial count.

After taking our first step to standardize the descriptions, we cut the number of different descriptions down by 250. The description that tops the list was "911/no voice." At this point we don't know if this means that sound was not transmitted over the phone, that the caller hung up before speaking to an operator, or that the calls were text-messages. The data set did not come with any description of its codes/phrases.

Let's take a look at other top reasons that Baltimoreans call 911.

top6desc = list(desc_summary.take(6))

print('The top reasons that people called 911 are: {0}, {1}, {2}, {3}, and {4}.'.format(str(top6desc[1][0]), str(top6desc[2][0]), str(top6desc[3][0]), str(top6desc[4][0]), str(top6desc[5][0])))
The top reasons that people called 911 are: disorderly, traffic stop, common assault, auto accident, and narcoticsoutside.

3.4. Phone numbers were masked

NCallNum = cleaned_timed_service_df.select('callNumber').distinct().count()
NRecord = cleaned_timed_service_df.count()
print('There are {0:,} call numbers among {1:,} call records.'.format(NCallNum, NRecord))
There are 1,833,841 call numbers among 1,833,841 call records.

It appears that phone numbers were masked, so we can't tell if the same caller called 911 multiple times.

3.5. Day/Time of calls

Now is time to explore the information embedded in the timestamps of the calls. The data set contains 911 calls from all of 2015 and the first nine months of 2016. Let's isolate and look closer at the 2015 call logs.

from pyspark.sql.functions import year, month, dayofmonth, hour

calls2015 = (cleaned_timed_service_df
             .select(year('dateTime').alias('year'), month('dateTime').alias('month'), dayofmonth('dateTime').alias('day'), hour('dateTime').alias('hour'),'*')
             .filter('year == 2015')
             .cache()
             )

numOf2015Records = calls2015.count()         
print('{0:,} call records belong to 2015.'.format(numOf2015Records))
1,071,776 call records belong to 2015.

3.5.1. Call volume by month

Let's tally up the calls by month and visualize the results.

callsByMonth = calls2015.groupBy('month').count().sort('month').cache()
callsByMonth.show()
+-----+-----+
|month|count|
+-----+-----+
|    1|83994|
|    2|79637|
|    3|86243|
|    4|90418|
|    5|89755|
|    6|93634|
|    7|97513|
|    8|95740|
|    9|93253|
|   10|90298|
|   11|84687|
|   12|86604|
+-----+-----+

With the help of the visualization, we can see that there is an uptick of 911 calls during the warmer months, i.e., June, July, August, and September.

3.5.2. Call volume by day

Was the call volume relative stable throughout the year, or did it peak on certain days? We can answer this question by tallying up calls made in each day of 2015.

callsByDay_df = (calls2015
             .groupBy('month', 'day')
             .count()
             .sort('month', 'day')
             .cache()
             )
month      1       2       3       4       5       6       7       8       9       10      11      12
day                                                                                                  
1      2549.0  2817.0  2390.0  3178.0  2541.0  2893.0  3304.0  3093.0  3260.0  2966.0  2923.0  2823.0
2      2603.0  2827.0  2781.0  3312.0  2645.0  2900.0  3514.0  2962.0  3138.0  3104.0  2822.0  2942.0
3      2258.0  2978.0  2525.0  3380.0  2663.0  3052.0  3408.0  3096.0  3198.0  2816.0  3059.0  3006.0
4      2192.0  2935.0  2581.0  3086.0  2812.0  2994.0  3374.0  2988.0  3349.0  2764.0  2973.0  2905.0
5      2694.0  3044.0  1935.0  2754.0  2993.0  3234.0  3213.0  3182.0  3044.0  3118.0  3068.0  2733.0
6      2230.0  3204.0  2523.0  3040.0  2918.0  3165.0  1999.0  3080.0  2908.0  3155.0  3320.0  2555.0
7      2376.0  3077.0  2552.0  2950.0  2939.0  2978.0  3260.0  3280.0  3023.0  3271.0  2998.0  2759.0
8      2334.0  2788.0  2665.0  2914.0  3189.0  3108.0  3220.0  3113.0  3313.0  3480.0  2558.0  2627.0
9      2460.0  2824.0  2861.0  2975.0  2860.0  3233.0  3147.0  2952.0  3073.0  3294.0  2809.0  2780.0
10     2435.0  2870.0  2683.0  2831.0  2632.0  3048.0  3207.0  2916.0  3141.0  1290.0  2913.0  3096.0
11     2414.0  3126.0  3016.0  3037.0  3001.0  3060.0  3018.0  3424.0  3318.0  2082.0  2842.0  3127.0
12     2469.0  3165.0  3128.0  2682.0  3024.0  3090.0  2950.0  3288.0  3137.0  2967.0  2853.0  3047.0
13     2737.0  3323.0  3107.0  3010.0  2755.0  3102.0  3231.0  3132.0  2769.0  3221.0  3059.0  2683.0
14     2714.0  2824.0  2806.0  2795.0  2966.0  3094.0  3220.0  3248.0  3106.0  3065.0  2904.0  2875.0
15     2790.0  2345.0  2677.0  3105.0  3161.0  3309.0  3172.0  3053.0  2998.0  3027.0  2735.0  2837.0
16     3087.0  2397.0  2886.0  3313.0  2837.0  3104.0  3111.0  2854.0  2979.0  3058.0  2904.0  2994.0
17     3032.0  2558.0  3028.0  3297.0  2766.0  3217.0  3098.0  3076.0  3153.0  2740.0  2944.0  3028.0
18     2580.0  2707.0  2787.0  3498.0  2824.0  3107.0  3119.0  3317.0  3113.0  2378.0  2816.0  2933.0
19     2769.0  2978.0  3013.0  2920.0  3182.0  3370.0  2962.0  3113.0  3292.0  2674.0  3009.0  2705.0
20     3028.0  2926.0  2871.0  2865.0  3075.0  3187.0  2974.0  3157.0  3020.0  2951.0  3264.0  2454.0
21     2709.0  2465.0  2806.0  3026.0  2695.0  2892.0  3100.0  3192.0  2889.0  3030.0  2898.0  2820.0
22     2920.0  2259.0  2675.0  3013.0  2916.0  3309.0  3085.0  3141.0  3119.0  3117.0  2569.0  2721.0
23     3118.0  2746.0  2702.0  2728.0  2490.0  3040.0  3167.0  2930.0  3031.0  3489.0  2523.0  2754.0
24     2947.0  2848.0  2901.0  2997.0  2766.0  3277.0  3266.0  3122.0  3087.0  3083.0  2701.0  2754.0
25     2698.0  2809.0  2867.0  2867.0  2696.0  3110.0  3232.0  3327.0  3160.0  2807.0  2584.0  2387.0
26     2780.0  2896.0  3009.0  2621.0  3120.0  3399.0  2959.0  1730.0  3187.0  2669.0  2214.0  2434.0
27     2695.0  3129.0  3138.0  2418.0  3159.0  2821.0  3092.0  3440.0  2848.0  2755.0  2648.0  2721.0
28     2945.0  2772.0  2831.0  4441.0  3186.0  3064.0  3256.0  3321.0  3092.0  2758.0  2791.0  2715.0
29     2989.0     NaN  2734.0  2800.0  3263.0  3252.0  3236.0  3140.0  3177.0  2993.0  2430.0  2905.0
30     3403.0     NaN  2874.0  2565.0  2899.0  3225.0  3263.0  2869.0  3331.0  3119.0  2556.0  2703.0
31     3039.0     NaN  2891.0     NaN  2782.0     NaN  3356.0  3204.0     NaN  3057.0     NaN  2781.0

Let's visualize this result.

This heatmap shows the total number of 911 calls for each day in 2015 as given in the data set. One striking feature was the intense color for April 28, which we recognize to be the day after the civic unrest in Baltimore. For this day, there are 4441 registered phone calls, the largest number of calls for any single day in the data set. While this may seem reasonable at first, like me, you may then ask: Why does April 27, the day of the unrest, look so pale and light in comparison? Indeed, according to news reports more than 11,000 calls were made to the 911 call center on April 27, 2015. However, our data set registers only 2418 calls for that day. Could the online data set be truncated?

3.5.3. Call volume by day of the week

Here we tally the calls by the day of the week. Note that in the following table, "0" represents Sunday, "1" represents Monday, etc.

callByDay = (calls2015
             .groupBy('weekday')
             .count()
             .sort('weekday')
             .cache()
             )
callByDay.show()
+-------+------+
|weekday| count|
+-------+------+
|      0|141703|
|      1|149178|
|      2|156376|
|      3|153477|
|      4|158682|
|      5|161583|
|      6|150777|
+-------+------+

It seems like call volume tended to increase as the week progressed, and it was highest on Fridays and lowest on Sundays.

3.5.4. Call volume by hour of the day

It might be interesting to see if there is any variations in call volume in a 24-hour period. Presumably this would help management more effectively schedule operators to responds to the ebbs and flows of demands.

callByHour = (calls2015
              .groupBy('hour')
              .count()
              .sort('hour')
              .cache()
              )

callByHour.show(n=24)
+----+-----+
|hour|count|
+----+-----+
|   0|54102|
|   1|30591|
|   2|25003|
|   3|18330|
|   4|14508|
|   5|13536|
|   6|16409|
|   7|25955|
|   8|37860|
|   9|42868|
|  10|47983|
|  11|52614|
|  12|38028|
|  13|55798|
|  14|62006|
|  15|62125|
|  16|63114|
|  17|66958|
|  18|65600|
|  19|62113|
|  20|60693|
|  21|58134|
|  22|51505|
|  23|45943|
+----+-----+

This plot of call volume at different hours of a day traces out the cycle of human activity in a 24-hour day: Call volumes gradually increased from early morning, peaked at 5PM in the afternoon, stayed high till about 9PM, and gradually decreased in the wee hours.

3.6. Locations with the most incidents

Given that the call logs contain addresses, let's identify locations that are frequented with incidents.

location_count = calls2015.select('incidentLocation').distinct().count()
print("There are {0:,} distinct locations in the data set.".format(location_count))
There are 56,664 distinct locations in the data set.
calls2015.groupBy('incidentLocation').count().sort('count',ascending=False).show()
+--------------------+-----+
|    incidentLocation|count|
+--------------------+-----+
|                 100|16045|
|    600 E FAYETTE ST| 8888|
|      1000 EDISON HY| 4973|
| 700 POPLAR GROVE ST| 4278|
|   200 FRANKFURST AV| 3922|
| 3800 W BELVEDERE AV| 3909|
|      400 N FRONT ST| 3877|
|  3000 DRUID PARK DR| 3812|
|  5600 LOCH RAVEN BL| 3756|
|2000-45 WINCHESTE...| 3716|
|      600 N WOLFE ST| 3677|
|     5300 MORAVIA RD| 3633|
|        5200 YORK RD| 3403|
|    2500 AISQUITH ST| 3023|
|  2000 WINCHESTER ST| 3004|
|    500 E PRESTON ST| 2815|
|1800 N PATTERSON ...| 2744|
|     2500 W NORTH AV| 2709|
| 2500 W BELVEDERE AV| 2704|
|  1100 DRUID HILL AV| 2537|
+--------------------+-----+
only showing top 20 rows

It is not clear what the code "100" stands for, nonetheless those of us who don't want to encounter unnecessary incidents may want to avoid the locations on the list.

3.7. Priorities of call vs. the hours of call

Given the trend we saw in call volume for different time of a day, it might be interesting to look at whether the call volume for different priority calls exhibit the same trend. Let's first tally the calls by the hour and different priorities.

+----+---+---------+----+-----+------+-------------+--------------+
|hour|   |Emergency|High|  Low|Medium|Non-Emergency|Out of Service|
+----+---+---------+----+-----+------+-------------+--------------+
|   0|213|       20|6547|13200| 28852|         5236|            34|
|   1|325|       19|5348| 6017| 14877|         4005|             0|
|   2|236|       12|3992| 5014| 12577|         3172|             0|
|   3|168|        7|2968| 3482|  9045|         2660|             0|
|   4| 99|        8|2282| 2577|  7120|         2422|             0|
|   5| 60|        3|2020| 2571|  6621|         2260|             1|
|   6| 67|        2|2194| 3759|  8236|         2143|             8|
|   7| 92|        1|3148| 6858| 12440|         3409|             7|
|   8|134|        5|5110| 9583| 17804|         5206|            18|
|   9|152|       10|5937|10759| 20869|         5110|            31|
|  10|152|       24|6687|12063| 23692|         5330|            35|
|  11|203|       24|6789|12849| 27474|         5251|            24|
|  12|323|       22|7365| 7348| 17923|         5047|             0|
|  13|209|       19|6544|13399| 29920|         5661|            46|
|  14|263|       25|7440|16468| 31124|         6628|            58|
|  15|372|       25|7243|17328| 31426|         5682|            49|
|  16|339|       28|6991|17769| 32231|         5722|            34|
|  17|469|       25|8545|18280| 32767|         6842|            30|
|  18|395|       30|9729|16909| 31906|         6595|            36|
|  19|462|       32|9814|15738| 29505|         6537|            25|
|  20|513|       25|9833|14391| 29454|         6452|            25|
|  21|558|       28|9327|12796| 29293|         6116|            16|
|  22|537|       22|8291|10836| 26159|         5660|             0|
|  23|314|       32|8707| 9223| 21424|         6243|             0|
+----+---+---------+----+-----+------+-------------+--------------+

Now let's visualize the results.

This plot is interesting as it

  • confirms that Medium and Low priority calls dominated throughout the day,
  • also confirms that different priority calls all approximately followed the cycle of human activity in a 24-hour period,
  • shows that the call volumes of Medium and Low priority calls dropped drastically between 12-1pm, suggesting that, possibly, a common cause (I'm guessing lunch) disrupted the trend in those hours. In contrast, call volumes for Emergency, High, and Non-Emergency calls did not abate during those hours.

Since call volumes for different priority calls differ by orders of magnitude, the Emergency call volume has been reduced to a flat line in the above plot. We can look at a log plot of the data to examine variations in the number of Emergency calls more closely.

This plot confirms that the volume of Emergency calls also followed the 24-hour cycle of human activity.

3.8. Types of incident & the hours of call

Given that 911 calls are citizen's reports of incidents across the city, let's see if there are any trends of certain incidents at different hours of a day. Here I decided to focus on reports of "auto accident", "auto theft", "burglary", "common assault", "disorderly", and "larcency." We can use the filter() function in Spark to select the rows of interest.

hour_descr_select = (calls2015
                   .filter((calls2015.descr == "disorderly") 
                           | (calls2015.descr == "common assault") 
                           | (calls2015.descr == "auto accident") 
                           | (calls2015.descr == "auto theft") 
                           | (calls2015.descr == "burglary") 
                           | (calls2015.descr == "larceny"))
                   .select('hour','descr')
                   .cache()
                   )
+----+-------------+----------+--------+--------------+----------+-------+
|hour|auto accident|auto theft|burglary|common assault|disorderly|larceny|
+----+-------------+----------+--------+--------------+----------+-------+
|   0|         2299|       572|    1260|          1949|      4235|   1475|
|   1|          685|       254|     670|          1746|      3588|    325|
|   2|          693|       217|     592|          1447|      3228|    258|
|   3|          340|       185|     485|          1053|      2148|    215|
|   4|          243|       185|     444|           704|      1338|    153|
|   5|          426|       279|     385|           517|      1077|    148|
|   6|          765|       500|     383|           528|      1138|    223|
|   7|         1852|       682|     702|           810|      1489|    362|
|   8|         2198|       840|     834|          1099|      2227|    606|
|   9|         1873|       679|    1045|          1355|      2611|    906|
|  10|         1922|       670|    1276|          1574|      3176|   1116|
|  11|         2092|       620|    1203|          1822|      3607|   1298|
|  12|          806|       302|     741|          1887|      4154|    420|
|  13|         2320|       571|    1182|          2132|      4467|   1535|
|  14|         2964|       548|    1174|          2463|      4725|   1736|
|  15|         3085|       523|    1261|          2933|      5289|   1778|
|  16|         3521|       568|    1349|          3008|      5842|   1791|
|  17|         3416|       534|    1420|          2986|      6184|   1731|
|  18|         2718|       472|    1329|          2900|      6153|   1475|
|  19|         1917|       393|    1168|          2836|      5786|   1390|
|  20|         1600|       437|    1181|          2930|      5838|   1168|
|  21|         1345|       398|    1146|          3012|      5961|    901|
|  22|         1318|       449|    1032|          2606|      5365|    658|
|  23|         1069|       371|     771|          2134|      4657|    516|
+----+-------------+----------+--------+--------------+----------+-------+

Let's visualize the results.

This plot shows some interesting trends for the chosen incidents. For example,

  • the trends of disorderly conducts and common assaults clearly show the characteristics of a 24-hour cycle of human activity. In fact, the trends of burglary and larceny also follow the same characteristics even though the number of incidents in these categories were smaller.
  • On the other hand, the trend of auto accidents suggests that it was strongly influenced by the large numbers of people getting on the road during 7-11am and 1-6pm.
  • Reports of auto theft were highest at 7 and 8 o'clock in the morning, possibly because most people were getting to work and needed their cars in these hours.
  • The hour seen with the largest number of burglary reports was 5pm. In general, the reports of burglary incidents peaked at 10am and stayed high until 11pm.
  • Reports of auto accidents, burglary, auto theft, and larceny all registered a dip around noon. This suggests that either the occurrence of such incidents or the discovery of such incidents was disrupted due to a common cause. In comparison, reports for disorderly conducts and common assault did not abate even in the noon hours.

3.9. Types of incident & the days of the week

Are certain days of the week particularly prone to incidents? Let's find out.

weekday_descr_select = (calls2015
                   .filter((calls2015.descr == "disorderly") 
                           | (calls2015.descr == "common assault") 
                           | (calls2015.descr == "auto accident") 
                           | (calls2015.descr == "auto theft") 
                           | (calls2015.descr == "burglary") 
                           | (calls2015.descr == "larceny"))
                   .select('weekday','descr')
                   .cache()
                   )
+-------+-------------+----------+--------+--------------+----------+-------+
|weekday|auto accident|auto theft|burglary|common assault|disorderly|larceny|
+-------+-------------+----------+--------+--------------+----------+-------+
|      0|         4518|      1492|    2646|          6854|     13892|   2671|
|      1|         5697|      1774|    3888|          6534|     12275|   3384|
|      2|         6281|      1719|    4331|          6649|     12780|   3415|
|      3|         6235|      1572|    3271|          6191|     12844|   3230|
|      4|         6632|      1624|    3169|          6516|     13127|   3340|
|      5|         6678|      1526|    3021|          6669|     14251|   3144|
|      6|         5426|      1542|    2707|          7018|     15114|   3000|
+-------+-------------+----------+--------+--------------+----------+-------+

According to the data, Mondays and Tuesdays have the highest number of burglary and larceny incidents. Saturdays and Sundays were worst days for disorderly behaviors and common assaults. As one might have expected, auto accidents were lowest on Saturdays and Sundays.

3.10. Case study: the top reason for calling 911 in any given day

In the above analysis, we have seen that 911 call volume was influenced by events throughout the year. We have also seen that the top reasons Baltimoreans called 911 were: disorderly, traffic stop, common assault, auto accident, and narcotics outside. Let's see what we can learn from the data if we look at the intersection of these two features. More specifically, let's look at the top reason for 911 calls for each day in 2015.

First, let' filter out calls that were described as '911/no voice' and 'other' since we can't assess the reasons behind these calls. Then, we will group the data by date and description of the call. For each date, we will pick the call description with the highest count.

+-----+---+-------------+-----+
|month|day|        descr|count|
+-----+---+-------------+-----+
|    1|  1|   disorderly|  242|
|    1|  2| traffic stop|  230|
|    1|  3| traffic stop|  196|
|    1|  4|   disorderly|  176|
|    1|  5| traffic stop|  258|
|    1|  6|auto accident|  183|
|    1|  7| traffic stop|  187|
|    1|  8| traffic stop|  215|
|    1|  9|   disorderly|  215|
|    1| 10|   disorderly|  227|
|    1| 11| traffic stop|  224|
|    1| 12| traffic stop|  213|
|    1| 13| traffic stop|  289|
|    1| 14| traffic stop|  333|
|    1| 15| traffic stop|  311|
|    1| 16| traffic stop|  445|
|    1| 17| traffic stop|  418|
|    1| 18| traffic stop|  303|
|    1| 19| traffic stop|  242|
|    1| 20| traffic stop|  338|
+-----+---+-------------+-----+
only showing top 20 rows

With the following command, we see that there are only four possible top reasons for Baltimoreans to call 911.

descrWMaxCount.select('descr').distinct().show()
+-------------+
|        descr|
+-------------+
| traffic stop|
|auto accident|
|     burglary|
|   disorderly|
+-------------+

At this point, we realize that April 14, 2015 had two competing top reasons for 911 calls.

descrWMaxCount.filter((descrWMaxCount.month == 4) & (descrWMaxCount.day == 14)).show()
+-----+---+------------+-----+
|month|day|       descr|count|
+-----+---+------------+-----+
|    4| 14|traffic stop|  204|
|    4| 14|  disorderly|  204|
+-----+---+------------+-----+

For the visualization we will be making, let's drop one of those entries.

newDescrWMaxCount = descrWMaxCount.dropDuplicates(['month', 'day']).cache()
newDescrWMaxCount.filter((descrWMaxCount.month == 4) & (descrWMaxCount.day == 14)).show()
+-----+---+------------+-----+
|month|day|       descr|count|
+-----+---+------------+-----+
|    4| 14|traffic stop|  204|
+-----+---+------------+-----+

newDescrWMaxCount.count()
365

Now we have made sure to have only one entry for each day, let's make a visualization.

This heatmap shows that the top reasons of the year were very much dominated by disorderly conducts (red) and traffic stops (blue). In particular, reports of traffic stops peaked in the colder winter months whereas reports for disorderly conducts stayed high throughout the year. The two yellow days when reports of burglary peaked clearly show a correlation with the civil unrest. The four cyan days when reports of auto accidents peaked suggested that they might be related to weather conditions. A preliminary investigation shows that these four days were plagued by snow, rain, or fog.

Top