Abstract: I analyzed the Baltimore City Employee Salaries FY2015 data set at Baltimore Open Data Portal. The data set recorded 14017 employees working for the City in 2015, and it provided employee information in 7 categories, which are name, job title, Agency ID, Agency, Hire Date, Annual Salary, and Gross Pay. After cleaning and preparing the data set, I analyzed the distributions of gross pay and tenure of City employees. The overall gross pay distribution has a median of $52753.13, and the middle 50% of city workers received pay between $34669.21 and $75771.01. The highest salary was $235767.92. The overall tenure distribution has a median of 9.95 years. The longest tenure was 63.02 years. The middle 50% of people had put in 4.58 to 19.78 years working for the City. While analyzing the distributions of gross pay and tenure with respect to the 30 agencies, I found that the Police Department, with 3097 employees, was the largest employer among different city agencies in 2015. The ratios of highest to median pays among different city agencies were around 3 except Municipal & Zoning Appeals and Recreation & Parks, whose ratios were more than 12. In terms of tenure, Legislative Reference had the longest median tenure (26.56 years) among different agencies.

Tags: data cleaning data analysis R

1 Introduction

As part of the efforts to provide transparency and build trust in the local government, Baltimore City has given the public online access to a slew of data related to the City’s operation. Among the data sets available at Baltimore Open Data Portal are city employee salaries since 2011. In this project, I will focus on analyzing the salary data from fiscal year 2015. In the follow-up projects, I will look at trends in the salary data from 2011 to 2015 and try to model the pay-scale based on the information available in the data sets.


First, let read in the data file and see what kind of information it contains.

BmoreSalaries2015 <- read.csv("Baltimore_City_Employee_Salaries_FY2015.csv", stringsAsFactors=FALSE)
BmoreSalaries2015 <- tbl_df(BmoreSalaries2015)
## Observations: 14,017
## Variables: 7
## $ name         (chr) "Aaron,Patricia G", "Aaron,Petra L", "Abaineh,Yohannes T", "Abbene,Anthony M", "Abbe...
## $ JobTitle     (chr) "Facilities/Office Services II", "ASSISTANT STATE'S ATTORNEY", "EPIDEMIOLOGIST", "PO...
## $ AgencyID     (chr) "A03031", "A29045", "A65026", "A99005", "A40001", "A90005", "A64120", "A99127", "A38...
## $ Agency       (chr) "OED-Employment Dev (031)", "States Attorneys Office (045)", "HLTH-Health Department...
## $ HireDate     (chr) "10/24/1979", "09/25/2006", "07/23/2009", "07/24/2013", "05/01/2013", "11/28/2014", ...
## $ AnnualSalary (chr) "$55314.00", "$74000.00", "$64500.00", "$46309.00", "$60060.00", "$42702.00", "$6217...
## $ GrossPay     (chr) "$53626.04", "$73000.08", "$64403.84", "$59620.16", "$54059.60", "$20250.80", "$8375...

The data set records 14017 employees working for the City in 2015. For each employee, the data set contains information on 7 categories, which are:

## [1] "name"         "JobTitle"     "AgencyID"     "Agency"       "HireDate"     "AnnualSalary" "GrossPay"

We see that all information, including AnnualSalary, GrossPay and HireDate, are read in as character strings. In order to do calculation and make plots using these data, we need to first convert them to the proper data types. In the following, I will first take steps to prepare the data set for analysis, and then use the data to answer the these questions:

2 Data Preparation

These are the major steps I will take to prepare the data set for analysis:

First, let’s convert the name of the name feature to upper camel case so we don’t have to remember which variable starts with an upper case letter and which one starts with a lower case letter.

names(BmoreSalaries2015)[1] <- "Name"

2.1 Format Salary Data

AnnualSalary is currently represented as character strings starting with a “$” sign. For analysis, I need to get rid of the “$” signs and convert those values to numeric. Note that AnnualSalary data do not include cents.

BmoreSalaries2015$AnnualSalary <- str_replace(BmoreSalaries2015$AnnualSalary, "[$]", "")
BmoreSalaries2015$AnnualSalary <- as.numeric(BmoreSalaries2015$AnnualSalary)

Do the same to GrossPay data. Note that GrossPay data do include cents.

BmoreSalaries2015$GrossPay <- str_replace(BmoreSalaries2015$GrossPay, "[$]", "")
BmoreSalaries2015$GrossPay <- as.numeric(BmoreSalaries2015$GrossPay)

According to Baltimore Open Data Portal,

Gross salary includes overtime, furloughs, adjustments. Contractual, temp, part-time or summer clerks/fellows, gross reflects actual amounts paid according to terms of employment. Comp & leave time excluded.

On the other hand,

Annual salary is projected over the full fiscal year. Contractual, temp, part-time or summer clerks/fellows, annual salary is the equivalent full-time annual salary. Comp & leave time excluded.

In other words, for someone who worked for the City during the summer months, his/her AnnualSalary was calculated based on the rate he/she was paid each month in the summer, multiplied with 12 to get the full year salary irrespective of his/her employment situation with the City in months other than the summer. In light of this information, I will use GrossPay to gauge the actual compensation an employee received.

Looking at the GrossPay data, we find that 269 individuals with records in the data set did not have GrossPay information. Upon further investigation, it appears that these individuals’ records did not have information on GrossPay because they were all hired in the second half of June, leaving them less than a two-week pay cycle before the data were collected.

Not counting the new hires, the following plot shows the distribution of gross pay received by all City employees in 2015. The distribution clearly skews to the right (direction of high gross pay). The median salary was $52753.13. The middle 50% of people received pay between $34669.21 and $75771.01. In the following plot, the red vertical line marks the median pay, and the two blue dashed vertical lines mark the 25 and 75 percentiles of pay respectively.

The highest salary, claimed by Nalewajko Jr,Stephen C of the Police Department (260), was $235767.92. The lowest salary, $0, was recorded for 11 individuals who had job titles such as “SECURITY GUARD CONV. CENTER”, “CONTRACT SERV SPEC II”, “EMT Firefighter Suppression”, “SR COMPANION STIPEND HLTH”, “RECREATION ARTS INSTRUCTOR”, “RECREATION ACTIVITIES ASSISTAN”, “LIFEGUARD I”, and “COMMUNITY AIDE” working at the Convention Center, the Fire Department, the Health Department, and Recreation and Parks. Each of these 11 individuals had at least 2.22 years of service recorded with the City.

2.2 Calculate Tenure of City Employees

I will estimate the tenure of City employees using their HireDate and the date of data collection.

BmoreSalaries2015$HireDate <- mdy(BmoreSalaries2015$HireDate)
DateOfData15 <- mdy("06-30-2015")
BmoreSalaries2015$YrOfSrv <- round(decimal_date(DateOfData15) - decimal_date(BmoreSalaries2015$HireDate), digits=nDecimalPlace)

The distribution for YrOfSrv skews to the right, which is consistent with the expectation for an organization that continues to recruit workers to replace those who had either retired or left the organization. The median tenure of City employees was 9.95 years. The longest tenure, 63.02 years, was achieved by Walter,William of the DPW-Water & Waste Water (101). The middle 50% of people had put in 4.58 to 19.78 years working for the City.

The following plot shows the distribution of tenure of City employees in 2015. The red vertical line marks the median tenure, and the two blue dashed vertical lines mark the 25 and 75 percentiles of tenure respectively.