library(lubridate)
library(dplyr)
library(stringr)
library(ggvis)
library(tidyr)
library(pander)

Abstract: I examined the employment and salary statistics of Baltimore City employees based on the 2011-2015 data from the Baltimore Open Data Portal. Among all City agencies, the Police Department had the largest number of employees over the 2011-2015 time period. Overall Baltimore City had a one-year employee retention rate of 87%, a two-year retention rate of 78%, a three-year retention rate of 71%, and a four-year retention rate of 65%. The number of employees who switched jobs among different City agencies were small, about 1.1%. The result indicates that the majority of employee attrition was due to employees leaving the City workforce all together. It also means that City workers, if they chose to stay working for the City, liked to stay with the same agencies. Analysis at a more granular scale shows that smaller agencies, if they continue to exist over the years, had higher retention rates than larger agencies. From analyzing salary data, I found that every year since 2011 there has been a steady, albeit modest, pay increase for City employees. The Police Department stood out as the only agency that has been experiencing a workforce reduction (since 2012) while increasing its salary spending at the same time. Data showed that the extra spending had impact on majority of the personnel in the Police Department, and was not the results of hiring a few high paying individuals. The data also showed that the Police Department had a higher pay scale than all the other City agencies. The data products from this analysis are available for download here.

Tags: data cleaning data analysis model fitting data visualization R

1 Introduction

This project analyzes Baltimore City Employee Salaries data from 2011 to 2015 provided at the Baltimore Open Data Portal. In particular, by identifying trends and features in these data I hope to uncover some of the consequences of the current policies on personnel allocation and financial support. To carry out this analysis, I will be reading in the data from separate files, clean the data, visualize them, and fit appropriate models to them.

2 Data Preparation

The five data files, one for each fiscal year, contain the same types of information. For example, each file records employee names, hiring dates, job titles, affiliating agencies, and salaries. While the variables in these data files may have names that differ from year to year, the fact that the same types of information were provided for all five years makes processing the data easy. Indeed, one can set up a loop that reads in the data, clean them up, and merge them into a master data set. This loop approach is scalable and saves memory space, so that’s what I did.

In the following, I will first define helper functions to carry out various tasks of cleaning the data and creating additional variables. The steps include:

2.1 Convert Salary Data from Character Strings to Numerics

# Remove "$" signs in salary data and convert them from character type to numeric type 
format_salary_data <- function(strings_salary) {
  numeric_salary <- strings_salary %>% 
    str_replace("[$]", "") %>%
    as.numeric()
}

2.2 Standardize Agency Names

I’ve learned from analyzing the 2015 data that the entries of Agency include the last three digits in the entries of AgencyID some of the times but not all the time. In the following I will remove the three digits reference in Agency (if they appear) and standardize the spellings of the entries in Agency.

2.2.1 Remove redundant location reference in the Agency column

The following function removes the three digit location reference (if they appear). It removes location reference that comes in the form “(nnn)”, “Location xxx”, or simply “(”. Since few departments indicated their employees as part-time, wkly, or full-time, I decided to focus this analysis on the inter-departmental differences only. As a result, this function will also remove any “(part-time)” or “(wkly)” designations.

# Identify and remove any location reference in a given character string
rm_location_ref <- function(cstr) {
  idx <- str_locate(cstr, "[(]|Location [:digit:]{3}|[:digit:]{3}")
  if (is.na(idx[1])) return(cstr) 
  else {
    idx <- idx[1] - 2
    # return(substring(cstr, 1, idx))
    return(str_trim(substring(cstr, 1, idx), "right"))
  }
}

# Apply rm_location_ref function to each character string in the `Agency` column
rm_location_ref_apply2vec <- function(cstr_vec) {
  unlist(lapply(cstr_vec, rm_location_ref))
}

2.2.2 Standardize the spellings of agency names

Some agency names have been spelled in multiple ways. For example,

  • “HLTH-Health Department”, “HLTH-Health Dept”, “HLTH-Heatlh Dept.”, and “HLTH-Health Dept.” will be unified as “HLTH-Health Department.”
  • “Mayor’s Office” and “Mayors Office” will be unified as “Mayor’s Office.”
  • The designation “Fire Academy Recruits” was only used in the 2011 data and it shared the same first three-digit AgencyID, “A64”, as the “Fire Department.” As a result, “Fire Academy Recruits” will be considered as a subsidiary agency of the “Fire Department.”
  • Out of data from all five year, only one person in 2013 was designated as “OED-Participants-Youth Summer.” Since the entry has the same AgencyID, “W02”, as others under “Youth Summer”, the single entry of “OED-Participants-Youth Summer” will be converted into “Youth Summer.”

The following function standardizes names for agencies and their subsidiary departments.

sync_name <- function(agency_vec) {
  agency_vec <- rm_location_ref_apply2vec(agency_vec)
  agency_vec <- agency_vec %>% str_replace("Fire Academy Recruits", "Fire Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Health Dept Locatio", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Health Dept. Locatio", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Heatlh Dept. Locatio", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Heatlh $", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Health Dept$", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Health Dept.$", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("HLTH-Heatlh Dept.$", "HLTH-Health Department")
  agency_vec <- agency_vec %>% str_replace("Mayors Office", "Mayor's Office")
  agency_vec <- agency_vec %>% str_replace("Civil Rights & Wage Enfor$", "Civil Rights & Wage Enforce")
  agency_vec <- agency_vec %>% str_replace("COMP-Communication Ser$", "COMP-Communication Services")
  agency_vec <- agency_vec %>% str_replace("COMP-Comptroller's O$", "COMP-Comptroller's Office")
  agency_vec <- agency_vec %>% str_replace("Municipal & Zoning Appeal$", "Municipal & Zoning Appeals")
  agency_vec <- agency_vec %>% str_replace("TRANS-Cross Guard-S$", "TRANS-Cross Guard-Summer")
  agency_vec <- agency_vec %>% str_replace("OED-Participants-Youth Summer", "Youth Summer")
}

2.2.3 Group subsidiary departments with their umbrella agencies

Among the agencies listed in each year’s salary file, some agencies were represented by their subsidiary departments. This resulted in 50+ different agencies. Since this level of refined differentiation can be overwhelming and may create confusion for a regular citizen, I limit the content of the Agency variable to the (umbrella) agencies only. The original super-agency-subsidiary-department description will be preserved in a different variable called AgencyDept. In this way, I can group data together at the super-agency level, and still be able to identify which department within an agency that a particular employee worked for by looking up content in the AgencyDept variable.

# Group sub-agencies under super-agencies, the sub-agency's name is provided as a character string
sub2sup_agencies <- function(cstr) {
  if (is.na(cstr)) return(cstr)
  if (str_detect(cstr, "COMP")) return("Comptroller's Office")
  if (str_detect(cstr, "DPW")) return("Dept of Public Works")
  if (str_detect(cstr, "ERS")) return("Employee's Retirement Sys")
  if (str_detect(cstr, "FIN")) return("Dept of Finance")
  if (str_detect(cstr, "FPR")) return("Fire & Police Retirement Sys Admin")
  if (str_detect(cstr, "HLTH")) return("Dept of Health")
  if (str_detect(cstr, "HR")) return("Human Resourcese")
  if (str_detect(cstr, "M-R")) return("Mayor's Office")
  if (str_detect(cstr, "OED")) return("Office of Employment Dev")
  if (str_detect(cstr, "R&P")) return("Recreation & Parks")
  if (str_detect(cstr, "TANF")) return("Temp Assist for Needy Families")
  if (str_detect(cstr, "TRANS")) return("Dept of Transportation")
  return(cstr)
}

# Apply `sub2sup_agencies` function to a vector of sub-agency names that are character strings
sub2sup_agencies_apply2vec <- function(cstr_vec) {
  unlist(lapply(cstr_vec, sub2sup_agencies))
}

2.3 Flag Records with No HireDate

These are records that have generic names, such as “BPD 1”, “BPD 2”, …, “BPD 96”, in the Name column. It appears that the payroll information of these records are credible, in the sense that they were actual salaries paid to certain individuals, However, these records can be linked to no identifiable persons. More specifically, for these unnamed individuals their JobTitle’s are identical to their names; their Agency were all listed as “Police Department”; and the first three digits of AgencyID are all “A99.”

Since Name and HireDate are the only demographic information provided in these data files, I will be using these two pieces of information to match records and merge them into a master data set. This means that I need to flag records that have generic Name’s and no HireDate in order to correctly match them later. I will give “Jan-1-2200”, a future date, as the HireDate to the these records.

flag_hiredate <- function(date_vec) {
  idx <- which(is.na(date_vec))
  # print(paste(as.character(length(idx)), "missing hire-date records identified.", sep=" "))
  date_vec[idx] <- mdy("01-01-2200")  
  return(date_vec)
}

2.4 Trim Trailing Space in Data Entries

Some entries of Name and Agency have included trailing space(s), which will lead to separate entries when we merge data records later on.

trim_trailing_space_cstr <- function(cstr) {
  str_trim(cstr, side="right")
}

trim_trailing_space2vec <- function(cstr_vec) {
  unlist(lapply(cstr_vec, trim_trailing_space_cstr))
}

2.5 Merge Data into One Master Data Set

BmoreSalaries1115 <- data.frame()

for (i in 11:15) {
  # Standardize names of variables in the incoming data file. This step is necessary since in some files, "AgencyID" is called "DEPTID", "Agency" is called "DESCR", etc.
  new_names <- c("Name", "JobTitle", "AgencyID", "Agency", "HireDate", "AnnualSalary", "GrossPay")
  
  # Read data from files
  if (i == 11) {
    new_csv = paste("Baltimore_City_Employee_Salaries_20",as.character(i),".csv", sep="")
  } else {
    new_csv = paste("Baltimore_City_Employee_Salaries_FY20",as.character(i),".csv", sep="")
  }
  generic_df <- read.csv(new_csv, stringsAsFactors=FALSE, col.names = new_names)
  
  # Prepare data
  generic_df$Name <- trim_trailing_space2vec(generic_df$Name)
  generic_df$JobTitle <- trim_trailing_space2vec(generic_df$JobTitle)
  generic_df$AgencyID <- trim_trailing_space2vec(generic_df$AgencyID)
  generic_df$Agency <- trim_trailing_space2vec(generic_df$Agency)
  
  generic_df$HireDate <- mdy(generic_df$HireDate)
  
  generic_df$HireDate <- flag_hiredate(generic_df$HireDate)
# [1] "75 missing hire-date records identified."
# [1] "96 missing hire-date records identified."
# [1] "52 missing hire-date records identified."
# [1] "70 missing hire-date records identified."
# [1] "10 missing hire-date records identified."
  
  generic_df$AnnualSalary <- format_salary_data(generic_df$AnnualSalary)
  generic_df$GrossPay <- format_salary_data(generic_df$GrossPay)
  generic_df$AgencyDept <- sync_name(generic_df$Agency)
  generic_df$Agency <- sub2sup_agencies_apply2vec(generic_df$AgencyDept)
  
  # Use "Name" and "HireDate" as unique employee identifier; arrange relevant variables to be next to each other in the data frame
  generic_df <- generic_df %>% 
    select(Name, HireDate, JobTitle, Agency, AgencyDept, AgencyID, AnnualSalary, GrossPay)
  
  # Re-name variables to reflect the year data were collected
  new_names <- c("Name", 
                 "HireDate",
                 paste("JobTitle",as.character(i), sep=""),
                 paste("Agency",as.character(i), sep=""),
                 paste("AgencyDept",as.character(i), sep=""),
                 paste("AgencyID",as.character(i), sep=""),
                 paste("AnnualSalary",as.character(i), sep=""),
                 paste("GrossPay",as.character(i), sep=""))
  names(generic_df) <- new_names

  # Merge data from different years into a single data frame
  if (i==11) {
    BmoreSalaries1115 <- generic_df
  } else {
    BmoreSalaries1115<-merge(generic_df, BmoreSalaries1115, by=c("Name", "HireDate"), all=TRUE)   
  }
}

At this point, all data have been read into the master data frame BmoreSalaries1115. This data frame has 29,425 records with each record characterized by 32 variables. Here is a glimpse of the master data frame at the current stage:

glimpse(BmoreSalaries1115)
## Observations: 29,425
## Variables: 32
## $ Name           (chr) "Aaron,Keairah T", "Aaron,Keontae E", "Aaron,Patricia G", "Aaron,Petra L", "Abaine...
## $ HireDate       (date) 2013-06-19, 2013-06-10, 1979-10-24, 2006-09-25, 2009-07-23, 2013-07-24, 2013-05-0...
## $ JobTitle15     (chr) NA, NA, "Facilities/Office Services II", "ASSISTANT STATE'S ATTORNEY", "EPIDEMIOLO...
## $ Agency15       (chr) NA, NA, "Office of Employment Dev", "States Attorneys Office", "Dept of Health", "...
## $ AgencyDept15   (chr) NA, NA, "OED-Employment Dev", "States Attorneys Office", "HLTH-Health Department",...
## $ AgencyID15     (chr) NA, NA, "A03031", "A29045", "A65026", "A99005", "A40001", "A90005", "A64120", "A99...
## $ AnnualSalary15 (dbl) NA, NA, 55314, 74000, 64500, 46309, 60060, 42702, 62175, 77343, 44548, 18408, 1840...
## $ GrossPay15     (dbl) NA, NA, 53626, 73000, 64404, 59620, 54060, 20251, 83757, 92575, 55943, 1051, 1051,...
## $ JobTitle14     (chr) NA, "AIDE BLUE CHIP", "Facilities/Office Services II", "ASSISTANT STATE'S ATTORNEY...
## $ Agency14       (chr) NA, "Youth Summer", "Office of Employment Dev", "States Attorneys Office", "Dept o...
## $ AgencyDept14   (chr) NA, "Youth Summer", "OED-Employment Dev", "States Attorneys Office", "HLTH-Health ...
## $ AgencyID14     (chr) NA, "W02200", "A03031", "A29005", "A65026", "A99416", "A40001", NA, "A64120", "A99...
## $ AnnualSalary14 (dbl) NA, 11310, 53428, 68300, 62000, 43999, 52000, NA, 62175, 70918, 42438, 11310, 1131...
## $ GrossPay14     (dbl) NA, 874, 52868, 67439, 58655, 39687, 47020, NA, 61452, 87900, 53668, NA, NA, NA, 1...
## $ JobTitle13     (chr) "AIDE BLUE CHIP", "AIDE BLUE CHIP", "Facilities/Office Services II", "ASSISTANT ST...
## $ Agency13       (chr) "Youth Summer", "Youth Summer", "Office of Employment Dev", "States Attorneys Offi...
## $ AgencyDept13   (chr) "Youth Summer", "Youth Summer", "OED-Employment Dev", "States Attorneys Office", "...
## $ AgencyID13     (chr) "W02278", "W02200", "A03031", "A29005", "A65026", NA, "A40001", NA, "A64120", "A99...
## $ AnnualSalary13 (dbl) 11310, 11310, 51862, 64000, 57900, NA, 52000, NA, 40650, 68847, 41194, NA, NA, NA,...
## $ GrossPay13     (dbl) NA, NA, 51771, 63909, 57429, NA, 3477, NA, 44160, 66496, 46254, NA, NA, NA, NA, 32...
## $ JobTitle12     (chr) NA, NA, "Facilities/Office Services II", "ASSISTANT STATE'S ATTORNEY", "EPIDEMIOLO...
## $ Agency12       (chr) NA, NA, "Office of Employment Dev", "States Attorneys Office", "Dept of Health", N...
## $ AgencyDept12   (chr) NA, NA, "OED-Employment Dev", "States Attorneys Office", "HLTH-Health Department",...
## $ AgencyID12     (chr) NA, NA, "A03031", "A29005", "A65026", NA, NA, NA, "A64215", "A99398", "A38410", NA...
## $ AnnualSalary12 (dbl) NA, NA, 51862, 64000, 57900, NA, NA, NA, 34146, 58244, 41194, NA, NA, NA, 44143, 3...
## $ GrossPay12     (dbl) NA, NA, 52247, 59027, 57130, NA, NA, NA, 35538, 62669, 46700, NA, NA, NA, 44638, 2...
## $ JobTitle11     (chr) NA, NA, "Facilities/Office Services II", "ASSISTANT STATE'S ATTORNEY", "EPIDEMIOLO...
## $ Agency11       (chr) NA, NA, "Office of Employment Dev", "States Attorneys Office", "Dept of Health", N...
## $ AgencyDept11   (chr) NA, NA, "OED-Employment Dev", "States Attorneys Office", "HLTH-Health Department",...
## $ AgencyID11     (chr) NA, NA, "A03031", "A29005", "A65026", NA, NA, NA, "A64063", "A99398", NA, NA, NA, ...
## $ AnnualSalary11 (dbl) NA, NA, 50845, 56595, 56500, NA, NA, NA, 33476, 50919, NA, NA, NA, NA, 43278, 2638...
## $ GrossPay11     (dbl) NA, NA, 45506, 51589, 50633, NA, NA, NA, 3889, 51422, NA, NA, NA, NA, 39117, 23758...
unique_names <- table(BmoreSalaries1115$Name)
length(unique_names)
## [1] 28832

There are 28,832 unique names in the data frame, which means that some of the records share the same employee names. So, do these records represent different people or the same ones? Given that I don’t have other demographic information of City employees, I looked up their HireDate, which were different. Given that employees who had switched agencies over the years were linked to the same HireDate, it seems that HireDate is a valid piece of information to differentiate employees who had the same names. For this reason I decided to use both Name and HireDate as unique identifiers of individual employees.

Here is a summary of the data:

summary(BmoreSalaries1115)
##      Name              HireDate           JobTitle15          Agency15         AgencyDept15      
##  Length:29425       Min.   :1952-06-23   Length:29425       Length:29425       Length:29425      
##  Class :character   1st Qu.:2001-09-17   Class :character   Class :character   Class :character  
##  Mode  :character   Median :2009-06-11   Mode  :character   Mode  :character   Mode  :character  
##                     Mean   :2006-04-02                                                           
##                     3rd Qu.:2013-06-10                                                           
##                     Max.   :2200-01-01                                                           
##                                                                                                  
##   AgencyID15        AnnualSalary15     GrossPay15      JobTitle14          Agency14        
##  Length:29425       Min.   :   900   Min.   :     0   Length:29425       Length:29425      
##  Class :character   1st Qu.: 33354   1st Qu.: 34666   Class :character   Class :character  
##  Mode  :character   Median : 48126   Median : 52753   Mode  :character   Mode  :character  
##                     Mean   : 51710   Mean   : 55782                                        
##                     3rd Qu.: 68112   3rd Qu.: 75768                                        
##                     Max.   :238772   Max.   :235768                                        
##                     NA's   :15406    NA's   :15675                                         
##  AgencyDept14        AgencyID14        AnnualSalary14     GrossPay14      JobTitle13       
##  Length:29425       Length:29425       Min.   :     0   Min.   :     0   Length:29425      
##  Class :character   Class :character   1st Qu.: 11310   1st Qu.: 22065   Class :character  
##  Mode  :character   Mode  :character   Median : 36400   Median : 45267   Mode  :character  
##                                        Mean   : 39757   Mean   : 45634                     
##                                        3rd Qu.: 60598   3rd Qu.: 67816                     
##                                        Max.   :238772   Max.   :238772                     
##                                        NA's   :10440    NA's   :13664                      
##    Agency13         AgencyDept13        AgencyID13        AnnualSalary13     GrossPay13    
##  Length:29425       Length:29425       Length:29425       Min.   :     0   Min.   :     0  
##  Class :character   Class :character   Class :character   1st Qu.: 11310   1st Qu.: 24536  
##  Mode  :character   Mode  :character   Mode  :character   Median : 35947   Median : 44001  
##                                                           Mean   : 38293   Mean   : 44313  
##                                                           3rd Qu.: 57000   3rd Qu.: 65615  
##                                                           Max.   :238772   Max.   :238681  
##                                                           NA's   :10285    NA's   :13504   
##   JobTitle12          Agency12         AgencyDept12        AgencyID12        AnnualSalary12  
##  Length:29425       Length:29425       Length:29425       Length:29425       Min.   :   377  
##  Class :character   Class :character   Class :character   Class :character   1st Qu.: 28726  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 42853  
##                                                                              Mean   : 43402  
##                                                                              3rd Qu.: 59933  
##                                                                              Max.   :238772  
##                                                                              NA's   :13373   
##    GrossPay12      JobTitle11          Agency11         AgencyDept11        AgencyID11       
##  Min.   :   -77   Length:29425       Length:29425       Length:29425       Length:29425      
##  1st Qu.: 23871   Class :character   Class :character   Class :character   Class :character  
##  Median : 43061   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 43176                                                                              
##  3rd Qu.: 63551                                                                              
##  Max.   :245353                                                                              
##  NA's   :13373                                                                               
##  AnnualSalary11     GrossPay11    
##  Min.   :   377   Min.   :     7  
##  1st Qu.: 31609   1st Qu.: 27609  
##  Median : 43614   Median : 41100  
##  Mean   : 46009   Mean   : 42323  
##  3rd Qu.: 59916   3rd Qu.: 58322  
##  Max.   :238772   Max.   :170423  
##  NA's   :14968    NA's   :14968

2.6 Anomalies in Data Entries

2.6.1 Negative gross pay in 2012

It appears that in 2012 some employees received negative pay. Let’s take a closer look.

sum(BmoreSalaries1115$GrossPay12<0, na.rm=T)
## [1] 19
idx12<-which(BmoreSalaries1115$GrossPay12 < 0)
negative_pay12 <- BmoreSalaries1115[idx12,] %>% 
  select(JobTitle12, Agency12, GrossPay12)
negative_pay12$JobTitle12 <- as.factor(negative_pay12$JobTitle12)
negative_pay12$Agency12 <- as.factor(negative_pay12$Agency12)
summary(negative_pay12)
##                      JobTitle12           Agency12    GrossPay12 
##  AIDE BLUE CHIP           : 2   Dept of Health:17   Min.   :-77  
##  JTPA TRAINEE A MOMR      :14   Youth Summer  : 2   1st Qu.:-56  
##  SR COMPANION STIPEND HLTH: 3                       Median :-49  
##                                                     Mean   :-45  
##                                                     3rd Qu.:-28  
##                                                     Max.   : -7

There were 19 people who received negative pay in 2012. Two of them were under “Youth Summer”, and the other 17 were under the “Dept of Health.” Those who served in “Youth Summer” had “AIDE BLUE CHIP” as JobTitle, while those who worked for the the “Dept of Health” either served as “JTPA TRAINEE A MOMR” or as “SR COMPANION STIPEND HLTH.” All 19 people have been with the City for less than a year, and their negative pays were less than $100 each.

We can rule out the possibility that these negative gross pays were the results of overpay from the previous year since all 19 individuals had been working for the City for less than a year. Given that no negative pay records were found in years before and after 2012, it is likely that these are typos. A less likely but possible explanation is that the City offered benefits which required a “co-pay”" from the employees. Given that this is an exploratory project to practice using R and not an investigative report, I will leave the numbers as they are. Those who are interested in finding the exact answer may download the processed data files to continue with their investigations.

2.6.2 Unnamed employees at the police department

numeric_names <- function(cstr) {
  str_detect(cstr, "^[:digit:]|[:digit:]$")
}

AnyNumericName <- unlist(lapply(BmoreSalaries1115$Name, numeric_names))
sum(AnyNumericName)    # [1] 96
## [1] 96
NumericNameIndv <- subset(BmoreSalaries1115, AnyNumericName)
head(NumericNameIndv)
##        Name   HireDate JobTitle15          Agency15      AgencyDept15 AgencyID15 AnnualSalary15 GrossPay15
## 2656  BPD 1 2200-01-01      BPD 1 Police Department Police Department     A99200          74517      82988
## 2657 BPD 10 2200-01-01     BPD 10 Police Department Police Department     A99200          66086     104206
## 2658 BPD 11 2200-01-01       <NA>              <NA>              <NA>       <NA>             NA         NA
## 2659 BPD 12 2200-01-01       <NA>              <NA>              <NA>       <NA>             NA         NA
## 2660 BPD 13 2200-01-01       <NA>              <NA>              <NA>       <NA>             NA         NA
## 2661 BPD 14 2200-01-01       <NA>              <NA>              <NA>       <NA>             NA         NA
##      JobTitle14          Agency14      AgencyDept14 AgencyID14 AnnualSalary14 GrossPay14 JobTitle13
## 2656        BPD Police Department Police Department     A99200          60598      74946        BPD
## 2657        BPD Police Department Police Department     A99200          56312      55117        BPD
## 2658        BPD Police Department Police Department     A99200          65351      67579        BPD
## 2659        BPD Police Department Police Department     A99200          60598      62915        BPD
## 2660        BPD Police Department Police Department     A99200          64163      66005        BPD
## 2661        BPD Police Department Police Department     A99200          64163      65174        BPD
##               Agency13      AgencyDept13 AgencyID13 AnnualSalary13 GrossPay13 JobTitle12          Agency12
## 2656 Police Department Police Department     A99334          61157      92411      BPD 1 Police Department
## 2657 Police Department Police Department     A99334          65020      94672     BPD 10 Police Department
## 2658 Police Department Police Department     A99334          69893      85584     BPD 11 Police Department
## 2659 Police Department Police Department     A99347          65816      77536     BPD 12 Police Department
## 2660 Police Department Police Department     A99295          62324      65310     BPD 13 Police Department
## 2661 Police Department Police Department     A99347          62324      72900     BPD 14 Police Department
##           AgencyDept12 AgencyID12 AnnualSalary12 GrossPay12 JobTitle11          Agency11      AgencyDept11
## 2656 Police Department        A99          88565      93671        BPD Police Department Police Department
## 2657 Police Department        A99          69893      77961        BPD Police Department Police Department
## 2658 Police Department        A99          62905      74054        BPD Police Department Police Department
## 2659 Police Department        A99          66982      81539        BPD Police Department Police Department
## 2660 Police Department        A99          61157     104831        BPD Police Department Police Department
## 2661 Police Department        A99          69893      79995        BPD Police Department Police Department
##      AgencyID11 AnnualSalary11 GrossPay11
## 2656        A99          85372      89290
## 2657        A99          68523      85543
## 2658        A99          61672      64022
## 2659        A99          66810     103011
## 2660        A99          59388      95306
## 2661        A99          65669      85301

Over the year, there were 96 individuals whose names were “BPD”+digits. These individuals were all employees of the Police Department. These are the only records where individuals were represented under generic names in these data files. While we don’t know who these individuals are, we can still ask questions such as how many unnamed individuals were employed each year and how much they were paid each year. The following is a summary of answers to those questions. Again, I will leave the records as they are in the data set since they seem to represent real people and real salary expense.

Salary Spending on Unnamed Employees in the Police Department 2011-2015
  Number of
Employees
(n)
Minimum
Gross Pay
($)
1st Quarter
Gross Pay
($)
Median
Gross Pay
($)
Mean
Gross Pay
($)
3rd Quarter
Gross Pay
($)
Maximum
Gross Pay
($)
Total
Gross Pay
($)
2011 75 43,200 63,600 72,100 74,500 85,000 127,000 5,591,031
2012 96 54,000 68,000 76,500 78,400 85,500 111,000 7,522,470
2013 52 59,300 73,100 82,200 84,700 95,400 113,000 4,402,887
2014 70 10,100 67,500 77,100 79,700 91,900 135,000 5,576,180
2015 10 59,300 73,100 75,600 79,700 92,000 104,000 797,282

In the boxplot above, each box presents the middle 50% of gross pays received by unnamed employees in a given year. The horizontal cut inside the box represents the median pay. The width of each box varies with respect to the number of unnamed employees serving in that year. That is, a wider box indicates a larger number of unnamed workers. The dots above and below each box represent cases of gross pays that were beyond 1.5 times the interquartile range (the middle 50% pay range) of gross pay of that particular year.

3 Analysis & Results

Our master data set shows that the City’s workforce had 14,019 employees in 2015, 18,985 employees in 2014, 19,140 employees in 2013, 16,052 employees in 2012, and 14,457 employees in 2011. So how was the workforce distributed among different City agencies?

3.1 Distribution of Workforce among Agencies

We can answer this question by counting the number of employees listed under different agencies. For ease of analysis I will first slice the relevant data out of the master data set.

Agency1115 <- select(BmoreSalaries1115, Name, HireDate, contains("Agency1"), contains("GrossPay"))

sum_agency15 = as.data.frame(table(Agency1115$Agency15), responseName = "2015", stringsAsFactors = FALSE)
sum_agency14 = as.data.frame(table(Agency1115$Agency14), responseName = "2014", stringsAsFactors = FALSE)
sum_agency13 = as.data.frame(table(Agency1115$Agency13), responseName = "2013", stringsAsFactors = FALSE)
sum_agency12 = as.data.frame(table(Agency1115$Agency12), responseName = "2012", stringsAsFactors = FALSE)
sum_agency11 = as.data.frame(table(Agency1115$Agency11), responseName = "2011", stringsAsFactors = FALSE)

TotAgency1115 = merge(sum_agency11,sum_agency12,by="Var1",all = TRUE)
TotAgency1115 = merge(TotAgency1115,sum_agency13,by="Var1",all = TRUE)
TotAgency1115 = merge(TotAgency1115,sum_agency14,by="Var1",all = TRUE)
TotAgency1115 = merge(TotAgency1115,sum_agency15,by="Var1",all = TRUE)

names(TotAgency1115)[1]="Agency"

# Convert all NA's into 0
for (i in 2:6) {
  idx <- which(is.na(TotAgency1115[[i]]))
  TotAgency1115[[i]][idx] <- 0
}

The above plot shows that most agencies had a steady workforce over the years. One striking feature is the spike of “Youth Summer” employees in 2013 (4,725 participants) and 2014 (4,819 participants). It appears that the summer program, which gave participants work experience and a small stipend, had a modest enrollment in 2012 ( 1,755 participants) and flourished in 2013 and 2014. We know that the summer programs expanded in 2015, hosting more than 8000 youths. However, the 2015 program participants were either no longer considered as city employees or somehow their data were not incorporated into the data set like in the previous years.

Another notable feature of the plot is that the Police Department has consistently been the largest agencies in the City. Indeed, it had the largest city workforce in 2011 (3,428 employees), 2012 (3,475 employees), and 2015 (3,097 employees). Its head counts in 2013 and 2014 were only surpassed by the Summer Youth Program.

While maintaining its dominance in the City’s workforce, the police workforce has been on the decline since 2012. And it is one of the few agencies what were doing so in recent years. The readers may be interested in comparing this plot with the plot of salary spending, which will be examined in a later section.

Between 2011 and 2015 only the “Mayor’s Office” and “Recreation & Parks” experienced a steady increase in the workforce. Between the two agencies, “Recreation & Parks” was about one-and-a-half the size of the “Mayor’s Office”. Nonetheless, the “Mayor’s Office” had a salary budget twice that of “Recreation & Parks” (see salary analysis). This likely means that majority of the “Recreation & Parks” employees were part-time, and that the positions at the “Mayor’s Office” were mostly full-time.

Finally, you may wonder if “Youth Cust”, “Youth Summer”, and “Youth Temp Adult” should be grouped together since their names imply that these are programs for youths. Based on the data in the salary data files, it is not a clear call. First of all, each of these designations has a distinct three-digit AgencyID. Secondly, they co-existed in four of the five years–2015 is the only year where all three programs had zero participant. Based on these observations, I decided to leave them as separate Agency’s in the analysis.

Here are the numbers that generated the plot above.

Number of Positions by Agency 2011-2015
  2011
(n)
2012
(n)
2013
(n)
2014
(n)
2015
(n)
Circuit Court 154 149 139 140 141
City Council 88 90 87 93 96
Civil Rights & Wage Enforce 0 0 12 12 11
Community Relations 10 12 0 0 0
Comptroller’s Office 91 86 95 90 91
Council Services 5 6 6 6 6
Dept of Finance 265 281 286 288 290
Dept of Health 941 870 957 915 958
Dept of Public Works 2,356 2,327 2,332 2,337 2,369
Dept of Transportation 1,490 1,535 1,576 1,423 1,368
Elections 6 11 8 8 8
Employee’s Retirement Sys 27 28 29 33 32
Enoch Pratt Free Library 442 439 475 463 473
Fire & Police Retirement Sys Admin 21 22 22 25 27
Fire Department 1,726 1,730 1,640 1,576 1,615
General Services 396 411 400 389 357
Housing & Community Dev 513 533 433 446 384
Human Resourcese 52 55 59 66 66
Law Department 97 105 107 110 108
Legislative Reference 6 6 5 5 5
Liquor License Board 30 36 29 20 24
Mayor’s Office 424 401 672 687 724
Municipal & Zoning Appeals 8 9 9 9 9
Office of Employment Dev 224 208 195 165 162
Orphan’s Court 10 11 12 11 10
Planning Department 47 45 47 48 50
Police Department 3,428 3,475 3,278 3,211 3,097
Recreation & Parks 650 756 874 953 966
Sheriff’s Office 175 182 191 206 210
Special City Services 2 2 3 4 4
States Attorneys Office 375 362 364 358 358
Temp Assist for Needy Families 45 50 0 0 0
Wage Commissioner 3 0 0 0 0
War Memorial Commission 4 4 0 0 0
Youth Cust 328 29 6 0 0
Youth Summer 0 1,755 4,725 4,819 0
Youth Temp Adult 18 31 67 69 0

3.2 Retention of City Employees

Employee attrition is costly to an organization. Here I examined the retention rates of Baltimore City agencies. I first look at the retention of employees across all agencies as a whole, and then break down the numbers by agency.

I decided to count an employee as retained only if he/she remained working for the same agency from one year to the next. This means that if an employee left the agency either by transferring to another agency within the City government or by dropping out of the City workforce altogether, he/she would be counted toward the attrition.

First, let’s calculate the size of the workforce for each year of the data we have.

# Workforce in '15
Workforce15 = sum(!is.na(Agency1115$Agency15))    # [1] 14019

# Workforce in '14
Workforce14 = sum(!is.na(Agency1115$Agency14))    # [1] 18985

# Workforce in '13
Workforce13 = sum(!is.na(Agency1115$Agency13))    # [1] 19140

# Workforce in '12
Workforce12 = sum(!is.na(Agency1115$Agency12))    # [1] 16052

# Workforce in '11
Workforce11 = sum(!is.na(Agency1115$Agency11))    # [1] 14457

Next, I am going to calculate, with each year’s passing,

  • number of City workers dropping out of the system,
  • number of new hires into the system,
  • number of City workers who switched agencies,
  • number of City workers who stayed with the same agencies, and
  • retention rate as a function the number of City workers who stayed with the same agencies.
AgencyOnly1115 <- Agency1115 %>% select(Name, HireDate, contains("Agency1"))

AgencyOnly1115 <- AgencyOnly1115 %>% mutate(same_agency1112=Agency11==Agency12)
AgencyOnly1115 = AgencyOnly1115 %>% mutate(same_agency1213=Agency12==Agency13)
AgencyOnly1115 = AgencyOnly1115 %>% mutate(same_agency1314=Agency13==Agency14)
AgencyOnly1115 = AgencyOnly1115 %>% mutate(same_agency1415=Agency14==Agency15)


# Drop out of City workforce '11 to '12
NLeftTot11 = sum(!is.na(AgencyOnly1115$Agency11) & is.na(AgencyOnly1115$Agency12))    # [1] 1845
LeftTot11 = NLeftTot11/Workforce11 # [1] 0.13

# New hires to the City workforce `11 to `12
NNewHires12 = sum(!is.na(AgencyOnly1115$Agency12) & is.na(AgencyOnly1115$Agency11))     # [1] 3440
NewHires12 = NNewHires12/Workforce12 # [1] 0.21

# Switched agencies from '11 - '12
NSwitchAgency1112 = sum(!AgencyOnly1115$same_agency1112, na.rm=T)   # [1] 84
SwitchAgency1112 = NSwitchAgency1112/Workforce11  # [1] 0.0058

# Stayed with the same agencies from '11 - '12
NSameAgency1112 = sum(AgencyOnly1115$same_agency1112, na.rm=T)    # [1] 12528

# Employee retention rate after one year (ref. 2011)
Retain1112 = NSameAgency1112/Workforce11    # [1] 0.87

# Let's check the numbers:
Workforce11 - NLeftTot11 + NNewHires12
## [1] 16052
Workforce12
## [1] 16052
Workforce11 - NLeftTot11 - NSwitchAgency1112
## [1] 12528
NSameAgency1112
## [1] 12528
# ------------------------------------------------
# Drop off from '12 to '13
NLeftTot12 = sum(!is.na(AgencyOnly1115$Agency12) & is.na(AgencyOnly1115$Agency13))    # [1] 2754
LeftTot12 = NLeftTot12/Workforce12    # [1] 0.17

# New hires to the City workforce `12 to `13
NNewHires13 = sum(!is.na(AgencyOnly1115$Agency13) & is.na(AgencyOnly1115$Agency12))   # [1] 5842
NewHires13 = NNewHires13/Workforce13    # [1] 0.31

# Switched agencies from '12 - '13
NSwitchAgency1213 = sum(!AgencyOnly1115$same_agency1213, na.rm=T)   # [1] 337
SwitchAgency1213 = NSwitchAgency1213/Workforce12    # [1] 0.021

# Stayed with the same agencies from '12 - '13
NSameAgency1213 = sum(AgencyOnly1115$same_agency1213, na.rm=T)    # [1] 12961

# City worker retention rate after one year (ref. 2012)
Retain1213 = NSameAgency1213/Workforce12    # [1] 0.81

# Let's check the numbers:
Workforce12 - NLeftTot12 + NNewHires13
## [1] 19140
Workforce13
## [1] 19140
Workforce12 - NLeftTot12 - NSwitchAgency1213
## [1] 12961
NSameAgency1213
## [1] 12961
# Stayed with same agency from '11 - '13
NSameAgency1113 = sum(AgencyOnly1115$same_agency1213 & AgencyOnly1115$same_agency1112, na.rm=T)   # [1] 11070

# City worker retention rate after two year (ref. 2011)
Retain1113 = NSameAgency1113/Workforce11    # [1] 0.77

# ------------------------------------------------
# Drop off from '13 to '14
NLeftTot13 = sum(!is.na(AgencyOnly1115$Agency13) & is.na(AgencyOnly1115$Agency14))   # [1] 4799
LeftTot13 = NLeftTot13/Workforce13    # [1] 0.25

# New hires to the City workforce `13 to `14
NNewHires14 = sum(!is.na(AgencyOnly1115$Agency14) & is.na(AgencyOnly1115$Agency13))  # [1] 4644
NewHires14 = NNewHires14/Workforce14    # [1] 0.24

# Switched agencies from '13 - '14
NSwitchAgency1314 = sum(!AgencyOnly1115$same_agency1314, na.rm=T)   # [1] 86
SwitchAgency1314 = NSwitchAgency1314/Workforce13    # [1] 0.0045

# Stayed with same agency from '13 - '14
NSameAgency1314 = sum(AgencyOnly1115$same_agency1314, na.rm=T)    # [1] 14255

# City worker retention rate after one year (ref. 2013)
Retain1314 = NSameAgency1314/Workforce13    # [1] 0.74

# Let's check the numbers:
Workforce13 - NLeftTot13 + NNewHires14
## [1] 18985
Workforce14
## [1] 18985
Workforce13 - NLeftTot13 - NSwitchAgency1314
## [1] 14255
NSameAgency1314
## [1] 14255
# Stayed with same agency from '12 - '14
NSameAgency1214 = sum(AgencyOnly1115$same_agency1213 & AgencyOnly1115$same_agency1314, na.rm=T)   # [1] 11295

# City worker retention rate after two year (ref. 2012)
Retain1214 = NSameAgency1214/Workforce12    # [1] 0.7


# Stayed with same agency from '11 - '14
NSameAgency1114 = sum(AgencyOnly1115$same_agency1112 & AgencyOnly1115$same_agency1213 
                      & AgencyOnly1115$same_agency1314, na.rm=T)    # [1] 10076

# City worker retention rate after three year (ref. 2011)
Retain1114 = NSameAgency1114/Workforce11    # [1] 0.7

# ------------------------------------------------
# Drop off from '14 to '15
NLeftTot14 = sum(!is.na(AgencyOnly1115$Agency14) & is.na(AgencyOnly1115$Agency15))    # [1] 6491
LeftTot14 = NLeftTot14/Workforce14    # [1] 0.34

NNewHires15 = sum(!is.na(AgencyOnly1115$Agency15) & is.na(AgencyOnly1115$Agency14))   # [1] 1525
NewHires15 = NNewHires15/Workforce15  # [1] 0.11

# Switched agencies from '14 - '15
NSwitchAgency1415 = sum(!AgencyOnly1115$same_agency1415, na.rm=T)   # [1] 191
SwitchAgency1415 = NSwitchAgency1415/Workforce14    # [1] 0.01

# Stayed with same agency from '14 - '15
NSameAgency1415 = sum(AgencyOnly1115$same_agency1415, na.rm=T)    # [1] 12303

# City worker retention rate after one year (ref. 2014)
Retain1415 = NSameAgency1415/Workforce14    # [1] 0.65

# Let's check the numbers:
Workforce14 - NLeftTot14 + NNewHires15
## [1] 14019
Workforce15
## [1] 14019
Workforce14 - NLeftTot14 - NSwitchAgency1415
## [1] 12303
NSameAgency1415
## [1] 12303
# Stayed with same agency from '13 - '15
NSameAgency1315 = sum(AgencyOnly1115$same_agency1314 & AgencyOnly1115$same_agency1415, na.rm=T)   # [1] 11143

# City worker retention rate after two years (ref. 2013)
Retain1315 = NSameAgency1315/Workforce13   # [1] 0.58


# Stayed with same agency from '12 - '15
NSameAgency1215 = sum(AgencyOnly1115$same_agency1213 & AgencyOnly1115$same_agency1314 
                      & AgencyOnly1115$same_agency1415, na.rm=T)    # [1] 9956

# City worker retention rate after three years (ref. 2012)
Retain1215 = NSameAgency1215/Workforce12    # [1] 0.62


# Stayed with same agency from '11 - '15
NSameAgency1115 = sum(AgencyOnly1115$same_agency1213 & AgencyOnly1115$same_agency1112 
                      & AgencyOnly1115$same_agency1314 & AgencyOnly1115$same_agency1415, na.rm=T)    # [1] 9167

# City worker retention rate after four years (ref. 2011)
Retain1115 = NSameAgency1115/Workforce11    # [1] 0.63

Let’s organize the numbers into tables and visualize them.

City Employee Retention 2011-2015
  Beginning
Workforce
(n)
New
Hires
(n)
Left City
After One Yr
(n)
Switched
Agencies
(n)
Same Agencies
After One Yr
(n)
Same Agencies
After Two Yrs
(n)
Same Agencies
After Three Yrs
(n)
Same Agencies
After Four Yrs
(n)
2011 14,457 NA 1,845 84 12,528 11,070 10,076 9,167
2012 16,052 3,440 2,754 337 12,961 11,295 9,956 NA
2013 19,140 5,842 4,799 86 14,255 11,143 NA NA
2014 18,985 4,644 6,491 191 12,303 NA NA NA
2015 14,019 1,525 NA NA NA NA NA NA
City Employee Retention Rates 2011-2015
  Beginning
Workforce
(%)
Hiring
Rates
(%)
Leaving
Rates
(%)
Switching
Rates
(%)
Remained
After One Yr
(%)
Remained
After Two Yrs
(%)
Remained
After Three Yrs
(%)
Remained
After Four Yrs
(%)
2011 100 NA 12.76 0.58 86.66 76.57 69.7 63.41
2012 100 21.43 17.16 2.1 80.74 70.37 62.02 NA
2013 100 30.52 25.07 0.45 74.48 58.22 NA NA
2014 100 24.46 34.19 1.01 64.8 NA NA NA
2015 100 10.88 NA NA NA NA NA NA

At first glance, one may be surprised by the appearance of progressively worsening retention rates of the City workforce over the years. Indeed, you may even question whether these numbers are believable. I had the same question, and here is what I found: The retention rate of 65% from 2014 to 2015 seems really low. However, that’s because we did not account for the fact that youth programs accounted for 4,888 employees in 2014 and none in 2015. In fact, those 4,888 participants amounted to 75% of the workforce reduction in 2015. So, much of the workforce fluctuation we see in this plot were related to the three youth programs the City hosted between 2011 and 2014.

Knowing that the above statistics are biased by the youth programs, which by design were not meant to keep participants on long-term basis, I decided to recalculate the statistics for City employees who were not participants of any of the three youth programs over the years.

Before we leave the summary table and plot though, I want to point out that over the years the number of employees who switched jobs among different City agencies remained small, between 0.45% and 2.1%. This result carries two pieces of information: First, it shows that the majority of employee attrition was due to employees leaving the City workforce all together. Secondly, it shows that City workers, if they chose to stay working for the City, liked to stay with the same agencies.

In the following, let’s isolate a data set to include only City workers who were not participants of any of the three youth programs.

youth_idx = grepl("Youth", Agency1115$Agency14) | grepl("Youth", Agency1115$Agency13) | 
            grepl("Youth", Agency1115$Agency12) | grepl("Youth", Agency1115$Agency11)

AgencyNoYouth1115 = AgencyOnly1115[!youth_idx,]

Removing youth program participants reduces the number of records from 29,425 to 20,134 in the years between 2011 and 2015. Let’s run the same analysis as before and look at the numbers.

City Employee Retention 2011-2015
(Not Including Youth Programs Participants)
  Beginning
Workforce
(n)
New
Hires
(n)
Left City
After One Yr
(n)
Switched
Agencies
(n)
Same Agencies
After One Yr
(n)
Same Agencies
After Two Yrs
(n)
Same Agencies
After Three Yrs
(n)
Same Agencies
After Four Yrs
(n)
2011 14,108 NA 1,511 78 12,519 11,063 10,069 9,167
2012 14,235 1,638 1,640 321 12,274 11,012 9,956 NA
2013 14,325 1,730 1,737 71 12,517 11,136 NA NA
2014 14,067 1,479 1,612 171 12,284 NA NA NA
2015 13,960 1,505 NA NA NA NA NA NA
City Employee Retention Rates 2011-2015
(Not Including Youth Programs Participants)
  Beginning
Workforce
(%)
Hiring
Rates
(%)
Leaving
Rates
(%)
Switching
Rates
(%)
Remained
After One Yr
(%)
Remained
After Two Yrs
(%)
Remained
After Three Yrs
(%)
Remained
After Four Yrs
(%)
2011 100 NA 10.71 0.55 88.74 78.42 71.37 64.98
2012 100 11.51 11.52 2.26 86.22 77.36 69.94 NA
2013 100 12.08 12.13 0.5 87.38 77.74 NA NA
2014 100 10.51 11.46 1.22 87.32 NA NA NA
2015 100 10.78 NA NA NA NA NA NA

This plot shows a stable City workforce, which is what most of us expect. Specifically, it shows that the City has a one-year retention rate of 87%, a two-year retention rate of 78%, a three-year retention rate of 71%, and a four-year retention rate of 65%. That the attrition rate decreases over the year shows that the workforce stabilizes as worker tenure increases.

3.2.1 Retention Rates by Agency

Let’s break down the aggregate numbers to get a clearer picture of how the City’s workforce changed over the years. Note, I included Youth Programs participants in the following analysis since it is useful to know whether a Youth Program participant stayed with the same program or switched to work for a different City agency.

same_Agency1112 = as.data.frame(table(AgencyOnly1115$Agency12[AgencyOnly1115$same_agency1112==T]),
                                responseName = "11-12", stringsAsFactors = FALSE)
same_Agency1113 = as.data.frame(table(AgencyOnly1115$Agency13[AgencyOnly1115$same_agency1112==T 
                                                              & AgencyOnly1115$same_agency1213==T]),
                                responseName = "11-13", stringsAsFactors = FALSE)
same_Agency1114 = as.data.frame(table(AgencyOnly1115$Agency14[AgencyOnly1115$same_agency1112==T 
                                                              & AgencyOnly1115$same_agency1213==T 
                                                              & AgencyOnly1115$same_agency1314==T]),
                                responseName = "11-14", stringsAsFactors = FALSE)
same_Agency1115 = as.data.frame(table(AgencyOnly1115$Agency15[AgencyOnly1115$same_agency1112==T 
                                                              & AgencyOnly1115$same_agency1213==T 
                                                              & AgencyOnly1115$same_agency1314==T 
                                                              & AgencyOnly1115$same_agency1415==T]),
                                responseName = "11-15", stringsAsFactors = FALSE)

same_Agency1213 = as.data.frame(table(AgencyOnly1115$Agency13[AgencyOnly1115$same_agency1213==T]),
                                responseName = "12-13", stringsAsFactors = FALSE)
same_Agency1214 = as.data.frame(table(AgencyOnly1115$Agency14[AgencyOnly1115$same_agency1213==T 
                                                              & AgencyOnly1115$same_agency1314==T]),
                                responseName = "12-14", stringsAsFactors = FALSE)
same_Agency1215 = as.data.frame(table(AgencyOnly1115$Agency15[AgencyOnly1115$same_agency1213==T 
                                                              & AgencyOnly1115$same_agency1314==T 
                                                              & AgencyOnly1115$same_agency1415==T]),
                                responseName = "12-15", stringsAsFactors = FALSE)

same_Agency1314 = as.data.frame(table(AgencyOnly1115$Agency14[AgencyOnly1115$same_agency1314==T]),
                                responseName = "13-14", stringsAsFactors = FALSE)
same_Agency1315 = as.data.frame(table(AgencyOnly1115$Agency15[AgencyOnly1115$same_agency1314==T 
                                                              & AgencyOnly1115$same_agency1415==T]),
                                responseName = "13-15", stringsAsFactors = FALSE)

same_Agency1415 = as.data.frame(table(AgencyOnly1115$Agency15[AgencyOnly1115$same_agency1415==T]),
                                responseName = "14-15", stringsAsFactors = FALSE)


TotSameAgency1115 = merge(same_Agency1112,same_Agency1113,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1114,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1115,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1213,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1214,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1215,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1314,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1315,by="Var1",all = TRUE)
TotSameAgency1115 = merge(TotSameAgency1115,same_Agency1415,by="Var1",all = TRUE)


names(TotSameAgency1115)[1]="Agency"

# Convert all NA's into 0
for (i in 2:11) {
  idx <- which(is.na(TotSameAgency1115[[i]]))
  TotSameAgency1115[[i]][idx] <- 0
}

# dim(TotSameAgency1115)

# The Agencies "Wage Commissioner" and "Youth Cust" did not retain any employees over the years
RetentionRateByAgency = rbind(TotSameAgency1115, c("Wage Commissioner", numeric(10)), c("Youth Cust", numeric(10))) %>%
  arrange(Agency) 

for (i in 2:11) {
  RetentionRateByAgency[[i]] <- as.numeric(RetentionRateByAgency[[i]])
}


RetentionRateByAgency$X11.12=RetentionRateByAgency$X11.12/TotAgency1115$X2011*100
RetentionRateByAgency$X11.13=RetentionRateByAgency$X11.13/TotAgency1115$X2011*100
RetentionRateByAgency$X11.14=RetentionRateByAgency$X11.14/TotAgency1115$X2011*100
RetentionRateByAgency$X11.15=RetentionRateByAgency$X11.15/TotAgency1115$X2011*100

RetentionRateByAgency$X12.13=RetentionRateByAgency$X12.13/TotAgency1115$X2012*100
RetentionRateByAgency$X12.14=RetentionRateByAgency$X12.14/TotAgency1115$X2012*100
RetentionRateByAgency$X12.15=RetentionRateByAgency$X12.15/TotAgency1115$X2012*100

RetentionRateByAgency$X13.14=RetentionRateByAgency$X13.14/TotAgency1115$X2013*100
RetentionRateByAgency$X13.15=RetentionRateByAgency$X13.15/TotAgency1115$X2013*100

RetentionRateByAgency$X14.15=RetentionRateByAgency$X14.15/TotAgency1115$X2014*100

The following set of plots shows the retention rates of each of the 37 City agencies over one-, two-, three-, and four-year periods.

The above one-year retention rates plot shows some extreme cases. For example, Community Relations, Council Services, Fire and Police retirement Sys Admin, Legislative Reference, Municipal and Zoning Appeals, Special City Services, and War Memorial Commission had been able to retain 100% of their workforce in some years. At the same time, the six agencies on the right hand side of the plot had experienced a complete loss of their workforce at least once over the five years.

Clearly, those agencies that achieved 100% retention rate were doing a good job in retaining their employees. At the same time, these agencies were relatively small, most of them had less than 10 employees (see workforce plot), so it is not entirely fair to compare them with agencies that had thousands of employees, such as the Police Department, the Department of Public Works, or the Fire Department.

Those who experience 0% retention rate can be separated into two groups: In one group the agencies (e.g., Temp Assist for Needy Families) simply no longer existed; in the other group the agencies (e.g., the youth programs) simply did not mean to keep employees on long term basis. As a result, the 0% retention rate was not too surprising.

In this four-year retention rates plot, Legislative Reference stands out to have the highest retention rate across all City agencies. There are at least two factors contributing to this high retention rate: First, Legislative Reference is a relatively small agency, having 6 people in two of the years and 5 in the other three years. Secondly, employee tenure tend to be high at Legislative Reference, as we found from the in-depth analysis of the 2015 data.

The Fire Department had the second highest retention rate over a four-year period. Given that it had over 1500 employees over the years, this is indeed an achievement. The third highest retention rate belongs to the Dept of Public Works, which had over 2300 employees and among whom was the longest-tenure employee in all City agencies.

3.3 Distribution of Gross Pay

First let’s take a look at the pay distribution over the years.

gross_pay1115 <- Agency1115 %>% 
  select(GrossPay11, GrossPay12, GrossPay13, GrossPay14, GrossPay15) %>% 
  gather(Yr, GrossPay, na.rm=TRUE)
lut <- c("GrossPay15"="2015","GrossPay14"="2014","GrossPay13"="2013",
         "GrossPay12"="2012","GrossPay11"="2011")
gross_pay1115$Yr <- lut[gross_pay1115$Yr]

The distributions for all five years clearly skew to the right (the end for higher gross pay), suggesting that a small number of employees received much higher pay than others–we expect them to be the managers, heads of departments/agencies.

In addition, the distributions are bimodal and possibly trimodal. The peaks on the left in 2012 - 2014’s distributions are most likely due to the large number of participants of the youth programs. The antimode/valley at ~$20,000 might be a natural breaking point between part-time and full-time employees. Among those who were paid ~$20,000 and above, the distributions shows two peaks: The main peak with lower pay corresponds to the rank-and-file employees, while the other peak with higher pay represents those in leadership positions. Given that a position’s pay also depends on the nature of the work and tenure of the employee, the division between the two groups is not likely to be clear-cut.

Let’s look at a numerical summary of the pay distributions.

GrossPay1115 <- select(Agency1115, GrossPay11, GrossPay12, GrossPay13, GrossPay14, GrossPay15)
summary(GrossPay1115)
##    GrossPay11       GrossPay12       GrossPay13       GrossPay14       GrossPay15    
##  Min.   :     7   Min.   :   -77   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.: 27609   1st Qu.: 23871   1st Qu.: 24536   1st Qu.: 22065   1st Qu.: 34666  
##  Median : 41100   Median : 43061   Median : 44001   Median : 45267   Median : 52753  
##  Mean   : 42323   Mean   : 43176   Mean   : 44313   Mean   : 45634   Mean   : 55782  
##  3rd Qu.: 58322   3rd Qu.: 63551   3rd Qu.: 65615   3rd Qu.: 67816   3rd Qu.: 75768  
##  Max.   :170423   Max.   :245353   Max.   :238681   Max.   :238772   Max.   :235768  
##  NA's   :14968    NA's   :13373    NA's   :13504    NA's   :13664    NA's   :15675

You may notice the negative pays in the 2012 data. We have discussed them in the anomalies section of this report, so I will not repeat it here. The NA counts in each year’s summary are due to combining all five year’s data into one master data set–not all employees worked for the City for all five years.

The above summary shows that median salary of City employees rose above $50,000 for the first time in 2015. At first, I did not think that this was due to a pay raise given to City employees in 2015. Instead, remembering that 2015 data included zero youth program participant instead of the thousands we saw in previous years, I assumed that removing thousands of lowest pay entries from the 2015 distribution would naturally resulted in a higher median value for the 2015 pay distribution.

To check this hypothesis, I look at the pay summary of City employees who were not youth program participants.

NYGrossPay1115 <- Agency1115[!youth_idx,] %>% select(GrossPay11, GrossPay12, GrossPay13, 
                                                     GrossPay14, GrossPay15)
summary(NYGrossPay1115)
##    GrossPay11       GrossPay12       GrossPay13       GrossPay14       GrossPay15    
##  Min.   :    60   Min.   :   -77   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.: 28406   1st Qu.: 31899   1st Qu.: 32082   1st Qu.: 33075   1st Qu.: 34862  
##  Median : 41821   Median : 47224   Median : 47859   Median : 50031   Median : 52943  
##  Mean   : 43316   Mean   : 48544   Mean   : 49483   Mean   : 51866   Mean   : 55983  
##  3rd Qu.: 58715   3rd Qu.: 65584   3rd Qu.: 67802   3rd Qu.: 70674   3rd Qu.: 75866  
##  Max.   :170423   Max.   :245353   Max.   :238681   Max.   :238772   Max.   :235768  
##  NA's   :6026     NA's   :5899     NA's   :5916     NA's   :6307     NA's   :6440

Clearly, my assumption was not correct! On the contrary, after removing youth programs participants, the remaining distributions suggest a steady increase in salary for City employees every year since 2011!

Let’s look at the overall pay distributions without youth programs participants. In the following histograms, the solid vertical line in the middle marks the median pay, and the dashed vertical lines mark the 25 and 75 percentiles of pay respectively.

3.4 Salary Spending by Agency

In the following, I aggregate salary data based on the agency of the employee, and examine the relation between number of positions and total salary spending by agency.

SalaryExpend15 <- Agency1115 %>% group_by(Agency15) %>% summarise(TotExpend15 = sum(GrossPay15, na.rm=TRUE)) %>% na.omit()
SalaryExpend14 <- Agency1115 %>% group_by(Agency14) %>% summarise(TotExpend14 = sum(GrossPay14, na.rm=TRUE)) %>% na.omit()
SalaryExpend13 <- Agency1115 %>% group_by(Agency13) %>% summarise(TotExpend13 = sum(GrossPay13, na.rm=TRUE)) %>% na.omit()
SalaryExpend12 <- Agency1115 %>% group_by(Agency12) %>% summarise(TotExpend12 = sum(GrossPay12, na.rm=TRUE)) %>% na.omit()
SalaryExpend11 <- Agency1115 %>% group_by(Agency11) %>% summarise(TotExpend11 = sum(GrossPay11, na.rm=TRUE)) %>% na.omit()

TotSalaryExpend1115 = merge(SalaryExpend11,SalaryExpend12,by.x="Agency11", by.y="Agency12",all = TRUE)
TotSalaryExpend1115 = merge(TotSalaryExpend1115,SalaryExpend13,by.x="Agency11", by.y="Agency13",all = TRUE)
TotSalaryExpend1115 = merge(TotSalaryExpend1115,SalaryExpend14,by.x="Agency11", by.y="Agency14",all = TRUE)
TotSalaryExpend1115 = merge(TotSalaryExpend1115,SalaryExpend15,by.x="Agency11", by.y="Agency15",all = TRUE)

names(TotSalaryExpend1115)[1]="Agency"

# Convert all NA's into 0
for (i in 2:6) {
  idx <- which(is.na(TotSalaryExpend1115[[i]]))
  TotSalaryExpend1115[[i]][idx] <- 0
}

Salary spending of the different departments had been steady, and a few departments even saw a moderate increase over the years. One striking feature, if you recall the number of positions by agency plot earlier, is the almost none-existent spending on the youth programs in comparison with other City agencies. This reinforce my belief that participants of the youth programs were not the traditional employees in other City agencies.

While most of the increase were relatively modest and gradual, two features jump out of the plot. First, spending of the Mayor’s Office was double from 2012 to 2013 and had stayed at that level since. Secondly, spending at the Police Department experienced two stages of large increase: one from 2011 to 2012, and the other from 2014 to 2015. Each of these increases was about $30,000,000.

In comparison with the number of positions by agency plot, which shows a steady decrease of the police workforce since 2012, the rather large salary spending increase by the Police Department in 2015 is even more intriguing. Did the increase impact majority of the personnel in the police workforce? Or, was it the result of hiring a few top managers? I will answer this question in a later section.

Here are the numbers that generated the plot above.

Salary Spending by Agency 2011-2015
  2011
($)
2012
($)
2013
($)
2014
($)
2015
($)
Circuit Court 6,992,230 7,418,225 6,905,444 6,930,433 7,600,118
City Council 2,930,632 3,220,308 3,486,672 3,642,744 3,699,890
Civil Rights & Wage Enforce 0 0 641,071 568,643 501,455
Community Relations 431,928 609,175 0 0 0
Comptroller’s Office 4,064,869 4,602,130 4,743,322 4,665,648 5,098,746
Council Services 321,864 382,368 443,624 365,916 388,532
Dept of Finance 10,617,647 12,368,165 12,860,218 13,376,656 14,156,029
Dept of Health 29,572,629 32,256,450 31,131,996 30,085,148 31,313,971
Dept of Public Works 84,146,174 94,566,693 95,628,158 99,802,845 106,487,490
Dept of Transportation 42,344,106 48,484,935 47,636,882 48,785,120 51,615,941
Elections 57,217 143,771 96,500 87,038 74,679
Employee’s Retirement Sys 1,224,019 1,277,099 1,310,384 1,575,017 1,757,640
Enoch Pratt Free Library 14,084,339 16,056,586 16,891,040 17,076,783 17,344,375
Fire & Police Retirement Sys Admin 1,019,820 1,200,031 1,224,966 1,332,487 1,581,667
Fire Department 101,253,987 113,689,987 113,686,926 121,142,250 125,612,177
General Services 16,069,391 18,692,706 19,551,601 19,402,148 17,874,549
Housing & Community Dev 20,302,024 23,216,475 18,515,544 18,877,111 18,598,793
Human Resourcese 2,108,471 2,271,436 2,625,640 3,109,768 3,062,501
Law Department 5,720,770 6,609,865 6,776,727 6,963,661 7,566,482
Legislative Reference 320,193 358,489 274,506 292,461 326,889
Liquor License Board 1,191,696 1,414,959 1,304,304 863,551 1,037,420
Mayor’s Office 14,657,869 15,676,067 29,291,244 31,350,202 34,096,255
Municipal & Zoning Appeals 206,074 290,158 268,809 240,884 245,083
Office of Employment Dev 8,737,769 9,126,557 9,241,597 8,113,639 7,988,847
Orphan’s Court 228,469 317,985 342,348 311,950 328,993
Planning Department 2,110,358 2,513,763 2,470,085 2,444,078 2,919,818
Police Department 199,573,347 229,243,749 230,392,751 228,004,357 256,359,188
Recreation & Parks 13,458,586 15,113,724 15,048,382 16,059,910 17,008,779
Sheriff’s Office 7,325,127 8,561,306 9,110,996 10,199,283 10,839,221
Special City Services 22,831 31,774 41,331 21,215 35,563
States Attorneys Office 19,729,108 21,037,262 21,653,893 21,593,103 21,486,610
Temp Assist for Needy Families 75,982 106,872 0 0 0
Wage Commissioner 154,204 0 0 0 0
War Memorial Commission 144,577 166,438 0 0 0
Youth Cust 502,534 52,877 3,640 0 0
Youth Summer 0 1,749,083 1,635,391 1,719,443 0
Youth Temp Adult 167,336 227,664 269,957 235,644 0

3.5 A Power-Law Relation between The Size of An Agency And Salary Spending

Combining results from the employment and salary analyses above, the following plot explores the relation between the City’s workforce and total salary spending.

In the plot above, the colors of the dots are different for different years, and the sizes of the dots are proportional to the sizes of the workforce. The visualization reinforces some of the results we obtained earlier: The years 2013 and 2014 had particularly large workforce because of the addition of youth programs participants. At the same time, these were not traditional employees since they certainly weren’t paid much–here we see that the total salary spending didn’t increase much from 2012 to 2014 despite the fact that thousands of people were added to the payroll. In 2015, youth programs participants weren’t included in the salary data file, and so the City’s workforce appears to have shrunk considerably from 2014 to 2015. Nonetheless, the overall increase in salary spending from 2014 to 2015 is one of the largest over the years. Here are the numbers.

Employment & Salary Spending 2011-2015
  Size of Workforce
(n)
Salary Spending
($)
2011 14,457 611,868,176
2012 16,052 693,055,130
2013 19,140 705,505,949
2014 18,985 719,239,135
2015 14,019 767,007,701

Let’s look at the numbers at a more granular level and examine salary spending vs. number of employees by agency.

The above is a log-log plot. It shows each year’s data in a different color. Visually most of the data points lie along a straight line, so I chose to fit a power law to the data. The best fit lines are displayed in the same colors as the data that were used to derive them. One may notice that the power law describes the 2015 data particularly well because there are few outlier points. The best fit parameter values, RMSE, and the \(R^2\) of the fits are given in the following table.

Best-Fitting Parameters
  Coefficient Exponent RMSE R-sq
2011 24.82 1.04 0.85 0.86
2012 27.52 1.02 1.01 0.8
2013 28.26 1.01 1.2 0.75
2014 39.1 0.96 1.05 0.77
2015 29.14 1.08 0.49 0.95

3.5.1 Case study: Workforce reduction that coincided with increased salary spending

The increase in the Police Department’s salary spending from 2014 to 2015 is most intriguing, so I chose it as the subject of my focus study. The aim was to find out where the extra spending went into during this time period. First, let’s extract a subset of the data for those who were in the police workforce in 2014 or 2015 and calculate the relevant statistics.

police1415 = BmoreSalaries1115 %>%
  subset(BmoreSalaries1115$Agency15=="Police Department" 
         | BmoreSalaries1115$Agency14=="Police Department") %>%
  select(Name, HireDate, JobTitle15, JobTitle14, Agency15, Agency14, 
         GrossPay15, GrossPay14, AnnualSalary15, AnnualSalary14)

policeWF14 = sum(police1415$Agency14=="Police Department", na.rm=T)    # [1] 3211
policeWF15 = sum(police1415$Agency15=="Police Department", na.rm=T)    # [1] 3097

LeftIdx = (police1415$Agency14=="Police Department") & ((police1415$Agency15!="Police Department") 
                                                        | is.na(police1415$Agency15))
NLeft = sum(LeftIdx, na.rm=T)    # [1] 328

RetainIdx = (police1415$Agency14=="Police Department") & (police1415$Agency15=="Police Department")
NRetain = sum(RetainIdx, na.rm=T)    # [1] 2883
RetainRate = NRetain/policeWF14    # [1] 0.9

NewHireIdx = (police1415$Agency15=="Police Department") & (is.na(police1415$Agency14) 
                                                           | (police1415$Agency14!="Police Department"))
NNewHire = sum(NewHireIdx)    # [1] 214

The police workforce in 2014 and 2015 were 3,211 and 3,097 respectively. 328 individuals left the department in 2014, and 214 were hired into the department in 2015. Overall, the workforce retention rate from 2014 to 2015 was 90%. The following plot shows the department’s salary distributions.

Clearly, both years’ distributions skew to the right, meaning that a few individuals received much higher paid than most of the other employees. In general, the 2015 distribution appears to have shifted to the right, which indicates that the increase spending had a broad impact. Let’s look more specifically at the distribution of the pay difference in the plot below.

The above plot shows that for those who remained in the police workforce from 2014 to 2015, the distribution of pay difference was nearly normally distributed. The average of the pay difference was $13,700, and the median of the pay difference was $12,000. The 25 and 75 percentiles were at $6,740 and $18,700. Again, this plot reinforces our earlier conclusion that the increase in salary spending had impact on majority of the personnel in the Police Department.

Finally, let’s take a look at the data of those who garnered the largest pay increase from 2014 to 2015.

police1415 <- police1415 %>%
  mutate(PayInc=police1415$GrossPay15-police1415$GrossPay14) %>%
  arrange(desc(PayInc))

police1415 %>%
  select(Name, HireDate, JobTitle15, JobTitle14, GrossPay15, GrossPay14, 
         AnnualSalary15, AnnualSalary14, PayInc) %>%
  head(n=10)
##                  Name   HireDate             JobTitle15             JobTitle14 GrossPay15 GrossPay14
## 1       Banach,Paul F 2014-05-29 Operations Officer III POLICE COMMAND STAFF I     105820       3923
## 2    Swinton,Dwayne L 1990-10-23        POLICE SERGEANT        POLICE SERGEANT     134567      40409
## 3    De Jesus,Michael 1994-02-09         POLICE OFFICER         POLICE OFFICER     100900       8049
## 4       Adams,Damon R 2003-10-23         POLICE OFFICER         POLICE OFFICER      97765       7688
## 5   Gladstone,Keith A 1992-11-20        POLICE SERGEANT        POLICE SERGEANT     132188      52263
## 6  Fleurimond,Richard 1999-10-12   POLICE OFFICER (EID)   POLICE OFFICER (EID)     166056      86326
## 7      Walrath,Mark A 1992-05-05  POLICE LIEUTENANT EID  POLICE LIEUTENANT EID     194656     115639
## 8     Camarote,Marc J 1996-10-03    POLICE SERGEANT EID    POLICE SERGEANT EID     210509     136940
## 9     Paradise,John J 1994-03-28  POLICE LIEUTENANT EID  POLICE LIEUTENANT EID     161310      92518
## 10      Leitch,Eric R 2000-02-10        POLICE SERGEANT        POLICE SERGEANT     182621     119683
##    AnnualSalary15 AnnualSalary14 PayInc
## 1          101200         102000 101897
## 2           90273          81327  94158
## 3           76496          68322  92852
## 4           68013          62380  90078
## 5           88682          79249  79925
## 6           74290          66320  79730
## 7          103808          92706  79017
## 8           86874          77564  73568
## 9          101999          91156  68793
## 10          84130          75288  62938

It appears that those who received the largest pay increase were not new hires. Instead, for some of them, for some reason, their were paid well below the projected salaries in 2014, and in 2015 they received the normal pay for their positions. At the same time, there were also others who received close to $80,000 of pay increase on top of their normal salary, resulting in double the amount of salary for their positions. Could those be overtime pay?

Whatever the reason behind the pay increase, it appears that the pay scale of the Police Department is higher than that of other agencies. The two boxplots below compare the salary distributions of the Police Department and all City agencies in 2014 and 2015. It shows that Police Department had a higher median pay than the rest of the City and, indeed, it got a larger pay increase from 2014 to 2015 than the rest of the City.

4 Conclusion

So far in this project, I have examined the employment and salary statistics of Baltimore City employees based on the 2011-2015 data from the Baltimore Open Data Portal.

In terms of employment statistics, I found that the Police Department had the largest number of employees among all City agencies over the 2011-2015 time period. The size of the police workforce, nonetheless, has declined modestly since 2012. I then examined the City-wide employee retention rates and the retention rates of each City agencies. Overall, Baltimore City had a one-year employee retention rate of 87%, a two-year retention rate of 78%, a three-year retention rate of 71%, and a four-year retention rate of 65%. The number of employees who switched jobs among different City agencies were small, about 1.1%. The result indicates that the majority of employee attrition was due to employees leaving the City workforce all together. It also means that City workers, if they chose to stay working for the City, liked to stay with the same agencies. Analysis at a more granular scale shows that smaller agencies, if they continue to exist over the years, had higher retention rates than larger agencies. Nonetheless, over long-term periods (four years), larger departments such as the Fire Department and the Dept of Public Works have been able to achieve some of the highest retention rates among all City agencies.

The salary distributions of City employees show that every year since 2011 there has been a steady, albeit modest, pay increase. One interesting observation was the Police Department’s workforce reduction that coincided with its increase in salary spending. Upon further examination of the data, I found that the extra spending had impact on majority of the personnel in the Police Department, and was not the results of hiring a few high paying individuals. The data also showed that the Police Department had a higher pay scale than all the other City agencies.

The data products from this analysis are available for download here.