Employee Attrition Modeling -- Part 1

Introduction to Attrition Analysis

This is the first in a series of blogs to do with analyzing.

It is often difficult for an HR department to identify which employees are most likely to leave the company and for what reasons; this blog aims to give you a way to leverage machine learning to better understand your employees. We have constructed a R Shiny app for this project, all our analysis are visualized in the Shiny app: https://sflscientific.shinyapps.io/employee_attrition_app/ (Please contact mluk@sflscientific.com if you have trouble viewing this page).

Part 1:

  1. The Data Problem
  2. Exploratory Data Analysis

Part 2:

  1. Data Preprocessing
  2. Data Modeling with Machine Learning

Part 3:

  1. Fine-Tuning the Results

Part 4:

  1. Construction of Shiny App
  2. Little Guidance

Part 1:

The Data Problem

Employee attrition is the rate at which employees leave a company. The goal of this analysis is to model employee attrition and determine the most dominant contributing factors that govern this turnover. Through this kind of analysis, we can understand how many employees are likely to leave, while also determining which employees are at the highest risk and for what reasons.

Exploratory Data Analysis

The dataset used in this analysis is provided from IBM HR to study about employee attrition, which can be found here.

In terms of machine learning analysis, the data needs to be initially cleaned. The goal is to make the data as consistent and relevant across the board as possible. This will allow for the maximum accuracy of the final model when we start processing.

On to step 1, we import the data as d:

# Importing data

Next, we want to change all the categorical variables to number as the followings:

d$Attrition <-  as.integer(as.factor(d$Attrition))-1
d$BusinessTravel <-  as.integer(as.factor(d$BusinessTravel))
d$Department <-  as.integer(as.factor(d$Department))
d$JobRole <-  as.integer(as.factor(d$JobRole))

The correlation matrix is used to illustrate the relations between every variable. The code is shown as following:

# draw the correlation matrix plot
corrplot(cor(d), method="circle", tl.col="#3982B7",mar = c(2, 0, 0, 0),tl.cex = 0.8)

The above correlation matrix displays the linear correlation between every pair of features in the form of dots of varying colors and sizes. A larger dot indicates that the correlation between these selected features is stronger, whereas the color denotes the strength of the positive (blue) or negative (red) correlation coefficient.

This application in Shiny App has an additional functionality: by clicking any element in the correlation matrix, a 2D histogram is displayed in order to better observe the correlation between those features.

The code is shown as following:

# construct correlation plot using ggplot2 stat_bin2d
ggplot(d, aes(YearsInCurrentRole, YearsWithCurrManager))+ 
stat_bin2d(bins = c(15, 10))+           #set bin numbers
guides(colour = guide_legend(override.aes = list(alpha = 1)),
fill = guide_legend(override.aes = list(alpha = 1)))+
legend.position = "bottom")+
2D hist.png

Alternatively, clicking the elements along the leading diagonal will output violin plots of the selected features, bucketed by the true underlying attrition value (1 indicating employees that attrite, and 0 indicated those that remain).

The code and the data visualization result are shown as following:

# construct violin plot using ggplot2 geom_violin
ggplot(d, aes(factor(Attrition), WorkLifeBalance))+   
geom_violin(alpha = 0.2, aes(fill = factor(Attrition)))+ 
#set violin plot
theme_bw()+                         #set theme and legend
legend.title=element_text(size=14),legend.position = "bottom")+

For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact mluk@sflscientific.com.


Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin [SFL Scientific]