Part 2:

This is the second in a series of blogs to do with analyzing.

In this blog entry, we discuss the use of several algorithms to model employee attrition in R and RShiny: extreme gradient boosting (XGBoost), support vector machines (SVM), and logistic regression.

• XGBoost is a decision tree based algorithm. Multiple trees are ensembled to improve the predictive power of the model.
• SVM is a discriminative classifier that takes labeled training data and constructs a hyperplane to categorize new examples.
• Logistic regression is a simple classifier used to estimate the probability of a binary outcome based on several predictors and a logit function.

Data Preprocessing

There are several steps to take before any predictions can be made. Firstly, library all the machine learning packages we need in R.

# load packages
library("xgboost"); library("e1071"); library("MASS"); library("xtable")


The data is partitioned into three sets: training, validation and testing:

• The training set is responsible for initially teaching the model the causal relationship between all information and the attrition probability.
• The validation set is then used to estimate how well the model has been trained and fine tune the parameters to develop the best model. Once those two steps have been completed
• The completed model is applied to the testing set in order to get accurate results on how the model would perform on real-world data.

The code to split the data is shown as follows:

# Create data for training and test
set.seed(0)
tr.number<-sample(nrow(d),nrow(d)*2/3)
# we split whole dataset into 2/3 training data and 1/3 testing data
train<-d[tr.number,]
test<-d[-tr.number,]

column_names = names(test)

# split dataset
train_Y = as.numeric(train$Attrition) train$Attrition<-NULL
test_Y = test$Attrition test$Attrition<-NULL

# numericize training and testing data
train[] <- lapply(train, as.numeric)
test[] <- lapply(test, as.numeric)


Data Modeling with Machine Learning

XGBoost model

The first model we fit is an extreme gradient boosting (XGBoost) model. For more detail, http://xgboost.readthedocs.io/en/latest/parameter.html.

Here, we choose maximum depth of tree as 3, step size shrinkage parameter as 0.3, use logistic regression for binary classification with area under the curve(auc) as an evaluation metric for validation data. These default values we have set can be tuned to optimize for accuracy in a true analysis.

# Construct xgb.DMatrix object from training matrix
dtrain <- xgb.DMatrix(as.matrix(train), label = train_Y)

# Create a list of parameters for XGBoost model
param <- list(max_depth=3,
silent=1,
eta = 0.3,
objective='binary:logistic',
eval_metric = 'auc')

# Training a XGBoost model using training dataset and chosen parameters
bst <- xgb.train(param, nrounds = 82, dtrain)

# Predicting the results using testing dataset
pred.xgb <- predict(bst, as.matrix(test))

# Create a table indicating the most important features of XGBoost model
importance <- xgb.importance(feature_names = column_names, model = bst)


The XGBoost object, bst, created from training the model is a list that contains basic information for the model training, e.g. the parameters settings, etc. The predict step allows us to make a prediction using the separate test dataset.

We can show the first 5 predictions here:

head(pred.xgb)
## [1] 0.61055940 0.59795433 0.01765974 0.03849135 0.91354728 0.06819534


The result shows us the first five predicted probabilities for the test dataset. For example, the first observation’s prediction is 0.61, based on our model, that employee will have a 61% chance to attrite.

Feature Importance

A great advantage of using XGBoost model is its built-in ability to show us a feature importance table. The importance metric provides a score indicating how valuable each factor was in the construction of the boosted decision trees. Higher relative importance indicates a larger impact on the algorithm and final prediction.

To actively improve overall employee retention issues, we can use this to look more closely at the most important features that determine the attrition.

In the figure below, we visualize the top 10 important features in histogram usingmetric.

xgb.plot.importance(importance_matrix = importance, top_n = 10)


When the model is run on the entire dataset, the results show that Marital Status, Number of Companies Worked For, and Age are the dominant drivers of employee attrition for this dataset. By looking at the plots from Part 1, we can determine how these features impact attrition directly.

In terms of HR adjustable parameters, we note that adjustments to Job Involvement, Stock Option and Monthly Income might be used as incentives for high value employees.

SVM model

The second model we fit is a support vector machine(SVM) model. Basically, a SVM constructs a hyperplane or a set of hyperplanes that have the largest distance to the nearest training data points of other classes.

We choose a radial kernel with proper gamma and cost values here to optimize the performance of SVM. Again, these should be tuned in a full analysis.

train\$Attrition<-train_Y

# Training a SVM
svm_model<-svm(Attrition~.,                #set model formula
type="C-classification",   #set classification machine
gamma=0.001668101,         #set gamma parameter
cost=35.93814,             #set cost parameter
data=train,
cross=3,                   #3-fold cross validation
probability = TRUE        #allow for probability prediction
)

# Predicting the results using testing dataset
# Obtain the predicted class 0/1
svm_model.predict<-predict(svm_model, test, probability=TRUE)
# Obtain the predicted probability for class 0/1
svm_model.prob <-attr(svm_model.predict,"probabilities")
svm_model
##
## Call:
## svm(formula = Attrition ~ ., data = train, type = "C-classification",
##     gamma = 0.001668101, cost = 35.93814, cross = 3, probability = TRUE)
##
##
## Parameters:
##    SVM-Type:  C-classification
##        cost:  35.93814
##       gamma:  0.001668101
##
## Number of Support Vectors:  339


The SVM model object is a list presenting basic information about the parameters, number of support vectors, etc. The number of support vectors depends on how much slack we allow when training the model. If we allow a large amount of flexibility, we will have a large number of support vectors.

Logistic regression model

The third model we fit is a logistic regression model. It uses a logit link from the data matrix to the class probability. The logit function always generates a predicted value between 0 and 1 that can be interpreted as  a probability.

# Training a logistic regression model

# Predicting the results using testing dataset
LR_model.predict <- predict(LR_model, test, type = "response")
coef(LR_model)[2]
##        Age
## -0.0298873


For example, the coefficient for variable "Age" is around -0.03 which is interpreted as the expected change in log odds for one-unit increase in the employee's age. The odds ratio can be calculated by exponentiation of this value; this gives 0.98 implying that we expect to see about 2% decrease in the odds of being an attrition for a one-unit increase in employee's age. In other words, if other variables are kept unchanged, the expected probability of attrition is lower for older employees.

For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact mluk@sflscientific.com.

--

Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin [SFL Scientific]