The live interactive app can be found here: https://sflscientific.shinyapps.io/employee_attrition_app/ (Please contact firstname.lastname@example.org if you have trouble viewing this page).
This is the second in a series of blogs to do with analyzing.
In this blog entry, we discuss the use of several algorithms to model employee attrition in R and RShiny: extreme gradient boosting (XGBoost), support vector machines (SVM), and logistic regression.
- XGBoost is a decision tree based algorithm. Multiple trees are ensembled to improve the predictive power of the model.
- SVM is a discriminative classifier that takes labeled training data and constructs a hyperplane to categorize new examples.
- Logistic regression is a simple classifier used to estimate the probability of a binary outcome based on several predictors and a logit function.
There are several steps to take before any predictions can be made. Firstly, library all the machine learning packages we need in R.
# load packages library("xgboost"); library("e1071"); library("MASS"); library("xtable")
The data is partitioned into three sets: training, validation and testing:
- The training set is responsible for initially teaching the model the causal relationship between all information and the attrition probability.
- The validation set is then used to estimate how well the model has been trained and fine tune the parameters to develop the best model. Once those two steps have been completed
- The completed model is applied to the testing set in order to get accurate results on how the model would perform on real-world data.
The code to split the data is shown as follows:
# Create data for training and test set.seed(0) tr.number<-sample(nrow(d),nrow(d)*2/3) # we split whole dataset into 2/3 training data and 1/3 testing data train<-d[tr.number,] test<-d[-tr.number,] column_names = names(test) # split dataset train_Y = as.numeric(train$Attrition) train$Attrition<-NULL test_Y = test$Attrition test$Attrition<-NULL # numericize training and testing data train <- lapply(train, as.numeric) test <- lapply(test, as.numeric)
Data Modeling with Machine Learning
The first model we fit is an extreme gradient boosting (XGBoost) model. For more detail, http://xgboost.readthedocs.io/en/latest/parameter.html.
Here, we choose maximum depth of tree as 3, step size shrinkage parameter as 0.3, use logistic regression for binary classification with area under the curve(auc) as an evaluation metric for validation data. These default values we have set can be tuned to optimize for accuracy in a true analysis.
# Construct xgb.DMatrix object from training matrix dtrain <- xgb.DMatrix(as.matrix(train), label = train_Y) # Create a list of parameters for XGBoost model param <- list(max_depth=3, silent=1, eta = 0.3, objective='binary:logistic', eval_metric = 'auc') # Training a XGBoost model using training dataset and chosen parameters bst <- xgb.train(param, nrounds = 82, dtrain) # Predicting the results using testing dataset pred.xgb <- predict(bst, as.matrix(test)) # Create a table indicating the most important features of XGBoost model importance <- xgb.importance(feature_names = column_names, model = bst)
The XGBoost object, bst, created from training the model is a list that contains basic information for the model training, e.g. the parameters settings, etc. The predict step allows us to make a prediction using the separate test dataset.
We can show the first 5 predictions here:
head(pred.xgb) ##  0.61055940 0.59795433 0.01765974 0.03849135 0.91354728 0.06819534
The result shows us the first five predicted probabilities for the test dataset. For example, the first observation’s prediction is 0.61, based on our model, that employee will have a 61% chance to attrite.
A great advantage of using XGBoost model is its built-in ability to show us a feature importance table. The importance metric provides a score indicating how valuable each factor was in the construction of the boosted decision trees. Higher relative importance indicates a larger impact on the algorithm and final prediction.
To actively improve overall employee retention issues, we can use this to look more closely at the most important features that determine the attrition.
In the figure below, we visualize the top 10 important features in histogram usingmetric.
xgb.plot.importance(importance_matrix = importance, top_n = 10)
When the model is run on the entire dataset, the results show that Marital Status, Number of Companies Worked For, and Age are the dominant drivers of employee attrition for this dataset. By looking at the plots from Part 1, we can determine how these features impact attrition directly.
In terms of HR adjustable parameters, we note that adjustments to Job Involvement, Stock Option and Monthly Income might be used as incentives for high value employees.
The second model we fit is a support vector machine(SVM) model. Basically, a SVM constructs a hyperplane or a set of hyperplanes that have the largest distance to the nearest training data points of other classes.
We choose a radial kernel with proper gamma and cost values here to optimize the performance of SVM. Again, these should be tuned in a full analysis.
train$Attrition<-train_Y # Training a SVM svm_model<-svm(Attrition~., #set model formula type="C-classification", #set classification machine gamma=0.001668101, #set gamma parameter cost=35.93814, #set cost parameter data=train, cross=3, #3-fold cross validation probability = TRUE #allow for probability prediction ) # Predicting the results using testing dataset # Obtain the predicted class 0/1 svm_model.predict<-predict(svm_model, test, probability=TRUE) # Obtain the predicted probability for class 0/1 svm_model.prob <-attr(svm_model.predict,"probabilities") svm_model ## ## Call: ## svm(formula = Attrition ~ ., data = train, type = "C-classification", ## gamma = 0.001668101, cost = 35.93814, cross = 3, probability = TRUE) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 35.93814 ## gamma: 0.001668101 ## ## Number of Support Vectors: 339
The SVM model object is a list presenting basic information about the parameters, number of support vectors, etc. The number of support vectors depends on how much slack we allow when training the model. If we allow a large amount of flexibility, we will have a large number of support vectors.
Logistic regression model
The third model we fit is a logistic regression model. It uses a logit link from the data matrix to the class probability. The logit function always generates a predicted value between 0 and 1 that can be interpreted as a probability.
# Training a logistic regression model LR_model <- glm(Attrition ~.,family=binomial(link='logit'),data=train) # Predicting the results using testing dataset LR_model.predict <- predict(LR_model, test, type = "response") coef(LR_model) ## Age ## -0.0298873
For example, the coefficient for variable "Age" is around -0.03 which is interpreted as the expected change in log odds for one-unit increase in the employee's age. The odds ratio can be calculated by exponentiation of this value; this gives 0.98 implying that we expect to see about 2% decrease in the odds of being an attrition for a one-unit increase in employee's age. In other words, if other variables are kept unchanged, the expected probability of attrition is lower for older employees.
For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact email@example.com.
Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin [SFL Scientific]