# Part 3:

This is third in a series of blogs that will discuss the results of our RShiny Attrition App. In this part, we use the results from our pre-trained models in Part 2 to do some model evaluations and discuss the results.

We will use several machine learning criterions such as receiver operating characteristic (roc) curve, precision and recall curve, etc. All of them are calculated based on the predicted attrition class and the true attrition class in the testing dataset.

In RShiny, we first load the required libraries that we’ll be using:

# load packages
library("ggplot2");
library("corrplot");
library("ROCR");
library("caret")


## Fine-Tuning the Results

Remember that each algorithm gives a confidence score(probability) between 0 and 1 for each employee, indicating that these individuals are somewhere between 0% and 100% likely to attrite, respectively.

By setting the confidence score threshold, above which we predict an employee to leave, we end up with a control on the precision and recall statistics. The cutoff can be adjusted in real-time, in the RShiny app, to optimize the model based on the needs of the business.

## ROC Curve

A receiver operating characteristic (ROC) curve is the result of plotting the true positive rate against the false positive rate. The closer the ROC curve is to the top left corner, the greater the accuracy of the test.

Let us create a simple prediction object and use them to create roc plot.

# Create a prediction object using previously saved results
ROCRpred_xgb <- prediction(pred.xgb, test_Y)
ROCRpred_svm <- prediction(svm_model.prob[,2], test_Y)
ROCRpred_lr <- prediction(LR_model.predict, test_Y)

#XGBoost roc data
perf_xgb <- performance(ROCRpred_xgb, 'tpr','fpr')
roc_xgb.data <- data.frame(fpr=unlist(perf_xgb@x.values),
tpr=unlist(perf_xgb@y.values), model="XGBoost")

#SVM roc data
perf_svm <- performance(ROCRpred_svm, 'tpr','fpr')
roc_svm.data <- data.frame(fpr=unlist(perf_svm@x.values),
tpr=unlist(perf_svm@y.values), model="SVM")

#Logistic Regression roc data
perf_lr <- performance(ROCRpred_lr, 'tpr','fpr')
roc_lr.data <- data.frame(fpr=unlist(perf_lr@x.values),
tpr=unlist(perf_lr@y.values), model="LR")


Everything is set up and we could draw an awesome roc plot now.

# Define colors for roc plot
cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

# Create roc plot
ggplot() +
geom_line(data = roc_xgb.data, aes(x=fpr, y=tpr, colour = "XGBoost")) + #set XGBoost roc curve
geom_line(data = roc_svm.data, aes(x = fpr, y=tpr, colour = "SVM")) + #set SVM roc curve
geom_line(data = roc_lr.data, aes(x = fpr, y=tpr, colour = "Logistic Regression")) +
#set LR roc curve
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() + #set themes
scale_colour_manual(name = "Models", values = cols) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
theme(legend.position = c(0.8, 0.2),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


The plot above indicates that the performance of the three different machine learning models are roughly the same. Slight variations show that if the false positive rate is above 0.6 then SVM and logistic regression seem marginally better than the XGBoost model.

We have also added a slider to adjust the vertical red line in the Shiny app. The slider allows the user to change the operation point of the algorithm by setting the false positive rate. The changes made to this cut-off are reflected in the confusion matrices.

## Confusion Matrix

Here is the logic and the code for how we draw the confusion matrix:

1. Obtain the auc, fpr and tpr usingprediction function
2. Define get_cutoff_point function to obtain the cutoff probability given a fixed fpr. Any predicted probability that is greater than the cutoff will be classified as attrition and vise versa.
3. Define draw_confusion_matrix function to draw a confusion matrix plot given calculated confusion table, auc and chosen color.
4. Take a look of three confusion matrices from three models and compare their auc, fpr,tpr and accuracy.
# Define a function to obtain the cutoff probability
# @perf is a S4 object gotten from @performance function
# @threshold is the targeted fpr
# In the ShinyApp, users can adjust the threshold by themselves and
# obtain different confusion matrix accordingly. Here, we always set
# threshold = 0.5 just for illustration.
get_cutoff_point <- function(perf, threshold)
{
cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]], tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),] cutoffs <- subset(cutoffs, fpr <= threshold) if(nrow(cutoffs) == 0){ return(1.0)} else return(cutoffs[1, 1]) } # Define a function to draw a confusion matrix plot # @cm is a confusion matrix obtained from @confusionMatrix function # @auc is the auc value obtained from @performance function # @color is the kind of color you want for true positive and true negative areas # In this function, we also add in accuracy information which calculates the # overall performance of model draw_confusion_matrix <- function(cm, auc, color) { layout(matrix(c(1,1,2))) par(mar=c(0,0.1,1,0.1)) plot(c(125, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n') # create the matrix rect(150, 430, 240, 370, col=color) text(195, 435, '0', cex=1.2) rect(250, 430, 340, 370, col='white') text(295, 435, '1', cex=1.2) text(125, 370, 'Predicted', cex=1.3, srt=90, font=2) text(245, 450, 'Actual', cex=1.3, font=2) rect(150, 305, 240, 365, col='white') rect(250, 305, 340, 365, col=color) text(140, 400, '0', cex=1.2, srt=90) text(140, 335, '1', cex=1.2, srt=90) # add in the cm results res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='black')
text(295, 400, res[3], cex=1.6, font=2, col='black')
text(295, 335, res[4], cex=1.6, font=2, col='white')

plot(c(0, 100), c(0, 50), type = "n", xlab="", ylab="", main = "", xaxt='n', yaxt='n')

# add in the accuracy information

text(25, 30, "AUC", cex=1.8, font=2)
text(25, 20, round(as.numeric(auc), 3), cex=1.8)
text(75, 30, names(cm$overall[1]), cex=1.8, font=2) text(75, 20, round(as.numeric(cm$overall[1]), 3), cex=1.8)
}

# draw XGBoosting confusion matrix
auc_xgb <- performance(ROCRpred_xgb, measure = "auc")  #obtain auc from @performance
perf_xgb <- performance(ROCRpred_xgb, 'tpr','fpr')  #obtain tpr and fpr from @performance
cut <- get_cutoff_point(perf_xgb, 1) #obtain the cutoff probability
pred_values_xgb <- ifelse(pred.xgb > cut,1,0) #classify using cutoff probability
cm_xgb <- confusionMatrix(data = pred_values_xgb, reference = test_Y) #obtain confusion matrix
draw_confusion_matrix(cm_xgb, auc_xgb@y.values, "#3DB7E4")  #Draw confusion matrix plot
# draw SVM confusion matrix
auc_svm <- performance(ROCRpred_svm, measure = "auc")
perf_svm <- performance(ROCRpred_svm, 'tpr','fpr')
cut <- get_cutoff_point(perf_svm, 0.5)
pred_values_svm <- ifelse(svm_model.prob[,2] > cut,1,0)
cm_svm <- confusionMatrix(data = pred_values_svm, reference = test_Y)
draw_confusion_matrix(cm_svm, auc_svm@y.values, "#FF8849")
# draw Logistic regression confusion matrix
auc_lr <- performance(ROCRpred_lr, measure = "auc")
perf_lr <- performance(ROCRpred_lr, 'tpr','fpr')
cut <- get_cutoff_point(perf_lr, 0.5)
pred_values_lr <- ifelse(LR_model.predict > cut,1,0)
cm_lr <- confusionMatrix(data = pred_values_lr, reference = test_Y)
draw_confusion_matrix(cm_lr, auc_lr@y.values, "#69BE28")


The confusion matrix show the predicted and true attrition numbers of employees. For example, the XGBoost matrix shows that of the 410 employees that do not attrite, our model predicts 227 true negatives, and 183 false negatives, similarly out of they 80 true attrites, 69 are correctly labelled as attitioners, with 11 false negatives.

## Precision and Recall

Another way to visualize this result is to look at precision and recall. Again, by controlling the cutoff, we can compare precision and recall values among different models in this plot.

The following code draws the precision plot and recall plot, which are very similar to the roc plots:

#Create precision plot
#XGBoost
perf_xgb <- performance(ROCRpred_xgb,'prec', 'cutoff') #use 'prec' and 'cutoff' as measurements
xgb.data <- data.frame(x=unlist(perf_xgb@x.values), y=unlist(perf_xgb@y.values),
model="XGBoost")

#SVM
perf_svm <- performance(ROCRpred_svm,'prec', 'cutoff')
svm.data <- data.frame(x=unlist(perf_svm@x.values), y=unlist(perf_svm@y.values),
model="SVM")

#Logistic Regression
perf_lr <- performance(ROCRpred_lr,'prec', 'cutoff')
lr.data <- data.frame(x=unlist(perf_lr@x.values), y=unlist(perf_lr@y.values),
model="LR")

cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

ggplot() +
geom_line(data = xgb.data, aes(x=x, y=y, colour = "XGBoost")) +
geom_line(data = svm.data, aes(x =x, y=y, colour = "SVM")) +
geom_line(data = lr.data, aes(x =x, y=y, colour = "Logistic Regression")) +
scale_colour_manual(name = "Models", values = cols) +
xlab("Cutoff") +
ylab("Precision") +
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() +
theme(legend.position = c(0.8, 0.2),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


Shows the tradeoff between precision as you increase the cut-off.

#Create recall plot
#XGBoost
perf_xgb <- performance(ROCRpred_xgb,'rec', 'cutoff')
xgb.data <- data.frame(x=unlist(perf_xgb@x.values), y=unlist(perf_xgb@y.values), model="XGBoost")

#SVM
perf_svm <- performance(ROCRpred_svm,'rec', 'cutoff')
svm.data <- data.frame(x=unlist(perf_svm@x.values), y=unlist(perf_svm@y.values), model="SVM")

#Logistic Regression
perf_lr <- performance(ROCRpred_lr,'rec', 'cutoff')
lr.data <- data.frame(x=unlist(perf_lr@x.values), y=unlist(perf_lr@y.values), model="LR")

cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

ggplot() +
geom_line(data = xgb.data, aes(x=x, y=y, colour = "XGBoost")) +
geom_line(data = svm.data, aes(x=x, y=y, colour = "SVM")) +
geom_line(data = lr.data, aes(x=x, y=y, colour = "Logistic Regression")) +
scale_colour_manual(name = "Models", values = cols) +
xlab("Cutoff") +
ylab("Recall") +
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() +
theme(legend.position = c(0.8, 0.8),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


Shows the tradeoff between recall as you increase the cut-off.

The above figures show the Precision and Recall curves for the three models and illustrate how the cut-off will affect the precision and recall.

Here, we use cutoff = 0.5 as a default in our Shiny App. The user can adjust the cutoff slider by themselves. As with the ROC curve, the confusion matrices for each algorithm are updated with changes to the slider location.

With these results, we can give the HR department a list of the employees that are the most likely to leave, as well as the confidence score returned by the model.  Further, the confidence score can be combined with any HR metrics, which themselves can be modelled algorithmically if need-be, to give an expected value lost per individual.

For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact mluk@sflscientific.com.

--

Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin [SFL Scientific]