# Anomaly Detection: Network Intrusion Detector

Anomaly detection is a common problem that can be solved using machine learning techniques. Simple density based algorithms provide a good baseline for such projects, and can be used to solve a variety of problems from defect detection in manufacturing to network attacks in IT.

# Document Summarisation

Automated text summarization through machine learning can be an extremely valuable tool to increase efficiency in both our everyday life and professional endeavors if the important information in a document can be extracted and accurately summarized.

# Predicting Flu Outbreaks with Twitter

The use of Twitter and natural language processing opens up a promising new approach to flu surveillance. Such data-driven methods produce encouraging results and provide a faster way to identify flu surges.

Further, these Twitter-based methods can be very easily applied to numerous other domains such as Marketing, for identifying geospatial trends in brand image, as well as in Urban Planning for analyzing public attitudes towards various spaces and landmarks for example.

# Predicting Hospital Wait Times

Both patients and hospitals need to effectively predict wait times, whether for psychological benefits or schedule optimization needs. In this post, we will explore some of the main ways that officials predict hospital wait times and assess how successful they are at doing so.

# Machine Learning in Transport

In this blog post we talk about 5 aspects of machine learning that can be applied to transportation.

1. Self-driving cars
2. Congestion Prediction
3.  Infrastructure maintenance
4. Predicting vehicle maintenance
5. Public transport optimizations

# Examining the digital transformation in agriculture

Responding to the global challenges, agriculture must improve on all aspects: Smarter resource use, increasing yields, increased operational efficiency, and sustainable land usage. Big data is expected to have a large impact on "smart farming" and involves the whole supply chain, from biotechnology and plant development to individual farmers and the companies that support them.

# Predicting Hospital Readmissions with Machine Learning

Governments in the US and around the world have introduced a variety of financial penalties to hospitals with excess early readmissions. But how can hospitals predict which patients are likely to be readmitted early, so they can help these patients avoid readmittance? In this post, we explore some machine learning methods for predicting early readmissions.

# Employee Attrition Modeling - Part 4

Part 4:

## Construction of the R Shiny App

R Shiny is a powerful yet intuitive tool for creating interactive web applications in R. The application itself has two main components: the user interface and the server. The user interface controls the appearance and layout of the app. The server is responsible for performing the calculations and contains the instructions for building the app.

The following is a code snippet used to set up a section of the user interface. In this case, we created a simple sidebar to allow the user to select from certain sub-groups of employees.

shinyUI(fluidPage(
theme = shinytheme("cerulean"),
column(width=12,
titlePanel("Employee Attrition")),
sidebarLayout(
sidebarPanel(
helpText("We recommend 3 or fewer options at a time."),
width = 2,
selectInput("Age", label = "Select an age range", choices = c("< 35", "35-45", "> 45", "all"), selected = "all"),
selectInput("Gender", label = "Select a gender", choices = c("Female", "Male", "all"),
selected = "all"), selectInput("Education", label = "Select an education level",
choices = c("1", "2", "3", "4", "all"), selected = "all"),
selectInput("MonthlyIncome", label = "Select a monthly income range",
choices = c("< 2500", "2500-5000", "5001-7500", "7501-10000", "> 10001", "all"), selected = "all"),
selectInput("MaritalStatus", label = "Select a marital status", choices = c("Single", "Married", "Divorced", "all"),
selected = "all")
)


On the server side, the options from the side bar are stored in variables which can then be applied to the model. The two work as a cohesive unit, with the user interface receiving the changes from the user and passing them along to the server to execute.

When dealing with employee attrition in particular, it is useful to look at specific employees and thus we added several options to allow the user to subset out such groupings.

We have already walked through how the learning process work with the three machine learning algorithms. In this part we will look at these three models to compare their performance on the dataset in Shiny app, by using various curves.

The first tab describes the problem we want to solve, discussed in Part 1. After users choose the subset of data they want to look at, the second tab shows the correlation matrix between all the variables in the dataset.

tabPanel("Data Exploration", width = 12,
column(width = 8, class = "well", h4("Correlation Matrix"),
plotOutput("corr_plot", click = "corr_plot_click"),
div(style = "height:110px;background-color: white;"),
style = "background-color:white;"
),

column(width = 4, class = "well", h4("Correlation plot of chosen variables:"),
plotOutput("corr_click_info", height="280px"),
style = "background-color:white;"
)
)


For the next three tabs, the shiny app gives visualization results comparing three different machine learning models: SVM, XGBoost, Logistic Regression.

We include ROC curve, precision curve and recall curve in the visualization results:

tabPanel("Training (ROC)",
column(width = 8, class = "well", h4("ROC Curve"),
plotOutput("plot"),
style = "background-color:white;",
sliderInput("thresh", label = "", min = 0, max = 1, value = c(0.5)),
),
column(width = 4, class = "well",
tabsetPanel(
tabPanel("XGBoost", h4("Confusion Matrix (XGBoost)"), plotOutput("confusionMatrix"), style = "background-color:white;"),
tabPanel("SVM", h4("Confusion Matrix (SVM)"), plotOutput("confusionMatrix_svm"), style = "background-color:white;"),
tabPanel("Logistic Regression", h4("Confusion Matrix (Logistic Regression)"),
plotOutput("confusionMatrix_lr"), style = "background-color:white;")
), style = "background-color:white;"
)
),
tabPanel("Training (Precision)",
column(width = 8, class = "well", h4("Precision vs Cutoff Curve"),
plotOutput("plot_precision"),
style = "background-color:white;",
sliderInput("thresh_precision", label = "", min = 0, max = 1, value = c(0.5)),
),
column(width = 4, class = "well",
tabsetPanel(
tabPanel("XGBoost", h4("Confusion Matrix (XGBoost)"), plotOutput("confusionMatrix_precision"), style = "background-color:white;"),
tabPanel("SVM", h4("Confusion Matrix (SVM)"), plotOutput("confusionMatrix_svm_precision"), style = "background-color:white;"),
tabPanel("Logistic Regression", h4("Confusion Matrix (Logistic Regression)"),
plotOutput("confusionMatrix_lr_precision"), style = "background-color:white;")
), style = "background-color:white;"
)
),
tabPanel("Training (Recall)",
column(width = 8, class = "well", h4("Recall vs Cutoff Curve"),
plotOutput("plot_recall"),
style = "background-color:white;",
sliderInput("thresh_recall", label = "", min = 0, max = 1, value = c(0.5))
),
column(width = 4, class = "well",
tabsetPanel(
tabPanel("XGBoost", h4("Confusion Matrix (XGBoost)"), plotOutput("confusionMatrix_recall"), style = "background-color:white;"),
tabPanel("SVM", h4("Confusion Matrix (SVM)"), plotOutput("confusionMatrix_svm_recall"), style = "background-color:white;"),
tabPanel("Logistic Regression", h4("Confusion Matrix (Logistic Regression)"),
plotOutput("confusionMatrix_lr_recall"), style = "background-color:white;")
), style = "background-color:white;"
)

)


## A Little Guidance

We limited the subsetting options to five factors: age, gender, education level, monthly income and marital status in the app; this allows the user to see the details for a subset of the dataset. We suggest the user to use 3 or fewer variables due to the size of the dataset.

After choosing features, we can go to the ROC, precision and recall tab to change threshold for attrition; for example, with an aggressive HR department we can set the threshold, above which requires remedial action, to a lower value. This will give more false positives, but also ensure that we catch the majority of true attrites.

After adjusting those parameter, users can go to prediction tab by clicking XGBoost.  From here, users can not only understand the distribution of prediction probability but also understand the most important features which deciding employees’ attrition. In a similar manner, you can also check the SVM and Logistic Regression distributions.

Finally, a 3D distribution can also be shown by clicking on the desired variable. Users can go to ‘Explanation of Results’ to have a deeper understanding about the result.

For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact mluk@sflscientific.com.

--

Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin

# Part 3:

This is third in a series of blogs that will discuss the results of our RShiny Attrition App. In this part, we use the results from our pre-trained models in Part 2 to do some model evaluations and discuss the results.

We will use several machine learning criterions such as receiver operating characteristic (roc) curve, precision and recall curve, etc. All of them are calculated based on the predicted attrition class and the true attrition class in the testing dataset.

In RShiny, we first load the required libraries that we’ll be using:

# load packages
library("ggplot2");
library("corrplot");
library("ROCR");
library("caret")


## Fine-Tuning the Results

Remember that each algorithm gives a confidence score(probability) between 0 and 1 for each employee, indicating that these individuals are somewhere between 0% and 100% likely to attrite, respectively.

By setting the confidence score threshold, above which we predict an employee to leave, we end up with a control on the precision and recall statistics. The cutoff can be adjusted in real-time, in the RShiny app, to optimize the model based on the needs of the business.

## ROC Curve

A receiver operating characteristic (ROC) curve is the result of plotting the true positive rate against the false positive rate. The closer the ROC curve is to the top left corner, the greater the accuracy of the test.

Let us create a simple prediction object and use them to create roc plot.

# Create a prediction object using previously saved results
ROCRpred_xgb <- prediction(pred.xgb, test_Y)
ROCRpred_svm <- prediction(svm_model.prob[,2], test_Y)
ROCRpred_lr <- prediction(LR_model.predict, test_Y)

#XGBoost roc data
perf_xgb <- performance(ROCRpred_xgb, 'tpr','fpr')
roc_xgb.data <- data.frame(fpr=unlist(perf_xgb@x.values),
tpr=unlist(perf_xgb@y.values), model="XGBoost")

#SVM roc data
perf_svm <- performance(ROCRpred_svm, 'tpr','fpr')
roc_svm.data <- data.frame(fpr=unlist(perf_svm@x.values),
tpr=unlist(perf_svm@y.values), model="SVM")

#Logistic Regression roc data
perf_lr <- performance(ROCRpred_lr, 'tpr','fpr')
roc_lr.data <- data.frame(fpr=unlist(perf_lr@x.values),
tpr=unlist(perf_lr@y.values), model="LR")


Everything is set up and we could draw an awesome roc plot now.

# Define colors for roc plot
cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

# Create roc plot
ggplot() +
geom_line(data = roc_xgb.data, aes(x=fpr, y=tpr, colour = "XGBoost")) + #set XGBoost roc curve
geom_line(data = roc_svm.data, aes(x = fpr, y=tpr, colour = "SVM")) + #set SVM roc curve
geom_line(data = roc_lr.data, aes(x = fpr, y=tpr, colour = "Logistic Regression")) +
#set LR roc curve
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() + #set themes
scale_colour_manual(name = "Models", values = cols) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
theme(legend.position = c(0.8, 0.2),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


The plot above indicates that the performance of the three different machine learning models are roughly the same. Slight variations show that if the false positive rate is above 0.6 then SVM and logistic regression seem marginally better than the XGBoost model.

We have also added a slider to adjust the vertical red line in the Shiny app. The slider allows the user to change the operation point of the algorithm by setting the false positive rate. The changes made to this cut-off are reflected in the confusion matrices.

## Confusion Matrix

Here is the logic and the code for how we draw the confusion matrix:

1. Obtain the auc, fpr and tpr usingprediction function
2. Define get_cutoff_point function to obtain the cutoff probability given a fixed fpr. Any predicted probability that is greater than the cutoff will be classified as attrition and vise versa.
3. Define draw_confusion_matrix function to draw a confusion matrix plot given calculated confusion table, auc and chosen color.
4. Take a look of three confusion matrices from three models and compare their auc, fpr,tpr and accuracy.
# Define a function to obtain the cutoff probability
# @perf is a S4 object gotten from @performance function
# @threshold is the targeted fpr
# In the ShinyApp, users can adjust the threshold by themselves and
# obtain different confusion matrix accordingly. Here, we always set
# threshold = 0.5 just for illustration.
get_cutoff_point <- function(perf, threshold)
{
cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]], tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),] cutoffs <- subset(cutoffs, fpr <= threshold) if(nrow(cutoffs) == 0){ return(1.0)} else return(cutoffs[1, 1]) } # Define a function to draw a confusion matrix plot # @cm is a confusion matrix obtained from @confusionMatrix function # @auc is the auc value obtained from @performance function # @color is the kind of color you want for true positive and true negative areas # In this function, we also add in accuracy information which calculates the # overall performance of model draw_confusion_matrix <- function(cm, auc, color) { layout(matrix(c(1,1,2))) par(mar=c(0,0.1,1,0.1)) plot(c(125, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n') # create the matrix rect(150, 430, 240, 370, col=color) text(195, 435, '0', cex=1.2) rect(250, 430, 340, 370, col='white') text(295, 435, '1', cex=1.2) text(125, 370, 'Predicted', cex=1.3, srt=90, font=2) text(245, 450, 'Actual', cex=1.3, font=2) rect(150, 305, 240, 365, col='white') rect(250, 305, 340, 365, col=color) text(140, 400, '0', cex=1.2, srt=90) text(140, 335, '1', cex=1.2, srt=90) # add in the cm results res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='black')
text(295, 400, res[3], cex=1.6, font=2, col='black')
text(295, 335, res[4], cex=1.6, font=2, col='white')

plot(c(0, 100), c(0, 50), type = "n", xlab="", ylab="", main = "", xaxt='n', yaxt='n')

# add in the accuracy information

text(25, 30, "AUC", cex=1.8, font=2)
text(25, 20, round(as.numeric(auc), 3), cex=1.8)
text(75, 30, names(cm$overall[1]), cex=1.8, font=2) text(75, 20, round(as.numeric(cm$overall[1]), 3), cex=1.8)
}

# draw XGBoosting confusion matrix
auc_xgb <- performance(ROCRpred_xgb, measure = "auc")  #obtain auc from @performance
perf_xgb <- performance(ROCRpred_xgb, 'tpr','fpr')  #obtain tpr and fpr from @performance
cut <- get_cutoff_point(perf_xgb, 1) #obtain the cutoff probability
pred_values_xgb <- ifelse(pred.xgb > cut,1,0) #classify using cutoff probability
cm_xgb <- confusionMatrix(data = pred_values_xgb, reference = test_Y) #obtain confusion matrix
draw_confusion_matrix(cm_xgb, auc_xgb@y.values, "#3DB7E4")  #Draw confusion matrix plot
# draw SVM confusion matrix
auc_svm <- performance(ROCRpred_svm, measure = "auc")
perf_svm <- performance(ROCRpred_svm, 'tpr','fpr')
cut <- get_cutoff_point(perf_svm, 0.5)
pred_values_svm <- ifelse(svm_model.prob[,2] > cut,1,0)
cm_svm <- confusionMatrix(data = pred_values_svm, reference = test_Y)
draw_confusion_matrix(cm_svm, auc_svm@y.values, "#FF8849")
# draw Logistic regression confusion matrix
auc_lr <- performance(ROCRpred_lr, measure = "auc")
perf_lr <- performance(ROCRpred_lr, 'tpr','fpr')
cut <- get_cutoff_point(perf_lr, 0.5)
pred_values_lr <- ifelse(LR_model.predict > cut,1,0)
cm_lr <- confusionMatrix(data = pred_values_lr, reference = test_Y)
draw_confusion_matrix(cm_lr, auc_lr@y.values, "#69BE28")


The confusion matrix show the predicted and true attrition numbers of employees. For example, the XGBoost matrix shows that of the 410 employees that do not attrite, our model predicts 227 true negatives, and 183 false negatives, similarly out of they 80 true attrites, 69 are correctly labelled as attitioners, with 11 false negatives.

## Precision and Recall

Another way to visualize this result is to look at precision and recall. Again, by controlling the cutoff, we can compare precision and recall values among different models in this plot.

The following code draws the precision plot and recall plot, which are very similar to the roc plots:

#Create precision plot
#XGBoost
perf_xgb <- performance(ROCRpred_xgb,'prec', 'cutoff') #use 'prec' and 'cutoff' as measurements
xgb.data <- data.frame(x=unlist(perf_xgb@x.values), y=unlist(perf_xgb@y.values),
model="XGBoost")

#SVM
perf_svm <- performance(ROCRpred_svm,'prec', 'cutoff')
svm.data <- data.frame(x=unlist(perf_svm@x.values), y=unlist(perf_svm@y.values),
model="SVM")

#Logistic Regression
perf_lr <- performance(ROCRpred_lr,'prec', 'cutoff')
lr.data <- data.frame(x=unlist(perf_lr@x.values), y=unlist(perf_lr@y.values),
model="LR")

cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

ggplot() +
geom_line(data = xgb.data, aes(x=x, y=y, colour = "XGBoost")) +
geom_line(data = svm.data, aes(x =x, y=y, colour = "SVM")) +
geom_line(data = lr.data, aes(x =x, y=y, colour = "Logistic Regression")) +
scale_colour_manual(name = "Models", values = cols) +
xlab("Cutoff") +
ylab("Precision") +
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() +
theme(legend.position = c(0.8, 0.2),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


Shows the tradeoff between precision as you increase the cut-off.

#Create recall plot
#XGBoost
perf_xgb <- performance(ROCRpred_xgb,'rec', 'cutoff')
xgb.data <- data.frame(x=unlist(perf_xgb@x.values), y=unlist(perf_xgb@y.values), model="XGBoost")

#SVM
perf_svm <- performance(ROCRpred_svm,'rec', 'cutoff')
svm.data <- data.frame(x=unlist(perf_svm@x.values), y=unlist(perf_svm@y.values), model="SVM")

#Logistic Regression
perf_lr <- performance(ROCRpred_lr,'rec', 'cutoff')
lr.data <- data.frame(x=unlist(perf_lr@x.values), y=unlist(perf_lr@y.values), model="LR")

cols <- c("XGBoost" = "#3DB7E4", "SVM" = "#FF8849", "Logistic Regression" = "#69BE28")

ggplot() +
geom_line(data = xgb.data, aes(x=x, y=y, colour = "XGBoost")) +
geom_line(data = svm.data, aes(x=x, y=y, colour = "SVM")) +
geom_line(data = lr.data, aes(x=x, y=y, colour = "Logistic Regression")) +
scale_colour_manual(name = "Models", values = cols) +
xlab("Cutoff") +
ylab("Recall") +
geom_vline(xintercept = 0.5, color = "red", linetype=2) + theme_bw() +
theme(legend.position = c(0.8, 0.8),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))


Shows the tradeoff between recall as you increase the cut-off.

The above figures show the Precision and Recall curves for the three models and illustrate how the cut-off will affect the precision and recall.

Here, we use cutoff = 0.5 as a default in our Shiny App. The user can adjust the cutoff slider by themselves. As with the ROC curve, the confusion matrices for each algorithm are updated with changes to the slider location.

With these results, we can give the HR department a list of the employees that are the most likely to leave, as well as the confidence score returned by the model.  Further, the confidence score can be combined with any HR metrics, which themselves can be modelled algorithmically if need-be, to give an expected value lost per individual.

For more details on this or any potential analyses, please visit us at http://sflscientific.com or contact mluk@sflscientific.com.

--

Contributors: Michael Luk, Zijian Han, Jinru Xue, Han Lin [SFL Scientific]

This project is an analysis of IBM Watson data of the sales pipeline and lead conversion outcome for an Auto Parts store which can be found in a csv file here. The goal of this analysis is to: “understand [the] sales pipeline and uncover what can lead to successful sales opportunities and better anticipate performance gaps.”

Exploratory Data Analysis

This data has eighteen features including: the category of the product being sold, the region, route to market, various statistics describing the sales pipeline, and client information. The features are in type floats, objects and integers. A further description of the data set variables can be found here.

There is a total of 78,025 rows in the data set, There are two outcome classes: Win or Loss, i.e. whether or not a lead was converted. The loss class is 3.43 times larger than the win class as shown in Figure 1 below. The majority of leads do not convert hence the business need to better understand the lead conversion pipeline.

Figure 1: Win and Loss Outcome Visualization

Training the Classification Model

For this classification problem, I choose to use an XGBoost classifier which is an implementation of gradient boosted regression trees often used to win Kaggle competitions. To prepare the data for use in the model, I 1-hot encoded the string variables resulting in forty features, and created a separate data frame for the outcome column. I then split the data into a training and validation set before training the classifier.

Technical Machine Learning Outcomes

Feature Importances

These are the top five features on which the decision trees are split in the XGBoost model:

• opportunity amount in USD

• elapsed days in sales stage

• total days identified through qualified

• sales stage change count

• revenue from client the past two years

Figure 2:  Feature importances

Figure 3: Log Scaled Box Plots of the Top 5 Feature Importances

Figure 4: Confusion Matrices for the Negative and Positive Classes

There is a disproportionately high number of false negatives, perhaps due to the unbalanced classes. I tuned the scale_pos_weight parameter which changes the weights of the Win class from 1 to sum(negative cases) / sum(positive cases) to correct the unbalanced classes. Changing this parameter in the model reduced the false negatives such that only 18% of converted leads are misclassified as losses.

Figure 5: Histograms of Predicted Probabilities of the True Positives and True Negatives of the Model

The above histograms of the predicted probabilities of the models shows that the model is more decisive when predicting the positive class when the classes are more balanced.

The histogram below shows the predicted probabilities of the true and false values by the model. Changing the cut off point to below 0.5 could increase the number of True Positives returned. However, it would also increase the number of False Positives returned as well.

Figure 6: True and False Values Predicted Probabilities Histogram

Figure 7: Precision vs Recall Plot of the Model

The area under the precision-recall for the positive class is less than that of the negative class. Perhaps having more data on the positive class can help the model better learn how to predict winning leads.

Results Analysis

Using the XGBModel, it is possible to predict the leads that are most likely to convert. Predicting these leads can be helpful in a few ways such as focusing limited sales resources on those leads. It is also be an opportunity to identify features that are most correlated with conversion and replicate those practices across the sales pipeline.

An alternative way to assess the performance of the model, which I wasn’t able to do within the limits of this data set,  would be to consider the cost-benefit matrix. How much does converting a lead earn the business? How much does it cost to follow up a lead? A few false positives may be acceptable in the final model if the sales process isn’t too costly for a company, or unacceptable if the sales costs are really high. Constructing a cost-benefit matrix can help with identifying the best classification algorithm to use to optimize the sales pipeline for this auto parts store.

# Part 2:

This is the second in a series of blogs to do with analyzing.

In this blog entry, we discuss the use of several algorithms to model employee attrition in R and RShiny: extreme gradient boosting (XGBoost), support vector machines (SVM), and logistic regression.

• XGBoost is a decision tree based algorithm. Multiple trees are ensembled to improve the predictive power of the model.
• SVM is a discriminative classifier that takes labeled training data and constructs a hyperplane to categorize new examples.
• Logistic regression is a simple classifier used to estimate the probability of a binary outcome based on several predictors and a logit function.

## Data Preprocessing

There are several steps to take before any predictions can be made. Firstly, library all the machine learning packages we need in R.

# load packages
library("xgboost"); library("e1071"); library("MASS"); library("xtable")


The data is partitioned into three sets: training, validation and testing:

• The training set is responsible for initially teaching the model the causal relationship between all information and the attrition probability.
• The validation set is then used to estimate how well the model has been trained and fine tune the parameters to develop the best model. Once those two steps have been completed
• The completed model is applied to the testing set in order to get accurate results on how the model would perform on real-world data.

The code to split the data is shown as follows:

# Create data for training and test
set.seed(0)
tr.number<-sample(nrow(d),nrow(d)*2/3)
# we split whole dataset into 2/3 training data and 1/3 testing data
train<-d[tr.number,]
test<-d[-tr.number,]

column_names = names(test)

# split dataset
train_Y = as.numeric(train$Attrition) train$Attrition<-NULL
test_Y = test$Attrition test$Attrition<-NULL

# numericize training and testing data
train[] <- lapply(train, as.numeric)
test[] <- lapply(test, as.numeric)


## Data Modeling with Machine Learning

### XGBoost model

The first model we fit is an extreme gradient boosting (XGBoost) model. For more detail, http://xgboost.readthedocs.io/en/latest/parameter.html.

Here, we choose maximum depth of tree as 3, step size shrinkage parameter as 0.3, use logistic regression for binary classification with area under the curve(auc) as an evaluation metric for validation data. These default values we have set can be tuned to optimize for accuracy in a true analysis.

# Construct xgb.DMatrix object from training matrix
dtrain <- xgb.DMatrix(as.matrix(train), label = train_Y)

# Create a list of parameters for XGBoost model
param <- list(max_depth=3,
silent=1,
eta = 0.3,
objective='binary:logistic',
eval_metric = 'auc')

# Training a XGBoost model using training dataset and chosen parameters
bst <- xgb.train(param, nrounds = 82, dtrain)

# Predicting the results using testing dataset
pred.xgb <- predict(bst, as.matrix(test))

# Create a table indicating the most important features of XGBoost model
importance <- xgb.importance(feature_names = column_names, model = bst)


The XGBoost object, bst, created from training the model is a list that contains basic information for the model training, e.g. the parameters settings, etc. The predict step allows us to make a prediction using the separate test dataset.

We can show the first 5 predictions here:

head(pred.xgb)
## [1] 0.61055940 0.59795433 0.01765974 0.03849135 0.91354728 0.06819534


The result shows us the first five predicted probabilities for the test dataset. For example, the first observation’s prediction is 0.61, based on our model, that employee will have a 61% chance to attrite.

### Feature Importance

A great advantage of using XGBoost model is its built-in ability to show us a feature importance table. The importance metric provides a score indicating how valuable each factor was in the construction of the boosted decision trees. Higher relative importance indicates a larger impact on the algorithm and final prediction.

To actively improve overall employee retention issues, we can use this to look more closely at the most important features that determine the attrition.

In the figure below, we visualize the top 10 important features in histogram usingmetric.

xgb.plot.importance(importance_matrix = importance, top_n = 10)


When the model is run on the entire dataset, the results show that Marital Status, Number of Companies Worked For, and Age are the dominant drivers of employee attrition for this dataset. By looking at the plots from Part 1, we can determine how these features impact attrition directly.

In terms of HR adjustable parameters, we note that adjustments to Job Involvement, Stock Option and Monthly Income might be used as incentives for high value employees.

### SVM model

The second model we fit is a support vector machine(SVM) model. Basically, a SVM constructs a hyperplane or a set of hyperplanes that have the largest distance to the nearest training data points of other classes.

We choose a radial kernel with proper gamma and cost values here to optimize the performance of SVM. Again, these should be tuned in a full analysis.

train\$Attrition<-train_Y

# Training a SVM
svm_model<-svm(Attrition~.,                #set model formula
type="C-classification",   #set classification machine
gamma=0.001668101,         #set gamma parameter
cost=35.93814,             #set cost parameter
data=train,
cross=3,                   #3-fold cross validation
probability = TRUE        #allow for probability prediction
)

# Predicting the results using testing dataset
# Obtain the predicted class 0/1
svm_model.predict<-predict(svm_model, test, probability=TRUE)
# Obtain the predicted probability for class 0/1
svm_model.prob <-attr(svm_model.predict,"probabilities")
svm_model
##
## Call:
## svm(formula = Attrition ~ ., data = train, type = "C-classification",
##     gamma = 0.001668101, cost = 35.93814, cross = 3, probability = TRUE)
##
##
## Parameters:
##    SVM-Type:  C-classification
##        cost:  35.93814
##       gamma:  0.001668101
##
## Number of Support Vectors:  339


The SVM model object is a list presenting basic information about the parameters, number of support vectors, etc. The number of support vectors depends on how much slack we allow when training the model. If we allow a large amount of flexibility, we will have a large number of support vectors.

### Logistic regression model

The third model we fit is a logistic regression model. It uses a logit link from the data matrix to the class probability. The logit function always generates a predicted value between 0 and 1 that can be interpreted as  a probability.

# Training a logistic regression model