This project is an analysis of IBM Watson data of the sales pipeline and lead conversion outcome for an Auto Parts store which can be found in a csv file here. The goal of this analysis is to: “understand [the] sales pipeline and uncover what can lead to successful sales opportunities and better anticipate performance gaps.”

**Exploratory Data Analysis**

This data has eighteen features including: the category of the product being sold, the region, route to market, various statistics describing the sales pipeline, and client information. The features are in type floats, objects and integers. A further description of the data set variables can be found here.

There is a total of 78,025 rows in the data set, There are two outcome classes: Win or Loss, i.e. whether or not a lead was converted. The loss class is 3.43 times larger than the win class as shown in Figure 1 below. The majority of leads do not convert hence the business need to better understand the lead conversion pipeline.

**Figure 1: Win and Loss Outcome Visualization **

**Training the Classification Model**

For this classification problem, I choose to use an XGBoost classifier which is an implementation of gradient boosted regression trees often used to win Kaggle competitions. To prepare the data for use in the model, I 1-hot encoded the string variables resulting in forty features, and created a separate data frame for the outcome column. I then split the data into a training and validation set before training the classifier.

**Technical Machine Learning Outcomes**

**Feature Importances**

These are the top five features on which the decision trees are split in the XGBoost model:

opportunity amount in USD

elapsed days in sales stage

total days identified through qualified

sales stage change count

revenue from client the past two years

**Figure 2: Feature importances **

**Figure 3: Log Scaled Box Plots of the Top 5 Feature Importances **

**Figure 4: Confusion Matrices for the Negative and Positive Classes**

There is a disproportionately high number of false negatives, perhaps due to the unbalanced classes. I tuned the scale_pos_weight parameter which changes the weights of the Win class from 1 to sum(negative cases) / sum(positive cases) to correct the unbalanced classes. Changing this parameter in the model reduced the false negatives such that only 18% of converted leads are misclassified as losses.

**Figure 5: Histograms of Predicted Probabilities of the True Positives and True Negatives of the Model**

The above histograms of the predicted probabilities of the models shows that the model is more decisive when predicting the positive class when the classes are more balanced.

The histogram below shows the predicted probabilities of the true and false values by the model. Changing the cut off point to below 0.5 could increase the number of True Positives returned. However, it would also increase the number of False Positives returned as well.

**Figure 6: True and False Values Predicted Probabilities Histogram**

**Figure 7: Precision vs Recall Plot of the Model **

The area under the precision-recall for the positive class is less than that of the negative class. Perhaps having more data on the positive class can help the model better learn how to predict winning leads.

**Results Analysis**

Using the XGBModel, it is possible to predict the leads that are most likely to convert. Predicting these leads can be helpful in a few ways such as focusing limited sales resources on those leads. It is also be an opportunity to identify features that are most correlated with conversion and replicate those practices across the sales pipeline.

An alternative way to assess the performance of the model, which I wasn’t able to do within the limits of this data set, would be to consider the cost-benefit matrix. How much does converting a lead earn the business? How much does it cost to follow up a lead? A few false positives may be acceptable in the final model if the sales process isn’t too costly for a company, or unacceptable if the sales costs are really high. Constructing a cost-benefit matrix can help with identifying the best classification algorithm to use to optimize the sales pipeline for this auto parts store.