Trying to detect anomalies is a typical problem that can solved using Machine Learning. In principle determining anomalous data is a very similar problem to typical classification, except that the anomalous class represent a much smaller fraction of data. Further, the anomalous class need not be homogeneous, in that outliers from the dataset need to be classified even though they may not be markedly similar to other outliers.
Canonical classification algorithm can still be used to great effect, especially density based ones, where new elements not close to the any original normal data is marked as anomalous.
Let us look at a Network Intrusion Detector example.
Network Intrusion Detector
Detecting network intrusions is vital to protecting computer networks from unauthorized users. The goal is to detect intrusions on a dummy U.S. Air Force local-area-network (LAN). The data was processed from nine weeks worth of network traffic with four dozen features and a label of normal request or some form of abnormal request for each data point (i.e: request to connect). The data set used for this analysis is the NSL-KDD data set with the following description of features included in the data set.
The dataset includes duration, protocol type, urgency of requests, server error rate, is guest login, root access etc. A subset of the features and their descriptions are placed below:
Further, the data is roughly split into 10 classes, a normal category corresponding to roughly 55% of the entire data set and 9 other classes corresponding to different anomalous types of data.
Since we are not concerned with the exact class of anomalies but merely interested in detecting which are anomalous, we can simply relabel all 9 classes as “anomaly”.
Below we show a few examples of the data split into normal and anomalous categories.
Now that we have a rough sense of the variables in the class, we can actually just immediately take a look at how canonical machine learning algorithms are applied.
Let’s first try a density-based algorithm. We expect reasonable results if tuned properly, with the idea behind using k-Nearest Neighbours being that the normal data should be clustered tightly together and the anomalies scattered further away.
We train on the training set kddtrain, with 'label' being normal or anomalous. We convert the label to a binary (using the function binarize) and search over a grid of parameter values.
At this point, we've run over the grid and found the optimum parameters for our KNN algorithm. This happens to be for k=1 with uniform weights. This is not too surprising if the different instances of anomalies are not very densely clustered.
From here, we predict on a hold out using the following code:
We have hit reasonable scores for precision and recall. The result does indicate that many of the normal instances (labelled 0) are being incorrectly tagged as anomalous i.e. false positive.
This is shown both in the confusion matrix and the relatively low recall statistic for normal instances.
Conversely, there is relatively low precision for the anomalies (labelled 1). This occurs when over-fitting to the normal data since in this case the model does poorly, predicting a lot of anomalous data when only 66% are true anomalies.
Nearest Centroid Classification
We can also use a centroid based classification procedure to check our findings. Nearest centroids is a slightly more robust form of nearest neighbour classification, except that it is a more global model of the data, being less susceptible to noise and outliers.
The nearest centroid result are unsuprisingly consistent with the kNN algorithm. It is marginally better, as may be expected from a more robust algorithm.
Finally, let's implement logistic regression. We will do a little parameter tuning to determine the optimal settings and fit a model using these best parameters on the training set kddtrain. 'true_label' is again an array with the true anomalous or normal indication for that instance of the training data.
As you can see a simple logistic regression is doing quite well on the hold-out set as well, giving similar F1 scores but from slightly different statistics. Trading some recall for precision in the normal class and vice-versa for the anomalous class. We will leave the reader to think about whether or not this is expected based on how logistic regression is different from density based approaches.
Simple machine learning algorithms can predict anomalous data well, out of the box. Density-based methods and very simple logistic regression have shown great performance in a whole slew of industry applications even more data-imbalanced regimes.
Further, even before considering ensembling models, more complicated supervised learning algorithms such as well as SVMs, BDTS, NNs will provide drastic improvements on result shown here.