Today we shall talk about time-series analyses. We have anonymized some EEG data, which comes in the form of 36 different channels (electrodes) that measure electrical signals from a patient's brain. Volunteers were directed to think about one of the following three possibilities: A specific word, or moving their left, or right hand, and continue doing so for a total of 15 seconds. After 15 seconds, the volunteers were directed to switch to another one of the three in a predefined sequence. This continued for 5 minutes and the sequence was noted. The electrical signals were recorded with a sampling rate of 8000Hz.
The goal for this project was to determine from raw neural activity what the volunteer was thinking at any given time.
This form of analysis have many direct applications, for example: Prosthesis moving, computer games which interface with brain activity, and even piloting vehicles [or objects beyond integrated eye movement] without need for physical interactions.
Further, the methods that we talk about are entirely general. We can apply this type of analysis to anything from stock-market forecasting to artificial intelligence.
Time Segmentation and Down-Sampling
There is a natural segmentation of time in 15 second intervals for this case. In principal the discretization of time is arbitrary and we can divide the period into overlapping, or even varying length (major peak-to-peak say) that we can classify.
To give finer resolution, we eventually settled on an interval of 2 seconds that seemed to give the optimum trade-off between sensitivity and computational intensity.
Another important question that needs to be address is the amount of downsampling that we use. At 8000Hz, a time period of 5 minutes represents 2,400,000 (8000*5*60) data points for each channel! This is typically too much information, and in particular, we risk overtraining and fitting to underlying noise.
After some optimization using a validation set, we down-sampled all channels to 8Hz (i.e. 2400 data points per channel in 5 minutes).
The raw data from one of the channels for the 5 minute period looks like this:
Looking at the data in the above format, we see that the signal is not immediately apparent. Furthermore, while it's plausible to step through all 120 channels in this case and look at them each manually, it is often not possible or even useful to look through all the raw data manually.
The idea instead is to generate some set of representative features that we can feed into a classification algorithm that can automatically classify time windows.
At this point we use some intuition and design some hand-crafted features that we think might be important. Say, the absolute size of the peak in a 15 second window, or the time between the largest and second largest peak, etc.
The only draw back with this approach is that it is relatively expensive to develop these features by hand, but at the same time, well-crafted features have great potential in predictability.
Let's cheat a little and show you what I know to be an excellent feature for this dataset. We define the normalized peak count in each 2 second window as the number of peaks in a window that are above 80% of the maximum within that window. This figure is shown for the most important of the 120 channels:
Notice the good separate between thinking about a word and thinking of a movement - the separation between left and right hands is not as useful for this feature.
How did we notice this feature was important without having to go through each of the 36 channels or each of the features? We simply generated dozens of such features, that we think might be useful and feed it into your favorite model selection algorithm – in our case XGBoost has a nice importance ranker when you use it for classification.
Plotting importance on the default normalized scale, I found this:
Feature 37 corresponded to the normalized peak count.
Feeding our features into a classification algorithm with their known classes, we find:
The green line represents the algorithms prediction and the red the actual classes. Feature_1 is actually the raw signal from one of the channels sampled at 8 Hz, which was fed into the algorithm (and pops up as f1 on the feature importance list above).
Interestingly the raw feature here already shows visible differences between classes (which is obviously why the algorithm pulled it out as important).
The confusion matrix on a hold-out set of new patients was computed:
Obviously, pulling out thoughts of a word is quite easy. Signaling that mental process of movement and language are possibly separate.
Generating features needs some intelligent intuition. In our case, where changes in activity happens in 15 second intervals, it was always likely that macroscopic features would be more important. By macroscopic, I mean things like number-of-peaks, time between major and minor peaks etc. For quickly varying time-series, you would expect that features that capture such structure would be more important (e.g. speed at which peaks decay etc).
MORE PRECISION with Machine Vision
There are machine vision methods that can optimize results by doing something even more complicated. The issue with this is that it will require deep learning, a lot of time to train, and thus a little inflexibility. If these are constraints we can live with, we can go even further.
We can convert the time-series data into a spectrogram. This is quite straight-forward and there are many built in libraries in python to do so. A spectrogram is basically a Fourier Transform of the data, which is visualized:
The patterns in the images represent frequency structure in the data not noticeable in the time-domain data. The heatmap here indicates the density in the frequency distribution over time. In this case, you can already see possible striations at roughly 15 second intervals as well!
We then use traditional machine vision techniques to digest this data and add the features into the classifier.
Finally, aside from time- and frequency-domain, there is also hybrid time-frequency domain methods that can be exploited. The benefits of such method is that we can remove any notion of windowing.
For example a common transform known as the continuous wavelet transform (CWT) will translate time-series data into the time-frequency domain.
The convolution of a predefined `Wavelet' - in EEG a common one is the Ricker wave - with the time series is taken at moment in time.
A range of scales is used to scale the wavelet up/down and the convolution at each moment is repeated. The result is an image that looks like:
The structure of the time-series again can be seen graphically and canonical machine vision techniques can be used.
Turning a time-series dataset into high-level features, frequency-domain or even time-frequency domain information, can yield very powerful and accurate predictions.
Window optimizations, feature crafting, and basic analysis flow would have to be tweaked but with these few modifications, depending on the exact problem, any time-series data can be analyzed in this manner.
There are various machine vision techniques, such as CWT and spectrograms, that can also be added to increase our predictive power and help classify complex datasets.
In our next blog entry we shall extend this example with a less common method known as Dynamic Time Warping (DTW), which has already yielded very convincing results in a wide variety of fields.