Understanding Speech Recognition: An Introduction

Spoken Arabic Digits

- Ted Sand and Diego Sosa-Coba [SFL Interns]

This blog focuses on developing an algorithm to understand spoken Arabic digits. Such algorithms are the first step in developing computers that can understand languages with applications ranging from text-to-speech, voice-recognition, and translation, to modern AI assistants that are widely becoming available. The practical implications are sd wide in scope as voice assistants and in automation environments for specific command recognition. The algorithms churn in the background with our daily use of commands and spoken telephone numbers recognized by Siri or Ok Google, or during a call with a voice directory or mobile banking application.

For the purpose of this analysis, we make use of the publically available dataset: UCI's Machine Learning Repository. The data consists of native Arabic speakers (44 males and females) that spoke each digit [0-10] ten times each. The sampling rate was 11025 Hz with 16 bits.

The raw time-series data is first preprocessed into 13 unique Mel Frequency Cepstral Coefficients (MFCCs) [1]. MFCCs are Fourier Transforms of the signal, which are split into coefficients designed to model the way people hear, rather than linearly spacing them. By converting the data into phonetically important information, the machine learning algorithm is given the maximum disambiguation of how different words are spoken.


Transforming the Dataset

The dataset from UCI consisted of 8,800 time series of 13 MFCCs taken from the 88 individuals. The MFCCs were divided into blocks which, when taken as a whole, represented the spoken digit. The shortest digit was 4 frames and the longest digit was 93 frames. In order to fit it all neatly in one data frame we use 1209 (13x93) columns, one representing each possible MFCC and missing values encoded as zeroes. Note that this encoding does not preserve all of the information given by the MFCCs. The algorithm now views the MFCCs as effectively unrelated and does not know that they go in sets of thirteen, or that the 93 observations of the first MFCC are all of the same function of the underlying signal.


Arabic Digits

The Arabic numeral system is the same as the one used in most Western countries, however, their pronunciation differs because they are spoken in Arabic. The numeral representations in Arabic, along with their phonetic pronunciation, are as follows:


0.   ٠ | sifr

1.   ١ | wahed

2.   ٢ | ithnan

3.   ٣ | thalathah

4.   ٤ | arba'eh

5.   ٥ | khamsah

6.   ٦ | setah

7.   ٧ | sab'ah

8.   ٨ | thamaneyah

9.   ٩ | tes'ah



Each digit has at least two and at most four syllables. We did not include any feature concerning the syllables in our analysis, but this could be used to further improve accuracy in the future. In particular, we are able to visually distinguish the longer spoken digits from the shorter ones as follows.



We first visualize the simplest possible aspect of the data, how long each digit takes to say.

Visualization of speech length by digit

 The above (normalized) histogram shows that the digit five is quick to say and eight is long relative to everything else. This is reflected in the number of syllables for five (2 syllables) and eight (4 syllables).

Let’s take a look at the MFCCs. The following boxplots show four of the distributions of the averaged MFCC values across all ten digits. The box and whiskers shows the 0, 25, 50, 75, 100 quartiles - discounting any outliers.

Mel Frequency Cepstral Coefficients Boxplot

We have handpicked a handle of plots that correspond to the following MFCCs: 2, 3, 5 and 8, since these were listed amongst the most important using standard feature importance tools.

This a great visualization of the differences in distributions between digits for any particular MFCC. We noticed that MFCC 2 had a significantly larger but consistent portion of outliers. On the other hand, MFCC 5 had a much narrower and concentrated distribution across all digits than the other MFCCs.

These differences between digits allow the machine learning algorithm to perform better in terms of accuracy, since it is easier to distinguish the individual digits.


Machine Learning Techniques

After trying a variety of methods, including boosted decision trees and SVMs, it turns out that the simple k-nearest neighbors (KNN) algorithm performed best.

kNN works by classifying a point based on a majority vote of its k nearest-neighbors (here determined by a Euclidean distance). The number of neighbors k is typically set using a validation set.

Validation Set and Parameter Tuning

An important consideration when partitioning sets in data like this was to separate the validation set by speaker. If samples from the same speaker were in different sets we would experience an artificial improvement in performance, as the same person tends to say the same digit in the same way. Thus our data ended up partitioned into 44 speakers in the training set, 22 in the validation set and 22 in the test set.

Below is the performance (higher better) of the kNN algorithm on our validation set as a function of the number of neighbors used. From here, we see that there is a max at k=2, where we have set the final model.

kNN algorithm performance on validation set

Model Performance

Once we have selected the number of neighbors we add our validation set back into our training set and generate a model using the algorithm. This allows us to make a final prediction on the test set, which gives us the “real world” performance expectations.

Below is the confusion matrix for our model, which shows that on average 97% of the test data correctly. The one large error is zeroes that get misclassified as sevens 7% of the time.

Confusion Matrix of kNN algorithm for spoken langauge

This level of accuracy from such a simple analysis is a testament to the robustness of MFCCs as features for language recognition. Having clean data and generating good features are key to model performance.


Final Thoughts

The next steps for expanding this analysis would be to train the algorithm to identify, transcribe, and even understand full speech. A simple idea would be to start by identifying pauses in speech, guessing the word splits and performing an analogous supervised classification model for it.

This type analysis can start to model human speech patterns and can be used in applications from medicine to the military, from personal assistants to automated phone services.


[1] Mel Frequency Cepstral Coefficient

They are derived by first taking the Fourier transformation of a windowed signal - in this case Hamming window.  A windowed signal is obtained through a function that becomes zero-valued outside of some specific interval. The Hamming window function is then applied to create the windowed signal. Then we map the powers of the spectrum onto mel scale and take the log of those powers. Finally we perform another transformation, converting the previous result into a spectrum and taking those amplitudes as our MFCCs. The sampling rate served as one of the parameters for calculating the MFCCs. Note that this entire procedure typically needs to be performed by the data scientist, but this data was preprocessed by UCI.