Introduction
This my first project and my first challenges with the Machine Learning. At that time I did not understand the real functioning, but this project give me the possibility to experiment and excited me.
In this project I trained and tested a simple Neural Network for speech recognition, and the dataset is composed of 1600 train samples, 200 for each word (“down”, “go”,…) and a 109 validation samples. To extract the features to train, the data was processed and reshaped. First of all, the sound is a sequence of different vibrations, and we can represent it as a Spectrogram, a graph with time as X axis and frequency as Y axis. It is created using the Fourier Transformation to transform from time to frequency domain on a little window of the audio. In my case I use a window of 1 second. Then the Mel scale, which is a transformation of the frequency scale, is applied on Spectrogram. In this way the scale tries to capture distances from low to high frequency. From the Mel spectrogram, using logarithm, it’s obtained the log-Mel spectrum. For the data in the project, it’s created a bi-dimensional representation of shape [128,32], then it’s resized into a [32,32] matrix and then flatten into a vector of 1024 features.
Model
All the efforts in this project are made on a manual search of every possible hyperparams, with different configuration. It was like a Brute Force approach in order to get the best paramater for a better final accuracy. All the graph and details are in the report.