Structure
-
Introduction
-
Data Processing
-
Augmentation Techniques
-
Model Development and Evaluation
-
Conclusion
-
Future Work
Introduction
“Eminem makes pop music” were his last words in our friend group. But what exactly makes a pop song different from a hiphop song, and a hiphop song different from a rock song?
A genre is a classification that groups songs based on shared features like structure, instrumentation, vibe, lyrics and cultural context. Though genres may seem obvious to us but what information could we extract from a mere audio file about the genre? It’s the specific features of the song. What exactly are these features, and how are they represented statistically? Let’s find out.
Data Processing
Data
The GTZAN dataset comprises of 10 genres, 100 audio files of each one. This dataset is commonly used for training and evaluating music genre classification models. MFCCs, zero crossings and other spectral features are extracted from these files and stored.
MFCCs and Mel spectrograms
Humans perceive sound logarithmically, which explains why notes sung an octave apart sound similar even though their frequencies double. The piano notes A1, A2, and A3 sound alike despite their frequencies being multiples.
Mel spectrograms are visual representations of the time-based frequency content of an audio file, mapped to the Mel scale, which better reflects human auditory perception than frequency. In a Mel spectrogram, the x-axis represents time, and the y-axis displays mel scale, allowing us to observe how pitch evolves.
They are calculated by applying a Fast Fourier Transform(FFT) to the audio data, converting time-domain information into the frequency domain by breaking the signal down into its individual components.
MFCCs (Mel-frequency cepstral coefficients) are derived from the Mel spectrogram by applying a discrete cosine transform (DCT) which compresses the data into a smaller set of coefficients that capture the most relevant audio features.
Other Features
More features can be extracted from the audio data to improve accuracy. Librosa library makes this task easy. Few other features can be:
-
Zero-Crossing Rate: Measures the rate at which the audio signal crosses the zero amplitude line.
-
Spectral Centroid: Represents the “centre of mass” of the spectrum, often associated with the brightness of the sound.
-
Chroma Features: Provide insights into the harmonic content of the audio regardless of its octave.
-
Spectral Rolloff: Measures the frequency below which a specified percentage of the total spectral energy is contained.
-
Spectral Contrast: Indicates the difference in amplitude between peaks and valleys in the sound spectrum.
-
Tonal Centroid Features (Tonnetz): Captures the harmonic relations of the audio due to tones in audio.
Augmentation Techniques
Data augmentation refers to modifying existing data to increase the size of our dataset. It is particularly useful for making the model robust and addressing overfitting (note that it does not solve underfitting, as providing more data to a model that cannot fit the training data does not help). Audio augmentation techniques include the following:
-
Add Gaussian Noise: Adds random noise to the audio signal.
-
Time Stretch: Alters the speed of the audio.
-
Pitch Shift: Modifies the pitch of the audio by given number of semitones.
-
Shift: Shifts the audio along the time axis.
-
Gain: Adjusts the volume of the audio by a specified amount in decibels.
The Audiomentations library can be useful for composing different audio files from a single one by varying parameters across these techniques.
Model Development and Evaluation
Note – The project is uploaded on my GitHub, where you can find details about the model architecture, training process, optimizer used, number of epochs etc.
ANNs
I tried different ANN models by varying the use of augmentation techniques and including other features like spectral centroid along with MFCCs.
Note – Augmented data cannot be included in testing and validation sets. Also, augmented data of the test set cannot be included in the training data else the model would end up being trained on the test dataset(augmented). Thus, test data was first kept separate and the rest was augmented to form the training data.
However, augmentation of such a small dataset did not improve model accuracy per se. Also, adding other features with MFCCs does improve accuracy by 1% or so but we would ignore that for this project. Further, by tuning hyperparameters, accuracy was improved and we could get an accuracy of 68.5% at best.
The disadvantage of using an ANN is that it does not capture music dynamics and local patterns effectively, which could be crucial for MFCC like visual data.
CNNs
The limitations of ANNs are overcome by using Convolutional Neural Networks (CNNs), which excel at extracting features from images, including visual representations like MFCCs.
GTZAN dataset provided pre-extracted MFCCs, but the model performed poorly on the resized blurry data. To address this, MFCCs were extracted manually and stored, leading to a significant improvement. The CNN model achieved a maximum accuracy of 75.6% after proper hyperparameter tuning.
However, a disadvantage of using CNNs is their inability to capture long-term dependencies which might be important for classifying music genres accurately.
LSTMs
LSTMs capture sequential data better like the kind found in music. Thus, to capture patterns across time, I decided to implement an LSTM model. The architecture was simple: 2 LSTM layers, followed by 2 dense layers, and finally a softmax layer for classification, achieving an accuracy of 83.1%. Not bad for such a straightforward setup.
However, this took the MFCCs as input directly. While the LSTMs were able to capture sequential dependencies, the model might have performed even better if I had first used CNN layers to pick up on the local patterns within the MFCCs.
CNN-LSTM Hybrid Model
The motivation to use a hybrid approach was to first identify local patterns and dependencies using CNN layers and then use LSTM layers to remember long-term dependencies better.
To implement this, I added 3 convolutional layers followed by a flattening layer to prepare the data for the LSTMs. After that, 2 LSTM layers and a dense layer were used. The hybrid model’s accuracy came in at around 89.3%, which was the lowest among the models I tried.
Conclusion
In this project, we explored various approaches to music genre classification using the GTZAN dataset, focusing on MFCCs as the primary features. Starting with traditional ANNs, we gradually moved to more advanced architectures like CNNs and LSTMs, which helped us capture different aspects of the audio data. While ANNs offered a baseline accuracy of around 68.5%, the CNNs improved this significantly to 75.6% by extracting local patterns from the MFCCs. Finally, the LSTMs and CNN-LSTM hybrid models pushed the accuracy further to 83.1% and 89.3% demonstrating the importance of capturing sequential information and local patterns.
Future work
To further improve the model’s accuracy and robustness, several strategies can be employed. One promising direction is to experiment with Parallel Recurrent Convolutional Neural Networks (PRCNNs), which could combine both convolutional and recurrent layers in parallel, allowing the model to capture both local and sequential patterns simultaneously. This could help reduce the limitations of the sequential pipeline and improve performance. Attention mechanisms could also be integrated, enabling the model to focus on the most relevant parts of the audio sequence, improving its understanding of long-term dependencies in the audio signal. Combining CNNs or GRUs with attention-based mechanisms in a hybrid model might offer an even more powerful architecture for genre classification.