Comparison Of Time-Frequency Representations for Environmental Sound Classification Using Convolutional Neural Networks
My essay is on the paper presented by the author Muhammad Huzaifah on the topic Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. This paper has introduced the importance of environmental sound classification which can be performed using the convolutional neural networks (CNNs).
Given the fact that CNNs are more suited for image and video applications, recent advancement in transforming 1-D audio to 2-D spectrogram image has aided the implementation of CNNs for audio processing purposes. The author has used various signal processing methods such as short-time Fourier Transform (STFT) with linear and Mel scales for audio, Constant-Q transform (CQT) and continuous wavelet transform (CWT) in order to observe their impact on the sound classification performance of environmental sound datasets.
The paper supports the hypothesis that vital features for sound classification is highly dependent on time-frequency representations. Moreover, the sliding window length of STFT depends on the characteristics of audio signal and 2-D convolution performs better when compared to 1-D convolution.
Since the author has used CNNs for classification, the conventional choices such as Mel-frequency cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients that were previously defined as basic building blocks for Gaussian mixture model (GMM)-based Hidden Markov Models (HMMs) are redundant. The reason being feature maps used in deep learning algorithms are independent of MFCCs and PLP features in order to be un-correlated.
This paper is built on the previous studies where the performance of short-time Fourier transform (STFT), fast wavelet transform (FWT) and continuous wavelet transform (CWT) were compared to the conventional machine learning techniques mentioned before and dives deep into the specifics of a CNN model.
The author has used STFT on both linear and Mel scales, CQT and CWT to assess the impact of different approaches in comparison to the baseline MFCC features on two publicly available environmental sound datasets (ESC- 50, UrbanSound8K) through the classification performance of several CNN variants.
The datasets ESC-50 and UrbanSound8K are a collection of short environmental recordings which split distinct classes such as animal sounds, human non-speech sounds, car horn, drilling etc. In pre-processing part of the experiment, four frequency-time representations were extracted in addition to MFFCs viz., linear-scaled STFT spectrogram, Mel-scaled STFT spectrogram, CQT spectrogram, CWT scalogram, MFCC cepstrogram.
The procedure for other transforms were similar to earlier procedure. The transform can be thought of as a series of logarithmically spaced filters fk, with the k-th filter having a spectral width δfk equal to a multiple of the previous filter’s width: where δfk is the bandwidth of the k-th filter, fmin is the central frequency of the lowest filter, and n is the number of filters per octave.
Like STFT, wideband and narrowband versions of the CQT were extracted and instead of decomposing it into sinusoids, the CWT was specified with 256 frequency bins and a Morlet mother function that has been used in previous audio recognition studies. Finally, MFCCs were computed and arranged as cepstrogram and the coefficients were normalized without taking the logarithmic function.
In order to keep the input feature map consistent, all the images were further downscaled with PIL using Lanczos resampling which helped in achieving higher processing speeds. The python libraries librosa and pywavelets were used for audio processing. On the Neural network end, two types of convolutional filters were considered viz., a 3×3 square filter and a Mx3 rectangular filter which implements 1-D convolution over time.
The convolutional layers were spread with rectified linear unit (ReLu) and max pooling layers. Overfitting hinders the performance of model, in order to overcome overfitting dropout was used during training after the first convolutional and fully connected layers.
Training was performed using Adam optimization with a batch size of 100, and cross-entropy for the loss function. Models were trained for 200 epochs for ESC-50 and 100 epochs for UrbanSound8K. The order of samples in the training and test sets were randomly shuffled after each training epoch. The network was implemented in Python with Tensorflow.