Pre-trained Deep Neural Network using Sparse Autoencoders and Scattering Wavelet Transform for Musical Genre Recognition

Mariusz Kleć, Danijel Koržinek


Research described in this paper tries to combine the approach of Deep Neural Networks (DNN) with the novel audio features extracted using the Scattering Wavelet Transform (SWT) for classifying musical genres. The SWT uses a sequence of Wavelet Transforms to compute the modulation spectrum coefficients of multiple orders, which has already shown to be promising for this task. The DNN in this work uses pre-trained layers using Sparse Autoencoders (SAE). Data obtained from the Creative Commons website is used to boost the well-known GTZAN database, which is a standard benchmark for this task. The final classifier is tested using a 10-fold cross validation to achieve results similar to other state-of-the-art approaches.


Sparse Autoencoders; deep learning; genre recognition; Scattering Wavelet Transform

Full Text:



Andén J., Mallat S.: Deep Scattering Spectrum. CoRR, vol. abs/1304.6763, 2013,

Bengio Y.: Learning Deep Architectures for AI. Foundations Trends Machine Learning, vol. 2(1), pp. 1–127,

Bengio Y., Lamblin P., Popovici D., Larochelle H., et al.: Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, vol. 19, p. 153, 2007.

Bishop C. M.: Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA, 1995.

Chang K. K., Jang J. S. R., Iliopoulos C. S.: Music Genre Classification via Compressive Sampling. In: ISMIR, pp. 387–392, 2010.

Chen X., Ramadge P. J.: Music genre classification using multiscale scattering and sparse representations. In: Information Sciences and Systems (CISS), 2013 47th Annual Conference on, pp. 1–6, IEEE, 2013.

Glorot X., Bengio Y.: Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics, pp. 249–256, 2010.

Grimaldi M., Cunningham P., Kokaram A.: A wavelet packet representation of audio signals for music genre classification using different ensemble and feature selection techniques. In:

Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval, pp. 102–108, ACM, 2003.

Hamel P., Eck D.: Learning Features from Music Audio with Deep Belief Networks. In: ISMIR, pp. 339–344, Utrecht, The Netherlands, 2010.

Hinton G., Osindero S., Teh Y. W.: A fast learning algorithm for deep belief nets. Neural Computation, vol. 18(7), pp. 1527–1554, 2006.

Kleć M., Koržinek D.: Unsupervised Feature Pre-training of the Scattering Wavelet Transform for Musical Genre Recognition. Procedia Technology, vol. 18, pp. 133–139, 2014.

LeCun Y., Bengio Y.: The Handbook of Brain Theory and Neural Networks. chap. Convolutional Networks for Images, Speech, and Time Series, pp. 255–258, MIT Press, Cambridge, MA, USA, 1998,

Lee H., Ekanadham C., Ng A.Y.: Sparse deep belief net model for visual area V2. In: Advances in neural information processing systems, pp. 873–880, MIT Press, 2008.

Li T., Ogihara M., Li Q.: A comparative study on content-based music genre classification. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 282–289, ACM, 2003.

Mallat S.: Group invariant scattering. Communications on Pure and Applied Mathematics, vol. 65(10), pp. 1331–1398, 2012.

Ng A.: Sparse autoencoder. CS294A Lecture Notes, vol. 72, pp. 1–19, 2011.

Panagakis Y., Kotropoulos C., Arce G. R.: Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations. In: ISMIR, pp. 249–254, 2009.

Poultney C., Chopra S., Cun Y.L., et al.: Efficient learning of sparse representations with an energy-based model. In: Advances in neural information processing systems, pp. 1137–1144, 2006.

Sigtia S., Dixon S.: Improved music feature learning with deep neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6959–6963, IEEE, 2014.

Skajaa A.: Limited memory BFGS for nonsmooth optimization. Master’s thesis, Courant Institute of Mathematical Science, New York University, 2010.

Sturm B.L.: The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use. arXiv preprint arXiv:1306.1461, 2013.

Tzanetakis G., Cook P.: Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on, vol. 10(5), pp. 293–302, 2002.

Vincent P., Larochelle H., Lajoie I., Bengio Y., Manzagol P. A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.



  • There are currently no refbacks.