Author: Sachin Chachada and C.-C. Jay Kuo

Research Problem

Research on Environmental Sound Recognition (ESR) has significantly increased in the past decade. With a growing demand on example-based search such as content-based image and video search, ESR can be instrumental in efficient audio search applications. ESR can be also useful for automatic tagging of audio files with descriptors for keyword-based audio retrieval, robot navigation, home-monitoring system, surveillance, recognition of animal and bird species, etc. Among various types of audio signals, speech and music are two categories that have been extensively studied. In their infancy, ESR algorithms were a mere reflection of speech and music recognition paradigms. However, on account of considerably non-stationary characteristics of environmental sounds, these algorithms proved to be ineffective for large-scale databases. Recent publications have seen an influx of substantial new features and algorithms catering to ESR. However, the problem largely remains unsolved.

Main Ideas

Owing to non-stationary characteristics of environmental sounds, recent works have focused on time-frequency features [1-3]. Efforts have also been made to incorporate non-linear classifiers for ESR [4]. A comprehensive coverage of recent developments can be found in [5]. These recently developed features perform well on sounds which exhibit non-stationarity but have to compete with conventional features like Mel-Frequency Cepstral Coefficients (MFCC) for other sounds. A set of features with simplicity of stationary methods and accuracy of non-stationary methods is still a puzzle piece. Moreover, considering the numerous types of environmental sounds, it is hard to fathom a single set of features suitable for all sounds. Another problem with using a single set of features is that different features need different processing schemes, and hence several meaningful combination of features, that would be otherwise functionally complementary to each other, are incompatible in practice. This school of thought and success stories of [6-7] directly leads us to ensemble learning methods. Instead of learning/training a classifier for single set of features, we can use multiple classifiers (experts) targeting different aspects of signal characteristics by using a set of complementary features. Unfortunately, there is no best way to design an ensemble framework. Hence, in this work, we focus on development of a robust design of ensemble model which outperforms any single algorithm and at the same time, is also scalable to a large dataset.


There are very few publications which use ensemble learning based models to solve ESR problem, and they are somewhat ad-hoc in their approach. For example, Wang et al. [8] propose a hybrid SVM/KNN classifier without motivating the use of the two experts or exploring other option. In this work, we not only focus on solving ESR problem, but also on the process of designing an efficient ensemble framework. To our knowledge, no similar prior work exists.


  • [1] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time–frequency audio features,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1142–1158, 2009.
  • [2] B. Ghoraani and S. Krishnan, “Time–frequency matrix feature extraction and classification of environmental audio signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2197–2209, 2011.
  • [3] P. Khunarsal, C. Lursinsap, and T. Raicharoen, “Very short time environmental sound classification based on spectrogram pattern matching,” 2013, (in press). link
  • [4] B. Ghoraani and S. Krishnan, “Discriminant non-stationary signal features’ clustering using hard and fuzzy cluster labeling,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, p. 250, 2012.
  • [5] S. Chachada and C.-C. J. Kuo, “Environmental Sound Recognition: A Survey”, Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2013 (accepted).
  • [6] R. M. Bell, Y. Koren, and C. Volinsky, “The bellkor solution to the Netflix prize,” KorBell Teams Report to Netflix, 2007.
  • [7] A. Toscher, M. Jahrer, and R. M. Bell, “The bigchaos solution to the Netflix grand prize,” Netflix prize documentation, 2009.
  • [8] J.-C. Wang, J.-F. Wang, K. W. He, and C.-S. Hsu, “Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio lowlevel descriptor,” in Neural Networks, 2006. IJCNN’06. International Joint Conference on. IEEE, 2006, pp. 1731–1735.