Author: Sachin Chachada and C.-C. Jay Kuo

Research on Environmental Sound Recognition (ESR) has significantly increased in the past decade. With a growing demand on example-based search such as content-based image and video search, ESR can be instrumental in efficient audio search applications. ESR can be also useful for automatic tagging of audio files with descriptors for keyword-based audio retrieval, robot navigation, home-monitoring system, surveillance, recognition of animal and bird species, etc. Among various types of audio signals, speech and music are two categories that have been extensively studied. In their infancy, ESR algorithms were a mere reflection of speech and music recognition paradigms. However, on account of considerably non-stationary characteristics of environmental sounds, these algorithms proved to be ineffective for large-scale databases. Recent publications have seen an influx of substantial new features and algorithms catering to ESR. However, the problem largely remains unsolved.

Owing to non-stationary characteristics of environmental sounds, recent works have focused on time-frequency features [1-3]. Efforts have also been made to incorporate non-linear classifiers for ESR [4]. A comprehensive coverage of recent developments can be found in [5]. These recently developed features perform well on sounds which exhibit non-stationarity but have to compete with conventional features like Mel-Frequency Cepstral Coefficients (MFCC) for other sounds. A set of features with simplicity of stationary methods and accuracy of non-stationary methods is still a puzzle piece. Moreover, considering the numerous types of environmental sounds, it is hard to fathom a single set of features suitable for all sounds. Another problem with using a single set of features is that different features need different processing schemes, and hence several meaningful combination of features, that would be otherwise functionally complementary to each other, are incompatible in practice. [...]