Learning audio and image representations with bio-inspired trainable feature extractors
Article Sidebar
Citacions a Google Acadèmic
Main Article Content
Since when very young, we can quickly learn new concepts, and distinguish between different kinds of object or sound. If we see a single object or hear a particular sound, we are then able to recognize such sample or even different versions of it in other scenarios. As an example, if one sees a iron chair and associates the object to the general concept of “chairs”, he will be able to detect and recognize also wooden or wicker chairs. Similarly, when we hear the sound of a particular event, such as a scream, we are then able to recognize other kinds of scream that occur in different environments. We continuously learn representations of the real world, which we then use in order to understand new and changing environments.
In the field of pattern recognition, traditional methods typically require a careful design of data representations (i.e. features), which involves considerable domain knowledge and effort by experts. Recently, approaches for automated learning of representations from training data were introduced and based on popular deep learning techniques and convolutional neural networks (CNN). Representation learning aims at avoiding engineering of hand-crafted features and providing automatically learned features suitable for the recognition tasks. In this work, we proposed novel trainable filters for representation learning in audio and image processing. The structure of these filters is not fixed in the implementation but rather configured directly from single prototype patterns of interest [4].
In the context of audio processing, we focused on the problem of audio event detection and classification in noisy environments, also in cases where the signal to noise ratio (SNR) is null or negative. We released two data sets, namely the MIVIA audio events and the MIVIA road events data sets, and obtained baseline results (recognition rate of about 85%) with a real-time method for event detection based on the bag of features classification scheme [3, 2].
We designed novel trainable feature extractors, which we call COPE (Combination of Peaks of Energy), that are able to detect specific constellations of energy peak points in time-frequency representations of input audio signals [8]. The particular constellation of energy peaks to be detected by a COPE feature extractor is determined in an automatic configuration process performed on a given prototype sound. The design of COPE feature extractors was inspired by some functions of the cochlea membrane and the inner hair cells in the inner auditory system, which convert the sound pressure waves into neural stimuli on the auditory nerve.
We proposed a method that uses COPE feature extractors together with a classification system to perform real-time audio event detection and classification, also in cases where sounds have null and negative SNR. The performance results (recognition rate over 90%) that we obtained on several benchmarking data sets for audio events detection in different contexts are higher than state-of-the-art approaches.
In the second part of the work, we introduced B-COSFIRE filters for detection of elongated and curvilinear patterns in images and apply them to the delineation of blood vessels in retinal images [1, 6]. The B-COSFIRE filters are trainable, that is their structure is automatically configured from prototype elongated patterns. The design of the B-COSFIRE filters is inspired by the functions of some neurons, called simple cells, in area V1 of the visual system, which fire when presented with line or contour stimuli. A B-COSFIRE filter achieves orientation selectivity by computing the weighted geometric mean of the output of a pool of Difference-of-Gaussians (DoG) filters, whose supports are aligned in a collinear manner. Rotation invariance is efficiently obtained by appropriate shiftings of the DoG filter responses.
After configuring a large bank of B-COSFIRE filters selective for vessels (i.e. lines) and vessel-endings (i.e. line-endings) of various thickness (i.e. scale), we employed different techniques based on information theory and machine learning to select an optimal subset of B-COSFIRE filters for the vessel delineation task [5, 7]. We considered the selected filters as feature extractors to construct a pixel-wise feature vector which we used in combination with a classifier to classify the pixels in the testing image as vessel and non-vessel pixels. We carried out experiments on public benchmarking data sets (DRIVE, STARE, CHASE DB1 and HRF data sets) and the results that we achieved are higher than many existing methods.
We studied the computational requirements of the proposed algorithms in order to evaluate their applicabilityin real-world applications and the fulfillment of real-time constraints given by the considered problems. The MATLAB implementation of the proposed algorithms are publicly released for research purposes.
This work contributes to the development of algorithms for representation learning in audio and imageprocessing and promotes their use in higher-level pattern recognition systems.