
180 Park Ave - Building 103
Florham Park, NJ
Connecting Your World,
The need to be connected is greater than ever, and AT&T Researchers are creating new ways for people to connect with one another and with their environments, whether it's their home, office, or car.
Connecting Your World,
The need to be connected is greater than ever, and AT&T Researchers are creating new ways for people to connect with one another and with their environments, whether it's their home, office, or car.
INVESTIGATING DEEP NEURAL NETWORK BASED TRANSFORMS OF ROBUST AUDIO FEATURES FOR LVCSR
Enrico Bocchieri, Dimitrios Dimitriadis
38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , IEEE,
2013.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2013-05-26, http://www.icassp2013.com/
{Although micro-modulation components such as the formant frequencies are very important characteristics of spoken speech, and great performance improvements have been detailed in small-vocabulary ASR tasks, yet they have limited use in large vocabulary ASR applications. To successfully use these frequency measures in real-life tasks, we study linear, e.g., HDA, and non-linear (bottleneck MLP) feature transforms for combining features such as the MFCC’s and PLP’s, with the formant frequency-related coefficients. Our experiments show that the integration, using non-linear MLP-based transforms, of micro-modulation and cepstral features greatly improves the ASR with respect to the cepstral features alone. We have applied this novel feature extraction scheme onto two very different tasks, i.e. a clean speech task (DARPA-WSJ) and a real-life, open-vocabulary, mobile search task (Speak4itSM), always reporting improved performance. We report relative error rate reduction of 15% for the Speak4itSM task, and similar improvements, up to 21%, for the WSJ task.}

INSTANTANEOUS FREQUENCY AND BANDWIDTH ESTIMATION USING FILTERBANK ARRAYS
Dimitrios Dimitriadis, University of Cambridge Pirros Tsiakoulis, Department of ECE Alexandros Potamianos
38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE,
2013.
[PDF]
[BIB]
{Accurate estimation of the instantaneous frequency of speech res-
onances is a hard problem mainly due to phase discontinuities in
the speech signal associated with excitation instants. We review a
variety of approaches for enhanced frequency and bandwidth esti-
mation in the time-domain and propose a new cognitively motivated
approach using filterbank arrays. We show that by filtering speech
resonances using filters of different center frequency, bandwidth and
shape, the ambiguity in instantaneous frequency estimation associ-
ated with amplitude envelope minima and phase discontinuities can be significantly reduced. The novel estimators are shown to perform well on synthetic speech signals with frequency and bandwidth micro-modulations (i.e., modulations within a pitch period), as well as on real speech signals. Filterbank arrays, when applied to frequency and bandwidth modulation index estimation, are shown to reduce the estimation error variance by 85% and 70% respectively.}

Living rooms getting smarter with multimodal and multichannel signal processing
Dimitrios Dimitriadis, Horst Schroeter
IEEE SLTC newsletter,
2011.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in IEEE SLTC newsletter. , 2011-07-27
{}
Combining Frame and Segment Level Processing via Temporal Pooling for Phonetic Classification
Sumit Chopra, Patrick Haffner, Dimitrios Dimitriadis
12th Annual Conference of the International Speech Communication Association,
2011.
[PDF]
[BIB]
International Speech Communication Association Copyright
The definitive version was published in 12th Annual Conference of the International Speech Communication Association. , 2011-08-27
{We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines the frame level transformation of the acoustic signal with the segment level transformation via a temporal pooling architecture to compute class conditional probabilities of phones. Without the use of any phonetic knowledge, our model achieved the state-of-the-art performance on the TIMIT phone classification task. The flexibility of our model allows us to mix a variety of pooling architectures, leading
to further significant performance improvements.}
An Alternative Frontend for the AT&T WATSON LV-CSR System
Dimitrios Dimitriadis, Enrico Bocchieri, Diamantino Caseiro
International Conference on Acoustics, Speech and Signal Processing,
2011.
[BIB]
{In previously published work, we have proposed a novel feature extraction algorithm approximating some of the human auditory characteristics and the robustness of an alternative energy estimation scheme. Herein, we examine the proposed feature performance under additive noise and suggest how to predict the noisy cepstral coefficient deviations by estimating the subband SNR values. Then, we examine the efficiency of the proposed features in the framework of a state-of-the-art LV-CSR system, namely the AT&T WATSON system. The features are examined in a mobile, voice search task, namely the Speak4It application. The proposed feature extraction scheme increases the overall performance by 6\% relative improvement, leaving the AM and LM training fixed. Additional improvements have been reported when this frontend is combined with advanced training techniques.}

Speech Recognition Modeling Advances For Mobile Voice Search
Enrico Bocchieri, Diamantino Caseiro, Dimitrios Dimitriadis
International Conference On Acoustics, Speech and Signal Processing,
2010.
[BIB]
{This paper reports on the development and advances in automatic speech recognition for the AT&T Speak4It voice-search application. With Speak4It as real-life example, we show the effectiveness of acoustic model (AM) and language model (LM) estimation (adaptation and training) on relatively small amounts of field-data. We then introduce algorithmic improvements concerning the use of sentence length in LM, of non-contextual features in AM decision-trees, and of the Teager energy in the acoustic front-end. The combination of these algorithms yields substantial accuracy improvements. LM and AM estimation on samples of field-data increases the word accuracy from 66.4% to 77.1%, a relative word error reduction of 32%. The algorithmic improvements increase the accuracy to 79.7%, an additional 11.3% relative error reduction.}