
180 Park Ave - Building 103
Florham Park, NJ
Research and application of algorithms in the area of speech recognition, signal processing and pattern recognition.
INVESTIGATING DEEP NEURAL NETWORK BASED TRANSFORMS OF ROBUST AUDIO FEATURES FOR LVCSR
Enrico Bocchieri, Dimitrios Dimitriadis
38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , IEEE,
2013.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2013-05-26, http://www.icassp2013.com/
{Although micro-modulation components such as the formant frequencies are very important characteristics of spoken speech, and great performance improvements have been detailed in small-vocabulary ASR tasks, yet they have limited use in large vocabulary ASR applications. To successfully use these frequency measures in real-life tasks, we study linear, e.g., HDA, and non-linear (bottleneck MLP) feature transforms for combining features such as the MFCC’s and PLP’s, with the formant frequency-related coefficients. Our experiments show that the integration, using non-linear MLP-based transforms, of micro-modulation and cepstral features greatly improves the ASR with respect to the cepstral features alone. We have applied this novel feature extraction scheme onto two very different tasks, i.e. a clean speech task (DARPA-WSJ) and a real-life, open-vocabulary, mobile search task (Speak4itSM), always reporting improved performance. We report relative error rate reduction of 15% for the Speak4itSM task, and similar improvements, up to 21%, for the WSJ task.}

An Alternative Frontend for the AT&T WATSON LV-CSR System
Dimitrios Dimitriadis, Enrico Bocchieri, Diamantino Caseiro
International Conference on Acoustics, Speech and Signal Processing,
2011.
[BIB]
{In previously published work, we have proposed a novel feature extraction algorithm approximating some of the human auditory characteristics and the robustness of an alternative energy estimation scheme. Herein, we examine the proposed feature performance under additive noise and suggest how to predict the noisy cepstral coefficient deviations by estimating the subband SNR values. Then, we examine the efficiency of the proposed features in the framework of a state-of-the-art LV-CSR system, namely the AT&T WATSON system. The features are examined in a mobile, voice search task, namely the Speak4It application. The proposed feature extraction scheme increases the overall performance by 6\% relative improvement, leaving the AM and LM training fixed. Additional improvements have been reported when this frontend is combined with advanced training techniques.}

Speech Recognition Modeling Advances For Mobile Voice Search
Enrico Bocchieri, Diamantino Caseiro, Dimitrios Dimitriadis
International Conference On Acoustics, Speech and Signal Processing,
2010.
[BIB]
{This paper reports on the development and advances in automatic speech recognition for the AT&T Speak4It voice-search application. With Speak4It as real-life example, we show the effectiveness of acoustic model (AM) and language model (LM) estimation (adaptation and training) on relatively small amounts of field-data. We then introduce algorithmic improvements concerning the use of sentence length in LM, of non-contextual features in AM decision-trees, and of the Teager energy in the acoustic front-end. The combination of these algorithms yields substantial accuracy improvements. LM and AM estimation on samples of field-data increases the word accuracy from 66.4% to 77.1%, a relative word error reduction of 32%. The algorithmic improvements increase the accuracy to 79.7%, an additional 11.3% relative error reduction.}
System And Method For Providing Large Vocabulary Speech Processing Based On Fixed-Point Arithmetic,
Tue Jun 05 16:10:39 EDT 2012
Disclosed herein is a system, method and computer-readable medium storing instructions for controlling a computing device according to the method. The invention relates to a system, method and computer-readable medium storing instructions for controlling a computing device according to the method. As an example embodiment, the method uses a speech recognition decoder that operates or uses fixed point arithmetic. The exemplary method comprises representing arc costs associated with at least one finite state transducer (FST) in fixed point, representing parameters associated with a hidden Markov model (HMM) in fixed point and processing speech data in the speech recognition decoder using fixed point arithmetic for the fixed point FST arc costs and the fixed point HMM parameters. The method may also include computing at the decoder sentence hypothesis probabilities with fixed point arithmetic as type Q-2e numbers.
Automatic speech recognizer,
Tue Jul 12 18:05:00 EDT 1994
Apparatus and method for recording data in a speech recognition system and recognizing spoken data corresponding to the recorded data. The apparatus and method responds to entered data by generating a string of phonetic transcriptions from the entered data. The data and generated phonetic transcription string associated therewith is recorded in a vocabulary lexicon of the speech recognition system. The apparatus and method responds to receipt of spoken data by constructing a model of subwords characteristic of the spoken data and compares the constructed subword model with ones of the recorded lexicon vocabulary recorded phonetic transcription strings to recognize the spoken data as the data identified by and associated with a phonetic transcription string matching the constructed subword string.
IEEE Fellow, 2013.
For contributions to computational models for speech recognition.
AT&T Labs V.P. Award, 2012.
AT&T CTO Award, 2011.
AT&T Intellectual Property Award, 2010.
IEEE Signal Processing Society Best Paper Award, 2005.
Best Paper Award, IEEE (IEEE Acoustic Speech and Signal Processing Society), 2004.
for "Subspace distribution clustering hidden Markov model"