
180 Park Ave - Building 103
Florham Park, NJ
Predicting Human Perceived Accuracy of ASR Systems
Taniya Mishra, Andrej Ljolje, Mazin Gilbert
Interspeech,
2011.
[PDF]
[BIB]
ISCA Copyright
The definitive version was published in Interspeech. , 2011-08-28
{Word error rate (WER), which is the most commonly used method of measuring automatic speech recognition (ASR) accuracy,
penalizes all ASR errors (insertions, deletions, substitutions) equally.
However, humans differentially weigh different types of ASR errors. They judge ASR errors that distort the meaning of the spoken message more harshly than those that do not.
Following the central idea of differential weighting of different ASR errors, we developed a new metric, HPA (Human Perceived Accuracy) that aims to align more closely with human perception of ASR errors. Applied to the particular task of automatically recognizing voicemails, we found that the correlation between HPA and the human perception of ASR accuracy was significantly higher (r-value=0.91) than the correlation between WER and human judgement (r-value=0.65).}
Speech Recognition Based On Pronunciation Modeling,
Tue Jul 03 12:52:37 EDT 2012
A system and method for performing speech recognition is disclosed. The method comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations and presenting a recognition result for the utterance. Recognition improvement is found by moving a pronunciation model from a dictionary to the langue model.
Systems And Methods Of Providing Modified Media Content,
Tue Jun 19 12:52:32 EDT 2012
In an embodiment, a method of providing modified media content is disclosed and includes receiving media content that includes audio data and video data having a first number of video frames. The method also includes generating abstracted media content that includes portions of the video data and audio elements of the audio data, where the abstracted media content includes less than all of the video data and includes fewer video frames than the first number of video frames.
System And Method For Increasing Recognition Rates Of In-Vocabulary Words By Improving Pronunciation Modeling,
Tue Jan 10 12:50:04 EST 2012
The present disclosure relates to systems, methods, and computer-readable media for generating a lexicon for use with speech recognition. The method includes receiving symbolic input as labeled speech data, overgenerating potential pronunciations based on the symbolic input, identifying potential pronunciations in a speech recognition context, and storing the identified potential pronunciations in a lexicon. Overgenerating potential pronunciations can include establishing a set of conversion rules for short sequences of letters, converting portions of the symbolic input into a number of possible lexical pronunciation variants based on the set of conversion rules, modeling the possible lexical pronunciation variants in one of a weighted network and a list of phoneme lists, and iteratively retraining the set of conversion rules based on improved pronunciations. Symbolic input can include multiple examples of a same spoken word. Speech data can be labeled explicitly or implicitly and can include words as text and recorded audio.
System And Method For Pronunciation Modeling,
Tue Dec 06 16:06:42 EST 2011
Systems, computer-implemented methods, and tangible computer-readable media for generating a pronunciation model. The method includes identifying a generic model of speech composed of phonemes, identifying a family of interchangeable phonemic alternatives for a phoneme in the generic model of speech, labeling the family of interchangeable phonemic alternatives as referring to the same phoneme, and generating a pronunciation model which substitutes each family for each respective phoneme. In one aspect, the generic model of speech is a vocal tract length normalized acoustic model. Interchangeable phonemic alternatives can represent a same phoneme for different dialectal classes. An interchangeable phonemic alternative can include a string of phonemes.
Multi-State Barge-In Models For Spoken Dialog Systems,
Tue Oct 25 16:06:20 EDT 2011
Disclosed are systems, methods and computer readable media for applying a multi-state barge-in acoustic model in a spoken dialogue system comprising the steps of (1) presenting a prompt to a user from the spoken dialog system. (2) receiving an audio speech input from the user during the presentation of the prompt, (3) accumulating the audio speech input from the user, (4) applying a non-speech component having at least two one-state Hidden Markov Models (HMMs) to the audio speech input from the user, (5) applying a speech component having at least five three-state HMMs to the audio speech input from the user, in which each of the five three-state HMMs represents a different phonetic category, (6) determining whether the audio speech input is a barge-in-speech input from the user, and (7) if the audio speech input is determined to be the barge-in-speech input from the user, terminating the presentation of the prompt.
System And Method Of Word Lattice Augmentation Using A Pre/Post Vocalic Consonant Distinction,
Tue Sep 20 16:06:10 EDT 2011
Systems and methods are provided for recognizing speech in a spoken dialogue system. The method includes receiving input speech having a pre-vocalic consonant or a post-vocalic consonant, generating at least one output lattice that calculates a first score by comparing the input speech to a training model to provide a result and distinguishing between the pre-vocalic consonant and the post-vocalic consonant in the input speech. A second score is calculated by measuring a similarity between the pre-vocalic consonant or the post vocalic consonant in the input speech and the first score. At least one category is determined for the pre-vocalic match or mismatch or the post-vocalic match or mismatch by using the second score and the results of the an automated speech recognition (ASR) system are refined by using the at least one category for the pre-vocalic match or mismatch or the post-vocalic match or mismatch.
System And Method Of Using Acoustic Models For Automatic Speech Recognition Which Distinguish Pre- And Post-Vocalic Consonants,
Tue Sep 06 16:06:07 EDT 2011
Disclosed are systems, methods and computer readable media for training acoustic models for an automatic speech recognition systems (ASR) system. The method includes receiving a speech signal, defining at least one syllable boundary position in the received speech signal, based on the at least one syllable boundary position, generating for each consonant in a consonant phoneme inventory a pre-vocalic position label and a post-vocalic position label to expand the consonant phoneme inventory, reformulating a lexicon to reflect an expanded consonant phoneme inventory, and training a language model for an automated speech recognition (ASR) system based on the reformulated lexicon.
Discriminative Training Of Multi-State Barge-In Models For Speech Processing,
Tue Aug 16 16:05:56 EDT 2011
Disclosed are systems and methods for training a barge-in-model for speech processing in a spoken dialogue system comprising the steps of (1) receiving an input having at least one speech segment and at least one non-speech segment, (2) establishing a restriction of recognizing only speech states during speech segments of the input and non-speech states during non-speech segments of the input, (2) generating a hypothesis lattice by allowing any sequence of speech Hidden Markov Models (HMMs) and non-speech HMMs, (4) generating a reference lattice by only allowing speech HMMs for at least one speech segment and non-speech HMMs for at least one non-speech segment, wherein different iterations of training generates at least one different reference lattice and at least one reference transcription, and (5) employing the generated reference lattice as the barge-in-model for speech processing.
Low Latency Real-Time Vocal Tract Length Normalization,
Tue Jul 28 16:07:45 EDT 2009
A method and apparatus for performing speech recognition are provided. A Vocal Tract Length Normalized acoustic model for a speaker is generated from training data. Speech recognition is performed on a first recognition input to determine a first best hypothesis. A first Vocal Tract Length Normalization factor is estimated based on the first best hypothesis. Speech recognition is performed on a second recognition input using the Vocal Tract Length Normalized acoustic model to determine an other best hypothesis. An other Vocal Tract Length Normalization factor is estimated based on the other best hypothesis and at least one previous best hypothesis.
Science & Technology Medal, 2006.
Honored for technical leadership and innovative contributions in Automatic Speech Recognition.