
180 Park Ave - Building 103
Florham Park, NJ
AT&T Natural VoicesTM Text-to-Speech,
Natural Voices is AT&T's state-of-the-art text-to-speech product, taking text and producing natural-sounding, synthesized speech in a variety of voices and languages.
MIRACLE and the Content Analysis Engine (CAE),
The MIRACLE project develops media processing technologies that enable the content-based retrieval and presentation of multimedia data over a range of devices, and a wide range of available bandwidth.
Speech Mashup,
The AT&T speech mashup is a web service that implements speech tasks for web applications, enabling users of smart phones and other devices to use and hear voice communications.
Automatic learning in content indexing service using phonetic alignment
Yeon Kim, David Gibbon
ISCA Interspeech 2011,
2011.
[BIB]
{Content indexing has become necessary, not just optional, in the era where broadcast, cable and Internet produce huge amounts of media daily. Text information from spoken audio is still a key feature to understand content along
with other meta-data and video features.
In this paper, a new method is introduced to improve transcription quality, which allows more accurate content indexing. Our method finds phonetic similarities between two imperfect sources, closed captions and ASR outputs, and align them together to make quality transcriptions. In the process, even out-of-vocabulary words could be learned automatically. Given broadcast news audio and closed captions, our experiment result shows that the proposed method improves word correct rates 11% from the ASR output using the baseline language model and 6% from the one using the adapted language model on average.}

Automatic Assessment of American English Lexical Stress using Machine Learning Algorithms
Yeon Kim, Mark Beutnagel
SLaTE-2011 workshop (Speech and Language Technology in Education),
2011.
[BIB]
{This paper introduces a method to assess lexical stress patterns in American English words automatically using machine learning algorithms, which could be used on the computer assisted language learning (CALL) system. We aim to model human perception concerning lexical stress patterns by training stress patterns in a native speaker's utterances and make use of it to detect erroneous stress patterns from a trainee.
In this paper, all the possible lexical stress patterns in 3- and 4-syllable American English words are presented and four machine learning algorithms, CART, AdaBoost+CART, SVM and MaxEnt, are trained with acoustic measurements from a native speaker's utterances and corresponding stress patterns. Our experimental results show that MaxEnt correctly classified the best, 83.3% stress patterns of 3-syllable words and 88.7% of 4-syllable words. }

AT&T VoiceBuilder: A Cloud-based Text-to-Speech Voice Builter Tool
Yeon Kim, Thomas Okken, Alistair Conkie, Giuseppe Di
ISCA Interspeech 2011,
2011.
[BIB]
{The AT&T VoiceBuilder provides a new service to customers who
want to have their voices on the text-to-speech (TTS) system.
It is implemented as web-interfaces on AT&T Speech Mashup Portal.
The proposed system receives customers' speech, processes them and
returns plug-in TTS voices accessible through a web interface.
All the procedures are fully-automated to avoid human interventions.
}
Speech acts and dialog TTS
Ann Syrdal, Alistair Conkie, Yeon Kim, Mark Beutnagel
Seventh ISCA Speech Synthesis Workshop,
2010.
[BIB]
System And Method Of Word Lattice Augmentation Using A Pre/Post Vocalic Consonant Distinction,
Tue Sep 20 16:06:10 EDT 2011
Systems and methods are provided for recognizing speech in a spoken dialogue system. The method includes receiving input speech having a pre-vocalic consonant or a post-vocalic consonant, generating at least one output lattice that calculates a first score by comparing the input speech to a training model to provide a result and distinguishing between the pre-vocalic consonant and the post-vocalic consonant in the input speech. A second score is calculated by measuring a similarity between the pre-vocalic consonant or the post vocalic consonant in the input speech and the first score. At least one category is determined for the pre-vocalic match or mismatch or the post-vocalic match or mismatch by using the second score and the results of the an automated speech recognition (ASR) system are refined by using the at least one category for the pre-vocalic match or mismatch or the post-vocalic match or mismatch.
System And Method Of Using Acoustic Models For Automatic Speech Recognition Which Distinguish Pre- And Post-Vocalic Consonants,
Tue Sep 06 16:06:07 EDT 2011
Disclosed are systems, methods and computer readable media for training acoustic models for an automatic speech recognition systems (ASR) system. The method includes receiving a speech signal, defining at least one syllable boundary position in the received speech signal, based on the at least one syllable boundary position, generating for each consonant in a consonant phoneme inventory a pre-vocalic position label and a post-vocalic position label to expand the consonant phoneme inventory, reformulating a lexicon to reflect an expanded consonant phoneme inventory, and training a language model for an automated speech recognition (ASR) system based on the reformulated lexicon.
Automatic Segmentation In Speech Synthesis,
Tue Sep 08 16:07:59 EDT 2009
Systems and methods for automatically segmenting speech inventories. A set of Hidden Markov Models (HMMs) are initialized using bootstrap data. The HMMs are next re-estimated and aligned to produce phone labels. The phone boundaries of the phone labels are then corrected using spectral boundary correction. Optionally, this process of using the spectral-boundary-corrected phone labels as input instead of the bootstrap data is performed iteratively in order to further reduce mismatches between manual labels and phone labels assigned by the HMM approach.
Automatic segmentation in speech synthesis,
Tue Sep 04 18:12:11 EDT 2007
Systems and methods for automatically segmenting speech inventories. A set of Hidden Markov Models (HMMs) are initialized using bootstrap data. The HMMs are next re-estimated and aligned to produce phone labels. The phone boundaries of the phone labels are then corrected using spectral boundary correction. Optionally, this process of using the spectral-boundary-corrected phone labels as input instead of the bootstrap data is performed iteratively in order to further reduce mismatches between manual labels and phone labels assigned by the HMM approach.