180 Park Ave - Building 103
Florham Park, NJ
At AT&T Labs - Research, we apply our speech, language and media technologies to give people with disabilities more independence, privacy and autonomy.
AT&T Natural VoicesTM Text-to-Speech,
Natural Voices is AT&T's state-of-the-art text-to-speech product, taking text and producing natural-sounding, synthesized speech in a variety of voices and languages.
System And Method For Configuring Voice Synthesis,
December 27, 2011
Systems and methods for providing synthesized speech in a manner that takes into account the environment where the speech is presented. A method embodiment includes, based on a listening environment and at least one other parameter associated with at least one other parameter, selecting an approach from the plurality of approaches for presenting synthesized speech in a listening environment, presenting synthesized speech according to the selected approach and based on natural language input received from a user indicating that an inability to understand the presented synthesized speech, selecting a second approach from the plurality of approaches and presenting subsequent synthesized speech using the second approach.
System And Method For Pronunciation Modeling,
December 6, 2011
Systems, computer-implemented methods, and tangible computer-readable media for generating a pronunciation model. The method includes identifying a generic model of speech composed of phonemes, identifying a family of interchangeable phonemic alternatives for a phoneme in the generic model of speech, labeling the family of interchangeable phonemic alternatives as referring to the same phoneme, and generating a pronunciation model which substitutes each family for each respective phoneme. In one aspect, the generic model of speech is a vocal tract length normalized acoustic model. Interchangeable phonemic alternatives can represent a same phoneme for different dialectal classes. An interchangeable phonemic alternative can include a string of phonemes.
System And Method Of Word Lattice Augmentation Using A Pre/Post Vocalic Consonant Distinction,
September 20, 2011
Systems and methods are provided for recognizing speech in a spoken dialogue system. The method includes receiving input speech having a pre-vocalic consonant or a post-vocalic consonant, generating at least one output lattice that calculates a first score by comparing the input speech to a training model to provide a result and distinguishing between the pre-vocalic consonant and the post-vocalic consonant in the input speech. A second score is calculated by measuring a similarity between the pre-vocalic consonant or the post vocalic consonant in the input speech and the first score. At least one category is determined for the pre-vocalic match or mismatch or the post-vocalic match or mismatch by using the second score and the results of the an automated speech recognition (ASR) system are refined by using the at least one category for the pre-vocalic match or mismatch or the post-vocalic match or mismatch.
System And Method Of Using Acoustic Models For Automatic Speech Recognition Which Distinguish Pre- And Post-Vocalic Consonants,
September 6, 2011
Disclosed are systems, methods and computer readable media for training acoustic models for an automatic speech recognition systems (ASR) system. The method includes receiving a speech signal, defining at least one syllable boundary position in the received speech signal, based on the at least one syllable boundary position, generating for each consonant in a consonant phoneme inventory a pre-vocalic position label and a post-vocalic position label to expand the consonant phoneme inventory, reformulating a lexicon to reflect an expanded consonant phoneme inventory, and training a language model for an automated speech recognition (ASR) system based on the reformulated lexicon.
Method And System For Enhancing A Speech Database,
March 22, 2011
A system, method and computer readable medium that enhances a speech database for speech synthesis is disclosed. The method may include labeling audio files in a primary speech database and a secondary speech database, enhancing the primary speech database by placing the labeled audio files from the secondary speech database into the primary speech database, and storing the enhanced primary speech database for use in speech synthesis.
System and method for configuring voice synthesis,
December 4, 2007
Systems and methods for providing synthesized speech in a manner that may take into account the environment where the speech is presented. In certain cases, the manner in which speech is presented can take into consideration ambient noise and/or can seek to optimize speech audibility.
Employing Speech Models In Concatenative Speech Synthesis,
September 27, 2005
A text-to-speech synthesizer employs database that includes units. For each unit there is a collection of unit selection parameters and a plurality of frames. Each frame has a set of model parameters derived from a base speech frame, and a speech frame synthesized from the frame's model parameters. A text to be synthesized is converted to a sequence of desired unit features sets, and for each such set the database is perused to retrieve a best-matching unit. An assessment is made whether modifications to the frames are needed, because of discontinuities in the model parameters at unit boundaries, or because of differences between the desired and selected unit features. When modifications are necessary, the model parameters of frames that need to be altered are modified, and new frames are synthesized from the modified model parameters and concatenated to the output. Otherwise, the speech frames previously stored in the database are retrieved and concatenated to the output.
Method And System For Recorded Word Concatenation,
July 29, 2003
A method and system are provided for performing recorded word concatenation to create a natural sounding sequence of words, numbers, phrases, sounds, etc. for example. The method and system may include a tonal pattern identification unit that identifies tonal patterns, such as pitch accents, phrase accents and boundary tones, for utterances in a particular domain, such as telephone numbers, credit card numbers, the spelling of words, etc.; a script designer that designs a script for recording a string of words, numbers, sounds etc., based on an appropriate rhythm and pitch range in order to obtain natural prosody for utterances in the particular domain and with minimum coarticulation between concatenative units; a script recorder that records a speaker's utterances of the domain strings; a recording editor that edits the recorded strings by marking the beginning and end of each word, number etc. in the string and including or inserting pauses according to the tonal patterns; and a concatenation unit that concatenates the edited recording into a smooth and natural sounding string of words, numbers, letters of the alphabet, etc., for audio output.
Fellow of the Acoustical Society of America, 2008.
National Academy of Engineering Gallery of Women Engineers, 2000.