
180 Park Ave - Building 103
Florham Park, NJ
Assistive Technology,
At AT&T Labs - Research, we apply our speech, language and media technologies to give people with disabilities more independence, privacy and autonomy.
AT&T Natural VoicesTM Text-to-Speech,
Natural Voices is AT&T's state-of-the-art text-to-speech product, taking text and producing natural-sounding, synthesized speech in a variety of voices and languages.
AT&T WATSON (SM) Speech Technologies,
AT&T WATSON (SM) integrates several speech technologies, including speech recognition. Tools allow for tuning recognition, adapting language & acoustic models, and adding custom extensions.
Connecting Your World,
The need to be connected is greater than ever, and AT&T Researchers are creating new ways for people to connect with one another and with their environments, whether it's their home, office, or car.
Living rooms getting smarter with multimodal and multichannel signal processing
Dimitrios Dimitriadis, Horst Schroeter
IEEE SLTC newsletter,
2011.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in IEEE SLTC newsletter. , 2011-07-27
{}
System And Method To Search A Media Content Database Based On Voice Input Data,
Tue Jan 22 14:43:55 EST 2013
A computer implemented method includes initiating a call from an interactive voice response (IVR) system to a first device associated with a user in response to a user request. The computer implemented method includes receiving voice input data at the IVR system via the call. The computer implemented method also includes performing a search of a media content database based at least partially on the voice input data. The computer implemented method further includes sending search results identifying media content items based on the search of the media content database to a second device associated with the user.
System And Method Of Dynamically Modifying A Spoken Dialog System To Reduce Hardware Requirements,
Tue Dec 11 16:12:28 EST 2012
A system and method for providing a scalable spoken dialog system are disclosed. The method comprises receiving information which may be internal to the system or external to the system and dynamically modifying at least one module within a spoken dialog system according to the received information. The modules may be one or more of an automatic speech recognition, natural language understanding, dialog management and text-to-speech module or engine. Dynamically modifying the module may improve hardware performance or improve a specific caller's speech processing accuracy, for example. The modification of the modules or hardware may also be based on an application or a task, or based on a current portion of a dialog.
System And Method For A Communication Exchange With An Avatar In A Media Communication System,
Tue Nov 20 16:12:24 EST 2012
A system that incorporates teachings of the present disclosure may include, for example, an Internet Protocol Television (IPTV) system having a controller to retrieve a user profile, cause a set-top box (STB) to present an avatar having characteristics that correlate to the user profile, receive from the STB one or more responses of the user, detect from the one or more responses a change in an emotional state of the user, adapt a search for media content according to the user profile and the detected change in the emotional state of the user, adapt a portion of the characteristics of the avatar relating to emotional feedback according to the user profile and the detected change in the emotional state of the user, and cause the STB to present the adapted avatar presenting content from a media content source identified from the adapted search for media content. Other embodiments are disclosed.
System And Method For Audibly Presenting Selected Text,
Tue Aug 07 16:11:19 EDT 2012
Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user.
Methods And Apparatus To Present A Video Program To A Visually Impaired Person,
Tue Jul 24 16:11:07 EDT 2012
Methods and apparatus to present a video program to a visually impaired person are disclosed. An example method comprises receiving a video stream and an associated audio stream of a video program, detecting a portion of the video program that is not readily consumable by a visually impaired person, obtaining text associated with the portion of the video program, converting the text to a second audio stream, and combining the second audio stream with the associated audio stream.
System And Method For Presenting An Avatar,
Tue Apr 17 16:10:09 EDT 2012
A system that incorporates teachings of the present disclosure may include, for example, an avatar engine having a controller to retrieve a user profile, present a user an avatar having characteristics that correlate to the user profile, detect a change in a developmental growth of the user, adapt a portion of the characteristics of the avatar responsive to the detected change, and present the user the adapted avatar. Other embodiments are disclosed.
Method Of Providing Dynamic Speech Processing Services During Variable Network Connectivity,
Tue Apr 03 16:09:44 EDT 2012
A device for providing dynamic speech processing services during variable network connectivity with a network server includes a connection determiner that determines the level of network connectivity of the client device and the network server; and a simplified speech processor that processes speech data and is initiated based on the determination from the connection determiner that the network connectivity is impaired or unavailable. The devices further includes a speech data storage that stores processed speech data from the simplified speech processor; and a transition unit that determines when to transmit the stored speech data and connects with the network server, based on the determination of the connection determiner.
Method And System For Training A Text-To-Speech Synthesis System Using A Specific Domain Speech Database,
Tue Mar 13 16:09:32 EDT 2012
A method and system are disclosed that train a text-to-speech synthesis system for use in speech synthesis. The method includes generating a speech database of audio files comprising domain-specific voices having various prosodies, and training a text-to-speech synthesis system using the speech database by selecting audio segments having a prosody based on at least one dialog state. The system includes a processor, a speech database of audio files, and modules for implementing the method.
Coarticulation Method For Audio-Visual Text-To-Speech Synthesis,
Tue Dec 13 16:06:47 EST 2011
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. The processor reads first data comprising one or more parameters associated with noise-producing orifice images of sequences of at least three concatenated phonemes which correspond to an input stimulus. The processor reads, based on the first data, second data comprising images of a noise-producing entity. The processor generates an animated sequence of the noise-producing entity.
Audio-Visual Selection Process For The Synthesis Of Photo-Realistic Talking-Head Animations,
Tue Aug 02 16:05:49 EDT 2011
A system and method for generating photo-realistic talking-head animation from a text input utilizes an audio-visual unit selection process. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. The unit selection process utilizes the acoustic data to determine the target costs for the candidate images and utilizes the visual data to determine the concatenation costs. The image database is prepared in a hierarchical fashion, including high-level features (such as a full 3D modeling of the head, geometric size and position of elements) and pixel-based, low-level features (such as a PCA-based metric for labeling the various feature bitmaps).
System And Method For Blending Synthetic Voices,
Tue Jun 21 16:05:31 EDT 2011
A system and method for generating a synthetic text-to-speech TTS voice are disclosed. A user is presented with at least one TTS voice and at least one voice characteristic. A new synthetic TTS voice is generated by blending a plurality of existing TTS voices according to the selected voice characteristics. The blending of voices involves interpolating segmented parameters of each TTS voice. Segmented parameters may be, for example, prosodic characteristics of the speech such as pitch, volume, phone durations, accents, stress, mis-pronunciations and emotion.
Voice-Enabled Dialog System,
Tue Jan 11 16:04:22 EST 2011
A voice-enabled help desk service is disclosed. The service comprises an automatic speech recognition module for recognizing speech from a user, a spoken language understanding module for understanding the output from the automatic speech recognition module, a dialog management module for generating a response to speech from the user, a natural voices text-to-speech synthesis module for synthesizing speech to generate the response to the user, and a frequently asked questions module. The frequently asked questions module handles frequently asked questions from the user by changing voices and providing predetermined prompts to answer the frequently asked question.
Coarticulation Method For Audio-Visual Text-To-Speech Synthesis,
Tue Dec 08 16:08:26 EST 2009
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. The processor reads first data comprising one or more parameters associated with noise-producing orifice images of sequences of at least three concatenated phonemes which correspond to an input stimulus. The processor reads, based on the first data. second data comprising images of a noise-producing entity. The processor generates an animated sequence of the noise-producing entity.
Service-Quality Text-To-Speech Synthesis System,
Tue Sep 01 16:07:54 EDT 2009
A system, method and computer readable medium that trains a text-to-speech synthesis system for use in speech synthesis is disclosed. The method may include recording audio files of one or more live voices speaking language used in a specific domain, the audio files being recorded using various prosodies, storing the recorded audio files in a speech database; and training a text-to-speech synthesis system using the speech database, wherein the text-to-speech synthesis system selects audio selects audio segments having a prosody based on at least one dialog state and one speech act.
System and method for blending synthetic voices,
Tue Nov 18 18:13:00 EST 2008
A system and method for generating a synthetic text-to-speech TTS voice are disclosed. A user is presented with at least one TTS voice and at least one voice characteristic. A new synthetic TTS voice is generated by blending a plurality of existing TTS voices according to the selected voice characteristics. The blending of voices involves interpolating segmented parameters of each TTS voice. Segmented parameters may be, for example, prosodic characteristics of the speech such as pitch, volume, phone durations, accents, stress, mis-pronunciations and emotion.
Coarticulation method for audio-visual text-to-speech synthesis,
Tue Jun 24 18:12:00 EDT 2008
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. The processor reads first data comprising one or more parameters associated with noise-producing orifice images of sequences of at least three concatenated phonemes which correspond to an input stimulus. The processor reads, based on the first data, second data comprising images of a noise-producing entity. The processor generates an animated sequence of the noise-producing entity.
Coarticulation method for audio-visual text-to-speech synthesis,
Tue Oct 03 18:11:00 EDT 2006
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. Representative parameters are extracted from the image samples and stored in an animation library. The processor also samples a plurality of multiphones comprising images together with their associated sounds. The processor extracts parameters from these images comprising data characterizing mouth shapes, maps, rules, or equations, and stores the resulting parameters and sound information in a coarticulation library. The animated sequence begins with the processor considering an input phoneme sequence, recalling from the coarticulation library parameters associated with that sequence, and selecting appropriate image samples from the animation library based on that sequence. The image samples are concatenated together, and the corresponding sound is output, to form the animated synthesis.
Synthesis-based pre-selection of suitable units for concatenative speech,
Tue Mar 14 18:11:00 EST 2006
A method for generating concatenative speech uses a speech synthesis input to populate a triphone-indexed database that is later used for searching and retrieval to create a phoneme string acceptable for a text-to-speech operation. Prior to initiating the real time synthesis process, a database is created of all possible triphone contexts by inputting a continuous stream of speech. The speech data is then analyzed to identify all possible triphone sequences in the stream, and the various units chosen for each context. During a later text-to-speech operation, the triphone contexts in the text are identified and the triphone-indexed phonemes in the database are searched to retrieve the best-matched candidates.
Coarticulation method for audio-visual text-to-speech synthesis,
Tue Dec 09 18:08:00 EST 2003
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. Representative parameters are extracted from the image samples and stored in an animation library. The processor also samples a plurality of multiphones comprising images together with their associated sounds. The processor extracts parameters from these images comprising data characterizing mouth shapes, maps, rules, or equations, and stores the resulting parameters and sound information in a coarticulation library. The animated sequence begins with the processor considering an input phoneme sequence, recalling from the coarticulation library parameters associated with that sequence, and selecting appropriate image samples from the animation library based on that sequence. The image samples are concatenated together, and the corresponding sound is output, to form the animated synthesis.
Audio-visual selection process for the synthesis of photo-realistic talking-head animations,
Tue Nov 25 18:08:00 EST 2003
A system and method for generating photo-realistic talking-head animation from a text input utilizes an audio-visual unit selection process. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. The unit selection process utilizes the acoustic data to determine the target costs for the candidate images and utilizes the visual data to determine the concatenation costs. The image database is prepared in a hierarchical fashion, including high-level features (such as a full 3D modeling of the head, geometric size and position of elements) and pixel-based, low-level features (such as a PCA-based metric for labeling the various feature bitmaps).
Automatic Detection Of Non-Stationarity In Speech Signals,
Tue Mar 18 18:08:00 EST 2003
When necessary to time scale a speech signal, it is advantageous to do it under influence of a signal that measures the small-window non-stationarity of the speech signal. Three measures of stationarity are disclosed: one that is based on time domain analysis, one that is based on frequency domain analysis, and one that is based on both time and frequency domain analysis.
Signal dependent speech modifications,
Tue Nov 27 18:07:00 EST 2001
Speech signals, and similar one-dimensional signals, are time scaled, interpolated, and/or smoothed, when necessary, under influence of a signal that is sensitive to a small window stationarity of the signal that is being modified. Three measures of stationarity are disclosed: one that is based on time domain analysis, one that is based on frequency domain analysis, and one that is based on both time and frequency domain analysis.
IEEE Fellow, 2002.
For contributions to text-to-speech synthesis technology
Science & Technology Medal, 2001.
Honored for significant contributions to the creation and development of a text-to-speech (TTS) synthesis system.