AT&T WATSON (SM) Speech Technologies

Natural Voices text to speech

AT&T WATSONSM takes input, analyzes it, performs one or more services, and returns a result, all in real time.
Input can be audio files, speech, gestures, face recognition, and text.


AT&T WATSONSM is now available as an API. Go to the AT&T Developer program site and register an account for the Speech API.


AT&T WATSONSM converts between different communication modalities, allowing for humans and devices to interact more readily. It consists of a general-purpose engine and a collection of plugins, each of which performs a conversion or analysis task. These tasks, many involving speech and language, can be combined in various ways, depending on what information is being communicated.

One common use of WATSON is to convert human speech to text that can be readily interpreted by a device or other machine. In this case, the output might be simple text, or WATSON can perform the additional step of parsing the text so the human’s intent can be determined and communicated to the device. It works the other way, too; WATSON can take content generated by a machine and convert it to speech or text for humans to understand.

Essentially WATSON takes some input, analyzes it, performs one or more services, and returns a result, all in real time.

WATSON can not only convert from speech to text but can combine speech with other modalities, such as a touch-screen tap (“show me the closest Starbucks, here”) or other gesture, and send the information to a device. WATSON also converts from speech to speech to do translations, even involving multiple languages: speech input in one language can be converted to text in real time, followed by a text translation (with little delay), followed by the spoken translated sentence at sentence end.

The diversity of possibilities on a single platform is due to a plugin architecture where each subtask is contained in its own plugin. Depending on the task to be performed, WATSON selects the right plugins at run time, assembles them into a working engine, and coordinates the information exchange between the plugins. It also takes care of feeding the input media into the engine and forwarding partial or final results to the end device.

While the architecture makes it easy to swap out plugins, the plugins themselves can contain extremely sophisticated “subtasks” including highly accurate, speaker-independent automatic speech recognition (ASR). WATSON also integrates AT&T Labs Natural Voices® text-to-speech conversion, natural language understanding (which includes machine learning), and dialog management tasks.

WATSON has been used within AT&T for IVR customers, including AT&T's VoiceTone® service, for over 20 years during which time the ASR algorithms, tools, and plug-in architecture have been refined to increase accuracy, convenience, and integration. Besides customer care IVR, AT&T WATSONSM has been used for speech analytics, speech translation (including the AT&T Translator app), mobile voice search of multimedia data, video search, voice remote, voice mail to text, web search, and SMS.

Increasingly, AT&T WATSONSM is being integrated into web-based, speech-enabled devices and services being worked on in Research, including the speech mashups and the AT&T WMSSP (WATSON Mobile Speech Services Platform) that currently supports Speak4it (local business search) and will support future production applications.

For licensing information, send an email to watsonip at research dot att dot com.  

Technical Documents

Speech technologies for interactive mobile applications � a primer
Jason Williams
Joint Workshop of the Association for Voice Interaction Design (AVIxD) and the Interaction Design Association (IxDA), New York City,  Association for Voice Interaction Design (AVIxD) and Interaction Design Association (IxD),  2011.  [PDF]  [BIB]

An Alternative Frontend for the AT&T WATSON LV-CSR System
Dimitrios Dimitriadis, Enrico Bocchieri, Diamantino Caseiro
International Conference on Acoustics, Speech and Signal Processing,  2011.  [BIB]

Speech Recognition Modeling Advances For Mobile Voice Search
Enrico Bocchieri, Diamantino Caseiro, Dimitrios Dimitriadis
International Conference On Acoustics, Speech and Signal Processing,  2010.  [BIB]

The AT&T Statistical Dialog Toolkit V1.0
Jason Williams
IEEE Speech and Language Processing Technical Committee Newsletter,  IEEE Speech and Language Processing Technical Committee Newsletter,  2010.  [BIB]

Project Members

Mazin Gilbert

Jay Wilpon

Horst Schroeter

Vincent Goffin

Iker Arizmendi


Related Projects

Project Space

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

Assistive Technology

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World



E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

Smart Grid

Speech Mashup

Omni Channel Analytics

Speech translation

StratoSIP: SIP at a Very High Level


Content Augmenting Media (CAM)

Content-Based Copy Detection

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

Content Analytics - distill content into visual and statistical representations

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Enhanced Indexing and Representation with Vision-Based Biometrics

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

iMIRACLE - Content Retrieval on Mobile Devices with Speech

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization