AT&T WATSONSM takes input, analyzes it, performs one or more services, and returns a result, all in real time.
Input can be audio files, speech, gestures, face recognition, and text.
AT&T WATSONSM is now available as an API. Go to the AT&T Developer program site and register an account for the Speech API.
AT&T WATSONSM converts between different communication modalities, allowing for humans and devices to interact more readily. It consists of a general-purpose engine and a collection of plugins, each of which performs a conversion or analysis task. These tasks, many involving speech and language, can be combined in various ways, depending on what information is being communicated.
One common use of WATSON is to convert human speech to text that can be readily interpreted by a device or other machine. In this case, the output might be simple text, or WATSON can perform the additional step of parsing the text so the human’s intent can be determined and communicated to the device. It works the other way, too; WATSON can take content generated by a machine and convert it to speech or text for humans to understand.
Essentially WATSON takes some input, analyzes it, performs one or more services, and returns a result, all in real time.
WATSON can not only convert from speech to text but can combine speech with other modalities, such as a touch-screen tap (“show me the closest Starbucks, here”) or other gesture, and send the information to a device. WATSON also converts from speech to speech to do translations, even involving multiple languages: speech input in one language can be converted to text in real time, followed by a text translation (with little delay), followed by the spoken translated sentence at sentence end.
The diversity of possibilities on a single platform is due to a plugin architecture where each subtask is contained in its own plugin. Depending on the task to be performed, WATSON selects the right plugins at run time, assembles them into a working engine, and coordinates the information exchange between the plugins. It also takes care of feeding the input media into the engine and forwarding partial or final results to the end device.
While the architecture makes it easy to swap out plugins, the plugins themselves can contain extremely sophisticated “subtasks” including highly accurate, speaker-independent automatic speech recognition (ASR). WATSON also integrates AT&T Labs Natural Voices® text-to-speech conversion, natural language understanding (which includes machine learning), and dialog management tasks.
WATSON has been used within AT&T for IVR customers, including AT&T's VoiceTone® service, for over 20 years during which time the ASR algorithms, tools, and plug-in architecture have been refined to increase accuracy, convenience, and integration. Besides customer care IVR, AT&T WATSONSM has been used for speech analytics, speech translation (including the AT&T Translator app), mobile voice search of multimedia data, video search, voice remote, voice mail to text, web search, and SMS.
Increasingly, AT&T WATSONSM is being integrated into web-based, speech-enabled devices and services being worked on in Research, including the speech mashups and the AT&T WMSSP (WATSON Mobile Speech Services Platform) that currently supports Speak4it (local business search) and will support future production applications.
For licensing information, send an email to .
Speech technologies for interactive mobile applications � a primer
Joint Workshop of the Association for Voice Interaction Design (AVIxD) and the Interaction Design Association (IxDA), New York City, Association for Voice Interaction Design (AVIxD) and Interaction Design Association (IxD), 2011. [PDF] [BIB]
The AT&T Statistical Dialog Toolkit V1.0
IEEE Speech and Language Processing Technical Committee Newsletter, IEEE Speech and Language Processing Technical Committee Newsletter, 2010. [BIB]
Speech Recognition Modeling Advances For Mobile Voice Search
Enrico Bocchieri, Diamantino Caseiro, Dimitrios Dimitriadis
International Conference On Acoustics, Speech and Signal Processing, 2010. [BIB]