Building Speech Applications Just Got Easier: AT&T Releases Speech API
Speech interfaces are easy to use, don’t require typing onto small, cumbersome keyboards, and allow people to multitask. No wonder they are popular with users.
To get started: Go to the AT&T Developer Program site.
But for developers, speech technologies are complex, computationally intensive, require huge volumes of representative data, and entail a mix of skills and specialized expertise in signal processing, acoustic modeling, and advanced statistical methods. Traditionally, building a speech interface has been a major undertaking.
No longer. AT&T, long a pioneer in speech technology, has released the Speech API, which is based on AT&T WATSONSM speech recognition capability. Now any developer who wants to create voice-enabled apps and interfaces has an easy way to incorporate accurate, fast speech recognition.
AT&T WATSONSM represents AT&T’s core speech technology that has been at the heart of the company’s own dialog systems, empowering directed and natural-language dialog IVR applications through VoiceToneSM. The technology has also been licensed to third parties and is currently used in all Vlingo products and on the Samsung Galaxy S II smart phone. AT&T WATSONSM is also increasingly being deployed on cloud-based servers to perform speech recognition for dozens of apps, including Speak4It (location-based business search) and AT&T Translator, with more apps coming soon.
Decades of research have gone into creating and refining AT&T WATSONSM, resulting in a speech recognition capability that is both fast and accurate. Independent third parties that have evaluated and benchmarked AT&T WATSONSM consistently find it to be more accurate than other speech engines. (Vlingo was one of the companies evaluating AT&T WATSONSM, and in this interview, the founder of Vlingo, Mike Phillips, discusses why he chose AT&T WATSONSM for Vlingo's speech-enabled interfaces on mobile devices.)
Each context is implemented as a simple HTTP URL.
For the first time, AT&T is making WATSON speech recognition available to outside developers who want to build voice-enabled apps and services, including virtual-assistant, Siri-like type interfaces. By performing fast, accurate speech recognition, AT&T WATSONSM makes building voice interfaces easier for developers.
By performing fast, accurate speech recognition, AT&T WATSONSM makes it easier for developers to create voice-enabled applications.
"Our goal,” said Mazin Gilbert, assistant vice-president of research, “is to make our speech and language technologies available to the masses so that every application or service can be empowered with AT&T WATSONSM with minimal effort."
The API provides seven speech contexts, each optimized for a specific task: web search, local business search, question-and-answer, voice mail to text, SMS, AT&T's U-verse® video programming guide, and a general-purpose dictation trained on 1+ billion words.
Each context is implemented as a simple HTTP URL, and each comes equipped with a vocabulary specialized for a task (smaller, more focused vocabularies increase the recognition accuracy and speed by reducing the search space).
More API releases are planned, including gaming and social media contexts.
Also being released is the Speech Kit SDK, which automates the capture of audio from iOS and Android devices, thus making it unnecessary to implement the low-level details of how the device encodes audio.
This is how it works: Once a developer has incorporated the API function calls, the app transmits audio to cloud-based WATSON servers that convert the speech to text, parse the meaning of the text, and assemble queries to search the appropriate database for the recognition. The recognized text, as spoken by the user, is then returned to the app, in some cases with additional results.
More API releases are planned, including gaming and social media contexts. Enhancements and improvements will continually be added and made available.
To get started: Go to the AT&T Developer Program site and register an account to access all seven speech contexts and the API documentation.
For technical information about the WATSON technology, email .