Speech Mashup


Speech mashup = Web app + speech

Speech mashups provide an easy way for web developers to incorporate a speech interface into their web apps so their users can use voice commands and receive back spoken responses. All speech and language processing, from automatic speech recognition, text-to-speech conversion, and natural language processing, is performed on the AT&T network where servers run the AT&T WATSON (SM) ASR and the AT&T Natural Voices (TM) TTS - the same speech technology employed for enterprise customers of AT&T.

Speech mashups work as follows: audio or text from a mobile device or a web browser is relayed over the cell network to the speech mashup manager, which manages the entire process by accessing AT&T servers where the speech and language processing takes place, and then relaying the result (interpreted into programming language) to the web application. If the application result is to be spoken, the speech mashup manager sends it for TTS conversion before relaying the spoken response back to the user.

All processing steps are tightly integrated to minimize the number of round trips in the mobile network and reduce latency to achieve a better user experience.

Building a speech mashup for a mobile device (any network-enabled device with audio input) requires the following:

1. Registering at the speech mashup portal (http://service.research.att.com/smm/) for an account on AT&T servers, and creating a directory for the web app and related files (grammars, log files, etc.).

2. Creating and uploading grammars or using a built-in or shared grammar (ASR applications only).

3. Building a speech mashup client in any suitable programming language (Java, JavaScript, etc.). Three sample clients are available for downloading and modification.

A developer's guide with instructions and examples is available from the portal.



Project Members

Giuseppe Di Fabbrizio

Jay Wilpon

Danilo Giulianelli

Yeon-jun Kim

Amanda Stent

Related Projects

Project Space

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

Assistive Technology

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World



E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

Smart Grid

Omni Channel Analytics

Speech translation

StratoSIP: SIP at a Very High Level


Content Augmenting Media (CAM)

Content-Based Copy Detection

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

Content Analytics - distill content into visual and statistical representations

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Enhanced Indexing and Representation with Vision-Based Biometrics

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

iMIRACLE - Content Retrieval on Mobile Devices with Speech

AT&T WATSON (SM) Speech Technologies

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization