Speech translation


Language barriers inhibit the free exchange of information and ideas and restrict travel and mobility. Fast, accurate speech-to-speech translation can overcome language gaps, and allow tourists to communicate easily when traveling in foreign countries or enable healthcare and legal professionals to communicate with others who speak a different language.

Speech-to-speech translation has almost unlimited market potential, especially if translation is as smooth and as natural as speech itself. With this goal, AT&T Research is developing a real-time speech-to-speech translation technology that begins translating as soon as it detects speech, without the latency incurred while waiting for an utterance to complete before translating. The increase in speed is achieved by combining the underlying technologies—automatic speech recognition, natural language understanding, machine learning, speech synthesis—into single step, omitting the usual intermediate step of translating source-language phonemes to target-language text.

This tight coupling is possible because AT&T Research can integrate its own highly accurate and sophisticated automatic speech recognition (AT&T WATSONTM ASR) and natural-sounding text-to-speech (Natural Voices) at a low level so the output of one becomes the input of the next.

Accuracy is ensured at the same time in a number of ways. First, AT&T translation does not rely on a single recognition, but instead looks at all possible recognitions (and the additional information they contain) using other information sources—including contextual knowledge provided from a natural language understanding module—to better decide which recognition is the correct one for the current context.

By not committing early to a single recognition that might be wrong (even if it scored the highest), the AT&T system avoids a common translation problem: the error magnification that occurs when an error at the beginning, such as a wrong recognition, propagates through all subsequent steps: matching the wrong phonemes, translating the wrong words, mis-identifying the context.

Accuracy is also dependent on having a large, complete corpus, and for this the AT&T translation technology depends on statistical methods to extract acoustic, lexical, and translation knowledge from traditional sources—large datasets and existing corpora—as well as nontraditional ones, such as data mining of web pages and their different language versions. This allows the system to be continually and automatically enlarged to cover more domains (health, hotel, entertainment, among others) and keep current with new words and expressions.

For the actual translation, the system encompasses multiple translation methods, automatically choosing the best one for the two languages being translated. Thus the system can take advantage of methods that work very fast to convert between two related languages, such as English and Spanish, as well as methods specifically designed for translating between languages that are syntactically dis-similar, such as English and Japanese.

AT&T translation technology is adapted for both network connections (in the cloud) or on the device. When available from the network, the system automatically identifies the two languages and uses the most appropriate translation method for the language pair. The on-device version relies on installing the appropriate ASR and models.

Multimedia (videos, demos, interviews)
Speech translation for two-way conversations    Speech_translation_2way (1k)

Project Members

Srinivas Bangalore

Mazin Gilbert

Vivek Kumar Rangarajan Sridhar

Related Projects

Project Space

Omni Channel Analytics

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

Assistive Technology

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World



E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

AT&T Natural VoicesTM Text-to-Speech

Smart Grid

Speech Mashup

StratoSIP: SIP at a Very High Level


Content Augmenting Media (CAM)

Content-Based Copy Detection

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Enhanced Indexing and Representation with Vision-Based Biometrics

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

iMIRACLE - Content Retrieval on Mobile Devices with Speech

AT&T WATSON (SM) Speech Technologies

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization