Speech translation


Language barriers inhibit the free exchange of information and ideas and restrict travel and mobility. Fast, accurate speech-to-speech translation can overcome language gaps, and allow tourists to communicate easily when traveling in foreign countries or enable healthcare and legal professionals to communicate with others who speak a different language.

Speech-to-speech translation has almost unlimited market potential, especially if translation is as smooth and as natural as speech itself. With this goal, AT&T Research is developing a real-time speech-to-speech translation technology that begins translating as soon as it detects speech, without the latency incurred while waiting for an utterance to complete before translating. The increase in speed is achieved by combining the underlying technologies—automatic speech recognition, natural language understanding, machine learning, speech synthesis—into single step, omitting the usual intermediate step of translating source-language phonemes to target-language text.

This tight coupling is possible because AT&T Research can integrate its own highly accurate and sophisticated automatic speech recognition (AT&T WATSONTM ASR) and natural-sounding text-to-speech (Natural Voices) at a low level so the output of one becomes the input of the next.

Accuracy is ensured at the same time in a number of ways. First, AT&T translation does not rely on a single recognition, but instead looks at all possible recognitions (and the additional information they contain) using other information sources—including contextual knowledge provided from a natural language understanding module—to better decide which recognition is the correct one for the current context.

By not committing early to a single recognition that might be wrong (even if it scored the highest), the AT&T system avoids a common translation problem: the error magnification that occurs when an error at the beginning, such as a wrong recognition, propagates through all subsequent steps: matching the wrong phonemes, translating the wrong words, mis-identifying the context.

Accuracy is also dependent on having a large, complete corpus, and for this the AT&T translation technology depends on statistical methods to extract acoustic, lexical, and translation knowledge from traditional sources—large datasets and existing corpora—as well as nontraditional ones, such as data mining of web pages and their different language versions. This allows the system to be continually and automatically enlarged to cover more domains (health, hotel, entertainment, among others) and keep current with new words and expressions.

For the actual translation, the system encompasses multiple translation methods, automatically choosing the best one for the two languages being translated. Thus the system can take advantage of methods that work very fast to convert between two related languages, such as English and Spanish, as well as methods specifically designed for translating between languages that are syntactically dis-similar, such as English and Japanese.

AT&T translation technology is adapted for both network connections (in the cloud) or on the device. When available from the network, the system automatically identifies the two languages and uses the most appropriate translation method for the language pair. The on-device version relies on installing the appropriate ASR and models.

Multimedia (videos, demos, interviews)
Speech translation for two-way conversations    Speech_translation_2way (1k)