
180 Park Ave - Building 103
Florham Park, NJ
Assistive Technology,
At AT&T Labs - Research, we apply our speech, language and media technologies to give people with disabilities more independence, privacy and autonomy.
AT&T Natural VoicesTM Text-to-Speech,
Natural Voices is AT&T's state-of-the-art text-to-speech product, taking text and producing natural-sounding, synthesized speech in a variety of voices and languages.
Connecting Your World,
The need to be connected is greater than ever, and AT&T Researchers are creating new ways for people to connect with one another and with their environments, whether it's their home, office, or car.
Word Prominence Detection using Robust yet Simple Prosodic Features
Taniya Mishra, Vivek Rangarajan sridhar, Alistair Conkie
Proceedings of Interspeech,
Interspeech 2012,
2012.
[PDF]
[BIB]
ISCA Copyright
The definitive version was published in 2012. , 2012-09-09
Automatic detection of word prominence can provide valuable information for downstream applications such as spoken language understanding. Prior work on automatic word prominence detection exploit a variety of lexical, syntactic, and prosodic features and model the task as a sequence of local classifications (independently or using history). While lexical and syntactic features are highly correlated with the notion of word prominence, the output of speech recognition is typically noisy and hence these features are less reliable than the acoustic-prosodic feature stream. In this work, we address the
automatic detection of word prominence through novel prosodic features that capture the changes in F0 curve shape and magnitude along with duration and energy. We contrast the utility of these features with aggregate statistics of F0, duration, and energy used in prior work. Our features
are simple to compute yet robust to the inherent difficulties associated with identifying salient points (such as F0 peaks, valleys, onsets, offsets, etc.) within the F0 contour. We demonstrate that these novel features are substantially more predictive than the standard aggregation-based prosodic features using feature analysis. Experimental results on a corpus of spontaneous speech indicate that the accuracy obtained using only the prosodic features is better than using both lexical and syntactic features.

Predicting Relative Prominence in Noun-Noun Compounds
Taniya Mishra, Srinivas Bangalore
Proceedings of ACL-HLT 2011,
2011.
[PDF]
[BIB]
ISCA Copyright
The definitive version was published in Proceedings of ACL-HLT 2011. , 2011-06-19
{ There are several theories regarding what influences prominence assignment in English noun-noun compounds. We developed corpus-driven models for automatically predicting prominence assignment in noun-noun compounds using feature sets based on two such theories: the informativeness theory and the semantic composition theory. The evaluation of the prediction models indicate that though both of these theories are relevant, they account for different types of variability in prominence assignment. }
Predicting Human Perceived Accuracy of ASR Systems
Taniya Mishra, Andrej Ljolje, Mazin Gilbert
Interspeech,
2011.
[PDF]
[BIB]
ISCA Copyright
The definitive version was published in Interspeech. , 2011-08-28
{Word error rate (WER), which is the most commonly used method of measuring automatic speech recognition (ASR) accuracy,
penalizes all ASR errors (insertions, deletions, substitutions) equally.
However, humans differentially weigh different types of ASR errors. They judge ASR errors that distort the meaning of the spoken message more harshly than those that do not.
Following the central idea of differential weighting of different ASR errors, we developed a new metric, HPA (Human Perceived Accuracy) that aims to align more closely with human perception of ASR errors. Applied to the particular task of automatically recognizing voicemails, we found that the correlation between HPA and the human perception of ASR accuracy was significantly higher (r-value=0.91) than the correlation between WER and human judgement (r-value=0.65).}
On the Intelligibility of Fast Synthesized Speech for Individuals with Early-Onset Blindness
Amanda Stent, Ann Syrdal, Taniya Mishra
ACM ASSETS,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 20XX. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM ASSETS , 2011-10-15.
{People with visual disabilities increasingly use text-to-speech synthesis as a primary output modality for interaction with computers. Surprisingly, there have been no systematic comparisons of the performance of different text-to-speech systems for this user population.
In this paper we report the results of a pilot experiment on the intelligibility of fast synthesized speech for individuals with early-onset blindness. Using an open-response recall task, we collected data on four synthesis systems representing two major approaches to text-to-speech synthesis: formant-based synthesis and concatenative unit selection synthesis. We found a significant effect of speaking rate on intelligibility of synthesized speech, and a trend towards significance for synthesizer type. In post-hoc analyses, we found that participant-related factors, including age and familiarity with a synthesizer and voice, also affect intelligibility of fast synthesized speech.
}

Finite-state models for Speech-based Search on Mobile Devices
Taniya Mishra, Srinivas Bangalore
Journal of Natural Language Engineering,
2010.
[PDF]
[BIB]
Cambridge University Press Copyright
The definitive version was published in Journal of Natural Language Engineering, 2010-11-01, http://journals.cambridge.org/action/displayMoreInfo?jid=NLE&type=tcr
{In this paper, we present techniques that exploit finite-state models for voice search applications. In particular, we illustrate the use of finite-state models for encoding the search index in order to tightly integrate the speech recognition and the search components of a voice search system. We show that the tight integration mutually benefits Automatic Speech Recognition and improves the search. In the second part of the paper, we discuss the use of finite-state techniques for spoken language understanding, in particular, to segment an input query into its component semantic fields so as to improve search as well as to extend the functionality of the system and be able to execute the user�s request against a backend database.
}
System And Method For Enhancing Voice-Enabled Search Based On Automated Demographic Identification,
Tue Mar 19 17:25:28 EDT 2013
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for approximating responses to a user speech query in voice-enabled search based on metadata that include demographic features of the speaker. A system practicing the method recognizes received speech from a speaker to generate recognized speech, identifies metadata about the speaker from the received speech, and feeds the recognized speech and the metadata to a question-answering engine. Identifying the metadata about the speaker is based on voice characteristics of the received speech. The demographic features can include age, gender, socio-economic group, nationality, and/or region. The metadata identified about the speaker from the received speech can be combined with or override self-reported speaker demographic information.