
180 Park Ave - Building 103
Florham Park, NJ
http://www.research.att.com/~lbarbosa
Voice-Enabled Social TV
Bernard Renger, Junlan Feng, Ovidiu Dan, Hisao Chang, Luciano Barbosa
WWW2011,
2011.
[PDF]
[BIB]
ACM Copyright
The definitive version was published in WWW 2011. , 2011-03-28
{Until today, the TV viewing experience has been very unsocial compared to the World Wide Web. In this demo, we will present a Voice-enabled Social TV system, VoiSTV, which allows users to access twitter through the TV using voice. With this application, a user can receive and send twitter messages (tweets) through the TV while watching TV. Users can input tweets to be sent using spoken language. Beyond accessibility, VoiSTV also provides users metadata information about TV shows such as trends, hot topics, popularity of the show as well as aggregated sentiment of show-related tweets. }
SpeechForms: From Web to Speech and Back
Luciano Barbosa, Diamantino Caseiro, Giuseppe Di , Amanda Stent
Interspeech 2011,
2011.
[PDF]
[BIB]
ISCA Copyright
The definitive version was published in Interspeech 2011. , 2011-08-27
{This paper describes SpeechForms, a system that uses novel
techniques to automatically identify form element semantics
(element type) and form element content (element values),
and to semi-automatically generate language models that al-
low users to fill out each web form element by voice. Prelim-
inary experimental results show that simple per element lan-
guage models are faster and may be more accurate than statis-
tical n-gram language models from large amounts of web text
data.}
Focusing on Novelty: A Crawling Strategy to Build Diverse Language Models
Luciano Barbosa, Srinivas Bangalore
20th ACM International Conference on Information and Knowledge Management,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 20th ACM International Conference on Information and Knowledge Management , 2011-10-30.
Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites
Luciano Barbosa, Srinivas Bangalore, Vivek Kumar
IJCNLP,
2011.
[PDF]
[BIB]
AFNLP Copyright
The definitive version was published in IJCNLP. , 2011-11-15
The definitive version was published in Very Large Databases, 2011. , 2011-11-15
{Recently, there has been an increase interested for Web parallel
text for tasks such as machine translation and cross-language information
retrieval. Although previous
works have addressed many aspects of it, including
document pair selection, and sentence and word alignment, the
problem of discovering bilingual data sources in a large
scale has been overlooked to a great extent.
In this paper, we propose a novel crawling strategy to locate
bilingual sites which aims to achieve a balance between the
two conflicting requirements of this problem: the need to perform
a broad search while at the same time avoiding the need to crawl
unproductive Web regions. Our solution does so by focusing on
the graph neighborhood of bilingual sites and exploring
the patterns of the links in this region to guide its visitation policy.
To detect such sites, we introduce a two-step strategy that, first, relies on common patterns
found in the internal links of these sites to compose a classifier
that identifies candidate pages as entry points to parallel data in these sites,
and then, verifies whether these pages are in fact in the languages
of interest. Our experimental evaluation show that our crawler outperforms previous
crawling approaches for this task and produces a
high-quality collection of bilingual sites.
}

A Scalable Approach to Building a Parallel Corpus from the Web
Vivek Kumar, Luciano Barbosa, Srinivas Bangalore
INTERSPEECH,
2011.
[PDF]
[BIB]
ACL Copyright
The definitive version was published in EMNLP. , 2011-08-27
{Parallel text acquisition from the Web is an attractive way for
augmenting statistical models (e.g., machine translation, cross-
lingual document retrieval) with domain representative data.
The basis for obtaining such data is a collection of pairs of bilin-
gual Web sites or pages. In this work, we propose a crawling
strategy that locates bilingualWeb sites by constraining the vis-
itation policy of the crawler to the graph neighborhood of bilin-
gual sites on the Web. Subsequently, we use a novel recursive
mining technique that recursively extracts text and links from
the collection of bilingual Web sites obtained from the crawl-
ing. Our method does not suffer from the computationally pro-
hibitive combinatorial matching typically used in previous work
that uses document retrieval techniques to match a collection of
bilingual webpages. We demonstrate the efficacy of our ap-
proach in the context of machine translation in the tourism and
hospitality domain. The parallel text obtained using our novel
crawling strategy results in a relative improvement of 21% in
BLEU score (English-to-Spanish) over an out-of-domain seed
translation model trained on the European parliamentary pro-
ceedings.}