
200 S Laurel Ave - Bldg A
Middletown, NJ
Subject matter expert in multimedia content processing, multimedia databases, video search, pattern recognition, machine learning, data mining, and natural language understanding.
I received the B.S. and M.S. degrees in Electronic Engineering from Tsinghua University, Beijing, China, in 1994 and 1996, respectively, and the Ph.D. degree in Electrical Engineering from Polytechnic University, Brooklyn, NY, in 2001. I am on the editorial board of the IEEE Transaction on Multimedia and the Peer-to-peer Networking and Applications Journal. I am a senior member of IEEE, and a meber of ACM and Tau Beta Pi.
Connecting Your World,
The need to be connected is greater than ever, and AT&T Researchers are creating new ways for people to connect with one another and with their environments, whether it's their home, office, or car.
Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT),
CONSENT provides a platform for detection and analysis of content for security and quality assurance.
Content Augmenting Media (CAM),
Leverage multimedia metadata to provide live alerts and intelligent content consumption.
Content-Based Copy Detection,
Content-based Copy Detection is an enabling technology to discover repeated content and events in a large-scale content database.
Enhanced Indexing and Representation with Vision-Based Biometrics,
Leveraging visual biometrics for indexing and representations of content for retrieval and verification.
iMIRACLE - Content Retrieval on Mobile Devices with Speech,
iMIRACLE uses large vocabulary speech recognition for content retrieval with metadata words (titles, genre, channels, etc.) and content words that occur in recorded programs.
MIRACLE and the Content Analysis Engine (CAE),
The Multimedia Information Retrieval by Content (MIRACLE) project encompasses the technologies for video indexing, analysis, and retrieval with audio, textual, and visual content information.
VidCat - Simplified Personal Photo and Video Managmenet,
VidCat permits simplified personal photo and video management (i.e. a Video Catalog) from a webpage or your favorite mobile device.
Video - Content Delivery and Consumption,
A background on the delivery and consumption of video and multimedia and references to projects within the AT&T Video and Multimedia Technologies and Services Research Department.
Video - Indexing and Representation (Metadata),
Video and multimedia indexing and representations (i.e. metadata), their production, and use. Links to projects within the AT&T Video and Multimedia Technologies and Services Research Department.
Video and Multimedia Technologies Research,
The AT&T Video and Multimedia Technologies Research Department strives to acquire multimedia and video for indexing,retrieval,and consumption with textual,semantic,and visual modalities.
Visual Semantics for Intuitive Mid-Level Representations,
Represent content with mid-level visual semantics for retrieval, filtering, and tagging.

Large-Scale Analysis for Interactive Media Consumption
David Gibbon, Andrea Basso, Lee Begeja, Zhu Liu, Bernard Renger, Behzad Shahraray, Eric Zavesky
TV Content Analysis,
TV Content Analysis,
CRC Press,
2012.
[PDF]
[BIB]
CRC Press, Taylor Francis LLC Copyright
The definitive version was published in proceedings of TV Content Analysis (CRC Press/Taylor & Francis). , 2012-02-22, http://mklab.iti.gr/tvca/
Over the years the fidelity and quantity of TV content has steadily increased, but consumers
are still experiencing considerable difficulties in finding the content matching their personal
interests. New mobile and IP consumption environments have emerged with the promise
of ubiquitous delivery of desired content, but in many cases, available content descriptions
in the form of electronic program guides lack sufficient detail and cumbersome human interfaces yield a less than positive user experience. Creating metadata through a detailed
manual annotation of TV content is costly and, in many cases, this metadata may be lost
in the content life-cycle as assets are repurposed for multiple distribution channels. Content
organization can be daunting when considering domains from breaking news contributions,
local or government channels, live sports, music videos, documentaries up through dramatic
series and feature films. As the line between TV content and Internet content continues
to blur, more and more long tail content will appear on TV and the ability to be able to
automatically generate metadata for it becomes paramount. Research results from several
disciplines must be brought together to address the complex challenge of cost effectively augmenting existing content descriptions to facilitate content personalization and adaptation for
users given todays range of content consumption contexts.This chapter presents systems architectures for processing large volumes of video efficiently, practical, state of the art solutions for TV content analysis and metadata generation,
and potential applications that utilize this metadata in effective and enabling ways.

Combining Content Analysis of Television Programs with Audience Measurement
David Gibbon, Zhu Liu, Eric Zavesky, DeDe Paul, Deborah Swayne, Rittwik Jana, Behzad Shahraray
IEEE Consumer Communication and Networking Conference, (CCNC),
2012.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in IEEE Consumer Communications and Networking Conference. , 2012-01-15
The definitive version was published in Advertising Research Foundation. , 2012-01-15
Combining content analysis of television programs with quantitative audience measurement can provide insights
into customer reactions to advertisements and program content. This work introduces a system architecture that
incorporates anonymous audience metrics from an operational IPTV environment with metadata from a content-based
analysis of recorded programs. Evaluated on a collection of news programs, the system verifies that events derived
from the audience metrics data stream correspond to media segmentation boundaries such as commercial breaks and
topic changes. An automated system for executing multimodal media segmentation algorithms for commercial break and
topic change detection is also discussed. Better understanding of audience reaction can help IPTV service providers
plan infrastructure investments and help in managing multimedia content delivery networks.

Automated Content Metadata Extraction Services based on MPEG Standards
David Gibbon, Zhu Liu, Andrea Basso, Behzad Shahraray
The Computer Journal,
The Computer Journal, Special Issue on MPEG Applications and Services,
2012.
[PDF]
[BIB]
Oxford University Press Copyright
The definitive version was published in 2012. , 2012-12-06, http://www.oxfordjournals.org
This paper is concerned with the generation, acquisition, standardized representation, and transport of video metadata.
The use of MPEG standards in the design and development of interoperable media architectures and Web services is discussed.
A high-level discussion of several algorithms for metadata extraction is presented.
Some architectural and algorithmic issues encountered when designing services for real-time processing of video streams,
as opposed to traditional offline media processing, are addressed.
A prototype real-time video analysis system for generating MPEG-7 Audiovisual Description Profile (AVDP) from MPEG-2
transport stream encapsulated video is presented.
Such a capability can enable a range of new services such as content-based personalization of live broadcasts given that the
MPEG-7 based data models fit in well with specifications for advanced television services such as
TV-Anytime and Alliance of Telecommunications Industry Solutions (ATIS) IPTV Interoperability Forum.

AT&T Research at TRECVID 2011
Eric Zavesky, Zhu Liu, Behzad Shahraray, Ning Zhou
TRECVID Workshop,
2011.
[PDF]
[BIB]
NIST Copyright
The definitive version was published in TRECVID Workshop. , 2011-12-07, http://www-nlpir.nist.gov/projects/tv2011/notebookpapers.html
{AT&T participated in two tasks at TRECVID 2011: content- based copy detection (CCD) and instance-based search (INS). The CCD system developed for TRECVID 2010 was en- hanced for speed and augmented with an additional picture- in-picture detector and alternative audio features. As a pilot task, participation in INS evaluated object-level content- based copy detection and created a basis for integer-score result reranking. This paper reports the enhancements of the CCD system and briefly describe its application to INS for object-level copy detection.}
AT&T RESEARCH AT TRECVID 2010
Eric Zavesky, Behzad Shahraray, Zhu Liu, Neela Sawant
TRECVID 2010 Workshop,
2010.
[PDF]
[BIB]
NIST Copyright
The definitive version was published in TRECVID 2010 , 2010-11-15
{AT&T participated in two tasks at TRECVID 2010: content- based copy detection (CCD) and instance-based search (INS). The CCD system developed for TRECVID 2009 was en- hanced for efficiency and scale and was augmented by audio features. As a pilot task, participation in INS was meant to evaluate a number of algorithms traditionally used for search in a fully automated setting. In this paper, we report the enhancement of our CCD system and propose a system for INS that attempts to leverage retrieval techniques from different audio, video, and textual cues.}
System And Method To Assign A Digital Image To A Face Cluster,
Tue Jan 08 14:43:48 EST 2013
A computer implemented method includes accessing a digital image including a plurality of faces including a first face and a second face. The computer implemented method includes identifying a plurality of identification regions of the digital image including a first identification region associated with the first face and a second identification region associated with the second face. The computer implemented method also includes assigning the digital image to a first face cluster of a plurality of face clusters when a difference between data descriptive of the first identification region and data descriptive of a face cluster identification region of the first face cluster satisfies a threshold. The computer implemented method further includes assigning the digital image to a second face cluster of the plurality of face clusters based at least partially on a probability of the second face and the first face appearing together in an image.
System And Method For Adaptive Media Playback Based On Destination,
Tue Aug 07 16:11:22 EDT 2012
Disclosed herein are systems, methods, and computer readable-media for adaptive media playback based on destination. The method for adaptive media playback comprises determining one or more destinations, collecting media content that is relevant to or describes the one or more destinations, assembling the media content into a program, and outputting the program. In various embodiments, media content may be advertising, consumer-generated, based on real-time events, based on a schedule, or assembled to fit within an estimated available time. Media content may be assembled using an adaptation engine that selects a plurality of media segments that fit in the estimated available time, orders the plurality of media segments, alters at least one of the plurality of media segments to fit the estimated available time, if necessary, and creates a playlist of selected media content containing the plurality of media segments.
Brief And High-Interest Video Summary Generation,
Tue Jun 05 16:10:38 EDT 2012
A video is summarized by determining if a video contains one or more junk frames, modifying one or more boundaries of shots of the video based at least in part on the determination of if the video contains one or more junk frames, sampling a plurality of the shots of the video into a plurality of subshots, clustering the plurality of subshots with a multiple step k-means clustering, and creating a video summary based at least in part on the clustered plurality of subshots. The video is segmented into a plurality of shots and a keyframe from each of the plurality of shots is extracted. A video summary is created based on a determined importance of the subshots in a clustered plurality of subshots and a time budget. The created video summary is rendered by displaying playback rate information for the rendered video summary, displaying a currently playing subshot marker with the rendered video summary, and displaying an indication of similar content in the rendered video summary.
System And Method For Automated Multimedia Content Indexing And Retrieval,
Tue Mar 06 16:09:28 EST 2012
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
System And Method For Identifying Contact Information,
Tue Nov 15 16:06:32 EST 2011
A system and method for identifying contact information is provided. A system to identify contact information may include an input to receive a data stream. The data stream may include audio content, video content or both. The system may also include an analysis module to detect contact information within the data stream. The system may also include a memory to store a record of the contact information.
System And Method For Automatically Authoring Interactive Television Content,
Tue Oct 11 16:06:17 EDT 2011
A system and method is provided to automatically generate content for ITV products and services by processing primary media sources. In one embodiment of the invention, keywords are automatically extracted from the primary media sources using one or more of a variety of techniques directed to video, audio and/or textual content of the multimodal source. In some embodiments, keywords are then processed according to one or more disclosed algorithms to narrow the quantity of downstream processing that is necessary to associate secondary sources (reference items) with the primary video source. Embodiments of the invention also provide automatic searching methods for the identification of reference items based on the processed keywords in order to maximize the value added by the association of reference items to the video source.
Unsupervised Speaker Segmentation Of Multi-Speaker Speech Data,
Tue Apr 19 16:04:55 EDT 2011
Systems and methods for unsupervised segmentation of multi-speaker speech or audio data by speaker. A front-end analysis is applied to input speech data to obtain feature vectors. The speech data is initially segmented and then clustered into groups of segments that correspond to different speakers. The clusters are iteratively modeled and resegmented to obtain stable speaker segmentations. The overlap between segmentation sets is checked to ensure successful speaker segmentation. Overlapping segments are combined and remodeled and resegmented. Optionally, the speech data is processed to produce a segmentation lattice to maximize the overall segmentation likelihood.
On-Demand Language Translation For Television Programs,
Tue Oct 05 15:04:54 EDT 2010
In an embodiment, a method of providing an on demand translation service is provided. A subscriber may be charged a reduced fee or no fee for use of the on demand translation service in exchange for displaying commercial messages to the subscriber, the commercial messages being selected based on subscriber information. A multimedia signal including information in a source language may be received. The information may be obtained as text in the source language from the multimedia signal. The text may be translated from the source language to a target language. Translated information, based on the translated text, may be transmitted to a processing device for presentation to the subscriber. The received multimedia signal may be sent to a multimedia device for viewing.
System And Method For Adaptive Content Rendition,
Tue Sep 14 15:04:37 EDT 2010
Disclosed herein are systems, methods, and computer readable-media for adaptive content rendition, the method comprising receiving media content for playback to a user, adapting the media content for playback on a first device in the user's first location, receiving a notification when the user changes to a second location, adapting the media content for playback on a second device in the second location, and transitioning media content playback from the first device to second device. One aspect conserves energy by optionally turning off the first device after transitioning to the second device. Another aspect includes playback devices that are "dumb devices" which receive media content already prepared for playback, "smart devices" which receive media content in a less than ready form and prepare the media content for playback, or hybrid smart and dumb devices. A single device may be substituted by a plurality of devices. Adapting the media content for playback is based on a user profile storing user preferences and/or usage history in one aspect.
Systems And Methods For Monitoring Speech Data Labelers,
Tue May 04 15:03:49 EDT 2010
Systems and methods for using an annotation guide to label utterances and speech data with a call type. A method embodiment monitors labelers of speech data by presenting via a processor a test utterance to a labeler, receiving input from the labeler that selects a particular call type from a list of call types and determining via the processor if the labeler labeled the test utterance correctly. Based on the determining step, the method performs at least one of the following: revising the annotation guide, retraining the labeler or altering the test utterance.
On-Demand Language Translation For Television Programs,
Tue May 04 15:03:47 EDT 2010
A method, a system and a machine-readable medium are provided for an on demand translation service. A translation module including at least one language pair module for translating a source language to a target language may be made available for use by a subscriber. The subscriber may be charged a fee for use of the requested on demand translation service or may be provided use of the on demand translation service for free in exchange for displaying commercial messages to the subscriber. A video signal may be received including information in the source language, which may be obtained as text from the video signal and may be translated from the source language to the target language by use of the translation module. Translated information, based on the translated text, may be added into the received video signal. The video signal including the translated information in the target language may be sent to a display device.
Method and apparatus for segmenting a multi-media program based upon audio events,
Tue Jan 15 18:12:34 EST 2008
The present invention provides for a method and apparatus for segmenting a multi-media program based upon audio events. In an embodiment a method of classifying an audio stream is provided. This method includes receiving an audio stream. Sampling the audio stream at a predetermined rate and then combining a predetermined number of samples into a clip. A plurality of features are then determined for the clip and are analyzed using a linear approximation algorithm. The clip is then characterized based upon the results of the analysis conducted with the linear approximation algorithm.
Unsupervised speaker segmentation of multi-speaker speech data,
Tue Nov 13 18:12:24 EST 2007
Systems and methods for unsupervised segmentation of multi-speaker speech or audio data by speaker. A front-end analysis is applied to input speech data to obtain feature vectors. The speech data is initially segmented and then clustered into groups of segments that correspond to different speakers. The clusters are iteratively modeled and resegmented to obtain stable speaker segmentations. The overlap between segmentation sets is checked to ensure successful speaker segmentation. Overlapping segments are combined and remodeled and resegmented. Optionally, the speech data is processed to produce a segmentation lattice to maximize the overall segmentation likelihood.
Systems and methods for monitoring speech data labelers,
Tue Oct 09 18:12:17 EDT 2007
Systems and methods for monitoring labelers of speech data. To test or train labelers, a labeler is presented with utterances that have already been identified as belonging to a particular class or call type. The labeler is asked to assign a call type to the utterances. The performance of the labeler is measured by comparing the call types assigned by the labeler with the existing call types of the utterances. The performance of a labeler can also be monitored as the labeler labels speech data by occasionally having the labeler label an utterance that is already labeled and by storing the results.
Systems and methods for generating an annotation guide,
Tue May 15 18:12:01 EDT 2007
Systems and methods for generating an annotation guide. Speech data is organized and presented to a user. After the user selects some of the utterances in the speech data, the selected utterances are included in a class and/or call type. Additional utterances that belong to the class and/or call type can be found in the speech data using relevance feedback, data mining, data clustering, support vector machines, and the like. After a call type is complete, it is committed to the annotation guide. After all call types are completed, the annotation guide is generated.
System and method for automated multimedia content indexing and retrieval,
Tue Feb 27 18:11:55 EST 2007
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
Distance measure for probability distribution function of mixture type,
Tue Jan 31 18:10:50 EST 2006
In accordance with our invention, for two mixture-type probability distribution functions (PDF's), G, H, .function..times..mu..times..functi- on..times..function..times..gamma..times..function. ##EQU00001## where G is a mixture of N component PDF's g.sub.i (x), H is a mixture of K component PDF's h.sub.k (x), .mu..sub.i and .gamma..sub.k are corresponding weights that satisfy .times..mu..times..times..times..times- ..times..gamma. ##EQU00002## we define their distance, D.sub.M(G, H), as .function..omega..times..times..times..omega..times..function. ##EQU00003## where d(g.sub.I, h.sub.k is the element distance between component PDF's g.sub.i and h.sub.k and w satisfie .omega..sub.ik.gtoreq.0, 1.ltoreq.i.ltoreq.N, 1.ltoreq.k.ltoreq.K; and .times..omega..mu..ltoreq..ltoreq..times..omega..gamma..ltoreq..ltoreq. ##EQU00004## The application of this definition of distance to various sets of real world data is demonstrated.
Method And Apparatus For Segmenting A Multi-Media Program Based Upon Audio Events,
Tue Oct 05 01:05:25 EDT 2004
The present invention provides for a method and apparatus for segmenting a multi-media program based upon audio events. In an embodiment a method of classifying an audio stream is provided. This method includes receiving an audio stream. Sampling the audio stream at a predetermined rate and then combining a predetermined number of samples into a clip. A plurality of features are then determined for the clip and are analyzed using a linear approximation algorithm. The clip is then characterized based upon the results of the analysis conducted with the linear approximation algorithm.
System and method for automated multimedia content indexing and retrieval,
Tue Mar 30 18:09:42 EST 2004
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.