
200 S Laurel Ave - Bldg A
Middletown, NJ
Subject matter expert in video metadata, multimedia, video processing
David Gibbon is Lead Member of Technical Staff in the Video and Multimedia Technologies and Services Research Department at AT&T Labs - Research. His current research focus includes multimedia processing for automated metadata extraction with applications in media and entertainment services including video retrieval and content adaptation. In 2007, David received the AT&T Science and Technology Medal for outstanding technical leadership and innovation in the field of Video and Multimedia Processing and Digital Content Management and in 2001, the AT&T Sparks Award for Video Indexing Technology Commercialization. David contributes to standards efforts through the Metadata Committee of the ATIS IPTV Interoperability Forum. He serves on the Editorial Board for the Journal of Multimedia Tools and Applications and is a member of the ACM, and a senior member of the IEEE. He joined AT&T Bell Labs in 1985 and has over 50 U.S. Patent filings and holds 16 U.S. patents in the areas of multimedia indexing, streaming, and video analysis. He has written a book on video search, several book chapters and encyclopedia articles as well as numerous technical papers
Content Augmenting Media (CAM),
Leverage multimedia metadata to provide live alerts and intelligent content consumption.
eClips - Personalized Content Clip Retrieval and Delivery,
The eClips project delivers customized video content based upon user profiles, based upon the MIRACLE platform.
Enhanced Indexing and Representation with Vision-Based Biometrics,
Leveraging visual biometrics for indexing and representations of content for retrieval and verification.
iMIRACLE - Content Retrieval on Mobile Devices with Speech,
iMIRACLE uses large vocabulary speech recognition for content retrieval with metadata words (titles, genre, channels, etc.) and content words that occur in recorded programs.
MIRACLE and the Content Analysis Engine (CAE),
The Multimedia Information Retrieval by Content (MIRACLE) project encompasses the technologies for video indexing, analysis, and retrieval with audio, textual, and visual content information.
VidCat - Simplified Personal Photo and Video Managmenet,
VidCat permits simplified personal photo and video management (i.e. a Video Catalog) from a webpage or your favorite mobile device.
Video - Content Delivery and Consumption,
A background on the delivery and consumption of video and multimedia and references to projects within the AT&T Video and Multimedia Technologies and Services Research Department.
Video - Indexing and Representation (Metadata),
Video and multimedia indexing and representations (i.e. metadata), their production, and use. Links to projects within the AT&T Video and Multimedia Technologies and Services Research Department.
Video and Multimedia Technologies Research,
The AT&T Video and Multimedia Technologies Research Department strives to acquire multimedia and video for indexing,retrieval,and consumption with textual,semantic,and visual modalities.
Science & Technology Medal, 2007.
Honored for outstanding technical leadership and innovation in the field of Video and Multimedia Processing and Digital Content Management.
System And Method For Representing Media Assets,
Tue Jan 22 14:43:53 EST 2013
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for representing media assets. The method includes receiving an original media asset and derivative versions of the original media asset and associated descriptors, determining a lineage to each derivative version that traces to the original media asset, generating a version history tree of the original media asset representing the lineage to each derivative version and associated descriptors from the original media asset, and presenting at least part of the version history tree to a user. In one aspect, the method further includes receiving a modification to one associated descriptor and updating associated descriptors for related derivative versions with the received modification. The original media asset and the derivative versions of the original media asset can share a common identifying mark. Descriptors can include legal documentation, licensing information, creation time, creation date, actors' names, director, producer, lens aperture, and position data.
Metadata Repository And Methods Thereof,
Tue Jan 22 14:43:51 EST 2013
A repository receives metadata from databases associated with different service providers. The repository converts the received metadata to a common format, such as MPEG7, and stores the converted metadata in a central database. The repository can also receive a query from a client device. The repository retrieves metadata associated with the query from the central database and provides it to the requesting client device. The repository can also convert the provided metadata to an appropriate format for the requesting device. Because the metadata is stored at a common location in a common format, content from different providers can be efficiently identified.
Comprehensive Information Market Exchange,
Tue Jan 01 14:43:41 EST 2013
Systems and techniques for collecting information as authorized by information providers and sharing the information with information recipients according to criteria specified by the information providers. Information is collected from one or more of a variety of sources and stored in a provider profile, with the provider profile also specifying criteria for sharing the information, including payment required for sharing the information with particular categories of recipients. An exchange system is maintained allowing recipients to request or to otherwise specify needs for particular categories of information and payments to be provided by the information, and needs or requests of recipients for information are matched with criteria specified by providers, with information being transferred or used to provide results for a recipient and payment being transferred from the recipient to a provider or providers when a match between information needs and criteria for sharing information is identified.
Method And Apparatus For Automatically Converting Source Video Into Electronic Mail Messages,
Tue Oct 23 16:12:05 EDT 2012
The invention relates to a method and system for automatically identifying video content within source video and transmitting the video content to an electronic mail client. The transmitted video content can be streaming video, video files, and/or other medium derived from the source video. An enhanced electronic mail client is also disclosed.
Multimodal Portable Communication Interface For Accessing Video Content,
Tue Sep 04 16:11:48 EDT 2012
A portable communication device has a touch screen display that receives tactile input and a microphone that receives audio input. The portable communication device initiates a query for media based at least in part on tactile input and audio input. The touch screen display is a multi-touch screen. The portable communication device sends an initiated query and receives a text response indicative of a speech to text conversion of the query. The portable communication device then displays video in response to tactile input and audio input.
System And Method For Creating And Manipulating Synthetic Environments,
Tue Sep 04 16:11:46 EDT 2012
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for synthesizing a virtual window. The method includes receiving an environment feed, selecting video elements of the environment feed, displaying the selected video elements on a virtual window in a window casing, selecting non-video elements of the environment feed, and outputting the selected non-video elements coordinated with the displayed video elements. Environment feeds can include synthetic and natural elements. The method can further toggle the virtual window between displaying the selected elements and being transparent. The method can track user motion and adapt the displayed selected elements on the virtual window based on the tracked user motion. The method can further detect a user in close proximity to the virtual window, receive an interaction from the detected user, and adapt the displayed selected elements on the virtual window based on the received interaction.
System And Method For Categorizing Long Documents,
Tue Aug 28 16:11:36 EDT 2012
A system, a method, an apparatus, and a computer-readable medium are provided. Each of a group of documents is segmented. Categories are assigned to each segment of the group of documents. A categorization series for each one of the group of documents is formed, based at least in part, on the categories assigned to each of the segments of respective ones of the plurality of documents. A pattern is found based, at least in part, on the plurality of categorization series corresponding to the plurality of documents. Each of the group of documents is categorized based, at least in part, on the pattern.
System And Method For Adaptive Media Playback Based On Destination,
Tue Aug 07 16:11:22 EDT 2012
Disclosed herein are systems, methods, and computer readable-media for adaptive media playback based on destination. The method for adaptive media playback comprises determining one or more destinations, collecting media content that is relevant to or describes the one or more destinations, assembling the media content into a program, and outputting the program. In various embodiments, media content may be advertising, consumer-generated, based on real-time events, based on a schedule, or assembled to fit within an estimated available time. Media content may be assembled using an adaptation engine that selects a plurality of media segments that fit in the estimated available time, orders the plurality of media segments, alters at least one of the plurality of media segments to fit the estimated available time, if necessary, and creates a playlist of selected media content containing the plurality of media segments.
Brief And High-Interest Video Summary Generation,
Tue Jun 05 16:10:38 EDT 2012
A video is summarized by determining if a video contains one or more junk frames, modifying one or more boundaries of shots of the video based at least in part on the determination of if the video contains one or more junk frames, sampling a plurality of the shots of the video into a plurality of subshots, clustering the plurality of subshots with a multiple step k-means clustering, and creating a video summary based at least in part on the clustered plurality of subshots. The video is segmented into a plurality of shots and a keyframe from each of the plurality of shots is extracted. A video summary is created based on a determined importance of the subshots in a clustered plurality of subshots and a time budget. The created video summary is rendered by displaying playback rate information for the rendered video summary, displaying a currently playing subshot marker with the rendered video summary, and displaying an indication of similar content in the rendered video summary.
Systems And Methods For Monitoring Speech Data Labelers,
Tue May 01 16:10:18 EDT 2012
Systems and methods herein use an annotation guide to label utterances and speech data with a call type. A system practicing the method embodiment monitors labelers of speech data by presenting via a processor a test utterance to a labeler, receiving input from the labeler that selects a particular call type from a list of call types and determining via the processor if the labeler labeled the test utterance correctly. Based on the determining step, the system revises the annotation guide, retrains the labeler, and/or alters the test utterance.
System And Method For Automated Multimedia Content Indexing And Retrieval,
Tue Mar 06 16:09:28 EST 2012
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
System And Method For Dynamically Constructing Audio In A Video Program,
Tue Feb 21 16:09:23 EST 2012
Disclosed herein are systems, methods, and computer readable-media for dynamically constructing audio in a video program. The method includes extracting video metadata from a video program displayed on a playback device to a viewer, extracting component metadata from a plurality of audio components stored in a media object library, extracting viewer preferences from a viewer profile, receiving synchronization information about the video program, identifying a segment of the video program susceptible to inserting an audio component based on extracted video metadata, component metadata, and viewer preferences, transmitting the audio component to the playback device and a set of instructions detailing how to insert the audio component in real time in the segment of the video program, and constructing audio in the video program at the playback device using the audio component and the set of instructions.
Digitally-Generated Lighting For Video Conferencing Applications,
Tue Dec 27 16:06:52 EST 2011
A method of improving the lighting conditions of a real scene or video sequence. Digitally generated light is added to a scene for video conferencing over telecommunication networks. A virtual illumination equation takes into account light attenuation, lambertian and specular reflection. An image of an object is captured, a virtual light source illuminates the object within the image. In addition, the object can be the head of the user. The position of the head of the user is dynamically tracked so that an three-dimensional model is generated which is representative of the head of the user. Synthetic light is applied to a position on the model to form an illuminated model.
Method And Apparatus For Interactively Retrieving Content Related To Previous Query Results,
Tue Nov 15 16:06:30 EST 2011
The invention relates to a method and system for automatically identifying and presenting video clips or other media to a user at a client device. One embodiment of the invention provides a method for updating a user profile or other persistent data store based on user feedback to improve the identification of video clips or other media content responsive to the user's profile. Embodiments of the invention also provide methods for processing user feedback. Related architectures are also disclosed.
Customized Interface Based On Viewed Programming,
Tue Nov 08 16:06:26 EST 2011
In one embodiment, a system generates a customized interface based on viewed programming. The system stores a program that a user viewed through a media device; searches through a network for information related to the viewed program; and extracts data associated with the information related to the viewed program. A custom interface is generated based substantially on the data associated with the information related to the viewed program.
System And Method For Generating Media Bookmarks,
Tue Nov 01 16:06:25 EDT 2011
Disclosed herein are systems, methods, and computer-readable media for transmedia video bookmarks, the method comprising receiving a first place marker and a second place marker for a segment of video media, extracting metadata from the video media between the first and second place markers, normalizing the extracted metadata, storing the normalized metadata, first place marker, and second place marker as a video bookmark, and retrieving the media represented by the video bookmark upon request from a user. One aspect further aggregates video bookmarks from multiple sources and refines the first place marker and second place marker based on the aggregated video bookmarks. Metadata can be extracted by analyzing text or audio annotations. Another aspect of normalizing the extracted metadata includes generating a video thumbnail representing the video media between the first place marker and the second place marker. Multiple video bookmarks may be searchable by metadata or by the video thumbnail visually. In one aspect a user profile stores video bookmarks on a per media and per user basis.
System And Method For Automatically Authoring Interactive Television Content,
Tue Oct 11 16:06:17 EDT 2011
A system and method is provided to automatically generate content for ITV products and services by processing primary media sources. In one embodiment of the invention, keywords are automatically extracted from the primary media sources using one or more of a variety of techniques directed to video, audio and/or textual content of the multimodal source. In some embodiments, keywords are then processed according to one or more disclosed algorithms to narrow the quantity of downstream processing that is necessary to associate secondary sources (reference items) with the primary video source. Embodiments of the invention also provide automatic searching methods for the identification of reference items based on the processed keywords in order to maximize the value added by the association of reference items to the video source.
System And Method For Adaptive Media Playback Based On Destination,
Tue Aug 09 16:05:55 EDT 2011
Disclosed herein are systems, methods, and computer readable-media for adaptive media playback based on destination. The method for adaptive media playback comprises determining one or more destinations, collecting media content that is relevant to or describes the one or more destinations, assembling the media content into a program, and outputting the program. In various embodiments, media content may be advertising, consumer-generated, based on real-time events, based on a schedule, or assembled to fit within an estimated available time. Media content may be assembled using an adaptation engine that selects a plurality of media segments that fit in the estimated available time, orders the plurality of media segments, alters at least one of the plurality of media segments to fit the estimated available time, if necessary, and creates a playlist of selected media content containing the plurality of media segments.
Internet Security Updates Via Mobile Phone Videos,
Tue May 31 16:05:19 EDT 2011
Information relevant to internet security is received at a data center server. Such information, for example, a network intrusion alert or details on a recent outbreak of an network virus, may be examined to determine the nature and scope of the security-related information. A security alert is promptly generated in response to the information, using previously stored multimedia content divided into categories of security alerts and/or multimedia content generated at the data center shortly after receiving the security information. The security alert is then disseminated to a plurality of mobile device users. The alerts may be disseminated only to the users associated with a certain security event category, or may be sent to different groups of users depending on other relevant criteria.
Browsing And Retrieval Of Full Broadcast-Quality Video,
Tue Jan 25 16:04:24 EST 2011
A method includes steps of indexing a media collection, searching an indexed library and browsing a set of candidate program segments. The step of indexing a media collection creates the indexed library based on a content of the media collection. The step of searching the indexed library identifies the set of candidate program segments based on a search criteria. The step of browsing the set of candidate program segments selects a segment for viewing.
On-Demand Language Translation For Television Programs,
Tue Oct 05 15:04:54 EDT 2010
In an embodiment, a method of providing an on demand translation service is provided. A subscriber may be charged a reduced fee or no fee for use of the on demand translation service in exchange for displaying commercial messages to the subscriber, the commercial messages being selected based on subscriber information. A multimedia signal including information in a source language may be received. The information may be obtained as text in the source language from the multimedia signal. The text may be translated from the source language to a target language. Translated information, based on the translated text, may be transmitted to a processing device for presentation to the subscriber. The received multimedia signal may be sent to a multimedia device for viewing.
Digitally-Generated Lighting For Video Conferencing Applications,
Tue Sep 28 15:04:49 EDT 2010
A method of improving the lighting conditions of a real scene or video sequence. Digitally generated light is added to a scene for video conferencing over telecommunication networks. A virtual illumination equation takes into account light attenuation, lambertian and specular reflection. An image of an object is captured, a virtual light source illuminates the object within the image. In addition, the object can be the head of the user. The position of the head of the user is dynamically tracked so that an three-dimensional model is generated which is representative of the head of the user. Synthetic light is applied to a position on the model to form an illuminated model.
System And Method For Adaptive Content Rendition,
Tue Sep 14 15:04:37 EDT 2010
Disclosed herein are systems, methods, and computer readable-media for adaptive content rendition, the method comprising receiving media content for playback to a user, adapting the media content for playback on a first device in the user's first location, receiving a notification when the user changes to a second location, adapting the media content for playback on a second device in the second location, and transitioning media content playback from the first device to second device. One aspect conserves energy by optionally turning off the first device after transitioning to the second device. Another aspect includes playback devices that are "dumb devices" which receive media content already prepared for playback, "smart devices" which receive media content in a less than ready form and prepare the media content for playback, or hybrid smart and dumb devices. A single device may be substituted by a plurality of devices. Adapting the media content for playback is based on a user profile storing user preferences and/or usage history in one aspect.
Systems And Methods For Monitoring Speech Data Labelers,
Tue May 04 15:03:49 EDT 2010
Systems and methods for using an annotation guide to label utterances and speech data with a call type. A method embodiment monitors labelers of speech data by presenting via a processor a test utterance to a labeler, receiving input from the labeler that selects a particular call type from a list of call types and determining via the processor if the labeler labeled the test utterance correctly. Based on the determining step, the method performs at least one of the following: revising the annotation guide, retraining the labeler or altering the test utterance.
On-Demand Language Translation For Television Programs,
Tue May 04 15:03:47 EDT 2010
A method, a system and a machine-readable medium are provided for an on demand translation service. A translation module including at least one language pair module for translating a source language to a target language may be made available for use by a subscriber. The subscriber may be charged a fee for use of the requested on demand translation service or may be provided use of the on demand translation service for free in exchange for displaying commercial messages to the subscriber. A video signal may be received including information in the source language, which may be obtained as text from the video signal and may be translated from the source language to the target language by use of the translation module. Translated information, based on the translated text, may be added into the received video signal. The video signal including the translated information in the target language may be sent to a display device.
Method For Providing A Compressed Rendition Of A Video Program In A Format Suitable For Electronic
Searching And Retrieval,
Tue Feb 02 15:03:24 EST 2010
A compressed rendition of a video program is provided in a format suitable for electronic searching and retrieval. An electronic pictorial transcript representation of the video program is initially received. The video program has a video component and a second information-bearing media component associated therewith. The pictorial transcript representation includes a representative frame from each segment of the video component of the video program and a portion of the second media component associated with the segment. The electronic pictorial transcript is transformed into a hypertext format to form a hypertext pictorial transcript. The hypertext pictorial transcript is subsequently recorded in an electronic medium.
Systems and methods for monitoring speech data labelers,
Tue Oct 09 18:12:17 EDT 2007
Systems and methods for monitoring labelers of speech data. To test or train labelers, a labeler is presented with utterances that have already been identified as belonging to a particular class or call type. The labeler is asked to assign a call type to the utterances. The performance of the labeler is measured by comparing the call types assigned by the labeler with the existing call types of the utterances. The performance of a labeler can also be monitored as the labeler labels speech data by occasionally having the labeler label an utterance that is already labeled and by storing the results.
Digitally-generated lighting for video conferencing applications,
Tue Jun 12 18:12:06 EDT 2007
A method of improving the lighting conditions of a real scene or video sequence. Digitally generated light is added to a scene for video conferencing over telecommunication networks. A virtual illumination equation takes into account light attenuation, lambertian and specular reflection. An image of an object is captured, a virtual light source illuminates the object within the image. In addition, the object can be the head of the user. The position of the head of the user is dynamically tracked so that an three-dimensional model is generated which is representative of the head of the user. Synthetic light is applied to a position on the model to form an illuminated model.
Systems and methods for generating an annotation guide,
Tue May 15 18:12:01 EDT 2007
Systems and methods for generating an annotation guide. Speech data is organized and presented to a user. After the user selects some of the utterances in the speech data, the selected utterances are included in a class and/or call type. Additional utterances that belong to the class and/or call type can be found in the speech data using relevance feedback, data mining, data clustering, support vector machines, and the like. After a call type is complete, it is committed to the annotation guide. After all call types are completed, the annotation guide is generated.
System and method for automated multimedia content indexing and retrieval,
Tue Feb 27 18:11:55 EST 2007
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
Digitally-generated lighting for video conferencing applications,
Tue Dec 27 18:10:43 EST 2005
A method of improving the lighting conditions of a real scene or video sequence. Digitally generated light is added to a scene for video conferencing over telecommunication networks. A virtual illumination equation takes into account light attenuation, lambertian and specular reflection. An image of an object is captured, a virtual light source illuminates the object within the image. In addition, the object can be the head of the user. The position of the head of the user is dynamically tracked so that an three-dimensional model is generated which is representative of the head of the user. Synthetic light is applied to a position on the model to form an illuminated model.
System and method for automated multimedia content indexing and retrieval,
Tue Mar 30 18:09:42 EST 2004
The invention provides a system and method for automatically indexing and retrieving multimedia content. The method may include separating a multimedia data stream into audio, visual and text components, segmenting the audio, visual and text components based on semantic differences, identifying at least one target speaker using the audio and visual components, identifying a topic of the multimedia event using the segmented text and topic category models, generating a summary of the multimedia event based on the audio, visual and text components, the identified topic and the identified target speaker, and generating a multimedia description of the multimedia event based on the identified target speaker, the identified topic, and the generated summary.
Method For Providing A Compressed Rendition Of A Video Program In A Format Suitable For Electronic Searching And Retrieval,
Tue Jun 17 18:08:46 EDT 2003
A compressed rendition of a video program is provided in a format suitable for electronic searching and retrieval. An electronic pictorial transcript representation of the video program is initially received. The video program has a video component and a second information-bearing media component associated therewith. The pictorial transcript representation includes a representative frame from each segment of the video component of the video program and a portion of the second media component associated with the segment. The electronic pictorial transcript is transformed into a hypertext format to form a hypertext pictorial transcript. The hypertext pictorial transcript is subsequently recorded in an electronic medium.
Generating hypermedia documents from transcriptions of television programs using parallel text alignment,
Tue Oct 29 18:08:31 EST 2002
An apparatus, method and computer program product for generating a hypermedia document from a transcript of a closed-captioned television program using parallel text alignment. The method includes the steps of receiving the closed-captioned text stream with its associated frame counts and the transcript, aligning the text of closed-captioned text stream and the transcript; transferring the frame counts from the closed-captioned text stream to the transcript; extracting video frames from the television program; and linking the frames to the frame-referenced transcript using the frame counts to produce the hypermedia document. The present invention produces other types of hypermedia products as well.
Method and apparatus for compressing a sequence of information-bearing frames having at least two media,
Tue Aug 07 18:07:11 EDT 2001
An apparatus and method for compressing a sequence of frames having at least first and second information-bearing media components selects a plurality of representative frames from among the sequence of frames. The representative frames represent information contained in the first information-bearing media component. A correspondence is then formed between each of the representative frames and one of a plurality of segments of the second information-bearing media component. The representative frames, the plurality of segments of the second information-bearing media component and the correspondence between them are recorded for subsequent retrieval. If the first information-bearing media component is a video component composed of a plurality of scenes, a representative frame may be selected from each scene. Additionally, if the second information-bearing media component is a closed-caption component, a printed rendition of the representative frames and the closed-caption component may be provided. The printed rendition constitutes a pictorial transcript in which each representative frame is printed with a caption containing the closed-caption text associated therewith.
Method for automatically providing a compressed rendition of a video program in a format suitable for electronic searching and retrieval,
Tue Aug 01 18:05:35 EDT 2000
A compressed rendition of a video program is provided in a format suitable for electronic searching and retrieval. An electronic pictorial transcript representation of the video program is initially received. The video program has a video component and a second information-bearing media component associated therewith. The pictorial transcript representation includes a representative frame from each segment of the video component of the video program and a portion of the second media component associated with the segment. The electronic pictorial transcript is transformed into a hypertext format to form a hypertext pictorial transcript. The hypertext pictorial transcript is subsequently recorded in an electronic medium.
Method and means for detecting people in image sequences,
Tue Nov 16 18:05:25 EST 1999
The head in a series of video images is identified by digitizing sequential images, subtracting a previous image from an input image to determine moving objects, calculating boundary curvature extremes of regions in the subtracted image, comparing the extremes with a stored model of a human head to find regions shaped like a human head, and identifying the head with a surrounding shape.
Method For Communicating Audiovisual Programs Over A Communication Network,
Tue Feb 23 01:05:20 EST 1999
This patent relates to on-demand streaming of media over IP networks. The patent discloses a method that supports visual browsing of media streams and is primarily intended for video and illustrated audio stream types. There is an increasing amount of rich media on the web and broadband access is making it available to larger numbers of people. New methods for searching and browsing of rich media over IP networks are required in order to fully exploit these trends. In streaming media systems, a prefetch buffer is maintained on the client to compensate for network jitter. Navigating around a stored media clip is difficult due to the time required to refill the buffer after seek operations. With the current invention, the network bandwidth is managed to not only send the data streams for basic buffering, but also to transmit additional information needed for stream navigation. This additional information is loaded non-sequentially using either UDP or TCP protocols for data transport. In addition to this concept, the patent further discloses an optimal buffer control algorithm that selects the best representative image set for any given time during the streaming session. In comparison with previously existing methods, the new method offers a much more interactive environment for simultaneous streaming and browsing of visual media.
Method and apparatus for recording and indexing an audio and multimedia conference,
Tue Jan 20 18:05:04 EST 1998
A method and apparatus for recording and indexing audio information exchanged during an audio conference call, or video, audio and data information exchanged during a multimedia conference. For a multimedia conference, the method and apparatus utilize the voice activated switching functionality of a multipoint control unit (MCU) to provide a video signal, which is input to the MCU from a workstation from which an audio signal is detected, to each of the other workstations participating in the conference. A workstation and/or participant-identifying signal generated by the multipoint control unit is stored, together or in correspondence with the audio signal and video information, for subsequent ready retrieval of the stored multimedia information. For an audio conference, a computer is connected to an audio bridge for recording the audio information along with an identification signal for correlating each conference participant with that participant's statements.
Apparatus for matching colors in image signals,
Tue Oct 17 18:05:01 EDT 1995
An inexpensive, yet robust color matcher is achieved by an apparatus which includes a compare device coupled to receive, as a first input, an input signal representing a combination of a luminance component Y and a chrominance component, and further coupled to receive as a second input, a threshold T, for comparing the input signal to the threshold T and outputting a color match signal if the input signal falls within a range defined by T and -T; and a threshold supply device coupled to the second input for supplying the threshold T, wherein the threshold T is a function of Y and at least two adjustable parameters.
iMiracle: Multimodal Speech-Enabled Mobile Video Search
Bernard Renger, Andrea Basso, David Gibbon, Michael Johnson, Zhu Liu, Behzad Shahraray
Spoken Query Voice Search Workshop 2010, (SQ2010),
2010.
[PDF]
[BIB]
This paper describes iMIRACLE, a system designed to support multimodal speech-enabled mobile video search of previously recorded broadcast TV programs.
It allows the user to search for video content using an interface combining spoken and graphical interaction.
Users can search for video based both on metadata and the content itself.
The principal technologies used include large vocabulary speech recognition, natural language understanding, and video search.
As new TV programs are processed and indexed by a content-based video search engine,
new speech models are created daily to handle the constantly changing metadata and vocabulary.
While the methods presented can be used with a wide range of mobile devices,
the current implementation of the system supports video playback on the iPhone or on another display such as a TV controlled via the iPhone.

Using MPEG Standards for Content Based Indexing of Broadcast Television, Web, and Enterprise Content
David Gibbon, Zhu Liu, Andrea Basso, Behzad Shahraray
Handbook of MPEG Applications, M. Angelides and H. Agius, Ed.,
John Wiley and Sons,
2010.
[BIB]
This chapter describes the use of MPEG-7 and MPEG-21 for television program descriptions,
and user preferences in the existing TVAnytime and DLNA specifications and the emerging ATIS IPTV specifications.
MPEG-7 defines a rich language for describing media at a multitude of levels from low level audio and video features up through higher level semantics
and global metadata. We further describe a large scale system for metadata augmentation via content processing including video segmentation, face detection,
image and image region similarity measures, automatic speech recognition, speaker segmentation and multimodal processing. MPEG-7 and MPEG-21 are used for
content ingestion and representation of processing results. The ingested content sources include ATSC MPEG-2,
IPTV H.264/MPEG-4 HD and SD transport streams as well as MPEG-4 encoded video files from Web sources. Illustrative examples of generated metadata are provided.

Effective and Scalable Video Copy Detection
Zhu Liu, Tao Liu, David Gibbon, Behzad Shahraray
ACM SIGMM International Conference on Multimedia Information Retrieval, MIR'10,
2010.
[PDF]
[BIB]
Video copy detection techniques are essential for a number of applications including discovering copyright infringement of multimedia content,
monitoring commercial air time, and querying videos by example.
Over the last decade, video copy detection has received rapidly growing attention from the multimedia research community.
To encourage more innovative technology and benchmark the state of the art approaches in this field, the TRECVID conference series,
sponsored by the NIST, initiated an evaluation task on content based copy detection in 2008.
In this paper, we describe the content-based video copy detection framework developed at AT&T Labs – Research.
We employed local visual features to match the video content, and adopted hashing techniques to maintain the scalability and the robustness of our approach.
Experimental results on TRECVID 2008 data show that our approach is effective and efficient.

A novel architecture for content and delivery validation for IPTV systems
Andrea Basso, David Gibbon, Zhu Liu, Bernard Renger, Behzad Shahraray, U. Muller
IEEE Consumer Communications and Networking Conference (CCNC),
2010.
[PDF]
[BIB]
In this paper, we describe a novel architecture for
content and delivery validation for IPTV systems. The system
generates multimedia events derived from IPTV content analysis
and its delivery. Events are aggregated, filtered and, through a
rules-based system, processed and prioritized to provide
monitoring and alerting capabilities. Offline event processing
along with a video archiving capability allows for detailed
forensic analysis. The novelty of the approach includes the
correlation and the combination of content-specific and networkspecific
events in a normalized representation to allow for
complex forensic monitoring and alerting. Applications include
content security assurance and metadata verification.
Project GeoTV - A Three-Screen Service
Yih-Farn Chen, David Gibbon, Rittwik Jana, Bernard Renger, Bin Wei, Hailong Sun, Ping-Fai Yang
IEEE Consumer Communications and Networking Conference (CCNC),
2009.
[PDF]
[BIB]
GeoTV is a project that explores seamless integration of mobile phones,
HDTV sets, and computers in the living room to enrich the user experience of existing services.
The three-screen service allows a user to navigate a world map on a smart phone to track geo-located media RSS content that matches her personal interests.
The user can show a matching video clip on her phone or direct a nearby HDTV set to play the video.
In addition, the user can bring up a world map on a nearby computer screen to navigate areas of interest related to the video clip.
GeoTV allows all three screens to be used for what they are best for: HDTV for high resolution video,
computer screen for browsing a world map, and a smart phone for personalized control at hand to select media of interest.
Large Scale Content Analysis Engine
David Gibbon, Zhu Liu
ACM Workshop on Large-Scale Multimedia Retrieval and Mining, LS-MMRM’09,
2009.
[PDF]
[BIB]
The evolution of IP video systems has resulted in unprecedented access to a wide range of video material for consumers via IPTV and Web delivery.
Retrieval technologies help users find relevant content, but suffer from a paucity of reliable content descriptions.
In this paper, we describe the large scale content analysis engine (CAE) which is designed to facilitate content based video retrieval,
as well as a number of other applications such as content repurposing, video browsing, discovering content relationships, and fine-grained content personalization.
A scalable system is presented to handle video from broadcast, enterprise and web sources for applications including IPTV and mobile video.
Media processing includes shot boundary detection, automatic speech recognition, face detection, clustering of shots and faces, speaker segmentation,
closed caption alignment, concept detection, and transcoding. Metadata ingest, augmentation and delivery using a standards-based approach enable the CAE to serve a range of content processing applications.

IPTV Terminal Metadata Specification
David Gibbon
Alliance for Telecommunications Industry Solutions,
2009.
[BIB]
This document specifies a logical data model to address the requirements related to IPTV services in the Consumer Domain, one of the metadata areas defined in the ATIS/IIF architecture. The data model is specified using XML schemas and facilitates the exchange of data related to users and the devices with which users consume IPTV services. Specifically, the following areas are addressed: user preferences for content consumption and accessibility, services to which users are subscribed, and recording of content consumption and user interaction.
Uninterrupted Recording and Real Time Content-based Indexing Service for IPTV Systems
Zhu Liu, David Gibbon, Behzad Shahraray
AT&T Technical Document,
2008.
[PDF]
[BIB]
Internet Protocol TV (IPTV) is a system to deliver TV content to customers by using IP.
It has been deployed worldwide by many IPTV service provides, including AT&T, BT, etc. In addition to the core video services
provided by the traditional cable TV and satellite TV systems, IPTV provides much richer user interactions and enables additional
value added services. This paper presents a prototype for such kind of service: uninterrupted recording and real time media content
indexing service for IPTV platforms. While electronic program guides (EPG) list the TV programs that will be broadcast in the future
and allow the users to schedule recording for shows that they are interested in, our prototype empowers the users to search and browse
any moment of all recent broadcast content for certain channels. Based on the business interest and requirements, this service can be
realized in either a centralized architecture (on the network side) or a distributed one (on the individual client side).
The developed user interface can be seamlessly integrated on IPTV platforms,
and it has been successfully demonstrated on the set top box (STB) adopted in the AT&T U-Verse IPTV service.
Multimedia Content Adaptation
David Gibbon
Encyclopedia of Multimedia, 2nd Ed.,
Springer,
2008.
[BIB]
Introduction to Video Search Engines
David Gibbon, Zhu Liu
Introduction to Video Search Engines,
Springer,
2008.
[BIB]
IPTV Emergency Alert System Metadata Specification
H. Bassali
Alliance for Telecommunications Industry Solutions,
2008.
[BIB]
Building upon the system requirements given in ATIS-0800010, Emergency Alert Service Provisioning Specifications, the IPTV Emergency Alert System Metadata Specification in this document defines an XML schema used for delivery of emergency alert signaling and information to the IPTV service provider’s EAS Ingestion System (EIS), and for delivery of alert information and signaling to the IPTV Terminal Function on the consumer premesis. In addition, the document specifies the methods used to authenticate EAS data and audio files.
Content Personalization and Adaptation for Three Screen Services
Zhu Liu, David Gibbon, Harris Drucker, Andrea Basso
ACM International Conference on Image and Video Retrieval,
2008.
[PDF]
[BIB]
Three screen services provide the right solution for consumers to access rich multimedia resources by any device, anytime and anywhere.
In this paper, we describe a prototype system of content personalization and adaptation for three screen services.
The system continuously acquires content from TV broadcast feeds, indexes and adapts the content for users according to their interests defined in preference profiles. Automatically compiled segments of content can be rendered on a variety of devices that the customers prefer to facilitate a smoother video consuming experience. Simulation results show that the proposed content analysis modules, including shot boundary detection, anchorperson detection, and multimodal story segmentation are effective. The resulting personalized content is suitable for consumption on devices with limited input capabilities.

Brief and High-Interest Summary Generation: Evaluating the AT&T Labs BBC Summarizations
Zhu Liu, Eric Zavesky, Behzad Shahraray, David Gibbon, Andrea Basso
ACM International Conference on Multimedia,
2008.
[BIB]
Video summarization is essential for the user to understand the main theme of video sequences in a short period,
especially when the volume of the video is huge and the content is highly redundant. In the paper, we present a video summarization system,
built for the rushes summarization task in TRECVID 2008. The goal is to create a video excerpt including objects and events in the video with
minimum redundancy and duration (up to 2% of the original video). We first segment a video into shots and then apply a multi-stage clustering algorithm to eliminate similar shots. Frame importance values that depend on both the temporal content variation and the spatial image salience are used to select the most interesting video clips as part of the summarization. We test our system with two output configurations - a dynamic playback rate and at the native playback rate - as a tradeoff between ground truth inclusion rate and ease of browsing. TRECVID evaluation results show that our system achieves a good inclusion rate and verify that the created video summarization is easy to understand.

Bridging Communication Services – Connecting Virtual Worlds to Real World: a Service Provider’s Perspective
Rittwik Jana, Andrea Basso, Yih-Farn Chen, Giuseppe Di Fabbrizio, David Gibbon, Bernard Renger, Bin Wei
MobEA VI, WWW 2008,
2008.
[PDF]
[BIB]
In this paper, we develop the concept of service bridging which allows services in virtual worlds (VW) to be connected to services in the real world (RW). In particular, we show through examples in a representative VW, Second Life (SL), various bidirectional connectivity modes that exist via messaging, web publishing and real time multimedia (voice, video etc.). We advocate the use of middleware proxies and synthetic content based sampling techniques to provide some of these services. Specifically, we construct a conference site for WWW 2008 in SL to experiment with some of these concepts to demonstrate how users can interact with authors, content, and other attendees through our bridging services in ways that are not available today to conference attendees.
Video Content Personalization for IPTV Services
David Gibbon, Zhu Liu, Harris Drucker, Bernard Renger, Lee Begeja, Behzad Shahraray
IEEE Broadband Technology Symposium (BTS),
2007.
[PDF]
[BIB]
IPTV customers will have access to thousands of video
content sources and will require powerful yet intuitive
tools to locate desired content. We propose a solution
based on stored user interest profiles and multimodal
processing for content segmentation to produce
manageable content subsets for users. Segmentation
involves part-of-speech tagging on extracted closedcaption
text combined with video shot boundary
detection. Query relevance ranking with temporal and
other metadata constraints is used to form timely,
focused content sets for users. We present a fully
automated end-to-end system including standard
definition video acquisition, media processing for
segmentation, a content storage and retrieval
subsystem, and content presentation via a user
interface designed for set-tops requiring only a simple
remote control for browsing.

Searching Visual Semantic Spaces with Concept Filters
Eric Zavesky, Zhu Liu, David Gibbon, Behzad Shahraray
IEEE International Conference on Semantic Computing, ICSC 2007,
pp 329-336,
2007.
[PDF]
[BIB]
Semantic concepts cement the ability to correlate visual
information to higher-level semantic concepts. Traditional
image search leverages text associated with images, a lowlevel
content-based matching, or a combination of the two.
We propose a new system that uses 374 semantic concepts
(derived from the LSCOM lexicon) to semantically facilitate
fast exploration of a large set of video data. This new
system, when coupled with traditional image search techniques
produces a very intuitive and fruitful design for targeted
user interaction.
Prototype Demonstration: Video Content Personalization for IPTV Services
David Gibbon, Zhu Liu, Harris Drucker, Bernard Renger, Lee Begeja, Behzad Shahraray
IEEE Consumer Communications and Networking Conference (CCNC),
2007.
[PDF]
[BIB]
IPTV customers will have access to thousands of video
content sources and will require powerful yet intuitive
tools to locate desired content. We propose a solution
based on stored user interest profiles and multimodal
processing for content segmentation to produce
manageable content subsets for users. Segmentation
involves linguistic processing of extracted closed
caption text combined with video shot boundary
detection as well as audio processing. Query relevance
ranking with temporal and other metadata constraints is
used to form timely, focused content sets for users.
We will demonstrate a prototype for browsing
automatically personalized video content in a set-top
usage context, which meets the challenging
requirements imposed by the low resolution TV display
and the limited input capabilities of a simple infrared
remote control. The client runs on Microsoft Windows
Media Center Edition (MCE) and the video content is
derived from MPEG-2 DVR content that has been
indexed and transcoded to WM9 to facilitate streaming
from a Windows Media Server. The MCE client
application interfaces to the AT&T Labs – Research
MIRACLE platform which creates the personalized
content from a 36,000 program archive that is updated
on a daily basis from a wide range of content sources.

GeoTV: Navigating Geocoded RSS to Create an IPTV Experience
Yih-Farn Chen, Giuseppe Di Fabbrizio, David Gibbon, Rittwik Jana, Serban Jora, Bernard Renger, Bin Wei
International World Wide Web Conference, WWW2007,
2007.
[PDF]
[BIB]
The Web is rapidly moving towards a platform for mass collaboration in content production and consumption from three screens:
computers, mobile phones, and TVs. While there has been a surge of interests in making Web content accessible from mobile devices,
there is a significant lack of progress when it comes to making the web experience suitable for viewing on a television.
Towards this end we describe a novel concept, namely GeoTV, where we explore a framework by which web content can be presented or
pushed in a meaningful manner to create an entertainment experience for the TV audience. Fresh content on a variety of topics, people,
and places is being created and made available on the Web at breathtaking speed. Navigating fresh content effectively on
TV demands a new browsing paradigm that requires few mouse clicks or user interactions from the remote control.
Novel geospatial and temporal browsing techniques are provided in GeoTV that allow users the capability of aggregating and
navigating RSS-enabled content in a timely, personalized and automatic manner for viewing in an IPTV environment.
This poster is an extension of our previous work on GeoTracker that utilizes both a geospatial
representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly within
the context of a Web-enabled environment. We demonstrate 1) the usability of such a tool that greatly enhances a user's
ability in locating and browsing videos based on his or her geographical interests and 2) various innovative interface
designs of showing RSS-enabled information for an IPTV environment.

Clicker - An IPTV Remote Control in Your Cell Phone
Rittwik Jana, Yih-Farn Chen, David Gibbon, Yennun Huang, Serban Jora, John Murray, Bin Wei
IEEE International Conference on Multimedia & Expo, ICME'07,
2007.
[PDF]
[BIB]
This paper investigates a novel concept of providing seamless
control and portability of an IPTV viewing session. A solution
employing a middleware system, a secure hardware token and a
cell phone are used to demonstrate how an IPTV session can be
securely controlled remotely and moved between multiple viewing
stations. We build a prototype of the system and demonstrate its
flexible features. Depending on the user’s protocol of choice, most
remote control operations from a mobile device took less than 5
seconds to execute. An interesting capability of previewing content
of other channels via the user’s device while still continuing to
watch the program on the viewing station show a difference from
today’s IPTV offers. Finally for mobile content delivery, we
address a problem of dynamic device profile selection and content
adaptation using a classification algorithm to match the best
content alternative destined for a mobile device.
Advanced Content Management and Distribution
D. Parisi, M. Weinstein, J. Daugherty, R. Klimovich, Behzad Shahraray, David Gibbon
SMPTE and VSF 2007 Joint Conference,
2007.
[PDF]
[BIB]
The Miracle Video Search Engine
David Gibbon, Zhu Liu, Behzad Shahraray
IEEE Consumer Communications and Networking Conference (CCNC),
2006.
[PDF]
[BIB]
This paper presents some of the searching and browsing features of the MIRACLE Video Search Engine. The rapid
increase in the generation and dissemination of information and entertainment in video form has created a need for
video search engines that facilitate finding and browsing relevant information. MIRACLE is an ongoing project at
AT&T Labs aimed at addressing this need. This video search engine combines existing metadata with content-based
information automatically extracted from the media components, or obtained from other sources to achieve this
goal. A web-based user interface provides browsing mechanisms that take advantage of the user’s perceptual
abilities to refine the search results. The MIRACLE search engine currently operates on an archive of more than
32,000 hours of video that have been collected and automatically indexed over a ten year period.

Personal Media Alert Systems: Personalization and Dissemination of Broadcast Content with a P2P Micropayment Scheme
Yih-Farn Chen, David Gibbon, Zhu Liu, Behzad Shahraray, B. Wei
IEEE ICME,
2006.
[PDF]
[BIB]
Media consumers who are overwhelmed by multitudinous
news content demand a system that would sift through tens
or hundreds of broadcast TV channels on a daily basis to
capture the most important clips that match users’ interests
and deliver these personalized clips for easy viewing on a
typical TV or home PC. In this paper, we present a personal
media alert system that extracts video segments for the
interested users. In addition, alerts with pointers to content
stored on the home server can be sent to mobile users who
are authorized to access the content at home. To make
content sharing feasible and scalable, we also propose a P2P
payment scheme that would allow a consumer to have
access to media clips aggregated on other consumers'
personal media systems.
Multimedia Content Acquisition and Processing in the MIRACLE System
Zhu Liu, David Gibbon, Behzad Shahraray
IEEE Consumer Communications and Networking Conference (CCNC),
2006.
[PDF]
[BIB]
This paper describes the content acquisition,
indexing and repurposing components of the
MIRACLE system. MIRACLE is an ongoing research
project at AT&T Labs aimed at creating automated
content-based media processing algorithms and
systems to collect, organize, index, and repurpose
video and multimedia information. While the retrieval
engine and the user interface are the most visible part
of the query process, performing the required
processing to extract the relevant information and
generate the indices for the retrieval engine is the most
crucial and challenging component of the overall
system. Moreover, the utility of the system is highly
dependent on the quality and quantity of the acquired
data. We describe an acquisition system that takes
advantage of Personal Video Recorder (PVR)
functionality of Windows OS to acquire high-quality
video and the associated metadata. We discuss several
processing algorithms for extracting information that
enable efficient searching and browsing of more than
32,000 video programs.

Introduction to Video Search Engines
David Gibbon
International World Wide Web Conference, WWW2006,
2006.
[PDF]
[BIB]
The emergence of video search engines on major web search portals
and market forces such as IPTV and mobile video service deployments and the
growing acceptance of digital rights management technologies are enabling new
application domains for multimedia search. This tutorial will give participants a
more complete understanding of the development, current state of the art and
future trends of multimedia search technologies in general, and video search
engines in particular. Participants will learn the relationships between multimedia
search and conventional web search and the capabilities and limitations of
current multimedia retrieval systems.
AT&T Research at TRECVID 2006
Zhu Liu, David Gibbon, Eric Zavesky, Behzad Shahraray, Patrick Haffner
NIST TREC Video Retreival Evaluation,
2006.
[PDF]
[BIB]
TRECVID (TREC Video Retrieval Evaluation) is sponsored by
NIST to encourage research in digital video indexing and retrieval.
It was initiated in 2001 as a “video track” of TREC and became an
independent evaluation in 2003. AT&T participated in three tasks
in TRECVID 2006: shot boundary determination (SBD), search,
and rushes exploitation. The proposed SBD algorithm contains a
set of finite state machine (FSM) based detectors for pure cut, fast
dissolve, fade in, fade out, dissolve, and wipe. Support vector
machine (SVM) is applied to cut and dissolve detectors to further
boost the SBD performance. AT&T collaborated with Columbia
University in the search and rushes exploitation tasks. In this
paper, we mainly focus on the SBD system and briefly introduce
our effort on the search and the rushes exploitation. The AT&T
SBD system is highly effective and its evaluation results are
among the best.

Semantic Data Mining of Short Utterances
Lee Begeja, Harris Drucker, David Gibbon, Patrick Haffner, Zhu Liu, Bernard Renger, Behzad Shahraray
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
2005.
[PDF]
[BIB]
This paper introduces a methodology for speech
data mining along with the tools that the methodology requires.
We show how they increase the productivity of the analyst who
seeks relationships among the contents of multiple utterances and
ultimately must link some newly discovered context into testable
hypotheses about new information.
While in its simplest form one can extend text data mining to
speech data mining by using text tools on the output of a speech
recognizer, we have found that it is not optimal. We show how
data mining techniques that are typically applied to text should be
modified to enable an analyst to do effective semantic data mining
on a large collection of short speech utterances.
Semantic Data Mining of Short Utterances
Lee Begeja, Harris Drucker, David Gibbon, Patrick Haffner, Zhu Liu, Bernard Renger, Behzad Shahraray
IEEE Transactions On Speech & Audio Processing: Special Issue on Data Mining of Speech, Audio and Dialog,
2005.
[PDF]
[BIB]
This paper introduces a methodology for speech data mining along with the tools that
the methodology requires. We show how they increase the productivity of the analyst who seeks
relationships among the contents of multiple utterances and ultimately must link some newly
discovered context into testable hypotheses about new information.
While in its simplest form one can extend text data mining to speech data mining by using text
tools on the output of a speech recognizer, we have found that it is not optimal. We show how data
mining techniques that are typically applied to text should be modified to enable an analyst to do
effective semantic data mining on a large collection of short speech utterances.
For the purposes of this paper we examine semantic data mining in the context of semantic
parsing and analysis in a specific situation involving the solution of a business problem that is
known to the analyst. We are not attempting a generic semantic analysis of a set of speech. Our
tools and methods allow the analyst to mine the speech data to discover the semantics that best
cover the desired solution. The coverage, in this case, yields a set of Natural Language
Understanding (NLU) classifiers that serve as testable hypotheses.
Multimedia Content Adaptation
David Gibbon
Encyclopedia of Multimedia,
Springer,
2005.
[PDF]
[BIB]
MediaAlert - A Broadcast Video Monitoring and Alerting System for Mobile Users
B. Wei, David Gibbon, Bernard Renger, Zhu Liu, Yih-Farn Chen, Behzad Shahraray, Rittwik Jana, H. Huang, Lee Begeja
MobiSys,
2005.
[PDF]
[BIB]
We present a system for automatic monitoring
and timely dissemination of multimedia information to a range of
mobile information appliances based on each user’s interest
profile. Multimedia processing algorithms detect and isolate
relevant video segments from over twenty television broadcast
programs based on a collection of words and phrases specified by
the user. Content repurposing techniques are then used to
convert the information into a form that is suitable for delivery to
the user’s mobile devices. Alerts are sent using a number of
application messaging and network access protocols including
email, short message service (SMS), multimedia messaging
service (MMS), voice, session initiation protocol (SIP), fax, and
pager protocols. The MediaAlert system provides an effective
and low-cost solution for the timely generation of alerts
containing personal, business, and security information.

Multimedia Processing for Enhanced Information Delivery on Mobile Devices
David Gibbon, Lee Begeja, Zhu Liu, Bernard Renger, Behzad Shahraray
Emerging Applications for Wireless and Mobile Access, MobEA II,
2004.
[PDF]
[BIB]
Handheld mobile devices have created new possibilities for accessing information.
The limited power, storage capacity, communications bandwidth, and user interface capabilities of these devices,
however, present challenges for effective presentation of multimedia data. In this paper, we discuss how automated multimedia processing techniques
can be used to address some of these challenges. Media conversion is used to generate presentations that fit the device and/or user capabilities.
Content-based video sampling is used to generate a compact presentation of visual information using a small set of still images.
Multimodal content processing is employed to extract relevant video clips based on a user profile of interests. The combination of these techniques
creates personalized information delivery systems that reduce the storage, bandwidth, and processing power requirements, and simplify user interaction.
A prototype of such a system for personalized delivery of video information on mobile devices is presented.

Interactive Machine Learning Techniques for Improving SLU Models
Lee Begeja, David Gibbon, Zhu Liu, Bernard Renger, Behzad Shahraray
HLT/NAACL Workshop on Spoken Language Understanding for Conversational Systems,
2004.
[PDF]
[BIB]
Spoken language understanding is a critical
component of automated customer service applications.
Creating effective SLU models is
inherently a data driven process and requires
considerable human intervention. We describe
an interactive system for speech data
mining. Using data visualization and interactive
speech analysis, our system allows a User
Experience (UE) expert to browse and understand
data variability quickly. Supervised
machine learning techniques are used to capture
knowledge from the UE expert. This captured
knowledge is used to build an initial
SLU model, an annotation guide, and a training
and testing system for the labelers. Our
goal is to shorten the time to market by increasing
the efficiency of the process and to
improve the quality of the call types, the call
routing, and the overall application.

A System for Searching and Browsing Spoken Communications
Lee Begeja, Bernard Renger, M. Saraclar, David Gibbon, Zhu Liu, Behzad Shahraray
HLT/NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval,
2004.
[PDF]
[BIB]
As the amount of spoken communications accessible
by computers increases, searching and
browsing is becoming crucial for utilizing such
material for gathering information. It is desirable
for multimedia content analysis systems
to handle various formats of data and to serve
varying user needs while presenting a simple
and consistent user interface. In this paper,
we present a research system for searching and
browsing spoken communications. The system
uses core technologies such as speaker segmentation,
automatic speech recognition, transcription
alignment, keyword extraction and speech
indexing and retrieval to make spoken communications
easy to navigate. The main focus is
on telephone conversations and teleconferences
with comparisons to broadcast news.
Panel of Experts: The Future of Video Databases
A. Bovik, A. Del Bimbo, N. Dimitrova, S. Ghandeharizadeh, David Gibbon, F. Golshani, T. Huang, R. Jain, D. Petkovic, R. Picard
Handbook of Multimedia Databases,
CRC Press,
pp 1173-1197,
2003.
[BIB]
Creating Personalized Video Presentations using Multimodal Processing
David Gibbon, Lee Begeja, Zhu Liu, Bernard Renger, Behzad Shahraray
Handbook of Multimedia Databases,
CRC Press,
pp 1107-1131,
2003.
[PDF]
[BIB]
This chapter will focus on using multimodal processing in the domain of
broadcast television content with the goal of automatically producing
customized video content for individual users. Multimodal topic
segmentation is used to extract video clips from a wide range of content
sources which are of interest to users as indicated by the user’s profile. We
will primarily use the Closed Caption as the text source although the textual
information may come from a variety of sources including post-production
scripts, very large vocabulary automatic speech recognition, and transcripts
which are aligned with audio using speech processing methods.
Support vector machines: relevance feedback and information retrieval
Harris Drucker, Behzad Shahraray, David Gibbon
Information Processing and Management,
v38,
#3,
pp 305-323,
2002.
[PDF]
[BIB]
We show that support vectors machines (SVM’s) are much better than conventional
algorithms in a relevancy feedback environment in information retrieval (IR) of text
documents. We track performance as a function of feedback iteration and show that while
the conventional algorithms do very well in the initial feedback iteration if the topic searched
for has high visibility in the data base, they do very poorly if the relevant documents are a
small percentage of the data base. SVM’s however do very well when the number of
documents returned in the preliminary search is low and the number of relevant documents is
small. The competitive algorithms examined are Rocchio, Ide regular, and Ide dec- hi.
eClips: A New Personalized Multimedia Delivery Service
Lee Begeja, Bernard Renger, David Gibbon, K. Huber, Zhu Liu, Behzad Shahraray, R. Markowitz, P. Stuntebeck
Alliance Engineering Symposium,
2001.
[PDF]
[BIB]
eClips: A New Personalized Multimedia Delivery Service
Lee Begeja, Bernard Renger, David Gibbon, K. Huber, Zhu Liu, Behzad Shahraray, R. Markowitz, P. Stuntebeck
Journal of the Institution of British Telecommunications Engineers (IBTE),
2001.
[PDF]
[BIB]
This article presents a new multimedia personalisation
service that automatically extracts multimedia content
segments (electronic clips—called eClips), based on individual
preferences (key terms/words, content source, etc.)
—which the user identifies in a profile. User profiles are
stored in the service platform and continually checked
against new content in the system. When matches are
found between a user profile and the content, the service
platform alerts the user that segments have been identified
and extracted. The user may then view/play these automatically
provided segments (eClips). In addition, the
eClips service stitches the clips of diverse sources together,
providing an automatically-generated multimedia experience
that revolves around the user’s provided profile.

Virtual Light: Digitally-Generated Lighting for Video Conferencing Applications
Andrea Basso, E. Cosatto, H. P. Graf, David Gibbon, S. Liu
IEEE Signal Processing Society, 2001 International Conference on Image Processing (ICIP-2001),
2001.
[PDF]
[BIB]
In this paper we discuss a simple method to improve the
lighting conditions of a real scene or video sequence. In
particular we concentrate on modifying real light sources
intensities and inserting virtual lights into a real scene
viewed from a fixed view point (i.e fixed camera). We
target video conferencing applications where typical
lighting conditions (i.e. the average office or home
environment) are poor. Our model is much simpler and is
a departure from approaches such as relighting, developed
in computer augmented reality (CAR). In such methods
the scene is first reconstructed geometrically by means of
computer vision techniques, light exchanges among
objects in the scene are computed and illumination
textures coming from real and virtual lights are modeled
and reintroduced in the scene. We rely on a virtual
illumination equation that takes into account light
attenuation, lambertian and specular reflections. Our first
scenario models a 3D space in which the image or video
frame lies on a 2D plane. Virtual lights are placed in the
3D space and illuminate the 2D image. We then propose a
refined model for talking head video sequences in which
the talking head surface is modeled with an ellipsoid.
Results show that convincing lighting conditions for the
scene can be achieved in both scenarios.

Relevance Feedback Using Support Vector Machines
Harris Drucker, Behzad Shahraray, David Gibbon
Machine Learning: Proceedings of the Eighteenth International Conference, 2001, (ICML-2001),
pp 122-129,
2001.
[PDF]
[BIB]
We show that support vectors machines
(SVM’s) are much better than conventional
algorithms in a relevancy feedback (RF)
environment in information retrieval (IR) of
text documents. We track performance as a
function of feedback iteration and show that
while the conventional algorithms do very well
in the initial feedback iteration if the topic
searched for has high visibility in the data
base, they do very poorly if the relevant
documents are a small percentage of the data
base. SVM’s however do very well when the
number of documents returned in the
preliminary search is low and the number of
relevant documents is small. The competitive
algorithms examined are Rocchio, Ide regular,
and Ide dec-hi.
VTonDemand: A Framework for Indexing, Searching and On-Demand Playback of RTP-Based Multimedia Conferences
Baldine Paul, David Gibbon, Glenn Cash, M. Reha Civanlar
1999 IEEE 3rd International Workshop on Multimedia Signal Processing,
pp 59-64,
1999.
[PDF]
[BIB]
In this paper, we describe the implementation of a sys-
tem for reliable recording and on-demand indexed playback of multime-
dia conferences using the services provided by the RTP/RTCP protocols
over an Intranet. Implemention issues will be underlined regarding the fle structure to support random access during playback, indexing tech-
niques and database searching capabilities.
Bullseye: A Compact, Camera-Based, Human-Machine Interface
R. Andersson, David Gibbon, R. Lyons
Presence,
v8,
#1,
pp 65-85,
1999.
[PDF]
[BIB]
BULLSEYE is a flexible, computer input device that operates by generating and analyzing
the image of known "props" at 60 Hz to determine position, orientation, and usercontrolled
internal state variables. The compact device, containing both the camera
and processing engine, has been built and fielded. This paper provides an overview of
the concept and hardware, then examines the problems and design strategies associated
with the interrelated areas of the environment, prop design, and color detection.
Browsing and Retrieval of Full Broadcast-Quality Video
David Gibbon, Andrea Basso, Reha Civanlar, Qian Huang, Esther Levin, Roberto Pieraccini
Proc. 10th International Workshop on Packet Video (Packet Video '99),
1999.
[PDF]
[BIB]
In this paper we describe a system we have developed for automatic broadcast-quality video indexing that
successfully combines results from the fields of speaker verification, acoustic analysis, very large
vocabulary speech recognition, content based sampling of video, information retrieval, natural language
processing, dialogue systems, and MPEG2 delivery over IP. Our audio classification and anchorperson
detection (in the case of news material) classifies video into news versus commercials using acoustic
features and can reach 97% accuracy on our test data set. The processing includes very large vocabulary
speech recognition (over 230K-word vocabulary) for synchronizing the closed caption stream with the
audio stream. Broadcast news corpora are used to generate language models and acoustic models for speaker
identification. Compared with conventional discourse segmentation algorithms based on only text
information, our integrated method operates more efficiently with more accurate results (> 90%) on a test
database of 17 one half-hour broadcast news programs. We have developed a natural language dialogue
system for navigating large multimedia databases and tested it on our database of over 4000 hours of
television broadcast material. Story rendering and browsing techniques are employed once the user has
restricted the search to a small subset of database that can be efficiently represented in few video screens.
We focus on the advanced home television as the target appliance and we describe a flexible rendering
engine that maps the user-selected story data through application-specific templates to generate suitable user
interfaces. Error resilient IP/RTP/RTSP MPEG-2 media control and streaming is included in the system to
allow the user to view the selected video material.
Automated Semantic Structure Reconstruction and Represenatation generation for Broadcast News
Qian Huang, Zhu Liu, Aaron Rosenberg, David Gibbon, Behzad Shahraray
SPIE,
1999.
[BIB]
Automated Generation of News Content Hierarchy By Integrating Audio, Video, and Text Information
Qian Huang, Zhu Liu, Aaron Rosenberg, David Gibbon, Behzad Shahraray
IEEE International Conference On Acoustics, Speech, and Signal Processing (ICASSP),
v6,
pp 3025-3028,
1999.
[PDF]
[BIB]
Generating Hypermedia Documents from Transcriptions of Television Programs Using Parallel Text Alignment
David Gibbon
IEEE Eighth International Workshop on Research Issues in Data Engineering (RIDE),
pp 26-33,
1998.
[PDF]
[BIB]
This paper presents a method of automatically
creating hypermedia documents from conventional
transcriptions of television programs. Using parallel text
alignment techniques, the temporal information derived
from the closed caption signal is exploited to convert the
transcription into a synchronized text stream. Given this
text stream, we can create links between the transcription
and the image and audio media streams. We describe a
two-pass method for aligning parallel texts that first uses
dynamic programming techniques to maximize the
number corresponding words (by minimizing the word
edit distance). The second stage converts the word
alignment into a sentence alignment, taking into account
the cases of sentence split and merge. We present results
of text alignment on a database of 610 programs
(including three television news programs over a oneyear
period) for which we have closed caption, transcript,
audio and image streams. The techniques presented here
can produce high quality hypermedia documents of video
programs with little or no additional manual effort.

Generating Hypermedia Documents from Transcriptions of Television Programs Using Parallel Text Alignment
David Gibbon
Handbook of Internet and Multimedia Systems and Applications,
CRC Press,
1998.
[PDF]
[BIB]
This chapter presents an automatic method for creating hypermedia documents
from conventional transcriptions of television programs. Using parallel text alignment
techniques, the temporal information derived from the closed caption signal is exploited to
convert the transcription into a synchronized text stream. Given this text stream, we can
create links between the transcription and the image and audio media streams. We first
describe an appropriate method for aligning texts based on dynamic programming
techniques, then present results of text alignment on a database of 610 broadcasts (including
three television news programs over a one-year period) for which we have caption,
transcript, audio and image streams. We have found correspondences for from 77 to 92% of
the transcript sentences depending on the program set. Approximately 12% of the
correspondences involve sentence splitting or merging. We describe a system that generates
several different HTML representations of television programs given the closed captioned
video and a transcription. The techniques presented here can produce high quality
hypermedia documents of video programs with little or no additional manual effort.

Pictorial Transcripts: Mulitmedia Processing Applied to Digital Library Creation
Behzad Shahraray, David Gibbon
IEEE First Workshop on Multimedia Signal Processing,
pp 581-586,
1997.
[PDF]
[BIB]
This paper describes a working system for the automated archiving and
selective retrieval of textual, pictorial and auditory information contained in video
programs. Video processing performs the task of representing the visual information
using a small subset of the video frames. Linguistic processing refines the closed
caption text, generates table of contents, and creates links to relevant multimedia
documents. Audio and video information are compressed and indexed based on their
temporal association with the selected video frames and processed text. The derived
information is used to automatically generate a hypermedia rendition of the program
contents. This provides a compact representation of the information contained in the
video program. It also serves as a textual and pictorial index for selective retrieval of
the full-motion video program. This fully automatic system generates HyperText
Markup Language (HTML) renditions of television programs, and makes them
available for access over the Internet within seconds of their broadcast. This digital
library currently contains over 2200 hours of television programs.

Efficient Archiving and Content-based Retrieval of Video Information on the Web
Behzad Shahraray, David Gibbon
AAAI Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora,
pp 133-136,
1997.
[PDF]
[BIB]
This paper summarizes an ongoing work in
multimedia processing aimed at the automated
archiving and selective retrieval of textual, pictorial
and auditory information contained in video programs.
Video processing performs the task of representing
the visual information using a small subset of the
video frames. Linguistic processing refines the closed
caption text, generates table of contents, and creates
links to relevant multimedia documents. Audio and
video information are compressed, and indexed based
on their temporal association with the selected video
frames and processed text. The derived information is
used to automatically generate a hypermedia rendition
of the program contents. This provides a compact
representation of the information contained in the
video program. It also serves as a textual and pictorial
index for selective retrieval of the full-motion video
program. A fully automatic system has been set up
that generates HyperText Markup Language (HTML)
renditions of television programs, and makes them
available for access over the Internet within seconds
of their broadcast.

Multi-Modal System for Locating Heads and Faces
H. P. Graf, E. Cosatto, David Gibbon, M. Kocheisen, E. Petajan
Proc. Second Internatioanl Conference on Automatic Face and Gesture Recognition,
IEEE Computer Soc. Press,
pp 88 - 93,
1996.
[PDF]
[BIB]
We designed a modular system using a
combination of shape analysis, color segmentation
and motion information for locating reliably heads
and faces of different sizes and orientations in
complex images. The first of the system’s three
channels does a shape analysis on gray--level images
to determine the location of individual facial features
as well as the outlines of heads. In the second channel
the color space is analyzed with a clustering algorithm
to find areas of skin colors. The color space is first
calibrated,, using the results from the other channels.
In the third channel motion information is extracted
from frame differences. Head outlines are determined
by analyzing the shapes of areas with large motion
vectors. All three channels produce lists of shapes,
each marking an area of the image where a facial
feature or a part of the outline of a head may be
present. Combinations of such shapes are evaluated
with n--gram searches to produce a list of likely head
positions and the locations of facial features. We
tested the system for tracking faces of people sitting in
front of terminals and video phones and used it to
track people entering through a doorway.

Color Signal Processing for a 256x256 Active Pixel Sensor
David Gibbon, K. Azadet, D. Inglis, S. Mendis, A. Dickinson
IEEE Solid-State Technology Workshop on CMOS Imaging Technology,
1996.
[LINK]
[BIB]
High quality color reproduction has been obtained from a 256x256 CMOS APS with red, green, and blue polyamid filters using a 3x3 color correction matrix.
The matrix coefficients are determined by an error minimization procedure using a color test pattern.
The camera signal processing operations of white and black balance, color matrixing, gamma correction,
and reducing the effects of color aliasing have been implemented using real-time image processing software.
The 0.9 mu CMOS imager has a 20 mu pixel pitch which yields a 5.1mm square active area.
Automatic Generation of Pictorial Transcripts of Video Programs
Behzad Shahraray, David Gibbon
Proc. SPIE Conf. Multimedia Computing and Networking 1995, SPIE 2417,
1995.
[PDF]
[BIB]
An automatic authoring system for the generation of pictorial transcripts of video programs which are accompanied by
closed caption information is presented. A number of key frames, each of which represents the visual information in a
segment of the video (i.e., a scene), are selected automatically by performing a content-based sampling of the video
program. The textual information is recovered from the closed caption signal and is initially segmented based on its
implied temporal relationship with the video segments. The text segmentation boundaries are then adjusted, based on
lexical analysis and/or caption control information, to account for synchronization errors due to possible delays in the
detection of scene boundaries or the transmission of the caption information. The closed caption text is further refined
through linguistic processing for conversion to lower-case with correct capitalization. The key frames and the related text
generate a compact multimedia presentation of the contents of the video program which lends itself to efficient storage
and transmission. This compact representation can be viewed on a computer screen, or used to generate the input to a
commercial text processing package to generate a printed version of the program.

Automated Authoring of Hypermedia Documents of Video Programs
Behzad Shahraray, David Gibbon
Proc. Third Int. Conf. on Multimedia (ACM Multimedia'95),
1995.
[PDF]
[BIB]
This paper describes an approach to the automatic
generation of hypermedia documents from video
programs that are accompanied by closed caption
information. It takes advantage of the already existing
structure of the program, and extracts all the information
about the media components and their relations
automatically by processing the media. Content-based
sampling is used to select a number of key frames each
of which represents the visual information in a segment
of the video program. Closed caption information is
recovered from the video signal and is processed to
extract the raw textual information. Initial
synchronization between the pictorial and textual
components is established by segmenting the text based
on its implied temporal relationship with the video
segments. Lexical and linguistic processing of text is
used to refifine the synchronization, restore the correct
capitalization, extract keywords, and spot topics for
which the information content of the program can be
augmented. Scene comparison and matching are used to
group similar scenes and identify repetitive scenes. The
key frames, the related text, and the auxiliary information
constitute a compact multimedia information source
representing the contents of the video program which
lends itself to effificient storage, search, and transmission.
This information is used to automatically generate a
hypermedia representation of the video program. Spotted
words and phrases in the text are used to provide
information, not presented in the program, through links
to other documents. A pictorial index into the contents
of the program is generated automatically. The key
frames in the generated document are used as indices into
other media types such as digitized audio and video
obtained from the program. A fully automatic system
has been implemented that generates and maintains
hypermedia documents of television news programs. The
documents are made available on an experimental HTTP
server within a few minutes of the end of the program.

Video Based Detection of People from Motion
David Gibbon, J. Segen
Virtual Reality Systems '93 Conference,
1993.
[PDF]
[BIB]
The ability to locate people in video images has applications in the areas of automated camera panning,
human/machine interfaces, security, human traffic monitoring, and image compression. We will describe a
person detection algorithm that may be sufficient for some applications. The methods proposed by Sexton
and Nagashima et al. are based on finding the centroid of a frame-difference image. These
methods fail to differentiate head-shaped moving objects from other moving objects such as a person’s hands.
Azarbayejani et al. describe a head tracking method which is superior in that it yields 3D position and
orientation of the head, but they do not discuss the the problem of object/background segmentation. Our method
locates the 2D position of the head image in normal scenes (with cluttered backgrounds.)
Improving the Color Fidelity of Cameras for Advanced Television Systems
R. V. Kollarits, David Gibbon
High-Resolution Sensors and Hybrid Systems, Proc. SPIE 1656,
pp 19-29,
1992.
[PDF]
[BIB]
Electronic Color Printing with a Liquid-Crystal Light Modulator on Low Sensitivity Photographic Materials
R. V. Kollarits, David Gibbon, W. H. Ninke
Society for Information Display International (SID) Symposium Digest of Technical Papers XIX,
1988.
[BIB]