Enhanced Indexing and Representation with Vision-Based Biometrics

What are biometrics?

Some applications require representations that uniquely identify an object, a scene, or even a face during content acquisition. When these representations are used to identify a person or a person's identity they are generally called biometrics, whereas more general representations are often referred to as metadata.

Visual biometrics like faces, fingerprints, and iris patterns require visual processing techniques (i.e. computer vision) to compute and verify a biometric whereas other biometrics like voice-print identification (used in speaker recognition) or a DNA analysis use information sources. Aside from the ability to quickly recall personal information from a few biometrics, the ability to index and retrieve content with these cues can greatly improve a user's experience with his or her content.

Clustering and Indexing with Faces

Have you ever wanted to scan all of your personal photos and videos to find all of the pictures of a friend? Have you tried to find all of the recent speeches given by a political figure or aspiring actor? If so, the additional metadata provided by visual biometrics like faces would be an ideal way to organize, search, and filter your personal content.

Face and Torso Region Identification

While exact face recognition is quite an active research topic that poses many challenges (lighting changes, database scale, accuracy, etc.), face clustering embraces a more gradual approach that requires no training stage, is applicable to large-scale databases, and can easily be improved with user feedback or a secondary analysis with a more rigorous recognition algorithm. In prior work, a person is represented by two image regions: the face and torso. Given the output of a face detector, commonly the Viola-Jones boosted cascade of simple features, the torso region can be approximated from the size and position of the detected face region, as illustrated in the figure on the right. Low-level features like color, texture, and edge information are concatenated and analyzed in an agglomerative clustering routine that repeatedly iterates over content until it reaches a pre-determined stopping condition.

Face-Enhanced Indexing Capabilities

Experimentally, the torso region not only aides the clustering algorithm in disambiguating people's faces, but it can also be used as an index for people wearing similar clothing with different non-frontal views of a face or conversely, the different clothing that one person wears throughout a piece of content, as illustrated above.

Indexing by Face Example Indexing for Clothing Change

Continuing research in this area focuses on alternative representations (i.e. 3D, semantic, etc.) and higher-precision features for face recognitoin such as those used in the content-based copy detection framework. Further, as mobile technology and capabilities continue to advance, we are also investigating methods for acquisition and analysis of biometric data on mobile devices, such as the LipActs project.


Passive Content Demographics

Increasingly, businesses and consumers rely on passive video feeds, like fixed security cameras to provide peace of mind for their stores and homes. In a business's public locations (eg retail stores), it is often helpful to understand who customers may be at certain times of day or after large promotional campaigns. Similarly, in a personal environment, a home owner may have more peace of mind if there is a visual record of known visitors or solicitors.

While it is unreasonable or too costly to ask each visitor to a home or business to identify hisself or herself, passive analysis can provide an estimation of information about these visitors over many different categories. Stemming from similar techniques utilized in facial biometrics, this passive and anonymous information is often referred to as Content Analytics and is commonly used in aggregate to spot new trends or singular anomalies.

LipActs: Lip-based Visual Speaker Verification

As a service provider, AT&T is uniquely poised to offer identity authentication for both online (i.e. a banking website, hotel check-in, etc) and in-person transactions (i.e. buying coffee, a book, or using a vending machine) using tokens that are unique to a person like his or her phone, a pin code, or fingerprint. Biometrics like fingerprints offer personalized tokens that are hard to emulate. Now that many mobile devices have at least one forward facing camera, it is possible to leverage a person's facial activity as one of the strongest tokens of identity. While prior face recognition systems can be fooled with a color print out, it is much harder to emulate the facial mannerisms and lip movements of a person's speech. In this discussion, we discuss the discovery of optimal settings for recognizing a persons lip-based actions, or LipActs, for use in verification and retrieval scenarios.

Activity and Lip Recognition Background

In the field of computer vision, activity recognition (waving hands, jumping, running, etc.) and lip recognition (to improve speech recognition with visual cues) have been studied independently for quite some time. Innovations in these two fields independently have lead to the creation of recognition systems that can read lips and those that detect suspicious activity in public places. Like the image-based scale invariant feature transform (SIFT) representation for content-based copy detection, the histogram of oriented gradients (HOG) feature representation is increasingly popular for human activity detection. Often referred to as local features, as opposed to global features like color and texture, these representations work so well because they capture information from a single point in an image or video keyframe in a highly efficient way. For example, HOG features can describe content in a 6x6 pixel square (36 pixels x 3 colors) with only 9 real values such that they are highly separable from other image square in an image!

Optimal Features for Lip Activity

Overview of LipActs Feature Extraction

For LipActs, HOG features were analyzed with different time and space settings over both a personal (i.e. mobile phone) dataset as well as a debates (i.e. public, kiosk-like) dataset. For temporal settings, both the sampling rate (τ - tau) from the video and the number of frames (t) to be pooled for feature extraction were varied. For spatial settings, HOG descriptors are first quantized into intermediate word features within a representative vocabulary of different sizes (N). Next, using one of these vocabularies, the word features are aggregated into different regions by their location in a frame to create a probabilistic histogram of words. The figure below provides a high-level overview of the feature creation process used in the LipActs work. Though an iteration over the different time and space settings above, equal-error rate performance (EER) was improved by almost 50% over unoptimized LipAct features for both datasets. Continued research focuses on dimensionality reduction, synchronization of LipActs features and audio features, and opportunities for deployment on mobile platforms for augmented speaker verification.

The LipActs Experimental Datasets

In the LipActs experiments, two datasets were collected in the fall of 2010 and are decribed below.

  • The personal dataset was acquired as high-quality (an average bitrate around 3.7Mb/s) and high resolution (640x480) H.264 videos with a single user that was cooperative trying to verify his or her identity with a mobile phone.
  • The debates dataset was acquired as lower quality (an average bitrate of 2.1Mb/s) and high resolution (960x720 for the entire scene) H.264 videos with the speaker unconcerned focused on their discussion, not on a retrieval task. The original videos were created by individuals as part of a one minute interactive video debate, sponsored in part by the World Economic Forum and found using the textual keywords "Davos Debates". This dataset emulates a public kiosk-like setting that included potential difficulties from background variance, speaker movements, and only a small portion of the frame devoted to the speaker's face.
Debates Face Montage, Part 1 Debates Face Montage, Part 2 Debates Face Montage, Part 3 Debates Face Montage, Part 4


To facilitate experiments and extensions of the LipActs work outside of AT&T, each video in the debates dataset and its MD5 hash are recorded in this text file. The format of this text file is simply a two-column MD5 hash and filename, which can be verified in any unix-like environment with the command below.

md5sum --check text file

Project Members

Lee Begeja

David Gibbon

Raghuraman Gopalan

Zhu Liu

Behzad Shahraray

Eric Zavesky

Related Projects

Assistive Technology

Smart Grid


Content-Based Copy Detection

Project Space

Omni Channel Analytics

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World



E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

AT&T Natural VoicesTM Text-to-Speech

Speech Mashup

Speech translation

StratoSIP: SIP at a Very High Level

Content Augmenting Media (CAM)

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

iMIRACLE - Content Retrieval on Mobile Devices with Speech

AT&T WATSON (SM) Speech Technologies

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization