Some applications require representations that uniquely identify an object, a scene, or even a face during content acquisition. When these representations are used to identify a person or a person's identity they are generally called biometrics, whereas more general representations are often referred to as metadata.
Visual biometrics like faces, fingerprints, and iris patterns require visual processing techniques (i.e. computer vision) to compute and verify a biometric whereas other biometrics like voice-print identification (used in speaker recognition) or a DNA analysis use information sources. Aside from the ability to quickly recall personal information from a few biometrics, the ability to index and retrieve content with these cues can greatly improve a user's experience with his or her content.
Have you ever wanted to scan all of your personal photos and videos to find all of the pictures of a friend? Have you tried to find all of the recent speeches given by a political figure or aspiring actor? If so, the additional metadata provided by visual biometrics like faces would be an ideal way to organize, search, and filter your personal content.
While exact face recognition is quite an active research topic that poses many challenges (lighting changes, database scale, accuracy, etc.), face clustering embraces a more gradual approach that requires no training stage, is applicable to large-scale databases, and can easily be improved with user feedback or a secondary analysis with a more rigorous recognition algorithm. In prior work, a person is represented by two image regions: the face and torso. Given the output of a face detector, commonly the Viola-Jones boosted cascade of simple features, the torso region can be approximated from the size and position of the detected face region, as illustrated in the figure on the right. Low-level features like color, texture, and edge information are concatenated and analyzed in an agglomerative clustering routine that repeatedly iterates over content until it reaches a pre-determined stopping condition.
Experimentally, the torso region not only aides the clustering algorithm in disambiguating people's faces, but it can also be used as an index for people wearing similar clothing with different non-frontal views of a face or conversely, the different clothing that one person wears throughout a piece of content, as illustrated above.
Continuing research in this area focuses on alternative representations (i.e. 3D, semantic, etc.) and higher-precision features for face recognitoin such as those used in the content-based copy detection framework. Further, as mobile technology and capabilities continue to advance, we are also investigating methods for acquisition and analysis of biometric data on mobile devices, such as the LipActs project.
Increasingly, businesses and consumers rely on passive video feeds, like fixed security cameras to provide peace of mind for their stores and homes. In a business's public locations (eg retail stores), it is often helpful to understand who customers may be at certain times of day or after large promotional campaigns. Similarly, in a personal environment, a home owner may have more peace of mind if there is a visual record of known visitors or solicitors.
While it is unreasonable or too costly to ask each visitor to a home or business to identify hisself or herself, passive analysis can provide an estimation of information about these visitors over many different categories. Stemming from similar techniques utilized in facial biometrics, this passive and anonymous information is often referred to as Content Analytics and is commonly used in aggregate to spot new trends or singular anomalies.
As a service provider, AT&T is uniquely poised to offer identity authentication for both online (i.e. a banking website, hotel check-in, etc) and in-person transactions (i.e. buying coffee, a book, or using a vending machine) using tokens that are unique to a person like his or her phone, a pin code, or fingerprint. Biometrics like fingerprints offer personalized tokens that are hard to emulate. Now that many mobile devices have at least one forward facing camera, it is possible to leverage a person's facial activity as one of the strongest tokens of identity. While prior face recognition systems can be fooled with a color print out, it is much harder to emulate the facial mannerisms and lip movements of a person's speech. In this discussion, we discuss the discovery of optimal settings for recognizing a persons lip-based actions, or LipActs, for use in verification and retrieval scenarios.
In the field of computer vision, activity recognition (waving hands, jumping, running, etc.) and lip recognition (to improve speech recognition with visual cues) have been studied independently for quite some time. Innovations in these two fields independently have lead to the creation of recognition systems that can read lips and those that detect suspicious activity in public places. Like the image-based scale invariant feature transform (SIFT) representation for content-based copy detection, the histogram of oriented gradients (HOG) feature representation is increasingly popular for human activity detection. Often referred to as local features, as opposed to global features like color and texture, these representations work so well because they capture information from a single point in an image or video keyframe in a highly efficient way. For example, HOG features can describe content in a 6x6 pixel square (36 pixels x 3 colors) with only 9 real values such that they are highly separable from other image square in an image!
For LipActs, HOG features were analyzed with different time and space settings over both a personal (i.e. mobile phone) dataset as well as a debates (i.e. public, kiosk-like) dataset. For temporal settings, both the sampling rate (τ - tau) from the video and the number of frames (t) to be pooled for feature extraction were varied. For spatial settings, HOG descriptors are first quantized into intermediate word features within a representative vocabulary of different sizes (N). Next, using one of these vocabularies, the word features are aggregated into different regions by their location in a frame to create a probabilistic histogram of words. The figure below provides a high-level overview of the feature creation process used in the LipActs work. Though an iteration over the different time and space settings above, equal-error rate performance (EER) was improved by almost 50% over unoptimized LipAct features for both datasets. Continued research focuses on dimensionality reduction, synchronization of LipActs features and audio features, and opportunities for deployment on mobile platforms for augmented speaker verification.
In the LipActs experiments, two datasets were collected in the fall of 2010 and are decribed below.
To facilitate experiments and extensions of the LipActs work outside of AT&T, each video in the debates dataset and its MD5 hash are recorded in this text file. The format of this text file is simply a two-column MD5 hash and filename, which can be verified in any unix-like environment with the command below.
md5sum --check text file