Content-Based Copy Detection

What is Content-Based Copy Detection?

Content-based Copy Detection is set of techniques that match duplicate (i.e. exact copies) or near-duplicate (i.e. some noise or a few changes) pairs of content. While it may seem like copy detection is a not difficult, given the ability to digitally copy video and audio files, this project aims to match content pairs that have undergone severe distortions, or in the case of pictures of real-world objects examples may not be exactly the same to begin with.

What is a near-duplicate?

Although the differences between duplicate and near-duplicate content are slight, there is an easy way to make the classification. In video and multimedia, a duplicate pair exists if the pixels of the image that you see are the same in two sources. Real-world examples of duplicate content can be found in newspapers, books, even television broadcasts. If you purchased two of any of these objects from different locations, the content (i.e. images, audio, and text) will be exactly the same. A near-duplicate pair exists in video and multimedia if the subject matter of the content is the same, but it was captured differently or has been significantly altered by some processing step. One common real-world example of near-duplicate content is the different view points that one sees on television for public speeches at the same event.

Near-Duplicate Content Examples

This example demonstrates two possible near-duplicate pairs. The top was created by natural scene differences due to the point-of-view of the camera. The bottom pair was created by intentional processing and editing manipulations. For a content-based copy detection system to work in real-world conditions, both must be accounted for.  

More information coming soon, thanks for your patience!

Project Members

Zhu Liu

Yadong Mu

Behzad Shahraray

Eric Zavesky

Related Projects

Assistive Technology

Smart Grid


Enhanced Indexing and Representation with Vision-Based Biometrics

Project Space

AT&T Application Resource Optimizer (ARO) - For energy-efficient apps

CHI Scan (Computer Human Interaction Scan)

CoCITe – Coordinating Changes in Text

Connecting Your World



E4SS - ECharts for SIP Servlets

Scalable Ad Hoc Wireless Geocast

AT&T 3D Lab

Graphviz System for Network Visualization

Information Visualization Research - Prototypes and Systems

Swift - Visualization of Communication Services at Scale

Speech Mashup

Omni Channel Analytics

Speech translation

StratoSIP: SIP at a Very High Level

Content Augmenting Media (CAM)

Content Acquisition Processing, Monitoring, and Forensics for AT&T Services (CONSENT)

Content Analytics - distill content into visual and statistical representations

MIRACLE and the Content Analysis Engine (CAE)

Social TV - View and Contribute to Public Opinions about Your Content Live

Visual API - Visual Intelligence for your Applications

Visual Semantics for Intuitive Mid-Level Representations

eClips - Personalized Content Clip Retrieval and Delivery

iMIRACLE - Content Retrieval on Mobile Devices with Speech

AT&T WATSON (SM) Speech Technologies

Wireless Demand Forecasting, Network Capacity Analysis, and Performance Optimization