att_abstract={{Data glitches are unusual observations that do not conform to 
data quality expectations, be they logical, semantic or statistical.  
By applying data integrity constraints, potentially large sections of data could be flagged as being noncompliant. 
Ignoring or repairing significant sections of the data could fundamentally bias the results 
and conclusions drawn from analyses. In the particular context of Big Data where large numbers and volumes of feeds from
disparate sources are integrated, it is likely that significant portions of seemingly noncompliant data are actually legitimate usable data. 

In this paper, we introduce the notion of Empirical Glitch Explanations -- 
concise, multi-dimensional descriptions of subsets of potentially dirty data --  and propose a scalable method for empirically generating such
explanatory characterizations. The explanations could serve two valuable functions:
(1) Provide a way of identifying legitimate data and releasing it back into the
pool of clean data. In doing so, we reduce cleaning-related statistical distortion of the data; 
(2) Used to refine existing data quality constraints and generate and formalize domain knowledge. 

We conduct experiments using real and simulated data to demonstrate the 
scalability of our method and the robustness of explanations. In addition, we use two
real world examples to demonstrate the utility of the explanations where we 
reclaim over 99% of the suspicious data, keeping data repair related statistical distortion  close to 0.}},
	att_authors={td3863, ds8961},
	att_categories={C_BB.1, C_NSS.2, C_IIS.6},
	att_copyright_notice={{(c) ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 2014 {{, 2014-08-23}}.
	author={Tamraparni Dasu and Divesh Srivastava and Ji Meng Loh},
	institution={{ACM SIGKDD Conference on Knowledge Discovery and Data Mining}},
	title={{Empirical Glitch Explanations}},