att_abstract={{Data intensive applications require extreme scaling of their
underlying storage systems.  Such scaling, together with the fact that
storage systems must be implemented in actual data centers, increases
the risk of data loss from failures of underlying
components. Accurate engineering requires
quantitatively predicting reliability, but this remains challenging, due to the
need to account for extreme scale, 
redundancy scheme type and strength, distribution
architecture, as well as component dependencies and failure and repair
rates.  This paper introduces cqsimr{}, a tool suite for predicting
the reliability of large scale storage system designs and deployments.
CQSim-R includes direct calculations based on a drives-only failure
model and an event-based simulator for detailed prediction that
handles arbitrary component failure models and dependencies.
CQSim-R's framework is general and handles a spectrum of
redundancy and share placement designs. The paper
demonstrates CQSim-R using models of common storage systems, such as
$n$-replicated (e.g., RAID-1 and HDFS) and erasure coded designs
(like RAID-6 and the Quantcast Filesystem). Several new
results, such as the poor reliability scaling behavior of
spread-placed systems, demonstrate the
usefulness and generality of the tools. Analysis and empirical data
show their soundness, generality, performance, and scalability.
	att_copyright_notice={{(c) ACM, 2015. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 2015 {{, Volume 12}}{{, Issue 4}}{{, 2016-07-01}}.}},
	att_tags={cloud,  storage,  system,  reliability,  Monte Carlo simulation,  modeling,  simulation},
	author={Robert Hall},
	institution={{ACM Transactions on Storage}},
	title={{Tools for Predicting the Reliability of Large Scale Storage Systems}},