Software-Defined Storage

by: Yih-Farn Robin Chen, Tue Sep 08 22:09:40 EDT 2015

1 Growth and Challenges of Cloud Storage

Cloud storage for both consumers and enterprises is growing at a phenomenal rate, fueled by the growth in Big Data, Mobile, and Social Networks.   It is becoming a major cost component in the cloud infrastructure and any Web-scale cloud services today.  While raw storage is cheap, the performance and data durability requirements of cloud storage frequently dictate sophisticated, multi-tier, geo-distributed and managed storage solutions.

Traditional storage systems use dedicated storage hardware and networking to guarantee that the storage QoS requirements such as throughput, latency, IOPS, and reliability are preserved.  Unfortunately, these dedicated resources are frequently underutilized.   Cloud computing promises efficient resource utilization by allowing multiple tenants to share the underlying networking, computing, and storage infrastructure, but it is difficult to provide end-to-end storage QoS guarantees to individual tenants due to the noisy neighbor problem.

In a cloud environment like OpenStack, the backend block storage  (such as LVM, Ceph RADOS block device, or vendor storage appliances) is shared by multiple tenants through a storage virtualization layer like Cinder that attaches virtual machines to individual storage volumes.   It is difficult to provide customized storage QoS to meet different tenant needs with a fixed backend where important design decisions such as the replication level, compression, de-duplication, and encryption are already made.

Finally, many cloud infrastructure service providers are moving to scale-out solutions based on commodity hardware, instead of expensive storage appliances, which are frequently more expensive and difficult to adapt to changing workloads or specific QoS requirements.  Any cloud solution architect must understand the tradeoffs among performance, reliability, and cost of cloud storage to provide an effective overall solution.

2 Software-Defined Storage

Software-Defined Networks (SDN) aims to virtualize networking resources and separate the control plane from the data plane.  Similarly, most Software-Defined Storage (SDS) solutions aim to separate storage hardware from the storage management software, which keeps the intelligence and can be dynamically reconfigured to adapt to changing and growing storage needs.

Unfortunately, unlike SDN, there is not a clear definition on what the core functionalities of software-defined storage really are -- even though many storage vendors claim that they have SDS solutions. Here we summarize the key principles that we believe are pertinent to multi-tenant cloud storage solutions and we call them the C.A.M.P.S. principles of SDS:

  • Customizable: SDS should allow customization to meet specific storage QoS requirements of different tenants.   Excessive over-engineering should be avoided through efficient cloud resource utilization.   The tenants should only pay for what they need.
  • Automation: SDS should automate the complete provisioning and deployment process of storage systems without human intervention.  The SDS process should explore a large design space to determine a storage configuration that meets the QoS requirements at the lowest cost.
  • Masking: SDS  may mask the underlying storage hardware and software system implementation, and distributed system complexity  as long as they can present a common storage API (block, file-system, object, etc.) and meet the QoS requirements.  This also gives the infrastructure providers more flexibility to adjust the cloud resources and placement strategies.
  • Policy Management: The SDS software must manage the storage system according to the specified policies, including QoS, security, and backup/recovery requirements,  and continue to meet the policy requirements despite potential interference from other tenants' workloads.  It also needs to monitor and handle failures - and auto-scales the system when necessary to adapt to each tenant's changing workload.
  • Scale-out: SDS should enable a scale-out (vs. scale-up) storage solution as the workload grows or changes dynamically over time, and preferably over commodity-hardware to reduce cost.

We can now combine these principles and give our own definition of SDS: SDS automatically maps customized and evolving storage service requirements to a scalable, elastic and policy-managed cloud storage service, with abstractions that mask the underlying storage hardware and software complexities.

We believe that an SDS solution based on these principles can meet many of the current cloud storage challenges. It is also in line with the AT&T Domain 2.0 principle, which is to use real-time orchestration of cloud resources to meet customer needs at low cost. We in the Cloud Technologies and Services Research have built an SDS solution that is designed to meet the above SDS principles. The SDS solution consists of three layers as shown in Figure 1: the bottom layer consists of raw compute, networking and storage resources that can be scaled out easily as needed. The storage can be locally attached or accessed through networks as long as it provides block storage. The second layer, CloudQoS, allows network bandwidth reservation capabilities through Tegu, storage bandwidth reservations through IOArbiter, CPU capacity reservation through Bora, and resource placement optimization through Ostro. The third layer is the SDS storage layer. SDS automates storage engineering and builds specific storage systems to meet different tenant’s storage QoS needs. The SDS process described below allows tenants with very diverse workloads, such as a random workload using the Ceph object storage, and a sequential workload such as a Big Data application using HDFS or QFS, to co-exist with efficient cloud resource utilization.

Figure 1: SDS Architecture Layers in an OpenStack Cloud

Figure 2 shows a high level view of our SDS automation process.  The tenant specifies the storage QoS requirements through a Web interface.  The SDS planner for each specific storage system takes the requirements, searches through a large design space to get the best performance, reliability, and cost tradeoffs and then generates an OpenStack Heat template to orchestrate the storage system deployment.  The whole process takes only about 10 minutes for a typical TB-level  storage system deployment in an OpenStack cloud.  Finally, an SDS monitor and visualizer continues to monitor the storage system.  It handles failures and reconfigures the storage system if needed to meet growing or shrinking storage demands.

Figure 2: SDS Automates the Storage Engineering Process in an OpenStack Cloud

Incidentally, erasure coding is a crucial technology that can be used to meet the SDS customization requirements. Erasure coding divides a le into k data chunks, and then expands it into n chunk, where n = k + m, and m is the number of parity chunks. Any k out of n chunks is sufficient to reconstruct the file. For a fixed k, varying m, the number of parity chunks, increases the reliability and replication factor (and hence the storage cost).  At the same time, it increases the overall encoding/decoding time, and hence the required computation capacity, and perhaps reduced performance. Erasure-coded storage opens up a large design space for performance, reliability, and cost tradeoffs that was not possible with traditional triple-redundancy storage that is commonly used in HDFS and the Swift Object Storage. The SDS planner takes the storage QoS requirements, looks at performance benchmarks of different erasure code choices if needed, and then picks particular erasure code parameters (k and m) to meet the minimal reliability and performances requirements with the least amount of storage overhead.

3 Current SDS Plans

There are several on-going SDS initiatives, which are outlined below, to bring our SDS approach to a level demanded by large-scale enterprise storage:

  • PB-scale: We are currently deploying our SDS solution in a large OpenStack cluster (100 server nodes and 4PB of storage) to learn the potential scalability, performance, reliability, and operation issues associated with PB-scale storage systems.
  • Multi-tier storage: We have done extensive experiments to understand the performance and cost tradeoffs in using SSD's, locally attached HDD's, and iSCSI storage arrays in different tiers of the storage architecture. We are currently encoding these storage engineering rules in our SDS planer - and extending our research into the use of NVMe (nonvolatile memory) as another storage tier.
  • Multi-site storage: Geo-redundancy is a common requirement to meet reliability requirements in the midst of disasters that can bring down data centers in one or multiple regions. In addition, the geographic storage access patterns may dictate intelligent storage placements to meet latency and cost requirements. Erasure-coded multi-site storage for tiered-storage applications with hot, warm, and cold storage pose other interesting technical and research challenges..

4 Conclusion

The rapid growth of cloud storage has created challenges for storage architects to meet the diverse performance and reliability requirements of dierent customers, while controlling the cost, in a multi-tenant cloud environment. We have been working on a software-defined storage solution that has the potential to address the challenges and open up new opportunities for innovations. This SDS solution is in line with the Domain 2.0 principle: real-time orchestration of cloud resources to meet customer demands at low cost.

SDS automatically maps customized and evolving storage service requirements to a scalable, elastic and policy-managed cloud storage service, with abstractions that mask the underlying storage hardware and software complexities.