An application or operating system typically requires a component or layer which provides the ability to store data. This could be on one hand implemented by local or internal storage of a server system, a group of disks referred to as a JBOD (just a bunch of disks) or an external storage system or disk array. External storage systems are usually shared between different hosts or server systems, and can provide on top of the persistence layer, enhanced functionalities (so called storage services) while allowing stored data to be replicated or mirrored to further storage arrays establishing a metro or geo redundant solution.
An external storage system or disk array integrates a large amount of hard disk drives, solid-state drives, or flash memory devices to store the data of an application. Hosts or servers access the external storage system through different access protocols (e.g. SAN, NAS or http REST based protocols). The storage server typically has a central control unit that manages the traffic I/O and the optional storage services. The optional storage services include the ability to replicate data from the volume level through to the entire storage array to further storage arrays. Replication can be performed at many levels, within site, within metro area or geo replication
Storage tiers are used to distinguish different levels of service that can be provided where application data is to be stored. The business application requirements should include provision as to the storage capacity required, expected level of growth, access responses times, and protection level. These factors can then be used to determine the needs of the storage.
Originally the concept for storage tiering was to define or classify data requirements into 1 of 3 tiers, today we see additional tiers and definitions due to high data growth rates, cost optimisation and retention requirements.
Here we define storage into 4 tiers, as we are thinking of live data that is utilised as part of the system operation. It is a requirement to provide the facility to backup system or application data and this can be considered as tier level 5. This is covered in further detail in the Backup and Recovery section.
These types of environments enable services that are in high demand and cost the end user significant loss of revenue when an outage occurs. Online transaction processing (OLTP), batch transaction processing, and some virtualization/cloud environments are examples of environments that fit into this tier of data availability.
These types of environments are likely subject to compliance requirements, and although maintaining client access to the storage system is important. The loss of data would be severely detrimental to the end user.
Because tier 1 or 2 storage systems require the highest amount of resiliency and are critical for a ZO infrastructure design, the remainder of this chapter is focussing on tier 1 and 2 storage architecture and design.
Repository environments are used to store collaborative data or user data that is noncritical to business operations. Scientific and engineering compute data, workgroup collaboration, and user home directories are examples of environments that fit into this tier of data availability.
This type of environment is subject to a large initial ingest of data (write), which then is seldom accessed. System utilization on average is not expected to be very significant. Because the data is seldom accessed, it is important to fully leverage subsystem features that exercise that data for continued integrity.
The ability of a storage system to host multiple storage tiers in the same system can be of advantage. However it might be economically more feasible to implement lower storage tiers on separate storage systems which do not provide the same level of resiliency as tier 1 or 2 systems.
Storage systems or disk arrays with a “Zero Outage” compliant design may fulfil the following best practices:
- Zero Outage mode of Operation ideally 365×24 (e.g. software or firmware upgrade non-disruptively, etc.).
- Data Availability with a minimum of 99,999% with the goal of 100% (Zero Outage).
- Data redundancy options to protect from single or double drive failures like RAID-1/RAID-10, RAID-5, RAID-6 or Erasure Coding.
- Enhanced data availability across disk arrays within one datacentre or across datacentres with optional cluster-like functionality of arrays.
- Enterprise class disk drives (e.g. enterprise class MTBF specifications) for HDDs and SSDs.
- Enterprise class endurance metrics for flash storage. (e.g. Drive Writes per Day, TB Written)
- In drive functionality to sustain consistent IO rates. (like compression, Wear Levelling, Data Refresh Management, Endurance Management).
- Full Write Protection for all volatile memory components. (e.g. Array Cache Components, Drive Level Cache Components) by the use of battery backup or other means.
- Predictable Performance in number of IO operations and/or latency even under high load and independent from Read/Write Ratio.
- Provisions to provide Quality of Service (QoS) or resource partitioning to provide minimum levels of performance between different hosts groups, hosts or up to a LUN level in SAN or Share level in NAS.
- Scalability to provide predictable, consistent performance independent of provisioned capacity.
- Non-disruptive Data Migration capabilities making the migration of storage between different arrays without any service interruption possible. (e.g. via storage virtualization functionalities).
- Design for serviceability. Maintenance on one part of the storage system including firmware, software and or hardware upgrades must not impact any application using the storage system.
Enterprise critical applications for customers in different geographical locations and running in shared service environments are requiring nonstop availability of the underlying infrastructure.
For a platform perspective, ‘Zero Outage’ requires a minimum set of functions that all storage equipment needs to support. Storage array should be able to provide 365×24 operations support, allowing virtually no downtime for administrative activities – either planned or pro-active or as part of troubleshooting.
All maintenance and management activities must be performed online, without any outage. Maintenance activities like microcode or operating system upgrades, capacity and controller upgrades, should not impact system performance. Non-stop service must be guaranteed at all times, even for example for a server connected only via a single FC port. These requirements are leading to the following common best practices of controller functionality regarding high availability:
- Non-disruptive microcode and hardware upgrades.
- Automatic failover architecture with redundant, hot-swappable components.
- Dual data paths and dual control paths connecting every component.
- Active-active dual-ported disk drives.
- Mirrored cache for all write data.
- Non-volatile backup of cache using a combination of battery and flash drives.
A ‘Zero Outage’ storage design requires a data availability of at least five nines (99,999%) or better with the goal of 100% data availability.
Data availability is achieved on a storage drive level by data redundancy technologies like RAID or Erasure Coding.
RAID is a technology that is used to increase the performance and/or reliability of data storage. The abbreviation stands for Redundant Array of Inexpensive Disks. A RAID system consists of two or more drives working in parallel. These disks can be hard disk drives, but can also be based on flash technology like SSDs (solid state drives) or other implementations of flash memory drives. There are different RAID levels, each optimized for a specific situation. These are not standardized by an industry group or standardization committee. The typical RAID levels a ‘Zero Outage’ compliant storage array needs to support are:
- RAID 0 – striping
- RAID 1 – mirroring
- RAID 5 – striping with parity
- RAID 6 – striping with double parity
- RAID 10 – combining mirroring and striping
To perform the RAID-functionality typically a combination of storage array microcode/firmware and HW based acceleration is used to provide best possible performance especially for more demanding RAID levels like RAID 5 and RAID 6.
In a RAID 0 system data are split up into blocks that get written across all the drives in the array. By using multiple disks (at least 2) at the same time, this offers superior read I/O performance. Doing striping across different independent components in the array can further improve I/O performance.
- RAID 0 offers great performance, both in read and write operations. There is no overhead caused by parity controls.
- All storage capacity is used, there is no overhead.
- The technology is easy to implement.
- RAID 0 is not fault-tolerant. RIAD 0 cannot be used in a Zero Outage environment unless it is combined with other RAID levels (e.g. RAID 1+0 a combination of Striping and Mirroring).
Data is stored twice by writing the data to the data drives (or set of data drives) and a mirror drive (or set of drives). If a drive fails, the controller uses either the data drive or the mirror drive for data recovery and continues operation. You need a minimum of 2 drives for a RAID 1 array
- RAID 1 offers excellent read speed and a write-speed that is comparable to that of a single drive.
- In case a of drive failure, data do not have to be rebuild, they just have to be copied to the replacement drive.
- RAID 1 is a very simple technology.
- The main disadvantage is that the effective storage capacity is only half of the total drive capacity because all data get written twice.
RAID 5 is a common secure RAID level. It requires a minimum of 3 drives but can work in modern disk arrays with RAID groups consisting up to 16 drives. Data blocks are striped across the drives and on one drive a parity checksum of all the block data is written. The parity data are not written to a fixed drive, they are spread across all drives. Using the parity data, the array controller can recalculate the data of one of the other data blocks, should those data blocks no longer be available. That means a RAID 5 array can withstand a single drive failure without losing data or access to data.
- Read data transactions are very fast as data is striped across the RAID group.
- If a drive fails, you still have access to all data, even while the failed drive is being replaced and the storage controller rebuilds the data on the new drive.
- Write data transactions are slower (due to the parity that has to be calculated, then written.)
- Drive failures will have an effect on throughput due to parity rebuild.
If one of the disks in an array using large NL SAS disks (e.g. 10TB drives) and large RAID-5 RAID groups (e.g. 15+1) fails and will get replaced/spared in, restoring/recovering the lost disks data to a spare disk (the rebuild time) can be an extended time period. If another disk goes bad during that time, data is lost forever (double drive failure).
RAID 6 is similar to RAID 5, but the parity data is written to two drives. That means it requires a minimum of 4 drives and can withstand 2 drive failures simultaneously.
- Like with RAID 5, read data transactions are very fast.
- If two drives fail, you still have access to all data, even while the failed drives are being replaced.
- Write data transactions are slower than RAID 5 due to the additional parity data that have to be calculated.
- Drive failures have an effect on throughput. Rebuilding a large RAID-6 group (e.g. 14+2) with high capacity drives in which one drive failed can take a long time and will impact performance of the RAID group during this period.
Today triple-parity RAID is also becoming an option. The addition of another level of parity mitigates increasing RAID rebuild times and occurrences of latent data errors
Erasure coding (EC) is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces and stored across a set of different locations or storage media.
Erasure coding creates a mathematical function to describe a set of numbers so they can be checked for accuracy and recovered if one is lost. Referred to as polynomial interpolation or oversampling, this is the key concept behind erasure codes.
The goal of erasure coding is to enable data that becomes corrupted at some point in the disk storage process to be reconstructed by using information about the data that’s stored elsewhere in the array. Erasure codes are often used instead of traditional RAID because of their ability to reduce the time and overhead required to reconstruct data. The drawback of erasure coding is that it can be more CPU-intensive for the array controller, and that can translate into increased latency.
Erasure coding can be useful with large quantities of data and any applications or systems that need to tolerate failures, such as high capacity disk array systems, data grids, distributed storage applications, object stores and archival storage. One common current use case for erasure coding is object-based cloud storage.
Although for example RAID-6 has the highest resiliency across RAID-1+0, RAID-5 and RAID-6 it is not per-se the only acceptable RAID level in a “Zero-Outage” design. It must be used in case of large RAID groups with high capacity drives to eliminate the risk of a double drive failure. But with very fast and/or low capacity drives a RAID-1+0 or RAID-5 design can still be valid and can achieve a 99,999%+ availability figure.
With Erasure coding compared to RAID you can create very flexible and very resilient storage architectures but it is more related to High Capacity as to High Performance environments due to adding additional latency levels.
In general terms maximum data redundancy on a drive level typically has an impact on performance. There is no single concept for an optimal data redundancy design on a drive level because many factors determine the optimal design as discussed before. A carefully designed storage solution can use for example RAID levels with lower resilience levels (e.g. RAID-10) without compromising overall data availability and still can be “Zero-Outage” compliant.
With regard to the types of drives which build the storage layer of a ZO platform design there is no general recommendation and is mainly based on performance decisions. Hard Disk Drives (HDDs), Solid State Disk Drives (SSDs) or Flash storage vendors have different types of specifications for enterprise and consumer market.
Typical consumer type HDD drives offer a substantially reduced MTBF.
SSD/Flash consumer type drives offer a substantially reduced amount of full write cycles compared to enterprise class flash mainly because the flash modules used are usually not designed for 24 x 365 operations. Enterprise class flash storage offer in addition several features to ensure data integrity across the life cycle:
- End-to-End ECC Protection
- Power Fail Protection for SRAM/DRAM Buffers
- Temperature throttling to prevent overheating under heavy load
Enterprise class flash devices contain specialized controller hardware providing more sophisticated wear levelling functionality and hardware offload engines e.g. for compression. Hot-sparing of single flash modules on the flash devices instead of hot-sparing full devices is another example of integrated controller functionality in enterprise flash drives to ensure zero-downtime operation.
In summary specification differences between consumer and enterprise class drives lead to a substantial higher amount of drive failures for consumer drives which must be prevented even if sophisticated data redundancy technologies are in place to ensure zero-downtime operation.
A SAN is about two things, Storage and Network. We decided to document the SAN technology in the networking area.
Often Storage Array Vendors offer additional services which can be enabled at an array level to provide additional features.
Data availability and data consistency are the key requirement for ZO compliant infrastructure design. In the case the datacentre infrastructure power service is not fully trustable, with regards to data consistency, it is crucial that all written data committed to a host, is indeed securely written to persistent storage or to a cache subsystem which is designed to withstand power outages. Those cache systems should have battery backed cache and implement an automatic download of data to persistent storage in case of power failure to make sure data is never compromised in a power down situation.
In addition the complete data path from the controller connecting the host up to the disk drives must be ECC protected or protected by other means of bit failure detection and correction.
Although Storage Efficiency Services are not per default a requirement in a ZO storage design they are often desirable features from an application level perspective in use cases like VDI, databases, etc. Typical storage efficiency services are:
- Thin Provisioning,
If an array or another component in the data path of a ZO compliant storage architecture provides these types of efficiency services they are not allowed to impact overall system performance or availability.
To provide the highest levels of availability and disaster tolerance in a ZO design the storage layer should provide features and functions for local data protection (cloning and snapshot functionality) as well as synchronous and asynchronous remote replication of data to remote or alternative sites. Data replicated to remote locations can also be considered as part of a disaster recovery facility.
The storage layer should provide:
- Local replication (Clone, Snap, Snap on Snap, Clone from Snap) and
- Remote Replication (synchronous with at least 100km distance and asynchronous with no distance limitation).
Although Array-Based protection and DR capabilities on the storage layer are not required if data protection and DR is handled at a higher level of the platform design, which in some circumstances maybe the only option to implement those services.
In this context an Active-Active-Cluster functionality on the storage layer will also help to build a disaster tolerant and high available ZO infrastructure design by providing continuous data services across different locations and enabling high-available, geographically dispersed data volumes. The service should provide read/write copies of the same data in two places at the same time, with immediate data availability at a second site with a continuous data mirroring capability.
As consolidation in a ZO environment leads to shared services platform a storage array need to provide isolation of different user environments on a logical instead of a physical level. Therefore a software functionality is required that enables administrative partitioning of a storage array. This software component should make it possible to assign individual resources (e.g. cache, capacities and frontend ports) to different and isolated array partitions.
In addition the array must be able to provide predefined performance classes and it must be possible to select multiple service classes, even for a single server. To guarantee those service classes a storage array must be capable of granularly defining QoS on an array, port and if necessary down to an individual storage capacity provisioned to a host.
Monitoring and reporting of the array is of vital importance to a ZO design, and it must be recognised that different vendors provide different options to monitoring and reporting.
Monitoring and Reporting for critical situations regarding performance or failures are a key requirement for any component in a ZO. Provisioning and removal of storage capacities to hosts should align with established processes and procedures and a centralized management of all storage components in one location via a single pane of glass is recommended. Call home and remote support functionality are recommended to be made available and should comply with industry wide and customer specific security standards: however it should be noted that some customers do not allow to enable Call home functions due to compliance to their security practices.
Management capabilities on the storage layer include, but are not limited to:
- Centralized Management
- Call Home Support
- Remote Support
- REST API Interface
Automation and programmatic interfaces to control the storage management layer are the basis to implement strict rule based management for tasks like provisioning and de-provisioning. A fully automated storage design should be the ultimate goal of a ZO platform to prevent any errors or inconsistencies due to manual intervention. Programmatic interfaces should adhere to the REST architectural constraints and should be RESTful APIs.
As a general concept limiting the number of storage systems deployed in a ZO environment is desirable. Although failure domains could be kept as minimal as possible with non-shared, dedicated storage elements per application/service element, the management effort increases exponentially with the number of systems and with this the risk of human error.
Thus scalability requirements for a storage subsystem are no longer defined by the performance requirements of a single application. Instead maximum possible consolidation of disparate storage system into a single array defines the performance potential of a single array specifically in terms of high vertical scalability. In this context, scalability not only refers to maximum array capacity but equally as important also encompasses connectivity or IO performance – and performance-scaling with increased capacity.
It is important to correctly size the storage subsystem for the current workload and also the growth over the lifecycle of the equipment. Factors that need to be considered are the number of hosts or servers to be attached, replication, size of disks/LUNs to be allocated, estimated I/O in terms of reads and writes, and quantity of changed data per day.
Its import to understand the rate of changed data where synchronous replication has been established between two storage systems. The rate of data change needs to factored into the sizing as this has an overhead on the CPUs keeping the data in sync, but also will provide an indication of the necessary bandwidth required to maintain the synchronous copy without impacting the host systems. Where this replication is cross site, the bandwidth and latency of the link become a further concern, to ensure that host I/O performance isn’t impacted by delays in committing two transactions.
It’s critical to factor the estimated growth into the requirements for the lifecycle of the equipment. (Although possible to add additional capacity during the lifecycle it’s not always possible to increase the number of I/O that the storage system is capable of processing.)
To achieve maximum utilization of an array, it may be necessary to consolidate capacities from existing infrastructure. Another reason to consider storage migration capabilities in a ZO design is the possibility for technology refreshes. Support for migrations which should be as seamless as possible from existing arrays to a new storage system is therefore an important requirement in a ZO design. Based on access protocol this should be completely non-disruptive (e.g. FC) or nearly non-disruptive (e.g. NFS/CIFS). Virtualization technologies which introduce first a virtualization layer between source and destination system and provide subsequently the capability to move data in the background without further downtime is a key concept to provide (nearly) zero-downtime migration capabilities.
Storage systems should be accessible via a broad range of access protocols (e.g. FC, iSCSI, NAS, CIFS). Ideally the storage architecture provides access to the same data through different access protocols and follows a unified design which means that is does not impose a gateway solution to access data volumes through another protocol.