Failure Case Study

High-impact IT-related business outage events do occur in many different industries resulting in disruptions to the company’s business and their customers. Company reputation can be damaged, monetary damages may be incurred, and lost revenue and lost business are consequences of unplanned business outages. The Zero Outage Industry Standard Association (ZOISA – https://www.zero-outage.com) founded in October 2016 is putting forth best practices, striving for Zero Business Outages in Multi-Vendor IT environments.

The team has looked at “real failures scenarios” – actual high impact outage events in order to drive specific lessons learned, counter measures, and to test the assumptions of the ZOIS framework and design guidelines. Assessing what went wrong is important in determining guidelines and best practices to limit risk.

In this use case analysis, we look at the events that led a Cloud provider service to go down, affecting various applications and a significant number of users for several hours.

The problem started because of an operator error that issued a console command that brought down more servers for maintenance than original intended. This initial event spread out fast and impacted in a cascaded fashion multiple devices and applications who were dependent on the shut-down servers for their operations. System outage caused by operator error. As more services were getting impacted, also the monitoring and reporting system was affected, which made the understanding of the issue, and subsequent solution, slower.

Various lessons learned can be drawn from this event, and various countermeasures appear to be relevant to avoid or reduce the impact of the event.

The initial trigger of the event was caused by a human error, caused by an employee who was debugging issues reported in a system and while doing so, decided to shut down some servers. Without understanding the amount of resources that were being put offline, this caused a degradation of the overall system resources below minimum level.

From this, a first area of learning, is training of employees, an area well covered in the People Workstream.

A possible additional improvement isprocess, by including, at least for a subset of critical commands a peer-review principle (like a “4 Eye review principle” for complex or potential impactful changes). This also implies that a process exists to define what “impactful changes” are.

However, these would be a very basic learning.

A next step in the learning, would be by intervening on the offending command through a set of enhancements, amongst which we can report:

  • Commands should not be allowed to reduce the computing power below the minimum defined threshold for operation. A syntax protection should be introduced
  • Command operation was instantaneous: it may be advisable to introduce a delay period allowing for command abort before impact as the operator realizes the mistake.
  • Command action was fast: it may be good to introduce a slow-down / delay mechanism, allowing the user the time to abort the command while it is still executing and not all resources have been actually impacted

All of these activities fall under the category of creating robustness to errors upfront in the design, and later in the testing.

It is also important to map the required improvements to the different phases of the Value Map to discover that all of the phases will be impacted by the suggested improvements list. In particular, introducing error protection methods in the command structure is a requirement that needs to be analysed and implemented in the Plan and Build phases

An additional area to consider is the fact that the event ended up also impacting the monitoring and reporting system. On this, it is key to implement a complete separate design of the Management system.

A further item to consider is about reducing the impact of the failure, what is often referred to as reducing the “blast radius”. Here we are touching the phases of Build and Deploy, where a proper segmentation of the implementation is needed, separating the co-dependent element of the solution.

In order to achieve this goals though, it is important to understand how the different elements of the solution are connected with each other. Here, the layer model will help, by exposing each building blocks and identifying the relation with each other. Treating each of these relations as a “connection” with a set of “specifications” (like an API) around the technical elements that need to be considered and maintained to achieve “compatibility” between building blocks, it will be possible to highlight critical dependencies and properly manage them, including proper validation of all changes and tracking of compatibility tables.

As many “connection” remain hidden also after the most accurate design review, it is recommended to introduce a culture around testing that can help expose unplanned dependencies. A random but structured and controlled process of introducing failures in the platform can help expose such hidden and unexpected dependencies.

Finally, it is worth mentioning that system growth and change over time and what may be perfectly working in a given moment, may not be suitably compliant to Zero Outage after a while. For this reason, it is important to identify the critical elements that will be subject to growth and preventively identify thresholds, by crossing which specific revision processes should be triggered. An example of such a process in the context of the case study we discussed, would have been to revise resources segmentation policies periodically.