Traditional IT Operations is not good enough …
Motivation
When the Zero Outage initiative started, the focus was on simply making things better, getting to a higher level of maturity in delivering and managing services, specifically in a multi-supplier environment, which is the norm today for large enterprises. The results of that thinking is demonstrated by the content of the first release, and is a practical and valuable start.
However, zero business outage, as stated by the value proposition of ZOIS, is an explicit and distinct goal, and a lofty one considering the reality customers are facing. When thinking about it, it became clear to me that it will be hard or even impossible to reach this goal by only improving the current way of operating.
At least it seems reasonable to take a step back and think whether the traditional processes, tools and behaviours are still valid. Today’s methodology is very much optimizing the procedure of finding and fixing problems as quickly as possible. Zero Outage, when taking literal, requires preventing issues before they happen, or at least before impacting the business.
Proactive operations is not necessarily a new goal, however, it was always seen as an ambition rather than a guarantee. And that changes the game, operations can’t be an after the fact activity, incident, problem and change need to be redefined, which is shifting the paradigm of how IT operates.
An analogy
In order to better understand the gravity of a complex problem, it always helps me to draw an analogy. The one that helped me in this case was to think about the human being a service, the doctor, the hospital and ultimately the OR representing the “Run” part of the value chain. Might not be the most accurate analogy, but it helped me.
The human body has all the things a service solution requires, with infrastructure, sub-systems, plumbing and wiring, power, computing, interfaces etc. And things can go wrong, so you need check-ups, see the doctor, measure blood pressure, pulse, etc. and get updates so to speak. You can also fall seriously ill and recover in the hospital, maybe even requiring surgery to fix fundamental problems. Sounds pretty alike to how IT operates today, business as usual.
But then I figured a fundamental difference when asking myself how and when I decided to see the doctor. Yes, I do regular, proactive check-ups, and I certainly go when I know that I have a specific problems, something I experienced earlier. But sometimes I go, because I simply don’t feel well. I do not know what it is, I can’t even explain it clearly. That happened to a friend of mine and it turned out to be an early form of cancer, which luckily could be healed completely. But had he not gone, he’d probably be dead.
The evolutionary difference
In my opinion it is two things that are different in Humans, first the autonomic nervous system that acts as an early warning system issuing tons of feelings, and second the cognitive skills to make sense of those seemingly intuitive, as opposed to a clear thought process.
It seems to be a fair assumption that without that difference the death rate of cancer in general would be much higher, so it could be seen as the evolutionary difference to make us fit to survive.
The reality of services systems
Service systems tend to be stupid things on their own, by and large you need to put all the control around it. Even though the digital transformation and The Internet of Things drive more intelligence into the components themselves, this typically doesn’t cover intelligence how to manage, trouble-shoot, maintain. Operating services requires significant management tooling, like monitoring, analysis, knowledge, process automation etc.
Going back to the analogy, with such a system we can do regular check-ups, examine organs, check whatever is measurable and the doctor makes sense of the numbers. Plus, he asks how I feel, but that is what I cannot really ask the service.
The tooling is focused on what we know, the sub-systems, their up-time, performance metrics etc., but it doesn’t allow us to figure out what we don’t know. And that is typically what creates surprises, the unforeseen heart attack, since our nervous system is not perfect either. Now, the question is, what can we do to know what we don’t know, so we can prevent surprises?
The learning of this is that it is still valuable to use the traditional ways of managing services, however, it needs to be done in a fully integrated end-to-end fashion, on top of a well-designed and modelled service. The IT4IT Reference Architecture is a great prescription and guidance how to get there. However, it is still not sufficient to reach Zero Outage, but it is less revolutionary and more evolutionary than I thought.
Overcoming the Difference, within IT Operations
Going back to the analogy for a second, the nervous system is such a complex, maybe not even fully understood system that we can’t mimic inside a service. However, let’s try to find the bottom line of it. In my opinion the key elements are
- Tons of data from internal and external sources
- A way to find out what is abnormal (not feeling well)
The indication of abnormal behaviour avoids surprises and provides the trigger and a data basis to guide further preventive investigation.
This is not a new problem, which has evolved especially in the security management area. Due to the increased complexity of different, dynamic deployment models, the traditional way of securing the perimeter is no longer sufficient. The enemy is already in, hence the task is to find out what the enemy does, which is not a distinct thing, but a pattern of different things. In essence, it is the same problem as described above, to know what I don’t know, understanding the meaning and impact of the observable patterns.
The other good news is that enabling technologies continuously evolve, allowing practices that were impossible only years ago. In particular I’m referring to big data and artificial neural networks, which essentially allow to mine and analyse very large amounts of structured and unstructured data to learn and pin-point abnormal behaviour.
In my opinion we should not throw away traditional operations and re-invent it, but rather complement traditional structured monitoring with information about the behaviour of services.
Hence, I propose an evolution by adding complementary capabilities, like preventive behavioural analysis, to the traditional ones (in the picture I suggest additional components to the IT4iT Reference Architecture as a consequence).
The traditional capabilities (incident, problem and change management), as a consequence should then focus more on the tracking, harmonization, documentation and effective communication of the related data with all involved parties. Which is a very important and often overlooked task. Data integrity and traceability, the ability to understand the state of service delivery at any given point in the IT value chain, is the key fundament in which advanced, innovative management capabilities can be added. This is also a wonderful example of two standards activities (IT4IT and ZOIS) work complementary towards the same vision.
Disclaimer
The information contained in this document is contributed and shared as thought leadership in order to evolve the Zero Outage Best Practices. It represents the personal view of the author and not the view of the Zero Outage Industry Standard Association.