The Standard

Event Monitoring

The topic of event monitoring is a broad one and in the Zero Outage Industry Standard the question of what is monitored and how, is covered in the Platform section.

In this workstream, the subject of what is done with the monitoring data will be addressed, since in a Zero Outage environment, the appropriate monitoring and handling of events is key to success and underpins the continuous service improvement process.

As the diagram shows, in its earliest forms event monitoring was largely manual – based on previous experience, operational history and fixed thresholds. This evolved to more dynamic models where detection and resolution of some incidents is automated. In future, event monitoring systems will be able to analyse large amounts of data from the IT environment and look for specific alerts and patterns which could influence service availability or performance, and then proactively address these potential issues before the business is impacted.

As an example, a traditional vehicle monitors several attributes and displays a warning when certain thresholds breach acceptable levels, for example oil pressure or temperature. The notification often comes too late, and the result is a breakdown (an outage). Traditionally, servicing for a vehicle is arranged according to mileage, or the date of the last service, but these parameters don’t allow for variables such as individual driver characteristics, or road conditions, which can accelerate wear and reduce the time to failure.

Now imagine a vehicle which monitors all components, driver behavior and road conditions holistically, then analyzes this data and identifies patterns to predict breakdowns before they occur. Such a vehicle could proactively book a service before a threshold is breached, even arranging a replacement vehicle for when the maintenance is planned. This would effectively result in near zero outage for vehicle availability. This scenario describes, in a simple way, the future of event monitoring.

The monitoring approach discussed here aims to describe the key principles of the Zero Outage Standard, mainly to:

  • Recognise conditions that may affect the availability of the service now and in the future.
  • Identify potential problems as early as possible so that corrective action can be taken before it affects users.
  • Recognise known error patterns or anomalies via data analytics to prevent problems before they appear
  • Reduce false alarms by ensuring that every new alert or alarm is thoroughly investigated and understood.

To accomplish these goals, the use of analytics is key so that the comparison and measurement of improvements, or setbacks, is possible.

From a Zero Outage perspective, there are several data sources that can be employed to contribute to successful event management. To effectively leverage all the data received, an inclusive end-to-end service monitoring model across all platform layers, which gathers and analyses data, is required.

The growing multi-vendor and distributed nature of many environments precludes a monolithic monitoring and event management model, and so a centralized aggregation layer will be required into which all individual monitoring and alerting services feed. Emerging techniques including artificial intelligence (AI), machine learning (ML), and predictive analytics can most effectively be deployed, thus adding the vital proactive element to event management in a Zero Outage environment.

For example, it is becoming common to employ tools (e.g. “ChatOps”) which capture monitoring data in real-time, allowing operators – using real-time collaboration – to react before an event becomes an outage.

A key element of any monitoring solution is having a holistic view of the IT environment and how each element relates to each other. In ITIL terms this is known as a Configuration Management Database (CMDB). The content of this database is defined in the configuration management process and should show, not only the various hardware & software elements, but also the relationships between them, so that the business/service impact of any event affecting one of these is clearly understood and ensures the most efficient and focused approach to resolution.

The core data sources for event management are usually instrumentation information, alerts, logs and user experience (in interpreting visualizations of these data, such as graphs.) These should work together to assist with the rapid and accurate identification and troubleshooting of issues.

  • Instrumentation information from both hardware and software
  • Alerts generated based on data analytics where a prediction of the expected customer experience is calculated. Once conditions are detected which could impact the user experience, measures should be started automatically to avoid a negative impact on the customer/user. An alert is therefore the beginning of an analysis sequence that means further scrutiny is needed to address the situation and facilitate resolution of any issues in a controlled way with minimal user impact.
  • Logs hold the raw operational data, recording events in the operating system or application. Whereas graphs can summarize an overall trend, logs have the finest level of detail and facilitate pinpointing the root cause of alert.
  • User experience in interpreting the data presented, including logfile analysis and visualization using various graphing tools.

To best deploy event monitoring tools, it helps to view an environment as a multi-layer stack, with the application on top, and the hardware and services that support it forming the layers below. By monitoring the right metrics and behavioural patterns in each layer, a detailed view of the status of the complete environment can be obtained. It is worth noting that in a Zero Outage model multiple (at least two) separate monitoring systems are recommended for redundancy and data verification.

The Use of Patterns & Predictions

As we have said, in today’s world, events generally trigger predefined actions. In the future, reactions will be based on identified patterns that are analysed for predicted impacts. A pattern contains discernable elements that repeat in a predictable manner. A prediction is a statement about an uncertain event. It is often, but not always, based upon experience or knowledge gained from observing patterns and their correlating events.

The historical data and patterns allow us to produce a baseline of “normal” behaviour, which can then be monitored to report deviations and generate alerts. All this data can help to predict how a system will behave under specific conditions and in a specific timeframe. For example, high load on a platform can be acceptable if it has been predicted and prepared for in that timeframe, but if the load is significantly higher than predicted and outside the expected timeframe, immediate action is required.

The following examples demonstrate how the use of predictions & patterns in monitoring can be employed to prevent an outage.

  1. The green line depicts the prediction of the expected load over a future timeframe, this prediction is based on previous patterns and other data and is effectively a dynamic threshold.
  2. The red line depicts the current load limit for this timeframe indicating that a serious incident will result if this is exceeded.
  3. The blue line depicts the actual load increasing above the predicted pattern (green line). As this exceeds the predicted pattern by a certain % (dynamic threshold) an alert would be generated by the monitoring system, which would in turn generate an event.
  4. Through event management and follow-on processes such as problem management, knowledge management & automation, action can be taken to address the condition, in this case, by either raising the load limit or reducing the actual load.

  1. The graph for Day 1 depicts a specific behavior that resulted in an incident.
  2. The exact same behavior, seven days later also resulted in an incident. We have started to see a pattern.
  3. It is safe to predict that the onset of the same pattern would have the same outcome as before, i.e. an incident will occur.
  4. As shown in the second diagram (Day x + 10), the use of pattern prediction can be employed to generate alerts when this pattern re-occurs, so that action can be taken to prevent an incident.

Event Monitoring & Security

In the past, there may have been significant differences in the ways security and operations teams monitored the IT environment. The traditional goal of operational monitoring was to generate alerts which could be addressed to prevent an outage to a service or process, whereas the goal of security was to prevent breaches, protect data and maintain the integrity of the IT environments. As operational monitoring has evolved, and security is an increasingly critical priority, we have started to see alignment between these two areas. What this means is that effective monitoring tools for zero outage should be designed to comply with specific security policies and to detect security threats, as well as the usual events that occur daily.

For more information on the topic of secure monitoring in a zero outage environment follow the link to the Security section (See “Logging, Monitoring & Security Reporting)”.