An “Event” in the context of both ITIL and Zero Outage Industry Standard can be defined as a detectable occurrence that is significant in relation to the effective management of an IT infrastructure or availability of an IT Service. Events are created by a CI, monitoring tool, IT Service, or processing activity and the process by which these are categorized and filtered to determine the most appropriate action is known as Event Management.
Event Management is the process which detects, correlates and responds to normal & abnormal events within an IT infrastructure or service, & is the entry point for many service operations processes. It provides a means of comparing actual performance against Operational & Service Level Agreements, and standards which may have been set during the design phase.
Event Management can be both reactive, for events which could or have led to a service impact, or proactive, by capturing patterns which later feed into other processes for example capacity management. Therefore, the objectives of effective event management can be described as follows:
- Detect & interpret events and initiate appropriate actions to avoid the business impact of such events.
- Proactively reduce risk & maximize the availability of IT infrastructure & services.
- Provide operational information to aid automation.
- Support continual service improvement activities.
Responses to significant events are becoming increasingly automated, compared to standard event and incident management processes. The process described in the following sections is specific to the Zero Outage Industry Standard and assumes an increased level of automation, compared to the ITIL process.
In future releases, the differences between the process described here & that described by ITIL, will be explored further, to specifically highlight those elements which enhance the ability of a vendor, service provider or end-customer, to achieve zero outage.
It is important that the people involved in designing, building and running an IT infrastructure and/or service and/or process have a thorough understanding of which events need to be captured by the monitoring & alerting systems to effectively manage an IT infrastructure, service or process. A Zero Outage posture requires implementation of a highly automated end-to-end monitoring and alerting system. Traditional siloed monitoring environments in which, for example, the network is monitored separately from storage, are inadequate for separating true root cause from incremental symptoms.
Event Notification, as described in the ITIL documentation is when a Configuration Item (CI) issues a notification (i.e. using SNMP), or a monitoring system creates a notification by querying a CI on its status and/or availability. Most CI’s are designed to communicate certain information about themselves in one of two ways:
- A device is interrogated by a management tool, which collects certain targeted data. This is often referred to as polling.
- The CI itself generates a notification when certain conditions are met.
Once an Event notification has been generated, it will be received by the specific tool i.e. read and interpreted. Timely notification of critical events is essential to the Zero Outage model to minimize (preferably eliminate) impact to business outcomes. Design and implementation of the underlying notification and detection tools should receive high priority within the organization to meet this goal.
The ordering of these activities indicates that the classification of the event is recorded at the time of logging. This is normally an automated/pre-defined process & is generally a technical classification which may be needed to determine the expected Service Restriction in follow-on processes.
Typically, intelligent monitoring components can assign Event Severity immediately after detection, filtering and event correlation has occurred within the system.
While every organisation will have its own classification of the significance of an event it is suggested that at least these three broad categories be represented:
An informational event typically has different possible characteristics. It can be an event within normal operating boundaries or it can be an event indicating an error that is deemed as not a risky or noteworthy problem.
Examples of informational events include:
- Within normal operating boundaries:
- A device has come online
- A transaction is completed successfully
- Indicating an error that is deemed not a risky or noteworthy problem.
- A threshold of 50% has been reached when 75% is deemed in general as a warning level.
- An event log entry has been created indicating one failed user attempt to log in.
- The monitoring agent lost a few seconds of connectivity to the monitoring system.
A warning describes an event that is normally generated when a service or device is approaching a threshold. Warnings are intended to notify the appropriate person, process or tool so that the situation can be checked and appropriate action taken to avoid an exception.
Examples of warning events are:
- Memory utilisation on a server is currently at 65% and increasing.
- A storage volume is nearing the recommended maximum capacity, should this be exceeded, performance of the storage device could be impacted.
An exception describes an event in which a service or device is currently operating abnormally. Generally, this means that an Operational or Service Level Agreement has been breached and the business has been impacted. Exceptions could represent a total failure, impaired functionality or degraded performance.
Examples of exception events include:
- Memory utilization has exceeded the defined 75% threshold and the performance of the system has slowed to such an extent that it is unusable.
- Response time for a standard transaction across the network has exceeded the defined thresholds.
- A server is no longer responding.
The “Significance” decision box from the ITIL workflow has been replaced by a simpler view in which, at the Correlation step, a decision is taken as to whether action is required or not, i.e. the event is purely informational and does not require any action from either the vendor or the customer. The impact of the event will determine what action is taken.
During event correlation, a 360-degree view is required to determine whether there are existing events which may be related to the present item. For example, in the case of duplicate/multiple events, these should be flagged and archived. For events which are linked/referenced, e.g. separate events for a hardware failure and a system crash for the same CI, then the link is made.
It should be noted that during follow-on processes, an additional and broader correlation may need to be done (manually, semi- or fully automated). This step is not to be confused with event management correlation.
The correlation of multiple events to identify a common CI failure may result in the creation of a new event from which an action is triggered, e.g. a server showing offline due to switch failure, while at the same time we have an event for switch failure, in which case we should focus on the switch failure and not on the server being unavailable. We should focus on only one of the events preferably on the one being the root cause of the events.
There are multiple correlation methods:
- Time based correlation
- Environment based correlation
- Event class based correlation
Time based correlation means:
- During a certain specified timeframe, a certain number of events of a certain type occur and trigger then an action accordingly
- If a Port fails less than 5 times in 24 hours = do nothing
- If the port fails 5 times in 24 hours = reboot port at 2 AM
- If port fails 5 times in 2 hours = immediate reboot of port
- Environment based correlation:
- Multiple events per CI or CI Group appear at a similar time resulting in a likely bigger issue
- Event class based correlation:
- Events of critical nature happen and require action…
After correlation, it is clear if an event is to be actioned or not. Those requiring no further action are closed while the remainder trigger either an automated response, or flow through either the Change, Incident or Problem management processes, as appropriate.
Now that we have identified that an event is to be actioned, the next step is to determine who is responsible for resolution, i.e. whether the vendor is empowered to perform the resolution action, or if the customer needs to trigger the resolution activity.
It may be that there are rules defined by the customer, that prevent a service provider from resolving the issue themselves, some examples are described below:
- The resolution action requires a change that is not pre-approved by the customer.
- The resolution action could impact a critical service, so customer approval is always required.
- It is an event for which the customer has specifically instructed that they need to be involved, or it is an event where no instructions have been given i.e. falls outside of the defined “rules”.
Such “rules” are usually defined at the contractual level.
The following examples indicate situations where the vendor/service provider may be empowered to perform the activities without customer inclusion:
- The event has an agreed resolution path, with the customer.
- The resolution can be performed automatically and this has been agreed for similar cases with the customer.
- The vendor/service provider includes resolutions of this nature as part of their service and the customer will not incur an outage.
- The resolution is performed during a maintenance window.
Automated responses to known events are becoming increasingly common in the service provider\vendor industry.
Today the IT industry aims to move to more automated environments utilizing machine learning or similar technologies, but currently only certain activities are generally automated. For an automated response, no incident ticket needs to be created but it is mandatory to:
- Log all activities performed by automation to remediate the situation (chronological steps with time-stamps).
- Log all required process information, e.g. priority, affected CI/Service and Customer, responsible party/owner, etc.
- Raise an Incident/Problem/Change ticket (depending on use case scenario) to a responsible party/owner in cases where the automated response fails & human intervention is required to finalize the processing, e.g. a technician replacing hardware component in datacenter.
- Prior to the execution of an automated action, a check must be performed for conflicts (e.g. planned/running changes, existing Incidents, other running automated responses). If a conflict exists the automation cannot be performed and an incident must be created to resolve the situation.
- If certain automated actions are known to affect the end-user, then they should only be executed at certain times, or in special circumstances.
In some cases, while an event may be resolvable by an automated response, the risk to the service is deemed too high for the action to be carried out during peak usage period. In these cases, such actions must be carried out at pre-defined times, when the risk of service impact is lowest.
- Critical systems – where the level of risk is too high during peak hours.
- Automation is known to cause performance impact.
- Change Freeze implementation, where permission for an emergency response must first be sought.
The best way to describe the interface is with an example:
Vendor hardware typically has onsite monitoring which will report directly to the vendor if a hardware issue occurs. With many vendors, this will result in an automated ticket being raised within the vendor’s ticketing system, and for simple pre-approved fixes the ticket will be assigned to a customer engineer.
Many customers have automated, standard, pre-approved change management for non-disruptive repairs, which means that these can be approved automatically. Where possible it is desirable for the vendor & customer to automate responses to such events. This doesn’t bypass change management process, it is just automating it, while it is still auditable.
In cases where the automated response actions are successful the event can be closed, and details of such events included in a cadenced report.
Automation scripts should be designed to fail with an error code when the changes or actions cannot be completed successfully. In these instances, it will be necessary to:
- Revert to a manual response as described below.
- Initiate an Incident ticket for the automation failure.
- Update the cadenced event report.
In many cases manual intervention is required. This is true for situations where there are no automated processes; where a response requires a normal or emergency change approval from the customer, and for automated responses which have returned an error, as previously described.
When a manual response is required, an incident, problem or change ticket will be required to trigger further actions such as:
- An incident ticket is required for engineer review.
- The change response can be automated, but requires manual approval by the customer change management system. In this case, the Incident and subsequent change tickets would need to be raised and reviewed.
Following the completion of the actions identified by the event management process, there needs to be verification that the steps taken have resolved the event. Should this verification process show that the event is not resolved, further manual intervention is required.
If the event has been resolved the event details should appear in the internal event report.
In cases where the resolution falls outside of the vendor/service provider’s scope it is the responsibility of the customer to further address the issue. The vendor/service provider must provide the customer with all available details to assist them with their follow-on actions.
Provision of this information can be done in several ways and must be agreed with the customer in advance. Standard methods for such communications include notification via a ticketing tool, an e-Mail or a phone call, depending on the urgency of the situation.
Once the necessary information has been passed to the customer, the responsibility for resolution has been handed over. In cases where the customer disputes their ownership of the issue, or decides not to act, a record of such events should be provided in a cadenced report which makes clear to the customer the possible outcomes associated with their decision. Situations like this are rare, but may be avoided by:
- Ensuring that the service contract clearly describes the areas of responsibility for the vendor/service provider and the customer.
- Agreeing a process for dealing with any “ownership” disputes which may fall outside those described in the contract.
Should such gaps be identified, steps can be taken to implement the appropriate contract amendments.
This refers to the customer-specific process used to evaluate information received from the service provider or vendor and to trigger actions based on the notified event.
This may mean that the customer triggers actions which revert to the service provider or vendor via the usual, agreed Incident/Problem/Change procedures.
Our Association recommends that service providers/vendors produce a cadenced customer report outlining open issues & identified risks. A well-designed report should outline the risks in the customer’s environment which could affect the quality of service. Such risks should be logically grouped into, for example, high, medium, low categories which help the customer prioritize &, where possible, clearly describe the actions that are required to mitigate them.
There are multiple formats existing in the field. Some Zero Outage Industry Standard Companies provide excellent examples of reports which help customers understand the risks, which could lead to an outage or performance issue, and actively support them in addressing the risks highlighted.
We recommend the following focus items in the event report:
- Number of events which have been resolved which have prevented business impact.
- Number of events which have occurred and resulted in business impact
- Number of events successfully resolved through automation.
- Success vs failure rate of automated responses
- Numbers of events by type to identify trends – this is very useful for identification of improvement potential for Monitoring and Event processing
- Number of events resolved via a knowledge base entry.
There is no requirement for a full-time Event Manager, instead the responsibility comes under the Availability Management role. Roles vary between companies; the generic descriptions below should help with interpretation of the matrix, which includes customer and service provider.
Service Provider (SP) Event Monitoring Team
The team within the service provider responsible for classification and correlation of events.
SP Technical Operations Team
The operations teams responsible for configuring the monitoring & automation; process improvement, & for day-to-day management of the IT environment.
Service Delivery Manager
The person ultimately responsible for the delivery of the contracted service to the customer.
Legend: R – Responsible | A – Accountable | C – Consulted | I – Informed
|SP Event Monitoring Team
|SP Technical Operations Team
|SP Service Delivery
|Customer Technical Operations
|Event detection, classification and logging
|Determining action required
|Customer action required
|Triggering of automated actions
|Manual intervention required by SP
|Production of reports
|Communication of cadenced customer reports