The Standard

Incident Management

Quote from ITIL:
“The purpose of incident management is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that agreed levels of service quality are maintained. ‘Normal service operation’ is defined as an operational state where services and CIs are performing within their agreed service and operational levels.”

To support the goal of Incident Management, below are some process and technology considerations from our association. Where we suggest that data be available in a database, this is a target requirement, but having such data in spreadsheet format is also an acceptable solution.

General Recommendations

  1. Make it easy to access required data for all participants of the incident process. Document and share high level architectures of infrastructures and services and ensure these are available to all participants in the incident process.
  2. Have a Configuration Database available, documenting relevant components for the delivery of services. This database should also show contracted supplier services for components or service offerings.
  3. Have a definition in your configuration database regarding the criticality of a service or device. This will significantly speed up detection of high or critical incidents and therefore also speed up the alarming chain and resolution.
  4. Have predefined data collection mechanisms defined with your suppliers, which are automatically triggered internally to collect logs or other data the suppliers need to provision rapid and targeted support.
  5. Suppliers and providers should agree beforehand on the balance between security and access necessary to address incidents promptly. As such there must be documented and tested procedures with the different key suppliers to obtain incident analysis data from affected components. This must include which data, from which location, via which method is to be collected and submitted to enable faster resolution. There must be a solution defined and tested with suppliers to enable retrieval of potentially large log files (GB size). In larger organizations, ensure a split between management & technical communication during incident resolution. Often technicians are faster at resolving an incident if they can exchange information without management involvement.

There are three basic acknowledgments:

  1. The Service Provider decides the priority of an incident for all activities around the incident
  2. The Service Provider decides the pace via the incident priority
  3. Suppliers need to understand the provider priority if it diverts from their standard

A matrix is usually used for the prioritisation of incidents. While ITIL uses a 3×3 matrix resulting in five priority codes, we suggest using four priority codes. Some organisations have more than ten priorities defined. Additional priorities (above four) can make things more difficult and do not result in noteworthy handling improvements.

We recommend the following naming conventions:

Priority CodeClassification Procedure
1Major Incident (Critical)Top Management engagement
2HighManagement engagement
3Mediumstandard incident procedure
4Lowstandard incident procedure

Incident:
A non-critical service is down or has performance problems. Low and medium ranked incidents should follow the standard incident procedure. They are resolved within the responsible service unit via the ticketing system.

High Incident:
A critical service chain is partly out of service, has performance problems or has lost its redundancy. These incidents are resolved according to the high incident procedure which requires additional communication to high level management and eventually to top management. The procedure is similar, but not the same as the major incident procedure.

Critical or Major Incident (MI):
The highest priority of resolution is a critical or major incident. A critical service chain is completely out of service with no possible workaround. Usually the whole company is impacted or the most important service(s).

Many suppliers align with this minimal specification. Therefore, in case incident tickets need to be exchanged, the mapping of priorities is straightforward. If the categories do not match up, mapping is required by both partners resulting in additional effort, delay and cost.

Recommendations:
High and Critical incident definitions, which include business impact, should be agreed beforehand between service providers and their customers. These definitions should be stored in a CMDB and applied automatically upon creation.

Critical Landscapes, a summary of all critical components in a given environment, should be updated at least quarterly and maintained in the CMDB. It usually contains a list of the components, the time frame when the service becomes critical (if applicable), and contact information. This document provides the basis for performing incident verification and prioritization.

For disaster cases, an up to date extract must be available for the Manager on Duty.

It must contain the following information:

  • Most important components
  • Timeframe for criticality i.e. what is the daily, monthly or annual timeframe when the services becomes critical. For example financial systems can have different critical timeframes like year-end closing (critical) or mid-month (not critical).
  • Contact persons for each system.
  • Communication to all those involved should be done using practical examples which fully illustrate the impact to the customer’s business. For example, an out-of-service printer at a goods shipping center could result in a halt to the delivery of all goods, or an inability to withdraw goods from the warehouse could stop the whole production line.

It should also be noted, that clear definitions are required between service providers and customers, that distinguish outages (production downtime) from degradation (the system continues to run, but in a degraded condition).

Major & High Incident Procedure

After specifying the priority by the Service Desk, major & high incidents are immediately handed over to the corresponding procedure. Due to the significant business impact, time is of the essence and the focus must be to act fast without further impacting other services. Many organizations may also have faced situations where a major downtime has been used to perform long outstanding patch or other SW upgrades. Such actions often increase resolution time and, in addition, make troubleshooting much more difficult.

Major incidents are rare and usually have a dramatic business impact. They are usually complex in nature and often cannot be resolved by simple actions such as a reboot.

It is important that Major and High Incidents are detected quickly and handed over to a specified expert organization for evaluation and resolution. Therefore, local organizations should spend a maximum of 45 minutes to identify, verify, and decide if an incident needs to be handed over to the centralized organization via the “Red Phone”. The verification should be done by a Customer Service Delivery Manager or local Lead Incident Manager (LIM).

The 45-minute timescale is to prevent situations whereby local teams attempt to solve an incident, for an extended period of time, without involving additional required experts or suppliers. The consequence of such actions is that the overall resolution time will be extended. Plenty may know from experience statements like, “Only 10 more minutes and we will have it solved” or, “Just one reboot and the problem is solved”.

Characteristics of the major incident procedure:

  1. Managed by the “Red Phone” as a central authority with skilled employees, 24/7 availability and the required expertise to address incidents with the highest priority.
  2. Involvement of Partners and suppliers in regular status calls to make use of their expertise.
  3. A mandatory check of all changes during the last 7 days should be performed
  4. A Full layer check should be performed, i.e. all configuration items and related components of the affected service are checked using checklists and instructions to ensure that all technical issues are being detected.
  5. Involvement of Senior Management as well as the Manager on Duty.
  6. Continuous customer communication must be provided.
  7. Scheduling of regular update calls on all counter steps and results.

Major Incident Workflow

For incidents verified as major/critical the following workflow is recommended.

The Red Phone is a cross country and cross organizational authority which is in charge of handling all high and major/critical incidents. It starts the corresponding procedure and manages the process from involving the necessary resources to alerting Executive Management.

High Incidents Workflow

The High Incident workflow is similar to the Major Incident Workflow, but most of the incident is handled by a Lead Incident Manager. The Manager on Duty is usually not involved, or is less involved. There are usually also no management status calls.

 

Combined technical and management conference call

The combined call for management and technicians, initiated by the Red Phone, is the initial and only platform to coordinate and assign all technical measures into work streams, including sharing the results. The call will be initially steered by the LIM, to evaluate the situation and set-up further activities. Furthermore, he assigns responsible technical lead(s).

In cases where the business line, account or customer has already established a technical call, the further procedure will be aligned in the combined technical/ management call.

Immediately after the management team leaves the combined call, the technicians will continue in the same, or a separate, permanently open bridge to work as a team on a predefined work stream to perform a “layer check” and a “check of changes”.

Further work streams will be defined, depending on the first findings and the affected technology from the technical drill down. During the whole incident, the assigned LIM ensures continuous communication with structured list and status emails.

All required documents, call setups or collaboration platforms etc. will be provided by the MoD or Lead Incident Manager.

Participants:

  • Lead Incident Manager / Global Incident Control
  • Manager on Duty
  • Technical experts
  • Manager on Duty (customer related)
  • Service Owner & Operation responsible
  • Supplier
  • Key roles (all needed Managers on Duty)
  • Top Management team
Management Status Conference Call

This conference call acts as the main communication platform to steer incident resolution. It is meant to efficiently exchange status information and align necessary decisions with the involvement of the relevant suppliers. The call is scheduled for a short management update – maximum duration should be 30 min.

The call is steered and moderated by the assigned LIM.

It is especially important that the SDM(s) deliver status information from the customer perspective. As well as provide feedback after the call, back to the customer(s).

The assigned technical lead presents progress on the analysis and next steps.

Involved suppliers deliver up-to-date information about the progress of their activities. All actions, decisions, work streams, supplier activities and current customer status will be summarized and distributed afterwards in the Incident Report by the assigned LIM.

Participants:

  • Lead Incident Manager
  • Manager on Duty
  • Service Delivery Manager
  • Technical Lead
  • Where present, a customer(s) related Manager on Duty
  • Management representatives
  • Supplier’s management
Red Phone Information Mail

A high level, short summary describing the incident and which is updated after each call.

This information mail is mandatory for every case and steered by a LIM. It does not replace the official documentation (incident report). It is always based on the previous status mail and has to be written and sent out as soon as the LIM has enough details about the ongoing incident but for the first time, at the latest, 15 minutes after the combined technical and management call has started. The email is sent by the LIM to a pre-defined distribution list. First mail contains information such as: customer, priority, business impact to customer, initial actions, next update timing.

Update mail contains information such as: any changes to the business impact, changes to the priority, update/ results of actions including supplier escalations, next steps &next update time.

The mail subject should reflect the priority of the situation from a customer perspective and should always stay the same until the priority is lowered or the incident is resolved. If the Incident is closed and the service is still in the safeguarding mode, a final mail with subject line: “FINAL: customer – service restored” – should be sent out. 

Our recommended best practice is to involve suppliers immediately in Major Incidents and quickly, where required, in High Incidents. Usually Major Incidents are complex in nature as almost all customers’ critical systems run on clustered infrastructures, where failover should work and, most of the time, enough capacity is available. As a result, resolutions are often not simple and require the additional expertise of the relevant suppliers. Immediately involving suppliers for Major Incidents results in a significant acceleration in the resolution of the incident.

Supplier Involvement Starting Point:

In order to make use of the supplier’s knowledge and expertise during an incident, the following basics for supplier involvement needs to be defined and frequently updated:

  • The relevant key suppliers for each service.
  • How to engage a key supplier to immediately receive a qualified expert (Level 3 or higher).
  • Details of the SLA and requirements for the activation of each supplier.
  • Contact list stored in a central storage, accessible by the MoD or LIM.

Escalations with key suppliers are allowed only by the Manager on Duty or the Lead Incident Management team. Escalations to other (non-key) suppliers, without predefined engagement models, must be done by the responsible Line Manager on Duty.

Red Phone

The Red Phone consists of the Incident Control and the Lead Incident Management. The Incident Control is the single point of contact (SPOC) for the major incident procedure and has the following responsibilities:

  • Operating as SPOC.
  • Providing complete communication infrastructure (conference calls, desktop sharing).
  • Providing internal documentation regarding the teams involved or contacts including their handover timings.
  • Attending and taking notes on the technical and management bridge calls.
  • Performing supplier involvement and determining whether escalations are necessary.
  • Preparing Major Incident notifications.
  • Distributing officially aligned communications (Incident Report, Info email, other email communication).

The Lead Incident Manager (LIM) is responsible for managing conference calls and ensures that the procedure is followed. The responsibilities of the LIM are as follows:

  • Moderate and structure the combined technical and management calls
  • Steer and moderate management status calls.
  • Produce & distribute continuous incident documentation (information emails, incident reports or action item lists).
  • Facilitate the ongoing technical analysis, ensuring that the incident is handled according to the Major Incident guidelines.
  • Setup proper safeguarding.
  • Ensure smooth handover to Problem Management (PRM), i.e. preparing the root cause analysis (RCA) template. -> Link > Problem Management
Yellow Phone

There might be several Yellow Phones available in the different regions, clusters or service lines to structure, moderate and document the resolution process of high incidents with the following responsibilities:

  • Operating as SPOC.
  • Providing complete communication infrastructure (conference calls, desktop sharing).
  • Providing internal documentation regarding the teams or contacts involved including their handover timings.
  • Attending and making notes on the technical bridge
Technical Lead
  • Will be named in the combined technical and management call during the start-up phase (usually a dedicated, customer LIM or technical MoD).
  • Coordinates all activities of the technical or operational teams (internal + external) and suppliers involved.
  • Drives the identification of an adequate solution in a reasonable timeframe, from technical point of view.
  • Documents and reports on the findings and implemented measures (including their results and/or effects).
Manager on Duty (MoD)
  • Covers dedicated work streams.
  • Supports incident solution with dedicated customer or service line knowledge.
  • Operates or steers dedicated technical conference calls on request
  • Ensures supplier escalation.
  • Responsible for people management.
Service Delivery Manager (SDM)
  • Covers dedicated work streams.
  • Evaluates the priority and establishes a continuous line of customer communication.
  • Acts as the interface to the customer.

 

 

Combined technical and management conference call

The combined call for management and technicians, initiated by the Red Phone, is the initial and only platform to coordinate and assign all technical measures into work streams, including sharing the results. The call will be initially steered by the LIM, to evaluate the situation and set up further activities. Furthermore, he assigns responsible technical lead(s).

In cases where the business line, account or customer has already established a technical call, the further procedure will be aligned in the combined technical/ management call.

Immediately after the management team leaves the combined call, the technicians will continue in the same, or a separate, permanently open bridge to work as a team on a predefined work stream to perform a “layer check” and a “check of changes”.

Further work streams will be defined, depending on the first findings and the affected technology from the technical drill down. During the whole incident, the assigned LIM ensures continuous communication with structured list and status emails.

All required documents, call setups or collaboration platforms etc. will be provided by the MoD or Lead Incident Manager.

Participants:

  • Lead Incident Manager / Global Incident Control
  • Manager on Duty
  • Technical experts
  • Manager on Duty (customer related)
  • Service Owner & Operation responsible
  • Supplier
  • Key roles (all needed Managers on Duty)
  • Top Management team
Management Status Conference Call

This conference call acts as the main communication platform to steer incident resolution. It is meant to efficiently exchange status information and align necessary decisions with the involvement of the relevant suppliers. The call is scheduled for a short management update – maximum duration should be 30 min.

The call is steered and moderated by the assigned LIM.

It is especially important that the SDM(s) deliver status information from the customer perspective. As well as provide feedback after the call, back to the customer(s).

The assigned technical lead presents progress on the analysis and next steps.

Involved suppliers deliver up-to-date information about the progress of their activities. All actions, decisions, work streams, supplier activities and current customer status will be summarized and distributed afterwards in the Incident Report by the assigned LIM.

Participants:

  • Lead Incident Manager
  • Manager on Duty
  • Service Delivery Manager
  • Technical Lead
  • Where present, a customer(s) related Manager on Duty
  • Management representatives
  • Supplier’s management
Red Phone Information Mail

A short, high level summary describing the incident which is updated after each call.

This information mail is mandatory for every case and steered by a LIM, but does not replace the official documentation (incident report). It is always based on the previous status mail and must be written and sent out as soon as the LIM has enough details about the ongoing incident. It must be sent for the first time a maximum of 15 minutes after the combined technical and management call has started. The email is sent by the LIM to a pre-defined distribution list. The first mail contains information such as: customer, priority, business impact to customer, initial actions & timing of the next update.

The update mail contains information such as: any changes to the business impact, changes to the priority, update/ results of actions including supplier escalations, next steps & timing of the next update time.

The mail subject should reflect the priority of the situation from a customer perspective and should always stay the same until the priority is lowered or the incident is resolved. If the Incident is closed and the service is still in the safeguarding mode, a final mail with subject line: “FINAL: customer – service restored”, must be sent out.

Before starting any steps to resolve the incident, a brief overview of the situation should be established. According to the ITIL framework the following 5W questions should be answered:

  • What happened?
  • Who did that?
  • When did it take place?
  • Where did it take place?
  • Why did that happen?

 

Full Technical Layer Check
  • Always check all layers unless a layer definitely does not apply.
  • This check is a quick, rough check to identify in which layers issues may exist.
  • After the check, begin to investigate the layers with issues.

Check of Changes

A review of the recent changes over the last 7 days prior to the High or Major Incident is important, as often the incident occurs due to a change in the infrastructure. The MoD will organize the overall change list and provide it to the assigned lead technician for inclusion in the technical call. The Technical Lead caries out a relevance check of all changes against the current incident. Results of change verification will be shared in the 2nd management call.

The start and end time, as well as status for the “check of changes” activity must be documented in the trigger of events in the Incident Report.

During or after incident diagnosis, changes to the infrastructure might be required. We strongly recommend not to perform quick updates, patches, reboots or similar changes to the infrastructure without review and approval, as these often decrease the chance of identifying the real root cause of the issue. As a result, our recommendation is to use an emergency change procedure in which all changes to the infrastructure are reviewed by the MoD or LIM and authorized prior to implementation. For this authorization a change record must exist in the regular change management system (if it is accessible at all) as well as some brief documentation describing what will be changed. The MoD or LIM approves the change in written form.

The change documentation in the change system must be performed after the incident is resolved within a specified timeframe (recommendation is maximum 1 working day). This includes the documentation of the MoD or LIM approval.

Changes resulting from incidents are always Emergency Changes.

>>Change Management

Generally, there are 2 short term change types recommended:

  1. Emergency Changes
  2. Short Lead Time Changes

These are applied in the following ways:

For further details regarding the Change Management process please refer to the Change Management section.

The priority of incidents can change from critical to low over one business day. However, the highest verified priority within the lifetime of an incident must be documented before closure. Different behavior, for example downgrading, could mean that incident statistics are being falsified or manipulated.

After resolving the incident, services are restored.

Due to the high impact of the incident we recommend that a Safeguarding phase is implemented. In this phase disrupted services are specifically monitored, with high attention, for a specified period. Safeguarding measures and follow-up activities are defined by the LIM in collaboration with the participants of the Management conference call. The Safeguarding method (extended or standard/default) depends on the below risk assessment:

Major Incident Review

A Major Incident review is necessary after every Major Incident or on special management request.

The review normally takes place after the last call, together with the key technical and management participants. It is hosted by the MoD and moderated and structured by a LIM. The Basis for the MI-review is the “Final Incident Report“.

The ttarget is to review the complete incident history and is focused on the following questions:

  • Are the documented trigger of events correct and complete?
  • What led to the solution of the incident?
  • What information for a preliminary root cause or additional relevant information was provided to Problem Management?
  • What were the focus areas for problem management?
  • What weak points were identified during the major incident process?
  • Were the correct people/teams engaged for problem management?

All topics will be recorded in a document/ticket for problem management. The Major Incident review has to be done during office hours from Monday to Friday and the assigned Problem Manager has to attend the process.