The Standard

Problem Management

The process or procedure does exist and is appropriate, but it was not followed properly. For example:

“The purpose of problem management is to manage the lifecycle of all problems from first identification through further investigation, documentation and eventual removal. Problem management seeks to minimize the adverse impact of incidents and problems on the business that are caused by underlying errors within the IT Infrastructure, and to proactively prevent recurrence of incidents related to these errors. In order to achieve this, problem management seeks to get to the root cause of incidents, document and communicate known errors and initiate actions to improve or correct the situation.” – Quote from ITIL

As ITIL, we divide between reactive and proactive Problem Management. We noticed, that Events usually result into Incidents and from there may end up in the reactive Problem Management process. However, if Events are reviewed if they occur regularly, they form an input via Proactive problem management. The situation that Events directly move into Problem Management seems to be fairly seldom.

Recurring Events & Incidents can represent more than 50% of the whole Incident amount. Therefore it is important to identify similar Incidents that might have the same root-cause, find that cause and remove it, so that the Incidents will not appear again.

The multiple Incidents can occur on more Config items … (we can search for a similarity of the symptoms e.g. via comparing the brief description of Incidents)

… or on a single Config item (the group of Incidents may indicate some malfunction of the Config item)

Recurring Event and Incident Analysis

  1. Service providers should ensure they receive a report with event & incident data, as well as system health data from vendors.
  2. Group incident records (INMs) in homogeneous groups based on their description.
    2a. Group events & incidents according to their configuration items (e.g. by using Pivot table).
  3. Open and process a problem management (PRM) ticket for each identified group.

Handover to Problem Management shall happen after the incident is solved or the service is stabilized with a workaround. Problem Management is mandatory for all critical incidents (MI) and high incidents. We recommend to perform a „warm“ handover for at least major incidents into Problem Management. Incident Management owns the handover responsibility. This handover must include a time log (event trigger) including Time zone for each time entry of the Incident outlining.
A warm handover is taking place during an incident review call, which is necessary after a critical or high incident up on a special request.

The review normally takes place after the last call, together with all technical and management key players of the related incident and is hosted by the MoD or LIM which managed the Incident. Basis for the MI review is the „Final Incident Report“, which needs to be shared for all participants.

  • Target is to review the complete incident history and is focussed on:
    • What happened and when did it happen?
    • Are the trigger of events correct and complete
    • What led to the solution of the incident and at which exact time
    • What are the points Problem management has to focus on
    • Did we identify weak points during the major incident process
    • Which people are necessary for problem management
  • All topics will be recorded in the “Incident Report”.
  • Major Incident review has to be done during the office hours from Monday to Friday. Define which time zone times are to be displayed and used.
    Recommendation is to use UTC.

Example:

08:21 Incident occurs
08:59Incident recorded at service desk
10:45Major incident procedure initiated
11:45First technical call
12:00Layer Check initiated to identify the issue
14:57First Management call
… etc …

Problems are prioritized by “low”, “medium”, “high” and “major/critical” using the same structure and matrix as in Incident Management.

Details:

For correct problem ticket prioritization, the following rules must apply:

  • The incident priority is the input parameter from INM. The event risk must be evaluated within the problem management. For example, in the case of issues triggered by a major incident make sure that the priority of the problem is “Major Problem” (i.e. Priority 1).

The risk of incident reoccurrence must be evaluated within PRM.

If the risk of incident reoccurrence is not known, use the risk level ‘normal’. If it’s possible to make an estimation, use ‘critical’ for problems with a high risk that an incident may occur for the same or other related CIs.

  • In cases where the problem ticket is opened as a proactive problem, based on incident management or event data analysis, it is recommended that problem priority “Medium” or “Low” is selected, unless there is a special reason to rate it higher.

Identifying the Event Risk

The following instructions are given as a guideline to find the right event risk level. For this you need at least the following information:

  • Customer
  • CI
  • Related services which can be disrupted if the relevant CI is crashed or damaged. (respectively to one or more customers)
  • Security information (if available).
  • Predicted work load or other information (e.g. external request) relevant to the CI or system.
  • Current maintenance information.

Major Problems require a Root Cause Analysis (RCA) to be presented in a report to senior management of the service provider. Our recommendation is to perform this in a weekly manner until there is no RCA outstanding. After the final RCA has been identified & accepted, a formal signoff is conducted.

Tracking:

During RCA investigation, the status is provided via a frequent report (2-5 times per week). It contains important and actual information about ongoing RCAs including:

  • Root Causes found
  • Root Causes still under investigation incl. status and issues
  • Identified Risks
  • Scheduled/planned De-Briefings / Sign off
  • Detailed streams and status of Root Cause Analysis
Workflow Root Cause Analysis

Root Cause Analysis Checklist:

Root Cause Analysis Checklist:

Check Alarming Chain:

  • When was MoD service informed? Explanation for delay of MoD service activation.

  • Did a monitoring system detect the impact/did the customer detect the impact?
Check Incident Process:

  • Was a standard checklist used to solve the Incident?

  • Was the Customer Business Impact clear during the whole Incident process?

  • Has the responsible SDM been involved in clarification?

  • Evaluation of downtime, time to repair for this case; further explanation (optional)

  • Quality of incident details?

  • Was critical landscape used and sufficient?

  • Has Configuration management data been sufficient?

  • Was a transition or transformation activity causing the Incident?

  • What went well / wrong?

  • Has a Change caused the Incident? (based on “Check of Changes”)

Key Players during Incident:

  • Were all key players during Incident Management Process available in time? Which roles were missing?

  • Have required 3rd parties joined in incident resolution?

  • Which ones?

  • Did the 3rd party react within SLA / OLA?

Incident caused by Change (Deep dive with Change):

  • Was Incident really caused by a conducted change?

  • Was the change tested before implementation?

  • Change type (= change classification: Major, Significant, Minor, Standard)

  • Was change discussed in the appropriate

  • Change Advisory Board (CAB)? Date of change discussion at CAB?

  • Was change approved by CAB?

  • Was a backout method defined?

  • Did the defined backout method work as planned? If not, why not?

  • Was it a customer driven change?

  • Planned change start/end time? - Actual change start/end time?

  • Has the run book been followed or where have there been variances?

  • Was the run book reasonable and feasible?

Supplier Involvement:

  • Which 3rd party suppliers were involved?

  • Do suppliers agree with the so far identified RCA Idea?

Identify Root Cause:

  • Was the Root Cause found?

  • Detailed Root Cause description

  • Use some of the techniques to identify the root cause as outlined in section 4.4.4.3 (ITIL Service Operation Book).

Root Cause Classification:

  • Classify the RCA like (Hardware, Software, Application, etc…)

  • read more
Responsibility for Incident:

  • Identify responsibilty for the occured incident based on facts (no fingerpointing)

Reoccurring Incident:

  • Was this a reoccurring incident?

  • Is this incident relevant for other systems of the same customer?

  • Is this incident relevant for other customers/environments?

Fill Known Error Database:

  • Check Known Error Database for entry

  • Create/update entry in Known Error Data Base

Conclude final business impact and Final Downtime:

  • Determine final business impact

  • Determine final start/end of time of impact and final downtime

Define Solutions to avoid Reoccurrence:

  • Description of measures should include deliverable(s), responsible and due dates

Get RCA approval:

  • Get agreement from responsible resolution measure owners

  • Gain acceptance for final RCA from involved

  • Service Delivery Manager

  • Production Responsible

  • Customer

  • Involved suppliers

Sign off RCA:

  • At least for Major Incidents and important High Incidents, introduce the RCA to senior management in a Sign off call to get the final approval

The goal of Problem Management is to find the root cause(s) of an incident, and all contributing factors, to avoid the incident or similar incidents in the future. It is mandatory for the problem manager to categorize all root causes and all contributing factors. The main purposes & benefits of classification are:

  • Building groups of similar root cause
  • Basis for analyzing the main issues in the department/sub-unit /unit /company
  • Set up overarching measures/initiatives to avoid similar incidents in the future

For example:
Human error issues, or lack of skill in a specific area → training measures
Issues with partners/third parties → set up an initiative with the partner
Recurring hardware issues → exchange a specific hardware component across the relevant installed base.
Process issues → adapt/change the process

The diagram below shows how problems can be split into eight main categories

  • Software
  • Hardware
  • Security
  • Environment
  • Process Human Error
  • Tools
  • Third Party

Note: this diagram is a simplification. It is often difficult to assign a problem to a single root cause category. It is more often a combination of categories that contribute to a problem. From a zero-outage perspective, it is important to understand the interdependencies to ensure that the root causes are addressed.

Diagram 1: Root Cause Categories (Click on each category for more detail)

 

Configuration Error

Issues within the configuration of the software such as:

  • Incorrect parameters in a database.
  • Incorrect protocol settings in a router.
  • Misconfigured VLAN.
  • Incorrect parameters in SAP.

Software Bug - Known Error

The problem occured due to a bug within the software or firmware and this bug is already listed within the known error database of the vendor and a patch or workaround is available and described. For example in the Oracle Technology Network or the Microsoft TechNet.

Software Bug – Unknown Error

The problem occurred due to a bug within the software or firmware. This Bug is not known by the responsible vendor. Therefore no recommended patch and/or workaround is currently available.

Software Version

Issue occurred due to the incorrect version of software, for example:

  • The version doesn’t fit with the usage requirements.
  • The version is too old and doesn’t support the hardware being used.
  • The version is out of support / end of life.

Internal Programming Issue

There is a problem within the software itself:

  • Failure within business logic.
  • Incorrect types of variables.

Redundancy Failed

There is a redundancy option available for the application bit it didn’t work, for example:

  • Due to performance issues as the backup system(s) couldn’t manage the load.
  • The backup systems needed too long for the take over.
  • Manual take over failed for a different reason.
  • Backup failed due to a different patch level.

Configuration Error

During the implementation of the hardware, or during a change to the hardware, some settings were made which immediately, or at a later point in time, led to a failure or service restriction on the hardware or elsewhere in the environment.

Faulty Hardware

A defect in a component or equipment, which may lead to a failure on this hardware. For example, a defective motherboard or line card on a server, storage device, router, etc.

Design Error

An issue has occurred as a result of an error in the design, or an error in the construction of an environment compared to the design.

Capacity Issue

This indicates that an overload situation has occurred as a result of a hardware failure or capacity growth failure. For example:

  • Too little memory or space in the environment.
  • Too few CPUs or bandwith on the connections.

Redundancy Failed

Following a failure of a component, failover to the backup device did not work for a number of reasons:

  • Redundancy was not active due to a configuration error.
  • A backup hardware component was also defective and was not previously detected.
  • The failover had not been tested before.

Unauthorized Access

Unauthorized access to a system (see hacking) or to a building, a location or data center.

Hacking

Attacks like “URL interpretation”, “Cross Site Scripting”, “SQL Injection” or “Buffer Overflow Attacks” are categorized here. The aim of these activities is to gain unauthorized access to websites, databases and systems.

Hardware Theft

Any kind of stolen hardware.

Software Vulnerabilities

The Exploitation of flaws un architecture, implementation and configuration that make it possible to compromise the system integrity against the intended will of the User

Data/Identity Theft

In the context of the internet, identity is often reduced to identification- and authorization data; mainly combinations of username and password-, bank-, creditcard information and email-addresses. If someone is gaining unauthorized access to this data, the term “identity theft” is commonly used. But also any kind of valuable company data could be the target of theft to sell to competitors (customer data, product data, statistics).

Social Engineering and Spear Phishing

With social engineering, the vulnerabilities of a person are exploited. Attackers mislead their victims to bypass security mechanisms or install malware unknowingly, to get hold of protected data and information. Unwanted emails are commonly called “SPAM”. The mails can be categorized into classical spam, malware spam an phishing-messages. With the last ones, the targeted user is lured onto malicious websites or is tricked into installing malware.

Malware

Malicious programs, or malware, often perform unwanted or harmful functions on the infected computer.

Denial –of-Service

Denial-of-Service (DoS or Distributed Denial-of-Service (DDoS) attacks, which render a system unusable through repeated spurious connections, are mainly performed by attackers who are motivated by blackmailing or hacktivism.

Natural Disaster/Deliberate Attack

Force majeure incidents (like earthquakes, flooding or terror attacks) affecting, for example, data center locations or other infrastructure like data networks.

Housekeeping

Housekeeping activities like vacuum cleaning or dusting can result in accidental unplugging or disturbance of sensitive equipment.

Physical Configuration

Physical interfaces can cause errors due to slack joints or slammed cabling.

Water/Leak

Flooding due to broken water pipes or leaking cooling systems can cause damage to electrical components.

Power Supply/UPS(Uninterrupted Power Supply)
  • A defective/unreliable power supply delivering stress peaks or overvoltage can cause hardware damage.
  • Faulty or offline uninterrupted power supplies can also result in hardware damage.

Escalation Chain Issues

The escalation chain is the internal process describing how critical incidents are escalated internally.

Potential issues that may arise are:

  • An incident ticket was opened too late.
  • The responsible line management was not informed in a timely manner.
  • The involvement of appropriate experts was late or incorrect.
  • The manager on duty could not be reached when needed.

Error in Process/Procedure Execution

The process or procedure does exist and is appropriate, but it was not followed properly. For example:

  • Knowledge of the existing process or procedure was absent or incomplete.
  • The process was known, but not followed due to time constraints or other issues.
  • The existing control mechanisms, e.g. double control principle, were not followed.

Missing Process/Procedure

The required process or procedure was either partly or completely missing.

Error in Process Design/Implementation

The process or procedure does exist, but it is flawed or is outdated, and its execution did not have the desired outcome.

Process/Procedure Violation

An individual did not follow the defined process/procedure. For example, there may be a clear, defined procedure to re-boot a device, but the responsible individual chose not to follow the documented process, in order to complete the procedure more quickly. In cases like this the violation is usually intentional, but the consequences probably were not.

Missing/Misleading Information

Incorrect information is delivered by a contributor (C) to the responsible person (R). For example:

  • There is an incorrect instruction in a change.
  • There is an incorrect parameter setting by a vendor.

Performance/Capability Issues

A responsible individual did not have the right knowledge, experience or skill set to successfully complete the required task. For example, the individual attempted to reboot a device, with which he had no experience or knowledge, and consequently made a mistake.

Incorrect Ordering/Delivery

An incorrect order was placed or the incorrect item was delivered. For example:

  • An incorrect connection, without full redundancy, was ordered by the customer or service provider.
  • While a fully redundant connection was correctly ordered, it was not correctly implemented by the line provider.

Insufficient Resources

The responsible individuals were not able to complete all of their tasks properly. For example, one of two individuals on a shift is not available due to sickness, and there is no replacement on short notice, which means the remaining individual could be overloaded, and the four-eyes principle is not possible.

Accidental Error

An unintentional mistake. There is no knowledge gap in this case. For example, the responsible person pushes the wrong button, or uses the wrong script by mistake.

Configuration Error

A tool which supports the processes is incorrectly configured. For example, incorrectly setting a parameter setting for a monitoring tool leads to performance issues, as the tool itself then consumes too many resources.

Missing Tool

There is no tool to support the necessary processes. For example:

  • The ticketing tool is not yet available for a new customer which will lead to a longer resolution time.
  • A tool is missing and tasks have to be done manually which leads to a mistake.

Tool Failure

The tool did not function according to its specifications, so the process could not be supported by the tool. For example:

  • The ticketing tool has an error and cannot be used. So the incident management process cannot be supported by the tool and the ticket has to be recorded on paper. This could lead to a longer resolution time.

Missing Tool Functionality

An important functionality of a tool is missing, which can lead to longer resolution time. For example:

  • Due to missing functionality of a tool, actions had to be done manually which led to an error or delayed resolution.

Tool Not Fit for Purpose

The tool was not able to support the process in the required manner. For example:

  • The performance of the tool was poor, which led to a longer resolution time.

Vendor Performance/Contribution

The performance or contribution of a vendor did not meet requirements. For example:

  • The skill level of their technician was insufficient.
  • Investigations could not be finalized in a timely manner.
  • The contracted service level was not met.

Vendor Quality

A quality issue in the product or service of the vendor was apparent. For example, a high percentage of a specific hardware component supplied by a single vendor becomes faulty after one year of use.

Vendor Contract

The contractual agreement with the vendor has weaknesses, or omissions. For example:

  • Service levels are not agreed, or are inappropriate for the criticality of a specific service.
  • The vendor deliverables are not contractually defined.

Customer Contract

The contractual agreement with the customer has weaknesses, or omissions. For example:

  • The agreed service levels are too aggressive or are dependent on factors which cannot be influenced.
  • The expected customer contribution is not clearly defined in the contract such as; the customer’s reaction times or numbers, skill level and availability of key resources.

Customer Obligation/Contribution

Customer is contractually obliged to provide parts of the service or prerequisites, but did not fulfil this obligation, e.g. independent power circuits.

  • After Problem Management has an agreed RCA, the identified measures are taken over into Solution/Measure Tracking.
  • A weekly tracking has to be organized by Problem Management to ensure that measures are performed as expected in scope and time.
  • The Problem Manager ensures that each measure owner reports the actual status and informs about potentially overdue measures.

Responsibility:

  • Responsibility can not be outsourced!
  • The Service Provider owns the decision to implement resolution measures.
  • In case it has formally been decided to not perform recommended resolution measures, this decision should be documented in the corresponding Known Errors