The process or procedure does exist and is appropriate, but it was not followed properly. For example:
“The purpose of problem management is to manage the lifecycle of all problems from first identification through further investigation, documentation and eventual removal. Problem management seeks to minimize the adverse impact of incidents and problems on the business that are caused by underlying errors within the IT Infrastructure, and to proactively prevent recurrence of incidents related to these errors. In order to achieve this, problem management seeks to get to the root cause of incidents, document and communicate known errors and initiate actions to improve or correct the situation.” – Quote from ITIL
As ITIL, we divide between reactive and proactive Problem Management. We noticed, that Events usually result into Incidents and from there may end up in the reactive Problem Management process. However, if Events are reviewed if they occur regularly, they form an input via Proactive problem management. The situation that Events directly move into Problem Management seems to be fairly seldom.
Recurring Events & Incidents can represent more than 50% of the whole Incident amount. Therefore it is important to identify similar Incidents that might have the same root-cause, find that cause and remove it, so that the Incidents will not appear again.
The multiple Incidents can occur on more Config items … (we can search for a similarity of the symptoms e.g. via comparing the brief description of Incidents)
… or on a single Config item (the group of Incidents may indicate some malfunction of the Config item)
Recurring Event and Incident Analysis
- Service providers should ensure they receive a report with event & incident data, as well as system health data from vendors.
- Group incident records (INMs) in homogeneous groups based on their description.
2a. Group events & incidents according to their configuration items (e.g. by using Pivot table).
- Open and process a problem management (PRM) ticket for each identified group.
Handover to Problem Management shall happen after the incident is solved or the service is stabilized with a workaround. Problem Management is mandatory for all critical incidents (MI) and high incidents. We recommend to perform a „warm“ handover for at least major incidents into Problem Management. Incident Management owns the handover responsibility. This handover must include a time log (event trigger) including Time zone for each time entry of the Incident outlining.
A warm handover is taking place during an incident review call, which is necessary after a critical or high incident up on a special request.
The review normally takes place after the last call, together with all technical and management key players of the related incident and is hosted by the MoD or LIM which managed the Incident. Basis for the MI review is the „Final Incident Report“, which needs to be shared for all participants.
- Target is to review the complete incident history and is focussed on:
- What happened and when did it happen?
- Are the trigger of events correct and complete
- What led to the solution of the incident and at which exact time
- What are the points Problem management has to focus on
- Did we identify weak points during the major incident process
- Which people are necessary for problem management
- All topics will be recorded in the “Incident Report”.
- Major Incident review has to be done during the office hours from Monday to Friday. Define which time zone times are to be displayed and used.
Recommendation is to use UTC.
|Incident recorded at service desk
|Major incident procedure initiated
|First technical call
|Layer Check initiated to identify the issue
|First Management call
|… etc …
Problems are prioritized by “low”, “medium”, “high” and “major/critical” using the same structure and matrix as in Incident Management.
For correct problem ticket prioritization, the following rules must apply:
- The incident priority is the input parameter from INM. The event risk must be evaluated within the problem management. For example, in the case of issues triggered by a major incident make sure that the priority of the problem is “Major Problem” (i.e. Priority 1).
The risk of incident reoccurrence must be evaluated within PRM.
If the risk of incident reoccurrence is not known, use the risk level ‘normal’. If it’s possible to make an estimation, use ‘critical’ for problems with a high risk that an incident may occur for the same or other related CIs.
- In cases where the problem ticket is opened as a proactive problem, based on incident management or event data analysis, it is recommended that problem priority “Medium” or “Low” is selected, unless there is a special reason to rate it higher.
Identifying the Event Risk
The following instructions are given as a guideline to find the right event risk level. For this you need at least the following information:
- Related services which can be disrupted if the relevant CI is crashed or damaged. (respectively to one or more customers)
- Security information (if available).
- Predicted work load or other information (e.g. external request) relevant to the CI or system.
- Current maintenance information.
Major Problems require a Root Cause Analysis (RCA) to be presented in a report to senior management of the service provider. Our recommendation is to perform this in a weekly manner until there is no RCA outstanding. After the final RCA has been identified & accepted, a formal signoff is conducted.
During RCA investigation, the status is provided via a frequent report (2-5 times per week). It contains important and actual information about ongoing RCAs including:
- Root Causes found
- Root Causes still under investigation incl. status and issues
- Identified Risks
- Scheduled/planned De-Briefings / Sign off
- Detailed streams and status of Root Cause Analysis
Workflow Root Cause Analysis
Root Cause Analysis Checklist:
Root Cause Analysis Checklist:
|Check Alarming Chain:
|Check Incident Process:
|Key Players during Incident:
|Incident caused by Change (Deep dive with Change):
|Identify Root Cause:
|Root Cause Classification:
|Responsibility for Incident:
|Fill Known Error Database:
|Conclude final business impact and Final Downtime:
|Define Solutions to avoid Reoccurrence:
|Get RCA approval:
|Sign off RCA:
The goal of Problem Management is to find the root cause(s) of an incident, and all contributing factors, to avoid the incident or similar incidents in the future. It is mandatory for the problem manager to categorize all root causes and all contributing factors. The main purposes & benefits of classification are:
- Building groups of similar root cause
- Basis for analyzing the main issues in the department/sub-unit /unit /company
- Set up overarching measures/initiatives to avoid similar incidents in the future
Human error issues, or lack of skill in a specific area → training measures
Issues with partners/third parties → set up an initiative with the partner
Recurring hardware issues → exchange a specific hardware component across the relevant installed base.
Process issues → adapt/change the process
The diagram below shows how problems can be split into eight main categories
- Process Human Error
- Third Party
Note: this diagram is a simplification. It is often difficult to assign a problem to a single root cause category. It is more often a combination of categories that contribute to a problem. From a zero-outage perspective, it is important to understand the interdependencies to ensure that the root causes are addressed.
Issues within the configuration of the software such as:
- Incorrect parameters in a database.
- Incorrect protocol settings in a router.
- Misconfigured VLAN.
- Incorrect parameters in SAP.
Software Bug - Known Error
The problem occured due to a bug within the software or firmware and this bug is already listed within the known error database of the vendor and a patch or workaround is available and described. For example in the Oracle Technology Network or the Microsoft TechNet.
Software Bug – Unknown Error
The problem occurred due to a bug within the software or firmware. This Bug is not known by the responsible vendor. Therefore no recommended patch and/or workaround is currently available.
Issue occurred due to the incorrect version of software, for example:
- The version doesn’t fit with the usage requirements.
- The version is too old and doesn’t support the hardware being used.
- The version is out of support / end of life.
Internal Programming Issue
There is a problem within the software itself:
- Failure within business logic.
- Incorrect types of variables.
There is a redundancy option available for the application bit it didn’t work, for example:
- Due to performance issues as the backup system(s) couldn’t manage the load.
- The backup systems needed too long for the take over.
- Manual take over failed for a different reason.
- Backup failed due to a different patch level.
During the implementation of the hardware, or during a change to the hardware, some settings were made which immediately, or at a later point in time, led to a failure or service restriction on the hardware or elsewhere in the environment.
A defect in a component or equipment, which may lead to a failure on this hardware. For example, a defective motherboard or line card on a server, storage device, router, etc.
An issue has occurred as a result of an error in the design, or an error in the construction of an environment compared to the design.
This indicates that an overload situation has occurred as a result of a hardware failure or capacity growth failure. For example:
- Too little memory or space in the environment.
- Too few CPUs or bandwith on the connections.
Following a failure of a component, failover to the backup device did not work for a number of reasons:
- Redundancy was not active due to a configuration error.
- A backup hardware component was also defective and was not previously detected.
- The failover had not been tested before.
Unauthorized access to a system (see hacking) or to a building, a location or data center.
Attacks like “URL interpretation”, “Cross Site Scripting”, “SQL Injection” or “Buffer Overflow Attacks” are categorized here. The aim of these activities is to gain unauthorized access to websites, databases and systems.
Any kind of stolen hardware.
The Exploitation of flaws un architecture, implementation and configuration that make it possible to compromise the system integrity against the intended will of the User
In the context of the internet, identity is often reduced to identification- and authorization data; mainly combinations of username and password-, bank-, creditcard information and email-addresses. If someone is gaining unauthorized access to this data, the term “identity theft” is commonly used. But also any kind of valuable company data could be the target of theft to sell to competitors (customer data, product data, statistics).
Social Engineering and Spear Phishing
With social engineering, the vulnerabilities of a person are exploited. Attackers mislead their victims to bypass security mechanisms or install malware unknowingly, to get hold of protected data and information. Unwanted emails are commonly called “SPAM”. The mails can be categorized into classical spam, malware spam an phishing-messages. With the last ones, the targeted user is lured onto malicious websites or is tricked into installing malware.
Malicious programs, or malware, often perform unwanted or harmful functions on the infected computer.
Denial-of-Service (DoS or Distributed Denial-of-Service (DDoS) attacks, which render a system unusable through repeated spurious connections, are mainly performed by attackers who are motivated by blackmailing or hacktivism.
Natural Disaster/Deliberate Attack
Force majeure incidents (like earthquakes, flooding or terror attacks) affecting, for example, data center locations or other infrastructure like data networks.
Housekeeping activities like vacuum cleaning or dusting can result in accidental unplugging or disturbance of sensitive equipment.
Physical interfaces can cause errors due to slack joints or slammed cabling.
Flooding due to broken water pipes or leaking cooling systems can cause damage to electrical components.
Power Supply/UPS(Uninterrupted Power Supply)
- A defective/unreliable power supply delivering stress peaks or overvoltage can cause hardware damage.
- Faulty or offline uninterrupted power supplies can also result in hardware damage.
Escalation Chain Issues
The escalation chain is the internal process describing how critical incidents are escalated internally.
Potential issues that may arise are:
- An incident ticket was opened too late.
- The responsible line management was not informed in a timely manner.
- The involvement of appropriate experts was late or incorrect.
- The manager on duty could not be reached when needed.
Error in Process/Procedure Execution
The process or procedure does exist and is appropriate, but it was not followed properly. For example:
- Knowledge of the existing process or procedure was absent or incomplete.
- The process was known, but not followed due to time constraints or other issues.
- The existing control mechanisms, e.g. double control principle, were not followed.
The required process or procedure was either partly or completely missing.
Error in Process Design/Implementation
The process or procedure does exist, but it is flawed or is outdated, and its execution did not have the desired outcome.
An individual did not follow the defined process/procedure. For example, there may be a clear, defined procedure to re-boot a device, but the responsible individual chose not to follow the documented process, in order to complete the procedure more quickly. In cases like this the violation is usually intentional, but the consequences probably were not.
Incorrect information is delivered by a contributor (C) to the responsible person (R). For example:
- There is an incorrect instruction in a change.
- There is an incorrect parameter setting by a vendor.
A responsible individual did not have the right knowledge, experience or skill set to successfully complete the required task. For example, the individual attempted to reboot a device, with which he had no experience or knowledge, and consequently made a mistake.
An incorrect order was placed or the incorrect item was delivered. For example:
- An incorrect connection, without full redundancy, was ordered by the customer or service provider.
- While a fully redundant connection was correctly ordered, it was not correctly implemented by the line provider.
The responsible individuals were not able to complete all of their tasks properly. For example, one of two individuals on a shift is not available due to sickness, and there is no replacement on short notice, which means the remaining individual could be overloaded, and the four-eyes principle is not possible.
An unintentional mistake. There is no knowledge gap in this case. For example, the responsible person pushes the wrong button, or uses the wrong script by mistake.
A tool which supports the processes is incorrectly configured. For example, incorrectly setting a parameter setting for a monitoring tool leads to performance issues, as the tool itself then consumes too many resources.
There is no tool to support the necessary processes. For example:
- The ticketing tool is not yet available for a new customer which will lead to a longer resolution time.
- A tool is missing and tasks have to be done manually which leads to a mistake.
The tool did not function according to its specifications, so the process could not be supported by the tool. For example:
- The ticketing tool has an error and cannot be used. So the incident management process cannot be supported by the tool and the ticket has to be recorded on paper. This could lead to a longer resolution time.
Missing Tool Functionality
An important functionality of a tool is missing, which can lead to longer resolution time. For example:
- Due to missing functionality of a tool, actions had to be done manually which led to an error or delayed resolution.
Tool Not Fit for Purpose
The tool was not able to support the process in the required manner. For example:
- The performance of the tool was poor, which led to a longer resolution time.
The performance or contribution of a vendor did not meet requirements. For example:
- The skill level of their technician was insufficient.
- Investigations could not be finalized in a timely manner.
- The contracted service level was not met.
A quality issue in the product or service of the vendor was apparent. For example, a high percentage of a specific hardware component supplied by a single vendor becomes faulty after one year of use.
The contractual agreement with the vendor has weaknesses, or omissions. For example:
- Service levels are not agreed, or are inappropriate for the criticality of a specific service.
- The vendor deliverables are not contractually defined.
The contractual agreement with the customer has weaknesses, or omissions. For example:
- The agreed service levels are too aggressive or are dependent on factors which cannot be influenced.
- The expected customer contribution is not clearly defined in the contract such as; the customer’s reaction times or numbers, skill level and availability of key resources.
Customer is contractually obliged to provide parts of the service or prerequisites, but did not fulfil this obligation, e.g. independent power circuits.
- After Problem Management has an agreed RCA, the identified measures are taken over into Solution/Measure Tracking.
- A weekly tracking has to be organized by Problem Management to ensure that measures are performed as expected in scope and time.
- The Problem Manager ensures that each measure owner reports the actual status and informs about potentially overdue measures.
- Responsibility can not be outsourced!
- The Service Provider owns the decision to implement resolution measures.
- In case it has formally been decided to not perform recommended resolution measures, this decision should be documented in the corresponding Known Errors