"The purpose of problem management is to manage the lifecycle of all problems from first identification through further investigation, documentation and eventual removal. Problem management seeks to minimize the adverse impact of incidents and problems on the business that are caused by underlying errors within the IT Infrastructure, and to proactively prevent recurrence of incidents related to these errors. In order to achieve this, problem management seeks to get to the root cause of incidents, document and communicate known errors and initiate actions to improve or correct the situation." - Quote from ITIL
As ITIL, we divide between reactive and proactive Problem Management. We noticed, that Events usually result into Incidents and from there may end up in the reactive Problem Management process. However, if Events are reviewed if they occur regularly, they form an input via Proactive problem management. The situation that Events directly move into Problem Management seems to be fairly seldom.
Recurring Events & Incidents can represent more than 50% of the whole Incident amount. Therefore it is important to identify similar Incidents that might have the same root-cause, find that cause and remove it, so that the Incidents will not appear again.
The multiple Incidents can occur on more Config items … (we can search for a similarity of the symptoms e.g. via comparing the brief description of Incidents)
… or on a single Config item (the group of Incidents may indicate some malfunction of the Config item)
Recurring Event and Incident Analysis
- Service providers should ensure they receive a report with event & incident data, as well as system health data from vendors.
- Group incident records (INMs) in homogeneous groups based on their description.
2a. Group events & incidents according to their configuration items (e.g. by using Pivot table).
- Open and process a problem management (PRM) ticket for each identified group.
Handover to Problem Management shall happen after the incident is solved or the service is stabilized with a workaround. Problem Management is mandatory for all critical incidents (MI) and high incidents. We recommend to perform a „warm“ handover for at least major incidents into Problem Management. Incident Management owns the handover responsibility. This handover must include a time log (event trigger) including Time zone for each time entry of the Incident outlining.
A warm handover is taking place during an incident review call, which is necessary after a critical or high incident up on a special request.
The review normally takes place after the last call, together with all technical and management key players of the related incident and is hosted by the MoD or LIM which managed the Incident. Basis for the MI review is the „Final Incident Report“, which needs to be shared for all participants.
- Target is to review the complete incident history and is focussed on:
- What happened and when did it happen?
- Are the trigger of events correct and complete
- What led to the solution of the incident and at which exact time
- What are the points Problem management has to focus on
- Did we identify weak points during the major incident process
- Which people are necessary for problem management
- All topics will be recorded in the “Incident Report”.
- Major Incident review has to be done during the office hours from Monday to Friday. Define which time zone times are to be displayed and used.
Recommendation is to use UTC.
|08:59||Incident recorded at service desk|
|10:45||Major incident procedure initiated|
|11:45||First technical call|
|12:00||Layer Check initiated to identify the issue|
|14:57||First Management call|
… etc …
Problems are prioritized by "low", "medium", "high" and "major/critical" using the same structure and matrix as in Incident Management.
For correct problem ticket prioritization, the following rules must apply:
- The incident priority is the input parameter from INM. The event risk must be evaluated within the problem management. For example, in the case of issues triggered by a major incident make sure that the priority of the problem is "Major Problem" (i.e. Priority 1).
The risk of incident reoccurrence must be evaluated within PRM.
If the risk of incident reoccurrence is not known, use the risk level ‘normal’. If it’s possible to make an estimation, use ‘critical’ for problems with a high risk that an incident may occur for the same or other related CIs.
- In cases where the problem ticket is opened as a proactive problem, based on incident management or event data analysis, it is recommended that problem priority "Medium" or "Low" is selected, unless there is a special reason to rate it higher.
Identifying the Event Risk
The following instructions are given as a guideline to find the right event risk level. For this you need at least the following information:
- Related services which can be disrupted if the relevant CI is crashed or damaged. (respectively to one or more customers)
- Security information (if available).
- Predicted work load or other information (e.g. external request) relevant to the CI or system.
- Current maintenance information.
Major Problems require a Root Cause Analysis (RCA) to be presented in a report to senior management of the service provider. Our recommendation is to perform this in a weekly manner until there is no RCA outstanding. After the final RCA has been identified & accepted, a formal signoff is conducted.
During RCA investigation, the status is provided via a frequent report (2-5 times per week). It contains important and actual information about ongoing RCAs including:
- Root Causes found
- Root Causes still under investigation incl. status and issues
- Identified Risks
- Scheduled/planned De-Briefings / Sign off
- Detailed streams and status of Root Cause Analysis
Workflow Root Cause Analysis
Root Cause Analysis Checklist:
|Check Alarming Chain:|
|Check Incident Process:|
|Key Players during Incident:|
|Incident caused by Change (Deep dive with Change):|
|Identify Root Cause:|
|Root Cause Classification:|
|Responsibility for Incident:|
|Fill Known Error Database:|
|Conclude final business impact and Final Downtime:|
|Define Solutions to avoid Reoccurrence:|
|Get RCA approval:|
|Sign off RCA:|
The goal of Problem Management is to find the root cause(s) of an incident, and all contributing factors, to avoid the incident or similar incidents in the future. It is mandatory for the problem manager to categorize all root causes and all contributing factors. The main purposes & benefits of classification are:
- Building groups of similar root cause
- Basis for analyzing the main issues in the department/sub-unit /unit /company
- Set up overarching measures/initiatives to avoid similar incidents in the future
Human error issues, or lack of skill in a specific area → training measures
Issues with partners/third parties → set up an initiative with the partner
Recurring hardware issues → exchange a specific hardware component across the relevant installed base.
Process issues → adapt/change the process
The diagram below shows how problems can be split into eight main categories
- Process Human Error
- Third Party
Note: this diagram is a simplification. It is often difficult to assign a problem to a single root cause category. It is more often a combination of categories that contribute to a problem. From a zero-outage perspective, it is important to understand the interdependencies to ensure that the root causes are addressed.
Diagram 1: Root Cause Categories (Click on each category for more detail)
- After Problem Management has an agreed RCA, the identified measures are taken over into Solution/Measure Tracking.
- A weekly tracking has to be organized by Problem Management to ensure that measures are performed as expected in scope and time.
- The Problem Manager ensures that each measure owner reports the actual status and informs about potentially overdue measures.
- Responsibility can not be outsourced!
- The Service Provider owns the decision to implement resolution measures.
- In case it has formally been decided to not perform recommended resolution measures, this decision should be documented in the corresponding Known Errors