Are you frustrated about recurring Incidents? How much time do you spend fixing the same things over and over again? These are the Incidents users regularly contact the Service Desk about. If it bothers you, what do you think your users think about your IT organization? What is an inconvenience to you means lost productivity to users, and lost productivity equals lost revenue.
Incident vs. Problem
For best practices, what does Incident mean? The ITIL Incident definition is "an unplanned interruption to a service or reduction in the quality of a service."1
In contrast, the ITIL Problem definition is "a Cause, or potential Cause, of one or more Incidents."2
As you can see by these definitions, Incidents are not Problems, and they do not become Problems:
- Incidents are work disruptions.
- Problems are about the investigation of Incidents to find the Root Cause as to why they happen, and to eliminate that Cause so that the Incidents never happen again.
- Root Causes (identified by Known Errors; see below) can include any of the following:
- Hardware failure
- Coding error
- Implementation issues
- Incorrect documentation
- Ignoring preventative maintenance
- Software failure
- Vendor/supplier issues
- Infrastructure failures
- Design errors (Do you know if you are designing services to fail?)
Incident Management vs. Problem Management
With the difference between Issues and Problems, Incident and Problem Management process flows are different as well:
- Problem Management eliminates the Cause of lost productivity. It brings significant value to the business because it keeps users productive through the elimination of Incidents. It is an ITIL practice: "The purpose of Problem Management practice is to reduce the likelihood and impact of Incidents by identifying actual and potential Causes of Incidents and managing workarounds and Known Errors."3
- Incident Management enables Problem Management. Some small user productivity gains are Incident Management's value, but Incident Management's main contribution is as a Problem Management enabler.
Matching Incidents to Problems are essential for Problem Management
Let us begin. As Henrik Ibsen said in 1913 "A picture is worth a thousand words."
Between Incident Management and Problem Management, there are three databases.
- Incident Database (DB). Storage for all Incident records.
- Problem Database (DB). One Problem may have one or many Incidents linked to it. The arrows on the above diagram represent digital links.
- Known Error Database (DB). When Problem Management practice begins to investigate the Root Cause of a Problem, documentation occurs in the Known Error database. One Known Error may have one or many Problems linked to it.
Problem Management supports Incident Management in the present and reduces future Incidents
With the basics out of the way, what is the incident to problem management flow? The following is the real story behind the above picture, using an example:
The story begins, Part One:
A purchasing department user sees something strange on their monitor while completing a purchase order. The extension is wrong (e.g., 10 units x $1.99/unit = $20.00). The user calls the Service Desk to explain what they see. The Service Desk agent has never seen the MyAPP purchasing application before, which is perfectly fine, as they follow the Incident procedure for all Incidents:
- First, the agent creates a new Incident by designating the Service. This is crucial because best practice is emphatic that every data record shows the Service impacted. In this case, the agent chooses from a menu "MyAPP: PURCHASING".
- Second, the agent enters a short description (e.g., what the user said) "Display extension error." This does not have to be perfect.
- The agent now wants to know if anyone has ever encountered this before, so they click a button on the Incident to "Check for similar", which does the following:
- Check the Known Error DB. Find all "Open" Known Errors where the Service is "MyAPP: PURCHASING" and the brief description has anything to do with "Display", "Extension", or "Error".
- If nothing is found in the Known Error DB, check the Problem DB and perform the same search.
- The first time, there will be no match. The Service Desk software then allows the agent to create an Problem, and it automatically loads Incident data into the Problem and establishes a digital link.
- The agent escalates the Incident in the normal way to the level 2 or 3 MyAPP Application Team.
Note that at this point, the Incident Management processes have not changed except to check to see if a Problem record exists.
In the MyAPP Applications Team, a team member, who is currently concentrating on a difficult programming feature, stops and puts on their Incident cap and starts to research the Incident because Incident Management is all about getting users back up and running as quickly as possible:
- They cannot quickly find a way to fix it, but they do find that the data in the MyAPP database is correct (i.e., no extension error is found there). They document this in the Problem record. Since there is a linkage between Problem/Incident, it is virtual in both places.
- They call the user and explain what they have found so far. They also tell the user it is okay to get back to work, as the data in the MyAPP database is good. This is what ITIL calls a Workaround: "A solution that reduces or eliminates the impact of an Incident or Problem for which a full resolution is not yet available."4 Workarounds are a fantastic tool to increase productivity! They also notify the user that they are going to close the Incident but that the Problem will remain open until they find a resolution (i.e., a Root Cause/Known Error and a fix to that).
- Total time from the opening of the incident until closed in this instance is, for example, 30 minutes.
Problem Management's sensational productivity gift to all of IT
The story continues, Part Two:
A different MyAPP Purchasing user contacts the Service Desk and gets a different agent.
- The agent logs the call as well, and does the query in the same way the first agent above did. This time the query finds the Problem record.
- They read what the MyAPP applications specialist wrote, repeats it to the user as if they had written it themselves, and closes the Incident without any escalation or incurring more costs. This time, the transaction took only 3 minutes. What is most significant is that Level 1 — the Service Desk — could resolve the incident without escalation saving Level 2 and 3 productivity, and thus saving money:
- A 90% increase in user productivity
- Increased customer satisfaction
- Decreased IT costs
Adding a little variety to this story: When another MyAPP Purchasing user contacts the Service Desk about an end-of-the-month report that is showing extension errors when printing, the Service Desk agent creates a new problem because printing is different from a display. In this case, there is no workaround. The Incident stays open.
So many ITIL Practices are there to help other Practices achieve what they could not do on their own
The MyAPP Applications Team uses the ITIL Monitoring and Event Management Practice to automate their processes. Monitoring detects that there are three low priority Incidents linked to one low priority Problem. Automatically, it increases the priority and notifies the MyAPP Applications Team that it is time to work on this Problem. The MyAPP Applications expert now puts on their Problem Management cap and opens a Known Error record to document the investigation, in searching for a Root Cause. With their Problem Management advanced training, they may apply any one or several Root Cause analysis techniques:
- Ishikawa Diagrams
- Pareto Analysis
- Pain Value analysis
- Chronological Analysis
- Technical Observation Post
- Affinity Mapping
- Fault Tree Analysis
After researching the issue, they notice that there is another Problem that says there is a report printing extension error (from part 2 above). This is an important coincidence. It turns out, from using the ITIL Configuration Management Practice configuration management database (CMDB), that there is a common program that does extensions and line #725 is the Root Cause of these extension errors, both for display and printing.
However, even though the Root Cause has been discovered, this is not the last step for Problem Management. The next step is to notify Change Management of a defect in a live application. When the MyAPP Applications Team finds the Root Cause, they open a Request for Change (RFC) and link the Known Error to it. When they get permission to change the application, they follow change procedures. It is very likely that the MyAPP Applications programmer that found the Root Cause has all the skills to fix it. But now, they would put on their Change Management cap and follow Change Management procedures. You can see in the diagram that this one person plays four roles (Incident, Problem, Known Error, and Change), and the ITSM software linked the four data records together by Service + symptom detail. Because of the links, when Change Management completes the change, the software closes all open Incidents, Problems and Known Errors associated to the change, and can even send a courtesy notification to every user associated with the Incidents that the MyAPP Applications Team has corrected the data error.
End of story.
- There is astonishing power with Incident matching in ITIL Problem Management. Without Incident matching, Problem Management's job is impossible. The more Incidents linked, the greater the likelihood of finding the correct Root Cause (e.g., the something common between the individual Incidents).
- The key linking field is the Service. The Service drives Incidents, Problems, Known Errors, Changes, Service Requests, Configuration Management, Service Level Management, and more. That is why all of this is called IT Service Management (ITSM). Every database uses Services to track and link everything.
- Critical is the accuracy of incident documentation, suggesting that Incident Management should institute Incident audits and training until every Incident created is perfect every time.
- Problem Management techniques are very different than Incident techniques. All people involved with investigating Root Cause should have Problem training such as:
- HDI PM Professional certification
- Kepner-Tregoe's Problem Management
- Lean Six Sigma Root Cause Analysis Certification
- ITIL Problem Management certification
- Problem Management requires ITSM applications with Problem Management enabled and Incident/Problem/Known error linking.
1. Axelos® Global Best Practice, ITIL Foundation, ITIL 4 Edition, p. 121
2. Ibid, p. 130
3. Ibid, p. 130
4. Ibid, p. 132