What is Problem management in ITSM?

When there is ’unknown underlying cause’ for ‘one or more incidents’, Problem management process is used to find the ‘underlying cause’, convert them into ‘Known Errors’ and ‘Resolve them permanently’.

Problem management can be Reactive (when it is used to do analyze and permanently fix the cause of an incident that just happened) or proactive, where  areas requiring Root cause analysis are proactively identified – through trend analysis, proactive identification of potential issues etc.

All these are fine – but have a very basic ‘problem’:  The whole set of explanations gives an impression that it is all above correcting a ‘problem’ and identifying and fixing an ‘error’.

Why this process should only address ‘Problems’ and ‘errors only? Won’t the same process useful for ‘Cause analysis’ and ‘management’ of any ‘unknown’ element in the Service management?

In a broader sense, we are discussing a process that aims at:

  • Identify an area that requires detailed cause analysis
  • Derive and document the underlying cause
  • Address the cause fully (as required) and further actions on it

Take the following scenarios as examples:

1. After a particular patch/bug-fix is implemented on a Server, the performance of the Server has improved significantly – It is an unintended outcome from the patch update. The concerned technical team is not sure ‘why’ this has happened.

It might be a good idea to do a ‘cause analysis’ in this case to identify the root cause of this positive change in Server performance and address questions like:

  • Is this positive performance change really due to the patch update?
  • If so, which element of the update has caused the improvement?
  • Can this element be used on other (similar) servers to improve the performance?
  • Can this element be included as a part of design document for such servers?

2. On specific times during a week, the network performance is notably better than the baseline/standard performance. There is no obvious or known change in workload or such characteristics correlating to these ‘good’ periods.

Similar to previous case, the situation might call for a ‘cause analysis’ to assess what causes the improvement in performance in those areas and address questions like:

  • Is there any drop in workload or usage in those particular times that is not noticed / unknown to IT?
  • Is there any external factors causing the change in performance?
  • Can the cause change in performance replicated/simulated in other times?

If you look at the scenarios above, the high-level steps required are exactly the same as the broader ones we came up for Problem management above.

ITIL® Service Operation Publication states that there is a close relationship between Proactive Problem management and Continual service Improvement (CSI). So if Proactive problem management is a major input to continual improvement, then the above scenarios are ideal examples for those. However the scenarios above (specifically scenario 1) are not exactly a Proactive cause analysis. It is reacting to something which has happened in Operation and we are reacting to that!

Yes, there are key differences in these scenarios:  You are not starting with a ‘Problem’ and the cause identified is not an Error!! Since it is not an error, you will not be ‘resolving’ it, but will be more of acting on it for proactive improvements.

CMMi  calls a comparable process as ‘Causal Analysis and Resolution. But the purpose statement restricts the scope here as well: ‘to identify causes of defects and problems and take action to prevent them from occurring in the future’

COBIT also talks about Problem management and handles it similar to other ITSM frameworks.

Root cause analysis in Wikipedia also link it to Problems, or harmful outcomes.

An argument usually received on these point is that when you do analysis of positive events, it is improvement. But why would we use two different process for identical set of objectives and steps?

The Points I am trying to drive here are:

  • The ‘Problem management process’ is a well-defined process which can deliver the holistic set of objectives expected out it. However the way it is named and the use of terminologies used in description tend to scope it narrow – to address only to issues (or underlying errors/causes of incidents).
  • The Terms ‘Problem’ and ‘Error’ are misleading to make it applicable only to incidents/errors/issues.
  • The need is for a structured process for “Cause analysis and management” – for any areas – Negative or positive. Such a view will really make the Process a much more powerful one with a distinct identity – freeing it from the age-long clutches of Incident management 🙂

To Summarize,

There is a need for a process that does effective and efficient Cause analysis, identification & recording of underlying case and act on the identified cause as necessary. BUT, that process should not be limited to ‘Problems’ , ‘Errors’ or ‘defects’ only; those can be used for positive events as well – if the cause of those positive events are not known/clear. It  could even lead to innovations!

However, the naming and terms used for the process in global frameworks and standards are often misleading on this front.

Any thoughts on this are welcome!