A frequent discussion we have in most of the training programs and consulting engagements is about clarity on handling of “Potential Incidents“. On my previous post about Event vs Incidents, JJ has commented with a similar query:
What about those events which are likely to become an incident?
For Instance, Internet Link utilization increased from 60% to 70%. Say my threshold is 75%(Warning) and 80%(Critical) and the utilization increases to 70% on Tuesdays and Wednesdays, shouldnt an incident be created. Because Incident also includes degradation of service.
Since it is a common discussion, thought of putting this as a new post.
First of all, let us divide the context into two parts:
- How will this be treated within event management process
- What response will be appropriate from event management for this (in other words, what process/action will be triggered by event management process to handle this event)
Since the utilization is reaching a level close to (not equal to more than) the threshold or critical levels, it is still a “Warning” event (Unusual operation) for event management process – and not an exception. As it hasn’t reached the threshold nor critical level of utilization, we can safely assume that there is no ‘degradation of service’ currently. Hence it is not yet an Incident. Yes, this can be a potential incident later – if not handled.
This is my take on such scenarios: this event is actually detected NOT as part of ‘Incident detection’, but as a part of ‘Capacity Monitoring’ (iterative activities of capacity management as per ITIL) . The thresholds and guidelines for these event should be established by Capacity management process. Capacity management can give an instruction to create an incident ticket, if the utilization reaches or crosses a critical limit at any point. However, the first objective of capacity management is to identify issues concerning capacity, before it start impacting business. That is the reason they are setting thresholds adequately below the critical limits for taking proactive action before it starts affecting the service/business.
Hence, if this is an anticipated situation as per the design of the service – the ideal trigger should be to Capacity management. That process should initiate tuning or other corrective actions to bring the utilization below acceptable limits.
However, if this situation is not an anticipated one as per the design of the service ; in other words if the cause of such utilization fluctuation is “unknown”, a problem ticket could also be created, triggering problem management process to identify the root cause and then fix the issue permanently, thus assisting the capacity management process and preventing further such events and Incidents.
To summarize, the reactive process like Incident management should be used to handle Incidents that has occurred. For Potential Incidents, proactive processes like Capactiy Management and Availability management should be used – along with Problem management, where there is a need to do Root cause analysis and Permanent solution.
Having said this, the context slightly different in the case of “Security Incidents”. Any potential security breaches also should ideally be proactively logged as Security Incidents.
Any further/different thoughts on this? Would like to hear…