Two of the main activities of Availability Management are monitoring the availability of the service and the preparation of the relevant reports:
From the time of the interruption of service until it is restored (downtime) the incident goes through various phases which must be analysed individually:
- Detection time: This is the time elapsing between when the fault occurs and when the IT organisation becomes aware of it.
- Response time: this is the time that elapses between the detection of the problem and when the incident is logged and diagnosed.
- Repair/recovery time: the time taken to repair the fault or find a workaround and restore the system to the same state as before the service was interrupted.
It is important to define metrics allowing the different phases of the life cycle of the service interruption to be measured precisely. The customer must be informed of these metrics and give his approval to them to avoid misunderstandings. In some cases it is hard to determine whether the system is down or is still running, and interpretations may vary between service providers and customers. It must, therefore, be possible to express these metrics in terms the customer can understand.
Some of the parameters commonly used by Availability Management, and which customers need to be given in the relevant availability reports, include:
- Average downtime: the average duration of a service interruption, including the time to taken detect, respond to and resolve the incident.
- Average uptime: the average length of time the service is available without interruptions.
- Average time between incidents: the average time between one incident and the next. This is equal to the sum of the average downtime and average uptime. The average time between incidents is a measure of the reliability of the system.