Critical Examination of a Common Assumption in System Availability Computations
In general, availability is defined as the probability that an item will be able to function (i.e., the item is not failed or undergoing maintenance) when called upon to do so. The availability definition takes into account an item’s reliability (how quickly it fails) and its maintainability (how quickly it can be restored to operation) and it is therefore a key metric in repairable systems analysis. Because reliability and availability are both probabilities, there is a common assumption that the method used to compute system reliability based on component data can also be used for system availability calculations.
This article provides a critical examination of this assumption and we conclude that it is only justified under certain conditions. In most cases, another factor must be taken into account when computing system availability.
Comparing System Reliability Computations and System Availability Computations
Like reliability, availability is a probability. Thus, one might assume, as many people do, that this same technique of multiplying probabilities could be applied to estimate system availability. For example, in the two-component system described above, suppose that the point availabilities at time t for Component 1 and Component 2 are 80% and 90%, respectively. If the system availability is simply the product of the component availabilities, then the system availability at time t would be AS = A1 • A2 = (.80 • .90) = .72 or 72%.
This method can be justified from a probabilistic perspective because both items need to be available when called upon in order for the system to be available. However, the method does not take into account the effect on component availability, if any, when the components are operating together in a system configuration. If the component continues to operate even when the system is down due to the failure of another component, then the availability will be the same when calculated individually as it is when calculated with respect to its behavior within the system. However, if the component does not continue to operate when the system is down, then its availability will be different within the system than when calculated individually. This is because when the system is down due to the failure of another component, the given component will not accumulate age. Therefore, although the “system clock” will advance by the amount of time that it takes to restore the other component, the “component clock” for the given component will not advance and the component will be likely to fail at a later time (on the “system clock”) than it would have if the system had not been down for the maintenance of another component. This effect of system operation is not taken into account in the estimation of the availability for the individual component and yet it is quite relevant to the availability of the system.
Simple Example to Demonstrate the Effect of System Operation on Component Availability
Figure 1: Up and downtimes for Unit 1 individually
Figure 2: Up and downtimes for Unit 2 individually
However, when we analyze the components operating together in a system, we see that Unit 2 will fail first at 75 hours, causing the system to fail. The system will then be undergoing maintenance for 25 hours and will be operational again at 100 hours. At 125 hours, the system will fail again, this time due to Unit 1. This is because Unit 1 fails after 100 hours of operation and it had accumulated 75 hours before system failure and another 25 after the system was restored. The same process can be repeated yielding the system results shown in Figure 3.
Figure 3: Up and downtimes for the system
As the example demonstrates, the distinction of whether a component continues to operate when the system has failed or if a component can fail when the system is undergoing repair is very important and needs to be taken into account when performing such analyses. Likewise, the computation used to estimate system availability must take into account the availabilities of the components with respect to their operation within the system. The component availabilities determined individually and with respect to system operation will be the same only in situations where the component is not affected by the failure of other components and/or the system (or where no other components fail within the given time).
Consider a very simple system with two identical components configured reliability-wise in series, and with constant failure and repair rates. Both components cease to operate when the system is down. (Note: We assume the exponential distribution for both times-to-failure and times-to-repair in order to simplify the calculations, while recognizing that the exponential distribution is rarely appropriate to describe failure behavior in the real world. The term “repair” is used to describe the maintenance action required to restore the component to operation and we assume that the component is “good as new” as a result of the maintenance action.)
Since both the failure and repair distributions are exponential, a closed-form solution can be easily obtained for the point availability of each component. This is given by:
If we assume that the point availability for the system is equal to the product of the component availabilities then:
or, for this example:
Since we are using a simple exponential distribution for both the failure and repair distributions, it is easy to determine the system availability using another methodology such as Markov analysis. By comparison with the Markov result, we can evaluate the assumption that the system availability is the product of the component availabilities. The first step in the Markov approach is to determine the possible system states. These are:
It is possible to make the argument that a fourth state exists: “System is down due to the failure of both Unit 1 and Unit 2.” However, for this fourth state to exist, both components must continuously operate regardless of the system status. If a system failure causes the system to shut down and subsequently stops the operation of the non-failed component, then this state is not possible (since both components would have to fail at the same infinitesimal instant dt). Because the components in this example do not continue to operate when the system is down due to the failure of another component, we can disregard this fourth possibility.
Now, and based on the three possible states, the Markov diagram is as shown in Figure 4. From this, the point availability is given by:
As it can be seen, the equations from multiplying the availabilities, Eqn (1), and from the Markov analysis, Eqn (2), are different. This implies that the assumption that the system availability is the product of the availabilities of the components is not appropriate under the circumstances of this example. Further examination of the two equations reveals that if Eqn (1) is used, we actually underestimate the system availability. This is because Eqn (1) provides the availability of the system when components continue to operate even if the system is not operating. In other words, Eqn (1) allows for component failures while the system is down and undergoing repair; whereas Eqn (2) takes into account the fact that if the system is not operating then non-failed units become idle (and therefore do not continue to accumulate age) while the system is undergoing repair.
Figure 4: Markov diagram for the simple system
Figure 5 illustrates the comparison between the two methods to compute availability. In this plot, the red line represents the mean availability using Eqn (1) and the blue line represents the mean availability using Eqn (2). The mean availability will also differ and is given by:
Figure5: Plot of mean availability vs. time