Reliability Edge Home

Critical Examination of a Common Assumption in System Availability Computations

In general, availability is defined as the probability that an item will be able to function (i.e., the item is not failed or undergoing maintenance) when called upon to do so. The availability definition takes into account an item’s reliability (how quickly it fails) and its maintainability (how quickly it can be restored to operation) and it is therefore a key metric in repairable systems analysis. Because reliability and availability are both probabilities, there is a common assumption that the method used to compute system reliability based on component data can also be used for system availability calculations.

This article provides a critical examination of this assumption and we conclude that it is only justified under certain conditions. In most cases, another factor must be taken into account when computing system availability.

Comparing System Reliability Computations and System Availability Computations
To examine whether it is appropriate to compute system availability with the same method used to compute system reliability, we must first briefly review the computation method for system reliability. For a system made up of components configured reliability-wise in series (i.e., if any one of the components fails, the entire system will fail), we can simply multiply the reliabilities of the components and the product is the estimated reliability of the system. For example, consider a simple system with two components configured reliability-wise in series. At some time t, Component 1 has a reliability of 85% and Component 2 has a reliability of 95%. Therefore, the system reliability at time t is the product of these two reliabilities. In other words, the reliability of the system is RS = R1 • R2 = (.85 • .95) = .8075 or 81%. This simple computation method is based on the fact that both values are probabilities and both events must occur (i.e., both components must operate without failure until time
t) for system success at time t. Note: Of course, components can be configured in parallel and other reliability-wise configurations. However, in order to simplify the discussion, only series configurations are discussed in this article.

Like reliability, availability is a probability. Thus, one might assume, as many people do, that this same technique of multiplying probabilities could be applied to estimate system availability. For example, in the two-component system described above, suppose that the point availabilities at time t for Component 1 and Component 2 are 80% and 90%, respectively. If the system availability is simply the product of the component availabilities, then the system availability at time t would be AS = A1 • A2 = (.80 • .90) = .72 or 72%.

This method can be justified from a probabilistic perspective because both items need to be available when called upon in order for the system to be available. However, the method does not take into account the effect on component availability, if any, when the components are operating together in a system configuration. If the component continues to operate even when the system is down due to the failure of another component, then the availability will be the same when calculated individually as it is when calculated with respect to its behavior within the system. However, if the component does not continue to operate when the system is down, then its availability will be different within the system than when calculated individually. This is because when the system is down due to the failure of another component, the given component will not accumulate age. Therefore, although the “system clock” will advance by the amount of time that it takes to restore the other component, the “component clock” for the given component will not advance and the component will be likely to fail at a later time (on the “system clock”) than it would have if the system had not been down for the maintenance of another component. This effect of system operation is not taken into account in the estimation of the availability for the individual component and yet it is quite relevant to the availability of the system.

Simple Example to Demonstrate the Effect of System Operation on Component Availability
The following deterministic scenario demonstrates the effect of system operation on component availability. Consider a system with two units configured reliability-wise in series, where Unit 1 fails every 100 hours and takes 20 hours to restore and Unit 2 fails every 75 hours and takes 25 hours to restore. Furthermore, neither component continues to accumulate age when the system is down. The individual availabilities of the components for 300 hours are 86.6% and 75% respectively, as shown in Figures 1 and 2.

Figure 1: Up and downtimes for Unit 1 individually

Figure 2: Up and downtimes for Unit 2 individually

However, when we analyze the components operating together in a system, we see that Unit 2 will fail first at 75 hours, causing the system to fail. The system will then be undergoing maintenance for 25 hours and will be operational again at 100 hours. At 125 hours, the system will fail again, this time due to Unit 1. This is because Unit 1 fails after 100 hours of operation and it had accumulated 75 hours before system failure and another 25 after the system was restored. The same process can be repeated yielding the system results shown in Figure 3.

Figure 3: Up and downtimes for the system

As the example demonstrates, the distinction of whether a component continues to operate when the system has failed or if a component can fail when the system is undergoing repair is very important and needs to be taken into account when performing such analyses. Likewise, the computation used to estimate system availability must take into account the availabilities of the components with respect to their operation within the system. The component availabilities determined individually and with respect to system operation will be the same only in situations where the component is not affected by the failure of other components and/or the system (or where no other components fail within the given time).

Mathematical Demonstration
The following example mathematically demonstrates the effect of the assumption that the system availability is always simply the product of the component availabilities.

Consider a very simple system with two identical components configured reliability-wise in series, and with constant failure and repair rates. Both components cease to operate when the system is down. (Note: We assume the exponential distribution for both times-to-failure and times-to-repair in order to simplify the calculations, while recognizing that the exponential distribution is rarely appropriate to describe failure behavior in the real world. The term “repair” is used to describe the maintenance action required to restore the component to operation and we assume that the component is “good as new” as a result of the maintenance action.)

Since both the failure and repair distributions are exponential, a closed-form solution can be easily obtained for the point availability of each component. This is given by:

If we assume that the point availability for the system is equal to the product of the component availabilities then:

or, for this example:

 Eqn (1)

Since we are using a simple exponential distribution for both the failure and repair distributions, it is easy to determine the system availability using another methodology such as Markov analysis. By comparison with the Markov result, we can evaluate the assumption that the system availability is the product of the component availabilities. The first step in the Markov approach is to determine the possible system states. These are:

• System is up (both units operate).
• System is down (Unit 1 failed).
• System is down (Unit 2 failed).

It is possible to make the argument that a fourth state exists: “System is down due to the failure of both Unit 1 and Unit 2.” However, for this fourth state to exist, both components must continuously operate regardless of the system status. If a system failure causes the system to shut down and subsequently stops the operation of the non-failed component, then this state is not possible (since both components would have to fail at the same infinitesimal instant dt). Because the components in this example do not continue to operate when the system is down due to the failure of another component, we can disregard this fourth possibility.

Now, and based on the three possible states, the Markov diagram is as shown in Figure 4. From this, the point availability is given by:

 Eqn (2)

As it can be seen, the equations from multiplying the availabilities, Eqn (1), and from the Markov analysis, Eqn (2), are different. This implies that the assumption that the system availability is the product of the availabilities of the components is not appropriate under the circumstances of this example. Further examination of the two equations reveals that if Eqn (1) is used, we actually underestimate the system availability. This is because Eqn (1) provides the availability of the system when components continue to operate even if the system is not operating. In other words, Eqn (1) allows for component failures while the system is down and undergoing repair; whereas Eqn (2) takes into account the fact that if the system is not operating then non-failed units become idle (and therefore do not continue to accumulate age) while the system is undergoing repair.

Figure 4: Markov diagram for the simple system

Figure 5 illustrates the comparison between the two methods to compute availability. In this plot, the red line represents the mean availability using Eqn (1) and the blue line represents the mean availability using Eqn (2). The mean availability will also differ and is given by:

Figure 5: Plot of mean availability vs. time

Conclusion
The examples in this article demonstrate that the simple computation method of multiplying component availabilities is appropriate only if all of the components in the system continue to operate even when the system is down due to the failure of another component (i.e. when the operation of the overall system has no effect on the operation of the component). If this is not the case, then the overall effect of system operation must be taken into account when computing system availability. If the effect of system operation is not taken into account, then the availability computation is likely to underestimate the availability of the system.

ReliaSoft.com Footer