Reliability Growth Test Planning and Management
An effective reliability growth test planning and management strategy can contribute greatly to successful product design and development through its impact on the ability of the design/development team to meet desired reliability goals on time and within the project budget. An effective reliability growth management program both produces and utilizes important information about the reliability of the product design, such as the demonstrated MTBF through testing, the growth in MTBF that has been achieved through implementation of corrective actions, the maximum potential MTBF that can likely be achieved for the product design and estimates regarding latent failure modes that have not yet been uncovered through testing.
This article presents a brief conceptual overview of a reliability growth test planning/management strategy and data analysis methodology that provide information that can be instrumental to various management decisions for product design/development. Dr. Larry H. Crow, a leading practitioner in the field of Reliability Growth Analysis for over 30 years, developed the approach described in this article and has cooperated with design/development teams in both the military and the private sector to implement, validate and refine the relevant techniques. This article has been written with cooperation from Dr. Crow based on his lectures on the subject and published standards for reliability growth analysis.
Also, as noted in previous articles by Dr. Crow, a comprehensive reliability growth program actually begins in early design and identified potential problem failure modes are mitigated before formal testing. This potential failure mode mitigation in design is highly productive when managed with Failure Mode and Effects Analysis (FMEA), System Reliability Block Diagram (RBD) Analysis and/or Fault Tree Analysis (FTA). The objective of these analyses is to increase the reliability before testing.
Based on the results of each reliability growth testing phase and the subsequent analysis, the project manager may wish to make changes to the design/development approach. Specifically, he/she may choose to revise the program schedule, change the number of products tested and/or the duration of the test and/or increase, decrease or reallocate the program budget and resources. In addition, the design/development team may reevaluate the criteria used to determine which failure modes will receive corrective actions and institute any necessary changes. That is, it may be appropriate to change the management strategy.
Reliability Growth Testing: Test a sample of units according to the test plan that has been established and record failure information for the units under test. In practice, the units may start the test at different times, but it is generally assumed that the test units have the same design configurations at any point in the testing. The methods also apply to discrete (one-shot) success/failure events.
Categorize Observed Failures: Categorize each observed failure according to whether corrective action will be performed to address the problem that caused the failure. In a "Test-Find-Test" scenario, one of two categories can be assigned to each failure mode: Category A or Category B.
Characterize Category B Failure Modes: Identify and characterize the failure mode for each Category B failure. The failure mode description typically provides information about the specific physical cause of the problem. For example, "leaking actuator, worn seal" and "leaking actuator, flange radius crack from fatigue" are two unique failure modes. In this case, the phrase "leaking actuator" is not sufficiently descriptive of the failure mode because there is more than one physical cause that can result in the failure of the item via a leaking actuator.
For bookkeeping purposes, it can be helpful to assign an alphanumeric code to all Category B failure modes according to the sequence in which unique modes have been identified. For example, the first Category B failure can be identified as B1, the second as B2, and so on. When/if another failure occurs due to a failure mode that has already been identified, it is given the same number as the first instance of that failure.
Effectiveness of Corrective Actions: For each unique Category B failure
mode, examine the likely effectiveness of the corrective action. The
effectiveness factor is a number between 0 and 1, which represents the
fraction decrease in the failure mode's failure rate due to the corrective
action. For example, if the corrective action is expected to reduce the
failure rate due to a given mode by 75%, then the effectiveness factor for the
corrective action is 0.75. If this mode is expected to be responsible for 8
failures before the fix has been implemented, then after the corrective action
has been performed, we would expect to observe 2 failures due to the given
mode. Numerically, this would be
Effectiveness factors are assigned based on engineering judgment and the predictions made based on the various factors will be affected by the quality of this assessment. Based on past experience with reliability growth analysis testing, the average effectiveness factor for all modes is likely to be in the range of 0.65 to 0.75. An individual effectiveness factor may be smaller or larger than this average, but the average over a large number of effectiveness factors during a test is likely to be in this range based on data.
Statistical Model: The Crow (AMSAA) projection model uses a nonhomogeneous
Poisson process (N.H.P.P.) statistical model to analyze reliability growth
data and incorporate the failure classifications and effectiveness factors.
This model can be used to obtain a variety of plots and results, including the
reliability that has been demonstrated during the test and the expected
reliability of the design after the delayed fixes for Category B failure modes
have been implemented. These results are presented graphically in Figure 1,
which shows the demonstrated MTBF of the current design as a straight line at
9.55 and the projection for the new design (which incorporates the delayed
fixes) as a point at 15.13 MTBF. The projection of 15.13 estimates the impact
of the proposed delayed corrective actions and effectiveness factors on the
Evaluate and Adjust Management Strategy: In addition to the demonstrated and projected MTBF results, the Crow (AMSAA) projection model supports the generation of other results and plots that can be invaluable for evaluating the current design/development management strategy and making any necessary adjustments. The growth potential metric and the analysis of unseen failure modes are important metrics for this purpose.
potential is an estimate of the maximum system MTBF that can be attained with
the product design and reliability growth management strategy. This can be
displayed with a straight line on the MTBF vs. Test Time plot, as shown in
Figure 2 where the growth potential is identified at 22.45 MTBF. This metric
can help to confirm the manager's expectation that the ultimate reliability
goal for the design is feasible, but it can also provide a clear warning if
the reliability goal cannot be achieved for the current design under the given
conditions. Management can then respond to this warning by making changes to
the management strategy, such as converting some Category A failure modes to
Category B failure modes and/or changing the criteria for the classification
of new modes that are uncovered or adding redundancy.
Analysis of the
unseen failure modes provides another important set of metrics for evaluating
the product design and the reliability growth management strategy. Based on
the failure modes that have been uncovered during the test, the Crow (AMSAA)
projection model can be used to provide estimates about the failure modes that
have not yet occurred. Such metrics include the current rate of uncovering new
Category B failure modes, the estimated number of unseen Category B failure
modes and the estimated failure rate for unseen Category B failure modes. This
analysis can provide an indicator of how many problems are yet to be
discovered in the design and how much test time will be required to identify
and correct those latent causes of failure. The pie chart in Figure 3
represents one method to display this information graphically. The pie chart
illustrates the quantity and ratio of seen and unseen failure modes after the
completion of a particular phase of testing.
Category C Failures
If the test also
includes Category B failure modes, then this gradual increase will also be
accompanied by a jump in reliability when the Category B corrective actions
are implemented at the end of the test phase. The Generalized Crow Projection
model accommodates Category A , B and C failure modes and Figure 4 displays
the MTBF vs. Time plot for such analyses. This plot is similar to the ones
shown in Figures 1 and 2, except that it includes a gradual increase in the
reliability observed during the test, due to the implementation of fixes for
some failure modes while the test was in progress.
Department of Defense. MIL-HDBK-189:Reliability
Electrotechnical Commission. IEC 61164:Reliability Growth - Statistical Test and
NOTE: Two IEC publications on reliability growth, IEC 61164 and IEC 61014, are currently undergoing revision. For more information, search for works in progress at http://www.iec.ch.