Reliability Edge Newsletter

Volume 3, Issue 3

Reliability Edge Home

Reliability Growth Test Planning and Management

An effective reliability growth test planning and management strategy can contribute greatly to successful product design and development through its impact on the ability of the design/development team to meet desired reliability goals on time and within the project budget. An effective reliability growth management program both produces and utilizes important information about the reliability of the product design, such as the demonstrated MTBF through testing, the growth in MTBF that has been achieved through implementation of corrective actions, the maximum potential MTBF that can likely be achieved for the product design and estimates regarding latent failure modes that have not yet been uncovered through testing. 

This article presents a brief conceptual overview of a reliability growth test planning/management strategy and data analysis methodology that provide information that can be instrumental to various management decisions for product design/development. Dr. Larry H. Crow, a leading practitioner in the field of Reliability Growth Analysis for over 30 years, developed the approach described in this article and has cooperated with design/development teams in both the military and the private sector to implement, validate and refine the relevant techniques. This article has been written with cooperation from Dr. Crow based on his lectures on the subject and published standards for reliability growth analysis.

Also, as noted in previous articles by Dr. Crow, a comprehensive reliability growth program actually begins in early design and identified potential problem failure modes are mitigated before formal testing. This potential failure mode mitigation in design is highly productive when managed with Failure Mode and Effects Analysis (FMEA), System Reliability Block Diagram (RBD) Analysis and/or Fault Tree Analysis (FTA). The objective of these analyses is to increase the reliability before testing. 

Background and Assumptions
The reliability growth test planning and management strategy described in this article assumes that as the product design matures, the design/development team identifies potential failure modes for the product through controlled testing in a series of phases. The design/development team then decides to implement corrective measures or "fixes" for some or all of the failure modes that have been identified in order to reduce the likelihood that the revised product design will fail due to the particular failure modes that have been identified. The corrective actions that are actually implemented, the effectiveness of these corrective actions and when the corrective actions are implemented determine the reliability growth management strategy. There are three basic approaches for implementing corrective actions into the design and the approach used will affect the analysis and decision-making process. These three approaches are: 

  • Test-Find-Test: Failure modes are identified but the fixes are not implemented until after the completion of the testing phase. In this case, the reliability growth due to the implementation of fixes takes place after the completion of a given testing phase and the improved product design is in place for the beginning of the next testing phase. 
  • Test-Fix-Test: Fixes are implemented during the test after the failure modes have been identified and the corrective actions have been determined. The testing may be stopped until the corrective action is implemented, but it is not necessary. The testing continues with the revised product design. In this case, the reliability growth is due to the implementation of the fixes during the given testing phase. 
  • Test-Fix-Test with Delayed Fixes: Some fixes are implemented during the test while other corrective actions are delayed until the completion of the test phase. In this case, the reliability growth due to the implementation of fixes takes place both during and after the completion of the given testing phase. 

Based on the results of each reliability growth testing phase and the subsequent analysis, the project manager may wish to make changes to the design/development approach. Specifically, he/she may choose to revise the program schedule, change the number of products tested and/or the duration of the test and/or increase, decrease or reallocate the program budget and resources. In addition, the design/development team may reevaluate the criteria used to determine which failure modes will receive corrective actions and institute any necessary changes. That is, it may be appropriate to change the management strategy. 

Analysis Procedure 
The analysis and management approach described here is an iterative process that begins with the data capture from the first testing phase and continues through subsequent testing phases until reliability goals have been achieved and the product is released. Before the first testing phase begins, the design/development team will have completed a number of important steps to prepare the groundwork for subsequent analysis and decision-making. These important activities include the analysis of previous programs to identify any relevant reliability growth patterns that are likely to appear for the new design, the development of a reliability growth testing plan (including decisions as to the duration of the test, sample size, policies for implementing fixes, etc.) and the creation of a planned reliability growth curve to provide the team with a general outline of what they can expect over the course of each testing phase. Once this has been completed, the following analysis procedures can be implemented. 

Reliability Growth Testing: Test a sample of units according to the test plan that has been established and record failure information for the units under test. In practice, the units may start the test at different times, but it is generally assumed that the test units have the same design configurations at any point in the testing. The methods also apply to discrete (one-shot) success/failure events. 

Categorize Observed Failures: Categorize each observed failure according to whether corrective action will be performed to address the problem that caused the failure. In a "Test-Find-Test" scenario, one of two categories can be assigned to each failure mode: Category A or Category B. 

  • Category A: Corrective actions will not be performed. A failure mode may be assigned to Category A for a variety of reasons including, but not limited to: 
  • The failure mode occurs in existing technology for which re-design is not possible or cost-effective.
  • The likelihood of occurrence for the failure mode is not large enough to justify the cost of corrective action.
  • The severity of the potential effect of failure is not serious enough to justify the cost of corrective action.
  • The budget for corrective actions does not permit corrective actions to be performed.
  • Other reasons to be determined based on the particular situation and the organization's design/development and reliability growth management strategy. 
  • Category B: Corrective actions to eliminate or mitigate the cause of failure will be performed after the current test phase has been completed. Corrective actions for Category B failure modes are often called "delayed corrective actions" or "delayed fixes." 

Characterize Category B Failure Modes: Identify and characterize the failure mode for each Category B failure. The failure mode description typically provides information about the specific physical cause of the problem. For example, "leaking actuator, worn seal" and "leaking actuator, flange radius crack from fatigue" are two unique failure modes. In this case, the phrase "leaking actuator" is not sufficiently descriptive of the failure mode because there is more than one physical cause that can result in the failure of the item via a leaking actuator. 

For bookkeeping purposes, it can be helpful to assign an alphanumeric code to all Category B failure modes according to the sequence in which unique modes have been identified. For example, the first Category B failure can be identified as B1, the second as B2, and so on. When/if another failure occurs due to a failure mode that has already been identified, it is given the same number as the first instance of that failure. 

Quantify Effectiveness of Corrective Actions: For each unique Category B failure mode, examine the likely effectiveness of the corrective action. The effectiveness factor is a number between 0 and 1, which represents the fraction decrease in the failure mode's failure rate due to the corrective action. For example, if the corrective action is expected to reduce the failure rate due to a given mode by 75%, then the effectiveness factor for the corrective action is 0.75. If this mode is expected to be responsible for 8 failures before the fix has been implemented, then after the corrective action has been performed, we would expect to observe 2 failures due to the given mode. Numerically, this would be 
8 * (1 - 0.75) = 2. 

Effectiveness factors are assigned based on engineering judgment and the predictions made based on the various factors will be affected by the quality of this assessment. Based on past experience with reliability growth analysis testing, the average effectiveness factor for all modes is likely to be in the range of 0.65 to 0.75. An individual effectiveness factor may be smaller or larger than this average, but the average over a large number of effectiveness factors during a test is likely to be in this range based on data. 

Apply Statistical Model: The Crow (AMSAA) projection model uses a nonhomogeneous Poisson process (N.H.P.P.) statistical model to analyze reliability growth data and incorporate the failure classifications and effectiveness factors. This model can be used to obtain a variety of plots and results, including the reliability that has been demonstrated during the test and the expected reliability of the design after the delayed fixes for Category B failure modes have been implemented. These results are presented graphically in Figure 1, which shows the demonstrated MTBF of the current design as a straight line at 9.55 and the projection for the new design (which incorporates the delayed fixes) as a point at 15.13 MTBF. The projection of 15.13 estimates the impact of the proposed delayed corrective actions and effectiveness factors on the system reliability. 

Figure 1: Demonstrated and projected MTBF
Figure 1: Demonstrated and projected MTBF

Evaluate and Adjust Management Strategy: In addition to the demonstrated and projected MTBF results, the Crow (AMSAA) projection model supports the generation of other results and plots that can be invaluable for evaluating the current design/development management strategy and making any necessary adjustments. The growth potential metric and the analysis of unseen failure modes are important metrics for this purpose. 

The growth potential is an estimate of the maximum system MTBF that can be attained with the product design and reliability growth management strategy. This can be displayed with a straight line on the MTBF vs. Test Time plot, as shown in Figure 2 where the growth potential is identified at 22.45 MTBF. This metric can help to confirm the manager's expectation that the ultimate reliability goal for the design is feasible, but it can also provide a clear warning if the reliability goal cannot be achieved for the current design under the given conditions. Management can then respond to this warning by making changes to the management strategy, such as converting some Category A failure modes to Category B failure modes and/or changing the criteria for the classification of new modes that are uncovered or adding redundancy.

Figure 2: MTBF vs. Time with growth potential line
Figure 2: MTBF vs. Time with growth potential line

Analysis of the unseen failure modes provides another important set of metrics for evaluating the product design and the reliability growth management strategy. Based on the failure modes that have been uncovered during the test, the Crow (AMSAA) projection model can be used to provide estimates about the failure modes that have not yet occurred. Such metrics include the current rate of uncovering new Category B failure modes, the estimated number of unseen Category B failure modes and the estimated failure rate for unseen Category B failure modes. This analysis can provide an indicator of how many problems are yet to be discovered in the design and how much test time will be required to identify and correct those latent causes of failure. The pie chart in Figure 3 represents one method to display this information graphically. The pie chart illustrates the quantity and ratio of seen and unseen failure modes after the completion of a particular phase of testing. 

Figure 3: Seen and unseen failure modes
Figure 3: Seen and unseen failure modes

Incorporating Category C Failures 
Although the previous discussion assumed that failure modes would not be corrected (Category A) or that corrective actions would be performed at the end of the testing phase (Category B), it is also possible to implement some corrective actions during the test and then continue testing with the corrective action in place (i.e., a Test-Fix-Test or Test-Fix-Test with Delayed Fixes approach). These failure modes are classified as Category C. Because it is assumed that the effect of the corrective action will be demonstrated empirically as the corrected units continue to operate in the test, there is no need to assign an effectiveness factor to Category C failure modes. The Crow (AMSAA) model (MIL-HDBK-189) is widely used to evaluate the reliability growth in the presence of Category A and Category C failure modes. This approach will likely result in a gradual increase in the reliability of the product during the test time. 

If the test also includes Category B failure modes, then this gradual increase will also be accompanied by a jump in reliability when the Category B corrective actions are implemented at the end of the test phase. The Generalized Crow Projection model accommodates Category A , B and C failure modes and Figure 4 displays the MTBF vs. Time plot for such analyses. This plot is similar to the ones shown in Figures 1 and 2, except that it includes a gradual increase in the reliability observed during the test, due to the implementation of fixes for some failure modes while the test was in progress. 

Figure 4: Incorporating category C failure modes
Figure 4: Incorporating category C failure modes

Conclusion
The reliability growth planning/management strategy and data analysis methodology described in this article will be supported by the next version of ReliaSoft's reliability growth analysis software, RGA++. This software is currently under development, with cooperation from Dr. Crow and other partners from the military and commercial sectors to determine the functional requirements. The RGA++ software is anticipated for release in 2Q 2003 and will provide a complete array of analysis options for both continuous (time-to-failure) and discrete (one-shot, success/failure) data sets, including the incorporation of the Crow (AMSAA) projection model and related analyses described in this article. 

References
Dr. Larry H. Crow has developed and implemented the management and analysis approach described in this article and the article has been written with his cooperation and review. This general presentation of the basic concepts of the approach is based largely on lecture notes, discussions and other information provided by Dr. Crow. In addition, the following documents are also relevant to this discussion: 

United States Department of Defense. MIL-HDBK-189:Reliability Growth Management,
      February 13, 1981. 

International Electrotechnical Commission. IEC 61164:Reliability Growth - Statistical Test and
      Estimation Methods
, June 1995. 

NOTE: Two IEC publications on reliability growth, IEC 61164 and IEC 61014, are currently undergoing revision. For more information, search for works in progress at http://www.iec.ch.

End Article

Dr. Larry H. Crow

Larry H. Crow is Vice President, Reliability and Sustainment Programs at Alion Science and Technology, Huntsville, Alabama. He held this position at IIT Research Institute before Alion was established in 2002 by 1600 former IITRI employees. Previously, Dr. Crow was Director, Reliability at General Dynamics Advanced Technology Systems (formerly Bell Laboratories ATS). Before joining Bell Laboratories in 1985, Dr. Crow was chief of the Reliability Methodology Office at the US Army Materiel Systems Analysis Activity (AMSAA). He developed the Crow (AMSAA) model and the Crow Projection model, which have been incorporated into US DoD military handbooks as well as national and international standards and service regulations on reliability. Dr. Crow chaired the Tri-Service Committee to develop US MIL-HDBK- 189, Reliability Growth Management and is the principal author of that document. He is also the principal author of the IEC 61164, Reliability Growth-Statistical Tests and Estimation Methods. He developed the widely used NHPP Power Law model for analyzing repairable systems reliability, which is featured in a new IEC 61710, Goodness-of-Fit and Estimation Methods for the Power Law Model. Dr. Crow is an elected Fellow of the American Statistical Association and the Institute of Environmental Sciences and Technology and is on the Board of Directors of the Annual Reliability and Maintainability Symposium (RAMS). He is the recipient of The Florida State University "Grad Made Good" Award for the Year 2000, the highest honor given to a graduate by Florida State University.