Avoiding Common Mistakes and Misapplications in Design for Reliability (DFR)
Design for Reliability (DFR) is a process that describes the entire set of tools that support the effort to increase a product’s reliability. These methodologies are applied from the early concept stage of a design all the way through to product obsolescence. The success of a DFR process is directly related to the selection of the appropriate reliability tools for each phase of the product development life cycle and the correct implementation of those tools.
This article has been adapted from a paper delivered by ReliaSoft engineers at the 2011 Annual Reliability and Maintainability Symposium (RAMS) . It examines certain areas of the DFR process where mistakes are common due to misunderstood "common practices" or due to attempts to either oversimplify the process or introduce unnecessary complexity. The observations presented here are based on the authors' collective experience from interactions with customers during consulting projects, training seminars and reliability engineering software development.
Setting well-defined and meaningful reliability requirements is one of the most important steps in a DFR process. Mean Time Between Failures (MTBF) is an example of a reliability requirement that is a standard in many industries, yet is often misused and is inappropriate in most cases. (Note that when referring to non-repairable systems, the most appropriate term is Mean Time To Failure (MTTF), but these terms are often interchangeable in practice. Here we will use the more common term MTBF.)
The use of an MTBF as a sole reliability requirement implies a constant failure rate, which is not the case for most systems or components. Even if the assumption of a constant failure rate is not made, the MTBF does not tell the whole story about reliability. For example, consider three components whose lifetimes are modeled with the Weibull distribution, each of which has the exact same MTBF. The fact that the MTBF of the three components is the same does not necessarily imply that the reliability at any given time will be the same. Figure 1 shows the reliability functions of three such components. Although the MTBF is the same for all three, the corresponding reliabilities at different times vary considerably. Therefore, it is clear that the use of a reliability requirement at a specific time (e.g., 95% at 1,000 hours) instead of simply an MTBF will be more descriptive of the expected life of a component.
Figure 1: Reliability function for three components with the same MTBF but different reliabilities at different points in time
Another example of an inadequate reliability requirement is the use of a point estimate with the absence of confidence bounds. Especially when setting requirements for suppliers, confidence bounds are crucial in order to ensure that the value claimed has been demonstrated properly. For example, consider a case where two vendors are evaluated in terms of the reliability of their components, and suppose that they both claim that their reliability is 95% at 1,000 hours. Given that information, one would expect that the decision should be made solely based on price because the reliabilities are the same. However, suppose that further investigation revealed that one vendor tested 100 components while the other tested only 5. Given that, it can be determined that the 90% lower confidence bound on reliability for the first vendor is 93%, while for the second vendor it is 78%. Thus, the basis for choosing the best vendor has changed. This simple example demonstrates that unless there is a lower confidence bound attached to the specification, there is no way to evaluate the validity of supplier claims and make valid comparisons.
Usage and Environmental Conditions
The core definition of reliability (the probability that the item will perform its intended function for a specific mission duration under specified conditions) suggests the need to clearly map out and understand the expected use conditions. Nevertheless, we have observed that many of the DFR challenges that organizations face are due to lack of consideration or poor understanding of usage and environmental factors.
The manner in which the product will be used in the hands of the customer should be given sufficient consideration in the design phase. Understanding what constitutes normal use or abuse can help with making the right design choices and selecting the right tests to simulate actual usage conditions during in-house testing. Technical directions in a user’s manual are, in most cases, not enough to protect a product from aggressive usage. Guidance such as "wait until the engine has warmed up to operate," "switch from 4x4 to 2x4 when driving above 55 mph," etc. may be frequently ignored in actual usage. If the tests do not account for aggressive usage, then the disconnect between in-house test reliability and field reliability is usually very high.
Usage can also be expressed in terms of understanding the proportion of customers who will reach a certain design life. For example, suppose a printer is designed to last five years, but it is also designed to last one million pages. Understanding the distribution of pages printed during five years can make a big difference in design decisions. Any reliability requirement is complete only when it has an associated percentile user. For example, a requirement to prove a certain level of printer component reliability for a 99th percentile user would mean proving that reliability for a specific number of pages printed.
Environmental conditions are also critical in making the right design choices to support reliability. Often products are tested in a very narrow range of environmental conditions, but they are used in a broader spectrum of conditions in the field. Using the same example of a printer, one can design a printer that will work without failures and paper jams in Arizona, but will fail to operate properly in Florida where the relative humidity is so high that it introduces new failure modes not observed in the tested environment. In this case, the paper in the trays curls from the high humidity environment and causes the printer to jam frequently. Up-front work needs to be done to clearly define the range of environmental conditions for which the product will operate, and testing needs to simulate various profiles of those environmental conditions in order to expose failure modes that would not show up in ambient conditions.
Temperature and humidity are only two of the major environmental conditions that can significantly affect product reliability. Depending on the specific application, solar load, water quality, rain, wind, snow, sand, dust, mud, hail, thermal cycling, voltage input stability and many other environmental conditions can be key factors that affect the product’s life. Major effort should be made up-front in the DFR process in order to identify realistic environmental conditions. The goal is to incorporate environmental concerns into the design early on, and also to be able to design tests that will reflect the actual environmental conditions to which the product will be exposed.
Preparation Before Testing
Reliability tests play an integral role in a DFR process because they provide the means to quantify reliability. In general, a lot of money, effort and resources can be saved through better preparation before reliability testing. For example, accelerated tests are often performed with a large number of stresses. This can result in very expensive tests or even unsuccessful tests that yield failure modes that are not expected to be seen in practice. However, if sufficient effort is spent in planning the test, the process will yield the same results in a much more efficient way. A best practice in this case would be to use Design of Experiments (DOE) methodologies to identify the most significant stresses that affect the life of the product. By performing small experiments to determine the few most important stresses, the overall cost of the test will be decreased significantly. Furthermore, the data collected from those experiments can also be used for planning the test so that the available samples are optimally allocated across the different stress levels.
Execution and Analysis of Accelerated Tests
Accelerated testing provides the benefit of uncovering failures in a short time for components that have very high life expectancy. However, it is important to ensure that the principles of accelerated testing and the assumptions of the statistical models used are not violated. When deciding on the levels of the stresses to be applied during the test, one must consider the design limits of the component. When the applied stress goes beyond those limits, then new failure modes that will not be seen in the field are introduced and the results of the test are not valid.
Besides engineering judgment, contour plots can be a useful tool in determining whether the component is failing the same way across all stress levels. For example, consider the contour plots shown in Figure 2, which represent data sets obtained at three different stress levels. The contour on the far left represents the highest stress, the contour in the middle represents the middle stress and the contour on the far right represents the lowest stress level. The Weibull distribution was used to analyze the data at each level. The assumption is that the component will show the same failure behavior across all stress levels and therefore the beta parameter remains the same. As shown in the figure, the beta parameter is significantly different for the data set obtained from the highest stress level at the 90% significance level, indicating that the failure behavior has changed at this stress level. Moreover, even if the stress levels are not beyond the design limits, it should be considered that the farther away the stress level is from usage conditions, the more error is introduced into the results of the analysis. In many cases, in an effort to minimize the duration and cost of a test, engineers fail to consider those principles, which results in tests that offer little or no value for reliability analysis.
Figure 2: Contour plots to test the assumption of a consistent beta parameter across stress levels
Another area that requires careful attention is the analysis of the data obtained from the accelerated test. The most important part of the analysis, with the greatest effect on the results, is choosing the appropriate model to describe the life-stress relationship. There are a number of models that have been suggested to describe life-stress relationships, and the appropriateness of each model depends on the applied stress. For example, the Arrhenius model is commonly used when the stress is temperature, while the inverse power law model is often used when the stress is mechanical . Therefore, when a practitioner chooses a model, he/she should carefully consider the applied stress and the physics of failure of the component.
A common characteristic of most of the available life-stress relationship models is that they are monotonic, which means that the life decreases as the stress increases. In practice, we have observed many cases where this assumption is violated. For example, in one such scenario, product life improved as temperature increased to a certain point; but once the upper limit of that particular temperature range was reached, a chemical reaction kicked in to reverse the trend and product life then began to decrease as temperature rose. This is illustrated in Figure 3, where two different Arrhenius models were applied to the data. We can see that moving from the lowest stress level to the middle stress level, life is increasing when stress is increasing. On the other hand, moving from the middle stress level to the highest stress level, life is decreasing when stress is increasing. This type of life-stress relationship is non-monotonic, and models such as the Arrhenius cannot be used to predict life at usage conditions. When this situation occurs, the engineer may need to redesign the accelerated test so that the applied stress levels are in the monotonic region of the life vs. stress relationship. Alternatively, if the data set collected from the original test is adequate, the analysis can consider only those data points that are in the monotonic region.
Figure 3: Non-monotonic life-stress relationship
Understanding Failure Rate Behavior
One of the most important aspects of reliability engineering practice is to be able to characterize the failure rate behavior of components and systems. Specifically, DFR activities are typically used to 1) design out decreasing failure rate behavior at the start of the life of the product (often called infant mortality), 2) increase margins/robustness and reduce variation during the useful life and 3) extend wearout beyond the designed life (or institute preventive maintenance before wearout if it makes economic sense).
Using inappropriate models or assumptions can lead to erroneous conclusions about the failure rate behavior of a component or system. As mentioned previously in the discussion of defining appropriate reliability requirements, one of the most common missteps is the assumption of an exponential distribution as the underlying lifetime model for reliability data. The exponential distribution implies a constant failure rate and superimposes that behavior on the modeled data. As a result, the analyst misses the signals that can indicate infant mortality or wearout mechanisms (or a mix of these as subpopulations in the data). This very often leads to poor design decisions concerning the reliability behavior of products, inaccurate estimation of warranty costs and many surprises after the product is fielded.
We have frequently observed the following scenario when working with customers in a variety of industries: the design team requests component reliability information from the suppliers, and many of the suppliers provide a single estimate as MTBF/MTTF without associated confidence bounds. In most cases, these numbers are generated by testing units for a short time, so the parameters are estimated based on a lot of suspensions and very few failure points. At the same time, an exponential distribution is often assumed. This can create overly optimistic reliability estimates. If the same test were continued longer and the analysis performed with a flexible distribution such as a two-parameter Weibull, the same test would provide a more pessimistic estimation of reliability.
A similar situation occurs with accelerated data analysis when key parameters are superimposed instead of being calculated from data. A good example of this is the decision to superimpose the activation energy in an Arrhenius model. That value can drastically affect the acceleration factors assumed. As a result, the extrapolation back to use conditions can be adjusted easily to a wide range of values .
Reliability Demonstration Tests
Reliability demonstration tests are a common practice right before a product goes into mass production in order to assure that the reliability target has been met. The importance of these tests is that they provide a final validation of the redesigns and the reliability tests that took place during the design phase. Therefore, the proper application of those tests is critical in the DFR process before a product goes into production and out into the field where it is more costly to deal with reliability issues. However, missteps can take place both when planning the demonstration test and during the execution of these tests.
In terms of planning, the binomial equation can be an effective tool for reliability demonstration test design [4,5]. With this approach, the analyst specifies a reliability goal at a given confidence level and the number of failures that are "allowed" during testing. Then he/she uses the model to compute the required number of units to be tested for a given test time, or the test time required for the given number of units. However, as we have observed in many cases, the design of such tests may be driven solely by the resource constraints of available time or samples without considering the statistical theory. As a result, the tests provide no statistical significance in terms of achieved reliability. In other words, time and money are spent without really proving that the reliability goals have been met, and the reliability engineer would have been better off not performing the demonstration test at all!
In terms of execution, another common misstep occurs when the engineers fail to reevaluate the test design after a failure occurs during a "zero failure test." If the test has been designed as a "zero failure test," it means that no failures should be observed in the sample for the given test duration in order to demonstrate the specified reliability for the specified time. However, in practice, when the first failure occurs, engineers sometimes mistakenly recalculate the test time for a "one failure test." Then when the second failure occurs, they recalculate the time for a "two failure test," and so on. Of course, this practice usually leads to a situation where the engineer is essentially chasing his tail, with no useful results. Instead of spending valuable resources to try to force the original test plan to succeed, the engineers should either go back and reevaluate the test design from a reliability perspective or run the rest of the test units to failure and calculate the reliability using traditional life data analysis methods.
Effects of Manufacturing
Well-done DFR still needs to be supported by manufacturing reliability efforts to ensure that the inherent design reliability is not degraded or unstable. Manufacturing may introduce variations in material, processes, manufacturing sites, human operators, contamination, etc. . A common mistake is to assume that the reliability of a unit produced out of the manufacturing line will automatically match the test results of the fine-tuned prototype units. In fact, Meeker and Escobar have identified differences in cleanliness and care, materials and parts, and highly trained technicians vs. factory workers .
Not paying attention to the effects of manufacturing on product reliability can result in units showing infant mortality, which is usually the result of misassembly, inappropriate transfer or storage, the line being unable to conform to designed specifications, or uncontrolled key product characteristics. The best approach is to identify and address manufacturing issues through activities such as Process FMEA (PFMEA), manufacturing control and screening/monitoring plans. Burn-in testing can be used to address infant mortality, but this is not the preferred choice because the ideal goal is to design out infant mortality [8,9].
It is also important to note that variation in manufacturing is not resolved as soon as the first product out the door is reliable. Variation in terms of supplier material, processes, machinery, personnel skills and other factors can influence the reliability characteristics of the units produced. A common misstep is to assume that verifying the reliability of the products through a single demonstration test is adequate. Instead, a thorough DFR process examines the reliability characteristics of the produced units at regular intervals over time to understand what changed. The solution here is two-fold. It involves mostly quality control approaches (such as statistical process control for key characteristics that can influence product reliability), but it can also include "ongoing reliability tests." These tests can also address the impact of current product engineering changes on product reliability. For example, a firmware upgrade can cause unexpected interactions that can lead to altered reliability characteristics. This and similar issues can be identified through an ongoing reliability test initiative.
Warranty Data Analysis
Warranty data analysis is another key step in a DFR process, which helps to assure reliability monitoring throughout the product’s life cycle. DFR does not stop when the product ships. Instead, a warranty tracking and analysis process should be built into the plan in order to assure field reliability.
One seemingly trivial but very common misstep during this stage is failing to implement the infrastructure that allows the capture of time-to-failure data in the field. A simple Nevada chart warranty data analysis requires knowledge of when units were shipped and when they were returned in order to apply Weibull analysis methodologies. However, it is very common that failure time information is not available for the product due to lack of infrastructure, rush to market or overly complex supply chain processes. In such cases, warranty analysis either cannot be thoroughly conducted or the data set contains too much noise.
The next misstep during the warranty stage is to ignore "suspended" units (i.e., units that have not failed) when analyzing data from the field. But the suspension data is of as much value as the failure times, and ignoring it will lead to erroneous analysis results. Consider this example: a company has just launched its new product and started to track reliability-related failures in the field. Every month, 500 new products enter service. Table 1 shows the Nevada chart of returns during the first six months of service.
Table 1: Warranty return data (row = month in service, column = month returned)
This chart indicates, for example, that a total of five units were returned in August. Two of those units had been in service since May, two since June and one since July. In other words, one can read the Nevada chart diagonally in order to count the number of units that failed after a specific period of time in service. Table 2 shows the time-to-failure data compiled for reliability analysis.
Table 2: Time-to-failure data
If the analyst uses maximum likelihood estimation (MLE) and fits a two-parameter Weibull distribution, the calculated parameters are β = 1.6851 and η = 2.7986 months. The prediction for the reliability at the warranty time of 36 months is almost zero.
However, the correct approach to predict reliability is to also consider all the suspended units as shown in Table 3.
Table 3: Data set with failures and suspensions
With this data set, when the analyst again uses MLE and fits a two-parameter Weibull distribution, the calculated parameters are β = 1.3621 and η = 130.3 months. The prediction for the reliability at the warranty time of 36 months is now R(36) = 84%.
The previous example clearly illustrates the vast difference in reliability estimation during the warranty period when considering or ignoring suspensions, and suggests to always include suspensions for accurate warranty analysis. However, another common misstep is the extreme application of this rule - that is, considering suspensions when in reality there is no information at all on whether or not the units have survived. This is a common scenario in fielded systems analysis beyond warranty life. In this case, there is no reliable way to track if the unit is operating beyond the warranty time, but the analyst is assuming that the unit is still operating unless there is definite knowledge that it failed. Without a robust way to know the state of the fielded units, using the assumption that the units are still operating leads to overly optimistic reliability estimations.
As both of these examples illustrate, warranty analysis needs to rely on good data and true knowledge of the state of fielded units, which can be very challenging.
In this article, we have outlined some special topics for consideration during the DFR process. Extra care and focus need to be used when deploying the DFR process so that these issues do not hinder the progress of aiming for, growing, achieving and sustaining high reliability. Although the list of caveats provided here is by no means comprehensive, the underlying idea is that some DFR activities require special attention for successful and meaningful execution. Otherwise, they become a check mark in a long list of activities but do not truly contribute to the reliability efforts. We hope that practitioners can benefit from this material and focus on improving the quality of key DFR activities in their organizations. With product development becoming more dynamic and complex every year, designing for reliability is becoming the only way to quickly meet reliability goals and the new extreme time-to-market requirements.
Learn more about DFR...
Design for Reliability is a very hot topic these days, and it can be a challenge to find a good starting point that will give you the foundation you need to start sifting through and exploring all of the available options. To address this need, ReliaSoft now offers a three-day training seminar entitled RS 560 - Fundamentals of Design for Reliability. For more information, please visit http://Seminars.ReliaSoft.com.
 G. Sarakakis, A. Gerokostopoulos and A. Mettas, "Special Topics for Consideration in a Design for Reliability Process," in the 2011 Proceedings of the Annual Reliability and Maintainability Symposium, 2011.
 W. Nelson, Accelerated Testing: Statistical Models, Test Plans, and Data Analysis, New York, NY: Wiley, 1990.
 D. J. Groebel, A. Mettas and F. B. Sun, "Determination and Interpretation of Activation Energy Using Accelerated-Test Data," in the 2001 Proceedings of the Annual Reliability and Maintainability Symposium, 2001.
 D. Kececioglu, Reliability & Life Testing Handbook, vol. 2, New Jersey: Prentice-Hall, Inc., 1994.
 L. M. Leemis, "Lower System Reliability Bounds From Binary Failure Data Using Bootstrapping," in Journal of Quality Technology, vol. 38 (1), pp 2-13, 2006.
 C. Carlson, G. Sarakakis, D. Groebel and A. Mettas, "Best Practices for Effective Reliability Program Plans," in the 2010 Proceedings of the Annual Reliability and Maintainability Symposium, 2010.
 W. Q. Meeker and L. A. Escobar, "Pitfalls of Accelerated Testing," in IEEE Trans. Reliability, vol. 47 (June), pp 114-118, 1998.
 W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data, New York: John Wiley & Sons, Inc., 1998, pp 520-522.
 W.Q. Meeker and M. Hamada, "Statistical Tools for the Rapid Development and Evaluation of High-Reliability Products," in IEEE Trans. Reliability, vol. 44 (June), pp 187-198, 1995.