Mean Time Between Failure (MTBF) is an important metric for determining the lifetime of embedded system components, but it’s difficult to calculate with accurate results. These difficulties lead to unreliable figures, which has generated backlash against such calculations.
One typically defines reliability as the probability that said device will perform functionally as required for a specified period of time. This all seems rather simplistic, and it can be, to a degree, with a large enough sample size and a long enough period of time. The main issue with deriving such figures is that they are required for a product’s release – not at the end of its lifetime when actual reliability can be determined.
To retrospectively calculate the reliability a component or device provided over its lifetime is fairly rudimentary math – total time/total failures. This is all well and good when proving near-obsolete products and potentially useful to prove the reliability of a typical product, but new integrators want to know how reliable this product is, not the previous incarnation.
Increasingly often, beyond a general acceptance of the estimated lifetime of industrial electronics, reliability is specified upfront at the earliest specification stage. Whilst this is more than logical – for instance, one must decide a warranty period based on an estimated lifetime – we now need to quantify that reliability. This is where our old friend, or increasingly, enemy, Mean Time Between Failure (MTBF) comes in. MTBF and “asset life” increasingly go hand in hand, but how accurate are any of these figures and what do they actually tell us?
It’s also worth pointing out MTBF’s less common cousin Mean Time To Failure (MTTF), which differs in that the latter generally is used for an irreparable product, so is used more often for atomic components rather than an assembled product. MTTF is calculated as total time/number of units.
Both have gaping holes in their accuracy; reliability of a given individual unit is a hugely complex calculation. To provide an example of this minefield, a client recently asked if their bespoke product we manufacture is suitable for a 10-year asset life. By querying this, they wanted us to provide evidence of a 10-year MTBF.
Interestingly, what would seem the most logical way to calculate MTBF gave the most bizarre result! Given the product has been in manufacture and deployed for more than 5 years, unlike a new product, we had the gift of substantial historical data. Unfortunately, that data of approximately 5,000 units, deployed over an average of 3 years, with around 14 failures provides an MTBF of more than 1,000 years!
As much as I’d like to gloat about our bespoke product’s reliability and my figures will entirely support this is a true MTBF figure, no one could realistically believe even the materials the product are constructed from will survive this length of time – though that could well be true of the plastic enclosure!
The second, perhaps more realistic method, only considers one component: the weakest link. It’s perfectly logical that by definition that the weakest link is the most likely to fail, and thus most likely to fail first. So should no calculation exist at all, and this figure just be passed through to the final product?
The way in which MTBF is presented I liken to how automobile manufacturers declare fuel consumption figures. Never in history has the real-world MPG achieved in a vehicle actually matched the extravagant claims of the manufacturer, as this figure was obtained in a far from real-world test with vents sealed, no wind, etc. Likewise, a component manufacturer’s MTBF is unlikely to encompass all, or any of the extraneous factors that will affect it – be that humidity, temperature, vibration, or shock. What these constants were during testing are almost never documented, thus any particular MTBF figure is rarely comparable to the next. Unfortunately, this regress follows to the final product; MTBF simply doesn’t cover the expected usage conditions or what the product lifetime should be.
The calculation of reliability and likelihood of failure has been studied in depth. Well-known, observable phenomena such as the “bathtub” effect are well documented but very difficult to encompass into a single “hours” integer. Weibull analysis, determining where a population of product currently lies in the
bathtub, is well worth researching further – alongside Accelerated Life Testing that tries to encompass an individual unit’s passage of time, though not quite for a millennium!
An increasingly popular website, www.nomtbf.com, is very much worth a read, pushing a backlash against this age-old quantification method. The reality is, though, there’s not even anything close to the “right answer” to truly calculate reliability.