MTBF help please

Rob Lister · Feb 21, 2006

I'm not an engineer but, as a technical writer in our proposal shop I argue with them all the time. I often play one at customer presentations and such.

We have a 'box' that contains 20 (critical) cards.
Each critical card has an Mean Time Between Failure of 80,000 hours

Engineer states that 'box' mtbf is 80,000 hours since no card within it is less than 80,000 hours

Proposal-writer/technical-writer Rob states that mtbf is
Box mtbf = 1 / [( 1/80000 ) * number of cards] = 4000 hours

Who's right.

Him,
Me,
Neither

eri · Feb 21, 2006

Hmm. If one card fails, does the box stop working?

brodski · Feb 21, 2006

I am not a mathematician,but your colleague seems to be arguing that the more cards you have in a box, the more likely each individual one is to fail early, as if the very presence of the card will lead to failure.
ask him this, if you put 80,000 cards in a box, would it fail within an hour?
I think you are right, and he is wrong.

Rob Lister · Feb 21, 2006

eri said:
Hmm. If one card fails, does the box stop working?

hence the word (critical)

eri · Feb 21, 2006

The way I understood it was, you have a box with 20 equal components that all need to be functional for the box to work. If one goes, they all go. If the average lifetime of a component is 80,000 hours, how long can you expect the box to last? So what's the spread around the average lifetime? If the standard deviation from the 80,000 is say, 10,000 hours, then you would expect 68% percent of the components to last 70,000 - 90,000 hours, 27% to last 60,000 - 100,000, ~ 5% to last 50,000 - 110,000 hours, and so on from a normal Gaussian distribution. So if you have 20 components, I would say that you have a decent chance of having one that will last 50,000 hours, so that would be the shortest expected lifetime of the box. Assuming a standard deviation of 10,000 hours, that is.

Can anyone else see something wrong with this?

Rob Lister · Feb 21, 2006

eri said:
The way I understood it was, you have a box with 20 equal components that all need to be functional for the box to work. If one goes, they all go. If the average lifetime of a component is 80,000 hours, how long can you expect the box to last? So what's the spread around the average lifetime? If the standard deviation from the 80,000 is say, 10,000 hours, then you would expect 68% percent of the components to last 70,000 - 90,000 hours, 27% to last 60,000 - 100,000, ~ 5% to last 50,000 - 110,000 hours, and so on from a normal Gaussian distribution. So if you have 20 components, I would say that you have a decent chance of having one that will last 50,000 hours, so that would be the shortest expected lifetime of the box. Assuming a standard deviation of 10,000 hours, that is.

Can anyone else see something wrong with this?

hmmm. I don't know the SD around the mean. Basically what I do is grab the spec sheets that lists...well, they list the specs. One spec on most every spec sheet is mtbf (another is Mean Time To Repair but I've been trained (ordered by higher authority) to ignore mttr. IOW, I don't know the SD around the mean.

GodMark2 · Feb 21, 2006

eri said:
The way I understood it was, you have a box with 20 equal components that all need to be functional for the box to work. If one goes, they all go. If the average lifetime of a component is 80,000 hours, how long can you expect the box to last? So what's the spread around the average lifetime? If the standard deviation from the 80,000 is say, 10,000 hours, then you would expect 68% percent of the components to last 70,000 - 90,000 hours, 27% to last 60,000 - 100,000, ~ 5% to last 50,000 - 110,000 hours, and so on from a normal Gaussian distribution. So if you have 20 components, I would say that you have a decent chance of having one that will last 50,000 hours, so that would be the shortest expected lifetime of the box. Assuming a standard deviation of 10,000 hours, that is.

Can anyone else see something wrong with this?

Nope, you're right on the money. Part of my job is determining the MTBF for complicated systems. You need to know more than just the MTBF for each part to determine this. The distribution is very important.

If you have a simple system with two parts, each with a MTBF of 1 hour, you know only that the average time that one will fail is 1 hour from start. One could fail immediately, an one two hours later, and the average would still be 1 hour. In that case, the MTBF of the AND-combined system (both must be working) could be as low as one minute.

On the other hand, if the components all work for at least 59 minutes, with the average failure of 1 hour, then you can be quite confident that the system as a whole will work for at least 59 minutes.

Zep · Feb 21, 2006

The calculation does involve SD's and such. Generally speaking, the more components involved, the flatter the reliability curve, and the bigger the SD of the probability of failure in a given time period.

Incidentally, this stuff was first encountered when they started using radio valves to build the first computers. With many thousands of valves required to be working simultaneously, it was thought that it would be lucky if only a few minutes at a time of useful work could be obtained before something blew. However it was found that by increasing the MTBF of all components, the overall MTBF did increase to more useful timespans (they ran all the valves at lower voltages to reduce burnouts).

VPescado · Feb 22, 2006

If we assume that the mtbf for a card is independent of how long it has been working (an assumption usually used for lightbulbs, and atomic decay - since you didn't specify the nature of the cards I am not above assuming each card is a [highly critical] lightbulb) - then 4,000 hour figure is correct.

Here is why (it's been 13 years since I last studied stochastics, so I apologize if I'm a bit rusty on terminology):

Based on our assumption, the propability that any given card will fail withing some short time span (call it t) is a constant (call it p). Which means that the probability of one of the 20 cards failing in time t is 20 p (for very small t, the probability of multiple failures occuring is negligable). For this reason we can think of the box as a super-card that is 20 times more likely to fail than one of the original cards- or put another way - it has a mtbf that is 1/20th of the original card.

jj · Feb 22, 2006

It all depends on the failure distribution. Without knowing a whole lot more than just the MTBF mean (that is why it's "mean time between failure") you really can't say.

If each card fails randomly once per 80000 hours, with uniform distribution between 0 and 160000 hours, ...

Vs. if each fails at 80000 hours with a Gaussian with sigma .001 hours ...

Very different answers.

Rob Lister · Feb 22, 2006

Thanks all...I think.

In other words, not enough data.

Problem is...now I have to research each card and find out how they each established mtbf.

Still...thanks.

chance · Feb 22, 2006

Rob Lister said:
I'm not an engineer but, as a technical writer in our proposal shop I argue with them all the time. I often play one at customer presentations and such.

We have a 'box' that contains 20 (critical) cards.
Each critical card has an Mean Time Between Failure of 80,000 hours

Engineer states that 'box' mtbf is 80,000 hours since no card within it is less than 80,000 hours

Proposal-writer/technical-writer Rob states that mtbf is
Box mtbf = 1 / [( 1/80000 ) * number of cards] = 4000 hours

Who's right.

Him,
Me,
Neither

You are correct, the MTBF can be converted to a failure rate of 12.5 (per million hrs). and 1000000/12.5 * 20 = 4000hrs MTBF for the Box.

ChristineR · Feb 22, 2006

Neither. The answer is basically "it depends." Say each card lasts exactly 80,000 hours then fails. Then the MTBF for the box is the same, 80,000 hours.

But it's more likely that it looks something like this imaginary card with a life up to 9 hours:

1 hour -- 10% .1
2 hours--15% .15/.9 = .167 failure rate
3 hours--20% .2/(1-.1-.167) = .2782 failure rate
4 hours--15%
5 hours--10%
6 hours--10%
7 hours--10%
8 hours--5%
9 hours--5%

The MTBF for this card is (.1 + .3 + .6 +.6 + .5 + .6 + .4 + .45) = 3.55 hours. The midway point is somewhere between 3 & 4.

But make it two identical cards, and the chance that at least one will fail in the first hour is (1-.81). In the second hour the chance that the remaining 81% will fail is 30%.

Failure rates per hour
1 hour -- = 1 - .9 * .9 = .19
2 hours -- (1 - (1-.167) * (1-.167))/.81 = .378

19% of boxes last 1 hour
30.61% of the remaining boxes last 2 hours.

This is messy, but you can see it basically isn't either of your answers.

ChristineR · Feb 22, 2006

After a good nights sleep, maybe I can cook up a simpler example.

bpesta22 · Feb 22, 2006

ChristineR said:
Neither. The answer is basically "it depends." Say each card lasts exactly 80,000 hours then fails. Then the MTBF for the box is the same, 80,000 hours.

But it's more likely that it looks something like this imaginary card with a life up to 9 hours:

1 hour -- 10% .1
2 hours--15% .15/.9 = .167 failure rate
3 hours--20% .2/(1-.1-.167) = .2782 failure rate
4 hours--15%
5 hours--10%
6 hours--10%
7 hours--10%
8 hours--5%
9 hours--5%

The MTBF for this card is (.1 + .3 + .6 +.6 + .5 + .6 + .4 + .45) = 3.55 hours. The midway point is somewhere between 3 & 4.

But make it two identical cards, and the chance that at least one will fail in the first hour is (1-.81). In the second hour the chance that the remaining 81% will fail is 30%.

Failure rates per hour
1 hour -- = 1 - .9 * .9 = .19
2 hours -- (1 - (1-.167) * (1-.167))/.81 = .378

19% of boxes last 1 hour
30.61% of the remaining boxes last 2 hours.

This is messy, but you can see it basically isn't either of your answers.

I disagree with this assessment, because it says "mean time before failure" which to me doesn't indicate the maximum likely life of the unit. I think your analyses works only if we assume the 80000 value is maximum (or near the maximum) time until failure (MTUF?)

In your first example, the standard deviation would be zero, and all you'd need to know is the mbtf. So, I agree.

In any other scenario, you'd need to know the variance to figure it out (as people above suggested).My guess is it's a bell curve (isn't that the law of large number thingy), so you could then calculate z scores and figure out the odds.

Two examples:

Mean 80,000 hours, standard deviation 10,000 hours.

What's the probability any one card would fail by 60,000 hours or sooner?

That would be a z score of -2.0, which has a probabilty of .0466.

So, the odds any one would fail in a box of 20 within 60 hours or sooner would be 1 - .9544 to the 20th power, which if I did it right is .607.

So, with a large SD, there's a fairly high chance one-- meaning the whole thing-- would fail within just 60,000 hours.

Example 2:

mean 80,000 hours, SD 5000 hours.

The same question produces a z score of 4, which has a probability of only .00003.

The probability that one will fail within 60000 hours or sooner is now: .0006!

The P one would fail within 65000 hours or sooner would be (z = 3.0) = .036

Within 70000 hours or sooner, p = .607

Within 75000 hours or sooner, p = .969!

Much different results across the 2 examples, with everything depending on the SD, and assuming the distribution is normal.

Interested in seeing how and if I f'd this up.

I wonder too, MTBF??? Is there any kinda calculation called MILF?

bpesta22 · Feb 22, 2006

also occurred to me, if there's no variance, then the mean time before failure would also be the maximum (and the minimum) times before failure...

CurtC · Feb 22, 2006

jj said:
If each card fails randomly once per 80000 hours, with uniform distribution between 0 and 160000 hours

But with these calculations, aren't they always done assuming that the probability of failure within any given hour is 1/80,000 ? This kind of calculation describes an exponential decay, like nuclear decay.

I guess you could try to figure a different "bathtub curve" kind of distribution, with more infant mortality, a fairly low failure mid-life, then higher rates as the part ages, but I haven't seen these done.

Many companies figure the MTBF by looking at how many repairs are done in the first year. If you have 1000 products out there in the market, and for your market you assume 2,000 working hours per year, and 50 of them (5%) fail in the first year, you publish an MTBF of 1000*2000/50 = 40,000 hours.

The way I've always seen it done to combine MTBFs is exactly the way Rob is doing.

Zep · Feb 22, 2006

The MTBF, as I understand it, is actually the time at which the highest percentage of identical parts are likely to expire, or the most likely time that one part is likely to expire. Thing is, if the build quality is low then the variance from this expected time is fairly wide.

As mentioned above, you can have an 80,000 hour MTBF, but it may also be, for example, +/-30,000 hours, or +/-3,000 hours. The first instance is far more likely to fail sooner than the second instance, even though they both have the same MTBF.

Also, an MTBF is not a guarantee that the part WILL fail at that point in time. Just that it MAY fail at about that time, and probabilities calculations work around bell-curves with SD's...

And be aware also that some manufacturers give MTBFs that exceed the practical lifetime of the unit anyway. HDD manufacturers often do this - make the MTBF something like 3 years continuous use, knowing full well that by the time it does fail, they can easily sell a new, improved product. In short, it's a fairly meaningless number in that case.

Soapy Sam · Feb 22, 2006

Rob- You are asking the wrong question.

How long does the guarantee last? Call it G hours.

The first component failure will occur after G + T hours, where T is a random number between -20,000 and about 8.

chance · Feb 22, 2006

ChristineR the 4000 MTBF for the box is correct. What you seem to be calculating is the reliability. Reliability is an exponent function of time, e.g. what is the chance that the box will not break down in it’s first year, yes?

In which case R = e^(-t/MTBF)

For 1 year there is a 11% chance of surviving, where t = -8760 (hr in one year), and MTBF = 4000 hrs

CurtC is correct the reliability is exponential.

All software (and calculations) that I use, assumes the MTBF, failure rate etc to be steady state, no allowance is made for the bath tub curves, burn in, infant mortality, nor ware out phases.

MTBF help please

Unregistered

Critical Thinker

Tea-Time toad

Unregistered

Critical Thinker

Unregistered

Master Poster

Banned

DELAYED DUE TO A TRAIN DERAILMENT

Penultimate Amazing

Unregistered

Critical Thinker

Illuminator

Illuminator

Cereal Killer

Cereal Killer

Illuminator

Banned

Penultimate Amazing

Critical Thinker