Alex said:
[snip]
Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.
Indeed. But what I can't see, for some reason, is how relatively
common errors (eg "every two days" on average) have an apparently
low probability of being noticed (which I take as equivalent to
being "critical").
Put simply, something doesn't seem to add up. Any thoughts on
what I'm missing?
No, it's not equivalent. Being noticed immediately makes it
relatively harmless. Not affecting anything else, such as file
storage, or a program run, makes it definitely non-critical. Most
of the memory system is busy holding something that will either
get swapped out and appears to be non-dirty, or that will get
overwritten, or that will never get read again.
However, imagine this scenario. A data base system, maybe based
on a B-tree. Normal operation accesses a record, loading it into
a stack buffer, and a cosmic kills a field. The routine extracts
that field (erroneous) and exits. Nothing gets written back, and
the bad value is in relinquished stack AND the returned result.
The report being generated uses the wrong value. Enter the
paychecks for an extra million dollars. An ECC machine would
never have glitched, and the lawyers trying to recover from the
employee (who fled the country for the Bahamas) would be
considerably poorer. It doesn't take a high probability for the
cost of the ECC system to be inconsequential.
There is, to my mind, a good argument for making the hardware
vendors responsible because they knew, or should have known, of
this possibility, and how to prevent it. But by the very nature
of the glitch it is non-repeatable, and proving the case would be
nearly impossible. I suspect a good deal of the contumnly heaped
on Windoze crashes is due to this very problem, combined with the
fact that Windoze is a house of cards in the first place. If Bill
and crew were smarter they would be pushing for universal ECC and
maybe would get away from the Yugo reputation.
This link, which someone put up a short while ago, has a good
discussion of that very aspect, backed by some experiments. They
do tend to waste paper, though.
<
http://www.eecg.toronto.edu/~lie/papers/hp-softerrors-ieeetocs.pdf>