Utility to test IDE cable connections?

  • Thread starter Thread starter David R
  • Start date Start date
David Maynard said:
Alex said:
[snip]

Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.

Are most hard errors non-critical?

A hard error means it's broke.

What I mean is, for example, a stuck bit in a DRAM. That is, a hard memory
error as opposed to a soft memory error.

Well it's an error either way, so in the singular sense the error
is no more critical than a soft error, but given than this error
will be reproduced every time, the frequency of the error makes
it more critical.
 
Alex said:
Alex said:
[snip]


Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.

Are most hard errors non-critical?

A hard error means it's broke.


What I mean is, for example, a stuck bit in a DRAM. That is, a hard memory
error as opposed to a soft memory error.

Alex

While possible that's not a common failure mode for dynamic RAM. What's
more likely is a dead address or I/O line (or the whole thing dead). In
which case you have catastrophic errors.
 
Alex said:
David Maynard said:
Alex said:
[snip]

Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.

Are most hard errors non-critical?

A hard error means it's broke.

What I mean is, for example, a stuck bit in a DRAM. That is, a
hard memory error as opposed to a soft memory error.

Such an error is easily exposed by such things as MEMTST. It will
repeat. Soft errors are due to other external events and will not
repeat. EEC will handle both kinds, and a good EEC system will be
able to report corrections and their locations, when the hard
error will stand out. I believe Windoze has no abilities to
collect these reports. Better OSs do.

Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.
 
[snip]
Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.

Indeed. But what I can't see, for some reason, is how relatively common
errors (eg "every two days" on average) have an apparently low probability
of being noticed (which I take as equivalent to being "critical").

Put simply, something doesn't seem to add up. Any thoughts on what I'm
missing?

Alex
 
While I agree with another post concerning physical memory
Alex Fraser said:
You seem to be missing the point ;). I think the point is that
ECC is not necessary for desktop systems if you use good
quality non-ECC memory, such as Crucial.

The fact that the majority of chipsets on motherboards used
for desktop systems can't take advantage of ECC makes it
largely a moot point anyway.


Is it possible to use ECC memory in a "standard" mobo which takes non-ECC
memory?
 
Alex said:
[snip]
Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.

Indeed. But what I can't see, for some reason, is how relatively
common errors (eg "every two days" on average) have an apparently
low probability of being noticed (which I take as equivalent to
being "critical").

Put simply, something doesn't seem to add up. Any thoughts on
what I'm missing?

No, it's not equivalent. Being noticed immediately makes it
relatively harmless. Not affecting anything else, such as file
storage, or a program run, makes it definitely non-critical. Most
of the memory system is busy holding something that will either
get swapped out and appears to be non-dirty, or that will get
overwritten, or that will never get read again.

However, imagine this scenario. A data base system, maybe based
on a B-tree. Normal operation accesses a record, loading it into
a stack buffer, and a cosmic kills a field. The routine extracts
that field (erroneous) and exits. Nothing gets written back, and
the bad value is in relinquished stack AND the returned result.
The report being generated uses the wrong value. Enter the
paychecks for an extra million dollars. An ECC machine would
never have glitched, and the lawyers trying to recover from the
employee (who fled the country for the Bahamas) would be
considerably poorer. It doesn't take a high probability for the
cost of the ECC system to be inconsequential.

There is, to my mind, a good argument for making the hardware
vendors responsible because they knew, or should have known, of
this possibility, and how to prevent it. But by the very nature
of the glitch it is non-repeatable, and proving the case would be
nearly impossible. I suspect a good deal of the contumnly heaped
on Windoze crashes is due to this very problem, combined with the
fact that Windoze is a house of cards in the first place. If Bill
and crew were smarter they would be pushing for universal ECC and
maybe would get away from the Yugo reputation.

This link, which someone put up a short while ago, has a good
discussion of that very aspect, backed by some experiments. They
do tend to waste paper, though.

<http://www.eecg.toronto.edu/~lie/papers/hp-softerrors-ieeetocs.pdf>
 
Troy said:
.... snip ...

Is it possible to use ECC memory in a "standard" mobo which takes
non-ECC memory?

Yes, in general. It just means there are extra data lines which
are neither read nor written. If the mobo designer had any sense
those lines are pulled up through something like 10k ohms,
avoiding any extra power dissipation. I have a machine doing
exactly that now and for the past 3 years.
 
CBFalconer said:
Alex said:
[snip]
Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.

Indeed. But what I can't see, for some reason, is how relatively
common errors (eg "every two days" on average) have an apparently
low probability of being noticed (which I take as equivalent to
being "critical").

Put simply, something doesn't seem to add up. Any thoughts on
what I'm missing?

No, it's not equivalent. Being noticed immediately makes it
relatively harmless. Not affecting anything else, such as file
storage, or a program run, makes it definitely non-critical.

OK, but that does not alter what I said: I can't see how relatively common
errors have an apparently low probability of being noticed. I don't deny
that it's largely irrelevant in practical terms (on the basis of cost vs
benefit), but the reasons are nonetheless of interest to me.

Alex
 
Alex said:
Alex said:
[snip]


Whether any error is critical depends on how the erroneous
location is used, which you normally have no practical control
over.

Indeed. But what I can't see, for some reason, is how relatively
common errors (eg "every two days" on average) have an apparently
low probability of being noticed (which I take as equivalent to
being "critical").

Put simply, something doesn't seem to add up. Any thoughts on
what I'm missing?

No, it's not equivalent. Being noticed immediately makes it
relatively harmless. Not affecting anything else, such as file
storage, or a program run, makes it definitely non-critical.


OK, but that does not alter what I said: I can't see how relatively common
errors have an apparently low probability of being noticed.

Because, for it to be 'noticed' it has to to be not only altered but used
by something in the altered form.

For example, who cares that the state of a bit in disk cache that's about
to be overwritten by new data is? Whatever it was, altered or not, is about
to get replaced anyway. It'll never be 'noticed'.
 
I can't see how relatively common
Because, for it to be 'noticed' it has to to be not only altered but used
by something in the altered form.

We don't have to leave this to speculation.

An interesting experiment to do on a machine which will later be
reformatted: write a program which runs in the background and at
pseudorandom times alters a bit, or several contiguous or non-contiguous
bits in memory, logging what it does. Use the machine for normal (but
non-critical) work.

Compare the number of problems _noticed_, and the number of error
messages and crashes, with the number of corruption events.

The program will of course have to be written so that it is allowed to
write outside its own memory space.

This experiment is as important as the experiments in cosmic-ray free
caves and similar.

Some errors that are not noticed will have no consequences (corruption
of a bit in memory that is not used before it it next written to -- I
would expect this to be by far the most common case). Others may simply
be missed (change one character in a report which has a few typos
anyway).

Perhaps try this with different operating systems.

Report the results.

For this experiment the type of memory used will make no difference, as
we are writing to memory by "proper" methods, not changing single bits
outside the system, though with actual memory errors ECC will correct
single- and two-bit errors, and detect others.

To be able to make sensible decisions on the need for memory error
detection and correction we need experimental information on the
frequency of corruption events that do have consequences.

The information from carefully controlled and documented experiments to
count the frequency of corruption events and experiments on their
consequences could lead to changes in the industry.

Best wishes,
 
Michael said:
We don't have to leave this to speculation.

An interesting experiment to do on a machine which will later be
reformatted: write a program which runs in the background and at
pseudorandom times alters a bit, or several contiguous or non-contiguous
bits in memory, logging what it does. Use the machine for normal (but
non-critical) work.

Compare the number of problems _noticed_, and the number of error
messages and crashes, with the number of corruption events.

The program will of course have to be written so that it is allowed to
write outside its own memory space.

This experiment is as important as the experiments in cosmic-ray free
caves and similar.

Some errors that are not noticed will have no consequences (corruption
of a bit in memory that is not used before it it next written to -- I
would expect this to be by far the most common case). Others may simply
be missed (change one character in a report which has a few typos
anyway).

Perhaps try this with different operating systems.

Report the results.

For this experiment the type of memory used will make no difference, as
we are writing to memory by "proper" methods, not changing single bits
outside the system, though with actual memory errors ECC will correct
single- and two-bit errors, and detect others.

To be able to make sensible decisions on the need for memory error
detection and correction we need experimental information on the
frequency of corruption events that do have consequences.

The information from carefully controlled and documented experiments to
count the frequency of corruption events and experiments on their
consequences could lead to changes in the industry.

That's a great idea. Why don't you do that?
 
I suggested:

David said:
That's a great idea. Why don't you do that?

Thanks. At the moment I don't have the time to spare, nor a machine or
space which I could dedicate to the experiment.

It's also best done in an academic environment where the results can be
published and be respectable, rather than a lone worker posting his own
claimed results! This is perhaps the most important reason for me not to
do it. The program itself is a doddle, if one can work around operating
systems' protection of memory not allocated to the program.

Best wishes,
 
Michael Salem said:
We don't have to leave this to speculation.

An interesting experiment to do on a machine which will later be
reformatted: write a program which runs in the background and at
pseudorandom times alters a bit, or several contiguous or non-contiguous
bits in memory, logging what it does. Use the machine for normal (but
non-critical) work.

No need for random timing. In fact, "every N seconds" would probably be
preferable.

[snip]
Some errors that are not noticed will have no consequences (corruption
of a bit in memory that is not used before it it next written to -- I
would expect this to be by far the most common case).

This is just what I don't see.

Isn't expecting that "by far the most common case" is that a flipped bit is
written to before it is next read equivalent to saying that most bits are
written multiple times without being read inbetween?

Hmm. Maybe I just worked out the answer: processor caches result in exactly
that effect on main memory.

Alex
 
Some errors that are not noticed will have no consequences (corruption
Alex Fraser wrote

Isn't expecting that "by far the most common case" is that a flipped bit is
written to before it is next read equivalent to saying that most bits are
written multiple times without being read inbetween?

Hmm. Maybe I just worked out the answer: processor caches result in exactly
that effect on main memory.

The experiment I propose would test this, speculation won't. The test
could be made with cache enabled and disabled, if possible.

Best wishes,
 
Michael Salem said:
The experiment I propose would test this, speculation won't. The test
could be made with cache enabled and disabled, if possible.

The only matter of speculation here is over the odds
in a game of Russian roulette.

-- Bob Day
 
Michael Salem said:
Alex Fraser wrote

The experiment I propose would test this, speculation won't. The test
could be made with cache enabled and disabled, if possible.

Shit, what is it with people evading my questions, using them as an excuse
to repeat themselves? :)

Your experiment is an excellent idea. Sadly, I wouldn't know where to start
to implement it. Well, almost, anyway. Probably the most accessible method
would be to do it in the Linux kernel.

Alex
 
Alex said:
Shit, what is it with people evading my questions, using them as an excuse
to repeat themselves? :)

With this sort of question, waffling away with speculation is pointless
when a fairly simple experiment can be made. I'm reminded of all the
experts who agreed that denser things fell faster than lighter things
until Galileo took the couple of minutes to drop balls of different
density from the tower of Pisa (or so it is said).

So I can't answer your question authoritatively (rather than evading
it). I did actually start drafting a response, but decided that it was a
waste of time.
Your experiment is an excellent idea. Sadly, I wouldn't know where to start
to implement it. Well, almost, anyway. Probably the most accessible method
would be to do it in the Linux kernel.

It would need to be done with each operating system for which we want to
know the answer.

If somebody knows how to enable a program to write to any location in
memory in Windows XP I might write and make available a program to
corrupt memory in the way I discussed. The answer is probably simple;
I've just never done anything like this since the days when the
operating system didn't protect memory.

Best wishes,
 
Michael said:
Alex Fraser wrote:
.... snip ...


It would need to be done with each operating system for which we
want to know the answer.

If somebody knows how to enable a program to write to any location
in memory in Windows XP I might write and make available a program
to corrupt memory in the way I discussed. The answer is probably
simple; I've just never done anything like this since the days
when the operating system didn't protect memory.

This link, which someone put up a short while ago, has a good
discussion of that very aspect, backed by some experiments. They
do tend to waste paper, though.

<http://www.eecg.toronto.edu/~lie/papers/hp-softerrors-ieeetocs.pdf>
 
Back
Top