When is ECC not ECC ?

G

Gianni Mariani

I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU->CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?
 
B

Bob Day

Gianni Mariani said:
I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU->CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?

Despite the BIOS setting, perhaps the chipset or the mainboard does
not, in fact, support ECC memory.

-- Bob Day
 
E

Erez Volach

Gianni Mariani said:
I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU->CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?
ECC can detect almost any error, but prolly can correct only a handfull of
1-bit errors. If the code is too corrupt it cannot overcome it with limited
information provied by the correction bits. I don't know what it would do in
such extreme situations. Perhaps that one stick of memory was not seated
well, and contact / impedance / resistance issues caused errors in data
transmission. Also, I would think the ECC works inside the memory banks /
modules, so it would not detect errors on the memory bus. That is a part
that (should be) handled by the north bridge of the motherboard.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top