ECC Errors

J

jtrooney

I have an intel server board running 4x1Gb sticks of ECC memory with 2
intel xeon processors. They system randomly dies every couple of
months. The odd part is that it is when the systems load is at a
minimum. I have gone throw and tested each stick of memory in each dimm
slot on the board and was lead to believe that one of the dimm slots
was bad. I replaced the motherboard and are still recieveing errors
with memtest. The errors that i recieve are ECC uncorrected errors, I
guess any idea as to where to go from here would be a great help.
Thanks in advance
 
P

Paul

I have an intel server board running 4x1Gb sticks of ECC memory with 2
intel xeon processors. They system randomly dies every couple of
months. The odd part is that it is when the systems load is at a
minimum. I have gone throw and tested each stick of memory in each dimm
slot on the board and was lead to believe that one of the dimm slots
was bad. I replaced the motherboard and are still recieveing errors
with memtest. The errors that i recieve are ECC uncorrected errors, I
guess any idea as to where to go from here would be a great help.
Thanks in advance

As I understand it, ordinary ECC is SECDED (single error correction,
double error detection). If the BIOS has an option to enable "background
scrubbing", then the system will methodically go through the main
memory, when the system is not under load, and do test read cycles
on the memory. If a correctable error is found, the hardware will
try to repair it. The benefit of scrubbing, is that if there was
some background level of single bit errors showing in the memory,
they won't "accumulate". (Now, memory errors aren't always characterized
as being single bit errors -- corruption of entire words in memory
is possible, if there is an electrical disturbance during a write
operation. Like a electric floor polisher bumping the equipment
rack in the server room.)

A second form of error detection/correction is "chipkill", but
I don't know if chipkill is a candidate for scrubbing or not.
Chipkill focusses on groups of data at the nibble level (4 bits),
and since x4 width memory chips are popular on registered DIMMs,
is a good fit for such memory modules. If you have a registered
DIMM with x4 chips on it, it is even possible for the computer
to keep running, if an entire chip dies.

Now, that being said, if you can run memtest on one stick of memory
and still get errors, then scrubbing is not going to fix it. Your
problem is too severe.

RAM stability can be influenced by memory timing, memory clock
rate, memory voltage, and temperature. If you get a copy of
CPUZ (www.cpuid.com), you can see what timing the computer is
currently using. CPUZ also has a "report" generator, and it
can dump the contents of the SPD chip on the DIMM, if you need
to see what memory timings have been set in the DIMM's SPD
chip, to be used as defaults.

If this server is using registered memory, operation should be
"bulletproof". The electrical performance of registered busses
is so much better than how desktops work, that you shouldn't be
seeing anything like this. And with only one DIMM installed
for testing purposes, there is no excuse for errors.

If memtest86+ is always returning errors at the same memory
addresses, then the RAM could be bad. I've had memory with
stuck bits before, so it does happen.

On "enthusiast" boards, there would be the option to increase
the memory voltage. For example, DDR chips rated at DDR333 or
slower, have an industry standard of 2.5V. At DDR400, the
spec used industry-wide is 2.6V. If the board supports it,
setting the voltage on the DIMMs to 2.7 or 2.75V will sometimes
improve a background error rate. The Intel board should already
be using at least 2.6V, if the design staff had half a brain.
(You didn't mention the technology used on the board, so
this could be 3.3V for SDRAM, 2.5V for DDR, 1.8V for DDR2 and
so on.)

In eons past, engineers used to design boards without the
benefit of simulation tools. Such boards would crash once
a day, and the engineers were powerless to improve them.
Design tools have improved a lot since then, as have understandings
of how this stuff should work. Certainly an Intel designed
board, should be well clear of those bad design methods. That
leaves bad (budget) memory, a problem with power supply,
or some other environmental factor, as possible contributors.

Can you "borrow" some sticks from another computer ?
I'd be curious if every module you stuff in the system,
fails memtest86+.

Does the server board have a "hardware monitor" ? That is
the ability to monitor key voltages on the system. An
example of a freeware tool for accessing the hardware monitor,
would be MBM5 from mbm.livewiredev.com . But, for a server
board, you are more likely to need to use whatever tool
was bundled with the motherboard when you bought it. That is
because the hardware monitor chip would likely not be a
mainstream implementation, and will not be similar to the
desktop boards that MBM5 supports.

With the hardware monitor, even at the BIOS level, you can
look to see if the +3.3V, +5V, +12V and so on, are within
5% tolerance of the true value. If your 5V was below 4.75V,
you might want to get a multimeter and verify by hand, the
quality of the power delivered to the board. Same for the
other voltages. Power supplies and disks, are the two
weakest links in a computer. Followed by flaky unbranded
memory chips that die a year after you buy them...

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top