IBM x345 Server goes black during memory test of Samsung DIMMs

Phil · Apr 27, 2007

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB,
2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the
screen hung about 30 min into the test. Then the box would not boot,
dark screen, no BIOS. Box powers on with Power-on green LED on front
panel but otherwise appears totally dead. The box has dual IBM 350
watt power supplies with both green LEDs on in rear.

Have not seen this IBM xSeries server issue discussed when googling
the newsgroups and Tek-tips, so this long solution is described here
with additional questions for enhancing reliability of legacy IBM
servers. Details follow, regard, Phil

-----

The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of
1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz,
PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) with Samsung SDRAM
memory chips (K4H510638D-TC80) which got quite hot-to-the-touch during
this strenuous testing. This was quite evident with older chips with
2002 date codes, compared to 2003.

The previous memory totalled 1GB RAM; 4 pieces of 256MB registered ECC
DIMMs FRU 09N4306 / 38L4029 with double-sided MicronT 46V32M4-75A
chips (512MB (2x256MB) min). One pair had older late-02 and one pair
mid-03 chip date codes. Micron has 75ns rated chips vs Samsung 80ns
rated chips; but both appear to have sufficient design margin for
100ns operational spec. The memory sticks are installed in matched
pairs for 2-way interleaved operation; factory labels face inwards.

Have any SysE added gamer RAM chip coolers to servers in heavy
production? There is only 3/16" 0.1875 in between the adjacent slot's
DIMM's chips and 4 3/4" long chip array, 1/4" 0.25 in double sided
thickness (1/8" 0.125" on 256MB). So the metal heat sinks need to be
only 1/8" thick with fins. Does one remove the labels for better
plastic-metal heat transfer?

When booting the box from cold, the computer was powered on but there
was no video from the integrated ATI RAGE XL chipset, no BIOS beeps.
The Light Path Diagnostics panel showed nothing (latest Integrated Sys
Mgmt Processor firmware). The blinking green LED next to the CMOS
battery (ISMP activity) just shows that AC is connected to the box.
The LEDs next to the DIMM slots showed nothing. Testing was done in
pairs in DIMM slots 1 and 2, which are closest to edge of the
mainboard and case, HW manual p57-8.

The IBM's Hardware Maintenance manual (48P9718, 11th Ed, Feb 04,
latest) showed no such diagnosis and no such remedy. See Chap6 Symptom-
to-FRU index, p83-113. The closest symptom would be BIOS beep code
1-1-3 (CMOS write/read test failed, p83), but there were no beeps. The
"No-beep symptoms" table on p86 was clueless. Same with other manuals,
including Options p21-3 (48P9719, 1st Ed, 7/2002), User's memory spec
p3, reliability p5-6 (48P9717, 1st Ed, 7/2002). The Installation Guide
has Chap2 Installing Options Memory p9-11, and Chap5 Solving Problems
p27, Table 3 showed with Boot Code=No beep, to call IBM Service
(48P9714, 2nd Ed, 7/2002).

The only real clues to solving the problem was in the HW Maint Manual
"Undetermined problems" p112 near the end of chapter, which had Notes
1 and 2 on damaged data in CMOS and BIOS.

The un-documented fix or remedy was to remove, short the leads, and
replace the CMOS CR2032 3v Lithium battery (FRU 33F8354) to reset the
BIOS to default. Since this systemboard has in an upright CMOS battery
retainer, one needs to use an insulated forcep (napkin) to remove it
with one hand and the other hand's fingernail on the retaining clip
(the flat positive + side with the mfgr name faces backwards). Advise
removal of any ServeRAID board too in Slot 2 PCI-X 100MHz for ease of
access. Use a screw driver in the black plastic clip to ease opening
the Adapter Retainer without breaking the blue plastic pronged snaps.
The manual procedures on "Replacing the battery" p69-71, and
"Installing a ServeRAID-5i adapter" p54-5 has diagrams on the
details.

Then the box will boot with a "161 Bad CMOS battery" error p102 and
the BIOS needs the time and date reset and the system-error logs
cleared.

After several repetitions of this scenario, was able to get both pairs
of Samsung memory to pass at least one complete cycle of the test (abt
45 min), but not much further than that; before the box hung again and
required having to R&R the CMOS battery. The newer pair had less of a
problem making the test hurdle and felt to run cooler.

Even with the chassis cover on and including all 8 chassis fan array,
the rear-most memory chips were noticably hotter than the chips closer
to the dual fans and CPUs. And the DIMMs closer to edge of case also
ran hotter, thus the newer pair's final destination was DIMM slots 1
and 2. The BIOS did spin-down the fans; after initial startup with all
8 fans roaring away. During the hour long memory testing, at no time
did the fans spin-up with a room ambient about 60F.

We now have 4GB of iffy RAM memory...any comments from fellow IBM
SysE?? Did Samsung change (shrink) their process technology
fabricating half gigabit (512Mb) DDR DRAM memory chips in the late
2002 timeframe?? I used to think that Samsung set the world standard
in memory chips. This is during the late 90s era when Intel / RamBus
RDRAM RIMMs was battling the SDRAM DDR world.

Are there more reliable memory chip mfgrs that IBM OEMS such as Hynix
(another Korean), Elpida Opt 33L5039 (Japanese), Infineon (higher
rated IBM FRU 09N4308 33L5039 CL2 PC2100) (German), and Micron
(unbranded IBM compatible to FRU 10K0071) (USA)?? Should we be looking
at enhanced memory specialists such as Corsair, OCZ, Patriot, etc? Or
is the best solution to overheating memory chip issue IBM's ChipKill
technology. Has anyone installed this more expensive memory in xSeries
servers??

BTW, this issue was posted in ,
and Tek-Tips.com' IBM Server
discussion group on 27Apl07.

Robert Redelmeier · Apr 28, 2007

Phil said:
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test. Then the box would

No surprise. Please remember we are talking servers here,
and reliability (correct answers) are often considered more
important than uptime. And maintenance is assumed available.

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS);
4 pieces of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided
DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v)
with Samsung SDRAM memory chips (K4H510638D-TC80) which got quite
hot-to-the-touch during this strenuous testing. This was quite
evident with older chips with 2002 date codes, compared to 2003.

Are you sure you have clean filters, dusted PSUs and have good
clearances around ducts? You may have some work to duplicate the
design conditions if there was an auxiliary airmover or cabinet.

I don't know the failure mechanism of RAM, but I was surprised
at the number of failures reported recently here. Second
after HDs, and not by much.

-- Robert

Franc Zabkar · Apr 29, 2007

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar

Franc Zabkar · Apr 29, 2007

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB,
2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the
screen hung about 30 min into the test. Then the box would not boot,
dark screen, no BIOS. Box powers on with Power-on green LED on front
panel but otherwise appears totally dead. The box has dual IBM 350
watt power supplies with both green LEDs on in rear.

Have not seen this IBM xSeries server issue discussed when googling
the newsgroups and Tek-tips, so this long solution is described here
with additional questions for enhancing reliability of legacy IBM
servers. Details follow, regard, Phil

-----

The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of
1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz,
PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) ...

Are you perhaps unintentionally overclocking your RAM? According to
the datasheet, 100MHz is the rated CL2 speed. To achieve 133MHz at CL2
you need "A2" parts.

See pages 3 and 4 of this document:
http://www.datasheetarchive.com/datasheet.php?article=1875962

... with Samsung SDRAM
memory chips (K4H510638D-TC80) ...

I believe that should be "TCB0" which codes for TSOP package,
commercial temperature, normal power, and a speed of [email protected].

The "D" indicates a 5th generation part.

... which got quite hot-to-the-touch during
this strenuous testing. This was quite evident with older chips with
2002 date codes, compared to 2003.

The datasheet stipulates a maximum power dissipation of 1.5W per chip.

- Franc Zabkar

Del Cecchi · Apr 29, 2007

Franc Zabkar said:
What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

Click to expand...

I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar

Except that it is undesirable so to do. If there is a pesistent hard
correctable error, those words are running with their protection against
soft errors gone, pretty much anyway. (there are tricks)

So if one has a hard error, at least a block of memory ought to be
deallocated. if one has a hard error or rate exceeds threshold, then all
blocks with errors should be deallocated.

Franc Zabkar · Apr 30, 2007

Franc Zabkar said:
Franc Zabkar said:

Phil <[email protected]> wrote in part:

Click to expand...

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test.

Click to expand...

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

Click to expand...

I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar

Click to expand...

Except that it is undesirable so to do. If there is a pesistent hard
correctable error, those words are running with their protection against
soft errors gone, pretty much anyway. (there are tricks)

So if one has a hard error, at least a block of memory ought to be
deallocated. if one has a hard error or rate exceeds threshold, then all
blocks with errors should be deallocated.

This appears to be the HW Maintenance Manual for the OP's machine:

ftp://ftp.software.ibm.com/systems/support/system_x_pdf/48p9718.pdf

AIUI, the Integrated System Management processor ("service processor")
is able to deallocate faulty DIMM banks on the fly (see page 94).

Page 6 of the user manual ...

ftp://ftp.software.ibm.com/systems/support/system_x_pdf/88p9189.pdf

.... confirms that the server incorporates "memory scrubbing and
Predictive Failure Analysis".

Furthermore, page 5 of the same manual states that "the memory
controller also provides Chipkill™ memory protection if all DIMMs are
of the type x4. Chipkill memory protection is a technology that
protects the system from a single chip failure on a DIMM."

I notice also that there is a diagnostic LED which reports when the
error log is more than 75% full (see page 34 of the HW manual). Is RR
onto something?

- Franc Zabkar

Robert Redelmeier · Apr 30, 2007

Franc Zabkar said:
I can't imagine that the paltry few bytes of CMOS RAM, most

I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.

of which are already in use, would be enough to store more
than a handful of such errors. In any case, what is the point
of ECC if a system dies when its error log becomes full?

Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.

My experience of ECC memory in mainframes is that a computer
can run forever, albeit with a minor performance penalty,

So long as the errors are rare and not localized. It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.

if RAM errors are limited to a single data bit per word. I
can't see why PCs would be any different. In fact the OP
can easily test your hypothesis by placing insulation tape
over one data bit of each RAM stick's edge connector and
then subjecting his machine to normal everyday use.

Oh, that'd be messy. The connectors have too much pressure
and too little clearance. I'd expect connector damage unless
just the right [Mylar?] tape was used.

-- Robert

Franc Zabkar · May 1, 2007

I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.

Many (most?) systems now have 256 bytes of CMOS RAM. AFAIK, the first
128 bytes are accessed via ports 70/71h, and the next 128 bytes via
ports 72/73.

I suppose it's possible to have more CMOS RAM, but it could also be
that the Integrated System Management Processor has its own RAM or
EEPROM. FWIW, other IBM server products appear to write their error
logs to "NVRAM", which in PC terms usually refers to an EEPROM.

Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.

Not in my experience. The performance penalty of a faulty memory bit
usually amounted to no more than an extra clock cycle. Taking a
mainframe out of service for a non-fatal error would have meant that
up to a dozen workstations would have been idle. Furthermore, many
servers run 24/7 doing batch jobs.

The whole point of ECC, especially in servers, is to provide a fault
tolerant system. If the error log is full, then the machine should
alert the operator, but that's all. In fact the OP's machine does
indicate when the log is 75% full.

So long as the errors are rare and not localized.

Not true. With ECC you can have a dead bit at *every* address in
*every* memory module and still have a functioning system. It's only
when you have a multi-bit error that the system can break down.

See the references to ChipKill and "memory scrubbing" in IBM's
documentation.

It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.

I don't see it.

It may help to know which chipset is detected by memtest86+. I found
one URL which suggests that the OP's chipset may be the Serverworks
Serverset CNB20-HE.

FWIW, the following URL describes a problem with memtest86+ v1.65:

Support for Serverworks Serverset (CNB20HE)?
http://forum.x86-secret.com/archive/index.php/t-4459.html

The author writes:

"This [test failure] seems to happen only with 2x1GB memory strips.
.... If I test with 2x512MB everything works fine."

- Franc Zabkar

Phil · May 1, 2007

I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.

Click to expand...

Many (most?) systems now have 256 bytes of CMOS RAM. AFAIK, the first
128 bytes are accessed via ports 70/71h, and the next 128 bytes via
ports 72/73.

I suppose it's possible to have more CMOS RAM, but it could also be
that the Integrated System Management Processor has its own RAM or
EEPROM. FWIW, other IBM server products appear to write their error
logs to "NVRAM", which in PC terms usually refers to an EEPROM.

Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.

Click to expand...

Not in my experience. The performance penalty of a faulty memory bit
usually amounted to no more than an extra clock cycle. Taking a
mainframe out of service for a non-fatal error would have meant that
up to a dozen workstations would have been idle. Furthermore, many
servers run 24/7 doing batch jobs.

The whole point of ECC, especially in servers, is to provide a fault
tolerant system. If the error log is full, then the machine should
alert the operator, but that's all. In fact the OP's machine does
indicate when the log is 75% full.

So long as the errors are rare and not localized.

Click to expand...

Not true. With ECC you can have a dead bit at *every* address in
*every* memory module and still have a functioning system. It's only
when you have a multi-bit error that the system can break down.

See the references to ChipKill and "memory scrubbing" in IBM's
documentation.

It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.

Click to expand...

I don't see it.

It may help to know which chipset is detected by memtest86+. I found
one URL which suggests that the OP's chipset may be the Serverworks
Serverset CNB20-HE.

FWIW, the following URL describes a problem with memtest86+ v1.65:

Support for Serverworks Serverset (CNB20HE)?
http://forum.x86-secret.com/archive/index.php/t-4459.html

The author writes:

"This [test failure] seems to happen only with 2x1GB memory strips.
... If I test with 2x512MB everything works fine."

- Franc Zabkar

Tested with a slightly older version of MemTest86+ v1.65 on the
original IBM x345 Server OEM 256MB DDR SDRAM memory DIMM sticks in the
server's slot 1 and 2. The results were again similar, with the
computer box hanging after about 45 minutes of testing and again
requiring a R&R of the CMOS battery before it would boot again.

The memory is mfgr'd by Micron Tech with two different date codes.
Again my conclusion is that there is a thermally related failure mode;
the older 2002 date codes failing first, presumably fabricated with
older process technology that results in higher power consumption.

My conclusion is that gamer-type RAM coolers (convection heat sinks)
are required to reduce memory reliability issues with IBM OEM DIMM
memory in their legacy 2002 xSeries servers, even though the 2U
servers are quite well designed with dual redundant banks of 4 fans
across the cross-section of the chassis (wind-tunnel type design). Any
SEs concur? Regards, Phil

Details follow:
IBM memory P/N 38L4029 FRU 09N4306, 2 sticks of 256MB PC2100 CL2.5
2.5v registered ECC, double sided, organized 32Mb x 72

The older Micron Tech PC2100A-25330-M1 DIMM with 18 chips 46V32M4-75A
date code late 2002. This pair hung 32 min into testing cycle with
each pass taking 11 min for 512MB, or 2 1/2 passes.

The newer Micron Tech PC2100A-25331-Z DIMM with 18 chips 46V32M4-75B
date code mid 2003. This pair hung 49 min (similar to newer 1GB DIMM)
into testing cycle, or just over 4 passes. Hung at Test5, Block move.
The chips were almost-too-hot-to-touch with the pinky finger.

Thanks guys for your comments, my reply to your questions are:
1) RR: computer was internally very clean, as if this unit was a
backup on-the-shelf, nowhere do any manuals state that filled BIOS
System-Error logs cause no ability to boot
2) FZ#2: the computer operation is factory stock; none of the LEDs
light up in IBM's Light Path Diagnostics Panel or on the mainboard or
planar. Thanks for correction, memory operation has a 10ns cycle time
at 100MHz clock. Samsung parts indeed are K4510638D-TB80, 8ns parts,
with sufficient design bandwidth margin. DDR clocking gives the 266MHz
operation to the Xeon processors with their 533MHz FSB. My feeling on
memory chip power consumption is that it is more than the 1.5 watts
spec.
3) DC: not using IBM's ChipKill technology DIMMs, so deallocation of
a block of memory space is not effected.
4) FZ#4: again not using IBM ChipKill DIMMs. I'll again refer you to
IBM's White paper on ChipKill.
5) FZ, RR, the HW Manual references p34, 94 do not pertain to problem
at hand. Same with User Manual, p5, 6.

Franc Zabkar · May 2, 2007

Tested with a slightly older version of MemTest86+ v1.65 on the
original IBM x345 Server OEM 256MB DDR SDRAM memory DIMM sticks in the
server's slot 1 and 2. The results were again similar, with the
computer box hanging after about 45 minutes of testing and again
requiring a R&R of the CMOS battery before it would boot again.

The memory is mfgr'd by Micron Tech with two different date codes.
Again my conclusion is that there is a thermally related failure mode;
the older 2002 date codes failing first, presumably fabricated with
older process technology that results in higher power consumption.

The "D" in the part number (K4H510638D) indicates the "generation" of
manufacture.

http://www.samsung.com/Products/Sem...omponent/512Mbit/K4H510638B/ds_k4h510638b.pdf
http://www.samsung.com/Products/Sem...omponent/512Mbit/K4H510638C/ds_k4h510638c.pdf

8. Version
M : 1st Generation
A : 2nd Generation
B : 3rd Generation
C : 4th Generation
* D : 5th Generation
E : 6th Generation

I would think that a newer process technology would have a higher
version number.

My conclusion is that gamer-type RAM coolers (convection heat sinks)
are required to reduce memory reliability issues with IBM OEM DIMM
memory in their legacy 2002 xSeries servers, even though the 2U
servers are quite well designed with dual redundant banks of 4 fans
across the cross-section of the chassis (wind-tunnel type design). Any
SEs concur? Regards, Phil

Details follow:
IBM memory P/N 38L4029 FRU 09N4306, 2 sticks of 256MB PC2100 CL2.5
2.5v registered ECC, double sided, organized 32Mb x 72

The older Micron Tech PC2100A-25330-M1 DIMM with 18 chips 46V32M4-75A
date code late 2002. This pair hung 32 min into testing cycle with
each pass taking 11 min for 512MB, or 2 1/2 passes.

The newer Micron Tech PC2100A-25331-Z DIMM with 18 chips 46V32M4-75B
date code mid 2003. This pair hung 49 min (similar to newer 1GB DIMM)
into testing cycle, or just over 4 passes. Hung at Test5, Block move.
The chips were almost-too-hot-to-touch with the pinky finger.

Samsung parts indeed are K4510638D-TB80, 8ns parts,

According to the datasheet, the correct suffix is TCB0 which makes
them 7.5ns parts.

9. Package
* T : TSOP2 (400mil x 875mil)

10. Temperature & Power
* C : (Commercial, Normal)
L : (Commercial, Low)

11. Speed
A0 : 10ns@CL2
A2 : 7.5ns@CL2
* B0 : [email protected]

with sufficient design bandwidth margin. DDR clocking gives the 266MHz
operation to the Xeon processors with their 533MHz FSB. My feeling on
memory chip power consumption is that it is more than the 1.5 watts
spec.
3) DC: not using IBM's ChipKill technology DIMMs, so deallocation of
a block of memory space is not effected.
4) FZ#4: again not using IBM ChipKill DIMMs. I'll again refer you to
IBM's White paper on ChipKill.

It's the memory controller that provides the ChipKill functionality,
not the DIMM.

According to IBM's white paper ...

http://www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf

"The memory subsystem design is such that a single chip, no matter
what its data width, would not affect more than one bit in any given
ECC word. For example, if x4 DRAMs were in use, each of the 4 DQs
would feed a different ECC word, that is, a different address of the
memory space. Thus even in the case of an entire chipkill, no single
ECC word will experience more than one bit of bad data -- which is
fixable by the SEC ECC -- and thereby the fault-tolerance of the
memory subsystem is maintained."

Furthermore, your user manual ...

ftp://ftp.software.ibm.com/systems/support/system_x_pdf/88p9189.pdf

.... states that "the memory controller also provides Chipkill™ memory
protection if all DIMMs are of the type x4."

Therefore, as Samsung's datasheet states that the organisation of your
DRAMs is "stacked x4" (128Mx4), this suggests to me that your system
*does* support ChipKill.

AIUI your motherboard's memory controller spreads each DRAM chip's
four data bits over four distinct addresses, which means that in the
worst case a faulty chip will give rise to four correctable single-bit
errors rather than a single uncorrectable 4-bit error.

5) FZ, RR, the HW Manual references p34, 94 do not pertain to problem
at hand. Same with User Manual, p5, 6.

I'm wondering whether clearing your CMOS RAM is a red herring. If you
allow sufficient time for your machine to cool, does this achieve the
same end?

- Franc Zabkar

IBM x345 Server goes black during memory test of Samsung DIMMs

Phil

Robert Redelmeier

Franc Zabkar

Franc Zabkar

Del Cecchi

Franc Zabkar

Robert Redelmeier

Franc Zabkar

Phil

Franc Zabkar