My SATA transfer corruption issue is back !

  • Thread starter Thread starter Castor Nageur
  • Start date Start date
I doubt if the test software that RAM makers use runs on standard PCs,
since PCs likely can't generate worst-case test patterns on RAMs while
executing out of those same RAMs.

Test patterns are not the issue, generally. These are for obscure
bus problems and with the very uniform burst transfer model of
modern RAM these have become rare. The problem are cells and
row/column amplifiers and these can be tested in PCs. Just
lower the voltage a bit and make the timing a bit tighter
than expected. Also decrease refresh frequency. The safety
margin you put in will decide the final quality. True, this
is a consumer-grade approach. But prices are consumer-grade as
well. If you want industrial RAM, expect higher prices. You
will get the same RAMs, but selected with a professional
tester, ideally with different temperatures in addition.

Arno
 
As to identifying the module: Use bisection.

First swap: 2 1 3 4 If error moves, it is in 1 or 2
Second swap: 1 2 4 3 If error moves, it is in 3 or 4.
Last swap for error in 1,2: 3 2 1 4 If error moves then it is in 1 else2
Last swap for error in 3,4: 1 4 3 2 If error moves then it is in 4 else3

No good module required, just some patience.

Hi,

Here are my first results.
I named my memory slots starting from the CPU socket side A, B, C, D.
I numbered the DIMMs 1, 2, 3, 4.

* My initial configuration which produced the detected error is A1,
B2, C3, D4.
Here are the results:

A1 B2 C3 D4 => error detected at address 0X001BF979C00 at pass 3
(http://www.cijoint.fr/cj201109/cijIqjVMpU.jpg)

* I then tested the DIMM one at a time (note that MemTest86+ detects
my DDR2 as DDR3 when plugging only 1 DIMM):

Ax Bx Cx D2 => no error (12 passes)
(http://www.cijoint.fr/cj201109/cijZZSejAG.jpg)

Ax Bx Cx D1 => no error (8 passes)
(http://www.cijoint.fr/cj201109/cijMjR2lNE.jpg)

Ax Bx Cx D3 => no error (8 passes)
(http://www.cijoint.fr/cj201109/cij1mqUEL9.jpg)

Ax Bx Cx D4 => no error (9 passes)
(http://www.cijoint.fr/cj201109/cijGAGjkop.jpg)

* Then I started the bisection:

A2 B1 C3 D4 => no error (3 passes)
(http://www.cijoint.fr/cj201109/cijQHoAdmW.jpg)

It seems that I can not reproduce the error anymore.
My 700 GB file copy test was also fine whereas I always had corrupted
files in the initial configuration.
Of course, I am not happy with this because the problem will certainly
occur again and more frequently.

Here are my hypothesis:

By moving the RAM, the bug occurrence was reduced in some way.

Of course, I could let it run for days but I can not test more than 8
consecutive hours at night because I need to work with my computer at
daytime.
I thought it could be a conductive dust which went away when I plugged/
unplugged but I doubt : my motherboard is clean and I am very careful
to dust when I plug a new memory or card.
Moreover, I suppose I would have got a lot of errors instead of only
one located byte (or even a no-functional memory).

According to my motherboard manual, 1 was working with 3 and 2 with 4
in the initial state.
Now I moved them, 2 is working with 3 and 1 with 4.
Perhaps the modules works better together in this new configuration.

The problem is I can not undervolt my RAM because my BIOS only allow
me to overvolt so I do not know how I can increase the error
frequency.
 
Castor Nageur said:
Here are my first results.
I named my memory slots starting from the CPU socket side A, B, C, D.
I numbered the DIMMs 1, 2, 3, 4.
* My initial configuration which produced the detected error is A1,
B2, C3, D4.
Here are the results:
A1 B2 C3 D4 => error detected at address 0X001BF979C00 at pass 3
(http://www.cijoint.fr/cj201109/cijIqjVMpU.jpg)
* I then tested the DIMM one at a time (note that MemTest86+ detects
my DDR2 as DDR3 when plugging only 1 DIMM):
* Then I started the bisection:
It seems that I can not reproduce the error anymore.
My 700 GB file copy test was also fine whereas I always had corrupted
files in the initial configuration.
Of course, I am not happy with this because the problem will certainly
occur again and more frequently.

Not necessarily, but likely.
Here are my hypothesis:
By moving the RAM, the bug occurrence was reduced in some way.
Of course, I could let it run for days but I can not test more than 8
consecutive hours at night because I need to work with my computer at
daytime.
I thought it could be a conductive dust which went away when I plugged/
unplugged but I doubt : my motherboard is clean and I am very careful
to dust when I plug a new memory or card.
Moreover, I suppose I would have got a lot of errors instead of only
one located byte (or even a no-functional memory).

I agree. This is unfortunate. Maybe some external factor
like heat had something to do tith it?
According to my motherboard manual, 1 was working with 3 and 2 with 4
in the initial state.
Now I moved them, 2 is working with 3 and 1 with 4.
Perhaps the modules works better together in this new configuration.

Can you move them back to the original configuration and test again?
There is a possibility that the weak bit is just a bit slower.
This would explain why it works with 1 module. It could also
explain why the different 4 module configuration works. The
problem is that the bus does not put the same load on the
output drivers of the RAM in each position. Load slows them
down. Unfortunately, closer to the CPU is _not_ neccessarily
better. However, as A2 B1 C3 D4 does not give you errors,
this casts suspicion on modules 1 and 2, them being the
moved ones.
The problem is I can not undervolt my RAM because my BIOS only allow
me to overvolt so I do not know how I can increase the error
frequency.

Does it allow you to speed up RAM timing? That may also
work.

Arno
 
Castor said:
In facts, it is not finished yet : I am unable to find the faulty
module from the MemTest86+ reported address. I did not find any help
in my mobo manual and people say that the use of the memory modules
depends on the motherboard implementation.

They also say that the DIMM slot can also be faulty so I first have
to find a good DIMM then try it on all the slots and finally check
all the DIMMs independently. I still have hours of testing !

My source: http://www.overclockers.com/forums/showthread.php?t=409152


Yes, I will buy Kingston ECC.


I realized that finding ECC RAM is not a problem.

Finding a CPU handling ECC is also fine : I plan to buy the Xeon
E3-1275 (I do not want AMD because Intel CPU seems to actually go
faster).

The problem is finding the motherboard that correctly support ECC :
I plan to buy the Tyan S5512GM2NR but this is quite impossible to
find in France.


Tyan is noted for its stable hardware...even, perhaps, sacrificing
some performance, in favor of dependability.

My own Tyan S1830S (AT mainboard) has contained 1GB of ECC memory,
since May 2000; no RAM problems for me, whatsoever.

Good luck and happy hunting, for the S5512GM2NR!
 
I agree. This is unfortunate. Maybe some external factor
like heat had something to do tith it?

If you mean external heat, the last month weather temperature did not
vary a lot (between 2O°C and 30°C).
Inside the computer, I installed a huge CPU heatsink which maintain my
CPU temperature (and motherboard :-)) at 35°C at regular load and
never more than 46°C at full load. At any load, the motherboard
temperature never exceeds 45°C.
Can you move them back to the original configuration and test again?
There is a possibility that the weak bit is just a bit slower.

Yes, I can do this.
I let you know that I let run my A2 B1 C3 D4 configuration longer
today (6 passes) and did not get any error.
However, as A2 B1 C3 D4 does not give you errors,
this casts suspicion on modules 1 and 2, them being the
moved ones.
Absolutely.

Does it allow you to speed up RAM timing? That may also
work.

Yes I can but I would not like to burn my motherboard.
 
My own Tyan S1830S (AT mainboard) has contained 1GB of ECC memory,
since May 2000; no RAM problems for me, whatsoever.

Good luck and happy hunting, for the S5512GM2NR!

Thank you for your opinion.
Hopefully, I found a computer shop which can sell me the Tyan
S5512GM2NR.
 
Can you move them back to the original configuration and test again?

I restored the original configuration and did not detect any errors
after more than 9 hours of MemTest86+ testing (nearly 4 passes):

http://www.cijoint.fr/cj201109/cij1OelqJp.jpg

The 700 GB file copy test did not produce any error so the error
occurrence seems to have decreased.

I still don't know why I got the error.
I took a photo of my DIMM slot 2 (just before I restored the
configuration to 1234) and noticed a quite big dust compared to the
DIMM connector size (I had to take a flashlight so I can see it):

Dust in DIMM slot:
http://www.cijoint.fr/cj201109/cij4Ms4U27.jpg

Zoom on dust 1:
http://www.cijoint.fr/cj201109/cij23x2ayn.jpg

Zoom on dust 2:
http://www.cijoint.fr/cj201109/cijJ8NAKnP.jpg

Perhaps the dust prevented the electric current of correctly going
through the DIMM but of course, I can not be sure.

I am going to frequently run MemTest86+ keeping this config and see
what happens ...
 
I restored the original configuration and did not detect any errors
after more than 9 hours of MemTest86+ testing (nearly 4 passes):

The 700 GB file copy test did not produce any error so the error
occurrence seems to have decreased.
I still don't know why I got the error.
I took a photo of my DIMM slot 2 (just before I restored the
configuration to 1234) and noticed a quite big dust compared to the
DIMM connector size (I had to take a flashlight so I can see it):
Perhaps the dust prevented the electric current of correctly going
through the DIMM but of course, I can not be sure.
I am going to frequently run MemTest86+ keeping this config and see
what happens ...

While unlikely, it _is_ possible, that some of the power
contacts had problems and this decreased stability. Moving
DIMMs around can correct that.

Arno
 
While unlikely, it _is_ possible, that some of the power
contacts had problems and this decreased stability. Moving
DIMMs around can correct that.

I tested again and everything seems to work correctly now.
I cannot say for sure this was a dust problem but I am sure of one
thing :

A/ I had a systematic and reproducible faulty bit problem.
B/ By moving the DIMMs then re-plug the them at the same place, the
problem disappeared.

Anyway, I am actually building my new ECC-memory based computer ;-)
 
I tested again and everything seems to work correctly now.
I cannot say for sure this was a dust

This is not dust, but surfece corrosion. It even happens
to gold-plated contacts sometimes. I have had one
USB stick just today with that. The solution is to
plug in and remove a few times.
problem but I am sure of one
thing :
A/ I had a systematic and reproducible faulty bit problem.
B/ By moving the DIMMs then re-plug the them at the same place, the
problem disappeared.

Hardware "Heisenbug". Unfortunately they happen not so rarely.
Anyway, I am actually building my new ECC-memory based computer ;-)

;-)
 
This is not dust, but surfece corrosion. It even happens
to gold-plated contacts sometimes. I have had one
USB stick just today with that. The solution is to
plug in and remove a few times.

Thanks Arno, we finally found it !
I learnt a lot of things by investigating this bug.
 
> > Mobo: Gigabyte GA-P35-DS4 (rev 1.1)
> > CPU: Intel Quad Core Q6600
> > Disks: WD Caviar Green (EADS serie) SATA2 (1TB, 1.5TB)
> > Seagate Barracuda Green SATA3 (2TB) for my W7 partition
> > Memory: 4x2GB Corsair l DDR2-800 memory
> > Chipset: Intel ICH9R configured in AHCI mode

> > Hi all,
> > In a previous post, I explained that when copying some data from disk
> > to disk, the destination data was corrupted (~ destination different
> > than source).
> > I replaced my 8 GB GSkill memory and replaced it by some Corsair
> > memory. I did the copy test twice and the problem seemed to be solved.
> > I recently copied 670 GB of big files from one disk to another and
> > found that 10 files out of 359 were corrupted.
> > My previouis tests proved that my RAM and disks were fine.

Not rally. Tersting ram under normal condition only indicates
it may be fine. For real testing you need to get it into an
extreme state (temperature, voltages and timing) and thest
there. This is generally not feasible to do at home.

> > I plan to burn a Linux/Ubuntu install disk image from a clean computer
> > then install it on a clean reformated hard disk on my corrupted
> > computer.
> > I will then copy/check the same files from Linux so I can exclude (or
> > not) an OS dependent problem.
> > If also have errors under Linux, I will compare then determine the
> > difference pattern then post it here before I go to buy a new
> > computer !
> > * Is there a system read/write/verify feature I could enable under
> > Linux so it will stop at the first error so I do not have to copy all
> > each time ?

Not to my knowledge. And if the corruption happens before the file
goes into the write-buffer, it would not help anyways.

Apart from that, your approach is sound. It may just be that
you have (had) more than one source of corruption. PC hardware
has gotten more reliable, but not at the same speed memopry and
disks have gotten larger.

Arno[/QUOTE]

Thanks for share the post with us.
 
Back
Top