Files corruption after SATA2 disk to disk transfer

C

Castor Nageur

Hi all,

Here is my "home-made" system:

Mobo: Gigabyte GA-P35-DS4 (rev 1.1)
CPU: Intel Quad Core Q6600
Disks: 2 Western Digital Caviar Green SATA2 (1TB, 1.5TB) + 1 brand
new Seagate Barracuda Green SATA3 (2TB) for W7
Memory: 4x2GB of brand new GSkill DDR2-1066 memory (in replacement
of my old 2x1GB DDR2-800 Corsair memory).
Chipset: Intel ICH9R configured in AHCI in BIOS

I upgraded from Windows XP SP2 to Windows 7 Ultimate.
Yesterday, I copied some important files for which I already had a
CHECKSUM.md5 file (generated with MD5Checker).
When I checked my destination, I noticed that many files were
corrupted (source & target MD5 did not match) which means that I have
an unstable system which corrupt my data when transfering.
The weird thing is Windows 7 DVD installer did not complain about this
=> I suppose the problem only occured with SATA transfers and my DVD
is IDE.
I can not tell for sure that the problem did not occur with Windows XP
because there is a long time (unfortunately :-() that I did not
perform any MD5 checking.

Consequently, I checked:

* the memory with Memtest86+ : I let it run for 5 passes (more than 8
hours) and it did not detect any error.
* the disks : the surface analysis detected no error.
* I did a clean W7 install => same problem
* I down-clocked my memory from 1066 to 800 => same problem
* I flashed my Mobo BIOS with many different versions (found on
Gigabyte web site) => same problem each time

I conclude my motherboard chipset is probably faulty (5 years old and
intensively used) so I am going to change it otherwise it will corrupt
all my data.

My big concern is my system did not warn me about any problem : it
silently copied the files without detecting some CRC errors.
I suppose ECC memory would not help because my memory tests detected
no error.

* Because I do no want this happens to me again : do you know if there
is an easy way to set my system so it warns me of any file corruption
during a disk transfer ?
* Does RAID can handle that kind of error ?

Thanks in advance for helping me.
 
P

Paul

Castor said:
Hi all,

Here is my "home-made" system:

Mobo: Gigabyte GA-P35-DS4 (rev 1.1)
CPU: Intel Quad Core Q6600
Disks: 2 Western Digital Caviar Green SATA2 (1TB, 1.5TB) + 1 brand
new Seagate Barracuda Green SATA3 (2TB) for W7
Memory: 4x2GB of brand new GSkill DDR2-1066 memory (in replacement
of my old 2x1GB DDR2-800 Corsair memory).
Chipset: Intel ICH9R configured in AHCI in BIOS

I upgraded from Windows XP SP2 to Windows 7 Ultimate.
Yesterday, I copied some important files for which I already had a
CHECKSUM.md5 file (generated with MD5Checker).
When I checked my destination, I noticed that many files were
corrupted (source & target MD5 did not match) which means that I have
an unstable system which corrupt my data when transfering.
The weird thing is Windows 7 DVD installer did not complain about this
=> I suppose the problem only occured with SATA transfers and my DVD
is IDE.
I can not tell for sure that the problem did not occur with Windows XP
because there is a long time (unfortunately :-() that I did not
perform any MD5 checking.

Consequently, I checked:

* the memory with Memtest86+ : I let it run for 5 passes (more than 8
hours) and it did not detect any error.
* the disks : the surface analysis detected no error.
* I did a clean W7 install => same problem
* I down-clocked my memory from 1066 to 800 => same problem
* I flashed my Mobo BIOS with many different versions (found on
Gigabyte web site) => same problem each time

I conclude my motherboard chipset is probably faulty (5 years old and
intensively used) so I am going to change it otherwise it will corrupt
all my data.

My big concern is my system did not warn me about any problem : it
silently copied the files without detecting some CRC errors.
I suppose ECC memory would not help because my memory tests detected
no error.

* Because I do no want this happens to me again : do you know if there
is an easy way to set my system so it warns me of any file corruption
during a disk transfer ?
* Does RAID can handle that kind of error ?

Thanks in advance for helping me.

Try a different SATA cable ? If the SATA cable is crushed or kinked,
it can affect the impedance of the transmission lines inside the
cable. And that can lead to errors in the packet data. At least,
there have been reports of corrupt data, that traced to a bad cable.

In terms of error checking -

CPU --- Memory (ECC)
|
NB
|
SB ---- SATA (ECC on packet data) ----- Disk (Error check on sector)

What is missing, is an end to end check. There are some point to point
checks in the system, but the coverage may not be as complete as it could be.

For example, many Intel desktop boards don't support ECC, so ECC
is not an option. AMD motherboards may have more options for
that kind of support. ECC memory support must be tested (by someone)
to prove it actually works, because there have been cases where
all the necessary BIOS settings aren't available.

Have you considered downloading a file with a known MD5 sum (such as
a 700MB Linux distro), and transferring that to the hard drive, and
seeing if that is corrupted or not. Perhaps the originally
generated MD5 sums are bad ? I would want to make a few more attempts
to test it.

To verify the memory, I would use a copy of Prime95 (stress test option)
from mersenne.org/freesoft . Prime95 stress test, does a math calculation
with a known answer, and holds the calculation in system memory. As a
stress test, it is more successful at detecting dynamic memory errors,
than memtest86+ is. The program stops and a GUI element turns red,
if even a single error is detected.

I would also run the manufacturer's diagnostic (like download a disk
tester from Western Digital) and test the three drives. There may be
a test for example, which verifies the cache memory on each disk drive
controller board.

Paul
 
C

Castor Nageur

Try a different SATA cable ? If the SATA cable is crushed or kinked,
it can affect the impedance of the transmission lines inside the
cable. And that can lead to errors in the packet data. At least,
there have been reports of corrupt data, that traced to a bad cable.

I already did that test and changing the cables did not change
anything.
Have you considered downloading a file with a known MD5 sum (such as
a 700MB Linux distro), and transferring that to the hard drive, and
seeing if that is corrupted or not. Perhaps the originally
generated MD5 sums are bad ? I would want to make a few more attempts
to test it.

I already did that test : I transfered 500 Go of data files for which
I had the original MD5 checksums (700 Mo is not enough to reproduce
the problem).
If I run MD5Checksum from the source disk, all the calculated MD5
match the original ones.
Moreover, it tends to prove that the memory is fine (otherwise I would
get checking errors even from the source).
To verify the memory, I would use a copy of Prime95 (stress test option)
from mersenne.org/freesoft . Prime95 stress test, does a math calculation
with a known answer, and holds the calculation in system memory. As a
stress test, it is more successful at detecting dynamic memory errors,
than memtest86+ is. The program stops and a GUI element turns red,
if even a single error is detected.

I do not believe the memory is faulty. First of all, it is brand new,
secondly, the memtest86+ were fine and finally, data corruption only
occurs after a data transfer.
Paradoxally, my system is very stable.

I would also run the manufacturer's diagnostic (like download a disk
tester from Western Digital) and test the three drives. There may be
a test for example, which verifies the cache memory on each disk drive
controller board.

I already did this and the WD diag tools did not report any error.
Note that the same corruption problem occurs on my new Seagate
Barracuda Green 2TB disk.
I even installed a copy of Windows Server 2008 and the same problem
occurs.

Anyway, I could spend (and lose) days testing and testing again so I
decided to rebuild my config from scratch just keeping my hard drives.
In the past, I did a lot of x264 encoding work and I had a heat issue
because my CPU was not correctly cooled.
So this is highly probable that the heat damaged the motherboard.


* In facts, do you know any way to automatically detect that a
motherboard component is faulty and may corrupt your data ?

* Why the system did not check the source disk CRC with the target
one ? Could this be enabled ?

* Are there any motherboard selftest ? (neither Windows XP, Windows 7,
Windows Server 2008, nor AIDA64 or Everest detected something wrong).
 
P

Paul

Castor said:
I already did that test and changing the cables did not change
anything.


I already did that test : I transfered 500 Go of data files for which
I had the original MD5 checksums (700 Mo is not enough to reproduce
the problem).
If I run MD5Checksum from the source disk, all the calculated MD5
match the original ones.
Moreover, it tends to prove that the memory is fine (otherwise I would
get checking errors even from the source).


I do not believe the memory is faulty. First of all, it is brand new,
secondly, the memtest86+ were fine and finally, data corruption only
occurs after a data transfer.
Paradoxally, my system is very stable.



I already did this and the WD diag tools did not report any error.
Note that the same corruption problem occurs on my new Seagate
Barracuda Green 2TB disk.
I even installed a copy of Windows Server 2008 and the same problem
occurs.

Anyway, I could spend (and lose) days testing and testing again so I
decided to rebuild my config from scratch just keeping my hard drives.
In the past, I did a lot of x264 encoding work and I had a heat issue
because my CPU was not correctly cooled.
So this is highly probable that the heat damaged the motherboard.


* In facts, do you know any way to automatically detect that a
motherboard component is faulty and may corrupt your data ?

* Why the system did not check the source disk CRC with the target
one ? Could this be enabled ?

* Are there any motherboard selftest ? (neither Windows XP, Windows 7,
Windows Server 2008, nor AIDA64 or Everest detected something wrong).

The only way I know of, for an operating system to do what you'd
like, is to operate the file system in "write-verify" mode. I
don't think Windows supports that. Basically, that cuts disk
performance in half, as it writes the file, then reads it back
immediately to check the data.

(At one time, when a new disk shipped from the manufacturer, the
hard drive itself was set up in a form of write verify, but that
isn't the same thing. A new disk would immediately read the data,
for the first ten power cycles or so. And then after that, it would
stop doing that. I presume that had something to do with detecting
errored blocks and sparing them out or something.)

So what you seem to be telling me, is you believe your Southbridge
(disk controller interface) is making this error. As you've removed
other things as issues. I've heard of a SATA port dying completely
on an Intel Southbridge. But I don't recollect an issue with
an Intel port operating degraded. It's possible it could do that,
if the amplitude on the SATA signals was weaker than it was
supposed to be.

In theory, the errors could be characterized by error counters or
loopback tests over the SATA. But I'm unaware of how you access
things like that.

http://www.freshpatents.com/Detecti...-loopback-modes-dt20080124ptan20080019280.php

"Loopback testing is one of the easiest and quickest ways to
conduct diagnostic tests for the purpose of debugging design or
connectivity bugs. This methodology has been widely used in the
electronics industry for a long time, such as for remote fault
isolation in the field. It has also resulted in significant
reduction of product or equipment downtime. The latest SATA
specification (Revision 2.5 and dated Oct. 27, 2005) specifies
three loopback modes for a SATA device (host, device, tester),
namely far-end retimed loopback, far-end analog, and near-end analog."

If we as users had access to stuff like that, it would be easier
to determine when a chip, device, or cable were bad. (We designed
loopback, into a lot of equipment at work, which is why I might
find it useful if it was available.) I've just never heard of
it being in any SATA-related software.

Paul
 
L

Loren Pechtel

So what you seem to be telling me, is you believe your Southbridge
(disk controller interface) is making this error. As you've removed
other things as issues. I've heard of a SATA port dying completely
on an Intel Southbridge. But I don't recollect an issue with
an Intel port operating degraded. It's possible it could do that,
if the amplitude on the SATA signals was weaker than it was
supposed to be.

I've had multiple SATA ports fail so as to throw repeated disk errors.
My solution has always been to simply quit using the port. I've never
tried to use the port anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top