My disk failure adventure (longish)

B

Bob S

This is an account of a disk problem and recovery. It includes
preliminary history, failure symptoms, recovery actions and results,
and questions.

The "bad" disk is an 80GB Seagate Barracuda ATA IV ST380021A,
formatted as one large NTFS partition with 512-byte cluster size. The
system is a Dell Dimension 8200 running Windows 2000.

Possibly Interesting Events Prior to Failure

The power supply failed in a way that caused it to crowbar every time
the machine was turned on. Nothing had been added to the machine
lately that might increase the load. I replaced the original Dell
supply with a PC Power and Cooling supply and the machine started
working fine again. (For some reason the PCPC people insisted that
even though the Dell supply was 250 watts, their replacement had to be
400 watts, at extra cost.)

A month or so later, the whole town had a power failure; apparently
someone knocked down a pole. The UPS kept the machine going until I
shut down Windows 2000 and powered it down. It appeared to shut down
normally.

When I powered the machine up next morning, it wouldn't boot.

Immediate Symptoms

An error message appeared, something along the lines of "couldn't find
the operating system, if this is the first time this has happened just
try it again". I tried it again, and it got as far as the white
"starting Windows 2000" screen and hung when the thirteenth blue
square appeared in the progress bar. After that it hung at the same
point every time.

I put the disk into another machine as a second disk to see what I
could see. It made the machine very slow to boot, and once the machine
came up it had a lot of trouble opening Windows Explorer for the bad
disk. It finally stopped rattling and opened the window, but
Properties claimed that the disk was size zero with lots of free space
and no used space. I decided that I didn't want to risk the stability
of the second machine any further.

Recovery Attempts

Most things were available from backup, but I decided to see whether I
could get the very latest version of everything.

I obtained a new blank disk, put it in the original machine, loaded
the OS onto it, and tried it out. It worked.

I put the "bad" disk into the machine as a second disk, and tried to
boot. The boot failed! I pulled the "bad" disk, but the machine still
failed to boot!

I ran repair twice on the "good" disk. The first time it said it was
fixing the structure, the second time it claimed to be checking files.
After that the machine booted again (still without the "bad" disk) and
behaved well.

I got a USB external drive box, put the "bad" disk in that, and
attached it to the machine. The machine booted OK.

I then used a program called GetDataBack to recover files from the
"bad" disk. This program scanned through the "bad" disk and found both
Master File Tables with no trouble. It found vast numbers of files and
directories. (I have no way of knowing whether it found absolutely
everything that had once been there, but it certainly appeared to.)
The only other oddity during the scan was that the cluster size was
reported as some six-digit number, with the parenthetical note (must
be 512 bytes) which was correct.

The scan took a couple days. This was partly because the software just
plain takes a while, and partly because when it put up the occasional
error message about bad blocks in the middle of the night, there was
nobody there to say "OK".

The program recovered all but two of the 200,000 or so user files that
it found by copying them to the "good" disk. These two files had
blocks that could not be read. Both were available from backup.

The only other problem was with a few files with extremely long names.
These files were being recovered to a directory on the "good" disk
that was at a lower level than they were on the "bad" disk, so
apparently the total path name length on the new disk exceeded some
limit, and they could not be copied. I recovered them separately to a
higher level directory, then renamed them with a shorter name and
copied them to the tree location where they belonged. The GetDataBack
program has a way to rename files before copying, but I couldn't get
it to work.

I did not try recovering system files since they wouldn't be much use
to me, so there might well have been a problem with one of them.

The GetDataBack program also found many deleted files. Just for fun I
tried recovering a few of them. Almost all of them turned out to be
files some of whose blocks had already been used by other files, so
the recovered bits were not useful.

All-in-all the GetDataBack program did what it said it would do.

Questions for Speculation

Might any of the power supply difficulties have caused the problem?

Is there anything particular that would cause the boot to fail after
the thirteenth progress marker?

How did the bad disk screw up the structure of the good disk before
the system could even finish booting?

Does anyone have an idea just what is wrong with the disk? It seems to
be mechanically mostly alright at least, and the MFTs were apparently
readable.

Do the unreadable blocks suggest that the surface is failing, or is
this just some software problem that could be cured by formatting the
disk and starting over?

Is there some other recovery action that might have worked?
 
E

Eric Gisin

Most likely the voltages to the disk surged or did not drop cleanly on power
fail, and random bits were written to a track(s).

You can run SeaTools to get an idea of how bad the damage is.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top