P
philip.moore
About a month ago, my Windows XP Pro installation started to experience
random Blue Screens of Death.
One minute you'd be happily using your PC, and then, from nowhere, the
blue screen would be upon you.
It was extremely frustrating, and I had to go through a lot of hoops
before I eventually tracked it down to a faulty RAM stick (which had
been fine for the previous seven months).
This summary brings together much of the advice I followed, and will
hopefully help others who are suffering from the terrible woe of random
Blue Screens of Death (BSODs).
BSODs
-----
Went my PC started to fail, the symptoms were as follows:
+ PC would go to BSOD, either at random (overnight say) or whilst
loading/stopping an application.
+ The BSOD stop errors would vary (a lot of articles assume you always
get the same error).
+ I had *not* recently installed any new software or hardware (a lot
of articles assume you have).
Let's take a look at some of the errors which affected my PC, due to a
dodgy RAM stick:
+ STOP 0x0000000A IRQL_NOT_LESS_OR_EQUAL
+ STOP 0x0000008E KERNEL_MODE_EXCEPTION_NOT_HANDLED
+ STOP 0x1000008E KERNEL_MODE_EXCEPTION_NOT_HANDLED
+ STOP 0x0000004E PFN_LIST_CORRUPT
+ STOP 0x00000024 NTFS_FILE_SYSTEM
I used this excellent website (http://www.aumha.org/win5/kbestop.htm)
to get the official MS information on these errors. I have to say,
these were not wildly helpful - apparently most errors can be
attributed to "faulty hardware, bad drivers, or incompatible hardware
or software" ... but you knew that already
Incidentally, if your computer is restarting immediately when you get a
BSOD, you want to uncheck the "automatically restart" option under
"Start | Control Panel | System | Advanced | [Start Up & Recovery]
Settings"... now you'll be able to get more information from the Blue
Screen.
wINDBG
------
When windows has a BSOD, it writes out a minidump, containing info on
the state of the machine at the time of the crash (type of dump, and
location, can be configured via Start Up & Recovery settings - see the
paragraph above).
Reading crash dumps is not straightforward, and requires geekery. In
this case, it was not actually that helpful (although it did suggest
memory corruption as a possible issue), so feel free to skip to the
next section if you want.
For the rest of us, you need WinDbg to analyse those minidumps - it's
the kind of tool that serious computer users will want to have around,
and you can get it here:
http://www.microsoft.com/whdc/devtools/debugging/default.mspx
I used these instructions to get started with the debugger:
http://forum.lowyat.net/lofiversion/index.php/t84632.html
In the end (after using !analyze -v), the best I got was: "Probably
caused by : memory_corruption"
SAFEMODE
--------
Next, I followed the advice of PCStats in their great article "Crash
Recovery - Dealing with the Blue Screen of Death"
(http://www.pcstats.com/articleview.cfm?articleID=1647).
I booted the PC into Windows Safe Mode, and tried to run a virus check.
Sure enough, my PC bombed out again. Safe Mode doesn't load any
drivers, so this pretty much confirmed it was either a hardware
problem, a fault with the virus checker, or perhaps a corruption in one
of Windows system files.
CHKDSK
------
As I'd been having NTFS problems, I thought maybe the disc was at risk.
So I ran CHKDSK. However, this did not reveal anything other than
minor inconsistencies (in particular there were no new bad sectors,
which are indicative of impending disc failure). As far as I can tell,
the NTFS errors were a side effect caused by crashing during disc
operations, or by NTFS.sys doing the wrong thing due to bad values from
RAM. Since pulling the dodgy RAM, these problems have gone away.
By the way, with CHKDSK, you sometimes see this error:
"Correcting errors in the Volume Bitmap.
Windows found problems with the file system."
Apparently you can ignore this - it happens when you run CHKDSK in
readonly mode from the command prompt, because Windows is updating the
status of the volume bitmap whilst you are checking it.
MEMTEST
-------
The next step I took, was to run memtest86 from www.memtest86.com - and
I'm glad I did! Straight away it flagged my upper memory (512MB-1GB)
as failing pattern read/write tests. The lower range (0-512MB) was
completely clear.
I pulled out the upper RAM stick, and ran memtest again ... this time
everything looked good. I've tried moving the faulty RAM around into
different slots (I should probably try cleaning the contacts too) but
the errors still occur - my RAM is still under warranty so it's going
back to the shop!
PLEASE do use this tool! It is excellent and will save you lots of
pain if RAM is your problem. You do need to find a floppy or blank CD
to boot memtest from, but don't be lazy - it is *well worth it* - and
once you've got it, store it somewhere safe for next time!
WHAT IF?
--------
What if the RAM test had have come back clean? Well, I'm quite glad it
did not, as otherwise I might still be hunting the bug! However, the
next steps might have been (not necessarily in order):
1) Download the Hard-disc manufacturers diagnostic tools and run these
in safe mode, to make sure the disc doesn't have some underlying
faults.
2) Try using the PC "normally" in safe mode without virus checker - it
would have crashed anyway, which would have ruled out the antivirus
software as the cause
3) Run the System File Checker (sfc /scannow) to look for damaged
system files.
4) Attempt to rollback to a System Restore Checkpoint.
The PCStats article has even more good ideas, particularly if you find
that you have a software or driver problem (i.e. if the BSODs go away
in safe mode, then you probably have a driver problem).
CONCLUSIONS
-----------
Firstly, we conclude that debugging BSODs can be hard, and that the
error messages you get do not always tell you very much about what is
causing the error. With hindsight, it is easy to see that the bad
memory was causing invalid instructions and bad values to crop up, and
confusing the hell out of the kernel - just about any error could have
been caused by this, as the RAM is effectively scrambling both program
and data at random.
Secondly, BSODs can start at anytime - not just because you've added
Software or Hardware. However, in the case of recurring random BSODs,
there will be a real cause!
Thirdly, a systematic approach is best - try the quick win options
(such as memtest and chkdsk) first before moving on to painstaking
options such as trying each driver individual. Obviously also take low
risk options before you decide to rollback configs or replace system
files!
Most importantly - the Blue Screen can be beaten! Be patient, read
plenty, and best of luck!
Phil Moore
www.GreatEastLondon.com
random Blue Screens of Death.
One minute you'd be happily using your PC, and then, from nowhere, the
blue screen would be upon you.
It was extremely frustrating, and I had to go through a lot of hoops
before I eventually tracked it down to a faulty RAM stick (which had
been fine for the previous seven months).
This summary brings together much of the advice I followed, and will
hopefully help others who are suffering from the terrible woe of random
Blue Screens of Death (BSODs).
BSODs
-----
Went my PC started to fail, the symptoms were as follows:
+ PC would go to BSOD, either at random (overnight say) or whilst
loading/stopping an application.
+ The BSOD stop errors would vary (a lot of articles assume you always
get the same error).
+ I had *not* recently installed any new software or hardware (a lot
of articles assume you have).
Let's take a look at some of the errors which affected my PC, due to a
dodgy RAM stick:
+ STOP 0x0000000A IRQL_NOT_LESS_OR_EQUAL
+ STOP 0x0000008E KERNEL_MODE_EXCEPTION_NOT_HANDLED
+ STOP 0x1000008E KERNEL_MODE_EXCEPTION_NOT_HANDLED
+ STOP 0x0000004E PFN_LIST_CORRUPT
+ STOP 0x00000024 NTFS_FILE_SYSTEM
I used this excellent website (http://www.aumha.org/win5/kbestop.htm)
to get the official MS information on these errors. I have to say,
these were not wildly helpful - apparently most errors can be
attributed to "faulty hardware, bad drivers, or incompatible hardware
or software" ... but you knew that already
Incidentally, if your computer is restarting immediately when you get a
BSOD, you want to uncheck the "automatically restart" option under
"Start | Control Panel | System | Advanced | [Start Up & Recovery]
Settings"... now you'll be able to get more information from the Blue
Screen.
wINDBG
------
When windows has a BSOD, it writes out a minidump, containing info on
the state of the machine at the time of the crash (type of dump, and
location, can be configured via Start Up & Recovery settings - see the
paragraph above).
Reading crash dumps is not straightforward, and requires geekery. In
this case, it was not actually that helpful (although it did suggest
memory corruption as a possible issue), so feel free to skip to the
next section if you want.
For the rest of us, you need WinDbg to analyse those minidumps - it's
the kind of tool that serious computer users will want to have around,
and you can get it here:
http://www.microsoft.com/whdc/devtools/debugging/default.mspx
I used these instructions to get started with the debugger:
http://forum.lowyat.net/lofiversion/index.php/t84632.html
In the end (after using !analyze -v), the best I got was: "Probably
caused by : memory_corruption"
SAFEMODE
--------
Next, I followed the advice of PCStats in their great article "Crash
Recovery - Dealing with the Blue Screen of Death"
(http://www.pcstats.com/articleview.cfm?articleID=1647).
I booted the PC into Windows Safe Mode, and tried to run a virus check.
Sure enough, my PC bombed out again. Safe Mode doesn't load any
drivers, so this pretty much confirmed it was either a hardware
problem, a fault with the virus checker, or perhaps a corruption in one
of Windows system files.
CHKDSK
------
As I'd been having NTFS problems, I thought maybe the disc was at risk.
So I ran CHKDSK. However, this did not reveal anything other than
minor inconsistencies (in particular there were no new bad sectors,
which are indicative of impending disc failure). As far as I can tell,
the NTFS errors were a side effect caused by crashing during disc
operations, or by NTFS.sys doing the wrong thing due to bad values from
RAM. Since pulling the dodgy RAM, these problems have gone away.
By the way, with CHKDSK, you sometimes see this error:
"Correcting errors in the Volume Bitmap.
Windows found problems with the file system."
Apparently you can ignore this - it happens when you run CHKDSK in
readonly mode from the command prompt, because Windows is updating the
status of the volume bitmap whilst you are checking it.
MEMTEST
-------
The next step I took, was to run memtest86 from www.memtest86.com - and
I'm glad I did! Straight away it flagged my upper memory (512MB-1GB)
as failing pattern read/write tests. The lower range (0-512MB) was
completely clear.
I pulled out the upper RAM stick, and ran memtest again ... this time
everything looked good. I've tried moving the faulty RAM around into
different slots (I should probably try cleaning the contacts too) but
the errors still occur - my RAM is still under warranty so it's going
back to the shop!
PLEASE do use this tool! It is excellent and will save you lots of
pain if RAM is your problem. You do need to find a floppy or blank CD
to boot memtest from, but don't be lazy - it is *well worth it* - and
once you've got it, store it somewhere safe for next time!
WHAT IF?
--------
What if the RAM test had have come back clean? Well, I'm quite glad it
did not, as otherwise I might still be hunting the bug! However, the
next steps might have been (not necessarily in order):
1) Download the Hard-disc manufacturers diagnostic tools and run these
in safe mode, to make sure the disc doesn't have some underlying
faults.
2) Try using the PC "normally" in safe mode without virus checker - it
would have crashed anyway, which would have ruled out the antivirus
software as the cause
3) Run the System File Checker (sfc /scannow) to look for damaged
system files.
4) Attempt to rollback to a System Restore Checkpoint.
The PCStats article has even more good ideas, particularly if you find
that you have a software or driver problem (i.e. if the BSODs go away
in safe mode, then you probably have a driver problem).
CONCLUSIONS
-----------
Firstly, we conclude that debugging BSODs can be hard, and that the
error messages you get do not always tell you very much about what is
causing the error. With hindsight, it is easy to see that the bad
memory was causing invalid instructions and bad values to crop up, and
confusing the hell out of the kernel - just about any error could have
been caused by this, as the RAM is effectively scrambling both program
and data at random.
Secondly, BSODs can start at anytime - not just because you've added
Software or Hardware. However, in the case of recurring random BSODs,
there will be a real cause!
Thirdly, a systematic approach is best - try the quick win options
(such as memtest and chkdsk) first before moving on to painstaking
options such as trying each driver individual. Obviously also take low
risk options before you decide to rollback configs or replace system
files!
Most importantly - the Blue Screen can be beaten! Be patient, read
plenty, and best of luck!
Phil Moore
www.GreatEastLondon.com