Problem isolating blue screen of death problem

P

pcoveney

I can't find an appropriate place to post questions about this, hope
someone can give me a pointer.

Our company makes a device that collects image data, a LOT of it. I won't
give the complete spec, but our computers are HP QuadCore xw8400
workstations, running XP (SP2 or SP# doesn't affect problem behavior.) We
have been getting field reports of BSOD problems. We have been able to
reproduce the failure in a test app in our lab. This app does nothing more
than allocate as much memory as it can, then writing to and reading it back,
which, at a grossly simple level, is what our app does. Using this approach,
we have been able to reproduce the failure, which happens about once every
month or so in the field, so that we can reproduce it within an hour or so
on a test system. By using WinDbg, we can see that a driver is causing an
IRQL_NOT_LESS_EQUAL fault, but we cant tell which ones, and removing the
likeliest suspects had no effect. We have updated all the drivers, to the
best of our knowledge.

I would like advice on how to determine which driver is causing the error.
Does anyone have advice on where I should visit to get such information?Or,
better, does anyone have such advice?

Thanks,
 
P

philo

pcoveney said:
I can't find an appropriate place to post questions about this, hope
someone can give me a pointer.

Our company makes a device that collects image data, a LOT of it. I won't
give the complete spec, but our computers are HP QuadCore xw8400
workstations, running XP (SP2 or SP# doesn't affect problem behavior.) We
have been getting field reports of BSOD problems. We have been able to
reproduce the failure in a test app in our lab. This app does nothing more
than allocate as much memory as it can, then writing to and reading it
back,
which, at a grossly simple level, is what our app does. Using this
approach,
we have been able to reproduce the failure, which happens about once every
month or so in the field, so that we can reproduce it within an hour or
so
on a test system. By using WinDbg, we can see that a driver is causing an
IRQL_NOT_LESS_EQUAL fault, but we cant tell which ones, and removing the
likeliest suspects had no effect. We have updated all the drivers, to the
best of our knowledge.

I would like advice on how to determine which driver is causing the
error.
Does anyone have advice on where I should visit to get such
information?Or,
better, does anyone have such advice?

Thanks,
Are all drivers "signed"

http://support.microsoft.com/kb/308514
 
G

Gerry

Background information on Stop Error message
http://msdn2.microsoft.com/en-us/library/ms793589.aspx

0x0000000A: IRQL_NOT_LESS_OR_EQUAL
Typically due to a bad driver, or faulty or incompatible hardware or
software. Use the General Troubleshooting of STOP Messages checklist
above. Technically, this error condition means that a kernel-mode
process or driver tried to access a memory location to which it did not
have permission, or at a kernel Interrupt ReQuest Level (IRQL) that was
too high. (A kernel-mode process can access only other processes that
have an IRQL lower than, or equal to, its own.)
Source: http://aumha.org/a/stop.htm

You receive a "Stop 0x0000000A" error message in Windows XP
http://support.microsoft.com/kb/314063/


--



Hope this helps.

Gerry
~~~~
FCA
Stourport, England
Enquire, plan and execute
~~~~~~~~~~~~~~~~~~~
 
P

pcoveney

Gerry,

Thanks for the feedback. We had already seen the information you're
showing, so we knew we were probably looking at a faulty driver. In fact, we
have seen threads elsewhere that indicate this is a not-uncommon problem for
drivers running on SMP systems, which describes us.
My question is, how to I narrow down which driver is at fault, so I can
make an appropriate substitution/contact the manufacturer/whatever? Our dumps
show 118 drivers active at the time of the crash, the vast majority of which
are Microsoft's. Many of them underpin fundamental OS operation, so I can't
just do a remove-one-at-a-time-until-the problem-goes-away.
I was looking for advice on how to narrow the list of cuplrits down, and
I could not even find an apporpriate place to ask for assistance. If you have
some thoughts, I would be grateful.

Phil C
 
G

Gerry

Phil

A copy of the Stop Error report is needed if you want targetted help.

Disable automatic restart on system failure. This should help by
allowing time to write down the STOP code properly. Right click on
the My Computer icon on the Desktop and select Properties, Advanced,
Start-Up and Recovery, System Failure and uncheck box before
Automatically Restart.

Do not re-enable automatic restart on system failure until you have
resolved the problem. Check for variants of the Stop Error message.

The inference from what you have written is that the Errors are not
occuring during the boot process. Is this correct? This means that you
can start to eliminate things that load when you boot.

Have you tried to reproduce the error in safe mode? If you cannot that
means it is something that is only used in normal mode. However, I would
not spend time on this one as it is likely to be a driver for something
you run after booting the computer.

Are there any yellow question marks in Device Manager? Right click on
the My Computer icon on your Desktop and select Properties,
Hardware,Device Manager. If yes what is the Device Error code?

What errors are appearing in Event Viewer?

Have a look in the System and Application logs in Event Viewer for
Errors and Warnings and post copies here. Don't post any more than 48
hours ago.

You can access Event Viewer by selecting Start, Control Panel,
Administrative Tools, and Event Viewer. When researching the meaning
of the error, information regarding Event ID, Source and Description
are important.

HOW TO: View and Manage Event Logs in Event Viewer in Windows XP
http://support.microsoft.com/kb/308427/en-us

A tip for posting copies of Error Reports! Run Event Viewer and double
click on the error you want to copy. In the window, which appears is a
button resembling two pages. Click the button and close Event
Viewer.Now start your message (email) and do a paste into the body of
the message. Make sure this is the first paste after exiting from
Event Viewer.

--



Hope this helps.

Gerry
~~~~
FCA
Stourport, England
Enquire, plan and execute
~~~~~~~~~~~~~~~~~~~
 
P

pcoveney

Gerry,

Thanks again for taking the time.
A copy of the Stop Error report is needed if you want targetted help.
I'm not sure what 'the Stop Error report' is. If it's the details from
the kernel dump, here it is:
=======================================================
Loading Dump File [C:\Bioptigen\Kernel Dumps\MEMORY122208A.DMP]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is:
SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows XP Kernel Version 2600 (Service Pack 2) MP (4 procs) Free x86
compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 2600.xpsp.051011-1528
Kernel base = 0x804d7000 PsLoadedModuleList = 0x8055c700
Debug session time: Mon Dec 22 15:08:06.625 2008 (GMT-5)
System Uptime: 0 days 0:45:25.579
Loading Kernel Symbol
...................................................................................................................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffd900c). Type ".hh dbgerr001" for details
Loading unloaded module list
...........
*******************************************************************************
*
*
* Bugcheck Analysis
*
*

*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck A, {c0605000, 2, 1, 805043d1}

Probably caused by : memory_corruption ( nt!MiAddWorkingSetPage+cf )

Followup: MachineOwner
---------

1: kd> !analyze -
*******************************************************************************
*
*
* Bugcheck Analysis
*
*

*******************************************************************************

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: c0605000, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on
chips which support this level of status)
Arg4: 805043d1, address which referenced memory

Debugging Details:
------------------

WRITE_ADDRESS: c0605000

CURRENT_IRQL: 2

FAULTING_IP:
nt!MiAddWorkingSetPage+cf
805043d1 c70680000000 mov dword ptr [esi],80h

DEFAULT_BUCKET_ID: DRIVER_FAULT

BUGCHECK_STR: 0xA

PROCESS_NAME: MemTest.exe

TRAP_FRAME: b404ac44 -- (.trap 0xffffffffb404ac44)
ErrCode = 00000002
eax=0007a4cf ebx=0007a4cf ecx=00000041 edx=89714902 esi=c0605000 edi=c0883000
eip=805043d1 esp=b404acb8 ebp=b404acdc iopl=0 nv up ei pl zr na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
nt!MiAddWorkingSetPage+0xcf:
805043d1 c70680000000 mov dword ptr [esi],80h
ds:0023:c0605000=????????
Resetting default scope

LAST_CONTROL_TRANSFER: from 805043d1 to 805437d0

STACK_TEXT:
b404ac44 805043d1 badb0d00 89714902 81de66a4 nt!KiTrap0E+0x238
b404acdc 805051fb 89d56588 89d56588 c03f7620 nt!MiAddWorkingSetPage+0xcf
b404acf4 8051fbc1 c0883cfc 7eec4000 0012e81c nt!MiLocateAndReserveWsle+0xc1
b404ad4c 80543668 81de6d88 7eec4000 80000000 nt!MmAccessFault+0xfb5
b404ad4c 004020a1 81de6d88 7eec4000 80000000 nt!KiTrap0E+0xd0
WARNING: Frame IP not in any known module. Following frames may be wrong.
000000a8 00000000 00000000 00000000 00000000 0x4020a1


STACK_COMMAND: kb

FOLLOWUP_IP:
nt!MiAddWorkingSetPage+cf
805043d1 c70680000000 mov dword ptr [esi],80h

SYMBOL_STACK_INDEX: 1

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

DEBUG_FLR_IMAGE_TIMESTAMP: 434c50c7

SYMBOL_NAME: nt!MiAddWorkingSetPage+cf

IMAGE_NAME: memory_corruption

FAILURE_BUCKET_ID: 0xA_W_nt!MiAddWorkingSetPage+cf

BUCKET_ID: 0xA_W_nt!MiAddWorkingSetPage+cf

Followup: MachineOwner
=======================================================
The Bug Check code, and Args 1, 2, and 3 never vary; that is: always
IRQL_NOT_LESS_OR_EQUAL, always to address 0xC0605000, always at IRQL level 2,
always a write operation. For the first several weeks of testing, Arg4 never
varied either (0x805043d1); in the last couple of days, we've seen other
addresses there. A few days ago, I removed some drivers of which we were
suspicious (the National Instruments drivers we use to acquire images) and
some we weren't using (Roxio DVD burning). My speculation is that the address
change is related to that, that I changed the driver load order in some way.
Disable automatic restart on system failure
We have done so--there's plenty of time to look at that screen.
The inference from what you have written is that the Errors are not
occuring during the boot process. Is this correct?
Yes, that is correct.
This means that you can start to eliminate things that load when you boot.
Cam I ask you to elaborate on this a little? Do you mean, use MSConfig
and disable stuff on the Startup and Services tab?
Have you tried to reproduce the error in safe mode?
No. It has been on our list of things to try, but we hadn't gotten to it
yet. I will investigate.
Are there any yellow question marks in Device Manager? No.

What errors are appearing in Event Viewer?
There are Warning level messages that appear to be related temporally.
Each time a crash occurs, we see15-25 instances (it varies) of a message "An
error was detected on device\HardDisk0\D during a paging operation." (There
is no page file on drive D:, although we are running on that disk.) The
messages, in fact, reinforce our working hypothesis: that a driver
inappropriately raised the IRQL level, and that a paging operation happened
to occur at the right time.
There are no other Warnings or Errors in the System area.
There are a few error messages in the Application area, but none that
occur regularly--they appear to be side-effects of the fact that the OS is
crashing. Things like explorer.exe or spoolsvc.exe faulting, and we might
have 1 or 2 of each, spread over the time period when we've seen 40-50
crashes.
The system on which we are crashing does not have network access--we're
trying to emulate our field conditions, and our product is a medical device
which would be operated in this way. If you think it is critical, I can get a
copy of one of these Error messages to you using a thumb drive, but I think I
have copied it faithfully.

Again, thank you for your efforts

PC
 
G

Gerry

PC

A Stop Error looks like this:

IRQL_NOT_LESS_OR_EQUAL
STOP: 0x0000000A (0x0000001C, 0x00000002, 0x00000001, 0x8000443A)
*** Address 8000443A base at 80001000, DateStamp 3e7a733a - hal.dll

The third line is not always there.

Parameter 4 is the location of the device driver causing the problem.
However, in many cases it is not practical to find out what is at the
address, unless it is pinpointed as in the error above.

Given the circumstances you outlined in your first post the problem lies
most likely in your specialist software. Have you discussed the problem
with whoever wrote the software?

Event Viewer Reports. Unless you post copies in the format I suggested
it is difficult to make objective comments. You should note the time an
error occurs and posts copies from 2 minutes before. Disregard
Information Reports.

A paging operation is not a reference to a pagefile.

When your computer pages information to or from the disk, if a generic
error occurs, it logs an Event ID 51 event message. In a paging
operation, the operating system either swaps a page of memory from
memory to disk or retrieves a page of memory from disk to memory. It is
part of the memory management of Microsoft Windows.
Quote from a Microsoft Knowledge Base Article.

You might try Driver Verifier -it does stress the system
http://www.microsoft.com/whdc/devtools/tools/verifier.mspx

The problem could be a driver with a memory leak. Try Ctrl+Alt+Delete to
select Task Manager and click the Performance
Tab. Under Commit Charge what is the Total, the Limit and the Peak? How
much RAM does the test machine have?

A memory leak manifests itself by an ever increasing demand for memory.
Task Manager will show an ever increasing Total figure if there is a
memory leak. Is the figure reaching the Limit. If the computer runs out
of memory it will crash.

You should be able to gather more information from Task Manager. With
the Processes tab open select View, Select, Columns and check the boxes
before Peak Memory Usage and Virtual Memory size. What are the figures
for the 6 processes using the largest amounts?

--



Hope this helps.

Gerry
~~~~
FCA
Stourport, England
Enquire, plan and execute
~~~~~~~~~~~~~~~~~~~

Gerry,

Thanks again for taking the time.
A copy of the Stop Error report is needed if you want targetted
help. I'm not sure what 'the Stop Error report' is. If it's the
details from
the kernel dump, here it is:
=======================================================
Loading Dump File [C:\Bioptigen\Kernel Dumps\MEMORY122208A.DMP]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is:
SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows XP Kernel Version 2600 (Service Pack 2) MP (4 procs) Free x86
compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 2600.xpsp.051011-1528
Kernel base = 0x804d7000 PsLoadedModuleList = 0x8055c700
Debug session time: Mon Dec 22 15:08:06.625 2008 (GMT-5)
System Uptime: 0 days 0:45:25.579
Loading Kernel Symbols
..................................................................................................................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffd900c). Type ".hh dbgerr001" for
details Loading unloaded module list
...........
*******************************************************************************
*
*
* Bugcheck Analysis
*
*
*
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck A, {c0605000, 2, 1, 805043d1}

Probably caused by : memory_corruption ( nt!MiAddWorkingSetPage+cf )

Followup: MachineOwner
---------

1: kd> !analyze -v
*******************************************************************************
*
*
* Bugcheck Analysis
*
*
*
*******************************************************************************

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid)
address at an interrupt request level (IRQL) that is too high. This
is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: c0605000, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation
(only on chips which support this level of status)
Arg4: 805043d1, address which referenced memory

Debugging Details:
------------------

WRITE_ADDRESS: c0605000

CURRENT_IRQL: 2

FAULTING_IP:
nt!MiAddWorkingSetPage+cf
805043d1 c70680000000 mov dword ptr [esi],80h

DEFAULT_BUCKET_ID: DRIVER_FAULT

BUGCHECK_STR: 0xA

PROCESS_NAME: MemTest.exe

TRAP_FRAME: b404ac44 -- (.trap 0xffffffffb404ac44)
ErrCode = 00000002
eax=0007a4cf ebx=0007a4cf ecx=00000041 edx=89714902 esi=c0605000
edi=c0883000 eip=805043d1 esp=b404acb8 ebp=b404acdc iopl=0 nv
up ei pl zr na pe nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030
gs=0000 efl=00010246 nt!MiAddWorkingSetPage+0xcf:
805043d1 c70680000000 mov dword ptr [esi],80h
ds:0023:c0605000=????????
Resetting default scope

LAST_CONTROL_TRANSFER: from 805043d1 to 805437d0

STACK_TEXT:
b404ac44 805043d1 badb0d00 89714902 81de66a4 nt!KiTrap0E+0x238
b404acdc 805051fb 89d56588 89d56588 c03f7620
nt!MiAddWorkingSetPage+0xcf b404acf4 8051fbc1 c0883cfc 7eec4000
0012e81c nt!MiLocateAndReserveWsle+0xc1 b404ad4c 80543668 81de6d88
7eec4000 80000000 nt!MmAccessFault+0xfb5
b404ad4c 004020a1 81de6d88 7eec4000 80000000 nt!KiTrap0E+0xd0
WARNING: Frame IP not in any known module. Following frames may be
wrong. 000000a8 00000000 00000000 00000000 00000000 0x4020a1


STACK_COMMAND: kb

FOLLOWUP_IP:
nt!MiAddWorkingSetPage+cf
805043d1 c70680000000 mov dword ptr [esi],80h

SYMBOL_STACK_INDEX: 1

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

DEBUG_FLR_IMAGE_TIMESTAMP: 434c50c7

SYMBOL_NAME: nt!MiAddWorkingSetPage+cf

IMAGE_NAME: memory_corruption

FAILURE_BUCKET_ID: 0xA_W_nt!MiAddWorkingSetPage+cf

BUCKET_ID: 0xA_W_nt!MiAddWorkingSetPage+cf

Followup: MachineOwner
=======================================================
The Bug Check code, and Args 1, 2, and 3 never vary; that is: always
IRQL_NOT_LESS_OR_EQUAL, always to address 0xC0605000, always at IRQL
level 2, always a write operation. For the first several weeks of
testing, Arg4 never varied either (0x805043d1); in the last couple of
days, we've seen other addresses there. A few days ago, I removed
some drivers of which we were suspicious (the National Instruments
drivers we use to acquire images) and some we weren't using (Roxio
DVD burning). My speculation is that the address change is related to
that, that I changed the driver load order in some way.
Disable automatic restart on system failure
We have done so--there's plenty of time to look at that screen.
The inference from what you have written is that the Errors are not
occuring during the boot process. Is this correct?
Yes, that is correct.
This means that you can start to eliminate things that load when
you boot. Cam I ask you to elaborate on this a little? Do you
mean, use MSConfig
and disable stuff on the Startup and Services tab?
Have you tried to reproduce the error in safe mode?
No. It has been on our list of things to try, but we hadn't gotten
to it yet. I will investigate.
Are there any yellow question marks in Device Manager? No.

What errors are appearing in Event Viewer?
There are Warning level messages that appear to be related
temporally.
Each time a crash occurs, we see15-25 instances (it varies) of a
message "An error was detected on device\HardDisk0\D during a paging
operation." (There is no page file on drive D:, although we are
running on that disk.) The messages, in fact, reinforce our working
hypothesis: that a driver inappropriately raised the IRQL level, and
that a paging operation happened to occur at the right time.
There are no other Warnings or Errors in the System area.
There are a few error messages in the Application area, but none that
occur regularly--they appear to be side-effects of the fact that the
OS is crashing. Things like explorer.exe or spoolsvc.exe faulting,
and we might have 1 or 2 of each, spread over the time period when
we've seen 40-50 crashes.
The system on which we are crashing does not have network
access--we're trying to emulate our field conditions, and our product
is a medical device which would be operated in this way. If you think
it is critical, I can get a copy of one of these Error messages to
you using a thumb drive, but I think I have copied it faithfully.

Again, thank you for your efforts

PC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top