Access violation in C++ postamble

P

paul-shed

Hello,

I have rather a nasty problem with my code which occasionally generates an
access violation when reading from memory. The address of the access
violation is not 0, it's different every time, but the address is outside of
the process memory space. So clearly there is a problem.

I have a full memory dump of the process and when I do a post mortem debug
in windbg I get the following dissassembly where the access violation
occurred, in function postamble.

10074c02 8b4c2428 mov ecx,dword ptr [esp+28h]
10074c06 5f pop edi
10074c07 5e pop esi <-- access violation in this instruction.
10074c08 64890d00000000 mov dword ptr fs:[0],ecx
10074c0f 83c42c add esp,2Ch
10074c12 c20800 ret 8

So the access violation occurred when executing the pop esi instruction. Now
what I can't work out is why a pop esi instruction would cause an access
violation. The stack poiner is ok and pointing to valid memory, the stack
frames look ok and I can get a good stack dump with symbol files loaded in
the debugger.

My understanding of pop esi is that it should pop a 32 bit value off the
stack and place into esi. So the only memory access would be at esp, the
stack pointer. Is this correct?

The only other point is that perhaps the access violation occured in another
instruction around the pop esi and the address reported in the dump is not
correct.

Anyway I'm stumped so any help would be appreciated. I want to resolve the
issue of how a pop esi instruction causes an access violation.

My gut feeling is that I have some stack corruption going on.

Regards,

paul-shed
 
J

Jialiang Ge [MSFT]

Hello paul-shed

Here are my quick ideas of the AV issue. They are for your references. I
will perform more researches and report back.

When the pop instruction is processed by the processor, these things happen
in sequence:

1. The item pointed to by the Stack Pointer (esp) is retrieved from the
stack
2. The Stack Pointer (esp) is incremented by four bytes.

A possible reason for the AV exception in a esp instruction is that, the
memory pointed by esp is not valid. To verify it, you may run the command
'r esp' in windbg to see the memory property, or dump both esp, ebp to see
whether the stack frame is with a proper length. It seems that the esp
register was reset to an invalid place by some reasons (we would need to
trace esp to see why it's reset), and causes the stack corruption.

By the way, the instruction list you posted is an epilog of a function.

pop edi
pop esi
are used to restore edi, esi register (they are backuped at the beginning
of the function, aka prolog)

add esp,2Ch
clears the local variables.

ret 8
clears the parameter and return value.

If my rough analysis is right, our next action is to figure out why the esp
register is reset improperly.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
P

paul-shed

Jialiang,

Yes indeed, that's what I thought.

And the stack pointer, esp, is pointing to valid memory. The stack backtrace
looks good. Also the read address in the AV is nowhere near the esp.

So either

1) The faulting instruction in the report is not correct. (more likely)
2) pop esi has other side effects (apparently not)
3) The esp was altered fleetingly by some unknown mechanism

But I'm not too sure which.

Regards,

paul-shed
 
P

paul-shed

Also ...

If the exception occured in pop esi, how come the access violation didn't
occur in pop edi if esp was incorrect?

Only reason I could come up with there is if control branched into the pop
esi instruction.

Regards,

paul-shed


paul-shed said:
Jialiang,

Yes indeed, that's what I thought.

And the stack pointer, esp, is pointing to valid memory. The stack backtrace
looks good. Also the read address in the AV is nowhere near the esp.

So either

1) The faulting instruction in the report is not correct. (more likely)
2) pop esi has other side effects (apparently not)
3) The esp was altered fleetingly by some unknown mechanism

But I'm not too sure which.

Regards,

paul-shed


"Jialiang Ge [MSFT]" said:
Hello paul-shed

Here are my quick ideas of the AV issue. They are for your references. I
will perform more researches and report back.

When the pop instruction is processed by the processor, these things happen
in sequence:

1. The item pointed to by the Stack Pointer (esp) is retrieved from the
stack
2. The Stack Pointer (esp) is incremented by four bytes.

A possible reason for the AV exception in a esp instruction is that, the
memory pointed by esp is not valid. To verify it, you may run the command
'r esp' in windbg to see the memory property, or dump both esp, ebp to see
whether the stack frame is with a proper length. It seems that the esp
register was reset to an invalid place by some reasons (we would need to
trace esp to see why it's reset), and causes the stack corruption.

By the way, the instruction list you posted is an epilog of a function.

pop edi
pop esi
are used to restore edi, esi register (they are backuped at the beginning
of the function, aka prolog)

add esp,2Ch
clears the local variables.

ret 8
clears the parameter and return value.

If my rough analysis is right, our next action is to figure out why the esp
register is reset improperly.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
J

Jialiang Ge [MSFT]

Hello paul-shed

I'm performing further researches on this behavior of AV.
1) The faulting instruction in the report is not correct. (more likely)

Please issue the command 'lm' in windbg to see whether the symbols of your
modules are loaded rightly. It might be helpful if you can share the dump
with me. If you agree to share the file, please let me know. I will create
a workspace for you to upload it.

P.S.

Fix a small typo in my last reply:
"To verify it, you may run the command 'r esp' in windbg to see the memory
property"

I intended to say
"To verify it, you may run the command '!address esp' in windbg to see the
memory property"

If you did not use !address, please try it.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
P

paul-shed

Jialiang,

When I issue the !address esp command I get the folowing error. any clues?
I've got the MS symbol server configured with the _NT_SYMBOL_PATH env
varables.
Also, ntdll has the correct pdb loaded, ntdll.pdb.

*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: ntdll!_NT_TIB ***
*** ***
*************************************************************************
ReadField(StackLimit) error at 7ffde000

Regards,

paul-shed


"Jialiang Ge [MSFT]" said:
Hello paul-shed

I'm performing further researches on this behavior of AV.
1) The faulting instruction in the report is not correct. (more likely)

Please issue the command 'lm' in windbg to see whether the symbols of your
modules are loaded rightly. It might be helpful if you can share the dump
with me. If you agree to share the file, please let me know. I will create
a workspace for you to upload it.

P.S.

Fix a small typo in my last reply:
"To verify it, you may run the command 'r esp' in windbg to see the memory
property"

I intended to say
"To verify it, you may run the command '!address esp' in windbg to see the
memory property"

If you did not use !address, please try it.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
J

Jialiang Ge [MSFT]

Hello paul-shed

I'm Jialiang. I'm replying to you at home.

The error "Your debugger is not using the correct symbols" is thrown by
windbg because some types (ntdll!_NT_TIB) are not present in public OS
symbols. As far as I know, this symbol issue was present in some old OSs
like Win2003, and is fixed in Vista and Win2008. In order that this
additional issue does not interfere with our trouble-shooting, I suggest
that you run the command

dd esp

If it outputs the question marks:

0:000> dd esp
0002f3c0 ???????? ???????? ???????? ????????
0002f3d0 ???????? ???????? ???????? ????????
0002f3e0 ???????? ???????? ???????? ????????
0002f3f0 ???????? ???????? ???????? ????????
0002f400 ???????? ???????? ???????? ????????
0002f410 ???????? ???????? ???????? ????????
0002f420 ???????? ???????? ???????? ????????
0002f430 ???????? ???????? ???????? ????????

It generally means that the memory pointed by esp is FREE or inaccessible,
and this can explain the AV exception.

In addition, please run the command 'r' to dump all the register values, and
paste them here.

Another quick way to verify whether the esp register is pointing to the
right stack memory, is to calculate (ebp - esp). The default stack
reservation size used by the linker is 1 MB.
http://msdn.microsoft.com/en-us/library/ms686774(VS.85).aspx. If (ebp - esp)
returns a very large value or a negative value, it generally means that esp
is corrupted.

?(ebp - esp)
Evaluate expression: 48

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================


paul-shed said:
Jialiang,

When I issue the !address esp command I get the folowing error. any clues?
I've got the MS symbol server configured with the _NT_SYMBOL_PATH env
varables.
Also, ntdll has the correct pdb loaded, ntdll.pdb.

*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: ntdll!_NT_TIB ***
*** ***
*************************************************************************
ReadField(StackLimit) error at 7ffde000

Regards,

paul-shed


"Jialiang Ge [MSFT]" said:
Hello paul-shed

I'm performing further researches on this behavior of AV.
1) The faulting instruction in the report is not correct. (more likely)

Please issue the command 'lm' in windbg to see whether the symbols of
your
modules are loaded rightly. It might be helpful if you can share the dump
with me. If you agree to share the file, please let me know. I will
create
a workspace for you to upload it.

P.S.

Fix a small typo in my last reply:
"To verify it, you may run the command 'r esp' in windbg to see the
memory
property"

I intended to say
"To verify it, you may run the command '!address esp' in windbg to see
the
memory property"

If you did not use !address, please try it.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you.
Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no
rights.
=================================================
 
P

paul-shed

Jialiang,

In all cases the stack pointer is pointing to valid memory.

The only slight issue is that is some of the crash dumps ebp has the value
2. Which doesn't seem right. Unless of couse ebp may be used as a general
purpose register by the compiler in some cases.

How do you explain the exception going off in pop esi, and not pop edi?

The only reason that I can think of for this is that control is jumping
directly to the pop esi instruction. Woud you agree?

Regards,

paul-shed


Jialiang Ge said:
Hello paul-shed

I'm Jialiang. I'm replying to you at home.

The error "Your debugger is not using the correct symbols" is thrown by
windbg because some types (ntdll!_NT_TIB) are not present in public OS
symbols. As far as I know, this symbol issue was present in some old OSs
like Win2003, and is fixed in Vista and Win2008. In order that this
additional issue does not interfere with our trouble-shooting, I suggest
that you run the command

dd esp

If it outputs the question marks:

0:000> dd esp
0002f3c0 ???????? ???????? ???????? ????????
0002f3d0 ???????? ???????? ???????? ????????
0002f3e0 ???????? ???????? ???????? ????????
0002f3f0 ???????? ???????? ???????? ????????
0002f400 ???????? ???????? ???????? ????????
0002f410 ???????? ???????? ???????? ????????
0002f420 ???????? ???????? ???????? ????????
0002f430 ???????? ???????? ???????? ????????

It generally means that the memory pointed by esp is FREE or inaccessible,
and this can explain the AV exception.

In addition, please run the command 'r' to dump all the register values, and
paste them here.

Another quick way to verify whether the esp register is pointing to the
right stack memory, is to calculate (ebp - esp). The default stack
reservation size used by the linker is 1 MB.
http://msdn.microsoft.com/en-us/library/ms686774(VS.85).aspx. If (ebp - esp)
returns a very large value or a negative value, it generally means that esp
is corrupted.

?(ebp - esp)
Evaluate expression: 48

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================


paul-shed said:
Jialiang,

When I issue the !address esp command I get the folowing error. any clues?
I've got the MS symbol server configured with the _NT_SYMBOL_PATH env
varables.
Also, ntdll has the correct pdb loaded, ntdll.pdb.

*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: ntdll!_NT_TIB ***
*** ***
*************************************************************************
ReadField(StackLimit) error at 7ffde000

Regards,

paul-shed


"Jialiang Ge [MSFT]" said:
Hello paul-shed

I'm performing further researches on this behavior of AV.

1) The faulting instruction in the report is not correct. (more likely)

Please issue the command 'lm' in windbg to see whether the symbols of
your
modules are loaded rightly. It might be helpful if you can share the dump
with me. If you agree to share the file, please let me know. I will
create
a workspace for you to upload it.

P.S.

Fix a small typo in my last reply:
"To verify it, you may run the command 'r esp' in windbg to see the
memory
property"

I intended to say
"To verify it, you may run the command '!address esp' in windbg to see
the
memory property"

If you did not use !address, please try it.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you.
Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no
rights.
=================================================
 
J

Jialiang Ge [MSFT]

Hello paul-shed

Thank you for the dump file. I have big progress in researching the issue:

I run the command !analyze -v to display the verbose info of the exception,
and get this output:

FAULTING_IP:
PBSutils!CPBSTimecode::CPBSTimecode+57
10073717 5e pop esi
......
READ_ADDRESS: 0860ecad
BUGCHECK_STR: ACCESS_VIOLATION
EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at 10073717 does not read/write to 0860ecad
LAST_CONTROL_TRANSFER: from 005a30b9 to 10073717
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be
wrong.
01acfb48 005a30b9 7c4eaa7d 00000000 00000002
PBSutils!CPBSTimecode::CPBSTimecode+0x57
00000000 00000000 00000000 00000000 00000000
PBSList2!CShuffleConfig::~CShuffleConfig+0xe1f9
......
FAILURE_BUCKET_ID:
ACCESS_VIOLATION_CODE_ADDRESS_MISMATCH_PBSutils!CPBSTimecode::CPBSTimecode+5
7
BUCKET_ID:
ACCESS_VIOLATION_CODE_ADDRESS_MISMATCH_PBSutils!CPBSTimecode::CPBSTimecode+5
7

It outputs that the exception (AV) does not match code (pop esi). This
indicates that it might be a hardware error. You may want to read this
article:

http://www.dumpanalysis.org/blog/index.php/2008/04/03/crash-dump-analysis-pa
tterns-part-57/
(If the long URL is truncated by the newsgroup system, please manually
connect the parts)

It reminds me of a case I once handled. In that case if the instruction at
10073717 changed its bit 7 because of an hardware error it would change
from 0x5e to 0xde and would be interpreted as follows by the processor:

10073717 de64890d fisub word ptr [ecx+ecx*4+0Dh]

which matches the exception address, from current value of ecx:
0:005> ? ecx+ecx*4+0Dh
Evaluate expression: 140569773 = 0860ecad

In general to troubleshoot this pattern you have to try each possible bit
flip in the original instruction code and unassemble, in this case:
eb 10073717 0y11011110
u 10073717
eb 10073717 0y00011110
u 10073717
eb 10073717 0y01111110
u 10073717
eb 10073717 0y01001110
u 10073717
eb 10073717 0y01010110
u 10073717
eb 10073717 0y01011010
u 10073717
eb 10073717 0y01011100
u 10073717
eb 10073717 0y01011111
u 10073717

In the scenario I considered, the dump always reports the correct
instruction because the bit flip occurs when the data travels over the bus,
from memory to processor, and a full dump is saved from memory. In case of
minidumps with images loaded from symbol server you'll still find the
correct instruction but for a different reason.

Overclocking is confirmed to be one of the reasons of this kind of
problems. (Please check whether your machine is super-clocked). However
overclocking is not enforced by the application, if in place it has been
configured in the BIOS and hardware. Specifically the processor receives a
clock signal from the mainboard at a frequency higher than what is was
designed for, and causes the hardware error.

Additional reading:

There's an awful lot of overclocking out there
http://blogs.msdn.com/oldnewthing/archive/2005/04/12/407562.aspx

Have a nice day!


Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
P

paul-shed

Jialiang,

Thank you very much for this, excellent theory.

Now I've got lots of dumps and I've tried this theory with each. In all
cases it looks as if bit 7 is being flipped as the address being accessed
matches ecx+ecx*4+0d. So I can now make sense of the dumps.

Now can you explain this to me? If this is indeed the case why doesn't the
machine "blue screen" all the time? Why is it only my software that causes
this problem. Given that this postamble code must be common to literaly
thousands of methods/functions on the machine?

I'm changing the hardware as we speeak BTW.

Regards,

paul-shed


"Jialiang Ge [MSFT]" said:
Hello paul-shed

Thank you for the dump file. I have big progress in researching the issue:

I run the command !analyze -v to display the verbose info of the exception,
and get this output:

FAULTING_IP:
PBSutils!CPBSTimecode::CPBSTimecode+57
10073717 5e pop esi
......
READ_ADDRESS: 0860ecad
BUGCHECK_STR: ACCESS_VIOLATION
EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at 10073717 does not read/write to 0860ecad
LAST_CONTROL_TRANSFER: from 005a30b9 to 10073717
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be
wrong.
01acfb48 005a30b9 7c4eaa7d 00000000 00000002
PBSutils!CPBSTimecode::CPBSTimecode+0x57
00000000 00000000 00000000 00000000 00000000
PBSList2!CShuffleConfig::~CShuffleConfig+0xe1f9
......
FAILURE_BUCKET_ID:
ACCESS_VIOLATION_CODE_ADDRESS_MISMATCH_PBSutils!CPBSTimecode::CPBSTimecode+5
7
BUCKET_ID:
ACCESS_VIOLATION_CODE_ADDRESS_MISMATCH_PBSutils!CPBSTimecode::CPBSTimecode+5
7

It outputs that the exception (AV) does not match code (pop esi). This
indicates that it might be a hardware error. You may want to read this
article:

http://www.dumpanalysis.org/blog/index.php/2008/04/03/crash-dump-analysis-pa
tterns-part-57/
(If the long URL is truncated by the newsgroup system, please manually
connect the parts)

It reminds me of a case I once handled. In that case if the instruction at
10073717 changed its bit 7 because of an hardware error it would change
from 0x5e to 0xde and would be interpreted as follows by the processor:

10073717 de64890d fisub word ptr [ecx+ecx*4+0Dh]

which matches the exception address, from current value of ecx:
0:005> ? ecx+ecx*4+0Dh
Evaluate expression: 140569773 = 0860ecad

In general to troubleshoot this pattern you have to try each possible bit
flip in the original instruction code and unassemble, in this case:
eb 10073717 0y11011110
u 10073717
eb 10073717 0y00011110
u 10073717
eb 10073717 0y01111110
u 10073717
eb 10073717 0y01001110
u 10073717
eb 10073717 0y01010110
u 10073717
eb 10073717 0y01011010
u 10073717
eb 10073717 0y01011100
u 10073717
eb 10073717 0y01011111
u 10073717

In the scenario I considered, the dump always reports the correct
instruction because the bit flip occurs when the data travels over the bus,
from memory to processor, and a full dump is saved from memory. In case of
minidumps with images loaded from symbol server you'll still find the
correct instruction but for a different reason.

Overclocking is confirmed to be one of the reasons of this kind of
problems. (Please check whether your machine is super-clocked). However
overclocking is not enforced by the application, if in place it has been
configured in the BIOS and hardware. Specifically the processor receives a
clock signal from the mainboard at a frequency higher than what is was
designed for, and causes the hardware error.

Additional reading:

There's an awful lot of overclocking out there
http://blogs.msdn.com/oldnewthing/archive/2005/04/12/407562.aspx

Have a nice day!


Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
J

Jialiang Ge [MSFT]

Dear paul-shed

Does it go away when we stop overclocking it?

Over-clocking has the chance of exposing electrical instability, the
specifics of which may vary per-part, across parts, be dependent on the
instruction stream and memory access patterns, and almost anything else
thrown into the mix. i.e., the behavior of a specific application. For
example, around the instruction that throws AV in this case, I notice that
there are a lot of calls to System Clock. This pattern might be part of the
reasons, but may be not. Overclocked systems are unsupportable.

Please kindly let me know whether fixing the hardware can resolve the
problem.

Thanks,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top