Fatal Execution Engine Error using unmanaged obj from 2 appdomains

G

Guest

I originally posted this in dotnet.languages.vc, and it was suggested that I
post it in here as well...so here goes

I believe I've found a bug in either the compiler or in the runtime....for
some reason when accessing a specific unmanaged object from 2 appdomains it
causes a fatal execution engine error. It took me over a week to narrow it
down, but below I've attached a nice simple snippet of code that showcases
this issue:

Anyone have any idea what exactly is the root cause of this or how to fix it
properly? It seems that if I change the code much (such as remove the
'virtual' from either of the methods, or change the return types, or change
the dictionary to something else) the problem goes away, but I'm more worried
about where else in my code this affects and I'm hoping for a more viable
solution than "search through 500K lines of code and look for that
pattern"....not realistic

If anyone wants more detail I've put a rar up which contains the source, a
debug build with symbols, the eventlog entry, and an adplus dump with
corresponding reports:
http://www.virgeweb.com/redec/crap/RuntimeFailureTest.rar


<snippet>

#include <msclr/appdomain.h>
using namespace System;
using namespace System::Collections::Generic;
using namespace msclr;

ref class MyManagedClass
{
};

class MyUnmanagedClass
{
public:
virtual MyManagedClass^ Foo() { return nullptr; }
virtual Object^ CrashyCrashy()
{
Dictionary<String^, MyManagedClass^> ^bar = gcnew
Dictionary<String^, MyManagedClass^>();
bar->Add("", nullptr);
return nullptr;
}
};

void Test(MyUnmanagedClass *foo)
{
foo->CrashyCrashy();
}

int WINAPI WinMain(HINSTANCE, HINSTANCE, LPSTR, int)
{
MyUnmanagedClass foo;
AppDomain ^domain1 = AppDomain::CreateDomain("TestDomainOfGoodness");
call_in_appdomain<MyUnmanagedClass*>(domain1->Id, &Test, &foo);
return 0;
}

</snippet>
 
P

Peter Duniho

I originally posted this in dotnet.languages.vc, and it was suggested that I
post it in here as well...so here goes

I believe I've found a bug in either the compiler or in the runtime....for
some reason when accessing a specific unmanaged object from 2 appdomains it
causes a fatal execution engine error. It took me over a week to narrow it
down, but below I've attached a nice simple snippet of code that showcases
this issue:

Anyone have any idea what exactly is the root cause of this or how to fix it
properly?

I don't. I admit, I know very little about this sort of thing.
However, figuring I might learn something I tried your code example.
One thing I noticed: if you call foo.CrashyCrashy() in the default app
domain (that is, just call it directly), then the subsequent attempt to
call it in the other app domain succeeds.

That suggests to me that there some sort of deferred initialization
that happens and which isn't happening if the object is first call
happens in the other app domain. I don't know enough about app domains
to know why this would be, and you may be right that it's a bug in the
CLR. Or it could be some defined behavior for app domains. I don't
really know.

I didn't bother trying to look at the vtable for the object, but I'd
guess that the state of the vtable is somehow related to this.
Especially since I was unable to catch any exception or have the
debugger show me an exception: on attempting to execute the call, the
process simply exits without any notification or opportunity to look at
it in the debugger. :(

I suppose a temporary work-around might be to create a dummy virtual
function that you call in the default app domain first, and then
hopefully the other calls would work. I admit, that's not what I'd
call a "good" or "robust" solution. But it might be useful for now.

Pete
 
P

Peter Duniho

I don't. I admit, I know very little about this sort of thing.
However, figuring I might learn something I tried your code example.
One thing I noticed: if you call foo.CrashyCrashy() in the default app
domain (that is, just call it directly), then the subsequent attempt to
call it in the other app domain succeeds.

Another observation:

If the "MyUnmanagedClass" instance is initialized in the other app
domain, the call also succeeds.

So, obviously there is some initialization of the object that is
specific to the app domain in which the object itself is created.
Accessing the object in a different app domain before the creating app
domain has an opportunity to fully initialize the object causes
problems.

Again, I don't have enough knowledge to say whether this is "by design"
or an oversight within the CLR. However, I wouldn't be surprised if
it's "by design", or at least a "we can't fix this" sort of thing. The
initialization we're talking about is specific to the C++ features of
the compiler, and there may be some reason that the compiler doesn't
take into account app domains when managing that initialization.

As a completely uninformed hypothesis: perhaps the vtable is
initialized in a static constructor of the class, so until the first
actual use of some instance of the class the vtable hasn't been
initialized. Further, perhaps this "first actual use" for the class is
tracked on a per-app-domain basis, so an object created in one app
domain must be first used in that same app domain in order for the
static constructor to be called.

No, I have no idea if this is actually what's going on. But it would
fit the symptoms. :)

Pete
 
G

Guest

hehehe....yeah, I'm sure it has something to do with deffered
initialization...I've inspected the vtables and they look fine. and yeah I
know changing pretty much anything in that code snippet will make it work
properly....I'm not terribly worried about getting a work-around to fix this
specific case, I'm more worried about the other, harder to find, instances of
this in my existing code...

this snippet of code, as it existed in the original code, spanned 5 or 6
different objects, across 3 different assemblies (1 C++/CLI and 2 C#)...I was
able to narrow it down only because it happened to be frequently executed,
and I was able to reproduce it reliably....I'm worried about the infrequently
executed code paths where this problem may also crop up...I'd rather not rely
on QA to find them all, if you know what I mean :)
 
P

Peter Duniho

hehehe....yeah, I'm sure it has something to do with deffered
initialization...I've inspected the vtables and they look fine. and yeah I
know changing pretty much anything in that code snippet will make it work
properly....I'm not terribly worried about getting a work-around to fix this
specific case, I'm more worried about the other, harder to find, instances of
this in my existing code...

Well, it seems that by narrowing it down, you've identified the
fundamental issue: calling a virtual function on an object passed
across an app domain boundary.

For what it's worth (and maybe that's not much), I thought I'd try to
read up a little more on app domains, why they exist, what they do, etc.

I found the docs actually kind of sparse, considering the potential
complexity it seems like an app domain would introduce. But one thing
they do discuss is that one main reason for having an app domain is to
be able to introduce a process-boundary-like separation between
executing code, without all of the overhead of a process.

In particular, they make it pretty clear that data _isn't_ supposed to
be able to easily get from one app domain to another. It's either
copied or proxied as near as I can tell, without allowing code
executing in one app domain to directly access data from another app
domain.

How this applies specifically to your scenario I'm not entirely sure.
Taken as simplistically as I've described it above, it's hard to see
how your code would _ever_ work, assuming that the data referenced by
the "&foo" is simply copied. Since it does work most of the time, that
suggests that "call_in_appdomain" is supposed to handle either copying
or proxying the object correctly to allow this cross-app-domain
execution to take place, and that the case where it doesn't is in fact
a bug.

If it is a bug, you may have some luck by filing a support request with
Microsoft. I have found them moderately responsive to that sort of
thing. They don't always solve my problem, but they at least do
generally wind up confirming that the behavior is in fact a bug (good
for one's sanity, if nothing else :) ).

There's clearly a lot I don't understand about app domains still, but
you may be able to make more sense of the documents I did run across.
Though you may well already be familiar with them, just in case I will
offer those links here:

http://msdn2.microsoft.com/en-us/library/2bh4z9hs.aspx
http://msdn2.microsoft.com/en-us/library/system.marshalbyrefobject.aspx
http://msdn2.microsoft.com/en-us/library/x0w2664k(VS.80).aspx

I'm a little surprised you haven't gotten a reply from someone more
knowledgable. Obviously such people exist :), and hopefully they'll
see this thread and offer their own insight.

Pete
 
G

Guest

Yeah....I've read alot about them....and I *think* I understand them quite
well. You're mostly correct re: your description of appdomains, however the
one thing it seems you don't quite understand is that appdomains are a
managed-only concept....they only affect managed objects. Unmanaged objects
are (supposed to be) 100% appdomain neutral/ignorant....unaffected by
appdomains. Now the docs for call_in_appdomain say that the
parameters/return type "must not be clr types". I took that to mean that
they should be unmanaged types....which seems to be the correct assumpton 99%
of the time....but this specific situation makes one thing that maybe sed
unmanaged types can't even reference any managed types (or reference any
other unmanaged types which reference managed types)....I really can't see
this being the case cuz it seems like a HUGE restriction, and you'd think it
would be mentioned somewhere.....but I'm no expert so I don't know
 
P

Peter Duniho

Yeah....I've read alot about them....and I *think* I understand them quite
well. You're mostly correct re: your description of appdomains, however the
one thing it seems you don't quite understand is that appdomains are a
managed-only concept....they only affect managed objects. Unmanaged objects
are (supposed to be) 100% appdomain neutral/ignorant....unaffected by
appdomains.

I didn't find anything that made that clear. You may be right about
that, but if so it seems like there's some inconsistency here. In
particular:
Now the docs for call_in_appdomain say that the
parameters/return type "must not be clr types". I took that to mean that
they should be unmanaged types....
which seems to be the correct assumpton 99%
of the time....but this specific situation makes one thing that maybe sed
unmanaged types can't even reference any managed types (or reference any
other unmanaged types which reference managed types)....I really can't see
this being the case cuz it seems like a HUGE restriction, and you'd think it
would be mentioned somewhere.....but I'm no expert so I don't know

First, in your example code the MyUnmanagedClass doesn't really
reference any managed types per se, until one of the methods actually
gets a chance to execute. I suppose the mere fact that the return type
of the method is a managed type is sufficient to violate the rule, but
like you say, you'd think they'd be more clear about that.

Secondly (and maybe more germane), if when using call_in_appdomain
you're not suppose to use managed types in any way, and if it's also
true that for unmanaged data app domains are irrelevant, then I'm at a
loss as to why the call_in_appdomain API exists at all. To me, the
latter seems more likely to be true, and being in conflict with the
former it suggests that the former is what's not true.

Which is a long way of saying that I not only agree that it seems like
a big restriction that you can't use managed types when using
call_in_appdomain, such a restriction would be logically inconsistent
with the behavior of app domains generally.

I guess in the end, I'm left thinking that you may have simply found a
bug in the call_in_appdomain API, and that whatever is accomplished by
calling a virtual method on your unmanaged class prior to using
call_in_appdomain is something that call_in_appdomain _ought_ to be
doing for itself.

..NET bugs, at least those that affect typical use of the framework, are
fairly rare but they definitely aren't impossible. I hope you can get
some confirmation from Microsoft via the usual support channels that
this is in fact a bug, or at least some explanation for what's going on.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top