Bern, you're seeing what looks like a manifestation of the "double thunk"
(aka "double p/invoke") problem. The problem is that when your managed
code calls the managed code in the DLL, it first goes through a native stub
(when using the Win32 DLL mechanism), so you ended up transitioning from
managed to native and then back to managed.
Try #using the DLL which you have compiled managed, rather than the
standard Win32 DLL mechanism. This should help. Let us know if that
helps, or if this makes no sense.
Thanks,
Kang Su Gatlin
Visual C++ Program Manager
--------------------
| From: "Bern McCarty" <(E-Mail Removed)>
| References: <(E-Mail Removed)>
<(E-Mail Removed)>
| Subject: Re: /CLR floating point performance, inter-assembly function
call performance
| Date: Thu, 6 May 2004 08:59:11 -0400
|
| From reading various things I had already recognized the things that you
| state as the current conventional wisdom. I went to the trouble to post
my
| results in the hopes of getting some feedback on why it might be that my
| results run very much against that conventional wisdom. Please consider:
|
| 1) Floating point performance of managed code. At least in this little
| test scenario floating point performance of managed code doesn't seem to
be
| a problem at all. In the first call out of the 8 in a test run the
| DMatrix3d_multiplyDPoint3dArray function is asked to apply the matrix to a
| whopping 5,000,000 3D points per call. So it is just sitting there doing
| floating point operations in a 5,000,000 iteration loop and there are no
| function calls in that loop at all. The managed version took only 3%
longer
| in that case than the all native version. It seems logical then to rule
out
| floating point performance as the culprit when things quickly change for
the
| worse in the later calls where the call granularity to
| DMatrix3d_multiplyDPoint3dArray becomes very fine. It makes more sense to
| assign the slowdown observed in the fine-grained call cases on function
call
| overhead, not on floating point performance.
|
| 2) The expense of transitions. What am I doing wrong? The version of my
| test program that involves a transition in the call from
| test_applyMatrixToDPoints->DMatrix3d_multiplyDPoint3dArray is actually
| FASTER than the all managed version (true for both the intra-assembly and
| inter-assembly call cases). Furthermore, the more finely-grained the
calls
| are the more the native->managed version outperforms the managed-managed
| versions. Since we already established that raw floating point performance
| of the loop inside of the DMatrix3d_multiplyDPoint3dArray function is very
| equivalent between the managed and native versions, and the conventional
| wisdom is that native->managed transitions are expensive and bad, then
what
| is to blame for the poor relative performance of the managed->managed
| versions? The managed->managed version is flat-out beaten by the version
| that does a transition for each and every call. It would seem that there
is
| some serious penalty associated with making regular managed->managed
| function calls - not managed->native calls. What might be responsible for
| it and is it something I have any control over?
|
| 3) The surprising difference in cost between inter-assembly and
| intra-assembly managed->managed calls. Can someone explain this
difference
| and is there anything that can be done about it besides making my program
| one enormous executable?
|
| 4) How can I step through JIT compiled code in assembly language in a
| debugger for a release executable so that I can see what is going on? I
| want the JIT to produce "non debug" x86 instructions and yet I want to
step
| through them to see what they do. Tips appreciated. Can I do this with
the
| VS.NET debugger? Windbg? How?
|
| "Yan-Hong Huang[MSFT]" <(E-Mail Removed)> wrote in message
| news:(E-Mail Removed)...
| > Hello Bern,
| >
| > Generally speaking, the v1 JIT does not currently perform all the
| > FP-specific optimizations that the VC++ backend does, making floating
| point
| > operations more expensive for now. That may be why managed->managed is
| more
| > expensive than managed->unmanaged in your test.
| >
| > So for areas which make heavy use of floating point arithmetic, please
use
| > profilers to pick the fragments where the overhead is costing you most,
| and
| > Keep the whole fragment in unmanaged space.
| >
| > Also, work to minimize the number of transitions you make. If you have
| some
| > unmanaged code or an interop call sitting in a loop, make the entire
loop
| > unmanaged. That way you'll only pay the transition cost twice, rather
than
| > for each iteration of the loop.
| >
| > By looking into ILCode, we can see that when InterOping, there are some
| > extra IL instructions. So minimizing the number of transitions can save
| > many IL instructions and improve performance.
| >
| > For some more information, you can refer to this chapter online:
| > "Chapter 7 〞 Improving Interop Performance"
| >
|
http://msdn.microsoft.com/library/en...pt07.asp?frame
| > =true#scalenetchapt07 _topic12
| >
| > Hope that helps.
| >
| > Best regards,
| > Yanhong Huang
| > Microsoft Community Support
| >
| > Get Secure! 每
www.microsoft.com/security
| > This posting is provided "AS IS" with no warranties, and confers no
| rights.
| >
|
|
|