Marcus,
Take care, while it's true that the JIT will not translate IL to X86 code to
take advantage of SIMD, the 'managed math' library (well this is just a
thunk) is directly calling into the CRT (MSVCR80), no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.
That means that, if for instance you call Math.Log(), the JIT will call the
thunk 'Log' in Mscorlib, which calls into MSVCR80. If you run this under the
debugger on a system that supports media code (SSE, SSE2 SSE3, 3DNow etc)
you might see a call to:
MSVCR80!_log_pentium4:
7818cf68 660f12442404 movlpd xmm0,qword ptr [esp+0x4]
7818cf6e ba00000000 mov edx,0x0
7818cf73 660f28e8 movapd xmm5,xmm0
7818cf77 660f14c0 unpcklpd xmm0,xmm0
7818cf7b 660f73d534 psrlq xmm5,0x34
7818cf80 660fc5cd00 pextrw ecx,xmm5,0x0
7818cf85 660f280db0b81a78 movapd xmm1,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2756 (781ab8b0)]
7818cf8d 660f281d10b91a78 movapd xmm3,oword ptr
[MSVCR80!_pi_by_2_to_61+0x27b6 (781ab910)]
7818cf95 660f2825c0b81a78 movapd xmm4,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2766 (781ab8c0)]
7818cf9d 660f2835d0b81a78 movapd xmm6,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2776 (781ab8d0)]
7818cfa5 660f54c1 andpd xmm0,xmm1
7818cfa9 660f56c3 orpd xmm0,xmm3
7818cfad 660f58e0 addpd xmm4,xmm0
7818cfb1 660fc5c400 pextrw eax,xmm4,0x0
7818cfb6 25f0070000 and eax,0x7f0
7818cfbb 660f28a090b91a78 movapd xmm4,oword ptr [eax+0x781ab990]
7818cfc3 660f28b8a0bd1a78 movapd xmm7,oword ptr [eax+0x781abda0]
See: the 128 bit Media code operands used!
Note also that the same is done by the managed DirectX package, so if you
want to take advantage of this take a look at the DirectX Matrix and Vector
classes.
Willy.
|> Is that really true? I don't recall any C runtime numerical routines that
operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| oops, I was thinking of the standard C library, not the Microsoft C
runtime library. Regardless, I don't think the
| CLR/JIT takes advantage of SIMD extensions.
|
| Marcus Cuda wrote:
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| > Is that really true? I don't recall any C runtime numerical routines
that operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| >
| > Also, from David Notario's MSDN Blog on the JIT compiler[1]:
| > Note that we don't use SSE2 for floating point code. The reason for this
is that we don't vectorize code (which is the
| > real win with SSE2).
| >
| > I agree that .NET is on par with non-SIMD optimized C code. But I've
done benchmarks and seen others (such as [2]) that
| > using a numerical library (via p/invoke) optimized with SIMD extension
easily outperforms .NET code (up to 10x on large
| > sets of data).
| > Marcus
| >
| > [1]
http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
| > [2]
http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
| >
| > Willy Denoyette [MVP] wrote:
| >> Sorry, but you got a AV exception, that means you are reading/writing
| >> from/to a read/write protected piece of memory. You also said that the
C DLL
| >> is working correctly when called from C, that would mean that: or, the
AV is
| >> due to the managed/unmanaged interop, or due to the piece of code that
| >> passes the float[] elements from unmanaged code to managed code. I
would
| >> suggest you run this in the (unmanaged) debugger.
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| >>
| >>
| >> Willy.
| >>
| >>
| >> | >> |i am not using the float array directly, in the C DLL i am assigning
| >> | aligned space using _aligned_malloc for 3 arrays and the float[]
input
| >> | is being copied to one of these arrays. all the SSE work is then
| >> | performed on the 3 aligned arrays, and this is what causes the error
| >> | for some reason.
| >> |
| >> | > I'm not entirely clear why you want to use this from managed code
| >> anyway,
| >> | > you won't realize any performance gain by marshaling from managed
to
| >> | > unmanaged, you better use unmanaged or managed code only for this.
| >> |
| >> | i was also under the impression that the performance overhead would
be
| >> | minimal for marshalling the float[]. if i could get the thing to run
i
| >> | would give you some numbers
| >> |
| >> | i am doing this using managed code because i am trying to create a
| >> | simple app for recording hydrophone data, and the only demanding work
| >> | it does is a bunch of FFTs. i have written an FFT algorithm in
managed
| >> | code but the performance hit is quite large, a 16k FFT in C# takes
| >> | something on the order of 5-8ms whereas the C DLL using SIMD commands
| >> | rarely takes more than 1ms. i thought it would be easy to call it
from
| >> | C# thus speeding up the app considerably (each iteration performs 12
| >> | FFTs roughly) and saving me the hassle of writing the rest of the app
| >> | in C++.
| >> |
| >>
| >>