passing array to C DLL

M

mr.resistor

hey

i am having a few problems calling a C DLL from C#. i am using a simple
function that takes an array of floats and an integer as an input, but
i cannot seem to get it to work. when i try to compile i get the
following error:

Attempted to read or write protected memory

the C function should not be manipulating the input arra, only reading
it. can anyone help? code is below:

namespace DLLTest
{
class HellDLL
{
[DllImport("FFT.DLL")]
public static extern unsafe void FFT(float[] x, int n);
}
class Program
{
static void Main(string[] args)
{
int i, N = 4096;
float[] noise = new float[N];

for (i = 0; i < N; i++)
{
noise = (float)i;
}

unsafe
{
HellDLL.FFT(noise, N);
}

}
}
}
 
N

Nicholas Paldino [.NET/C# MVP]

Can you provide the declaration of the function in C? It's impossible
to say what is wrong without knowing that first.
 
M

mr.resistor

hey

here is the C function declaration:

DLLIMPORT void FFT(float x[], int n);

the function is a high performance FFT algorithm, it mallocs a bunch of
arrays, copies the input array x into one of them, does a load of maths
using SSE intrinsics, then copies the result back into x.

is this a suitable way to go about this? for some strange reason the
program worked the first time i ran it, but since then it gives me a
corrupt memory error.

can anyone help?
 
M

Marcus Cuda

public static extern unsafe void FFT(float[] x, int n);
You might want to try
public static extern unsafe void FFT([In,Out]float[] x, int n);
^^^^^^
In think you might need the Out attribute since the native method is updating the array.

Also, do really need to mark it as unsafe?

Marcus
 
M

mr.resistor

ok, have removed unsafe, currently calling it as:

[DllImport("FFT.dll")]
public static extern void FFT([Out]float[] x, int n);

but with the same problem. i have worked out that the error only occurs
when the C DLL tries to use the SSE commands such as:

_mm_load_ps
_mm_store_ps

etc. can you think why this error is occuring? i can call the DLL just
fine from a C program, so it is definitely functioning.
 
W

Willy Denoyette [MVP]

Note that you can't pass the C# float array to SSE instructions directly.
There is no guarantee that the managed array is 16 byte aligned, so you'll
have to marshal the array to a 16 byte aligned array before passing to SSE.
I'm not entirely clear why you want to use this from managed code anyway,
you won't realize any performance gain by marshaling from managed to
unmanaged, you better use unmanaged or managed code only for this.

Willy.

| ok, have removed unsafe, currently calling it as:
|
| [DllImport("FFT.dll")]
| public static extern void FFT([Out]float[] x, int n);
|
| but with the same problem. i have worked out that the error only occurs
| when the C DLL tries to use the SSE commands such as:
|
| _mm_load_ps
| _mm_store_ps
|
| etc. can you think why this error is occuring? i can call the DLL just
| fine from a C program, so it is definitely functioning.
|
 
M

Marcus Cuda

Willy,
I'm not entirely clear why you want to use this from managed code anyway,
you won't realize any performance gain by marshaling from managed to
unmanaged, you better use unmanaged or managed code only for this.
For my own understanding, you are only referring to marshallings the 16 byte aligned arrays here? For the non-aligned
array, my understanding (from [1]) is that that marshallings overhead is minimal when marshallings a float/double array.
The runtime just pins the pointer to the array and passes it on to the unmanaged function.

I'm curious if using _mm_loadu_ps and _mm_storeu_ps instead of _mm_load_ps and _mm_store_ps would solve the problem.

Thanks,
Marcus
[1] http://msdn2.microsoft.com/en-us/library/75dwhxf7.aspx

Note that you can't pass the C# float array to SSE instructions directly.
There is no guarantee that the managed array is 16 byte aligned, so you'll
have to marshal the array to a 16 byte aligned array before passing to SSE.
I'm not entirely clear why you want to use this from managed code anyway,
you won't realize any performance gain by marshaling from managed to
unmanaged, you better use unmanaged or managed code only for this.

Willy.

| ok, have removed unsafe, currently calling it as:
|
| [DllImport("FFT.dll")]
| public static extern void FFT([Out]float[] x, int n);
|
| but with the same problem. i have worked out that the error only occurs
| when the C DLL tries to use the SSE commands such as:
|
| _mm_load_ps
| _mm_store_ps
|
| etc. can you think why this error is occuring? i can call the DLL just
| fine from a C program, so it is definitely functioning.
|
 
M

mr.resistor

i am not using the float array directly, in the C DLL i am assigning
aligned space using _aligned_malloc for 3 arrays and the float[] input
is being copied to one of these arrays. all the SSE work is then
performed on the 3 aligned arrays, and this is what causes the error
for some reason.
I'm not entirely clear why you want to use this from managed code anyway,
you won't realize any performance gain by marshaling from managed to
unmanaged, you better use unmanaged or managed code only for this.

i was also under the impression that the performance overhead would be
minimal for marshalling the float[]. if i could get the thing to run i
would give you some numbers :)

i am doing this using managed code because i am trying to create a
simple app for recording hydrophone data, and the only demanding work
it does is a bunch of FFTs. i have written an FFT algorithm in managed
code but the performance hit is quite large, a 16k FFT in C# takes
something on the order of 5-8ms whereas the C DLL using SIMD commands
rarely takes more than 1ms. i thought it would be easy to call it from
C# thus speeding up the app considerably (each iteration performs 12
FFTs roughly) and saving me the hassle of writing the rest of the app
in C++.
 
W

Willy Denoyette [MVP]

Marcus,

Yes, it should, but you will loose another couple of cycles as these
intrinsics are a lot slower than the aligned ones.

Willy.

| Willy,
|
| > I'm not entirely clear why you want to use this from managed code
anyway,
| > you won't realize any performance gain by marshaling from managed to
| > unmanaged, you better use unmanaged or managed code only for this.
| For my own understanding, you are only referring to marshallings the 16
byte aligned arrays here? For the non-aligned
| array, my understanding (from [1]) is that that marshallings overhead is
minimal when marshallings a float/double array.
| The runtime just pins the pointer to the array and passes it on to the
unmanaged function.
|
| I'm curious if using _mm_loadu_ps and _mm_storeu_ps instead of _mm_load_ps
and _mm_store_ps would solve the problem.
|
| Thanks,
| Marcus
| [1] http://msdn2.microsoft.com/en-us/library/75dwhxf7.aspx
|
|
| Willy Denoyette [MVP] wrote:
| > Note that you can't pass the C# float array to SSE instructions
directly.
| > There is no guarantee that the managed array is 16 byte aligned, so
you'll
| > have to marshal the array to a 16 byte aligned array before passing to
SSE.
| > I'm not entirely clear why you want to use this from managed code
anyway,
| > you won't realize any performance gain by marshaling from managed to
| > unmanaged, you better use unmanaged or managed code only for this.
| >
| > Willy.
| >
| > | > | ok, have removed unsafe, currently calling it as:
| > |
| > | [DllImport("FFT.dll")]
| > | public static extern void FFT([Out]float[] x, int n);
| > |
| > | but with the same problem. i have worked out that the error only
occurs
| > | when the C DLL tries to use the SSE commands such as:
| > |
| > | _mm_load_ps
| > | _mm_store_ps
| > |
| > | etc. can you think why this error is occuring? i can call the DLL just
| > | fine from a C program, so it is definitely functioning.
| > |
| >
| >
 
W

Willy Denoyette [MVP]

Sorry, but you got a AV exception, that means you are reading/writing
from/to a read/write protected piece of memory. You also said that the C DLL
is working correctly when called from C, that would mean that: or, the AV is
due to the managed/unmanaged interop, or due to the piece of code that
passes the float[] elements from unmanaged code to managed code. I would
suggest you run this in the (unmanaged) debugger.
I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.


Willy.


|i am not using the float array directly, in the C DLL i am assigning
| aligned space using _aligned_malloc for 3 arrays and the float[] input
| is being copied to one of these arrays. all the SSE work is then
| performed on the 3 aligned arrays, and this is what causes the error
| for some reason.
|
| > I'm not entirely clear why you want to use this from managed code
anyway,
| > you won't realize any performance gain by marshaling from managed to
| > unmanaged, you better use unmanaged or managed code only for this.
|
| i was also under the impression that the performance overhead would be
| minimal for marshalling the float[]. if i could get the thing to run i
| would give you some numbers :)
|
| i am doing this using managed code because i am trying to create a
| simple app for recording hydrophone data, and the only demanding work
| it does is a bunch of FFTs. i have written an FFT algorithm in managed
| code but the performance hit is quite large, a 16k FFT in C# takes
| something on the order of 5-8ms whereas the C DLL using SIMD commands
| rarely takes more than 1ms. i thought it would be easy to call it from
| C# thus speeding up the app considerably (each iteration performs 12
| FFTs roughly) and saving me the hassle of writing the rest of the app
| in C++.
|
 
M

Marcus Cuda

I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.
Is that really true? I don't recall any C runtime numerical routines that operate on arrays. SIMD math extensions are
only a gain when operating on vectors of data (single instruction multiple data).

Also, from David Notario's MSDN Blog on the JIT compiler[1]:
Note that we don’t use SSE2 for floating point code. The reason for this is that we don’t vectorize code (which is the
real win with SSE2).

I agree that .NET is on par with non-SIMD optimized C code. But I've done benchmarks and seen others (such as [2]) that
using a numerical library (via p/invoke) optimized with SIMD extension easily outperforms .NET code (up to 10x on large
sets of data).
Marcus

[1] http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
[2] http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
Sorry, but you got a AV exception, that means you are reading/writing
from/to a read/write protected piece of memory. You also said that the C DLL
is working correctly when called from C, that would mean that: or, the AV is
due to the managed/unmanaged interop, or due to the piece of code that
passes the float[] elements from unmanaged code to managed code. I would
suggest you run this in the (unmanaged) debugger.
I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.


Willy.


|i am not using the float array directly, in the C DLL i am assigning
| aligned space using _aligned_malloc for 3 arrays and the float[] input
| is being copied to one of these arrays. all the SSE work is then
| performed on the 3 aligned arrays, and this is what causes the error
| for some reason.
|
| > I'm not entirely clear why you want to use this from managed code
anyway,
| > you won't realize any performance gain by marshaling from managed to
| > unmanaged, you better use unmanaged or managed code only for this.
|
| i was also under the impression that the performance overhead would be
| minimal for marshalling the float[]. if i could get the thing to run i
| would give you some numbers :)
|
| i am doing this using managed code because i am trying to create a
| simple app for recording hydrophone data, and the only demanding work
| it does is a bunch of FFTs. i have written an FFT algorithm in managed
| code but the performance hit is quite large, a 16k FFT in C# takes
| something on the order of 5-8ms whereas the C DLL using SIMD commands
| rarely takes more than 1ms. i thought it would be easy to call it from
| C# thus speeding up the app considerably (each iteration performs 12
| FFTs roughly) and saving me the hassle of writing the rest of the app
| in C++.
|
 
M

Marcus Cuda

Is that really true? I don't recall any C runtime numerical routines that operate on arrays. SIMD math extensions are
only a gain when operating on vectors of data (single instruction multiple data).
oops, I was thinking of the standard C library, not the Microsoft C runtime library. Regardless, I don't think the
CLR/JIT takes advantage of SIMD extensions.

Marcus said:
I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.
Is that really true? I don't recall any C runtime numerical routines that operate on arrays. SIMD math extensions are
only a gain when operating on vectors of data (single instruction multiple data).

Also, from David Notario's MSDN Blog on the JIT compiler[1]:
Note that we don’t use SSE2 for floating point code. The reason for this is that we don’t vectorize code (which is the
real win with SSE2).

I agree that .NET is on par with non-SIMD optimized C code. But I've done benchmarks and seen others (such as [2]) that
using a numerical library (via p/invoke) optimized with SIMD extension easily outperforms .NET code (up to 10x on large
sets of data).
Marcus

[1] http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
[2] http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
Sorry, but you got a AV exception, that means you are reading/writing
from/to a read/write protected piece of memory. You also said that the C DLL
is working correctly when called from C, that would mean that: or, the AV is
due to the managed/unmanaged interop, or due to the piece of code that
passes the float[] elements from unmanaged code to managed code. I would
suggest you run this in the (unmanaged) debugger.
I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.


Willy.


|i am not using the float array directly, in the C DLL i am assigning
| aligned space using _aligned_malloc for 3 arrays and the float[] input
| is being copied to one of these arrays. all the SSE work is then
| performed on the 3 aligned arrays, and this is what causes the error
| for some reason.
|
| > I'm not entirely clear why you want to use this from managed code
anyway,
| > you won't realize any performance gain by marshaling from managed to
| > unmanaged, you better use unmanaged or managed code only for this.
|
| i was also under the impression that the performance overhead would be
| minimal for marshalling the float[]. if i could get the thing to run i
| would give you some numbers :)
|
| i am doing this using managed code because i am trying to create a
| simple app for recording hydrophone data, and the only demanding work
| it does is a bunch of FFTs. i have written an FFT algorithm in managed
| code but the performance hit is quite large, a 16k FFT in C# takes
| something on the order of 5-8ms whereas the C DLL using SIMD commands
| rarely takes more than 1ms. i thought it would be easy to call it from
| C# thus speeding up the app considerably (each iteration performs 12
| FFTs roughly) and saving me the hassle of writing the rest of the app
| in C++.
|
 
W

Willy Denoyette [MVP]

Marcus,
Take care, while it's true that the JIT will not translate IL to X86 code to
take advantage of SIMD, the 'managed math' library (well this is just a
thunk) is directly calling into the CRT (MSVCR80), no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.

That means that, if for instance you call Math.Log(), the JIT will call the
thunk 'Log' in Mscorlib, which calls into MSVCR80. If you run this under the
debugger on a system that supports media code (SSE, SSE2 SSE3, 3DNow etc)
you might see a call to:

MSVCR80!_log_pentium4:
7818cf68 660f12442404 movlpd xmm0,qword ptr [esp+0x4]
7818cf6e ba00000000 mov edx,0x0
7818cf73 660f28e8 movapd xmm5,xmm0
7818cf77 660f14c0 unpcklpd xmm0,xmm0
7818cf7b 660f73d534 psrlq xmm5,0x34
7818cf80 660fc5cd00 pextrw ecx,xmm5,0x0
7818cf85 660f280db0b81a78 movapd xmm1,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2756 (781ab8b0)]
7818cf8d 660f281d10b91a78 movapd xmm3,oword ptr
[MSVCR80!_pi_by_2_to_61+0x27b6 (781ab910)]
7818cf95 660f2825c0b81a78 movapd xmm4,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2766 (781ab8c0)]
7818cf9d 660f2835d0b81a78 movapd xmm6,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2776 (781ab8d0)]
7818cfa5 660f54c1 andpd xmm0,xmm1
7818cfa9 660f56c3 orpd xmm0,xmm3
7818cfad 660f58e0 addpd xmm4,xmm0
7818cfb1 660fc5c400 pextrw eax,xmm4,0x0
7818cfb6 25f0070000 and eax,0x7f0
7818cfbb 660f28a090b91a78 movapd xmm4,oword ptr [eax+0x781ab990]
7818cfc3 660f28b8a0bd1a78 movapd xmm7,oword ptr [eax+0x781abda0]


See: the 128 bit Media code operands used!
Note also that the same is done by the managed DirectX package, so if you
want to take advantage of this take a look at the DirectX Matrix and Vector
classes.

Willy.

|> Is that really true? I don't recall any C runtime numerical routines that
operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| oops, I was thinking of the standard C library, not the Microsoft C
runtime library. Regardless, I don't think the
| CLR/JIT takes advantage of SIMD extensions.
|
| Marcus Cuda wrote:
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| > Is that really true? I don't recall any C runtime numerical routines
that operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| >
| > Also, from David Notario's MSDN Blog on the JIT compiler[1]:
| > Note that we don't use SSE2 for floating point code. The reason for this
is that we don't vectorize code (which is the
| > real win with SSE2).
| >
| > I agree that .NET is on par with non-SIMD optimized C code. But I've
done benchmarks and seen others (such as [2]) that
| > using a numerical library (via p/invoke) optimized with SIMD extension
easily outperforms .NET code (up to 10x on large
| > sets of data).
| > Marcus
| >
| > [1] http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
| > [2]
http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
| >
| > Willy Denoyette [MVP] wrote:
| >> Sorry, but you got a AV exception, that means you are reading/writing
| >> from/to a read/write protected piece of memory. You also said that the
C DLL
| >> is working correctly when called from C, that would mean that: or, the
AV is
| >> due to the managed/unmanaged interop, or due to the piece of code that
| >> passes the float[] elements from unmanaged code to managed code. I
would
| >> suggest you run this in the (unmanaged) debugger.
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| >>
| >>
| >> Willy.
| >>
| >>
| >> | >> |i am not using the float array directly, in the C DLL i am assigning
| >> | aligned space using _aligned_malloc for 3 arrays and the float[]
input
| >> | is being copied to one of these arrays. all the SSE work is then
| >> | performed on the 3 aligned arrays, and this is what causes the error
| >> | for some reason.
| >> |
| >> | > I'm not entirely clear why you want to use this from managed code
| >> anyway,
| >> | > you won't realize any performance gain by marshaling from managed
to
| >> | > unmanaged, you better use unmanaged or managed code only for this.
| >> |
| >> | i was also under the impression that the performance overhead would
be
| >> | minimal for marshalling the float[]. if i could get the thing to run
i
| >> | would give you some numbers :)
| >> |
| >> | i am doing this using managed code because i am trying to create a
| >> | simple app for recording hydrophone data, and the only demanding work
| >> | it does is a bunch of FFTs. i have written an FFT algorithm in
managed
| >> | code but the performance hit is quite large, a 16k FFT in C# takes
| >> | something on the order of 5-8ms whereas the C DLL using SIMD commands
| >> | rarely takes more than 1ms. i thought it would be easy to call it
from
| >> | C# thus speeding up the app considerably (each iteration performs 12
| >> | FFTs roughly) and saving me the hassle of writing the rest of the app
| >> | in C++.
| >> |
| >>
| >>
 
M

Marcus Cuda

Willy,

Thank you for your response. I don't think I was clear in my previous post, I'll just clarify one thing and then I'll
let this die.
no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.
Right, that is the problem. Math.Log may call Mscorlib that in turn uses SIMDs to calculate the log value, but that
really isn't taking advantage of SIMDs. For an overly simplified, unrealistic example, say you wanted to multiple the
values of two arrays together.
double[] a = ...;
double[] b = ...;
double[] c = ...;
for( int i ....){
c = a * b;
}

It would be great if the JIT compiler would use a SIMD multiply here, but it doesn't. But I can P/Invoke Intel's MKL,
ACML, ATLAS, or DirectX and they will. For small arrays it wouldn't help but for large arrays it would [1]. So my point
is that the JIT compiler does not exploit SIMDs on array operations. Wish it did :)


Regards,
Marcus

[1] When your arrays are larger than (about) 100 elements, otherwise the overhead of using P/Invoke cancels out any
performance gains.

Marcus,
Take care, while it's true that the JIT will not translate IL to X86 code to
take advantage of SIMD, the 'managed math' library (well this is just a
thunk) is directly calling into the CRT (MSVCR80), no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.

That means that, if for instance you call Math.Log(), the JIT will call the
thunk 'Log' in Mscorlib, which calls into MSVCR80. If you run this under the
debugger on a system that supports media code (SSE, SSE2 SSE3, 3DNow etc)
you might see a call to:

MSVCR80!_log_pentium4:
7818cf68 660f12442404 movlpd xmm0,qword ptr [esp+0x4]
7818cf6e ba00000000 mov edx,0x0
7818cf73 660f28e8 movapd xmm5,xmm0
7818cf77 660f14c0 unpcklpd xmm0,xmm0
7818cf7b 660f73d534 psrlq xmm5,0x34
7818cf80 660fc5cd00 pextrw ecx,xmm5,0x0
7818cf85 660f280db0b81a78 movapd xmm1,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2756 (781ab8b0)]
7818cf8d 660f281d10b91a78 movapd xmm3,oword ptr
[MSVCR80!_pi_by_2_to_61+0x27b6 (781ab910)]
7818cf95 660f2825c0b81a78 movapd xmm4,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2766 (781ab8c0)]
7818cf9d 660f2835d0b81a78 movapd xmm6,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2776 (781ab8d0)]
7818cfa5 660f54c1 andpd xmm0,xmm1
7818cfa9 660f56c3 orpd xmm0,xmm3
7818cfad 660f58e0 addpd xmm4,xmm0
7818cfb1 660fc5c400 pextrw eax,xmm4,0x0
7818cfb6 25f0070000 and eax,0x7f0
7818cfbb 660f28a090b91a78 movapd xmm4,oword ptr [eax+0x781ab990]
7818cfc3 660f28b8a0bd1a78 movapd xmm7,oword ptr [eax+0x781abda0]


See: the 128 bit Media code operands used!
Note also that the same is done by the managed DirectX package, so if you
want to take advantage of this take a look at the DirectX Matrix and Vector
classes.

Willy.

|> Is that really true? I don't recall any C runtime numerical routines that
operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| oops, I was thinking of the standard C library, not the Microsoft C
runtime library. Regardless, I don't think the
| CLR/JIT takes advantage of SIMD extensions.
|
| Marcus Cuda wrote:
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| > Is that really true? I don't recall any C runtime numerical routines
that operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| >
| > Also, from David Notario's MSDN Blog on the JIT compiler[1]:
| > Note that we don't use SSE2 for floating point code. The reason for this
is that we don't vectorize code (which is the
| > real win with SSE2).
| >
| > I agree that .NET is on par with non-SIMD optimized C code. But I've
done benchmarks and seen others (such as [2]) that
| > using a numerical library (via p/invoke) optimized with SIMD extension
easily outperforms .NET code (up to 10x on large
| > sets of data).
| > Marcus
| >
| > [1] http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
| > [2]
http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
| >
| > Willy Denoyette [MVP] wrote:
| >> Sorry, but you got a AV exception, that means you are reading/writing
| >> from/to a read/write protected piece of memory. You also said that the
C DLL
| >> is working correctly when called from C, that would mean that: or, the
AV is
| >> due to the managed/unmanaged interop, or due to the piece of code that
| >> passes the float[] elements from unmanaged code to managed code. I
would
| >> suggest you run this in the (unmanaged) debugger.
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| >>
| >>
| >> Willy.
| >>
| >>
| >> | >> |i am not using the float array directly, in the C DLL i am assigning
| >> | aligned space using _aligned_malloc for 3 arrays and the float[]
input
| >> | is being copied to one of these arrays. all the SSE work is then
| >> | performed on the 3 aligned arrays, and this is what causes the error
| >> | for some reason.
| >> |
| >> | > I'm not entirely clear why you want to use this from managed code
| >> anyway,
| >> | > you won't realize any performance gain by marshaling from managed
to
| >> | > unmanaged, you better use unmanaged or managed code only for this.
| >> |
| >> | i was also under the impression that the performance overhead would
be
| >> | minimal for marshalling the float[]. if i could get the thing to run
i
| >> | would give you some numbers :)
| >> |
| >> | i am doing this using managed code because i am trying to create a
| >> | simple app for recording hydrophone data, and the only demanding work
| >> | it does is a bunch of FFTs. i have written an FFT algorithm in
managed
| >> | code but the performance hit is quite large, a 16k FFT in C# takes
| >> | something on the order of 5-8ms whereas the C DLL using SIMD commands
| >> | rarely takes more than 1ms. i thought it would be easy to call it
from
| >> | C# thus speeding up the app considerably (each iteration performs 12
| >> | FFTs roughly) and saving me the hassle of writing the rest of the app
| >> | in C++.
| >> |
| >>
| >>
 
M

Marcus Cuda

If anyone has actually read down this far, here is a plug for an open source numerical library for .NET that I manage.
The unique feature of this library is that it can optionally use MKL, ACML, or ATLAS for BLAS and LAPACK operations and
it works with Mono on Linux.
see http://www.dnanalytics.net/numerical/


Marcus said:
Willy,

Thank you for your response. I don't think I was clear in my previous post, I'll just clarify one thing and then I'll
let this die.
no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.
Right, that is the problem. Math.Log may call Mscorlib that in turn uses SIMDs to calculate the log value, but that
really isn't taking advantage of SIMDs. For an overly simplified, unrealistic example, say you wanted to multiple the
values of two arrays together.
double[] a = ...;
double[] b = ...;
double[] c = ...;
for( int i ....){
c = a * b;
}

It would be great if the JIT compiler would use a SIMD multiply here, but it doesn't. But I can P/Invoke Intel's MKL,
ACML, ATLAS, or DirectX and they will. For small arrays it wouldn't help but for large arrays it would [1]. So my point
is that the JIT compiler does not exploit SIMDs on array operations. Wish it did :)


Regards,
Marcus

[1] When your arrays are larger than (about) 100 elements, otherwise the overhead of using P/Invoke cancels out any
performance gains.

Marcus,
Take care, while it's true that the JIT will not translate IL to X86 code to
take advantage of SIMD, the 'managed math' library (well this is just a
thunk) is directly calling into the CRT (MSVCR80), no JIT compiling is
involved here, so it's up to the native math library to take advantage of
SIMD, NOT the JIT.

That means that, if for instance you call Math.Log(), the JIT will call the
thunk 'Log' in Mscorlib, which calls into MSVCR80. If you run this under the
debugger on a system that supports media code (SSE, SSE2 SSE3, 3DNow etc)
you might see a call to:

MSVCR80!_log_pentium4:
7818cf68 660f12442404 movlpd xmm0,qword ptr [esp+0x4]
7818cf6e ba00000000 mov edx,0x0
7818cf73 660f28e8 movapd xmm5,xmm0
7818cf77 660f14c0 unpcklpd xmm0,xmm0
7818cf7b 660f73d534 psrlq xmm5,0x34
7818cf80 660fc5cd00 pextrw ecx,xmm5,0x0
7818cf85 660f280db0b81a78 movapd xmm1,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2756 (781ab8b0)]
7818cf8d 660f281d10b91a78 movapd xmm3,oword ptr
[MSVCR80!_pi_by_2_to_61+0x27b6 (781ab910)]
7818cf95 660f2825c0b81a78 movapd xmm4,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2766 (781ab8c0)]
7818cf9d 660f2835d0b81a78 movapd xmm6,oword ptr
[MSVCR80!_pi_by_2_to_61+0x2776 (781ab8d0)]
7818cfa5 660f54c1 andpd xmm0,xmm1
7818cfa9 660f56c3 orpd xmm0,xmm3
7818cfad 660f58e0 addpd xmm4,xmm0
7818cfb1 660fc5c400 pextrw eax,xmm4,0x0
7818cfb6 25f0070000 and eax,0x7f0
7818cfbb 660f28a090b91a78 movapd xmm4,oword ptr [eax+0x781ab990]
7818cfc3 660f28b8a0bd1a78 movapd xmm7,oword ptr [eax+0x781abda0]


See: the 128 bit Media code operands used!
Note also that the same is done by the managed DirectX package, so if you
want to take advantage of this take a look at the DirectX Matrix and Vector
classes.

Willy.

|> Is that really true? I don't recall any C runtime numerical routines that
operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| oops, I was thinking of the standard C library, not the Microsoft C
runtime library. Regardless, I don't think the
| CLR/JIT takes advantage of SIMD extensions.
|
| Marcus Cuda wrote:
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| > Is that really true? I don't recall any C runtime numerical routines
that operate on arrays. SIMD math extensions are
| > only a gain when operating on vectors of data (single instruction
multiple data).
| >
| > Also, from David Notario's MSDN Blog on the JIT compiler[1]:
| > Note that we don't use SSE2 for floating point code. The reason for this
is that we don't vectorize code (which is the
| > real win with SSE2).
| >
| > I agree that .NET is on par with non-SIMD optimized C code. But I've
done benchmarks and seen others (such as [2]) that
| > using a numerical library (via p/invoke) optimized with SIMD extension
easily outperforms .NET code (up to 10x on large
| > sets of data).
| > Marcus
| >
| > [1] http://blogs.msdn.com/davidnotario/archive/2005/08/15/451845.aspx
| > [2]
http://www.centerspace.net/doc/NMath/Core/whitepapers/NMath.Core.Benchmarks.pdf
| >
| > Willy Denoyette [MVP] wrote:
| >> Sorry, but you got a AV exception, that means you are reading/writing
| >> from/to a read/write protected piece of memory. You also said that the
C DLL
| >> is working correctly when called from C, that would mean that: or, the
AV is
| >> due to the managed/unmanaged interop, or due to the piece of code that
| >> passes the float[] elements from unmanaged code to managed code. I
would
| >> suggest you run this in the (unmanaged) debugger.
| >> I'm also not clear on what you are doing exactly in your FFT alorithm
and
| >> what version of the framework you are using, but, as CLR sits in top of
the
| >> exact same C runtime when doing float maths (using SIMD when
available), the
| >> performance figures should be comparable.
| >>
| >>
| >> Willy.
| >>
| >>
| >> | >> |i am not using the float array directly, in the C DLL i am assigning
| >> | aligned space using _aligned_malloc for 3 arrays and the float[]
input
| >> | is being copied to one of these arrays. all the SSE work is then
| >> | performed on the 3 aligned arrays, and this is what causes the error
| >> | for some reason.
| >> |
| >> | > I'm not entirely clear why you want to use this from managed code
| >> anyway,
| >> | > you won't realize any performance gain by marshaling from managed
to
| >> | > unmanaged, you better use unmanaged or managed code only for this.
| >> |
| >> | i was also under the impression that the performance overhead would
be
| >> | minimal for marshalling the float[]. if i could get the thing to run
i
| >> | would give you some numbers :)
| >> |
| >> | i am doing this using managed code because i am trying to create a
| >> | simple app for recording hydrophone data, and the only demanding work
| >> | it does is a bunch of FFTs. i have written an FFT algorithm in
managed
| >> | code but the performance hit is quite large, a 16k FFT in C# takes
| >> | something on the order of 5-8ms whereas the C DLL using SIMD commands
| >> | rarely takes more than 1ms. i thought it would be easy to call it
from
| >> | C# thus speeding up the app considerably (each iteration performs 12
| >> | FFTs roughly) and saving me the hassle of writing the rest of the app
| >> | in C++.
| >> |
| >>
| >>
 
W

Willy Denoyette [MVP]

| Willy,
|
| Thank you for your response. I don't think I was clear in my previous
post, I'll just clarify one thing and then I'll
| let this die.
|
Not at all, you were very clear. And as I said, you are right the JIT
doesn't take advantage of SIMD for simple math operations.

| > no JIT compiling is
| > involved here, so it's up to the native math library to take advantage
of
| > SIMD, NOT the JIT.
| Right, that is the problem. Math.Log may call Mscorlib that in turn uses
SIMDs to calculate the log value, but that
| really isn't taking advantage of SIMDs.

Hmm... do you mean Math.Log doesn't take advantage of SSE and SSE2? Take a
look at the "MSVCR80!_log_pentium4:" function in my previous post all
instructions are SSE and SSE2.


For an overly simplified, unrealistic example, say you wanted to multiple
the
| values of two arrays together.
| double[] a = ...;
| double[] b = ...;
| double[] c = ...;
| for( int i ....){
| c = a * b;
| }
|

Don't know if you would take any advantage of SSE for this, all depends on
the size of the vectors or matrixes. Honestly I don't believe this is
something we need to be supported at the managed language level (JIT
compiled), I prefer to see an extended math, vector and matrix class library
(ala DirectX ) supporting doubles, implemented as a thunk calling into
native optimized code, IMO the 32 bit JIT is not able to generate processor
dependent optimized 64/128 bit Media code streams in a way that the overhead
doesn't defeat the purpose.

| It would be great if the JIT compiler would use a SIMD multiply here, but
it doesn't. But I can P/Invoke Intel's MKL,
| ACML, ATLAS, or DirectX and they will. For small arrays it wouldn't help
but for large arrays it would [1]. So my point
| is that the JIT compiler does not exploit SIMDs on array operations. Wish
it did :)
|

I'm pretty sure MSFT already explored this subject, and as far as I know
they stay away from it, might be they have pretty good reasons.

|
| Regards,
| Marcus
|
| [1] When your arrays are larger than (about) 100 elements, otherwise the
overhead of using P/Invoke cancels out any
| performance gains.
|

Very true.

Regards,
Willy.
 
M

Marcus Cuda

Hmm... do you mean Math.Log doesn't take advantage of SSE and SSE2? Take a
look at the "MSVCR80!_log_pentium4:" function in my previous post all
instructions are SSE and SSE2.
It does. I'm just saying SIMDs really payoff when working on arrays of data, not scalars (which a call to
Math.Log(value) is).
Don't know if you would take any advantage of SSE for this, all depends on
the size of the vectors or matrixes.
You would since a SIMD multiply would multiply multiple elements simultaneously, not element by element like the example
code would. This is where SIMDs would really make a difference. The JIT compiler would have to smart enough to recognize
what is going on and vectorize the loop (breaking it up in to chunks and calling a SIMD routine on each chunk). As you
said, this probably isn't going to happen, so users should consider using an optimized library.

This gets back to your original point:
I'm also not clear on what you are doing exactly in your FFT alorithm and
what version of the framework you are using, but, as CLR sits in top of the
exact same C runtime when doing float maths (using SIMD when available), the
performance figures should be comparable.
I am saying that this not really the case.

Marcus
| Willy,
|
| Thank you for your response. I don't think I was clear in my previous
post, I'll just clarify one thing and then I'll
| let this die.
|
Not at all, you were very clear. And as I said, you are right the JIT
doesn't take advantage of SIMD for simple math operations.

| > no JIT compiling is
| > involved here, so it's up to the native math library to take advantage
of
| > SIMD, NOT the JIT.
| Right, that is the problem. Math.Log may call Mscorlib that in turn uses
SIMDs to calculate the log value, but that
| really isn't taking advantage of SIMDs.

Hmm... do you mean Math.Log doesn't take advantage of SSE and SSE2? Take a
look at the "MSVCR80!_log_pentium4:" function in my previous post all
instructions are SSE and SSE2.


For an overly simplified, unrealistic example, say you wanted to multiple
the
| values of two arrays together.
| double[] a = ...;
| double[] b = ...;
| double[] c = ...;
| for( int i ....){
| c = a * b;
| }
|

Don't know if you would take any advantage of SSE for this, all depends on
the size of the vectors or matrixes. Honestly I don't believe this is
something we need to be supported at the managed language level (JIT
compiled), I prefer to see an extended math, vector and matrix class library
(ala DirectX ) supporting doubles, implemented as a thunk calling into
native optimized code, IMO the 32 bit JIT is not able to generate processor
dependent optimized 64/128 bit Media code streams in a way that the overhead
doesn't defeat the purpose.

| It would be great if the JIT compiler would use a SIMD multiply here, but
it doesn't. But I can P/Invoke Intel's MKL,
| ACML, ATLAS, or DirectX and they will. For small arrays it wouldn't help
but for large arrays it would [1]. So my point
| is that the JIT compiler does not exploit SIMDs on array operations. Wish
it did :)
|

I'm pretty sure MSFT already explored this subject, and as far as I know
they stay away from it, might be they have pretty good reasons.

|
| Regards,
| Marcus
|
| [1] When your arrays are larger than (about) 100 elements, otherwise the
overhead of using P/Invoke cancels out any
| performance gains.
|

Very true.

Regards,
Willy.
 
W

Willy Denoyette [MVP]

|> Hmm... do you mean Math.Log doesn't take advantage of SSE and SSE2? Take
a
| > look at the "MSVCR80!_log_pentium4:" function in my previous post all
| > instructions are SSE and SSE2.
| It does. I'm just saying SIMDs really payoff when working on arrays of
data, not scalars (which a call to
| Math.Log(value) is).
|

Agreed, but it does also for 'Log' and some other math operations which can
take advantage of the SIMD's even for scalars, that's why the CLR calls into
the native CRT math library, which checks by means of the cpuid instruction
for the availability of SSE/SSE2.

| > Don't know if you would take any advantage of SSE for this, all depends
on
| > the size of the vectors or matrixes.
| You would since a SIMD multiply would multiply multiple elements
simultaneously, not element by element like the example
| code would. This is where SIMDs would really make a difference. The JIT
compiler would have to smart enough to recognize
| what is going on and vectorize the loop (breaking it up in to chunks and
calling a SIMD routine on each chunk). As you
| said, this probably isn't going to happen, so users should consider using
an optimized library.

Agreed.


Willy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top