Why is C# 450% slower than C++ on nested loops ??

C

Christoph Nahr

[1] This is the correct code which is still ~50% slower than the (corrected)
native C++ code ( __int64 x=0;).

Great analysis! This 50% slowdown is also what I've observed in other
(non-broken...) mathematical benchmarks of C# vs native C++.

For the record, that's also the maximum performance loss I've ever
observed; in tasks that aren't as well suited to modern C++
optimizers the speed difference is only about 10-20%.
 
G

Guest

This is all very interesting (yawn), but Mr. Olcott seems more intent on
self-promotion than question-answering. Moving on...

Peter Olcott said:
Nicholas Paldino said:
Joanna,

Which one, alt.inventors? That's a non-tech group I saw...

misc.int-property is much more useful. There are real patent agents and at least
one patent attorney that provide good advice for free. It can be verified as
good advice by cross validation.
--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Joanna Carter said:
"Nicholas Paldino [.NET/C# MVP]" <[email protected]> a écrit
dans le message de eFs0Pb%[email protected]...

| Now I'm really done feeding the troll.

Heh, you ought to come and see what we have had to deal with in our non-tech
group :))

Joanna
 
W

Willy Denoyette [MVP]

Christoph Nahr said:
[1] This is the correct code which is still ~50% slower than the
(corrected)
native C++ code ( __int64 x=0;).

Great analysis! This 50% slowdown is also what I've observed in other
(non-broken...) mathematical benchmarks of C# vs native C++.

For the record, that's also the maximum performance loss I've ever
observed; in tasks that aren't as well suited to modern C++
optimizers the speed difference is only about 10-20%.
--

In this "particular" case, one could conclude that the C# team could have
done a beter job. However, when looking at the IL produced by C++/CLI and C#
you'll see that all three produce identical IL (or the loop construct), so
there is no reason to blame C#.
However, despite the fact that all managed compilers produce the same IL
they perform differently:

Same IL
C#: 100% (verifiable)
C++/CLI /clr: 64% (non-verifiable)
C++/CLI /clr:safe : 120% (verifiable)

native C++ : 62%

So you see that in this 'particular case', managed code can perform like
native code and C# performs slightly better than managed safe C++, but is
this true in all cases? NO sure not. And that's why you should be careful
when interpreting benchmark figures, most of the time they have no real
value and sometimes they are (in a subtle way) incorrect


Willy.
 
P

Peter Olcott

Willy Denoyette said:
Ever heard of integer overflow?
x (an integer type) cannot hold the value of +=a+b+c+d+e+f;
What's the value of a benchmark that produces wrong results?

If the only purpose of the benchmark is to see how long an arbitrary construct
takes to execute arithmetic overflow should not be much of an issue. There are
some cases (such as computing the checksum) where arithmetic overflow is the
desired result,
 
P

Peter Olcott

Willy Denoyette said:
Christoph Nahr said:
[1] This is the correct code which is still ~50% slower than the (corrected)
native C++ code ( __int64 x=0;).

Great analysis! This 50% slowdown is also what I've observed in other
(non-broken...) mathematical benchmarks of C# vs native C++.

For the record, that's also the maximum performance loss I've ever
observed; in tasks that aren't as well suited to modern C++
optimizers the speed difference is only about 10-20%.
--

In this "particular" case, one could conclude that the C# team could have done
a beter job. However, when looking at the IL produced by C++/CLI and C# you'll
see that all three produce identical IL (or the loop construct), so there is
no reason to blame C#.
However, despite the fact that all managed compilers produce the same IL they
perform differently:

According to MS technical support this is incorrect. The only reason that one
compiler produces faster code than another is that they generate different CIL.
 
C

Christoph Nahr

In this "particular" case, one could conclude that the C# team could have
done a beter job. However, when looking at the IL produced by C++/CLI and C#
you'll see that all three produce identical IL (or the loop construct), so
there is no reason to blame C#.

Now you've lost me. Marcus Cuda already posted the IL generated for
C++/CLI and C#, and they are clearly different -- the C# version is
much longer. How could it even be possible that the same IL performed
differently when generated by different compilers?
 
P

Peter Olcott

Christoph Nahr said:
Now you've lost me. Marcus Cuda already posted the IL generated for
C++/CLI and C#, and they are clearly different -- the C# version is
much longer. How could it even be possible that the same IL performed
differently when generated by different compilers?

My point exactly!
 
C

Christoph Nahr

Sorry, above is not completely true, the C# IL may differ as I only compared
the IL of both: C++ /clr and C++ /clr:safe, and despite they are identical
(only for the loop part), the results are quite different (100%).
That means that even with identical IL the perf. level may differ, and
different IL may produce equal results. The secret (especially in this
case!) lays in the metadata which 'tells' the JIT that one is verifiable
code while the other is not, so it's allowed to generate different machine
code for the same IL.

That's an interesting idea but it must be something else (see below).

The unsafe method modifier allows the C# compiler to generate
unverifiable code, too, but a quick test showed absolutely no
difference between two benchmark runs -- one unchanged, one with the
"unsafe" modifier on the nested loop test. So apparently the machine
code generated for either flag is identical.
Marcus code posted in this thread is wrong,he changed:

x+=a+b+c+d+e+f;
into:
x++;
, please re-read his post and the follow-up,

Right, I saw his post now.
PS. find the IL for both in attachment.

You've lost me again. I've compared both IL sequences, and they are
quite different! The unverifiable version has a different calling
convention, a smaller local stack, calls to C++ library methods, and
overall a smaller IL instruction count.

My diff program can't find the identical loop part that you claim
exists because the two IL sequences are so different. Evidently the
/clr:safe switch causes the C++ compiler to generate different IL, not
just to set a different JIT flag.
 
A

Anders Borum

Hello!
It must have been after the JIT, because C# beat C++ on some things, just
not the nested loops. The author of the link would know relatively obvious
things such as this.

Unless somebody documents what methods have been used to measure the
benchmarks, I am disposing the data altogether. Benchmarks should be
verifiable ..
 
A

Anders Borum

I think Mr. Olcott was busy filing patent requests instead of reading the
excellent posting by Mr. Denoyette. I must admit that I have learned some
great things about IL and optimization from this thread, but I am still
wondering why some people post here in the first place ..
 
W

Willy Denoyette [MVP]

Christoph Nahr said:
That's an interesting idea but it must be something else (see below).

The unsafe method modifier allows the C# compiler to generate
unverifiable code, too, but a quick test showed absolutely no
difference between two benchmark runs -- one unchanged, one with the
"unsafe" modifier on the nested loop test. So apparently the machine
code generated for either flag is identical.


Right, I saw his post now.


You've lost me again. I've compared both IL sequences, and they are
quite different! The unverifiable version has a different calling
convention, a smaller local stack, calls to C++ library methods, and
overall a smaller IL instruction count.

My diff program can't find the identical loop part that you claim
exists because the two IL sequences are so different. Evidently the
/clr:safe switch causes the C++ compiler to generate different IL, not
just to set a different JIT flag.



Chris,

I guess you think both IL functions comes from the same C++ code, which
isn't the case. The first comes from the original benchmark (modified to
take a __int64 for x) the second is the same function modified such that it
can be compiled in safe mode. For completeness I included both functions[1].

I never said the IL for the whole function was the same, I said - identical
IL for the "loop part only", Note that I did include the loop start and end
address labels in the attached file, just compare what's in between these
and you will see that the IL is the same (but don't use diff, the address
labels and local variable labels are different).

I know that the calling convention is different (cdecl/clrcall) and I know
that there are call's into unmanage code like clock() and printf(), but
these don't account for the time it takes to run the loop.

And, applying the 'unsafe' modifier on the loop is a nop, there is nothing
"unsafe" happening inside the loop (unless you modified the code and used
pointer arithmetic), the JIT is smarter than that. If you really want to
know what's happening, you'll have to attach a native debugger set a
breakpoint before the call of the nl function, and single step through the
JIT code (make sure you have the right symbols loaded for mscorwks.dll),
there you will see that a different path is taken for the "mixed mode"
assembly (the one compiled with /clr) based on some metadata flags.

PS. I think we realy have to let this thread die, or we must move this to a
private conversation.

Willy.

[C++/CLI /clr]
__int64 nl(int n) {
double elapsedTime;
clock_t stopTime;
clock_t startTime = clock();

int a, b, c, d, e, f;
__int64 x=0;
//--- start of the loop
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x+=a+b+c+d+e+f;
// end of the loop
stopTime = clock();
elapsedTime = (stopTime - startTime) / (CLOCKS_PER_SEC / (double) 1000.0);
printf("Nested Loop elapsed time: %1.0f ms %I64d\n", elapsedTime, x);

return (__int64)elapsedTime;
}

[C++/CLI /clr:safe]
...
static void nl(int n) {
int a, b, c, d, e, f;
__int64 x = 0;
// use System::Diagnostics::StopWatch to time the loop
Stopwatch^ stopWatch = gcnew Stopwatch;
stopWatch->Start();
//--- start of the loop
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x+=a+b+c+d+e+f;
// end of the loop
stopWatch->Stop();
TimeSpan ts = stopWatch->Elapsed;

System::Console::WriteLine("Nested Loop elapsed time: {0} sec.{1} msec.
{2}\n", ts.Seconds, ts.Milliseconds, x);

}
 
C

Christoph Nahr

Willy,

I hang my head in shame -- once I had edited out the different time
measuring prologue/epilogue and deleted the labels, the remaining IL
code was indeed exactly the same. I should have looked more closely,
but my diff was thrown off by the different line labels and marked so
many differences that I thought the IL was significantly different.
And, applying the 'unsafe' modifier on the loop is a nop, there is nothing
"unsafe" happening inside the loop (unless you modified the code and used
pointer arithmetic), the JIT is smarter than that.

Or dumber! If a simple metadata flag can create faster code then it
sure would be nice if the C# /unsafe option would set that flag...
If you really want to
know what's happening, you'll have to attach a native debugger set a
breakpoint before the call of the nl function, and single step through the
JIT code (make sure you have the right symbols loaded for mscorwks.dll),
there you will see that a different path is taken for the "mixed mode"
assembly (the one compiled with /clr) based on some metadata flags.

Very interesting. I think I'll have to experiment a bit and see if
there's a way to force C# to emit those flags.
PS. I think we realy have to let this thread die, or we must move this to a
private conversation.

No problem, I believe you now. I'm just stunned that the JIT compiler
would create better code based on some secret flags that C# simply
doesn't emit. :-(
 
P

Peter Olcott

Anders Borum said:
Hello!


Unless somebody documents what methods have been used to measure the
benchmarks, I am disposing the data altogether. Benchmarks should be
verifiable ..

I searched high and low and finally found some stuff on unverified code. The
MSDN documentation mentions it yet does not discuss what it means. It could be
anything from security checks to tests for memory leaks. It doesn't say what its
purpose is.
 
P

Peter Olcott

http://msdn.microsoft.com/msdnmag/issues/04/05/VisualC2005/
The above is an interesting link that deals with the latest 2005 optimizations.
If it is something more than marketing hype my problems may be solved. I already
mentioned why I am posting here. I must find out in advance whether or not my
crucial 100-line function will run fast enough in .NET, and what I can do about
it if it doesn't. The above link makes it very clear which optimizations are
performed, and when they are performed. Like I already said most of them are
performed before the MSIL is generated. Of course the crucial optimization of
register allocation must wait until after the specific machine architecture is
known. In other words it must be done by the JIT.
 
P

Peter Olcott

Willy Denoyette said:
Christoph Nahr said:
That's an interesting idea but it must be something else (see below).

The unsafe method modifier allows the C# compiler to generate
unverifiable code, too, but a quick test showed absolutely no
difference between two benchmark runs -- one unchanged, one with the
"unsafe" modifier on the nested loop test. So apparently the machine
code generated for either flag is identical.


Right, I saw his post now.


You've lost me again. I've compared both IL sequences, and they are
quite different! The unverifiable version has a different calling
convention, a smaller local stack, calls to C++ library methods, and
overall a smaller IL instruction count.

My diff program can't find the identical loop part that you claim
exists because the two IL sequences are so different. Evidently the
/clr:safe switch causes the C++ compiler to generate different IL, not
just to set a different JIT flag.



Chris,

I guess you think both IL functions comes from the same C++ code, which isn't
the case. The first comes from the original benchmark (modified to take a
__int64 for x) the second is the same function modified such that it can be
compiled in safe mode. For completeness I included both functions[1].

That would explain the 450% performance difference on a machine with a 32-bit
int. It must do the same sort of thing as the old 16-bit machines emulating a
32-bit int, handle it as two pieces.
I never said the IL for the whole function was the same, I said - identical IL
for the "loop part only", Note that I did include the loop start and end
address labels in the attached file, just compare what's in between these and
you will see that the IL is the same (but don't use diff, the address labels
and local variable labels are different).

I know that the calling convention is different (cdecl/clrcall) and I know
that there are call's into unmanage code like clock() and printf(), but these
don't account for the time it takes to run the loop.

And, applying the 'unsafe' modifier on the loop is a nop, there is nothing
"unsafe" happening inside the loop (unless you modified the code and used
pointer arithmetic), the JIT is smarter than that. If you really want to know
what's happening, you'll have to attach a native debugger set a breakpoint
before the call of the nl function, and single step through the JIT code (make
sure you have the right symbols loaded for mscorwks.dll), there you will see
that a different path is taken for the "mixed mode" assembly (the one compiled
with /clr) based on some metadata flags.

PS. I think we realy have to let this thread die, or we must move this to a
private conversation.

Willy.

[C++/CLI /clr]
__int64 nl(int n) {
double elapsedTime;
clock_t stopTime;
clock_t startTime = clock();

int a, b, c, d, e, f;
__int64 x=0;
//--- start of the loop
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x+=a+b+c+d+e+f;
// end of the loop
stopTime = clock();
elapsedTime = (stopTime - startTime) / (CLOCKS_PER_SEC / (double) 1000.0);
printf("Nested Loop elapsed time: %1.0f ms %I64d\n", elapsedTime, x);

return (__int64)elapsedTime;
}

[C++/CLI /clr:safe]
..
static void nl(int n) {
int a, b, c, d, e, f;
__int64 x = 0;
// use System::Diagnostics::StopWatch to time the loop
Stopwatch^ stopWatch = gcnew Stopwatch;
stopWatch->Start();
//--- start of the loop
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x+=a+b+c+d+e+f;
// end of the loop
stopWatch->Stop();
TimeSpan ts = stopWatch->Elapsed;

System::Console::WriteLine("Nested Loop elapsed time: {0} sec.{1} msec.
{2}\n", ts.Seconds, ts.Milliseconds, x);

}
 
P

Peter Olcott

Christoph Nahr said:
Willy,

I hang my head in shame -- once I had edited out the different time
measuring prologue/epilogue and deleted the labels, the remaining IL
code was indeed exactly the same. I should have looked more closely,
but my diff was thrown off by the different line labels and marked so
many differences that I thought the IL was significantly different.


Or dumber! If a simple metadata flag can create faster code then it
sure would be nice if the C# /unsafe option would set that flag...


Very interesting. I think I'll have to experiment a bit and see if
there's a way to force C# to emit those flags.


No problem, I believe you now. I'm just stunned that the JIT compiler
would create better code based on some secret flags that C# simply
doesn't emit. :-(

/clr:safe prohibits some of the optimizations that clt:pure allows.
 
S

Stefan Simek

Peter said:
I searched high and low and finally found some stuff on unverified code. The
MSDN documentation mentions it yet does not discuss what it means. It could be
anything from security checks to tests for memory leaks. It doesn't say what its
purpose is.
I guess you should google for 'define:verify'. Verifiable benchmarks
have nothing to do with verifiable code. Anders just wanted the
benchmarks to be verifiably *valid*, which they aren't by any means. A
nested loop written the way it is in the benchmark is measuring nothing
but a compiler's ability to optimize nested loops that do more or less
nothing. Verifiable code means (in the .NET vocabulary) that the CLR can
statically verify the code to ensure it will do nothing it is not
expected to do. But enough of that, let's see what native assembly the
compilers generate for the loop and get over with it.

<...>

ha! I won't post the complete disassemblies (the C++ one is terribly
cryptic so it would do no help), but I've found the following. The C++
compiler (8.0 in my case, but I suspect 6.0 is doing the same) manages
to pre-cache the additions in the outer loops, which the C# compiler
doesn't. Thus, in the C# code all the additions that happen at

x+=a+b+c+d+e+f;

are evaluated for every single innermost loop, while c++ does something
like the following:

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++, tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

If you compile this code in C#, the execution time is 6734 ms for .NET
2.0 and 8593 ms for .NET 1.1 versus 3656 ms for unmanaged C++ (8.0) on
my machine. But note that the innermost loop in the C++ is only four
instructions. This can hardly be matched to any real-life algorithm, and
I can assure you that if there was anything more than 'nothing' in the
inner loop, the performance difference would be much less than 80% (I
expect it to drop far below 10% for most real-life algorithms).

That said, I wouldn't expect C# (or .NET) to match pure native code
performance for purely computational tasks like the one you describe is.
C# can win easily in cases where a lot of dynamic allocation is
involved, etc., but it will probably never outperform optimized native
C++ (or handwritten assembly) for computational tasks. If you need to
squeeze every clock of your CPU, you will probably get best results
using an optimizing compiler targeted for exactly the CPU the code will
run on.

I didn't want to write all this, but after reading through the long
discussions on this matter without touching the importatnt points, I
just decided to do so :)

Just my 2c.

Stefan
 
P

Peter Olcott

Stefan Simek said:
I guess you should google for 'define:verify'. Verifiable benchmarks have
nothing to do with verifiable code. Anders just wanted the benchmarks to be
verifiably *valid*, which they aren't by any means. A nested loop written the
way it is in the benchmark is measuring nothing but a compiler's ability to
optimize nested loops that do more or less nothing. Verifiable code means (in
the .NET vocabulary) that the CLR can statically verify the code to ensure it
will do nothing it is not expected to do. But enough of that, let's see what
native assembly the

Yet I just found a Microsoft article that says that this sort of technology is
between 20 and 50 years away. Code is not expected to be incorrect. Proving that
code is not incorrect is technolgy that does not yet exist. I don't know what it
is verifying, but, correct code is not it. The posted benchmark was crucial to
my needs because my critical 100-line function is mostly loops.
compilers generate for the loop and get over with it.

<...>

ha! I won't post the complete disassemblies (the C++ one is terribly cryptic
so it would do no help), but I've found the following. The C++ compiler (8.0
in my case, but I suspect 6.0 is doing the same) manages to pre-cache the
additions in the outer loops, which the C# compiler doesn't. Thus, in the C#
code all the additions that happen at

x+=a+b+c+d+e+f;

are evaluated for every single innermost loop, while c++ does something like
the following:

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++, tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

If you compile this code in C#, the execution time is 6734 ms for .NET 2.0 and
8593 ms for .NET 1.1 versus 3656 ms for unmanaged C++ (8.0) on my machine. But
note that the innermost loop in the C++ is only four instructions. This can
hardly be matched to any real-life algorithm, and I can assure you that if
there was anything more than 'nothing' in the inner loop, the performance
difference would be much less than 80% (I expect it to drop far below 10% for
most real-life algorithms).

Since this is not the same code that was posted on the benchmark, it is not a
valid test. If you are worried about arithmetic overflow then define X as a
double. I want to know why there was a 450% difference in the original
benchmark. When you change the code, this avoids rather than answers the
question. It is reported that 2005 does a much better job of optimization of
..NET code, yet, only with the C++, not the C# compiler.
That said, I wouldn't expect C# (or .NET) to match pure native code
performance for purely computational tasks like the one you describe is. C#
can win easily in cases where a lot of dynamic allocation is involved, etc.,
but it will probably never outperform optimized native C++ (or handwritten
assembly) for computational tasks. If you need to squeeze every clock of your
CPU, you will probably get best results using an optimizing compiler targeted
for exactly the CPU the code will run on.

There is no reason (in theory) why .NET code would need to be any slower than
native code. The reason in practice seems to be that Microsoft just hasn't
gotten around to implementing every possible optimization in every language,
yet. The one tricky thing about optimization in .NET is that it is divided into
two different processes. Some of the optimizations at the compile step can screw
up some of the optimizations at the JIT step. The problem is that the compile
step does not know how many registers the target architecture has.
 
W

Willy Denoyette [MVP]

Stefan Simek said:
I guess you should google for 'define:verify'. Verifiable benchmarks have
nothing to do with verifiable code. Anders just wanted the benchmarks to
be verifiably *valid*, which they aren't by any means. A nested loop
written the way it is in the benchmark is measuring nothing but a
compiler's ability to optimize nested loops that do more or less nothing.
Verifiable code means (in the .NET vocabulary) that the CLR can statically
verify the code to ensure it will do nothing it is not expected to do. But
enough of that, let's see what native assembly the compilers generate for
the loop and get over with it.

<...>

ha! I won't post the complete disassemblies (the C++ one is terribly
cryptic so it would do no help), but I've found the following. The C++
compiler (8.0 in my case, but I suspect 6.0 is doing the same) manages to
pre-cache the additions in the outer loops, which the C# compiler doesn't.
Thus, in the C# code all the additions that happen at

x+=a+b+c+d+e+f;

are evaluated for every single innermost loop, while c++ does something
like the following:

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++, tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

If you compile this code in C#, the execution time is 6734 ms for .NET 2.0
and 8593 ms for .NET 1.1 versus 3656 ms for unmanaged C++ (8.0) on my
machine. But note that the innermost loop in the C++ is only four
instructions. This can hardly be matched to any real-life algorithm, and I
can assure you that if there was anything more than 'nothing' in the inner
loop, the performance difference would be much less than 80% (I expect it
to drop far below 10% for most real-life algorithms).



The snip you posted is meaningless without the value of n and the types of
the local variables, so please post the whole function.


Willy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top