Why is C# 450% slower than C++ on nested loops ??

Stefan Simek · Dec 31, 2005

Willy said:
The snip you posted is meaningless without the value of n and the types of
the local variables, so please post the whole function.

Willy.

The code I have used is from the original benchmark from the link the OP
has mentioned so many times. The C++ code is at
http://www.tommti-systems.de/main-Dateien/reviews/languages/Benchmark.cpp

and CS code is at
http://www.tommti-systems.de/main-Dateien/reviews/languages/Benchmark.cs

The pseudocode I've posted is just an example close to what the C++
compiler makes out of the original code, based on my analysis of the
generated x86 assembly.

Stefan

Stefan Simek · Dec 31, 2005

Peter said:
Yet I just found a Microsoft article that says that this sort of technology is
between 20 and 50 years away. Code is not expected to be incorrect. Proving that
code is not incorrect is technolgy that does not yet exist. I don't know what it
is verifying, but, correct code is not it. The posted benchmark was crucial to
my needs because my critical 100-line function is mostly loops.

That's why the code must meet certain criteria in order to be verifiable
today. C# compiler generates such code, and the C++/CLI compiler is
able to do so.

Since this is not the same code that was posted on the benchmark, it is not a
valid test. If you are worried about arithmetic overflow then define X as a
double. I want to know why there was a 450% difference in the original
benchmark. When you change the code, this avoids rather than answers the
question. It is reported that 2005 does a much better job of optimization of
.NET code, yet, only with the C++, not the C# compiler.

No. I gave you and exact answer *why* there is the difference. The C++
translates the original code logic to something comparable to the
snippet I have posted. C# compiler fails to do this optimization. But it
has nothing to do with nested loops per se. It is an optimization of
the addition performed in the innermost loop.

There is no reason (in theory) why .NET code would need to be any slower than
native code. The reason in practice seems to be that Microsoft just hasn't
gotten around to implementing every possible optimization in every language,
yet. The one tricky thing about optimization in .NET is that it is divided into
two different processes. Some of the optimizations at the compile step can screw
up some of the optimizations at the JIT step. The problem is that the compile
step does not know how many registers the target architecture has.

That's why I used the word *probably*. The JIT is much more time
constrained, so it *might* never be able to match the optimizations of
traditional compilers. It *should* certainly give better *average*
results across platforms in the future, because it makes use of any
available hardware present.

If you really want to get answers to your questions, not just troll
around, you really can get them here. If you don't...

Stefan

Willy Denoyette [MVP] · Dec 31, 2005

">

The code I have used is from the original benchmark from the link the OP
has mentioned so many times. The C++ code is at
http://www.tommti-systems.de/main-Dateien/reviews/languages/Benchmark.cpp

and CS code is at
http://www.tommti-systems.de/main-Dateien/reviews/languages/Benchmark.cs

The pseudocode I've posted is just an example close to what the C++
compiler makes out of the original code, based on my analysis of the
generated x86 assembly.

Stefan

Wait a minute, You said...
<If you compile this code in C#, the execution time is 6734 ms for .NET ...>
To me "this code" means you ran YOUR code and not the one from the above URL
(you can't possibly achieve such results using the original code), so please
post the C# function you used to achieve the results you posted.

Willy.

Stefan Simek · Dec 31, 2005

Willy said:
">

Wait a minute, You said...
<If you compile this code in C#, the execution time is 6734 ms for .NET ...>
To me "this code" means you ran YOUR code and not the one from the above URL
(you can't possibly achieve such results using the original code), so please
post the C# function you used to achieve the results you posted.

Willy.

"This code" was meant to resemble the optimizations made by the C++
compiler. I used it to demonstrate that the problem is not in nested
loops themselves, but in the optimization of the addition statement. For
the sake of completeness, here is the code. I thought it's obvious which
part of the code should be replaced, as well as that the types of the
tmp variables are int (introducing a single long in the benchmarking
loop would make it absolutely invalid for 32-bit platforms).

public static long nl(int n)
{
long elapsedMilliseconds;
DateTime startTime = DateTime.Now;

int a, b, c, d, e, f;
int tmp_ab, tmp_abc, tmp_abcd, tmp_abcde, tmp;
int x = 0;

if (n < 1) n = 1;

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++,
tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

// Console.WriteLine(x.ToString() + "\n");
DateTime stopTime = DateTime.Now;
TimeSpan elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " +
elapsedMilliseconds + " ms - " + x.ToString());

return elapsedMilliseconds;
}

Is this better?

Willy Denoyette [MVP] · Dec 31, 2005

Stefan Simek said:
Is this better?

Somewhat...
Following are the result I obtained when running [1], here I call both the
original function (modified to prevent x to overflow) and your modified
function (here named nl1).

Start C# benchmark
Nested Loop elapsed time: 9952 ms - 479232000000
Nested Loop elapsed time: 13171 ms - -1804337152
End C# benchmark

Notes:
- the result (value of x) is incorrect for nl1.
<quote
(introducing a single long in the benchmarking
loop would make it absolutely invalid for 32-bit platforms).
/quote>
Don't know what is meant by this. The result of the calculation yiels a
value that is > int.MAX, that means that there a billions of arithmetic
overruns (479232000000 - int.Max ), you consider this "valid"?
-I'm also not clear on why your run is 2X faster than mine, your CPU isn't
twice as fast as mine I 'm sure.

- Notice that the original function performs better than yours. The reason
is simple, the modified nl (where x is a long) doesn't incur a hit when x
overflows.
Your function has a better loop algorithm, but still incurs the hit because
of the overflow on the int result x. Note that all modern x86 CPU's (I know
of) take a (tiny) hit on an integer arithmetic overflow.
- If you compile with /checked+, you code will throw.

- Following is the result of another run where x in nl1 is changed into a
long.
Note also that the result (value of x) is correct now!

Start C# benchmark
Nested Loop elapsed time: 9983 ms - 479232000000
Nested Loop elapsed time: 9998 ms - 479232000000
End C# benchmark

Willy.

[1]
// Compile this with: csc /o+ filename.cs

/* original code copyright 2004 Christopher W. Cowell-Shah
http://www.cowell-shah.com/research/benchmark/code */
/* other code portions copyright http://dada.perl.it/shootout/ and Doug
Bagley http://www.bagley.org/~doug/shootout */
/* combined, modified and fixed by Thomas Bruckschlegel -
http://www.tommti-systems.com */
using System;
using System.Collections;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Security.Permissions;
namespace Benchmark_CSharp
{
class BenchmarkCSharp
{
static DateTime startTime;
static DateTime stopTime;
static TimeSpan elapsedTime;

[STAThread]
static void Main(string[] args)
{
Console.WriteLine("Start C# benchmark");
int n=10;
nl(n*4);
nl1(n*4);
Console.WriteLine("End C# benchmark");
}
public static void nl(int n)
{
long elapsedMilliseconds;

int a, b, c, d, e, f;
long x=0;
if (n < 1) n = 1;
startTime = DateTime.Now;

for (a=0; a!=n; a++)
for (b=0; b!=n; b++)
for (c=0; c!=n; c++)
for (d=0; d!=n; d++)
for (e=0; e!=n; e++)
for (f=0; f!=n; f++)
x+=a+b+c+d+e+f;

stopTime = DateTime.Now;
elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - "+x.ToString());
}
//end

// Your modified function
public static long nl1(int n)
{
long elapsedMilliseconds;
DateTime startTime = DateTime.Now;
int a, b, c, d, e, f;
int tmp_ab, tmp_abc, tmp_abcd, tmp_abcde;
int tmp = 0;
int x = 0;

if (n < 1) n = 1;

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++,
tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

// Console.WriteLine(x.ToString() + "\n");
DateTime stopTime = DateTime.Now;
TimeSpan elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds +
" ms - " + x.ToString());

return elapsedMilliseconds;
}

}
}

Stefan Simek · Dec 31, 2005

Willy said:
Is this better?

Click to expand...

Somewhat...
Following are the result I obtained when running [1], here I call both the
original function (modified to prevent x to overflow) and your modified
function (here named nl1).

Start C# benchmark
Nested Loop elapsed time: 9952 ms - 479232000000
Nested Loop elapsed time: 13171 ms - -1804337152
End C# benchmark

The original benchmark included the overflows. Your modification to
prevent the overflow makes it invalid, IMO. Overflows can be sometimes a
desired effect, for example when calculating checksums. Changing the
type of x to long introduces another problem - 64-bit addition, which is
non-trivial on 32-bit platforms.

Notes:
- the result (value of x) is incorrect for nl1.
<quote
(introducing a single long in the benchmarking
loop would make it absolutely invalid for 32-bit platforms).
/quote>
Don't know what is meant by this. The result of the calculation yiels a
value that is > int.MAX, that means that there a billions of arithmetic
overruns (479232000000 - int.Max ), you consider this "valid"?
-I'm also not clear on why your run is 2X faster than mine, your CPU isn't
twice as fast as mine I 'm sure.

The result of the calculation is not important in the benchmark, IMO.
What is important is the time the calculation takes to complete.

- Notice that the original function performs better than yours. The reason
is simple, the modified nl (where x is a long) doesn't incur a hit when x
overflows.

The performance hit of adding to 64 bit integer should be much higher
than any hit caused by the overflow that happens once in a time.

Your function has a better loop algorithm, but still incurs the hit because
of the overflow on the int result x. Note that all modern x86 CPU's (I know
of) take a (tiny) hit on an integer arithmetic overflow.

Not sure of that, but not sure it isn't true either... But arithmetic
overflows are desirable in many cases anyway.

- If you compile with /checked+, you code will throw.

I still think that the overflow is desired in this benchmark, so I would
wrap the *whole* benchmarking code in an unchecked { } block. /checked+
hits performance quite a lot.

- Following is the result of another run where x in nl1 is changed into a
long.
Note also that the result (value of x) is correct now!

Start C# benchmark
Nested Loop elapsed time: 9983 ms - 479232000000
Nested Loop elapsed time: 9998 ms - 479232000000
End C# benchmark

Interesting, when I change the x to long, I get 25437 ms execution time.
This clearly shows the difference between 32 bit and 64 bit arithmetic.
Are you sure you don't have an x64 machine? That would explain the
difference between our results perfectly.

Willy.

[1]
// Compile this with: csc /o+ filename.cs

/* original code copyright 2004 Christopher W. Cowell-Shah
http://www.cowell-shah.com/research/benchmark/code */
/* other code portions copyright http://dada.perl.it/shootout/ and Doug
Bagley http://www.bagley.org/~doug/shootout */
/* combined, modified and fixed by Thomas Bruckschlegel -
http://www.tommti-systems.com */
using System;
using System.Collections;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Security.Permissions;
namespace Benchmark_CSharp
{
class BenchmarkCSharp
{
static DateTime startTime;
static DateTime stopTime;
static TimeSpan elapsedTime;

[STAThread]
static void Main(string[] args)
{
Console.WriteLine("Start C# benchmark");
int n=10;
nl(n*4);
nl1(n*4);
Console.WriteLine("End C# benchmark");
}
public static void nl(int n)
{
long elapsedMilliseconds;

int a, b, c, d, e, f;
long x=0;
if (n < 1) n = 1;
startTime = DateTime.Now;

for (a=0; a!=n; a++)
for (b=0; b!=n; b++)
for (c=0; c!=n; c++)
for (d=0; d!=n; d++)
for (e=0; e!=n; e++)
for (f=0; f!=n; f++)
x+=a+b+c+d+e+f;

stopTime = DateTime.Now;
elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - "+x.ToString());
}
//end

// Your modified function
public static long nl1(int n)
{
long elapsedMilliseconds;
DateTime startTime = DateTime.Now;
int a, b, c, d, e, f;
int tmp_ab, tmp_abc, tmp_abcd, tmp_abcde;
int tmp = 0;
int x = 0;

if (n < 1) n = 1;

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++,
tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

// Console.WriteLine(x.ToString() + "\n");
DateTime stopTime = DateTime.Now;
TimeSpan elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds +
" ms - " + x.ToString());

return elapsedMilliseconds;
}

}
}

I don't want to persuade anyone of my truth. As I've mentioned in my
first post in this endless thread, I just wanted to point out some facts
*I* considered relevant. It's everyone's right to disagree and make
their own opinion, and I enjoy having this right as well.

Stefan

Peter Olcott · Dec 31, 2005

Stefan Simek said:
That's why the code must meet certain criteria in order to be verifiable
today. C# compiler generates such code, and the C++/CLI compiler is able to do
so.

No. I gave you and exact answer *why* there is the difference. The C++
translates the original code logic to something comparable to the snippet I
have posted. C# compiler fails to do this optimization. But it has nothing to
do with nested loops per se. It is an optimization of the addition performed
in the innermost loop.

That's why I used the word *probably*. The JIT is much more time constrained,
so it *might* never be able to match the optimizations of traditional
compilers. It *should* certainly give better *average*

http://msdn.microsoft.com/msdnmag/issues/04/05/VisualC2005/
If you read this link you will see that this is not even the way that it works.
As I already said, most of the optimizations are accomplished by the language
compiler, not the JIT.

Peter Olcott · Dec 31, 2005

Stefan Simek said:
"This code" was meant to resemble the optimizations made by the C++ compiler.
I used it to demonstrate that the problem is not in nested

Why not just post the actual code generated from the actual C++ benchmark?

Peter Olcott · Dec 31, 2005

Willy Denoyette said:
Stefan Simek said:

Is this better?

Click to expand...

Somewhat...
Following are the result I obtained when running [1], here I call both the
original function (modified to prevent x to overflow) and your modified
function (here named nl1).

Start C# benchmark
Nested Loop elapsed time: 9952 ms - 479232000000
Nested Loop elapsed time: 13171 ms - -1804337152
End C# benchmark

Notes:
- the result (value of x) is incorrect for nl1.
<quote
(introducing a single long in the benchmarking
loop would make it absolutely invalid for 32-bit platforms).
/quote>
Don't know what is meant by this. The result of the calculation yiels a value
that is > int.MAX, that means that there a billions of arithmetic overruns
(479232000000 - int.Max ), you consider this "valid"?
-I'm also not clear on why your run is 2X faster than mine, your CPU isn't
twice as fast as mine I 'm sure.

- Notice that the original function performs better than yours. The reason is
simple, the modified nl (where x is a long) doesn't incur a hit when x
overflows.
Your function has a better loop algorithm, but still incurs the hit because of
the overflow on the int result x. Note that all modern x86 CPU's (I know of)
take a (tiny) hit on an integer arithmetic overflow.
- If you compile with /checked+, you code will throw.

- Following is the result of another run where x in nl1 is changed into a
long.
Note also that the result (value of x) is correct now!

Start C# benchmark
Nested Loop elapsed time: 9983 ms - 479232000000
Nested Loop elapsed time: 9998 ms - 479232000000
End C# benchmark

Willy.

[1]
// Compile this with: csc /o+ filename.cs

/* original code copyright 2004 Christopher W. Cowell-Shah
http://www.cowell-shah.com/research/benchmark/code */
/* other code portions copyright http://dada.perl.it/shootout/ and Doug Bagley
http://www.bagley.org/~doug/shootout */
/* combined, modified and fixed by Thomas Bruckschlegel -
http://www.tommti-systems.com */
using System;
using System.Collections;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Security.Permissions;
namespace Benchmark_CSharp
{
class BenchmarkCSharp
{
static DateTime startTime;
static DateTime stopTime;
static TimeSpan elapsedTime;

[STAThread]
static void Main(string[] args)
{
Console.WriteLine("Start C# benchmark");
int n=10;
nl(n*4);
nl1(n*4);
Console.WriteLine("End C# benchmark");
}
public static void nl(int n)
{
long elapsedMilliseconds;

int a, b, c, d, e, f;
long x=0;
if (n < 1) n = 1;
startTime = DateTime.Now;

for (a=0; a!=n; a++)
for (b=0; b!=n; b++)
for (c=0; c!=n; c++)
for (d=0; d!=n; d++)
for (e=0; e!=n; e++)
for (f=0; f!=n; f++)
x+=a+b+c+d+e+f;

stopTime = DateTime.Now;
elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - "+x.ToString());
}
//end

// Your modified function
public static long nl1(int n)
{
long elapsedMilliseconds;
DateTime startTime = DateTime.Now;
int a, b, c, d, e, f;
int tmp_ab, tmp_abc, tmp_abcd, tmp_abcde;
int tmp = 0;
int x = 0;

if (n < 1) n = 1;

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++, tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

// Console.WriteLine(x.ToString() + "\n");
DateTime stopTime = DateTime.Now;
TimeSpan elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - " + x.ToString());

return elapsedMilliseconds;
}

}
}

It would be better to declare x as a double than to either use a long unless C++
has a 64-bit long. This would put C++ and C# on the same footing. Your change to
the benchmark changes its semantic meaning too much.

Peter Olcott · Dec 31, 2005

Stefan Simek said:
Willy said:

Is this better?

Click to expand...

Somewhat...
Following are the result I obtained when running [1], here I call both the
original function (modified to prevent x to overflow) and your modified
function (here named nl1).

Start C# benchmark
Nested Loop elapsed time: 9952 ms - 479232000000
Nested Loop elapsed time: 13171 ms - -1804337152
End C# benchmark

Click to expand...

The original benchmark included the overflows. Your modification to prevent
the overflow makes it invalid, IMO. Overflows can be sometimes a desired
effect, for example when calculating checksums. Changing the type of x to long
introduces another problem - 64-bit addition, which is non-trivial on 32-bit
platforms.

Same thing that I said. He can use a double to get around both of these problems
at once.

The result of the calculation is not important in the benchmark, IMO. What is
important is the time the calculation takes to complete.

Yes, I agree. Unless there is some sort of overflow detection that gets invoked
in .NET that makes it run much more slowly when overflow is detected.

The performance hit of adding to 64 bit integer should be much higher than any
hit caused by the overflow that happens once in a time.

Yes, a 64-bit integer on a 32-bit platform must be treated as two separate
pieces.

Not sure of that, but not sure it isn't true either... But arithmetic
overflows are desirable in many cases anyway.

I still think that the overflow is desired in this benchmark, so I would wrap
the *whole* benchmarking code in an unchecked { } block. /checked+ hits
performance quite a lot.

Interesting, when I change the x to long, I get 25437 ms execution time. This
clearly shows the difference between 32 bit and 64 bit arithmetic. Are you
sure you don't have an x64 machine? That would explain the difference between
our results perfectly.

That is what I suspected. The original C# benchmark declared x to be an int, so
this should not have caused the original 450% performance difference. Hopefully
all of these problems are solved with the new 2005 compiler. Microsoft does
indicate this at least for managed C++, not managed C#.

Willy.

[1]
// Compile this with: csc /o+ filename.cs

/* original code copyright 2004 Christopher W. Cowell-Shah
http://www.cowell-shah.com/research/benchmark/code */
/* other code portions copyright http://dada.perl.it/shootout/ and Doug
Bagley http://www.bagley.org/~doug/shootout */
/* combined, modified and fixed by Thomas Bruckschlegel -
http://www.tommti-systems.com */
using System;
using System.Collections;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Security.Permissions;
namespace Benchmark_CSharp
{
class BenchmarkCSharp
{
static DateTime startTime;
static DateTime stopTime;
static TimeSpan elapsedTime;

[STAThread]
static void Main(string[] args)
{
Console.WriteLine("Start C# benchmark");
int n=10;
nl(n*4);
nl1(n*4);
Console.WriteLine("End C# benchmark");
}
public static void nl(int n)
{
long elapsedMilliseconds;

int a, b, c, d, e, f;
long x=0;
if (n < 1) n = 1;
startTime = DateTime.Now;

for (a=0; a!=n; a++)
for (b=0; b!=n; b++)
for (c=0; c!=n; c++)
for (d=0; d!=n; d++)
for (e=0; e!=n; e++)
for (f=0; f!=n; f++)
x+=a+b+c+d+e+f;

stopTime = DateTime.Now;
elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - "+x.ToString());
}
//end

// Your modified function
public static long nl1(int n)
{
long elapsedMilliseconds;
DateTime startTime = DateTime.Now;
int a, b, c, d, e, f;
int tmp_ab, tmp_abc, tmp_abcd, tmp_abcde;
int tmp = 0;
int x = 0;

if (n < 1) n = 1;

for (a = 0; a < n; a++)
{
for (b = 0, tmp_ab = a; b < n; b++, tmp_ab++)
{
for (c = 0, tmp_abc = tmp_ab; c < n; c++, tmp_abc++)
{
for (d = 0, tmp_abcd = tmp_abc; d < n; d++, tmp_abcd++)
{
for (e = 0, tmp_abcde = tmp_abcd; e < n; e++,
tmp_abcde++)
{
for (f = 0, tmp = tmp_abcde; f < n; f++, tmp++)
x += tmp;
}
}
}
}
}

// Console.WriteLine(x.ToString() + "\n");
DateTime stopTime = DateTime.Now;
TimeSpan elapsedTime = stopTime.Subtract(startTime);
elapsedMilliseconds = (int)elapsedTime.TotalMilliseconds;
Console.WriteLine("Nested Loop elapsed time: " + elapsedMilliseconds + "
ms - " + x.ToString());

return elapsedMilliseconds;
}

}
}

Click to expand...

I don't want to persuade anyone of my truth. As I've mentioned in my first
post in this endless thread, I just wanted to point out some facts *I*
considered relevant. It's everyone's right to disagree and make their own
opinion, and I enjoy having this right as well.

Stefan

Jesse McGrew · Jan 1, 2006

C++ does have a (nonstandard) 64-bit long. GNU calls it "long long"; I
think MS calls it __int64.

Jesse

Stefan Simek · Jan 1, 2006

Jesse said:
C++ does have a (nonstandard) 64-bit long. GNU calls it "long long"; I
think MS calls it __int64.

Jesse

Sure it does, but as I've said before, the long addition would be by far
the most expensive thing in the benchmark. The benchmark is supposed to
measure performance of nested loops, not 64-bit addition.

Willy Denoyette [MVP] · Jan 1, 2006

Stefan Simek said:
Sure it does, but as I've said before, the long addition would be by far
the most expensive thing in the benchmark. The benchmark is supposed to
measure performance of nested loops, not 64-bit addition.

That's your assumption, and maybe it was supposed to be so, but it missed
the point, what takes most of the time are the additions in the most
internal loop, not the nesting of the for loops.
The original benchmark loops 409.600.000 times the internal loop and
performs 7 add's per loop, you can hardly call this a looping benchmark.
The reason the calculation is added, is to prevent aggressive optimizations
when executing empty loops that the author included an operation that
included all variables used in the loop.
Again, as I stated in my first reply, these kind of micro-benchmarks
(originating from C), have no real value, they are misleading they produce
wrong results (ever asked yourself why the output include the value of x?)
and they don't account for side-effects.

Willy.

Willy Denoyette [MVP] · Jan 1, 2006

The original benchmark included the overflows. Your modification to
prevent the overflow makes it invalid, IMO. Overflows can be sometimes a
desired effect, for example when calculating checksums. Changing the type
of x to long introduces another problem - 64-bit addition, which is
non-trivial on 32-bit platforms.

A desired effect, right, but not in a benchmark that was meant to measure
nested loop performance. The 64 bit add. hit taken is largely compensated by
the lack of overflow handling by the CPU's microcode in the original code.
Note also that the function using a long has 3 instructions more than the
int version (11 vs. 14), but this doesn't say that much about the
performance delta's on super-scalar CPU's like P4 and AMD.

The result of the calculation is not important in the benchmark, IMO. What
is important is the time the calculation takes to complete.

Without correct results, why performing a calculation that yields a plain
wrong result, I'm sorry but I disagree.

The performance hit of adding to 64 bit integer should be much higher than
any hit caused by the overflow that happens once in a time.

Do you have an idea about the no. of overflows? Did you run the code with
/checked while counting the no. of exceptions? Guess not, it would take a
considerable time (minutes? hours?) to do it.

Not sure of that, but not sure it isn't true either... But arithmetic
overflows are desirable in many cases anyway.

I still think that the overflow is desired in this benchmark, so I would
wrap the *whole* benchmarking code in an unchecked { } block. /checked+
hits performance quite a lot.

If it was desired it should have been mentioned in the benchmark txt, and as
far as I see it's not.
IMO it's just an oversight.

Interesting, when I change the x to long, I get 25437 ms execution time.
This clearly shows the difference between 32 bit and 64 bit arithmetic.
Are you sure you don't have an x64 machine? That would explain the
difference between our results perfectly.

No, running on 32 bit HW. Why would a 32 bit ADD be slower than a 64bit ADD
on X64 anyway?
I ran the benchmark on PIII, P4 (above results) Pentium M and to please you
also on AMD64.

To illustrate what happens consider following results, run on Pentium M
1.6GHz, XP SP2 - Framework 2.0.50727. I took a slower machine to better
illustrate the differences, note that running on PIII, while slower, shows
the same pattern.

---------- original code --------
int o-
Nested Loop elapsed time: 32136 ms - -1804337152 108%

int o+
Nested Loop elapsed time: 29772 ms - -1804337152 reference value 100%

long o-
Nested Loop elapsed time: 24595 ms - 479232000000 83%

long o+

Nested Loop elapsed time: 20789 ms - 479232000000 70%

--------- optimized loop ---------
int o-
Nested Loop elapsed time: 17204 ms - -1804337152 58% 107%
int o+

Nested Loop elapsed time: 16073 ms - -1804337152 54% 100%

long o-
Nested Loop elapsed time: 21230 ms - 479232000000 71% 132%

long o+

Nested Loop elapsed time: 20219 ms - 479232000000 68% 126%

As you can see, the result of the original benchmark (see reference value)
is slower than the same using a long result.

That means that using a long in the "original code" loop, compensates for
the overflow hit.
Inspecting the JIT generated code for the inner loop, gives:
11 instructions for the int result version vs. 14 instructions for the long
result version. So 'theoretically the long version should be slower (but it
isn't). NOTE: that your milleage may vary depending on HW and operational
conditions (beware CPU clock throttling!).

Your hand "optimized loop" code is in general faster than the original code,
but it slows down when using long as result.
Your modified loop compensates the overflow hit by using a better algorithm
at the source code level, that is, THE NUMBER of X86 ADD's resulting from
your algorithm is lower than the original code loop (innerloop). Because of
this, the result field can be register allocated, while in the original
code, the result must be loaded from the stack before the addition (x +=
....) and save back in the stack when done. this takes a huge hit.
[note that your inner loop is 12 instructions long!)

Your code using a long takes a small hit (2 X86 instructions) because of the
64 bit integer addition on 32 bit HW, but this hit is much smaller what the
figures are telling, what happens is that the JIT generates less efficient
code or the whole function (especially the innerloop) when using a long
result (note that this isn't the case with the original code). The rgister
(eax) con no longer be used to store and hold the result field, so now your
code takes the load/store from the stack just like the original code (both
for int and long).

I hope this clears things up a bit, anyway I have no ambition to spend more
time on this neither to start another argument, if you really wanna know
what's happening, run this silly thing against a native debugger and watch
the JIT created code, and you will understand.

PS. I ran this on X64 AMD64 VISTA build 5270 with v2.0.50727.52 (note 52 is
the latest framework build for Vista)

Original code using int resp. long as result
Nested Loop elapsed time: 16370 ms - -1804337152
Nested Loop elapsed time: 10053 ms - 479232000000

Hand-tweaked code using int resp. long as result
Nested Loop elapsed time: 16292 ms - -1804337152
Nested Loop elapsed time: 10194 ms - 479232000000

Notice that this is beta code, so this further reduces the value of the
results obtained.

Willy.

Peter Olcott · Jan 1, 2006

Jesse McGrew said:
C++ does have a (nonstandard) 64-bit long. GNU calls it "long long"; I
think MS calls it __int64.

Jesse

Actually, the new standard for C includes long long. I read this several years
ago. The problem is not whether or not the language has a 64-bit long. The
problem is making C# use a 64-bit long on a 32-bit platform, while letting C++
use a 32-bit long on this same platform.

Peter Olcott · Jan 1, 2006

Stefan Simek said:
Sure it does, but as I've said before, the long addition would be by far the
most expensive thing in the benchmark. The benchmark is supposed to measure
performance of nested loops, not 64-bit addition.

I don't think that it would be the most expensive thing, although it is far more
expensive than native architecture sized arithmetic.

Peter Olcott · Jan 1, 2006

Willy Denoyette said:
That's your assumption, and maybe it was supposed to be so, but it missed the
point, what takes most of the time are the additions in the most internal
loop, not the nesting of the for loops.
The original benchmark loops 409.600.000 times the internal loop and performs
7 add's per loop, you can hardly call this a looping benchmark.
The reason the calculation is added, is to prevent aggressive optimizations
when executing empty loops that the author included an operation that included
all variables used in the loop.
Again, as I stated in my first reply, these kind of micro-benchmarks
(originating from C), have no real value, they are misleading they produce
wrong results (ever asked yourself why the output include the value of x?) and
they don't account for side-effects.

Willy.

Although it is possible that they may be misleading in some cases, in my case
where my critical 100-line function does little more than looping, it is a valid
benchmark. If most of the time is used by the addition, then it might not even
be valid in my case.

Ken Wilson · Jan 5, 2006

On Sun, 1 Jan 2006 16:55:14 +0100, "Willy Denoyette [MVP]"

<snip>

Please don't edit subject headers unless you intend to start a new
thread. Otherwise they become disjoined from the main thread and make
it confusing to try to follow the original thread.

Many thanks. [ 8-)

Ken Wilson
Seeking viable employment in Victoria, BC

Why is C# 450% slower than C++ on nested loops ??

Stefan Simek

Stefan Simek

Willy Denoyette [MVP]

Stefan Simek

Willy Denoyette [MVP]

Stefan Simek

Peter Olcott

Peter Olcott

Peter Olcott

Peter Olcott

Jesse McGrew

Stefan Simek

Willy Denoyette [MVP]

Willy Denoyette [MVP]

Peter Olcott

Peter Olcott

Peter Olcott

Ken Wilson