C++/CLI the fastest compiler? Yes, at least for me. :-)

D

Don Kim

Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster
compiler. Some said C# was just as fast as C++/CLI, whereas others said
C++/CLI was more optimized.

Anyway, I wrote up some very simple test code, and at least on my
computer C++/CLI came out the fastest. Here's the sample code, and just
for good measure I wrote one in java, and it was the slowest! ;-) Also,
I did no optimizing compiler switches and compiled the C++/CLI with
/clr:safe only to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
long start = Environment::TickCount;
for (int i = 0; i < 10000000; ++i) {}
long end = Environment::TickCount;
Console::WriteLine(end - start);
}


//C# code
using System;

public class ForLoopTest
{
public static void Main(string[] args)
{
long start = Environment.TickCount;
for (int i =0;i < 10000000; ++i) {}
long end = Environment.TickCount;
Console.WriteLine((end-start));
}
}

//Java code
public class Performance
{
public static void main(String args[])
{
long start = System.currentTimeMillis();
for (int i=0; i < 10000000; ++i) {}
long end = System.currentTimeMillis();
System.out.println(end-start);
}
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim
 
C

Carl Daniel [VC++ MVP]

Don said:
C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and
results can vary by computer to computer, but at least on my system,
C++/CLI had the fastest results.

Maybe C++/CLI is the most optimized compiler?

After increasing the length of the loops by a factor of 100, I see about a
2X speed advantage for C++/CLI as well. Looking at the IL produced by the
two compilers for the respective main functions:

C++:

..method assembly static int32 main() cil managed
{
// Code size 40 (0x28)
.maxstack 2
.locals (int32 V_0,
int32 V_1,
int32 V_2)
IL_0000: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0005: stloc.2
IL_0006: ldc.i4.0
IL_0007: stloc.0
IL_0008: br.s IL_000e
// start of loop
IL_000a: ldloc.0
IL_000b: ldc.i4.1
IL_000c: add
IL_000d: stloc.0
IL_000e: ldloc.0
IL_000f: ldc.i4 0x3b9aca00
IL_0014: bge.s IL_0018
IL_0016: br.s IL_000a
// end of loop
IL_0018: call int32 [mscorlib]System.Environment::get_TickCount()
IL_001d: stloc.1
IL_001e: ldloc.1
IL_001f: ldloc.2
IL_0020: sub
IL_0021: call void [mscorlib]System.Console::WriteLine(int32)
IL_0026: ldc.i4.0
IL_0027: ret
} // end of method 'Global Functions'::main

C#:

..method public hidebysig static void Main(string[] args) cil managed
{
.entrypoint
// Code size 47 (0x2f)
.maxstack 2
.locals init (int64 V_0,
int32 V_1,
int64 V_2,
bool V_3)
IL_0000: nop
IL_0001: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0006: conv.i8
IL_0007: stloc.0
IL_0008: ldc.i4.0
IL_0009: stloc.1
IL_000a: br.s IL_0012
// start of loop
IL_000c: nop
IL_000d: nop
IL_000e: ldloc.1
IL_000f: ldc.i4.1
IL_0010: add
IL_0011: stloc.1
IL_0012: ldloc.1
IL_0013: ldc.i4 0x3b9aca00
IL_0018: clt
IL_001a: stloc.3
IL_001b: ldloc.3
IL_001c: brtrue.s IL_000c
// end of loop
IL_001e: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0023: conv.i8
IL_0024: stloc.2
IL_0025: ldloc.2
IL_0026: ldloc.0
IL_0027: sub
IL_0028: call void [mscorlib]System.Console::WriteLine(int64)
IL_002d: nop
IL_002e: ret
} // end of method ForLoopTest::Main


The C++ compiler did generate more optimized IL. It's surprising to me that
the JIT didn't do a better job of optimizing the C#-produced code.

Note that the C# code converted the time to a 64 bit value (C#'s long is 64
bits, while C++'s long is 32 bits), but that occurred outside the loop so it
should have next to no impact on the overall speed of the code.

-cd
 
J

Jochen Kalmbach [MVP]

Hi Carl!
The C++ compiler did generate more optimized IL. It's surprising to me that
the JIT didn't do a better job of optimizing the C#-produced code.

Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...
--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
 
J

Jochen Kalmbach [MVP]

The C++ compiler did generate more optimized IL. It's surprising to
Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...

Currently I could only find the confirmation of the "missing"
optimization for the CF. But I tought the same was true for the
"desktop"-framework...

http://blogs.msdn.com/stevenpr/archive/2005/12/12/502978.aspx

<quote>
Because the CLR can throw away native code under memory pressure or when
an application moves to the background, it is quite possible that the
same IL code may need to be jit compiled again when the application
continues running. This fact leads to our second major jit compiler
design decision: the time it takes to compile IL code often takes
precedence over the quality of the resulting native code. As with all
good compilers, the Compact Framework jit compiler does some basic
optimizations, but because of the need to regenerate code quickly in
order for applications to remain responsive, more extensive
optimizations generally take a back seat to shear compilation speed.
</quote>

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
 
A

Andre Kaufmann

Jochen said:
Currently I could only find the confirmation of the "missing"
optimization for the CF. But I tought the same was true for the
"desktop"-framework...

http://blogs.msdn.com/stevenpr/archive/2005/12/12/502978.aspx

<quote>
Because the CLR can throw away native code under memory pressure or when
an application moves to the background, it is quite possible that the
same IL code may need to be jit compiled again when the application
continues running. This fact leads to our second major jit compiler
design decision: the time it takes to compile IL code often takes
precedence over the quality of the resulting native code. As with all
good compilers, the Compact Framework jit compiler does some basic
optimizations, but because of the need to regenerate code quickly in
order for applications to remain responsive, more extensive
optimizations generally take a back seat to shear compilation speed.
</quote>

That may be true. But I wonder why there cannot be both ?
A fast IL compiler and one that is slow, but optimizes much better. E.g.
"ngen" could have a command line switch to generate more optimized code.

Andre
 
E

Eugene Gershnik

Don said:
I did no optimizing compiler switches

[...]

Then the test is meaningless. If you don't ask the compiler to optimize why
should it spend any effort on making your code fast?

[I don't have any stake in C++/CLI, C# or Java -- they can all die as far as
I am concerned -- my objection as an outsider is only about how you tested.]
 
S

Shawn B.

Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...

No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.


Thanks,
Shawn
 
D

Don Kim

Eugene said:
Then the test is meaningless. If you don't ask the compiler to optimize why
should it spend any effort on making your code fast?

That was the whole point. If I were to use optimizing options, there
would invariably be arguments that either I did not use the correct
ones, not in the proper order, that certain compiler switches are not
equivalent, etc., etc. Therefore, I compiled as is w/out any options to
see how each complier would compile on its own. I also made the test as
simple as possible so as to time how each compiler internally optimizes
a straight iteration of a common for loop.

In this case, it seems C++/CLI is the fastest in the managed Windows
environment.

-Don Kim
 
J

Jochen Kalmbach [MVP]

Hi Shawn!
No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.

Really?
I thought the C++/CLI compiler does not care what code it is generating.
It always tryes to optimize the "pseudeo-code".

Nevertheless... I neither found docu that the JIT-compiler does
optimization nor does I found some docu that it does not...

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
 
A

Andre Kaufmann

Shawn said:
No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the

You mean the sample where W.D. [MVP] gives a samples that the C++/CLI
doesn't do global optimization on IL code ?

It does. IMHO the example is wrong. If I interpret the given example
correctly it's based on a call to an external DLL. So the C++/CLI
compiler must do an optimization over DLL boundaries ?! Since the DLL is
loaded dynamically, how should the C++/CLI compiler do any optimization ?

Why should the C++/CLI compiler not optimize the code ? I don't know how
the C++/CLI compiler is implemented, but I assume that the code
generation of native or CLI code is done by optimizing the generated
intermediate code, before native or managed code is generated. So that
(nearly) the same optimizer is used for "native code compiled to IL
code" and "native x86 code". If my assumption is true it would be plain
nonsense to revert this optimization, already done.
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.

If he gives a valid link to the statements, I will believe it. Which
doesn't mean that the statements are true.
Thanks,
Shawn

Andre
 
C

Carl Daniel [VC++ MVP]

Andre said:
Why should the C++/CLI compiler not optimize the code ? I don't know
how the C++/CLI compiler is implemented, but I assume that the code
generation of native or CLI code is done by optimizing the generated
intermediate code, before native or managed code is generated. So that
(nearly) the same optimizer is used for "native code compiled to IL
code" and "native x86 code". If my assumption is true it would be
plain nonsense to revert this optimization, already done.

That is indeed the case. There's a single front-end for both native and
managed code. That front end produces CIL ('C' Intermediate Language) which
is then fed to the back end. The back-end consists of target independent
parts (e.g. CIL optimizations) and target dependent parts (e.g. code
generation).

-cd
 
W

Willy Denoyette [MVP]

| Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
| and it forked over into another rant about which was the faster
| compiler. Some said C# was just as fast as C++/CLI, whereas others said
| C++/CLI was more optimized.
|
| Anyway, I wrote up some very simple test code, and at least on my
| computer C++/CLI came out the fastest. Here's the sample code, and just
| for good measure I wrote one in java, and it was the slowest! ;-) Also,
| I did no optimizing compiler switches and compiled the C++/CLI with
| /clr:safe only to compile to pure verifiable .net.
|
| //C++/CLI code
| using namespace System;
|
| int main()
| {
| long start = Environment::TickCount;
| for (int i = 0; i < 10000000; ++i) {}
| long end = Environment::TickCount;
| Console::WriteLine(end - start);
| }
|
|
| //C# code
| using System;
|
| public class ForLoopTest
| {
| public static void Main(string[] args)
| {
| long start = Environment.TickCount;
| for (int i =0;i < 10000000; ++i) {}
| long end = Environment.TickCount;
| Console.WriteLine((end-start));
| }
| }
|
| //Java code
| public class Performance
| {
| public static void main(String args[])
| {
| long start = System.currentTimeMillis();
| for (int i=0; i < 10000000; ++i) {}
| long end = System.currentTimeMillis();
| System.out.println(end-start);
| }
| }
|
| Results:
|
| C++/CLI -> 15-18 secs
| C# -> 31-48 secs
| Java -> 65-72 secs
|
| I know, I know, these kind of test are not always foolproof, and results
| can vary by computer to computer, but at least on my system, C++/CLI had
| the fastest results.
|
| Maybe C++/CLI is the most optimized compiler?
|
| -Don Kim

Such micro benchmark has little value, an empty loop will be hoisted in
optimized builds (you ain't gonna do this in real code do you?).
More important is the way you measure execution time here, it is wrong. The
reason or this is that Environment.TickCount is updated with the real time
clock tick. That is every 10 msec or 15,6 msec or higher, depending on the
CPU type (Intel AMD, variants...). For instance an AMD 64 ticks at an
interval of 15.5 msec, most intel based systems have an interval of 10msec,
most SMP systems tick at 20msec or higher.

To get accurate results you need to use the high performance counters or the
Stopwatch class in V2.
Here is the adapted code:

// C# code
// csc /o- bcs.cs
using System;
using System.Diagnostics;
public class ForLoopTest
{
public static void Main(string[] args)
{
long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;

Stopwatch sw = new Stopwatch();
sw.Start();
for (int i =0;i < 10000000; ++i) {}
sw.Stop();
long ticks = sw.Elapsed.Ticks;
Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
}
}

// C++/CLI code
// cl /CLR:safe /Od bcc.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;

int main()
{
Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;

Stopwatch^ sw = gcnew Stopwatch;
sw->Start();
for (int i = 0; i < 10000000; ++i) {}
sw->Stop();
Int64 ticks = sw->Elapsed.Ticks;
Console::WriteLine("{0} nanoseconds", ticks* nanosecPerTick);
}


On my system using above code and the command line arguments as specified in
the source (both non optimized) show following results:

C#
37714104 nanoseconds
C++/CLI
37389069 nanoseconds

That means both are equaly fast, but again this means nothing, such micro
benchmarks have no value.

Note that an optimized C++ build will hoist the empty loop (removes it
completely from IL). This kind of hoisting is not done by the C# compiler,
and there is a reason for it.
That doesn't mean there is no loop optimization, it's just done at the JIT
level!!.

Willy.
 
C

Carl Daniel [VC++ MVP]

Willy said:
wrong. The reason or this is that Environment.TickCount is updated
with the real time clock tick. That is every 10 msec or 15,6 msec or
higher, depending on the CPU type (Intel AMD, variants...). For
instance an AMD 64 ticks at an interval of 15.5 msec, most intel
based systems have an interval of 10msec, most SMP systems tick at
20msec or higher.

To get accurate results you need to use the high performance counters
or the Stopwatch class in V2.

On my system using above code and the command line arguments as
specified in the source (both non optimized) show following results:

C#
37714104 nanoseconds
C++/CLI
37389069 nanoseconds

Interesting. I took Don' sample and increased the loop count by a factor of
100 and consistently got execution times of about 530ms for the C++ code and
1200ms for the C# code.

Granted, the resolution of GetTickCount is poor - but that's a large enough
difference to be significant.

Your results are actually much closer to what I expected - nearly identical
performance, but I can't see why replacing GetTickCount with StopWatch would
have any effect other than to increase the resolution of the time
measurement.

But... here's what I found with your examples: First, I changed both to
calculate nanosecPerTick as a double instead of a long - on a system with a
tick rate higher than 1Ghz, your calcuation results in 0 all the time.

With that change, I get a time of 15.8us for the C++ code and 42.3us for
the C# code - about the same difference I saw with GetTickCount.

It seems that there's something significantly different about your machine
as compared to mine & Don's when it comes to the performance of this code -
and that is very interesting!

What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM
under XP SP2. I'm suspicious of your times (and mine as well) as I doubt my
machine is 2000 times faster than yours.

-cd
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | Willy Denoyette [MVP] wrote:
| > wrong. The reason or this is that Environment.TickCount is updated
| > with the real time clock tick. That is every 10 msec or 15,6 msec or
| > higher, depending on the CPU type (Intel AMD, variants...). For
| > instance an AMD 64 ticks at an interval of 15.5 msec, most intel
| > based systems have an interval of 10msec, most SMP systems tick at
| > 20msec or higher.
| >
| > To get accurate results you need to use the high performance counters
| > or the Stopwatch class in V2.
| >
| > On my system using above code and the command line arguments as
| > specified in the source (both non optimized) show following results:
| >
| > C#
| > 37714104 nanoseconds
| > C++/CLI
| > 37389069 nanoseconds
|
| Interesting. I took Don' sample and increased the loop count by a factor
of
| 100 and consistently got execution times of about 530ms for the C++ code
and
| 1200ms for the C# code.
|

Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
compiler will hoist the loop when optimization is on (O&, O2 or whatever).

| Granted, the resolution of GetTickCount is poor - but that's a large
enough
| difference to be significant.
|

It's not the resolution it's the interval which is the cullprit.

| Your results are actually much closer to what I expected - nearly
identical
| performance, but I can't see why replacing GetTickCount with StopWatch
would
| have any effect other than to increase the resolution of the time
| measurement.
|

Stopwatch uses the QueryPerformanceCounter and QueryPerformanceFrequency
high resolution counters of the OS.

| But... here's what I found with your examples: First, I changed both to
| calculate nanosecPerTick as a double instead of a long - on a system with
a
| tick rate higher than 1Ghz, your calcuation results in 0 all the time.
|

That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
should not be that high, notice that this Frequency is not the CPU clock
frequency, it's the output of a CPU clock divider, it's frequency is much
lower, on my System it's 3579545MHz (try with:
Console::WriteLine(System::Diagnostics::Stopwatch::Frequency);)
If on your system it's much higher than 1GHz, you might have an issue with
your system.


| With that change, I get a time of 15.8us for the C++ code and 42.3us for
| the C# code - about the same difference I saw with GetTickCount.
|

Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions
[1]per loop, that would mean 60000000 instructions in 15.8µsec or 0.000263
nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not possible
really, looks like the loop is hoisted or your clock is broken ;-).


| It seems that there's something significantly different about your machine
| as compared to mine & Don's when it comes to the performance of this
code -
| and that is very interesting!

Looks like you have to investigate the Frequency value returned first, and
inspect your code.

|
| What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM
| under XP SP2. I'm suspicious of your times (and mine as well) as I doubt
my
| machine is 2000 times faster than yours.
|

I have it running on an AMD64 Atlon 3500+, 2GB, XP SP2, whith CPU clock
throttling disabled.
Increasing the loop count by a factor 100 gives me:

3737032857 nanoseconds

or 3.7 seconds.
or 3737032857/1000000000 = 3.737032857 nsec/loop or ~0.63 nsec. per
instruction (avg.)

| -cd
|
|

[1]
00d100d2 83c201 add edx,0x1
00d100d5 81fa80969800 cmp edx,0x989680
00d100db 0f9cc0 setl al
00d100de 0fb6c0 movzx eax,al
00d100e1 85c0 test eax,eax
00d100e3 75ed jnz 00d100d2

notes:
- 0x989680 = 1.000.000.000 decimal
- that this is native code, generated by the JIT in non optimized build.

Willy.
 
E

Eugene Gershnik

[Rearranging your post a little]
That was the whole point.
[...]

In this case, it seems C++/CLI is the fastest in the managed Windows
environment.

Let me see. Take a world record holder for a 100m dash and take me. Put us
both before a 100m range and ask as to get to the end at whatever pace we
want. He walks. I run. I get there before him. You conclusion seems to be
that I am a faster runner.
If I were to use optimizing options, there
would invariably be arguments that either I did not use the correct
ones, not in the proper order, that certain compiler switches are not
equivalent, etc., etc.

Yes measuring compiler performance is hard. If you want to get meaningful
results you will need to study each one's options in detail, determine what
people usually set in their optimized builds, create a meaningfull test set
etc. etc. If you don't do all this anouncing to the world that X compiler is
faster is a waste of electrons.
I also made the
test as simple as possible so as to time how each compiler internally
optimizes a straight iteration of a common for loop.

You didn't ask them to optimize the loop.
 
C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
"Carl Daniel [VC++ MVP]" <[email protected]>
Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
compiler will hoist the loop when optimization is on (O&, O2 or whatever).

Quite certain - I used the exact command lines given in your posting
(optimization if off by default as well, so specifying nothing is equivalent
to /Od).
It's not the resolution it's the interval which is the cullprit.

We're talking about the same thing - 15ms precision is quite sufficient for
measuring intervals of 500ms or more and certainly won't account for a 50%
measurement error for such intervals - only 3% or so.
That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
should not be that high, notice that this Frequency is not the CPU clock
frequency, it's the output of a CPU clock divider, it's frequency is much
lower, on my System it's 3579545MHz (try with:
Console::WriteLine(System::Diagnostics::Stopwatch::Frequency);)
If on your system it's much higher than 1GHz, you might have an issue with
your system.

(You made a typo - on your system it's 3579545Hz, not MHz)

If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
instruction which does report actual CPU core clocks. If your system
doesn't use the MP HAL, then QPC uses the system board timer, which
generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of
3.57954545 Mhz. Note that this rate has absolutely nothing to do with your
CPU clock - it's a completely independent crystal oscillator on the MB.
| With that change, I get a time of 15.8us for the C++ code and 42.3us
for
| the C# code - about the same difference I saw with GetTickCount.
|

Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions
[1]per loop, that would mean 60000000 instructions in 15.8µsec or 0.000263
nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not
possible
really, looks like the loop is hoisted or your clock is broken ;-).

I agree - it doesn't add up. I'm quite sure that I did unoptimized builds,
and the results are 100% reproducible. But see below.
| It seems that there's something significantly different about your
machine
| as compared to mine & Don's when it comes to the performance of this
code -
| and that is very interesting!

Looks like you have to investigate the Frequency value returned first, and
inspect your code.

Well, it's your code - not mine. The Frequency value is right on for this
machine.

I'm at my office right now, on a different computer. This one's a 3GHz
Pentium D. I modified the samples as before to make nanosecPerTick double
instead of Int64 and added code to print the value of Stopwatch.Frequency
and the raw Ticks and nanosecPerTick. Here are the results:

C:\Dev\Misc\fortest>fortest0312cs
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
240117 ticks
78664.4695028862 nanoseconds

C:\Dev\Misc\fortest>fortest0312cpp
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
49225 ticks
16126.548771139 nanoseconds

Increasing the loop count by a factor of 10 increases the times by a factor
of 10. Decreasing by a factor of 10 decreases the times by a factor of 10.
Clearly the loop has not been optimized out, but that still doesn't explain
the apparent execution speed of more than 200 adds per clock cycle (I know
modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
so!)

I don't know what's going on here, but two things seem to be true:

1. The C++ code is faster on these machines. If I increase the loop count
to 1,000,000,000 I can clearly see the difference in execution time with my
eyes.
2. The Stopwatch class doesn't appear to work correctly on these machines -
it's measuring times that are orders of magnitude too short, yet still
proportional to the actual time spent.

Working on the assumpting that #2 is true, I modified the code to call
QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the
results:

C:\Dev\Misc\fortest>fortest0312cpp
QPC frequency=3052420000
0.327608913583321 ns/tick
22388910 ticks
7334806.48141475 nanoseconds

C:\Dev\Misc\fortest>fortest0312cs
QPC frequency=3052420000
0.327608913583321 ns/tick
58980368 ticks
19322494.2832245 nanoseconds

The times are now much more reasonable - Stopwatch apparently doesn't work
correctly with such a high value from QPF (it's apparently off by a factor
of 1000). The ratio of times remains about equal though- the C++ code is
still nearly 2X faster on this machine (despite the fact that that makes no
sense at all, it seems to be true).

-cd
 
C

Carl Daniel [VC++ MVP]

Carl Daniel said:
The times are now much more reasonable - Stopwatch apparently doesn't work
correctly with such a high value from QPF (it's apparently off by a factor
of 1000). The ratio of times remains about equal though- the C++ code is
still nearly 2X faster on this machine (despite the fact that that makes
no sense at all, it seems to be true).

Follow-up -

It appears that Stopwatch scales the QPF/QPC values internally if the
frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled
value, but Stopwatch.Frequency still reports the full resolution value
returned by QPF.

Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly
scaled values.

This is clearly a bug in the Stopwatch class.

-cd
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | | > "Carl Daniel [VC++ MVP]"
<[email protected]>
| > Are you sure you compiled non optimized (/Od and /o-)? As I said, the
C++
| > compiler will hoist the loop when optimization is on (O&, O2 or
whatever).
|
| Quite certain - I used the exact command lines given in your posting
| (optimization if off by default as well, so specifying nothing is
equivalent
| to /Od).
|
| > It's not the resolution it's the interval which is the cullprit.
|
| We're talking about the same thing - 15ms precision is quite sufficient
for
| measuring intervals of 500ms or more and certainly won't account for a 50%
| measurement error for such intervals - only 3% or so.

Yes, but not for a loop of 10.000.000 (as in Don's code), which takes only
takes 37 msecs. to complete. And as I said on SMP systems this interval can
be as large as 60 msecs. (as I have measured here on a Compaq Proliant 8 way
system).

|
| > That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
| > should not be that high, notice that this Frequency is not the CPU clock
| > frequency, it's the output of a CPU clock divider, it's frequency is
much
| > lower, on my System it's 3579545MHz (try with:
| > Console::WriteLine(System::Diagnostics::Stopwatch::Frequency);)
| > If on your system it's much higher than 1GHz, you might have an issue
with
| > your system.
|
| (You made a typo - on your system it's 3579545Hz, not MHz)
|

Right, sorry for that.

| If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
| instruction which does report actual CPU core clocks. If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency
of
| 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
your
| CPU clock - it's a completely independent crystal oscillator on the MB.
|
True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz), but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated, the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the dividers
value.

No my current test machine is not a MP or HT, so it doesn't use an MP HAL,
and you didn't specify that either in your previous reply, it's quite
important as I know about the MP HAL.

| > | With that change, I get a time of 15.8us for the C++ code and 42.3us
| > for
| > | the C# code - about the same difference I saw with GetTickCount.
| > |
| >
| > Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions
| > [1]per loop, that would mean 60000000 instructions in 15.8µsec or
0.000263
| > nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not
| > possible
| > really, looks like the loop is hoisted or your clock is broken ;-).
|
| I agree - it doesn't add up. I'm quite sure that I did unoptimized
builds,
| and the results are 100% reproducible. But see below.
|
| > | It seems that there's something significantly different about your
| > machine
| > | as compared to mine & Don's when it comes to the performance of this
| > code -
| > | and that is very interesting!
| >
| > Looks like you have to investigate the Frequency value returned first,
and
| > inspect your code.
|
| Well, it's your code - not mine. The Frequency value is right on for this
| machine.
|

Well ..., it's Don's code. What do you mean with the Frequency value is
right? The Frequency is also right on mine :).

| I'm at my office right now, on a different computer. This one's a 3GHz
| Pentium D. I modified the samples as before to make nanosecPerTick double
| instead of Int64 and added code to print the value of Stopwatch.Frequency
| and the raw Ticks and nanosecPerTick. Here are the results:
|

| C:\Dev\Misc\fortest>fortest0312cs
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 240117 ticks
| 78664.4695028862 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cpp
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 49225 ticks
| 16126.548771139 nanoseconds
|

That's for 10000000 loops I assume.

| Increasing the loop count by a factor of 10 increases the times by a
factor
| of 10. Decreasing by a factor of 10 decreases the times by a factor of
10.
| Clearly the loop has not been optimized out, but that still doesn't
explain
| the apparent execution speed of more than 200 adds per clock cycle (I know
| modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
| so!)
|

That's not possible, Intel Pentium IV CPU's fetches and executes 2
instruction per cycle.
The AMD Athlon 64 fetches and executes a max. of 3 instructions per cycle,
(mine clocks at 2.2GHz)

These are the results on PIV 3GHz not HT running W2K3 R2.
C#
Frequency = 3579545
46632867 nanoseconds

C++
Frequency = 3579545
40659177 nanoseconds

Notice the difference between C++ and C#, looks like the X86 JIT'd code is
not exactly the same, have to check this.
Remember the results on AMD 64 bit (XP SP2) - 37368702 nanoseconds, that
means that the AMD the Intel 3GHz show comparable results, as expected.


| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines. If I increase the loop count
| to 1,000,000,000 I can clearly see the difference in execution time with
my
| eyes.

Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.

| 2. The Stopwatch class doesn't appear to work correctly on these
machines -
| it's measuring times that are orders of magnitude too short, yet still
| proportional to the actual time spent.
|
| Working on the assumpting that #2 is true, I modified the code to call
| QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the
| results:
|
| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

How many loops here?


| The times are now much more reasonable - Stopwatch apparently doesn't work
| correctly with such a high value from QPF (it's apparently off by a factor
| of 1000).


This is really strange as Stopwatch uses the same QueryPerformanceCounter
and Frequency under the hood.

The ratio of times remains about equal though- the C++ code is
| still nearly 2X faster on this machine (despite the fact that that makes
no
| sense at all, it seems to be true).
|
Time to expect the Stopwatch code, and I'll try to prepare a multicore or HT
box to do some more tests.

wd.
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | "Carl Daniel [VC++ MVP]" <[email protected]>
| wrote in message
| > The times are now much more reasonable - Stopwatch apparently doesn't
work
| > correctly with such a high value from QPF (it's apparently off by a
factor
| > of 1000). The ratio of times remains about equal though- the C++ code
is
| > still nearly 2X faster on this machine (despite the fact that that makes
| > no sense at all, it seems to be true).
|
| Follow-up -
|
| It appears that Stopwatch scales the QPF/QPC values internally if the
| frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled
| value, but Stopwatch.Frequency still reports the full resolution value
| returned by QPF.
|
| Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly
| scaled values.
|
| This is clearly a bug in the Stopwatch class.
|
| -cd
|
|

I see, but it still doesn't explain this:

| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

Why is C++ almost 3 times faster than C#? Are we sure the ticks are
accurate, are we sure the OS counter is updated for every tick, Are we sure
the OS goes to the HAL to read the HW clock tick value at each call of
QueryPerformanceCounter (this must be quite expensive)?

And why is it 2 and 5 times faster than on my AMD box, while the results are
comparable (AMD a little faster) when I run it on Intel 3GHz non HT (see my
previous post) ?

That means that the native code must be different, while it is on my AMD box
(dispite the fact that the IL is different).
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Which is not realy the best algorithm for X86, wonder how it looks like on
Intel. Grr.. micro benchmarks, what a mess ;-)

Willy.
 
C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
"Carl Daniel [VC++ MVP]" <[email protected]>
| If your machine uses the MP HAL (which mine does), then QPC uses the
RDTSC
| instruction which does report actual CPU core clocks. If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency
of
| 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
your
| CPU clock - it's a completely independent crystal oscillator on the MB.
|
True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz),
but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the
dividers
value.

Not (necessarily) true. For example, this Pentium D machine uses a BCLK
frequency of 200Mhz with a multiplier of 15. There's no requirement
(imposed by the CPU or MCH) that the CPU clock be related to color burst
frequency at all.

Now, it's entirely possible that the motherboard generates that 200Mhz BCLK
by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
motherboard detail that's unrelated to the CPU. Without really digging,
there's no way I can tell one way or another - just looking at the MB, I see
at least 4 different crystal oscillators of unknown frequency. Historically,
the only reason color burst crystals are used is that they're cheap -
they're manufactured by the gazillion for NTSC televisions.
| Working on the assumpting that #2 is true, I modified the code to call
| QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
the
| results:
|
| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

How many loops here?

That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
resonable rate to me - certainly not off by orders of magnitude.
| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines. If I increase the loop
count
| to 1,000,000,000 I can clearly see the difference in execution time with
my
| eyes.

Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.

It's completely reasonable based on the times reported directly by QPC, not
the bogus values from Stopwatch, which is off by a factor of 1000 on these
machines.

So, any theory why the C++ code consistently runs faster than the C# code on
both of my machines? I can't think of any reasonable argument why having a
dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
code is different for the two loops - maybe there's some pathological code
in the C# case that the P4 executes much more slowly than AMD, or some
optimal code in the C++ case that the P4 executes much more quickly than
AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
Single/HT/Dual, etc.

-cd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top