C++/CLI the fastest compiler? Yes, at least for me. :-)

C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
That means that the native code must be different, while it is on my AMD
box
(dispite the fact that the IL is different).
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Which is not realy the best algorithm for X86, wonder how it looks like on
Intel. Grr.. micro benchmarks, what a mess ;-)

Here's what I see (loops going 1 billion times):

The JIT'd C++ code:
// ---------------------------------------------------
for (int i = 0; i < 1000000000; ++i) {}
00000077 xor edx,edx
00000079 mov dword ptr [esp],edx
0000007c nop
0000007d jmp 00000082
// start of loop
0000007f inc dword ptr [esp]
00000082 cmp dword ptr [esp],3B9ACA00h
00000089 jge 0000008E
0000008b nop
0000008c jmp 0000007F
// end of loop

The JIT'd C# code:
// ---------------------------------------------------
for (int i =0;i < 1000000000; ++i) {}
00000098 xor ebx,ebx
0000009a nop
0000009b jmp 000000A0
// start of loop
0000009d nop
0000009e nop
0000009f inc ebx
000000a0 cmp ebx,3B9ACA00h
000000a6 setl al
000000a9 movzx eax,al
000000ac mov dword ptr [ebp-6Ch],eax
000000af cmp dword ptr [ebp-6Ch],0
000000b3 jne 0000009D
// end of loop

Neither of these represent ideal code by any stretch of the imagination -
but instruction count alone probably accounts for the bulk of the difference
between the two programs on this machine. Why the results are so different
from what you see on your AMD machine I can't even guess.

-cd
 
D

Don Kim

Carl said:
So, any theory why the C++ code consistently runs faster than the C# code on
both of my machines? I can't think of any reasonable argument why having a
dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
code is different for the two loops - maybe there's some pathological code
in the C# case that the P4 executes much more slowly than AMD, or some
optimal code in the C++ case that the P4 executes much more quickly than
AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
Single/HT/Dual, etc.

Wow, this is becomming interesting. We're getting down to dicussions
CPU architecture and instructions sets. Talk about getting down to the
metal!

Anyway, I just reran my test code with larger loop factors, as well as
the other code with my original and larger loop factors, and C++/CLI
still came out around 2X faster.

I ran these both on my laptop and desktop. Here's the configuration:

Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2

I know someone who has an AMD computer, and I'm going to run my programs
on that computer to see if there's something in the CPU that's causing
the discrepencies.

-Don Kim
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | | > "Carl Daniel [VC++ MVP]"
<[email protected]>
| > | If your machine uses the MP HAL (which mine does), then QPC uses the
| > RDTSC
| > | instruction which does report actual CPU core clocks. If your system
| > | doesn't use the MP HAL, then QPC uses the system board timer, which
| > | generally has a clock speed of 1X or 0.5X the NTSC color burst
frequency
| > of
| > | 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
| > your
| > | CPU clock - it's a completely independent crystal oscillator on the
MB.
| > |
| > True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz),
| > but
| > the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
| > the
| > CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
| > instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
| > 995MHz. The stepping number is important here, as it may change the
| > dividers
| > value.
|
| Not (necessarily) true. For example, this Pentium D machine uses a BCLK
| frequency of 200Mhz with a multiplier of 15. There's no requirement
| (imposed by the CPU or MCH) that the CPU clock be related to color burst
| frequency at all.
|
Carl, I'm not saying this is the case for all type of CPU's and mother
boards, I only say that it's true for Pentiums up to III, things are
different for other type of CPU's. See, AMD clocks at 200MHz with a
multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz clock
can be adjusted (overclocked or underclocked), the Frequency returned by
QueryPerformanceFrequency stays the same, the same is true for recent PIV's
Pentium M and D. So here it's true that both aren't related, and the
3.57954545MHz clock is derived from the on baord Graphics controller or an
external clock source (on mobo or not) when no on board graphics controller,
but the value remains the same 3.57954545MHz unless you are using a MP HAL.

| Now, it's entirely possible that the motherboard generates that 200Mhz
BCLK
| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
| motherboard detail that's unrelated to the CPU. Without really digging,
| there's no way I can tell one way or another - just looking at the MB, I
see
| at least 4 different crystal oscillators of unknown frequency.
Historically,
| the only reason color burst crystals are used is that they're cheap -
| they're manufactured by the gazillion for NTSC televisions.
|

I know,carl, I've been working for IHV's (HP before Compac, before DEC ...)
I know what you are talking about. Even on DEC Alpha (AXP) systems, the
QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not the
case, Jeez, what a bunch of problems did we have when porting W2K (never
released for well known reasons) from intel code to AXP, just because some
drivers and core OS components did not expect QueryPerformanceCounter speeds
higher that 1GHz (that is when we overclocked an 800MHz CPU).

| > | Working on the assumpting that #2 is true, I modified the code to call
| > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
| > the
| > | results:
| > |
| > | C:\Dev\Misc\fortest>fortest0312cpp
| > | QPC frequency=3052420000
| > | 0.327608913583321 ns/tick
| > | 22388910 ticks
| > | 7334806.48141475 nanoseconds
| > |
| > | C:\Dev\Misc\fortest>fortest0312cs
| > | QPC frequency=3052420000
| > | 0.327608913583321 ns/tick
| > | 58980368 ticks
| > | 19322494.2832245 nanoseconds
| > |
| >
| > How many loops here?
|
| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
| resonable rate to me - certainly not off by orders of magnitude.
|

Sure it is, I was wrong when reading the tick values (largely over midnight
here, time to go to bed).

| > | I don't know what's going on here, but two things seem to be true:
| > |
| > | 1. The C++ code is faster on these machines. If I increase the loop
| > count
| > | to 1,000,000,000 I can clearly see the difference in execution time
with
| > my
| > | eyes.
| >
| > Assumed the timings are correct, it's simply not possible to execute
that
| > number instructions during that time, so there must be something going
on
| > here.
|
| It's completely reasonable based on the times reported directly by QPC,
not
| the bogus values from Stopwatch, which is off by a factor of 1000 on these
| machines.
|
| So, any theory why the C++ code consistently runs faster than the C# code
on
| both of my machines? I can't think of any reasonable argument why having
a
| dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
| code is different for the two loops - maybe there's some pathological code
| in the C# case that the P4 executes much more slowly than AMD, or some
| optimal code in the C++ case that the P4 executes much more quickly than
| AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
| Single/HT/Dual, etc.
|
| -cd
|

Well I have investigated the native code generated on the Intel PIV (see
previous .
Here is (part of) the disassembly (VS2005)for C++:
....
0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop ---> not sure why this one is good for, it's ignored by the
CPU anyway
00000029 EB F4 jmp 0000001F
....

That means 4 instructions per loop compared to 6 on AMD.
And the results are comparable to yours (for C++).
Did not look at the C# code and it's result, but above shows that the JIT
compiler generates (better?) code for PIV (don't know what the __cpuid call
returns, but I know the CLR checks it when booting). Again, notice this is
an unoptimized code build (/Od flag set), optimized code is a totally
different story.

Willy.
 
W

Willy Denoyette [MVP]

|
| "Carl Daniel [VC++ MVP]" <[email protected]>
| wrote in message || || > "Carl Daniel [VC++ MVP]"
| <[email protected]>
|| > | If your machine uses the MP HAL (which mine does), then QPC uses the
|| > RDTSC
|| > | instruction which does report actual CPU core clocks. If your system
|| > | doesn't use the MP HAL, then QPC uses the system board timer, which
|| > | generally has a clock speed of 1X or 0.5X the NTSC color burst
| frequency
|| > of
|| > | 3.57954545 Mhz. Note that this rate has absolutely nothing to do
with
|| > your
|| > | CPU clock - it's a completely independent crystal oscillator on the
| MB.
|| > |
|| > True MP HAL uses the externam CPU clock (yours runs at 3.052420000
GHz),
|| > but
|| > the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
|| > the
|| > CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
|| > instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
|| > 995MHz. The stepping number is important here, as it may change the
|| > dividers
|| > value.
||
|| Not (necessarily) true. For example, this Pentium D machine uses a BCLK
|| frequency of 200Mhz with a multiplier of 15. There's no requirement
|| (imposed by the CPU or MCH) that the CPU clock be related to color burst
|| frequency at all.
||
| Carl, I'm not saying this is the case for all type of CPU's and mother
| boards, I only say that it's true for Pentiums up to III, things are
| different for other type of CPU's. See, AMD clocks at 200MHz with a
| multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz
clock
| can be adjusted (overclocked or underclocked), the Frequency returned by
| QueryPerformanceFrequency stays the same, the same is true for recent
PIV's
| Pentium M and D. So here it's true that both aren't related, and the
| 3.57954545MHz clock is derived from the on baord Graphics controller or an
| external clock source (on mobo or not) when no on board graphics
controller,
| but the value remains the same 3.57954545MHz unless you are using a MP
HAL.
|
|| Now, it's entirely possible that the motherboard generates that 200Mhz
| BCLK
|| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
|| motherboard detail that's unrelated to the CPU. Without really digging,
|| there's no way I can tell one way or another - just looking at the MB, I
| see
|| at least 4 different crystal oscillators of unknown frequency.
| Historically,
|| the only reason color burst crystals are used is that they're cheap -
|| they're manufactured by the gazillion for NTSC televisions.
||
|
| I know,carl, I've been working for IHV's (HP before Compac, before DEC
....)
| I know what you are talking about. Even on DEC Alpha (AXP) systems, the
| QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
| on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not
the
| case, Jeez, what a bunch of problems did we have when porting W2K (never
| released for well known reasons) from intel code to AXP, just because some
| drivers and core OS components did not expect QueryPerformanceCounter
speeds
| higher that 1GHz (that is when we overclocked an 800MHz CPU).
|
|| > | Working on the assumpting that #2 is true, I modified the code to
call
|| > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
|| > the
|| > | results:
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cpp
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 22388910 ticks
|| > | 7334806.48141475 nanoseconds
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cs
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 58980368 ticks
|| > | 19322494.2832245 nanoseconds
|| > |
|| >
|| > How many loops here?
||
|| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
|| resonable rate to me - certainly not off by orders of magnitude.
||
|
| Sure it is, I was wrong when reading the tick values (largely over
midnight
| here, time to go to bed).
|
|| > | I don't know what's going on here, but two things seem to be true:
|| > |
|| > | 1. The C++ code is faster on these machines. If I increase the loop
|| > count
|| > | to 1,000,000,000 I can clearly see the difference in execution time
| with
|| > my
|| > | eyes.
|| >
|| > Assumed the timings are correct, it's simply not possible to execute
| that
|| > number instructions during that time, so there must be something going
| on
|| > here.
||
|| It's completely reasonable based on the times reported directly by QPC,
| not
|| the bogus values from Stopwatch, which is off by a factor of 1000 on
these
|| machines.
||
|| So, any theory why the C++ code consistently runs faster than the C# code
| on
|| both of my machines? I can't think of any reasonable argument why having
| a
|| dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
|| code is different for the two loops - maybe there's some pathological
code
|| in the C# case that the P4 executes much more slowly than AMD, or some
|| optimal code in the C++ case that the P4 executes much more quickly than
|| AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
|| Single/HT/Dual, etc.
||
|| -cd
||
|
| Well I have investigated the native code generated on the Intel PIV (see
| previous .
| Here is (part of) the disassembly (VS2005)for C++:
| ...
| 0000001f 46 inc esi
| 00000020 81 FE 80 96 98 00 cmp esi,989680h
| 00000026 7D 03 jge 0000002B
| 00000028 90 nop ---> not sure why this one is good for, it's ignored by
the
| CPU anyway
| 00000029 EB F4 jmp 0000001F
| ...
|
| That means 4 instructions per loop compared to 6 on AMD.
| And the results are comparable to yours (for C++).
| Did not look at the C# code and it's result, but above shows that the JIT
| compiler generates (better?) code for PIV (don't know what the __cpuid
call
| returns, but I know the CLR checks it when booting). Again, notice this is
| an unoptimized code build (/Od flag set), optimized code is a totally
| different story.
|
| Willy.
|


Last follow up, (before my spouse pulls the plugs).
Here is the X86 output of a C# release build on both AMD and Intel PIV:
[1]
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl

this results in 6.235684 msec on AMD and 7.023547 msec on PIV (10.000.000
loops).

while this is the debug build on Intel:

00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

See that the release build is the most optimum X86 code possible for the
loop. The C++/CLI compiler in optimized build hoists the loop completely, so
can't compare.
Carl, could you look at the disassembly on your box, not a problem if you
can't (It doesn't mean that much anyway), it looks like on you box the
C++/CLI output looks more like [1] above.

Willy.
 
C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
Pentium M and D. So here it's true that both aren't related, and the
3.57954545MHz clock is derived from the on baord Graphics controller or an
external clock source (on mobo or not) when no on board graphics
controller,
but the value remains the same 3.57954545MHz unless you are using a MP
HAL.

I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
color burst, or 1.7897727Mhz. But this particular branch has drifted far
from the real point of this thread - interesting though (made me go look at
the Pentium D data sheet, afterall!)

-cd
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | | > That means that the native code must be different, while it is on my AMD
| > box
| > (dispite the fact that the IL is different).
| > add edx,0x1
| > cmp edx,0x989680
| > setl al
| > movzx eax,al
| > test eax,eax
| > jnz 00d100d2
| >
| > Which is not realy the best algorithm for X86, wonder how it looks like
on
| > Intel. Grr.. micro benchmarks, what a mess ;-)
|
| Here's what I see (loops going 1 billion times):
|
| The JIT'd C++ code:
| // ---------------------------------------------------
| for (int i = 0; i < 1000000000; ++i) {}
| 00000077 xor edx,edx
| 00000079 mov dword ptr [esp],edx
| 0000007c nop
| 0000007d jmp 00000082
| // start of loop
| 0000007f inc dword ptr [esp]
| 00000082 cmp dword ptr [esp],3B9ACA00h
| 00000089 jge 0000008E
| 0000008b nop
| 0000008c jmp 0000007F
| // end of loop
|
| The JIT'd C# code:
| // ---------------------------------------------------
| for (int i =0;i < 1000000000; ++i) {}
| 00000098 xor ebx,ebx
| 0000009a nop
| 0000009b jmp 000000A0
| // start of loop
| 0000009d nop
| 0000009e nop
| 0000009f inc ebx
| 000000a0 cmp ebx,3B9ACA00h
| 000000a6 setl al
| 000000a9 movzx eax,al
| 000000ac mov dword ptr [ebp-6Ch],eax
| 000000af cmp dword ptr [ebp-6Ch],0
| 000000b3 jne 0000009D
| // end of loop
|
| Neither of these represent ideal code by any stretch of the imagination -
| but instruction count alone probably accounts for the bulk of the
difference
| between the two programs on this machine. Why the results are so
different
| from what you see on your AMD machine I can't even guess.
|
| -cd
|

Thanks, that's almost exactly what I've noticed see my previous reply.

C# Intel...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

C# AMD...
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Conclusion: the JIT takes care of the CPU type even in debug builds! So
generates different X86 even from the same IL.
This is extremely weird, for instance the inc esi used on Intel, is an add,
edx, 1 on AMD;
so different register allocations and a different instruction. Well I know
add on AMD is prefered over an inc (according their "Optimization guide for
AMD64 Processors"), can you believe MSFT went that far with the JIT (in
debug builds)?

Willy.
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | | > Pentium M and D. So here it's true that both aren't related, and the
| > 3.57954545MHz clock is derived from the on baord Graphics controller or
an
| > external clock source (on mobo or not) when no on board graphics
| > controller,
| > but the value remains the same 3.57954545MHz unless you are using a MP
| > HAL.
|
| I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
| color burst, or 1.7897727Mhz. But this particular branch has drifted far
| from the real point of this thread - interesting though (made me go look
at
| the Pentium D data sheet, afterall!)
|

Can't remember this, but I guess you are right, much depends on the chip set
used, I was on the Alpha team by that time (where we build the AXP HAL's and
drivers), I moved to Intel architectures after the Compaq merge ;-). Digital
had their own chip sets for Alpha systems (that's why they were too
expensive, right?), nothing commodity, like there is available now.

Willy.
 
C

Carl Daniel [VC++ MVP]

Willy said:
"Optimization guide for AMD64 Processors"), can you believe MSFT went
that far with the JIT (in debug builds)?

Well, yeah. Maybe. I'm under the (possibly misguided) impression that
debug primarily stops the JIT from inlining and hoisting - things that
change the relative order of the native code compared to the IL code.
Within those guidelines, I guess it still picks the best codegen it can
based on the machine.

My belief is that there are multiple full-time Intel and AMD employees at
MSFT that do nothing but work on the compiler back-ends, including the CLR
JIT.

-cd
 
T

Tim Roberts

r"Carl Daniel said:
I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
color burst, or 1.7897727Mhz.

Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original
PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
12 for the counter.
 
C

Carl Daniel [VC++ MVP]

Tim said:
r"Carl Daniel [VC++ MVP]"
I'm 99.99% sure that my old P-II machine produced a QPC frequency of
1/2 color burst, or 1.7897727Mhz.

Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The
original PC had a 14.31818 MHz crystal (4x the color burst), and they
divided it by 12 for the counter.

Yep. That sounds right - 1.789 just didn't feel quite right :)

-cd
 
W

Willy Denoyette [MVP]

"Carl Daniel [VC++ MVP]" <[email protected]>
wrote in message | Willy Denoyette [MVP] wrote:
| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
| > that far with the JIT (in debug builds)?
|
| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
| debug primarily stops the JIT from inlining and hoisting - things that
| change the relative order of the native code compared to the IL code.
| Within those guidelines, I guess it still picks the best codegen it can
| based on the machine.
|
| My belief is that there are multiple full-time Intel and AMD employees at
| MSFT that do nothing but work on the compiler back-ends, including the CLR
| JIT.
|

Well, I would expect this for the C++ compiler back-end, but not directly
for the JIT compiler which is more time constrained, but I guess I'm wrong.

Willy.
 
W

Willy Denoyette [MVP]

| r"Carl Daniel [VC++ MVP]"
<[email protected]>
| wrote:
| >
| >I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
| >color burst, or 1.7897727Mhz.
|
| Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original
| PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
| 12 for the counter.
| --
| - Tim Roberts, (e-mail address removed)
| Providenza & Boekelheide, Inc.

Yep, an old 200MHz (199.261) P6 "Model 1, Stepping 7" of mine, gives a QPC
of 1.193182 MHz, that is CPU clock/167.

Willy.
 
W

Willy Denoyette [MVP]

| Carl Daniel [VC++ MVP] wrote:
| > So, any theory why the C++ code consistently runs faster than the C#
code on
| > both of my machines? I can't think of any reasonable argument why
having a
| > dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
| > code is different for the two loops - maybe there's some pathological
code
| > in the C# case that the P4 executes much more slowly than AMD, or some
| > optimal code in the C++ case that the P4 executes much more quickly than
| > AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
| > Single/HT/Dual, etc.
|
| Wow, this is becomming interesting. We're getting down to dicussions
| CPU architecture and instructions sets. Talk about getting down to the
| metal!
|

That's true, if you are running empty loops, you are not only comparing
compiler optimizations, you are measuring architectural differences at the
CPU, L1/L2 cache & memory controler level. That's also why such
micro-benchmarks have little or no value.


| Anyway, I just reran my test code with larger loop factors, as well as
| the other code with my original and larger loop factors, and C++/CLI
| still came out around 2X faster.
|
| I ran these both on my laptop and desktop. Here's the configuration:
|
| Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
| Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2
|

Just currious what the QPD is on the Centrino.

| I know someone who has an AMD computer, and I'm going to run my programs
| on that computer to see if there's something in the CPU that's causing
| the discrepencies.

Well, I noticed that for debug builds, C++/CLI produces smaller IL, and
different X86 code produced by the JIT for both C# and C++/CLI, here are the
for loops...

X86 for C# (debug)
...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030
....

X86 for C++/CLI (debug)

0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop
00000029 EB F4 jmp 0000001F

An optimized C# build produces even a shorter code path:
...
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl 0000001C
...

Now, while one would think that the run times would be better, they do not,
all take the same time to finish.

The reason for this (AFAIK) is that super scalars like AMD prefer longer
code paths (longer than a cacheline) in order to feed the instruction
pipeline with longer bursts. Don't know how this behaves on Intel Centrino
and PVI HT, but it looks like they behave differently. (I'll try this with
an assembly code program).

Anyway I don't care that much about this, empty loops are not that common I
guess (and C++ will hoist them anyway). Once you start something reasonable
inside the loop, the loop overhead is reduced to dust and the pipeline gets
filed in a more optimum way.

Willy.
 
W

Willy Denoyette [MVP]

|
| "Carl Daniel [VC++ MVP]" <[email protected]>
| wrote in message || Willy Denoyette [MVP] wrote:
|| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
|| > that far with the JIT (in debug builds)?
||
|| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
|| debug primarily stops the JIT from inlining and hoisting - things that
|| change the relative order of the native code compared to the IL code.
|| Within those guidelines, I guess it still picks the best codegen it can
|| based on the machine.
||
|| My belief is that there are multiple full-time Intel and AMD employees at
|| MSFT that do nothing but work on the compiler back-ends, including the
CLR
|| JIT.
||
|
| Well, I would expect this for the C++ compiler back-end, but not directly
| for the JIT compiler which is more time constrained, but I guess I'm
wrong.
|
| Willy.
|
|

Some more fun.

Consider this program:

//C++/CLI code
// File : EmptyLoop.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,989680h ; i < 10000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{
Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;
Stopwatch^ sw = gcnew Stopwatch;
sw->Start();
ForLoopTest();
sw->Stop();
Int64 ticks = sw->Elapsed.Ticks;
Console::WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
}

Compiled with:
cl /clr /O2 EmptyLoop.cpp
output:
24935346 nanoseconds

cl /clr /Od EmptyLoop.cpp
output:
37636821 nanoseconds

See the loop is in assembly, pure unmanaged X86 code, the code produced by
the C++ compiler [1] is the same except for the function prolog and epilog,
altough the results are different. Any takers?

[1]
/Od build

void ForLoopTest( void )
{
00401000 55 push ebp
00401001 8B EC mov ebp,esp
00401003 56 push esi
__asm {
xor esi,esi; 0 -> esi
00401004 33 F6 xor esi,esi
jmp begin;
00401006 EB 01 jmp begin (401009h)
iter:;
inc esi; i++
00401008 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401009 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100F 7C F7 jl iter (401008h)
}
return;
}
00401011 5E pop esi
00401012 5D pop ebp
00401013 C3 ret

/O2 build

void ForLoopTest( void )
{
00401000 56 push esi
xor esi,esi; 0 -> esi
00401001 33 F6 xor esi,esi
jmp begin;
00401003 EB 01 jmp begin (401006h)
iter:;
inc esi; i++
00401005 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401006 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100C 7C F7 jl iter (401005h)
__asm {
0040100E 5E pop esi
}
return;
}
0040100F C3 ret

Willy.
 
W

Willy Denoyette [MVP]

Ok, final update.
The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
on all platforms.

Using StopWatch.Elapsed.Milliseconds gives folowing results.

Values are averges for 10 runs.

C# ~12.8 msec. for 10.000.000 loops
C++/CLI ~9.1 msec.

Release build:

C# ~9.1 msec.
C++/CLI - loop hoisted by C++/CLI compiler (no IL body)

The X86 code for the loop C++/CLI /Od and C# optimized build are nearly the
same (different registers allocated and inc i.s.o add).

Now this:

#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,100000000 ; < 100000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{

Stopwatch^ sw = gcnew Stopwatch;
sw->Reset();
sw->Start();
ForLoopTest();
sw->Stop();

Int64 ms = sw->Elapsed.Milliseconds;
Console::WriteLine("{0} msec.", ms);
}

compiled with:
cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 135 msec.

cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 91 msec.


Notice the same result for C# optimized build as C++/CLI with loop in
assembly optimized build.
Remains the question why the debug build is that much slower, guess this is
due to the CLR starting some actions when running debug builds, IMO there is
an GC/Finalizer run after the call to Stopwatch.Start and before running the
loop. That would explain different behavior (better results) on an HT CPU as
the finalizer runs on a second CPU, so doesn't disturb the user thread which
runs on another core or logical CPU, on a single CPU core the finalizer
pre-empts the user thread.
I'll try to get an HW analizer from the lab to check this, this is simply
not possible to check only by SW tools.

Willy.
 
W

Willy Denoyette [MVP]

| Ok, final update.
| The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
| on all platforms.
|

Followup.
!!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!

One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
The only correct way to get this high precision count is by using
Stopwatch.ElapsedTicks like this:

long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
....
long ticks = sw.ElapsedTicks;
Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);

or use Stopwatch.ElapsedMiliseconds.

Note that the Stopwatch code is not broken, the code I posted used
Stopwatch.Elapsed.Ticks which is wrong in this context.
Sorry for all the confusion.


Willy.
 
W

Willy Denoyette [MVP]

|
| || Ok, final update.
|| The Stopwatch.Ticks is broken, so the calculated nanoseconds are
incorrect
|| on all platforms.
||
|
| Followup.
| !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
|
| One should not use Elapsed.Ticks to calculate the elapsed time in
| nanoseconds.
| The only correct way to get this high precision count is by using
| Stopwatch.ElapsedTicks like this:
|
| long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
| ...
| long ticks = sw.ElapsedTicks;
| Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
|
| or use Stopwatch.ElapsedMiliseconds.
|
| Note that the Stopwatch code is not broken, the code I posted used
| Stopwatch.Elapsed.Ticks which is wrong in this context.
| Sorry for all the confusion.
|
|
| Willy.
|
|
|

Mystery solved, finally :).

A C++/CLI debug build ( /Od flag - the default), does not generate sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast as
the C++/CLI sample built with /Od. The IL produced is identical.

Willy.
 
C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
Followup.
!!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!

A ha! I obviously hadn't looked at the code closely enough to realize that
it was using Elapsed.Ticks and not ElapsedTicks.
One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.

True - one should use it to calculate the elapsed time in 0.1us units, since
that's what TimeSpan.Ticks is expressed as.
The only correct way to get this high precision count is by using
Stopwatch.ElapsedTicks like this:

long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;

but make this a double. Stopwatch.Frequency is more than 1E9 on modern
machines using the MP HAL.

double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;

-cd
 
C

Carl Daniel [VC++ MVP]

Willy Denoyette said:
Mystery solved, finally :).

A C++/CLI debug build ( /Od flag - the default), does not generate
sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to
a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but
the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast
as
the C++/CLI sample built with /Od. The IL produced is identical.

Good sleuthing! In the end, they really ought to be about the same -
having the C++ code execute 2x faster just didn't make sense.

-cd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top