The original benchmark included the overflows. Your modification to
prevent the overflow makes it invalid, IMO. Overflows can be sometimes a
desired effect, for example when calculating checksums. Changing the type
of x to long introduces another problem - 64-bit addition, which is
non-trivial on 32-bit platforms.
A desired effect, right, but not in a benchmark that was meant to measure
nested loop performance. The 64 bit add. hit taken is largely compensated by
the lack of overflow handling by the CPU's microcode in the original code.
Note also that the function using a long has 3 instructions more than the
int version (11 vs. 14), but this doesn't say that much about the
performance delta's on super-scalar CPU's like P4 and AMD.
The result of the calculation is not important in the benchmark, IMO. What
is important is the time the calculation takes to complete.
Without correct results, why performing a calculation that yields a plain
wrong result, I'm sorry but I disagree.
The performance hit of adding to 64 bit integer should be much higher than
any hit caused by the overflow that happens once in a time.
Do you have an idea about the no. of overflows? Did you run the code with
/checked while counting the no. of exceptions? Guess not, it would take a
considerable time (minutes? hours?) to do it.
Not sure of that, but not sure it isn't true either... But arithmetic
overflows are desirable in many cases anyway.
I still think that the overflow is desired in this benchmark, so I would
wrap the *whole* benchmarking code in an unchecked { } block. /checked+
hits performance quite a lot.
If it was desired it should have been mentioned in the benchmark txt, and as
far as I see it's not.
IMO it's just an oversight.
Interesting, when I change the x to long, I get 25437 ms execution time.
This clearly shows the difference between 32 bit and 64 bit arithmetic.
Are you sure you don't have an x64 machine? That would explain the
difference between our results perfectly.
No, running on 32 bit HW. Why would a 32 bit ADD be slower than a 64bit ADD
on X64 anyway?
I ran the benchmark on PIII, P4 (above results) Pentium M and to please you
also on AMD64.
To illustrate what happens consider following results, run on Pentium M
1.6GHz, XP SP2 - Framework 2.0.50727. I took a slower machine to better
illustrate the differences, note that running on PIII, while slower, shows
the same pattern.
---------- original code --------
int o-
Nested Loop elapsed time: 32136 ms - -1804337152 108%
int o+
Nested Loop elapsed time: 29772 ms - -1804337152 reference value 100%
long o-
Nested Loop elapsed time: 24595 ms - 479232000000 83%
long o+
Nested Loop elapsed time: 20789 ms - 479232000000 70%
--------- optimized loop ---------
int o-
Nested Loop elapsed time: 17204 ms - -1804337152 58% 107%
int o+
Nested Loop elapsed time: 16073 ms - -1804337152 54% 100%
long o-
Nested Loop elapsed time: 21230 ms - 479232000000 71% 132%
long o+
Nested Loop elapsed time: 20219 ms - 479232000000 68% 126%
As you can see, the result of the original benchmark (see reference value)
is slower than the same using a long result.
That means that using a long in the "original code" loop, compensates for
the overflow hit.
Inspecting the JIT generated code for the inner loop, gives:
11 instructions for the int result version vs. 14 instructions for the long
result version. So 'theoretically the long version should be slower (but it
isn't). NOTE: that your milleage may vary depending on HW and operational
conditions (beware CPU clock throttling!).
Your hand "optimized loop" code is in general faster than the original code,
but it slows down when using long as result.
Your modified loop compensates the overflow hit by using a better algorithm
at the source code level, that is, THE NUMBER of X86 ADD's resulting from
your algorithm is lower than the original code loop (innerloop). Because of
this, the result field can be register allocated, while in the original
code, the result must be loaded from the stack before the addition (x +=
....) and save back in the stack when done. this takes a huge hit.
[note that your inner loop is 12 instructions long!)
Your code using a long takes a small hit (2 X86 instructions) because of the
64 bit integer addition on 32 bit HW, but this hit is much smaller what the
figures are telling, what happens is that the JIT generates less efficient
code or the whole function (especially the innerloop) when using a long
result (note that this isn't the case with the original code). The rgister
(eax) con no longer be used to store and hold the result field, so now your
code takes the load/store from the stack just like the original code (both
for int and long).
I hope this clears things up a bit, anyway I have no ambition to spend more
time on this neither to start another argument, if you really wanna know
what's happening, run this silly thing against a native debugger and watch
the JIT created code, and you will understand.
PS. I ran this on X64 AMD64 VISTA build 5270 with v2.0.50727.52 (note 52 is
the latest framework build for Vista)
Original code using int resp. long as result
Nested Loop elapsed time: 16370 ms - -1804337152
Nested Loop elapsed time: 10053 ms - 479232000000
Hand-tweaked code using int resp. long as result
Nested Loop elapsed time: 16292 ms - -1804337152
Nested Loop elapsed time: 10194 ms - 479232000000
Notice that this is beta code, so this further reduces the value of the
results obtained.
Willy.