Can you write code directly in CIL ???

P

Peter Olcott

I have to screen out the good advice from the advice that does not apply to my
needs. With 16,000 hours of development in the current project, and the speed of
a single 100 line function making or breaking the success of this project, many
of the typical rules would not apply. One poster said that hand tweaked CIL
doubled the speed, thus confirming my estimations.

Nicholas Paldino said:
I wouldn't worry about it, since you are not worried by the multiple posts
by multiple people in this thread telling you things that you don't want to
hear.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Peter Olcott said:
Nicholas Paldino said:
I will second that the C++ compiler is better at optimizing IL output
than the C# compiler. However, as Willy stated, it will not always produce
verifiable code... I believe the article you were looking for is in MSDN
magazine.

What do you mean by verifiable code?
But as a general statement, the C++ compiler generally has the best
optimizations (and for unmanaged code, with the new profile-guided
optimization, it's even cooler).

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

It's a shame I can't find the newsgroup thread I'm thinking about - I
could give it a try with C# 2.0...

btw what search engine are you using? It maybe socking for some people but
few days back I was searching some thread and for the same keywords google
could not find the post but msn search (search.msn.com) did :)

groups.google.com, which is what I always use for newsgroup posts.
(Note: not just "web google".) It'll be there somewhere, but I just
can't find it at the moment...
 
J

Jon Skeet [C# MVP]

Peter Olcott said:
I have to screen out the good advice from the advice that does not apply to my
needs. With 16,000 hours of development in the current project, and the speed of
a single 100 line function making or breaking the success of this project, many
of the typical rules would not apply. One poster said that hand tweaked CIL
doubled the speed, thus confirming my estimations.

I said it doubled the speed for one particular case, which was only
about four instructions. I wouldn't expect there to be much difference
(if any) normally.

How much more performance do you need? Have you tried doing the
conversion and seeing how it performs *without* tweaking?
 
A

Abubakar

I have to screen out the good advice from the advice that does not apply
to my
needs. With 16,000 hours of development in the current project, and the speed of
a single 100 line function making or breaking the success of this project, many
of the typical rules would not apply. One poster said that hand tweaked CIL
doubled the speed, thus confirming my estimations.

I want to advise that you first study the framework details. You are wanting
to write the IL yourself which would be better than the C# compiler
generated output, and you dont yet know about verified code, details of GC,
etc. I think you would yet have to go through the complete clr instruction
set in order to pick the best instructions to best optimise the code. By now
you must have got an idea that in .net world (as opposed to c++) its going
to be extremely difficult to find people who occasionally hand code *.il
files to acheive better performance.

Ab.
 
P

Peter Olcott

Jon Skeet said:
I said it doubled the speed for one particular case, which was only
about four instructions. I wouldn't expect there to be much difference
(if any) normally.

How much more performance do you need? Have you tried doing the
conversion and seeing how it performs *without* tweaking?

The fastest algorithm with the best compilation just barely meets my target.
This is with MS Visual C++ 6.0. The project requirements call for a .NET
component. If I could double the speed of this I would be very pleased. In any
case more recent compilers do not meet my target even with the best algorithm,
so I must do at least as well as the best compiler. This should only be a matter
of translating the generated assembly language from the best unmanaged code into
CIL.
 
P

Peter Olcott

Abubakar said:
I want to advise that you first study the framework details. You are wanting
to write the IL yourself which would be better than the C# compiler
generated output, and you dont yet know about verified code, details of GC,
etc. I think you would yet have to go through the complete clr instruction
set in order to pick the best instructions to best optimise the code. By now
you must have got an idea that in .net world (as opposed to c++) its going
to be extremely difficult to find people who occasionally hand code *.il
files to acheive better performance.

Yet when a project with 16,000 hours invested can be made or broken by the speed
of a single 100 line function, this kind of optimization would be completely
appropriate. I am only at the feasibility study stage now. It definitely looks
feasible.
 
J

Jon Skeet [C# MVP]

Peter Olcott said:
The fastest algorithm with the best compilation just barely meets my target.
This is with MS Visual C++ 6.0. The project requirements call for a .NET
component. If I could double the speed of this I would be very pleased. In any
case more recent compilers do not meet my target even with the best algorithm,
so I must do at least as well as the best compiler. This should only be a matter
of translating the generated assembly language from the best unmanaged code into
CIL.

No - because CIL looks pretty different from assembly language, and
even if you generated similar-looking CIL somehow, there's no guarantee
that would then be JITted to the same assembly code.

The performance improvement you'll get from this step, if any, is
likely to be tiny (don't hang on to the idea of doubling the speed -
that was a very particular case) and an awful lot of work. I'd do the
initial conversion to C# and benchmark that *first*.
 
W

Willy Denoyette [MVP]

Peter Olcott said:
I wouldn't think that this would be the case for two reasons:
(1) CIL (for the most part) forms a one-to-one mapping with assembly
language

Not true, IL is kind of high level language compared to X86 assembly, one
single IL instruction translates to x assembly level instructions where x is
certainly not 1.
(2) End users are waiting on the JIT to complete, no time to waste doing
optimizations that could have been done before the softwae shipped.

Wrong again, IL is not optimized that much, THE optimizer is the JIT. It's
the JIT that knows at run-time what kind of optmizations can be performed
depending on the characteristics of the HW like CPU type 32bit/64 bit,
number of registers, L1 and L2 cache sizes, MMX/SSE enabled etc.
The CLR is a run-time optimizing execution engine, whether you believe it or
not.


Willy.
 
W

Willy Denoyette [MVP]

Peter Olcott said:
Willy Denoyette said:
Peter Olcott said:
Peter Olcott wrote:

"Nicholas Paldino [.NET/C# MVP]" <[email protected]>
wrote in message
Peter,

I highly recommend that you read up on how Garbage Collection
works exactly.


I already know this.

No, you obviously don't. The problem is not that the GC will take any
memory from your method, but that it is running in a separate thread.
So lets say you method executes in 1/10 s. When garbage collection
occurs, it will take way longer, simply because your method will halt
in the middle of something and resume when the GC is done. So from
inside your method, the execution time was still 1/10, but from the
outside the execution time is way longer ( A little bit like theory of
relativity :) ).

Relying on the GC to do or not to do something is a capital sin in
.NET.

One thing that it can not do is to reclaim memory that is still in
use.
Correct, it will not try to get you memory, however it will stop you
thread if it wants to
I remember reading the algorithm. It is some sort of aging system. In
any case even if my memory needs to be constantly checked to see if
it is still in use, I only need a single monolithic large block.
Yea, and while t checks and sees that it is not allowed to touch you
monolithic large block, you method will pause and take longer than
1/10s.
One thing that I do know about GC, is that it is ONLY invoked when
memory runs out, and is needed, otherwise it is never invoked.
One thing that you must know when developing managed code, is that you
*never* know when the GC is invoked.
Watch your performance counters for garbage collection. You'll be
surprised how busy the area :)

[snip]

HTH,
Andy
--
To email me directly, please remove the *NO*SPAM* parts below:
*NO*SPAM*xmen40@*NO*SPAM*gmx.net

Andreas,

The only time the GC runs (un-forced) is when the creation of an object
on the GC heap would overrun the gen0 heap threshold. When this happens
the GC runs on the same thread as the object creator. That means that
the GC won't run as long as you don't create objects instances.
Note that this assumes there is no external memory pressure when there
are extra GC heap segments allocated, this would force the CLR to start
a full collection.

Willy.


So it is like I said. My program will not implicitly invoke a garbage
collection cycle after it begins executing, if it needs a single fixed
block of memory the whole time that it is executing.

Be careful, if your function allocates that "block of memory" from the GC
heap, it may get pre-empted by the CLR to perform a GC. You could force a
GC run by calling GC.Collect() before you call the method, but this won't
necessarily prevent a GC when you allocate a very large object in that
method.

I don't care about a GC before I begin running, or any other GC that I did
not invoke.

You don't care! I see, wonder why you did start this thread if you don't
care about most of the good advise we gave.

Willy.
 
W

Willy Denoyette [MVP]

Peter Olcott said:
The fastest algorithm with the best compilation just barely meets my
target. This is with MS Visual C++ 6.0. The project requirements call for
a .NET component. If I could double the speed of this I would be very
pleased. In any case more recent compilers do not meet my target even with
the best algorithm, so I must do at least as well as the best compiler.
This should only be a matter of translating the generated assembly
language from the best unmanaged code into CIL.

If the fasted algo using C++ barely meets your target, you wont do any
better using IL, whether you "translate" X86 assembly to IL or not (which
isn't possible as there is no direct mapping) or not, the IL will be JIT
compiled at run-time, but don't expect it will translate to the same X86
machine code.

Willy.
 
W

Willy Denoyette [MVP]

Peter Olcott said:
I found it on the web, some of the differences were several-fold. I don't
know which versions.

Well, they were wrong, for sure, Please post the URL's where you found this
kind of nonsense.

Willy.
 
N

Nicholas Paldino [.NET/C# MVP]

And yet, you STILL WONT POST THE CODE!

Again, I ask of you, post the 100 line function, or let the thread die.
The fact that you have some 16,000 hours of development means absolutely
NOTHING to the CLR.

And yes, I am trying to say, in the nicest possible way, put up, or shut
up.
 
P

Peter Olcott

Jon Skeet said:
No - because CIL looks pretty different from assembly language, and
even if you generated similar-looking CIL somehow, there's no guarantee
that would then be JITted to the same assembly code.

The performance improvement you'll get from this step, if any, is
likely to be tiny (don't hang on to the idea of doubling the speed -
that was a very particular case) and an awful lot of work. I'd do the
initial conversion to C# and benchmark that *first*.

Well of course I would do that first. The only reason that I am considering this
step is because with the unmanaged C++ compilers the change between 6.0 to 7.0
resulting in doubling of the time. In other words the newer compiler produces
code that is only half as fast as the older compiler. The code that I am talking
about doesn't have any OS calls or memory management. It is just comparisons,
branches and movement of integers. The tweaking that I am talking about would
merely be to shorten the lengths of the execution paths. Fewer instructions in
the execution paths would have to result in at least somewhat faster code. If I
could cut down the weighted average length of the execution paths in half, this
would result in doubling of the speed. Most every simple instruction only takes
a single clock. Some instructions can now be paired to execute concurrently. I
will probably look into this sort of optimization as well. This would only
involve changing the order of some instructions. I figure that in the worst case
scenario I will be able to achieve the same speed as the best compiler. For
example if the best compiler is VC++ 6.0, unmanaged code, and if 2005 C#
compiler produces code that takes 500% longer to execute, I can force the .NET
code to execute just as fast as the VC++ 6.0 unmanaged code. I would only bother
to do this for this critical 100 line function. I actually expect to be able to
do better than this. I would expect to improve performance at least 50%.
 
P

Peter Olcott

Willy Denoyette said:
Not true, IL is kind of high level language compared to X86 assembly, one
single IL instruction translates to x assembly level instructions where x is
certainly not 1.
Many of the instructions (all the ones in my critical 100 line function) would
map one-to-one with assembly language. All of the code in this critical 100 line
function is comparisons, branches, and the data movement of single integers.
Wrong again, IL is not optimized that much, THE optimizer is the JIT. It's
The JIT probably does all the processor specific optimizations. These don't
affect performance nearly as much as the ones that are not processor specific.
 
P

Peter Olcott

Willy Denoyette said:
Peter Olcott said:
Willy Denoyette said:
Peter Olcott wrote:

in message
Peter,

I highly recommend that you read up on how Garbage Collection works
exactly.


I already know this.

No, you obviously don't. The problem is not that the GC will take any
memory from your method, but that it is running in a separate thread. So
lets say you method executes in 1/10 s. When garbage collection occurs,
it will take way longer, simply because your method will halt in the
middle of something and resume when the GC is done. So from inside your
method, the execution time was still 1/10, but from the outside the
execution time is way longer ( A little bit like theory of relativity
:) ).

Relying on the GC to do or not to do something is a capital sin in .NET.

One thing that it can not do is to reclaim memory that is still in use.
Correct, it will not try to get you memory, however it will stop you
thread if it wants to
I remember reading the algorithm. It is some sort of aging system. In
any case even if my memory needs to be constantly checked to see if it
is still in use, I only need a single monolithic large block.
Yea, and while t checks and sees that it is not allowed to touch you
monolithic large block, you method will pause and take longer than 1/10s.
One thing that I do know about GC, is that it is ONLY invoked when
memory runs out, and is needed, otherwise it is never invoked.
One thing that you must know when developing managed code, is that you
*never* know when the GC is invoked.
Watch your performance counters for garbage collection. You'll be
surprised how busy the area :)

[snip]

HTH,
Andy
--
To email me directly, please remove the *NO*SPAM* parts below:
*NO*SPAM*xmen40@*NO*SPAM*gmx.net

Andreas,

The only time the GC runs (un-forced) is when the creation of an object on
the GC heap would overrun the gen0 heap threshold. When this happens the
GC runs on the same thread as the object creator. That means that the GC
won't run as long as you don't create objects instances.
Note that this assumes there is no external memory pressure when there are
extra GC heap segments allocated, this would force the CLR to start a full
collection.

Willy.


So it is like I said. My program will not implicitly invoke a garbage
collection cycle after it begins executing, if it needs a single fixed
block of memory the whole time that it is executing.


Be careful, if your function allocates that "block of memory" from the GC
heap, it may get pre-empted by the CLR to perform a GC. You could force a GC
run by calling GC.Collect() before you call the method, but this won't
necessarily prevent a GC when you allocate a very large object in that
method.

I don't care about a GC before I begin running, or any other GC that I did
not invoke.

You don't care! I see, wonder why you did start this thread if you don't care
about most of the good advise we gave.

Because much of this advice does not apply to my case. I do care about the CPU
time it takes this critical 100 line function to execute. If another process
interrupts this so that the real time is much loner than the CPU time, I don't
care. I used the term real-time somewhat misleadingly.
 
P

Peter Olcott

Willy Denoyette said:
If the fasted algo using C++ barely meets your target, you wont do any better
using IL, whether you "translate" X86 assembly to IL or not (which isn't
possible as there is no direct mapping) or not, the IL will be JIT compiled at
run-time, but don't expect it will translate to the same X86 machine code.

My goal is to at least match this fastest time. The project has a design
requirement to be implemented as a .NET component. I will probably also hand
tweak the assembly language from this fastest compiler VC++ 6.0, and then
attempt to match this performance in CIL. From all of this effort I expect to
improve the performance of the fastest compiler by at least 50%. Since this
critical function will be executed several million times every second, it will
be worth the cost of this extra effort at optimization.
 
P

Peter Olcott

message news:%[email protected]...
And yet, you STILL WONT POST THE CODE!

Again, I ask of you, post the 100 line function, or let the thread die. The
fact that you have some 16,000 hours of development means absolutely NOTHING
to the CLR.

And yes, I am trying to say, in the nicest possible way, put up, or shut
up.
This code has an expected value in the millions of dollars. The idea behind what
this code implements was just approved for patent protection. I am not going to
provide this code.
http://www.tommti-systems.de/go.htm...Dateien/reviews/languages/benchmarks.htmlThis does show that C# is about 500% slower that C++ on something as simple as anested loop.>>> --> - Nicholas Paldino [.NET/C# MVP]> - (e-mail address removed)>> "Peter Olcott" <[email protected]> wrote in message"Abubakar" <[email protected]> wrote in messageI have to screen out the good advice from the advice that does not apply>>> to my>>>> needs. With 16,000 hours of development in the current project, and the>>> speed of>>>> a single 100 line function making or breaking the success of this project,>>> many>>>> of the typical rules would not apply. One poster said that hand tweaked>>> CIL>>>> doubled the speed, thus confirming my estimations.>>>>>> I want to advise that you first study the framework details. You are wanting>>> to write the IL yourself which would be better than the C# compiler>>> generated output, and you dont yet know about verified code, details of GC,>>> etc. I think you would yet have to go through the complete clr instruction>>> set in order to pick the best instructions to best optimise the code. By now>>> you must have got an idea that in .net world (as opposed to c++) its going>>> to be extremely difficult to find people who occasionally hand code *.il>>> files to acheive better performance.>>>> Yet when a project with 16,000 hours invested can be made or broken by thespeed of a single 100 line function, this kind of optimization would becompletely appropriate. I am only at the feasibility study stage now. Itdefinitely looks feasible.>>>>>>>> Ab.>>>>>> "Peter Olcott" <[email protected]> wrote in message>>> news:QNosf.38020$QW2.8997@dukeread08...>>>>>>>>>>>>>>>>
 
N

Nicholas Paldino [.NET/C# MVP]

ROFL, that's hilarious.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Peter Olcott said:
Nicholas Paldino said:
And yet, you STILL WONT POST THE CODE!

Again, I ask of you, post the 100 line function, or let the thread
die. The fact that you have some 16,000 hours of development means
absolutely NOTHING to the CLR.

And yes, I am trying to say, in the nicest possible way, put up, or
shut up.
This code has an expected value in the millions of dollars. The idea
behind what this code implements was just approved for patent protection.
I am not going to provide this code.
http://www.tommti-systems.de/go.htm...Dateien/reviews/languages/benchmarks.htmlThis
does show that C# is about 500% slower that C++ on something as simple as
anested loop.>>> --> - Nicholas Paldino [.NET/C# MVP]> -
message"Abubakar"
messageI have to screen
out the good advice from the advice that does not apply>>> to my>>>>
needs. With 16,000 hours of development in the current project, and the>>>
speed of>>>> a single 100 line function making or breaking the success of
this project,>>> many>>>> of the typical rules would not apply. One poster
said that hand tweaked>>> CIL>>>> doubled the speed, thus confirming my
estimations.>>>>>> I want to advise that you first study the framework
details. You are wanting>>> to write the IL yourself which would be better
than the C# compiler>>> generated output, and you dont yet know about
verified code, details of GC,>>> etc. I think you would yet have to go
through the complete clr instruction>>> set in order to pick the best
instructions to best optimise the code. By now>>> you must have got an
idea that in .net world (as opposed to c++) its going>>> to be extremely
difficult to find people who occasionally hand code *.il>>> files to
acheive better performance.>>>> Yet when a project with 16,000 hours
invested can be made or broken by thespeed of a single 100 line function,
this kind of optimization would becompletely appropriate. I am only at the
feasibility study stage now. Itdefinitely looks feasible.>>>>>>>>
 
W

Willy Denoyette [MVP]

Peter Olcott said:
Many of the instructions (all the ones in my critical 100 line function)
would map one-to-one with assembly language. All of the code in this
critical 100 line function is comparisons, branches, and the data movement
of single integers.

No they are not, IL is based on a pure stack based virtual machine execution
environment, it has not such thing like registers, it has no notion of a
real memory location, it has no access to the runtime stack.

Just to give you an idea what I'm trying to explain, consider following C#
method and it's compiler generated IL method.

[C#]
static void Foo()
{
int v = 0;
int[] ar = new int[5] {0,1,2,3,4};
for (int i = 0;i != 5 ;i++ )
{
v += ar;
}
}
//

[compiler generated IL]
.method private hidebysig static void Foo() cil managed
{
// Code size 39 (0x27)
.maxstack 3
.locals init (int32 V_0,
int32[] V_1,
int32 V_2)
IL_0000: ldc.i4.0
IL_0001: stloc.0
IL_0002: ldc.i4.5
IL_0003: newarr [mscorlib]System.Int32
IL_0008: dup
IL_0009: ldtoken field valuetype
'<PrivateImplementationDetails>{E21D91A1-F27C-4190-94E3-4FB17E12D29A}'/'__StaticArrayInitTypeSize=20'
'<PrivateImplementationDetails>{E21D91A1-F27C-4190-94E3-4FB17E12D29A}'::'$$method0x6000002-1'
IL_000e: call void
[mscorlib]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class
[mscorlib]System.Array,

valuetype [mscorlib]System.RuntimeFieldHandle)
IL_0013: stloc.1
IL_0014: ldc.i4.0
IL_0015: stloc.2
IL_0016: br.s IL_0022

IL_0018: ldloc.0
IL_0019: ldloc.1
IL_001a: ldloc.2
IL_001b: ldelem.i4
IL_001c: add
IL_001d: stloc.0
IL_001e: ldloc.2
IL_001f: ldc.i4.1
IL_0020: add
IL_0021: stloc.2
IL_0022: ldloc.2
IL_0023: ldc.i4.5
IL_0024: bne.un.s IL_0018

IL_0026: ret
} // end of method Tester::Foo

and here is what the JIT compiler actually generated from this (!! CPU
specific !!)

00cb0098 57 push edi
00cb0099 56 push esi
00cb009a ba05000000 mov edx,0x5
00cb009f b92a981579 mov ecx,0x7915982a
00cb00a4 e86b21c5ff call 00902214
00cb00a9 8d7808 lea edi,[eax+0x8]
00cb00ac be68204000 mov esi,0x402068
00cb00b1 f30f7e06 movq xmm0,qword ptr [esi]
00cb00b5 660fd607 movq qword ptr [edi],xmm0
00cb00b9 f30f7e4608 movq xmm0,qword ptr [esi+0x8]
00cb00be 660fd64708 movq qword ptr [edi+0x8],xmm0
00cb00c3 83c610 add esi,0x10
00cb00c6 83c710 add edi,0x10
00cb00c9 a5 movsd
00cb00ca 33d2 xor edx,edx
00cb00cc 8b4804 mov ecx,[eax+0x4]
00cb00cf 3bd1 cmp edx,ecx
00cb00d1 730b jnb 00cb00de
00cb00d3 83c201 add edx,0x1
00cb00d6 83fa05 cmp edx,0x5
00cb00d9 75f4 jnz 00cb00cf
00cb00db 5e pop esi
00cb00dc 5f pop edi
00cb00dd c3 ret
00cb00de e8fe453e79 call mscorwks!JIT_RngChkFail (7a0946e1)
00cb00e3 cc int 3

Now try for yourself to build an IL module from the assembly code, and
please make sure it compiles, is verifiable and runs as fast as the C#
generated IL above. Or try to tweak the IL so it translates into better
(faster) X86 code.
The JIT probably does all the processor specific optimizations. These
don't affect performance nearly as much as the ones that are not processor
specific.

Apart from the processor specific optimizations (which are significant) it
performs most of the optimizations performed by a C/C++ compiler back-end
optimizer (both the C++ back-end optimizer and the JIT optimizer has been
written by the same team), only difference is that it happens at run-time,
so it is somewhat constrained by time, but this is largely compensated by
the processor/memory specific optimizatons.
Check this link and see how managed code compares to unmanaged code at the
performance level.
http://www.grimes.demon.co.uk/dotnet/man_unman.htm


Willy.
 
W

Willy Denoyette [MVP]

Peter Olcott said:
My goal is to at least match this fastest time. The project has a design
requirement to be implemented as a .NET component. I will probably also
hand tweak the assembly language from this fastest compiler VC++ 6.0, and
then attempt to match this performance in CIL. From all of this effort I
expect to improve the performance of the fastest compiler by at least 50%.
Since this critical function will be executed several million times every
second, it will be worth the cost of this extra effort at optimization.

Don't expect you can make it execute (50%) faster than unmanaged C++ code,
don't expect to hand tweak ASM and translate that to IL and expect the JIT
compiler will produce the same machine code - IT WONT. Also I'm not clear on
what you mean by executing several million times per second, in another
reply you said the function takes 10 msec to finish using VC6 and now you
are expecting this to execute million times per second.

Willy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top