c# is a good way to learn c

O

Olaf Baeyens

Hi Olaf, Are you looking for my original code, or just code that fails ?I try to look at the code that you tried to make it work in theory.

I had prepared a big mail with disassembler stuff. But I did not have enough
time at my company, so I wanted to complete it at home. :) But I see that
you already discovered some of my points.

First of all the code assumes a Intel byte order, mine did not and would
work on C# and C++ including none-Intel.
Second this --P thing was moving in the wrong direction. This way you
unecessarely slow down any memory operations since memory tends to start
preloading the next memory byte in bursts. And you use the previous instead
thus stalling any memory cycle.
On most processors/memories or any, memory reads are optimized to be read
from low to high in sequenctional order.

Interesting, is that you also found the way I would have done it that indeed
works on VC++ 2003
But then again, the Intel byte order is not gone and you had to use a global
variable in your case to optimize it.
The problem with a global memory is that it might not reside in one of the
32 bytes cache lines of your processor cache, so you might lose time to to
load the memory to the processor cache. And you loas an additional 32 bytes
of processor cache memory for that one global variable again slowing down
something else.
__inline int Large_is_First_32 ( int X ) {
uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ; }

But you biggest bottleneck is this thing:
0C movzx eax,byte ptr [rv+2 (403012h)]
....
15 movzx edx,byte ptr [rv+3 (403013h)]

Compared to my code, you need additional memory access and worse you only
read one byte at a time, while the complete original variable of 32 bits
would have been aligned on a memory/4 boundary thus loaded on one pass in
memory.
Only your compiler optimizer discovered that it could reduce one byte read
in it's first load.
Both these assembler instructions indicates that you need additional reads
and also on memory locations that are not devidable by 4. So giving your
processor another stall.

You should really try my way and check the optimized results.You will lose
at least the 2 memory reads.

I don't think I need to post the message I was planning to. ;-)
Everything here has been said. :)
 
B

Bill

Tim said:
In C and C++, precedence and associativity basically determine the shape
of the parse tree. They do *NOT* determine the order of evaluation of
things, other than indirectly (in that an operation can not be done
until the value of both operands is known).

In the case of (&&, ||) the order of operation IS defined as left to
right and STOPS once the outcome is known (this is called short
circuiting)
SO for example

if ( (F(x) < 0) || (G(x) <0) )
Blah();

F(x) will be executed first.
If (F(x) < 0) is true we NEVER execute the second test

Thus G(x) is not guaranteed to run.

-----------------------------------
Unfortunately for Refl, he didn't use these operators
He used the bitwise operators and they DON'T short circuit

Had he used the boolean operators his code MIGHT have worked.
Or at the least it would have defined behavior.

Hope this helps
Bill
 
B

Bill

Relf said:
Howdy there Bill, Re: * ++ P << 16 | * ++ P << 8...,
That code Should work. Why... you ask ?
Because it's much more readable than this: P [ 1 ] << 16 | P [ 2 ]
<< 8

Huh????
What does supposed READABILITY have to do with Code Correctness or
Undefined Behavior???
So it's a bug in both MS_CPP_7_1 and ISO_C.

That does not logically follow

....
P.S. This iterates MS_CPP's sequence points:
http://msdn.microsoft.com/library/default.asp?url=/library/en->us/vclang98/html/_pluslang_c.2b2b_.sequence_points.asp

I carefully looked over the list of sequence points and don't see one
that applies to your case
Which one do YOU think applies?

Bill
 
J

Jeff_Relf

Hi Olaf, I used a global instead of a constant or a local,
so that it didn't totally optimize away my code.
I've since found a better way to do that, this is the disassembly:

rv = Large_is_First_32( X );
* P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;

1D xor ecx,ecx
1F mov ch,al
21 movzx edx,ah
24 mov dword ptr [esp+0Ch],eax ; Make the pointer
28 shr eax,10h
2B or ecx,edx
2D movzx edx,al
30 shl ecx,8
33 or ecx,edx
35 movzx eax,ah
38 shl ecx,8
3B or eax,ecx
3D mov dword ptr [rv (403018h)],eax ; Store rv, the Global int

vs. Your:

return (((byte) iOriginal)) << 24 | ((byte) (iOriginal >> 8)) << 16 |
((byte) (iOriginal >> 16)) << 8 | ((byte) (iOriginal >> 24));

90 mov eax,esi // eax=iOriginal
92 and eax,0FFh // (byte) iOriginal ( converting int to byte )
97 shl eax,18h // (byte) iOriginal<< 24
9a mov edx,esi // edx=iOriginal
9c sar edx,8 // iOriginal >> 8
9f and edx,0FFh // ((byte) (iOriginal >> 8))
a5 shl edx,10h // ((byte) (iOriginal >> 8)) << 16
a8 or eax,edx // ((byte) iOriginal)) << 24
// | ((byte) (iOriginal >> 8)) << 16
aa mov edx,esi // edx=iOriginal
ac sar edx,10h // iOriginal >> 16
af and edx,0FFh // ((byte) (iOriginal >> 16))
b5 shl edx,8 // ((byte) (iOriginal >> 16)) << 8
b8 or eax,edx // ((byte) iOriginal)) << 24
// | ((byte) (iOriginal >> 8)) << 16
// | ((byte) (iOriginal >> 16)) << 8
ba mov edx,esi // edx=iOriginal
bc sar edx,18h // iOriginal >> 24
bf and edx,0FFh // ((byte) (iOriginal >> 24)
c5 or eax,edx // ((byte) iOriginal)) << 24
// | ((byte) (iOriginal >> 8)) << 16
// | ((byte) (iOriginal >> 16)) << 8
// | ((byte) (iOriginal >> 24)
c7 mov ebx,eax // copy to the swap variable.

My code is faster, and more readable... sorry Olaf !
( And readability was my Only goal here... not speed )

This was the code I used:

#include <StdLib.H>
typedef unsigned char * uint_8_P ; int rv ;

__inline int Large_is_First_32 ( int X ) {
uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ; }

int main(){ char * P = "84838281" ; register int X = strtoul( P, & P, 16 );
rv = Large_is_First_32( X );
return rv ;
}

The rv is there so that I can confirm that it works.
The strtoul() is there so that it doesn't optimize away my code.
 
T

The Ghost In The Machine

In comp.os.linux.advocacy, Jeff_Relf
<[email protected]>
wrote
Hi Spooky, Re: My Swap_32(), You wrote: <<
Not sure if it's a bug or not. On GCC/x86 it would return 32. >>

Don't you guys believe in hex ? !

0x84838281 is a much Much better test,
as each byte is labeled and has it's high bit set.

Well, ideally one would use several such...but for a single test
that's not a bad start.
By the way, on a x86, I should've reversed the order, like this:

typedef unsigned char * uint_8_P ;

int Swap_32 ( int X ) { uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }
main() {
// prints 0x84848484 ! ! MS_CPP_7_1 BUG
printf( "%x", Swap_32( 0x84838281 ) ); }

And you were going to encapsulate that in some sort of #ifdef, right?
Assigning each byte to a global makes it work, like this:

typedef unsigned char * uint_8_P ; int _1, _2, _3, _4 ;

int Swap_32 ( int X ) { uint_8_P P = ( uint_8_P ) & X ;
return ( _1 = * P << 24 ) | ( _2 = * ++ P << 16 )
| ( _3 = * ++ P << 8 ) | ( _4 = * ++ P ); }

main() {
// prints 0x81828384 as it should.
printf( "%x", Swap_32( 0x84838281 ) ); }

OK, who else wants to vote for this being the Ugliest Workaround
For A Possible Compiler Bug This Week? :)
Removing the assignments, but keeping the parens, fails:
return ( * P << 24 ) | ( * ++ P << 16 ) | ( * ++ P << 8 ) | ( * ++ P );

As I said before, MS_CPP fails with or without the optimizer,
and MicroSoft claims that the | operator is evaluated left to right:
http://msdn.microsoft.com/library/d...s/vclang/html/_pluslang_c.2b2b_.operators.asp

Operator |
Name Bitwise inclusive OR
Associativity Left to right

Or just do the obvious fix:

return (P[0] << 24) | (P[1] << 16) | (P[2] << 8) | (P[3]);

How well does that one work for you?
 
T

Tim Smith

What if a static global is set in a() that needs to be called in b(),
c() to get a correct calculation?

Then you have to write it like this:

int temp = a();
whatever = temp + b()*c();
 
T

The Ghost In The Machine

In comp.os.linux.advocacy, Hurray for Peter Pumpkinhead
<[email protected]>
wrote
What if a static global is set in a() that needs to be called in b(),
c() to get a correct calculation?

Then one has unpredictable behavior and will be lucky to
get a result even remotely reproducible, and probably one
will get some rather subtle and oddball symptoms to hash
through before one gets at the root cause of the problem,
during porting of the application from a platform that
works to another platform essential to one's business plan.

In short: problem-in-waiting. :)

Of course the workaround isn't too bad; either factor out
the global and have a setup() routine prior to evaluating
the expression in question, or put the result of a()
into a variable, and force the issue:

double aVal = a();
/* ... use ... */ aVal+b()*c(); /* ... */

I could see all of a(), b() and c() calling a setup()
routine if they all needed access to the same pointer
(setup() would create and/or return it to them), though;
so long as setup() does the same thing every time it won't
really matter *who* gets called first:

***

Something * staticValue;

void setup()
{
/* ... blah blah whatever blah blah ... */
staticValue = new Something(...);
/* ... blah blah whatever blah blah ... */
}

double a()
{
if(staticValue == 0) setup();
return staticValue->a();
}

double b()
{
if(staticValue == 0) setup();
return staticValue->b();
}

double c()
{
if(staticValue == 0) setup();
return staticValue->c();
}

***

This is admittedly not the best of code (thread problem?
*What* thread problem? :) ) but at least it wouldn't depend
on an explicit order of evaluation.
 
T

The Ghost In The Machine

In comp.os.linux.advocacy, Bill
<[email protected]>
wrote
Relf said:
Howdy there Bill, Re: * ++ P << 16 | * ++ P << 8...,
That code Should work. Why... you ask ?
Because it's much more readable than this: P [ 1 ] << 16 | P [ 2 ]
<< 8

Huh????
What does supposed READABILITY have to do with Code Correctness or
Undefined Behavior???

Ah, remember -- it's the "Jeff-Relf-Readability Test". If he can
read it, no problem. If he can't, it's all your fault. :)
That does not logically follow

Make that "Jeff-Relf-Logic" as well.

[URL repair]
I carefully looked over the list of sequence points and don't see one
that applies to your case
Which one do YOU think applies?

Interesting -- "An expression can modify an objects [sic] value only
once between consecutive sequence points". I think this explains
the behavior in part that he saw earlier, in that the declaration

unsigned char * P = (unsigned char *) &X + 4;
return *--P << 24 | *--P << 16 | *--P << 8 | *--P;

was for some reason returning 0x20202020 for input value 0x00000020;
apparently the four decrements were being collapsed.

However, that does not explain why the modified form

unsigned char * P = (unsigned char *) &X;
return *P++ << 24 | *P++ << 16 | *P++ << 8 | *P;

actually *works*.

My brain is starting to hurt.
 
O

Olaf Baeyens

Hi Olaf, I used a global instead of a constant or a local,
so that it didn't totally optimize away my code.
I've since found a better way to do that, this is the disassembly:

rv = Large_is_First_32( X );
* P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;

1D xor ecx,ecx
1F mov ch,al
21 movzx edx,ah
24 mov dword ptr [esp+0Ch],eax ; Make the pointer
28 shr eax,10h
2B or ecx,edx
2D movzx edx,al
30 shl ecx,8
33 or ecx,edx
35 movzx eax,ah
38 shl ecx,8
3B or eax,ecx
3D mov dword ptr [rv (403018h)],eax ; Store rv, the Global int
This is very close to something that I would have created if I would have
made it in assembler directly.
I assume that 'eax' starts with the X variable?
This is clearly created by an optimizer compiler.

I did check your code by hand, it works. :)

Only the line:
24 mov dword ptr [esp+0Ch],eax ; Make the pointer
is obsolete and in fact it is not storing the pointer it is the actual X
value, maybe for future reference?

Try this = P[0] << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;
instead of this * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;
You might lose that additional.
My code is faster, and more readable... sorry Olaf !
( And readability was my Only goal here... not speed )
Why sorry, if it works then it is a good job? :)
And the fact that you adapt your coding style and prove that you are willing
to learn and adapt is a very good

But the code I showed compiled in C# and with minor change (BYTE instead of
byte) also C++.
But regarding to speed, it is really compiler dependend. I do not have an
optimizer compiler for C++ since I use VC++ 2003 standard so my assembler
code generated will be bigger.
I do not know if the VC# 2003 proffesional have an compiler optimizer.???

Also note that the C# generated assembler and the unmanaged C++ generated
assembler was nearly exactly the same. My original point in the discussion
was to prove that C# was totally not slow compared to unanaged C++ when you
look at the assembler level. I cannot compare with an compiler optimizer,
since I don't have one available.

So I would be interested to look at the optimized code from my function. Can
you give me that?
I mean how did you determin that my code is slower? Did you test this with
your compiler?
 
J

Jeff_Relf

Hi Spooky, I'm saying that the following could should work, but doesn't:

Large_is_First_32 ( int X ) {
unsigned char * P = ( unsigned char * ) & X ;
return * P ++ << 24 | * P ++ << 16 | * P ++ << 8 | * P ; }

C_99 and MS_CPP should both be updated so that it will work, in my opinion,
because P [ 1 ] << 16 | P [ 2 ] << 8
is too ugly, too weird, and too unnecessary.

This is the code that works:

#include <StdLib.H>
typedef char * int_8_P ; typedef unsigned char * uint_8_P ; int rv ;

int Large_is_First_32 ( int X ) {
uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ; }

int main(){ int_8_P P = "84838281" ;
int X = strtoul( P, & P, 16 ); rv = Large_is_First_32( X );
return rv ; } // rv is now 0x81828384

The optimizer inlines the code as follows:
rv = * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;

1D xor ecx,ecx
1F mov ch,al
21 movzx edx,ah
24 mov dword ptr [ esp + 0Ch ],eax
28 shr eax,10h
2B or ecx,edx
2D movzx edx,al
30 shl ecx,8
33 or ecx,edx
35 movzx eax,ah
38 shl ecx,8
3B or eax,ecx
3D mov dword ptr [ rv ( 403018h ) ],eax

Such fast and pretty code !

The strtoul() must be there or my code gets optimized away.
rv, a global, must be there to confirm that it works.
 
J

Jeff_Relf

Hi Olaf, Re: This code of mine:

uint_32 Big_First_32 ( int & X ) { uint_8_P B = ( uint_8_P ) & X ;
return * B << 24 | B [ 1 ] << 16 | B [ 2 ] << 8 | B [ 3 ]; }

00 xor edx,edx
02 mov dh,al
04 movzx ecx,ah
07 or edx,ecx
09 mov ecx,dword ptr [esp+18h]
0D shr ecx,10h
10 shl edx,8
13 movzx ebx,cl
16 or edx,ebx
18 shl edx,8
1B movzx ecx,ch
1E or edx,ecx

You told me: << This is very close to something that I would have created
if I would have made it in assembler directly.
I assume that 'eax' starts with the X variable ?
This is clearly created by an optimizer compiler. >>

Yes, hovering the mouse cursor over eax shows 0x84838281,
which is what I set X to at the beginning ( in that test ).

And Yes, it was optimized with MS_CPP's /Og switch, Global optimizations:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_.2f.og.asp

You added: << I did check your code by hand, it works. :)
Only the line:
24 mov dword ptr [esp+0Ch],eax ; Make the pointer
is obsolete and in fact it is not storing the pointer
it is the actual X value, maybe for future reference ? >>

Right, it's being placed on the stack, I should've realized that.
I'm not really into the disassembly, as you can tell.

I don't know why it was there,
because my latest code does it the other way around, like this:
09 mov ecx,dword ptr [esp+18h]

I had to add the & here: Big_First_32 ( int & X )
to stop the /Og switch from breaking my code in certain Bizarre cases,
....another bug with MS_CPP ?

The /Og switch is notorious for breaking my code like that,
I've stopped using it in my professional code.

Re: This idea of yours: P[0] << 24 instead of * P << 24,

That produces the exact same disassembly.

Re: If VC# 2003 proffesional has an optimizer,

I don't know... I only use the CPP part.
I got my copy of VS_2003_Pro from a friend, it cost me nothing.
( Universities are like that )

You wrote: << Also note that the C# generated assembler
and the unmanaged C++ generated assembler was nearly exactly the same.
My original point in the discussion was
to prove that C# was totally not slow
compared to unanaged C++ when you look at the assembler level.
I cannot compare with an compiler optimizer,
since I don't have one available.

So I would be interested to look at the optimized code from my function.
Can you give me that ?
I mean how did you determin that my code is slower ?
Did you test this with your compiler ? >>

Your code works, and it becomes:

uint_32 Swap_32 ( int X ) {
return uint_8( X ) << 24 | uint_8( X >> 8 ) << 16
| uint_8( X >> 16 ) << 8 | uint_8( X >> 24 ); }

1B movzx ecx,ah
1E mov ch,al
20 mov edx,eax
22 sar edx,10h
25 movzx edx,dl
28 sar eax,18h
2B shl ecx,8
2E or ecx,edx
30 movzx eax,al
33 shl ecx,8
36 or eax,ecx
38 mov dword ptr [rv (403018h)],eax

Adding Winsock2.H's htonl() to the mix,
Your C++ code it the fastest, this is the output:
.00401 Seconds, Sum 2147244176424960, Swap_32( 0 - 999,999 ).
.00469 Seconds, Sum 2147244176424960, Big_First_32( 0 - 999,999 ).
.00789 Seconds, Sum 2147244176424960, htonl( 0 - 999,999 ).

But, as I said before, readability was my only goal, not speed,
and I find my code to be more readable.

Here's the code the produced the above output:
http://www.Cotse.NET/users/jeffrelf/Kelsey.CPP
http://www.Cotse.NET/users/jeffrelf/Kelsey.VCPROJ

#pragma warning( disable: 4127 4244 4706 )
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <StdLib.H>
#include <stdio.h>
#include <IO.H>
#include <Winsock2.H>
#pragma comment( lib, "Ws2_32.LIB")

#define Loop( N ) int J = - 1, LLL = N ; while ( ++ J < LLL )
#define Tics ( QueryPerformanceCounter( ( Quad * ) & _Tics ), _Tics )
#define Secs ( _Secs = Tics / Secnd_Dub )

typedef char * int_8_P ;
typedef unsigned char uint_8 ; typedef uint_8 * uint_8_P ;
typedef unsigned __int32 uint_32 ;
typedef LARGE_INTEGER Quad ;

double Secnd_Dub, _Secs, Mark ; __int64 _Tics, Secnd ;

uint_32 Big_First_32 ( int & X ) { uint_8_P B = ( uint_8_P ) & X ;
return * B << 24 | B [ 1 ] << 16 | B [ 2 ] << 8 | B [ 3 ]; }

uint_32 Swap_32 ( int X ) {
return uint_8( X ) << 24 | uint_8( X >> 8 ) << 16
| uint_8( X >> 16 ) << 8 | uint_8( X >> 24 ); }

main() {
QueryPerformanceFrequency( ( Quad * ) & Secnd ); Secnd_Dub = Secnd ;
FILE * fp = fopen( "AA.TXT", "w" );
const int Times = 1000 * 1000 ;

__int64 X = 0 ; Mark = Secs ;
{ Loop( Times ) X += Swap_32( J ); } double Dur = Secs - Mark ;

__int64 X2 = 0 ; Mark = Secs ;
{ Loop( Times ) X2 += Big_First_32( J ); } double Dur2 = Secs - Mark ;

__int64 X3 = 0 ; Mark = Secs ;
Loop( Times ) X3 += htonl( J ); double Dur3 = Secs - Mark ;

char SecStr [ 99 ] ;
sprintf( SecStr, "%1.5f" , Dur );
fprintf( fp, "%s Seconds, Sum %I64d, Swap_32( 0 - 999,999 ).\n"
, SecStr + ( * SecStr == '0' ), X );

sprintf( SecStr, "%1.5f" , Dur2 );
fprintf( fp, "%s Seconds, Sum %I64d, Big_First_32( 0 - 999,999 ).\n"
, SecStr + ( * SecStr == '0' ), X2 );

sprintf( SecStr, "%1.5f" , Dur3 );
fprintf( fp, "%s Seconds, Sum %I64d, htonl( 0 - 999,999 ).\n"
, SecStr + ( * SecStr == '0' ), X3 ); fclose( fp ); }
 
O

Olaf Baeyens

Nice test. :)

Right now I am checking out this IL Assembler.
It appears that the managed/unmanaged thing used n C++ is not native of C++,
but could be implemented in C# too if they would change the C# compiler.
It is actually built into the JIT itself.

The IL assembler is a Object oriented assembler thing based on stacks,
similar like Forth (if I remeber correctly).
Or the way HP scientific calculators would work. You push 2 varablios on the
stack and then say ADD, and the result is stored on that stack ready to be
used for the next operation.
It is getting used to it. :)

But the VC# 2003 clearly does not optimize that IL code, in both release and
debug code when I look at it with ILDASM.
Maybe the Professional version does that? So it is logical that the
executable code compiled by the JIT is also an almost one to one translation
of the C# instructions.

Now rumors are that the 64 bit JIT of .NET framework 2.0 has more time to do
things, so it create a lot more optimized code. But the 32 bit version is
not.
The reason why the JIT is not optimizing that muich is because it would take
a very long time to load the .NET program. So they chose for less optimizing
to speed up the loading process.

I also discovered that there exist a NGEN program that pre-compiles a .NET
program, so it has more optimized solution and thus also runs faster. But it
has some issues, so it is not used that much. But rumors also tells me that
this NGEN is actually used to compile the .NET framework assemblies during
the install procedure so it executes faster.

So al to of interesting things to discover. :)
 
J

Jeff_Relf

Hi Olaf, You told me: << Nice test. :) >>

Thanks, and now that I've thought about it some more,
I think your method, given current contraints,
is the most readable, as well as the fastest:

uint_32 Swap_32 ( int X ) {
return uint_8( X ) << 24 | uint_8( X >> 8 ) << 16
| uint_8( X >> 16 ) << 8 | uint_8( X >> 24 ); }

But I still think the following code should be legal,
as it would be even more readable and would be quit optimizable.

uint_32 Big_First_32 ( int X ) { uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }

Re: Adding the IL Assembler to C#'s compiler
and possibly optimizing managed code on install,

That would certainly speed things up.

As long as you're making changes to the compiler like that,
I vote for #define and more sequence_points.
 
T

The Ghost In The Machine

In comp.os.linux.advocacy, Jeff_Relf
<[email protected]>
wrote
Hi Spooky, I'm saying that the following could should work, but doesn't:

Large_is_First_32 ( int X ) {
unsigned char * P = ( unsigned char * ) & X ;
return * P ++ << 24 | * P ++ << 16 | * P ++ << 8 | * P ; }

C_99 and MS_CPP should both be updated so that it will work, in my opinion,
because P [ 1 ] << 16 | P [ 2 ] << 8

For GCC/x86 and X = 0x84838281,

return * P ++ << 24 | * P ++ << 16 | * P ++ << 8 | * P ; }

returns 0x81818181.

Looks like GCC has a "bug" as well.

An alternative formulation:

return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }

does work correctly in GCC.


is too ugly, too weird, and too unnecessary.

This is the code that works:

#include <StdLib.H>
typedef char * int_8_P ; typedef unsigned char * uint_8_P ; int rv ;

int Large_is_First_32 ( int X ) {
uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ; }

int main(){ int_8_P P = "84838281" ;
int X = strtoul( P, & P, 16 ); rv = Large_is_First_32( X );
return rv ; } // rv is now 0x81828384

The optimizer inlines the code as follows:
rv = * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ;

1D xor ecx,ecx
1F mov ch,al
21 movzx edx,ah
24 mov dword ptr [ esp + 0Ch ],eax
28 shr eax,10h
2B or ecx,edx
2D movzx edx,al
30 shl ecx,8
33 or ecx,edx
35 movzx eax,ah
38 shl ecx,8
3B or eax,ecx
3D mov dword ptr [ rv ( 403018h ) ],eax

Such fast and pretty code !

Feh.

Try this code for pretty. This is in MASM/Intel syntax:

mov edx, eax ; edx=0x84838281
shr eax, 16 ; eax=0x00008483
xchg eax, edx ; edx=0x00008483,eax=0x84838281
xchg al, ah ; eax=0x84838182
shl eax, 16 ; eax=0x81820000
xchg dl, dh ; edx=0x00008384
or eax, edx ; eax=0x81828384

There's probably better methods but I've already reduced it to 7
instruction lines by simply coding it by hand. Works like a champ.

I could reduce it to 3 if there's a method to exchange
the 16-bit register ax with the high 32-bits of eax.
However, the obvious choices 'xchg eax,ax' and 'xchg eah,ax'
both fell flat.

(I don't know if GCC is up to producing this quality of code,
or not. It would take quite some doing -- one could call it
"grokking" -- the subtleties of the machine architecture.)
 
T

Tukla Ratte

Thanks. Hate giving up on a cause no matter how hopeless, but there's a
limit y'know? Used to spend ages bashing my head against a twit called
Boatwright

Eww! You let your head touch Boaty?!

How long did you have to stay in decontamination?
 
J

Jeff_Relf

Hi Spooky, Re: Your mod of the code I showed, You told me: <<
return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }
does work correctly in GCC. >>

But that's a quirk, of course.
The solution would be to modify the MS_CPP and gcc compilers themselves,
as well as the C_2006 standard, to introduce Many more sequence points.

For example,
Each of the following should have it's left_to_right'ness guaranteed,
complete with so_called sequence_points in the obvious places:
1. char Buffer [] = { func(), /* Seq_Pnt */ func(), /* Seq_Pnt */ func() };
2. func( func(), /* Seq_Pnt */ func(), /* Seq_Pnt */ func() );
3. return * ++ P << 16 | /* Seq_Pnt */ * ++ P << 8 ;

At the very Very least, a compiler warning should be thrown !

Re: How I liked the disassembly of the working C++ code I showed,

You replied: << Feh. Try this code for pretty.
This is in MASM/Intel syntax:

mov edx, eax ; edx=0x84838281
shr eax, 16 ; eax=0x00008483
xchg eax, edx ; edx=0x00008483,eax=0x84838281
xchg al, ah ; eax=0x84838182
shl eax, 16 ; eax=0x81820000
xchg dl, dh ; edx=0x00008384
or eax, edx ; eax=0x81828384

There's probably better methods but I've already reduced it to 7
instruction lines by simply coding it by hand. Works like a champ. >>

Wow, I'm impressed ! So much for my claim that you don't do assembly anymore.

You added: << I could reduce it to 3 if there's a method to exchange
the 16-bit register ax with the high 32-bits of eax. However,
the obvious choices xchg eax,ax and xchg eah,ax both fell flat. >>

Wow again, You just taught me some x86 assembly:
1. ax is a short.
2. al is ax's low byte.
3. ah is ax's high byte.
4. eax is a long.

Olaf taught me that this is moving an int off the stack to ecx

mov ecx, dword ptr [ esp + 18h ]

You concluded: <<
I don't know if GCC is up to producing this quality of code, or not.
It would take quite some doing -- one could call it grokking
-- the subtleties of the machine architecture. >>

It'd be easier to just inline the assembly.
To time it, you could add it to something like my Kelsey.CPP
http://www.Cotse.NET/users/jeffrelf/Kelsey.CPP
http://www.Cotse.NET/users/jeffrelf/Kelsey.VCPROJ

But, as I keep repeating, readability is my only goal, not speed.
I say the following code is more readable,
....and it Should be legal and very optimizable... but it's neither.

uint_32 Big_First_32 ( int X ) { uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }
 
O

Olaf Baeyens

Feh.
Try this code for pretty. This is in MASM/Intel syntax:

mov edx, eax ; edx=0x84838281
shr eax, 16 ; eax=0x00008483
xchg eax, edx ; edx=0x00008483,eax=0x84838281
xchg al, ah ; eax=0x84838182
shl eax, 16 ; eax=0x81820000
xchg dl, dh ; edx=0x00008384
or eax, edx ; eax=0x81828384

There's probably better methods but I've already reduced it to 7
instruction lines by simply coding it by hand. Works like a champ.

I could reduce it to 3 if there's a method to exchange
the 16-bit register ax with the high 32-bits of eax.
However, the obvious choices 'xchg eax,ax' and 'xchg eah,ax'
both fell flat.
Very nice. :)

But the big question now in modern processors is yours faster?
I do know that some simple instructions that are used more often tends to be
faster than instructions that might have one opcode but not used that much.
And another thing is that that modern processors execute stuff out of order.
So less instructions does not mean necesarely faster performance, but it
could and in your solution it might.

The only way to determin what is faster is to actually measure it.
And then again, it might be processor dependend.
 
O

Olaf Baeyens

return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }
does work correctly in GCC. >>

But that's a quirk, of course.
Yes it shoud have worked in my opinion too.
But this is life, nothing is perfect.

I also discovered one time that if you use events in a C++ class VC++ 2002
then you get an access violation if the class happens not to have a
constructor defined and implemented in the header file.
So for week I struggled with that event thing, and funny enough the examples
worked, but mine failed. Then I started to realize looking at the assmbly
code that the variables of those events didn't get initialized. Clearly a
bug in the C++ compiler. Now my code works fine, since I now put a complete
constructor in the headre file. So the code generated is now correct.
At the very Very least, a compiler warning should be thrown !
I fully agree.
Wow, I'm impressed ! So much for my claim that you don't do assembly anymore.
In my case it is 15 years old knowledge. I only use it to look at the
generated code to find bugs in my code and to optimize my functions to speed
up without resorting to assembler.
Or to learn a new language, because I can compare it to something I already
know.

And now I do this with IL assembler generated by .NET. One thing I discover
is that properties are not optimized, not inlined so that could explain why
some C# code could be slower. But then again if I look at my C++ code of the
VC++ 2003 Standard, none of my properties gets inlined too. And this
explains why my C++ code and C# code are almost the same speed on the same
computer and the same OS and compiled with the same compiler environment. I
hope that The VS 2005 gets a better optimizer for that.

You added: << I could reduce it to 3 if there's a method to exchange
the 16-bit register ax with the high 32-bits of eax. However,
the obvious choices xchg eax,ax and xchg eah,ax both fell flat. >>

Wow again, You just taught me some x86 assembly:
1. ax is a short.
2. al is ax's low byte.
3. ah is ax's high byte.
4. eax is a long.
In the case of Intel like processors, the ax, eax register gets specialized
for processing things.It is another name for accumulator.
You will discover that any operation is done on that register, so a lot of
code is copying registers to the eax register and then move it back.

Another thing to know is that you cannot access the upper word part of eax.
(or the ebx, ecx, edx)
Only the lower word part that could be split into a high byte and a low
byte.

eax is the 32 bit register
ax is the same as LOWORD(eax)
and al would be like LOBYTE(ax)
and ah would be like HIBYTE(ax)

So to get the HIWORD(eax), you must >> 16 to it ends into the ax part.
Then you can access it.

Typically eax is used for calculating things.
ebx is used as index pointer
ecx is typically used as counter
edx is typically used as destination index pointer.
But in the case of an optimizer compiler you might lose that relationship.

Another thing somethin like this "xor ecx,ecx" is actually saying set
ecx to null.
This notation is only one byte and superfast compared to loading it with a
actual value.

And the reason why the P++ and P-- tends to be faster in C++ is because it
translates to an assembler instruction
inc ebx // P++
dec ebx // P--

While the P=P+1 would probably translate to somthing like this (just
guessing)
//P++
mov eax,ebx
mov ebx,1
add eax,ebx

//P--
mov eax,ebx
mov ebx,1
sub eax,ebx
Olaf taught me that this is moving an int off the stack to ecx

mov ecx, dword ptr [ esp + 18h ]
Yes a local variable located at 18h positions from your return address. :)
And if you get something like this "mov ecx, dword ptr [ esp - 18h ]"
then it is some parameter passed on from outside your function.
 
R

Raging Lunatic

Jeff_Relf said:
uint_32 Big_First_32 ( int X ) { uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | * ++ P << 16 | * ++ P << 8 | * ++ P ; }

I thought it was pretty cool that Ghost basically doused your code with
some shit that was totally Elegant and that makes Ghost a Mogul and you
a Serf.

Prolly if he took the time to read your 'c' code...well less just say
the Machine Got Gaaaame.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top