2 or more Processors Support

Guest · Oct 4, 2003

Hi,

I'm aware that Windows 2000 family support 2 or more processors. Does that mean any program running on Windows 2000 will be able to take advantage of the additional processing power if I add more processors? Do I need to specifically design my program in order to exploit the additional processing power?

Thanks
Chew

dcdon · Oct 4, 2003

No, all progams do not take advantage of hyperthreading or multiple processors.
Only 32 bit of 64 bit programs can make use of mutiple processors. i suggest you
design CPU intensive programming using MMX technology based on 32 or 64 bit
processing.

In general, you don't learn scalar (non-MMX) 80x86 assembly language programming
and then use that same mindset when writing programs using the MMX instruction
set. While it is possible to directly use various MMX instructions the same way
you would the general purpose integer instructions, one phrase comes to mind when
working with MMX: think parallel. This text has spent many hundreds of pages up to
this point attempting to get you to think in assembly language; to think that this
small section can teach you how to design optimal MMX sequence would be ludicrous.
Nonetheless, a few simple examples are useful to help start you thinking about how
to use the MMX instructions to your benefit in your programs. This section will
begin by presenting some fairly obvious uses for the MMX instruction set, and then
it will attempt to present some examples that exploit the inherent parallelism of
the MMX instructions.

Since the MMX registers are 64-bits wide, you can double the speed of certain data
movement operations by using MMX registers rather than the 32-bit general purpose
registers. For example, consider the following code from the HLA Standard Library
that copies one character set object to another: Like the example ripped off the
source as follows:

(They ripped it - but it may show what to do)

Source Site:
http://webster.cs.ucr.edu/Page_AoALinux/HTML/TheMMXInstructionSeta3.html

procedure cs.cpy( src:cset; var dest:cset ); nodisplay;

begin cpy;

push( eax );

push( ebx );

mov( dest, ebx );

mov( (type dword src), eax );

mov( eax, [ebx] );

mov( (type dword src[4]), eax );

mov( eax, [ebx+4] );

mov( (type dword src[8]), eax );

mov( eax, [ebx+8] );

mov( (type dword src[12]), eax );

mov( eax, [ebx+12] );

pop( ebx );

pop( eax );

end cpy;

Program 11.2 HLA Standard Library cs.cpy Routine

This is a relatively simple code sequence. Indeed, a fair amount of the execution
time is spent copying the parameters (20 bytes) onto the stack, calling the
routine, and returning from the routine. This entire sequence can be reduced to
the following four MMX instructions:

movq( (type qword src), mm0 );

movq( (type qword src[8]), mm1 );

movq( mm0, (type qword dest));

movq( mm1, (type qword dest[8]));

Of course, this sequence assumes two things: (1) it's okay to wipe out the values
in MM0 and MM1, and (2) you'll execute the EMMS instruction a little later on
after the execution of some other MMX instructions. If either, or both, of these
assumptions is incorrect, the performance of this sequence won't be quite as good
(though probably still better than the cs.cpy routine). However, if these two
assumptions do hold, then it's relatively easy to implement the cs.cpy routine as
an in-line function (i.e., a macro) and have it run much faster. If you really
need this operation to occur inside a procedure and you need to preserve the MMX
registers, and you don't know if any MMX instructions will execute shortly
thereafter (i.e., you'll need to execute EMMS), then it's doubtful that using the
MMX instructions will help here. However, in those cases when you can put the code
in-line, using the MMX instructions will be faster.

Warning: don't get too carried away with the MMX MOVQ instruction. Several
programmers have gone to great extremes to use this instruction as part of a high
performance MOVSD replacement. However, except in very special cases on very well
designed systems, the limiting factor for a block move is the speed of memory.
Since Intel has optimized the operation of the MOVSD instruction, you're best off
using the MOVSD instructions when moving blocks of memory around.

Earlier, this chapter used the cs.difference function as an example when
discussing the PANDN instruction. Here's the original HLA Standard Library
implementation of this function:

procedure cs.difference( src:cset; var dest:cset ); nodisplay;

begin difference;

push( eax );

push( ebx );

mov( dest, ebx );

mov( (type dword src), eax );

not( eax );

and( eax, [ebx] );

mov( (type dword src[4]), eax );

not( eax );

and( eax, [ebx+4] );

mov( (type dword src[8]), eax );

not( eax );

and( eax, [ebx+8] );

mov( (type dword src[12]), eax );

not( eax );

and( eax, [ebx+12] );

pop( ebx );

pop( eax );

end difference;

Program 11.3 HLA Standard Library cs.difference Routine

Once again, the high-level nature of HLA is hiding the fact that calling this
function is somewhat expensive. A typical call to cs.difference emits five or more
instructions just to push the parameters (it takes four 32-bit PUSH instructions
to pass the src character set because it is a value parameter). If you're willing
to wipe out the values in MM0 and MM1, and you don't need to execute an EMMS
instruction right away, it's possible to compute the set difference with only six
instructions - that's about the same number of instructions (and often fewer) than
are needed to call this routine, much less do the actual work. Here are those six
instructions:

movq( dest, mm0 );

movq( dest[8], mm1 );

pandn( src, mm0 );

pandn( src[8], mm1 );

movq( mm0, dest );

movq( mm1, dest[8] );

These six instructions replace 12 of the instructions in the body of the function.
The sequence is sufficiently short that it's reasonable to code it in-line rather
than in a function. However, were you to bury this code in the cs.difference
routine, you needed to preserve MM0 and MM11, and you needed to execute EMMS
afterwards, this would cost more than it's worth. As an in-line macro, however, it
is going to be significantly faster since it avoids passing parameters and the
call/return sequence.

If you want to compute the intersection of two character sets, the instruction
sequence is identical to the above except you substitute PAND for PANDN.
Similarly, if you want to compute the union of two character sets, use the code
sequence above substituting POR for PANDN. Again, both approaches pay off
handsomely if you insert the code in-line rather than burying it in a procedure
and you don't need to preserve MMX registers or execute EMMS afterwards.

We can continue with this exercise of working our way through the HLA Standard
Library character set (and other) routines substituting MMX instructions in place
of standard integer instructions. As long as we don't need to preserve the MMX
machine state (i.e., registers) and we don't have to execute EMMS, most of the
character set operations will be short enough to code in-line. Unfortunately,
we're not buying that much over code the standard implementations of these
functions in-line from a performance point of view (though the code would be quite
a bit shorter). The problem here is that we're not "thinking in MMX." We're still
thinking in scalar (non-parallel mode) and the fact that the MMX instruction set
requires a lot of set-up (well, "tear-down" actually) negates many of the
advantages of using MMX instructions in our programs.

The MMX instructions are most appropriate when you compute multiple results in
parallel The problem with the character set examples above is that we're not even
processing a whole data object with a single instruction; we're actually only
processing a half of a character set with a sequence of three MMX instructions
(i.e., it requires six instructions to compute the intersection, union, or
difference of two character sets). At best, we can only expect the code to run
about twice as fast since we're processing 64 bits at a time instead of 32 bits.
Executing EMMS (and, God help us, having to preserve MMX registers) negates much
of what we might gain by using the MMX instructions. Again, we're only going to
see a speed improvement if we process multiple objects with a single MMX
instruction. We're not going to do that manipulating large objects like character
sets.

One data type that will let us easily manipulate up to eight objects at one time
is a character string. We can speed up many character string operations by
operating on eight characters in the string at one time. Consider the HLA Standard
Library str.uppercase procedure. This function steps through each character of a
string, tests to see if it's a lower case character, and if so, converts the lower
case character to upper case. A good question to ask is "can we process eight
characters at a time using the MMX instructions?" The answer turns out to be yes
and the MMX implementation of this function provides an interesting perspective on
writing MMX code.

At first glance it might seem impractical to use the MMX instructions to test for
lower case characters and convert them to upper case. Consider the typical scalar
approach that tests and converts a single character at a time:

<< Get character to convert into the AL register >>

cmp( al, `a' );

jb noConversion;

cmp( al, `z' );

ja noConversion;

sub( $20, al ); // Could also use AND($5f, al); here.

noConversion:

This code first checks the value in AL to see if it's actually a lower case
character (that's the CMP and Jcc instructions in the code above). If the
character is outside the range `a'..'z' then this code skips over the conversion
(the SUB instruction); however, if the code is in the specified range, then the
sequence above drops through to the SUB instruction and converts the lower case
character to upper case by subtracting $20 from the lower case character's ASCII
code (since lower case characters always have bit #5 set, subtracting $20 always
clears this bit).

Any attempt to convert this code directly to an MMX sequence is going to fail.
Comparing and branching around the conversion instruction only works if you're
converting one value at a time. When operating on eight characters simultaneously,
any mixture of the eight characters may or may not require conversion from lower
case to upper case. Hence, we need to be able to perform some calculation that is
benign if the character is not lower case (i.e., doesn't affect the character's
value) while converting the character to upper case if it was lower case to begin
with. Worse, we have to do this with pure computation since flow of control isn't
going to be particularly effective here (if we test each individual result in our
MMX register we won't really save anything over the scalar approach). To save you
some suspense, yes, such a calculation does exist.

Consider the following algorithm that converts lower case characters to upper
case:

<< Get character to test into AL >>

cmp( al, `a' );

setae( bl ); // bl := al >= `a'

cmp( al, `z' );

setbe( bh ); // bh := al <= `z'

and( bh, bl ); // bl := (al >= `a') && (al <= `z' );

dec( bl ); // bl := $FF/$00 if false/true.

not( bl ); // bl := $FF/$00 if true/false.

and( $20, bl ); // bl := $20/$00 if true/false.

sub( bl, al ); // subtract $20 if al was lowercase.

This code sequence is fairly straight-forward up until the DEC instruction above.
It computes true/false in BL depending on whether AL is in the range `a'..'z'. At
the point of the DEC instruction, BL contains one if AL is a lower case character,
it contains zero if AL's value is not lower case. After the DEC instruction, BL
contains $FF for false (AL is not lower case) and $00 for true (AL is lowercase).
The code is going to use this as a mask a little later, but it really needs true
to be $FF and false $00, hence the NOT instruction that follows. The (second) AND
instruction above converts true to $20 and false to $00 and the final SUB
instruction subtracts $20 if AL contained lower case, it subtracts $00 from AL if
AL did not contain a lower case character (subtracting $20 from a lower case
character will convert it to upper case).

Whew! This sequence probably isn't very efficient when compared to the simpler
code given previously. Certainly there are more instructions in this version
(nearly twice as many). Whether this code without any branches runs faster or
slower than the earlier code with two branches is a good question. The important
thing to note here, though, is that we converted the lower case characters to
upper case (leaving other characters unchanged) using only a calculation; no
program flow logic is necessary. This means that the code sequence above is a good
candidate for conversion to MMX. Even if the code sequence above is slower than
the previous algorithm when converting one character at a time to upper case, it's
positively going to scream when it converts eight characters at a shot (since
you'll only need to execute the sequence one-eighth as many times).

The following is the code sequence that will convert the eight characters starting
at location [EDI] in memory to upper case:

static

A:qword; @nostorage;

byte $60, $60, $60, $60, $60, $60, $60, $60; // Note: $60 = `a'-1.

Z:qword; @nostorage;

byte $7B, $7B, $7B, $7B, $7B, $7B, $7B, $7B; // Note: $7B = `z' + 1.

ConvFactor:qword; @nostorage;

byte $20, $20, $20, $20, $20, $20, $20, $20; // Magic value for lc->UC.

Mark · Oct 4, 2003

Applications that support Multithreading will be able to
make use of the extra processor.

-----Original Message-----
Hi,

I'm aware that Windows 2000 family support 2 or more

processors. Does that mean any program running on Windows
2000 will be able to take advantage of the additional
processing power if I add more processors? Do I need to
specifically design my program in order to exploit the
additional processing power?

Guest · Oct 6, 2003

I don't think I would like to deal with low level programmimg such as assembly language. If I've developed a simple program using Visual C++, say to calculate the factorial of a number, will it be running faster on a machine with 2 or more processors than on a machine with only one processor (on win2k platform of course)?

Editorial Intel releases Core i9 Processor for laptops	0	Apr 3, 2018
Latest Intel Graphics Drivers Configure Games Automatically	0	Feb 16, 2018
AMD EPYC	0	Jun 21, 2017
Asus Tinker Board takes on Raspberry Pi with 4K video and Rockchip processing power	1	Jan 24, 2017
Does Windows 2000 support different speed processors?	1	Oct 20, 2005
Hanging processes	3	Jun 21, 2006
Intel based Apple Machines, what happens in the future?	2	Jun 29, 2020
Windows 7 Problem with Facebook	9	Dec 6, 2021

2 or more Processors Support

Guest

dcdon

Mark

Guest

Ask a Question

Similar Threads