Here's a Dell story you don't see too often

Yousuf Khan · Aug 22, 2004

Rupert said:
I've used (early) Solaris boxes that had 50+ students banging away
with C++ compilers that remained very responsive... I have even used
a Pentium Pro 200 on WinNT 3.51 that had one CPU maxed out, and yet
it remained responsive although it was a bit slow on the screen
repaint I guess.

This is why I suspect his methodology is broken...

Solaris is actually an interesting case. Solaris unlike most other
multiprocessor OSes, has only a single run queue for all of the processors.
Therefore Solaris takes a global view of the overall system performance and
responsiveness. You can have 1 processor or a 100 processors, but a single
Solaris image will still only have one overall run queue. It just allocates
threads to processors as it sees resources becoming free (although it will
attempt to preserve affinity between processes and processors too). So in a
system such as this, you can't get away by adding additional run queues just
by adding virtual processors to a system.

Yousuf Khan

Rupert Pigott · Aug 23, 2004

Yousuf said:
Solaris is actually an interesting case. Solaris unlike most other
multiprocessor OSes, has only a single run queue for all of the processors.
Therefore Solaris takes a global view of the overall system performance and
responsiveness. You can have 1 processor or a 100 processors, but a single
Solaris image will still only have one overall run queue. It just allocates
threads to processors as it sees resources becoming free (although it will
attempt to preserve affinity between processes and processors too). So in a
system such as this, you can't get away by adding additional run queues just
by adding virtual processors to a system.

I had a little dig at the MS website, looks like Processor Queue
length is actually for all the processors so my idea about the diff
being a question of the number of logical processors could be
groundless. That said, I note that MS recommend adding more
processors as one of the solutions to that problem.

Still, those articles are shockingly poor... I found these quotes
to be contrary to my experience :

"That doesn't mean it doesn't have a place in the enterprise, though; an
Opteron-based system would be a good choice for tasks such as CAD, which
is basically a single-task, high-performance-requiring process."

Many VLSI CAD types I've seen work pipelined their workflow...
I feel that the assertion that CAD "is basically a single-task" way
too strong.

"Xeon's speed is good news for financial services companies such as
Morgan Stanley, Goldman Sachs, and Credit Suisse First Boston, which
have long used workstations to deliver the massive computing power
required to drive their trading operations (a single active trader can
easily bury a top-of-the-line PC). In an environment where time
literally is money,"

Strange, because all the trading gear I saw was not really compute
bound, it was I/O bound. DB access, network, disk access etc. When
it came to compute the machines were largely doing integer type
stuff like string bashing. The hardcore crunching was done on
servers...

Cheers,
Rupert

Rob Stow · Aug 23, 2004

No. The two physical Opteron processors appear to
every x86 OS as two physical processors.

Maybe. It depends on whether the OS knows how to
tell the difference between physical and logical
processors.

NT4 and W2K can't distinguish between physical and
logical processors, so those OSes will think each
HT-capable cpu (assuming HT is enabled in the BIOS)
is really two physical processors - for a total
of four physical processors. They will be blissfully
unaware that in reality you merely have four logical
processors running on two physical processors.

XP (with SP1) and W2K3 Server *can* distinguish between
physical and logical processors, so they will correctly
identify that you have two physical processors but
those OSes will still identify and use all four logical
processors.

Somebody else will have to comment on what Linux
and other OSes do with HT capable Xeons - I have
never tried Linux with HT capable P4s or Xeons.

Robert Myers · Aug 23, 2004

Rupert said:
I've used (early) Solaris boxes that had 50+ students banging away
with C++ compilers that remained very responsive... I have even used
a Pentium Pro 200 on WinNT 3.51 that had one CPU maxed out, and yet
it remained responsive although it was a bit slow on the screen
repaint I guess.

Aren't a bunch of users remotely logged in somewhat analagous to the
nicely partitioned and/or embarrassingly parallel problems that lend
themselves so nicely to clusters? ;-).

Deal with one at a time: Application 1 on Display x doesn't interact at
all with Application 2 on Display y. Any single user can bring his own
display to its knees? Sure. They soon learn not to do that.

I can learn not to do that to my display when I'm the only user at the
console, too, but the whole point is that I'd like a box powerful enough
that I don't have to think about such things.

RM

alexi · Aug 23, 2004

Yousuf Khan said:
It doesn't matter if the Opteron system was also a multiprocessor system, it
still has half as many run queues to work with than the HT system. If you
group all of the background processes onto the virtual processors through
processor affinity, that means you have those background processes will only
fight it out for timeslices amongst themselves, leaving the foreground
process free to occupy its own private run queue. It doesn't matter if there
were one processor, or two processors, or 4 or 8, in each case the SMT
system will have twice as many run queues.

Yousuf Khan

Let me try again. The Xeon system has 4 logical processors, and the OS
presumably forms 4 queues. One queue is privately occupied by a foreground
process which is timed and reported as benchmark result, right? The other
three queues are time-sliced across all other background tasks, but they are
not measured, so who cares how fast do they run, right? (I know some people
care, but let this issue leave aside for a moment).

Now, the second system has 2 processors, and the system apparently forms 2
queues. Again, as you say, the cheat in that benchmark is that one queue is
dedicated to the timed foreground process, and all other background tasks
share resources of the second processor.

So, the question: where do you see a disadvantage of the Opteron system in
this setup? One queue is dedicated to the timed foreground task in both
cases; the rest of load fights for timeslices amongst themselves in
remaining queues. Since the background tasks are not measured, it shouldn't
matter how many queues are left,
one or three. What am I missing again?

- Alexei

alexi · Aug 23, 2004

Scott Alfter said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I don't think the OP was saying that the benchmark was forcing Windows
itself to run on one processor and apps to run on another. Instead, the
benchmark's main thread was set to run on one processor and its various
background threads were set to run on the other. The SetProcessAffinityMask
system call lets you restrict a process and its subthreads to run on the
processor(s) you specify:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/bas
e/setprocessaffinitymask.asp

Setting process affinity mask allows you to restrict a process and its
subthreads to run on a specified processor, true. It allows for more
efficient use of cache on the specified processor for that particular
process. However, this does not seem to prevent other processes to get time
slices of the specified processor unless all other processes are all
restricted to other processor. I still do not see any rationale in
Yousuf's original critique of the InfoWorld benchmark methodology and
associated conspiracy:

http://groups.google.com/groups?q=g:thl3084042023d&dq=&hl=en&lr=&ie=UTF-8&se
lm=ByhUc.635%24E7T1.234%40news04.bloor.is.net.cable.rogers.com

- Alexi

Rupert Pigott · Aug 23, 2004

Robert said:
Aren't a bunch of users remotely logged in somewhat analagous to the
nicely partitioned and/or embarrassingly parallel problems that lend
themselves so nicely to clusters? ;-).

Erm, yeah, but this is about that InfoWorld article. I don't think
those guys would know a replicated PAR construct if it landed 999
punches on their nose simultaneously.

Deal with one at a time: Application 1 on Display x doesn't interact at
all with Application 2 on Display y. Any single user can bring his own
display to its knees? Sure. They soon learn not to do that.

I can learn not to do that to my display when I'm the only user at the
console, too, but the whole point is that I'd like a box powerful enough
that I don't have to think about such things.

Well... The fact is : Both those boxes were heavily laden, both
remained responsive while under extreme load and neither of them
ran Xeons with SMT. Not only that but they were considerably slower
anyway. The point being that I figure they must have done something
*really* stupid to make a dual Opteron "tank". The methodology link
is a joke. If I had submitted that as an undergraduate I would have
been escorted off-campus and shot.

Cheers,
Rupert

Robert Myers · Aug 23, 2004

Rupert Pigott wrote:

The point being that I figure they must have done something
*really* stupid to make a dual Opteron "tank". The methodology link
is a joke.

Okay. What _did_ he do to the Opteron system, and, for that matter, how
slow did it become? We don't know, and that's the real problem. No
argument from me on that point.

I'll risk guessing that, if I wanted to badly enough, I could design a
test that would cause a system with two physical processors and two
active threads to tank before a system with two physical processes and
four active threads.

Would such a test have any bearing at all on what an actual user would
experience in practice? I'm sure that, even if all the details were
available, we'd still have a lengthy discussion, and I have this
suspicion that the players would mostly be lined up the same way. :-)

.

RM

chrisv · Aug 23, 2004

Rob Stow said:
Somebody else will have to comment on what Linux
and other OSes do with HT capable Xeons - I have
never tried Linux with HT capable P4s or Xeons.

The newer versions of Linux take advantage of HT. They detect it
during installation and make an SMP-capable version of the kernel the
boot default.

Rupert Pigott · Aug 23, 2004

Robert said:
Rupert Pigott wrote:

Okay. What _did_ he do to the Opteron system, and, for that matter, how
slow did it become? We don't know, and that's the real problem. No
argument from me on that point.

I'll risk guessing that, if I wanted to badly enough, I could design a
test that would cause a system with two physical processors and two
active threads to tank before a system with two physical processes and
four active threads.

Would such a test have any bearing at all on what an actual user would
experience in practice? I'm sure that, even if all the details were
available, we'd still have a lengthy discussion, and I have this
suspicion that the players would mostly be lined up the same way. .

If he's really soooooooooo clued up about the workload and requirements
of the traders at investment banks, surely he could have rigged up some
kind of workload that vaguely approximates that application... He could
then have measured what the system throughput was and I'll bet that with
a bit of cunning he could have measured the screen repaint time as well.

Publishing hard figures would have been nice too.

Cheers,
Rupert

Yousuf Khan · Aug 24, 2004

alexi said:
So, the question: where do you see a disadvantage of the Opteron
system in this setup? One queue is dedicated to the timed foreground
task in both cases; the rest of load fights for timeslices amongst
themselves in remaining queues. Since the background tasks are not
measured, it shouldn't matter how many queues are left,
one or three. What am I missing again?

True, the effect would get lesser with the more physical processors that you
have, or at least it _should_ get lesser. But I think with just two physical
processors you'll still see the effect. If they group off the physical
processors into their own processor group, and the virtual processors into
their own processor group. This way you can still divide it up so that the
timed and untimed streams could run in their own separate processor groups.

Processor grouping is available in most Unixes, so I would assume it's at
least a possibility in Windows too. Even if there isn't any explicit
processor grouping scheme available in Windows, it's pretty easy to figure
out which processors are physical and which ones are virtual anyways.

Yousuf Khan

Here's a Dell story you don't see too often

Yousuf Khan

Rupert Pigott

Rob Stow

Robert Myers

alexi

alexi

Rupert Pigott

Robert Myers

chrisv

Rupert Pigott

Yousuf Khan