Itanium finally passes Alpha at HP

Steve Greatbanks · Aug 28, 2004

Alex Johnson said:
I believe you have misinterpretted the "16 processor" POWER5. IBM
actually refers to chips. "16 processor" as reported is 16 POWER5 chips,
comprised of 32 cores, allowing 64 threads of execution. So the 64-thread
Madison vs the 64-thread POWER5 having similar performance is just a sign
that things are about equal. I'm stunned by how good POWER5 is. But I
know that next year Montecito will go from 1 thread per package to 4
threads per package. Itanium will be down to a 16P system to compete with
IBM's 16P system.

Bill is right. The p5 570, as benchmarked for TPC-C[1] has 4 "building
blocks",
each of which is a 4-way machine. Reading the relevant redpaper[2] each of
these
building blocks has two processor slots, and each of the processor cards
contains a
single DCM (dual chip module). The DCM is comprised of a dual-core Power 5
and
the off-chip L3 cache. That means the 16 way box mentioned has 4 building
blocks,
each with two chips, each with two cores (and each of the cores is
SMT-capable),
so the "16 processor" Power 5 box is really "16 cores on 8 chips".

[1] http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=104071202
[2] http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf

Rupert Pigott · Aug 29, 2004

Alex Johnson wrote:

[SNIP]

As explained above, if you compare per thread, these machines are
equivalent in size (64P Madison, 32P * 2 cores POWER4+, 16P * 2 cores *
2 threads POWER5).

Can you explain to me why you think 2 way SMT is equivelent
to 2 processors ?

I can't see how it can be in terms of transistor count or
performance characteristics (think about contention). Just
to add to my confusion, the concensus is that SMT gives < 30%
more oomph at best (depending on workload of course)...

I can see how you could claim that a 64 *package* Madison box
was analogous to a 32 *package* dual-core box though.

Cheers,
Rupert

Bill Todd · Aug 29, 2004

message
....

Just

to add to my confusion, the concensus is that SMT gives < 30%
more oomph at best (depending on workload of course)...

While this indeed seems to be about the limit (and in fact seems quite a bit
too generous on average) for the throughput boost that existing SMT
implementations can provide (though I may have encountered a claim of 40%
for one outlier application somewhere), the simulations performed for EV8
seemed to indicate that a single core with far more resources (in terms of
execution units, number of in-flight instructions supported, etc.) than the
current SMT cores provide could obtain significantly higher percentage
throughput boosts from SMT.

- bill

Nick Maclaren · Aug 29, 2004

While this indeed seems to be about the limit (and in fact seems quite a bit
too generous on average) for the throughput boost that existing SMT
implementations can provide (though I may have encountered a claim of 40%
for one outlier application somewhere), the simulations performed for EV8
seemed to indicate that a single core with far more resources (in terms of
execution units, number of in-flight instructions supported, etc.) than the
current SMT cores provide could obtain significantly higher percentage
throughput boosts from SMT.

As did Eggers' simulation, which was based on MIPS. My guess is that
much of the problem of Hyperthreading was that it had to start from
the x86 architecture, which was already too complex and messy. But
it could have been other factors as well.

However, the flaw in Eggers' work (I have not seen DEC's) is that it
did not compare SMT with CMP using the same number of transistors.
Yes, the latter would have been slower for serial code, and perhaps
even for a small number of threads, but is MUCH more scalable. Even
Eggers' model ran out of steam at 4 cores (just arguably 8). But, as
I have posted, all these TLAs and other acronyms are intended to give
a veneer of importance to minor variants of a general model.

There is a continuum of shared memory designs, according to how close
the sharing is to the core. SMT is nearly as close as it is possible
to get (and certainly as close as it is sane to go), but you can move
that out to L1, L2, L3 or main memory. And, of course, you can share
difference resources at different levels - e.g. compare the memory
bandwidth handling between the Opteron, MIPS/SPARC and POWERx, all
of which do it differently, and at a different level from any on-chip
multi-threading.

Regards,
Nick Maclaren.

Paul Repacholi · Aug 29, 2004

However, the flaw in Eggers' work (I have not seen DEC's) is that it
did not compare SMT with CMP using the same number of transistors.

So what is the performance boast you get from CMP with ~110% of a single
core?

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.

Robert Myers · Aug 29, 2004

Bill said:
message
...

Just

While this indeed seems to be about the limit (and in fact seems quite a bit
too generous on average) for the throughput boost that existing SMT
implementations can provide (though I may have encountered a claim of 40%
for one outlier application somewhere),

http://www-106.ibm.com/developerworks/linux/library/l-htl/

Table 7. 45% geometric mean improvement for handling chat rooms on a
linux kernel tweaked for Hyperthreading.

RM

Bill Todd · Aug 29, 2004

Robert Myers said:
http://www-106.ibm.com/developerworks/linux/library/l-htl/

Table 7. 45% geometric mean improvement for handling chat rooms on a
linux kernel tweaked for Hyperthreading.

Well, I suppose that could have been what I remembered - and it does appear
to be the outlier in the article. Still, somewhat better than I'd expect:
I wonder exactly what it is about that workload that's so much more
HT-friendly than the others (unless it's something dead-simple like a
ridiculously short time quantum per thread, such that context-switching
overheads dominate the workload and halving them helps a *lot*).

- bill

Nick Maclaren · Aug 29, 2004

So what is the performance boast you get from CMP with ~110% of a single
core?

Sigh. I said that the flaw in her work is that it did not provide
that information. No, I don't know. What I do know is that the
claimed benefits of SMT are dubious without that comparison.

OK?

Regards,
Nick Maclaren.

Robert Redelmeier · Aug 29, 2004

I wonder exactly what it is about that workload that's so
much more HT-friendly than the others (unless it's something
dead-simple like a ridiculously short time quantum per
thread, such that context-switching overheads dominate the
workload and halving them helps a *lot*).

Well, IRC does mean ridiculously short timeslices (little work)
before a blocking syscall that yields the CPU. Just shovelling
data from one port to another.

Also important in this case will be doing useful work during
memory latency fetches. The busmaster ethernet devices will
drop data into RAM, and some code (probably the kernel TCP/IP
stack) will stall loading it into cache.

-- Robert in Houston

Maynard Handley · Aug 30, 2004

Bill Todd said:
Well, I suppose that could have been what I remembered - and it does appear
to be the outlier in the article. Still, somewhat better than I'd expect:
I wonder exactly what it is about that workload that's so much more
HT-friendly than the others (unless it's something dead-simple like a
ridiculously short time quantum per thread, such that context-switching
overheads dominate the workload and halving them helps a *lot*).

- bill

For christ sake. Why do we have to keep going through this?
Surely it's really simple.
SMT will perform useful work if execution slots are available, and not
otherwise. So if the code running has properties like
- it frequently misses in L1 (either I or D)
- it frequently mispredicts branches
- it consists of long streams of sequentially dependent instructions
(say integer ops) on a machine that has two or three integer exec units
then SMT will work wonderfully.
If these properties don't hold, then it won't.
(And of course, there is the issue of locks and so on shared in L1 which
may help certain types of code.)

BUT of course these are properties of an optimal SMT system. If the
particular IMPLEMENTATION of an SMT system is poorly designed, for
example the number of of rename registers or completion buffers, when
shared across both threads, falls below the knee of the curve; or if the
system allows one blocked thread to block execution for both threads
(for example miss to main memory of thread 0, thread 0 keeps executing,
along with thread 1, until all the completion buffers are full, then
both threads block), then obviously the results may be far far more
disappointing than the above analysis would suggest.

As such, complaining about "SMT" being this or that is like complaining
that "RISC" is this or that; it's a complete waste of time for most
people. How about we establish a rule from now on that any discussions
about SMT start along the lines of
"SMT on Prescott sucks because of ..." or
"SMT on Power5 only gets 15% performance boost on my code, clearly
everyone at IBM is an idiot..."
Even more useful would be criticisms of specific SMT implementations
that actually tell us what went wrong --- not enough resources,
resources are statically not dynamically partitioned, even if one thread
is blocked, the second thread only gets half the fetch slots from the I1
cache, completion buffer/ROB fills up on L2 miss like I describe above
or whatever.

Maynard

David Schwartz · Aug 30, 2004

In comp.sys.ibm.pc.hardware.chips Bill Todd <[email protected]>
wrote:

The workload is bogus, deliberately designed to inflate the numbers.

Well, IRC does mean ridiculously short timeslices (little work)
before a blocking syscall that yields the CPU. Just shovelling
data from one port to another.

Also important in this case will be doing useful work during
memory latency fetches. The busmaster ethernet devices will
drop data into RAM, and some code (probably the kernel TCP/IP
stack) will stall loading it into cache.

If you look at the way they created the test, the 'chat' test is really
just a measure of how fast you can do context switches. With HT (and this
ridiculously unrealistic type of workload), you need half as many context
switches. Only an idiot would design a chat application such that a context
switch would be needed every time the server wanted to change which client
it was working on behalf of.

DS

Bill Todd · Aug 30, 2004

Maynard Handley said:
For christ sake. Why do we have to keep going through this?
Surely it's really simple.
SMT will perform useful work if execution slots are available, and not
otherwise. So if the code running has properties like
- it frequently misses in L1 (either I or D)
- it frequently mispredicts branches
- it consists of long streams of sequentially dependent instructions
(say integer ops) on a machine that has two or three integer exec units
then SMT will work wonderfully.
If these properties don't hold, then it won't.
(And of course, there is the issue of locks and so on shared in L1 which
may help certain types of code.)

BUT of course these are properties of an optimal SMT system. If the
particular IMPLEMENTATION of an SMT system is poorly designed, for
example the number of of rename registers or completion buffers, when
shared across both threads, falls below the knee of the curve; or if the
system allows one blocked thread to block execution for both threads
(for example miss to main memory of thread 0, thread 0 keeps executing,
along with thread 1, until all the completion buffers are full, then
both threads block), then obviously the results may be far far more
disappointing than the above analysis would suggest.

As such, complaining about "SMT" being this or that is like complaining
that "RISC" is this or that; it's a complete waste of time for most
people. How about we establish a rule from now on that any discussions
about SMT start along the lines of
"SMT on Prescott sucks because of ..." or
"SMT on Power5 only gets 15% performance boost on my code, clearly
everyone at IBM is an idiot..."
Even more useful would be criticisms of specific SMT implementations
that actually tell us what went wrong --- not enough resources,
resources are statically not dynamically partitioned, even if one thread
is blocked, the second thread only gets half the fetch slots from the I1
cache, completion buffer/ROB fills up on L2 miss like I describe above
or whatever.

Maynard

Since the discussion prior to your rant above was quite explicit in its
differentiation among the various flavors of SMT on POWER5, EV8, Montecito,
and P4/Xeon, plus noting the differences in chip resources where applicable
that could affect the utility of the specific SMT implementation, I'm afraid
whatever point you thought you were making is unclear.

- bill

Nick Maclaren · Aug 30, 2004

If you look at the way they created the test, the 'chat' test is really
just a measure of how fast you can do context switches. With HT (and this
ridiculously unrealistic type of workload), you need half as many context
switches. Only an idiot would design a chat application such that a context
switch would be needed every time the server wanted to change which client
it was working on behalf of.

Eh? Why? That is precisely what you want to do to get security,
without having to be very clever. I agree that this is an unusual
requirement, but it is not unreasonable.

Regards,
Nick Maclaren.

Bill Todd · Aug 30, 2004

Nick Maclaren said:
Eh? Why? That is precisely what you want to do to get security,
without having to be very clever.

While there may be a legitimate differentiation between an 'idiot' and
someone who is merely not 'very clever', the underlying sentiments do not
seem all that different.

I'm not in the habit of completely ignoring performance in favor of
dirt-simple coding when I create software, unless performance is truly
unimportant. And that's especially true for production software, where any
effort expended may be repaid by benefits for literally millions of users.

File servers, which have far more stringent security requirements than chat
rooms, often not only eschew per-request context-switching but may run
entirely in the kernel to achieve optimal performance, even given the
resulting need to roll their own security mechanisms. So embedding
relatively simple security mechanisms in a chat-room server to achieve
significantly better performance hardly seems impractical.

- bill

Alex Johnson · Aug 30, 2004

Thu said:
IBM's always refers 16 processors as the number of cores. So a
maximum p570 is 8 Power5 Dual core chips, 16 cores and with SMT 32
threads.

Here's their spec submission for a fully loaded p570:
http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03234.html

Of course! How did I miss that?

Here is the best non clustered results in terms of performance for
Itanium and power5/power4+ on a per core basis.

That's a good read. Thanks for the research you put in.

A few things to note:
Montecito's dual thread implementation is not SMT, it is the much
simpler HMT. Performance increase expected from this is much less
than SMT.

Montecito's implementation is SoEMT. Maybe HMT means the same thing,
but I am unfamiliar with the abbreviation. What is "H" MT?

I'll do another comparison using specjbb2000

That's another interesting read. Unlike the TPC numbers, the ordering
of the competitors is not the same at each number of cores. At one
scale POWER4+ is leading by a mile, at the next SPARC64V by a mile, and
at one point Itanium 2 is ahead. Quite strange results. But then the
larger systems use lower speed parts. Another oddity.

Alex

Stephen Sprunk · Aug 30, 2004

Nick Maclaren said:
Eh? Why? That is precisely what you want to do to get security,
without having to be very clever. I agree that this is an unusual
requirement, but it is not unreasonable.

ircd, the oldest chat system still running with half a million or so current
users, does all operations in a single thread because that removes the need
for context switches, synchronization, and message sequence tracking. AFAIK
there's not been a security breach in over a decade.

I have no idea how IM systems operate, since (with the exception of Jabber)
they're not open source. However, I can't imagine that AIM servers have 100
million threads, one per user. That is clearly unreasonable with current OS
designs.

S

Greg Lindahl · Aug 30, 2004

Robert Redelmeier said:
Well, IRC does mean ridiculously short timeslices (little work)
before a blocking syscall that yields the CPU. Just shovelling
data from one port to another.

I don't know which IRC server you've worked with, but that's not how
it works; when it wakes up, it does as much as it can (non-blocking)
before it sleeps in select(). I believe the chat benchmark in question
is more aimed at a multi-threaded chat server, which does a lot less
work at a time.

-- greg

Nick Maclaren · Aug 31, 2004

|>
|> ircd, the oldest chat system still running with half a million or so current
|> users, does all operations in a single thread because that removes the need
|> for context switches, synchronization, and message sequence tracking. AFAIK
|> there's not been a security breach in over a decade.

And CICS did the same. But, in order to deliver that security, they
have to impose a lot of constraints. One slip, and you have a
security breach - been there, seen that :-(

|> I have no idea how IM systems operate, since (with the exception of Jabber)
|> they're not open source. However, I can't imagine that AIM servers have 100
|> million threads, one per user. That is clearly unreasonable with current OS
|> designs.

What is unreasonable is that there are no current designs that can
handle it. Nobody is claiming that every system should be able to
work that way, but the fact that none can is not good.

Regards,
Nick Maclaren.

Bill Davidsen · Sep 1, 2004

David said:
The workload is bogus, deliberately designed to inflate the numbers.

If you look at the way they created the test, the 'chat' test is really
just a measure of how fast you can do context switches. With HT (and this
ridiculously unrealistic type of workload), you need half as many context
switches. Only an idiot would design a chat application such that a context
switch would be needed every time the server wanted to change which client
it was working on behalf of.

Okay, enlighten me, how do you handle client requests without a context
switch? Having a thread do a blocking read on a socket actually scales
better than select() on many system (certainly Linux). Let's skip the
writes for a little while until I think about the issues. Having a mix
of slow and fast connections and comments going to multiple clients
makes it worthy of some thought.

Unless I misremember both apache and sendmail use threads just because
of the select scaling issues.

Itanium finally passes Alpha at HP

Steve Greatbanks

Rupert Pigott

Bill Todd

Nick Maclaren

Paul Repacholi

Robert Myers

Bill Todd

Nick Maclaren

Robert Redelmeier

Maynard Handley

David Schwartz

Bill Todd

Nick Maclaren

Bill Todd

Alex Johnson

Stephen Sprunk

Greg Lindahl

Nick Maclaren

Bill Davidsen