Intel might revive Hyperthreading with Nehalem

David Kanter · Nov 30, 2006

krw said:
I believe the IBM 360/195 and 370/95 both had SMT. It was called
"dual I-stream". Two "threads" were executed (I believe)
simultaneously.

Keith, if you mean the RS64-IV, that was not SMT; it was switch on
event multithreading. i.e. one thread would continue to execute until
a stall condition occured, and then it would switch threads.

[snip]

I don't believe memory latency is the issue at all, at least not
directly. Even SMT can't solve a 500 instruction "hole". SMT is
supposed to cover pipe flushes caused by branches (and
mispredicts). The other thread can still execute (utilize
execution units) while the first flushes and refills.

500 instructions might be feasible. Assume that 1/3 of instructions
generate memory requests, and the on-die caches have a hit rate of
99.5% --> 1/200 memory refs miss in cache --> 1/600 instructions cause
an off-die memory reference. Even if you assume 2/5 of instructions
are mem refs, you still end up with 1/500.

Sure, but they're inevitable.

Yes, but they should be minimized.

I'm not buying it. As I said, the primary problem that SMT is
trying to solve is pipe bubbles caused by branch mispredicts.

I agree that it's a benefit, but I have a hard time seeing that as a
bigger motivator than cache misses. If you look at any realistic CPI
breakdown, memory is always the biggest component, by a long shot.

Here's a paper on the subject, identifying branch misprediction as a
minor problem for OLTP workloads:

http://www.cs.cmu.edu/~damon2006/pdf/saylor06oltp.pdf

Here is an analysis of the POWER5:
http://www-128.ibm.com/developerworks/power/library/pa-cpipower2/?ca=dgr-lnxwCPIP2

In each case, the branch prediction penalty is miniscule compared to
cache misses.

X86, perhaps. When you're register poor and there is a branch every
five instructions this isn't surprising.

Actually x86 has the highest IPC chip for SPECint:
Chip - SPECint2000 score
P5+ 2.3GHz - 1820
Woodcrest 3GHz - 3089
Opt. 3GHz - 1942
P4 Xeon 3.8GHz - 1854
Itanium 1.6GHz - 1590

Converting that into SPECint/GHz you get:
P5+ - 791
Xeon - 1029
Opteron - 647
P4 Xeon - 487
Itanium - 993

SPECint/GHz will be proportional to IPC for these processors. So
actually x86 has the highest IPC processor, followed by IPF, then PPC.
Note that this comparison is only using server processors, while
desktop processors are slightly faster due to lower memory latency.

DK

David Kanter · Nov 30, 2006

Sebastian said:
Um... K7/K8 can do 6. It's limited by 3 macro ops which can be pair s of
uops. Typically on average K8 mop is ~1.5 uops so in reality those 6 will be
acheived infrequently.

Ah, my apologies.

Well, Netburst is rather thin -- only 3 uops throughput (both at the the
front and back (retirement) of the pipeline. As you probably know, typical
application execution is bursty -- burst of high IPC streams (talk about ICP
around 1.5~2.0) interleaved with few hundred dead periods of waiting for
cache miss.

Do you have any data to back this up? I have IPC for p4, but nothing
that looks at it over time.

When both threads are both in thir execution bursts they are
congesting (as one x86 instruction translates into 1.4~1.5 P4 uops P4 can
sustaion no more than 2.0~2.1 IPC in execution bursts).

I'm not so sure.

But the bottleneks are many -- small L1 cache is one of them.

Small, and only 1 read port. Small trace cache, overly sensitive
scheduler, etc. etc.

DK

Sebastian Kaliszewski · Dec 1, 2006

David said:
Do you have any data to back this up? I have IPC for p4, but nothing
that looks at it over time.

Sorry, no hard data. Its based on variuos stuff read on various places and
then some simple calculations. See below.

I'm not so sure.

This is top realistic speed. 3uops/cycle throughput, and more execution
resources (on EU's side you have 4uops/cycle). One x86 istruction on average
translates into ~1.4-1.5uops. So no more than 2.0-2.1 x86 instructions per
cycle.

The question remains if that result is acheivable in reality. With good
enough compiler P4 gets about IPC of 1.0, but then there are long
misprediction penalties, worth about 40 instructions (or ~60 on Prescott),
L1 miss penaties (somewhat smaller but not much smaller) and those huge L2
miss penatiles (500 instructions worth). Get typical number of P4 branch
miss rate (~4%), recognise that typically 1 instruction per 6 is branch and
you have miss every 125 instructions (with 40 instruction hole which could
be somehow filled by the scheduler only if all the nearby branches are
properly predicted -- but one should not assume flat distribution of
predictable and unpredictable branches). So mispredictions cost significant
amount of work, then cache misses const another big piece -- bring those two
together and you might see that to acheive 1.0 average IPC CPU needs about
twice as many in its bursts of useful work.
And thats pretty close to what is practical top limit for P4. So getting two
threads bursting at once menas they will fight for resources.

Small, and only 1 read port. Small trace cache, overly sensitive
scheduler, etc. etc.

Yup.

rgds

David Kanter · Dec 1, 2006

Sebastian said:
Sorry, no hard data. Its based on variuos stuff read on various places and
then some simple calculations. See below.

That doesn't sound like a very reliable source of information.

This is top realistic speed. 3uops/cycle throughput, and more execution
resources (on EU's side you have 4uops/cycle). One x86 istruction on average
translates into ~1.4-1.5uops. So no more than 2.0-2.1 x86 instructions per
cycle.

That sounds reasonable.

The question remains if that result is acheivable in reality. With good
enough compiler P4 gets about IPC of 1.0, but then there are long
misprediction penalties, worth about 40 instructions (or ~60 on Prescott),
L1 miss penaties (somewhat smaller but not much smaller) and those huge L2
miss penatiles (500 instructions worth). Get typical number of P4 branch
miss rate (~4%),

That doesn't line up with what I've measured.

recognise that typically 1 instruction per 6 is branch and
you have miss every 125 instructions (with 40 instruction hole which could
be somehow filled by the scheduler only if all the nearby branches are
properly predicted -- but one should not assume flat distribution of
predictable and unpredictable branches). So mispredictions cost significant
amount of work, then cache misses const another big piece -- bring those two
together and you might see that to acheive 1.0 average IPC CPU needs about
twice as many in its bursts of useful work.

That an interesting analysis, but you completely forgot interaction
effects. To what extent are the two penalties overlapping? You cannot
say anything interesting without that information.

And thats pretty close to what is practical top limit for P4. So getting two
threads bursting at once menas they will fight for resources.

Perhaps, perhaps not. It seems to me you're forgetting about the trace
cache hit rate entirely.

DK

Intel might revive Hyperthreading with Nehalem

David Kanter

David Kanter

Sebastian Kaliszewski

David Kanter