Intel might revive Hyperthreading with Nehalem

L

lyon_wonder

http://www.vr-zone.com/?i=4322

The next generation Intel processor based on the Nehalem architecture
is clearly exciting as VR-Zone has learned. Successor to quad core
Yorkfield which forms part of the 45nm Penryn architecture, Bloomfield
will come along and sit right on top of the 45nm Nehalem desktop
processors in mid 2008. Bloomfield will have 4 cores and is capable of
8 threads like the old Hyper-Threading technology but only more
advanced. Bloomfield will contain an integrated memory controller that
requires a new socket refresh called Socket B with 1366 contact pads.
 
Y

Yousuf Khan

lyon_wonder said:
http://www.vr-zone.com/?i=4322

The next generation Intel processor based on the Nehalem architecture
is clearly exciting as VR-Zone has learned. Successor to quad core
Yorkfield which forms part of the 45nm Penryn architecture, Bloomfield
will come along and sit right on top of the 45nm Nehalem desktop
processors in mid 2008. Bloomfield will have 4 cores and is capable of
8 threads like the old Hyper-Threading technology but only more
advanced. Bloomfield will contain an integrated memory controller that
requires a new socket refresh called Socket B with 1366 contact pads.

I'm guessing that this will be true hardware multithreading as opposed
to the "exploit the time between our various inefficiencies" type of
multithreading that was Hyperthreading.

Yousuf Khan
 
D

David Kanter

Yousuf said:
lyon_wonder wrote:
I'm guessing that this will be true hardware multithreading as opposed
to the "exploit the time between our various inefficiencies" type of
multithreading that was Hyperthreading.

Perhaps you'd care to explain the difference between the two?

AFAIK, all multithreading relies on exploiting the time between our
various inefficiencies to improve performance...

DK
 
K

krw

Perhaps you'd care to explain the difference between the two?

AFAIK, all multithreading relies on exploiting the time between our
various inefficiencies to improve performance...

Can two threads be dispatched/completed simultaneously? Does one
thread have to wait for the other to flush? These are two
improvements on P4s implementation I can imagine.
 
J

joshk18

Can two threads be dispatched/completed simultaneously?

AFAIK, yes. The only hardware structures that cannot be used by both
threads simultaneously are the trace cache and decoder (see Tullsen's
PACT03 paper).
Does one
thread have to wait for the other to flush? These are two
improvements on P4s implementation I can imagine.

I don't know the answer to that, however, I suspect only certain types
of flushes impact both threads. For instance, a TC flush would hit
both threads. However, flushing the ROB and RS should only impact one
thread, since they are statically partitioned.

Again, I'll ask what is this 'true' multithreading that Yousuf
mentions? The P4 uses simultaneous multithreading and it's just as
real as the POWER5/6, or the SoEMT used in Montecito, Niagara and the
older IBM systems (northstar or pulsar) and Tera's systems.
Multithreading fundamentally relies on exploiting the difference
between average IPC and peak IPC, i.e. making up for inefficiencies in
a design. Where's the beef?

DK
 
K

krw

AFAIK, yes. The only hardware structures that cannot be used by both
threads simultaneously are the trace cache and decoder (see Tullsen's
PACT03 paper).


I don't know the answer to that, however, I suspect only certain types
of flushes impact both threads. For instance, a TC flush would hit
both threads. However, flushing the ROB and RS should only impact one
thread, since they are statically partitioned.

Again, I'll ask what is this 'true' multithreading that Yousuf
mentions? The P4 uses simultaneous multithreading and it's just as
real as the POWER5/6, or the SoEMT used in Montecito, Niagara and the
older IBM systems (northstar or pulsar) and Tera's systems.
Multithreading fundamentally relies on exploiting the difference
between average IPC and peak IPC, i.e. making up for inefficiencies in
a design. Where's the beef?

The P4 cannot dispatch or complete two instructions from opposite
threads in the same cycle (I thought *star could). This may be
caused by the trace cache limitation you mention. I'm not sure
this is all that important and certainly adds complication.
 
D

David Kanter

krw said:
The P4 cannot dispatch or complete two instructions from opposite
threads in the same cycle (I thought *star could). This may be
caused by the trace cache limitation you mention. I'm not sure
this is all that important and certainly adds complication.

So you're claiming that the P4 SMT is not *real* because it does round
robin retirement?

Again, this discussion is in the context of Yousuf's ridiculous claim
(which I'd note he is too chicken, or unable to come out and back up).
I don't consider this evidence that hyperthreading is some sort of
half-assed or less than 'true' multithreading.

I'd also point out that by that criterion, the POWER5 is also not
*real* SMT; it can only dispatch instructions from one thread at a time
to form a group....

I guess according to Yousuf this means that the POWER5 doesn't have
*real* multithreading either.

DK
 
R

Robert Redelmeier

David Kanter said:
So you're claiming that the P4 SMT is not *real* because
it does round robin retirement?

Sorry to butt in, but strict RR would kill the main SMT
advantage: running another thread during the 200-300 clocks
that one is waiting for RAM fetch. RR would force all threads
to block because one couldn't retire.

This is why SMT isn't a universal win: it is highly app dependant.
If the app is reasonably optimized and doesn't suffer too many
memory stalls, there's not many scraps for a second thread.
Particularly not on a dispatch-limited arch like Pentium4.
I'd also point out that by that criterion, the POWER5 is
also not *real* SMT; it can only dispatch instructions from
one thread at a time to form a group....

"Chunking" is not a problem so long as the CPU doesn't stall.

-- Robert
 
K

krw

So you're claiming that the P4 SMT is not *real* because it does round
robin retirement?

Cool your jets, Dave. Do you have to make *every* thread personal?
There is a reason I've been trying to ignore you, but I thought
this was a chance to discuss. I guess not.

It is not *I* who is claiming a DAMNED THING! No sense in
continuing...
 
Y

Yousuf Khan

David said:
Again, this discussion is in the context of Yousuf's ridiculous claim
(which I'd note he is too chicken, or unable to come out and back up).
I don't consider this evidence that hyperthreading is some sort of
half-assed or less than 'true' multithreading.

Whoa there buddy-boy, I'm just seeing this thread for the first time,
didn't even know there was a reply to my reply, until today. I just
don't have time to go through every thread and follow it up diligently.

Now I'll go ignore my sleep and read through this thread just to
continue this pointless debate. :)

Yousuf Khan
 
Y

Yousuf Khan

David said:
Perhaps you'd care to explain the difference between the two?

AFAIK, all multithreading relies on exploiting the time between our
various inefficiencies to improve performance...

Not necessarily, the multithreading that I had known about all along --
prior to Intel showing up with Hyperthreading -- has been about a
processor with twice (or more) the execution units than its
single-threaded counterpart. Basically the instruction streams in each
thread can be brought in and retired independently, without ever waiting
for the other thread's stream to give up its place in the queue, because
each thread has its own separate queue. You could say this is one step
removed from full multi-cores, except sharing even more resources than
multi-cores. Intel's Hyperthreading was the first I'd heard about
instead of doubling the number of execution units, you just exploit the
inefficiencies of the existing execution units.

Yousuf Khan
 
D

David Kanter

Robert said:
Sorry to butt in, but strict RR would kill the main SMT
advantage: running another thread during the 200-300 clocks
that one is waiting for RAM fetch. RR would force all threads
to block because one couldn't retire.

It isn't strict RR...obviously a thread with nothing to retire yields
to a productive thread.

DK
 
D

David Kanter

krw said:
Cool your jets, Dave. Do you have to make *every* thread personal?
There is a reason I've been trying to ignore you, but I thought
this was a chance to discuss. I guess not.

It is not *I* who is claiming a DAMNED THING! No sense in
continuing...

Fair enough, my apologies for jumping on you.

DK
 
D

David Kanter

Yousuf said:
Not necessarily, the multithreading that I had known about all along --
prior to Intel showing up with Hyperthreading -- has been about a
processor with twice (or more) the execution units than its
single-threaded counterpart.

In what context? The EV8?

The oldest form of MT is the time slice multithreading that is used to
hide memory latency (IBM and Tera, probably some others as well, have
done this).
Basically the instruction streams in each
thread can be brought in and retired independently, without ever waiting
for the other thread's stream to give up its place in the queue, because
each thread has its own separate queue.

That only really matters if you are bottlenecked by retirement
capabilities for a substantial amount of the time, which I doubt is the
case of the P4. The bottleneck is the memory subsystem.
You could say this is one step
removed from full multi-cores, except sharing even more resources than
multi-cores. Intel's Hyperthreading was the first I'd heard about
instead of doubling the number of execution units, you just exploit the
inefficiencies of the existing execution units.

Almost all the prior work actually focused on attacking inefficiencies
in the memory system, rather than execution units. For instance, Tera
interleaved ~100 threads to hide main memory latency...of course, that
didn't exactly work for the real world.

The IBM pulsar or northstar was a very short pipeline with switch on
event multithreading to hide memory latency primarily.

I agree that retiring from multiple threads is better, but ultimately,
you are still going to have to have some sort of mechanism to choose
the retiring instructions. It could be RR, it could be greedy WRT one
thread (and switching which thread), it could take equal instructions
from each thread.

I guess I don't see what sort of selection policy you have for
retirement as being fundamental to SMT; the underlying goal is to
eliminate inefficiencies in the pipeline and increase utilization.
This may require increasing retirement, and it may not...

DK
 
Y

Yousuf Khan

David said:
In what context? The EV8?

Probably even before that, when it was still just a theory just being
bantered about. I guess EV8 was the first product they talked about
having it, that I can recall.
The oldest form of MT is the time slice multithreading that is used to
hide memory latency (IBM and Tera, probably some others as well, have
done this).

Yeah, ancestral SMT, about as relevant as dinosaurs.
That only really matters if you are bottlenecked by retirement
capabilities for a substantial amount of the time, which I doubt is the
case of the P4. The bottleneck is the memory subsystem.

The retirement capabilities of the P4 *were* a bottleneck for P4, in
addition to memory latency. The retirement capabilities of all other
architectures were much higher than P4's, x86 or not.
Almost all the prior work actually focused on attacking inefficiencies
in the memory system, rather than execution units. For instance, Tera
interleaved ~100 threads to hide main memory latency...of course, that
didn't exactly work for the real world.

Yeah, and then they invented caches and all of a sudden memory latency
is not that much of an issue, and other parts of the processor do become
an issue.
I agree that retiring from multiple threads is better, but ultimately,
you are still going to have to have some sort of mechanism to choose
the retiring instructions. It could be RR, it could be greedy WRT one
thread (and switching which thread), it could take equal instructions
from each thread.

Yeah, or it could just take instruction streams from independent
processes. That's what Hyperthreading was doing most of the time.
I guess I don't see what sort of selection policy you have for
retirement as being fundamental to SMT; the underlying goal is to
eliminate inefficiencies in the pipeline and increase utilization.
This may require increasing retirement, and it may not...

Who cares about memory bottlenecks? Stop diverting the subject, David.
You're the only one bringing up memory bottlenecks as the reason for
SMT. Nobody else will agree. That may have been the case in ancient
times, but now we have caches and inboard memory controllers. These
days, SMT is used to increase IPC, which means increasing the
instruction retirement rate.

Yousuf Khan
 
D

David Kanter

Yousuf said:
Probably even before that, when it was still just a theory just being
bantered about. I guess EV8 was the first product they talked about
having it, that I can recall.

Indeed. The EV8 was designed to be wide, but I've heard rumors that it
was planned to be 8 wide before they settled on 4 way SMT.
Yeah, ancestral SMT, about as relevant as dinosaurs.

It's not simultaneous, i.e. it's just time slice MT.
The retirement capabilities of the P4 *were* a bottleneck for P4, in
addition to memory latency. The retirement capabilities of all other
architectures were much higher than P4's, x86 or not.

Um...can you provide evidence to back those statements up? K7/K8 and
P3 can only retire 3 micro operations/cycle.

I have yet to see any conclusive proof that the P4 was retirement
limited. Can you cite any serious studies which show retirement as a
bottleneck?

I have done some performance analysis for a broad spectrum of
benchmarks on the P4, and I see very little which indicates retirement
is an issue. In fact, if anything, I see evidence that the bottleneck
lies with other elements of the design.
Yeah, and then they invented caches and all of a sudden memory latency
is not that much of an issue, and other parts of the processor do become
an issue.

You do realize that caches have been around long before MTA was
designed, right? In fact, not only were caches around, but caches had
been integrated into CPUs. The Tera and Cray folks really didn't
believe in caches, because for some applications they are useless.

You also realize that even with caches, multithreading is required to
tolerate cache miss latency, which substantially contributes to CPI.
Yeah, or it could just take instruction streams from independent
processes. That's what Hyperthreading was doing most of the time.

That's not exactly the processor's fault, that problem lies with the
OS...
Who cares about memory bottlenecks?

I dunno, why don't you ask someone who designs MPUs for a living,
they'd probably tell you almost everyone.
Stop diverting the subject, David.
You're the only one bringing up memory bottlenecks as the reason for
SMT.

It is one reason, there are others. However, the biggest benefit of
multithreading is to alleviate memory bottlenecks. Look at the
performance gain that SoEMT provides on Northstar versus SMT on the
POWER5. SoEMT provides most of the benefits of SMT...and all it does
is switch on memory stalls.
Nobody else will agree.

Really? Let me quote for you what the creators of SMT said:

"The objective of SMT is to substantially increase processor
utilization in the face of
both long memory latencies and limited available parallelism per
thread."

Gosh it sounds to me like the folks who devised SMT thought that long
memory latencies were an issue that was important to address.
Ironically enough, the folks at UW were working closely with DEC, which
designed the EV7....which had both an integrated memory controller and
truly glueless and scalable MP.
That may have been the case in ancient
times, but now we have caches and inboard memory controllers. These
days, SMT is used to increase IPC, which means increasing the
instruction retirement rate.

Yousuf, I think we are talking past each other. You're saying that SMT
should increase retirement rate, which is true. It is trivial that any
change in architecture which improves performance, while leaving path
length and frequency unchanged must improve IPC...which must improve
the retirement rate.

I'm saying that SMT improves performance (i.e. improves IPC) because it
enables you to extract more memory parallelism and overlap many cache
accesses. i.e. it alleviates memory bottlenecks.

However, your assertion that the P4 is retirement bound is simply
wrong. AFAIK, all server or desktop MPUs, ignoring Niagara, achieve
less than half their peak retirement rate. IOW, there is no way that 2
way multithreading could be a problem.

DK
 
S

Sebastian Kaliszewski

Yousuf said:
Yeah, ancestral SMT, about as relevant as dinosaurs.

That's not *S*MT to begin with...

The retirement capabilities of the P4 *were* a bottleneck for P4, in
addition to memory latency. The retirement capabilities of all other
architectures were much higher than P4's, x86 or not.

Really? P3 is no better nor PM is. You must go up to C2 to see improvement
there. K7 was better (6ups) but it was surprisingly not faster. It was K8
which significantly improved on the real bottleneck (memory) and which was
faster that P4.

Yeah, and then they invented caches and all of a sudden memory latency
is not that much of an issue, and other parts of the processor do become
an issue.

The memory latency *is* the issue! And caches are know for looooong time.
But event with 99.5% cache hit rate latency is a bottleneck -- when cache
miss costs ~500 not exceuted instructions (typical for >3GHz P4) then 0.5%
chance of miss with sparswe memory access rate of only once per 5
instructions your code spends 1/3 of the time waiting on memory.

And none of todays OoO chips can reorder around 500 instructions.

Who cares about memory bottlenecks?

Everyone who designs this stuff?
Stop diverting the subject, David.
You're the only one bringing up memory bottlenecks as the reason for
SMT.

As they're the promary problem!
Nobody else will agree.

Everybody knowledgable will agree
That may have been the case in ancient
times, but now we have caches and inboard memory controllers.

And those are not enough.
These
days, SMT is used to increase IPC, which means increasing the
instruction retirement rate.

IPC is net effect of all the parts of the pipe working. Do you know what is
typical IPC in typical SPECINT-like workload on current processors, capable
of dispatching executing and retinring 3 or 4 instructions at once? It's
around 1.0!!!

rgds
 
S

Sebastian Kaliszewski

David said:
Um...can you provide evidence to back those statements up? K7/K8 and
P3 can only retire 3 micro operations/cycle.

Um... K7/K8 can do 6. It's limited by 3 macro ops which can be pair s of
uops. Typically on average K8 mop is ~1.5 uops so in reality those 6 will be
acheived infrequently.
I have yet to see any conclusive proof that the P4 was retirement
limited. Can you cite any serious studies which show retirement as a
bottleneck?

I have done some performance analysis for a broad spectrum of
benchmarks on the P4, and I see very little which indicates retirement
is an issue. In fact, if anything, I see evidence that the bottleneck
lies with other elements of the design.

Well, Netburst is rather thin -- only 3 uops throughput (both at the the
front and back (retirement) of the pipeline. As you probably know, typical
application execution is bursty -- burst of high IPC streams (talk about ICP
around 1.5~2.0) interleaved with few hundred dead periods of waiting for
cache miss. When both threads are both in thir execution bursts they are
congesting (as one x86 instruction translates into 1.4~1.5 P4 uops P4 can
sustaion no more than 2.0~2.1 IPC in execution bursts).

But the bottleneks are many -- small L1 cache is one of them.

rgds
 
K

krw

That's not *S*MT to begin with...

I believe the IBM 360/195 and 370/95 both had SMT. It was called
"dual I-stream". Two "threads" were executed (I believe)
simultaneously.
The memory latency *is* the issue! And caches are know for looooong time.
But event with 99.5% cache hit rate latency is a bottleneck -- when cache
miss costs ~500 not exceuted instructions (typical for >3GHz P4) then 0.5%
chance of miss with sparswe memory access rate of only once per 5
instructions your code spends 1/3 of the time waiting on memory.

And none of todays OoO chips can reorder around 500 instructions.

I don't believe memory latency is the issue at all, at least not
directly. Even SMT can't solve a 500 instruction "hole". SMT is
supposed to cover pipe flushes caused by branches (and
mispredicts). The other thread can still execute (utilize
execution units) while the first flushes and refills.
Everyone who designs this stuff?

Sure, but they're inevitable.
As they're the promary problem!

I'm not buying it. As I said, the primary problem that SMT is
trying to solve is pipe bubbles caused by branch mispredicts.
Everybody knowledgable will agree

Perhaps not. ;-)
And those are not enough.


IPC is net effect of all the parts of the pipe working. Do you know what is
typical IPC in typical SPECINT-like workload on current processors, capable
of dispatching executing and retinring 3 or 4 instructions at once? It's
around 1.0!!!

X86, perhaps. When you're register poor and there is a branch every
five instructions this isn't surprising.
 
S

Sebastian Kaliszewski

krw said:
I don't believe memory latency is the issue at all, at least not
directly. Even SMT can't solve a 500 instruction "hole". SMT is
supposed to cover pipe flushes caused by branches (and
mispredicts). The other thread can still execute (utilize
execution units) while the first flushes and refills.

Well, in case of one thread CPU spends 30-50% waiting. In case of 2 thredas
it gets down to 9-25%. This is quite significant gain.

Sure, but they're inevitable.

Yes, but if one finds a weay to cause hardware doing useful work while some
instructin is waiting for it's data its useful optimisation.
I'm not buying it. As I said, the primary problem that SMT is
trying to solve is pipe bubbles caused by branch mispredicts.

Well, if your jump is mispredicted it's too late -- the resources got
wasted. But predictor could predict if a jump is hard to predict and
instruct scheduler to dispatch from the other thread.

I think SMT reduces both jump cost as well as cache miss costs. In fact from
the SMT PoV the booth are quite similar.

Perhaps not. ;-)

Oh. But perheps one could get convinced ;-)

X86, perhaps. When you're register poor and there is a branch every
five instructions this isn't surprising.

Well, on x86 it's often even worse. Besides on RISCs tehre are simply more
instructions (about 1.5 more). Then those temporary variable accesses in
case of x86 are very cache friendly. The memory accesses causing problems
are more or less the same on RISC and on CISC.

rgds
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top