How AMD will take on Intel Woodcrest: twice the FPU's

George Macdonald · Mar 1, 2006

Changing the number of reg ports by one or two is minor. Changing your
L1D cache porting is a pretty major undertaking, especially if you only
had a single port before. Ask Mitch Alsup or someone who does this for
a living.

Err, what do you think Keith does for a "living"?

That's not a red herring if it relates to reality, which it does. AMD
cannot design 3 new architectures. They have said that they have a new
mobile and a new server uarch in the pipeline...combining those
statements results in a particular conclusion.

Where did you see this? All I've seen is that they are planning
"variants", some details of which could be "planned" and some of which
could fall out of binning. Calling them different "architectures" seems a
bit of a stretch... to get to your "conclusion".

That's right. I'm pointing out that having an FMA is a huge
performance boost. Since x86 doesn't have one, I have to use other
things to show this. Ask anyone who designs chips if FMAs are a good
idea...

It doubles your FLOPs, and if you have the memory to support it, is a
huge boost.

How about you ask the folks who program them and write compilers! FMAs are
often found to be somewhat less useful than anticipated - no bloody use at
all for the matrix stuff I'm interested in. When you need the precision
and have to cope with rounding errors, you lose much of any advantage.

David Kanter · Mar 1, 2006

George said:
Err, what do you think Keith does for a "living"?

Not entirely sure, however, he sure doesn't strike me as a computer
architect. If he does design MPUs, I'd want to hear that from him,
along with a few more details.

Anyone on usenet should be capable of defending themselves, as I'm sure
Keith is.

Where did you see this? All I've seen is that they are planning
"variants", some details of which could be "planned" and some of which
could fall out of binning. Calling them different "architectures" seems a
bit of a stretch... to get to your "conclusion".

From EETimes:

Hester: We are evolving to what I'd say are a minimum of two
brand-new core design points, new microarchitectures from the ground
up. One is aimed at mobile computers and the very low-power space.
Another is optimized for the high-end server space. The question we
have now is: Can you pull down the server space, or pull up the mobile,
enough to cover the desktop?

<end quote>

Let me emphasize the "minimum of two brand-new core design points, new
uarchs from the ground up" part. First of all, I think this is a good
move by AMD, the server, mobile and desktop worlds are rather
different. With a mobile or desktop CPU, you can have (or expect) sort
of bursty operation/performance. Sometimes you get the least
power/heat on a piece of code by executing it very quickly at high
performance, and then sending the chip to sleep. Obviously servers are
a little different.

I suppose this discussion is rather speculative in one regard. We
don't know when these two brand new designs were started, and when they
are slated to hit the market. If they are aimed for 2011 or so,
perhaps the team from the K8L will be doing some of the work on one of
the later chips.

How about you ask the folks who program them and write compilers! FMAs are
often found to be somewhat less useful than anticipated - no bloody use at
all for the matrix stuff I'm interested in. When you need the precision
and have to cope with rounding errors, you lose much of any advantage.

LOL. If you are worried about rounding errors and precision on FP,
then you REALLY shouldn't be using x86. That's rich!

Out of curiousity, what kind of matrix stuff are you doing?

DK

Keith · Mar 2, 2006

Not entirely sure, however, he sure doesn't strike me as a computer
architect. If he does design MPUs, I'd want to hear that from him,
along with a few more details.

What makes you think I owe you anything? Many long-timers here know what
I do. Perhaps you might want to hang around here for more than a few
months; you might get a few more hints? I'd hate end your suspense. ;-)

Anyone on usenet should be capable of defending themselves, as I'm sure
Keith is.

Against stupid? I've never been very good at defending myself against the
terminally stupid. But in your case; one *cannot* process more
information (per unit time/clock) than you can dispatch or complete. I
don't care *how* many execution units you have at your disposal.

David Kanter · Mar 2, 2006

Err, what do you think Keith does for a "living"?

What makes you think I owe you anything? Many long-timers here know what
I do. Perhaps you might want to hang around here for more than a few
months; you might get a few more hints? I'd hate end your suspense. ;-)

I've been around for more than that, but sure suspense is fine.

Against stupid? I've never been very good at defending myself against the
terminally stupid. But in your case; one *cannot* process more
information (per unit time/clock) than you can dispatch or complete. I
don't care *how* many execution units you have at your disposal.

I'm not disagreeing with that, it's obvious that peak execution rate is
limited by the narrowest of your pipeline stages. My point was that
there are REAL LIFE situations where it makes sense to have unbalanced
pipe width (if you have OOO).

Let's look at two MPUs, the EV6 and POWER4/5 (and only three pipe
stages):

EV6 POWER4
fetch: 4 8
issue: 6 8
retire: 11 4+1

See, they aren't fully balanced. Why would the EV6 or POWER4/5
architects bother wasting all those extra transistors? They aren't
stupid, they have extra room in there for a reason.

So, back to my original assertion (that having some extra execution
resources might make sense). My point is not that on average it would
be great, but that I can see some situations where having more
resources in a stage is useful, particularly when clearing a
backlog/queue of work. Worst case, as you noted, you just burn a
little extra power or put the extra units to sleep with clock gating.
There's no way that it will hurt (assuming you can spare the power,
heat and time to create said resources), and there are some situations
where it could help.

Now you could object and say: "If you get backlog and queues that
aren't draining properly, you probably screwed up your design", that
may be true, but my point was that you cannot say "having an unbalanced
pipeline is useless" in all cases. It's worth looking into...

DK

George Macdonald · Mar 2, 2006

Not entirely sure, however, he sure doesn't strike me as a computer
architect. If he does design MPUs, I'd want to hear that from him,
along with a few more details.

Anyone on usenet should be capable of defending themselves, as I'm sure
Keith is.

Why should he be forced to blatantly tout his own wares - most of us know
what he does and acknowledge that he might prefer to guard some degree of
discretion when it comes to his employer and exact duties. You can look in
the archives for hints.

Hester: We are evolving to what I'd say are a minimum of two
brand-new core design points, new microarchitectures from the ground
up. One is aimed at mobile computers and the very low-power space.
Another is optimized for the high-end server space. The question we
have now is: Can you pull down the server space, or pull up the mobile,
enough to cover the desktop?

<end quote>

Let me emphasize the "minimum of two brand-new core design points, new
uarchs from the ground up" part. First of all, I think this is a good
move by AMD, the server, mobile and desktop worlds are rather
different.

Maybe with the systems you work with - I don't see it.

With a mobile or desktop CPU, you can have (or expect) sort
of bursty operation/performance. Sometimes you get the least
power/heat on a piece of code by executing it very quickly at high
performance, and then sending the chip to sleep. Obviously servers are
a little different.

Let them sleep! :-)

Not with my/our desktops/laptops - if a system is not
capable of running flat out for hours at a time, it's of no interest to me.
Design that thing and there are a lot of workstation and gaming users who
are going to be pissed... "sorry but your game is going to slow down now
'cos I'm a little tired/weary and I need a rest"? :-)

I suppose this discussion is rather speculative in one regard. We
don't know when these two brand new designs were started, and when they
are slated to hit the market. If they are aimed for 2011 or so,
perhaps the team from the K8L will be doing some of the work on one of
the later chips.

Seems like "evolving" is the key word.

LOL. If you are worried about rounding errors and precision on FP,
then you REALLY shouldn't be using x86. That's rich!

In that respect x86 is no different from amy other general purpose
computer... which people have been using for high precision work for
decades. What would *you* suggest? Rich is not appropriate at all - if a
special machine was required, the work would not have been done!

Out of curiousity, what kind of matrix stuff are you doing?

Linear Programming, though I don't see that the precision aspects require
that narrow a definition.

Keith · Mar 2, 2006

I've been around for more than that, but sure suspense is fine.

I'm not disagreeing with that, it's obvious that peak execution rate is
limited by the narrowest of your pipeline stages. My point was that
there are REAL LIFE situations where it makes sense to have unbalanced
pipe width (if you have OOO).

I never said any differently. I raised the issue that throwing more
execute elements (OF A TYPE) than instruction that you can dispatch or
complete per cycle is stupid.

Let's look at two MPUs, the EV6 and POWER4/5 (and only three pipe
stages):

EV6 POWER4
fetch: 4 8
issue: 6 8
retire: 11 4+1

Maybe you need to study these again? The number of issue slots is a
meaningless number for the POWER4. There is an issue queue for each
execution unit (2xFXU, 2xFPU, 2xLSU, VMP, and VMA). There are only five
dispatch slots (4+1) feeding these queues, so the number of issue queues
is meaningless WRT IPC.

See, they aren't fully balanced. Why would the EV6 or POWER4/5
architects bother wasting all those extra transistors? They aren't
stupid, they have extra room in there for a reason.

Because you only have half the story right? The PowerPC can dispatch five
(4+1) instructions per cycle and complete five (4+1) instructions per
cycle. Kinda balanced, no?

So, back to my original assertion (that having some extra execution
resources might make sense). My point is not that on average it would
be great, but that I can see some situations where having more resources
in a stage is useful, particularly when clearing a backlog/queue of
work. Worst case, as you noted, you just burn a little extra power or
put the extra units to sleep with clock gating. There's no way that it
will hurt (assuming you can spare the power, heat and time to create
said resources), and there are some situations where it could help.

I stand by my assertion (if you actually read it, you might agree); More
execute units of a type than instructions that can be dispatched or
completed per cycle is a waste. Is that so hard to understand?

Now you could object and say: "If you get backlog and queues that aren't
draining properly, you probably screwed up your design", that may be
true, but my point was that you cannot say "having an unbalanced
pipeline is useless" in all cases. It's worth looking into...

I never said anything of the kind. Maybe your strawman is talking to you
again?

How AMD will take on Intel Woodcrest: twice the FPU's

George Macdonald

David Kanter

Keith

David Kanter

George Macdonald

Keith