65nm news from Intel

T

The little lost angel

Umm, did you catch the link here earlier today, comparing the 90nm A64,
130nm A64, and 90nm P4? A P4 at >230W! Yeow! I passed that one around
the office. ;-)

But those were system figures, not merely processors no? The last time
I looked, the average Opteron/A64 topped out at around 60W. So a good
100W of those figures are likely from the other components. This would
make the P4 burn around 130~140W. So those 200W heatsinks wouldn't
quite be needed just yet right? :pppP

--
L.Angel: I'm looking for web design work.
If you need basic to med complexity webpages at affordable rates, email me :)
Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
If you really want, FrontPage & DreamWeaver too.
But keep in mind you pay extra bandwidth for their bloated code
 
S

Stephen Fuld

Nick Maclaren said:
In the above, you mean SMT, I assume.

Yes, sorry. :-(
It's been possible for at least 5 years, probably 10. Yes, the cores
of a CMP system would necessarily be simpler, but it becomes possible
as soon as the transistor count of the latest and greatest model in
the range exceeds doubt that of the simplest. Well, roughly, and
allowing for the difference between code and data transistors.

But if you compare different cores, the more complex ones for the SMT
(excluding the extra complexity of the SMT) versus a simpler one for the
CMP, then you complicate the comparison by not comparing apples to apples.
How much of the difference is the SMT vs CMP and how much is the difference
in cores? One presumes the more complex core performs better than the
simpler one (or why do the complex one). Besides, if the SMT die area
penalty is in the 10% range that many have been quoting, can you do the
"simpler" core in almost exactly 55% of the die area of the complex one?
Once you change the core, you change the comparison such that I maintain
that it isn't the same comparison any more and my original comment holds.

Yes, a comparison with different cores could be done, but I can see why no
one is very interested in doing it.
 
P

Peter Boyle

Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.

Ah, I was objecting to what I read as your claim that SMT is the best
way to deal with 300cycles latency. Switch on event multithreading may
help equally well, and chip multiprocessing may help more.
From your point above here I suspect I misread, and that you are merely
pointing out that latency tolerance is the best use for SMT.
This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.

There is also the Pirhana concept of making the multiple cores
simpler and not suck so much heat. I think the jury is out on
CMP vs SMT.

BTW, anyone see the Broadcom BCM1480 announcent. Four 1.2Ghz cores, quad
issue in order MIPS with 3 HT ports in .09um, only draws 23W.
Nope, not without a second memory bus and all those pins.

I misread your original post - now I parsed it correctly, I agree
completely on this point!

Sorry for jumping in too hastily,

Peter
-- Robert

Peter Boyle (e-mail address removed)
 
?

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=

You haven't allowed for the problem of access.

Access to what, please?
Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation),

All thread-specific state information flows together with the instruction
it belongs to through the pipeline. Yes, the amount of information you
are sending along has increased - however, access to global state (e.g.,
FP mode flags) is costly as well, and I believe there have been
implementations that have taken the route sketched above for performance
reasons without SMT.
think of quiescing the other CPU (needed for single to dual thread
switching), think of interrupts (machine check needs one logic, and
underflow another). In ALL cases, on two CPUs, each can operate
independently, but SMT threads can't.

So you SMT is a little asymmetric: you stop decoding/issuing instructions
for all threads but one, and when they have drained the pipeline, you are
back to the single-thread situation, and continue from there. This is at
least correct behaviour, and if its performance impact is too great, you
look at those subsets of situations where you can relax the constraints
this imposes.

Jan
 
?

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=

code could be autoparallized by autoparallerizing compiler.
Yeah? Like who's?

Cray, Sun, IBM, DEC, ...

Oh, you mean performance is worse than for your hand-tuned MPI program?
Yeah, but that _is_ the state of the art.

Jan
 
?

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

While the latter is true, with the former you are comparing apples
and oranges - the starting point is adding, to a processor with a given
set of resources (FUs, registers, ...) SMT-like capability. The wiring
mentioned above does not change substantially - in the minimum of
2-thread SMT, all it must carry is one additional bit/wire to distinguish
the two thread.
No, they don't. Take performance counters. [...] The
Pentium 4 kludges this horribly.

Yeah, it seems they didn't completely think this through on the first
round. So one imperfect implementation damns the concept? Methinks not.

Jan
 
?

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=

"Make" is run on workstations. It is not a legacy application for
personal computers.

Ah blech. I'm mostly in the PC category - mail, editing, Excel & Co.
But fairly regularly, I run a compute-intensive program - might be an
applet running in my browser - and Winwoes broken scheduler gets me.
Same when paging is occuring (another broken piece of software, the
pager/swapper in Winwoes). In these cases, a second processor would
help my productivity a lot. Less often, I even use make (in the form
of pressing the Build button in an MSDS project for instance). And
the guys in our software development team use the same type of system
as I am using - does that turn them from a PC into a workstation?

I think that distinction is dead, nowadays.

Jan
 
A

Andrew Reilly

Peter said:
BTW, anyone see the Broadcom BCM1480 announcent. Four 1.2Ghz cores, quad
issue in order MIPS with 3 HT ports in .09um, only draws 23W.

Or the recent Freescale MPC8641D announcement? Also 90nm, 15W(?),
dual core 1.5GHz PPC G4 (4 issue (3+branch)), with dual 64-bit
DDR2 memory interfaces on chip, 1MB L2 Cache per core, and RapidIO
and GigE ports for fabric.

I know that Altivec doesn't excite the double-precision-only guys
in comp.arch (followups set -- sorry ibm-pc.h.c folk), but the
processor density that you could build an array of these things at
would be pretty wicked (tesselation of chip+DRAM, basically).
And they do do double precision at some speed.

Cheers,
 
J

Jouni Osmala

"Make" is run on workstations. It is not a legacy application for
Ah blech. I'm mostly in the PC category - mail, editing, Excel & Co.
But fairly regularly, I run a compute-intensive program - might be an
applet running in my browser - and Winwoes broken scheduler gets me.
Same when paging is occuring (another broken piece of software, the
pager/swapper in Winwoes). In these cases, a second processor would
help my productivity a lot. Less often, I even use make (in the form
of pressing the Build button in an MSDS project for instance). And
the guys in our software development team use the same type of system
as I am using - does that turn them from a PC into a workstation?

I think that distinction is dead, nowadays.

Jan

Legacy performance improvements mean LITTLE when considering new CPU:s.
Think Does you word need better performance or excell or powerpoint...
Where do you need the performance most over the current CPU:s.
A) Games.
B) Video/GFX editing.
C) Compilation. [Not for Joe MSuser.]
D) Running more tasks simultaneously.

A has quite potential for coarse grain parallerism. Running separate
threads for AI, physics, networking and graphics, and OS and video
drivers etc...
B) Parallelization already done in many software products.
C) Gcc is already paralerised.
D) this is interesting at the moment, windows users have plenty of
tasks, like Firewall, Mp3 player, P2P application, virus scanner running
in background, and having one CPU for foreground and other for ALL
background tasks do speed up foreground task, and gets rid of annoying
pauses.

Also remember plenty of transistors available and ILP and frequency
hunting for extra transistors, is in such situation that doubling the
transistor budget won't give much anymore, in that direction for X86.
Remember P4 was double the size of P3 when it first came. At that point
getting P4 bus and putting two P3:s CMP would equal in die area.
So thats the reason why CMP is way to go. Hunting higher frequencies and
more ILP isn't going to work like it used to work. So they need to hunt
other use for additional transistors, so putting 2nd core is obvious choice.

Jouni Osmala
 
J

Joe Seigh

Felger said:
We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips) than
on one? Answer: none.

What about IDE? That seems to be rather cpu intensive when you're doing
a lot of i/o. The economics are a little strange though. While a scsi
processor is simpler and would offload the i/o processing somewhat, it's
more expensive than a second cpu core.

Joe Seigh
 
P

Peter Dickerson

Nick Maclaren said:
|> >
|> > You haven't allowed for the problem of access. A simple (CMP)
|> > duplication doesn't increase the connectivity, and can be done
|> > more-or-less by replicating a single core; SMT does, and may need
|> > the linkages redesigning. This might be a fairly simple task for
|> > 2-way SMT, though there have been reports that it isn't even for
|> > that, but consider it for 8-way.
|>
|> I think we must be talking at cross purposes because to me an 8-way SMT is
|> very little different from a 2-way. Bigger register files for architected
|> state and a few more bits into the renamer. I don't know what you mean by
|> linkages in this context. Linkages between what and what?

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

We've been talking at cross purposes. I'm saying what needs to be added to a
CPU to give it SMT capabilities, extracting extra work from the otherwise
idle execution units. You seem to be saying what needs to be added to a CPU
to give it the performance of to CMP cores. I don't think that SMT should be
viewed that way, at least not for current implementation. I see it as
something that can be squeezed out for a little bit of extra silicon - the
register file and execution units wouldn't change in my view.
|> > Look at the performance counters, think of floating-point modes
|> > (in SMT, they may need to change for each operation), think of
|> > quiescing the other CPU (needed for single to dual thread switching),
|> > think of interrupts (machine check needs one logic, and underflow
|> > another). In ALL cases, on two CPUs, each can operate independently,
|> > but SMT threads can't.
|>
|> I don't see this at all. I'm not saying these things are trivial, I'm saying
|> that most of it has to be done for a single threaded OoO CPU too.

No, they don't. Take performance counters. In an OoO CPU, you have
a single process and single core, so you accumulate the counter and,
at context switch, update the process state. With SMT, you have
multiple processes and multiple cores - where does the time taken
(or events occurring) in an execution unit get assigned to? The
Pentium 4 kludges this horribly.

Here, you would need two (or whatever) copies of the performance counters.
Consider mode switching. In an OoO CPU, a typical mode switch is
a synchronisation point, and is reset on a context switch. With
SMT, a mode must be per-thread (which was said by hardware people
to be impossible a decade ago).

But is exactly what the Pentium 4 does. Each virtual CPU does its own thing.
You use the word thread here which I don't use in this context. It is
utterly trivial for an OoO design.
Consider interrupt handling. Underflow etc. had better be handled
within its thread, because the other might be non-interruptible
(and think scalability). But you had BETTER not handle all machine
checks like that (such as ones that disable an execution unit, in
a high-RAS design), as the execution units are in common.

Interrupts and exceptions are handled in exactly the same way as two cores
or chips would do. Exceptions are taken by the virtual processor that
triggered it. Harware interrupts are taken by whichever (virtual) processor
it is assigned to by the interrupt controller (APIC in PC style designs).

Machine checks are taken on each virtual CPU as and when that CPU detects
it. If the machine state is architecturally visible then it is duplicated.
Consider quiescing the other CPU to switch between single and dual
thread mode, to handle a machine check or whatever. You had BETTER
ensure that both CPUs don't do it at once ....

I don't understand this. If you are switching from single virtual CPU to two
how can the second be doing anything until it exists?
Regards,
Nick Maclaren.

Peter
 
R

Robert Redelmeier

In comp.sys.ibm.pc.hardware.chips Peter Boyle said:
pointing out that latency tolerance is the best use for SMT.

This is correct. I don't see SMT as a quick route to
maximum performance, but rather a cheap route to somewhat
better performance. SMT could be done in combination with
SMP or even CMP. Do something during the stalls.
Sorry for jumping in too hastily,

Britain and America. Divided by a common language :)

-- Robert
 
R

Robert Redelmeier

In comp.sys.ibm.pc.hardware.chips Joe Seigh said:
What about IDE? That seems to be rather cpu intensive
when you're doing a lot of i/o.

Not anymore. Modern IDE chipsets do Busmaster DMA and is
fairly low CPU overhead. AFAIK, IDE still doesn't have a
multicommand bus which SCSI has always had (don't need to
wait for seeks to complete). So SCSI is preferable for more
than one intense device per bus.

-- Robert
 
B

Bernd Paysan

Robert said:
AFAIK, IDE still doesn't have a
multicommand bus which SCSI has always had (don't need to
wait for seeks to complete).

SATA-II has native command queuing, NCQ. AFAIK, you could use SCSI command
queuing on SATA if the device did understand it, since SATA (like ATAPI)
allows to transfer SCSI commands, and therefore to use SCSI's command
queuing mechanism ("tagged command queuing", TCQ).
 
S

Stephen Sprunk

Stephen Fuld said:
But if you compare different cores, the more complex ones for the SMT
(excluding the extra complexity of the SMT) versus a simpler one for the
CMP, then you complicate the comparison by not comparing apples to apples.
How much of the difference is the SMT vs CMP and how much is the
difference in cores? One presumes the more complex core performs better
than the simpler one (or why do the complex one). Besides, if the SMT die
area penalty is in the 10% range that many have been quoting, can you do
the "simpler" core in almost exactly 55% of the die area of the complex
one? Once you change the core, you change the comparison such that I
maintain that it isn't the same comparison any more and my original
comment holds.

IBM claimed 25% die size increase to add SMT to Power5, and a -5% to +24%
performance gain, i.e. the performance increase _never_ matches up to the
size increase.

Traditional SMP costs 100% more in die size and, in the case of Opteron, and
never exceeds 100% performance gain except in corner cases. CMP will
require slightly less die (and significantly less in system costs) but will
likely be offset by memory contention of having half the number of memory
channels.

Looks like apples and apples so far.

S
 
S

Scott Alfter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

When I looked (a few months ago) a decent dual AthlonMP board was around
$400, with the processors at a rather premium too. I was *considering a
dual K7 at the time, rather than a single K8. The duals lost because of
the cost. It would have been cheaper to upgrade the second system than go
SMP.

I've been running a Tyan S2466N-4M with a pair of Athlon MP 2100s at home
now for a couple of years or so (have been thinking of a pair of faster
processors as a cheap upgrade lately). Pricewatch puts this board at about
$190, with tray processors at about $90 each. IIRC, the board didn't cost
much more back when I bought it than it does now. (The processors were a
fair bit more expensive, but they were only one or two steps down from the
fastest-available speed at the time.)

_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( http://alfter.us/ Top-posting!
\_^_/ rm -rf /bin/laden >What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Linux)

iD8DBQFBYuC2VgTKos01OwkRAsInAJ0ccEEa2z7XCkn/K2ZON/v7U+uEBACg48v/
O8puH7rQbb9dQLGvyE7iw6o=
=6IBu
-----END PGP SIGNATURE-----
 
S

Sander Vesik

In comp.arch Robert Redelmeier said:
Logic. When else can SMT really do net increased work?

Result derived from logic that is not backed up with
evidence usualy turn out to be wrong.
If you want to test, run some pointer-chasers.

So why do you claim logic and don't post your own results?

[snip]
CMP will also help the former [bandwidth].

Nope, not without a second memory bus and all those pins.

Provided they share at least one level of cache, what you
just said is completely false.
 
S

Sander Vesik

In comp.arch Felger Carbon said:
Bzzt! This is an IBM.PC NG.

comp.arch is not a ibm.pc ng, so bzzt yourself, please.
"Make" is run on workstations. It is not a legacy application for
personal computers.

you are worng here again - make even used to come with DOS, and many
legacy copies from various development packages for dos abound.
 
S

Stephen Fuld

Stephen Sprunk said:
IBM claimed 25% die size increase to add SMT to Power5, and a -5% to +24%
performance gain, i.e. the performance increase _never_ matches up to the
size increase.

I don't remember those figures from IBM. I am not doubting you at all -
just my memory. But 25% dia area penalty does seem high to me based on what
others have said..

Also, it isn't clear that this some version of the Power core that takes 63%
of the size of the non-smt power one with which one could do Nick's
comparison. I certainly agree that if you have twice the silicon space than
CMP may out perform SMT, but if you can only "afford" less than a 100%
penalty, then SMT may make sense.

Traditional SMP costs 100% more in die size and, in the case of Opteron,
and never exceeds 100% performance gain except in corner cases. CMP will
require slightly less die (and significantly less in system costs) but
will likely be offset by memory contention of having half the number of
memory channels.

Yes - agreed.
Looks like apples and apples so far.

But Nick specified equal die area for his comparison.
 
T

Thu

Stephen Sprunk said:
IBM claimed 25% die size increase to add SMT to Power5, and a -5% to +24%
performance gain, i.e. the performance increase _never_ matches up to the
size increase.

Where did you get those figures from? Either your memory is playing
tricks on you or IBM were very conservative with their estimates. If
IBM originally claimed those numbers, they are definitely much below
what they are getting in benchmarks.

Check: http://www.redbooks.ibm.com/redpieces/pdfs/sg245768.pdf
p128. Increase of 10-50% with SMT on some industry standard
benchmarks, with an average of around 30-40%.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top