65nm news from Intel

N

Nick Maclaren

Who's "we" ?

A good question. But note that "by '08" includes "in 2005".
I have read that there will be ~1.7e9 transistors in Montecito.
Cache (2*1 MB L2 + 2*12 MB L3) probably accounts for ~90% of the
transistor count. Montecito is expected next year.

By whom is it expected? And how is it expected to appear? Yes,
someone will wave a chip at IDF and claim that it is a Montecito,
but are you expecting it to be available for internal testing,
to all OEMS, to special customers, or on the open market?


Regards,
Nick Maclaren.
 
N

Nick Maclaren

At least as far as your typical spaghetti C++ is concerned, yeah, not
going to happen anytime in the near future.

Sigh. You are STILL missing the point. Spaghetti C++ may be about
as bad as it gets, but the SAME applies to the cleanest of Fortran,
if it is using the same programming paradigms. I can't get excited
over factors of 5-10 difference in optimisability, when we are
talking about improvements over decades.
And yet, by that argument there should be no market for the big
parallel servers and supercomputers; yet there is. The solution is
that for things that need the speed, people just write the parallel
code by hand.

Sigh. Look, I am in that area. If it were only so simple :-(
If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
out is a chip with 1024 individually slow cores, then those games will
be written to use 1024-way parallelism, just as weather forecasting
and quantum chemistry programs are today. Ditto for Photoshop, 3D
modelling, movie editing, speech recognition etc. There's certainly no
shortage of parallelism in the problem domains. The reason things like
games don't use parallel code today whereas weather forecasting does
isn't because of any software issue, it's because gamers don't have
the money to buy massively parallel supercomputers whereas
organizations doing weather forecasting do. When that changes, so will
the software.

Oh, yeah. Ha, ha. I have been told that more-or-less continually
since about 1970. Except for the first two thirds of your first
sentence, it is nonsense.

Not merely do people sweat blood to get such parallelism, they
often have to change their algorithms (sometimes to ones that are
less desirable, such as being less accurate), and even then only
SOME problems can be parallelised.


Regards,
Nick Maclaren.
 
G

Grumble

Nick said:
A good question. But note that "by '08" includes "in 2005".

I took "by 2008" to mean "sometime in 2008". Otherwise he would have
said "by 2005" or "by 2006", don't you think?
By whom is it expected? And how is it expected to appear? Yes,
someone will wave a chip at IDF and claim that it is a Montecito,
but are you expecting it to be available for internal testing,
to all OEMS, to special customers, or on the open market?

In November 2003, Intel's roadmap claimed Montecito would appear in
2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel
supposedly showcased Montecito dies, and claimed that testing had begun.

http://www.theinquirer.net/?article=15917
http://www.xbitlabs.com/news/cpu/display/20040219125800.html
http://www.xbitlabs.com/news/cpu/display/20040619180753.html

Perhaps Intel is being overoptimistic, but, as far as I understand, they
claim Montecito will be ready in 2005.
 
N

Nick Maclaren

|> >
|> > By whom is it expected? And how is it expected to appear? Yes,
|> > someone will wave a chip at IDF and claim that it is a Montecito,
|> > but are you expecting it to be available for internal testing,
|> > to all OEMS, to special customers, or on the open market?
|>
|> In November 2003, Intel's roadmap claimed Montecito would appear in
|> 2005. 6 months later, Otellini mentioned 2005 again. In June 2004, Intel
|> supposedly showcased Montecito dies, and claimed that testing had begun.
|>
|> Perhaps Intel is being overoptimistic, but, as far as I understand, they
|> claim Montecito will be ready in 2005.

I am aware of that. Given that Intel failed to reduce the power
going to 90 nm for the Pentium 4, that implies it will need 200
watts. Given that HP have already produced a dual-CPU package,
they will have boards rated for that. Just how many other vendors
will have?

Note that Intel will lose more face if they produce the Montecito
and OEMs respond by dropping their IA64 lines than if they make
it available only on request to specially favoured OEMs.


Regards,
Nick Maclaren.
 
A

Alex Johnson

Nick said:
By whom is it expected? And how is it expected to appear? Yes,
someone will wave a chip at IDF and claim that it is a Montecito,
but are you expecting it to be available for internal testing,
to all OEMS, to special customers, or on the open market?

By intel and everyone who has been believing their repeated, unwavering
claims that mid-2005 will see commercial revenue shipments of Montecito.
Based on all the past releases in IPF, I expect a "launch" in June '05
and customers will have systems running in their environments around
August. There should be Montecito demonstrations at this coming IDF.
There were wafers shown at the last IDF. If my anticipated schedule is
correct, OEMs will have test chips soon.

Alex
 
R

Russell Wallace

Sigh. You are STILL missing the point. Spaghetti C++ may be about
as bad as it gets, but the SAME applies to the cleanest of Fortran,
if it is using the same programming paradigms. I can't get excited
over factors of 5-10 difference in optimisability, when we are
talking about improvements over decades.

"Cleanest of Fortran" usually means vector-style code, which is a
reasonable target for autoparallelization. I'll grant you if you took
a pile of spaghetti C++ and translated line-for-line to Fortran, the
result wouldn't autoparallelize with near-future technology any more
than the original did.
Sigh. Look, I am in that area. If it were only so simple :-(

I didn't claim it was simple. I claimed that, even though it's
complicated, it still happens.
Oh, yeah. Ha, ha. I have been told that more-or-less continually
since about 1970. Except for the first two thirds of your first
sentence, it is nonsense.

So you claim weather forecasting and quantum chemistry _don't_ use
parallel processing today? Or that gamers would be buying 1024-CPU
machines today if Id would only get around to shipping parallel code?
Not merely do people sweat blood to get such parallelism, they
often have to change their algorithms (sometimes to ones that are
less desirable, such as being less accurate), and even then only
SOME problems can be parallelised.

I didn't claim sweating blood and changing algorithms weren't
required. However, I'm not aware of any CPU-intensive problems of
practical importance that _can't_ be parallelized; do you have any
examples of such?
 
N

Nick Maclaren

|>
|> "Cleanest of Fortran" usually means vector-style code, which is a
|> reasonable target for autoparallelization. ...

Not in my world, it doesn't. There are lots of other extremely
clean codes.

|> >Oh, yeah. Ha, ha. I have been told that more-or-less continually
|> >since about 1970. Except for the first two thirds of your first
|> >sentence, it is nonsense.
|>
|> So you claim weather forecasting and quantum chemistry _don't_ use
|> parallel processing today? Or that gamers would be buying 1024-CPU
|> machines today if Id would only get around to shipping parallel code?

I am claiming that a significant proportion of the programs don't.
In a great many cases, people have simply given up attempting the
analyses, and have moved to less satisfactory ones that can be
parallelised. In some cases, they have abandoned whole lines of
reserach! Your statement was that the existing programs would
be parallelised:

then those games will be written to use 1024-way parallelism,
just as weather forecasting and quantum chemistry programs are
today

|> >Not merely do people sweat blood to get such parallelism, they
|> >often have to change their algorithms (sometimes to ones that are
|> >less desirable, such as being less accurate), and even then only
|> >SOME problems can be parallelised.
|>
|> I didn't claim sweating blood and changing algorithms weren't
|> required. However, I'm not aware of any CPU-intensive problems of
|> practical importance that _can't_ be parallelized; do you have any
|> examples of such?

Yes. Look at ODEs for one example that is very hard to parallelise.
Anything involving sorting is also hard to parallelise, as are many
graph-theoretic algorithms. Ones that are completely hopeless are
rarer, but exist - take a look at the "Spectral Test" in Knuth for
a possible candidate.

The characteristic of the most common class of unparallelisable
algorithm is that they are iterative, each step is small (i.e.
effectively scalar), yet it makes global changes (and where the
cost of that is very small). This means that steps are never
independent, and are therefore serialised.

What I can't say is how many CPU-intensive problems of practical
importance are intrinsically unparallelisable - i.e. they CAN'T
be converted to a parallelisable form by changing the algorithms.
But that is not what I claimed.


Regards,
Nick Maclaren.
 
R

Russell Wallace

Your statement was that the existing programs would
be parallelised:

then those games will be written to use 1024-way parallelism,
just as weather forecasting and quantum chemistry programs are
today

Oh! I think we've been talking at cross purposes then.

I'm not at all talking about taking existing code and tweaking it to
run in parallel. I agree that isn't always feasible. I'm talking about
taking an existing problem domain and writing new code to solve it
with parallel algorithms.
What I can't say is how many CPU-intensive problems of practical
importance are intrinsically unparallelisable - i.e. they CAN'T
be converted to a parallelisable form by changing the algorithms.
But that is not what I claimed.

Okay, I'm specifically talking about using different algorithms where
necessary.
 
S

Scott Moore

Russell said:
At least as far as your typical spaghetti C++ is concerned, yeah, not
going to happen anytime in the near future.

The statement is wrong in any case. C can be translated to hardware
(which is defacto parallelisim) by "constraints", i.e., refusing to
translate its worst features (look up system C, C to hardware and
similar). Other languages can do it without constraints. Finally,
any code, no matter how bad, could be so translated by executing it
(simulating it), and then translating what it does dynamically and
not statically. This simulation can then give the programmer a report
of what was not executed, and the programmer modifies the test cases
until all code has been so translated.
And yet, by that argument there should be no market for the big
parallel servers and supercomputers; yet there is. The solution is
that for things that need the speed, people just write the parallel
code by hand.

If what's on the desktop when Doom X, Half-Life Y and Unreal Z come
out is a chip with 1024 individually slow cores, then those games will
be written to use 1024-way parallelism, just as weather forecasting
and quantum chemistry programs are today. Ditto for Photoshop, 3D
modelling, movie editing, speech recognition etc. There's certainly no
shortage of parallelism in the problem domains. The reason things like
games don't use parallel code today whereas weather forecasting does
isn't because of any software issue, it's because gamers don't have
the money to buy massively parallel supercomputers whereas
organizations doing weather forecasting do. When that changes, so will
the software.


--
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.
 
D

David Gay

And yet, by that argument there should be no market for the big
parallel servers and supercomputers; yet there is. The solution is
that for things that need the speed, people just write the parallel
code by hand.

More accurately, they try to. Whether they succeed is a different question
(it's clear that they do succeed some times, but there's no reason to
believe that just because you'ld like it, you'll succeed).
 
N

Nick Maclaren

More accurately, they try to. Whether they succeed is a different question
(it's clear that they do succeed some times, but there's no reason to
believe that just because you'ld like it, you'll succeed).

Precisely. As far as the easiness of doing it is concerned, the
question to ask is how the proportion of systems/money/effort/etc.
spent on large scale parallel applications is varying over time,
relative to that on all performance-limited applications.

If we exclude the modern equivalents of the Manhattan project,
and include the traditional vector systems as parallel (as they
were), my guess is that it has remained pretty constant for the
past 30 or 40 years. The number of performance-limited tasks that
can be parallelised is continually (if slowly) increasing, but
probably no faster than the number of tasks people would like to
do that are limited by performance.

Highly parallel systems were specialist in 1974, and they are STILL
specialist. We know how to do a LOT more in parallel than we
did then, but it is still a small proportion of what we would like
to do. Still, it keeps people like me off the streets :)


Regards,
Nick Maclaren.
 
S

Stefan Monnier

Precisely. As far as the easiness of doing it is concerned, the
question to ask is how the proportion of systems/money/effort/etc.
spent on large scale parallel applications is varying over time,
relative to that on all performance-limited applications.

Getting back to the issue of multiprocessors for "desktops" or even
laptops: I agree that parallelizing Emacs is going to be excrutiatingly
painful so I don't see it happening any time soon. But that's not really
the question.

I think that as SMP and SMT progresses on those machines (first as
bi-processors), you'll see more applications use *very* coarse grain
parallelism. It won't make much difference performancewise: the extra
processor will be used for unrelated tasks like "background foo" which isn't
done now because it would slow things down too much on a uniprocessor.
Existing things mostly won't be parallelized, but the extra CPU will be used
for new things of dubious value.

Your second CPU will be mostly idle, of course, but so is the first CPU
anyway ;-)



Stefan
 
N

Nick Maclaren

I think that as SMP and SMT progresses on those machines (first as
bi-processors), you'll see more applications use *very* coarse grain
parallelism. It won't make much difference performancewise: the extra
processor will be used for unrelated tasks like "background foo" which isn't
done now because it would slow things down too much on a uniprocessor.
Existing things mostly won't be parallelized, but the extra CPU will be used
for new things of dubious value.

I regret to say that I agree with you :-(


Regards,
Nick Maclaren.
 
R

Robert Myers

Stefan said:
Getting back to the issue of multiprocessors for "desktops" or even
laptops: I agree that parallelizing Emacs is going to be excrutiatingly
painful so I don't see it happening any time soon. But that's not really
the question.

I think that as SMP and SMT progresses on those machines (first as
bi-processors), you'll see more applications use *very* coarse grain
parallelism. It won't make much difference performancewise: the extra
processor will be used for unrelated tasks like "background foo" which isn't
done now because it would slow things down too much on a uniprocessor.
Existing things mostly won't be parallelized, but the extra CPU will be used
for new things of dubious value.

Your second CPU will be mostly idle, of course, but so is the first CPU
anyway ;-)

I sometimes think: no one experienced the microprocessor revolution. Or
perhaps: everyone has adjusted his recollection so that he thinks he saw
things much more clearly than he did. Or perhaps: the world is divided
between those whose world-view was built before the revolution and are
never going to acknowledge exactly what they missed, anyway, and those
whose world-view was built too late to have enough perspective to see
just how badly everybody missed it.

The world of programming is about to change in ways that no big-iron or
cluster megaspending program ever could accomplish. I'm tempted to say:
get used to it, but it would be socially unacceptable and we're going to
have a repeat of what happened with the microprocessor revolution:
almost no one is going to put his hand to his forehead and say, "I
should have seen that coming, but I didn't."

RM
 
R

Rupert Pigott

Robert Myers wrote:

[SNIP]
The world of programming is about to change in ways that no big-iron or
cluster megaspending program ever could accomplish. I'm tempted to say:
get used to it, but it would be socially unacceptable and we're going to
have a repeat of what happened with the microprocessor revolution:
almost no one is going to put his hand to his forehead and say, "I
should have seen that coming, but I didn't."

More CPUs per chunk of memory ?

Back in 1990 as a PFY at INMOS I asked about why they took the
approach they did (OCCAM/CSP/Transputers). I was given an explanation
that included trends in heat dissipation, memory latency, clock rates,
leakage etc. By and large it's panning out as predicted, although the
timescales have proven to be a little longer (kudos to the guys doing
the chip design and silicon physics).


Cheers,
Rupert
 
R

Robert Myers

Rupert said:
Robert Myers wrote:

[SNIP]
The world of programming is about to change in ways that no big-iron
or cluster megaspending program ever could accomplish. I'm tempted to
say: get used to it, but it would be socially unacceptable and we're
going to have a repeat of what happened with the microprocessor
revolution: almost no one is going to put his hand to his forehead and
say, "I should have seen that coming, but I didn't."


More CPUs per chunk of memory ?

Back in 1990 as a PFY at INMOS I asked about why they took the
approach they did (OCCAM/CSP/Transputers). I was given an explanation
that included trends in heat dissipation, memory latency, clock rates,
leakage etc. By and large it's panning out as predicted, although the
timescales have proven to be a little longer (kudos to the guys doing
the chip design and silicon physics).

Yes, indeed.

That's a powerful insight, but I would characterize it as the hardware
driver for what I see as a more profound revolution in software. Who
knows, maybe the day of Occam is at hand. :).

The smallest unit that anyone will ever program for non-embedded
applications will support I hesitate to guess how many execution pipes,
but certainly more than one. Single-pipe programming, using tools
appropriate for single-pipe programming, will come to seem just as
natural as doing physics without vectors and tensors.

The fact that this reality is finally percolating into the lowly but
ubiquitous PC is what I'm counting on for magic.

RM
 
R

Russell Wallace

Just like programming in general, really :)
Highly parallel systems were specialist in 1974, and they are STILL
specialist. We know how to do a LOT more in parallel than we
did then, but it is still a small proportion of what we would like
to do. Still, it keeps people like me off the streets :)

What are some examples of important and performance-limited
computation tasks that aren't run in parallel?
 
A

Andrew Reilly

Russell said:
What are some examples of important and performance-limited
computation tasks that aren't run in parallel?

I.e., that run fastest on a one-processor Itanium or Opteron or
Xeon workstation...

On the other hand, who isn't drooling over these:

http://www.orionmulti.com/products/

Quoting the press release on Transmeta's web site:

"The specifications for Orion's DS-96 deskside Cluster Workstation
include 96 nodes with 300 Gflops peak performance (150 sustained),
up to 192 gigabytes of memory and up to 9.6 terabytes of storage.
It consumes less than 1500 watts and fits unobtrusively under a
desk. Orion's DT-12 desktop Cluster Workstation has 12 nodes with
36 Gflops peak performance (18 sustained), up to 24 gigabytes of
DDR SDRAM memory and up to 1 terabyte of internal disk storage.
The DT-12 consumes less than 220 watts and is scalable to 48 nodes
by stacking up to four systems.

"Orion's desktop model will be available in October 2004, and the
deskside model will be available during the latter part of Q4. For
more information about Orion Multisystems and its products, visit
www.orionmultisystems.com.

Have to wonder why all of those nodes are hooked together (inside
the box, presumably on the motherboard) with gigabit ethernet,
rather than something like the Horus chipset that's been spoken
about here recently, given that the processors have HyperChannel
interfaces. My guess is that it let them offload system software
development onto the open source cluster community, without having
to even do device drivers. I guess that the HyperChannel is for
peripherals, and doesn't do interprocessor cache coherency anyway.
Still, you'd think that they could have come up with something
lighter-weight than gigabit ethernet, switched or not.

R Clint Whaley and others have been playing with Atlas on Eficions
recently, too. Don't look to be too bad, although there seem to
be some code-vs-data cache pressure issues. Two flops/clock peak
(2GFlop at 1GHz) realizing between 90% and 60% of peak on various
atlas kernels.

Cheers,
 
R

Rupert Pigott

Robert Myers wrote:

[SNIP]
The smallest unit that anyone will ever program for non-embedded
applications will support I hesitate to guess how many execution pipes,
but certainly more than one. Single-pipe programming, using tools
appropriate for single-pipe programming, will come to seem just as
natural as doing physics without vectors and tensors.

The fact that this reality is finally percolating into the lowly but
ubiquitous PC is what I'm counting on for magic.

I really wouldn't hold your breath. Look how long it took for SMP to
become ubiquitous with major league UNIXen ... Has it had much of an
impact on the code base at large ? IMO : It hasn't.

UNIX had three stumbling blocks :

1) UNIX does let you make use of multiple CPUs at a coarse grained level
with stuff like pipes (ie : good enough).

2) The predominance of single threaded language that promotes single
threaded thinking.

3) Libraries designed for single-threaded non-rentrant usage.

By all accounts Windows NT suffers from the same, but to be fair it
has supported threading for a very long time and MS has been pushing
it very hard too. The codebase is positively riddled with threads by
comparison to UNIX, but I haven't seen much that is genuinely scalable.

I don't believe that some kid will have a stunning insight as a result
of having a 2 or a 4P NT/Linux box sat on their desk either. Such boxes
have been around a *long* time and in the hands of some very clever
people who have already cleaned out the low-hanging fruit and are about
1/3rd the way up the tree at the moment.

I think hard graft is needed, perhaps having more boxes in more hands
will help increase the volume of hard graft and in turn that might get
us a result.


Cheers,
Rupert
 
N

Nick Maclaren

|>
|> >Highly parallel systems were specialist in 1974, and they are STILL
|> >specialist. We know how to do a LOT more in parallel than we
|> >did then, but it is still a small proportion of what we would like
|> >to do. Still, it keeps people like me off the streets :)
|>
|> What are some examples of important and performance-limited
|> computation tasks that aren't run in parallel?

ODEs, to a great extent.

A great deal of transaction processing.

A great deal of I/O.

Event handling in GUIs.


Regards,
Nick Maclaren.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top