Another AMD supercomputer, 13,000 quad-core

R

Robert Myers

George said:
Dell compared with Cray? What are you smoking?

Now an Itanium cluster is compared with a HPC high-capability system? I
must get some of that stuff you have! If Cray is in trouble, SGI is in
(extended) death rattle.
SGI's real value-added seems to be their ability to run a very large
number of processors under a single system image. In my perception,
Cray vector processors definitely influenced the design of Itanium, and
Itanium shines for the same kind of problems as the classic Crays.
Whether it was actually wise for SGI to use Itanium is a separate
question entirely, but anyone who thinks that AMD is the future of
supercomputing is out of touch with reality.

Folding at home is reporting a 20-40x speedup using GPU's as opposed to
COTS processors. That's not a theoretical advantage. It's one
achieved actually in practice. I haven't been following the GPGPU
(General Purpose GPU) field closely, but if I had a bundle of money to
spend on R&D, that's where it would be going, not into a new high-end
processor. I do know the strengths and limitations of vector
processors very well, and the new stream processors (including Cell and
GPU's) seem like more than worthy successors--at a much lower price.

Robert.
 
G

George Macdonald

SGI's real value-added seems to be their ability to run a very large
number of processors under a single system image. In my perception,
Cray vector processors definitely influenced the design of Itanium, and
Itanium shines for the same kind of problems as the classic Crays.
Whether it was actually wise for SGI to use Itanium is a separate
question entirely, but anyone who thinks that AMD is the future of
supercomputing is out of touch with reality.

That's a cheap shot Robert - taking an adverserial stance to something I
never said in the first place - strawman! The facts have already been
discussed here - the opinions differ but the fact is that an AMD, or Xeon I
suppose, cluster satisfies a largish portion of super-computing needs; that
is to say what is known as "high capacity" systems.

In the face of that, the market for "high capability" systems is now so
meagre that nobody really wants to touch it: the (potential) customers,
govt. & commercial will not commit: "oh sure, we need/want one *but* only
if the price is right". How the hell is a high capability OEM to target
this umm, market without impaling themselves on the dilemma of how to
actually survive with any reasonable expectation?

The inspiration for Itanium notwithstanding, an SGI cluster is *not* a high
capability supercomputer - just read what the customer base thinks on that
one. Now that the expected volume for Itanium is fairly well established
it becomes clear that it is not even classifiable as a merchant CPU. It's
dead!
Folding at home is reporting a 20-40x speedup using GPU's as opposed to
COTS processors. That's not a theoretical advantage. It's one
achieved actually in practice. I haven't been following the GPGPU
(General Purpose GPU) field closely, but if I had a bundle of money to
spend on R&D, that's where it would be going, not into a new high-end
processor. I do know the strengths and limitations of vector
processors very well, and the new stream processors (including Cell and
GPU's) seem like more than worthy successors--at a much lower price.

I don't know enough about folding to know where it fits in the
supercomputing spectrum between high capability/capacity. If you have
expectations of progress in GPUs, it'd be worth while to take note of
comments by the major mfrs on the future of GPUs: there isn't one -
according to their experts the wall has been hit on pixel/vertex/triangle
processing... and the future of GPUs is in more custom logic and a
progression to a more CPU-like framework. Maybe Cell is the answer but it
does not escape the programming difficulty problem and if it's going to fit
super-computing, the FPU precision performance will certainly have to be
extended.
 
R

Robert Myers

George said:
That's a cheap shot Robert - taking an adverserial stance to something I
never said in the first place - strawman! The facts have already been
discussed here - the opinions differ but the fact is that an AMD, or Xeon I
suppose, cluster satisfies a largish portion of super-computing needs; that
is to say what is known as "high capacity" systems.
It wasn't intended as a shot, cheap or otherwise. I was musing. AMD
has a definite advantage at the moment because of hypertransport, but
it is a temporary advantage at best. Power consumption being a
decisive consideration, the future looks to be Intel x86 chips, unless
AMD has something in the wings I don't know about.
In the face of that, the market for "high capability" systems is now so
meagre that nobody really wants to touch it: the (potential) customers,
govt. & commercial will not commit: "oh sure, we need/want one *but* only
if the price is right". How the hell is a high capability OEM to target
this umm, market without impaling themselves on the dilemma of how to
actually survive with any reasonable expectation?
I think Del's announcement on comp.arch pretty well describes the
future: pick a processor, pick an interconnect, pick a system
integrator. As Del announced on comp.arch, IBM would be pleased to be
your system integrator for just about anything you could dream up
(except Itanium).
The inspiration for Itanium notwithstanding, an SGI cluster is *not* a high
capability supercomputer - just read what the customer base thinks on that
one. Now that the expected volume for Itanium is fairly well established
it becomes clear that it is not even classifiable as a merchant CPU. It's
dead!

It's a reasonable certainty that Itanium is not going away in the
forseeable future. There isn't an RAS x86 chip to compete with it.
Intel isn't going to build one and AMD can't afford to. Eugene Miya
isn't going to explain why NASA Ames keeps buying hi-processor count
single-system image boxes, but they do. They buy them from SGI, and
those boxes contain Itanium.

In the compute-intensive market, though, it's hard to imagine how
Itanium is going to keep pace as more and more power-efficient x86
cores are crammed onto a die, and I haven't heard anything from Intel
to indicate that Itanium is committed to that market, anyway.
I don't know enough about folding to know where it fits in the
supercomputing spectrum between high capability/capacity. If you have
expectations of progress in GPUs, it'd be worth while to take note of
comments by the major mfrs on the future of GPUs: there isn't one -
according to their experts the wall has been hit on pixel/vertex/triangle
processing... and the future of GPUs is in more custom logic and a
progression to a more CPU-like framework. Maybe Cell is the answer but it
does not escape the programming difficulty problem and if it's going to fit
super-computing, the FPU precision performance will certainly have to be
extended.

The future is streaming architectures for compute-intensive tasks.
Even though GPU's may have maxed out for graphics, we're a long way
from seeing the full potential of streaming architectures exploited for
number-crunching. My bet: the supercomputer of the future will be
power-efficient x86 with streaming coprocessors on the same local bus
or point-to-point interconnect.

Robert.
 
D

Del Cecchi

Robert Myers said:
It wasn't intended as a shot, cheap or otherwise. I was musing. AMD
has a definite advantage at the moment because of hypertransport, but
it is a temporary advantage at best. Power consumption being a
decisive consideration, the future looks to be Intel x86 chips, unless
AMD has something in the wings I don't know about.

I think Del's announcement on comp.arch pretty well describes the
future: pick a processor, pick an interconnect, pick a system
integrator. As Del announced on comp.arch, IBM would be pleased to be
your system integrator for just about anything you could dream up
(except Itanium).


It's a reasonable certainty that Itanium is not going away in the
forseeable future. There isn't an RAS x86 chip to compete with it.
Intel isn't going to build one and AMD can't afford to. Eugene Miya
isn't going to explain why NASA Ames keeps buying hi-processor count
single-system image boxes, but they do. They buy them from SGI, and
those boxes contain Itanium.

In the compute-intensive market, though, it's hard to imagine how
Itanium is going to keep pace as more and more power-efficient x86
cores are crammed onto a die, and I haven't heard anything from Intel
to indicate that Itanium is committed to that market, anyway.


The future is streaming architectures for compute-intensive tasks.
Even though GPU's may have maxed out for graphics, we're a long way
from seeing the full potential of streaming architectures exploited for
number-crunching. My bet: the supercomputer of the future will be
power-efficient x86 with streaming coprocessors on the same local bus
or point-to-point interconnect.

Robert.

Actually if you want Itanium IBM will do that too. It isn't a standard
offering but that isn't really a problem if (enough)money is involved.

You talk about Single System Image as if it were synonymous with NUMA,
which I don't believe to be true.

Whither Itanium is an interesting question to speculate on.

del
 
R

Robert Myers

Del said:
You talk about Single System Image as if it were synonymous with NUMA,
which I don't believe to be true.
How did I ever give that impression? Apples and oranges, of course.
George referred derisively to SGI clusters, and I responded that NASA
Ames has been buying hi-processor count single system image boxes
(which are not clusters) from SGI for a while now and that the customer
in that case isn't some clueless Federal bureaucrat. I don't know what
I'm missing.

Robert.
 
D

Del Cecchi

Robert Myers said:
How did I ever give that impression? Apples and oranges, of course.
George referred derisively to SGI clusters, and I responded that NASA
Ames has been buying hi-processor count single system image boxes
(which are not clusters) from SGI for a while now and that the customer
in that case isn't some clueless Federal bureaucrat. I don't know what
I'm missing.

Robert.

Right, the SGI claim to fame is the NUMA box. I guess I read more into
your referring to SGI and Single System Image than you intended. Sorry.

del
 
G

George Macdonald

It wasn't intended as a shot, cheap or otherwise. I was musing. AMD
has a definite advantage at the moment because of hypertransport, but
it is a temporary advantage at best. Power consumption being a
decisive consideration, the future looks to be Intel x86 chips, unless
AMD has something in the wings I don't know about.

Your "musing" had a barb to it... that I thought I recognized. Where do
you get the idea that Intel has a lead in power consumption?.. just tain't
so and you only need to run an AMD64 CPU to see that... however much that
might stick in the craw... and AMD is still on 90nm! AMD does *not* need
anything in the wings. AMD64 has been streets ahead of Intel for years now
on power consumption and Intel just kinda caught up and maybe took a very
slight lead with their 65nm chips... but the "future" is Intel?? You need
to get out more!

As for Hypertransport's advantage, as long as Intel keeps shying off on
CSI... self-inflicted wounds. said:
I think Del's announcement on comp.arch pretty well describes the
future: pick a processor, pick an interconnect, pick a system
integrator. As Del announced on comp.arch, IBM would be pleased to be
your system integrator for just about anything you could dream up
(except Itanium).

Did you not notice "high capability"? "Pick a processor" is not going to
get you that. I haven't seen Del's announcement since I don't take
comp.arch.
It's a reasonable certainty that Itanium is not going away in the
forseeable future. There isn't an RAS x86 chip to compete with it.
Intel isn't going to build one and AMD can't afford to. Eugene Miya
isn't going to explain why NASA Ames keeps buying hi-processor count
single-system image boxes, but they do. They buy them from SGI, and
those boxes contain Itanium.

Not going away... I suppose if Intel is content to devote fab space to a
non-mass market chip... but at wwat price to Intel *and* its OEM
customers... talking of which, what a bunch of wannabees and usetabees.

RAS can mean too many different things now but assuming you mean
Reliability, Accessability, Serviceability, I see no reason why that has to
be a feature of the CPU.

As for NASA I believe their umm, accountability is coming up for a err,
review & overhaul.;-)
In the compute-intensive market, though, it's hard to imagine how
Itanium is going to keep pace as more and more power-efficient x86
cores are crammed onto a die, and I haven't heard anything from Intel
to indicate that Itanium is committed to that market, anyway.


The future is streaming architectures for compute-intensive tasks.
Even though GPU's may have maxed out for graphics, we're a long way
from seeing the full potential of streaming architectures exploited for
number-crunching. My bet: the supercomputer of the future will be
power-efficient x86 with streaming coprocessors on the same local bus
or point-to-point interconnect.

Ya mean like Torrenza?:) There's still a lot of work to be done.:) I
don't think, however, it'll hit "high capability".
 
K

krw

fammacd=! said:
RAS can mean too many different things now but assuming you mean
Reliability, Accessability, Serviceability, I see no reason why that has to ^^^^^^^^^^^^^ Availability
be a feature of the CPU.

Error checkers pretty much are a function of the CPU. There are
systems built out of COTS microprocessors that run two CPUs in
lock-step for RAS reasons and TMR is also a possibility, but RAS is
certainly part of the CPU. There is a ton of logic in the 370->
zSeries dedicated to RAS.

When I was working on the ES9000 (H2) series there was a numerical
error detected, system checkstopped, system state scanned out to
the service processor, error corrected, state scanned back in, and
system restarted. The logic error was only found because a test
engineer was pouring through the logs and saw the unexpected event.
It's pretty hard to do this sort of thing without RAS designed into
the processor.
 
D

Del Cecchi

George said:
Your "musing" had a barb to it... that I thought I recognized. Where do
you get the idea that Intel has a lead in power consumption?.. just tain't
so and you only need to run an AMD64 CPU to see that... however much that
might stick in the craw... and AMD is still on 90nm! AMD does *not* need
anything in the wings. AMD64 has been streets ahead of Intel for years now
on power consumption and Intel just kinda caught up and maybe took a very
slight lead with their 65nm chips... but the "future" is Intel?? You need
to get out more!

As for Hypertransport's advantage, as long as Intel keeps shying off on



Did you not notice "high capability"? "Pick a processor" is not going to
get you that. I haven't seen Del's announcement since I don't take
comp.arch.

You could check for "IBM System Cluster 1350" on IBM's web site
http://www-03.ibm.com/systems/clusters/hardware/1350.html
if you are interested.

I guess I don't understand what you mean by "high capability".
 
D

Del Cecchi

George Macdonald said:
Not sure where I got the "high" from:) but "capability" and "capacity"
seem to be used to contrast the two (extreme point) types of
supercomputers

I didn't see those terms in a quick scan but presumably capability refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely been
filled by machines like Blue Gene or other clusters. Capacity machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI at
home is a capacity machine.

del.
 
G

George Macdonald

I didn't see those terms in a quick scan but presumably capability refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely been
filled by machines like Blue Gene or other clusters. Capacity machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI at
home is a capacity machine.

From the last useful Meeting Bulletin, admittedly a while back in April
2001, the sense I get is that anything built out of COTS, tightly coupled
or not, is/was considered "capacity" when compared with Crays and others
with specialized processors. Interestingly, the guy from Ford was one
pushing the need for "capability".
 
D

Del Cecchi

George said:
From the last useful Meeting Bulletin, admittedly a while back in April
2001, the sense I get is that anything built out of COTS, tightly coupled
or not, is/was considered "capacity" when compared with Crays and others
with specialized processors. Interestingly, the guy from Ford was one
pushing the need for "capability".
Cool. Are there any "capability" machines left in the top500?
 
G

George Macdonald

Cool. Are there any "capability" machines left in the top500?

Is that like the Billboard "Hot 100"... but for computers?:) Yeah it's
true that much progress has been made in COTS since 2001 so maybe that is
the future?
 
D

Del Cecchi

George Macdonald said:
Is that like the Billboard "Hot 100"... but for computers?:) Yeah
it's
true that much progress has been made in COTS since 2001 so maybe that
is
the future?

Blue gene is a network of processors, but not exactly COTS. 240
Teraflops. Number one. That a capacity machine?
 
R

Robert Myers

Del said:
Blue gene is a network of processors, but not exactly COTS. 240
Teraflops. Number one. That a capacity machine?

Linpack flops isn't the only measure of performance that matters. It's
not sensitive to bisection bandwidth, and low bisection bandwidth
forces a particular approach to numerical analysis.

What the whiz kids at LLNL don't seem to get is that localized
approximations will _always_ get the problem wrong for strongly
nonlinear problems, because localized differencing invariably
introduces an artificial renormalization: very good for getting
nice-looking but incorrect answers.

I've discussed this extensively with the one poster to comp.arch who
seems to understand strongly nonlinear systems and he knows exactly
what I'm saying. He won't go public because the IBM/National Labs
juggernaut represents a fair slice of the non-academic jobs that might
be open to him.

The limitations of localized differencing may not be an issue for the
class of problem that LLNL needs to do, but ultimately, you can't fool
mother nature. The bisection bandwidth problem shows up in the poor
performance of Blue Gene on FFT's. My fear about Blue Gene is that it
will perpetuate a kind of analysis that works well for (say) routine
structural analysis, but very poorly for the grand problems of physics
(for example, turbulence and and strongly-interacting systems).

As I'm sure you will say, if you've got enough bucks, you can buy all
the bisection bandwidth you need. As it is, though, all the money
right now is going into linpack-capable machines that will never make
progress on the interesting problems of physics. It's a grand exercise
in self-deception.

Robert.
 
D

Del Cecchi

Robert Myers said:
Linpack flops isn't the only measure of performance that matters. It's
not sensitive to bisection bandwidth, and low bisection bandwidth
forces a particular approach to numerical analysis.

What the whiz kids at LLNL don't seem to get is that localized
approximations will _always_ get the problem wrong for strongly
nonlinear problems, because localized differencing invariably
introduces an artificial renormalization: very good for getting
nice-looking but incorrect answers.

I've discussed this extensively with the one poster to comp.arch who
seems to understand strongly nonlinear systems and he knows exactly
what I'm saying. He won't go public because the IBM/National Labs
juggernaut represents a fair slice of the non-academic jobs that might
be open to him.

The limitations of localized differencing may not be an issue for the
class of problem that LLNL needs to do, but ultimately, you can't fool
mother nature. The bisection bandwidth problem shows up in the poor
performance of Blue Gene on FFT's. My fear about Blue Gene is that it
will perpetuate a kind of analysis that works well for (say) routine
structural analysis, but very poorly for the grand problems of physics
(for example, turbulence and and strongly-interacting systems).

As I'm sure you will say, if you've got enough bucks, you can buy all
the bisection bandwidth you need. As it is, though, all the money
right now is going into linpack-capable machines that will never make
progress on the interesting problems of physics. It's a grand exercise
in self-deception.

Robert.

Well the Cluster 1350 has a pretty good network available, if the Blue
Gene one isn't good enough. And Blue Gene was really designed for a few
particular problems, not just Linpack. But the range of problems it is
applicable to seems to be reasonably wide.

And are the "interesting problems in Physics" something that folks are
willing to spend reasonable amounts of money on, like the money spent on
accelerators and nutrino detectors etc? And do they agree as to the kind
of computer needed?

Do you like the new Opteron/Cell Hybrid better? Throwing rocks is easy.
How about specific suggestions?

del
 
R

Robert Myers

Del said:
Do you like the new Opteron/Cell Hybrid better? Throwing rocks is easy.
How about specific suggestions?

I had really hoped to get out of the rock-throwing business.

My criticism really isn't of IBM, which is apparently only giving the
most important customer what it wants. The most important customer
lost interest in science a long time ago, so maybe it doesn't matter
that the machines it buys aren't good science machines.

I'm sure that a good science machine can be built within the parameters
of cluster 1350, and asking how you might go about that would be an
interesting exercise. Sure. The Opteron/Copressor hybrid sounds good.
All that's left to engineer is the network. Were it up to me, I'd
optimize it to do FFT and Matrix transpose. If you can do those two
operations efficiently, you can do an awful lot of very interesting
physics.

The money just isn't there for basic science right now. It isn't IBM's
job to underwrite science or to try to get the government to buy
machines that it apparently doesn't want. The bisection bandwidth of
Blue Gene is millibytes per flop. That's apparently not a problem for
some customers, but there is a big slice of important physics that you
can't do correctly or efficiently with a machine like that.

Robert.
 
D

Del Cecchi

Robert said:
Del Cecchi wrote:




I had really hoped to get out of the rock-throwing business.

My criticism really isn't of IBM, which is apparently only giving the
most important customer what it wants. The most important customer
lost interest in science a long time ago, so maybe it doesn't matter
that the machines it buys aren't good science machines.

I'm sure that a good science machine can be built within the parameters
of cluster 1350, and asking how you might go about that would be an
interesting exercise. Sure. The Opteron/Copressor hybrid sounds good.
All that's left to engineer is the network. Were it up to me, I'd
optimize it to do FFT and Matrix transpose. If you can do those two
operations efficiently, you can do an awful lot of very interesting
physics.

The money just isn't there for basic science right now. It isn't IBM's
job to underwrite science or to try to get the government to buy
machines that it apparently doesn't want. The bisection bandwidth of
Blue Gene is millibytes per flop. That's apparently not a problem for
some customers, but there is a big slice of important physics that you
can't do correctly or efficiently with a machine like that.

Robert.
Is BiSection bandwidth really a valid metric for very large clusters?
It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.

I can't think of why this wouldn't apply in general but don't claim that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)

Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.
 
R

Robert Myers

Del said:
Is BiSection bandwidth really a valid metric for very large clusters?

Yes, if you want to do FFT's, or, indeed, any kind of non-local
differencing.
It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.

That's the problem with the architecture and why I howled so loudly
when it came out. Naturally, I was ridiculed by people whose entire
knowledge of computer architecture is nearest neighbor clusters.

Someone in New Mexico (LANL or Sandia, I don't want to dredge up the
presentation again) understands the numbers as well as I do. The
bisection bandwidth is a problem for a place like NCAR, which uses
pseudospectral techniques, as do most global atmospheric simulations.
The projected efficiency of Red Storm for FFT's was 25%. The
efficiency of Japan's Earth Simulator is at least several times that
for FFT's. No big deal. It was designed for Geophysical simulations,
Blue Gene at Livermore was bought to produce the plots the Lab needed
to justify its own existence (and not to do science). As you have
correctly inferred, the more processors you hang off the
nearest-neighbor network, the worse the situation becomes.
I can't think of why this wouldn't apply in general but don't claim that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)
Unless you increase the aggregate bandwidth, you reach a point of
diminishing returns. The special nature of Linpack has allowed
unimaginative bureacrats to make a career out of buying and touting
very limited machines that are the very opposite of being scalable.
"Scalability" does not mean more processors or real estate. It means
the ability to use the millionth processor as effectively as you use
the 65th. Genuine scalability is hard, which is why no one is really
bothering with it.
Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.

The problems aren't as special as you think. In fact, the glaring
problem that I've pointed out with machines that rely on local
differencing isn't agenda or marketing driven, it's an unavoidable
mathematical fact. As things stand now, we will have ever more
transistors chuffing away on generating ever-less reliable results.

The problem is this: if you use a sufficiently low-order differencing
scheme, you can do most of the problems of mathematical physics on a
box like Blue Gene. Low order schemes are easy to code, undemanding
with regard to non-local bandwidth, and usually much more stable than
very high-order schemes. If you want to figure out how to place an
air-conditioner, they're just fine. If you're trying to do physics,
the plots you produce will be plausible and beautiful, but very often
wrong.

There is an out that, in fairness, I should mention. If you have
processors to burn, you can always overresolve the problem to the point
where the renormalization problem I've mentioned, while still there,
becomes unimportant. Early results by the biggest ego in the field at
the time suggested that it takes about ten times the resolution to do
fluid mechanics with local differencing as accurately as you can do it
with a pseudospectral scheme. In 3-D, that's a thousand times more
processors. For fair comparison, the number of processors in Livermore
box would be divided by 1000 to get equivalent performance to a box
that could do a decent FFT.

Should be posting to comp.arch so people there can switch from being
experts on computer architecture to being experts on numerical analysis
and mathematical physics.

Robert.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top