Yamhill out in open

Felger Carbon · Jan 29, 2004

From CNET:

Intel plans to demonstrate a 64-bit revamp of its Xeon and Pentium
processors
in mid-February--an endorsement of a major rival's strategy and a
troubling
development for Intel's Itanium chip.

The demo, which follows the AMD64 approach of Intel foe Advanced Micro
Devices, is expected at the Intel developer conference, Feb. 17
through 19 in
San Francisco, according to sources familiar with the plan. Intel had
code-named the technology Yamhill but now calls it CT, sources said.

Adding 64-bit features would let "x86" chips such as Intel's Xeon and
Pentium
overcome today's 4GB memory limit but would undermine the hope that
Intel's
current 64-bit chip, Itanium, will ever ship in large quantities. A CT
demonstration would send the message that prospective Itanium
customers should
put Itanium purchases on hold, said Peter Glaskowsky, editor in chief
of
In-Stat/MDR's Microprocessor Report.

http://zdnet.com.com/2100-1103_2-5150336.html

Robert Myers · Jan 30, 2004

From CNET:

http://zdnet.com.com/2100-1103_2-5150336.html

Same article refers to the fact that IBM is planning a 64-way x86
server...with Xeon, if you follow the link.

RM

Yousuf Khan · Jan 30, 2004

Robert Myers said:
Same article refers to the fact that IBM is planning a 64-way x86
server...with Xeon, if you follow the link.

I'm wondering which Tier-1 server company is going to try the same thing
with Opteron? Well, Cray is already trying it with the Strider system
employing Black Widow interconnects.

Has anyone figured out if Black Widow is cache-coherent or not? And if it
is, is it directory-based or broadcast-based cache coherent? Can't find any
detailed info about it on the web.

Yousuf Khan

Robert Myers · Jan 30, 2004

I'm wondering which Tier-1 server company is going to try the same thing
with Opteron? Well, Cray is already trying it with the Strider system
employing Black Widow interconnects.

Has anyone figured out if Black Widow is cache-coherent or not? And if it
is, is it directory-based or broadcast-based cache coherent? Can't find any
detailed info about it on the web.

This question came up on comp.arch. Actually, I asked it, and what I
got back was mush. Some briefing somewhere refers to Red Storm as
cache coherent. It's 2:00am, so if someone walked in here and
threatened violence if I didn't find it, I could probably find it.
But anything short of that...Oh, wait, there's google:

http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf

Slide 13 of 18:

DMA between OpteronTMmemory and network for high bandwidth (cache
coherent)

No one seems to know what that means. I cannot conceive that it means
what you want it to mean.

Take a gander at that mesh on slide 4, and imagine just ordinary
programming traffic on that mesh, never mind cache snoop from every
damn processor, _every_one_of_which_ is connected to the mesh
_separately_ through its own router chip (no exploitation of local
hyperlink, despite the fact that Opteri are four to a board).

Some snot-headed know-it-all on comp.arch wants me to look at page 2
of an elementary programming book to figure out how to program one of
those things, and I'll tell you right now there's no damn book
anywhere that will tell you how to program one of those things in the
general (not nearly embarrassingly-parallel) case.

Every single processor has its own operating system image, so Red
Storm is not a competitor for the kind of box that IBM wants to build
with its (very impressive) Summit chip. All due well-publicized
disrespect for the well-meaning people at Cray and the DoE, but Red
Storm is a very expensive white elephant.

Well, that's an extreme overstatement. Red Storm will do well where
data are physically localized, pretty much stay put, and only have to
communicate at the boundaries of well-defined regions. That describes
an awful lot of problems that the nice people at LANL and Sandia
really have to do, but it does not describe a general purpose
computer.

In particular, I do time-dependent problems that require multiple
global communication at every time step, and the kind of work I do
isn't all that extraordinarily unusual. I don't think Red Storm is
going to be well-suited to the kinds of problems I do, and I think if
Red Storm attempted cache coherency it would go up in a puff of smoke.

I would be very happy to learn that I am dead wrong, but so far no one
has pointed me to a document that gives any evidence that I would be,
and I think Red Storm is going to be another DoE special that operates
in the 5-10% of peak flops range. Tolerable linpack score, pathetic
everything else.

RM

Yousuf Khan · Jan 30, 2004

Robert Myers said:
This question came up on comp.arch. Actually, I asked it, and what I
got back was mush. Some briefing somewhere refers to Red Storm as
cache coherent. It's 2:00am, so if someone walked in here and
threatened violence if I didn't find it, I could probably find it.
But anything short of that...Oh, wait, there's google:

http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf

Slide 13 of 18:

DMA between OpteronTMmemory and network for high bandwidth (cache
coherent)

No one seems to know what that means. I cannot conceive that it means
what you want it to mean.

Yeah, that's the problem I'm having, I cannot find any specific information
about its cache-coherency. An extreme lack of details about Black Widow.

Take a gander at that mesh on slide 4, and imagine just ordinary
programming traffic on that mesh, never mind cache snoop from every
damn processor, _every_one_of_which_ is connected to the mesh
_separately_ through its own router chip (no exploitation of local
hyperlink, despite the fact that Opteri are four to a board).

If it's a directory-based cache coherency, it might be doable. There's no
cache snoop traffic on directory-based interlinks.

Every single processor has its own operating system image, so Red
Storm is not a competitor for the kind of box that IBM wants to build
with its (very impressive) Summit chip. All due well-publicized
disrespect for the well-meaning people at Cray and the DoE, but Red
Storm is a very expensive white elephant.

I don't think it is one operating system image per processor. I think it's
one operating system image per compute node (i.e. 4 processors/node?).

Cray is using the very same Black Widow interlinks in its Black Widow line
of computers too (X1 is the first from the Black Widow lineup, using Cray's
vector processors). It's likely that it is cache-coherent in the Black
Widows, so it should be the same in Striders (Striders are Cray's
Opteron-based computers, of which the Red Storm is the first model in that
lineup).

I would be very happy to learn that I am dead wrong, but so far no one
has pointed me to a document that gives any evidence that I would be,
and I think Red Storm is going to be another DoE special that operates
in the 5-10% of peak flops range. Tolerable linpack score, pathetic
everything else.

One of their PDFs was showing over 80% efficiency.

Yousuf Khan

Felger Carbon · Jan 30, 2004

Yousuf Khan said:
If it's a directory-based cache coherency, it might be doable. There's no
cache snoop traffic on directory-based interlinks.

Sorry, Robert. I agreed with you on this, but decided to review the
..pdf first. Good thing I did, or I'd have embarrassed myself. Again.

Slide 13: "Message based. DMA between Opteron memory and network for
high bandwidth (cache coherent)."

What this means is that each Opteron maintains cache coherency with
its own local Dram, as per usual in all our single-processor desktop
machines. Communications from the net to each processor are via DMA
message passing from the net directly to the memory. This is snooped
by the CPU, exactly as our desktop machine routinely snoops DMA by
(for example) our hard disk.

The Red Storm network can broadcast a given message simultaneously to
multiple Opteron memories. Each Opteron snoops its own memory. By
this mechanism, cache coherency is maintained (on passed messages) by
each CPU with its own local memory, which is the only memory it can
directly access.

Damn! I was _sure_ the Red Storm wasn't cache coherent! ;-(

Keep in mind that the Red Storm is NUMA. Changes to one CPU's local
memory by the CPU are _not_ snooped by other CPUs. Inter-CPU
coherency _only_ exists on passed DMA messages.

I don't think it is one operating system image per processor. I think it's
one operating system image per compute node (i.e. 4

processors/node?).

Sorry, Yousuf, Robert is correct on this one. There is absolutely no
electrical interconnection between the four CPUs on each
node (one board per node).

Slide 6: "Single (non-SMP) processor."
Slide 7 reveals that there's no interconnection between the four CPUs
on a node board.
Slide 10: "Supplies boot code to [each] processor."
Slide 9 reveals that each system chip (one per Opteron) contains an
embedded PPC CPU for booting and system monitoring.

For those keeping score, Me, Robert, and Yousuf each won one and lost
one. Well, that's what this NG is all about. ;-)

Courtesy of David Wang last Oct:
http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf

Robert Myers · Jan 30, 2004

Sorry, Robert. I agreed with you on this, but decided to review the
.pdf first. Good thing I did, or I'd have embarrassed myself. Again.

Slide 13: "Message based. DMA between Opteron memory and network for
high bandwidth (cache coherent)."

What this means is that each Opteron maintains cache coherency with
its own local Dram, as per usual in all our single-processor desktop
machines. Communications from the net to each processor are via DMA
message passing from the net directly to the memory. This is snooped
by the CPU, exactly as our desktop machine routinely snoops DMA by
(for example) our hard disk.

The Red Storm network can broadcast a given message simultaneously to
multiple Opteron memories. Each Opteron snoops its own memory. By
this mechanism, cache coherency is maintained (on passed messages) by
each CPU with its own local memory, which is the only memory it can
directly access.

Damn! I was _sure_ the Red Storm wasn't cache coherent! ;-(

Keep in mind that the Red Storm is NUMA. Changes to one CPU's local
memory by the CPU are _not_ snooped by other CPUs. Inter-CPU
coherency _only_ exists on passed DMA messages.

Hmmm. This brings a whole new-meaning to the term cache-coherent to
my spongiform brain (not the result of misfolded proteins, just the
neurons that say, "Yeah, whatever" when the discussion sounds boring).

If a processor changes a memory location *and* remembers to broacast a
message to interested parties that the memory location has been
changed, then the system stays cache-coherent in the sense that
cache-coherent seems to imploy.

If a processor changes a memory location and fails to broadcast a
message, different processors have different versions of the same
data, possibly in their cache, but that's okay because this is a NUMA
machine and we still get to call it cache coherent?

Felger, are you a part of this conspiracy? NUMA "cache-coherency"
sounds like a much worse abuse of language than a "salesman's
gigabyte."

The 64-processor x86 box that IBM will build with its Summit chipset
will be cache-coherent with no asterisk.

RM

Felger Carbon · Jan 30, 2004

Robert Myers said:
Hmmm. This brings a whole new-meaning to the term cache-coherent to
my spongiform brain (not the result of misfolded proteins, just the
neurons that say, "Yeah, whatever" when the discussion sounds boring).

If a processor changes a memory location *and* remembers to broacast a
message to interested parties that the memory location has been
changed, then the system stays cache-coherent in the sense that
cache-coherent seems to imploy.

If a processor changes a memory location and fails to broadcast a
message, different processors have different versions of the same
data, possibly in their cache, but that's okay because this is a NUMA
machine and we still get to call it cache coherent?

Felger, are you a part of this conspiracy? NUMA "cache-coherency"
sounds like a much worse abuse of language than a "salesman's
gigabyte."

No, I just finally figgered out what "ccNUMA" means. Since each CPU
has its own, unshared memory, writes to one memory do not have to be
snooped by the other 10K+ CPUs. Only when a message is passed (via
DMA in this case) does the new data have to be snooped by the
receiving CPU - and it is!

There is no asterisk on Red Storm's cache coherency. Since data can
be exchanged _only_ by message passing, the system is fully cache
coherent at all times.

The limitation of the system (and it's a whopper) is unrelated to
cache coherency. It is the fact that only one CPU gets to _send_ a
message at one time. Any number of CPUs can receive the message
simultaneously. In Red Storm's case, because of the Red/Black
partitioning, _two_ CPUs get to send at one time, one in each
partition.

Robert Myers · Jan 31, 2004

No, I just finally figgered out what "ccNUMA" means. Since each CPU
has its own, unshared memory, writes to one memory do not have to be
snooped by the other 10K+ CPUs. Only when a message is passed (via
DMA in this case) does the new data have to be snooped by the
receiving CPU - and it is!

There is no asterisk on Red Storm's cache coherency. Since data can
be exchanged _only_ by message passing, the system is fully cache
coherent at all times.

Mmmph? Changed data comes in from aonther processor via a message.
The processor snoops the message and says "Ohmagod, the data has
changed from what I have in my cache. Better straighten that out."

*What* in tarnation was the processor doing with the data in its cache
in the first place if somebody else who might change it had it at the
same time? It could still use the data while a message that the data
has been on its way (possibly thousands of processor cycles). That
is, *if* a message was sent at all, responsibility entirely on the
programmer to see to it that it happens, as far as I can tell.

You can get into the same kind of trouble on an SMP shared-memory box,
and people have argued to me recently that it's a *virtue* of NUMA
systems that all this message-passing has to go on. Balderdash.

The only situation in which cache-coherency makes any sense is when
the snoop time is less than the data-fetch time. Otherwise, if you've
been keeping your books correctly and you know that data has been
handled by someone else, you might just as well routinely invalidate
data that happen still to be in your cache and get a fresh copy.

If the snoop time is significantly less than the data fetch time, the
circumstance can easily arise that hotly-contested data might still be
in-cache when you learn that another processor is no longer liable to
change it, and it is worth your while to arrange things so that you
don't have to go out to memory to fetch a new copy if you don't have
to.

The usefulness of cache-coherency isn't even a matter of shared memory
vs. NUMA. In a four-way Opteron system, snoop times are less than
fetch times, and ccNUMA is a term with real significance.

In the case of RedStorm, cache-coherence is a salesman's gigabyte.

The limitation of the system (and it's a whopper) is unrelated to
cache coherency. It is the fact that only one CPU gets to _send_ a
message at one time. Any number of CPUs can receive the message
simultaneously. In Red Storm's case, because of the Red/Black
partitioning, _two_ CPUs get to send at one time, one in each
partition.

Ermph. That is not a problem at all. The Red and the Black
partitions of RedStorm are *physically* disconnected. That is the
only way you can use part for classified and part for unclassified,
which is why the Red and the Black partitions exist at all.

While I am on my soap box, and to keep someone from gloatingly
pointing the obvious out to me at some later date, the fact that all
those processors are hooked together is more useful from the point of
view of trying to do several small jobs at once than it is from the
point of view of really attempting to use a mesh of that size with
one-hop routing for a single problem.

From this particular point of view, Red Storm is *not* an expensive
white elephant. It shares with a z-series mainframe the property that
a large quantity of memory and a large number of processors can be
reconfigured as different computers almost at a moment's notice. It
is also much less expensive in that role (than a z-Series mainframe),
and a RedStorm box might make good sense for a company that has
multiple server farms dividing resources up in arbitrary ways that are
hard to reconfigure.

When you get down to most jobs occupying just a dozen or so nodes,
then many things about the box that seem silly when you consider the
mesh as a whole, like "cache-coherency", no longer seem quite so
ridiculous, because data go no further than well-defined boundaries,
and you don't wind up with traffic jams.

Don't look for a slide that highlights what I just said, though,
because it really highlights the fact that these huge boxes don't
really work for all but the most embarrassingly parallel of problems,
and that the DoE has spent over a decade doing nothing more than
funding high school shop class projects that emphasize running wire,
bolting things together, and plugging things in. That Virginia Tech
could do the same thing much more cheaply by paying undergraduates
with pizza and football tickets doesn't come as a big surprise.

RM

Felger Carbon · Jan 31, 2004

Robert Myers said:
*What* in tarnation was the processor doing with the data in its cache
in the first place if somebody else who might change it had it at the
same time?

First, an obligatory disclaimer: I'm still learning about Red Storm.
The following statements are based on the current state of my
knowledge. I'll try real hard to make some good guesses. ;-)

The fact that DMA message passing into the local Dram is used
necessitates that the data being overwritten is inconsequential.
Therefore, if any of the data being overwritten is in cache, replacing
it via snooping is also inconsequential.

It could still use the data while a message that the data
has been on its way (possibly thousands of processor cycles).

The data being overwritten _must_ be inconsequential. It is the
programmers' task to make certain this is the case. Nobody ever said
programming a message-passing 10K+ CPU MPU was easy.

You can get into the same kind of trouble on an SMP shared-memory

box

This is not my understanding, Robert. I'll try to keep this on-topic
about ccNUMA and not pursue this further.

The only situation in which cache-coherency makes any sense is when
the snoop time is less than the data-fetch time. Otherwise, if you've
been keeping your books correctly and you know that data has been
handled by someone else, you might just as well routinely invalidate
data that happen still to be in your cache and get a fresh copy.

Snooping the passed message *does in fact* invalidate the
(inconsequential) data in your cache and updates it with a fresh copy.
So?

If the snoop time is significantly less than the data fetch time, the
circumstance can easily arise that hotly-contested data might still be
in-cache when you learn that another processor is no longer liable to
change it, and it is worth your while to arrange things so that you
don't have to go out to memory to fetch a new copy if you don't have
to.

I read the above several times, Robert, and I still don't understand
what you're saying. This is probably my limitation.

The usefulness of cache-coherency isn't even a matter of shared memory
vs. NUMA. In a four-way Opteron system, snoop times are less than
fetch times, and ccNUMA is a term with real significance.

You have just opened a brand-new can of worms. There are several
forms of NUMA. One is the Red Storm version, where each CPU has a
totally independent memory, accessable by other CPUs only via message
passing. The 4-way Opteron system is a completely different type of
NUMA since each CPU can address the other CPUs' memory. However, it
addresses the other memory at a different address. This means each
CPU's cache must snoop the other 3's memory, as well as its own.
Thus, the largish number of high-speed links.

In the case of RedStorm, cache-coherence is a salesman's gigabyte.

Wrong. Red Storm is absolutely perfectly cache coherent. There are
no corners or special cases where this is not true. 100% perfection.
The penalty is that only one CPU gets to send messages at a time. And
the programmer must avoid overwriting valid data when passing a
message.

Ermph. That is not a problem at all. The Red and the Black
partitions of RedStorm are *physically* disconnected.

Whoa, Nellie! You mean, when the partition is moved that a swarm of
technicians physically remove or install wiring? Huh??

While I am on my soap box, and to keep someone from gloatingly
pointing the obvious out to me at some later date, the fact that all
those processors are hooked together is more useful from the point of
view of trying to do several small jobs at once than it is from the
point of view of really attempting to use a mesh of that size with
one-hop routing for a single problem.

Absolutely correct. This is the problem with a message-passing MPU.
The unfortunate fact is, there is no practical way around the problem
of interconnecting 10K+ CPUs. Otherwise, everybody would use that
practical way, hmm?

From this particular point of view, Red Storm is *not* an expensive
white elephant. It shares with a z-series mainframe the property that
a large quantity of memory and a large number of processors can be
reconfigured as different computers almost at a moment's notice.

By swarms of technicians physically installing/removing wiring?
(Sorry, Robert, that was a cheap shot that I just couldn't resist. I
ain't perfect.

these huge boxes don't
really work for all but the most embarrassingly parallel of problems

They're the only game in town. We'd all love to have equivalent
performance in a really fast one-CPU supercomputer, but that just
ain't possible. Alas.

For one specific algorithm, it is sometimes *possible* (in principle)
to design the algorithm flow into hardware. There isn't enough money
in the world to pay for this for lotsa algorithms. Double alas. ;-)

Rob Stow · Jan 31, 2004

Felger said:
No, I just finally figgered out what "ccNUMA" means. Since each CPU
has its own, unshared memory, writes to one memory do not have to be
snooped by the other 10K+ CPUs. Only when a message is passed (via
DMA in this case) does the new data have to be snooped by the
receiving CPU - and it is!

There is no asterisk on Red Storm's cache coherency. Since data can
be exchanged _only_ by message passing, the system is fully cache
coherent at all times.

The limitation of the system (and it's a whopper) is unrelated to
cache coherency. It is the fact that only one CPU gets to _send_ a
message at one time. Any number of CPUs can receive the message
simultaneously. In Red Storm's case, because of the Red/Black
partitioning, _two_ CPUs get to send at one time, one in each
partition.

Damn ! So now I'm a small step closer to understanding just what the
bleep that red/black partitioning is all about.

Robert Myers · Jan 31, 2004

First, an obligatory disclaimer: I'm still learning about Red Storm.
The following statements are based on the current state of my
knowledge. I'll try real hard to make some good guesses. ;-)

I don't post on Usenet for the pleasure of catching other people in
mistakes, and I don't think you do, either, so we can just have a
conversation.

The fact that DMA message passing into the local Dram is used
necessitates that the data being overwritten is inconsequential.

Let me try a translation. If the programmer has not done something
dangerous, it should not, indeed, cannot matter that data are being
overwritten in local memory by DMA. Otherwise, running the code with
different, unrelated activity on the sytem could produce different
results. <end attempted translation>

The only thing that is accomplished by "cache-coherency" in Red Storm
is that, should the processor by some chance, be holding in cache a
piece of data being overwritten in main memory by DMA, the cache will
be updated at the same time.

Therefore, if any of the data being overwritten is in cache, replacing
it via snooping is also inconsequential.

Red alert! Whoop! Whoop! Translator banks drawing inconsistent
conclusions from external stimuli. I thought the whole point is that
the processor does snoop the DMA write and does update the cache.

The data being overwritten _must_ be inconsequential. It is the
programmers' task to make certain this is the case. Nobody ever said
programming a message-passing 10K+ CPU MPU was easy.

Well, not if you go about it the way most people do these days.
(What, he says, you think you know a better way? Yes, I think I do).

box

This is not my understanding, Robert. I'll try to keep this on-topic
about ccNUMA and not pursue this further.

Processor A and B have a copy of the same memory location. Processor
A and B by an unfortunate coincidence (and though the incompetence of
the programmer who allowed the situation to arise) decide to use the
data at the same time. Processor A uses the value to produce some
other result. Processor B changes the value. Processor A snoops the
change and corrects the value it has in cache, but a result from
Processor A is on its way elsewhere that would have been different had
the timing been just a little different. No different from what
happens in the ccNUMA case, as far as I can see.

Snooping the passed message *does in fact* invalidate the
(inconsequential) data in your cache and updates it with a fresh copy.
So?

This conversation may not be about much more than:

ccNUMA for Red Storm is almost free.

It also doesn't really buy you much of anything, but since it's almost
free, it doesn't matter very much that it's almost worthless.

I read the above several times, Robert, and I still don't understand
what you're saying. This is probably my limitation.

Megacorp International has a credit line of x. Processor A and
processor B are each handling transactions for Megacorp International.
Processor B gets there first and puts a lock on the value of memory
location x. Processor B changes the value of x and releases the lock
on that memory location. Processor A learns that the data can be
used, and does so without worrying about having to go out to memory to
fetch a fresh copy because the change was snooped and the value in its
cache updated in less time than it would take to do a memory fetch.
On RedStorm, the minimum time for a lock that has to go onto the
network is 4 us, compared to a memory fetch round-trip of under 200
ns. Whether you snoop the DMA or just fetch a fresh copy makes an
insignificant difference in the amount of time you can't use that data
on RedStorm. The one real payoff that I can see is that the lock
itself is a data item, and having the processor snoop the changed lock
as it arrives saves the processor from having to poll the lock to see
if the data can be used.

You have just opened a brand-new can of worms. There are several
forms of NUMA. One is the Red Storm version, where each CPU has a
totally independent memory, accessable by other CPUs only via message
passing. The 4-way Opteron system is a completely different type of
NUMA since each CPU can address the other CPUs' memory. However, it
addresses the other memory at a different address. This means each
CPU's cache must snoop the other 3's memory, as well as its own.
Thus, the largish number of high-speed links.

Right, but the fact that the processors snoop one another's cache
makes a big differerence _proportionally_ in how long data that have
been locked by one processor are unavailable for use by another
processor.

Wrong. Red Storm is absolutely perfectly cache coherent. There are
no corners or special cases where this is not true. 100% perfection.
The penalty is that only one CPU gets to send messages at a time. And
the programmer must avoid overwriting valid data when passing a
message.

Let's put it this way: I think RedStorm's cache-coherence has about as
much value as P4 detractors think hyperthreading has.

Whoa, Nellie! You mean, when the partition is moved that a swarm of
technicians physically remove or install wiring? Huh??

When I was in the business (and I no longer am) a computer doing
classified work could have no network connections to any unclassified
environment. It isn't necessary for me to guess at how the
corresponding requirement can be met under current regulations, but
you may safely count on the impossibility of any message getting from
Red to Black or vice versa.

Absolutely correct. This is the problem with a message-passing MPU.
The unfortunate fact is, there is no practical way around the problem
of interconnecting 10K+ CPUs. Otherwise, everybody would use that
practical way, hmm?

There are these things called _switches_. The cost of _just_ the
switch for a beowulf cluster even with fairly high-end compute nodes
is changed significantly by going from a switchless mesh to a switched
network if you're using a low-latency, high-bandwidth interconnect.

Now the problem with _switches_ is that they significantly raise the
cost of the installation without significantly raising your Top 500
ranking.

Want to experience sticker shock? Price out an SGI Altix 3300 (ring)
vs Altix 3700 (switched).

Best Top 500 per dollar? Leave out the switches. Also has the neat
effect of maximizing IT staff at National Laboratories, because they
are working with an RSA (Really Stupid Architecture). You've heard me
talk about this before in a group larded with DoE vassals and
retainers. Boy do I get an unfriendly reception. Smarter
architecture, faster development, less money to vassals and retainers.

By swarms of technicians physically installing/removing wiring?
(Sorry, Robert, that was a cheap shot that I just couldn't resist. I
ain't perfect.

If they could call them IT professionals and use it to inflate their
budgets and their staffs, they probably would. One cheap shot
deserves another, although this cheap shot was definitely not aimed at
you. ;-).

They're the only game in town. We'd all love to have equivalent
performance in a really fast one-CPU supercomputer, but that just
ain't possible. Alas.

_Not_ the only game in town. NASA doesn't buy boxes like that. NRL
doesn't buy boxes like that. NSA doesn't buy boxes like that. Only
the DoE, with its heavy thumb on national policy and tons of vassals
and retainers to justify big salaries for half-wit muckity-mucks buys
boxes like that. Oh, yes, and the DoE has an unseemly relationship
with IBM. How did Cray get into it? They had to do _something_ to
show apparent support for supposedly real HPC, as opposed to high
school shop projects and national subsidies to IBM.

SGI is working on boxes that have both vector and scalar processors.
Now _there's_ a thought. SGI, in all likelihood, will go safely out
of business before they can interfere with the favorites chosen by the
DoE. Wonder what SGI did wrong?

For one specific algorithm, it is sometimes *possible* (in principle)
to design the algorithm flow into hardware. There isn't enough money
in the world to pay for this for lotsa algorithms. Double alas. ;-)

No, but you can do one helluva lot better than racking up as many COTS
processors with as much cable as you can afford to buy.

RM

Yousuf Khan · Jan 31, 2004

Felger Carbon said:
The limitation of the system (and it's a whopper) is unrelated to
cache coherency. It is the fact that only one CPU gets to _send_ a
message at one time. Any number of CPUs can receive the message
simultaneously. In Red Storm's case, because of the Red/Black
partitioning, _two_ CPUs get to send at one time, one in each
partition.

Why do you think only one CPU will get to send a message at one time? It's
entirely likely that the Black Widow interconnects can receive messages from
multiple CPU sources and update its internal directory to reflect changed
memory locations.

Yousuf Khan

Rob Stow · Jan 31, 2004

Yousuf said:
Why do you think only one CPU will get to send a message at one time? It's
entirely likely that the Black Widow interconnects can receive messages from
multiple CPU sources and update its internal directory to reflect changed
memory locations.

Does anyone have a link that explains what & why the red/black
partitioning in Red Storm is ?

I have no *need* for this info, but I am curious as hell
I am coming from a unrelated background where the closest
I have come to something like Red Storm are 8P Xeon servers.

Felger Carbon · Jan 31, 2004

Yousuf Khan said:
Why do you think only one CPU will get to send a message at one time? It's
entirely likely that the Black Widow interconnects can receive messages from
multiple CPU sources and update its internal directory to reflect changed
memory locations.

Yousuf, I've been thinking about the above, and I think my original
reply was wrong. While only _one_ message at a time can be received
by a CPU, it's possible (with some stringent limitations) for more
than one CPU at a time to be sending messages on Red Storm.

Suppose the CPU at X Y Z address 1, 1, 1 is sending a message. Then
no other CPU can be broadcasting a message on X = 1 or Y = 1 or Z = 1.
However, the CPU at X Y Z address 2, 2, 2 can (as far as I can tell)
be simultaneously broadcasting. Since Red Storm uses a 27 x 16 x 24
mesh, in the limit, it's possible that up to 16 CPUs can be
simultaneously broadcasting.

However, it's obvious that there are very stringent limitations on
_which_ addresses subsequent (beyond broadcaster #1) broadcasters can
address CPUs. I'm not sure I'd want to try to program around that
limitation.

Nevertheless, I now believe you are right. More than one message can
be sent by _some_ CPUs at one time, with the limitation that each X
address can only be used once (and Y, and Z). Whoops!

Yousuf Khan · Jan 31, 2004

Felger Carbon said:
Yousuf, I've been thinking about the above, and I think my original
reply was wrong. While only _one_ message at a time can be received
by a CPU, it's possible (with some stringent limitations) for more
than one CPU at a time to be sending messages on Red Storm.

Suppose the CPU at X Y Z address 1, 1, 1 is sending a message. Then
no other CPU can be broadcasting a message on X = 1 or Y = 1 or Z = 1.
However, the CPU at X Y Z address 2, 2, 2 can (as far as I can tell)
be simultaneously broadcasting. Since Red Storm uses a 27 x 16 x 24
mesh, in the limit, it's possible that up to 16 CPUs can be
simultaneously broadcasting.

However, it's obvious that there are very stringent limitations on
_which_ addresses subsequent (beyond broadcaster #1) broadcasters can
address CPUs. I'm not sure I'd want to try to program around that
limitation.

Nevertheless, I now believe you are right. More than one message can
be sent by _some_ CPUs at one time, with the limitation that each X
address can only be used once (and Y, and Z). Whoops!

Actually, I still don't get it, why do you think there should be any
restrictions at all? The CPUs don't need to be broadcasting any messages at
all to other CPUs, i.e. if the cache coherency is directory-based. The CPU
would talk only to its own local Black Widow NIC, and the Black Widow
network would take care of switching and routing. The Black Widow would
maintain an internal directory of changed memory locations. If any
particular CPU would need to access any memory location outside its own then
it would check through the Black Widow. The BW would then take care of
informing the CPU whether its data is stale or fresh. The Black Widow much
like a network switch would direct traffic only between nodes which are
relevant to each other, without disturbing CPUs that are irrelevant.

In fact, these Striders/Red Storms are using Opteron 1xx processors, which
don't have any ccHTT interfaces at all, since they are meant for use only in
single-processor systems. Any data outside each Opterons' own local memory
controller is coming over a regular HTT link, as if the system had an
external memory controller. This external memory's cache coherency is
maintained with DMA coherency protocols rather than multiprocessor coherency
protocols.

This is all assuming that Black Widow employs a directory-based cache
coherency protocol, of course.

Yousuf Khan

Robert Myers · Jan 31, 2004

Actually, I still don't get it, why do you think there should be any
restrictions at all? The CPUs don't need to be broadcasting any messages at
all to other CPUs, i.e. if the cache coherency is directory-based. The CPU
would talk only to its own local Black Widow NIC, and the Black Widow
network would take care of switching and routing. The Black Widow would
maintain an internal directory of changed memory locations. If any
particular CPU would need to access any memory location outside its own then
it would check through the Black Widow. The BW would then take care of
informing the CPU whether its data is stale or fresh. The Black Widow much
like a network switch would direct traffic only between nodes which are
relevant to each other, without disturbing CPUs that are irrelevant.

You are correct in your analysis of what is *theoretically* possible.
Any node (within the red or black domain) can transmit to any other
node in the same color domain at any time. You *could* use a
predetermined shortest path algorithm to route the message, but the
likelihood is that you will wind up with traffic jams. You could even
wind up with the mesh flooded with NACK's as some node tries to tell
other nodes trying to send messages that it can't handle any more
incoming traffic. The whole thing comes to a standstill: Manhattan
Routing.

Without my diverting attention I do not have to the notion of
directory-based cache-coherence to see if there is more to the notion
than the term implies, there is simply no way that there is enough
bandwidth there to support keeping all interested parties apprised of
changes to data that a node does not have an exclusive lock on. The
responsibility is entirely on the programmer to see to it that data
are not used in an inconsistent way. The "cache coherence" doesn't
help in any but the most trivial of ways.

In fact, these Striders/Red Storms are using Opteron 1xx processors, which
don't have any ccHTT interfaces at all, since they are meant for use only in
single-processor systems. Any data outside each Opterons' own local memory
controller is coming over a regular HTT link, as if the system had an
external memory controller. This external memory's cache coherency is
maintained with DMA coherency protocols rather than multiprocessor coherency
protocols.

Felger has made it clear that he understands that.

This is all assuming that Black Widow employs a directory-based cache
coherency protocol, of course.

Very big assumption.

RM

Yousuf Khan · Feb 1, 2004

Robert Myers said:
Very big assumption.

The first Strider systems are due to come out in H2 of this year, so we'll
probably be told more about the interconnects by then. In the meantime, it
seems to be guarded like a trade secret within Cray.

Yousuf Khan

Intel to come out with budget Core-2 based dual core Pentium	4	Nov 18, 2006
Upgrade Report [Tested: 64-Bit P4 - 03/29/2005]	2	Mar 30, 2005
Intel Unveils Supercomputing Multicore Processor called KnightsCorner	1	Jun 2, 2010
Amd-Intel	1	Jun 27, 2005
Intel details future -Larrabee- graphics chip	7	Aug 4, 2008
faster Power6, CELL and other processors at ISSCC	2	Jan 8, 2007
IBM processor event March 31 - game console related	0	Mar 21, 2004
Tilera to Introduce 64-Core Processor	37	Oct 11, 2007

Yamhill out in open

Felger Carbon

Robert Myers

Yousuf Khan

Robert Myers

Yousuf Khan

Felger Carbon

Robert Myers

Felger Carbon

Robert Myers

Felger Carbon

Rob Stow

Robert Myers

Yousuf Khan

Rob Stow

Felger Carbon

Yousuf Khan

Robert Myers

Yousuf Khan

Ask a Question

Similar Threads