"Robert Myers" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On Fri, 30 Jan 2004 21:55:04 GMT, "Felger Carbon"
<(E-Mail Removed)>
> wrote:
>
> >No, I just finally figgered out what "ccNUMA" means. Since each
CPU
> >has its own, unshared memory, writes to one memory do not have to
be
> >snooped by the other 10K+ CPUs. Only when a message is passed (via
> >DMA in this case) does the new data have to be snooped by the
> >receiving CPU - and it is!
> >
> >There is no asterisk on Red Storm's cache coherency. Since data
can
> >be exchanged _only_ by message passing, the system is fully cache
> >coherent at all times.
> >
> *What* in tarnation was the processor doing with the data in its
cache
> in the first place if somebody else who might change it had it at
the
> same time?
First, an obligatory disclaimer: I'm still learning about Red Storm.
The following statements are based on the current state of my
knowledge. I'll try real hard to make some good guesses. ;-)
The fact that DMA message passing into the local Dram is used
necessitates that the data being overwritten is inconsequential.
Therefore, if any of the data being overwritten is in cache, replacing
it via snooping is also inconsequential.
> It could still use the data while a message that the data
> has been on its way (possibly thousands of processor cycles).
The data being overwritten _must_ be inconsequential. It is the
programmers' task to make certain this is the case. Nobody ever said
programming a message-passing 10K+ CPU MPU was easy.
> You can get into the same kind of trouble on an SMP shared-memory
box
This is not my understanding, Robert. I'll try to keep this on-topic
about ccNUMA and not pursue this further.
> The only situation in which cache-coherency makes any sense is when
> the snoop time is less than the data-fetch time. Otherwise, if
you've
> been keeping your books correctly and you know that data has been
> handled by someone else, you might just as well routinely invalidate
> data that happen still to be in your cache and get a fresh copy.
Snooping the passed message *does in fact* invalidate the
(inconsequential) data in your cache and updates it with a fresh copy.
So?
> If the snoop time is significantly less than the data fetch time,
the
> circumstance can easily arise that hotly-contested data might still
be
> in-cache when you learn that another processor is no longer liable
to
> change it, and it is worth your while to arrange things so that you
> don't have to go out to memory to fetch a new copy if you don't have
> to.
I read the above several times, Robert, and I still don't understand
what you're saying. This is probably my limitation.
> The usefulness of cache-coherency isn't even a matter of shared
memory
> vs. NUMA. In a four-way Opteron system, snoop times are less than
> fetch times, and ccNUMA is a term with real significance.
You have just opened a brand-new can of worms. There are several
forms of NUMA. One is the Red Storm version, where each CPU has a
totally independent memory, accessable by other CPUs only via message
passing. The 4-way Opteron system is a completely different type of
NUMA since each CPU can address the other CPUs' memory. However, it
addresses the other memory at a different address. This means each
CPU's cache must snoop the other 3's memory, as well as its own.
Thus, the largish number of high-speed links.
> In the case of RedStorm, cache-coherence is a salesman's gigabyte.
Wrong. Red Storm is absolutely perfectly cache coherent. There are
no corners or special cases where this is not true. 100% perfection.
The penalty is that only one CPU gets to send messages at a time. And
the programmer must avoid overwriting valid data when passing a
message.
> Ermph. That is not a problem at all. The Red and the Black
> partitions of RedStorm are *physically* disconnected.
Whoa, Nellie! You mean, when the partition is moved that a swarm of
technicians physically remove or install wiring? Huh??
> While I am on my soap box, and to keep someone from gloatingly
> pointing the obvious out to me at some later date, the fact that all
> those processors are hooked together is more useful from the point
of
> view of trying to do several small jobs at once than it is from the
> point of view of really attempting to use a mesh of that size with
> one-hop routing for a single problem.
Absolutely correct. This is the problem with a message-passing MPU.
The unfortunate fact is, there is no practical way around the problem
of interconnecting 10K+ CPUs. Otherwise, everybody would use that
practical way, hmm?
> From this particular point of view, Red Storm is *not* an expensive
> white elephant. It shares with a z-series mainframe the property
that
> a large quantity of memory and a large number of processors can be
> reconfigured as different computers almost at a moment's notice.
By swarms of technicians physically installing/removing wiring?
(Sorry, Robert, that was a cheap shot that I just couldn't resist. I
ain't perfect.
> these huge boxes don't
> really work for all but the most embarrassingly parallel of problems
They're the only game in town. We'd all love to have equivalent
performance in a really fast one-CPU supercomputer, but that just
ain't possible. Alas.
For one specific algorithm, it is sometimes *possible* (in principle)
to design the algorithm flow into hardware. There isn't enough money
in the world to pay for this for lotsa algorithms. Double alas. ;-)