PC Review
Forums
Newsgroups
Hardware
Processors
Yamhill out in open
Forums
Newsgroups
Hardware
Processors
Yamhill out in open
![]() |
Yamhill out in open |
|
|
Thread Tools |
Rating:
|
|
|
#1 |
|
Guest
Posts: n/a
|
From CNET:
Intel plans to demonstrate a 64-bit revamp of its Xeon and Pentium processors in mid-February--an endorsement of a major rival's strategy and a troubling development for Intel's Itanium chip. The demo, which follows the AMD64 approach of Intel foe Advanced Micro Devices, is expected at the Intel developer conference, Feb. 17 through 19 in San Francisco, according to sources familiar with the plan. Intel had code-named the technology Yamhill but now calls it CT, sources said. Adding 64-bit features would let "x86" chips such as Intel's Xeon and Pentium overcome today's 4GB memory limit but would undermine the hope that Intel's current 64-bit chip, Itanium, will ever ship in large quantities. A CT demonstration would send the message that prospective Itanium customers should put Itanium purchases on hold, said Peter Glaskowsky, editor in chief of In-Stat/MDR's Microprocessor Report. http://zdnet.com.com/2100-1103_2-5150336.html |
|
|
|
#2 |
|
Guest
Posts: n/a
|
On Thu, 29 Jan 2004 23:33:06 GMT, "Felger Carbon" <fmsfnf@jfoops.net>
wrote: >From CNET: <snip> > >http://zdnet.com.com/2100-1103_2-5150336.html > Same article refers to the fact that IBM is planning a 64-way x86 server...with Xeon, if you follow the link. RM |
|
|
|
#3 |
|
Guest
Posts: n/a
|
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:hdcj10hr6r4tm0fok80rtcq1r1ak315mla@4ax.com... > Same article refers to the fact that IBM is planning a 64-way x86 > server...with Xeon, if you follow the link. I'm wondering which Tier-1 server company is going to try the same thing with Opteron? Well, Cray is already trying it with the Strider system employing Black Widow interconnects. Has anyone figured out if Black Widow is cache-coherent or not? And if it is, is it directory-based or broadcast-based cache coherent? Can't find any detailed info about it on the web. Yousuf Khan |
|
|
|
#4 |
|
Guest
Posts: n/a
|
On Fri, 30 Jan 2004 06:27:47 GMT, "Yousuf Khan"
<ABCbjsk90DEF@GHIhotmailJKL.com> wrote: >"Robert Myers" <rmyers@rustuck.com> wrote in message >news:hdcj10hr6r4tm0fok80rtcq1r1ak315mla@4ax.com... >> Same article refers to the fact that IBM is planning a 64-way x86 >> server...with Xeon, if you follow the link. > >I'm wondering which Tier-1 server company is going to try the same thing >with Opteron? Well, Cray is already trying it with the Strider system >employing Black Widow interconnects. > >Has anyone figured out if Black Widow is cache-coherent or not? And if it >is, is it directory-based or broadcast-based cache coherent? Can't find any >detailed info about it on the web. > This question came up on comp.arch. Actually, I asked it, and what I got back was mush. Some briefing somewhere refers to Red Storm as cache coherent. It's 2:00am, so if someone walked in here and threatened violence if I didn't find it, I could probably find it. But anything short of that...Oh, wait, there's google: http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf Slide 13 of 18: DMA between OpteronTMmemory and network for high bandwidth (cache coherent) No one seems to know what that means. I cannot conceive that it means what you want it to mean. Take a gander at that mesh on slide 4, and imagine just ordinary programming traffic on that mesh, never mind cache snoop from every damn processor, _every_one_of_which_ is connected to the mesh _separately_ through its own router chip (no exploitation of local hyperlink, despite the fact that Opteri are four to a board). Some snot-headed know-it-all on comp.arch wants me to look at page 2 of an elementary programming book to figure out how to program one of those things, and I'll tell you right now there's no damn book anywhere that will tell you how to program one of those things in the general (not nearly embarrassingly-parallel) case. Every single processor has its own operating system image, so Red Storm is not a competitor for the kind of box that IBM wants to build with its (very impressive) Summit chip. All due well-publicized disrespect for the well-meaning people at Cray and the DoE, but Red Storm is a very expensive white elephant. Well, that's an extreme overstatement. Red Storm will do well where data are physically localized, pretty much stay put, and only have to communicate at the boundaries of well-defined regions. That describes an awful lot of problems that the nice people at LANL and Sandia really have to do, but it does not describe a general purpose computer. In particular, I do time-dependent problems that require multiple global communication at every time step, and the kind of work I do isn't all that extraordinarily unusual. I don't think Red Storm is going to be well-suited to the kinds of problems I do, and I think if Red Storm attempted cache coherency it would go up in a puff of smoke. I would be very happy to learn that I am dead wrong, but so far no one has pointed me to a document that gives any evidence that I would be, and I think Red Storm is going to be another DoE special that operates in the 5-10% of peak flops range. Tolerable linpack score, pathetic everything else. RM |
|
|
|
#5 |
|
Guest
Posts: n/a
|
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:n50k1055roreq1d8f76f9opcdlgnsubur9@4ax.com... > >Has anyone figured out if Black Widow is cache-coherent or not? And if it > >is, is it directory-based or broadcast-based cache coherent? Can't find any > >detailed info about it on the web. > > > > This question came up on comp.arch. Actually, I asked it, and what I > got back was mush. Some briefing somewhere refers to Red Storm as > cache coherent. It's 2:00am, so if someone walked in here and > threatened violence if I didn't find it, I could probably find it. > But anything short of that...Oh, wait, there's google: > > http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf > > Slide 13 of 18: > > DMA between OpteronTMmemory and network for high bandwidth (cache > coherent) > > No one seems to know what that means. I cannot conceive that it means > what you want it to mean. Yeah, that's the problem I'm having, I cannot find any specific information about its cache-coherency. An extreme lack of details about Black Widow. > Take a gander at that mesh on slide 4, and imagine just ordinary > programming traffic on that mesh, never mind cache snoop from every > damn processor, _every_one_of_which_ is connected to the mesh > _separately_ through its own router chip (no exploitation of local > hyperlink, despite the fact that Opteri are four to a board). If it's a directory-based cache coherency, it might be doable. There's no cache snoop traffic on directory-based interlinks. > Every single processor has its own operating system image, so Red > Storm is not a competitor for the kind of box that IBM wants to build > with its (very impressive) Summit chip. All due well-publicized > disrespect for the well-meaning people at Cray and the DoE, but Red > Storm is a very expensive white elephant. I don't think it is one operating system image per processor. I think it's one operating system image per compute node (i.e. 4 processors/node?). Cray is using the very same Black Widow interlinks in its Black Widow line of computers too (X1 is the first from the Black Widow lineup, using Cray's vector processors). It's likely that it is cache-coherent in the Black Widows, so it should be the same in Striders (Striders are Cray's Opteron-based computers, of which the Red Storm is the first model in that lineup). > I would be very happy to learn that I am dead wrong, but so far no one > has pointed me to a document that gives any evidence that I would be, > and I think Red Storm is going to be another DoE special that operates > in the 5-10% of peak flops range. Tolerable linpack score, pathetic > everything else. One of their PDFs was showing over 80% efficiency. Yousuf Khan |
|
|
|
#6 |
|
Guest
Posts: n/a
|
"Yousuf Khan" <ABCbbbl67DEF@GHIyahooJKL.MNOcomPQR> wrote in message
news:lJpSb.20027$ef.12682@twister01.bloor.is.net.cable.rogers.com... > "Robert Myers" <rmyers@rustuck.com> wrote in message > news:n50k1055roreq1d8f76f9opcdlgnsubur9@4ax.com... > > > > Take a gander at that mesh on slide 4, and imagine just ordinary > > programming traffic on that mesh, never mind cache snoop from every > > damn processor, _every_one_of_which_ is connected to the mesh > > _separately_ through its own router chip (no exploitation of local > > hyperlink, despite the fact that Opteri are four to a board). > > If it's a directory-based cache coherency, it might be doable. There's no > cache snoop traffic on directory-based interlinks. Sorry, Robert. I agreed with you on this, but decided to review the ..pdf first. Good thing I did, or I'd have embarrassed myself. Again. Slide 13: "Message based. DMA between Opteron memory and network for high bandwidth (cache coherent)." What this means is that each Opteron maintains cache coherency with its own local Dram, as per usual in all our single-processor desktop machines. Communications from the net to each processor are via DMA message passing from the net directly to the memory. This is snooped by the CPU, exactly as our desktop machine routinely snoops DMA by (for example) our hard disk. The Red Storm network can broadcast a given message simultaneously to multiple Opteron memories. Each Opteron snoops its own memory. By this mechanism, cache coherency is maintained (on passed messages) by each CPU with its own local memory, which is the only memory it can directly access. Damn! I was _sure_ the Red Storm wasn't cache coherent! ;-( Keep in mind that the Red Storm is NUMA. Changes to one CPU's local memory by the CPU are _not_ snooped by other CPUs. Inter-CPU coherency _only_ exists on passed DMA messages. > > Every single processor has its own operating system image. > > I don't think it is one operating system image per processor. I think it's > one operating system image per compute node (i.e. 4 processors/node?). Sorry, Yousuf, Robert is correct on this one. There is absolutely no electrical interconnection between the four CPUs on each node (one board per node). Slide 6: "Single (non-SMP) processor." Slide 7 reveals that there's no interconnection between the four CPUs on a node board. Slide 10: "Supplies boot code to [each] processor." Slide 9 reveals that each system chip (one per Opteron) contains an embedded PPC CPU for booting and system monitoring. For those keeping score, Me, Robert, and Yousuf each won one and lost one. Well, that's what this NG is all about. ;-) Courtesy of David Wang last Oct: http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf |
|
|
|
#7 |
|
Guest
Posts: n/a
|
On Fri, 30 Jan 2004 14:35:31 GMT, "Felger Carbon" <fmsfnf@jfoops.net>
wrote: <snip> > >Sorry, Robert. I agreed with you on this, but decided to review the >.pdf first. Good thing I did, or I'd have embarrassed myself. Again. > >Slide 13: "Message based. DMA between Opteron memory and network for >high bandwidth (cache coherent)." > >What this means is that each Opteron maintains cache coherency with >its own local Dram, as per usual in all our single-processor desktop >machines. Communications from the net to each processor are via DMA >message passing from the net directly to the memory. This is snooped >by the CPU, exactly as our desktop machine routinely snoops DMA by >(for example) our hard disk. > >The Red Storm network can broadcast a given message simultaneously to >multiple Opteron memories. Each Opteron snoops its own memory. By >this mechanism, cache coherency is maintained (on passed messages) by >each CPU with its own local memory, which is the only memory it can >directly access. > >Damn! I was _sure_ the Red Storm wasn't cache coherent! ;-( > >Keep in mind that the Red Storm is NUMA. Changes to one CPU's local >memory by the CPU are _not_ snooped by other CPUs. Inter-CPU >coherency _only_ exists on passed DMA messages. > Hmmm. This brings a whole new-meaning to the term cache-coherent to my spongiform brain (not the result of misfolded proteins, just the neurons that say, "Yeah, whatever" when the discussion sounds boring). If a processor changes a memory location *and* remembers to broacast a message to interested parties that the memory location has been changed, then the system stays cache-coherent in the sense that cache-coherent seems to imploy. If a processor changes a memory location and fails to broadcast a message, different processors have different versions of the same data, possibly in their cache, but that's okay because this is a NUMA machine and we still get to call it cache coherent? Felger, are you a part of this conspiracy? NUMA "cache-coherency" sounds like a much worse abuse of language than a "salesman's gigabyte." The 64-processor x86 box that IBM will build with its Summit chipset will be cache-coherent with no asterisk. RM |
|
|
|
#8 |
|
Guest
Posts: n/a
|
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:fedl10hembfvpb93krje5tvoo44sksloan@4ax.com... > > >Keep in mind that the Red Storm is NUMA. Changes to one CPU's local > >memory by the CPU are _not_ snooped by other CPUs. Inter-CPU > >coherency _only_ exists on passed DMA messages. > > > > Hmmm. This brings a whole new-meaning to the term cache-coherent to > my spongiform brain (not the result of misfolded proteins, just the > neurons that say, "Yeah, whatever" when the discussion sounds boring). > > If a processor changes a memory location *and* remembers to broacast a > message to interested parties that the memory location has been > changed, then the system stays cache-coherent in the sense that > cache-coherent seems to imploy. > > If a processor changes a memory location and fails to broadcast a > message, different processors have different versions of the same > data, possibly in their cache, but that's okay because this is a NUMA > machine and we still get to call it cache coherent? > > Felger, are you a part of this conspiracy? NUMA "cache-coherency" > sounds like a much worse abuse of language than a "salesman's > gigabyte." No, I just finally figgered out what "ccNUMA" means. Since each CPU has its own, unshared memory, writes to one memory do not have to be snooped by the other 10K+ CPUs. Only when a message is passed (via DMA in this case) does the new data have to be snooped by the receiving CPU - and it is! There is no asterisk on Red Storm's cache coherency. Since data can be exchanged _only_ by message passing, the system is fully cache coherent at all times. The limitation of the system (and it's a whopper) is unrelated to cache coherency. It is the fact that only one CPU gets to _send_ a message at one time. Any number of CPUs can receive the message simultaneously. In Red Storm's case, because of the Red/Black partitioning, _two_ CPUs get to send at one time, one in each partition. |
|
|
|
#9 |
|
Guest
Posts: n/a
|
On Fri, 30 Jan 2004 21:55:04 GMT, "Felger Carbon" <fmsfnf@jfoops.net>
wrote: >"Robert Myers" <rmyers@rustuck.com> wrote in message >news:fedl10hembfvpb93krje5tvoo44sksloan@4ax.com... >> >> >Keep in mind that the Red Storm is NUMA. Changes to one CPU's >local >> >memory by the CPU are _not_ snooped by other CPUs. Inter-CPU >> >coherency _only_ exists on passed DMA messages. >> > >> >> Hmmm. This brings a whole new-meaning to the term cache-coherent to >> my spongiform brain (not the result of misfolded proteins, just the >> neurons that say, "Yeah, whatever" when the discussion sounds >boring). >> >> If a processor changes a memory location *and* remembers to broacast >a >> message to interested parties that the memory location has been >> changed, then the system stays cache-coherent in the sense that >> cache-coherent seems to imploy. >> >> If a processor changes a memory location and fails to broadcast a >> message, different processors have different versions of the same >> data, possibly in their cache, but that's okay because this is a >NUMA >> machine and we still get to call it cache coherent? >> >> Felger, are you a part of this conspiracy? NUMA "cache-coherency" >> sounds like a much worse abuse of language than a "salesman's >> gigabyte." > >No, I just finally figgered out what "ccNUMA" means. Since each CPU >has its own, unshared memory, writes to one memory do not have to be >snooped by the other 10K+ CPUs. Only when a message is passed (via >DMA in this case) does the new data have to be snooped by the >receiving CPU - and it is! > >There is no asterisk on Red Storm's cache coherency. Since data can >be exchanged _only_ by message passing, the system is fully cache >coherent at all times. > Mmmph? Changed data comes in from aonther processor via a message. The processor snoops the message and says "Ohmagod, the data has changed from what I have in my cache. Better straighten that out." *What* in tarnation was the processor doing with the data in its cache in the first place if somebody else who might change it had it at the same time? It could still use the data while a message that the data has been on its way (possibly thousands of processor cycles). That is, *if* a message was sent at all, responsibility entirely on the programmer to see to it that it happens, as far as I can tell. You can get into the same kind of trouble on an SMP shared-memory box, and people have argued to me recently that it's a *virtue* of NUMA systems that all this message-passing has to go on. Balderdash. The only situation in which cache-coherency makes any sense is when the snoop time is less than the data-fetch time. Otherwise, if you've been keeping your books correctly and you know that data has been handled by someone else, you might just as well routinely invalidate data that happen still to be in your cache and get a fresh copy. If the snoop time is significantly less than the data fetch time, the circumstance can easily arise that hotly-contested data might still be in-cache when you learn that another processor is no longer liable to change it, and it is worth your while to arrange things so that you don't have to go out to memory to fetch a new copy if you don't have to. The usefulness of cache-coherency isn't even a matter of shared memory vs. NUMA. In a four-way Opteron system, snoop times are less than fetch times, and ccNUMA is a term with real significance. In the case of RedStorm, cache-coherence is a salesman's gigabyte. >The limitation of the system (and it's a whopper) is unrelated to >cache coherency. It is the fact that only one CPU gets to _send_ a >message at one time. Any number of CPUs can receive the message >simultaneously. In Red Storm's case, because of the Red/Black >partitioning, _two_ CPUs get to send at one time, one in each >partition. > Ermph. That is not a problem at all. The Red and the Black partitions of RedStorm are *physically* disconnected. That is the only way you can use part for classified and part for unclassified, which is why the Red and the Black partitions exist at all. While I am on my soap box, and to keep someone from gloatingly pointing the obvious out to me at some later date, the fact that all those processors are hooked together is more useful from the point of view of trying to do several small jobs at once than it is from the point of view of really attempting to use a mesh of that size with one-hop routing for a single problem. From this particular point of view, Red Storm is *not* an expensive white elephant. It shares with a z-series mainframe the property that a large quantity of memory and a large number of processors can be reconfigured as different computers almost at a moment's notice. It is also much less expensive in that role (than a z-Series mainframe), and a RedStorm box might make good sense for a company that has multiple server farms dividing resources up in arbitrary ways that are hard to reconfigure. When you get down to most jobs occupying just a dozen or so nodes, then many things about the box that seem silly when you consider the mesh as a whole, like "cache-coherency", no longer seem quite so ridiculous, because data go no further than well-defined boundaries, and you don't wind up with traffic jams. Don't look for a slide that highlights what I just said, though, because it really highlights the fact that these huge boxes don't really work for all but the most embarrassingly parallel of problems, and that the DoE has spent over a decade doing nothing more than funding high school shop class projects that emphasize running wire, bolting things together, and plugging things in. That Virginia Tech could do the same thing much more cheaply by paying undergraduates with pizza and football tickets doesn't come as a big surprise. RM |
|
|
|
#10 |
|
Guest
Posts: n/a
|
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:maql10p3dvpck0gp2t7h99g2an0vko9k76@4ax.com... > On Fri, 30 Jan 2004 21:55:04 GMT, "Felger Carbon" <fmsfnf@jfoops.net> > wrote: > > >No, I just finally figgered out what "ccNUMA" means. Since each CPU > >has its own, unshared memory, writes to one memory do not have to be > >snooped by the other 10K+ CPUs. Only when a message is passed (via > >DMA in this case) does the new data have to be snooped by the > >receiving CPU - and it is! > > > >There is no asterisk on Red Storm's cache coherency. Since data can > >be exchanged _only_ by message passing, the system is fully cache > >coherent at all times. > > > *What* in tarnation was the processor doing with the data in its cache > in the first place if somebody else who might change it had it at the > same time? First, an obligatory disclaimer: I'm still learning about Red Storm. The following statements are based on the current state of my knowledge. I'll try real hard to make some good guesses. ;-) The fact that DMA message passing into the local Dram is used necessitates that the data being overwritten is inconsequential. Therefore, if any of the data being overwritten is in cache, replacing it via snooping is also inconsequential. > It could still use the data while a message that the data > has been on its way (possibly thousands of processor cycles). The data being overwritten _must_ be inconsequential. It is the programmers' task to make certain this is the case. Nobody ever said programming a message-passing 10K+ CPU MPU was easy. > You can get into the same kind of trouble on an SMP shared-memory box This is not my understanding, Robert. I'll try to keep this on-topic about ccNUMA and not pursue this further. > The only situation in which cache-coherency makes any sense is when > the snoop time is less than the data-fetch time. Otherwise, if you've > been keeping your books correctly and you know that data has been > handled by someone else, you might just as well routinely invalidate > data that happen still to be in your cache and get a fresh copy. Snooping the passed message *does in fact* invalidate the (inconsequential) data in your cache and updates it with a fresh copy. So? > If the snoop time is significantly less than the data fetch time, the > circumstance can easily arise that hotly-contested data might still be > in-cache when you learn that another processor is no longer liable to > change it, and it is worth your while to arrange things so that you > don't have to go out to memory to fetch a new copy if you don't have > to. I read the above several times, Robert, and I still don't understand what you're saying. This is probably my limitation. > The usefulness of cache-coherency isn't even a matter of shared memory > vs. NUMA. In a four-way Opteron system, snoop times are less than > fetch times, and ccNUMA is a term with real significance. You have just opened a brand-new can of worms. There are several forms of NUMA. One is the Red Storm version, where each CPU has a totally independent memory, accessable by other CPUs only via message passing. The 4-way Opteron system is a completely different type of NUMA since each CPU can address the other CPUs' memory. However, it addresses the other memory at a different address. This means each CPU's cache must snoop the other 3's memory, as well as its own. Thus, the largish number of high-speed links. > In the case of RedStorm, cache-coherence is a salesman's gigabyte. Wrong. Red Storm is absolutely perfectly cache coherent. There are no corners or special cases where this is not true. 100% perfection. The penalty is that only one CPU gets to send messages at a time. And the programmer must avoid overwriting valid data when passing a message. > Ermph. That is not a problem at all. The Red and the Black > partitions of RedStorm are *physically* disconnected. Whoa, Nellie! You mean, when the partition is moved that a swarm of technicians physically remove or install wiring? Huh?? > While I am on my soap box, and to keep someone from gloatingly > pointing the obvious out to me at some later date, the fact that all > those processors are hooked together is more useful from the point of > view of trying to do several small jobs at once than it is from the > point of view of really attempting to use a mesh of that size with > one-hop routing for a single problem. Absolutely correct. This is the problem with a message-passing MPU. The unfortunate fact is, there is no practical way around the problem of interconnecting 10K+ CPUs. Otherwise, everybody would use that practical way, hmm? > From this particular point of view, Red Storm is *not* an expensive > white elephant. It shares with a z-series mainframe the property that > a large quantity of memory and a large number of processors can be > reconfigured as different computers almost at a moment's notice. By swarms of technicians physically installing/removing wiring? (Sorry, Robert, that was a cheap shot that I just couldn't resist. I ain't perfect. ![]() > these huge boxes don't > really work for all but the most embarrassingly parallel of problems They're the only game in town. We'd all love to have equivalent performance in a really fast one-CPU supercomputer, but that just ain't possible. Alas. For one specific algorithm, it is sometimes *possible* (in principle) to design the algorithm flow into hardware. There isn't enough money in the world to pay for this for lotsa algorithms. Double alas. ;-) |
|
![]() |
|
| Thread Tools | |
| Rate This Thread | |
|
|

Main Page 



