PC to Get PS3-Class Computing Power? AMD Planning CELL-Like Co-Processor Developed with Clearspeed

V

video-game dude

http://www.electronicsweekly.com/Articles/2006/03/15/37936/ClearspeedplansAMDco-processorlinkup.htm

Clearspeed in talks with AMD about co-processor
by David Manners
Wednesday 15 March 2006

Clearspeed, the Bristol parallel processor company, is in talks with
AMD about developing a closely-coupled co-processor to soup up the AMD
multi-core family of x86 processors.

"Clearspeed is talking to AMD about producing a closely-coupled
co-processor," said Jeff Underhill, business development manager for
64-bit embedded applications at AMD.

The talks may be the result of an awareness of the stunning performance
of the IBM/Sony/Toshiba microprocessor Cell.

A future product for AMD is QuadCore, due out next year, a four-core
microprocessor delivering double the performance of AMD's current
two-core product, DualCore.
EW.com
Dual Clearspeed processors on a PCI card - 50Gflops of extra
performance


However, Cell uses 8 processors around a PowerPC core to deliver a
quarter of a Teraflops or 256 billion flops.

Of course, x86 is Cisc, and Cell is Risc and so can run faster,
nonetheless the difference in performance between Cell and QuadCore is
such that AMD may consider a co-processor option.

Clearspeed has experience of producing ultra-high-speed accelerator
chips to boost computing power, while maximising power per Watt.

The Bristol-based firm's CSX600 chip has 96 processor cores, runs at
250MHz, dissipates 10W, and adds 25Gflops to a computer's
performance.

The problems in implementing multi-core processors in practice are,
according to Chris Rowen, CEO of Tensilica, which automates
microprocessor generation, are: "A need for new tools and software
for application-centric energy management; a need for new tools,
software and training for multi-processor programming; significant
silicon improvements for lower voltage and capacitance, and automation
in multi-processor partitioning and interconnect."

____________________

http://arstechnica.com/news.ars/post/20060315-6392.html

AMD considers Clearspeed math co-processor

by Hannibal

It has been over a year since I last reported on Clearspeed, the
company that made waves on the hardware scene in 2003 with a massively
parallel floating-point processor aimed at accelerating math-intensive
simulations. Now AMD wants to tap Clearspeed to provide a math
co-processor for their forthcoming quad-core Opteron, a combination
that would provide some serious compute bandwidth for scientific
computing.

The last time I talked about Clearspeed I described their latest
product, the CSX600, which was codenamed Avebury at the time. The
CSX600 has 96 small processing elements (PEs), each of which contains
two ALUs, a register file, and a 6K data cache. The CSX600's PEs are
much simpler than, say, Cell's SPEs, and they can't execute programs
independently but require a general-purpose host to pass them commands
and data. Clearspeed claims 25 GFLOPs at 10W max power dissipation,
which is pretty impressive and makes the CSX600 an attractive
co-processor.

One of the points I've consistently made in my Clearspeed coverage is
that the PC doesn't have the kinds of buses that can really do justice
to an add-on like Clearspeed. PCIe helps, but an even better fit is
AMD's HyperTransport.

In a recent post on AMD's intentions to position HyperTransport in the
gap left by Intel's much delayed Common Systems Interconnect (CSI), I
talked about how AMD would like to leverage HT by having third parties
offer an attractive array of HT-compatible co-processors for different
kinds of applications (e.g. a hypothetical server-side Java + XML
accelerator was mentioned at one point). Cultivating an HT- and
Opteron-compatible co-processor market will help AMD differentiate
their server offerings from Intel in a time period where Intel will
have a leg up in raw CPU performance (via Core) while AMD will have the
advantage in interconnect technology. The Clearspeed co-processor
that's being investigated by AMD falls squarely within this "let's
leverage our interconnect" strategy.

I'm sure that AMD is investigating any number of potential co-processor
makers, and they certainly should be if Conroe's numbers are even close
to what's being advertised. I could easily see an Opteron vendor like
Sun, for instance, doing some nice and very competitive things with an
Opteron + CSX600 combination.




______________


http://www.dailytech.com/article.aspx?newsid=1276&ref=y

AMD Interested In Reviving the Math Co-processor

ClearSpeed Advance board

The CSX600's design; 96 dedicated math units each with their own 6KB
cache
Providing nearly 5X the math processing power of an AMD FX60 CPU, AMD
may use Clearspeed's processors

While current generation processors are good at arithmetic in general,
their design leaves many things to be desired. For example, while an
Opteron CPU is more than capable of rendering 3D graphics, special
purpose processors from ATI and NVIDIA are infinitely better at this
task due to their specialized designs. Likewise, ATI and NVIDIA GPUs
would not make very good general purpose processors. Both Intel and AMD
have worked very hard on making the latest generation processors very
good at general purpose computations -- multiple cores, branch
predictions, deep pipelines, operation fusion, etc. Unfortunately,
none of these trends are particularly necessary for heavy mathematical
computations, and in many cases these advancements have proven
detrimental.

While desktops generally do not rely on heavy math operations,
workstations and servers are demanding more and more out of CPUs that
have drifted further and further away from mathematical computations.
Enter Clearspeed. The company has been in the talks with AMD for some
time now over the use of its dedicated math co-processors. AMD is
looking at Clearspeed's products and possibly integrating Clearspeed's
silicon into future Opteron quad-core processors.

Clearspeed's current flagship processor, the CSX600 is a processor
dedicated to complex mathematics processing only. Used in designs where
applications include medical research, CAD, space research, data
mining, and other math-intensive applications, Clearspeed's products
are far superior to standard general purpose processors alone.
Clearspeed claims that its CSX600 co-processor is able to perform at
25Gflops per second under the right conditions. AMD's Opteron can
handle approximately 5.7Gflops, though the operations on the Opteron
are significantly more complex than the ones carried out on the CSX600.

Currently, Clearspeed's products are available as individual
co-processor units or as add-in PCIe board. The company says that while
PCI Express does offer a significant speed advantage over using PCI-X
or PCI interfaces, HyperTransport interconnects offer the best
performance. Clearspeed also mentions that Intel's latest platform
performs well with its co-processors, but due to delays with Common
System Interconnect (Intel's answer to HyperTransport) performance
gains are limited.

As of right now, there doesn't seem to be a firm deal between AMD and
Clearspeed just yet, but things are shaping up nicely for Clearspeed,
which is based in the UK. Intel also uses Clearspeed's technology in
some of its line of processors.

_________________

http://www.theinquirer.net/?article=30318

AMD considers using Clearspeed co-processor

To sauce multicores up a bit

By INQUIRER staff:

ELECTRONICS WEEKLY said that UK based parallel chip firm Clearspeed is
chatting to AMD about producing a co-processor.

According to the report, the Clearspeed co-processor will help speed
AMD multicores on their way amidst concerns that the forthcoming Cell
chip will fly along like the proverbial off a hot shovel.

Electronics Weekly quotes an AMD exec as saying the Cell chip was a
wake up call. µ


______________________
 
J

Jac

<snip spam>

Changed your handle from "gOJDO" to "video-game dude", didn't ya, punk?
PLONK!

j.
 
J

johns

They don't need it. I've got single core AMD cpu s running
calcs in Solidworks that use to take hours ... now taking
minutes. Guys can't even go for coffee. Also have a Fluent
Engineering design lab with X2s cutting those rendering
projects down to seconds ??? Big deal. Where's the apps
needing this stuff?

johns
 
S

Stephen Sprunk

johns said:
They don't need it. I've got single core AMD cpu s running
calcs in Solidworks that use to take hours ... now taking
minutes. Guys can't even go for coffee. Also have a Fluent
Engineering design lab with X2s cutting those rendering
projects down to seconds ??? Big deal. Where's the apps
needing this stuff?

HPCC folks have calculations that are measured in days, not seconds. With
that kind of number crunching, the extra GFLOPS will be a huge benefit,
provided you can muster enough memory bandwidth to keep them fed. Also,
having more cores per system may improve the percent of peak performance
actually achieved on some workloads, since the average latency between cores
will be much lower.

However, let's look at the numbers. According to that article, you'd need
ten Clearspeed co-procs in a box to (almost) match a single Cell. The
low-power angle certainly doesn't play out since you need ~100W to get what
the Cell does with ~30W. The space and boards for all those chips (not to
mention the cooling) certainly isn't competitive.

Sure, an Opteron only gets a few GFLOPS, so this is an improvement if you
want to stick to PC-based clusters, but is it compelling when stacked up
against Cell for custom systems?

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
A

Andrew Reilly

However, let's look at the numbers. According to that article, you'd need
ten Clearspeed co-procs in a box to (almost) match a single Cell. The
low-power angle certainly doesn't play out since you need ~100W to get what
the Cell does with ~30W. The space and boards for all those chips (not to
mention the cooling) certainly isn't competitive.

Sure, an Opteron only gets a few GFLOPS, so this is an improvement if you
want to stick to PC-based clusters, but is it compelling when stacked up
against Cell for custom systems?

Clearspeed's FMAC is a 64-bit (double precision) one, I thought. Cell
slows down by approximately a factor of ten on double precision. If
that's what you want (and it is for HPCC), then doesn't that make them
comparable?

I'd like to know how one keeps a 96-processor Clearspeed CSX600 chip
even *close* to busy with a single DDR2-DRAM port and only 756 words
(double precision) of processor-local scratch-pad memory. Anyone know?

Cheers,
 
S

Stephen Sprunk

Andrew Reilly said:
Clearspeed's FMAC is a 64-bit (double precision) one, I thought. Cell
slows down by approximately a factor of ten on double precision. If
that's what you want (and it is for HPCC), then doesn't that make them
comparable?

Ah, the first few pages I found Googling just listed the Cell at 256GFLOPS
and said it 64-bit. I just found another one that said it's only 25GFLOPS
with DP (all assuming 4GHz core speed).

Clearspeed's chip is apparently 25GFLOPS (sustained, 50G peak) for both SP
and DP.

So, Cell is a clear winner if you can get away with SP, and Clearspeed has
the same performance at 1/3 the power for DP. That's a totally different
argument if you assume DP is needed.

Still, power isn't comparable when you consider the Clearspeed will need a
separate CPU to control it, but I'd assume the ability to use PC-based
designs will provide enough cost savings in other areas to make up for it.
I'd like to know how one keeps a 96-processor Clearspeed CSX600 chip
even *close* to busy with a single DDR2-DRAM port and only 756 words
(double precision) of processor-local scratch-pad memory. Anyone know?

The Clearspeed data sheet lists 96GB/s to internal memory, 3.2GB/s to
external memory, and 2x3.2GB/s to other chips.

The external numbers are a far cry from Cell's 25.6GB/s, and the local
storage is also a lot smaller. I don't know enough about HPCC codes to say
if this will be a big stumbling block or not, but it's definitely an
improvement over using just commodity CPUs.

The first folks to build a cluster with these things will have a lot of
visitors.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
D

Del Cecchi

Stephen said:
Ah, the first few pages I found Googling just listed the Cell at
256GFLOPS and said it 64-bit. I just found another one that said it's
only 25GFLOPS with DP (all assuming 4GHz core speed).

Clearspeed's chip is apparently 25GFLOPS (sustained, 50G peak) for both
SP and DP.

So, Cell is a clear winner if you can get away with SP, and Clearspeed
has the same performance at 1/3 the power for DP. That's a totally
different argument if you assume DP is needed.

Still, power isn't comparable when you consider the Clearspeed will need
a separate CPU to control it, but I'd assume the ability to use PC-based
designs will provide enough cost savings in other areas to make up for it.



The Clearspeed data sheet lists 96GB/s to internal memory, 3.2GB/s to
external memory, and 2x3.2GB/s to other chips.

The external numbers are a far cry from Cell's 25.6GB/s, and the local
storage is also a lot smaller. I don't know enough about HPCC codes to
say if this will be a big stumbling block or not, but it's definitely an
improvement over using just commodity CPUs.

The first folks to build a cluster with these things will have a lot of
visitors.

S
you mean cluster of cells? how bout a nice blade server.
(followups trimmed)
 
D

David Wang

Ah, the first few pages I found Googling just listed the Cell at 256GFLOPS
and said it 64-bit. I just found another one that said it's only 25GFLOPS
with DP (all assuming 4GHz core speed).

The Cell processor presently consists of 1 PPE and 8 (or 7) SPE's.

The PPE can theoretically do 11 SP Flops per cycle, (although I still
don't know how exactly) so that's 44 SP GFlops @ 4 GHz.
The PPE can do 1 DP FMAC per cycle, so that's 8 DP GFlops @ 4 GHz.

Each SPE can do 4 SP FMAC per cycle, so that's 8 * 8 * 4 = 256 GFlops
@ 4 GHz - assuming 8 SPE's.

The DP units aren't pipelined, so the max throughput is 2 DP FMAC per
7 cycles. Take 256 SP Gflops, divide by 14, and we get 18.3 DP GFlops
@ 4 GHz.

For an 8 SPE CELL processor running @ 4 GHz, the max DP throughput is
thus 8 + 18.3 = 26.3 GFlops, and the max SP throughput is 44 + 256 =
300 GFlops.

The asymmetry and the specialized SP Flops out of the PPE makes it
difficult to get close to max, but impressive max SP flops they are.
 
A

Andrew Reilly

The first folks to build a cluster with these things will have a lot of
visitors.

A few of the hits on my google for this pointed to a pretty large Sun
Opteron cluster going into Japan, augmented by several hundred Clearspeed
cards. I don't think that it's fully commissioned, yet, but yes, it'll be
pretty interesting to see how much it helps on their code.
 
Y

Yousuf Khan

video-game dude said:
http://www.electronicsweekly.com/Articles/2006/03/15/37936/ClearspeedplansAMDco-processorlinkup.htm

Clearspeed in talks with AMD about co-processor
by David Manners
Wednesday 15 March 2006

The Clearspeed copro already exists, I assume in the form of a PCI or
PCI-e plug-in card. All this announcement does is that allows the
Clearspeed copro to exist alongside an Opteron on a specially designed
socket and directly interfaced to the Opteron via Hypertransport links,
most likely *Coherent* HT links.

Yousuf Khan
 
Z

Zak

Andrew said:
I'd like to know how one keeps a 96-processor Clearspeed CSX600 chip
even *close* to busy with a single DDR2-DRAM port and only 756 words
(double precision) of processor-local scratch-pad memory. Anyone know?

I suppose that an Opteron can do quite some stuff at full speed with the
memory bandwidth that is available. 10 or 30 float ops per value or so?
Thus, the Clearspeed would need to do a many-step thing to add any
speed, but 768 registers is more than what Opteron has (though Opterons
cache is larger... hmm).

Filling in the steps is hard.


Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top