Understanding the Cell Microprocessor [VERY LONG]



Understanding the Cell Microprocessor


Three very interesting things happened over the past couple of weeks here at
1.. Intel's Spring IDF 2005 turned out to be a multi-core CPU festival,
with Intel being even more open than ever before about future plans for
their multi-core microprocessor architectures. Intel has over 10
multi-core CPU designs in the works, and they made that very clear at IDF.
2.. At GDC 2005, AGEIA announced that they had developed a Physics
Processing Unit (PPU) that could be used to enable extremely realistic
physics and artificial intelligence models.
3.. Johan De Gelas went one step further in his quest for more processing
power earlier this week to find that there's quite a lot of potential for
multi-core CPUs in the gaming market, at the expense of increasing
development times.
So, what do these three things have in common? The aggregate of the three
basically summarize what we've come to know as the Cell microprocessor - a
multi-core CPU, part of which is designed for parallel physics/AI processing
for which it will be quite difficult to program.

Cell, at a high level, isn't too difficult to understand; it's how the
designers got there that is most intriguing. It's the design decisions and
building blocks of Cell that we'll focus on here in this article, with an
end goal of understanding why Cell was designed the way it was.

A joint venture between IBM, Sony and Toshiba, the Cell microprocessor is
the heart and soul of Sony's upcoming Playstation 3. However, this time
around, Sony and Toshiba are planning to use Cell (or parts of it) in
everything from consumer electronics to servers and workstations. If you
don't already have the impression, publicly, Cell has been given some very
high aspirations as a microprocessor, especially a non-x86 microprocessor.

Usage Patterns

Before getting into the architecture of Cell, let's talk a bit about the
types of workloads for which Cell and other microprocessors are currently
being built.

In the past, office application performance was a driving factor behind
microprocessor development. Before multitasking and before email, there
was single application performance and for the most part, we were talking
about office applications, word processors, spreadsheets, etc. Thus, most
microprocessors were designed toward incredible single application, single
task performance.

As microprocessors became more powerful, the software followed -
multitasking environments were born. The vast majority of computer users,
however, were still focused on single application usage, so microprocessor
development continued to focus on single-threaded performance (single
application, single task performance).

Over the years, the single-threaded performance demands grew. Microsoft
Word was no longer the defining application, but things like games, media
processing and dynamic content creation became the applications that ate up
the most CPU cycles. This is where we are today with workloads being a mix
of office, 3D games, 3D content creation and media
encoding/decoding/transcoding that consume our CPU cycles. But in order to
understand the creation of a new architecture like Cell, you have to
understand where these workloads are headed. Just as the types of
applications demanding performance today are much different than those run
10 years ago, the same will apply to applications in the next decade. And
given that a new microprocessor architecture takes about 5 years to develop,
it is feasible to introduce a new architecture geared towards these new
usage models now.

Intel spoke a lot about future usage models at their most recent IDF, things
like real time voice recognition (and even translation), unstructured search
(e.g. Google image search), even better physics and AI models in games, more
feature-rich user interfaces (e.g. hand gesture recognition), etc. These
are the usage models of the future, and as such, they have a different set
of demands on microprocessors and their associated architectures.

The type of performance required to enable these types of usage models is
significantly higher than what we have available to us today.
Conventionally, performance increases from one microprocessor generation to
the next by optimizing single thread performance. There are a number of
ways of improving single thread performance, either by driving up the clock
speed or by increasing the instructions executed per clock (IPC). Taking
it one step further, the more parallelism you can extract from a single
thread, the better your performance will be - this type of parallelism is
known as instruction level parallelism (ILP) as it involves executing as
many instructions out of a thread at the same time.

The problem with improving performance through increasing ILP is that from
one generation to the next, you're only talking about a 10% - 20% increase
in performance. Yet, the usage models that we're talking about for the
future require significantly more than the type of gains that we've been
getting in the past. With power limitations preventing clock speeds from
scaling too high, it's clear that there needs to be another way of improving

The major players in the microprocessor industry have all pretty much agreed
that the only way to get the type of performance gains that are necessary is
by moving towards multi-core architectures. Through a combination of
multithreaded applications and multi-core processors, you can get the types
of performance increases that should allow for these types of applications
to be developed. Instead of focusing on extracting ILP to improve
performance, these multi-core processors extract parallelism on a thread
level to improve performance (thread level parallelism - TLP).

It's not as straightforward as that, however. There are a handful of
decisions that need to be made. How powerful do you make each core in your
multi-core microprocessor? Do you have a small array of powerful
processors or a larger array of simpler processors? How do they
communicate with one another? How do you deal with feeding a multi-core
processor with enough memory bandwidth?

The Cell implementation is just one solution to the problem...

High Level Overview of Cell

Cell is just as much of a multi-core processor as the upcoming multi-core
CPUs from AMD and Intel, the only difference being that Cell's architecture
doesn't have an entirely homogeneous set of cores.

Cell's Execution Cores
The Cell architecture debuted in a configuration of 9 independent cores: one
PowerPC Processing Element (PPE) and eight Synergistic Processing Elements
(SPEs). The PPE and SPEs are obviously different, but all eight SPEs are
identical to one another.

The PPE is IBM's major contribution to the Cell project; it also appears to
be very similar to the core being used in the next Xbox console.

The PPE is a new core unlike any other PowerPC core made by IBM. The PPE
is kept simple purposefully, although it has the base functionality of any
modern day, general purpose microprocessor. The role of the PPE in Cell is
to handle the tasks that any general purpose microprocessor would run;
basically, anything that you could run on your Athlon 64 would be run on the

The PPE features a 64KB L1 cache and a 512KB L2 cache and features SMT,
similar to Intel's Hyper Threading. The PPE features a strictly in-order
core, which the desktop x86 market hasn't seen since the death of the
original Pentium (the Pentium Pro brought out-of-order execution to the x86
market), so the move for an in-order core is an interesting one. The PPE
is also only a 2-issue core, meaning that, at best, it can execute two
instructions simultaneously. For comparison, the Athlon 64 is a 3-issue
core, so immediately, you get the sense that the PPE is a much simpler core
than anything that we have on the desktop. IBM's VMX instruction set (aka
Altivec) is also supported by the PPE. Much like the rest of the Cell
processor, the PPE is designed to run at very high clock speeds.

There's not much that's impressive about the PPE, other than it's a small,
fast, efficient core. Put up against a Pentium 4 or an Athlon 64, the PPE
would lose undoubtedly, but the PPE's architecture is one answer to a shift
in the performance paradigm. Performance in business/office applications
requires a very powerful, very fast general purpose microprocessor, but
performance in a game console, for example, does not. The original Xbox
used a modified Intel Celeron processor running at 733MHz, while the fastest
desktops had 2.0GHz Pentium 4s and 1.60GHz Athlon XPs. Given that the
first implementation of Cell is supposed to be Sony's Playstation 3, the
simplicity of the PPE is not surprising. Should Cell ever make its way
into a PC, the PPE would definitely have to be beefed up, or at least paired
with multiple other PPEs.

The majority of the Cell's die is composed of the eight Synergistic
Processing Elements (SPEs). If you consider the PPE to be a general
purpose microprocessor, think of the SPEs as general purpose processors with
a slightly more specific focus.

Each SPE is a fully functioning independent microprocessor, but greatly
simplified and not as general purpose as the PPE. The SPEs have no cache,
but each SPE does have 256KB of local memory (we will discuss the difference
between local memory and cache later). Each SPE also has a total of 7
execution units, including one integer unit, so the SPEs can perform integer
math as well as SIMD floating point arithmetic. The SPEs are dual issue,
meaning that they can execute a maximum of 2 instructions in parallel.
Keeping both the SPEs and the PPE dual issue indicates a concern over Cell's
transistor count and chip size, as increasing issue width is directly
connected to both of these key items.

The SPEs have no branch predictor, meaning that they rely solely on software
branch prediction. There are ways that the compiler can avoid branches,
and the SPE architecture lends itself very well to things like loop
unrolling. Any elementary programmer is familiar with a loop, where one or
more lines of code is repeated until a certain condition is met. The
checking of that condition (e.g. i < 100) often results in a branch, so one
way of removing that branch is simply to unroll the loop. If you have a
statement in a loop that is supposed to execute 100 times, you could either
keep it in the loop and execute it that way, or you could remove the loop
and simply copy the statement 100 times. The end result is the same - the
only difference is that in one case, you have a branch condition while the
other case results in more lines of code to execute.

The problem with loop unrolling is that you need a large number of registers
to unroll some loops, which is one reason that each SPE has 128 registers.
Originally, the SPEs were supposed to use the VMX (Altivec) ISA, but because
of a need for more than 32 architectural registers, the SPEs implemented a
new ISA with support for 128 registers.

Each SPE is only capable of issuing two instructions per clock, meaning that
at best, each SPE can execute two instructions at the same time. The issue
width of a microprocessor can determine a big part of how large the
microprocessor will be; for example, the Itanium 2 is a 6-issue core, so
being a 2-issue core makes each SPE significantly smaller than most general
purpose microprocessors.

In the end, what we see with the SPEs is that they sacrifice some of the
normal tricks to improve ILP in favor of being able to cram more SPEs onto a
single die, effectively sacrificing some ILP for greater TLP. Given the
direction that the industry is headed, a move to a very TLP centric design
makes a lot of sense, but at the same time, it will be quite dependent on
developers adhering to very specific development models.

Clearly, the architects of Cell saw the SPEs as being used to run a highly
parallelizable workload, and as Derek Wilson mentioned in his article about

"One of the properties of graphics that made the feature a good fit for a
specialized processor inside a PC is the fact that the task is infinitely
parallelizable. Hundreds of thousands, and even millions of pixels, need to
be processed every frame. The more detailed a rendering needs to be, the
more parallel the task becomes. The same is true with physics. As with the
visual world, the physical world is continuous rather than discrete. The
more processing power we have, the more things we can simulate at once, and
the more realistically we can approximate the real world."

With NVIDIA supplying some form of a GPU for Playstation 3, Cell's array of
SPEs have one definite purpose in a gaming console - physics and AI
processing. Many have argued that the array of SPEs seems capable of
taking over the pixel processing workload of a GPU, but for a high
performance console, that's not much of an option. The SPE array could
offer better CPU-based 3D rendering, but it would be a tough sell (no pun
intended) for this array of SPEs to be the end of dedicated GPU hardware.

Cell's On-Die Memory Controller
For years, we've known that Rambus' memory and interface technology is well
ahead of the competition. The problem is that it has never been implemented
well on a PC before. The Rambus brand received a fairly negative
connotation during the early days of RDRAM on the PC, and things worsened
even more for the company's brand with the Rambus vs. the DDR world

Rambus has had success in a lot of consumer electronics devices, such as
HDTVs and the Playstation 2, so when Cell was announced to make heavy use of
Rambus technologies, it wasn't too surprising. As we've reported before,
Rambus technology is used in about 90% of the signaling pins on Cell. The
remaining 10% are mostly test pins, so basically, Rambus handles all data
going in and out of the Cell processor. They do so in two ways:

First off, Cell includes an on-die dual channel XDR memory controller, each
channel being 36-bits wide (32-bits with ECC). Cell's XDR memory bus runs
at 400MHz, but XDR memory transfers data at 8 times the memory bus clock -
meaning that you get 3.2GHz data signaling rates. The end result is
GPU-like memory bandwidth of 25.6GB/s. As we've mentioned in our coverage
of this year's Spring IDF, memory bandwidth requirements increase
tremendously as you increase the number of processor cores - with 9 total in
Cell, XDR is the perfect fit. Note that the GeForce 6800GT offers 32GB/s
of memory bandwidth just to its GPU, so it would not be too surprising to
see the Playstation 3's GPU paired up with its own local memory as well as
being able to share system memory and bandwidth.

The block labeled MIC is the XDR memory controller, and the XIO block is the
physical layer - all of the input receivers and output drivers are in the
XIO block. Data pipelines are also present in the XIO block.

As we've seen from AMD's Athlon 64, having a memory controller on-die
significantly reduces memory latencies, which applies to Cell as well.

Cell's On-Die FlexIO Interface
The other important I/O aspect of Cell is also controlled by Rambus - the
FlexIO interface. Cell features two configurable FlexIO interfaces, each
being 48-bits wide with 6.4GHz data signaling rates.

The BEI block is effectively the North Bridge interface, while the FlexIO
block is the physical FlexIO layer.

The word "configurable" is particularly important as it means that you don't
need to connect every wire. Taking this notion one step further, don't
look at the FlexIO interfaces as being able to connect to one chip, but
rather multiple chips with different width FlexIO interfaces.

One potential implementation of Cell's configurable FlexIO interface.

While Cell's XDR interface offers over 2x the memory bandwidth of any
PC-based microprocsesor, Cell's FlexIO interface weighs in at 76.8GB/s -
almost 10x the chip-to-chip bandwidth of AMD's Athlon 64.

In Playstation 3, you can pretty much expect a good hunk of this bandwidth
to be between NVIDIA's GPU and the Cell processor, but it also can be used
for some pretty heavy I/O interfaces.

One of the major requirements in any high performance game console is
bandwidth, and thanks to Rambus, Cell has plenty of it.

Cell's In-Order Architecture
We have mentioned that both the PPE and SPEs are in-order cores, but in
order to understand the impact of an in-order core on performance, there's a
bit of background knowledge that we have to go over first.

Dependencies, Instruction Ordering and Parallelism
What are Dependencies?

In many of our past CPU articles, we've brought up this idea of dependencies
as seen by the CPU. At the very basic level, a CPU is fed a stream of
instructions that are generally of the form:

OP destination, source1, source2, ... , source n

The instruction format will vary from one CPU ISA to the next, but the
general idea is that the CPU is sent an operation (OP), a destination to
store the result of the operation, and one or more sources on which to get
data to perform the operation. Depending on the architecture, the
destination and sources can be memory locations or registers. For the sake
of simplicity, let's just assume that for now, all destinations and sources
are registers.

Let's take a look at an example with some data filled in:

ADD R10, R1, R2

The above line of assembly would be sent to the CPU, telling it to add the
values stored in R1 (Register #1) and R2 and store the result in R10.
Simple enough. Now, let's give the CPU another operation to crunch on:

MUL R11, R10, R3

This time, we're multiplying the values stored in R10 and R3, and storing
the result in R11. As a single line of assembly, the above code is easily
executed, but when placed directly after our first example, we've created a
bit of a problem:
1.. ADD R10, R1, R2
2.. MUL R11, R10, R3
3.. ADD R9, R11, R4
Line 1 writes to R10, while Line 2 reads from R10. Under no circumstances
can the CPU begin executing line 2 before line 1 completes - the same goes
for lines 3 and 2. What we've created here is what is known as a RAW
dependency, Read After Write. There are many more types of dependencies,
but understanding this basic example is more than enough to take us to the
next topic at hand - the impact of such dependencies.

The problem with a dependency is that it limits what can be executed in
parallel. Take the Athlon 64, for example. It has three integer execution
units, all of which are equally capable of executing the code (in a slightly
revised, x86 assembly format, of course) that we used above. In theory,
the Athlon 64 could execute three lines integer operations in parallel at
the same time - assuming that no dependencies existed between the
operations. In executing the above code, two of the Athlon 64's integer
execution units would go idle until the first line of code was executed.

Dependencies, such as the simple one that we talked about above, hinder the
ability of modern day microprocessors to function to the best of their
abilities. It's like having three hands, but only being able to clean your
room by picking up one item at a time; frustratingly inefficient.

Ordering Instructions around Dependencies

Luckily, there are solutions to the problem of dependencies in code; one
tackles the problem in hardware, the other tackles the problem in software.

The software compiler is responsible for producing the assembly code that is
sent to the CPU for execution. Thus, with an intimate knowledge of the
inner workings of the CPU, the compiler can, generally speaking, produce
code that minimizes data dependencies.

There are microprocessor architectures that are dependent entirely on the
compiler to extract parallelism, on the instruction level, while avoiding
dependencies as much as possible. These architectures are known as in-order

In-Order Architectures
As the name implies, an in-order microprocessor can only execute
instructions in the order that they are sent to the CPU. At best, the CPU
can execute multiple instructions in parallel, but it has no ability to
reorder the instructions to suit its needs better.

If you have a good enough compiler, then an in-order microprocessor should
be just fine. There are a couple of key limitations, however:

1. Binaries Compiled for in-order architectures are very architecture

Although both the Athlon 64 and the Pentium 4 are fully able to run x86
code, they contain vastly different microarchitectures, with different
execution units and very different things that they are "good" at. If both
of the aforementioned chips depended entirely on the compiler to extract
parallelism and maximize performance, one would most definitely suffer.
You could always have two versions of every program, but that tends to get
large and messy - especially from an update/patches standpoint. The
compiler has to be intimately aware of the architecture that it's compiling
for, which works in cases like a game console where you don't have multiple
vendors providing differently architected CPUs with a common ISA, yet not so
well when you look at something like the desktop x86 market.

2. Unpredictable memory latencies

Cache is a good thing, most of the time. Cache on a microprocessor does
its best to keep frequently used data at hand, so it can be made available
to the CPU at very low latencies. The problem is that cache adds a level
of unpredictability to how long it will take to get data from memory. A
cache hit could mean that your data will be ready in 10 - 20 cycles. A
cache miss could mean that it'll be hundreds of cycles. With an in-order
microprocessor, you can't reorder instructions based on data availability,
so if data isn't available in cache and the CPU has to wait longer to pull
it from main memory, the entire CPU has to sit and wait until that data is
brought in from main memory. Even if other instructions could be executed,
an in-order microprocessor has no logic to effectively handle the on-the-fly
reordering of instructions to get around unpredictable memory latencies.

If you can find a way around the limitations of an in-order architecture,
there are some very tangible benefits:

1. A much simplified microprocessor

Out-of-Order microprocessors have a significant amount of complexity added
to them in order to deal with on-the-fly reordering of instructions. We
will talk about them in greater detail in the next section. By moving this
complexity to the software/compiler side, you greatly reduce the complexity
of your microprocessor and save your transistor budget for other things that
can yield better performance benefits. Less complexity also means less
power consumed and heat dissipated.

2. Shorter pipeline

In order to deal with the reordering of instructions, generally speaking, a
number of pipeline stages have to be added to the architecture, resulting in
higher power consumption and demands for a more accurate branch predictor
(thanks to an even higher branch prediction penalty). While the impact on
pipeline depth isn't as big of a deal for longer pipelined designs, for
shorter designs, the increase can be 40% or more.

Historically, the idea of a simple in-order core has been one that's been
abandoned in favor of the obvious alternative: an out-of-order architecture.

Out-of-Order Architectures

In contrast to in-order architectures, there are out-of-order architectures.
Out-of-order architectures still decode instructions in the original order
of the program, and still retire the instructions in order, but the actual
issue/execution of the instructions can be done out of order.

Let's talk a bit about what all of this means. A CPU is useless if it
changes the intent of the code fed to it. Frankly speaking, if you
double-click on a file, your CPU would be rather useless if it executed a
bunch of format commands instead. Although that's an extreme example, in
order to ensure that things like that don't happen, a CPU must adhere to two
1.. Instructions must be decoded (i.e. interpreted by the CPU to find out
what they are asking it to do) in the original order of the program, and
2.. Instructions must retire in the original order of the program (i.e.
the result of each operation must be written to memory/disk in the same
order as it was sent to the CPU).
Both in-order and out-of-order architectures adhere to those two rules - it'
s what happens in between those two stages that out-of-order architectures
differ. We mentioned in the previous page that in-order architectures can'
t reorder instructions on the fly. Let's say that we have an in-order CPU
with one adder and one load/store unit that is fed the following code (for
the sake of simplicity, we'll leave a forwarding network out of this
1.. LD R10, R11
2.. ADD R5, R10, R10
3.. ADD R9, R9, #1
4.. ...
In the first instruction, we're loading data from a memory address stored in
R11 into R10. Then, we're adding the value that we just obtained from
memory to itself and storing it in R5. The third and final line in the
snippet increments the value stored in R9 by 1 and stores it in R9.
Quickly looking at the code, you see that line 2 can't execute before line
1. Doing so would alter the intent of the code (if you want to add
something to itself, you need to make sure you have that something first).
Line 3, however, is completely independent of lines 1 and 2.

With an in-order microprocessor, if the data being loaded in line 1 is
contained within cache, then that instruction will take around 1 - 30 clock
cycles to complete (varying depending on the architecture and which level of
cache it is in). Line 2 would have to simply wait those 1 - 30 cycles
before executing and then after it executed, line 3 could have its turn.
If the requested data isn't stored in cache (maybe it's the first time that
we're asking for that value and we haven't asked for anything near it in
memory), then we have a problem. All of the sudden, line 1 doesn't take
around 1 - 30 cycles to complete; now, it's going to take 200+ clock cycles
to complete. For line 2, that's not such a big deal, since it can't
execute until line 1 completes anyway, but for line 3, it could just as
easily execute during the time that the CPU is waiting to get that load from
memory. Any independent instructions following line 3 are also at the
mercy of the cache miss.

With an out-of-order microprocessor, however, the situation of a cache miss
isn't nearly as dramatic. The code is still decoded in order, meaning that
it comes across instructions 1, 2 and 3 in the same order as the in-order
CPU, but this time, we have the ability to execute line 3 ahead of lines 1
and 2 instead of idly waiting for line 1 to complete. In the event of a
cache miss, this gives the out-of-order microprocessor a pretty big
performance advantage, as it isn't sitting there burning away clock cycles
while nothing gets done. So, how does the out-of-order CPU work?

If someone told you a list of things to do in any order that you wanted, you
'd simply take in the list and get to it. But if they told you to report
back the things that you've completed in the order in which they were told
to you, you'd have to grumble and write them down first before reorganizing
them to fit your needs.

An out-of-order CPU works pretty much the same way, except instead of a
to-do list, it has an instruction window. The instruction window functions
similarly to a to-do list - it has all of the decoded instructions in their
original order and is kept as a record to make sure that those instructions
retire in the order that they were decoded.

Alongside the instruction window, an out-of-order CPU also has a scheduling
window - it is in this "window" where all of the reordering of instructions
takes place. The scheduling window contains logic to mark dependent and
independent instructions and send all independent ones to execution units
while waiting for dependent instructions to become ready for execution.

As previously dependent instructions (e.g. instructions waiting on data from
main memory or instructions waiting for other instructions to complete)
become independent, they are then able to be executed, once again, in any

Right off the bat, you can tell that the addition of an instruction window,
a scheduling window and all of the associated logic to detect independent
instructions, not to mention the logic to handle out-of-order execution but
in order retirement, all makes for a more complex microprocessor. But
there is one other significant problem with out-of-order microprocessors -
the increase in performance and instruction level parallelism is greatly
dependent upon the size of the instruction window.

The larger you make this window, the more parallelism that can be extracted
simply because the CPU is looking at a wider set of instructions from which
to select independent ones. At the same time, the larger you make the
window, the lower your clock speed can be. For example, Itanium has an
extremely large instruction window to feed its execution units, while the
Pentium 4 has a significantly smaller one in order to hit higher clock

Despite the downsides, all modern day x86 microprocessors are out-of-order
cores, as keeping a single core simple isn't the top priority given advances
in manufacturing processes. The benefits of an out-of-order architecture
are two-fold:
1.. Dynamic reordering of instructions lets the CPU hide memory latencies,
allowing for even higher clock speeds. For every cache miss, a Pentium 4
3.6GHz has to wait around 230 clock cycles to get data from main memory,
which is a lot of idle time in the eyes of the CPU. Being able to make use
of that idle time by executing other independent instructions in the
meantime is one way in which architectures like the Pentium 4 and Athlon 64
get away with running at such high multiples of their memory frequency.
2.. Incremental increase in instruction level parallelism - by reordering
instructions on the fly, out-of-order architectures can improve ILP as best
as possible in areas where the compiler fails to.
So, it's obvious that both AMD and Intel have figured out that for a general
purpose x86 microprocessor, out-of-order makes the most sense. Then, why
is it that the architects of Cell, when starting with a clean slate,
outfitted the processor with 9 independent in-order cores?

The first thing to remember is that you can get pretty solid performance
from an in-order architecture. The Itanium is an in-order microprocessor,
based on a premise similar to Cell by which the compiler should be able to
extract the sort of parallelism that of an out-of-order core. Current
generation Itanium cores run at half the speed of modern day x86 cores, yet
the CPU is able to execute around 2x the instructions per clock as the
fastest x86 CPUs. To quote Intel's Justin Rattner in reference to Itanium,
"an appropriately designed instruction set should lend itself to an in-order
architecture without any problems." So, it's quite possible that the same
could apply to Cell...

Cell's Approach - In Order with no Cache

Remember that the Cell's architects designed the processor while evaluating
the incremental performance each transistor they used resulted in (somewhat
exaggerated, they didn't count every last one of the 234 million
transistors, but they evaluated each architectural decision very closely).
In doing so, the idea of in-order vs. out-of-order must have raised a huge
debate, given the increased complexity that an out-of-order core would add.

With the major benefit of out-of-order being a decrease in susceptibility to
memory latencies, the Cell architects proposed another option - what about
an in-order core with controllable (read: predictable) memory latencies?

In-order microprocessors suffer because as soon as you introduce a cache
into the equation, you no longer have control over memory latencies. Most
of the time, a well-designed cache is going to give you low latency access
to the data that you need. But look at the type of applications that Cell
is targeted at (at least initially) - 3D rendering, games, physics, media
encoding etc. - all applications that aren't dependent on massive caches.
Look at any one of Intel's numerous cache increased CPUs and note that 3D
rendering, gaming and encoding performance usually don't benefit much beyond
a certain amount of cache. For example, the Pentium 4 660 (3.60GHz - 2MB
L2) offered a 13% increase in Business Winstone 2004 over the Pentium 4 560
(3.60GHz - 1MB L2), but less than a 2% average performance increase in 3D
games. In 3dsmax, there was absolutely no performance gain due to the
extra cache. A similar lack of performance improvement can be seen in our
media encoding tests. The usage model of the Playstation 3 isn't going to
be running Microsoft Office; it's going to be a lot of these "media rich"
types of applications like 3D gaming and media encoding. For these types
of applications, a large cache isn't totally necessary - low latency memory
access is necessary, and lots of memory bandwidth is important, but you can
get both of those things without a cache. How? Cell shows you how.

Each SPE features 256KB of local memory, more specifically, not cache. The
local memory doesn't work on its own. If you want to put something in it,
you need to send the SPE a store instruction. Cache works automatically;
it uses hard-wired algorithms to make good guesses at what it should store.
The SPE's local memory is the size of a cache, but works just like a main
memory. The other important thing is that the local memory is SRAM based,
not DRAM based, so you get cache-like access times (6 cycles for the SPE)
instead of main memory access times (e.g. 100s of cycles).

What's the big deal then? With the absence of cache, but the presence of a
very low latency memory, each SPE effectively has controllable, predictable
memory latencies. This means that a smart developer, or smart compiler,
could schedule instructions for each SPE extremely granularly. The
compiler would know exactly when data would be ready from the local memory,
and thus, could schedule instructions and work around memory latencies just
as well as an out-of-order microprocessor, but without the additional
hardware complexity. If the SPE needs data that's stored in the main
memory attached to the Cell, the latencies are just as predictable, since
once again, there's no cache to worry about mucking things up.

Making the SPEs in-order cores made a lot of sense for their tasks.
However, the PPE being in-order is more for space/complexity constraints
than anything else. While the SPEs handle more specified tasks, the PPE's
role in Cell is to handle all of the general purpose tasks that are not best
executed on the array of SPEs. The problem with this approach is that in
order to function as a relatively solid performing general purpose
processor, it needs a cache - and we've already explained how cache can hurt
in-order cores. If there's a weak element of the Cell architecture it's
the PPE, but then again, Cell isn't targeted at general purpose computing,
despite what some may like to spin it as.

The downsides of an in-order PPE are minimized as much as possible by making
the core only 2-issue, meaning that at best, it could execute two operations
in parallel. So, execution potential lost to in-order inefficiencies are
minimized in a sense that at least there aren't a lot of transistors wasted
on making the PPE an extremely wide chip. A good compiler should be able
to make sure that both issue ports are populated as frequently as possible,
despite the fact that the microprocessor is in-order. The PPE is also
capable of working on two threads at a time, also designed to mask the
inefficiencies of an in-order core for general purpose code.

Architecturally, if anything will keep Cell out of being used in a PC
environment, it's the PPE. A new Cell with a stronger PPE or an array of
PPEs could change that, however.

Manufacturing, Die Size and Clock Speed

Intel's superiority in manufacturing is responsible for the majority of
their technological advances in microprocessors over the past decade, and
it's often argued that there isn't a company around that could come close to
matching Intel's manufacturing abilities - with the exception of IBM.

The Cell prototype boasts some pretty major manufacturing specs:

- 90nm SOI manufacturing process
- 221 mm2 die area
- 234M transistors
- > 4GHz observed clock speed

When it was first announced, the chip sounded massive, but its
specifications compare extremely well to Intel's upcoming Pentium D
processor; let's take a look at its vitals:

Intel Pentium D Processor

- 90nm strained silicon manufacturing process
- 206 mm2 die area
- 230M transistors
- 2.8GHz - 3.2GHz clock speed

With a slightly larger chip and a few million transistors more, Cell is
supposed to be able to run at a minimum of 25% higher clock frequency than
Intel's forthcoming Pentium D. We'll let that sit in for a moment...

Dynamic Logic
At first glance, a 90nm SOI Cell running at between 3 - 5GHz looks extremely
impressive. After all, the fastest 90nm CPU IBM currently produces runs at
2.5GHz, not to mention that even Intel, the king of clock speed, can't mass
produce anything faster than 3.8GHz on their 90nm process. But let's dig a
little deeper.

The Pentium 4 has two ALUs that run at twice its internal clock speed - so
in the case of a Pentium 4 660, that means that two of the more frequently
used execution units operate at 7.2GHz - on a 90nm process. So, it's
possible to get circuits to run at higher clock speeds, even in the 3 - 5GHz
range, on current 90nm processes - it just takes a little bit of creative
logic design.

It's been confirmed that Cell is using some sort of dynamic logic as opposed
to static CMOS in order to control transistor counts and improve operating
frequencies. Intel uses a number of specialized logic techniques in their
double-pumped ALUs to reach their 7GHz+ operating frequencies, and Intel has
discussed techniques that are similar to the dynamic logic used in Cell.

The diagram on the right of a "sleep transistor" should look very familiar
by the end of this article

To understand how dynamic works, you have to understand how the transistors
are implemented on a chip.

Transistors and You
Just about any AnandTech reader who has followed our CPU articles has heard
us count transistors before, but understanding how transistors work is quite
critical to understanding how IBM can talk about 3 - 5GHz clock speeds for

We'll spare you the details about how transistors are made and the physics
behind them in an attempt to keep this section as brief, but as informative,
as possible. It's quite common to refer to a transistor as a "switch" much
like a light switch, so how does a transistor function like a switch?
Below, we have a representation of a p-type transistor:

There are three points labeled on the transistor: the drain (D), gate (G),
and source (S). In a microprocessor, you generally have the source
connected to Vcc (think of Vcc as a line carrying the CPU's core voltage),
and the drain connected to ground - often times indirectly (e.g. 10
transistors will be connected in series, with the top-most source connecting
to Vcc and the bottom-most drain connecting to ground).

The input to the gate of the p-type transistor is what makes it function as
a switch. If you apply the right voltage to the gate, thus making it a
logical "1" or high, current doesn't flow in the transistor. If you don't
apply any voltage to the gate, current can flow. Just like a light switch,
flip it one way and the light turns on; flip it another and you're in the

There's another type of transistor that we'll be talking about here: the
n-type transistor:

The three points on the transistor are the same, but the drain and source
switch places. The functionality of the n-type transistor is different as
well. Here, if you apply the appropriate voltage to the gate, a current
will flow; if you apply no voltage, no current will flow through the

CMOS circuits work by using pairs of n- and p-type transistors (that's where
the Complementary element of CMOS comes from). CMOS circuits are by far
the most predominant in modern day microprocessors, but as you will soon
see, that doesn't mean that they are without flaws.

Understanding Gates

The fundamental building blocks of any microprocessor are gates. Gates are
collections of transistors that electrically mimic a particular logic
function. For example, a 2-input AND gate will take two input signals and
output a 1 only if the two inputs are both 1s. An XOR gate will output a 1
only if the two inputs are different. A NOR gate will output a 1 only if
all inputs are 0s.

Combinations of these gates are used to implement everything in a
microprocessor, including functional units like adders, multipliers, etc.

Here, we have a 1-bit carry adder implemented using logic gates. It will
add any two 1-bit numbers and produce a result.

However, there are many ways to implement each gate, as long as the behavior
of the gate remains the same independent of the implementation. It's just
like doing a math problem; there are multiple ways to find the solution -
some just may be more efficient than others.

A very popular way of designing gate logic is using what is known as static
CMOS. Static CMOS designs are relatively easy to implement and there are
tons of libraries available for automated (e.g. computer driven) static CMOS
design. There are a couple of problems with static CMOS design:
1.. Static CMOS circuits aren't the fastest circuits possible, which is
why they aren't used in things like Intel's double-pumped ALUs where high
clock speeds are necessary.
2.. Static CMOS designs use quite a few transistors. For each m-input
gate, you need 2 * m transistors (m PMOS and m NMOS transistors), which for
high fan-in gates (gates with lots of inputs), it drives transistor counts
up considerably. As is the case with any heavily SIMD architecture, high
fan-in gates are commonplace.
Let's take a look at a static CMOS NOR gate:

First thing to note is that for every NMOS transistor we add, there's a
complementary PMOS transistor. With each additional input to the NOR gate,
we have to add two transistors - one PMOS and one NMOS - hence the 2*m
transistors from before.

There's another problem here - the NOR gate isn't clocked. Normally, large
collections of gates are assembled and put behind an element called a latch,
which is clocked. One type of large collection would be all of the
circuitry used for a pipeline stage. This isn't really a problem for
static CMOS gates, as it greatly simplifies the distribution of clocks to
the chip (since you don't have to route a clock signal to every gate, just
every latch, and there are far more general gates than there are latches).

Designing and implementing static CMOS gates are extremely easy. Hardware
Description Languages (HDLs), programming languages in which chips are
"written" have widespread static CMOS libraries, meaning that a chip
designer can focus on writing code to crank out a chip without having to
hand design its circuits. But as success is usually proportional to
difficulty, static CMOS designs aren't the fastest things in the world.
Things like Intel's 7.2GHz ALUs aren't designed using static CMOS logic,
neither is Cell.

Cell's Dynamic Logic

Although it's beyond the scope of this article, one of the major problems
with static CMOS circuits are the p-type transistors, and the fact that for
every n-type transistor, you also must use a p-type transistor.

There is an alternative known as dynamic or pseudo-NMOS logic, which gets
around the problems of static CMOS while achieving the same functionality.
Let's take a look at that static CMOS NOR gate again:

The two transistors at the top of the diagram are p-type transistors. When
either A or B are high (i.e. have a logical 1 value), then the p-type
transistor gates remain open, with no current flowing. In that case, the
output of the circuit is ground, or 0 since the complementary n-type
transistors at the bottom function oppositely from their p-type counterparts
(e.g. current can flow when the input is high).

Thus, the NOR gate outputs a 1 only if all inputs are 0, which is exactly
how a NOR gate should function.

Now, let's take a look at a pseudo-NMOS implementation of the same NOR gate:

There are a few things to notice here. First and foremost, the clock signal
is tied to two transistors (a p-type at the top, and an n-type at the
bottom) whereas there was no clock signal directly to the NOR gate in our
static CMOS example. There is a clock signal fed to the gate here.

Cell's implementation goes one step further. The p-type transistor at the
top of the circuit and the n-type transistor at the bottom are clocked on
non-overlapping phases, meaning that the two clocks aren't high/low at the
same time.

The way in which the gate here works is as follows: inputs are first applied
to the logic in between the clock fed transistors. The top transistor's
gate is closed allowing the logic transistors to charge up. The gate is
then opened and the lower transistor's gate is closed to drain the logic
transistors to ground. The charge that remains is the output of the

What's important about this is that since power is only consumed during two
non-overlapping phases, overall power consumption is lower than static CMOS.
The downside is that clock signal routing becomes much more difficult.

The other benefit is lower transistor count. In the example of the 2-input
NOR gate, our static CMOS design used 4 transistors, while our pseudo-NMOS
implementation used 4 transistors as well. But for a 3-input NOR gate, the
static CMOS implementation requires 6 transistors, while the pseudo-NMOS
implementation requires 5. The reasoning is that for a CMOS circuit, you
have 1 p-type transistor for every n-type, while in a pseudo-NMOS circuit
you only have two additional transistors beyond the bare minimum required to
implement the logic function. For a 100-input NOR gate (unrealistic, but a
good example), a static CMOS implementation would require 200 transistors,
while a pseudo-NMOS implementation would only require 102.

By making more efficient use of transistors and lowering power consumption,
Cell's pseudo-NMOS logic design enables higher clock frequencies. The
added cost is in the manufacturing and design stages:
1.. As we mentioned before, clock routing becomes increasingly difficult
with pseudo-NMOS designs similar to that used in Cell. The clock trees
required for Cell are probably fairly complex, but given IBM's expertise in
the field, it's not an insurmountable problem.
2.. Designing pseudo-NMOS logic isn't easy, and there are no widely
available libraries from which to pull circuit designs. Once again, given
IBM's size and expertise, this isn't much of an issue, but it does act as a
barrier for entry of smaller chip manufacturers.
3.. Manufacturing such high speed dynamic logic circuits often requires
techniques like SOI, but once again, not a problem for IBM given that they
have been working on SOI for quite some time now. There's no surprise that
Cell is manufactured on a 90nm SOI process.
On the gate level, just like at the logic level, Cell is architected for
high speeds and efficient use of transistors.

Blueprint for a High Performance per Transistor CPU

Given that Cell was designed with a high performance per transistor metric
in mind, its architecture does serve as somewhat of a blueprint for the
technologies that result in the biggest performance gains, at the lowest
transistor counts. Now that we've gone through a lot of the Cell
architecture, let's take a look back at what some of those architectural
decisions are:

1. On-die memory controller

We've seen this with the Athlon 64, but an on-die memory controller appears
to be one of the best ways to improve overall performance, at minimal
transistor expenditure. Furthermore, we also see the use of Rambus' XDR
memory instead of conventional DDR, as the memory of choice for Cell. High
frequencies and high bandwidth are what Cell thrives on, and for that, there
's no substitute but Rambus' technology.

2. SMT

On-die multithreading has also been proven to be a good way of extracting
performance at minimal transistor impact. Introducing Hyper Threading to
the Pentium 4's core required a die increase of less than 5%, just to give
you an idea of the scale of things. The performance benefits to SMT will
obviously vary depending on the architecture of the CPU. In the case of
the Pentium 4, performance gains ranged from 0 - 20%. In the case of the
in-order PPE core of Cell, the performance gains could be even more.
Needless to say, if implemented well, and if proper OS/software support is
there, SMT is a feature that makes sense and doesn't strain the transistor

3. Simpler, in-order, narrow-issue core - but lots of them

This next design decision is more controversial than the first two, simply
because it goes against the design strategies of most current generation
desktop microprocessors that we're familiar with. By making the PPE and
SPEs 2-issue only, each individual core still remains a manageable size.
Narrower cores obviously sacrifice the ability to extract ILP, but doing so
allows you to cram more cores onto a single die - highlighting the ILP for
TLP sacrifice that the Cell architects have made.

Getting rid of the additional logic and windows needed for an out-of-order
core helps further reduce transistor count, but at the expense of making
sure that you have a solid compiler and/or developers that are willing to
deal with more of the architecture's intricacies to achieve good

Looking at Intel's roadmap for Platform 2015, the type of microprocessors
that they're talking about are eerily Cell-like - a handful of strong
general purpose cores surrounded by smaller cores, some of which are more
specialized hardware.

However, the time frame that Intel is talking about to introduce those
Cell-like architectures is much further away than today; the first Cell-like
architectures don't appear on roadmaps until the 2009 - 2015 range.

In the past, when Intel dictated a major architectural shift, it didn't
happen until they said so. It is yet to be seen if Cell is an exception to
the rule or if it will be another architecture well before its time.

Final Words

Concluding anything about Cell requires a multifaceted look at the
architecture and the platform as a whole.

First from the perspective of the game industry, more specifically
Playstation 3:

Cell's architecture is similar to the next version of Microsoft's Xbox and
upcoming PC microprocessors in that it is heavily multithreaded. The next
Xbox will execute between 3 and 6 threads simultaneously, while desktop PC
microprocessors will execute between 2 - 4. The problem is that while Xbox
2/360/Next and the PC will be using multiple general purpose cores, Cell
relies on more specialized hardware to achieve its peak performance. Cell'
s SPEs being Altivec/VMX derived is a benefit, which should mean that the
ISA is more familiar to developers working on any POWER based architecture,
but the approach to development on Cell vs. development on the PC will
literally be on opposite ends of the spectrum, with the new Xbox somewhere
in between.

The problem here is that big game development houses often develop and
optimize for the least common denominator when it comes to hardware, and
offer ports with minor improvements to other platforms. Given Cell's
architecture, it hardly looks like a suitable "base" platform to develop
for. We'd venture to say that a game developed for and ported from the PC
or Xbox Next would be under-utilizing Cell's performance potential unless
significant code re-write time was spent.

Console-only development houses, especially those with close ties to Sony,
may find themselves able to harness the power of Cell much more efficiently
than developers who ascribe to the write-once, port-many process of
cross-platform development. Given EA's recent acquisition and
licensing-spree, this is a very valid concern.

With Cell, Sony has effectively traded hardware complexity for programmer
burden, but if anyone is willing to bear the burden of a complicated
architecture, it is a game developer. The problem grows in complexity once
you start factoring in porting to multiple platforms in a timely manner
while still attempting to achieve maximum performance.

As a potential contender in the PC market, Cell has a very tall ladder to
climb before even remotely appearing on the AMD/Intel radars. The biggest
strength that the x86 market has is backwards compatibility, which is the
main thing that has kept alternative ISAs out of the PC business.
Regardless of how much hype is drummed up around Cell, the processor is not
immune to the same laws of other contenders in the x86 market - a compatible
ISA is a must. And as Intel's Justin Rattner put it, "if there are good
ideas in that architecture, PC architecture is very valuable and it will
move to incorporate those ideas."

Once again, what's most intriguing is the similarity, at a high level, of
Intel's far future multi-core designs to Cell today. The main difference
is that while Intel's Cell-like designs will be built on 32nm or smaller
processes, Cell is being introduced at 90nm - meaning that Intel is
envisioning many more complex cores on a single die than Cell. Intel can
make that kind of migration to a Cell-like design because their
microprocessors already have a very large user base. IBM, Sony and Toshiba
can't however - Cell must achieve a very large user base initially in order
to be competitive down the road. Unfortunately, seeing a future for Cell
far outside of Playstation 3 and Sony/Toshiba CE devices is difficult at

The first thing you have to keep in mind is that Cell's architecture is
nothing revolutionary, it's been done before. TI's MVP 320C8X is a
multi-processor DSP that sounds a lot like Cell:
So, while Cell is the best mass-market attempt at a design approach that has
been tried before, it doesn't have history on its side for success beyond a
limited number of applications.

Regardless of what gaming platform you're talking about, Cell's ability to
offer an array of cores to handle sophisticated physics and AI processing is
the future. AGEIA's announcement of the PhysX PPU (and the fact that it's
been given the "thumbs up" by Ubisoft and Epic Games) lends further
credibility to Cell's feasibility as a high performance gaming CPU.

The need for more realistic physics environments and AI in games is no
illusion; the question is will Intel's forthcoming dual and multi-core CPUs
(with further optimized SIMD units) offer enough parallelism and performance
for game developers, or will the PPU bring Cell-like architecture to the
desktop PC well ahead of schedule? The answer to that question could very
well shape the future of desktop PCs even more so than the advent of the

Alex Johnson

Thanks. We can click on links here. Don't waste our space by posting
the entire contents of web sites.



Alex Johnson said:
Thanks. We can click on links here. Don't waste our space by posting the
entire contents of web sites.


58 KB slowing down the computer?

Arthur Hagen

W????n said:
58 KB slowing down the computer?

58 kB to a bunch of newsgroups, sent to tens of thousands news servers
around the globe, and downloaded by millions -- many of whom pay per bit.
Multiply the 58 kB with the number of total feeds, sucks and downloads, and
you've wasted a LOT of bandwidth. Usenet ain't one-to-one email.

No regards,

David Magda

Arthur Hagen said:
58 kB to a bunch of newsgroups, sent to tens of thousands news
servers around the globe, and downloaded by millions -- many of
whom pay per bit. Multiply the 58 kB with the number of total
feeds, sucks and downloads, and you've wasted a LOT of bandwidth.
Usenet ain't one-to-one email.

Don't necessarilly disagree with you, but it's a molecule in the
bucket when compared to alt.binaries.* :)

David Schwartz

Don't necessarilly disagree with you, but it's a molecule in the
bucket when compared to alt.binaries.* :)

Right, but people who care about bandwidth don't carry those groups.


David Schwartz

58 KB slowing down the computer?
58 kB to a bunch of newsgroups [...]
The point of cross-posting it to send to multiple newsgroups without
sending multiple copies, so this is a red herring.

-- Richard

Not really. The more newsgroups a post is sent to, the more systems the
post will wind up on, so the size is magnified. Mose systems (both servers
and clients) only subscribe to a subset of all the available newsgroups.


Arthur Hagen

David Magda said:
Don't necessarilly disagree with you, but it's a molecule in the
bucket when compared to alt.binaries.* :)

Thing is, people *expect* it there, and can choose not to feed, suck or
download *.binaries.*



God-damned nym-shifting spammer. *PLONK* again.

Use a better filter.
Since this particular crossposting moron always has "xbox" somewhere in
his name, just flag it on Author: {xbox} and set it to ignore thread.
Presto every thread started by anyone with "xbox" in their name gets

You're using agent so you can do that.


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question