AMD R600 Architecture and GPU Analysis (long read)

  • Thread starter Thread starter RadeonR600
  • Start date Start date
R

RadeonR600

this is only like HALF the article: the entire thing is here
http://www.beyond3d.com/content/reviews/16/1



AMD R600 Architecture and GPU Analysis

Published on 14th May 2007, written by Rys for Consumer Graphics -
Last updated: 14th May 2007

Introduction


Given everything surrounding the current graphics world at the time of
writing -- with big highlights that include the recent AMD/ATI merger,
the introduction of a new programming shading model via DirectX,
NVIDIA's introduction of G80, real-time graphics hardware in the new
generation of consoles, and Intel's intent to come back to discrete --
the speculation and anticipation for AMD's next generation of Radeon
hardware has reached levels never seen before. 4 years in the making
by a team of some 300 engineers, the chip takes the best bits of R5
and Xenos, along with new technology, to create their next
architecture.

How it performs, and how it slides in to the big graphics picture,
means the base architecture and its derivative implementations will
have an impact on the industry that will be felt a long time from
launch day. If you're not excited about what we're about to explain
and go over in the following pages, you haven't been paying attention
to the state of the GPU union over the last year and a half, since we
live in the most exciting graphics-related times since Voodoo Graphics
blazed its real-time, mass-market consumer trail.

The engineers at the new AMD Graphics Products Group have been
beavering away for the last few years on what they call their 2nd
generation unified shader architecture. Based in part on what you can
find today in the Xenos GPU inside the Xbox 360 console, AMD's D3D10
compliant hardware has been a long time coming. Obviously delayed and
with product family teething troubles, R600, RV610 and RV630 -- the
first implementations of the new architecture -- break cover today for
the first time, at least officially!

We'll let you know the architecture basics first, before diving in for
closer looks at some of the bigger things the architecture and the
implementing GPUs do. As with our G80 analysis, we split things in to
three, covering architecture in this piece, before looking at image
quality and performance in subsequent articles, to divide things up
into manageable chunks for us to create and you to consume.

AMD have embargoed performance analysis of RV610 and RV630 until next
month, but we're allowed to talk about those GPUs in terms of
architecture and their board-level implementations, so we'll do that
later today, after our look at R600, the father of the family. Tank
once said, "Damn, it's a very exciting time". Too true, Tank, too
true. He also said shortly afterwards, "We got a lot to do, let's get
to it". We'll heed his sage advice.


The Chip - R600

R600 is the father of the family, outgunning G80 as the biggest piece
of mass market PC silicon ever created, in terms of its transistor
count. It's not feature identical to the other variations, either, so
we'll cover the differences as we get to them.

ATI R600 details
Foundry and process 80nm @ TSMC, 720M
Die Size 420mm²
20mm x 21mm
Chip Package Flipchip
Basic Pipeline Config 16 / 16 / 32
Textures / Pixels / Z
Memory Config 512-bit
8 x 64-bit
API Compliance DX10.0
System Interconnect PCI Express x16
Display Pipeline Dual dual-link DVI, HDCP, HDMI

TSMC are AMD's foundry partner for this round of graphics processors
once again, with R600 built on their 80HS node at 80nm. 720M
transistors comprise the huge 20x21mm die, which contains all of the
logic, including display and signal I/O. R600 is an implementation of
AMD's 2nd generation unified shading architecture, fully threaded for
computation, data sampling and filtering, and supporting Shader Model
4.0 as set out by Direct3D 10. R600 also sports a hardware tesselation
unit for programmable surface subdivision and certain high order
surfaces outside of any major existing API, although programmable
using them.

R600 sports a 512-bit external memory bus, interfacing with an
internal, bi-directional 1024-bit ring bus memory controller, with
support for dozens of internal memory clients and GDDR3 or GDDR4
memories for the external store. Sticking with memory, the R600
architecture is cache-heavy internally, SRAM logic a significant
portion of the die area. The external bus interface to the PC host is
PCI Express, access to that coming via dedicated stop on the internal
rings.


R600 sports AMD's next generation video decoding core, called the
Unified Video Decoder or UVD for short. The UVD is designed to
handling full H.264 AVC decode processing offload at maximum bitrates
for both Blu-Ray and HD-DVD video at maximum resolution. In terms of
power management, the chip supports clock throttling, voltage adjust
for p-states and entire unit shutdown depending on workload, combined
by marketing under the umbrella of PowerPlay 7.

We'll cover all of those things and more as the article(s) progress.
The initial R600-based SKU is called Radeon HD 2900 XT, so we'll take
a look at the reference board before we move on to the architecture
discussion for the chip that powers it. We'll cover RV610 and RV630
separately later today, as mentioned.

The Radeon HD 2900 XT Reference Board

The only R600-based launch product is Radeon HD 2900 XT. To signal the
intent of the products in terms of how they're able to process screen
pixels, be that in motion video or real-time 3D graphics, AMD are
dropping the X from the product family in place of HD. The familiar
XT, XTX (maybe) and PRO monikers for inter-family designation will
stay, however. Radeon HD 2900 XT sports two clock profiles, the 3D one
seeing R600 clocked at 742MHz (the chip has a single master clock
domain, across many slightly asynchronous subdomains), with 512MiB of
GDDR3 memory clocked at 825MHz. The 2D profile sees things drop to
507MHz for the chip, with the memory run at 514MHz. Windowed
applications are run in the 2D clock profile.

ATI Radeon HD 2900 XT 512MiB Details
Board Name Radeon HD 2900 XT
Memory Quantity 512MiB
Chip ATI R600
Core Frequency 742MHz
Memory Frequency 825MHz
Theoretical Performance @ 742/825
Pixel Fillrate 11872 Mpixels/sec
Texture Fillrate 11872 Mtexels/sec
Z sample rate 23744 Msamples/sec
AA sample rate 47488 Msamples/sec
Geometry rate 742 Mtris/sec
Memory Bandwidth 105.6 GB/sec

The table above will start to give some of the story away if you're
willing to take the base clock and do some basic arithmetic. By virtue
of the 512-bit external memory bus and Hynix HY5RS573225A FP-1 DRAM
devices clocked at 825MHz (1.65GHz effective rate), HD 2900 XT has a
peak theoretical bandwidth of 105.6GB/sec. One hundred and five point
six, for those still blinking at the figure in numerical form. And
while AMD aren't the first company to deliver a single board consumer
solution with a peak bandwidth higher than 100GB/sec, given NVIDIA's
soft launch of GeForce 8800 Ultra and its 103.6GB/sec peak via a 384-
bit bus, they're definitely the first to try a 512-bit external bus on
a consumer product.



Retail editions of the board are barely larger than AMD's outgoing
Radeon X1950 XTX, PCB size wise, although the cooler is a good bit
beefier. While it looks like your average double slot cooler for a
high end board, replete with blower fan and rear exit for the
exchanged heat, the mass of the sink attached to the board and the
components it's cooling is significant. In fact, the board is the
heaviest 'reference' hardware yet created, with reference used to
define the cooler that the IHV officially specifies.

It's a somewhat sad fact, for AMD and the consumer, that TSMC's 80HS
process has pretty horrible static power leakage properties. Given the
process properties, the die area, clocks for GPU and mem, and the
memories themselves, it's no honest surprise to see the fastest launch
board based on R600 coming with such a cooler. It's also no real
surprise to see it come with a pair of external power input
connectors, and one of those is the new 8-pin variant as well. That
version of the connector gives the board another +12V input source at
just over 6A max current.



The board doesn't require you fill that connector block up with power-
giving pins, though, since it runs happily with a regular pair of 6-
pin inputs. Look closely and you'll see the 8-pin block holds the 6-
pin block, and there's only one orientation that can make it happen,
so you don't connect the wrong pins and have things go horribly wrong.

A pair of dual-link DVI connectors dominates the backplane space that
the first slot occupies, with the grille for the heat output
dominating the second. The DVI ports (although it's unclear if it's
just one of them or both) support HDMI output for video and audio via
a supplied active convertor. It appears that audio is either muxed
into the video bitstream pushed out over the pins, or the spare pins
on the dual-link connector (the version of HDMI that the hardware
supports is analogous to single-link DVI for the video portion) are
used for that, but it's an implementation detail at best.
Physicals



In terms of the board's physical properties when being used in anger,
we've run our test sample just fine using a PSU that provides only 6-
pin connectors. Calibrating against other boards on the same test
platform, we estimate (because we're not 100% certain about other
board's peak power either) peak load power from the board, at least
using our load condition of 3DMark05's GT3 @ 1920x1200 w/4xAA/16xAF,
of less than 200W at stock frequencies. We haven't tried overclocking
enough to be sure of a peak power draw using the Overdrive feature.

Using AMD's GPU overclocking tool which is able to read the GPU's
thermal diode, load temperatures for the GPU approach 90°C with our
sample. Under the load condition generating that heat, the fan speed
is such that board volume is higher than any of the board boards on
test, and that the pitch of the noise is annoyingly high with a
whistling property to it, at least in our test system and to our ears.
Added to that, the speed steppings of the fan make pitch changes very
noticeable, and the board constantly alters fan speed in our test
system, even just drawing the (admittedly Vista Aero Glass) desktop.
So there's no static 2D fan speed that we can see, and the overall
noise profile of the cooling solution is subjectively much worse than
Radeon X1950 XTX, NVIDIA GeForce 8800 GTX or GeForce 8800 GTS.



So while the board's form factor is pleasing in the face of NVIDIA's
highest-end products, simply because it's not as long and will
therefore fit into more cases than its competitor, the cooler is a
disappointment in terms of noise. It certainly seems effective when
dealing with heat, though, and we're sure that was AMD's primary
concern when engineering it. It's possible that things can get better
for the product here without any hardware changes, via software. The
fan's got more variable speed control than the driver seems to make
use of, and a more gradual stepping function to change fan speed based
on GPU load and temperature can likely be introduced. Here's hoping!

Time to check out the test system that we used for our architecture
and performance investigations.

AMD R600 Overview

Click for a bigger version

We use an image display overlay for image enlargement at Beyond3D (if
your browser supports it), but it might be worth having the
enlargement open in a new brower window or a new tab, so you can refer
to it as the article goes by. The image represents a (fairly heavily
in places) simplified overview of R600. Datapaths aren't complete by
any means, but they serve to show the usual flow of data from the
front of the chip to its back end when final pixels are output to the
framebuffer and drawn on your screen. Hopefully all major processing
blocks and on-chip memories are shown.

R600 is a unified, fully-threaded, self load-balancing shading
architecture, that complies with and exceeds the specification for
DirectX Shader Model 4.0. The major design goals of the chip are high
ALU throughput and maximum latency hiding, achieved via the shader
core, the threading model and distributed memory accesses via the
chip's memory controller. A quick glance at the architecture,
comparing it to their previous generation flagship most of all, shows
that the emphasis is on the shader core and maximising available
memory bandwidth for 3D rendering (and non-3D applications too).


The shader core features ALUs that are single precision and IEEE754
compliant in terms of rounding and precision for all math ops, with
integer processing ability combined. Not all R600 SPUs are created
equal, with a 5th more able ALU per SPU group that handles special
function and some extra integer processing ops. R600, and the other
GPUs in the same architecture family, also sports a programmable
tesselation unit, very similar to the one found in the Xbox 360. While
DirectX doesn't support it in any of its render stages, it's
nonetheless programmable using that API with minimal extra code. The
timeframe for that happening is unclear, though.

That's the basics of the processor, so we'll start to look at the
details, starting with the front end of the chip where the action
starts.

Shader Core

Ah, the processing guts of the beast. With four such clusters, R600
sports a full 320 separate processing ALUs dedicated to shading, with
the units arranged as follows. Each cluster contains sixteen shader
units, each containing five sub scalar ALUs that perform the actual
shading ops. Each ALU can run a separate op per clock, R600 exploiting
instruction-level parallelism via VLIW. Having the compiler/assembler
do a bunch of the heavy lifting in terms of instruction order and
packing, arguably reduces overall efficiency compared to something
like G80's scalar architecture that can run full speed with dependent
scalar ops. Not all the ALUs are equal, either, with the 5th in the
group able to do a bit more than the other four, and independently of
those too.

Architecture Summary

Well well, graphics fans, it's finally here! Years in the making for
AMD, via the hands of 300 or so engineers, hundreds of millions of
dollars in expenditure, and unfathomable engineering experience from
the contributing design teams at AMD, R600 finally officially breaks
cover. We've been thinking about the architecture and GPU
implementations for nearly a year now in a serious fashion, piecing
together the first batches of information sieved from yon GPU
information stream. As graphics enthusiasts, it's been a great
experience to finally get our hands on it and put it through the mill
of an arch analysis, after all those brain cycles spent thinking about
it before samples were plugged in and drivers installed.



So what do we think, after our initial fumblings with the shader core,
texture filter hardware and ROPs? Well arguably the most interesting
bits and pieces the GPU and boards that hold them provide, we've not
been able to look at either for time reasons, resource reasons, or
they simply fall outside this article's remit! That's not to say
things like the UVD, HDMI implementation and the tesselator overshadow
the rest of the chip and architecture, but they're significant
possible selling points that'll have to await our judgement a little
while longer.

What remains is a pretty slick engineering effort from the guys and
guys at AMD's Graphics Products Group, via its birth at the former
ATI. What you have is evolution rather than revolution in the shader
core, AMD taking the last steps to fully superscalar with independent
5-way ALU blocks and a register file with seemingly no real-world
penalty for scalar access. That's backed up by sampler hardware with
new abilities and formats supported to chew on, with good throughput
for common multi-channel formats. Both the threaded sampler and
shader blocks are fed and watered by an evolution of their ring-bus
memory controller. We've sadly not been able to go into too much
detail on the MC, but mad props to AMD for building a 1024-bit bi-
directional bus internally, fed by a 16-piece DRAM setup on the 512-
bit external bus.

Who said the main IHVs would never go to 512? AMD have built that
controller in the same area as the old one (whoa, although that's
helped by the process change), too. Using stacked pads and an increase
in wire density, affording them the use of slower memory (which is
more efficient due to clock delays when running at higher speeds),
R600 in HD 2900 XT form gets to sucking over 100GB/sec peak
theoretical bandwidth from the memories. That's worth a tip of an
engineer's hat any day of the week.

Then we come to the ROP hardware, designed for high performance AA
with high precision surface formats, at high resolution, with an
increase in the basic MSAA ability to 8x. It's here that we see the
lustre start to peel away slightly in terms of IQ and performance,
with no fast hardware resolve for tiles that aren't fully compressed,
and a first line of custom filters that can have a propensity to blur
more than not. Edge detect is honestly sweet, but the CFAA package
feels like something tacked on recently to paper over the cracks,
rather than something forward-looking (we'll end up at the point of
fully-programmable MSAA one day in all GPUs) to pair with speedy
hardware resolve and the usual base filters. AMD didn't move the game
on in terms of absolute image quality when texture filtering, either.
They're no longer leaders in the field of IQ any more, overtaken by
NVIDIA's GeForce 8-series hardware.

Coming back to the front of the chip, the setup stage is where we find
the tesselator. Not part of a formal DirectX spec until next time with
DX11, it exists outside of the main 3D graphics API of our time, and
we hope the ability to program it reliably comes sooner rather than
later since it's a key part of the architecture and didn't cost AMD
much area. We'll have a good look at the tesselator pretty soon,
working with AMD to delve deep into what the unit's capable of.

With a harder-to-compile-for shader core (although one with monstrous
floating point peak figures), less per-clock sampler ability for
almost all formats and channel widths, and a potential performance
bottleneck with the current ROP setup, R600 has heavy competition in
HD 2900 XT form. AMD pitch the SKU not at (or higher than) the GeForce
8800 GTX as many would have hoped, but at the $399 (and that's being
generous at the time of writing) GeForce 8800 GTS 640MiB. And that
wasn't on purpose, we reckon. If you asked ATI a year ago what they
were aiming for with R600, the answer was a simple domination over
NVIDIA at the high end, as always.

While we take it slow with our analysis -- and it's one where we've
yet to heavily visit real world game scenarios, DX10 and GPGPU
performance, video acceleration performance and quality, and the
cooler side facets like the HDMI solution -- the Beyond3D crystal ball
doesn't predict the domination that ATI will have done a year or more
ago. Early word from colleagues at HEXUS, The Tech Report and
Hardware.fr in that respect is one of mixed early performance that's
8800 GTS-esque or thereabouts overall, but also sometimes less than
Radeon X1950 XTX in places. Our own early figures there show promise
for AMD's new graphics baby, but not everywhere.

It's been a long time since that's been something anyone's been able
to say about a leading ATI, now AMD, graphics part. We'll know a
fuller story as we move on to looking at IQ and performance a bit
closer, with satellite pieces to take in the UVD and HDMI solution and
the tesselator to come as well. However after our look at the base
architecture, we know that R600 has to work hard for its high-quality,
high-resolution frames per second, but we also know AMD are going to
work hard to make sure it gets there. We really look forward to the
continued analysis of a sweet and sour graphics architecture in the
face of stiff competition, and we'll have image quality for you in a
day or two to keep things rolling. RV610 and RV630 details will follow
later today.

The VLIW design packs a possible 6 instructions per-clock, per-shader
unit (5 shading plus 1 branch) into the complete instructions it
issues to the shader units, and those possible instruction slots have
to match the capabilities of the hardware underneath. Each of the
first 4 sub ALUs is able to retire a finished single precision
floating point MAD (or ADD or MUL) per clock, dot product (dp, and
special cased by combining ALUs), and integer ADD. In terms of float
precision, the ALUs are 1 ULP for MAD, and 1/2 ULP for MUL and ADD.
The ALUs are split in terms of gates for float and int logic, too.
There's no 32-bit mantissa in the ALU to support both, but only one
datapath in and out of the sub-ALU, so no parallel processing there.
Denorms are clamped to 0 for both D3D9 and D3D10, but the hardware
supports inf. and NaN to IEE754 spec.

The fifth fatter unit (let's egotistically call it the RysUnit, since
it shares my proportions compared to normal people, and I can be
'special' too) can't do dp ops, but is capable of integer division,
multiply and bit shifting, and it also takes care of transcendental
'special' functions (like sin, cos, log, pow, exp, rcp, etc), at a
rate of one retired instruction per clock (for most specials at
least). It's also responsible for float<->integer conversion. Unlike
the other units, this one is actually FP40 internally (32-bit
mantissa, 8-bit exponent). This allows for single-cycle MUL/MAD
operations on INT32 operands under D3D10, which G80 needs 4 cycles
for. It's certainly an advantage of having a VLIW architecture and
multiple kinds of units. If you didn't follow that, the following
should help.

Each cluster runs thread pairs with the same type in any given cycle,
but each of those four clusters can run a different thread type if it
needs to. The front-end of the chip handles the thread load balancing
across the core as mentioned, and there's nothing stopping all running
threads in a given cycle being all pixel, all vertex, or even.....you
guessed it: all geometry, although that might not be the case
currently. More on that later.

For local memory access, the shader core can load/store from a huge
register file that takes up more area on the die than the ALUs for the
shader core that uses it. Accesses can happen in 'scalar' fashion, one
32-bit word at a time from the application writer's point of view,
which along with the capability of co-issuing 5 completely random
instructions (we tested using truly terrifying auto-generated shaders)
makes ATI's claims of a superscalar architecture perfectly legit.
Shading performance with more registers is also very good, indeed
we've been able to measure that explicitly with shaders using variable
numbers of registers, where there's no speed penalty for increasing
them or using odd numbers. It's arguably one of the highlights of the
design so far, and likely a significant contributor to R600's
potential GPGPU performance as well.

Access to the register file is also cached, read and write, by an 8KiB
multi-port cache. The cache lets the hardware virtualise the register
file, effectively presenting any entry in the cache as any entry in
the larger register file. It's unclear which miss/evict scheme they
use, or if there's prefetching, but they'll want to maximise hits for
the running threads of course.

It seems the hardware will also use it for streamout to memory,
letting the shader core bypass the colour buffer and ROPs on the way
out to board memory, and the chip will also use it for R2VB and
overflow storage for GS amplification, making it quite the useful
little piece of on-chip memory.

While going 5-way scalar has allowed AMD more flexibility in
instruction scheduling compared to their previous hardware, that
flexibility arguably makes your compiler harder to write, not easier.
So as a driver writer you have more packing opportunities -- and I
like to think of it almost like a game of Tetris when it comes to a
GPU, but only with the thin blocks and with those being variable
lengths, and you can sometimes break them up! -- those opportunities
need handling in code and your corner cases get harder to find.

The end result here is a shader core with fairly monstrous peak
floating point numbers, by virtue of the unit count in R600, its core
clock and the register file of doom, but one where software will have
a harder time driving it close to peak. That's not to say it's
impossible, and indeed we've managed to write in-house shaders, short
and long and with mixtures of channels, register counts and what have
you, that run close to max theoretical thoughput. However it's a more
difficult proposition for the driver tech team to take care of over
the lifetime of the architecture, we argue, than their previous
architecture.

In terms of memory access from the sampler hardware, sampler units
aren't tied to certain clusters as such, rather certain positions
inside the cluster. If you visualise the 16 shader units in a cluster
as being four quads of units, each of the four samplers in R600 is
tied to one of those quads, and then across the whole shader core.
 
[lots of stuff clipped]

Alas, it appears a rushed chipset product has resulted in a power-hungry
beast that had to be underclocked before it could be released, and also
prevented the release of a contender for fastest videocard.

Perhaps when it's retooled on a smaller process.

rms
 
[lots of stuff clipped]

Alas, it appears a rushed chipset product has resulted in a power-hungry
beast that had to be underclocked before it could be released, and also
prevented the release of a contender for fastest videocard.

Perhaps when it's retooled on a smaller process.

rms



there's a 65nm version in the works, possibly called R650. but I don't
expect the R6xx architecture to shine until its been refreshed /
improved, in the form of an R680 or R700 -which will most likely share
the same basic architechure but revamped.
 
[lots of stuff clipped]

Alas, it appears a rushed chipset product has resulted in a power-hungry
beast that had to be underclocked before it could be released, and also
prevented the release of a contender for fastest videocard.

Perhaps when it's retooled on a smaller process.

rms

It's a crap card and doesn't hold a candle to the G80. nVidia is king
of the hill for the next year or so at least.
 
Bababooey said:
[lots of stuff clipped]

Alas, it appears a rushed chipset product has resulted in a power-hungry
beast that had to be underclocked before it could be released, and also
prevented the release of a contender for fastest videocard.

Perhaps when it's retooled on a smaller process.

rms

It's a crap card and doesn't hold a candle to the G80. nVidia is king
of the hill for the next year or so at least.

yeah...its a great card for dx9 games :)
 
Thanks for that RadeonR600 ...My brain hurts !
Why such an obsession with quite Fans ? I'm fighting my Geforce to turn its
fan up.
I want the thing to last as reliably as possible ...every Graphic card I
have had that
has died has done so because of some fault in the cooling.
I'd go for noisy & cool everytime.
All those enthusiastic words !! but no mention at all of reliability / long
term life.
When spending say £200+ BPS / $400+ USD a person should know the thing is
good for at least a couple of years of heavy use.

All the words don't really tell me anything in terms of what performance /
cost is.
After years with ATI cards & now recently to NVIDIA for the 8800GTX I'm in
2 minds ...wanting to hear my Geforce is the better ...& wanting to hear
about a
really good new DX10 card ..I would not recommend the 8800 to anyone
because of the dreadful control panel & silly little driver bugs.
The article should try to impress us with how wonderful the ATI drivers will
be.
(\__/)
(='.'=)
(")_(") mouse
 
Back
Top