A
AirRaid
http://www.realworldtech.com/page.cfm?ArticleID=RWT120307033606&p=1
Rambus Sets the Bandwidth Bar at a Terabyte/Second
Last week at the annual Developer's Forum in Japan, Rambus announced
an ambitious technology initiative that aims to create a 16 gigabit-
per-second memory signaling layer that can sustain 1TB/s of bandwidth
to a single memory controller by 2010. The Terabyte Bandwidth
Initiative is still in development hence there are no shipping
products, but the goals are now public and Rambus will demonstrate a
test board that achieves 1TB/s of bandwidth over this signaling
technology. This article will provide an in-depth look at the history,
target market, technical innovations and test vehicle for Rambus'
Terabyte Bandwidth Initiative (TBI).
Target Market
The target markets for Rambus are the segments of the DRAM market that
prefer high bandwidth and are willing to sacrifice capacity to achieve
that bandwidth: graphics, consoles and possibly networking. Graphics
almost universally uses GDDR3 or GDDR4, with GDDR5 slated for 2H08.
Consoles use GDDR and XDR (in the case of the PS3), while networking
applications use DDR, SRAM and RLDRAM (Reduced Latency DRAM).
Motivation
Several trends within the computing industry have driven a tremendous
increase in the need for high bandwidth memory systems. The
exponential increases in graphics performance and display capabilities
require exponentially faster memory. The fierce competition between
NVIDIA and ATI for graphics performance is typified by extremely fast
product cycles. Each new product from either one of the contenders
contains a greater number of programmable pipelines, operating at a
faster frequency as well. The graphics memory must increase
proportionally in order to feed the highly parallel graphics
processors. On the display side, resolutions increase to match the
capabilities of graphics processors, and the internal frame buffers
must be fast enough to transfer 30-60 frames per second.
Multi-core processors have had a similar (although less extreme)
impact on the general purpose market. In theory, a dual core processor
needs twice the memory of a single core processor, the reality is that
processor architects typically use cache to reduce the demand on the
memory subsystem. While processors do not require quite the same
bandwidth as graphics applications, the mismatch between execution
capabilities and memory bandwidth (which is often referred to as the
Memory Wall) is growing quite fast. In 1989, the 486 ran at 16-25MHz,
with 8KB of on-die cache and used 33MHz memory. In 2007, a Core 2 Duo
based processor features 2-4 cores running at 3GHz, with 6-12MB of
cache and uses two channels of 1.33GHz DDR3 memory.
These trends are also found in the gaming console market. Consoles
require both general purpose and graphics processors, with matching
high performance memory hierarchies. Figure 1 above shows the
bandwidth used by various gaming consoles from 1985 to 2006, and
Rambus' target: 1TB/s in 2010.
http://www.realworldtech.com/includes/images/articles/Rambus-TBI-1.jpg
The 1TB/s target was calculated by looking at the overall trend (10X
increase every 5 years, 50GB/s in 2006), and then doubling the
extrapolation for 2010. For products to ship in that 2010 timeframe,
Rambus' IP must be fully validated and verified for high volume 45nm
and 32nm processes well in advance of this target, since system
designers will require time to integrate 3rd party IP.
The Rambus signaling technology operates at 16gbps, and it is
envisioned that a single memory controller could connect to 16 DRAMs,
with each DRAM providing 4 bytes of data per cycle (1TB/s = 16gbps *
4B * 16 DRAMs). To reach the 1TB/s target, Rambus is relying on three
key techniques to increase bandwidth: 32X data rates, full speed
command and addressing, and a differential memory architecture.
Data Tweaks and a Command and Addressing Overhaul
In general, the Terabyte Bandwidth Initiative is best viewed as the
logical successor to the XDR2 memory interface, since in many regards
it builds on that foundation but goes further towards being a narrow,
high speed signaling interface.
32X Data Rate
The memory interface operates at 16gpbs; 32X the 500MHz reference
clock. This requires an extremely accurate PLL that was designed
specifically for this purpose. The 500MHz reference clock was chosen
to reuse the infrastructure for the XDR, XDR2 and FlexIO interfaces,
all of which use 500MHz input clocks. The 32X data rate is a very
evolutionary and predictable change to the data interface. With each
generation, Rambus has consistently increased the ratio between the
data interface and the reference clock. The original Direct Rambus
interface operated at twice the reference clock and later evolved to
four times the reference clock. The first generation of XDR transfered
data at 8X the reference clock, and XDR2 increased the ratio to 16X.
Hence it should come as no surprise that this new interface runs at
32X the reference clock.
FlexLink and Differential Signaling
To reduce area and increase performance, Rambus totally redesigned the
command and addressing (C/A) link. Traditionally, commands and address
information are sent over a parallel, multi-drop bus, with a drop for
each DRAM - this is how XDR, DDR and GDDR all function. For instance,
the XDR in the CELL processor used a 12 bit wide C/A interface that
operated at 800Mbps, a quarter of the data rate. However, as data
rates increase it becomes more and more difficult to synchronize
multi- drop buses. To avoid this problem, Rambus moved from the old
model of a 12 pin shared, parallel bus using slower single-end
signaling to using the same techniques for both the data and address
links. The new address link uses a 2 pin, point-to-point C/A link with
differential signaling that operates at the full 16gbps data rate and
builds on all of Rambus' previous techniques for high performance
(such as FlexPhase, a technique to compensate for skew). Rambus refers
to the narrow, full speed C/A link as FlexLink, and the differential
signaling for the C/A link as a Fully Differential Memory Architecture
(Fully Differential since all three components, clocks data and C/A
will be differential).
The two technologies are extremely complementary. Differential
signaling avoids the capacitance and inductance problems with single-
ended signals, so that the C/A link can operate at 16gbps and achieve
the desired power profile at the transmitter. In turn, this frequency
headroom enables fewer pins for the link. In general, high performance
implementations of alternative technologies (XDR or GDDR) tend to use
one C/A bus (12 or ~20 pins) for 1-2 DRAMs. Rambus' new interface has
4-16 fewer C/A pins per DRAM than XDR and 16-32 fewer C/A pins than
GDDR.
Both of these changes are very consistent with the overall trends for
interconnect architectures in the semiconductor industry. Almost
universally, interconnects have shifted away from the slow and wide
buses that were favored in earlier days, such as SCSI or the front
side bus. Instead physics and economics tend to favor interfaces with
the fewest number of pins, where more bandwidth comes from faster
signaling, rather than additional data pins. Rambus is somewhat ahead
of the curve, as their memory interface is the first to use all
differential signaling, and narrow point-to-point links. Since Rambus
focuses primarily on the high bandwidth market, it is very likely that
these architectural changes will be mirrored by more mainstream
standards over the course of the next 2-4 years.
Rambus' Test Vehicle
Along with declaring their intentions to provide 1TB/s of bandwidth to
ASICs, GPUs or MPUs in the future, Rambus also demonstrated a test
vehicle for their future interconnect, shown below in Figure 2. The
data eyes at the memory controller (with equalization) and the data
eye at the DRAM (without equalization) are both at 16gbps.
http://www.realworldtech.com/includes/images/articles/Rambus-TBI-2.jpg
Rambus' test system is manufactured in TSMC's 65nm ASIC process and
two DRAM emulators were manufactured using their 65nm DRAM process.
The ASIC is flip chip packaged, while the DRAM emulators use wire bond
- which is consistent with overall industry practices. The demo system
does not use transmit equalization, which most high speed memory
interfaces would employ in a real world situation.
At this point, Rambus declined to discuss the power efficiency (as
measured in GB/s per mW) of this first implementation, or specific
targets for the initiative. However, they did state that they do not
believe the thermal envelop for memory interconnects has changed
significantly since the time when XDR or XDR2 debuted. This implies
that the power efficiency should increase by roughly the same factor
as the bandwidth of an individual link relative to XDR2 or XDR. One of
the advantages of setting a performance target twice as high as the
estimated needs of the target systems is that a system designer can
easily trade that extra performance for even lower power consumption.
Rambus did not quantitatively describe the project targets or
implementation specific bit-error rates (BER), other than to say that
they will provide "commercially viable BERs". The achievable level of
BERs is to a large extent implementation specific (depending on board
quality, etc.), and is not defined wholly by Rambus' IP. One challenge
that they are cognizant of is that maintaining acceptable failure
rates requires lower and lower BERs as bandwidth increases. While a
given error rate may be acceptable at 4gbps and yield a 3-5 year
expected life time, at 16gbps that same error rate will produce an
expected life time that is one fourth as long - below most commercial
requirements. Consequently, Rambus will design for lower BERs and help
any customers achieve the desired level of reliability.
Conclusion
As Rambus was very clear to point out, this announcement isn't about a
currently shipping product - it is about setting goals. These goals
are certainly aggressive, but Rambus has made substantial progress
already and there is little risk that they would fall short of their
targets. Rambus' work in this area is extremely relevant for the
console and graphics markets. When this technology is finally
productized, the initial design wins are most likely to be in next
generation consoles, particularly designs from Sony, given their
previous experience. The most interesting aspect of the technology
that Rambus has discussed are the implications for other
interconnects. It will be very interesting to see when other
interconnects follow suit and transition command and address
communication towards narrow high speed differential links instead of
slower and wider single ended buses.
Rambus Sets the Bandwidth Bar at a Terabyte/Second
Last week at the annual Developer's Forum in Japan, Rambus announced
an ambitious technology initiative that aims to create a 16 gigabit-
per-second memory signaling layer that can sustain 1TB/s of bandwidth
to a single memory controller by 2010. The Terabyte Bandwidth
Initiative is still in development hence there are no shipping
products, but the goals are now public and Rambus will demonstrate a
test board that achieves 1TB/s of bandwidth over this signaling
technology. This article will provide an in-depth look at the history,
target market, technical innovations and test vehicle for Rambus'
Terabyte Bandwidth Initiative (TBI).
Target Market
The target markets for Rambus are the segments of the DRAM market that
prefer high bandwidth and are willing to sacrifice capacity to achieve
that bandwidth: graphics, consoles and possibly networking. Graphics
almost universally uses GDDR3 or GDDR4, with GDDR5 slated for 2H08.
Consoles use GDDR and XDR (in the case of the PS3), while networking
applications use DDR, SRAM and RLDRAM (Reduced Latency DRAM).
Motivation
Several trends within the computing industry have driven a tremendous
increase in the need for high bandwidth memory systems. The
exponential increases in graphics performance and display capabilities
require exponentially faster memory. The fierce competition between
NVIDIA and ATI for graphics performance is typified by extremely fast
product cycles. Each new product from either one of the contenders
contains a greater number of programmable pipelines, operating at a
faster frequency as well. The graphics memory must increase
proportionally in order to feed the highly parallel graphics
processors. On the display side, resolutions increase to match the
capabilities of graphics processors, and the internal frame buffers
must be fast enough to transfer 30-60 frames per second.
Multi-core processors have had a similar (although less extreme)
impact on the general purpose market. In theory, a dual core processor
needs twice the memory of a single core processor, the reality is that
processor architects typically use cache to reduce the demand on the
memory subsystem. While processors do not require quite the same
bandwidth as graphics applications, the mismatch between execution
capabilities and memory bandwidth (which is often referred to as the
Memory Wall) is growing quite fast. In 1989, the 486 ran at 16-25MHz,
with 8KB of on-die cache and used 33MHz memory. In 2007, a Core 2 Duo
based processor features 2-4 cores running at 3GHz, with 6-12MB of
cache and uses two channels of 1.33GHz DDR3 memory.
These trends are also found in the gaming console market. Consoles
require both general purpose and graphics processors, with matching
high performance memory hierarchies. Figure 1 above shows the
bandwidth used by various gaming consoles from 1985 to 2006, and
Rambus' target: 1TB/s in 2010.
http://www.realworldtech.com/includes/images/articles/Rambus-TBI-1.jpg
The 1TB/s target was calculated by looking at the overall trend (10X
increase every 5 years, 50GB/s in 2006), and then doubling the
extrapolation for 2010. For products to ship in that 2010 timeframe,
Rambus' IP must be fully validated and verified for high volume 45nm
and 32nm processes well in advance of this target, since system
designers will require time to integrate 3rd party IP.
The Rambus signaling technology operates at 16gbps, and it is
envisioned that a single memory controller could connect to 16 DRAMs,
with each DRAM providing 4 bytes of data per cycle (1TB/s = 16gbps *
4B * 16 DRAMs). To reach the 1TB/s target, Rambus is relying on three
key techniques to increase bandwidth: 32X data rates, full speed
command and addressing, and a differential memory architecture.
Data Tweaks and a Command and Addressing Overhaul
In general, the Terabyte Bandwidth Initiative is best viewed as the
logical successor to the XDR2 memory interface, since in many regards
it builds on that foundation but goes further towards being a narrow,
high speed signaling interface.
32X Data Rate
The memory interface operates at 16gpbs; 32X the 500MHz reference
clock. This requires an extremely accurate PLL that was designed
specifically for this purpose. The 500MHz reference clock was chosen
to reuse the infrastructure for the XDR, XDR2 and FlexIO interfaces,
all of which use 500MHz input clocks. The 32X data rate is a very
evolutionary and predictable change to the data interface. With each
generation, Rambus has consistently increased the ratio between the
data interface and the reference clock. The original Direct Rambus
interface operated at twice the reference clock and later evolved to
four times the reference clock. The first generation of XDR transfered
data at 8X the reference clock, and XDR2 increased the ratio to 16X.
Hence it should come as no surprise that this new interface runs at
32X the reference clock.
FlexLink and Differential Signaling
To reduce area and increase performance, Rambus totally redesigned the
command and addressing (C/A) link. Traditionally, commands and address
information are sent over a parallel, multi-drop bus, with a drop for
each DRAM - this is how XDR, DDR and GDDR all function. For instance,
the XDR in the CELL processor used a 12 bit wide C/A interface that
operated at 800Mbps, a quarter of the data rate. However, as data
rates increase it becomes more and more difficult to synchronize
multi- drop buses. To avoid this problem, Rambus moved from the old
model of a 12 pin shared, parallel bus using slower single-end
signaling to using the same techniques for both the data and address
links. The new address link uses a 2 pin, point-to-point C/A link with
differential signaling that operates at the full 16gbps data rate and
builds on all of Rambus' previous techniques for high performance
(such as FlexPhase, a technique to compensate for skew). Rambus refers
to the narrow, full speed C/A link as FlexLink, and the differential
signaling for the C/A link as a Fully Differential Memory Architecture
(Fully Differential since all three components, clocks data and C/A
will be differential).
The two technologies are extremely complementary. Differential
signaling avoids the capacitance and inductance problems with single-
ended signals, so that the C/A link can operate at 16gbps and achieve
the desired power profile at the transmitter. In turn, this frequency
headroom enables fewer pins for the link. In general, high performance
implementations of alternative technologies (XDR or GDDR) tend to use
one C/A bus (12 or ~20 pins) for 1-2 DRAMs. Rambus' new interface has
4-16 fewer C/A pins per DRAM than XDR and 16-32 fewer C/A pins than
GDDR.
Both of these changes are very consistent with the overall trends for
interconnect architectures in the semiconductor industry. Almost
universally, interconnects have shifted away from the slow and wide
buses that were favored in earlier days, such as SCSI or the front
side bus. Instead physics and economics tend to favor interfaces with
the fewest number of pins, where more bandwidth comes from faster
signaling, rather than additional data pins. Rambus is somewhat ahead
of the curve, as their memory interface is the first to use all
differential signaling, and narrow point-to-point links. Since Rambus
focuses primarily on the high bandwidth market, it is very likely that
these architectural changes will be mirrored by more mainstream
standards over the course of the next 2-4 years.
Rambus' Test Vehicle
Along with declaring their intentions to provide 1TB/s of bandwidth to
ASICs, GPUs or MPUs in the future, Rambus also demonstrated a test
vehicle for their future interconnect, shown below in Figure 2. The
data eyes at the memory controller (with equalization) and the data
eye at the DRAM (without equalization) are both at 16gbps.
http://www.realworldtech.com/includes/images/articles/Rambus-TBI-2.jpg
Rambus' test system is manufactured in TSMC's 65nm ASIC process and
two DRAM emulators were manufactured using their 65nm DRAM process.
The ASIC is flip chip packaged, while the DRAM emulators use wire bond
- which is consistent with overall industry practices. The demo system
does not use transmit equalization, which most high speed memory
interfaces would employ in a real world situation.
At this point, Rambus declined to discuss the power efficiency (as
measured in GB/s per mW) of this first implementation, or specific
targets for the initiative. However, they did state that they do not
believe the thermal envelop for memory interconnects has changed
significantly since the time when XDR or XDR2 debuted. This implies
that the power efficiency should increase by roughly the same factor
as the bandwidth of an individual link relative to XDR2 or XDR. One of
the advantages of setting a performance target twice as high as the
estimated needs of the target systems is that a system designer can
easily trade that extra performance for even lower power consumption.
Rambus did not quantitatively describe the project targets or
implementation specific bit-error rates (BER), other than to say that
they will provide "commercially viable BERs". The achievable level of
BERs is to a large extent implementation specific (depending on board
quality, etc.), and is not defined wholly by Rambus' IP. One challenge
that they are cognizant of is that maintaining acceptable failure
rates requires lower and lower BERs as bandwidth increases. While a
given error rate may be acceptable at 4gbps and yield a 3-5 year
expected life time, at 16gbps that same error rate will produce an
expected life time that is one fourth as long - below most commercial
requirements. Consequently, Rambus will design for lower BERs and help
any customers achieve the desired level of reliability.
Conclusion
As Rambus was very clear to point out, this announcement isn't about a
currently shipping product - it is about setting goals. These goals
are certainly aggressive, but Rambus has made substantial progress
already and there is little risk that they would fall short of their
targets. Rambus' work in this area is extremely relevant for the
console and graphics markets. When this technology is finally
productized, the initial design wins are most likely to be in next
generation consoles, particularly designs from Sony, given their
previous experience. The most interesting aspect of the technology
that Rambus has discussed are the implications for other
interconnects. It will be very interesting to see when other
interconnects follow suit and transition command and address
communication towards narrow high speed differential links instead of
slower and wider single ended buses.