Memory Performance - Random vs. Sequential

M

mike

Regarding the trade-off between random & sequential data access for SDR,
DDR & DDR2 I would like to know if this information that I found at the
link below is correct (from which I've summarized into this table):

For base clock rate of 200MHz (5ns)

Initial Random Subsequent Sequential
Data Access Data Access
-------------- ---------------------
SDR 5 ns 5 ns
DDR 10 ns 2.5 ns
DDR2 20 ns 1.25 ns

The reason I'm asking is that I believe this shows that the best
performance for an application that has a very high proportion of random
memory accesses will be achieved with DDR-400 memory as opposed
to DDR2-800. True? By this logic, SDR-200 would be even better although I
don't think there ever was such a thing. For the purpose of this
comparison assume processor speed & cache sizes are the same.


Source:

http:/archives.postgresql.org/pgsql-performance/2006-04/msg00601.php

"Note also what happens when transferring the first datum after a lull
period. For purposes of example, let's pretend that we are talking about a
base clock rate of 200MHz= 5ns.

The SDR still transfers data every 5ns no matter what. The DDR transfers
the 1st datum in 10ns and then assuming there are at least 2 sequential
datums to be transferred will transfer the 2nd and subsequent sequential
pieces of data every 2.5ns. The DDR2 transfers the 1st datum in 20ns and
then assuming there are at least 4 sequential datums to be transferred
will transfer the 2nd and subsequent sequential pieces of data every
1.25ns.

Thus we can see that randomly accessing RAM degrades performance
significantly for DDR and DDR2. We can also see that the conditions for
optimal RAM performance become more restrictive as we go from SDR to DDR to
DDR2. The reason DDR2 with a low base clock rate excelled at tasks like
streaming multimedia and stank at things like small transaction OLTP DB
applications is now apparent."
 
P

Paul

mike said:
Regarding the trade-off between random & sequential data access for SDR,
DDR & DDR2 I would like to know if this information that I found at the
link below is correct (from which I've summarized into this table):

For base clock rate of 200MHz (5ns)

Initial Random Subsequent Sequential
Data Access Data Access
-------------- ---------------------
SDR 5 ns 5 ns
DDR 10 ns 2.5 ns
DDR2 20 ns 1.25 ns

The reason I'm asking is that I believe this shows that the best
performance for an application that has a very high proportion of random
memory accesses will be achieved with DDR-400 memory as opposed
to DDR2-800. True? By this logic, SDR-200 would be even better although I
don't think there ever was such a thing. For the purpose of this
comparison assume processor speed & cache sizes are the same.


Source:

http:/archives.postgresql.org/pgsql-performance/2006-04/msg00601.php

"Note also what happens when transferring the first datum after a lull
period. For purposes of example, let's pretend that we are talking about a
base clock rate of 200MHz= 5ns.

The SDR still transfers data every 5ns no matter what. The DDR transfers
the 1st datum in 10ns and then assuming there are at least 2 sequential
datums to be transferred will transfer the 2nd and subsequent sequential
pieces of data every 2.5ns. The DDR2 transfers the 1st datum in 20ns and
then assuming there are at least 4 sequential datums to be transferred
will transfer the 2nd and subsequent sequential pieces of data every
1.25ns.

Thus we can see that randomly accessing RAM degrades performance
significantly for DDR and DDR2. We can also see that the conditions for
optimal RAM performance become more restrictive as we go from SDR to DDR to
DDR2. The reason DDR2 with a low base clock rate excelled at tasks like
streaming multimedia and stank at things like small transaction OLTP DB
applications is now apparent."

While shooting from the hip is fun (and Lord knows I've done it enough myself),
a latency analysis requires more than looking at the RAM interface. The
Northbridge or memory interface, also plays a part. There can be latency
in the Northbridge itself, and differences between Northbridges. Northbridges
can run sync or async, and that can add an extra cycle or two to the path.

There are other effects, which make measuring the latency harder. There
is pre-fetching activity on some of the latest Intel hardware, and the
tools used to measure latency have to disable the prefetching, as best
they can, to make a measurement. Pre-fetching could shoot you in the foot,
if the app does nothing but random access.

For SDRAM, yes, someone in fact did make "SDR200". You aren't likely to
find sticks of memory with these chips on them (they might be used in
embedded applications, or perhaps the cache on a hard drive controller
uses them). But these run with a 200MHz clock.

http://www.micron.com/products/partdetail?part=MT48LC2M32B2P-5

The CAS Latency spec for those, is CAS3. The first cycle latency is
(3 * 5ns) = 15ns.

At DDR400, the best memories were CAS2 (I have some of those).
CAS2 times 5nS gives 10nS, and a little faster first cycle than the best SDRAM.

At DDR2-800, a CAS4 stick has the same latency as a CAS2 DDR400. If
you can find some CAS3 DDR2-800, then again, you are ahead by a little
bit.

On Newegg, I see three products at the DDR2-800 CAS3 level. This is the
cheapest of them. Timings 3-4-4-15 (first digit is CAS and is the most
important).

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227190

In the chipset itself, there are subtle differences between single
channel and dual channel operation. Some chipsets handle things differently
than others. For example, there were claims that the first cycle latency
of a P4PE single channel board, was better than a P4P800 dual channel
board run in dual channel mode. There have been review articles
comparing such things, but that would take hours of searches to dig up.

But really, you have to measure this, rather than looking at the memory by
itself.

And comparing Athlon64 with its built in AM2 DDR2 memory controller,
is different than the Intel approach. Since the memory interface
is right on the processor, there is an opportunity to shave off some
latency, after the memory is taken into account. The AMD processor has
a limit, as to how low the latency can be set, which is a slight
impediment.

So this is not really an easy question to answer at all.

The best way to answer it, is to benchmark representative systems,
and pick the winner that way. When enough money is involved, a
vendor will provide loaner systems for a couple days, so you can
test.

Also, this latency analysis (looking at CAS only), ignores the
rest of the memory cycle, and what mode the transaction runs in.
I expect most of the time, the controller is doing a burst. I'm
not even sure any more, whether you can do a single cycle on a
memory subsystem. You may always be paying for a burst transfer
and throwing away the unused bits. Systems now, tend to do things
in cache line sized chunks. So, in terms or "random access
transfers per second", you need to examine how many full memory
transactions fit per second. (This assumes the processor creates
the random requests, faster than the memory subsystem can satisfy
them, and thus you are waiting for the memory to become ready again,
for the next request.)

This document, shows sample timing diagrams. Figure 41 on
PDF page 32, shows a burst write. Notice how the end of the complete
memory transaction, is chewing up as much time as the data transfer.
The inverse of the time period for a complete transfer, determine
how many of these random accesses you can do a second. Most people
are shocked by just how low this number can be.

http://www.hynix.com/datasheet/Timing_Device/DDR2_device_operation&timing_diagram(Rev.0.1).pdf

Another thing to note, is that modern memory controllers do not use
all the features shown in that document. I presume the reason for
this, is chipset designers have done the analysis, and decided which
features are a win, and which ones aren't. So not all the crazy
DDR2 timing diagrams in that document, would be applicable.

Paul
 
B

Bob Day

mike said:
Regarding the trade-off between random & sequential data access for SDR,
DDR & DDR2 I would like to know if this information that I found at the
link below is correct (from which I've summarized into this table):

A free software application to measure real-world random
access latency of blocks of memory in sizes from 4 bytes
on up is available on my website. The source code is there
also, and is also free: http://bobday.vze.com

-- Bob Day
 
M

mike

While shooting from the hip is fun (and Lord knows I've done it enough myself),
a latency analysis requires more than looking at the RAM interface. The
Northbridge or memory interface, also plays a part. There can be latency
in the Northbridge itself, and differences between Northbridges. Northbridges
can run sync or async, and that can add an extra cycle or two to the path.

There are other effects, which make measuring the latency harder. There
is pre-fetching activity on some of the latest Intel hardware, and the
tools used to measure latency have to disable the prefetching, as best
they can, to make a measurement. Pre-fetching could shoot you in the foot,
if the app does nothing but random access.

For SDRAM, yes, someone in fact did make "SDR200". You aren't likely to
find sticks of memory with these chips on them (they might be used in
embedded applications, or perhaps the cache on a hard drive controller
uses them). But these run with a 200MHz clock.

http://www.micron.com/products/partdetail?part=MT48LC2M32B2P-5

The CAS Latency spec for those, is CAS3. The first cycle latency is
(3 * 5ns) = 15ns.

At DDR400, the best memories were CAS2 (I have some of those).
CAS2 times 5nS gives 10nS, and a little faster first cycle than the best SDRAM.

At DDR2-800, a CAS4 stick has the same latency as a CAS2 DDR400. If
you can find some CAS3 DDR2-800, then again, you are ahead by a little
bit.

Was the evolution of DDR2 such that initially latency was double DDR and
improved over time until it is now better than DDR? That would go a long
way toward explaining a lot of the seemingly conflicting information I've
read on the web.



On Newegg, I see
three products at the DDR2-800 CAS3 level. This is the
cheapest of them. Timings 3-4-4-15 (first digit is CAS and is the most
important).

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227190

In the chipset itself, there are subtle differences between single
channel and dual channel operation. Some chipsets handle things
differently than others. For example, there were claims that the first
cycle latency of a P4PE single channel board, was better than a P4P800
dual channel board run in dual channel mode. There have been review
articles comparing such things, but that would take hours of searches to
dig up.

But really, you have to measure this, rather than looking at the memory
by itself.

And comparing Athlon64 with its built in AM2 DDR2 memory controller, is
different than the Intel approach. Since the memory interface is right
on the processor, there is an opportunity to shave off some latency,
after the memory is taken into account. The AMD processor has a limit,
as to how low the latency can be set, which is a slight impediment.

So this is not really an easy question to answer at all.

The best way to answer it, is to benchmark representative systems, and
pick the winner that way. When enough money is involved, a vendor will
provide loaner systems for a couple days, so you can test.

Also, this latency analysis (looking at CAS only), ignores the rest of
the memory cycle, and what mode the transaction runs in. I expect most
of the time, the controller is doing a burst. I'm not even sure any
more, whether you can do a single cycle on a memory subsystem. You may
always be paying for a burst transfer and throwing away the unused bits.

If so then that would make the time to do a random access for DDR2-800
(CAS4) slightly higher than DDR-400 (CAS2), wouldn't it? (although CAS3
DDR2-800 may still be an improvement over DDR-400 in this regard)



Systems now, tend to do things in cache line sized chunks. So, in terms
or "random access transfers per second", you need to examine how many
full memory transactions fit per second. (This assumes the processor
creates the random requests, faster than the memory subsystem can
satisfy them, and thus you are waiting for the memory to become ready
again, for the next request.)

This document, shows sample timing diagrams. Figure 41 on PDF page 32,
shows a burst write. Notice how the end of the complete memory
transaction, is chewing up as much time as the data transfer. The
inverse of the time period for a complete transfer, determine how many
of these random accesses you can do a second. Most people are shocked by
just how low this number can be.

http://www.hynix.com/datasheet/Timing_Device/DDR2_device_operation&timing_diagram(Rev.0.1).pdf

I'll have to study that a while. thanks for the info! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top