HP's 2-way Opteron server

Mitch Alsup · Mar 3, 2004

Tony Hill said:
Err, no. The coherency checks and memory access are done
concurrently. Since the cache checks are WAY faster than a read from
main memory they always return first.

Er, no. Coherence checks run concurrently, but the latency to the
remote cache is longer than the latency to local DRAM under open
page access senarios.

Open page hit access:

The probe is launched from the memory controller at the same time that
the request is sent to the local DRAM controller. If the request hits
on an open page the data arrives at the pins of the consuming processor
before the probe response has a chance to return. The L2 miss buffer
absorbs the data and waits for the probe responses to arrive before
feeding data to the core, if the remote cache also delivers data, that
data supercedes the DRAM data and then the data can be consumed by the
core. Much of the time, the latency ends up being probe bound, not
data bound.

Path for DRAM: MemCtl->DRAMCtl->DRAMpins->DRAM->DRAMpins->DRAMrdbuf
->CrossBar->L2response
Path for Coherence: MemCtl->CrossBar->HtSend->HtLink->HtReceive
->CrossBar->L2Ctl->L2access->L2Ctl->Crossbar
->HtSend->HtLink->HTreceive->CrossBar->L2response

The path HtSend->HTreceive is about 12ns so the round trip is 24ns
on the HT without any time at all for accessing the remote L2 cache.
This puts the minimum time for a negative L2 respnse in the 30 ns range,
a positive response has to access the data from the L2 and then send
it over HT adding another 8ns to the first data word; 38 ns, while
the minimum time for local DRAM is in the 26 ns range.

Closed page access: most of the time the L2 responses arrive before data
Wrong page access: almost all the time the L2 responses arrive before data

If a remote cache has a newer
copy of the data, it is used and the memory read is canceled. If not,

Seldom happens in practice:

The vast majority of the time that one receives data from a remote
cache when accessing local DRAM, the local DRAM data arrives before
the response from the remote L2. The only real chance to cancel a
DRAM access is when the remote L2 cache that contains the most resent
copy of the data is on the same node as the remote DRAM that contains
stale data. Even when the L2 cache response arrives first, the DRAM
controller has started or finished the request and the cancel does
nothing.

The other real chance occurs when there is a great memory load on the
system and the DRAM controller queues become flooded with requests,
adding request latency through the serialy reuseable resource (DRAM).
In this case there is time for the cancel messages to reach the DRAM
controller before the request has been processed.

the memory read continues as normal. There might be an extra ns or
two of latency, but nothing significant.

What IS significant is that without any NUMA optimizations a
dual-processor Opteron system accesses 50% of it's data from a remote
memory controller. Even with NUMA optimizations this number is still
pretty high. AMD estimates the latency penalty for remote memory as
being 35ns for one hop and another 40ns for two hops.

Tony

Mitch

HP's 2-way Opteron server

Mitch Alsup