When will the Intel "real" quad-core processor come out?

P

phuile

I've read threads here about both AMD and Intel bringing out new "real"
quad-core processors for 4+ socket servers in a few months. I am
looking at a machine with 2 real quad Xeon processor. Does anyone know
approx. how long am I looking at - am I looking at March? June? Fall?
December? of 2007?

Apart from this forum, where can I find more information on this
timeframe issue?

Thanks for any reply.
 
D

David Kanter

phuile said:
I've read threads here about both AMD and Intel bringing out new "real"
quad-core processors for 4+ socket servers in a few months.

AMD won't have quad cores till the second half of this year AFAIK.
Intel already offers quad Xeon DP systems.
I am
looking at a machine with 2 real quad Xeon processor. Does anyone know
approx. how long am I looking at - am I looking at March? June? Fall?
December? of 2007?

How about they have been available since November?

DK
 
D

Derek Baker

* David Kanter:
AMD won't have quad cores till the second half of this year AFAIK.
Intel already offers quad Xeon DP systems.


How about they have been available since November?

DK

OP If you mean single die - i.e. not two dual-cores put together - the
answer seems to be not until next year. Though as DK indicates, just
because the current ones are not single die, doesn't mean that they're
inferior.
 
P

phuile

Yes, I am talking about the "real" quad, not 2 dual core put together.
I am asking because I was in a discussion on another thread in this
forum and happened to read that some people are prepared to wait for
the "real" quad from Intel. The reason being that AMD will have them
coming "soon" and Intel shouldn't be far off if they want to compete. I
am just consdering whether I should wait or just go ahead with the
currently quad Xeon. That's why I am wondering whether anybody knows
about the time frame.
 
Y

Yousuf Khan

phuile said:
Yes, I am talking about the "real" quad, not 2 dual core put together.
I am asking because I was in a discussion on another thread in this
forum and happened to read that some people are prepared to wait for
the "real" quad from Intel. The reason being that AMD will have them
coming "soon" and Intel shouldn't be far off if they want to compete. I
am just consdering whether I should wait or just go ahead with the
currently quad Xeon. That's why I am wondering whether anybody knows
about the time frame.

Well, the answer seems to have been updated recently: not till sometime
in 2008, _after_ Intel has converted to 45nm!

http://www.tgdaily.com/2007/01/27/intel_45nm_penryn_details/

For Intel with a shared L2 cache, it may not be as easy to redesign it
to accommodate 4 cores rather than just 2. AMD will only have a shared
L3 cache, which is not as performance critical as an L2 cache, so some
design flexibility might be available there.

As for the advantages of a real quad-core vs. dual-dual-cores, we really
won't know the answer to that until AMD launches its real quad-core. So
far people think it won't make a difference, but AMD is claiming that
Barcelona will be upto 40% faster than Cloverton. Pretty much what Intel
claimed Conroe would be over Athlon 64 before it got launched; back then
people were skeptical, but it turned out to be true. AMD might hold
similar aces up its sleeve. We can assume that AMD will implement all of
the same architectural improvements to its cores that Intel did to make
Core 2 so good, so at the very least it will equal Core 2. Then AMD will
have a shared L3 cache between the 4 cores, which should pool common
data among all 4 cores rather than 2; the shared L2 cache worked wonders
for Core 2 over Athlon 64, it was probably worth over 50% of the overall
improvement by itself. Also although Core 2 is a superb computational
engine, it's definitely not state-of-the-art at I/O throughput (i.e.
Hypertransport vs. front-side-bus). It's masking its I/O deficiencies
with big caches at the moment. The I/O throughput equation also includes
communications between processors in a multiprocessor system. When they
scale up over two processors, the FSB is a bottleneck.

Yousuf Khan
 
D

David Kanter

Yes, I am talking about the "real" quad, not 2 dual core put together.]

Can you explain to me exactly what the difference is? It sure seems
like most software doesn't know the difference.
I am asking because I was in a discussion on another thread in this
forum and happened to read that some people are prepared to wait for
the "real" quad from Intel.

Yes, well, some people also stored up tons of canned food for Y2K.
That just made them crazy...
The reason being that AMD will have them
coming "soon" and Intel shouldn't be far off if they want to compete.

What makes you think that Intel "needs" an integrated solution? The
only questions that a user should care about are:
1. What is the performance for the applications I care about, or
performance generally?
2. What is the power dissipation?
3. How much does it cost?

Having four cores on a single die is a way to improve performance.
However, it has drawbacks. You cannot bin the parts to match on
frequency and power dissipation. It is inherently more expensive to
produce, because larger dice have lower yields.

Using a multichip package has some advantages and disadvantages as
well. It is lower performance for multithreaded applications, but can
be higher performance for some single threaded applications, since the
frequencies will be higher. The reason that frequency will be higher
is that you can choose the two dice that go into the MCP, and pick so
that they are both really fast, or both slow. It's also probably 10x
cheaper to developer a new CPU with a multichip package than with a
new project.

Only zealots really want to portray it as a black and white issue.
AMD doesn't have the technology to do a multichip package so they have
been spending a lot of effort to get people to believe that "only
integrated quad cores are real quad cores", which is bunk. AMD's
performance will scale better because of the choices they made, but it
is balanced out by being somewhat more expensive in other ways.

If your workload isn't multithreaded, then it won't even matter
anyway.
I am just consdering whether I should wait or just go ahead with the
currently quad Xeon. That's why I am wondering whether anybody knows
about the time frame.

Intel won't do four cores on a die for another 1.5-2 years. It really
depends on what your application is, most applications simply don't
have 4 threads, so you're better off buying the fastest dual core you
can get.

DK
 
D

David Kanter

in 2008, _after_ Intel has converted to 45nm!

http://www.tgdaily.com/2007/01/27/intel_45nm_penryn_details/

For Intel with a shared L2 cache, it may not be as easy to redesign it
to accommodate 4 cores rather than just 2.
Absolutely.

AMD will only have a shared
L3 cache, which is not as performance critical as an L2 cache, so some
design flexibility might be available there.

This sentence really doesn't make much sense. How is there design
flexibility from having a shared L3 versus a shared L2 cache? It's
really all the same.
As for the advantages of a real quad-core vs. dual-dual-cores, we really
won't know the answer to that until AMD launches its real quad-core. So
far people think it won't make a difference, but AMD is claiming that
Barcelona will be upto 40% faster than Cloverton.

AMD is claiming that it will be 40% faster in SPECfp_rate than
EXISTING clovertown processors. Between now and then clovertown could
increase in clockspeed...imagine that. I bet on average the two
products will be about even, with AMD winning on FP and numerical
workloads, and Intel winning on more integer stuff.
Pretty much what Intel
claimed Conroe would be over Athlon 64 before it got launched; back then
people were skeptical, but it turned out to be true. AMD might hold
similar aces up its sleeve. We can assume that AMD will implement all of
the same architectural improvements to its cores that Intel did to make
Core 2 so good, so at the very least it will equal Core 2.

You could assume that, and you'd be wrong. AMD already stated that
they are not doing full LD/ST reordering, and they are only reordering
loads around other loads. That's much easier to do, and provides less
of a performance benefit.

Besides, any changes that AMD made to Barcelona were set in stone
around 1-2 years ago.
Then AMD will
have a shared L3 cache between the 4 cores, which should pool common
data among all 4 cores rather than 2;

That's really easy to model though. The problem with Intel's quad
core is that there is duplication between the different L2 caches, but
it probably isn't that bad.
the shared L2 cache worked wonders
for Core 2 over Athlon 64, it was probably worth over 50% of the overall
improvement by itself.

Can you back that statement up by data? Those numbers seem
ridiculously high.

DK
 
T

The Kat

Yes, I am talking about the "real" quad, not 2 dual core put together.]

Can you explain to me exactly what the difference is? It sure seems
like most software doesn't know the difference.

I could see, in any case when the same data was being processed
by multiple cores, that there would be a benefit to a shared cache.
But that's NOT what most quad-core chips will be doing, I think.




--

Lumber Cartel (tinlc) #2063. Spam this account at your own risk.

This sig censored by the Office of Home, Land & Planet Insecurity...

Remove XYZ to email me
 
D

David Kanter

Yes, I am talking about the "real" quad, not 2 dual core put together.]
Can you explain to me exactly what the difference is? It sure seems
like most software doesn't know the difference.
I could see, in any case when the same data was being processed
by multiple cores, that there would be a benefit to a shared cache.
But that's NOT what most quad-core chips will be doing, I think.

Yup. It becomes an issue when you have one CPU in the package trying
to write, while the other CPU is trying to read or write.

It's not ideal, but it is a lot cheaper and easier to do, and I think
both Intel and AMD make the appropriate choices for their respective
situations.

DK
 
S

Sebastian Kaliszewski

David said:
This sentence really doesn't make much sense. How is there design
flexibility from having a shared L3 versus a shared L2 cache? It's
really all the same.

But L3 can be:
a) slower (latency wise)
b) concurrency is less (L1/L2 handles much more traffic that just L1)

IOW -- it's easier to share slower L3 than faster L2.

You could assume that, and you'd be wrong. AMD already stated that
they are not doing full LD/ST reordering, and they are only reordering
loads around other loads. That's much easier to do, and provides less
of a performance benefit.

But up to now AMD does virtually none of that while Intel did that stuff
(LD/LD) since P6. So finally AMD picks that long hanging fruit with
performance improvement associated with that. This sole thing should bring
performance up by 2 speedgrades.

Besides, any changes that AMD made to Barcelona were set in stone
around 1-2 years ago.


That's really easy to model though. The problem with Intel's quad
core is that there is duplication between the different L2 caches, but
it probably isn't that bad.

The problem is bus load. Coherency traffic is exposed on CPU's FSB and
occurs at FSB speed.


rgds
 
Y

Yousuf Khan

David said:
This sentence really doesn't make much sense. How is there design
flexibility from having a shared L3 versus a shared L2 cache? It's
really all the same.


L3 doesn't need to have as low latency as lower-level caches. It's also
probably not as heavily accessed as the lower-level caches. Since 4
cores accessing the same memory would require more traffic management,
which adds latency, the fact that the cores are relying much less on the
L3 than L2 will result in less weighted-average latency increase.

In fact, L3 might be ideal for the experimental caches like AMD's ZRAM,
or Intel's FB-RAM. Neither of those have the low-latency of SRAM, but
SRAM's latency increases the larger and larger it gets, so at
sufficiently large enough sizes, the two technologies' latencies might
equal out. Intel is talking about a 16MB SRAM L3 cache in some future
quad-core processor (assume Nehalem), at 45nm. If ZRAM comes online for
AMD soon enough, then a 16MB ZRAM L3 can be substituted for a 2MB SRAM
L3, without much increase in real-estate.

Yousuf Khan
 
Y

Yousuf Khan

David said:
Yup. It becomes an issue when you have one CPU in the package trying
to write, while the other CPU is trying to read or write.

It's not ideal, but it is a lot cheaper and easier to do, and I think
both Intel and AMD make the appropriate choices for their respective
situations.


In Intel's case, the main advantage of the shared L2 is not so much that
the shared cache allows two cores to share each other's data, but
that it allows them the flexibility to allow a single core to take over
the whole L2 cache for itself on an as-needed basis. This gives a large
advantage in single-threaded performance. I don't think data sharing
between cores is that much of a big deal yet for anybody.

AMD's shared L3 cache may not be able to act in the same way as Intel's
shared L2 to increase single-thread performance, since it will likely be
a slower path to the core. However, the shared L3 will likely reduce
cache-coherency inter-processor traffic, as cores will search their own
L3 first before sending out a message to another processor over the HT
link. This is good for server applications, if not so much for PC
applications.

Yousuf Khan
 
D

David Kanter

In Intel's case, the main advantage of the shared L2 is not so much that
the shared cache allows two cores to share each other's data, but
that it allows them the flexibility to allow a single core to take over
the whole L2 cache for itself on an as-needed basis.

I think in the *real* world, that's less important. For single
threaded benchmarks, it is an advantage. But sharing your cache is a
huge advantage for stuff like specjbb, tpcc, or real world stuff where
you have data sharing between processors.

If you look at the time it takes to acquire a lock from another
processor across the FSB (or even HT) versus from a shared cache,
you're talking an order of magnitude or more quicker.
This gives a large
advantage in single-threaded performance. I don't think data sharing
between cores is that much of a big deal yet for anybody.

Not really. There's quite a few people who think otherwise...
AMD's shared L3 cache may not be able to act in the same way as Intel's
shared L2 to increase single-thread performance

Sure it will. It will decrease HT/memory accesses. That's a big win.
, since it will likely be
a slower path to the core. However, the shared L3 will likely reduce
cache-coherency inter-processor traffic, as cores will search their own
L3 first before sending out a message to another processor over the HT
link. This is good for server applications, if not so much for PC
applications.

I fail to see the distinction between the two situations (C2D and
Barcelona).

In each cache, the last level of on-die cache is shared between
multiple processors, providing a performance advantage (usually).

DK
 
S

Sebastian Kaliszewski

David said:
I fail to see the distinction between the two situations (C2D and
Barcelona).

In each cache, the last level of on-die cache is shared between
multiple processors, providing a performance advantage (usually).

The disticntion is that Barcelona should do that at 4 core level and C2D at
2 core level.


rgds
\SK
 
D

David Kanter

L3 doesn't need to have as low latency as lower-level caches. It's also
probably not as heavily accessed as the lower-level caches.

I think I need to be more clear. What do you mean by 'design
flexibility'? And what do you mean by less performance critical?
Since 4
cores accessing the same memory would require more traffic management,
which adds latency, the fact that the cores are relying much less on the
L3 than L2 will result in less weighted-average latency increase.

In fact, L3 might be ideal for the experimental caches like AMD's ZRAM,
or Intel's FB-RAM.

ZRAM was developed by ISi. What is FB-RAM? Is that Intel's version?
Neither of those have the low-latency of SRAM, but
SRAM's latency increases the larger and larger it gets, so at
sufficiently large enough sizes, the two technologies' latencies might
equal out.

There certainly will be a point where a denser cache is faster,
despite using 'slower' technology. It all really depends on your
interconnects.
Intel is talking about a 16MB SRAM L3 cache in some future
quad-core processor (assume Nehalem), at 45nm.

That's pretty reasonable. They already have a 16MB L3 for Tulsa.
If ZRAM comes online for
AMD soon enough, then a 16MB ZRAM L3 can be substituted for a 2MB SRAM
L3, without much increase in real-estate.

Your numbers are a little off. The ZRAM folks claim that really at
best you can get a 5x improvement in density, not 8x. Secondly, I
have yet to see any actual proof of this, especially for high
performance caches. Thirdly, it's not clear how flexible AMD's L3
controller and interface is; there's a limit to how much it can
manage. Also, AMD would probably want to redo several aspects of the
uarch to support a larger L3 cache. I suspect that AMD might do some
inhouse trials of ZRAM using barcelona, but I think if they were going
to change, they'd do it during a die shrink or uarch change.

DK
 
Y

Yousuf Khan

I think I need to be more clear. What do you mean by 'design
flexibility'? And what do you mean by less performance critical?

As I just said above, L3 is less latency sensitive, therefore you can
place it in places far away from the cores without much increase in
latency.

As for "performance critical", again it just means that it has higher
latency therefore it's likely going to be accessed less often. Anything
found in L3 will likely be copied directly into L1 first, which will
then eventually migrate back down to L2 again. So they're not going to
need to access the L3 that often once it's found in L1 or L2.
ZRAM was developed by ISi. What is FB-RAM? Is that Intel's version?

Yes, that's what I wrote about in the thread, "Intel is doing ZRAM too".
It means Floating-Body RAM. It's a form of RAM that uses the capacitive
properties of SOI. Of course in Intel's case, since they don't use SOI,
they will need to convert bulk wafers into SOI in certain areas first.
There certainly will be a point where a denser cache is faster,
despite using 'slower' technology. It all really depends on your
interconnects.

Yeah, at some point a high latency, high capacity cache might still be
useful simply because a lot of data will be found within it, even if
it's not the fastest cache you could possibly have.
Your numbers are a little off. The ZRAM folks claim that really at
best you can get a 5x improvement in density, not 8x. Secondly, I
have yet to see any actual proof of this, especially for high
performance caches. Thirdly, it's not clear how flexible AMD's L3
controller and interface is; there's a limit to how much it can
manage. Also, AMD would probably want to redo several aspects of the
uarch to support a larger L3 cache. I suspect that AMD might do some
inhouse trials of ZRAM using barcelona, but I think if they were going
to change, they'd do it during a die shrink or uarch change.

I never said that 16MB ZRAM is identical in size to 2MB SRAM, I just
said, "without much increase in real-estate".

Anyways, I'm not saying we'll see a ZRAM L3 in the upcoming Barcelona
generation of Opteron. I think there will likely be a few more
generations yet before we see it.

Also recently, ISi made a new breakthough in ZRAM which they call ZRAM2.
This new version is much easier to deal with, because it's much easier
to read its state than the ZRAM1. AMD also has a license for this form,
so I don't know if they're going to go with ZRAM1 or 2, but if they go
with ZRAM2, it might require more validation. But ZRAM2 is apparently
also able to scale better towards smaller processes, so it provides a
bit of future-proofing.

Yousuf Khan
 
Y

Yousuf Khan

I think in the *real* world, that's less important. For single
threaded benchmarks, it is an advantage. But sharing your cache is a
huge advantage for stuff like specjbb, tpcc, or real world stuff where
you have data sharing between processors.

Depends on whose "real world" you're talking about. In PC applications,
the ability to share data is mainly useless, single-threaded performance
still rules the day. In servers, sure, shared data between threads is
useful.

If you look at the time it takes to acquire a lock from another
processor across the FSB (or even HT) versus from a shared cache,
you're talking an order of magnitude or more quicker.

In the case of HT, it's not so much locking the data that takes time, as
it is in copying the data from the remote processor over the links to
your own local processor cache that matters most.

Not really. There's quite a few people who think otherwise...

Different work definitions.

Sure it will. It will decrease HT/memory accesses. That's a big win.

Which is what I think I had said just below that:

I fail to see the distinction between the two situations (C2D and
Barcelona).

The Barcelona's shared L3 will be used mainly for "shared data between
cores" purposes, but not for "single-threaded overdrive" purposes. The
C2D having a shared L2, each of the cores accesses that L2 directly, so
therefore any single core can take over larger portions of the L2 as the
need arises, thus increasing single-thread performance, overdriving it
if you will.

The cores in Barcelona will not be reading anything directly in from L3,
it will all be first read from L3 to L1. The cores only ever read
directly from their own L1 or L2 in the AMD64 architecture. So in
Barcelona you can't simply have a core allocate a large portion of the
L3 if it needs to increase its single-thread performance. Granted,
something like that effect can occur but only in slow domino-effect
stages, as data overflows each lower-level cache and gets ejected into
the next higher level cache. Eventually after enough overflowage, a
really busy AMD64 core might be able to take over the entire L3 for
itself, just like a really busy C2D core could take over the full L2 for
itself.

In each cache, the last level of on-die cache is shared between
multiple processors, providing a performance advantage (usually).



I don't think in AMD64 there is a direct linkage between the cores and
the L3 cache. The cores only directly control L1 and L2, but L3 might be
an automatic catch basin for data that's overflowed those first two
caches. I could be wrong on this, and there might be a direct read pipe
from L3 to each core. In current AMD64, there is a similar mechanism
with the onboard memory controller. Data that has to be fetched from
system RAM is never read directly from the memory controller into the
core; instead it is read into the memory controller which then passes it
onto the L1 which then passes it to the core. The core may also
occasionally read directly out of the L2, if data is not found in L1.

Yousuf Khan
 
D

David Kanter

As I just said above, L3 is less latency sensitive, therefore you can
place it in places far away from the cores without much increase in
latency.

As for "performance critical", again it just means that it has higher
latency therefore it's likely going to be accessed less often. Anything
found in L3 will likely be copied directly into L1 first, which will
then eventually migrate back down to L2 again. So they're not going to
need to access the L3 that often once it's found in L1 or L2.

I don't think you understand the cache architecture for Barcelona.
The L2 and L3 are victim caches. That means they are all exclusive.
If data is in one level it cannot be in another.
Yes, that's what I wrote about in the thread, "Intel is doing ZRAM too".
It means Floating-Body RAM. It's a form of RAM that uses the capacitive
properties of SOI. Of course in Intel's case, since they don't use SOI,
they will need to convert bulk wafers into SOI in certain areas first.

I suspect Intel would be more likely to take an MCM approach and just
fab a huge external, but on-package cache.
Yeah, at some point a high latency, high capacity cache might still be
useful simply because a lot of data will be found within it, even if
it's not the fastest cache you could possibly have.

What I meant was that because ZRAM is denser, you use less wire to
route between stuff inside. At a certain point that advantage will
outweigh the faster cells in SRAM; it's a question of storage
retrieval time (where ZRAM is inferior) versus interconnect time
(where ZRAM could be superior). The problem is that everyone already
has REALLY fast SRAMs.
I never said that 16MB ZRAM is identical in size to 2MB SRAM, I just
said, "without much increase in real-estate".

Uh. You'd be increasing the size by 50% for what is already a rather
large die. I consider a 1.5x increase in real-estate to be 'a lot'.
Anyways, I'm not saying we'll see a ZRAM L3 in the upcoming Barcelona
generation of Opteron. I think there will likely be a few more
generations yet before we see it.

I'm glad we can agree on this...ISTR when I made that claim a while
ago some folks in this NG jumped on me for it.
Also recently, ISi made a new breakthough in ZRAM which they call ZRAM2.
This new version is much easier to deal with, because it's much easier
to read its state than the ZRAM1.

AMD also has a license for this form,
so I don't know if they're going to go with ZRAM1 or 2, but if they go
with ZRAM2, it might require more validation. But ZRAM2 is apparently
also able to scale better towards smaller processes, so it provides a
bit of future-proofing.

ZRAM2 sounded usable, ZRAM1 did not. I heard a lot of folks at ISSCC
express skepticism about the R/W margins on ZRAM v1, which is the
problem they fixed.

DK
 
D

David Kanter

Depends on whose "real world" you're talking about. In PC applications,
the ability to share data is mainly useless, single-threaded performance
still rules the day. In servers, sure, shared data between threads is
useful.

So you don't think applications like office might be multithreaded?
Excel certainly is in newer versions.
In the case of HT, it's not so much locking the data that takes time, as
it is in copying the data from the remote processor over the links to
your own local processor cache that matters most.

That depends on how frequently you encounter locks.
Different work definitions.


Which is what I think I had said just below that:

All shared caches reduce CC traffic. In fact all caches do. A cache
miss always results in a snoop, a cache hit never does.
The Barcelona's shared L3 will be used mainly for "shared data between
cores" purposes

Just like C2D's L2? You know what happens when there is no other
process to share with? It all gets used by one process...just like
C2D.
, but not for "single-threaded overdrive" purposes.

No offense, but judging by your understanding of Barcelona's cache
architecture, I'm a little skeptical of your opinion on this subject.
Shared caches can always be used by a single thread, or perhaps more
importantly, they can be split unevenly by different threads. That is
an inherent advantage.
The
C2D having a shared L2, each of the cores accesses that L2 directly, so
therefore any single core can take over larger portions of the L2 as the
need arises, thus increasing single-thread performance, overdriving it
if you will.

So are you claiming that barcelona only lets a core take over a
particular portion of the cache? Maybe 1/4?
The cores in Barcelona will not be reading anything directly in from L3,
it will all be first read from L3 to L1.

No. All reads in any K8 based architecture first go into L1, then are
evicted to L2, then would be evicted to the L3. Go read the MPF
presentation.
The cores only ever read
directly from their own L1 or L2 in the AMD64 architecture.

No, the cores only ever read directly from their own L1. When you
miss in L1 and hit in L2, you have to swap the cache line in, and then
send something out to the L2. When you hit in L3, you would swap into
L1 as well. When the L2 gets full it evicts to the L3. When you
fetch from system memory that goes directly into the L1. IOW, the L3
functions like a big victim buffer for the L2, which acts like a
victim buffer for the L1. That's a "victim buffer architecture".
So in
Barcelona you can't simply have a core allocate a large portion of the
L3 if it needs to increase its single-thread performance.

That doesn't follow at all, and you really don't seem to understand
how the caches actually work.
Granted,
something like that effect can occur but only in slow domino-effect
stages, as data overflows each lower-level cache and gets ejected into
the next higher level cache. Eventually after enough overflowage, a
really busy AMD64 core might be able to take over the entire L3 for
itself, just like a really busy C2D core could take over the full L2 for
itself.

I don't think in AMD64 there is a direct linkage between the cores and
the L3 cache. The cores only directly control L1 and L2, but L3 might be
an automatic catch basin for data that's overflowed those first two
caches. I could be wrong on this, and there might be a direct read pipe
from L3 to each core.

Yes, it's called a load-store unit. What you are thinking about is an
architecture where the L3 is inclusive of L1, but L2 is exclusive of
both. That has the advantage that snoop probes don't disturb the L1,
but AMD probably already handles this by replicating the tags (I'd
wager). In the K8L all three levels are exclusive of one another.
In current AMD64, there is a similar mechanism
with the onboard memory controller. Data that has to be fetched from
system RAM is never read directly from the memory controller into the
core; instead it is read into the memory controller which then passes it
onto the L1 which then passes it to the core. The core may also
occasionally read directly out of the L2, if data is not found in L1.

No, any memory requests which misses in the L1 will fill the L1,
although this probably occurs simultaneous to providing the data on a
forwarding bus.

DK
 
Y

Yousuf Khan

David said:
I don't think you understand the cache architecture for Barcelona.
The L2 and L3 are victim caches. That means they are all exclusive.
If data is in one level it cannot be in another.

The L2 is, but not the L3. The L3 clearly can't be mutually exclusive
since it is a shared cache, and there can stuff in each core's private
caches that can also be in the shared cache.
I suspect Intel would be more likely to take an MCM approach and just
fab a huge external, but on-package cache.

It's an interesting idea, you can possibly outfit a processor's entire
system RAM in ZRAM if you put these things off-die.
What I meant was that because ZRAM is denser, you use less wire to
route between stuff inside. At a certain point that advantage will
outweigh the faster cells in SRAM; it's a question of storage
retrieval time (where ZRAM is inferior) versus interconnect time
(where ZRAM could be superior). The problem is that everyone already
has REALLY fast SRAMs.

Yes, that too.
Uh. You'd be increasing the size by 50% for what is already a rather
large die. I consider a 1.5x increase in real-estate to be 'a lot'.

Still clearly not as much of an increase as if you had to do 16MB of
cache in SRAM rather than ZRAM.
ZRAM2 sounded usable, ZRAM1 did not. I heard a lot of folks at ISSCC
express skepticism about the R/W margins on ZRAM v1, which is the
problem they fixed.


For ZRAM1, they actually required comparator cells to distinguish
between a logical 0 and 1, since the voltage differentials were so
miniscule.

Yousuf Khan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top