Macro-Op fusion does not work in 64-bit mode

YKhan · Aug 1, 2006

"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel
can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Yousuf Khan

The little lost angel · Aug 1, 2006

and Intel has to admit that it's all due the cache.

Is it really just the cache and nothing else?

David Kanter · Aug 1, 2006

[snip]

Intel has to admit
that it's all due the cache.

Or how about "None of the above".

DK

Tony Hill · Aug 1, 2006

"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel

Actually I've been rather adamant that there are a LOT of factors that
are affecting performance in the Core architecture. Sure, the extra
cache helps. Faster bus speed helps too, and more pipelines, better
decoders, an excellent brand predictor, improved TLB and hey, even
Macro-Op Fusion, just to name a few. Take away any one of these and
you are going to lose some performance. Going from 4MB to 2MB of
cache costs about 3.5% performance (see:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=4 ), while
1MB or L2 would probably drop performance further. Substantial yes,
but not nearly enough to make up for the improvements vs. either the
Athlon64 X2 or the Core Duo (Yonah) chips before it.

can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Or they just tell the truth that Macro-Op Fusion is just one of many
features that helps performance. It's also supposed to reduce power
consumption slightly. In all it's damn near impossible to predict
just how much the loss of this one feature will really change things
though, since there are many other variables that come into play here.

Mark · Aug 1, 2006

YKhan said:
"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel
can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Yousuf Khan

If you actually looked at the benchmarks, you would realize that the
improved performance cannot be attributed to the cache alone.

Yousuf Khan · Aug 6, 2006

The said:
Is it really just the cache and nothing else?

Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

Yousuf Khan

Yousuf Khan · Aug 6, 2006

Mark said:
If you actually looked at the benchmarks, you would realize that the
improved performance cannot be attributed to the cache alone.

The cache is 4 times bigger than anything AMD has. What else would it
be? We've already shown it's not macro-op fusion.

Yousuf Khan

Seraphim · Aug 6, 2006

Yousuf said:
Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

What about other things like the out of order load/store? That's memory
and not cache. It seems that every thing just adds a small % thus adding
up. While individually, the large cache or whatever does not appear to
be the "key" component.

George Macdonald · Aug 6, 2006

What about other things like the out of order load/store? That's memory
and not cache. It seems that every thing just adds a small % thus adding
up. While individually, the large cache or whatever does not appear to
be the "key" component.

The out of order load/store *is* predictive, in particular the
disambiguation and was said to include speculative components, without
further elucidation by Intel. The large cache is an important part of such
a strategy to avoid/minimize negative effects. It's quite rare for
microarchitecture tweaks like op-fusion, or additional pipeline paths to
yield benefits which are consistently measurable.

I *do* wish that the benchmarkers would quit quoting "latency" performance
using a program which is now clearly insufficient for the job.

krw · Aug 6, 2006

Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

Predictive algorithms are part of the load/store or fetch units
,which the dcache and icache are part, but I wouldn't say any
prefetching was part of the cache, per se. Caches are pretty dumb.

Sorta like saying the multiply algorithm is part of the register
file...

Carlo Razzeto · Aug 6, 2006

Yousuf Khan said:
The cache is 4 times bigger than anything AMD has. What else would it be?
We've already shown it's not macro-op fusion.

Yousuf Khan

How much impact would something like a wider execution path make? This is
coming from someone who is more of a layman than anything else when it comes
to the specifics of how CPU's actually perform their duties, so I'm asking
out of curiosity. Having read an analysis off of the anandtech website, one
of the key architectural changes they point out is how much wider the Core 2
is compared to a PIII/P4/Ahtlon64. Core 2, for instance, is the only core
among those that can execute 128bit SSE instructions in a single cycle. Is
this the type of thing that might add up to create a real impact?

Carlo

Yousuf Khan · Aug 6, 2006

Carlo said:
How much impact would something like a wider execution path make? This is
coming from someone who is more of a layman than anything else when it comes
to the specifics of how CPU's actually perform their duties, so I'm asking
out of curiosity. Having read an analysis off of the anandtech website, one
of the key architectural changes they point out is how much wider the Core 2
is compared to a PIII/P4/Ahtlon64. Core 2, for instance, is the only core
among those that can execute 128bit SSE instructions in a single cycle. Is
this the type of thing that might add up to create a real impact?

I'm sure it helps during SSE instructions. Can't see it being a big part
of the equation though, just like SSE itself isn't a big part of programs.

Yousuf Khan

Tony Hill · Aug 9, 2006

The cache is 4 times bigger than anything AMD has.

AMD has chips with 2MB of cache (2 x 1MB) and so does Intel. Intel
chips are MUCH faster, clock for clock, when compared with equal
quantities of cache.

What else would it
be? We've already shown it's not macro-op fusion.

How about the fact that Intel has 4 instruction decoders to AMD's 3,
an extra LOAD/STORE unit, 3 fully pipelined SSE units vs. K8's 2
partially pipelined, more and better branch predictors, much larger
TLBs, larger OoO reorder buffer, more advanced scheduler... to name a
few. And that's entirely separate from the better data prefetching
and greater cache bandwidth that, as you mentioned in another message,
are all related to cache.

Besides, we don't really know how much macro-op fusion really is
helping since we haven't seen any apples to apples comparison. 32-bit
with macro-op fusion vs. 64-bit without it doesn't really help, even
if only relative to AMD's 32-bit vs. 64-bit numbers. Intel might have
just done a better implementation of 64-bit x86 (AMD's K8 does have a
compromise or two in 64-bit mode as well) and that made up for the
loss in performance from Macro-op Fusion.

Long story short, there is a LOT more to the Core architecture than
just cache. Other than the integrated memory controller, Core is a
more advanced chip start to finish when compared to AMD's K8.
Fortunately for AMD, most of these advantages are incremental in
nature and their more modular K8L design could theoretically allow
them to phase such features into future processors.

George Macdonald · Aug 10, 2006

AMD has chips with 2MB of cache (2 x 1MB) and so does Intel. Intel
chips are MUCH faster, clock for clock, when compared with equal
quantities of cache.

I think what Yousuf is getting at is that in a single task benchmark
situation, you have 4MB of L2 cache for that single task, multithreaded or
not.

How about the fact that Intel has 4 instruction decoders to AMD's 3,
an extra LOAD/STORE unit, 3 fully pipelined SSE units vs. K8's 2
partially pipelined, more and better branch predictors, much larger
TLBs, larger OoO reorder buffer, more advanced scheduler... to name a
few. And that's entirely separate from the better data prefetching
and greater cache bandwidth that, as you mentioned in another message,
are all related to cache.

Looking back, it's not often that inner core microarchitecture tweaks have
yielded that much performance benefit. To me there are two clues here:

1) The fact that there are benchmarks where C2D shows near-zero benefit vs.
AMD64 points to the memory/cache subsytem and how it's manipulated as the
important provider of performance in the other benchmarks where C2D wins
handily. In particular, when disambiguation "hits", it hits *big*; when it
"misses", the penalty drags performance back down. When it "hits", it
depends heavily on the large cache and associativity to avoid thrashing.

2) The ridiculous C2D "latency" measurements being published, all using the
same chipset where a P4 is a latency dog, are an indication that
speculation on stride size and Load/Store re-ordering make a *huge*
contribution to performance. Of course what this really means is that the
current latency benchmark is obsolete but it makes no sense that a system
with FSB, where the real round-trip latency is illustrated by the P4
measurements, can beat a system with an on-board memory controller. Again,
without the large L2 cache, the strategy would fall down.

Besides, we don't really know how much macro-op fusion really is
helping since we haven't seen any apples to apples comparison. 32-bit
with macro-op fusion vs. 64-bit without it doesn't really help, even
if only relative to AMD's 32-bit vs. 64-bit numbers. Intel might have
just done a better implementation of 64-bit x86 (AMD's K8 does have a
compromise or two in 64-bit mode as well) and that made up for the
loss in performance from Macro-op Fusion.

Long story short, there is a LOT more to the Core architecture than
just cache. Other than the integrated memory controller, Core is a
more advanced chip start to finish when compared to AMD's K8.
Fortunately for AMD, most of these advantages are incremental in
nature and their more modular K8L design could theoretically allow
them to phase such features into future processors.

"Incremental" is correct.;-)

64-bit or 32-bit: When will it matter?	6	Mar 1, 2005
Virtual PC Windows 7 32 bit mode	2	Oct 2, 2014
Intel ready to put out desktop 64-bit now	9	Dec 8, 2004
32-64 bit upgrade ~ Questions	5	Apr 5, 2011
64 bit Prescott	5	Apr 9, 2004
Help - 32-bit Vs 64-bit	18	Jun 9, 2005
Vista 64 bit	7	Dec 10, 2008
Upgrade Report [Tested: 64-Bit P4 - 03/29/2005]	2	Mar 30, 2005

Macro-Op fusion does not work in 64-bit mode

YKhan

The little lost angel

David Kanter

Tony Hill

Mark

Yousuf Khan

Yousuf Khan

Seraphim

George Macdonald

krw

Carlo Razzeto

Yousuf Khan

Tony Hill

George Macdonald

Ask a Question

Similar Threads