Um; now I'm really confused. I thought that the P4EE had a 64 bit
data bus, compared with 32 bit on the regular P, P2, P3, P4,
Athlon, and Duron. Is that not correct?
Nope!
All cpus since the original Pentium have 64-bit data bus. I think what
you're looking for is dual channel or 128 bit buses.
Socket 939 and 940 Athlon64s have 128bit bus
No P4 has dual channel memory bus. But some mobos/chipsets have
dualchannel bus. But the bus from memory controller (Northbridge) to
cpu fsb is 64 bit. Same thing with AthlonXPs. But P4C at 800FSB can
use the dual channel better than AthlonXPs. P4EE is nothing but a P4
with a large L3 cache. Not a L2 cache! So it doesn't benefit quite
that much from it.
Provided you use a dual channel DDR400 mobo, and not more than two
memory sticks, the P4C's memory bandwidth is much better. L2 cache
latency is also much better on the P4. That's the easy part of the
answer. Unfortunately, the P4 often seem to have problems translating
those advantages into better real world performance.
As long as it's sequential huge blocks of data that is moved about, or
done fairly simple operations on, the P4 does very well with its
memory bandwidth.
But I can't answer your question regarding DDR400 AthlonXP vs 800MHz
P4. The Athlons have lower bandwidth, but also very big L1 cache and
vastly superior branch handling and out of order execution.
The Athlon64, in turn, memory latency is much superior. Memory
bandwidth of the socket 939 and 940 AMD '86-64 cpus should also be
better.
Some additional information: AMD Opteron, Athlon64 and AthlonFX are
64-bit CPUs. all other are 32-bit. The significance of these bits, are
the address width of the cpu instructions, not any width of data. Plus
that the 64-bit instructions are extended to use more registers, and
in a more rational manner. In all discussions and benchmarks, sofar,
these Athlon64s are treated and used as 32-bit cpus, using 32-bit OS
and 32-bit software. Even so, they still kick ass. With 64-bit
software, they should really start to look interesting.
I read that the A64 has a 64 bit data bus and a "single data
channel", while the A64FX has a 64 bit data bus and a "double data
channel". I'm not sure what is meant by a "data channel" in this
context, but is that what YOU mean?
http://www.nordichardware.com/reviews/cpu/2004/Athlon64_2.2GHz/index.php
My data sets are between 256 Kwords and 512 Kwords, where a word
is a 64 bit double precision float. So it looks like I fall
within that 1-5MB range that you mention.
So, since only about 10% of my operations include branches, it
looks to me like the P4 might be the better choice. Right?
OTOH, I understand that the Athlon has a faster FPU ...
I'm so confused.
Well, from what I've seen, 7% div is enough to break the P4. Even
using vectorized SSE2 optimization, the AthlonXP sails past even using
old '387 code.
AMD and Intel (post PentiumIII) architectures are wildly different. It
seems to me, extremely hard to make comparisons, that are valid in
correlation to real application performance. I've also come to
realize, that most (all?) synthetic benchmarks are useless as well.
Bottom line is, run the application and see. Some general big guesses
can be made, and is what I've tried to make, in this thread.
Much can be done with optimization for the P4. But my take is that the
Northwood/Prescott cores are better geared for media than
math/science/engineering. Sure, a lot of things are just matrix mul,
and P4s can be made to do that blazing fast. So if your code spends
most of the time doing things like that, SSE2 should make a hell of a
difference.
But the AMDs doesn't have any weak spots. They just crunch away, when
a P4 grinds to halt.
All benchmarks are optimized for the P4. But only mainstream
applications seem to be.
I've had two disappointing P4 experiences, and I think I'm firmly in
the AMD camp now. I recommend you not to invest any money in any P4
system, before trying out your software on one.
You can go that route: Change only motherboard + memory, so you literally
open up a major bottleneck. [...] If you are still not satisfied from
improvement, go and buy whatever model you can afford of Athlon (barton) XP:
2500+ to 3200+.
Yes; interestingly enough, after I posted my message I found
http://www.xbitlabs.com/articles/cpu/display/athlon64-3200_14.html,
where a standard XP3200+ did remarkably well against the A64FX and
the P4EE in mathematical analysis benchmarks -- exactly the sorts
of things that I am doing.
It partly depends on the code. The AthlonXP does indeed have the most
powerful '387 FPU in existence. Even more powerful than the K8s'.
But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more
powerful, even on scalar FP. So the AMD game plan is that even scalar
math should be compiled for that instead.
Intel's P4 plan is similar, even scalar ops are redirected to SSE2 by
their compiler. But the P4 doesn't shine on scalar FP.
Also the AthlonXP can also do better than '387 for vectorized
operations.
You have the interesting possibility of optimizing your code for
'enhanced 3DNow'. I don't know how to do that, I'm lazy and use old
and cheap tools. But check AMD's web site for developer information.
This 'enhanced 3Dnow' supposedly comes to like 80% of P4's SSE2 max
performance, but is supposed to not have the same sensitivity to
fp-mix and branches.
Even though the AthlonXP supports SSE, enhanced 3DNow should be
better. SSE makes the Athlon look better on PIII optimized code, but
isn't the optimum.
(I think some big corp, Lockheed or Boeing, built a supercomputer from
AthlonXPs, for the sole purpose of using enhanced 3DNow for
aerodynamic calculations.)
But if I go with the XP3200+, what do I look for on the
motherboard in terms of "DDR" vs. "dual DDR"? I have looked at
motherboard specs, and "dual DDR" capability seldom seems to be
mentioned. Is it even a concern in my situation?
Dual channel actually is slightly, slightly faster, even on the
AthlonXPs. But they don't make the same use of it, as 800MHz fsb P4s.
It is often regarded as insignificant (for Athlons), particularly in
comparison with later single channel chipsets, like KT600.
Actually, mine is a streaming application. When the program is
running, the hard drive spins down due to inactivity!
Either I or you are confused here, because that is not what I
understand by 'streaming'. I think what is meant by 'streaming', is
that input comes directly from output of preceding op. In the case of
P4, I interpret it as generalized to - when you have 'next input and
op ready at hand'. Basically, that there's no conditional statements,
and that everything to be done, for very large continuous segments of
processing, is fixed, and data is continuous. Like
moving/factoring/adding/transforming large data blocks.
P.S.
There have been repeated references to P4 Extreme here. I want to warn
against the P4EE (extreme edition). It costs around $1000, and while
it does do 15% better on some, it doesn't average more than 3% better
on benchmarks, than a vanilla 3.2P4C. (I guess that'll be something
like 1% on actual applications...). In my mind, if you're that
desperate, it's much more tempting to spend all those money on
cpu-freezing and serious overclocking. Sole reason for the P4EE
existence, is an Intel marketing plan to confuse the market about
AMD's Athlon64.
There is also the P4E. Don't confuse them. This is the new 'Prescott'
core. Unfortunately, it's something like 4-9% slower than P4C per
clockrate. It's engineered for higher clockrates, but it's even more
inefficient than the Northwood. The P4 of choice, IMO, and I'm much
surer of that than anything else, remains the P4C for now. Even more
so with price cuts. It may all have changed when we reach 4GHz, but
early Prescott buyers are suckers.
Final words: If I'd dared recommend anything at all, it would probably
be the new Athlon64s. If memory speed is important, socket 939
(currently still unavailable), otherwise socket 754 seem to be doing
well enough.
ancra