Pretty good explanation of x86-64 by HP

Tony Hill · Dec 6, 2004

FWIW, Dell are shipping EM64T-equipped non-Xeon P4 workstations (the
Precision 370).

Ahh, thanks. When I first wrote the above I had actually included
Dell's name as well, but then removed it when I couldn't find any
EM64T P4 processors in any of their servers (didn't think to check
workstations first). I figured that if anyone was selling 64-bit P4s
it would be Dell!

Tony Hill · Dec 6, 2004

It's a bit of a crap argument isn't it? Even if the latency is small,
the fact that it's a NUMA system impacts performance (potentially by a
lot) as the available memory bandwidth is coupled to where you place
your data.

It does, but the difference is small, usually less than 10% and often
much closer to 0%. When well over 90% of your memory access is coming
from cache anyway and (assuming a totally random distribution in a
strictly UMA setup) 50% of your memory access is going to be local,
most of the performance difference is lost in the noise.

Besides, remember that even in a classic UMA environment (ie a 2P or
4P Xeon server... or even a single-processor system) you STILL have
differences in latency depending on where in memory your data resides
due to open vs. closed pages, TLB misses, etc.

Classic example is OpenMP parallelized STREAM. Parallelize all the
loops except the data initialization loop on a system with hard memory
affinity (such as Linux), then parallelize _all_ the loops and explain
how the difference is "not worth headaching over".

Most users don't use their computer to run STREAM though. Even in the
HPC community where memory bandwidth is king, STREAM is still a rather
extreme case.

Bottom line IMO is that pretending that the system isn't NUMA is doing
customers a disservice.

I've said it before and I'll say it again: Hardware is cheap,
software is expensive. It would be a true disservice to your
customers to tell them to spend thousands upon thousands of dollars
changing all their software for the small improvement in performance
equal to a few hundred dollars of hardware costs.

They should know that treating the system as a
UMA one is a bad idea.

Spending lots of money to make all your software NUMA is a bad idea
when treating it as UMA and throwing a tiny amount of extra hardware
at the job will do the trick. That's all that AMD is getting at.

Besides, they do recognize that it is NUMA, just that they are saying
you don't NEED to worry about that if you don't want to because for
the vast majority of times the performance difference is lost in the
noise.

Greg Lindahl · Dec 6, 2004

Tony Hill said:
It does, but the difference is small, usually less than 10% and often
much closer to 0%.

No, it's not. The Opteron builds the best 4-cpu SMP system out there
according to the SPECrate2000 cpu benchmark, but in order to get that
best result, you need to pin the individual processes to cpus and
memory using a utility. Without it, the performance is no longer the
best. So people really care about that last bit of performance.

Now I don't have the directly comparison for that, but here's a
comparison on some benchmarks for a recent competitive bid. "Slow" is
a system without the processor binding and with "node interleave"
turned on. "Fast" is with processor binding and node interleave off,
which lets the processor binding have the best benefit. Note that it's
only a trivial amount of work to get this improvement for a serial
code, so this is a common situation, although these benchmarks are, of
course, particular to this scientific-computing customer. In these
results, the comparison is scaling for 4 processes on a 4 cpu machine.
4.0 would be a perfect score.

fast slow difference
benchmark 1 3.71 3.03 + 22 %
benchmark 2 3.76 3.29 + 14 %
benchmark 3 3.78 3.26 + 16 %
benchmark 4 3.79 3.45 + 10 %
benchmark 5 3.92 3.89 + 1 %
benchmark 6 3.88 3.71 + 5 %

These benchmarks were run with the best Opteron compiler, so this
scaling improvement was very good to see. And it's bigger than
"usually less than 10%".

When well over 90% of your memory access is coming
from cache anyway and (assuming a totally random distribution in a
strictly UMA setup) 50% of your memory access is going to be local,
most of the performance difference is lost in the noise.

Handwaving is a bad way to evaluate effects like this.

I've said it before and I'll say it again: Hardware is cheap,
software is expensive. It would be a true disservice to your
customers to tell them to spend thousands upon thousands of dollars
changing all their software for the small improvement in performance
equal to a few hundred dollars of hardware costs.

Customers know what 10% or 20% more performance means, as do vendors
who are doing competitive bidding. The fact that I care a lot about
this should give you a clue. And in some cases, such as serial codes,
the benefits are easy to achieve. It took only a moderate amount of
work in our OpenMP compiler and runtime to get these benefits for some
parallel programs, too. Well worth it to our customers.

-- greg
speaking for myself, not PathScale

Grumble · Dec 6, 2004

Greg said:
These benchmarks were run with the best Opteron compiler [...]

Which compiler would that be? PathScale?

:-)

Per Ekman · Dec 6, 2004

Tony Hill said:
It does, but the difference is small, usually less than 10% and often
much closer to 0%.

And sometimes 50%...

When well over 90% of your memory access is coming from cache anyway
and (assuming a totally random distribution in a strictly UMA setup)
50% of your memory access is going to be local, most of the
performance difference is lost in the noise.

What you are saying is that if the application runs well on a NUMA
machine it will run well on the Opteron. The problem is if the
application _doesn't_ run well on a NUMA box, then it becomes
important to know that this machine is in fact a NUMA machine.

Besides, remember that even in a classic UMA environment (ie a 2P or
4P Xeon server... or even a single-processor system) you STILL have
differences in latency depending on where in memory your data resides
due to open vs. closed pages, TLB misses, etc.

I'm not talking about latency here, I'm talking about bandwidth. Are
your main memory references all going to the same memory controller or
are they shared among the controllers in the system? That's the crux
of the matter.

Most users don't use their computer to run STREAM though. Even in the
HPC community where memory bandwidth is king, STREAM is still a rather
extreme case.

I admit I'm from the HPC-sector and memory bandwidth is very important
to many applications here. STREAM is an extreme case in some sense,
but the performance differences it can uncover are by no means
insignificant for real codes. Shared memory MPI is one example,
clearly not an uncommon software in HPC.

I've said it before and I'll say it again: Hardware is cheap,
software is expensive. It would be a true disservice to your
customers to tell them to spend thousands upon thousands of dollars
changing all their software for the small improvement in performance
equal to a few hundred dollars of hardware costs.

It's a disservice not to let the customers decide themselves. If their
code is memory bandwidth limited and runs poorly on NUMA systems then
you're doing them a disservice by selling them a system where they
_have_ to change their software to get it to run well without telling
them about it.

Spending lots of money to make all your software NUMA is a bad idea
when treating it as UMA and throwing a tiny amount of extra hardware
at the job will do the trick. That's all that AMD is getting at.

Besides, they do recognize that it is NUMA, just that they are saying
you don't NEED to worry about that if you don't want to because for
the vast majority of times the performance difference is lost in the
noise.

It's a pretty strange argument in my eyes, "If you ignore the
applications that run poorly because of property X, then it makes
sense to downplay property X." True, but not helpful if you have such
an application.

*p

Janne Blomqvist · Dec 6, 2004

It could
be that Intel still has a reasonable amount of inventory of their old
"Northwood" P4 chips and they want to clear those out first, but that
certainly doesn't seem to be the case looking at Intel's pricing
structure and what is being sold by the major OEMs (Intel seems to be
pushing Prescott VERY hard here).

A friend recently (1 month ago IIRC) wanted a Northwood for his DIY
computer, but he found that none of the usual suspects around here had
them in stock. Eventually he called the importer, who said that
they're out of stock and they're not getting anymore either, buy a
Prescott instead.

Long story short, I'm not quite sure what the actual answer is, but
excessive inventory of 32-bit chips doesn't seem to make sense from
what I've seen.

Considering the rate chips depreciate I guess manufacturers think
pretty hard about what they can do to minimize inventory.

Eugene Nalimov · Dec 6, 2004

Greg Lindahl said:
...
These benchmarks were run with the best Opteron compiler
...

Visual C?

Thanks,
Eugene

Rob Stow · Dec 6, 2004

George said:
Thanks for the data but no I guess I should have highlighted better what I
was getting at: "the memory controller is integrated into and operates at
the core speed of the processor", which is what was being
discussed/disputed in another thread.

I haven't been able to find any hard data from AMD on where the clock
domain boundaries are in the Opteron/Athlon64 but if the memory controller
is not operating at "core speed" it's now at the stage of Internet
Folklore.

Ah, that one is much easier to answer. ;-)

Straight from the horse's mouth:
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_4699_7981^7983,00.html

"By running at the processor’s core frequency, an integrated
memory controller greatly increases bandwidth directly available
to the processor at significantly reduced latencies."

Yousuf Khan · Dec 6, 2004

keith said:
I'd say that because in small systems (less than 8 CPUs), Opterons are
coherent in hardware thus sufficiently tightly coupled to be called UMA,
as far as the user is concerned.

Yes, exactly my point, it's more or less UMA in the upto 8 processor
range. After that, then you can start thinking of it as NUMA. But having
upto 8 processors being treated as UMA is quite a lot.

Yousuf Khan

Yousuf Khan · Dec 6, 2004

Per said:
It's a bit of a crap argument isn't it? Even if the latency is small,
the fact that it's a NUMA system impacts performance (potentially by a
lot) as the available memory bandwidth is coupled to where you place
your data.

Actually, there was a story here not so long ago where one of the Linux
distros had been optimized up with NUMA assumptions, and it actually ran
/slower/ than a non-NUMA kernel. In other words the Linux kernel might
have spent more time making complex decisions about memory placement
than it was actually going to save from the latencies.

Accessing memory through the Hypertransport links should not be any
worse than the traditional front-side bus arrangement like in Intel
processors. So it will match the Intel architectures that way, at the
very least. And whenever it goes through its own local memory
controllers, it blows the Intel architectures away.

Yousuf Khan

George Macdonald · Dec 7, 2004

Ah, that one is much easier to answer. ;-)

Straight from the horse's mouth:
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_4699_7981^7983,00.html

"By running at the processor’s core frequency, an integrated
memory controller greatly increases bandwidth directly available
to the processor at significantly reduced latencies."

Ah so there we have it... assuming this has been approved by the technical
folks. :-)

BTW I notice that AMD seems to cutting back on the depth of info
in their technical docs - the Product Data Sheets now consist of one
page... a far cry from the excruciating detail on cache operation etc. we
used to get.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

keith · Dec 7, 2004

Note that the STREAM bandwidth and lmbench latency changes with every
cpuspeedbump. So clearly part of the memory controller is at the cpu
core frequency, or a related frequency, and not at the HT frequency,
or the SDRAM external bus frequency.

That does *not* mean that the memory corntoller runs at the core speed.

It would be nuts to assume such. Would you assume the cashes of the
PII run at the the I/O bus speed?

Please reduce the cross-post. Followups set to a group I read.

Isn't his a rather egotistical statement? "I don't read other
groups, so no one else matters!" Hint: Others are reading this thread
from other groups! It's posted to *three* related groups (hardly a breech
of USENET protocol).

John Savard · Dec 7, 2004

I found this whitepaper from HP to be pretty good, it is surprisingly
candid, considering HP was the coinventor of the Itanium. It does a
pretty good job of explaining and summarizing the similarities and
differences between AMD64 and EM64T, and their comparison to the
Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
compatible", but IA64 is a different animal altogether.

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

I would have preferred if you had given the URL of a page with a *link*
on it to this manual. That would make it easier to back-navigate for
other items of related interest, and it would have meant that the manual
could be downloaded with a right-click without waiting for the browser
plug-in to display the whole manual.

On page 13, under the heading "Power Considerations", I noticed a real
whopper. Or, at least, what _seemed_ to me to be a real whopper
initially.

It is true that for a given implementation, a higher clock speed means
more power consumption. It takes more power to make gates switch faster.

However, if a higher clock speed is obtained by splitting the pipeline
into more itty-bitty pieces, for the same level of instruction latency,
then one still has the same number of gates, each consuming the same
amount of power. (Except for the overhead of the pipelining process...
and one more thing to be noted later.)

What is the point of splitting up a pipeline into smaller pieces? Is it
to put more megahertz in the ad copy? No, it is so that more
instructions can be executing, in different stages, at once. (Which
means that a Pentium IV ought to have explicit vector instructions. Yes,
it has a separate instruction cache and data cache, but there's still
only one bus to *main memory*, and caches do have to get filled from
somewhere.)

Since CMOS gates only consume power when they are changing state, unused
elements of a non-pipelined ALU are not consuming power, so it may well
be that a 14-stage pipelined ALU can consume twice as much power as a
7-stage pipelined ALU.

But that will be because twice as much of it is in use, not because it
is going "twice as fast".

Since they are still sort of right, even if for the wrong reason,
perhaps all I am criticizing is an oversimplification here. But I think
that this can lead to a profound misconception of how microprocessors
work.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Rob Stow · Dec 7, 2004

George said:
Ah so there we have it... assuming this has been approved by the technical
folks. BTW I notice that AMD seems to cutting back on the depth of info
in their technical docs - the Product Data Sheets now consist of one
page... a far cry from the excruciating detail on cache operation etc. we
used to get.

The "Product Data Sheets" are indeed so brief as to be
virtually useless, but there is still a wealth of PDFs
that provide details about just about everything.

The useless Product Data Sheet heads the list of
"AMD Opteron™ Processor Tech Docs" at
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html
but the other PDFs there have mind numbing details about
every little thing that does not give away trade secrets.
For example, read the "BIOS and Kernel Developer's Guide
for AMD Athlon™ 64 and AMD Opteron™ Processors".

del cecchi · Dec 7, 2004

John Savard said:
in part:

OK, I wont trim the wonderful newsgroup list, all of whose readers are
breathlessly awaiting my imortal prose....

38028.pdf

I would have preferred if you had given the URL of a page with a *link*
on it to this manual. That would make it easier to back-navigate for
other items of related interest, and it would have meant that the manual
could be downloaded with a right-click without waiting for the browser
plug-in to display the whole manual.

What braindamaged newsreader are you using that won't let you right
click the link in the newsreader? Even OE does that. So quit whining
and switch to a decent newsreader.

On page 13, under the heading "Power Considerations", I noticed a real
whopper. Or, at least, what _seemed_ to me to be a real whopper
initially.

It is true that for a given implementation, a higher clock speed means
more power consumption. It takes more power to make gates switch faster.

Probably referring to that esoteric equation P= (sf)*.5*C*V**2 which you
may have encountered. Or perhaps I=Cdv/dt.

However, if a higher clock speed is obtained by splitting the pipeline
into more itty-bitty pieces, for the same level of instruction latency,
then one still has the same number of gates, each consuming the same
amount of power. (Except for the overhead of the pipelining process...
and one more thing to be noted later.)

If one adds pipe stages one has more gates and more latches and more
clock drivers. And the power per gate goes up because of the higher
frequency.

What is the point of splitting up a pipeline into smaller pieces? Is it
to put more megahertz in the ad copy? No, it is so that more
instructions can be executing, in different stages, at once. (Which
means that a Pentium IV ought to have explicit vector instructions. Yes,
it has a separate instruction cache and data cache, but there's still
only one bus to *main memory*, and caches do have to get filled from
somewhere.)

Actually one reason for intel to "superpipeline" was to jack up the freq
for the ad copy.
You lost me with the "Pentium IV ought to have explicit vector
instructions" leap.

Since CMOS gates only consume power when they are changing state, unused
elements of a non-pipelined ALU are not consuming power, so it may well
be that a 14-stage pipelined ALU can consume twice as much power as a
7-stage pipelined ALU.

Or maybe 4 times, if the freq is double.

But that will be because twice as much of it is in use, not because it
is going "twice as fast".

Clearly they are using "twice as fast" to mean "double the frequency".
Why do you find that so hard to understand?

Since they are still sort of right, even if for the wrong reason,
perhaps all I am criticizing is an oversimplification here. But I think
that this can lead to a profound misconception of how microprocessors
work.

What ARE you talking about?

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Del Cecchi.

Per Ekman · Dec 7, 2004

Yousuf Khan said:
Actually, there was a story here not so long ago where one of the Linux
distros had been optimized up with NUMA assumptions, and it actually ran
/slower/ than a non-NUMA kernel. In other words the Linux kernel might
have spent more time making complex decisions about memory placement
than it was actually going to save from the latencies.

And the conclusion was that a multi-CPU Opteron system must then be
UMA, rather than that the NUMA "optimizations" were crap?

Accessing memory through the Hypertransport links should not be any
worse than the traditional front-side bus arrangement like in Intel
processors. So it will match the Intel architectures that way, at the
very least.

If your 4-way system with 4 separate memory controllers matches the
memory bandwidth on a shared-bus 1P system that's good enough??
*boggle*

And whenever it goes through its own local memory controllers, it
blows the Intel architectures away.

Latency-wise perhaps, bandwidth-wise I doubt it but feel free to prove
me wrong.

You can't tout the scaling advantage of a NUMA approach while
pretending that it isn't NUMA. Well you can, but IMO it's dishonest
and self-defeating.

I _like_ the Opteron systems, they have great performance and I work
with them daily. However I would hate them if I didn't know that they
were NUMA because important codes would run terribly slowly under UMA
assumptions.

NUMA is _NOT_ only about latency!

*p

Michael Woodacre · Dec 7, 2004

Yousuf said:
Actually, there was a story here not so long ago where one of the Linux
distros had been optimized up with NUMA assumptions, and it actually ran
/slower/ than a non-NUMA kernel. In other words the Linux kernel might
have spent more time making complex decisions about memory placement
than it was actually going to save from the latencies.

Accessing memory through the Hypertransport links should not be any
worse than the traditional front-side bus arrangement like in Intel
processors. So it will match the Intel architectures that way, at the
very least. And whenever it goes through its own local memory
controllers, it blows the Intel architectures away.

As Per correctly points out, it's not just latency, but also bandwidth
that is key in NUMA.

For example, if I have a process on Opteron A, that is reading data from
Opteron B, then I can only read at Hypertransport rate, which is
something like 2.7GB of data in one direction (that's the best data rate
number I have for HT - if someone has a better one I'd be happy to use
that - note that this is user data, not protocol overhead). This is
quite different to the local bandwidth of memory directly connected to
Opteron A. Hence NUMA awareness is important.

This is different for example to the SGI Altix Bx2 system where an
Itanium can read at the full bus rate (local or remote). Clearly the
Altix architecture had a different set of tradeoffs made, to match the
requirements of SGIs customer base.

BTW, I think Opterons have some nice characteristics and for certain
workloads are attractice, but they are not the best solution to all
problems as some people seem to want to claim.

Another example would be making sure that people understand that when
Opteron goes dual core, unless you double the memory bandwidth
available, you effectively cut the bandwidth per core in half. This will
impact some workloads quite dramatically. Has AMD made public statements
about supporting higher local bandwidth for the dual core chip?

Cheers,
Mike

Grumble · Dec 7, 2004

Eugene said:
Visual C?

Maybe he meant GCC!

;-)

Grumble · Dec 7, 2004

Del said:
What braindamaged newsreader are you using that won't let you right
click the link in the newsreader? Even OE does that. So quit whining
and switch to a decent newsreader.

Speaking of brain-damaged newsreaders, take a look at the mess yours
did when you quoted John's message. I rest my case.

Del Cecchi · Dec 7, 2004

Grumble said:
Speaking of brain-damaged newsreaders, take a look at the mess yours
did when you quoted John's message. I rest my case.

A few lines got wrapped. That what you are talking about?

del

Pretty good explanation of x86-64 by HP

Tony Hill

Tony Hill

Greg Lindahl

Grumble

Per Ekman

Janne Blomqvist

Eugene Nalimov

Rob Stow

Yousuf Khan

Yousuf Khan

George Macdonald

keith

John Savard

Rob Stow

del cecchi

Per Ekman

Michael Woodacre

Grumble

Grumble

Del Cecchi

Ask a Question

Similar Threads