Pretty good explanation of x86-64 by HP

Yousuf Khan · Dec 5, 2004

I found this whitepaper from HP to be pretty good, it is surprisingly
candid, considering HP was the coinventor of the Itanium. It does a
pretty good job of explaining and summarizing the similarities and
differences between AMD64 and EM64T, and their comparison to the
Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
compatible", but IA64 is a different animal altogether.

Yousuf Khan

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

Bill Bradley · Dec 5, 2004

Yousuf said:
I found this whitepaper from HP to be pretty good, it is surprisingly
candid, considering HP was the coinventor of the Itanium. It does a
pretty good job of explaining and summarizing the similarities and
differences between AMD64 and EM64T, and their comparison to the
Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
compatible", but IA64 is a different animal altogether.

Yousuf Khan

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

When did the non-Xeon Prescott P4s start offering EMT64 as listed in
the paper? News to me. Does HP know something the rest of the world
doesn't?

Bill

George Macdonald · Dec 5, 2004

I found this whitepaper from HP to be pretty good, it is surprisingly
candid, considering HP was the coinventor of the Itanium. It does a
pretty good job of explaining and summarizing the similarities and
differences between AMD64 and EM64T, and their comparison to the
Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
compatible", but IA64 is a different animal altogether.

Yousuf Khan

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

Hmm and the following quote: "However, the latency difference between local
and remote accesses is actually very small because the memory controller is
integrated into and operates at the core speed of the processor, and
because of the fast interconnect between processors." is relevant to
another discussion here. I wish we could get a firm answer on this one.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

Rob Stow · Dec 5, 2004

George said:
Hmm and the following quote: "However, the latency difference between local
and remote accesses is actually very small because the memory controller is
integrated into and operates at the core speed of the processor, and
because of the fast interconnect between processors." is relevant to
another discussion here. I wish we could get a firm answer on this one.

Not sure if this is exactly what you are looking for in the
way of a "firm answer", but the latencies in a Opteron system are:

0 hops 80 ns uniprocessor (Local access)
100 ns multiprocessor (Local access, with cache snooping on other processors)
1 hop 115 ns
2 hops 150 ns
3 hops 190 ns

I couldn't find my original source for those numbers, and
the two and three hop numbers above are a little higher
than I remembered them as being. This time around I got
them from this thread:
http://www.aceshardware.com/forum?read=80030960

That thread refers to this article:
http://www.digit-life.com/articles2/amd-hammer-family/
which gives slightly different numbers for a 2 GHz Opteron
with DDR333:
Uni-processor system: 45 ns
Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns.
Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns.

I don't know if any of the numbers above are for cache misses
or if they are averages that include both hits and misses.

Hamman · Dec 5, 2004

Bill Bradley said:
When did the non-Xeon Prescott P4s start offering EMT64 as listed in the
paper? News to me. Does HP know something the rest of the world
doesn't?

Bill

www.overclockers.co.uk had some a few weeks back, and htey sold very quikly.
I think theres a few more in now.

hamman

Tony Hill · Dec 5, 2004

When did the non-Xeon Prescott P4s start offering EMT64 as listed in
the paper? News to me. Does HP know something the rest of the world
doesn't?

Not that they know something the rest of the world doesn't, just that
they have access to processors that most of us do not. IBM sells them
as well, but for the time being Intel will ONLY sell them for use in
servers. Why? I really don't know. Maybe it's just a bit too much
crow for them to eat after saying (only a bit over a year ago) that
64-bit wouldn't be useful for the desktop until the end of the year?

Patrick Schaaf · Dec 5, 2004

Tony Hill said:
Not that they know something the rest of the world doesn't, just that
they have access to processors that most of us do not. IBM sells them
as well, but for the time being Intel will ONLY sell them for use in
servers. Why? I really don't know. Maybe it's just a bit too much
crow for them to eat after saying (only a bit over a year ago) that
64-bit wouldn't be useful for the desktop until the end of the year?

How much does Intel stockpile? Could it be that they have warehouses
full of already produced non-64-bit processors, and those want to be
sold at the projected prices, not thrown away?

best regards
Patrick

Yousuf Khan · Dec 5, 2004

Bill said:
When did the non-Xeon Prescott P4s start offering EMT64 as listed in
the paper? News to me. Does HP know something the rest of the world
doesn't?

It must have been at least two or three months now, I posted a message
about it in one of these newsgroups.

Google Search: g:thl403337196d
http://groups.google.ca/groups?q=g:[email protected]

or,

http://tinyurl.com/6tnjy

Yousuf Khan

Yousuf Khan · Dec 5, 2004

George said:
Hmm and the following quote: "However, the latency difference between local
and remote accesses is actually very small because the memory controller is
integrated into and operates at the core speed of the processor, and
because of the fast interconnect between processors." is relevant to
another discussion here. I wish we could get a firm answer on this one.

Yeah, but that's why I think AMD insists on calling their multiprocessor
connection scheme as SUMO (Sufficiently Uniform Memory Organization),
rather than NUMA. It's not worth headaching over such small differences
in latency, is basically what they're saying.

Yousuf Khan

Bob Niland · Dec 5, 2004

Patrick Schaaf said:
How much does Intel stockpile?

Well, according to the Reg,
<http://www.theregister.co.uk/2004/12/03/intel_eol_p2/>
they just finally announced EOL for the Pentium-II.

"The Register reveals that you'll be able to continue
ordering the part for a year, with the last trays
leaving the chip giant's Pentium II warehouse on
1 June 2006."

Could it be that they have warehouses full of already
produced non-64-bit processors, and those want to be
sold at the projected prices, not thrown away?

Whether there is any connection between your hypothesis
and the Reg news, is left as an exercise for the reader

ammonton · Dec 6, 2004

In comp.arch Tony Hill said:
Not that they know something the rest of the world doesn't, just that
they have access to processors that most of us do not. IBM sells them
as well, but for the time being Intel will ONLY sell them for use in
servers. Why? I really don't know.

FWIW, Dell are shipping EM64T-equipped non-Xeon P4 workstations (the
Precision 370).

-a

David Schwartz · Dec 6, 2004

Hmm and the following quote: "However, the latency difference between
local
and remote accesses is actually very small because the memory controller
is
integrated into and operates at the core speed of the processor, and
because of the fast interconnect between processors." is relevant to
another discussion here. I wish we could get a firm answer on this one.

In typical Opteron setups (2-8 CPUs, using the Opteron's build in SMP
hardware), the latency difference between local and remote memory accesses
is so small that the benefits of treating it as NUMA are typically
outweighed by the costs. Generally, you just distribute the memory evenly
and interleaved on the nodes (if you can) to avoid overloading one memory
controller channel.

DS

keith · Dec 6, 2004

Yeah, but that's why I think AMD insists on calling their multiprocessor
connection scheme as SUMO (Sufficiently Uniform Memory Organization),
rather than NUMA. It's not worth headaching over such small differences
in latency, is basically what they're saying.

I'd say that because in small systems (less than 8 CPUs), Opterons are
coherent in hardware thus sufficiently tightly coupled to be called UMA,
as far as the user is concerned.

keith · Dec 6, 2004

How much does Intel stockpile? Could it be that they have warehouses
full of already produced non-64-bit processors, and those want to be
sold at the projected prices, not thrown away?

Unsold inventory is a very bad thing indeed. The tax man isn't happy.
Stockholders aren't happy. Executives shiver.

keith · Dec 6, 2004

However, it's not hard to show with benchmarks that paying attention
to the NUMA nature of the Opteron is a significant win. So you can
call it what you want, but...

Point well taken. So we have a desert topping and a floor wax. ;-)

Newsgroups trimmed.

..chips added back in.

George Macdonald · Dec 6, 2004

Not sure if this is exactly what you are looking for in the
way of a "firm answer", but the latencies in a Opteron system are:

0 hops 80 ns uniprocessor (Local access)
100 ns multiprocessor (Local access, with cache snooping on other processors)
1 hop 115 ns
2 hops 150 ns
3 hops 190 ns

I couldn't find my original source for those numbers, and
the two and three hop numbers above are a little higher
than I remembered them as being. This time around I got
them from this thread:
http://www.aceshardware.com/forum?read=80030960

That thread refers to this article:
http://www.digit-life.com/articles2/amd-hammer-family/
which gives slightly different numbers for a 2 GHz Opteron
with DDR333:
Uni-processor system: 45 ns
Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns.
Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns.

I don't know if any of the numbers above are for cache misses
or if they are averages that include both hits and misses.

Thanks for the data but no I guess I should have highlighted better what I
was getting at: "the memory controller is integrated into and operates at
the core speed of the processor", which is what was being
discussed/disputed in another thread.

I haven't been able to find any hard data from AMD on where the clock
domain boundaries are in the Opteron/Athlon64 but if the memory controller
is not operating at "core speed" it's now at the stage of Internet
Folklore.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

George Macdonald · Dec 6, 2004

Yeah, but that's why I think AMD insists on calling their multiprocessor
connection scheme as SUMO (Sufficiently Uniform Memory Organization),
rather than NUMA. It's not worth headaching over such small differences
in latency, is basically what they're saying.

See my reply to Rob Stow.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

Greg Lindahl · Dec 6, 2004

George Macdonald said:
I haven't been able to find any hard data from AMD on where the clock
domain boundaries are in the Opteron/Athlon64 but if the memory controller
is not operating at "core speed" it's now at the stage of Internet
Folklore.

Note that the STREAM bandwidth and lmbench latency changes with every
cpuspeedbump. So clearly part of the memory controller is at the cpu
core frequency, or a related frequency, and not at the HT frequency,
or the SDRAM external bus frequency.

Please reduce the cross-post. Followups set to a group I read.

-- greg

Per Ekman · Dec 6, 2004

Yousuf Khan said:
Yeah, but that's why I think AMD insists on calling their multiprocessor
connection scheme as SUMO (Sufficiently Uniform Memory Organization),
rather than NUMA. It's not worth headaching over such small differences
in latency, is basically what they're saying.

It's a bit of a crap argument isn't it? Even if the latency is small,
the fact that it's a NUMA system impacts performance (potentially by a
lot) as the available memory bandwidth is coupled to where you place
your data.

Classic example is OpenMP parallelized STREAM. Parallelize all the
loops except the data initialization loop on a system with hard memory
affinity (such as Linux), then parallelize _all_ the loops and explain
how the difference is "not worth headaching over".

Bottom line IMO is that pretending that the system isn't NUMA is doing
customers a disservice. They should know that treating the system as a
UMA one is a bad idea.

*p

Tony Hill · Dec 6, 2004

How much does Intel stockpile? Could it be that they have warehouses
full of already produced non-64-bit processors, and those want to be
sold at the projected prices, not thrown away?

ALL of the "Prescott" and "Nocona" cores are 64-bit capable excluding
those that would pass a validation as 32-bit chips but fail as 64-bit
chips, but such chips would be rather few and far between. It could
be that Intel still has a reasonable amount of inventory of their old
"Northwood" P4 chips and they want to clear those out first, but that
certainly doesn't seem to be the case looking at Intel's pricing
structure and what is being sold by the major OEMs (Intel seems to be
pushing Prescott VERY hard here).

Long story short, I'm not quite sure what the actual answer is, but
excessive inventory of 32-bit chips doesn't seem to make sense from
what I've seen.

Pretty good explanation of x86-64 by HP

Yousuf Khan

Bill Bradley

George Macdonald

Rob Stow

Hamman

Tony Hill

Patrick Schaaf

Yousuf Khan

Yousuf Khan

Bob Niland

ammonton

David Schwartz

keith

keith

keith

George Macdonald

George Macdonald

Greg Lindahl

Per Ekman

Tony Hill