Pentium M to become THE CPU

MitchAlsup · Oct 12, 2005

Oliver said:
It has about four times the store-bandwith - but not the load-bandwidth
due to speculative snoops.

The 4 processors in a 4-Node fabric system can make 4 read accesses (L2
miss) to 4 memory controllers. Say one isolated access takes 120ns with
Probe delays, all 4 of these accesses can complete in only 130ns when
each request is serviced by a different memory controller! Including
Probe delays! In addition, a single CPU can make 4 overlapped requests
to 4 different memory controllers and achieve all of its data and
probes back in only 130ns.

And while all of the above is going on, other processors can be using
the remaining fabric link bandwidths to transmit store-data to the
various memory controllers,....

Mitch
<Note: memory delays are illustrative and vary from system to system>

EdG · Oct 12, 2005

Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Are they x86-64?

Oliver S. · Oct 13, 2005

Pentium M has all the right ingredients for total world domination:

Are they x86-64?

Even the upcoming dual-core incarnation of the Pentium-M won't
be capable of x86-64 and it seems that this version even miss VT
technology; but both are not strong arguments for Notebook-systems.

Oliver S. · Oct 13, 2005

The 4 processors in a 4-Node fabric system can make 4 read

accesses (L2 miss) to 4 memory controllers.

Ok, you're right here. But feeding threads with physical memory-pages of
the current CPU's memory-controller is rather complicated when the threads
are able to migrate between the cores and memory-pages are shared between
the cores. To handle this efficiently, there would have to be counters on
each page-table-entry (PTE) which count the read and write-accesses of the
CPU currently using a certain page-table; this would help to migrate logi-
cal pages from physical pages of one CPU to physical pages of another CPU
by the OS to maximize performance. But unfortunately AMD has missed to add
this feature to the AMD64 page-tables so that you usually have a lot of
interconnnect-traffic on loads. So for high-performance-applications you
would have to nail a thread to a certain CPU and use an API that would give
you memory-pages that physically map to the memory attached to the CPU.
But I'm not aware of any OS that supports this processor-affine allocations.
And with even more improved PTEs with a flag that prevents the snoop-broad-
casts it would be possible to drop the snoop-broadcasts. But unfortunately
AMD64 also doesn't include this feature.

Jason Ozolins · Oct 13, 2005

Oliver S. wrote:

[>Mitch Alsup wrote:] <- added by J.O.

Yes, but you have to consider the speculative snoops to other CPUs in
the ccNUMA domain also!

If a node has an L2 miss on memory which is on its local DRAM controller,
can't it just speculatively initiate the load from local DRAM and then drop
the results if the snoop issued in parallel indicates that the line is owned
by another node? If the snoop latency is less than the memory latency, then
you would at most wait for the memory load in this case.

It has about four times the store-bandwith - but not the load-bandwidth
due to speculative snoops.

The load bandwidth is indeed limited by the snoop interconnect. If the loads
are overlapped with snoops as I suggested above, the load latency to local
memory is not, unless the interconnect is so saturated that a snoop takes
longer than a memory read.

On a big 24 processor UltraSPARC III/IV system (E6900), using the Fireplane
interconnect, you will see nasty limits on the overall sustained load
bandwidth (9.6GB/sec) due to the snoop interconnect maxing out. AFAIK the
bigger Sun systems use cache directories to partition their snoop broadcast
domains.

-Jason

Oliver S. · Oct 13, 2005

If a node has an L2 miss on memory which is on its local DRAM

controller, can't it just speculatively initiate the load from
local DRAM and then drop the results if the snoop issued in parallel
indicates that the line is owned by another node?

Of course it can; and didn't I mention that somewhere in this
thread? I thought I named this kind of load a "speculative load".

If the snoop latency is less than the memory latency, then
you would at most wait for the memory load in this case.

I'll bet that the snoop-latency is higher because it travels through
very tight hypertransport channels and on large machines with four or
eight CPUs even through some other CPUs.

On a big 24 processor UltraSPARC III/IV system (E6900), using the
Fireplane interconnect, you will see nasty limits on the overall
sustained load bandwidth (9.6GB/sec) due to the snoop interconnect
maxing out.

Hasn't this architecture duplicate tags in the chipset? Those large
machines are very expensive anyway, so the duplicate tags wouldn't
increase the whole costs significantly.

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= · Oct 13, 2005

So for high-performance-applications you would have to nail a thread

to a certain CPU and use an API that would give you memory-pages that
physically map to the memory attached to the CPU. But I'm not aware of
any OS that supports this processor-affine allocations.

VMS on ALpha and IA-64 does it. IIRC, DEC's Unix for Alpha and IA-64
(whatever marketing might be calling it right now) does also. I strongly
suspect SGI's OSes for their MIPS- and IA-64-based systems does as well.

Jan

Anton Ertl · Oct 13, 2005

Andi Kleen said:
[1] Actually with lmbench a newer Intel dual core systems reports a lower
memory latency to me than on an A64, but I suspect their prefetch
algorithms became so good they broke lmbench ;-)

Well, for the original lmbench that's not very hard.

We had a discussion about that last year (and some of the results are
reflected in <[email protected]> ff.), and I
then got a message from Carl Staelin (one of the lmbench people that
lmbench-3.0-a4 (and presumably later versions) has two memory access
patterns: sequential strided and random. So, with the random access
pattern you should be able to disable the stream buffers of the CPUs.

Followups set to comp.arch.

- anton

Oliver S. · Oct 13, 2005

VMS on ALpha and IA-64 does it. IIRC, DEC's Unix for Alpha and IA-64

(whatever marketing might be calling it right now) does also. I strongly
suspect SGI's OSes for their MIPS- and IA-64-based systems does as well.

Cool! So number-crunching apps would simply stick their
threads to the cores and allocate processor-affine.

Andi Kleen · Oct 13, 2005

Oliver S. said:
you memory-pages that physically map to the memory attached to the CPU.
But I'm not aware of any OS that supports this processor-affine allocations.

It's pretty much standard on any NUMA aware OS (Linux, Solaris, Windows Server 2k3)

-Andi

Trent · Oct 13, 2005

(One of these days, I'll build a controller for my beer fridges so I can
free up the Apple IIs that are currently running them (a IIGS on one and a
IIe on the other). To simplify the software-porting effort, it'll most
likely be built around a 6502, or something compatible with it. It's not
like monitoring the temperature and switching the compressor on and off
requires dual Opterons or something insane like that.)

Even a 6502 is overkill. Why not use a simple 8 pin device like this?
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2735

The only additional thing you'd need aside from a power source is a
transistor, a diode, and a relay. You can program the device with a
parallel port.

dannysdailys · Oct 13, 2005

Nathan Bateswrote
Pentium M has all the right ingredients for total world domination

low power consumption, short pipeline stages, hi-performance

Pentium M will kill its brother Pentium 4 and its bastard cousi
Athlon
PowerPC is a Neanderthal that's nearing its end (Jobs figured tha
out)
But ARM will survive due to its ultra-low power consumption an
elegance

This should tell you more about your beloved Intel then anything else
When a old tech P-3/M can blow the doors off a P-4 for gaming. On
must wonder where Intel is headed. The dual cores they're buildin
will be orphaned in about a year. Keep in mind, you already have t
scrap your mobo now to even use one. Then, when they finally cop
AMD and do away with the FSB, you'll be scrapping your mobo all ove
again

This is good? I hardly think so..

Yes, there is a company that is poised to take over the world, but i
certainly isn't Intel or Dell

I'm reminded of a conversation I heard years ago at my local pub.
saleman for a micro brew was there, as was a salesman for a majo
brewer. The major salesman scoffed at the micro brew guy. "We spil
more beer in one day then you even make." The other guy said, "Yeah?
Maybe that just shows what you think of your own beer.

Words to live by..

Cheer

nobody · Oct 13, 2005

Even the upcoming dual-core incarnation of the Pentium-M won't
be capable of x86-64 and it seems that this version even miss VT
technology; but both are not strong arguments for Notebook-systems.

The original post was about P-M dominating the _world_, not just
thin-n-lite. And even there, Turion is gaining the market share,
apparently not at the expense of VIA, Transmeta, and Apple, but rather
Intel.

Scott Alfter · Oct 13, 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Even a 6502 is overkill. Why not use a simple 8 pin device like this?
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2735

The only additional thing you'd need aside from a power source is a
transistor, a diode, and a relay. You can program the device with a
parallel port.

1) It doesn't appear to have an ability to enforce a minimum off-time. If
you repeatedly turn the compressor back on too soon after it has shut off,
that'll shorten its life. Choosing setpoints that are far-enough apart
might minimize this, but that would result in the temperature not being
as tightly-regulated as it could be.
2) It doesn't appear to have a way to slowly ramp the temperature up/down.
If the fridge is at 50 degrees and you want it to go up to 70 for a
diacetyl rest and then down to 35 for lagering, you want those
temperature changes to be made slowly (at a rate of maybe 1 degree per
hour).

These are things for which some sort of microprocessor control is needed. A
6502 might be overkill, but it's what I know, so there's less time getting
up to speed with an unfamiliar instruction set. I'm currently using DS18B20
sensors controlled by Apple IIs through a little bit of custom hardware, so
I already have software for the 6502 that talks to 1-Wire devices. Porting
that to another 6502-based machine would take minimal effort.

(If anyone's interested, the 1-Wire software is the first link at
http://alfter.us/a2soft.shtml.)

I also have the IIs graphing the temperature for the past ~4 hours; this
functionality would most likely go away, as it's (more or less) a curiosity
that was relatively easy to implement.

_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( http://alfter.us/ Top-posting!
\_^_/ rm -rf /bin/laden >What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDTqibVgTKos01OwkRAib5AKCtqvU9XJIhNcX/y2KxRhFTJ3mYLQCg6O2H
IwAKYv22azX/G20ePQ+6RpI=
=WzMq
-----END PGP SIGNATURE-----

Oliver S. · Oct 13, 2005

you memory-pages that physically map to the memory attached to the CPU.

It's pretty much standard on any NUMA aware OS (Linux, Solaris, Windows Server 2k3)

It isn't that easy as someone might think:
- At least for Windows Server 2003, there's no API to allocate processor local
memory. And for other systems you'd have to use such an API instead of stupid
malloc()ing.
- A thread will usually get migrated from one CPU to another sooner or later (if
it hasn't fixed CPU-affinity) and the OS might chose to re-schedule a thread on
the CPU the thead ran on the last time to try to help to (to^3 *g*) recycle the
working-sets of this thread in the cache-hierarchy of that CPU. So it won't help
much to allocate memory of a thread on that CPU the read was running on when it
was requesting the memory.
- Memory allocated by one thread might be used more by another thread running on
another CPU.

Bill Davidsen · Oct 14, 2005

Andi Kleen wrote:

If the processor waits at any point because DRAM data has not arrived
and the CPU has nothing left to try to do, then you are in a latency
bound situation and the FSB looses. More bandwidth does not speed up
latency bound problems.

In addition the on-die approach with the HyperTransport fabric
interconnect gives you the property that as you add CPUs, you also add
DRAM bandwidth and bisection bandwidth. A 4 Node Opteron system has ~4
times as much DRAM bandwidth as a 4 node Pentium (single) FSB system
and plenty of chip-to-chip bandwidth to route the data to where it is
needed.

But you said latency was the issue, not bandwidth. Or do you believe
that having to get data out of memory on another CPU introduces no latency?

Bill Davidsen · Oct 14, 2005

Ketil said:
Mediocre FP performance, few available motherboards, high price, no
SMP support?

For the price of a high end Pentium M, I can get a dual core AMD where
each core has equivalent integer performance and much better FP. Sure
Pentium M is attractive for some purposes, but total world domination
is still a way off, IMO.

And you can heat your house in the winter. The P-M is really low power
compared to Opteron, and of course P4 is in a class by itself for heat.
The SMP and FP issues are supposedly being addressed soon, as will
EMT64, current advantage is power. AMD realized this and recently
offered a mobil chip to be more competitive, so I guess AMD saw the need.

I don't think the P-M is going to make everything else go away, but at
the moment the play is in low power, no matter what the O.P. thinks.

I'd like to get a dual dual-core system, but my net will be a single
chip Intel-DC to be compatible with other things I support.

Bill Davidsen · Oct 14, 2005

The original post was about P-M dominating the _world_, not just
thin-n-lite. And even there, Turion is gaining the market share,
apparently not at the expense of VIA, Transmeta, and Apple, but rather
Intel.

With the exception of VIA, my perception is that there isn't a hell of a
lot of market share to take other than Intel's. I'm considering the low
power market, some "laptops" are just portable desktops, and you can't
use them on your lap if you ever plan on having children.

Bill Todd · Oct 14, 2005

Bill Davidsen wrote:

....

The P-M is really low power

compared to Opteron

Perhaps you haven't been paying attention recently: AMD just introduced
a 2.4 GHz mobile chip with a 35W power envelope (and given the slope of
their performance/W curve lately there's little reason to expect that to
be the best they can do this year, let alone next).

- bill

keith · Oct 14, 2005

And you can heat your house in the winter. The P-M is really low power
compared to Opteron, and of course P4 is in a class by itself for heat.

Don't I wish I could heat my house with the 150W my 18-month-old Opteron
144 (complete system, sans monitor) draws. Please!

The SMP and FP issues are supposedly being addressed soon, as will
EMT64, current advantage is power. AMD realized this and recently
offered a mobil chip to be more competitive, so I guess AMD saw the
need.

Sure, and met it. Intel hasn't met AMD64. Next!

I don't think the P-M is going to make everything else go away, but at
the moment the play is in low power, no matter what the O.P. thinks.

That's the current marketeering, sure. A couple of years ago it was GHz.
What benchmarket will they lose next?

I'd like to get a dual dual-core system, but my net will be a single
chip Intel-DC to be compatible with other things I support.

The Intel marketeering department thanks you.

Pentium M to become THE CPU

MitchAlsup

EdG

Oliver S.

Oliver S.

Jason Ozolins

Oliver S.

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?=

Anton Ertl

Oliver S.

Andi Kleen

Trent

dannysdailys

nobody

Scott Alfter

Oliver S.

Bill Davidsen

Bill Davidsen

Bill Davidsen

Bill Todd

keith