Intel COO signals willingness to go with AMD64!!

George Macdonald · Feb 6, 2004

They were for desktop micros. ;-)

Hmmm, call it 1988 for 1st product?

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

Jim Hull · Feb 7, 2004

Grumble said:
Intel did write their own compiler for IPF, and, unsurprisingly, it is
the best compiler available for that platform, as far as I know.

HP also has their own IPF compiler. Depending on how you define "best",
some would consider it to be the best IPF compiler.

For example, using published performance data on the SPEC CPU2000 integer
and FP benchmarks from spec.org, we find that the HP compiler gives the
highest CINT2000 results, while the Intel compiler gives the highest
CFP2000 results.

Of course, one measure of "best" is whether the compiler is available for
the platform you're interested in. For now, HP's compiler is only
available on HP-UX, so if you only care about Windows and/or Linux, it's
not the compiler for you.

-- Jim
HP Itanium Processor Architect

Robert Myers · Feb 7, 2004

As mentioned previously, I keep hearing about this "feedback loop" and,
while its importance to VLIW/EPIC seems obvious, have trouble seing how it
fits into the model for delivery of commercial software. Is every client
supplied with a compiler "free" or does the price have to be included in
the software?... or is Intel going to give compilers away to sell CPU
chips?... or any other of the various permutations for supplying the
capability? BTW I am looking at a future world where Open Source will not
displace paid-for software, especially in the domain of "difficult
problems".

I'm not sure what you mean by "difficult problems." One of the most
active participants and a reasonably prolific publisher in this
particular area of research is Microsoft. I'm not certain, but I
having a feeling that they aren't looking at a world in which Open
Source will replace paid-for software, either. :-)

.

There are two problems that I know of with distribution of commercial
software. One problem is that the available cache may be nowhere near
to the amount of cache for which the binary was optimized, and the
other is that programs are no longer in general statically linked.

The first of the two problems (available cache size is not
predictable) is why Intel has very little choice but to take actual
run-time into conditions into account in one way or another. The
easiest way for them to do that, and the one I expect Intel to
implement first (I have no inside information whatsoever), is to use
speculative threading as a prefetch mechanism. More generally, it is
no secret that Intel is working on making Itanium out-of-order, albeit
in a limited way.

The most obvious way to get around the dll problem is static linking
of binaries. You can provide some-degree of installation choice in
the same way that auto-install programs work now.

From a practical standpoint, are we to believe that a re-train has to be
done for every variation on the "dataset"?
No.

How many (near) repetitions on
a given "dataset" make it worthwhile to do the re-train? Can that even be
defined?

I'm not sure, but I think you are imagining that the binary you
produce would vary tremendously with input conditions, so that one
customer allowed to continue to train and recompile the code would end
up with a very different binary from another customer that was also
alllowed to continue to train and recompile the code. Were that so,
the whole concept would make no sense at all, and the evidence is
abundant that such is not the case. Even the most unlikely of
software, like Microsoft Word, shows an incredible level of
predictability.

I haven't seen anyone talk about it in the literature, but there is no
reason I can see why a program cannot allow a certain amount of tuning
on-site with a limited optimization space. The current assumption, as
far as I know, is that commercial software will be delivered
"fully-trained."

Are you familar with the term "perfect future technology"?

Have you considered that as the complexity of the solution exceeds that of
the problem we have an umm, enigma?

I don't know about everyone working on this problem, but I know that
at least some people are looking beyond the single-processor problem.
If we can't manage to cope with the dataflow problem for a single
procesor, how in heaven's name are we going to get decent performance
out of thousands or hundreds of thousands of processors?

We already have supercomputers that occupy a large amount of physical
space. Speed of light limitations will mean that autonomous entities
will have to be able to figure out what to do next without waiting for
guidance from a central scheduler. It is very easy to come up with
forseeable problems that warrant the effort that is being expended.

RM

Tony Hill · Feb 7, 2004

They were for desktop micros. ;-)

Stuff like caches were invented first for mainframes, then for
minicomputers, and recently (historically speaking) for desktop
microprocessors.

Quick memory refresh needed here.. Was it the 386 or the 486 that
first introduced L1 caches into Intel's line of processors?

Question: when will the first smoke-detector micro to use an L1 cache
be introduced?

Most microcontrollers of the sort used in smoke detectors have RAM
embedded right on the chip, so it's sorta like they have cache now.
Considering it's the first level of cache, does that RAM count as L1
cache? :>

Felger Carbon · Feb 7, 2004

Tony Hill said:
Quick memory refresh needed here.. Was it the 386 or the 486 that
first introduced L1 caches into Intel's line of processors?

Late in the 386 generation, Intel introduced a _separate_ cache
system-on-a-chip. The 486 was the first x86 CPU to have an on-die
cache. In 1984, the Motorola 68020 had a very small cache - 256
bytes - on-die, but the CPU often ran faster when that small cache
was disabled (it was mainly a loop buffer). Nonetheless, Motorola
did beat Intel by having the first CPU with an on-die cache, and beat
them by at least 4 years.

Keith R. Williams · Feb 7, 2004

They were for desktop micros. ;-)

Stuff like caches were invented first for mainframes, then for
minicomputers, and recently (historically speaking) for desktop
microprocessors.

Mainframes had the same memory-wall issues (their memory was in a
different frame, meters away), only a decade or two earlier.
....same problems, same solutions. Smaller doesn't make them any
more clever.

Question: when will the first smoke-detector micro to use an L1 cache
be introduced?

It's likely already out there, except that there is no "main
memory". ;-)

Robert Myers · Feb 7, 2004

Mainframes had the same memory-wall issues (their memory was in a
different frame, meters away), only a decade or two earlier.
...same problems, same solutions. Smaller doesn't make them any
more clever.

I've done an awful lot of work coming up with evidence that casual
claims that you and others have made are wrong. What you have, and
not even in direct response, are more casual claims. All in the same
vein: seen it all, done it all, forget it, everybody knew everything
long ago.

Don't know how it worked out for mainframes, but OoO doesn't help with
the transaction processing problem for microprocessors. I've already
posted the citation several times, and I'm not going to go dig it up
again. It has Patterson's name on it, published about 1996. Yes,
that Patterson.

Unless I'm mistaken, mainframes always have been designed for
transaction processing. In any case, maybe you should dig up the
Patterson citation and send him an e-mail telling *him* that he's just
publishing stuff that you knew before you even started your education.

RM

Keith R. Williams · Feb 7, 2004

I've done an awful lot of work coming up with evidence that casual
claims that you and others have made are wrong. What you have, and
not even in direct response, are more casual claims. All in the same
vein: seen it all, done it all, forget it, everybody knew everything
long ago.

Well, that's what 30 years working on processors will do fer ya!
You tend to see the same, but different, problems solved the same
way they were before. The scale changes, the answers don't
nearly as often.

Don't know how it worked out for mainframes, but OoO doesn't help with
the transaction processing problem for microprocessors. I've already
posted the citation several times, and I'm not going to go dig it up
again. It has Patterson's name on it, published about 1996. Yes,
that Patterson.

Unless I'm mistaken, mainframes always have been designed for
transaction processing. In any case, maybe you should dig up the
Patterson citation and send him an e-mail telling *him* that he's just
publishing stuff that you knew before you even started your education.

As soon as one does pipelining OoO makes sense. Mainframes have
been pipelined since forever (I'm not sure about OoO since it is
hardware expensive).

BTW, I've lost track. Are we talking about he memory wall, OoO,
or transaction processing?

Robert Myers · Feb 7, 2004

(e-mail address removed) says...

As soon as one does pipelining OoO makes sense. Mainframes have
been pipelined since forever (I'm not sure about OoO since it is
hardware expensive).

Patterson's paper showed that a P6 core was stalled 60% of the time in
on-line transaction processing. I actually exaggerrated that OoO
didn't help (actually, I just repeated the exact wording of the
paper). An in-order Alpha was stalled 80% of the time.

BTW, I've lost track. Are we talking about he memory wall, OoO,
or transaction processing?

"same problems, same solutions. Smaller doesn't make them any more
clever."

Problem: memory wall
Solution: OoO
Old evidence that should exist if your logic (same problems, same
solutions) has any substance: OoO wouldn't have been much help in
transaction processing on mainframes, either. Patterson is so
desperate to publish that he attaches his name to old news?

You implied it had all been learned long ago on mainframes. Since
mainframes are used for transaction processing, the fact that OoO
wouldn't help for transaction processing should have been old news by
the time Patterson and his student got around to the 1996 paper.

The fact that OoO isn't all that big a help for transaction processing
may be one reason why Intel stuck with an in-order Itanium. Maybe
transaction processing was one of the applications George had in mind
when he referred to Itanium applications that are "embarrassingly
appropriate."

RM

RusH · Feb 7, 2004

Keith R. Williams said:
Gee, even the much maligned Cyrix 6X86 was an OoO processor, sold
in what, 1986? Evidently Cyrix thought it was a winner, and they
weren't wrong.

some made it by mistake :

[quote from linux sources i386/io.h]

* Cache management
*
* This needed for two cases
* 1. Out of order aware processors
* 2. Accidentally out of order processors (PPro errata #51)

accidentally ? thats nice

Pozdrawiam.

Nate Edel · Feb 7, 2004

In comp.sys.ibm.pc.hardware.chips Tony Hill said:
Quick memory refresh needed here.. Was it the 386 or the 486 that
first introduced L1 caches into Intel's line of processors?

486s were the first Intel x86 chips with an on chip cache. (And I don't
think any of the earlier non-Intel 386s had one.)

Many 386s had an off-chip cache by the time the 486 became reasonably
common.

It was initially a single level cache on the 386, so wouldn't that still be
an L1 cache? It was only once you had the on-chip 486 cache and an off-chip
cache that the L1 and L2 distinction made sense for PCs.

Rob Stow · Feb 7, 2004

Nate said:
486s were the first Intel x86 chips with an on chip cache. (And I don't
think any of the earlier non-Intel 386s had one.)

Many 386s had an off-chip cache by the time the 486 became reasonably
common.

My vague recollection of some ASM programming I did umpteen million
years ago tells me that making some things work in protected mode
required me to fiddle with the "segment descriptor cache". IIRC,
that was introduced with the 80286 but it might not have been until
the 80386. This wasn't exactly a RAM cache - it had more to do with
RAM management than with caching the actual contents of the RAM.

The 80386 had no on-chip L1 cache. Cheap motherboards had no off-chip
cache, but many had anywhere from 32 KB to 256 KB. Enthusiasts could and
frequently did buy lightning-fast 30 or 35 ns chips to upgrade and/or
max out there on-board cache. I desoldered small chips and replaced
them with larger ones on several occasions, but in most cases it was
just a matter of prying a chip out of a socket and pushing in the
replacement.

I also recall one system where the manual said I could install
L1 cache *or* use a Weitek/80387 math coprocessor, but not both :-D

The 80486 had an 8 KB on-chip L1 cache. I /think/ it was unified.
Because of the much lower latency, the 8 KB on-chip L1 in an 80486 was
supposed to have been as good as 64 KB of off-chip L1 with an 80386.
I /think/ one of the things done to cripple a 486DX to get a 486SX
was to permanently disable the L1 cache.

80486 motherboards commonly had up to 256 KB of L2 - but there were a
few enthusiast boards with up to 1 MB and I vaguely recall reading
about a motherboard with 2 MB. Real performance nuts occasionally
also used an expensive type of SIMM that put 4 KB or 8 KB of cache
on each SIMM - wish I could remember what the heck that kind of SIMM
was called.

Nate Edel · Feb 8, 2004

In comp.sys.intel Rob Stow said:
My vague recollection of some ASM programming I did umpteen million
years ago tells me that making some things work in protected mode
required me to fiddle with the "segment descriptor cache". IIRC,
that was introduced with the 80286 but it might not have been until
the 80386. This wasn't exactly a RAM cache - it had more to do with
RAM management than with caching the actual contents of the RAM.

My own recollection is hazy, but the article at:

http://x86.ddj.com/ddj/aug98/aug98.htm

confirms it... the segment descriptor cache only became something like a
real cache on the Pentium and Pentium II/III (and I'd imagine the P4, but
don't know for sure.) On the 286-486 and Pentium Pro, the sedment
descriptor cache is just a set of shadow registers, one per segment
register.

For that matter, how big was the TLB on the 386/486?

The 80386 had no on-chip L1 cache. Cheap motherboards had no off-chip
cache, but many had anywhere from 32 KB to 256 KB.

Cheap/early motherboards; by the time the later 386 chips were selling, at
least a 32k or 64k cache on the motherboard was pretty ubiquitous.

I also recall one system where the manual said I could install
L1 cache *or* use a Weitek/80387 math coprocessor, but not both :-D

That's one I never saw.

Off the top of my head, I can't recall whether I ever saw a 386SX system
with a cache.

The 80486 had an 8 KB on-chip L1 cache. I /think/ it was unified.

My own recollection is that it was unified.

Because of the much lower latency, the 8 KB on-chip L1 in an 80486 was
supposed to have been as good as 64 KB of off-chip L1 with an 80386.

Or better; between the improved cache latency and some instruction speedups,
the 25mhz 80486 was supposed to beat the fastest (AMD-made) 40mhz 80386. My
own impression with DOS games at the time was that they were pretty closely
comparable.

I /think/ one of the things done to cripple a 486DX to get a 486SX
was to permanently disable the L1 cache.

I think the disabled FPU was the only difference between the 486DX and
486SX; I don't think it had the cache disabled.

80486 motherboards commonly had up to 256 KB of L2 - but there were a
few enthusiast boards with up to 1 MB and I vaguely recall reading
about a motherboard with 2 MB. Real performance nuts occasionally
also used an expensive type of SIMM that put 4 KB or 8 KB of cache
on each SIMM - wish I could remember what the heck that kind of SIMM
was called.

There was a later type of cache SIMM called COAST (cache on a stick) that
was used with second-generation Pentium boards, and may have been used with
some very very late 486 boards as well, but I doubt that's what you're
thinking of ... these ranged, IIRC, between 256KB and 1MB (2mb?) and I don't
recall ever seeing a system with more than one socket.

Keith R. Williams · Feb 8, 2004

Keith R. Williams said:
Keith R. Williams said:

Gee, even the much maligned Cyrix 6X86 was an OoO processor, sold
in what, 1986? Evidently Cyrix thought it was a winner, and they
weren't wrong.

Click to expand...

some made it by mistake :

[quote from linux sources i386/io.h]

Is there a date on that?

* Cache management
*
* This needed for two cases
* 1. Out of order aware processors
* 2. Accidentally out of order processors (PPro errata #51)

accidentally ? thats nice

ROTFLOL! Any specifics?

George Macdonald · Feb 8, 2004

I'm not sure what you mean by "difficult problems." One of the most
active participants and a reasonably prolific publisher in this
particular area of research is Microsoft. I'm not certain, but I
having a feeling that they aren't looking at a world in which Open
Source will replace paid-for software, either. .

I don't mean basic Web Browsing, Word etc. but a humonguous spreadsheet
with a Solve might qualify - basically anything where there is some compute
complexity and can benefit significantly in performance from
feedback/retrain. Academic and semi-academic institutions are often good
at algorithm theory and expression; where it's complex/difficult turn into
code, very often it needs a commercial implementation to get the best out
of it.

There are two problems that I know of with distribution of commercial
software. One problem is that the available cache may be nowhere near
to the amount of cache for which the binary was optimized, and the
other is that programs are no longer in general statically linked.

The first of the two problems (available cache size is not
predictable) is why Intel has very little choice but to take actual
run-time into conditions into account in one way or another. The
easiest way for them to do that, and the one I expect Intel to
implement first (I have no inside information whatsoever), is to use
speculative threading as a prefetch mechanism. More generally, it is
no secret that Intel is working on making Itanium out-of-order, albeit
in a limited way.

I've certainly seen software that, even on x86, adapts to cache sizes so
where it helps, e.g. stuff with matix block diagonal decomposition, it has
been done without special compiler aid. My gut feel is that trying to get
a compiler to handle such stuff automatically can never yield optimal
results.

The most obvious way to get around the dll problem is static linking
of binaries. You can provide some-degree of installation choice in
the same way that auto-install programs work now.

I'm not sure, but I think you are imagining that the binary you
produce would vary tremendously with input conditions, so that one
customer allowed to continue to train and recompile the code would end
up with a very different binary from another customer that was also
alllowed to continue to train and recompile the code. Were that so,
the whole concept would make no sense at all, and the evidence is
abundant that such is not the case. Even the most unlikely of
software, like Microsoft Word, shows an incredible level of
predictability.

What I'm thinking of is a general purpose package like a Mathematical
Programming system (Extended LP if you like), where a client might have
several different problems to solve which have competely different
characteristics in terms of matrix sparsity and compute complexity. I see
a different dataset giving completely different feedback to a retrain here
and therefore different binary versions for each dataset.

I haven't seen anyone talk about it in the literature, but there is no
reason I can see why a program cannot allow a certain amount of tuning
on-site with a limited optimization space. The current assumption, as
far as I know, is that commercial software will be delivered
"fully-trained."

This is where I disagree... based on my personal experience. IOW this is
the flaw from my POV - can't be done for everything and certainly not for
the stuff I'm familiar with.

I don't know about everyone working on this problem, but I know that
at least some people are looking beyond the single-processor problem.
If we can't manage to cope with the dataflow problem for a single
procesor, how in heaven's name are we going to get decent performance
out of thousands or hundreds of thousands of processors?

We already have supercomputers that occupy a large amount of physical
space. Speed of light limitations will mean that autonomous entities
will have to be able to figure out what to do next without waiting for
guidance from a central scheduler. It is very easy to come up with
forseeable problems that warrant the effort that is being expended.

This seems to go way beyond any notion of statically trained commercial
software.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

George Macdonald · Feb 8, 2004

I also recall one system where the manual said I could install
L1 cache *or* use a Weitek/80387 math coprocessor, but not both :-D

Not sure what you mean but the Weitek was not 80387 compatible - it was a
different coprocessor with different instruction set. In that timeframe,
the 80x87 was Cyrix's first product.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

George Macdonald · Feb 8, 2004

Quick memory refresh needed here.. Was it the 386 or the 486 that
first introduced L1 caches into Intel's line of processors?

With 80386 there was one level of (off-chip) cache, using the 82385 cache
controller, which IIRC sat between the CPU and the "chipset" as it existed
at the time (1987/88) and could handle up to 32KB of cache memory directly.
There was also a 82395DX Smart Cache controller on Intel's lineup for the
80386DX which post dates the intro of the 80486. Intel called those "First
Level cache".

There were also some non-Intel caches such as ALR's "Powercache4" and
probably some others.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??

Robert Myers · Feb 8, 2004

I don't mean basic Web Browsing, Word etc. but a humonguous spreadsheet
with a Solve might qualify - basically anything where there is some compute
complexity and can benefit significantly in performance from
feedback/retrain. Academic and semi-academic institutions are often good
at algorithm theory and expression; where it's complex/difficult turn into
code, very often it needs a commercial implementation to get the best out
of it.

A spreadsheet, of course, is nothing more than an interpreted
functional programming language. I have *zero* knowledge of how well
Itanium does with interpreters, but you just added to my list of
things to do. :-/.

I think that Intel was hoping that DynamoRio might help in situations
like that. I haven't heard much about DynamoRio recently. Intel has
lost its enthusiasm, things haven't gone as well as hoped, I haven't
been paying close enough attention... I don't know. In general, I
think interpreters will need the help of a run-time supervisor and
optimizer.

*Compiled* compute-intensive applications can benefit spectacularly
from training and re-compilation. That's not a big market, but it
happens to be the business I'm in. Thus, my interest, at least
initially. People who want to do that kind of work are going to have
to own a compiler.

You're not optimistic about open source, but I think putting the
Chinese Academy of Sciences on ORC was a really smart move. China
wants to develop its scientific infrastructure and is going to put
serious effort behind open source because it has very little choice,
other than continued and expanding rampant piracy.

I've certainly seen software that, even on x86, adapts to cache sizes so
where it helps, e.g. stuff with matix block diagonal decomposition, it has
been done without special compiler aid. My gut feel is that trying to get
a compiler to handle such stuff automatically can never yield optimal
results.

FFTW seems to do pretty well without help from a compiler.

Intel and Microsoft both have teams of compiler experts working on
this problem full time. My read is that Microsoft loves Itanium in a
way they are never going to love x86-64, and they are certainly not
going to let Intel allow the whole world to write better software with
a compiler they don't own and control. Itanium has done wonders for
the world of compiler development, as I read it. In some ways, the
compiler horse race is at least as interesting as the processor horse
race.

What I'm thinking of is a general purpose package like a Mathematical
Programming system (Extended LP if you like), where a client might have
several different problems to solve which have competely different
characteristics in terms of matrix sparsity and compute complexity. I see
a different dataset giving completely different feedback to a retrain here
and therefore different binary versions for each dataset.

I think the world is moving toward BLAS as a lingua franca for
compute-intensive applications written by non-specialists, and it's
not hard to imagine building a pretty smart BLAS and LAPACK for
Itanium. Two FFT packages out there, FFTW and UHFFT have lot of
adapatability built-in, so you don't have to alot of weenie work when
you move your FFT-based application to a new machine. I can see that
general approach working pretty well for Itanium and frequently-used
compute-intensive software.

This is where I disagree... based on my personal experience. IOW this is
the flaw from my POV - can't be done for everything and certainly not for
the stuff I'm familiar with.

People who have bought Itania for FP applications looking at Itanium
SPEC numbers have generally been pretty disappointed, and I'm not
surprised. If Itanium survives (and I think it will) I think
mathematical programming platforms will like Mathematica and Matlab
will adapt to it pretty well over time. The very fact that you can
build a Mathematica and a Matlab and have such a large customer base
with such a diverse array of applications tells you that, short of
people working at the hammer and tongs level, the world of numerical
programming is pretty repetitive.

This seems to go way beyond any notion of statically trained commercial
software.

Well, sure. I've been pretty open about the reasons for my interest
in this problem. I can't include a disclaimer with every post.

RM

Rob Stow · Feb 8, 2004

George said:
Not sure what you mean but the Weitek was not 80387 compatible - it was a
different coprocessor with different instruction set. In that timeframe,
the 80x87 was Cyrix's first product.

In this case I meant to convey uncertainty about whether that
particular system had a socket for a Weitek or an 80387.

Keith R. Williams · Feb 8, 2004

Patterson's paper showed that a P6 core was stalled 60% of the time in
on-line transaction processing. I actually exaggerrated that OoO
didn't help (actually, I just repeated the exact wording of the
paper). An in-order Alpha was stalled 80% of the time.

Of course there are no other differences between a P6 and Alpha.

"same problems, same solutions. Smaller doesn't make them any more
clever."

Ah, we shouldn't be using one argument to obfuscate another than,
eh?

Problem: memory wall
Solution: OoO

Nope.

Problem: Memory wall (latency)
Solution: Caches (many of 'em), Branch prediction, speculative
execution, speculative loads, swamp with bandwidth, pray.

Problem: How to get 1 IPC
Solution: Pipeline

Problem: How to reduce pipeline stalls
Solution: OoO

Old evidence that should exist if your logic (same problems, same
solutions) has any substance: OoO wouldn't have been much help in
transaction processing on mainframes, either. Patterson is so
desperate to publish that he attaches his name to old news?

The "existence theorem" is in my favor. It was, thus that is the
way it is.

You implied it had all been learned long ago on mainframes. Since
mainframes are used for transaction processing, the fact that OoO
wouldn't help for transaction processing should have been old news by
the time Patterson and his student got around to the 1996 paper.

TP is all mainframes do? I don't *think* so. When you're theory
starts out with a false premiss it's time to junk the theory.

The fact that OoO isn't all that big a help for transaction processing
may be one reason why Intel stuck with an in-order Itanium. Maybe
transaction processing was one of the applications George had in mind
when he referred to Itanium applications that are "embarrassingly
appropriate."

Having been around the industry for some time, I prefer the
"NIH" theory. ;-)

Intel COO signals willingness to go with AMD64!!

George Macdonald

Jim Hull

Robert Myers

Tony Hill

Felger Carbon

Keith R. Williams

Robert Myers

Keith R. Williams

Robert Myers

RusH

Nate Edel

Rob Stow

Nate Edel

Keith R. Williams

George Macdonald

George Macdonald

George Macdonald

Robert Myers

Rob Stow

Keith R. Williams