Is Itanium the first 64-bit casualty?

Rupert Pigott · Jul 1, 2004

Stephen said:
By both measures, a Mac is a workstation and not a desktop.

I never heard of anyone referring to an Apollo or a Sun as a
Desktop, always a Workstation. Yet I noticed that Mac II's of
similar configuration had higher list prices over here when
they first came out...

Cheers,
Rupert

Stephen Sprunk · Jul 1, 2004

Rupert Pigott said:
I never heard of anyone referring to an Apollo or a Sun as a
Desktop, always a Workstation. Yet I noticed that Mac II's of
similar configuration had higher list prices over here when
they first came out...

The distinction I've always used is that a workstation is a high-performance
server platform that was later scaled down to single-user performance,
whereas a desktop was designed originally with single user in mind and might
be later scaled up to make mid-range servers (often requiring extensive
kludges).

I've also heard the term "workstation" applied to x86 Linux boxes, so some
people seem to associate the term with running a real multiuser OS as
opposed to a consumer OS, not the hardware that is in use.

S

Scott Moore · Jul 1, 2004

Nick said:
Using multiple hardware segments to create a software segment that
is larger than the indexing size is, I agree, an abomination. It
has been done many times and has never worked well.

But a single flat, linear address space is almost equally ghastly,
for different reasons. It is one of the surest ways to ensure
mind-bogglingly revolting and insecure designs.

What is actually wanted is the ability to have multiple segments,
with application-specified properties, where each application
segment is inherently separate and integral. That is how some
systems (especially capability machines) have worked.

Thats what paging is for, and, IMHO, a vastly superior system that
gives you memory attributing while still resulting in a linear
address space.

--
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.

Scott Moore · Jul 1, 2004

Rupert said:
I think we should play the Marketoids at their own game : Let's start
referring to IA-64 as "Legacy" now that we have a dual-sourced 64bit
architecture in the x86 world.

It didn't get widespread enough to be a legacy.

--
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.

Scott Moore · Jul 1, 2004

Nick said:
If you don't care about robustness and security, then obviously a
single flat address space is best. But even current systems are

If you don't care about pimento stuffed olives, you obviously don't
care about world peace.

now trying to separate the executable segments from the writable
ones, which is moving away from a single flat address space. If
you think about it:

Why shouldn't I be able to load a module onto the stack, or use
part of an executable as a temporary stack segment? We used to be
able to do that, after all, and a genuinely integral flat address
space would allow it.

The reason is that it is generally accepted to be sacrificing too
much robustness and security for flexibility.

Having segmentation return would be to me like seeing the Third
Reich make a comeback. Segmentation was a horrible, destructive
design atrocity that was inflicted on x86 users because it locked
x86 users into the architecture.

All I can do is hope the next generation does not ignore the
past to the point where the nightmare of segmentation does not
happen again.

Never again !

--
Samiam is Scott A. Moore

Personal web site: http:/www.moorecad.com/scott
My electronics engineering consulting site: http://www.moorecad.com
ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal
Classic Basic Games web site: http://www.moorecad.com/classicbasic
The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal
compiler system: http://www.moorecad.com/ippas

Being right is more powerfull than large corporations or governments.
The right argument may not be pervasive, but the facts eventually are.

David Schwartz · Jul 1, 2004

File size does not relate to pointer size in any way.

You want to point inside the file. The only reason people don't think
this way is because they've always had limited address spaces, so they can't
think of files as 'just some memory that happens to be backed to a
semi-permanent medium'.

If its under 2GB, you have some serious questions to ask from your OS
developer.

But that's just the point, it's all tradeoffs. You only have 4Gb to play
with. If you have 4Gb of RAM, then you need to play games. All of this
inconvenience is stuff we're so used to, we don't even realize it's an
issue.

Oh, and just what would that do to the caches?

It would have no effect. Caches don't care if the data is sparse or
dense. They basically just care how much information you're moving. There is
no difference between accessing 512 32Kb objects that are addressed one
megabyte apart of one gigabyte apart.

DS

Rupert Pigott · Jul 2, 2004

Stephen Sprunk wrote:

[SNIP]

The distinction I've always used is that a workstation is a high-performance
server platform that was later scaled down to single-user performance,
whereas a desktop was designed originally with single user in mind and might
be later scaled up to make mid-range servers (often requiring extensive
kludges).

Seems like a better way of drawing the line.

Despite the different definition it still places Alphas in the
workstation slot and PCs in the desktop slot.

Mac IIs were schziophrenic by that definition, they could run
MacOS or A/UX.

Given the price and the spec I'd still the Mac II a workstation
because they were too pricey to find their way onto every desk
in an office. I'm making it sound like the universal truth, but
it's what I saw in practice. :/

The guys doing the heavy work (+PHB) would get a Mac II and the
rest would get something more modest like a Mac SE... At least
that is how it panned out at the few Mac sites I visited.

I've also heard the term "workstation" applied to x86 Linux boxes, so some
people seem to associate the term with running a real multiuser OS as
opposed to a consumer OS, not the hardware that is in use.

I think it's a fair cop in that regard.

Cheers,
Rupert

Rupert Pigott · Jul 2, 2004

Scott said:
It didn't get widespread enough to be a legacy.

I think it's old enough and dead enough and has had enough money thrown
at it to be called legacy though.

Cheers,
Rupert

Dale Pontius · Jul 2, 2004

Marketing and the perceptions of PHBs with the chequebooks.

Not my call. Reminds me a little of the thread about some Intel dude
calling SPARC "proprietry".

I think we should play the Marketoids at their own game : Let's start
referring to IA-64 as "Legacy" now that we have a dual-sourced 64bit
architecture in the x86 world.

Easier to stick to absolute truths, and highlight just how proprietary
IA-64 is. (all the third-party IP holding, etc)

Dale Pontius

Judd · Jul 2, 2004

Tony Hill said:
Smart business move for who?!? For Intel maybe, but how exactly does
it help MS' cause? It's not like they're selling more by not having a
product now and I can't see any way that it would help them long-term.
Not releasing the product until the end of this year or early next
year (it looks like 64-bit Windows is being delayed *again*) is only
going to hurt Microsoft relative to Linux.

It isn't hurting them at all. Development costs $$$... In today's world,
you don't develop unless the $$$ is there. The $$$ isn't there until Intel
is OEM'ing large quantities of 64-bit hardware. Linux isn't gaining
anything at all from this.

Yousuf Khan · Jul 2, 2004

Tony Hill said:
Smart business move for who?!? For Intel maybe, but how exactly does
it help MS' cause? It's not like they're selling more by not having a
product now and I can't see any way that it would help them long-term.
Not releasing the product until the end of this year or early next
year (it looks like 64-bit Windows is being delayed *again*) is only
going to hurt Microsoft relative to Linux.

I wonder if Intel's lack of IOMMU support beyond 4GB is going to result in
device drivers having to be revalidated under 64-bit Windows? If
manufacturers were using Opteron and A64 hardware to do device driver
validation, this incompatibility by Intel might require them to redo their
drivers to take into account the Intel discrepancy. I know that MS is now
having to redo its 64-bit beta just for the EM64T, since it breaks
compatibility in this way. Hopefully, the Windows code will have to bear the
brunt of taking care of Intel's oversight, and not the device drivers.

As it turns out Intel's 64-bit doesn't even support DMA beyond 32-bit memory
address boundary.

http://www.theinquirer.net/?article=16879

Yousuf Khan

Yousuf Khan · Jul 2, 2004

Stephen Sprunk said:
I've never run across a desktop app that needs more than 2GB of
address space; for that matter my 512MB machine (with no VM) handles
anything I throw at it, though I'm sure it'd be a bit faster if the
motherboard accepted more than that.

Many news readers are running into the file size limitation when downloading
from binary newsgroups.

Yousuf Khan

Yousuf Khan · Jul 2, 2004

Zalman Stern said:
I suggested using Boyer-Moore-Gosper (BMG) since the search string is
applied to a very large amount of text. A fairly straight forward BMG
implementation (including case insensitivity) in C is ~3 times faster
than strstr on PowerPC and SPARC for the test cases they use. On PIII
class hardware it is faster than strstr by maybe 50%. On a P4 it is a
little slower than strstr.

Typical story, the P4 seems to fall down anytime there is non-linear data
thrown at it.

AMD64 is a well executed piece of practical computer architecture
work. Much more in touch with market issues than IPF ever has been or
ever will be.

Even without the extra registers, AMD64 is achieving big performance gains
just with bog-standard unoptimized 386 code. Mostly due to the integrated
memory controller, I would gather, but also possibly due slightly to the
Hypertransport i/o.

Yousuf Khan

Yousuf Khan · Jul 2, 2004

Stephen Sprunk said:
There was a year or so when an Alpha running x86 binaries on FX!32 did
indeed outpace the fastest x86 machines available, though by less
than a 50% margin. I believe it was right before the P6 core came
out.

As far as I recall, FX32 came out a long time after the P6 core introduced.
P6's first generation, PPro, was already obsolete, and they were already
into the second generation, PII. PPro was introduced in late 1995. I don't
think FX32 came out till sometime in 1997.

I'm sure FX32 could smoke an x86 core a couple of generations old, but it
never came close to any of the modern x86 cores it was competing against at
the time.

Yousuf Khan

Alexander Grigoriev · Jul 2, 2004

Some ancient software (Premiere v4 IIRC) could happily write AVI files
beyound 2 GB, but such files were unusable.

Now, ODML AVI files are virtually unlimited in size.

There is no need to memmap a file to work on it. And it actually may be
slower than using explicit read.

Zak · Jul 2, 2004

Yousuf said:
I wonder if Intel's lack of IOMMU support beyond 4GB is going to result in
device drivers having to be revalidated under 64-bit Windows? If
manufacturers were using Opteron and A64 hardware to do device driver
validation, this incompatibility by Intel might require them to redo their
drivers to take into account the Intel discrepancy.

It seems the IOMMU is broken for Linux on some chipsets. If it is broken
hard on those, 64 bit windows should know how to work around that already.

I know that MS is now
having to redo its 64-bit beta just for the EM64T, since it breaks
compatibility in this way. Hopefully, the Windows code will have to bear the
brunt of taking care of Intel's oversight, and not the device drivers.

Probably also quite some assumptions in that code (no support for
certain support chips for example).

As it turns out Intel's 64-bit doesn't even support DMA beyond 32-bit memory
address boundary.

That would be very strange as it is supported (using 64 bit PCI
addressing) on older Xeons. But stranger things have happened... hmm...
switch CPU to 32 bit mode, do your 64 bit DMA, then switch back? Hehe...

Thomas

Nick Maclaren · Jul 2, 2004

|>
|> > What is actually wanted is the ability to have multiple segments,
|> > with application-specified properties, where each application
|> > segment is inherently separate and integral. That is how some
|> > systems (especially capability machines) have worked.
|>
|> Thats what paging is for, and, IMHO, a vastly superior system that
|> gives you memory attributing while still resulting in a linear
|> address space.
|>
|> Having segmentation return would be to me like seeing the Third
|> Reich make a comeback. Segmentation was a horrible, destructive
|> design atrocity that was inflicted on x86 users because it locked
|> x86 users into the architecture.
|>
|> All I can do is hope the next generation does not ignore the
|> past to the point where the nightmare of segmentation does not
|> happen again.
|>
|> Never again !

I suggest that you read my postings before responding. It is very
clear that you do not understand the issues. I suggest that you
read up about capability machines before continuing.

You even missed my point about the read-only and no-execute bits,
which are in common use today. Modern address spaces ARE segmented,
but only slightly.

Regards,
Nick Maclaren.

Terje Mathisen · Jul 2, 2004

Zalman said:
I recently helped a friend with perfomance optimization of some string
searching code. They were using strstr, which was acceptably fast, but
needed a case insensitive version and the case insensitive versions of
strstr were much slower. The platform where this was discovered was
Win32 using VC++ libraries, but Linux, Solaris, AIX, HP-UX, etc. are
also targets for the product.

I suggested using Boyer-Moore-Gosper (BMG) since the search string is
applied to a very large amount of text. A fairly straight forward BMG
implementation (including case insensitivity) in C is ~3 times faster
than strstr on PowerPC and SPARC for the test cases they use. On PIII
class hardware it is faster than strstr by maybe 50%. On a P4 it is a
little slower than strstr.

The Microsoft strstr is a tight hand coded implementation of a fairly
simple algorithm: fast loop to find the first character, quick test
for the second character jumping to strcmp if success, back to first
loop if failure.

The BMG code had ridiculous average stride. Like 17 characters for
this test corpus. But the inner loop did not fit in an x86 register
file. (It looked like I might be able to make it fit if I hand coded
it and used the 8-bit half registers, and took over BP and SP, etc.
But it wasn't obvious it could be done and at that point, they were
happy enough with the speedup it did give so...)

OK, this is fun!

I wrote an 8-bit, case-insensitive, BM search many years ago, and I
found it relatively easy to make it fit in 6-7 regs, even in the 16-bit
days.

Let's see... darn, I've lost the source code. :-(

It was something like this (with 486 timings):

; SI -> source string
; DI -> Search record, containing among other things the
; Skip distance and case conversion lookup tables, as well
; as a (possibly) monocased target string

next4:
mov bl,[si] ; 1 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr

mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr

mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr

mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr

test bx,bx ; 1 Did we get a match?
jnz next4 ; 3 No, so try again

; Check for
; Possible match, check remaining chars...

At this point I did have to spill/fill one register, it might have been
faster to do an other explicit test against the second-to-last target
character first.

Something like 10 years ago I figured out how to make it work for
case-insensitive 16-bit strings. There are two key ideas to make this work:

1) Use a _very_ simple hash table for the skip distance lookup, i.e. XOR
the top 6 bits into the lower 10, and use the resulting 0..1023 value as
an index into the skip table. For each location in that table, there
will be 64 collisions. Record the shortest possible skip distance for
any of these. This is normally still very close to the maximum skip
distance!

2) During setup, generate two versions of the target string, one
lowercased and the other uppercased. During a check for a possible match
you should, for each iteration of the match verification loop, test
against both versions of the string.

This avoids a possibly _very_ expensive 16-bit toupper() call for each
char loaded!

It turns out that the P4 core can run the first character search loop
very fast. (I recall two characters per cycle, but I could be
misremembering. It was definitely 1/cycle.) The BMG code, despite

1/cycle? Hmmm!

With one char/iteration, that's impossible:

next:
mov al,[esi] ; 1
inc esi ; 1
cmp al,bl ; 2
jne next ; 2

Unrolling makes it better:

next4:
movzx eax,word ptr [esi] ; Load 16 bits
cmp bl,al
je first_match
cmp bl,ah
je second_match
movzx eax,word ptr [esi+2] ; Load next 16 bits
add esi,4
cmp bl,al
je third_match
cmp bl,ah
jne next4
fourth_match:

Even better would be to load 32 bits and check all four characters at
the same time:

next4:
mov eax,[esi]
add esi,4
cmp bl,al
je first_match
cmp bl,ah
je second_match
shr eax,16
cmp bl,al
je third_match
cmp bl,ah
jne next4
fourth_match:

A case-insensitve version of the same kind of loop could also be written:

next:
mov al,[esi]
inc esi
cmp al,bl ; tolower(first_char)
je match
cmp al,bh ; toupper(first_char)
jne next
match:
mov al,[esi]
cmp al,dl ; tolower(second_char)
je match_pair
cmp al,dh
jne next
match_pair:

I have a feeling that a regular, case-sensitive, strstr() could work
quite well on x86 by using/abusing the ability to do misaligned loads
quite fast. I.e. only when straddling cache lines do you get a real penalty:

next4:
cmp ebx,[esi]
je match4_0
cmp ebx,[esi+1]
je match4_1
cmp ebx,[esi+2]
je match4_2
cmp ebx,[esi+3]
add esi,4
jne next4

match4_3:
sub esi,3
match4_2:
inc esi
match4_1:
inc esi
match4_0:
; ESI -> place where the first four characters all match,
; check the rest!

theoretically doing a lot less work, stalls a a bunch. A good part of
this is that it has a lot of memory accesses that are dependent on
loads of values being kept in memory because there are not enough
registers. 16 registers is enough to hold the whole thing. (I plan on
redoing the performance comparison on AMD64 soon.)

The inner loop will always fit in registers, but the data-dependent skip
distances means that all the loads suffer AGI stalls.

The point is this: having "enough" registers provides a certain
robustness to an architecture. It allows register based calling
conventions and avoids a certain amount of "strange" inversions in the
performance of algorithms. (Where the constant factors are way out of
whack compared to other systems.) As a practical day to day matter, it
makes working with the architecture easier. (Low-level debugging is
easier when items are in registers as well.)

I expect there are power savings to be had avoiding spill/fill
traffic, but I'll leave that to someone with more knowledge of the
hardware issues.

AMD64 is a well executed piece of practical computer architecture
work. Much more in touch with market issues than IPF ever has been or
ever will be.

That is something we can agree on!

(Or as I told our Dell representative this spring, by coincidence the
day before Intel caved in and announced their x86-64 version: "Dell has
to deliver 64-bit x86 product within six months or so, or you'll be
history.")

A C(++) version of the 16-bit (possibly) case-insensitive BM:

do {
do {
c = [src];
c = hash(c); // #define hash(c) (c & 1023)
skip = skiptable[c];
src += skip;
} while (skip);

// Assume we placed a sentinel at the end!
if (src >= source_end) return NULL;

// Since the skip table test can have false positives, we
// must check the entire string here!

s = src;
l = target_len;
tu = target_upper_end;
tl = target_lower_end;
do {
c = [s--];
if (c != *tu) && (c != *tl) break;
tu--; tl--;
} while (--l);
if (! l) return (s+1);

src++;
} while (src < source_end);

When doing a case-sensitive search, you simply allocate a single buffer
for the target string, and set both tu and tl to point to the last
character in it.

Terje

PS. When the buffer to be searched can be shared or read-only, then it
is faster to make two versions of the code above, with the first using
an unrolled inner loop that checks that it stays within (UNROLL_FACTOR *
MAX_STRIDE) of the end, and the second version doing an explicit check
on each iteration.

Peter Dickerson · Jul 2, 2004

Yousuf Khan said:
Many news readers are running into the file size limitation when downloading
from binary newsgroups.

Yousuf Khan

File sizes and physical or virtual addressability are not related. Its a
rare app that *needs* to have a whole file mapped into the adress space, and
if it does then the app isn't intended to handle large files (i.e. more than
tens of megs). It would be folly for a video-stream editor to have to fit
the file into memory.

Peter

Carlo Razzeto · Jul 2, 2004

Peter Dickerson said:
File sizes and physical or virtual addressability are not related. Its a
rare app that *needs* to have a whole file mapped into the adress space, and
if it does then the app isn't intended to handle large files (i.e. more than
tens of megs). It would be folly for a video-stream editor to have to fit
the file into memory.

Peter

You may very well need to if you're doing a combine and decode in a news
reader. Not sure how specific news readers do this but I could see OE for
example keeping the entire file in memory until it's fully decoded.

Carlo

Is Itanium the first 64-bit casualty?

Rupert Pigott

Stephen Sprunk

Scott Moore

Scott Moore

Scott Moore

David Schwartz

Rupert Pigott

Rupert Pigott

Dale Pontius

Judd

Yousuf Khan

Yousuf Khan

Yousuf Khan

Yousuf Khan

Alexander Grigoriev

Zak

Nick Maclaren

Terje Mathisen

Peter Dickerson

Carlo Razzeto