On Thu, 10 Nov 2005 23:30:05 -0500, Tony Hill <(E-Mail Removed)>
wrote:
>On Thu, 10 Nov 2005 17:09:35 -0500, George Macdonald
><fammacd=!SPAM^(E-Mail Removed)> wrote:
>
>>On Thu, 10 Nov 2005 04:57:40 -0500, Tony Hill <(E-Mail Removed)>
>>wrote:
>>>Linpack doesn't touch the cache much at all. The real advantage Intel
>>>has in that test is SSE2. Linpack uses almost exclusively SSE2
>>>operations, and in both the P4 and the Opteron the performance of
>>>SSE2, so long as it can get data, is directly proportional to the
>>>clock speed. In this case a 2.4GHz Opteron is going to perform almost
>>>exactly like a 2.4GHz P4.
>>
>>Is Linpack really that dumb?... in which case it's not a very useful
>>benchmark.
>
>Not so much dumb as HUGELY dated. The benchmark was first written in
>the late 1970's and hasn't really changed any in the past 20 years. I
>doubt that it's used much in the real-world of HPC stuff, there are
>better algorithms out there today. Here's a quick overview of it:
>
>http://www.top500.org/lists/linpack.php
Yeah I'd seen that before... and it's not Linpack that's dumb in itself,
rather the use of it for evaluation of CPU system competence. The trouble
seems to be that it was originally intended for relatively small matrices:
100x100 and 1000x1000 are mentioned, which are both fractions of the sizes
being applied here.
>> I haven't looked at Linpack source code but there are certainly
>>things which can be done to benefit from cache... e.g. beyond the usual
>>matrix arrangements, surely a "simple" decomposition would be possible and
>>desirable... even based on cache size... not unusual in the real world.
>
>As you can see above, Linpack isn't really designed to be the fastest
>way to solve the problem, but rather a standard way of comparing MANY
>different computer architectures.
But it's only one - reminds me of academics who would propose turning off
the cache(s) to evaluate the "efficiency" of their algorithm... because the
cache was "interfering" with that measure.
> That standard was also largely
>chosen a LONG time ago. It has it's uses and can provide a reasonable
>guess as to how good a system will be at solving matrices, but it's
>definitely not going to give you an exact indication of how your
>system will perform on real-world code, even if that code is linear
>algebra.
My own interest is in sparse matrices so it's not a good "guess" for me.:-)
>>>Note that Linpack results do not necessarily reflect real-world
>>>performance, even for applications that primarily revolve around
>>>solving matrices. Linpack puts doesn't do much to stress out the
>>>memory subsystem on most modern desktops while often real-world Linear
>>>algebra does.
>>
>>Yeah, one of my beefs about many of the linear algebra "benchmarks": they
>>don't do what real-world code does. Again, Linpack is a dumb benchmark if
>>it doesn't stress the memory.
>
>Not only does it not stress memory much, it also doesn't stress
>internode communication much either and yet it is used to determine
>what the "fastest supercomputer" in the world is.
If it "doesn't touch the cache much" and you're moving huge amounts of data
in and out between memory & registers, it *has* to be stressing memory;
even where the access patterns can't always be arranged to benefit from
long sequential, contiguous address bursts... you may not be wringing the
maximum bandwidth from the memory channel but you're still stressing memory
with page switching from pseudo-random accesses.
>Linpack does have it's uses, but it's hardly the end-all, be-all of
>benchmarks.
>
>>>Linpack operates in such a way that it's very easy to keep your
>>>pipeline filled at all times which really defeats the purpose of
>>>Hyperthreading.
>>
>>Yeah well I though it was worth highlighting in view of the HT hype we've
>>seen here just recently... from someone who's never seen it do worse. The
>>fact that performance with HT drops significantly would indicate to me that
>>there is indeed quite good use of cache, the drop being partly due to the
>>expected cache collisions between the threads, with possibly some TLB
>>degradation too. How else do you keep a pipeline filled?
>
>Prefetch everything into your L1 data cache in blocks and run through
>that entire block before moving on to the next chunk.
It's not clear what Linpack allows officially in that respect, if you mean
software manipulation over and above hardware prefetch. So what you really
meant above was "doesn't touch the (*L2*) cache much".;-)... and it depends
on what you mean by "touch" - throw more cache at it and you increase the
size of the last part of the elimination step that's going to fly - with
4MB L2 that'd be very approximately, the last 500-1000 rows.
>>Perhaps I should not have singled out Linpack - the point was to highlight
>>the difference between the GamePC "desktop" oriented approach and the
>>techchannel workstation set. It's also difficult to know where those
>>results may have been distorted by Intel optimizations - as pointed out by
>>Derek baker, quoting "Intel provided" results is umm, suspicious enough...
>
>Extremely suspicious given that the results they achieved themselves
>on the same test (for both Paxville and Opteron, but especially
>Opteron) are *SIGNIFICANTLY* lower than published results using the
>same chips, OS and same compiler. When you compare the numbers that
>AMD were able to achieve vs. what Intel achieved on the SPEC scores it
>tells a VERY different picture with the AMD chips coming out on top.
One does wonder what "AMDs Opteron processors with SSE3 co-operate with the
Intel compiled Linpack version likewise problem-free" means?... err,
"problem free"??... and would AMD agree with such a statement???:-) BTW I
don't see a mention of the specific compiler used in the Google translation
other than that hint.
--
Rgds, George Macdonald