IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance

T

The little lost angel

You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :).

That's bull, it's web design and JAVA ;)
 
S

Scott Michel

Sander said:
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.

gcc doesn't really help you if you don't know what you're doing. Loop
unrolling comes to mind: can't tell you how many times I've had to
forcibly do loop unrolling where one would have expected gcc to do it
with "-O3 -funroll-loops".

There is some hope on the horizon, like LLVM from UIUC, which you'll
see underneath hood in OS X "Leopard". I'm not sure if I'd expect to
see Cell SPU support in Java, although IBM will likely make that
happen. Sure, compilers can take hints, but it seems to me that an
interpretive system, like LLVM or Python, to take the "bird's eye" view
and dispatch tasks to SPUs. Simple loop-level parallelism, while
common, is likely the wrong level of granularity.
 
R

Robert Redelmeier

In comp.sys.ibm.pc.hardware.chips Scott Michel said:
gcc doesn't really help you if you don't know what you're doing.

Agreed `gcc` can be cantankerous.
Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".

Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

-- Robert
 
S

Scott Michel

Robert said:
Agreed `gcc` can be cantankerous.


Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
 
R

Robert Redelmeier

In comp.sys.ibm.pc.hardware.chips Scott Michel said:
Robert said:
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)


If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.

40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.

-- Robert
 
S

Scott Michel

Robert said:
In comp.sys.ibm.pc.hardware.chips Scott Michel said:
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)


If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.

40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.


The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.

gcc is not your friend.
 
P

Phil Armstrong

Scott Michel said:
In comp.sys.ibm.pc.hardware.chips Scott Michel said:
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
[snip]
gcc is not your friend.


Was the loop not being unrolled at all by gcc? Did -funroll-all-loops
help?

Phil
 
B

Bernd Paysan

Scott said:
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.

gcc is not your friend.

More than 10 years ago, when I still was a student, one of the PhDs of the
numeric faculty made a matrix multiply competition for the HP-RISC CPUs we
had on our workstations. He estimated that 30MFLOPs would be possible, even
though a naive C loop could get less than 1MFLOP, and the HP Fortran
compiler with a build-in "extremely fast" matrix multiplication got no more
than 10MFLOPs.

After doing some experiments, I got indeed 30MFLOPs out of the thing, by
doing several levels of blocking. The inner loop kept a small submatrix
accumulator (as much as did fit, I think I got 5x5 into the registers), so
that several rows and columns could be multiplied together in one go
(saving a lot of loads and stores). The next blocking level was the (quite
large) cache of the PA-RISC machine, i.e. subareas of both matrixes where
multiplied together.

I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries. GCC isn't your friend, because loop unrolling here is really the
wrong approach. The inner loop I used just did all the multiplications for
the 5x5 submatrix, and no further unrolling was necessary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top