Diminishing bandwidth performance with multiple quad core X5355s

  • Thread starter CharlesBlackstone
  • Start date
C

CharlesBlackstone

Hello, I'm considering getting a second X5355 in my new machine. I've
heard bandwidth gets bad above 4 cores with the Woodcrests.

We do large calculations on multi-gig datasets held in 16 gigs of RAM,
so our performance is bandwidth limited.

How much faster would two quad core X5355 chips be compared to one?

Thanks.
 
K

kony

Hello, I'm considering getting a second X5355 in my new machine. I've
heard bandwidth gets bad above 4 cores with the Woodcrests.

We do large calculations on multi-gig datasets held in 16 gigs of RAM,
so our performance is bandwidth limited.

How much faster would two quad core X5355 chips be compared to one?

Thanks.


There are too many variables involved to answer your
question with any reasonable degree of accuracy. The
obvious answer is that if you can use a 2nd system, you have
that bandwidth increase.

If you can find someone doing same calcs, same app (?) on
same/similar config who has done this upgrade, only then
would you approach some data that might be extrapolated to
your situation. Maybe more info about these jobs would
help. Maybe not.
 
C

CharlesBlackstone

There are too many variables involved to answer your
question with any reasonable degree of accuracy. The
obvious answer is that if you can use a 2nd system, you have
that bandwidth increase.

If you can find someone doing same calcs, same app (?) on
same/similar config who has done this upgrade, only then
would you approach some data that might be extrapolated to
your situation. Maybe more info about these jobs would
help. Maybe not.





I think a lot of people are aware that an Opteron system has less
bandwidth restrictions with a lot of processors, but that woodcrests
don't have as good a memory controller and fall behind opterons after
4 cores or so. I'm asking how severe this is. Heavy number cruncing of
huge data sets in RAM is a bandwidth intensive operation. So, I'm
asking how badly woodcrests are impacted above 4 cores, for example, 8
cores vs 4 cores, on bandwidth performance. I didn't think this was
that vague, is there anything else I can tell you that will make the
question less difficult to answer?

I need to crunch a chunk of data 8 gigs large. It cant' be split into
two chunks and the crunching is not a parallel operation, so having
two computers is not helpful.
 
?

=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=

CharlesBlackstone said:
Hello, I'm considering getting a second X5355 in my new machine. I've
heard bandwidth gets bad above 4 cores with the Woodcrests.

We do large calculations on multi-gig datasets held in 16 gigs of RAM,
so our performance is bandwidth limited.

How much faster would two quad core X5355 chips be compared to one?

You mean you are populating only a single slot on a dual slot board? You
should use both slots to get the most bandwidth out of the system.
 
P

Paul

CharlesBlackstone said:
I think a lot of people are aware that an Opteron system has less
bandwidth restrictions with a lot of processors, but that woodcrests
don't have as good a memory controller and fall behind opterons after
4 cores or so. I'm asking how severe this is. Heavy number cruncing of
huge data sets in RAM is a bandwidth intensive operation. So, I'm
asking how badly woodcrests are impacted above 4 cores, for example, 8
cores vs 4 cores, on bandwidth performance. I didn't think this was
that vague, is there anything else I can tell you that will make the
question less difficult to answer?

I need to crunch a chunk of data 8 gigs large. It cant' be split into
two chunks and the crunching is not a parallel operation, so having
two computers is not helpful.

There was a time, when looking at spec.org , the answer to these questions
was easy to see. But the last time I looked here, I wasn't sure I was
even looking at the results right. Perhaps you can find evidence of your
choking hypothesis here.

http://www.spec.org/cpu2006/results/

There are sites that have run consumer oriented benchmarks on such
platforms, but there the danger is that the application is not
using the available resource properly or well. They don't do a good
job here, of listing the hardware particulars. (I believe it is
possible the computer in question here uses two 3GHz X5365's, which are
not listed on processorfinder.intel.com.) This benchmark is not
directly applicable, because they are comparing dual dual-cores to
dual quad-cores, whereas you want to compare one quad-core to two
quad-cores. Still, I think you can see that the speedup here is not
linear, for the quality of applications and testing techniques
they are using.

http://www.barefeats.com/octopro1.html

For the $1200 or so you are going to spend finding out, I think
you'll get some benefit. But it won't be a linear speedup. And
if you test your existing platform and setup, with 1, 2, or 4
cores enabled, I think you may already be able to show what
kind of impact your particular memory access pattern is
having on the platform. If you are seeing pretty close to
linear speedup right now, then chances are you'll get some
benefit from an extra 4 cores. (Enough to justify the $1200.)
If, on the other hand, the box is already collapsing under the
access pattern (say purely random access, some kind of cache
busting pattern), you may already have evidence that the extra
4 cores would be wasted.

(Followup arbitrarily set, because my news server won't let me
post without it.)

Paul
 
S

sndive

Hello, I'm considering getting a second X5355 in my new machine. I've
heard bandwidth gets bad above 4 cores with the Woodcrests.

We do large calculations on multi-gig datasets held in 16 gigs of RAM,
so our performance is bandwidth limited.

How much faster would two quad core X5355 chips be compared to one?
see if running
../sys_basher -mbandwidth

on an unloaded and loaded system sheds any light onto your question.
if not vary the memory sizes used by sys_basher
 
J

jlmarin

I think a lot of people are aware that an Opteron system has less
bandwidth restrictions with a lot of processors, but that woodcrests
don't have as good a memory controller and fall behind opterons after
4 cores or so. I'm asking how severe this is. Heavy number cruncing of
huge data sets in RAM is a bandwidth intensive operation. So, I'm
asking how badly woodcrests are impacted above 4 cores, for example, 8
cores vs 4 cores, on bandwidth performance. I didn't think this was
that vague, is there anything else I can tell you that will make the
question less difficult to answer?


Your question is difficult to answer because you'd first need to know
(at least approximately) what's the ratio of
FLOPS vs memory accesses, and the pattern of those accesses. It all
boils down to that. If your program
can keep the CPU busy during "long" stretches of time without needing
to access the memory bus, then your
program will definitely benefit from more cpus/cores. If, on the
other hand, your program needs to request
(i.e. load/store) to main RAM (i.e. cache misses) very frequently,
then you will have contention on the memory
bus and your performance per cpu will degrade.

You ask "how badly" will your app degrade; well, the actual way to
model and predict that would be using the hardware performance
counters (OProfile under Linux, cputrack on Solaris, etc), and then
you'd get an idea about the rate of instructions vs anything else
(load/stores
to ram, retired FLOPS, cache misses, TLB misses, etc). But of
course the best way is to measure your program on the real thing.

I wanted to post this even if it's a bit late on the thread because
right now I have exactly this kind of problem.
We're trying to figure out if a dual-Quadcore (Xeon) will be better
(cost/benefit wise) than a 4-way Opteron dualcore, for *our* program.

Spec CPU 2006 can give you some pretty good insights on this: go to
the advanced query option, and list all available results,
but filter by "number of total cores" equal to 8. Go straight to the
int_rate and fp_rate figures, and you'll be able to compare how
4-way dual Opterons compare to (Xeon) dual-Quadcores. At least, on
the Spec-2006 suite, whose programs have working set sizes quite
big, although they may not be as RAM-bottlenecked as your particular
program.

As you say, Opterons do definitely have a much better memory system.
But then a 4-way mobo is WAY more expensive that a dual-socket one...

And btw, if you want to benchmark just memory bandwidth/latency
performance, STREAM (http://www.cs.virginia.edu/stream/)
is the way to go.

Cheers,

JL
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top