Parallel.Invoke

Clive Tooth · Mar 12, 2012

I have an array of Actions which I kick off with a Parallel Invoke...

Action[] workers = new Action[numThreads];
...
Parallel.Invoke(workers);

Can I have an array of Action<int> s which I somehow kick off with an
array of parameters?

Action<int>[] workers = new Action<int>[numThreads];
...
Parallel.Invoke(workers, arrayOfParameters); // ... or somesuch

?

TIA

Marcel Müller · Mar 12, 2012

I have an array of Actions which I kick off with a Parallel Invoke...

Action[] workers = new Action[numThreads];
...
Parallel.Invoke(workers);

Can I have an array of Action<int> s which I somehow kick off with an
array of parameters?

Action<int>[] workers = new Action<int>[numThreads];
...
Parallel.Invoke(workers, arrayOfParameters); // ... or somesuch

Feel free to inject your parameters into the workers.

Action<int>[] workers = new Action<int>[numThreads];
....
Action[] workers2 = new Action[numThreads];
for (int i = 0; i < numThreads; ++i)
workers2 = () => workers(arrayOfParameters);

Parallel.Invoke(workers2);

Marcel

Marcel MÃ¼ller · Mar 13, 2012

If you're going to do it that way (it's equivalent to my second suggestion:
"[embed] the correct argument for each instance in the delegate instance
itself (i.e. through variable capturing)"), you need to write correct code:

for (int i = 0; i< numThreads; i++)
{
int j = i;

workers2 = () => workers[j](arrayOfParameters[j]);
}

Then each delegate instance captures a new "j" instance, which only ever
has a single value: the value that "i" had at the time of capture.

You are rioght! I went into that common pitfall.

Usually I do not use for(int i... for this purpose but a generalized
sequence join as LINQ expression. But this is not part of the standard
runtime.

Marcel

Clive Tooth · Mar 13, 2012

If you're going to do it that way (it's equivalent to my second suggestion:
"[embed] the correct argument for each instance in the delegate instance
itself (i.e. through variable capturing)"), you need to write correct code:

for (int i = 0; i< numThreads; i++)
{
int j = i;

workers2 = () => workers[j](arrayOfParameters[j]);
}

Then each delegate instance captures a new "j" instance, which only ever
has a single value: the value that "i" had at the time of capture.

Click to expand...

You are rioght! I went into that common pitfall.

Usually I do not use for(int i... for this purpose but a generalized
sequence join as LINQ expression. But this is not part of the standard
runtime.

Marcel

Thanks for that, Peter and Marcel. I have changed the program slightly
to use variable capture.

Here is some background to what I am actually doing...

Over 40 years ago I became interested in the problem of finding powers
of two which have runs of several consecutive zeros in their decimal
representation. I had access to a Burroughs B6500 computer and I wrote
a rather complicated Algol program to look for a run of 9 consecutive
zeros. I found them in 2^100,823. Many further runs of the program
failed to get as far as 2^559,940 which contains 10 consecutive zeros.
Over the decades I have revisited the problem a few times.

See
https://oeis.org/A006889

Anyway, a month or so ago I got a new computer... Hmmm... what can I
do with it?

Currently it is known that 2^918,583,174 has a run of 17 consecutive
zeros. So I thought I would have a look for 18 zeros.

My program stores the current power of two in base one billion - with
each base one billion "digit" being a 32-bit integer. I will refer to
these base billion digits as ints. Each int is equivalent to nine base
ten digits. Currently the program is up to about 2^922,000,000 - which
requires about 31 million ints. The basic loop is to double the
current number and then look to see if there is a run of 18 decimal
zeros. This process is facilitated by the fact that if there is a run
of 18 (or even 17) decimal zeros then there must be a zero int.

The program partitions the 31 million ints into n (nearly) equal
chunks and Parallel.Invokes a worker for each chunk. [n is, in fact,
eight. The computer has 8 logical processors (thanks to
hyper-threading) but I arrived at this value for n by
experimentation.] Each worker doubles its chunk and looks out for zero
ints. The main program then resolves the issue of the carry out of the
top of one chunk into the bottom of the next chunk and has a closer
look around the zero ints to see if we have our 18 zero decimal digits
yet. Every 20 minutes the program produces a restart file (using
serialization).

Before producing a restart file, and after restarting from one, the
program divides 2^currentpower by a ten-digit prime, P, and stores the
remainder. It then works out what 2^currentpower mod P should be [this
can be done quickly] and compares this answer with the reminder. If
those two numbers are different we have a problem.

At the moment the program is running at about 28 powers of two per
second. I will probably need a few years to find 18 zeros.

Clive Tooth · Mar 13, 2012

As long as you've empirically determined that your concurrent performance
appears to be a rough scaling of the serial performance (i.e. with 8
threads, you're getting something reasonably close to 8x throughput), then
you've avoided this issue. In fact, while I haven't studied the newer HT
processors closely, I have read that Intel actually has worked to side-step
the issue or at least mitigate it. It might be harder to even reproduce on
the modern HT processors. When I ran into the issue, it was on a PC with
one of the original HT CPUs.

With 8 threads the program handles about 30 powers of two per second,
and each power of two would contain about 280 million decimal digits.
So, the metric that I use is "picoseconds per digit". The program
computes and displays various statistics every three seconds.

Here is a table of threads against ps/digit:

Th. ps/digit
1 633

2 318

4 193

7 125
8 114
9 162

So, using 8 threads is about 5.6 times faster than using 1 thread.

I now have a compile-time option to use a traditional threading
approach or the Parallel.Invoke method. In terms of speed they are
virtually identical. The whole program runs at "Below Normal"
priority.

Clive Tooth · Mar 14, 2012

[...]
Here is a table of threads against ps/digit:

Th. ps/digit
1 633

2 318

4 193

7 125
8 114
9 162

So, using 8 threads is about 5.6 times faster than using 1 thread.

Click to expand...

What happened to threads 3 and 5?

Also, based on those numbers it appears that at least two threads are
lagging behind. I suppose in one of those threads, it might be because
that's the one that is inspecting a value more closely, when it's flagged
by another thread as a potential candidate. But that still leaves the
other thread.

I note in your previous description that you have partitioned the work by
digits. It seems to me that would be a reasonable design if you just
wanted to process a single number as quickly as possible. But in this
case, it seems to me that it needlessly complicates the logic and adds
inter-thread overhead.

I would expect throughput to improve noticeably if you just keep all the
processing for any given number (power of 2) in a single thread. Then each
thread can basically proceed in complete isolation practically indefinitely
(interrupted only by housekeeping duties such as saving state and printing
statistics), rather than having to synchronize with the other threads
dozens of times each second.

Pete

Hi Pete,

Here is a more complete table of threads/ps-per-digit...

Th. ps/digit
1 633
2 318
3 223
4 193
5 165
6 142
7 125
8 114
9 162

Thanks *very* much for your idea of keeping all the processing for a
given power of two in just one thread - I will certainly pursue this.
That idea just had not occurred to me

When I thought about your idea my first notion was to have 8 threads
and each thread multiplying its power of 2 by 2^8. Each base-billion
digit would just be bit shifted to do the multiplication but then I
realised that a (64-bit) div and mod would be required to compute each
carry - so that is probably not the way to go.

A better idea is probably for each thread to work on, for example, one
million consecutive powers of two. When it had finished its range
(which would probably take 2 or 3 days) it would be told the starting
point for its next million. To work out the base-billion value of its
starting power of two the thread will communicate with the Mathematica
kernel running on my computer. That computation only takes Mathematica
about 5 minutes, so it is not too great an overhead.

The controlling thread will still compute an overall
picoseconds-per-digit figure, so we can see if the new method is an
improvement

Clive Tooth · Mar 14, 2012

A better idea is probably for each thread to work on, for example, one
million consecutive powers of two. When it had finished its range
(which would probably take 2 or 3 days) it would be told the starting
point for its next million. To work out the base-billion value of its
starting power of two the thread will communicate with the Mathematica
kernel running on my computer. That computation only takes Mathematica
about 5 minutes, so it is not too great an overhead.

The controlling thread will still compute an overall
picoseconds-per-digit figure, so we can see if the new method is an
improvement

I tried giving out distinct ranges of powers of two to different
threads but the timings were spookily similar to my original
approach...

8 threads: originally 114 ps/digit - new method: 113 ps/dig
7 threads: originally 125 ps/digit - new method: 124 ps/dig

So, I don't think I will use this new method. [And keeping track of
which chunks have been processed would need some careful book-keeping.
We don't want to miss a chunk

]

However, I have managed to shave about 10ps off the 114ps by doing
something else...

Here is the inner loop of the program:

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a >= half)
{
a = (a<<1) - myBase | carry;
carry = 1;
}
else
{
a = (a<<1) | carry;
carry = 0;
}
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
} // i loop
} // unchecked

It struck me that if I moved the "if (a == 0) ..." to just after
the "a = ..." the compiler might be able to do an optimisation
involving not re-fetching a. And that does indeed seem to be the
case.

So the inner loop now reads...

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a >= half)
{
a = (a<<1) - myBase | carry;
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 1;
}
else
{
a = (a<<1) | carry;
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 0;
}
} // i loop
} // unchecked

And the time is down to about 104 picoseconds per digit.

execute in ui thread	1	Sep 23, 2010
Threads - WaitForMultiObjects equivalent in C#	19	Jun 1, 2010
parallel (threads)	6	Mar 12, 2008
Thread	1	Nov 12, 2010
generic class doesn't work but ArrayList work fine	1	Dec 23, 2010
multiple progressbar with backgroundworker	7	Jan 26, 2010
threads (background worker) question	3	Apr 13, 2008
array of BackgroundWorker's hung	2	Dec 12, 2007

Parallel.Invoke

Clive Tooth

Marcel Müller

Marcel MÃ¼ller

Clive Tooth

Clive Tooth

Clive Tooth

Clive Tooth

Ask a Question

Similar Threads