Parallel.Invoke

C

Clive Tooth

I have an array of Actions which I kick off with a Parallel Invoke...

Action[] workers = new Action[numThreads];
...
Parallel.Invoke(workers);

Can I have an array of Action<int> s which I somehow kick off with an
array of parameters?

Action<int>[] workers = new Action<int>[numThreads];
...
Parallel.Invoke(workers, arrayOfParameters); // ... or somesuch

?

TIA
 
M

Marcel Müller

I have an array of Actions which I kick off with a Parallel Invoke...

Action[] workers = new Action[numThreads];
...
Parallel.Invoke(workers);

Can I have an array of Action<int> s which I somehow kick off with an
array of parameters?

Action<int>[] workers = new Action<int>[numThreads];
...
Parallel.Invoke(workers, arrayOfParameters); // ... or somesuch

Feel free to inject your parameters into the workers.

Action<int>[] workers = new Action<int>[numThreads];
....
Action[] workers2 = new Action[numThreads];
for (int i = 0; i < numThreads; ++i)
workers2 = () => workers(arrayOfParameters);

Parallel.Invoke(workers2);


Marcel
 
M

Marcel Müller

If you're going to do it that way (it's equivalent to my second suggestion:
"[embed] the correct argument for each instance in the delegate instance
itself (i.e. through variable capturing)"), you need to write correct code:

for (int i = 0; i< numThreads; i++)
{
int j = i;

workers2 = () => workers[j](arrayOfParameters[j]);
}

Then each delegate instance captures a new "j" instance, which only ever
has a single value: the value that "i" had at the time of capture.


You are rioght! I went into that common pitfall.

Usually I do not use for(int i... for this purpose but a generalized
sequence join as LINQ expression. But this is not part of the standard
runtime.


Marcel
 
C

Clive Tooth

If you're going to do it that way (it's equivalent to my second suggestion:
"[embed] the correct argument for each instance in the delegate instance
itself (i.e. through variable capturing)"), you need to write correct code:

for (int i = 0; i< numThreads; i++)
{
int j = i;

workers2 = () => workers[j](arrayOfParameters[j]);
}

Then each delegate instance captures a new "j" instance, which only ever
has a single value: the value that "i" had at the time of capture.


You are rioght! I went into that common pitfall.

Usually I do not use for(int i... for this purpose but a generalized
sequence join as LINQ expression. But this is not part of the standard
runtime.


Marcel


Thanks for that, Peter and Marcel. I have changed the program slightly
to use variable capture.

Here is some background to what I am actually doing...

Over 40 years ago I became interested in the problem of finding powers
of two which have runs of several consecutive zeros in their decimal
representation. I had access to a Burroughs B6500 computer and I wrote
a rather complicated Algol program to look for a run of 9 consecutive
zeros. I found them in 2^100,823. Many further runs of the program
failed to get as far as 2^559,940 which contains 10 consecutive zeros.
Over the decades I have revisited the problem a few times.

See
https://oeis.org/A006889

Anyway, a month or so ago I got a new computer... Hmmm... what can I
do with it? :)

Currently it is known that 2^918,583,174 has a run of 17 consecutive
zeros. So I thought I would have a look for 18 zeros.

My program stores the current power of two in base one billion - with
each base one billion "digit" being a 32-bit integer. I will refer to
these base billion digits as ints. Each int is equivalent to nine base
ten digits. Currently the program is up to about 2^922,000,000 - which
requires about 31 million ints. The basic loop is to double the
current number and then look to see if there is a run of 18 decimal
zeros. This process is facilitated by the fact that if there is a run
of 18 (or even 17) decimal zeros then there must be a zero int.

The program partitions the 31 million ints into n (nearly) equal
chunks and Parallel.Invokes a worker for each chunk. [n is, in fact,
eight. The computer has 8 logical processors (thanks to
hyper-threading) but I arrived at this value for n by
experimentation.] Each worker doubles its chunk and looks out for zero
ints. The main program then resolves the issue of the carry out of the
top of one chunk into the bottom of the next chunk and has a closer
look around the zero ints to see if we have our 18 zero decimal digits
yet. Every 20 minutes the program produces a restart file (using
serialization).

Before producing a restart file, and after restarting from one, the
program divides 2^currentpower by a ten-digit prime, P, and stores the
remainder. It then works out what 2^currentpower mod P should be [this
can be done quickly] and compares this answer with the reminder. If
those two numbers are different we have a problem.

At the moment the program is running at about 28 powers of two per
second. I will probably need a few years to find 18 zeros.
 
C

Clive Tooth

As long as you've empirically determined that your concurrent performance
appears to be a rough scaling of the serial performance (i.e. with 8
threads, you're getting something reasonably close to 8x throughput), then
you've avoided this issue. In fact, while I haven't studied the newer HT
processors closely, I have read that Intel actually has worked to side-step
the issue or at least mitigate it. It might be harder to even reproduce on
the modern HT processors. When I ran into the issue, it was on a PC with
one of the original HT CPUs.

With 8 threads the program handles about 30 powers of two per second,
and each power of two would contain about 280 million decimal digits.
So, the metric that I use is "picoseconds per digit". The program
computes and displays various statistics every three seconds.

Here is a table of threads against ps/digit:

Th. ps/digit
1 633

2 318

4 193

7 125
8 114
9 162

So, using 8 threads is about 5.6 times faster than using 1 thread.

I now have a compile-time option to use a traditional threading
approach or the Parallel.Invoke method. In terms of speed they are
virtually identical. The whole program runs at "Below Normal"
priority.
 
C

Clive Tooth

[...]
Here is a table of threads against ps/digit:

Th. ps/digit
1 633

2 318

4 193

7 125
8 114
9 162

So, using 8 threads is about 5.6 times faster than using 1 thread.

What happened to threads 3 and 5?

Also, based on those numbers it appears that at least two threads are
lagging behind. I suppose in one of those threads, it might be because
that's the one that is inspecting a value more closely, when it's flagged
by another thread as a potential candidate. But that still leaves the
other thread.

I note in your previous description that you have partitioned the work by
digits. It seems to me that would be a reasonable design if you just
wanted to process a single number as quickly as possible. But in this
case, it seems to me that it needlessly complicates the logic and adds
inter-thread overhead.

I would expect throughput to improve noticeably if you just keep all the
processing for any given number (power of 2) in a single thread. Then each
thread can basically proceed in complete isolation practically indefinitely
(interrupted only by housekeeping duties such as saving state and printing
statistics), rather than having to synchronize with the other threads
dozens of times each second.

Pete

Hi Pete,

Here is a more complete table of threads/ps-per-digit...

Th. ps/digit
1 633
2 318
3 223
4 193
5 165
6 142
7 125
8 114
9 162

Thanks *very* much for your idea of keeping all the processing for a
given power of two in just one thread - I will certainly pursue this.
That idea just had not occurred to me :(

When I thought about your idea my first notion was to have 8 threads
and each thread multiplying its power of 2 by 2^8. Each base-billion
digit would just be bit shifted to do the multiplication but then I
realised that a (64-bit) div and mod would be required to compute each
carry - so that is probably not the way to go.

A better idea is probably for each thread to work on, for example, one
million consecutive powers of two. When it had finished its range
(which would probably take 2 or 3 days) it would be told the starting
point for its next million. To work out the base-billion value of its
starting power of two the thread will communicate with the Mathematica
kernel running on my computer. That computation only takes Mathematica
about 5 minutes, so it is not too great an overhead.

The controlling thread will still compute an overall
picoseconds-per-digit figure, so we can see if the new method is an
improvement :)
 
C

Clive Tooth

A better idea is probably for each thread to work on, for example, one
million consecutive powers of two. When it had finished its range
(which would probably take 2 or 3 days) it would be told the starting
point for its next million. To work out the base-billion value of its
starting power of two the thread will communicate with the Mathematica
kernel running on my computer. That computation only takes Mathematica
about 5 minutes, so it is not too great an overhead.

The controlling thread will still compute an overall
picoseconds-per-digit figure, so we can see if the new method is an
improvement :)

I tried giving out distinct ranges of powers of two to different
threads but the timings were spookily similar to my original
approach...

8 threads: originally 114 ps/digit - new method: 113 ps/dig
7 threads: originally 125 ps/digit - new method: 124 ps/dig

So, I don't think I will use this new method. [And keeping track of
which chunks have been processed would need some careful book-keeping.
We don't want to miss a chunk :) ]

However, I have managed to shave about 10ps off the 114ps by doing
something else...

Here is the inner loop of the program:

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a >= half)
{
a = (a<<1) - myBase | carry;
carry = 1;
}
else
{
a = (a<<1) | carry;
carry = 0;
}
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
} // i loop
} // unchecked

It struck me that if I moved the "if (a == 0) ..." to just after
the "a = ..." the compiler might be able to do an optimisation
involving not re-fetching a. And that does indeed seem to be the
case.

So the inner loop now reads...

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a >= half)
{
a = (a<<1) - myBase | carry;
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 1;
}
else
{
a = (a<<1) | carry;
if (a == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 0;
}
} // i loop
} // unchecked

And the time is down to about 104 picoseconds per digit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top