Most efficient way to process thousands of files using multiple threads (dealing with thread handlin

F

Frans Bouma

Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single
object, which is then placed on a stack(This cuts down thread creation
and deletions roughly by a factor of 4). I create up to 12 threads,
which then process a single object off of the stack. I use a loop with a
boolean statement, stack.Count > 0. Then I check each thread to see if
it is alive, if it is not, then I create a new thread with a new object
off of the stack which is passed as the constructor parameter for a new
threaded object. If the thread is alive, then it merely goes on to check
the status of the next thread in line. This is a big process and running
the CPU at 100% is not an issue, I would just like to optimize my
threading code in order to make my application faster and more
efficient. The ThreadPool class does not seem like a good option for my
needs, as my threads will be constantly processing throughout their
lifetime. I think that my constant polling of threads could definitely
be replaced with something like a thread callback upon completion of its
processing. How can I further reduce the threading overhead? Would it
be better to just reset all the variables in a thread and pass a new
stack object, without creating a new thread to overwrite the dead
thread? My code, while reliable so far, could easily be simplified and
improved upon.

I wonder if the threads are the bottleneck, since accessing a file
system via 2 or more threads simultaneously is slowing down readprocessing
terribly, because of head-stepping on your harddrive.

What I would do is a queue-like mechanism for file requests and one
thread that reads all the files in a sequential order and returns them
back to processing threads. This way your disk activity is streamlined.

So you can f.e. start 4 or 5 threads, which request files, these
are loaded and your loader thread goes to sleep until a new thread
requests a file so there will be another request in the queue. 20 or so
threads per process for a single CPU are probably the maximum amount,
since IIS is optimized for 20 threads per cpu.

FB
 
F

Frank Drebin

First thing that comes to mind, is instead of polling to see what the
threads are doing - you should have each thread update a static member that
updates the status..

For example, have a static array that keeps track of threads 0 to 11. When a
thread is created, it's given one of the empty spots. The thread executes
(and it knows it's "4" for example).. when it's done, it updates that array
and says "4" is done, then kills itself. That way, all threads clean up
after themselves cleanly..
 
C

codewriter

Keep in mind that if it is a single CPU machine, creating multiple threads
will be even slower than processing everything in one thread.
 
F

Fergus Cooney

Hi m, (are you related to q?)

If I understand correctly, the processing is continuous and each thread
is processing a file and then dying - only to be immediately replaced by a
new thread for the next file.

I wonder why you don't just have your threads stay alive and grab a new
file each time they've finished. They don't need to die at all, do they?

And I would certainly investigate Frans' idea of disk-access being a
bottleneck.

Regards,
Fergus
 
A

AlexS

I do something similar but in more asynch way.

I use queue - or it could be list or collection like in your case.

Main thread, which is getting data to process (files), puts every new item
into the queue and signals using Mointor.PulseAll to all threads that there
is item to process.

All worker threads were created during app. startup; are locking on queue
and sit in Monitor.Wait. When Pulse arrives, first thread (which exactly I
never know - but from tests it looks like MS does this in round-robin
fashion) gets kicked up, locks the queue, takes out item, releases the queue
and starts processing. Sometime later it is interrupted (usually because of
asynch calls inside thread) and next waiting thread is being kicked up. It
locks the queue and gets next item. If there are no items yet - it goes into
Monitor.Wait immediately.

As you see - there is no polling, no loops or Thread.Sleep calls. Sure I sit
also at 100% nearly when there are several threads running together. However
this is split between my threads and OS (5%) and MS SQL (40-60%) which are
used from threads. Because threads are created during startup - there is no
overhead with repeated creation of threads and corresponding clogging of
memory and GC kicking in.

About callbacks - I use callbacks from threads only for updating UI visual
feedback elements. Never to start new threads. And I wouldn't recommend it
to you in your case - in essence, if your thread finished processing, it
should check for next work item immediately; there is no sense to get back
to some another thread to start new thread etc..

Btw, I started with scheme like yours - polling and attempts to create
additional threads for each work item. However with Monitor.Pulse/Wait and
my own fixed thread pool final implementation is much faster and stable. My
app can process around up to thousand items per hour on 1GHz PC - every item
is file (1-500K) and causes several MS SQL operations. App runs for several
hours usually.

The only change apart from Monitor.Pulse/Wait you might need to consider -
how to signal end of run to all threads. It could be special item or some
static flag field, which is less preferable. Of course, each work item
is/should be processed completely independently from any other. Put all
parameters controlling processing into item object.

Also, take a look if you really need 12 threads. E.g. on my machine there is
no sense to try to run more than 4-5 simultaneously. Higher than that there
is no gain at all and degradation starts even when all connections are
available. If you have multiprocessor one and low disk activity during
processing - maybe 12. But I would try to measure before making conclusions.

And last one - take a look how you process your files in threads.
Asynchronous IO can help to speed up processing too.

HTH
Alex
 
J

Jim H

A couple of suggestions...

1. Becarful of running too many threads. You may end up slowing yourself
down because of the time and overhead involved during the context switch
from one thread to another. If your threads are do a lot of work with
blobking calls like file IO, network communication, or database requests
having extra threads is a definite benefit, but you can go too far. If your
threads are doing steady intense processing without much blocking IO you may
actually be slowing your app down. A lot of people get thread happy like
it's the answer to everything. A single processor can only do one thing at
a time.

2. Consider using a thread safe queue to hand out the work. Rather than
having a thread terminate and clean up, just to create a new thread that
will perform the same type of work, have your threads grab their work
requests from a queue. This way you won't have to poll. Basically I
created a thread safe queue with a Get(object ref) that retreives and item
from the queue or waits on an event if the queue is empty. When an item is
added to the queue an event is signaled. The Get has a timeout. If the
timeout is reached the Get returns 0 otherwise it returns 1. If the get
returns 0 I check a global variable to see if the app is closing otherwise I
go back into the Get. When I want to close the App I signal a closing
event. That's a rough idea of how it works anyway.

What Frans said about hdd drive access makes sense to and is something to
check, but I would definitely try reusing the worker threads. If you are
processing thousands of files and creating and destroying thousands of
threads the cost of that constant destruction and allocations should be a
noticable difference. You can also play with the number of threads in your
pool to see what performs best.

Good luck!

jim
 
M

m

Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object,
which is then placed on a stack(This cuts down thread creation and deletions
roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is
passed as the constructor parameter for a new threaded object. If the thread
is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing
throughout their lifetime. I think that my constant polling of threads could
definitely be replaced with something like a thread callback upon completion
of its processing. How can I further reduce the threading overhead? Would it
be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code,
while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top