NOT using the System.Threadpool in server class applications

J

Jon Skeet [C# MVP]

Chris Mullins said:
Quite often of late I've been asked about when to use the System Threadpool.

I finally sat down and wrote a blog detailing the issues in the ThreadPool
and why it's unsuitable for use in production grade server applications.

Any feedback on this is appriciated!

http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx

My only feedback is that this is something I've been aware of for a
while. When I tried to make the same point on the CLR mailing list, I
was variously accused of being naive or finding problems when there
really aren't any.

Half the problem is that *anything* can use the system threadpool -
there's no way of isolating different areas of code to use different
pools.

Oh well - at least I'm not alone in considering it a problem.
 
C

Chris Mullins

Vadym Stetsyak said:
Hello, Chris!

CM> I finally sat down and wrote a blog detailing the issues in the
CM> ThreadPool and why it's unsuitable for use in production grade server
CM> applications.

CM> Any feedback on this is appriciated!

CM>
http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/
CM> Default.aspx

Nice post ::cool:

Thanks. I'm always worred about a post like that, as it goes against what so
many other people preach.
I have some questions on how you handled issues with ThreadPool.
Did you consider increasing number of worker threads in the threadpool?

The problem isn't the size of the threadpool. I can use one of the ICor
interfaces (I don't remember which one) to manipulate the size of the thread
pool, and have in fact done this. I've also run into this problem on
machines that have 400 threads (by default) in the threadpool.

The problems stems from the fact that no matter how many threads are in
there, I can exhaust them all. Because the CLR itself needs threadpool
threads, this means the whole system then deadlocks.

What's needed are two threadpools: One for my application to use, and one
for the CLR to use. There's just no getting around this.
 
C

Chris Mullins


[Jon has seen this too]
My only feedback is that this is something I've been aware of for a
while. When I tried to make the same point on the CLR mailing list, I
was variously accused of being naive or finding problems when there
really aren't any.

That's what I've been bumping into as well. I dismiss alot of it, as its
from people who write proof-of-concept applications for a living, rather
than production grade applications. As we all know, there's a world of
difference between them.
Half the problem is that *anything* can use the system threadpool -
there's no way of isolating different areas of code to use different
pools.

Yea, I agree. This was, when we ran into this 3 years or so ago, the big
issue. We needed the ability to say, "CLR gets these threads, SoapBox Server
gets these threads". Without that ability, the threadpool runs out of
threads and the whole appdomain hangs.
 
V

Vadym Stetsyak

Hello, Chris!
You wrote on Tue, 27 Jun 2006 10:12:56 -0700:

CM> The problems stems from the fact that no matter how many threads
are in
there, I can exhaust them all. Because the CLR itself needs threadpool
threads, this means the whole system then deadlocks.


Strange workaround just came to my mind, it can solve the problem with
single thread pool, but its not optiomal, for the software that you
described in your post.

The solutions is multiple AppDomains. Every AppDomain has its own
ThreadPool, so we can create something like "worker" AppDomain.

This approach is an overkill, due to marshalling, but nevertheless this
solution exists, and application can be built using this solution.

CM> What's needed are two threadpools: One for my application to use,
and
one for the CLR to use. There's just no getting around this.


Yeah, that would simplify a lot of problems, with parallel exucution in
hight performance systems.
 
W

William Stacey [MVP]

It might be easier to just use your own thread pool. There are multiple
implementations out there.

--
William Stacey [MVP]

| Hello, Chris!
| You wrote on Tue, 27 Jun 2006 10:12:56 -0700:
|
| CM> The problems stems from the fact that no matter how many threads
| are in
| there, I can exhaust them all. Because the CLR itself needs
threadpool
| threads, this means the whole system then deadlocks.
|
|
| Strange workaround just came to my mind, it can solve the problem with
| single thread pool, but its not optiomal, for the software that you
| described in your post.
|
| The solutions is multiple AppDomains. Every AppDomain has its own
| ThreadPool, so we can create something like "worker" AppDomain.
|
| This approach is an overkill, due to marshalling, but nevertheless this
| solution exists, and application can be built using this solution.
|
| CM> What's needed are two threadpools: One for my application to
use,
| and
| one for the CLR to use. There's just no getting around this.
|
|
| Yeah, that would simplify a lot of problems, with parallel exucution in
| hight performance systems.
|
| --
| Regards,
| Vadym Stetsyak.
| Blog: http://vadmyst.blogspot.com
|
|
 
W

William Stacey [MVP]

Here is rough pseudocode for a server that uses a threadpool (the clr or
other)
to handle N client sockets. We break a TP rule here because we use a
threadpool thread
for a long running operation (i.e. the socket read). However, it is our
server, so we
understand and manage that issue. Basically, we add all client sockets to a
list - for management, etc. We then kick off the first read (and future
reads) on a thread pool thread.
We set a max of threads to 100 in this example. So at any one instant, we
could have a 100 pending reads. Other client reads will just queue up in the
TP FIFO queue.

Issues:
1) 100 slow clients (in this example) can stall the TP. This could be an
attack, or just 100 slow clients. We can mitigate attack by only allowing so
many connections from same IP. Other methods, such as an async server, can
also stall after many pending reads, so not sure how much of issue this is
in reality compared to other methods. After a client does a read/reply, it
frees that thread and posts another read to the queue so other clients get
time. The next client in the queue gets to read so the server will keep
running as long as client reads eventually happen.
2) Writes are done sync on the same thread. Not sure this is much of an
issue however. The write will be done quickly with no wait. Not sure we gain
anything by doing an async write.

Advantages:
1) Built in fairness. Client requests are handled FIFO.
2) Dynamic throttle. We can throttle and apply "back-pressure" just by
changing the number of MaxThreads in the TP. This will limit the number of
pending sync reads/readprocessing on the server at any one time. New clients
just queue up.
3) Logic is sync. However we don't assign 1 thread per client for the whole
client session. Each request/reply will be handle by another TP thread
(possibly the same if Q is empty). This makes it easy to code and find bugs.
We can also use the socket timeout feature on reads and writes. Slow clients
that timeout can just be dropped (or use some algo). After a timeout, we
free that thread for another client read.
4) Easy to replace system TP with a custom TP if needed.
5) Can handle many thousands of clients with little number of threads. Only
testing we find the optimal number of threads needed for the app.


PseudoCode (just the basics. Locks, exception handling, etc, not shown for
simplicity):
---------------------------------------------------------------------------------------
ThreadPool.MinThreads = 25;
ThreadPool.MaxThreads = 100;

while(started)
{
sock = Accept();
sock.ReadTimeout = 5000;
if ( userList.Count > 5000 ) // Set to how many concurrent clients you
want.
{
sock.Close(); //A hard close - too many clients. Can do this better.
continue;
}
userList.Add(sock);

// Kick off the request/response chain for client.
ThreadPool.QueueUserWorkItem(ReadMethod, sock);
}

private void ReadMethod(object state)
{
// This is on a threadpool thread.
//
Socket sock = (Socket)state;
byte[] buf = new byte[1024];
try
{
int read = sock.Receive(buf);

// Handle request processing.

// Write reply back to client.
sock.Send(writebuf);
if ( done )
{
sock.Close()
userList.Remove(sock);
return; // Client done.
}

// Release this thread to pool, but queue another read before we
leave.
// Note: If there is no other queued work items,
// the TP will most likely just use the same thread.
ThreadPool.QueueUserWorkItem(ReadMethod, sock);
}
catch(Timeout te)
{
//Handle timeout.
}
catch(Exception ex)
{
//Handle exception
}
}

--
William Stacey [MVP]

| Quite often of late I've been asked about when to use the System
Threadpool.
|
| I finally sat down and wrote a blog detailing the issues in the ThreadPool
| and why it's unsuitable for use in production grade server applications.
|
| Any feedback on this is appriciated!
|
|
http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx
|
| --
| Chris Mullins
|
|
 
C

Chris Mullins

William Stacey said:
It might be easier to just use your own thread pool. There are multiple
implementations out there.

I've looked all over the place for a good threadpool implementation, and
have yet to find one.

My ideal thread pool would have the following features:
1 - Threads are dynamically added and removed form the queue depending on
load.
2 - Threads have some degree of processor affinity in order to minimize
cache issues
3 - The scheduler is smart about assigning work items to threads based on
processor load.
4 - A built-in watchdog. If a work item is exceeding some timeout, a
mechanism exists to abort it.
5 - Intelligent about locking on the work item queue

I've built a number of thread pools, and looked at several that I've found
on the Internet. None of them have ever had all these qualities..
 
B

Barry Kelly

Chris Mullins said:
4 - A built-in watchdog. If a work item is exceeding some timeout, a
mechanism exists to abort it.

There's no good way to abort a thread, so I wouldn't be too optimistic
about finding this feature. Thread.Abort is dangerous and evil, and any
shared structures being currently manipulated by the thread will likely
be in an inconsistent state. The best you can do is tear down the
AppDomain.

-- Barry
 
W

William Stacey [MVP]

I'll play. I wrote a couple before so have general interest and maybe I'll
blow the dust off it and update.
In general, some of these things may have more cost then payoff for a
general purpose thread pool, but let me ask.

| 1 - Threads are dynamically added and removed form the queue depending on
| load.

What defines "load" here? Is it Q size or cpu load. CPU load logic may
compete against min/max thread parms. So what is algo here?

| 2 - Threads have some degree of processor affinity in order to minimize
| cache issues

Windows using soft affinity automatically. It prefers to run a task on the
same processor as last time, but will swith if it has to based on its algo.
IMO, the system will do a better job then we can do here. The successful
usage cases of changing processor affinity are rare. However, I agree there
may be special cases - but wonder if that will be abused if available. What
are your thoughts here?

| 3 - The scheduler is smart about assigning work items to threads based on
| processor load.

What you mean? tia

| 4 - A built-in watchdog. If a work item is exceeding some timeout, a
| mechanism exists to abort it.

That could be a good add. It is overhead, but probably not too bad.

| 5 - Intelligent about locking on the work item queue

At a minimum, you need to lock on add and remove. Are you thinking of
something specific, or just seen poor locking in the TPs you reviewed?
 
W

William Stacey [MVP]

That is true. The best place for this is in your own code (running on a TP
thread). I guess an optional callback could be a safty for things user code
does not check for, like timeouts on locks or send/receives. But then
again, the callback has to interupt and/or abort the thread, but at least
you get the option to do some last second clean up before tearing down the
app. Not sure a great solution exists as you said.

--
William Stacey [MVP]

|
| > 4 - A built-in watchdog. If a work item is exceeding some timeout, a
| > mechanism exists to abort it.
|
| There's no good way to abort a thread, so I wouldn't be too optimistic
| about finding this feature. Thread.Abort is dangerous and evil, and any
| shared structures being currently manipulated by the thread will likely
| be in an inconsistent state. The best you can do is tear down the
| AppDomain.
|
| -- Barry
|
| --
| http://barrkel.blogspot.com/
 
C

Chris Mullins

[Features a threadpool should have]
| 1 - Threads are dynamically added and removed form the queue depending
on
| load.

What defines "load" here? Is it Q size or cpu load. CPU load logic may
compete against min/max thread parms. So what is algo here?

I think it would be more cpu load than anything else. I'm not sold however
on this being the correct metric.

The behavior that I explicitly want to avoid is dumping 100 items into a
Queue, and having the thread pool immediatly spin up 100 threads. Likewise,
if 100 threads are spun up, I wouldn't want to see them all terminated
immediatly should there be no items in the pool.

The System thread pool seems to have a timeout of some sort before spawning
new threads. This seems to work fairly well.

| 2 - Threads have some degree of processor affinity in order to minimize
| cache issues

[Thread Affinity]
IMO, the system will do a better job then we can do here. The successful
usage cases of changing processor affinity are rare. However, I agree
there
may be special cases - but wonder if that will be abused if available.
What
are your thoughts here?


I tend to agree with you here, and as a result I have never myself written
code that has processor affinity. Windows has always done a good job of
things.

However I've been runnnig alot of tests latley on this machine:
http://www.coversant.net/bigiron.png

That particular machine has 16 physical Itanium2 processors, and some of
it's behavior has been unexpected. We consistantly see certain processors
load signifigantly higher than others, and it's tough to tell quite why.
Down deep in the network stack, there are some issues (which are solved by
the new Chimney features in the Scalable Network Pack) with interrupts and
processor affinity, so this may explain some of it. On the other hand, there
seems to be room for improvement. This is a more of a feeling than a fact,
I'll readily admit. It's also compounded by the lack of profiling tools that
run on Itanium (there are a few, but they're not nearly as usefull for .Net
as the Compuware stuff is).

I would really like to be able to have a way to figure out what processor
cache (especially on these Itanium2 machines with their HUGE caches) my
datastructure is already in, have a thread on that processor wake up,
process the item from the cache, write the results to main memory. I don't
think I'm going to see this any time in the furture, but it's on my wish
list. I have that feeling (I know, I know) this will give a fiarly
signifigant boost to performance as the load increases.
| 3 - The scheduler is smart about assigning work items to threads based
on
| processor load.

What you mean? tia

Many of the threadpools that I have seen use this algorithm:
1 -thread that wakes up
2 - thread checks for shutdown
3 - thread checks for a work item
4 - thread processes work item
5 - thread goes back to sleep
6 - goto 1

Some use variations on this, and fancy means of putting threads to sleep -
I've seen a few algorithms:
1 - Keep all threads blocked in a Montior. Whenver a work items comes in, an
"PulseOne" (or, god help us, a Pulse-All with a "First one wins" algorithm)
2 - Use Events and Waits to test for shutdown, items in the queue, etc.
3 - Items are posted to a Queue, a control thread wakes up and gives the
work item to a particular work thread.

I would really like to see an algorithm that has, in the "busy" case, as few
context switches as possible. When I post an item to a queue, there should
be exactly 1 context switch needed to process my item.
| 5 - Intelligent about locking on the work item queue

At a minimum, you need to lock on add and remove. Are you thinking of
something specific, or just seen poor locking in the TPs you reviewed?

I am thinking of some specific stuff. That 16 processor Itanium box has
signifigantly influenced the way I think about locking and contention.

Typically a queue has two locks - one on Enqueue, one on Dequeue. This is
how I've been coding it for years. However, after seeing some performance
that really puzzled me, I spent some time looking into the contention in our
system. While doing that, I came across this:
http://makeashorterlink.com/?Q4132185D

I played with this test for a while, and ran it on single, dual, quad, and
16 processor boxes, and got some shocking results. After looking into this
more and more, I became quite alarmed at just how much contention we were
seeing. It was.... shocking. I have all the numbers recorded, and will be
doing a blog entry on my findings here in the not-too-distant future.

Anyway, after much wailing and gnashing of teeth, I stumbled across
thread-safe, lock-free datastructures:
http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
myself.

I added these into the test suites, and got more interesting results. On a
single or dual processor machine they're signifigantly slower. On a 16 way
machine, they're waaaaaay faster.

The end result, is that on different processor architectures, and different
numbers of processors, the locking mechanisms need to change. The difference
in some cases was a full 4 orders of magnitude (in one very dramatic case,
what took 85 milliseconds on a single processor x86 machine, took 200
milliseconds on a dual-processor machine, took 2 minutes on a quad processor
machine, and took more than 30 minutes on a 16-way machine). Granted these
cases were contrived and almost nothing but thread contention, but they
illustrated the case and let me pick the right datastructures.

My hypothetical thread pool would take this into account and have a variety
of locking mechanisms. It would even be especially nice if, in a general
some case, a work item could be handed directly to a thread with no lock
required at all. In the "lots of threads awake" case, it seems as if there
should be a way to do this.

Now, I'll be the first to admit that locking and unlocking on the queue for
work items shouldn't be too big a hit, especially when compared to parsing
some xml, string manipulation, etc - but these locks are hit at least twice
per operation (one in, one out), and at 10k operations per second, that adds
up quick.
 
C

Carl Daniel [VC++ MVP]

Chris said:
[Features a threadpool should have]
What defines "load" here? Is it Q size or cpu load. CPU load logic
may compete against min/max thread parms. So what is algo here?

I think it would be more cpu load than anything else. I'm not sold
however on this being the correct metric.

The behavior that I explicitly want to avoid is dumping 100 items
into a Queue, and having the thread pool immediatly spin up 100
threads. Likewise, if 100 threads are spun up, I wouldn't want to see
them all terminated immediatly should there be no items in the pool.

[lots more great discussion about thread pools and lock-free data
structures]

It seems to me that there's a great model of how to make a high performance
server make the most effective use of multiple CPUs out there already: SQL
Server (and, I would presume, Oracle and DB2 must use similar techniques).

Within SQL Server, they go to extraordinary lengths to try to have exactly 1
unblocked thread per CPU, almost never more, and fewer only when there are
fewer work items than CPUs. Further, they go to extraordinary lengths to
ensure that context switches never occur and that worker thread never block.
Considerable writing has been done on the inner workings of the User Mode
Scheduler that SQL Server uses. For example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqldev/html/sqldev_02252004.asp

How does that mesh with your wish-list for the ultimate thread pool? Of
course, it's more than a thread pool - it is, in effect, an entire Operating
System built on top of the OS, requiring that code running atop it use it's
I/O and memory allocation facilities exclusively.

-cd
 
W

William Stacey [MVP]

| The behavior that I explicitly want to avoid is dumping 100 items into a
| Queue, and having the thread pool immediatly spin up 100 threads.
Likewise,
| if 100 threads are spun up, I wouldn't want to see them all terminated
| immediatly should there be no items in the pool.

Here is where trade-offs come into play and needing a specific behavior
instead of something general. A simple way (that I used) is to have each
thread check, after, it is done with the delegate if new threads need to be
created (i.e. Q.Count > 1 and TPThreads.Count < maxThreads). This avoids
needing a seperate manager thread running. This, in fact, may be the
required behavior for the app (i.e. don't wait but bound the max). Having
to walk the active threads and check if they are in fact blocked and only
then spin up more can induce delay and not sure there is a managed way to do
this.

| The System thread pool seems to have a timeout of some sort before
spawning
| new threads. This seems to work fairly well.

Not sure exact way they do it either. Put has something to do with checking
if threads are blocked, then spin 1 more, and so on, upto the max.

| > | 2 - Threads have some degree of processor affinity in order to
minimize
| > | cache issues
|
| [Thread Affinity]

Not sure how much thread affinity will help here and is more likely to slow
things down. You never want to wait, so if there is a ready, willing, and
able thread ready to pop the queue and start work, then that is better then
trying to wait for same thread used last time or spend cycles trying to
figure that out when the work could have already started or completed. So
not sure we get payoff here in general. Naturally, there are probably use
cases that prove my feeling wrong.

| Some use variations on this, and fancy means of putting threads to sleep -
| I've seen a few algorithms:
| 1 - Keep all threads blocked in a Montior. Whenver a work items comes in,
an
| "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
algorithm)

Need to look very hard at this one. I use pulseall in my blocking-Q because
it is the right thing to do in that case. I have seen many incorrect
implementation that use Pulse(). It may look right at first, but there a
few funky things that can happen that can create a live-lock when using just
Pulse(). In most cases, if the loop is tight, a PulseAll is not going to
create a hurding issue and may, in fact, be the only way to ensure
correctness. Sure you may get a couple threads here and there that wake-up
only to sleep again, but that should be the exception.

| I would really like to see an algorithm that has, in the "busy" case, as
few
| context switches as possible. When I post an item to a queue, there should
| be exactly 1 context switch needed to process my item.

That is the way mine implementation works. I admit to more locking then I
would like, but can't figure out a way around it while keeping everything
correct and not getting into lock-free stuff. I'll post here, so please
take shoots at it.
 
C

Chris Mullins

Carl Daniel said:
Chris said:
[Features a threadpool should have]
It seems to me that there's a great model of how to make a high
performance server make the most effective use of multiple CPUs out there
already: SQL Server (and, I would presume, Oracle and DB2 must use
similar techniques).

The SQL Server infrastructure is great, but I'm afraid it doesn't really do
me any good. Now, as soon as Microsoft releases SQL Server 2000 or 2005
under a GPL or BSD license, I'll use that code :)

I'm not so desperate yet as to use Fibers and write my own scheduler. Next
year, perhaps, but not yet!
 
C

Chris Mullins

William Stacey said:
| 1 - Keep all threads blocked in a Montior. Whenver a work items comes
in,
an
| "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
algorithm)

Need to look very hard at this one. I use pulseall in my blocking-Q
because
it is the right thing to do in that case. I have seen many incorrect
implementation that use Pulse(). It may look right at first, but there a
few funky things that can happen that can create a live-lock when using
just
Pulse(). In most cases, if the loop is tight, a PulseAll is not going to
create a hurding issue and may, in fact, be the only way to ensure
correctness. Sure you may get a couple threads here and there that
wake-up
only to sleep again, but that should be the exception.

I used the word dreaded because the normal case (where only some threads
should be active) has Windows waking up all the threads, then putting most
of them right back to sleep. This can't be a good idea... Especially with
dozens (or hundreds) of threads in the pool.

I don't know the right answer to this, but there's gotta be a better way.
 
J

Joe Seigh

Chris said:
I used the word dreaded because the normal case (where only some threads
should be active) has Windows waking up all the threads, then putting most
of them right back to sleep. This can't be a good idea... Especially with
dozens (or hundreds) of threads in the pool.

I don't know the right answer to this, but there's gotta be a better way.
I'm not sure how you're using the queue here. If it's just a simple queue
use a semaphore to control blocking on the queue. You'll need a lock as
well if you have more than one queueing and one dequeueing thread. Unless
it's a lock-free queue of course.

If you want a fast pathed semaphore which avoids system calls much like
what critical section does for mutex, you can check here
http://groups.google.com/[email protected]
Note that if you use this on xbox360 the interlocked functions don't
have memory barriers so you'll have to provide your own.

For things like timer queues I use a gating mutex or semaphore to
avoid the thundering herd problem caused by having the entire
thread pool wait on the same timeout. The gating mutex ensures
only one thread is waiting on an autoreset event with infinite
timeout if there's no work and a specific timeout if there is
a scheduled piece of work. So in pseudo code

gate.lock();
queue.lock();
for (;;) {
if queue.empty()
timeout = INFINITE;
else
timeout = queue.peek().deadline - now;


if (timeout < 0)
break;

queue.unlock();
event.wait(); // autoreset event
queue.lock()
}

item = queue.pop();
queue.unlock();
gate.unlock();


Any queueing of work that changes the head of the queue (next item
to be scheduled) signals the autoreset event.
 
C

Carl Daniel [VC++ MVP]

Chris said:
"Carl Daniel [VC++ MVP]"
Chris said:
[Features a threadpool should have]
It seems to me that there's a great model of how to make a high
performance server make the most effective use of multiple CPUs out
there already: SQL Server (and, I would presume, Oracle and DB2
must use similar techniques).

The SQL Server infrastructure is great, but I'm afraid it doesn't
really do me any good. Now, as soon as Microsoft releases SQL Server
2000 or 2005 under a GPL or BSD license, I'll use that code :)

I'm not so desperate yet as to use Fibers and write my own scheduler.
Next year, perhaps, but not yet!

I have, a couple years back. It would be interestsing to dust that code
off, make it .NET friendly (it's native C++), and use it to build a UMS-like
framework for .NET. That'd be a powerful tool to have in the .NET toolbox.

-cd
 
W

William Stacey [MVP]

| gate.lock();
| queue.lock();
| for (;;) {
| if queue.empty()
| timeout = INFINITE;
| else
| timeout = queue.peek().deadline - now;
|
|
| if (timeout < 0)
| break;
|
| queue.unlock();
| event.wait(); // autoreset event
| queue.lock()
| }
|
| item = queue.pop();
| queue.unlock();
| gate.unlock();
|

I am not following that totally.
1) if queue is empty timeout is set to -1 (i.e. INFINITE). The next "if" is
then true and does a break. Then you try to pop() an empty queue? This
does not seem right. Please help.
2) Where else do you wait on timeout? Should it be event.wait(timeout)
instead of event.wait()?

tia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top