Multithreaded GC!

A

Atmapuri

Hi!

The GC should be multithreaded, but on my quad core
I still see only 25% CPU max usage, when running a
..NET test app which uses 50% of 1 core for the GC.
As if though GC.Collect is called from the same main
thread of the main app.

Have I missed some settings? I was expecting that
the main application will be able to get 100% of the main
thread and GC with 10-30% would be running on the
second thread and core not affecting the performance
of the main thread.

Thanks!
Atmapuri
 
N

Nicholas Paldino [.NET/C# MVP]

Atmapuri,

The default configuration settings (according to the documentation) have
concurrent GC enabled. In your app.config file, you should have no setting
for the gcConcurrent element (assuming the other config files, user and
machine haven't been changed) set, or it should be set to true:
<configuration>
<runtime>
<gcConcurrent enabled="true"/>
</runtime>
</configuration> Additionally, you might want to set the runtime to use
server GC collection since you have a quad core machine:<configuration>
<runtime>
<gcServer enabled="true"/>
</runtime>
</configuration>
 
W

Willy Denoyette [MVP]

Atmapuri said:
Hi!

The GC should be multithreaded, but on my quad core
I still see only 25% CPU max usage, when running a .NET test app which
uses 50% of 1 core for the GC.
As if though GC.Collect is called from the same main
thread of the main app.
Have I missed some settings? I was expecting that
the main application will be able to get 100% of the main
thread and GC with 10-30% would be running on the
second thread and core not affecting the performance
of the main thread.

Thanks!
Atmapuri



Not sure what you mean with "GC should be multithreaded". The GC is not
multithreaded, the GC runs on a single thread. What you are expecting is
that the GC "task" is parallelized, but this is not what is done in the
current version of the CLR.
Note that each thread that triggers a collection will get suspended for the
whole duration of the collection (non-concurrent GC) or for a number of
short interval during collection (concurrent GC)
When you are running the server GC (as specified in your config file), you
will get multiple GC threads (one per CPU/Core) which may run
simultaneously, however none of these threads can run simultaneously with
the thread that triggered the collection and this for the duration of the
collection (no Concurrent GC here!).

Willy.
 
J

Jon Harrop

Atmapuri said:
The GC should be multithreaded, but on my quad core
I still see only 25% CPU max usage, when running a
.NET test app which uses 50% of 1 core for the GC.
As if though GC.Collect is called from the same main
thread of the main app.

Have I missed some settings? I was expecting that
the main application will be able to get 100% of the main
thread and GC with 10-30% would be running on the
second thread and core not affecting the performance
of the main thread.

That is exactly what I see on my dual core and it is the expected result.

Even though the current garbage collector implementation in .NET is
concurrent it is still closely tied to the running threads and the
performance of your program's threads and the GC is strongly related as a
consequence. Specifically, every allocation by your main thread must be
synchronized with the garbage collector for it to work properly (e.g. to
avoid traversing a data structure while it is being initialized or to avoid
deallocating a value while it is temporarily available only from the
register on another CPU). So a main thread that is doing a lot of
allocation can be wasting time waiting for the garbage collector to catch
up.

This is why allocation is so slow on .NET compared to language
implementations with non-concurrent GCs, e.g. OCaml is 5x faster at
allocation than any .NET language.
 
J

Jon Harrop

Willy said:
Not sure what you mean with "GC should be multithreaded". The GC is not
multithreaded, the GC runs on a single thread. What you are expecting is
that the GC "task" is parallelized, but this is not what is done in the
current version of the CLR.

If that is still true then Microsoft had better fix it quickly. Many of my
programs consume 33% CPU on the GC thread for each of their threads so they
will scale badly beyond quad cores until Microsoft make a parallel GC.

Are you sure the current .NET GC is not already parallel? Have Microsoft
hinted when a parallel one will be introduced?
 
W

Willy Denoyette [MVP]

Jon Harrop said:
If that is still true then Microsoft had better fix it quickly. Many of my
programs consume 33% CPU on the GC thread for each of their threads so
they
will scale badly beyond quad cores until Microsoft make a parallel GC.

Are you sure the current .NET GC is not already parallel? Have Microsoft
hinted when a parallel one will be introduced?



Yes, I'm pretty sure it's like that, what makes you think this doesn't
scale?
In "serverGc" mode, each GC thread (one per CPU) is assigned his own GC heap
and each user thread will allocate from one of these GC heaps (in a round
robin fashion). If you have only one user thread in the process actively
allocating objects, then the GC thread will be awoken when reaching it's own
heap Gen0 allocation "budget", further allocation will be done from the
other heap while the GC collects the first heap. When you have more than one
thread actively allocating, then each GC thread will collect it's own heap
when reaching its Gen 0 threshold. That means you can have multiple GC
threads running at the same time, but they are only collecting their own
heap.

Willy.
 
B

Ben Voigt [C++ MVP]

Is it the GC or the finalizer processing you are referring to? As far as I
know, GC runs on an application thread, but the finalizer queue is processed
by one or more separate threads.
That is exactly what I see on my dual core and it is the expected
result.

Even though the current garbage collector implementation in .NET is
concurrent it is still closely tied to the running threads and the
performance of your program's threads and the GC is strongly related
as a consequence. Specifically, every allocation by your main thread
must be synchronized with the garbage collector for it to work
properly (e.g. to avoid traversing a data structure while it is being
initialized or to avoid deallocating a value while it is temporarily
available only from the register on another CPU). So a main thread
that is doing a lot of allocation can be wasting time waiting for the
garbage collector to catch up.

No locks are taken to achieve this synchronization. Rather the GC suspends
all other threads before it collects.

You can see that in a garbage collected environment, not only allocation
must be synchronized, but any instruction that affects reachability.
This is why allocation is so slow on .NET compared to language
implementations with non-concurrent GCs, e.g. OCaml is 5x faster at
allocation than any .NET language.

Show the code for your tests. I suspect you're badly misusing .NET in order
to get those numbers.
 
W

Willy Denoyette [MVP]

Ben Voigt said:
Is it the GC or the finalizer processing you are referring to? As far as
I know, GC runs on an application thread, but the finalizer queue is
processed by one or more separate threads.

The GC thread runs on the user thread that triggered a collection when
running the Workstation GC (all or not concurrent); The serverGc (available
on multi cores and SMP only) runs on a separate thread.
No locks are taken to achieve this synchronization. Rather the GC
suspends all other threads before it collects.

Same here, this is true for "workstation GC" mode, the server mode GC uses
another scheme.

You can see that in a garbage collected environment, not only allocation
must be synchronized, but any instruction that affects reachability.


Show the code for your tests. I suspect you're badly misusing .NET in
order to get those numbers.

Willy.
 
J

Jon Harrop

Ben said:
Show the code for your tests.

Logic and symbolic programs (e.g. theorem provers and computer algebra
systems) highlight this. For example, the following purely functional
constraint-based Sudoku solver:

let unsolved =
[for j in 1 .. size2
for i in 1 .. size2 ->
(i, j), [1 .. size2]];;

let invalid (i, j) (i', j') =
i=i' || j=j' || (i-1)/3=(i'-1)/3 && (j-1)/3=(j'-1)/3;;

let select p n p' ns =
if p=p' then [n] else
if invalid p p' then List.filter ((<>) n) ns else
ns;;

let add p n sols =
List.map (fun (p', ns) -> p', select p n p' ns) sols;;

let rec search f sol = function
| [] -> f sol
| (p, ns)::sols ->
let sols = List.sort (fun (_, ns1) (_, ns2) -> compare (List.length
ns1) (List.length ns2)) sols
List.iter (fun n -> search f (Map.add p n sol) (add p n sols)) ns;;

In this case you can work around the CLR's performance woes by replacing the
purely functional implementation with a heavily imperative implementation
that amortizes allocations and then replacing the inherent parallelism with
explicit locks.

However, the program will become several times larger in this case and these
manual optimizations are simply intractable for most real programs.
I suspect you're badly misusing .NET in order to get those numbers.

We are working with Microsoft to remedy this problem but there is no easy
solution.
 
J

Jon Harrop

Here, you say that the "GC runs on a single thread".
Yes, I'm pretty sure it's like that, what makes you think this doesn't
scale?

Say each user thread requires 33% CPU time from the GC. Then a
single-threaded GC can satisfy the allocation needs of up to 3 such
threads. If you have 1..3 such user threads then performance will scale.
But if you have 4... user threads then a single-threaded GC will saturate
and the user threads will be allocation-limited.
In "serverGc" mode, each GC thread (one per CPU)

Here, you say that there is one GC thread for each CPU. Is that not contrary
to your original statement "GC runs on a single thread"?
is assigned his own GC
heap and each user thread will allocate from one of these GC heaps (in a
round robin fashion).

If there are more user threads than GC heaps then two or more user threads
must share a single GC heap. So they must synchronize their allocations?
If you have only one user thread in the process
actively allocating objects, then the GC thread will be awoken when
reaching it's own heap Gen0 allocation "budget", further allocation will
be done from the other heap while the GC collects the first heap. When you
have more than one thread actively allocating, then each GC thread will
collect it's own heap when reaching its Gen 0 threshold. That means you
can have multiple GC threads running at the same time, but they are only
collecting their own heap.

That makes sense. Who is responsible for collecting the main heap? Is there
a server GC thread that suspends all user threads and then traverses the
whole heap? What are the stalling implications of this for soft real-time
applications?
 
B

Ben Voigt [C++ MVP]

Jon said:
Ben said:
Show the code for your tests.

Logic and symbolic programs (e.g. theorem provers and computer algebra
systems) highlight this. For example, the following purely functional
constraint-based Sudoku solver:

let unsolved =
[for j in 1 .. size2
for i in 1 .. size2 ->
(i, j), [1 .. size2]];;

let invalid (i, j) (i', j') =
i=i' || j=j' || (i-1)/3=(i'-1)/3 && (j-1)/3=(j'-1)/3;;

let select p n p' ns =
if p=p' then [n] else
if invalid p p' then List.filter ((<>) n) ns else
ns;;

let add p n sols =
List.map (fun (p', ns) -> p', select p n p' ns) sols;;

let rec search f sol = function
| [] -> f sol
| (p, ns)::sols ->
let sols = List.sort (fun (_, ns1) (_, ns2) -> compare
(List.length ns1) (List.length ns2)) sols
List.iter (fun n -> search f (Map.add p n sol) (add p n sols))
ns;;

In this case you can work around the CLR's performance woes by
replacing the purely functional implementation with a heavily
imperative implementation that amortizes allocations and then
replacing the inherent parallelism with explicit locks.

Are you comparing the performance of OCaml to F#... an unreleased research
language? Or do you have an implementation in some other language to
substantiate the claim "5x faster at allocation than any .NET language"?
However, the program will become several times larger in this case
and these manual optimizations are simply intractable for most real
programs.

I understand that. A problem is solved in totally different ways in
functional vs imperative programming.
We are working with Microsoft to remedy this problem but there is no
easy solution.

What class does "List" refer to when your code is compiled to .NET? I'd
imagine the first step is to have a List class optimized for growth and
iteration in the ways common in functional code, not
System.Collections.Generic.List<>. I suspect that the problem is not that
allocations are slow in .NET, but that the MSIL generated from the code
performs many more allocations and copies than necessary.
 
W

Willy Denoyette [MVP]

Jon Harrop said:
Here, you say that the "GC runs on a single thread".

Yep. One GC thread per GC heap, you can't have multiple threads collecting a
single GC heap (no parallelism).

Workstation GC: 1 (user) thread per process collecting the (single) heap.
Server GC: one dedicated GC thread per GC heap (one GC heap per processor).
Say each user thread requires 33% CPU time from the GC. Then a
single-threaded GC can satisfy the allocation needs of up to 3 such
threads. If you have 1..3 such user threads then performance will scale.
But if you have 4... user threads then a single-threaded GC will saturate
and the user threads will be allocation-limited.


Here, you say that there is one GC thread for each CPU. Is that not
contrary
to your original statement "GC runs on a single thread"?

Each GC thread collects a single heap, note that this is only valid for
serverGc enabled processes. The workstation version GC runs on the user
thread that triggered a collection, this is the single thread in the process
actually collecting the one and only GC heap.

For this you need:
1) multiple CPU's (or cores).
2) the following in your config file:
<configuration>
<runtime>
<gcServer enabled="true"/>
</runtime>
If there are more user threads than GC heaps then two or more user threads
must share a single GC heap. So they must synchronize their allocations?

True, if you have more threads running than actual processors, you will
share a heap across multiple threads, not sure what you mean with
"synchronize their allocations" though.
Say you have 6 threads running on a quad core, with 'gcServer' enabled, you
may notice the first 4 thread allocating from different heap, while the two
remaining threads are sharing a heap with one of the first four threads.
However, this all depends on the allocation scheme, the CLR version and a
number of (undocumented) heuristics.
The CLR keeps track of the allocation frequency, allocation size number of
managed threads in the process. User threads with a "low allocation
frequency" tend to share a single heap, while threads with an high
allocation frequency , *may* get their own heap.
Say that you start 4 threads on a quad box, and each of these threads do
only allocate less than a few hundred small objects per second, then you
will probably share a single heap for the four threads, while running 4
threads that do have a much higher allocation frequency may occupy 4
different heaps.

That makes sense. Who is responsible for collecting the main heap? Is
there
a server GC thread that suspends all user threads and then traverses the
whole heap? What are the stalling implications of this for soft real-time
applications?

There is no such thing like a "main heap", each GC heap has it's own GC
thread, this one will/can only collect his own heap.
I guess you are referring to something different.

Willy.
 
J

Jon Harrop

Ben said:
Are you comparing the performance of OCaml to F#...
Yes.

an unreleased

F# was first released in 2005.
research language?

F#'s productization was announced last October.
Or do you have an implementation in some other language to substantiate
the claim "5x faster at allocation than any .NET language"?

Allocation is a function of the CLR and not of the language.
What class does "List" refer to when your code is compiled to .NET?

A data structure provided by the F# standard library: an immutable singly
linked list implementation.
I'd
imagine the first step is to have a List class optimized for growth and
iteration in the ways common in functional code, not
System.Collections.Generic.List<>.

Exactly. Given enough time, you can solve the problem by hand. The
difficulty is getting the compiler automate it.
I suspect that the problem is not that
allocations are slow in .NET, but that the MSIL generated from the code
performs many more allocations and copies than necessary.

Trying to amortize allocations by hand in all .NET languages is the best
workaround available at the moment.
 
J

Jon Harrop

Willy said:
Yep. One GC thread per GC heap, you can't have multiple threads collecting
a single GC heap (no parallelism).

Workstation GC: 1 (user) thread per process collecting the (single) heap.
Server GC: one dedicated GC thread per GC heap (one GC heap per
processor).

So the server GC scales better but probably gives a worse constant-factor
performance degredation?
True, if you have more threads running than actual processors, you will
share a heap across multiple threads, not sure what you mean with
"synchronize their allocations" though.

If multiple user threads can share a single heap then they cannot allocate
from it independently: their allocations must be synchronized using a lock.
But Ben Voigt said that no locks were used during allocation.
Say you have 6 threads running on a quad core, with 'gcServer' enabled,
you may notice the first 4 thread allocating from different heap, while
the two remaining threads are sharing a heap with one of the first four
threads.
Right.

However, this all depends on the allocation scheme, the CLR
version and a number of (undocumented) heuristics.
The CLR keeps track of the allocation frequency, allocation size number of
managed threads in the process. User threads with a "low allocation
frequency" tend to share a single heap, while threads with an high
allocation frequency , *may* get their own heap.
Nice.


There is no such thing like a "main heap", each GC heap has it's own GC
thread, this one will/can only collect his own heap.
I guess you are referring to something different.

If one heap contains a reference to a value in another heap, how does the GC
thread for that heap know that it is referenced and cannot be deallocated?
Is it not necessary to have some kind of global traversal as well?
 
A

Atmapuri

Hi!
If one heap contains a reference to a value in another heap, how does the
GC
thread for that heap know that it is referenced and cannot be deallocated?
Is it not necessary to have some kind of global traversal as well?

My question also. However, there is one part of the GC that could be
threaded properly even for the workstation GC mode. All memory
before returned by the GC is initialized to zero. That "zero setter"
could be threaded.

On the other hand, according to my benchmarks, this would only
speed up GC by about 20-30% on machines with more than 1 core.

Thanks!
Atmapuri
 
W

Willy Denoyette [MVP]

Jon Harrop said:
So the server GC scales better but probably gives a worse constant-factor
performance degredation?

Yes, the server GC is tuned for maximum throughput in server scenarios where
many threads may allocate simultneously.
The main advantage of the server GC is that not all threads are suspended
during a GC run, the drawback is that a GC run may suspend a thread for
longer a period than the workstation GC in "Concurrent" mode. As a rule of
thumb, a WKS Concurrent GC is best for interactive applications, while the
server GC is better for ASP.NET like application scenarios, but as always,
you'll have to measure to know exactly which GC mode is best for your
scenario.


If multiple user threads can share a single heap then they cannot allocate
from it independently: their allocations must be synchronized using a
lock.
But Ben Voigt said that no locks were used during allocation.

"object allocation" is nothing more than bumping a pointer in an atomic
operation.

If one heap contains a reference to a value in another heap, how does the
GC
thread for that heap know that it is referenced and cannot be deallocated?
Is it not necessary to have some kind of global traversal as well?

I didn't have a look at the details for the server GC, but IMO the GC must
scan all heaps in order to mark all objects in the current heap that have
live references from the other heaps.

Willy.
 
W

Willy Denoyette [MVP]

Atmapuri said:
Hi!


My question also. However, there is one part of the GC that could be
threaded properly even for the workstation GC mode. All memory
before returned by the GC is initialized to zero. That "zero setter"
could be threaded.

On the other hand, according to my benchmarks, this would only
speed up GC by about 20-30% on machines with more than 1 core.


This is not done in the current CLR, so how did you measure this?

Willy.
 
A

Atmapuri

Hu!
This is not done in the current CLR, so how did you measure this?

You havent answered:

About zero setting. Very simple: First you measure the time to
allocate and set to zero and then you measure the time needed only
to set to zero. Subtract the two and you get the difference, which is
the time of GC if it would not be initializing to zero.

for (int i=0;i<Iterations;i++) {
for (int k = 0; k < GCIterCount; k++) //GC Loop
{
testArray = new double[res.Length];
testArray[1] = resArray[1];
}
}

And then of this loop:

for (int i=0;i<Iterations;i++) {
for (int k = 0; k < GCIterCount; k++) //GC Loop
{
pinnedArray_Set_To_Zero_WithIntelIPPSetValUsingSSE(res.Length);
}
}

The timing results are then normalized on a per byte cost.
This timings show that when memory allocation exceeds 8MByte large
blocks, zero settings cost is 30%.

res.Length goes in steps of 2 from 2 up to 2^18.

Regards!
Atmapuri
 
W

Willy Denoyette [MVP]

Atmapuri said:
Hu!
This is not done in the current CLR, so how did you measure this?

You havent answered:

About zero setting. Very simple: First you measure the time to
allocate and set to zero and then you measure the time needed only
to set to zero. Subtract the two and you get the difference, which is
the time of GC if it would not be initializing to zero.

for (int i=0;i<Iterations;i++) {
for (int k = 0; k < GCIterCount; k++) //GC Loop
{
testArray = new double[res.Length];
testArray[1] = resArray[1];
}
}

And then of this loop:

for (int i=0;i<Iterations;i++) {
for (int k = 0; k < GCIterCount; k++) //GC Loop
{
pinnedArray_Set_To_Zero_WithIntelIPPSetValUsingSSE(res.Length);
}
}

The timing results are then normalized on a per byte cost.
This timings show that when memory allocation exceeds 8MByte large
blocks, zero settings cost is 30%.

res.Length goes in steps of 2 from 2 up to 2^18.
Regards!
Atmapuri



Well, actually you didn't measure how long the CLR needs to *zero* the
memory, so you are comparing apples to oranges. Also keep in mind that the
CLR doesn't zero the freed memory after the GC finishes, it allocates/zeroes
the memory when a *soft page* error occurs while allocating an *new* object
from the heap.
For this, the CLR's memory allocator calls Win32's VirtualAlloc, which
automatically zeroes the allocated block. The size of the allocated block is
64KByte per default (implementation dependent!!!), for allocations from
gen0, 1 and 2 heap, this size is a multiple of 64KByte, depending on the
size of the object, when allocating from the LOH.

From above (incomplete) posted code I see that you call into unmanaged code
( pinnedArray_Set_To_Zero_WithIntelIPPSetValUsingSSE), this involves
transitioning from managed to unmanaged code, which takes time when not done
with "unmanaged code security" walk disabled, and the pinning takes time
too. Also, you didn't consider the time needed for a thread switch to zero
the memory from another thread.
Zeroing the memory as done by by the CLR's memory manager, needs no
transition from managed code to unmanaged code and there is no pinning
involved either . Allocating 64KByte using VirtualAlloc is extremely fast,
there is absolutely no reason to burn a separate thread for this, a thread
switch may actually be much slower than the actual VirtualAlloc call.

Willy.
 
B

Ben Voigt [C++ MVP]

True, if you have more threads running than actual processors, you
"object allocation" is nothing more than bumping a pointer in an
atomic operation.

Right. Because the Gen0/Gen1/Gen2 heaps... aren't. They are really stacks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top