Multithreaded GC!

J

Jon Harrop

Jeroen said:
The number of threads and processors involved in this exercise are also
highly relevant. The fact that you're just bumping a counter is likewise.
What you'll see here is a rough estimate of how much overhead the locking
mechanisms have relative to each other, when not considering the actual
work. It says far less about the performance of any actual multithreaded
scenario using those mechanisms.

That's exactly what I'm trying to benchmark, yes.
For example, in most scenarios spinning interlocked operations are more
efficient than locks, but if contention gets very high, the overhead of
threads going to sleep and switching context becomes less than the time
wasted on spinning. Caching issues are also a big hairy ball of
analysis-ruining nasties.

Hopefully cache effects won't be too bad here because this is about
allocations from gen0 that are contiguous.
Micro-benchmarks like these are usually worse than useless, they're
actively misleading. If you want to discuss the performance of real
algorithms, profile those algorithms.

I already did. Read my previous posts in this thread for the context of this
benchmark.
 
J

Jeroen Mostert

Jon said:
Jeroen Mostert wrote:

I already did. Read my previous posts in this thread for the context of this
benchmark.
If you wrote a concurrent garbage collector where you've tested the various
locking mechanisms for their performance, or adapted the Microsoft GC this
way, I find no mention of that in your posts. I'm sure you've profiled your
own algorithms that *use* the GC, but that's not what I mean here. We were
talking about the code benchmarking the locking; I'm pointing out that your
conclusions mean little if those locking benchmarks are not taken in the
context of an actual GC.

As far as I'm aware, there's no such thing as a garbage collector that never
synchronizes with allocation threads in some way (however indirectly and
unobtrusively it wants to do this), so simply going "locks are bad because
slow" is a shallow conclusion. It's the amount and method of synchronization
that matters here. As mentioned by others, the CLR GC actually uses a
combination of interlocked operations and suspending threads to achieve
synchronization. Other schemes are certainly possible, possibly more
efficient, and possibly even so averaged over all allocation patterns, but
just testing what locking mechanism is fastest tells you little about that.
 
J

Jon Harrop

Jeroen said:
If you wrote a concurrent garbage collector where you've tested the
various locking mechanisms for their performance, or adapted the Microsoft
GC this way, I find no mention of that in your posts.

Yes. That is exactly what my OCaml vs F# comparisons achieved.

There are other factors at play, like the quality of code generation, but
they are dwarfed by allocation performance in this case (even in OCaml,
where allocation is much faster). Specifically, a variety of other
benchmarks indicate that the performance difference is well within a factor
of two in general. So I'm confident these results are representative.
I'm sure you've
profiled your own algorithms that *use* the GC, but that's not what I mean
here. We were talking about the code benchmarking the locking; I'm
pointing out that your conclusions mean little if those locking benchmarks
are not taken in the context of an actual GC.

That was not the purpose of those lock microbenchmarks.

I had previously observed up to 5x slowdowns moving allocation-intensive
OCaml code to .NET (F# and C#) and wanted to know what the cause was. Now I
know that a direct pointer bump is 5x faster than a synchronized one and
that .NET currently does a synchronized pointer bump in every allocation,
which explains the slowdown I see.

So the results of the lock microbenchmarks are not related to a specific GC
but, rather, to any GC that requires synchronization in each and every
allocation (as the .NET implementation currently does).
As far as I'm aware, there's no such thing as a garbage collector that
never synchronizes with allocation threads in some way (however indirectly
and unobtrusively it wants to do this),

Actually that is very common in the context of language that make extensive
use of purely functional data structures. Look at Haskell and Erlang
implementations, for example.
so simply going "locks are bad because slow" is a shallow conclusion.

The lock in the current .NET allocator is my performance bottleneck and I
need to find a workaround to avoid that problem and recover the lost
performance.
It's the amount and method of synchronization that matters here.

Yes, exactly. That is the same thing.
As mentioned by others, the CLR GC
actually uses a combination of interlocked operations and suspending
threads to achieve synchronization. Other schemes are certainly possible,
possibly more efficient, and possibly even so averaged over all allocation
patterns, but just testing what locking mechanism is fastest tells you
little about that.

That is why I did not just test the performance of different locking
mechanisms.

The next question for me is: how can I amortize those locks to recover the
lost performance?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top