Jeroen said:
If you wrote a concurrent garbage collector where you've tested the
various locking mechanisms for their performance, or adapted the Microsoft
GC this way, I find no mention of that in your posts.
Yes. That is exactly what my OCaml vs F# comparisons achieved.
There are other factors at play, like the quality of code generation, but
they are dwarfed by allocation performance in this case (even in OCaml,
where allocation is much faster). Specifically, a variety of other
benchmarks indicate that the performance difference is well within a factor
of two in general. So I'm confident these results are representative.
I'm sure you've
profiled your own algorithms that *use* the GC, but that's not what I mean
here. We were talking about the code benchmarking the locking; I'm
pointing out that your conclusions mean little if those locking benchmarks
are not taken in the context of an actual GC.
That was not the purpose of those lock microbenchmarks.
I had previously observed up to 5x slowdowns moving allocation-intensive
OCaml code to .NET (F# and C#) and wanted to know what the cause was. Now I
know that a direct pointer bump is 5x faster than a synchronized one and
that .NET currently does a synchronized pointer bump in every allocation,
which explains the slowdown I see.
So the results of the lock microbenchmarks are not related to a specific GC
but, rather, to any GC that requires synchronization in each and every
allocation (as the .NET implementation currently does).
As far as I'm aware, there's no such thing as a garbage collector that
never synchronizes with allocation threads in some way (however indirectly
and unobtrusively it wants to do this),
Actually that is very common in the context of language that make extensive
use of purely functional data structures. Look at Haskell and Erlang
implementations, for example.
so simply going "locks are bad because slow" is a shallow conclusion.
The lock in the current .NET allocator is my performance bottleneck and I
need to find a workaround to avoid that problem and recover the lost
performance.
It's the amount and method of synchronization that matters here.
Yes, exactly. That is the same thing.
As mentioned by others, the CLR GC
actually uses a combination of interlocked operations and suspending
threads to achieve synchronization. Other schemes are certainly possible,
possibly more efficient, and possibly even so averaged over all allocation
patterns, but just testing what locking mechanism is fastest tells you
little about that.
That is why I did not just test the performance of different locking
mechanisms.
The next question for me is: how can I amortize those locks to recover the
lost performance?