Multithreaded GC!

J

Jon Harrop

Ben said:
Right. Because the Gen0/Gen1/Gen2 heaps... aren't. They are really
stacks.

Sure, but bumping a pointer means reading it, incrementing it and writing
it. To make that atomic you must wrap it in a lock, right?
 
J

Jon Skeet [C# MVP]

Jon Harrop said:
Sure, but bumping a pointer means reading it, incrementing it and writing
it. To make that atomic you must wrap it in a lock, right?

Would you consider Interlocked.Increment to be taking out a lock?
 
A

Atmapuri

Hi!
From above (incomplete) posted code I see that you call into unmanaged
code ( pinnedArray_Set_To_Zero_WithIntelIPPSetValUsingSSE), this involves
transitioning from managed to unmanaged code, which takes time when not
done with "unmanaged code security" walk disabled,

Security was disabled on that call.
and the pinning takes time too.

Not when pinned outside of the timed loop.
Also, you didn't consider the time needed for a thread switch to zero the
memory from another thread.
Zeroing the memory as done by by the CLR's memory manager, needs no
transition from managed code to unmanaged code and there is no pinning
involved either . Allocating 64KByte using VirtualAlloc is extremely fast,
there is absolutely no reason to burn a separate thread for this, a thread
switch may actually be much slower than the actual VirtualAlloc call.

Zeroing by VirtualAlloc is not faster than fastest zeroing you can make
especially since VirtualAlloc is not able to make use of SSE2/SSE3, since
it has to run code that runs on all CPU's. The zeroing I timed is close
to the fastest zeroing you can make "ever".

Posting a code is not a problem, but I was hoping that you will take the
suggestion as well meant and try to do something about it
rather than denying that an issues exists.

Thanks!
Atmapuri
 
J

Jon Skeet [C# MVP]

Jon Harrop said:
I assume that is a pointer bump wrapped in a lock, yes.

The point of the Interlocked class is to avoid requiring a lock - at
least a user-mode lock. I don't know the details of the implementation,
but I believe it uses "native" (i.e. CPU level) atomic operations.
They're likely to be significantly faster than CLR-level locking.

Yes, there will still be overhead compared with incrementing in a non-
threadsafe way - but it's not at the same level as most people mean
when they talk about acquiring a lock.
 
B

Ben Voigt [C++ MVP]

Jon said:
Sure, but bumping a pointer means reading it, incrementing it and
writing it. To make that atomic you must wrap it in a lock, right?

No. See the x86 instruction set, which provides atomic bus operations.
They still use the term "LOCK prefix", but it is not a lock in the software
sense of the word, it uses bus arbitration instead.
 
B

Ben Voigt [C++ MVP]

Atmapuri said:
Hi!


Security was disabled on that call.


Not when pinned outside of the timed loop.


Zeroing by VirtualAlloc is not faster than fastest zeroing you can
make especially since VirtualAlloc is not able to make use of
SSE2/SSE3, since it has to run code that runs on all CPU's. The
zeroing I timed is close to the fastest zeroing you can make "ever".

Posting a code is not a problem, but I was hoping that you will take
the suggestion as well meant and try to do something about it
rather than denying that an issues exists.

Why not just compare to Array.Clear, which would zero using the best method
known to the framework?
 
J

Jon Harrop

Jon said:
The point of the Interlocked class is to avoid requiring a lock - at
least a user-mode lock. I don't know the details of the implementation,
but I believe it uses "native" (i.e. CPU level) atomic operations.
They're likely to be significantly faster than CLR-level locking.

Yes, there will still be overhead compared with incrementing in a non-
threadsafe way - but it's not at the same level as most people mean
when they talk about acquiring a lock.

Indeed, I just benchmarked it here and the lock used by the Interlocked.Add
member is ~5x slower than direct access but still 2x faster than using
Monitor directly from F# code.

This raises another interesting question: if only a single thread is bumping
pointers in a GC heap and the GC stays out until the minor heap is
exhausted then the GC could let the user thread know that the lock isn't
required and make the pointer bump 5x faster. Looks like the current CLR
implementation doesn't do that but it would give a huge performance boost
to many heavily-allocating programs...
 
J

Jon Harrop

Ben said:
No. See the x86 instruction set, which provides atomic bus operations.

That's a lock...
They still use the term "LOCK prefix",

See, they even call it a lock. :)
but it is not a lock in the software sense of the word, it uses bus
arbitration instead.

Whether or not you regard this inherent mutual exclusion as a lock, it is
the source of .NET's slow allocation.

I'm actually less concerned about the current 5x performance hit and more
concerned about how that performance hit will change in the future, as the
memory gap continues to expand.
 
J

Jon Skeet [C# MVP]

Jon Harrop said:
Indeed, I just benchmarked it here and the lock used by the Interlocked.Add
member is ~5x slower than direct access but still 2x faster than using
Monitor directly from F# code.

This raises another interesting question: if only a single thread is bumping
pointers in a GC heap and the GC stays out until the minor heap is
exhausted then the GC could let the user thread know that the lock isn't
required and make the pointer bump 5x faster. Looks like the current CLR
implementation doesn't do that but it would give a huge performance boost
to many heavily-allocating programs...

Potentially, yes. It would be interesting to have each 4 threads on a 4
proc box, and each allocating from its own GC heap. So long as each
thread stays on its own processor, you could do wonderful things.

It would certainly make an interesting research project.
 
J

Jon Harrop

Jon said:
Potentially, yes. It would be interesting to have each 4 threads on a 4
proc box, and each allocating from its own GC heap. So long as each
thread stays on its own processor, you could do wonderful things.

Even if the user thread is moved to a different CPU there is still no
contention for its heap.
It would certainly make an interesting research project.

This approach is already commonplace in other systems.
 
J

Jon Skeet [C# MVP]

Jon Harrop said:
Even if the user thread is moved to a different CPU there is still no
contention for its heap.

True - although it would make it slower in terms of caching etc.
This approach is already commonplace in other systems.

I suspect the difficulty is knowing when to apply it. It's no doubt
highly suitable for some situations, but probably awful for others.
 
W

Willy Denoyette [MVP]

Jon Harrop said:
Ben said:
Jon said:
Ben Voigt [C++ MVP] wrote:
"object allocation" is nothing more than bumping a pointer in an
atomic operation.

Right. Because the Gen0/Gen1/Gen2 heaps... aren't. They are really
stacks.

Sure, but bumping a pointer means reading it, incrementing it and
writing it. To make that atomic you must wrap it in a lock, right?

No. See the x86 instruction set, which provides atomic bus operations.

That's a lock...
They still use the term "LOCK prefix",

See, they even call it a lock. :)

Yes, and they will keep calling it that way (in X86), but modern processors
do get smarter and no longer require to lock the bus.
Please consult the appropriate technical literature from Intel or Others for
details on pdifference between processor types and families.
Whether or not you regard this inherent mutual exclusion as a lock, it is
the source of .NET's slow allocation.

I'm actually less concerned about the current 5x performance hit and more
concerned about how that performance hit will change in the future, as the
memory gap continues to expand.

What 5x performance hit are you talking about, what are you comparing and
how did you measure?.


Willy.
 
W

Willy Denoyette [MVP]

Ben Voigt said:
Right. Because the Gen0/Gen1/Gen2 heaps... aren't. They are really
stacks.

Yep the back-end is, but with a complex front-end.

Willy.
 
J

Jon Harrop

Jon said:
True - although it would make it slower in terms of caching etc.

Very true.
I suspect the difficulty is knowing when to apply it. It's no doubt
highly suitable for some situations, but probably awful for others.

I've been trying to think about that but I don't think its possible to
estimate the performance you'll get from such tweaks, or at least you'll
need a lot more experience with parallel GC than I have...

I think the fundamental problem is that optimizing for the immutable case
will degrade the performance of the mutable case. On .NET, the vast
majority of money is invested in mutable code, of course, so Microsoft
aren't likely to switch any time soon. However, that will almost certainly
change as programmers start to appreciate how much easier immutable data
structures are in the context of parallelism...
 
J

Jon Harrop

Willy said:
What 5x performance hit are you talking about, what are you comparing and
how did you measure?.

Time taken to increment a counter 100M times using each of four different
methods:

.. Direct access
.. Interlock.Add
.. Monitor.Enter and Exit
.. F#'s higher-order "lock" function

I can post the code if you like.
 
J

Jeroen Mostert

Jon said:
Time taken to increment a counter 100M times using each of four different
methods:

. Direct access
. Interlock.Add
. Monitor.Enter and Exit
. F#'s higher-order "lock" function

I can post the code if you like.
The number of threads and processors involved in this exercise are also
highly relevant. The fact that you're just bumping a counter is likewise.
What you'll see here is a rough estimate of how much overhead the locking
mechanisms have relative to each other, when not considering the actual
work. It says far less about the performance of any actual multithreaded
scenario using those mechanisms.

For example, in most scenarios spinning interlocked operations are more
efficient than locks, but if contention gets very high, the overhead of
threads going to sleep and switching context becomes less than the time
wasted on spinning. Caching issues are also a big hairy ball of
analysis-ruining nasties.

Micro-benchmarks like these are usually worse than useless, they're actively
misleading. If you want to discuss the performance of real algorithms,
profile those algorithms.
 
W

Willy Denoyette [MVP]

Jon Harrop said:
Time taken to increment a counter 100M times using each of four different
methods:

. Direct access
. Interlock.Add
. Monitor.Enter and Exit
. F#'s higher-order "lock" function

I can post the code if you like.

That would be nice. I'm having access to some great HW (for the next few
day's) running WS2008. I really like what they have done at the kernel
level.

Willy.
 
B

Ben Voigt [C++ MVP]

Jon said:
Time taken to increment a counter 100M times using each of four
different methods:

. Direct access
. Interlock.Add
. Monitor.Enter and Exit
. F#'s higher-order "lock" function

I can post the code if you like.

In C++ the advantage of direct access to a non-volatile variable over an
atomic operation would far exceed 100M in the case you described (since
it'll replace the whole loop with a single increment).

I hope you're doing something else in that loop to guarantee the variable
actually is incremented.
 
W

Willy Denoyette [MVP]

Atmapuri said:
Hi!


Security was disabled on that call.

Next time post the whole code, and don't let us guess.
Not when pinned outside of the timed loop.


Zeroing by VirtualAlloc is not faster than fastest zeroing you can make
especially since VirtualAlloc is not able to make use of SSE2/SSE3, since
it has to run code that runs on all CPU's. The zeroing I timed is close
to the fastest zeroing you can make "ever".

VirtualAlloc is how the CLR to allocate memory from the OS, and as
VirtualAlloc clears the allocated memory, I can't see why you think that
burning another thread to clear it once more, would speed-up the overall
object allocation.
Also, you are claiming that VirtualAlloc's zeroing of the memory does not
use SSE2/SSE3. Well,the kernel knows on what CPU it is running, and takes
the most appropriate path to use what the underlying architecture provides
when he sees fit (wich doesn't mean SSE is always the most appropriate
path!). Note also, that VirtualAlloc incurs a the kernel mode switch, this
one takes a ~5000 cycles hit, the zeroing of 64KB of memory takes less than
1% of this, how many cycles do you think you can save by using SSE2/SSE3
for this? (Please post the whole code if ever you come up with something).

Posting a code is not a problem, but I was hoping that you will take the
suggestion as well meant and try to do something about it
rather than denying that an issues exists.

What issue? And what suggestion? Suggestions should be posted to Microsoft's
connect site at: <http://connect.microsoft.com>, not to a NG. Also FYI,
MVP's are not MS employees, and are not affiliated whatsoever with MS,
basically we are here to help people with real issues, but we can not help
you with your suggestions about changes to the CLR to suit your needs.

Willy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top