atomic operations...

Chris M. Thomasson · Aug 3, 2009

I was wondering if there was an equivalent of Java's
`AtomicStampedReference':

http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/atomic/AtomicStampedReference.html

in C Sharp? I tried to hack something together:

http://pastebin.com/m16a32c47

But as you can probably see, it's not working. The DWCAS should have failed,
but it erroneously succeeded anyway. Am I making any obvious mistakes? I
need to be able to atomically mutate an object reference along with an
integer. Please help as I am not an expert in C#!

Thank you all for your valuable time.

:^)

Chris M. Thomasson · Aug 3, 2009

Chris M. Thomasson said:
I was wondering if there was an equivalent of Java's
`AtomicStampedReference':

http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/atomic/AtomicStampedReference.html

in C Sharp? I tried to hack something together:

http://pastebin.com/m16a32c47

But as you can probably see, it's not working. The DWCAS should have
failed, but it erroneously succeeded anyway. Am I making any obvious
mistakes? I need to be able to atomically mutate an object reference along
with an integer. Please help as I am not an expert in C#!

I even tried:

http://pastebin.com/m1c214a67

The DWCAS returns true, however, no actual update was made to the target. I
have a feeling that this is not going to work out in C#.

Peter Duniho · Aug 4, 2009

I was wondering if there was an equivalent of Java's
`AtomicStampedReference':

http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/atomic/AtomicStampedReference.html

There is no built-in class that I'm aware of which exactly that
functionality.

in C Sharp? I tried to hack something together:

http://pastebin.com/m16a32c47

But as you can probably see, it's not working. The DWCAS should have
failed, but it erroneously succeeded anyway.

Why should it have failed? You are passing exactly the same object for
the original and the comparand, so of course they will test equal and the
original will be replaced by your "xchg" object.

The implementation of the "DWCAS()" method seems little more to me than an
inefficient, potentially buggy version of the Interlock.CompareExchange()
method it uses. Inefficient, because the if() statement after the
exchange is redundant; first you check for object equality, then you check
for equality of a member in the two objects you already verified as
equal. The latter test, when executed at all, will always evaluate as
"true".

Buggy, because the method is also not thread-safe, in the sense that the
passed-in variables, if accessed by multiple threads, could wind up being
modified by one thread during the execution of the method in another
thread (possibly even by the same method). In fact, there's not any way
for the method itself to be inherently thread-safe; it can only be made
that way through changes to the callers. The method itself doesn't have
enough information about the operands to be made thread-safe. This sort
of "interlocked" operation is all about thread safety, so that's an
obvious problem.

The usefulness of modifying the comparand to be the original value is also
lost on me.

Am I making any obvious mistakes? I need to be able to atomically mutate
an object reference along with an integer. Please help as I am not an
expert in C#!

The biggest issue I see is that the class you created isn't semantically
anything like the Java AtomicStampedReference class. Your class is
basically just a "holder" for data, and you've tried to implement the Java
functionality in a single method elsewhere.

As a first attempt, I would definitely just duplicate the Java class,
rather than change the API at the same time. It's much harder to try to
do two new things at the same time than just one new thing.

Also, you've stated a problem goal that is precisely the behavior of the
Java class. But, it's entirely possible that whatever algorithmic goal
you have is achievable through other means, rather than literally via the
implementation provided using the Java API. Creating new classes to do
exactly what you could do with built-in classes is something one ought to
avoid.

All that said, here's a simple implementation of the Java class, in C#,
that might be suitable for your needs:

class AtomicStampedReference<T>
{
private T _t;
private int _i;
private object _objLock = new object();

public AtomicStampedReference(T tInitial, int iInitial)
{
_t = tInitial;
_i = iInitial;
}

public T Value { get { return _t; } }
public int Stamp { get { return _i; } }

public void GetValueAndStamp(out T tCurrent, out int iCurrent)
{
lock (_objLock)
{
tCurrent = _t;
iCurrent = _i;
}
}

public void SetValueAndStamp(T tNew, int iNew)
{
lock (_objLock)
{
_t = tNew;
_i = iNew;
}
}

public bool AttemptStamp(T tExpected, int iNew)
{
lock (_objLock)
{
if (_t.Equals(tExpected))
{
_i = iNew;
return true;
}
}

return false;
}

public bool CompareAndSet(T tExpected, T tNew, int iExpected, int
iNew)
{
lock (_objLock)
{
if (_t.Equals(tExpected) && _i == iExpected)
{
_t = tNew;
_i = iNew;
return true;
}
}

return false;
}

public bool WeakCompareAndSet(T tExpected, T tNew, int iExpected,
int iNew)
{
if (Monitor.TryEnter(_objLock))
{
try
{
if (_t.Equals(tExpected) && _i == iExpected)
{
_t = tNew;
_i = iNew;
return true;
}
}
finally
{
Monitor.Exit(_objLock);
}
}

return false;
}
}

Important caveats: I have not compiled, never mind tested, the above
code. Inspect and test it carefully to be sure it's suitable for your
needs. Also, in C# be very wary of the difference between
Object.Equals(Object), T.Equals(T), and the == operator. Ideally, these
will be implemented identically (and in most cases for reference types,
the default implementation using reference equality is what's being used),
and so it shouldn't be an issue. But if you start using classes with
custom implementations of those methods and operator, if it's not done
right you can get some weirdly incorrect results.

Pete

Chris M. Thomasson · Aug 4, 2009

Peter Duniho said:
There is no built-in class that I'm aware of which exactly that
functionality.

Why should it have failed? You are passing exactly the same object for
the original and the comparand, so of course they will test equal and the
original will be replaced by your "xchg" object.

The implementation of the "DWCAS()" method seems little more to me than an
inefficient, potentially buggy version of the Interlock.CompareExchange()
method it uses. Inefficient, because the if() statement after the
exchange is redundant; first you check for object equality, then you check
for equality of a member in the two objects you already verified as
equal. The latter test, when executed at all, will always evaluate as
"true".

[...]

Here is full self-contained compliable example code in C and IA-32 asm
(MSVC 6.0 and higher) that I am trying to accomplish/mimic in C#:
___________________________________________________________________________
#include <stdio.h>

__declspec(naked) int
IA32_DWCAS(
void volatile* const _pthis,
void* const pcmp,
void const* const pxhcg
) {
_asm {
PUSH ESI
PUSH EBX
MOV ESI, [ESP + 16]
MOV EAX, [ESI]
MOV EDX, [ESI + 4]
MOV ESI, [ESP + 20]
MOV EBX, [ESI]
MOV ECX, [ESI + 4]
MOV ESI, [ESP + 12]
LOCK CMPXCHG8B QWORD PTR [ESI]
JNE x86_DWCAS_failed
MOV EAX, 1
POP EBX
POP ESI
RET

x86_DWCAS_failed:
MOV ESI, [ESP + 16]
MOV [ESI], EAX
MOV [ESI + 4], EDX
MOV EAX, 0
POP EBX
POP ESI
RET
}
}

struct DWCAS_Anchor {
void* ptr;
unsigned __int32 counter;
};

static struct DWCAS_Anchor g_anchor = { NULL };

int main(void) {
struct DWCAS_Anchor cmp = g_anchor;
struct DWCAS_Anchor xchg = { NULL, 666 };

printf("Before DWCAS:\nDWCAS_Anchor.ptr = %p\nDWCAS_Anchor.counter ==
%lu\n\n",
g_anchor.ptr, g_anchor.counter);

if (IA32_DWCAS(&g_anchor, &cmp, &xchg)) {
printf("After DWCAS:\nDWCAS_Anchor.ptr = %p\nDWCAS_Anchor.counter ==
%lu\n\n",
g_anchor.ptr, g_anchor.counter);
}

cmp.counter = 666;
xchg.counter = 555;

if (IA32_DWCAS(&g_anchor, &cmp, &xchg)) {
printf("After DWCAS:\nDWCAS_Anchor.ptr = %p\nDWCAS_Anchor.counter ==
%lu\n\n",
g_anchor.ptr, g_anchor.counter);
}

cmp.counter = 123;
xchg.counter = 999;

if (IA32_DWCAS(&g_anchor, &cmp, &xchg)) {
printf("After DWCAS:\nDWCAS_Anchor.ptr = %p\nDWCAS_Anchor.counter ==
%lu\n\n",
g_anchor.ptr, g_anchor.counter);
} else {
printf("After DWCAS FAILED:\nDWCAS_Anchor.ptr = %p\nDWCAS_Anchor.counter
== %lu\n\n",
g_anchor.ptr, g_anchor.counter);
}

return 0;
}
___________________________________________________________________________

No MUTEXS! Java can accomplish this in a 100% non-blocking manner on
platform which support it. It seems like your saying that I have to use
locks and monitors in C# in order to get similar functionality. That's fine.
I was just wondering if I can get C# to create code that does not use
expensive mutexs, I don't think its possible in 100% pure C# managed code. I
can create a library in which C# can get non-blocking DWCAS. But, that's not
ideal.

Anyway, you totally answered my question Peter, Thank you!

:^)

BTW, can you think of a way that C# can do this? I would be GRATEFUL if you
can perhaps HACK something together without resorting to unsafe unmanaged
code!

Help me!

;^/

Chris M. Thomasson · Aug 4, 2009

Chris M. Thomasson said:
Peter Duniho said:

There is no built-in class that I'm aware of which exactly that
functionality.

Why should it have failed? You are passing exactly the same object for
the original and the comparand, so of course they will test equal and the
original will be replaced by your "xchg" object.

The implementation of the "DWCAS()" method seems little more to me than
an inefficient, potentially buggy version of the
Interlock.CompareExchange() method it uses. Inefficient, because the
if() statement after the exchange is redundant; first you check for
object equality, then you check for equality of a member in the two
objects you already verified as equal. The latter test, when executed at
all, will always evaluate as "true".

Click to expand...

[...]

Here is full self-contained compliable example code in C and IA-32 asm
(MSVC 6.0 and higher) that I am trying to accomplish/mimic in C#:

[...]

I know my C# skills are fairly limited as of now, so, please have some
patience? I really do want to learn, and I appreciate all of your valuable
help. Your example code will totally work in principle because of the locks
and monitors and that's great! I was just wondering if C# can do what I want
it to do without using locks ...

=^/

BTW, here is some of my work in C# that you can take a look at in the form
of 100% self contained (crude) example applications:

http://cpt.pastebin.com/f72cc3cc1
(multi-producer/consumer queue by me /w eventcount by me)

http://cpt.pastebin.com/f52dcc0fc
(single-producer/consumer queue by me /w eventcount by me)

http://heev.pastebin.com/f70a73f00
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me)

http://heev.pastebin.com/f270184be
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me /w
node cache by me)

http://heev.pastebin.com/f10399a09
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me /w
node cache by Dmitriy V'jukov)

I have all of these algorithms in C and ASM, the languages I am most
comfortable with, but I wanted to expose my work to different "unknown"
languages. All of the algorithms I provided links to are 100% compatible
with C# as-is, no jumping through hoops; C# is a pleasure to work with!
However, I have another entire class of algorithms which unfortunately
depend on DWCAS (Double Width Compare-and-Swap). I can port these to Java
via `AtomicStampedReference' and I was wondering if I can port them to C#
without using locks and monitors. That's all.

Anyway, I really thank you for you're time Peter! Seriously.

:^)

Chris M. Thomasson · Aug 4, 2009

Ummm, my newsreader totally mangled the code I posted. Here it is in a
format that should retain its integrity:

http://pastebin.com/f5c3e52a7

Sorry about that non-sense!

;^(...

Ben Voigt [C++ MVP] · Aug 4, 2009

Java can accomplish this in a 100% non-blocking manner on platform which
support it.

Click to expand...

Not exactly. The Java implementation trades synchronization in the class
itself for synchronization in the memory manager. The implementation I
showed doesn't have to perform any allocations, nor does it cause work for
the GC as part of its operation. [snip]
Please note that you seem to be asking two different, contradictory
questions here: you initially wanted an implementation of an existing Java
class. Then you said you wanted an implementation that's equivalent to
the ASM code you posted. Those two are not the same, so it's hard to
understand exactly what you do want.

Great, now he'll want a Java solution as well.

And having read some of Chris's thoughts on parallel access to data
structures, I'm pretty comfortable that he fully understands the performance
impact of CAS and wants to use it here for a good reason.

Chris M. Thomasson · Aug 4, 2009

[...]
No MUTEXS!

Click to expand...

Is that a hard design requirement?

Absoultly not.

It seems arbitrary to me.
Interesting.

Besides, the Monitor class in .NET isn't strictly speaking a Windows
mutex (as in the unmanaged CreateMutex() function).

Youâ€™re 100% correct.

Until you've measured and confirmed a genuine performance problem, IMHO
you should not care so much about the specific synchronization mechanism
used.
Agreed.

That said...

Okay. For the record, this `DWCASâ€™ function will be operating on the anchors
of non-blocking data-structures, sometimes these can be pefromance senetive
in the sense that they might be frequently accessed under periods of load.
Think 64-bit word in a 32-bit environment mutated by a 64-bit CAS (e.g.,
Double-Width Compare-And-Swap) instruction. This is not an uncommon (LL/SC
aside for a moment) instruction indeed. Please read this text from an IBM
principles of operation document:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendex A/Multi-Programming/Processing Examples/Conditional Swapping
Instructions)

also read:

(Appendex A/Multi-Programming/Processing Examples/Free-Pool Manipulation)

I can do this in Java, I want to be able to do this in C# without using
unmanaged code. Can it be done?

Not exactly. The Java implementation trades synchronization in the class
itself for synchronization in the memory manager. The implementation I
showed doesn't have to perform any allocations, nor does it cause work for
the GC as part of its operation.

What JVM are you referring to; need info? IMVHO, there is no reason why the
JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use `CASXâ€™
on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

Also, careful with that word "blocking". Even an interlocked CPU
instruction can block; it has to, just as atomic writes to memory
locations have to block at some level.

Please note that I am referring to IA-32. The bus-lock can actually be
avoided if the requested data in in the cache of the calling processor.
Otherwise, the bus is indeed locked, resulting in an expensive operation...

It's just that CPU-level synchronization is generally less expensive than
OS-level synchronization mechanisms.

Think about it for a momentâ€¦ OS-level synchronization can have a fast-path
in which only a single atomic RMW instruction and/or memory barrier is
invoked. IMO, itâ€™s a QOI issueâ€¦ Therefore, I say that OS-level
synchronization can be equilivatnt to CPU-level synchronization on the
fast-paths.

Of course, if for some reason you prefer the performance penalty of object
allocation and collection over the performance penalty of a monitor, you
could simply adjust my implementation to match the Java implementation

Why would I not want to use a lock-based solution in order to maintain
portability? I was wondering if C# can support a non-blocking version.
That's all.

This would in fact be similar to what it _looks_ like you were trying to
do originally, but unsuccessfully.

Yes, I tried something in C#, but it did not work, thatâ€™s why I am here.

;^)

Then you can use the Interlocked methods to perform the exchange on the
value/stamp pair object references, rather than locking around the store
values in a single class. This is done as easily in C# as in Java.

Please show me some sample code that can use Interlocked methods that can
emulate the C/IA-32 `DWCASâ€™ , code I posted, and at least keep its
non-blocking properties, I will be very happy.

No, you don't have to. It's just that given your original question, it
wasn't clear at all that was something you are concerned about.

Note: there's no "locks and monitors" in C#.

They are implied.

The "lock()" statement is simply a shorthand way of using the Monitor
class. It's the same thing.
Yes.

Note also: using the Interlocked class is not a performance panacea.

Yes. I try and avoid them all the time.

It still involves hardware level synchronization, and that's every bit as
much a performance risk as monitors, depending on usage.

I am assuming you are referring to a traditional monitor (e.g., condvar +
mutex enpasulation with list of predicates). If so, you fail to consider an
entire class of non-blocking data-structures and usage patterns. Who says I
am going to be calling `DWCASâ€™ in a rapid succession? There are ways to
amortize the frequency of accesses.

If a CPU core stalls somewhere because some other core was doing an
atomic compare-and-exchange, that can cause delays or other problems that
have noticable effects on performance, sometimes dramatically so.

Indeed. Atomic RMW instructions are expensive.

Again, "expensive" compared to what?

Expensive compared to scaleable wait-free algorithms. (`CASâ€™ can sometimes
be wait-free if one passes pre-determined comparands).

Until you've identified synchronization as a specific bottleneck in your
code, trying to avoid a specific synchronization mechanism is a waste of
your valuable time.

I know I can use a mutex/monitor-based version; I can make very good use of
locks. In fact, if I know the properties of the lock, I can use them for
some really neat things. I am not interested in that right now. I want to
know if I can use an atomic RMW operation in C# to mutate a data-structure
that is comprised of a object reference and an integer.

In pure C# you can implement a class that has exactly the same performance
characteristics as the Java class you're talking about.

Please, show me some quick code that can mutate a reference to an object
along with an integer in a non-blocking manner. Thatâ€™s exactly what I am
looking for!

You can use "unsafe" and get a 32-bit pointer to an object instead of a
reference, and then proceed to pack that into an Int64 with your Int32
"stamp". Then you can use the Interlocked.CompareExchange(Int64, Int64,
Int64) overload.

This is not all that portable now is it? Jesting of course! ;^)

However, portability was what attracted me to C# in the first place.
`AtomicStampedReferenceâ€™ is portable in Java. However, I did say I would
accept a hack. Anyway, I already have code that implements `DWCASâ€™ in
assembly language with `Câ€™ calling convention (e.g., â€œextern `Câ€™â€ in C++):

http://webpages.charter.net/appcore/appcore/src/cpu/i686/ac_i686_gcc_asm.html

Itâ€™s the first function. I want to see if I can use Interlocked function in
C# to mimic that without â€œunsafe/unmanagedâ€ code. Thanks allâ€¦.

;^)

Other than that, AFAIK a library (e.g. managed C++/CLI or plain C called
via p/invoke) is the only way you'd be able to use the CMPXCHG8
instruction from C#, and using that is the only way to atomically modify 8
bytes at once.

Damn! Thatâ€™s what I was afraid of. Okay, well I have all the work already
done. So, itâ€™s not that bad. Thanks.

Java has basically the same limitations; you'd have to use an unsafe
mechanism (e.g. JNI) to convert a Java reference to a pointer that can be
packed into a 64-bit word, which can then be compared and exchanged
atomically.

No. `AtomicStampedReferenceâ€™ on a 32-bit Intel system can use `CMPXCHG8Bâ€™ to
mutate the object reference along with the remaining 32-bits.

Please note that you seem to be asking two different, contradictory
questions here: you initially wanted an implementation of an existing Java
class.
Yes.

Then you said you wanted an implementation that's equivalent to the ASM
code you posted.

I posted the ASM code as an example of the overall process. Itâ€™s only one
way to implement what I want. You posted lock-based code that does what I
want. So be it.

Those two are not the same, so it's hard to understand exactly what you
do want.

I want to get `AtomicStampedReferenceâ€™ in C# using Interlocked operations.

[...]
BTW, can you think of a way that C# can do this? I would be GRATEFUL if
you can perhaps HACK something together without resorting to unsafe
unmanaged code!

Click to expand...

Sure. Just port the Java implementation to C#. I wouldn't bother, but
you can do it if you want. Just remember to use "out" arguments instead
of the silly "pass a single-item array" workaround needed in Java.

I guess I am going to have to use unmanaged code in an managed environment.
Not very Kosher not is it?

Oh well. At least I know it can be done.

;^)

Chris M. Thomasson · Aug 4, 2009

[...]

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

I misunderstood you. Anyway, allocation overhead in a garbage collected
environment can boil down to a fetch-and-add instruction. In fact, GC
allocation can be completely atomic RMW instruction free, and even memory
barrier free. Think of an allocator based on per-thread heap instances fed
from a global scaleable allocator instance...

[...]

Ben Voigt [C++ MVP] · Aug 4, 2009

Chris M. Thomasson said:
[...]

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

Click to expand...

I misunderstood you. Anyway, allocation overhead in a garbage collected
environment can boil down to a fetch-and-add instruction. In fact, GC
allocation can be completely atomic RMW instruction free, and even memory
barrier free. Think of an allocator based on per-thread heap instances fed
from a global scaleable allocator instance...

Do you know of a Java implementation that uses such a scalable allocator?
If not, then Peter is correct that Java's AtomicStampedReference is not
lockless.

Peter Duniho · Aug 4, 2009

[...]

Not exactly. The Java implementation trades synchronization in the
class itself for synchronization in the memory manager. The
implementation I showed doesn't have to perform any allocations, nor
does it cause work for the GC as part of its operation.

Click to expand...

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

What evidence do you have that it _does_ use that?

The class you are talking about, AtomicStampedReference, has published
source code (I provided a link earlier) and nowhere in there is there
anything that would take advantage of the CMPXCHG8 instruction. It's pure
Java code.

It certainly is theoretically possible that someone would deliver a JVM
with a framework that implements that class natively, but I'm not aware of
that existing. If you have specific information that the Intel-based JVMs
available all do this, that would be interesting and useful information to
share.

If you don't, then I think you're on the wrong track in your comparison
with the AtomicStampedReference class and the CMPXCHG8 instruction.

[...]

Of course, if for some reason you prefer the performance penalty of
object allocation and collection over the performance penalty of a
monitor, you could simply adjust my implementation to match the Java
implementation

Click to expand...

Why would I not want to use a lock-based solution in order to maintain
portability? I was wondering if C# can support a non-blocking version.
That's all.

You can easily write C# code that does exactly the same thing the Java
code I've shown does. Whether you consider that "non-blocking" or not, I
don't know. I wouldn't, but I admit the term is not well-defined and if
you focus on only one tiny aspect of the overall implementation, it's
true...it could be considered non-blocking.

[...]

Then you can use the Interlocked methods to perform the exchange on
the value/stamp pair object references, rather than locking around the
store values in a single class. This is done as easily in C# as in
Java.

Click to expand...

Please show me some sample code that can use Interlocked methods that
can emulate the C/IA-32 `DWCASâ€™ , code I posted, and at least keep its
non-blocking properties, I will be very happy.

I already provided a link to the Java implementation of
AtomicStampedReference. And that does _emulate_ the use of the CMPXCHG8
instruction. Emphasis on "emulate". It does not actually use that
instruction, and will have performance implications that the use of that
instruction wouldn't. But it can do the job.

[...]
[...]

In pure C# you can implement a class that has exactly the same
performance characteristics as the Java class you're talking about.

Click to expand...

Please, show me some quick code that can mutate a reference to an object
along with an integer in a non-blocking manner. Thatâ€™s exactly what I am
looking for!

Again, that's a confusing request in context. The Java class
AtomicStampedReference does not, according to the published source code,
mutate a reference to an object along with an integer in a non-blocking
manner. It does use an interlocked exchange to swap references of a
container object that includes both your original object reference and the
integer "stamp". But even if we consider a CAS instruction as
"non-blocking", and even if we assume a memory allocator that uses only
CAS to perform allocations, that all ignores the eventual cost of the
garbage collection, which will necessarily involve some kind of blocking.

The Java implementation of AtomicStampedReference is simply not equivalent
to using a simple CMPXCHG8 instruction to perform your operation.

This is not all that portable now is it? Jesting of course! ;^)

Well, joking or not, it's more portable than using the CMPXCHG8
instruction. I agree it's not optimal; but I don't really see how you can
use CMPXCHG8 and be portable at the same time. The instruction is
platform-specific, and it seems to me using it would necessarily preclude
portability.

However, portability was what attracted me to C# in the first place.
`AtomicStampedReferenceâ€™ is portable in Java.

But it's not identical to using CMPXCHG8.

[...]

Java has basically the same limitations; you'd have to use an unsafe
mechanism (e.g. JNI) to convert a Java reference to a pointer that can
be packed into a 64-bit word, which can then be compared and exchanged
atomically.

Click to expand...

No. `AtomicStampedReferenceâ€™ on a 32-bit Intel system can use
`CMPXCHG8Bâ€™ to mutate the object reference along with the remaining
32-bits.

How does it do that? Do you have the source code or some other
documentation for an implementation of AtomicStampedReference that does
that?

[...]

Those two are not the same, so it's hard to understand exactly what
you do want.

Click to expand...

I want to get `AtomicStampedReferenceâ€™ in C# using Interlocked
operations.

That you can do. Just port the Java code I provide a link for.

[...]

if > you can perhaps HACK something together without resorting to
unsafe > unmanaged code!

Click to expand...

Sure. Just port the Java implementation to C#. I wouldn't bother, but
you can do it if you want. Just remember to use "out" arguments
instead of the silly "pass a single-item array" workaround needed in
Java.

Click to expand...

I guess I am going to have to use unmanaged code in an managed
environment.

You can port the Java code without using any unmanaged code.

Pete

Chris M. Thomasson · Aug 4, 2009

Ben Voigt said:
Chris M. Thomasson said:

[...]

Java can accomplish this in a 100% non-blocking manner on platform
which support it.

Not exactly. The Java implementation trades synchronization in the
class itself for synchronization in the memory manager. The
implementation I showed doesn't have to perform any allocations, nor
does it cause work for the GC as part of its operation.

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

Click to expand...

I misunderstood you. Anyway, allocation overhead in a garbage collected
environment can boil down to a fetch-and-add instruction. In fact, GC
allocation can be completely atomic RMW instruction free, and even memory
barrier free. Think of an allocator based on per-thread heap instances
fed from a global scaleable allocator instance...

Click to expand...

Do you know of a Java implementation that uses such a scalable allocator?

http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf
(page 8)

Is that what your looking for?

Peter Duniho · Aug 4, 2009

[...]
I misunderstood you. Anyway, allocation overhead in a garbage collected
environment can boil down to a fetch-and-add instruction. In fact, GC
allocation can be completely atomic RMW instruction free, and even
memory barrier free. Think of an allocator based on per-thread heap
instances fed from a global scaleable allocator instance...

First, I don't think you can always assume a per-thread heap in .NET.
Second, allocations may require an OS-level allocation to expand the
heap. I don't disagree that it's theoretically possible for an
_allocation_ to be free of synchronization, but I have no reason to
believe that's something you can depend on in .NET.

Beyond all that, every allocation eventually involves a collection. So
even if we assume that every single allocation is always free of
synchronization, the act of allocating memory causes a synchronization
requirement downstream, as well as operations that are more costly than
the synchronization itself would be.

I know that if I had to choose between writing synchronization code and
creating a situation where the GC has to spend more time collecting and
compacting the heap, I would pick synchronization, on the basis of
maintainability alone.

Of course, if there seemed to be some performance problem, I might try
other alternatives, but in that case I would be measuring specific
implementations and choosing one with the specific knowledge of a proven
performance advantage. And I certainly wouldn't bother with that effort
until I knew I had a performance problem in the first place. And I've no
reason to believe without measurement that the synchronization is more
expensive than memory management.

Pete

Chris M. Thomasson · Aug 4, 2009

Peter Duniho said:
[...]

Not exactly. The Java implementation trades synchronization in the
class itself for synchronization in the memory manager. The
implementation I showed doesn't have to perform any allocations, nor
does it cause work for the GC as part of its operation.

Click to expand...

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

Click to expand...

What evidence do you have that it _does_ use that?

The class you are talking about, AtomicStampedReference, has published
source code (I provided a link earlier) and nowhere in there is there
anything that would take advantage of the CMPXCHG8 instruction. It's pure
Java code.

[...]

Okay. You have totally answered my question Peter. AFAICT, the source-code
of `AtomicStampedReference' in conjunction with the allocator that Sun's
Hotspot JVM does indicate that it can be non-blocking. I have two choices.
One is use an unmanaged library in a 32-bit environment and invoke the
64-bit overload of the Interlocked functions in order to mutate object
pointer and integer, or port source-code of `AtomicStampedReference' to C#.
Need to think. Anyway, I thank you Peter for all your time and effort.
Seriously!

:^)

Chris M. Thomasson · Aug 4, 2009

Chris M. Thomasson said:
Peter Duniho said:

[...]
Not exactly. The Java implementation trades synchronization in the
class itself for synchronization in the memory manager. The
implementation I showed doesn't have to perform any allocations, nor
does it cause work for the GC as part of its operation.

What JVM are you referring to; need info? IMVHO, there is no reason why
the JVM cannot use `CMPXCHG8Bâ€™ on a 32-bit IA-32 system. It can also use
`CASXâ€™ on a `SPARCâ€™. `LL/SCâ€™ on others, ectâ€¦

Click to expand...

What evidence do you have that it _does_ use that?

The class you are talking about, AtomicStampedReference, has published
source code (I provided a link earlier) and nowhere in there is there
anything that would take advantage of the CMPXCHG8 instruction. It's
pure Java code.

Click to expand...

[...]

Okay. You have totally answered my question Peter. AFAICT, the source-code
of `AtomicStampedReference' in conjunction with the allocator that Sun's
Hotspot JVM does indicate that it can be non-blocking. I have two choices.
One is use an unmanaged library in a 32-bit environment and invoke the
64-bit overload of the Interlocked functions in order to mutate object
pointer and integer, or port source-code of `AtomicStampedReference' to
C#. Need to think. Anyway, I thank you Peter for all your time and effort.
Seriously!

:^)

One more thought... I have several algorithms that depend on `DWCAS' and I
can port them to Java using `AtomicStampedReference', however the way in
which `AtomicStampedReference' is actually implemented, it will be creating
new objects on every damn atomic operation (e.g., weakCompareAndSet,
compareAndSet, attemptStamp). This is going to be unacceptable in some of
those algorithms, even if the act of creating an object frequently his a
fast-path in the allocation and is rendered into a non-blocking operation.
It might not be work porting these algorithms to Java and C#. Anyway, for
you're entertainment, here are some fully-contained example source-code of
high-performance queues in C# in the form of a (crude) little
producer/consumer environment:

http://cpt.pastebin.com/f72cc3cc1
(multi-producer/consumer queue by me /w eventcount by me)

http://cpt.pastebin.com/f52dcc0fc
(single-producer/consumer queue by me /w eventcount by me)

http://heev.pastebin.com/f70a73f00
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me)

http://heev.pastebin.com/f270184be
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me /w
node cache by me)

http://heev.pastebin.com/f10399a09
(multi-producer/single-consumer by Dmitriy V'jukov /w eventcount by me /w
node cache by Dmitriy V'jukov)

Enjoy!

BTW, if you have any questions on any of the algorithms above, feel free to
interrogate me!

;^)

Final thought, I see no reason why a JVM for IA-32 cannot be constructed to
make use of `CMPXCHG8B'. IMVHO, its a QOI issue... Whether or not this is
the case in actual implementations is perhaps besides the point?

Peter Duniho · Aug 4, 2009

One more thought... I have several algorithms that depend on `DWCAS' and
I can port them to Java using `AtomicStampedReference', however the way
in which `AtomicStampedReference' is actually implemented, it will be
creating new objects on every damn atomic operation (e.g.,
weakCompareAndSet, compareAndSet, attemptStamp).

That is what I've been saying all along.

This is going to be unacceptable in some of those algorithms, even if
the act of creating an object frequently his a fast-path in the
allocation and is rendered into a non-blocking operation. It might not
be work porting these algorithms to Java and C#.

It might or it might not. As I mentioned, I personally wouldn't use an
approach that uses allocations to simulate CMPXCHG8, but that doesn't mean
that allocations are actually a problem (if you don't measure it, there's
no way to know) nor that there's no performant way to implement your
desired design in Java or C#. It could be that plain old synchronization
is fine, or it could be that you need to learn more about high-performance
synchronization objects in Java or .NET (for example, if you're dealing
with producer/consumer queues, the ReaderWriterLockSlim class might be
useful).

Part of the problem here is that your question has been solely about
implementing a specific synchronization technique, whereas most likely
what you really have is a higher-level problem that needs solving and
which in some specific implementation you're familiar with uses some
specific synchronization technique that may or may not be applicable in a
managed environment like Java or .NET. Just because a particular
implementation doesn't translate directly into a different environment,
that doesn't mean that the environment doesn't provide a completely
different way to arrive at the same results.

If you ask a different question, you might get a more useful answer.

(Not necessarily from me, but perhaps from someone who has more first-hand
experience writing high-throughput .NET applications; I dabble, but have
had no real professional need to worry about that sort of thing, so my
knowledge is limited to what I happen to come across in my wanderings)

[...]
Final thought, I see no reason why a JVM for IA-32 cannot be constructed
to make use of `CMPXCHG8B'. IMVHO, its a QOI issue... Whether or not
this is the case in actual implementations is perhaps besides the point?

Besides what point?

The implementors or .NET or Java can easily write whatever they need to in
order to support whatever features they really want to support. Classes
in the Java library and .NET framework need not be implemented as Java or
some .NET managed code; they can be written natively, and in some very
specific cases they are.

But, every native implementation requires porting to other platforms for
portability. For .NET this isn't a huge problem, but it's still a
relevant one. And for Java, which has a large number of platforms it
supports, it can be an enormous task. Also, because of the "write once,
run everywhere" mentality (not a bad thing, IMHO), often the decision
comes down to the least-common-denominator. It could be catastrophic to
provide a high-performance implementation of some class on some platforms
but not others, if programmers use performance on one platform to try to
predict (even if just roughly) the performance of their application on
other platforms.

Either there would be a very large performance difference, in which case
it could be quite a struggle for Java developers to provide consistent
experiences across all supported platforms, or there wouldn't be a large
performance difference in which case providing custom native
implementations for each platform would be a waste of effort.

Pete

Chris M. Thomasson · Aug 4, 2009

Peter Duniho said:
[...]
Okay. You have totally answered my question Peter. AFAICT, the
source-code of `AtomicStampedReference' in conjunction with the
allocator that Sun's Hotspot JVM does indicate that it can be
non-blocking.

Click to expand...

That's an inaccurate conclusion. If you care at all about performance
(and it seems you do, very much), you cannot ignore the broader picture.
Even in the most highly-optimized GC that Java provides, you cannot avoid
synchronization and the overhead of the GC. All you can do is change
where and when it happens.

I can do everything I need in C# if I use an offset trick common to
non-blocking algorithms. There is a caveat, it usually puts a upper bound on
the numbers of nodes that can be "live" at any one time. Here is quick
sketch in C pseudo-code:
________________________________________________________________________
#define DEPTH 10U
#define END 666U
#define INDEX(p) ((p) - (g_nodes))
#define TOPTR(p) ((p) != END ? ((g_nodes) + (p)) : NULL)
#define TOIDX(p) ((p) ? INDEX(p) : END)

struct node {
struct node* next;
};

static struct node g_nodes[DEPTH] = { NULL };

extern struct node* node_cache_pop();
extern void node_cache_push(struct node*);

struct CAS64 {
uint32_t aba;
uint32_t idx;
};

static struct CAS64 g_anchor = { 0, END };

void push(struct node* n) {
uint32_t idx = INDEX(n);
uint32_t cmp = g_anchor.idx;
do {
n->next = TOPTR(cmp);
} while (! CAS32(&g_anchor.idx, &cmp, idx));
}

struct node* flush() {
return TOPTR(SWAP32(&g_anchor.idx, END));
}

struct node* pop() {
struct node* n;
struct CAS64 xchg, cmp;
cmp.aba = g_anchor.aba;
cmp.idx = g_anchor.idx;
do {
n = TOPTR(cmp.idx);
if (! n) return NULL;
struct node* nx = n->next;
xchg.aba = cmp.aba + 1;
xchg.idx = TOIDX(nx);
} while (! CAS64(&g_anchor, &cmp, &xchg));
return n;
}
________________________________________________________________________

Please note I am assuming that CAS functions above will automatically update
the comparand on failure.

There, no DWCAS! ;^)

Something like that would work fine in C# using Interlocked 64-bit
overloads...

I have not seen any recent articles discussing the .NET GC implementation;
the references I'm familiar with are from the .NET 2.0 era or before. But
I would be surprised if the GC in .NET has not kept pace with technical
advancements found in the Java GC. I think it's safe to say that if there
are performance characteristics in the Java GC that you find desirable,
similar characteristics are available in .NET.

So if the Java emulation is suitable for you, than the same emulation in
C# should be as well.
Agreed.

You should do all three (including the implementation I showed), measure
the performance of each, and then decide. There is absolutely no way to
make a determination as to which approach will perform best without
measuring them in the real world.

Agreed. I am also going to code the pseudo-code above in C# and see how it
compares to Java emulation of AtomicStampedReference.

Chris M. Thomasson · Aug 4, 2009

Peter Duniho said:
That is what I've been saying all along.

It might or it might not. As I mentioned, I personally wouldn't use an
approach that uses allocations to simulate CMPXCHG8, but that doesn't mean
that allocations are actually a problem (if you don't measure it, there's
no way to know) nor that there's no performant way to implement your
desired design in Java or C#. It could be that plain old synchronization
is fine, or it could be that you need to learn more about high-performance
synchronization objects in Java or .NET (for example, if you're dealing
with producer/consumer queues, the ReaderWriterLockSlim class might be
useful).

Have you examined the queue algorithms I posted? I see no real need for a
lock wrt implementing simple producer/consumer pattern except on the
slow-path. Take a look at my eventcount object and how its used. The
primitive basically allows one to apply conditional waiting to existing
non-blocking algorithms, or even lock-based algorithms.

No time to make a further response now, I need to go out to lunch.

Peter Duniho · Aug 4, 2009

Have you examined the queue algorithms I posted? I see no real need for
a lock wrt implementing simple producer/consumer pattern except on the
slow-path. Take a look at my eventcount object and how its used. The
primitive basically allows one to apply conditional waiting to existing
non-blocking algorithms, or even lock-based algorithms.

If you haven't already, you should read Joe Duffy's blog:
http://www.bluebytesoftware.com/blog/Default.aspx

All of it, but especially his discussions on lock-free algorithms.

The short answer: lock-free is much harder to implement and maintain, and
does not always improve performance.

Whether you see a "real need for a lock", is up to you. The word "need"
is subjective, lacking precision required for a technical discussion. But
one's "need" is not necessarily considered strictly in terms of what's
technically feasible; you have to consider "need" in terms of actual
performance and maintenance costs as well.

It may be that you "need" a lock simply because going without doesn't
produce enough of a performance improvement to justify the maintenance
costs. It's true that you can write lock-free algorithms, and can even do
them in pure C# (Joe has examples on his blog site), but just because you
can doesn't necessarily mean you don't "need" a lock. There are other
reasons you may want to consider using one.

For what it's worth, I understand where you're coming from. From the late
90's when it first appeared, I was dismissive of .NET on the whole,
because of garbage collection, JITting, etc. I didn't like the idea of
turning over aspects of my code such as performance and correctness to an
external library.

But it turns out that some smart people have been involved in it, and
about five years ago I grudgingly "came around". I worry less about
performance these days, because .NET does a "good enough" job in that
area, and because using the framework allows me to spend less time dealing
with issues not directly connected to whatever coding problem I'm actually
trying to solve.

There will always be exceptions, of course. Sometimes nothing but the
fastest possible execution is good enough. But even in that case, it's
not a foregone conclusion that using .NET's built-in features won't
provide the fastest possible execution (you have to measure to know), and
those situations are extremely rare. In many cases, the best you can do
is squeeze an extra 5-10% performance out of your program by being clever,
which is often not enough of an improvement to justify the potential for
added bugs and maintenance costs.

Pete

Ben Voigt [C++ MVP] · Aug 5, 2009

Peter said:
If you haven't already, you should read Joe Duffy's blog:
http://www.bluebytesoftware.com/blog/Default.aspx

All of it, but especially his discussions on lock-free algorithms.

The short answer: lock-free is much harder to implement and maintain,
and does not always improve performance.

Both these claims are true in isolation, but I don't think the combination
is true. I contend that high performance lock-based synchronization is at
least as hard to implement and maintain as high performance lock-free.

In either case, the ease of use is far more important than the ease of
implementation, and while generics have helped considerably I think that
..NET containers still fall far short of making high performance
synchronization accessible to the majority of programmers.

Want to switch from J# to C# - need help	4	Apr 29, 2005
Java 1.5.0_07 released	1	May 25, 2006
Download the JAVA , .NET and SQL Server interview with answers	2	Sep 14, 2006
CAS - common access security causes add-ins deployment problems?	1	Aug 14, 2007
virus/spyware found (i did, not the program) error sending report	2	Apr 28, 2005
"WindUpdates (Browser Plugin and Media Access" pain	0	Jul 14, 2005
CSharp Coding Standards	18	Mar 29, 2007
Microsoft .NET	1	Feb 11, 2004

atomic operations...

Chris M. Thomasson

Chris M. Thomasson

Peter Duniho

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Ben Voigt [C++ MVP]

Chris M. Thomasson

Chris M. Thomasson

Ben Voigt [C++ MVP]

Peter Duniho

Chris M. Thomasson

Peter Duniho

Chris M. Thomasson

Chris M. Thomasson

Peter Duniho

Chris M. Thomasson

Chris M. Thomasson

Peter Duniho

Ben Voigt [C++ MVP]

Ask a Question

Similar Threads