When is "volatile" used instead of "lock" ?

J

Jon Skeet [C# MVP]

I guess we'll just have to disagree on a few things, for the reasons I've
already stated. I don't see much point in going back and forth saying the
same things...

I should say (and I've only just remembered) that a few years ago I
was unsure where the safety came from, and I mailed someone (Vance
Morrison? Chris Brumme?) who gave me the explanation I've been giving
you.
With regard to runtime volatile read/writes and acquire/release semantics of
Monitor.Enter and Monitor.Exit we can agree.

I don't agree that anything specified in either 334 or 335 covers all levels
of potential compile-time class member JIT/IL compiler optimizations.

It specifies how the system as a whole must behave: given a certain
piece of IL, there are
I don't agree that "int number; void UpdateNumber(){lock(locker){
number++;}}" is equally as safe as "volatile int number; void UpdateNumber(){
number++; }"

I agree - the version without the lock is *unsafe*. Two threads could
both read, then both increment, then both store in the latter case.
With the lock, everything is guaranteed to work.
With the following Monitor.Enter/Exit IL, for example:

...what part of that IL tells the JIT/IL compiler that Tester.number
specifically should be treated differently--where lines commented // * are
the only lines distinct to usage of Monitor.Enter/Exit?

The fact that it knows Monitor.Enter is called, so the load (in the
logical memory model) cannot occur before Monitor.Enter. Likewise it
knows that Monitor.Exit is called, so the store can't occur after
Monitor.Exit. If it calls another method which *might* call
Monitor.Enter/Exit, it likewise can't move the reads/writes as that
would violate the spec.
...where an IL compiler is given ample amounts of information that
Tester.number should be treated differently.

It's being given ample
I don't think it's safe, readable, or future friendly to utilize syntax
strictly for their secondary consequences (using Monitor.Enter/Exit not for
synchronization but for acquire/release semantics. As in the above line
where modification of an int is already atomic; "synchronization" is
irrelevant), even if they were effectively identical to another syntax. Yes,
if you've got a non-atomic invariant you still have to synchronize (with
lock, etc.)... but volatility is different and needs to be accounted for
equally as much as thread-safety.

Again you're treating atomicity as almost interchangeable with
volatility, when they're certainly not. Synchronization is certainly
relevant whether or not writes are atomic. Atomicity just states that
you won't see a "half way" state; volatility state that you will see
the "most recent" value. That's a huge difference.

The volatility is certainly not just a "secondary consequence" - it's
vital to the usefulness of locking.

Consider a type which isn't thread-aware - in other words, nothing is
marked as volatile, but it also has no thread-affinity. That should be
the most common kind of type, IMO. You can't retrospectively mark the
fields as being volatile, but you *do* want to ensure that if you use
objects of the type carefully (i.e. always within a consistent lock)
you won't get any unexpected behaviour. Due to the guarantees of
locking, you're safe. Otherwise, you wouldn't be. Without that
guarantee, you'd be entirely at the mercy of type authors for *all*
types that *might* be used in a multi-threaded environment making all
their fields volatile.

Further evidence that it's not just a secondary effect, but one which
certainly *can* be relied on: there's no other thread-safe way of
using doubles. They *can't* be marked as volatile - do you really
believe that MS would build .NET in such a way that wouldn't let you
write correct code to guarantee that you see the most recent value of
a double, rather than one cached in a register somewhere?

This *is* guaranteed, it's the normal way of working in the framework
(as Willy said, look for volatile fields in the framework itself) and
it's perfectly fine to rely on it.

Jon
 
J

Jon Skeet [C# MVP]

It specifies how the system as a whole must behave: given a certain
piece of IL, there are

It specifies how the system as a whole must behave: given a certain
piece of IL, there are valid behaviours and invalid behaviours. If you
can observe that a variable has been read before a lock has been
acquired and that value has then been used (without rereading) after
the lock has been acquired, then the CLR has a bug, pure and simple.
It violates the spec in a pretty clear-cut manner.

Jon
 
B

Ben Voigt [C++ MVP]

Consider a type which isn't thread-aware - in other words, nothing is
marked as volatile, but it also has no thread-affinity. That should be
the most common kind of type, IMO. You can't retrospectively mark the
fields as being volatile, but you *do* want to ensure that if you use

You don't need to modify the type definition, you would need a volatile
variable of that type.
 
G

Guest

It specifies how the system as a whole must behave: given a certain
piece of IL, there are valid behaviours and invalid behaviours. If you
can observe that a variable has been read before a lock has been
acquired and that value has then been used (without rereading) after
the lock has been acquired, then the CLR has a bug, pure and simple.
It violates the spec in a pretty clear-cut manner.

That's not the same thing as saying use of Monitor.Enter and Monitor.Exit
are what are used to maintain that behaviour.

In 335 section 12.6.5 has "[calling Monitor.Enter]...shall implicitly
perform a volatile read operation..." says to me that one volatile operation
is performed. And "[calling Monitor.Exit]...shall implicitly perform a
volatile write operation." A write to what? As in this snippet:
Monitor.Enter(this.locker)
Trace.WriteLine(this.number);
Monitor.Exit(this.locker)

It only casually mentions "See [section] 12.6.7" which discussions acquire
and release semantics in the context of the volatile prefix (assuming the C#
volatile keyword is what causes generation of this prefix). 12.6.7 only
mentions "the read" or "the write" it does not mention anything about a set
or block of read/writes. I think you've made quite a leap getting to: code
between Monitor.Enter and Monitor.Exit has volatility guarantees.

Writing a sample "that works" is meaningless to me. I've dealt with
thousands of snippets of code "that worked" in certain circumstances (usually
resulting in me fixing them to "really work").

You're free to interpret the spec any way you want, and if you've gotten
information from Chris or Vance, you've got their interpretation of the spec.
and, best case, you've got information specific to Microsoft's JIT/IL
Compilers.

Based upon the spec, I *know* that this is safe code:
public volatile int number;
public void DoSomething() {
this.Number = 1;
}

This is equally as safe:
public volatile int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

I think it's open to interpretation of the spec whether this is safe:
public int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

....it might be safe in Microsoft's implementations; but that's not open
information and I don't think it's due to Monitor.Enter/Monitor.Exit.

I don't see what the issue with volatile is, if you're not using "volatile"
for synchronization. Worst case with this:
public volatile int number;
public void DoSomething() {
this.Number = 1;
}
you've explicitly stated your volatility usage/expectation: more readable,
makes no assumptions...

Whereas:
public int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

....best case, this isn't as readable because it uses implicit volatility
side-effects.

What happens with the following code?
internal class Tester {
private Object locker = new Object();
private Random random = new Random();
public int number:

public Tester()
{
DoWork(false);
}

public void UpdateNumber() {
Monitor.Enter(locker);
DoWork(true);
}

private void DoWork(Boolean doOut) {
this.number = random.Next();
if(doOut)
{
switch(random.Next(1))
{
case 0:
Out1();
break;
case 1:
Out2();
break;
}
}
}

private void Out1() {
Montior.Exit(this.locker);
}

private void Out2() {
Monitor.Exit(this.locker);
}
}

....clearly there isn't enough information merely from the existence
Monitor.Enter and Monitor.Exit to maintain those guarantees.

Again you're treating atomicity as almost interchangeable with
volatility,
<snip>
No, I'm not. I said you don't need to synchronize an atomic invariant but
you still need to account for its volatility (by declaring it volatile). I
didn't say volatility was a secondary concern, I said it needs to be
accounted for equally. I was implying that using the "lock" keyword is not
as clear in terms of volatility assumptions/needs as is the "volatile"
keyword. If a I read some code that uses "lock", I can't assume the author
did that for volatility reasons and not just synchronization reasons; whereas
if she had put "volatile" on a field, I know for sure why she put that there.
This *is* guaranteed, it's the normal way of working in the framework
(as Willy said, look for volatile fields in the framework itself)

Which ones? Like Hashtable.version or StringBuilder.m_StringValue?
 
J

Jon Skeet [C# MVP]

Ben Voigt said:
You don't need to modify the type definition, you would need a volatile
variable of that type.

Just because the variable itself is volatile doesn't mean every access
would be volatile in the appropriate way. Consider:

public class Foo
{
public string bar; // No, I'd never use a public field really...
}


public class AnotherClass
{
volatile Foo x;

SomeMethod()
{
x.bar = 100;
}
}

Now, you've got a volatile *read* but not a volatile *write* - which is
what you really want to make sure that the write is visible to other
threads.
 
J

Jon Skeet [C# MVP]

Peter Ritchie said:
That's not the same thing as saying use of Monitor.Enter and Monitor.Exit
are what are used to maintain that behaviour.

Well, without that guarantee for Monitor.Enter/Monitor.Exit I don't
believe it would be possible to write thread-safe code.
In 335 section 12.6.5 has "[calling Monitor.Enter]...shall implicitly
perform a volatile read operation..." says to me that one volatile operation
is performed. And "[calling Monitor.Exit]...shall implicitly perform a
volatile write operation." A write to what? As in this snippet:
Monitor.Enter(this.locker)
Trace.WriteLine(this.number);
Monitor.Exit(this.locker)

It doesn't matter what the volatile write is to - it's the location in
the CIL that matters. No other writes can be moved (logically) past
that write, no matter what they're writing to.
It only casually mentions "See [section] 12.6.7" which discussions acquire
and release semantics in the context of the volatile prefix (assuming the C#
volatile keyword is what causes generation of this prefix).

I don't see what's "casual" about it, nor why you should believe that
12.6.7 should only apply to instructions with the "volatile." prefix.
The section starts off by mentioning the prefix, but then talks in
terms of volatile reads and volatile writes - which is the same terms
as 12.6.5 talks in.
12.6.7 only
mentions "the read" or "the write" it does not mention anything about a set
or block of read/writes. I think you've made quite a leap getting to: code
between Monitor.Enter and Monitor.Exit has volatility guarantees.

I really, really haven't. I think the problem is the one I talk about
above - you're assuming that *what* is written to matters, rather than
just the location of a volatile write in the CIL stream. Look at the
guarantee provided by the spec:

<quote>
A volatile read has =3Facquire semantics=3F meaning that the read is
guaranteed to occur prior to any references to memory that occur after
the read instruction in the CIL instruction sequence. A volatile write
has =3Frelease semantics=3F meaning that the write is guaranteed to happen
after any memory references prior to the write instruction in the CIL
instruction sequence.
</quote>

Where does that say anything about it being dependent on what is being
written or what is being read? It just talks about reads and writes
being moved in terms of their position in the CIL sequence.

So, no write that occurs before the call to Monitor.Exit in the IL can
be moved beyond the call to Monitor.Exit in the memory model, and no
read that occurs after Monitor.Enter in the IL can be moved to earlier
than Monitor.Enter in the memory model. That's all that's required for
thread safety.
Writing a sample "that works" is meaningless to me. I've dealt with
thousands of snippets of code "that worked" in certain circumstances (usually
resulting in me fixing them to "really work").

I'm not talking about certain circumstances - I'm talking about
*guarantees* provided by the CLI spec.

I'm saying that I can write code which doesn't use volatile but which
is *guaranteed* to work. I believe you won't be able to provide any
exmaple of how it could fail without the CLI spec itself being
violated.
You're free to interpret the spec any way you want, and if you've gotten
information from Chris or Vance, you've got their interpretation of the spec.
and, best case, you've got information specific to Microsoft's JIT/IL
Compilers.

Well, I've got information specific to the .NET 2.0 memory model (which
is stronger than the CLI specified memory model) elsewhere.

However, I feel pretty comfortable in having the interpretation experts
who possibly contributed to the spec or at least have direct contact
with those who wrote it.
Based upon the spec, I *know* that this is safe code:
public volatile int number;
public void DoSomething() {
this.Number = 1;
}

This is equally as safe:
public volatile int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

I think it's open to interpretation of the spec whether this is safe:
public int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

Well, this is why I suggested that I post a complete program - then you
could suggest ways in which it could go wrong, and I think I'd be able
to defend it in fairly clear-cut terms.
...it might be safe in Microsoft's implementations; but that's not open
information and I don't think it's due to Monitor.Enter/Monitor.Exit.

I *hope* we won't just have to agree to disagree, but I realise that
may be the outcome :(
I don't see what the issue with volatile is, if you're not using "volatile"
for synchronization. Worst case with this:
public volatile int number;
public void DoSomething() {
this.Number = 1;
}
you've explicitly stated your volatility usage/expectation: more readable,
makes no assumptions...

It implies that without volatility you've got problems - which you
haven't (provided you use locking correctly). This means you can use a
single way of working for *all* types, regardless of whether you can
use the volatile modifier on them.
Whereas:
public int number;
public void DoSomething() {
lock(locker) {
this.Number = 1;
}
}

...best case, this isn't as readable because it uses implicit volatility
side-effects.

If you're not used to that being the idiom, you're right. However, if
I'm writing thread-safe code (most types don't need to be thread-safe)
I document what lock any shared data comes under. I can rarely get away
with a single operation anyway.

Consider the simple change from this:

this.number = 1;

to this:

this.number++;

With volatile, your code is now broken - and it's not obvious, and
probably won't show up in testing. With lock, it's not broken.
What happens with the following code?
internal class Tester {
private Object locker = new Object();
private Random random = new Random();
public int number:

public Tester()
{
DoWork(false);
}

public void UpdateNumber() {
Monitor.Enter(locker);
DoWork(true);
}

What happens here is that I don't let this method go through code
review. There have to be *very* good reasons not to use lock{}, and in
those cases there would almost always still be a try/finally.

I wouldn't consider using volatile just to avoid the possibility of
code like this (which I've never seen in production, btw).

private void DoWork(Boolean doOut) {
this.number = random.Next();
if(doOut)
{
switch(random.Next(1))
{
case 0:
Out1();
break;
case 1:
Out2();
break;
}
}
}

private void Out1() {
Montior.Exit(this.locker);
}

private void Out2() {
Monitor.Exit(this.locker);
}
}

...clearly there isn't enough information merely from the existence
Monitor.Enter and Monitor.Exit to maintain those guarantees.

It's the other way round - the JIT compiler doesn't have enough
information to perform certain optimisations, simply because it can't
know whether or not Monitor.Exit will be called.

Assuming the CLR follows the spec, it can't move the write to number to
after the call to random.Next() - because that call to random.Next()
may involve releasing a lock, and it may involve a write.

Now, I agree that it really limits the scope of optimisation for the
JIT - but that's what the CLI spec says.
<snip>
No, I'm not. I said you don't need to synchronize an atomic invariant but
you still need to account for its volatility (by declaring it volatile). I
didn't say volatility was a secondary concern, I said it needs to be
accounted for equally. I was implying that using the "lock" keyword is not
as clear in terms of volatility assumptions/needs as is the "volatile"
keyword. If a I read some code that uses "lock", I can't assume the author
did that for volatility reasons and not just synchronization reasons; whereas
if she had put "volatile" on a field, I know for sure why she put that there.

I use lock when I'm going to use shared data. When I use shared data, I
want to make sure I don't ignore previous changes - hence it needs to
be volatile.

Volatility is a natural consequence of wanting exclusive access to a
shared variable - which is why exactly the same strategy works in Java,
by the way (which has a slightly different memory model). Without the
guarantees given by the CLI spec, having a lock would be pretty much
useless.
Which ones? Like Hashtable.version or StringBuilder.m_StringValue?

Yup, there are a few - but I believe there are far more places which
use the natural (IMO) way of sharing data via exclusive access, and
taking account the volatility that naturally provides.
 
W

Willy Denoyette [MVP]

Jon Skeet said:
I should say (and I've only just remembered) that a few years ago I
was unsure where the safety came from, and I mailed someone (Vance
Morrison? Chris Brumme?) who gave me the explanation I've been giving
you.


It specifies how the system as a whole must behave: given a certain
piece of IL, there are


I agree - the version without the lock is *unsafe*. Two threads could
both read, then both increment, then both store in the latter case.
With the lock, everything is guaranteed to work.




The fact that it knows Monitor.Enter is called, so the load (in the
logical memory model) cannot occur before Monitor.Enter. Likewise it
knows that Monitor.Exit is called, so the store can't occur after
Monitor.Exit. If it calls another method which *might* call
Monitor.Enter/Exit, it likewise can't move the reads/writes as that
would violate the spec.


It's being given ample


Again you're treating atomicity as almost interchangeable with
volatility, when they're certainly not. Synchronization is certainly
relevant whether or not writes are atomic. Atomicity just states that
you won't see a "half way" state; volatility state that you will see
the "most recent" value. That's a huge difference.

The volatility is certainly not just a "secondary consequence" - it's
vital to the usefulness of locking.

Consider a type which isn't thread-aware - in other words, nothing is
marked as volatile, but it also has no thread-affinity. That should be
the most common kind of type, IMO. You can't retrospectively mark the
fields as being volatile, but you *do* want to ensure that if you use
objects of the type carefully (i.e. always within a consistent lock)
you won't get any unexpected behaviour. Due to the guarantees of
locking, you're safe. Otherwise, you wouldn't be. Without that
guarantee, you'd be entirely at the mercy of type authors for *all*
types that *might* be used in a multi-threaded environment making all
their fields volatile.

Further evidence that it's not just a secondary effect, but one which
certainly *can* be relied on: there's no other thread-safe way of
using doubles. They *can't* be marked as volatile - do you really
believe that MS would build .NET in such a way that wouldn't let you
write correct code to guarantee that you see the most recent value of
a double, rather than one cached in a register somewhere?

This *is* guaranteed, it's the normal way of working in the framework
(as Willy said, look for volatile fields in the framework itself) and
it's perfectly fine to rely on it.


I see that my remark about the FCL was too strong worded, I didn't mean to
say that "volatile" fields were not used at all in the FCL, sure they are
used, but only in a context where the author wanted to guarantee that a
field (most often a bool) access had acquire/release semantics and would not
be reordered, not in the context of a locked region. Also note that a large
part of the FCL was written against v1.0 (targeting X86 only) at a time
there was no VolatileRead and long before the Interlocked Class was
introduced.
The latest bits in the FCL use more often Interlocked and VolatileXXX
operations than than applying the volatile modifier.
Also note that volatile does not imply a memory barrier, while lock,
Interlocked ops. and VolatileXXX do effectively imply a MemoryBarrier. The
way the barrier is implemented is platform specific, on X86 and X64 a full
barrier is raised, while on IA64 it depends on the operation.


Willy.
 
G

Guest

Jon Skeet said:
I'm saying that I can write code which doesn't use volatile but which
is *guaranteed* to work. I believe you won't be able to provide any
exmaple of how it could fail without the CLI spec itself being
violated.
Actually, I'm having a hard time getting the JIT to optimize *any* member
fields, even with lack of locking. Local variables seem to optimized into
registers easily, but not member fields...

If I could get an optimization of a member field I believe I would be able
show an example.

For example:
private Random random = new Random();
public int Method()
{
int result = 0;
for(int i = 0; i < this.random.Next(); ++i)
{
result += 10;
}
return result;
}

ebx is used for result (and edi for i) while in the loop; but with:
private Random random = new Random();
private int number;
public int Method()
{
for(int i = 0; i < this.random.Next(); ++i)
{
this.number += 10;
}
return this.number
}

....number is always accessed directly and never optimized to a register. I
think I'd find the same thing with re-ordering.
 
J

Jon Skeet [C# MVP]

Actually, I'm having a hard time getting the JIT to optimize *any* member
fields, even with lack of locking. Local variables seem to optimized into
registers easily, but not member fields...

I can well believe that, just as an easy way of fulfilling the spec.
If I could get an optimization of a member field I believe I would be able
show an example.

Well, rather than arguing from a particular implementation (which, as
you've said before, may be rather stricter than the spec requires) I'd
be perfectly happy arguing from the spec itself. Then at least if there
are precise examples where I interpret the spec to say one thing and
you interpret it a different way, we'll know exactly where our
disagreement is.

<snip code>
 
W

Willy Denoyette [MVP]

Peter Ritchie said:
Actually, I'm having a hard time getting the JIT to optimize *any* member
fields, even with lack of locking. Local variables seem to optimized into
registers easily, but not member fields...

If I could get an optimization of a member field I believe I would be able
show an example.

For example:
private Random random = new Random();
public int Method()
{
int result = 0;
for(int i = 0; i < this.random.Next(); ++i)
{
result += 10;
}
return result;
}

ebx is used for result (and edi for i) while in the loop; but with:
private Random random = new Random();
private int number;
public int Method()
{
for(int i = 0; i < this.random.Next(); ++i)
{
this.number += 10;
}
return this.number
}

...number is always accessed directly and never optimized to a register.
I
think I'd find the same thing with re-ordering.




In your sample, the member field has to be read from the object location in
the GC heap, and after the addition it has to be written back to the same
location.
The write "this.number +=.... "must be a "store acquire" to fulfill the
rules imposed by the CLR memory model. Note that this model derives from the
ECMA model!

The assembly code of the core part of the loop, looks something like this
(your mileage may vary):

mov eax,dword ptr [ebp-10h]
add dword ptr [eax+8],0Ah

here the object reference of the current instance (this) is loaded from
[ebp-10h] and stored in eax, after which 0Ah is added to the location of the
'number' field [eax+8].

Question is what else do you expect to optimize any further, and what are
you expecting to illustrate?

Willy.
 
B

Ben Voigt [C++ MVP]

Willy Denoyette said:
Peter Ritchie said:
Actually, I'm having a hard time getting the JIT to optimize *any* member
fields, even with lack of locking. Local variables seem to optimized
into
registers easily, but not member fields...

If I could get an optimization of a member field I believe I would be
able
show an example.

For example:
private Random random = new Random();
public int Method()
{
int result = 0;
for(int i = 0; i < this.random.Next(); ++i)
{
result += 10;
}
return result;
}

ebx is used for result (and edi for i) while in the loop; but with:
private Random random = new Random();
private int number;
public int Method()
{
for(int i = 0; i < this.random.Next(); ++i)
{
this.number += 10;
}
return this.number
}

...number is always accessed directly and never optimized to a register.
I
think I'd find the same thing with re-ordering.




In your sample, the member field has to be read from the object location
in the GC heap, and after the addition it has to be written back to the
same location.
The write "this.number +=.... "must be a "store acquire" to fulfill the
rules imposed by the CLR memory model. Note that this model derives from
the ECMA model!

The assembly code of the core part of the loop, looks something like this
(your mileage may vary):

mov eax,dword ptr [ebp-10h]
add dword ptr [eax+8],0Ah

here the object reference of the current instance (this) is loaded from
[ebp-10h] and stored in eax, after which 0Ah is added to the location of
the 'number' field [eax+8].

Question is what else do you expect to optimize any further, and what are
you expecting to illustrate?

I know what just *did* get illustrated -- that the .NET JIT doesn't optimize
nearly as well as the C++ optimizing compiler.
 
G

Guest

Willy, I'm not following where you are going with your comment. As I've
said, my example should have been number = 10 (or something similar) to
capture CLI atomicity guarentees; but either operation is optimized to a
single opcode on x86:

x86 for number += 10:
int count = random.Next();
00000000 56 push esi
00000001 8B F1 mov esi,ecx
00000003 8B 4E 04 mov ecx,dword ptr [esi+4]
00000006 8B 01 mov eax,dword ptr [ecx]
00000008 FF 50 3C call dword ptr [eax+3Ch]
0000000b 8B D0 mov edx,eax
for(int i = 0; i > count; ++i)
0000000d 33 C0 xor eax,eax
0000000f 85 D2 test edx,edx
00000011 7D 0B jge 0000001E
{
number += 10;
00000013 83 46 08 0A add dword ptr [esi+8],0Ah
for(int i = 0; i > count; ++i)
00000017 83 C0 01 add eax,1
0000001a 3B C2 cmp eax,edx
0000001c 7F F5 jg 00000013

and x86 for number = 10:
int count = random.Next();
00000000 56 push esi
00000001 8B F1 mov esi,ecx
00000003 8B 4E 04 mov ecx,dword ptr [esi+4]
00000006 8B 01 mov eax,dword ptr [ecx]
00000008 FF 50 3C call dword ptr [eax+3Ch]
0000000b 8B D0 mov edx,eax
for(int i = 0; i > count; ++i)
0000000d 33 C0 xor eax,eax
0000000f 85 D2 test edx,edx
00000011 7D 0E jge 00000021
{
number = 10;
00000013 C7 46 08 0A 00 00 00 mov dword ptr [esi+8],0Ah
for(int i = 0; i > count; ++i)
0000001a 83 C0 01 add eax,1
0000001d 3B C2 cmp eax,edx
0000001f 7F F2 jg 00000013

I don't know if your comment was supposed to show the adjacentness of object
reference load to the increment; but it's clearly not doing that (the object
reference load is hoisted out of the loop and before the Next() call, which
is where it needs it first. If that wasn't your point, pardon my ramblings;
but they do provide basis for what follows...

But, the difference here is irrelevent. It's the difference between a
member field and a local variable. x86 for the same code with a local
variable instead of a member field:
int count = random.Next();
00000000 8B C1 mov eax,ecx
00000002 8B 48 04 mov ecx,dword ptr [eax+4]
00000005 8B 01 mov eax,dword ptr [ecx]
00000007 FF 50 3C call dword ptr [eax+3Ch]
0000000a 8B D0 mov edx,eax
int result = 0;
0000000c 33 C9 xor ecx,ecx
for (int i = 0; i > count; ++i)
0000000e 33 C0 xor eax,eax
00000010 85 D2 test edx,edx
00000012 7D 0A jge 0000001E
{
result += 10;
00000014 83 C1 0A add ecx,0Ah
for (int i = 0; i > count; ++i)
00000017 83 C0 01 add eax,1
0000001a 3B C2 cmp eax,edx
0000001c 7F F6 jg 00000014

I was expecting the JIT to do much better optimizations (looping x times
assigning the same value to number) than it had. Sure, the difference
between an single add with a register and a memory location is small... In
the increment case, I was expecting something similar to the local variable:
by using a register for the duration of the loop. If Next() returned 10, the
loop would effectively be:

number += 10; number += 10; number += 10; number += 10;
number += 10; number += 10; number += 10; number += 10;
number += 10; number += 10; // joined for brevity

All adjacent writes on the same thread, where optimizing to a register being
removing a write...

And, in fact, if you do 10 increments instead of a loop, the JIT *still*
won't optimize any writes away. I know it knows how; because it will do it
with local variables.

Which leads me to believe that that JIT implemention is giving people a
false sense of security by not introducing things clearly accounted for in
the specs. Not to mention, makes the whole discussion somewhat moot.
 
W

Willy Denoyette [MVP]

Peter Ritchie said:
Willy, I'm not following where you are going with your comment. As I've
said, my example should have been number = 10 (or something similar) to
capture CLI atomicity guarentees; but either operation is optimized to a
single opcode on x86:

x86 for number += 10:
int count = random.Next();
00000000 56 push esi
00000001 8B F1 mov esi,ecx
00000003 8B 4E 04 mov ecx,dword ptr [esi+4]
00000006 8B 01 mov eax,dword ptr [ecx]
00000008 FF 50 3C call dword ptr [eax+3Ch]
0000000b 8B D0 mov edx,eax
for(int i = 0; i > count; ++i)
0000000d 33 C0 xor eax,eax
0000000f 85 D2 test edx,edx
00000011 7D 0B jge 0000001E
{
number += 10;
00000013 83 46 08 0A add dword ptr [esi+8],0Ah
for(int i = 0; i > count; ++i)
00000017 83 C0 01 add eax,1
0000001a 3B C2 cmp eax,edx
0000001c 7F F5 jg 00000013

and x86 for number = 10:
int count = random.Next();
00000000 56 push esi
00000001 8B F1 mov esi,ecx
00000003 8B 4E 04 mov ecx,dword ptr [esi+4]
00000006 8B 01 mov eax,dword ptr [ecx]
00000008 FF 50 3C call dword ptr [eax+3Ch]
0000000b 8B D0 mov edx,eax
for(int i = 0; i > count; ++i)
0000000d 33 C0 xor eax,eax
0000000f 85 D2 test edx,edx
00000011 7D 0E jge 00000021
{
number = 10;
00000013 C7 46 08 0A 00 00 00 mov dword ptr [esi+8],0Ah
for(int i = 0; i > count; ++i)
0000001a 83 C0 01 add eax,1
0000001d 3B C2 cmp eax,edx
0000001f 7F F2 jg 00000013

I don't know if your comment was supposed to show the adjacentness of
object
reference load to the increment; but it's clearly not doing that (the
object
reference load is hoisted out of the loop and before the Next() call,
which
is where it needs it first. If that wasn't your point, pardon my
ramblings;
but they do provide basis for what follows...


This is because you run this code in a "managed debugger", the JIT produces
different code from what is produced when no managed bedugger is attached!
You need to run the code (release version) in a native debugger to see what
the JIT really produces for 'release' mode code.
As unmanaged debugger you can use any of the debuggers from the Debugging
Tools for Windows like windbg, sdb etc... (Which is what I prefer, because
it's more powerfull as the VS2005 debugger)
You can also use VS2005 as unmanaged debugger, but you need to make sure you
break into an unmanaged debugging session. That means you cannot use
System.Diagnostics.Debugger.Break(), you have to call Kernel32.dll
DebugBreak().

Here is the PInvoke signature:
[DllImport("kernel32"), SuppressUnmanagedCodeSecurity] static extern void
DebugBreak();

Add a call to DebugBreak() in your code, run the program without debugging
(CTRL+F5) from within VS and wait until a break is hit, select 'Debug' -
select the current VS instance from the list in the "VS JIT Debugger"
dialog. Wait for the break message is hit, press 'Break' and in the
following dialog press 'Show Disassembly'.

When you hit this point you'll see this (partly stripped) :

X86, member variable number = 10;
....
0032013A mov eax,dword ptr [ebp-10h]
0032013D mov dword ptr [eax+8],0Ah
00320144 add edx,1
00320147 cmp edx,esi
00320149 jl 0032013A
0032014B mov eax,dword ptr [ebp-10h]
0032014E mov eax,dword ptr [eax+8]
.......


The first two instructions are the load of the 'this' instance pointer into
eax, and the store of '0Ah' into the member field 'number' of 'this'.
This sequence is repeated until the loop counter reaches the count value.
But, the difference here is irrelevent. It's the difference between a
member field and a local variable. x86 for the same code with a local
variable instead of a member field:
int count = random.Next();
00000000 8B C1 mov eax,ecx
00000002 8B 48 04 mov ecx,dword ptr [eax+4]
00000005 8B 01 mov eax,dword ptr [ecx]
00000007 FF 50 3C call dword ptr [eax+3Ch]
0000000a 8B D0 mov edx,eax
int result = 0;
0000000c 33 C9 xor ecx,ecx
for (int i = 0; i > count; ++i)
0000000e 33 C0 xor eax,eax
00000010 85 D2 test edx,edx
00000012 7D 0A jge 0000001E
{
result += 10;
00000014 83 C1 0A add ecx,0Ah
for (int i = 0; i > count; ++i)
00000017 83 C0 01 add eax,1
0000001a 3B C2 cmp eax,edx
0000001c 7F F6 jg 00000014

I was expecting the JIT to do much better optimizations (looping x times
assigning the same value to number) than it had.

That's right, the JIT optimizer is quite conservative when optimizing loops.
However, I don't know who writes code like this:
for(int i = 0; i < count ; ++i)
result = 10;

Sure, the difference
between an single add with a register and a memory location is small...
In
the increment case, I was expecting something similar to the local
variable:
by using a register for the duration of the loop. If Next() returned 10,
the
loop would effectively be:

number += 10; number += 10; number += 10; number += 10;
number += 10; number += 10; number += 10; number += 10;
number += 10; number += 10; // joined for brevity

Same here, this can be optimized by storing the number in a local before
running the loop and once done moving the local to the field variable. This
is something you should do whenever you are dealing with field variable in
long running algorithms.

Granted, in the sample above, the loop won't be optimized as agressively as
a native compiler would do (a C compiler will hoist the loop completely),
but again, I don't know if one writes code like this.
All adjacent writes on the same thread, where optimizing to a register
being
removing a write...

And, in fact, if you do 10 increments instead of a loop, the JIT *still*
won't optimize any writes away. I know it knows how; because it will do
it
with local variables.

Which leads me to believe that that JIT implemention is giving people a
false sense of security by not introducing things clearly accounted for in
the specs. Not to mention, makes the whole discussion somewhat moot.
Don't know what this has to do with security and the specs, this is about
loop optimizing, right?.

Willy.
 
G

Guest

:
This is because you run this code in a "managed debugger", the JIT produces
different code from what is produced when no managed bedugger is attached!
You need to run the code (release version) in a native debugger to see what
the JIT really produces for 'release' mode code.
As unmanaged debugger you can use any of the debuggers from the Debugging
Tools for Windows like windbg, sdb etc... (Which is what I prefer, because
it's more powerfull as the VS2005 debugger)
You can also use VS2005 as unmanaged debugger, but you need to make sure you
break into an unmanaged debugging session. That means you cannot use
System.Diagnostics.Debugger.Break(), you have to call Kernel32.dll
DebugBreak().

I've been using Vance Morrison's guide observing optimized managed code [1]
That's right, the JIT optimizer is quite conservative when optimizing loops.

As I already pointed out, it optimized identical loops--not using member
fields.
However, I don't know who writes code like this:
for(int i = 0; i < count ; ++i)
result = 10;

That's irrelevant, the optimizer doesn't know who's likely to write what
code. The exercise is to show optimized code.
Same here, this can be optimized by storing the number in a local before
running the loop and once done moving the local to the field variable. This
is something you should do whenever you are dealing with field variable in
long running algorithms.

Granted, in the sample above, the loop won't be optimized as agressively as
a native compiler would do (a C compiler will hoist the loop completely),

Huh? In the post you replied to I showed an example where the JIT *did*
hoist the loop completely, just not with member fields.
Don't know what this has to do with security and the specs, this is about
loop optimizing, right?.

No, as I pointed out, it's about getting an example of JIT optimization of
member fields. It doesn't have to be a loop, it's just loop optimization is
easy to have generated.

[1] http://blogs.msdn.com/vancem/archive/2006/02/20/535807.aspx
 
W

Willy Denoyette [MVP]

Peter Ritchie said:
:
This is because you run this code in a "managed debugger", the JIT
produces
different code from what is produced when no managed bedugger is
attached!
You need to run the code (release version) in a native debugger to see
what
the JIT really produces for 'release' mode code.
As unmanaged debugger you can use any of the debuggers from the Debugging
Tools for Windows like windbg, sdb etc... (Which is what I prefer,
because
it's more powerfull as the VS2005 debugger)
You can also use VS2005 as unmanaged debugger, but you need to make sure
you
break into an unmanaged debugging session. That means you cannot use
System.Diagnostics.Debugger.Break(), you have to call Kernel32.dll
DebugBreak().

I've been using Vance Morrison's guide observing optimized managed code
[1]

Which is wrong, it doesn't show machine code as it would do when you don't
run in the managed debugger (VS debugger or mdbg) code. The CLR knows that
he runs in the managed debugger using the "managed debugger interfaces"
(ICorDebug - COM interfaces), and forces the JIT to produce different code
as it would when no managed debugger would be attached! What Vance calls
optimized code is not what the JIT produces when run outside of the
debugger. That's why I allways use Windbg to analyze assembly code.

You don't have to believe me, just do as I said and try to run the code in
windbg (you can download the latest builds for free from
http://www.microsoft.com/whdc/devtools/debugging/default.mspx, or as I have
explained in my previous post, using VS2005, but take care not to run in the
VS Debugger!.
If you still don't believe me, you can ngen your code and run "dumpbin
/rawdata program.ni.exe", where "program" is the name of the assembly. The
ngen'd image can be found in:
C:\Windows\assembly\NativeImages_v2.0.50727_32\blable....

The output should contain something like:

30002640: 8B 45 F0 C7 40 08 0A 00 00 00 83 C2 01 3B D1 7C .E­Ã@......┬.;Ã|
30002650: EF 8B 45 F0 8B 40 08 8B 7D CC 89 7E 0C 8D 65 F4 ´.E­.@..}╠.~..e¶
30002660: 5B 5E 5F 5D C3 CC CC CC BF 27 00 30 6A 46 00 30 [^_]├╠╠╠â”'.0jF.0
.....

To find the exact addresses you will have to look at the unmanaged debugger
output...
Following is how it looks like when I ran this in windbg:

vols2_ni!Willys.Test.Method2()+0x7c:
30002640 8b45f0 mov eax,dword ptr [ebp-10h]
30002643 c740080a000000 mov dword ptr [eax+8],0Ah
3000264a 83c201 add edx,1
3000264d 3bd1 cmp edx,ecx
3000264f 7cef jl vols2_ni!Willys.Test.Method2()+0x7c
(30002640)

Now you just have to compare the sequences of bytes.
What I see when running in windbg or sdb orVS2005's VSJIT debugger and what
I see produced by ngen are exactly the same, are you telling me that what I
see is not correct?.
As I already pointed out, it optimized identical loops--not using member
fields.


That's irrelevant, the optimizer doesn't know who's likely to write what
code. The exercise is to show optimized code.


Huh? In the post you replied to I showed an example where the JIT *did*
hoist the loop completely, just not with member fields.

Yes, but again you were running in the managed debugger!
Don't know what this has to do with security and the specs, this is
about
loop optimizing, right?.

No, as I pointed out, it's about getting an example of JIT optimization of
member fields. It doesn't have to be a loop, it's just loop optimization
is
easy to have generated.

[1] http://blogs.msdn.com/vancem/archive/2006/02/20/535807.aspx

Willy.
 
G

Guest

It appears the x86 JIT's (or the design-team’s thereof) interpretation is
somewhat similar to my interpretation in that compile-time optimization
restrictions and run-time acquire/release semantics are separate
considerations.

The JIT is using the protected region as the guard in which not to optimize,
not Monitor.Enter/Monitor.Exit. Anything within a try block will be
considered volatile operations by the compile-time optimizer, it has nothing
to do with Enter/Exit or directly with "lock" (other than lock is implemented
with a try block). Secondary to that, and outside of any documentation I’ve
been able to find, it appears the all member access (only tried Instance, not
Class) is considered a volatile operation by the JIT in terms of
optimizations, regardless of being in or out of a protected region (obviously
acquire/release semantics is the responsibility of Enter/Exit, MemoryBarrier,
etc. and is not implicitly obtained without them). Therefore acquire/release
semantics guarantees do not directly affect what the JIT decides not to
optimize.

For example:
int result = 0;
Monitor.Enter(locker);
result += 10;
result += 10;
Monitor.Exit(locker);
result += 10;
result += 10;
result += 10;
return result;

....is compile-time optimized by the JIT to the equivalent of:
Monitor.Enter(locker);
Monitor.Exit(locker);
return 50;

....and you get acquire/release semantics on nothing in the current threads
(other than "locker" in the call to Exit).

And:
int result = 0;
try
{
result += 10;
result += 10;
}
finally {
result += 10;
result += 10;
result += 10;
}
return result;

....is compile-time optimized by the JIT to the equivalent of:
int result = 0;
try
{
result += 10;
result += 10;
}
finally {
result += 30;
}
return result;

....but I do not get acquire/release semantics within the try block.

And finally:

int result = 0;
Monitor.Enter(locker);
try {
result += 10;
result += 10;
} finally {
Monitor.Exit(locker);
result += 10;
result += 10;
result += 10;
}
return result;

....is compile-time optimized by the JIT to the equivalent of:

int result = 0;
Monitor.Enter(locker);
try {
result += 10;
result += 10;
} finally {
Monitor.Exit(locker);
result += 30;
}
return result;

....and this is the only example where I get the JIT optimization AND
acquire/release semantics guarantees you've been talking about.

I don’t believe this compile-time optimization behaviour is covered clearly,
if at all, in 335.
 
B

Ben Voigt [C++ MVP]

This is because you run this code in a "managed debugger", the JIT
produces
different code from what is produced when no managed bedugger is attached!

The JIT produces different code when *started* in a debugger. When a
debugger is attached later, managed or not, the optimized code is already
generated.
 
W

Willy Denoyette [MVP]

Ben Voigt said:
The JIT produces different code when *started* in a debugger. When a
debugger is attached later, managed or not, the optimized code is already
generated.

Yep, but this is not the case when an "unmanaged debugger" is attached.
Using an unmanaged debugger like sdb, you can break into the debugger before
the CLR is even loaded, after JITing you will get 'fidelity' code. Unmanaged
debuggers aren't using the ICORDebugger COM interface to interact with the
CLR (using the CLR's debugger thread as present in any managed code
process).
Note that when running in VS debugger, you can get the same behavior, you
only need to take care not to break using
System.Diagnostics.Debugger.Break(), else you will get ICORdebug as
interface and the CLR will signal the presence of a managed debugger to the
JIT.

Willy.
 
J

Jon Skeet [C# MVP]

Peter Ritchie said:
It appears the x86 JIT's (or the design-team?s thereof) interpretation is
somewhat similar to my interpretation in that compile-time optimization
restrictions and run-time acquire/release semantics are separate
considerations.

They can be implemented separately without the team having decided that
our reading of the spec is incorrect.
The JIT is using the protected region as the guard in which not to optimize,
not Monitor.Enter/Monitor.Exit. Anything within a try block will be
considered volatile operations by the compile-time optimizer, it has nothing
to do with Enter/Exit or directly with "lock" (other than lock is implemented
with a try block).

That may be *a* type of optimisation blocking - it doesn't mean it's
the only one.
Secondary to that, and outside of any documentation I?ve
been able to find, it appears the all member access (only tried Instance, not
Class) is considered a volatile operation by the JIT in terms of
optimizations, regardless of being in or out of a protected region (obviously
acquire/release semantics is the responsibility of Enter/Exit, MemoryBarrier,
etc. and is not implicitly obtained without them). Therefore acquire/release
semantics guarantees do not directly affect what the JIT decides not to
optimize.

Unless the reason the JIT decided not to optimise *anything* for member
variables is because it's simpler than trying to work out exactly where
it can and can't optimise due to Monitor.Enter/Exit.

By not reordering member access, the JIT is automatically complying
with the spec without having to do any extra checking. That doesn't
mean that the guarantees given by the spec don't apply - just that the
JIT is being stricter than it needs to.
For example:
int result = 0;
Monitor.Enter(locker);
result += 10;
result += 10;
Monitor.Exit(locker);
result += 10;
result += 10;
result += 10;
return result;

...is compile-time optimized by the JIT to the equivalent of:
Monitor.Enter(locker);
Monitor.Exit(locker);
return 50;

That's certaily interesting - but the difference can't be *observed*
because no other thread has access to the value on that thread's stack.

I'll readily confess that I can't see where that's made clear in the
spec, unless it's the section about 12.6.4. It certainly makes sense
though - optimising within a stack can be done easily without
introducing bugs.

Another argument in favour of this is that the volatile prefix can't be
applied to the ldloc instruction - it's only applicable for potentially
shared data:

<quote>
The volatile. prefix specifies that addr is a volatile address (i.e.,
it can be referenced externally to the current thread of execution) and
the results of reading that location cannot be cached or that multiple
stores to that location cannot be suppressed.
...and you get acquire/release semantics on nothing in the current threads
(other than "locker" in the call to Exit).

Again you're talking about acquire/release semantics *on* something -
which is something the spec doesn't talk about. It talks about
acquire/release semantics at a particular point in time.

I don?t believe this compile-time optimization behaviour is covered clearly,
if at all, in 335.

The spec doesn't talk about compile-time vs run-time optimisation
though - it talks about observable behaviour. As a developer trying to
write code which is guaranteed to work against the spec, I don't care
whether the JIT has to do more or less work depending on the CPU it's
on - I just care that my code works in all situations.

I still believe the spec guarantees that for the situation I've
specified.
 
P

Peter Ritchie [C#MVP]

It's somewhat moot, I feel, at this point to discuss it much further, other
to continue to say each other's interpretation is different. But, if you're
interested in arguing from the spec itself... If you want to take it
offline, just reply to the email address you have for me.

The behaviour I've observed proves nothing about anyone's interpretation of
the spec, it merely speaks to what appears to be the JIT's opinion of what a
volatile operation is, despite what the spec says. What I've observed shows
optimization blocking and acquire/release semantics are at least considered
indpendantly (I'll admit it's not proof that the JIT does or does not take
into account Enter/Exit calls, but it can't use that to decide how it should
generate assembler for another method, regardless of whether a JIT allows
any optimzation of observable read/writes). The combination of behaviour
I've observed may suggest the JIT is attempting to guarantee observable
read/writes can't be reorderd and therefore must be flushed at Enter to Exit
(which is good); but that neither proves your interpretation of the spec nor
disproves mine.

If the MS x86 JIT does not in fact optimize member fields to fulfil
that/those particular guarantee(s), that really just substantiates my
assertion that the spec. is unclear and that your rebuttal is intepretive.
It's also a bit contradictory to information you've said you received from
Vance and Chris. No offence or implication that you didn't receive that
information; just that it seems contradictory, if it's indeed true that the
JIT doesn't optimize member fields and therefore does not need to look for
Enter/Exit...

You're readily confessing that the reasons for the side-effects you're
relying upon are not clear in the spec, yet you're still advocating reliance
upon the side-effects (arguing from the spec itself)?

To reduce typing:

"Conforming implementations of the CLI are free to execute programs using
any technology that guarantees, within a single thread of execution, that
side-effects and exceptions generated by a thread are visible in the order
specified by the CIL. For this purpose only volatile operations (including
volatile reads) constitute visible side-effects."

I'll call that statement optimization allowance 1 (OA1).

"Acquiring a lock (System.Threading.Monitor.Enter or entering a synchronized
method) shall implicitly perform a volatile read operation, and releasing a
lock (System.Threading.Monitor.Exit or leaving a synchronized method) shall
implicitly perform a volatile write operation."

I'll call that statement locking rule 1 (LR1).
Again you're talking about acquire/release semantics *on* something -
which is something the spec doesn't talk about. It talks about
acquire/release semantics at a particular point in time.

Semantics. In the context, the acquire/release semantics ensure flushing to
memory of no values being read after the Enter and before the Exit, other
than "locker" (the only thing read merely supports the existance of
Enter/Exit) and there are no observable writes. An example that shows the
lack of clarity of LR1 (i.e. an "implicitly perform[ing] a volatile write"
of what?), the only association between Enter/Exit and acqurie/release.
Yes, it has the side effect of having flushed values to memory for
subsequent reads (the acquire semantics) but that makes no guarentees for
the instructions immediately following Exit and therefore no guarentees on
any reads.

What LR1 implies for acquire/release semantics hinges on whether you believe
the implication that everything on and after a call to Enter and before a
call to Exit constitutes one volatile read and that the call to Exit
constitues one volatile write. No matter how you interpret that paragraph
it's unclear. Regardless of intepretation it still leaves the code between
related Enter and Exit calls in a black hole (ignoring the fact there is no
syntax ensuring related Enter/Exit calls occur in the same block, the same
method, or even the same assembly). With your intepretation the release
semantics for the block occur at the call to Exit; which leaves any writes
within the block without release semantics until the end of the block and
therefore makes no guarantees any writes are visible to other threads until
Exit.

Without clarity of LR1, you can neither make the connection between
acquire/release semantics and with Enter/Exit nor, therefore, the connection
to observable side-effect guarantees.
The spec doesn't talk about compile-time vs run-time optimisation
though - it talks about observable behaviour.

And that has been my point. Without taking into account what the JIT *is*
doing, your interpretation of the guarentee(s) means the following is safe:
//thread A:
instance.firstIntMember = 1;
instance.firstIntMember = 2;

//thread B:
instance.secondIntMember = 3;
instance.secondIntMember = 4;

//thread C:
Monitor.Enter(locker)
instance.otherMember = instance.firstIntMember;
instance.anotherMember = instance.secondIntMember;
Monitor.Exit(locker);

....including atomicity rules: the assignment to otherMember in C is
"guaranteed" to see any observable side-effects made to firstIntMember, and
the assignment to anotherMember in C is "guarenteed" to see any observable
side-effects made to secondIntMember. And yet, nowhere in that code is
there enough information for a JIT to make any decisions what and what not
to optimize in A and B, especially considering thread A code and thread B
code are likely in different methods than C and that they could be in
different assemblies: JITted independantly. OA1 suggests it could optimize
away assignment of 1 in A and 3 in B because that isn't observable "within
[that] single thread of execution."

Is it good code? Of course not. Should it pass code review? Of course not.
What is and isn't sanctioned code is outside the domain of a C# compiler,
the JIT, or the CLI. The point is the spec is unclear in this area and to
use it as a crutch to support using syntax because of its side-effects is,
in my opinion, not a good practice. Using observed behaviour as a crutch is
better; but if the behviour doesn't match the spec it's subject to change
and, again, not a good practice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top