How does "new" work in a loop?

B

Barry Kelly

Göran Andersson said:
How would the GC know the scope of the variable, when the scope is
something that only the compiler is aware of?

One possible implementation: the C# compiler compiles to IL, and the JIT
produces the actual code. The IL contains ldloc and stloc for locals,
and thus the JIT can make a note of where the last use of a variable
occurs for each basic block. Hence it can produce tables which indicate
which stack locations / registers are valid roots for given instruction
pointer ranges.

It needs to do analysis like this anyway to make sensible decisions on
enregistering. Without knowing when a variable is no longer needed, and
thus a register freed up for use by another variable, code wouldn't be
as performant as it could be.

-- Barry
 
J

Jon Skeet [C# MVP]

Göran Andersson said:
How would the GC know the scope of the variable, when the scope is
something that only the compiler is aware of?

I can't remember (and can't find in the specs) whether the compiler
adds some information in the IL or whether the JIT works it out. Either
way, it clearly happens :)
 
B

Barry Kelly

Jon Skeet said:
I can't remember (and can't find in the specs) whether the compiler
adds some information in the IL or whether the JIT works it out. Either
way, it clearly happens :)

The information isn't in the IL. It may be in the PDB. Quote from CLI
specs (v1.1):

"ilasm allows nested local variable scopes to be provided and allows
locals in nested scopes to share the same location as those in the outer
scope. The information about local names, scoping, and overlapping of
scoped locals is persisted to the PDB (debugger symbol) file rather than
the PE file itself."

In Java class files, there is a mechanism for specifying the extent of a
local variable. I've never used it when writing compilers targeting the
JVM, so I don't know if it actually gets used by the JIT, or whether it
does its own work. I would expect it to do its own work, since that kind
of work (def-use / use-def chains / SSA graph) is needed to implement
other compiler optimizations.

-- Barry
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Barry said:
One possible implementation: the C# compiler compiles to IL, and the JIT
produces the actual code. The IL contains ldloc and stloc for locals,
and thus the JIT can make a note of where the last use of a variable
occurs for each basic block. Hence it can produce tables which indicate
which stack locations / registers are valid roots for given instruction
pointer ranges.

It needs to do analysis like this anyway to make sensible decisions on
enregistering. Without knowing when a variable is no longer needed, and
thus a register freed up for use by another variable, code wouldn't be
as performant as it could be.

-- Barry

But you yourself said that:

"Scope is a lexical concept that exists only at compile time."

I guess that's not really so, then.
 
B

Barry Kelly

Göran Andersson said:
But you yourself said that:

"Scope is a lexical concept that exists only at compile time."

I guess that's not really so, then.

I think you're confusing scope with GC reachability.

In compiler theory, the word "scope" is overloaded. It can either refer
to (i) the extent of source code for which the identifier is valid ("the
scope of a variable") or (ii) the set of identifiers which are valid for
the current position while parsing the source code ("the variable isn't
in scope").

It is implemented with the compiler's symbol table. After the compiler
has finished parsing and has resolved identifiers, scope no longer
exists. The information may be carried forward to the PDB for debugging,
but that's the end of it.

GC reachability is the set of rules by which objects in a graph are
determined to be alive or eligible for collection. Reachability is
typically defined by (i) a set of object roots and (ii) the transitive
closure of objects referenced by these roots.

The point is that the set of object roots at a particular location in
compiled code does not necessarily correspond exactly with the variables
which are lexically in scope at that location in the original source
code.

A variable being lexically "in scope" does not imply that it is GC
reachable.

-- Barry
 
B

Barry Kelly

Barry Kelly said:
One possible implementation: the C# compiler compiles to IL, and the JIT
produces the actual code. The IL contains ldloc and stloc for locals,
and thus the JIT can make a note of where the last use of a variable
occurs for each basic block. Hence it can produce tables which indicate
which stack locations / registers are valid roots for given instruction
pointer ranges.

There's another reason why it needs this info: so it can adjust all
pointers to relocated objects after a GC has just finished.

-- Barry
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Barry said:
I think you're confusing scope with GC reachability.

In compiler theory, the word "scope" is overloaded. It can either refer
to (i) the extent of source code for which the identifier is valid ("the
scope of a variable") or (ii) the set of identifiers which are valid for
the current position while parsing the source code ("the variable isn't
in scope").

It is implemented with the compiler's symbol table. After the compiler
has finished parsing and has resolved identifiers, scope no longer
exists. The information may be carried forward to the PDB for debugging,
but that's the end of it.

GC reachability is the set of rules by which objects in a graph are
determined to be alive or eligible for collection. Reachability is
typically defined by (i) a set of object roots and (ii) the transitive
closure of objects referenced by these roots.

The point is that the set of object roots at a particular location in
compiled code does not necessarily correspond exactly with the variables
which are lexically in scope at that location in the original source
code.

A variable being lexically "in scope" does not imply that it is GC
reachable.

-- Barry

No, I'm not at all confused about what scope is. It's a bit surprising
how much the GC knows about it, though. Even if it doesn't know the
"available" scope of the variables, it seems to know the "utilized"
scope, or the active lifetime of the variables (which may be shorter
than the physical lifetime).

Is there any information that supports the theory that the GC knows when
a reference is no longer reachable? Can we trust that it will always be
able to collect objects that won't be used?

Does the scope matter? Will there be a difference between:

for (int i=0; i<1000; i++) {
byte[] buffer = new byte[10000];
}

and:

byte[] buffer;
for (int i=0; i<1000; i++) {
buffer = new byte[10000];
}

Will it with certainty always be able to collect the previous buffer?
Will it never differ from this?

byte[] buffer;
for (int i=0; i<1000; i++) {
buffer = new byte[10000];
buffer = null;
}
 
B

Barry Kelly

Göran Andersson said:
No, I'm not at all confused about what scope is.

I don't understand how you could have quoted me in this context without
you being mistaken.
It's a bit surprising
how much the GC knows about it, though.

The GC doesn't know anything about scope. That's what I've been trying
to explain to you. The scope information is *LOST* after compile time.
The thing that the GC knows about IS NOT SCOPE.
it seems to know the "utilized"
scope, or the active lifetime of the variables (which may be shorter
than the physical lifetime).

Like I said in the other messages, the JIT needs this info (variable
lifetime - not scope) for enregistering and stack reuse, and so it
calculates it, and the GC needs this data for adjusting pointers after a
collection.
Is there any information that supports the theory that the GC knows when
a reference is no longer reachable?

The JIT can only detect the last use of a given variable definition (in
the Single Static Assignment (SSA) model of "variable definition"). It's
the JIT compiler that is doing the analysis, not the GC.

I recommend that you Google up on:

* Use-Definition chain, Definition-Use chain (ud-chain, du-chain)
* Single Static Assignment (SSA - this is a more modern approach)

Alternatively, you can look up use-def / def-use chains in the Dragon
book (Compilers: Principles, Techniques and Tools, by Aho, Sethi &
Ullman).
Can we trust that it will always be
able to collect objects that won't be used?

No, it can only collect objects which aren't used. For example:

---8<---
object x = new object();
Halting_Problem(); // might not return
Console.WriteLine(x);
--->8---

The JIT clearly can't determine that x is dead at the point of calling
Halting_Problem(), so the GC can't collect x.
Does the scope matter?

THE SCOPE DOESN'T EXIST IN IL. The scope is GONE, GONE, GONE, ALL GONE,
after the C# compiler has produced IL. Use ILDASM to decompile an
assembly some time. You will notice that THERE IS NO SCOPE INFORMATION
in the dump. There is only a list of local variables per method.
Will there be a difference between:

for (int i=0; i<1000; i++) {
byte[] buffer = new byte[10000];
}

and:

byte[] buffer;
for (int i=0; i<1000; i++) {
buffer = new byte[10000];
}

Will it with certainty always be able to collect the previous buffer?
Will it never differ from this?

It's entirely implementation defined, based on how smart the JIT is at
recognizing that variables are no longer needed. It's a function of the
sophistication of the compiler. It's in the JIT's interest to discover
when variables are no longer needed, because that creates room for other
variables to be enregistered, or stack space minimized.

-- Barry
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Barry said:
I don't understand how you could have quoted me in this context without
you being mistaken.

No, I can see that. Hopefully it will dawn on you.
The GC doesn't know anything about scope. That's what I've been trying
to explain to you. The scope information is *LOST* after compile time.
The thing that the GC knows about IS NOT SCOPE.

If you read more than just one sentence at a time, perhaps you would
understand what I am saying, instead of gettings stuck on a single word.
Like I said in the other messages, the JIT needs this info (variable
lifetime - not scope) for enregistering and stack reuse, and so it
calculates it, and the GC needs this data for adjusting pointers after a
collection.

No, it doesn't. It uses the same information for that as it did to
determine which objects can be collected. If it used different
information in the phases, it would mess up the references.
No, it can only collect objects which aren't used. For example:

---8<---
object x = new object();
Halting_Problem(); // might not return
Console.WriteLine(x);
--->8---

The JIT clearly can't determine that x is dead at the point of calling
Halting_Problem(), so the GC can't collect x.

Well, that is obvious, isn't it?

Ok, let me rephrase the question a bit more precise:

Can we trust that it will always be able to collect objects that
possibly can't be used later in the execution?
THE SCOPE DOESN'T EXIST IN IL. The scope is GONE, GONE, GONE, ALL GONE,
after the C# compiler has produced IL. Use ILDASM to decompile an
assembly some time. You will notice that THERE IS NO SCOPE INFORMATION
in the dump. There is only a list of local variables per method.

Will you PLEASE STOP SHOUTING!

I wasn't asking if the scope was existing in the IL code. I was asking
if the scope mattered.

If you please try to look beyond your hangup on this word, maybe you
could try to understand the question?
 
J

Jon Skeet [C# MVP]

Ok, let me rephrase the question a bit more precise:

Can we trust that it will always be able to collect objects that
possibly can't be used later in the execution?

On the current implementations, in release mode? I believe so, in
simple cases. The JIT doesn't do complex analysis, so if you had:

bool first=true;

object bigObject = (...);

for (int i=0; i < 100000; i++)
{
if (first)
{
useObject (bigObject);
first = false;
}

// Code not using bigObject
}

then the JIT wouldn't work out that first could never become true after
the first iteration and bigObject would therefore never be used after
that point. That's one of the few situations where it might make sense
to set a local variable to null.


I looked at the CLR spec and I found *very* little about garbage
collection. No guarantees about this kind of thing at all. Normally I'm
a spec hound in terms of only coding to the spec, but the ramifications
of only trusting to the spec in this case are so horrible that I
believe it makes more sense to go with what happens in reality.
 
J

John J. Hughes II

Jon,

Thanks for the debate, at this point I have not changed my mind but it does
give me food for thought. The next time I have some slow time I will
research the matter further taking you points and other points in this
thread into consideration.

But as a last comment: In one of you other messages in this thread you made
the following comment:
then the JIT wouldn't work out that first could never become true after
the first iteration and bigObject would therefore never be used after
that point. That's one of the few situations where it might make sense
to set a local variable to null.

Since I have some rather long running threads which create some long lived
variables it's possible that setting some of them to null cause memory to be
returned which was being held before.

Regards,
John

Jon Skeet said:
John J. Hughes II said:
I do agree the memory is not marked... poor verbiage on my part.

I don't think your example really proves anything since you are calling
garbage collection.

Well, I can make an example which ends up garbage collecting due to
other activity if you want. It'll do the same thing. Just change the
call to GC.Collect() to

for (int i=0; i < 10000000; i++)
{
byte[] b = new byte[1000];
}

and you'll see the same thing.
I have no argument that when GC runs it will clean up
memory that is not being used. I personally believe that all references
to
a variable are not removed in a timely fashion unless you tell them too
be.
The key here is timely.

It's not a matter of the reference being removed. It's a case of the
release-mode garbage collector ignoring variables which are no longer
relevant.
Again as I have said I had a problem with memory creep, the only change I
did was add using statements the problem slowed down but was not
eliminated.

And *that* can have a significant impact - because many classes which
implement IDisposable also have finalizers which are suppressed when
you call Dispose. That really *does* affect when the memory can be
freed, and can make a big difference.
The second change was to add value=null statement (shotgun blast style)
and
the problem went away. Since it was a production system I used great
care
to change as little as possible so I really don't think I fixed any other
problems.

I'm afraid I still don't believe you saw what you claimed to be seeing
- not on a production system. You *would* see improvements in a
debugger, but that's a different matter.
If at some point in the near future if I can give you code which proves
my
point I will be happy too but the last time I had the problem it required
a
system running full blown for 14 days on average.

That being said I may have gotten my head wet and decided it was raining
when it was snowing. I decide to use an umbrella and my head it not wet
now.

I really suspect you were mistaken, I'm afraid.
 
T

Tony Sinclair

My sincere gratitude to everyone who responded.

I'm afraid the debate on my question sailed over my head quite some
time ago, but in case anyone is interested, I can give you the actual
results of my program.

I used essentially the same code in my OP on an internet file that I
downloaded, which comprised 50+ segments of 24 MB each, so each time
through my loop, I was allocating a 24 MB buffer as "new." (I do
intended to incorporate the improvements suggested, especially the
using statement, but I haven't gotten to it yet.) I watched the
memory data with the MS Task manager as I started and ran my program.

When it started, the program quickly grabbed an extra 24MB from the
memory pool. It never went more than 1MB above that for the rest of
the run, and the file assembled perfectly. There was no shortage of
memory at the time (I have 1GB of physical memory, and Task Manager
showed about 2500MB of virtual memory available. When I started my
program, about 700MB of this was committed).

I conclude that even in a short loop, with no shortage of memory, and
with no hints from me that might help speed it up, the GC acts quickly
enough to dispose of the old buffer as soon as it's unneeded. I note
that even though the loop is short in lines, the CPU probably has a
lot of time on its hands while I am writing to the output buffer. I
might try this test again with a high-CPU task running in the
background and see what happens.

Thanks again to everyone for their help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top