Array.Clear vs List<>.Clear

  • Thread starter Thread starter Lee Crabtree
  • Start date Start date
Jon Skeet said:
Nope, it doubles the capacity each time, as Doug said. Here's code to show
that:

Yes, you are right of course. I was thinking another thing. Oh well...
happens.
using System;
using System.Collections.Generic;

public class ConvertFile
{
public static void Main(string[] args)
{
List<string> list = new List<string>();

int previousCapacity=-1;
for (int i=0; i < 100000; i++)
{
list.Add("");
if (list.Capacity != previousCapacity)
{
Console.WriteLine (list.Capacity);
previousCapacity = list.Capacity;
}
}
}
}

Output:

4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072

It would perhaps be nice to be able to specify the scaling factor, but 2
is reasonable for most situations.
Reasonable and fast, yes. A hack but reasonable one. It's still strange, and
maybe unfortunate, that the doubling behavior does not seem to be
documented.
The current wording (http://msdn2.microsoft.com/en-us/library/y52x03h2.aspx)
"If Count exceeds Capacity while adding elements, the capacity is increased
by automatically reallocating the internal array before copying the old
elements and adding the new elements." is misleading.

Still, it's not that hard to roll your own capacity calculator by deriving a
new class from List<T> and overriding the needed methods.
It's not beautiful because it needs to hide methods (new) but it can be
done. For many some applications, more linear allocation helps performance.
 
Well... List.clear actually calls internally array.clear which is an
internal framework function. I wouldn't be surprised if it didn't
just eventually call some sort of memset variant (which is really
fast).

I suppose not, because List.Clear doesn't have to clear the items. It only
sets the length to zero.
I would say that calling List.Clear() is going to be slightly faster
than an iterative version (but since her test didn't measure a
List<T>.Clear call it is unclear..pardon the pun).

After List.Clear the list would have to be refilled, because it results in a
list with length zero. Not a list with same length and all Elements zero.
Really, I think the main advantage to throwing out the old list is the
fact that you can't shrink a list size once it has been grown....and
that every time it grows the array size doubles....So if you have a
list that is 4 megabytes, the next growth will be to 8 megs. even if
you only want 4Megs+2 items....

This also doesn't apply, if the List remains it size.

Christof
 
Laura T. said:
Jon Skeet said:
Nope, it doubles the capacity each time, as Doug said. Here's code to
show that:

Yes, you are right of course. I was thinking another thing. Oh well...
happens.
using System;
using System.Collections.Generic;

public class ConvertFile
{
public static void Main(string[] args)
{
List<string> list = new List<string>();

int previousCapacity=-1;
for (int i=0; i < 100000; i++)
{
list.Add("");
if (list.Capacity != previousCapacity)
{
Console.WriteLine (list.Capacity);
previousCapacity = list.Capacity;
}
}
}
}

Output:

4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072

It would perhaps be nice to be able to specify the scaling factor, but 2
is reasonable for most situations.
Reasonable and fast, yes. A hack but reasonable one. It's still strange,
and maybe unfortunate, that the doubling behavior does not seem to be
documented.
The current wording
(http://msdn2.microsoft.com/en-us/library/y52x03h2.aspx) "If Count exceeds
Capacity while adding elements, the capacity is increased by automatically
reallocating the internal array before copying the old elements and adding
the new elements." is misleading.

Still, it's not that hard to roll your own capacity calculator by deriving
a new class from List<T> and overriding the needed methods.
It's not beautiful because it needs to hide methods (new) but it can be
done. For many some applications, more linear allocation helps
performance.


The scaling factor is not documented because it's an implementation detail,
MSFT reserves the right to change the factor statically and dynamically,
something they were thinking of (if not yet done in V3.5) for MSIL code
that runs on X64/IA64.

Willy.
 
Willy Denoyette said:
Laura T. said:
Jon Skeet said:
<snip>

The unfortunate thing with List<T> is that the capacity does grow up
one by
one. I'd like to see something like "expand it by 10%", so that the
next Add
would not incur reallocations.

Nope, it doubles the capacity each time, as Doug said. Here's code to
show that:

Yes, you are right of course. I was thinking another thing. Oh well...
happens.
using System;
using System.Collections.Generic;

public class ConvertFile
{
public static void Main(string[] args)
{
List<string> list = new List<string>();

int previousCapacity=-1;
for (int i=0; i < 100000; i++)
{
list.Add("");
if (list.Capacity != previousCapacity)
{
Console.WriteLine (list.Capacity);
previousCapacity = list.Capacity;
}
}
}
}

Output:

4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072

It would perhaps be nice to be able to specify the scaling factor, but 2
is reasonable for most situations.
Reasonable and fast, yes. A hack but reasonable one. It's still strange,
and maybe unfortunate, that the doubling behavior does not seem to be
documented.
The current wording
(http://msdn2.microsoft.com/en-us/library/y52x03h2.aspx) "If Count
exceeds Capacity while adding elements, the capacity is increased by
automatically reallocating the internal array before copying the old
elements and adding the new elements." is misleading.

Still, it's not that hard to roll your own capacity calculator by
deriving a new class from List<T> and overriding the needed methods.
It's not beautiful because it needs to hide methods (new) but it can be
done. For many some applications, more linear allocation helps
performance.


The scaling factor is not documented because it's an implementation
detail, MSFT reserves the right to change the factor statically and
dynamically, something they were thinking of (if not yet done in V3.5)
for MSIL code that runs on X64/IA64.

Willy.

Let me disagree.
IMHO, any behavior that may have a impact on the resource consumption model
of my application deserves to be documented.
Microsoft may well write "This behavior may change in the future". There is
nothing wrong with it. Good. Go and optimize it. Make it better.

Knowing that the array doubles is key parameter to understand better
memory/performance behavior of my code.

And, already, Microsoft documents implementation details with O(1), O(N)
etc. Even they can change in the future.

But as I said, it's my opinion.
 
Laura said:
[...]
The scaling factor is not documented because it's an implementation
detail, MSFT reserves the right to change the factor statically and
dynamically, something they were thinking of (if not yet done in V3.5)
for MSIL code that runs on X64/IA64.

Willy.

Let me disagree.

And let me disagree. And agree. With both of you. :)
IMHO, any behavior that may have a impact on the resource consumption
model of my application deserves to be documented.
Microsoft may well write "This behavior may change in the future". There
is nothing wrong with it. Good. Go and optimize it. Make it better.

Knowing that the array doubles is key parameter to understand better
memory/performance behavior of my code.

And, already, Microsoft documents implementation details with O(1), O(N)
etc. Even they can change in the future.

IMHO, the fact that Microsoft documents the algorithm order is a
commitment to preserve that _behavior_ detail. Yes, it's an outcome of
the implementation, but the exposed behavioral contract for the class,
by virtue of the documentation, includes the algorithm order and since
that's part of the contract it should _not_ change in the future.

On the other hand, obviously the management of the internal storage for
a List<> is an implementation detail. A user of the class should expect
that the class not do anything stupid, like reallocating the array every
single time the capacity changes. But otherwise, the exact scaling
factor at which the storage is reallocated is not really important to
the user of the client. Unlike the algorithm order, assuming any
"non-stupid" scaling for the capacity size on reallocation then the
overall performance is actually not going to be affected that much by
the specific choice.

So, that's why I disagree with you and agree with Willy.

On the other hand, I generally prefer more details to fewer. Just
because something is an implementation detail, that doesn't mean that as
the user of a class I am entirely uninterested in the detail. I don't
really expect a complete description of the implementation, and of
course it makes the doc writer's job hard if that implementation should
change in the future.

But the mere fact that something might change in the future shouldn't
necessarily exclude it from being documented. It's a question of
manpower, not technical feasibility. It would be nice if Microsoft's
technical writing staff could come up with some kind of process for
identifying implementation details that can be usefully documented while
still being easily tracked and updated should they change in the future.

The implementation details that are so documented would be very few and
far between. Generally speaking you _don't_ want to document
implementation details. But I'd agree that this particular
implementation detail is likely to be of greater-than-average interest
to the typical .NET programmer.

So, that's why I agree with you and disagree with Willy. :)

Pete
 
Christof said:
I suppose not, because List.Clear doesn't have to clear the items. It
only sets the length to zero.

Why doesn't List.Clear() have to clear the items? If it doesn't,
reference types will still be referenced and can't be garbage collected.

I know that if I clear my List<>, I surely expect it to actually let go
of any references it's kept. It has to actually clear the internal
storage for this to happen. Either with explicit code to clear the
existing storage, or simply reallocating a new version of the storage.
Either way though, it has to do more than just reset the length.
After List.Clear the list would have to be refilled, because it results
in a list with length zero. Not a list with same length and all Elements
zero.

But the length is only "zero" to the client code using the List<> class.
As far as the underlying implementation goes, it's still a valid array.

Pete
 
Peter Duniho said:
Laura said:
[...]
The scaling factor is not documented because it's an implementation
detail, MSFT reserves the right to change the factor statically and
dynamically, something they were thinking of (if not yet done in V3.5)
for MSIL code that runs on X64/IA64.

Willy.

Let me disagree.

And let me disagree. And agree. With both of you. :)
IMHO, any behavior that may have a impact on the resource consumption
model of my application deserves to be documented.
Microsoft may well write "This behavior may change in the future". There
is nothing wrong with it. Good. Go and optimize it. Make it better.

Knowing that the array doubles is key parameter to understand better
memory/performance behavior of my code.

And, already, Microsoft documents implementation details with O(1), O(N)
etc. Even they can change in the future.

IMHO, the fact that Microsoft documents the algorithm order is a
commitment to preserve that _behavior_ detail. Yes, it's an outcome of
the implementation, but the exposed behavioral contract for the class, by
virtue of the documentation, includes the algorithm order and since that's
part of the contract it should _not_ change in the future.

On the other hand, obviously the management of the internal storage for a
List<> is an implementation detail. A user of the class should expect
that the class not do anything stupid, like reallocating the array every
single time the capacity changes. But otherwise, the exact scaling factor
at which the storage is reallocated is not really important to the user of
the client. Unlike the algorithm order, assuming any "non-stupid" scaling
for the capacity size on reallocation then the overall performance is
actually not going to be affected that much by the specific choice.

So, that's why I disagree with you and agree with Willy.

On the other hand, I generally prefer more details to fewer. Just because
something is an implementation detail, that doesn't mean that as the user
of a class I am entirely uninterested in the detail. I don't really
expect a complete description of the implementation, and of course it
makes the doc writer's job hard if that implementation should change in
the future.

The point is, where do you draw the line, what details are worthwhile to be
described in a API reference document, more importantly what questions do
the try to answer, don't they belong in some kind of guidelines/best
practices?
Here in this particular case, one could say ok "the underlying array is
doubled in size when ...", but what if the algorithm turns out to be more
complex or dynamically adjusted, depending on run-time attributes like
memory pressure and usage patterns?
But the mere fact that something might change in the future shouldn't
necessarily exclude it from being documented. It's a question of
manpower, not technical feasibility. It would be nice if Microsoft's
technical writing staff could come up with some kind of process for
identifying implementation details that can be usefully documented while
still being easily tracked and updated should they change in the future.
This is what the MSDN Community Content is meant to solve. Personally, I
think this is a great initiative, the community as a whole can fill in the
holes left by the documentation engine (such things aren't any longer the
job of technical writers).
The implementation details that are so documented would be very few and
far between. Generally speaking you _don't_ want to document
implementation details. But I'd agree that this particular implementation
detail is likely to be of greater-than-average interest to the typical
.NET programmer.
Not exactly my opinion, I guess that all typical programmers, except a real
newbie, knows that most containers (whatever language/library) have their
hidden costs, if you want to know what they are, start *thinking* about
them, if you want to know the exact resource consumption/ performance
impact, go and measure. But *thinking* is the most important here, don't
start using container classes blindly (oh and don't get me started about
generics), select the right containers for the job, think about their
benefits and drawbacks, do it early in the ((learning) process, don't wait
until accidents happen.

Willy.
 
Willy said:
[...]
The point is, where do you draw the line, what details are worthwhile to
be described in a API reference document, more importantly what
questions do the try to answer, don't they belong in some kind of
guidelines/best practices?

Yes, of course that is the point. But where to draw the line is fairly
subjective, and I don't think Laura's desire that in this case the line
be drawn so that the docs are at least more explicit about what
reallocation strategy is used is an unreasonable one.

Wasn't I clear enough that I agree to some extent with what _both_ of
you wrote?

Being a subjective issue, I don't really care to debate it much. It's
mainly a matter of opinion. I just don't think it's necessarily fair to
Laura to dismiss her desire outright. Of all the things that are in
fact obviously outside the behavior contract the class publishes (and
thus would not justify mandatory inclusion in the docs), this seems like
one of the few that would still be of benefit.

If nothing else, it would avoid having people worry about using the
class for fear that it would spend too much time reallocating itself. :)

Pete
 
Peter Duniho said:
Willy said:
[...]
The point is, where do you draw the line, what details are worthwhile to
be described in a API reference document, more importantly what questions
do the try to answer, don't they belong in some kind of guidelines/best
practices?

Yes, of course that is the point. But where to draw the line is fairly
subjective, and I don't think Laura's desire that in this case the line be
drawn so that the docs are at least more explicit about what reallocation
strategy is used is an unreasonable one.

Honestly I don't know really. True, if this "strategy "was fixed and
"simple", it should have been mentioned, but what if it wasn't true for all
implementations, or what if this algorithm was more dynamic or was a subject
of change in the near future? Note that I don't say it couldn't be
documented somewhere, the question is where and at what level of detail.
Look there are a lot of classes that do not document their implementation
details, take for instance the impact of security on certain class methods,
what's the required security context to call some methods, these are things
that become apparent when you run on Vista and higher. Should these details
be documented and if yes at what level, "you need administrative privileges
to run ... " is not what we need, more, it's misleading, but what exactly do
we need here?

Wasn't I clear enough that I agree to some extent with what _both_ of you
wrote?

Sure you were, no problem ;-)
Being a subjective issue, I don't really care to debate it much.

I don't think it's a pure subjective issue, but I wont debate it either.
mainly a matter of opinion. I just don't think it's necessarily fair to
Laura to dismiss her desire outright. Of all the things that are in fact
obviously outside the behavior contract the class publishes (and thus
would not justify mandatory inclusion in the docs), this seems like one of
the few that would still be of benefit.

If nothing else, it would avoid having people worry about using the class
for fear that it would spend too much time reallocating itself. :)

Well the point is not (only) the time to reallocate which matters, it's the
memory resource consumption , and this (I think) is why Laura would love
to see it (the algorithm) documented, but a simple notice like "it doubles
the ..." wont answer the real question on resource consumption. It might
even give a wrong impression, for instance when the array reallocates you
need to consider the size of the old array and the impact of heap
fragmentation, can/should this all be documented, I guess not, should we
talk about it? sure we do.

Willy.
 
Christof Nordiek said:
I suppose not, because List.Clear doesn't have to clear the items. It only
sets the length to zero.


But List.clear doesn't do that. All it does is dereference whatever is held
in the list.

Iss actually more important on 64 bit machines, where references are twice
as long as 32 bits ....wouldn't a List<object> throw a OOM exception twice
as fast on that platform as on 32 bit just because of the "doubling of size
2GB memory allocation" issue ????? I'll have to test this monday......

--
Doug Semler, MCPD
a.a. #705, BAAWA. EAC Guardian of the Horn of the IPU (pbuhh).
The answer is 42; DNRC o-
Gur Hfrarg unf orpbzr fb shyy bs penc gurfr qnlf, abbar rira
erpbtavmrf fvzcyr guvatf yvxr ebg13 nalzber. Fnq, vfa'g vg?
 
Doug Semler said:
But List.clear doesn't do that. All it does is dereference whatever is
held in the list.

Iss actually more important on 64 bit machines, where references are twice
as long as 32 bits ....wouldn't a List<object> throw a OOM exception
twice as fast on that platform as on 32 bit just because of the "doubling
of size 2GB memory allocation" issue ????? I'll have to test this
monday......

On 64 bit Windows this throw twice as fast, references are 8 bytes now,
which means that you can't have more than ~268 Million references in the
List.
Note that you may have run out of physical memory long before you've reached
this number, the minimum object size on 64 bit is 24 bytes, which means that
a minimum of 6GB is needed to store the 268M objects.

Willy.
 
Note that you may have run out of physical memory long before you've reached
this number, the minimum object size on 64 bit is 24 bytes, which means that
a minimum of 6GB is needed to store the 268M objects.

Only if each reference is to a different object. You may well have a
list which only has references to 10 *distinct* objects, but still has
a length in the millions.

I'm not saying this is a common scenario, of course :)
 
Jon Skeet said:
Only if each reference is to a different object. You may well have a
list which only has references to 10 *distinct* objects, but still has
a length in the millions.
Absolutely right.
I'm not saying this is a common scenario, of course :)

I would suggest a design revision if I had to deal with this :)

Willy.
 
Peter Duniho said:
Why doesn't List.Clear() have to clear the items? If it doesn't,
reference types will still be referenced and can't be garbage collected.

It doesn't need to zero each pointer, just make the whole mess unreachable.
It does that by clearing its reference to the entire array of pointers --
O(1), not O(N).
 
Jon Skeet said:
Only if each reference is to a different object. You may well have a
list which only has references to 10 *distinct* objects, but still has
a length in the millions.

I'm not saying this is a common scenario, of course :)

Sounds like Flyweight pattern to me, so maybe not all that uncommon.
 
But List.clear doesn't do that. All it does is dereference whatever is
held in the list.

I would hope it doesn't *dereference* any list entries. I think I know what
you meant (cease referencing), but that isn't what "dereference" means.
 
It doesn't need to zero each pointer, just make the whole mess unreachable.
It does that by clearing its reference to the entire array of pointers --
O(1), not O(N).

It doesn't do that though - it clears the existing array (and keeps
it), rather than just "forgetting" about the array.
It *could* either create a new array with the same capacity or
possibly just reset the capacity to 0, either way forgetting about the
old array, but it happens not to: it calls Array.Clear (specifying its
current size as the number of items to wipe out - so calling
List<T>.Clear twice in a row is quicker the second time).

Jon
 
Sounds like Flyweight pattern to me, so maybe not all that uncommon.

Using flyweights isn't uncommon - I'd suggest that having lists with
millions of such entries is reasonably uncommon though.

Jon
 
Back
Top