WeakHashSet?

  • Thread starter Thread starter Marcel Müller
  • Start date Start date
M

Marcel Müller

While it is quite trivial to have a dictionary where the values are weak
references (OK, one has to deal with the orphaned keys) it is not that
easy for a hash set. Because the keys cannot be mutable.

Does anyone have information how to get a set where the keys are weak
references?


Marcel
 
I'm not really sure I understand your question. Keys cannot be mutable
in a dictionary either, in that the mutable data cannot contribute to
the comparing keys. The same rule applies to any other hash-table data
structure.

But WeakReference uses the default object comparison, which doesn't take
into account any of the contents of the object, so mutability shouldn't
be an issue in a dictionary or any other hash-table data structure (e.g.
HashSet<T>).

How are you trying to use the weak references such that they don't cause
a problem in a dictionary, but do in a hash set?

Well a hash set consists only of keys. WeakReferences cannot be
reasonably used as keys because they are mutable, i.e. they can loose
their target. OK, I could write an EqualityComparer that compares false
for any non-existing target, but what about GetHashCode? I can't provide
a reasonable hash code for the dead Target of a WeakReference. And since
the hash code of a HashSet item must not change I have a problem once
the target gets destroyed.

Using WeakReference instances themselves as keys should work fine.

Of course, but instance equality of Weak references is most likely
nothing very useful.
That
may not necessarily solve your problem, but if not you should post a
clearer question.

I have an application with a lot of memory objects. There are many
duplicates by value. This is a result of deserialization of XML data, so
they do not share the same instance, because the redundancy is already
part of the XML data. To conserve memory (in practice about a factor of
20) I want to convert value equality into instance equality.

For this purpose I have something like a HashSet. It is different from
the built in HastSet only in the way that I can extract instance
references of the stored objects.

Mostly I am talking about strings, but sometime also about larger sub
trees of data. In fact it is something like string.Intern. But since the
application runs 24/7 and occasionally singular strings may get into the
set temporarily, string.Intern is not an option. It would cause a slowly
growing memory balloon. Secondly I have to deal with other reference
types (generic type parameter).

So I need to handle a set of weak references to immutable objects.


Marcel
 
All of the above applies both to dictionaries and plain hash sets.

I ask again: how are you trying to use the weak references such that
they don't cause a problem in a dictionary, but do in a hash set?

Weak references work as /values/ of a dictionary reasonably. I have used
hash tables with strong keys and weak values so far. I have never needed
weak keys in the past. So I am looking for ideas how to deal with the
weak keys.

[...]
I have an application with a lot of memory objects. There are many
duplicates by value. This is a result of deserialization of XML data, so
they do not share the same instance, because the redundancy is already
part of the XML data. To conserve memory (in practice about a factor of
20) I want to convert value equality into instance equality.

It seems to me as though you are overengineering. From the above
description, I understand that the duplicates occur during
deserialization. It seems to me that you need the dictionary to track
instances to exist only long enough to completely deserialize the graph
found in the XML. There's no need for weak references; just discard the
dictionary after deserialization is done.

Deserialization takes place at any time. There is not the one XML. There
are several thousands, each of them in several versions. And they are
deserialized on demand and discarded after some time without use - in
fact the are held in an ordinary cache. To get the cache much more
effective, I need to compact it's content. Reusing strings and other
objects significantly extends the effective cache size. The temporaries
of the deserialization are discarded by the generation 0 GC fast. So I
see only a short spike in memory after deserialization of some thousand
complex objects. But currently I have strong references in the
dictionary. So the dictionary grows slowly and constantly. No problem
for a few hours, but for uptimes of some weeks this will be an issue.


Marcel
 
So rebuild the dictionary periodically, based on objects still currently
in use. Or just always rebuild it before deserialization; simpler to do
that, and in many scenarios would be "efficient enough".

The bottom line: objects used as keys in a hash table necessarily need
to remain in memory. The hash table cannot possibly work any other way.
So a hash set — that is, a hash table containing only the weak
references and using data in the target object as the key — is impossible.

Are you sure about that point?

AFAIK a hash table uses GetHashCode only when inserting elements, for
element look up and when the number of buckets changes, of course. In
the first two cases I still have a strong reference to the key. In the
third case there is no need to keep the orphaned weak reference.
Additionally a hash table uses equality comparison to locate elements in
a chosen bucket. For this purpose I could provide a comparer that always
compare false against dead references.

You can maintain a dictionary with values as weak references, as you
seem possibly to be doing now, but the cached objects would have to be
quite a bit larger than the basic keys themselves to make it worth the
added layer of complexity and even to improve on memory usage.

I do not agree. You could improve memory usage solely by turning value
equality of immutable objects into instance equality. OK, the build in
hash table classes do not provide access to the keys, but this is no
show stopper.

Even if the dictionary of keys/weak reference values was a win in that
analysis, you still have the periodically required compaction where you
search the dictionary for dead weak references and remove them.

This part of the job is shared by all .NET collections containing weak
references. At least the alternative, using objects with a finalizer
that removes themselves from the collection, is even worse in general.

So for simpler, easier-to-maintain code where the performance
characteristics and break-even points are easier to define, just forget
about the weak references and rebuild your dictionary periodically.

Hmm, maybe this is an option. But how to rebuild the index? Taking the
old index, I can not distinguish between keys that are no longer used by
anybody else than the index.

Do I need to traverse all references in the entire object tree? That's
expensive and I am unsure whether this is possible without incredibly
slow access by reflections.


Marcel
 
(And note that at least for the built-in types where you aren't offered
the opportunity to discard during reallocation: while you can provide an
IEquatable<T> interface that provides custom hash codes and return
something different for dead targets, changing the hash code even during
reallocation assumes a) the hash code is retrieved only once during
reallocation — making your code very implementation-dependent even
assuming that's a valid assumption today, and b) that you don't mind
interfering with the distribution heuristics inherent in the hash table
implementation, potentially deoptimizing its performance strategies).

I prefer not to use this hack with changing hash codes.
In any case, a custom hash table implementation antes up the design-,
implementation-, and maintenance-cost aspects significantly even more
over a different implementation using the built-in types. I see that as
a major drawback, and IMHO you should too.

Well, it will not be the first custom container. But you are right, a
hash table is another weight category. Usually I start with the
decompiled code of the framework.

Btw., I just had a look on the HashSet implementation. The current
implementation does never calls GetHashCode twice for the same node. It
calls it once on add and stores the result together with the value, even
in case of a reallocation. Of course, there is no such guarantee.

I will also point out that even in a custom implementation, there are
certain potential unintended consequences. For example, even a custom
implementation such as you seem to anticipate will necessarily have to
take a strong reference for objects during comparisons.

Assigning from WeakReference.Target returns always a strong reference.
If GC happens at
just the wrong moment (and rest assured, if something can go wrong, it
will), an object that was nominally unreachable but which became
reachable during the moment that GC was occurring will wind up not only
not collected when it needed to be, but promoted to an older generation,
further delaying its collection and causing greater overhead.

You are talking about resurrection? AFAIK this could only happen to long
weak references.

Sorry, not sure what you mean here. My comment is assuming the built-in
types, which I think are not safe to use with mutable keys, even in a
restrained way.

Yes, built-in types can't handle weak references to the keys. But with
strong references they already save much memory now.

The GC traverses _all_ references in your _process_ every time it
collects objects. It can't be _that_ bad to do, at least only
periodically, and any overhead is significantly mitigated by the fact
that you're doing it much less frequently than the GC, and on fewer
objects.

I guess the GC has significantly more efficient ways to access all
fields than user code has. Most likely it is not even implemented in
..NET but as unsafe external C++ code instead.

Do it once a day, and you'll probably never know the difference,
especially if you have some way of identifying low-demand times when
such an operation is less likely to interfere with user experience. You
stated that the memory consumption becomes an issue only after a few
weeks, so surely cleaning up once a day is fine…

It was only a rough guess. I have no production environment with
reliable statistics so far. But maybe one a day is sufficient.

As for "incredibly slow access by reflections", there's nothing in your
question so far that suggests reflection is needed at all. The code to
build the cache needs specific knowledge of the type of object involved,
but that's assumed anyway.

The core engine has no much knowledge about the structure of the used
entities. In fact they are in another assembly that is not referenced.
So the engine cannot access property values directly.
The deduplication is provided as a service. The deserializer is the only
point where this service is consumed. This is some ugly T4 code because
of some missing features in the C# language.

I do not want to implement the necessary code to traverse the property
values by hand. It is very unlikely that all developers remember to add
new properties to the required code. Some more ugly T4 code might come
around this, but T4 is always a maintenance nightmare.


Marcel
 
That's my point. The code that is retrieving key data from the target
object is necessarily going to have a strong reference during that period.


No, I am not talking about resurrection. I am talking about what happens
when the GC winds up happening just at the same moment your code is
holding a strong reference to the target object of a weak reference (due
to the need to compare for equality the key data of an object that is
still live).

This cannot happen. It would be a race condition in the CLR.

Either the GC is first and this implies that target is already null. Or
my code is first and this implies that the GC sees my strong reference
and will keep the object.

Then the stored type must have a significant amount of data besides that
used for the hash key. That's all I'm saying.

No, it has no additional data at all. Just the key.

The trick is to reduce redundant keys in memory. From deserialization I
may get thousands of identical strings, most of them rather short. Each
of it has it's own char[] array but their values compare equal. With the
global key table I replace all of these strings by a single instance,
discarding all the others. Now the amount of memory used by these
thousands of objects reduces to one object plus thousands of references
to this single instance. It's like a multiton.

Similar argumentation can be applied to other immutable objects, not
just strings. And since almost all of my business objects in the
application are immutable, this is often useful.

I doubt traversing your graph yourself is significantly costlier
performance-wise than the cost the GC incurs traversing the graph of
_all_ reachable objects, and I think it's quite plausible it's less
costly, at least if you can implement it without reflection.

In any case, if you care about performance, you can't guess. You need to
implement and measure.

That's true.
But as rule of thumb I kept in mind that executing .Invoke of a
MethodInfo or .GetValue of a PropertyInfo should not be done frequently.

Even if you have the need to generalize as the GC does, you can cache
the slow parts of reflection to get good performance. For example,
maintain a dictionary of PropertyInfo (or FieldInfo) objects as values,
with the type itself as the key. Then you can quickly retrieve the
relevant data from an object during traversal.

I already have this kind of information. But it is not sufficient,
because I still have to call GetValue, which is quite expensive.

To be fast I need to use the delegates of the getter directly without
..Invoke. This is usually a problem because the lack of compile time type
information. Maybe in this special case I can come around this, because
I have only to deal with a strictly limited number of types which are
generic type parameter of the deduplication helper class.

Note also that as an alternative to reflection, you can in fact require
maintainers of client code to provide a traversal helper (as an
interface implementation, delegate callback, whatever), and then in a
debug-only build, maintain a reflection-based dictionary that can be
used to compare the results of the traversal helper with your own
reflection-based code (i.e. double-check that the set of graph nodes
returned by the client implementation is exactly the same as those
returned by your own reflection-based implementation).

I do not want to go this way. These kind of checks are really expensive
and the delay in DEBUG builds will preserve the developers from doing
their work.
I actually doubt that the performance will be bad enough with a
cached-reflection implementation, especially for a process run only
periodically, to justify the added complexity of having two different
implementations (i.e. the client-provided and the reflection-based). But
if it actually is, that could be an appropriate way to address both the
performance and correctness questions at once.

As long as the assumption holds that one run per day is sufficient, you
will be most likely right, because this application is mainly designed
for 10/6 operation (not 24/7). It may also force a generation 2 GC after
completion.
But if I need to do this clean up more frequently, e.g. once an hour, I
may run into trouble, because it might not be that easy to implement
this as concurrent job without a global lock and first of all without a
race condition.

I think I would prefer a custom weak hash table class in this case,
since tracing race conditions is more tricky than testing a hash table.


Marcel
 
Back
Top