That's my point. The code that is retrieving key data from the target
object is necessarily going to have a strong reference during that period.
No, I am not talking about resurrection. I am talking about what happens
when the GC winds up happening just at the same moment your code is
holding a strong reference to the target object of a weak reference (due
to the need to compare for equality the key data of an object that is
still live).
This cannot happen. It would be a race condition in the CLR.
Either the GC is first and this implies that target is already null. Or
my code is first and this implies that the GC sees my strong reference
and will keep the object.
Then the stored type must have a significant amount of data besides that
used for the hash key. That's all I'm saying.
No, it has no additional data at all. Just the key.
The trick is to reduce redundant keys in memory. From deserialization I
may get thousands of identical strings, most of them rather short. Each
of it has it's own char[] array but their values compare equal. With the
global key table I replace all of these strings by a single instance,
discarding all the others. Now the amount of memory used by these
thousands of objects reduces to one object plus thousands of references
to this single instance. It's like a multiton.
Similar argumentation can be applied to other immutable objects, not
just strings. And since almost all of my business objects in the
application are immutable, this is often useful.
I doubt traversing your graph yourself is significantly costlier
performance-wise than the cost the GC incurs traversing the graph of
_all_ reachable objects, and I think it's quite plausible it's less
costly, at least if you can implement it without reflection.
In any case, if you care about performance, you can't guess. You need to
implement and measure.
That's true.
But as rule of thumb I kept in mind that executing .Invoke of a
MethodInfo or .GetValue of a PropertyInfo should not be done frequently.
Even if you have the need to generalize as the GC does, you can cache
the slow parts of reflection to get good performance. For example,
maintain a dictionary of PropertyInfo (or FieldInfo) objects as values,
with the type itself as the key. Then you can quickly retrieve the
relevant data from an object during traversal.
I already have this kind of information. But it is not sufficient,
because I still have to call GetValue, which is quite expensive.
To be fast I need to use the delegates of the getter directly without
..Invoke. This is usually a problem because the lack of compile time type
information. Maybe in this special case I can come around this, because
I have only to deal with a strictly limited number of types which are
generic type parameter of the deduplication helper class.
Note also that as an alternative to reflection, you can in fact require
maintainers of client code to provide a traversal helper (as an
interface implementation, delegate callback, whatever), and then in a
debug-only build, maintain a reflection-based dictionary that can be
used to compare the results of the traversal helper with your own
reflection-based code (i.e. double-check that the set of graph nodes
returned by the client implementation is exactly the same as those
returned by your own reflection-based implementation).
I do not want to go this way. These kind of checks are really expensive
and the delay in DEBUG builds will preserve the developers from doing
their work.
I actually doubt that the performance will be bad enough with a
cached-reflection implementation, especially for a process run only
periodically, to justify the added complexity of having two different
implementations (i.e. the client-provided and the reflection-based). But
if it actually is, that could be an appropriate way to address both the
performance and correctness questions at once.
As long as the assumption holds that one run per day is sufficient, you
will be most likely right, because this application is mainly designed
for 10/6 operation (not 24/7). It may also force a generation 2 GC after
completion.
But if I need to do this clean up more frequently, e.g. once an hour, I
may run into trouble, because it might not be that easy to implement
this as concurrent job without a global lock and first of all without a
race condition.
I think I would prefer a custom weak hash table class in this case,
since tracing race conditions is more tricky than testing a hash table.
Marcel