Hashing and comparing of arrays

O

Ole Nielsby

How does the GetHashCode() of an array object behave?

Does it combine the GetHashCode() of its elements, or does
it create a sync block for the object?

I want to use readonly arrays as dictionary keys, based on
their content, not their identity. Is this feasible using the
arrays directly, or do I need to wrap them in a struct that
handles GetHashCode and Equal? If so, is such a wrapper
present in the standard class library?

My app will have hundreds or maybe thousands of these
dictionaries, some of which will contain only a single
arraykey-value entry, while others will have thousands
of entries. The number of entries in a particular dictionary
cannot be predicted or estimated reliably in advance. The
values associated with the arrays will be WeakReference
objects, each pointing to an object with a finalizer that will
remove the entry from the containing dictionary.

Would the Dictionary<SomeClass[], WeakReference>
generic collection be suitable for this, or do I need to
spice it up?
 
J

Jon Skeet [C# MVP]

Ole Nielsby said:
How does the GetHashCode() of an array object behave?

Does it combine the GetHashCode() of its elements, or does
it create a sync block for the object?

Neither, as far as I'm aware. I believe it doesn't override GetHashCode
at all, so the implementation in Object is used.
I want to use readonly arrays as dictionary keys, based on
their content, not their identity. Is this feasible using the
arrays directly, or do I need to wrap them in a struct that
handles GetHashCode and Equal? If so, is such a wrapper
present in the standard class library?

You'll need to wrap them, I believe. (Any reason for using a struct
rather than a class though?)
My app will have hundreds or maybe thousands of these
dictionaries, some of which will contain only a single
arraykey-value entry, while others will have thousands
of entries. The number of entries in a particular dictionary
cannot be predicted or estimated reliably in advance. The
values associated with the arrays will be WeakReference
objects, each pointing to an object with a finalizer that will
remove the entry from the containing dictionary.

Would the Dictionary<SomeClass[], WeakReference>
generic collection be suitable for this, or do I need to
spice it up?

I'd consider using a HybridDictionary - it's likely to be much more
efficient for the smaller ones. Unfortunately, however, I don't believe
there's a generic version :(
 
O

Ole Nielsby

Jon Skeet said:
Neither, as far as I'm aware. I believe it doesn't override
GetHashCode at all, so the implementation in Object is used.

I may have got this wrong but I think the Object::GetHashCode
implementation will create a sync block if there isn't already
one.
You'll need to wrap them, I believe. (Any reason for using a
struct rather than a class though?)

There will be large numbers of these arrays, and wrappers
would be unnecessary memory bloat. A generic collection
used with a struct will, if I understand generics right, be able
to store the array pointers as efficiently as if I used them
directly, while using the overrides defined by my struct when
hashing and comparing.
I'd consider using a HybridDictionary - it's likely to be much
more efficient for the smaller ones. Unfortunately, however,
I don't believe there's a generic version :(

Well then, I guess I will google up an open source generic
HybridDictionary-like thing - I'm pretty sure it exists somewhere,
I just hoped I could make do with the standard library.

Regards/Ole N.
 
J

Jon Skeet [C# MVP]

Ole Nielsby said:
I may have got this wrong but I think the Object::GetHashCode
implementation will create a sync block if there isn't already
one.

Sync blocks and GetHashCode are unrelated. The first is to do with
locking; the second is to do with hashing. What makes you think they're
related?
There will be large numbers of these arrays, and wrappers
would be unnecessary memory bloat. A generic collection
used with a struct will, if I understand generics right, be able
to store the array pointers as efficiently as if I used them
directly, while using the overrides defined by my struct when
hashing and comparing.

Yes, if you use generics you should be okay.
Well then, I guess I will google up an open source generic
HybridDictionary-like thing - I'm pretty sure it exists somewhere,
I just hoped I could make do with the standard library.

It sounds like you do need generics.
 
O

Ole Nielsby

Jon Skeet said:
Having just googled this, it appears that under v1.0 and v1.1,
Object.GetHashCode does indeed create a sync block. It's incidental,
however - and the blog linked below suggests that v2.0 may not
behave in the same way.

http://blogs.msdn.com/brada/archive/2003/09/30/50396.aspx

I think what the blog suggests is, you cannot assume that different
objects with non-overridden GetHashCode() always return different
hash codes even though this may be the case under v1.x and some
MS developers were caught using it.

I'll bet a beer that Object.GetHashCode still creates a sync block in v2.
There is really nowhere else to store the hash code, unless they'd take
the trouble of storing it in the sync lock slot in a form distinguishable
from a real sync lock, then, in case a sync block is created later, move
the hash code to a slot in the sync block. This somehow seems bad
style though, and I don't think the clr forgers will (or should) do
this to facilitate some exotic uses of Object.GetHashCode() which is
supposed to be overridden anyway.

Regards/Ole N.
 
L

Lucian Wischik

Ole Nielsby said:
I want to use readonly arrays as dictionary keys, based on
their content, not their identity. Is this feasible using the
arrays directly, or do I need to wrap them in a struct that
handles GetHashCode and Equal? If so, is such a wrapper
present in the standard class library?

Incidentally, I was also looking to hash a list based on contents
recently. I found "lookup2" by Bob Jenkins, and translated it into C#.
Maybe this will be helpful. (or maybe there already exist better C#
hash implementations elsewhere that I hadn't found...)


/// <summary>
/// Produces a hash out of a list of hash-values.
/// This is a C# implementation of the "lookup2" hash of Bob Jenkins
(1996).
/// Licensing conditions for Lookup2: Bob Jenkins writes, "You may
use this
/// code any way you wish, private, educational, or commercial. It's
free."
/// </summary>
/// <param name="keys">An enumerable list of items; we will
GetHashCode()
/// on each of them</param>
/// <param name="previousHash">If you call hash more than once, feed
the previous
/// hash result into the next call to hash. Otherwise, if you only
call it once,
/// previousHash can be an arbitrary value.</param>
/// <returns>A 32-bit signed integer hash code.</returns>
private static int lookup2_hash(System.Collections.IEnumerable keys,
int previousHash)
{ System.Collections.IEnumerator keyi = keys.GetEnumerator();
uint a = 0x9e3779b9; uint b=a; uint c = (uint)previousHash;
uint[] k=new uint[12]; uint length=0; uint sublen=12;
while (true)
{ sublen=0; while (keyi.MoveNext() && sublen<12)
{ k[sublen] = (uint)keyi.Current.GetHashCode();
sublen++; length++;
}
if (sublen==12)
{ a += k[0] +(k[1]<<8) +(k[ 2]<<16) +(k[3]<<24);
b += k[4] +(k[5]<<8) +(k[ 6]<<16) +(k[7]<<24);
c += k[8] +(k[9]<<8) +(k[10]<<16) +(k[11]<<24);

a-=b;a-=c;a^=(c>>13);b-=c;b-=a;b^=(a<<8);c-=a;c-=b;c^=(b>>13);a-=b;a-=c;a^=(c>>12);

b-=c;b-=a;b^=(a<<16);c-=a;c-=b;c^=(b>>5);a-=b;a-=c;a^=(c>>3);b-=c;b-=a;b^=(a<<10);
c-=a;c-=b;c^=(b>>15);
}
else
{ c += length;
if (sublen>=11) c+= (k[10]<<24);
if (sublen>=10) c+= (k[9]<<16);
if (sublen>=9 ) c+= (k[8]<<8);
// the first byte of c is reserved for the sublength
if (sublen>=8) b+= (k[7]<<24);
if (sublen>=7) b+= (k[6]<<16);
if (sublen>=6) b+= (k[5]<<8);
if (sublen>=5) b+= (k[4]);
if (sublen>=4) a+= (k[3]<<24);
if (sublen>=3) a+= (k[2]<<16);
if (sublen>=2) a+= (k[1]<<8);
if (sublen>=1) a+= (k[0]);

a-=b;a-=c;a^=(c>>13);b-=c;b-=a;b^=(a<<8);c-=a;c-=b;c^=(b>>13);a-=b;a-=c;a^=(c>>12);

b-=c;b-=a;b^=(a<<16);c-=a;c-=b;c^=(b>>5);a-=b;a-=c;a^=(c>>3);b-=c;b-=a;b^=(a<<10);
c-=a;c-=b;c^=(b>>15);
return (int)c;
}
}
}
 
H

Helge Jensen

Ole said:
How does the GetHashCode() of an array object behave?

In .NET1 atleast it returns an identity of the array, so:

byte[] b1 = {};
byte[] b2 = {};

b1.GetHashCode() != b2.GetHashCode()
I want to use readonly arrays as dictionary keys, based on
their content, not their identity. Is this feasible using the
arrays directly, or do I need to wrap them in a struct that
handles GetHashCode and Equal? If so, is such a wrapper
present in the standard class library?

I would suggest using an IDictionary with keys something like:

public class ComparableArray<T>:
IEquatable<ComparableArray<T>>, // if you decide to use hashing
IComparable<ComparableArray<T>>
{
int? hash; // if you decide to do hashing
public readonly T[] Array;
public ComparableArray(T[] array, int? hash) {
this.Array = array;
this.hash = hash;
}
public ComparableArray(T[] array): this(array, null) {}
public static int CalculateHash(T[] array) {
// do your stuff with Knuth, or whatever. perhaps something like:
uint x = 0;
foreach ( T t in array ) {
x ^= (uint)t.GetHashCode(); // combine hashes
// rotate make ordering %32 significant, Knuth can do wonders here
x = (uint)(x >> 1) | (uint)((x & 0x1) << 31);
}
return (int)x;
}
public override GetHashCode() {
if ( hash == null )
hash = CalculateHash(Array);
return (int)hash;
}
public bool Equals(ComparableArray<T> other) {
return
Array.Length == other.Array.Length
&& GetHashCode() == other.GetHashCode()
&& compareItemByItem(Array, other.Array);
}
public int Compare(ComparableArray<T> other) {
int diff;
T[] a1 = Array, a2 = other.Array;
diff = a2.Length - a1.Length;
IComparer<T> comp = Comparer.Default<T>;
if ( diff != 0 )
return diff;
for ( int i = 0; i < a1.Length; ++i ) {
diff = a2 - a1;
if ( diff != 0 )
return diff;
}
return 0;
}
public static bool compareItemByItem(T[] a1, T[] a2) {
if ( a1.Length != a2.Length )
throw new ArgumentException("Unequal lengths", "a1 and a2");
// you could just compare some small subset of actual valuse
// iff your hash is really good
for ( int i = 0; i < a1.Length; ++i )
if ( !a1.Equals(a2) )
return false;
return true;
}
// Make interchanging T[] and ComparableArray<T> easy
public static explicit operator T[](ComparableArray a)
{ returrn a.Array; }
public static implicit operator ComparableArray(byte[] a)
{ return new ComparableArray(a); }
// decorate with operators == != <, <=, ... as you like
}

And use IDictionary<ComparableArray<T>, V> as the interface for storage.

You can use any implementation of IDictinoary using ComparableArray<T>
as the key-type, since it supports both hashing and alphabetic sorting.

The above Key-type implementation will cache the hashing of keys, so you
won't have to rehash the keys in the dictionary when doing lookup. It's
not that much performance gained, but since you need a wrapper Key-type
for GetHashCode and Equals anyway...

If you choose your hash-function well you might just want to sample a
few items when comparing item-by-item.

The best data-structure for this kind of operation is know as a
dictionary (don't confuse that with the C# Dictionary :) and is related
to prefix-trees. Have a look at those if you are screaming for
performance (they are pretty cute too :)
My app will have hundreds or maybe thousands of these
dictionaries, some of which will contain only a single
arraykey-value entry, while others will have thousands
of entries. The number of entries in a particular dictionary
cannot be predicted or estimated reliably in advance. The
values associated with the arrays will be WeakReference
objects, each pointing to an object with a finalizer that will
remove the entry from the containing dictionary.

Try just using Dictionary<ComparableArray<T>, WeakReference> or
SortedDictionary<ComparableArray<T>, WeakReference>, it will have pretty
OK performance. You can always complicate things by using the
proxy-pattern to dispatch to different implementations of IDictionary
depending on dictionary-size.
Would the Dictionary<SomeClass[], WeakReference>
generic collection be suitable for this, or do I need to
spice it up?

No, SomeClass[].GetHashCode screws it up.
 
O

Ole Nielsby

Helge Jensen said:
I would suggest using an IDictionary with keys something like:

public class ComparableArray<T>:
[...]

The above Key-type implementation will cache the hashing of
keys [...]

I'll leave that part out... the arrays will never be hashed more than
once or twice. (They will be hashed and unified on creation). And
I'll do a struct instead of a class if possible, since large amounts of
them will be created.
[...]

Try just using Dictionary<ComparableArray<T>, WeakReference>
or SortedDictionary<ComparableArray<T>, WeakReference>, it
will have pretty OK performance.

I'll stick with the Dictionary<> then, and change it only if
profiling identifies it as a hot thing.

(The app is an interpreter for a homebrew programming language
that I use for some rather complex tasks, the meanest of which
is as a parser generator system that can produce parsers for
ill-structured languages like VB6. The data model of my language
implies that constant trees must be unified, like intern'ed strings
but deep structures, rather like attribute based XML. This
'cascading unification' doesn't really fit well with .NET,it's much
faster when objects aren't compacted, allowing hash functions
to operate directly on already unified pointers. Anyway, since
I'm doung a .NET port, I might as well go all the way and use
managed objects, interoperability being more important than
raw speed. It's a pity, though, that the framework doesn't have
some support for cascading unification, like a generalized Intern,
perhaps controlled by an interface, like serialization is controlled
by ISerializable.)

Anyway, thanks for clarifying the array hashing issue.
 
L

Lucian Wischik

Ole Nielsby said:
(The app is an interpreter for a homebrew programming language
that I use for some rather complex tasks, the meanest of which
is as a parser generator system that can produce parsers for
ill-structured languages like VB6.

I've been doing work on parsing, interpretation and model-checking of
languages. I did my first prototypes in F#, which is a functional
language that sits on top of .NET. Then I ported over to C#.

I have to say that F# was much much faster. I think it's the natural
language for many of these problems. The ability to return tuples does
make a difference, and the fact that it has built-in value-equality
for lists. What was most important, though, was immutability of
data-structures. Made it much easier to write and debug. Compact and
readable, too: translating from F# to C# took about ten times as many
lines of code. Another important thing is that F# is a perfect .NET
citizen: it exposes its stuff as CLR objects, and can operate upon CLR
objects. I completed (from scratch) a parser, interpreter and abstract
model-checker for a Pascal-callibre language in just four weeks.
 
B

Bruce Wood

And I'll do a struct instead of a class if possible, since large amounts of them will be created.

BTW... that's not a good reason to use a struct. In fact, if you use
struct instead of class because "large amounts of them will be created"
then you may find that your structs take up _more_ space than classes
would have, or it may be the reverse. "It all depends."
 
O

Ole Nielsby

Lucian Wischik said:
I've been doing work on parsing, interpretation and model-checking of
languages. I did my first prototypes in F#, which is a functional
language that sits on top of .NET. Then I ported over to C#.

I have to say that F# was much much faster. I think it's the natural
language for many of these problems.

By faster, do you mean faster programming or faster execution?
(I assume the former but I'd still like to know how the performance
compares.)
[...] What was most important, though, was immutability of
data-structures. Made it much easier to write and debug.

I second that. Far too many bugs come from messing up
data modifications.

I've considered F# (didn't investigate thorought but assume
it's pretty much the same as OCaml) but, excuse me for putting
false modesty aside, my own design is better, though possibly
a bit harder to integrate with .NET.

Anyway, this is a C# NG and I'm getting off-topic...

BTW, when porting F# to C#, did you rewrite into "procedural"
style, or did you do it the functional way? Do you find C# useable
for functional programming?

(Years ago, I involved myself in a rather heated discussion
about tail call flattening in this NG. The C# compiler's lack
of support for this killed my first attempt of porting my
language to .NET. When I uttered my dissatisfaction whith
this, I was flooded with arguments against supporting it, and
when I tore some of these arguments apart, things got hot.
Anyway, I think a better functional programming support
in C# might help bridge the gap between procedural and
functional programming.)
 
O

Ole Nielsby

Bruce Wood said:
BTW... that's not a good reason to use a struct. In fact, if you use
struct instead of class because "large amounts of them will be created"
then you may find that your structs take up _more_ space than classes
would have, or it may be the reverse. "It all depends."

The only situations I can think of where structs would take up more
space than classes is when many copies are made of a struct that is
either 1) larger than a reference, or 2) getting boxed. This is why
C# tutorials recommend that structs be kept small and unboxed.

Since the struct in question consists of a single reference and
the collections are generic and won't box it, I see no reason to
use a class.

Regards/Ole Nielsby
 
B

Bruce Wood

There are other reasons why structs may take up more space.

If the "same" information appears in multiple collections, then each
collection will contain its own copy of the struct rather than a
reference to a common class instance.

If the structs are passed or stored in local variables then they will
again be copied, rather than copying references (although from the
sounds of it stack is not your big problem).

Also be aware that if you refer to structs in collections then you will
be returned a copy, not the data in the collection, and so to change
the collection entry you'll have to copy the entire struct back into
the collection. Changing to structs does change your programming model
significantly.
 
L

Lucian Wischik

Ole Nielsby said:
By faster, do you mean faster programming or faster execution?

Faster programming. (As for execution speed, the difference between F#
and C# has been insignificant compared to speed difference between my
choices of algorithms and datastructures).

BTW, when porting F# to C#, did you rewrite into "procedural"
style, or did you do it the functional way? Do you find C# useable
for functional programming?

Rewrite it in a procedural style. Mainly that involved turning
recursive "maps" into nested "foreaches". I preferred the foreaches. I
didn't use delegates. The use of C# "yield return" made some things
much nicer. The lack of anonymous tuples in C# made other things
uglier. The biggest unexpected cost was that of developing immutable
data-structures for C# that were natural to use from the C# side, but
also offered compile-time support and run-time guarantees that they
were immutable. ReadOnlyCollection<T> didn't cut it, and nor did
"immutable types"...
 
H

Helge Jensen

Ole said:
Anyway, this is a C# NG and I'm getting off-topic...

taking it a little further OT ;)
BTW, when porting F# to C#, did you rewrite into "procedural"
style, or did you do it the functional way? Do you find C# useable
for functional programming?

(Years ago, I involved myself in a rather heated discussion
about tail call flattening in this NG. The C# compiler's lack
of support for this killed my first attempt of porting my
language to .NET. When I uttered my dissatisfaction whith
this, I was flooded with arguments against supporting it, and
when I tore some of these arguments apart, things got hot.
Anyway, I think a better functional programming support
in C# might help bridge the gap between procedural and
functional programming.)

MS C++ 7.0 has *excellent* tail call optimizations. During a course on
partial evaluation a while back I did three implementations of an
interpreter for a simple virtual machine:

- procedural, while ( true ) process next instruction
- tail-recursive via an accumulator
- CPS-style

The compiled code for the three ran at *almost* the same speed! I was
especially impressed with the CPS-rewrites done by the optimizer.

My hand-in is at http://slog.dk/svn/home/jensen/PEIOOL/ex4/ex4.pdf, and
you can browse around http://slog.dk/svn/home/jensen/PEIOOL/ex4/ if you
like.
 
O

Ole Nielsby

Helge Jensen said:
MS C++ 7.0 has *excellent* tail call optimizations. During
a course on partial evaluation a while back I did three
implementations of an interpreter for a simple virtual
machine:

- procedural, while ( true ) process next instruction
- tail-recursive via an accumulator
- CPS-style

The compiled code for the three ran at *almost* the same
speed! I was especially impressed with the CPS-rewrites
done by the optimizer.

Sounds interesting. My interpreter port will have to be done
in Continuation Passing Style, as the current version (written
in NASM) does things to the stack that simply won't work
with clr stack frames (like piped list processing, and a construct
that works like a mix of structured exception handling and
explicit tail call - this may sound obscure but the construct is
quite central to the language and models the concept of "taking
responsibility", which is of great help when debugging.)

My main reason for picking C# for the project is, I like the
idea of getting verifiable code "by default" - though I'm
not sure this is possible when WeakReference is involved.
However, if the C++ compiler can produce better optimized
and still verifiable code for the style I will be using (or if I
can't get verifiable code with WeakReference in it), I might
change the horse.

BTW I explored C# generics and found that there were things
I'd like to do but couldn't, some of which I think would be
possible with C++ templates, though I'm not too sharp with
templates (no pun inteded) - actually, I shocked a C++ expert
the other day by casually referring to C# generics as "templates".
(Won't do it again!!!). On the other hand, the strictness of
generics has something going for it... yet I'd like to know more
precisely how C++ templates relate to C# generics. Do they
merge, in the sense that the C++ compiler will use the clr
generic whenever possible and resort to code bloat when
not... or are they rather separate concepts?

Regards/Ole N.
 
H

Helge Jensen

Ole said:
Sounds interesting. My interpreter port will have to be done
in Continuation Passing Style, as the current version (written
in NASM) does things to the stack that simply won't work
with clr stack frames (like piped list processing, and a construct
that works like a mix of structured exception handling and
explicit tail call - this may sound obscure but the construct is
quite central to the language and models the concept of "taking
responsibility", which is of great help when debugging.)

Interesting, have you got any intros/docs/references on any of this?
My main reason for picking C# for the project is, I like the
idea of getting verifiable code "by default" - though I'm
not sure this is possible when WeakReference is involved.
However, if the C++ compiler can produce better optimized
and still verifiable code for the style I will be using (or if I
can't get verifiable code with WeakReference in it), I might
change the horse.

Which definition of "verifiable" are you referring to? verifiable, as in
"no-null-pointer derefs" or "no-undefined-behaviour", or ...
BTW I explored C# generics and found that there were things
I'd like to do but couldn't, some of which I think would be

You can't really do meta-programming with C# generics -- they are not
touring-complete like c++ templates ;) At a point i was quite fascinated
with meta-programming using templates. In the end a realized that it's
just building another *very* poor language, and that I was hard put to
identify problems in my problem-domains which had significant
performance-benefits from it.

Many of the tricks done with templates in C++ is all about letting the
compiler know as much as possible about the types of arguments to allow
more precise analysis and inlining, which may cause increased
performance and almost certainly bloat, and often a long-lasting
sessions of "spoon-feeding-the-compiler".

Many of the things i did with static-dispatch a few years ago I do with
dynamic dispatch (virtual functions or dictionary-lookup) now.
possible with C++ templates, though I'm not too sharp with
templates (no pun inteded) - actually, I shocked a C++ expert
the other day by casually referring to C# generics as "templates".

Generics are certainly not templates. Templates is "type-parameters by
macro". Generics is "limited type-parameters, mostly at compile-time".
precisely how C++ templates relate to C# generics. Do they
merge, in the sense that the C++ compiler will use the clr
generic whenever possible and resort to code bloat when
not... or are they rather separate concepts?

I don't have any answer to that, but I can't imagine how the C++
compiler could share code between all instances of classes or functions
that are parametrised with reference-types, like C# does.
 
O

Ole Nielsby

Helge Jensen said:
Ole said:
[...] My interpreter port will have to be done in Continuation
Passing Style, as the current version (written in NASM) does
things [...] models the concept of "taking responsibility",
which is of great help when debugging.

Interesting, have you got any intros/docs/references on any
of this?

I have an unfinished sketch of a paper in English that
describes this feature. At present, the language as such
is documented in Danish only, mail me in our native
tongue at (e-mail address removed) but without the slow
slimy animal (spam protection) if you want to read the
danish docs and/or give the language a try.
My main reason for picking C# for the project is, I like the
idea of getting verifiable code "by default" - though I'm
not sure this is possible when WeakReference is involved.
[...]

Which definition of "verifiable" are you referring to? verifiable,
as in "no-null-pointer derefs" or "no-undefined-behaviour", or ...

I mean: verifiable by the clr so that it can be safely run in a
sandbox, like Java applets, without the need for the user to
explicitly trust the code... so that I might promote the language
by writing applets...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top