Compaing two arrays when element order does not matter

  • Thread starter Thread starter ssg31415926
  • Start date Start date
Simple solution that involves no sorting. Just add all the hash codes
from both arrays and compare the two sums. This O(n) solution has an
extremely slight possibility of a false positive.

What if you're coding a mission critical routine? Is extremely slight
a good safety margin? I would hate to have to chase this down if it
was one of those intermittent bugs you can get in a system. (Think
shuttle o-rings, it can happen with software too).
// first check length of two arrays, if not equal then done
if( servers.Length != monitors.Length )
return false;

// add all hashcode values of each element in first array
int serversSumOfHashCodes = 0;
for( int index=0; index<servers.Length; index++ )
{
serversSumOfHashCodes += servers[index].GetHashCode();
}

// add all hashcode values of each element in second array
int monitorsSumOfHashCodes = 0;
for( int index=0; index<monitors.Length; index++ )
{
monitorsSumOfHashCodes += monitors[index].GetHashCode();
}

// compare first sumOfHashCodes with second sumOfHashCodes
// if sums are == then the two arrays contain the same of elements
if( serversSumOfHashCodes == monitorsSumOfHashCodes )
return true;

return false;

Ken Wilson
Seeking viable employment in Victoria, BC
 
Ken Wilson said:
I think there is a bug in this code. If the very first item passes
won't the found variable be irreversibly changed to true?

found is reset to false at the start of each outside loop-through.
 
Jon said:
(To check the hashtable stuff, I had a look at the docs for Hashtable
which claim that retrieving a value is an O(1) operation. I find that
hard to believe - if every key in the table has the same hashcode, it
has to walk through all of them, making it an O(n) operation. That's
not mentioned anywhere...)

Well, yes, but that would have to be a spectacularly badly designed
Hashtable, a spectacularly bad hash key algorithm, or very, very bad
luck. Generally speaking, hash storage and retrieval is O(k), which I
suppose using a different notation could be called O(1). Anyway, it's
typically a heck of a lot faster even than binary search.
 
You're wrong. :)

Your inner loop will continue until it finds a monitor that does _not_
equal the current server. In other words, for arrays of greater than
one item each, it will always return false.

My original code was correct, apart from the silly mistake of not
bailing out of the inner loop when I found a match.
If the very first item passes won't the found variable be irreversibly changed to true?

Well, yes, but that's the point. A match was found; it doesn't matter
if all of the other items in the array _don't_ match... all that
matters is that one _did_. In fact, once "found" is set to true,
according to the OP's description of the problem, it's guaranteed that
none of the other entries will match.

Your inner loop does the inverse: it returns true only if _every_
monitor matches the current server, which will never be true if there
is more than one monitor in the array.
 
You're wrong. :)

Your inner loop will continue until it finds a monitor that does _not_
equal the current server. In other words, for arrays of greater than
one item each, it will always return false.

My original code was correct, apart from the silly mistake of not
bailing out of the inner loop when I found a match.


Well, yes, but that's the point. A match was found; it doesn't matter
if all of the other items in the array _don't_ match... all that
matters is that one _did_. In fact, once "found" is set to true,
according to the OP's description of the problem, it's guaranteed that
none of the other entries will match.

Your inner loop does the inverse: it returns true only if _every_
monitor matches the current server, which will never be true if there
is more than one monitor in the array.

Respectfully I beg to differ. Here is the opening paragraph from the
OP's post.

---

I need to compare two string arrays defined as string[] such that the
two arrays are equal if the contents of the two are the same, where
order doesn't matter and every element must be unique.

E.g. these two arrays would test as equal:

servers[0] = "Admin"
servers[1] = "Finance"
servers[2] = "Payroll"
servers[3] = "Sales"

monitors[0] = "Sales"
monitors[1] = "Payroll"
monitors[2] = "Admin"
monitors[3] = "Finance"

---

As you can see, according to the original problem statement, there has
to be a one-to-one match for each element in both arrays. Otherwise,
the length of the second array is irrelevant if all you need it one
match because you might find that one match in a shorter or a longer
array if it's only one you are looking for.

Ken Wilson
Seeking viable employment in Victoria, BC
 
Yes, but that's why there are two loops. The inner loop is only asking
the question, "Is there any element in the second array that matches
the current element in the first array?" The outer loop asks the
question, "Does the inner loop test pass for every element in the first
array?"

Let's take the OP's example and run it through the code you posted:

servers[0] = "Admin"
servers[1] = "Finance"
servers[2] = "Payroll"
servers[3] = "Sales"

monitors[0] = "Sales"
monitors[1] = "Payroll"
monitors[2] = "Admin"
monitors[3] = "Finance"

For i = 0 and j = 0, "Sales" does not equal "Admin", so found will be
set to false and we break out of the "j" loop. Since found is now
false, we execute the contents of the "if (!found)" test, and break out
of the outer loop as well. The method returns false.

Now, the same example using my posted code, with Otis' adjustment to
improve the efficiency by breaking out of the inner loop when a match
is found.

i = 0
found is set to false
j = 0
"Sales" does not equal "Admin", so found remains false
j = 1
"Sales" does not equal "Payroll", so found remains false
j = 2
"Sales" equals "Sales", so found is set to true, and we exit the inner
loop
found is true, so we do not return yet
i = 1
found is set to false
j = 0
"Finance" does not equal "Sales", so found remains false
j = 1
"Finance" does not equal "Payroll", so found remains false
j = 2
"Finance" does not equal "Admin", so found remains false
j = 3
"Finance" equals "Finance", so found is set to true and we exit the
inner loop
found is true, so we do not return yet
i = 2
found is set to false
j = 0
"Payroll" does not equal "Sales", so found remains false
j = 1
"Payroll" equals "Payroll", so found is set to true and we exit the
inner loop
found is true, so we do not return yet
i = 3
found is set to false
j = 0
"Sales" equals "Sales", so found is set to true and we exit the inner
loop
found is true, so we do not return yet
The outer loop is done, the method returns true
 
Bruce Wood said:
Well, yes, but that would have to be a spectacularly badly designed
Hashtable, a spectacularly bad hash key algorithm, or very, very bad
luck. Generally speaking, hash storage and retrieval is O(k), which I
suppose using a different notation could be called O(1). Anyway, it's
typically a heck of a lot faster even than binary search.

*Typically* is fine - but I thought the point of the O notation was
that it was an upper bound... Maybe I'm misremembering though :)
 
Computing all the hash codes is the slowest part of storing all the
strings in a dictionary and then looking them up, so I don't think this
will save much over just copying one list to a dictionary and then
looking up each element of the other list.
 
This is a perfect illustration of why the faster and simpler dictionary
code is preferable. Maybe the nested loop solution is easier for a
freshman programming student to understand, but it's about 100 times
more likely to be wrong. Anyone who knows about the Dictionary
collection could understand and verify the dictionary solution in a
minute or two.
 
Yes, and no.

Notice that the dictionary solution has the subtle distinction that you
have to remove the dictionary entries as you find them, and then ensure
that the dictionary is empty at the end to guard against duplicates in
the second list giving a false positive.

Nonetheless, overall I agree with your assessment. I would choose the
dictionary solution, myself.
 
Yes, but that's why there are two loops. The inner loop is only asking
the question, "Is there any element in the second array that matches
the current element in the first array?" The outer loop asks the
question, "Does the inner loop test pass for every element in the first
array?"

<snip long analysis>

With respect, I'm a bit of a dog with a bone on issues sometimes so I
will apologize right now. [8-)

I ran both of our algorithms. The error I thought I spotted in your
algorithm still exists in that as long as one item tests right you
will never report a failure of the whole because you never set found
back to false. I did add the break in the inner loop if server and
monitor match. On the upside, mine is buggy both ways. Don't know
where I went wrong but it's a good lesson on why you don't start
winging code in the early AM when you should be in bed (referring to
myself of course.) I'm going to chase this further in my own code
because, as another poster said, a college entrant should be able to
code a brute force one.

Ken Wilson
Seeking viable employment in Victoria, BC
 
Just for fun, I cut and pasted my code into VS and ran it. I did find
one bug, but not the one you reported. The inner loop contains two
typos. I wrote:

for (int j = 0; i < monitors.Length j++)

which is, of course, at least missing a semicolon. However, I also
mistyped "i <" instead of "j <". The correct line is

for (int j = 0; j < monitors.Length; j++)

to which I added Otis' test to get:

for (int j = 0; j < monitors.Length && !found; j++)

With that change, it works just fine.

Anyway, I think we've beaten this one to death. :)
 
Just for fun, I cut and pasted my code into VS and ran it. I did find
one bug, but not the one you reported. The inner loop contains two
typos. I wrote:

for (int j = 0; i < monitors.Length j++)

which is, of course, at least missing a semicolon. However, I also
mistyped "i <" instead of "j <". The correct line is

for (int j = 0; j < monitors.Length; j++)

to which I added Otis' test to get:

for (int j = 0; j < monitors.Length && !found; j++)

With that change, it works just fine.

Anyway, I think we've beaten this one to death. :)

I believe so. At least now I am more sure I would rather just use a
collection of some sort where all the headaches have already been had
by someone other than me. :-)

Ken Wilson
Seeking viable employment in Victoria, BC
 
I don't think all that much subtlety is required. Really, beginning
programmers should be taught about associative stores in about the
second week of class. It would save the world millions of coding
hours, and billions of billions of wasted computing cycles.

bool UnorderedListContentsAreEqual<T>(List<T> a, List<T> b)
{
if (a.Count != b.Count) return false;
Dictionary<T,int> d = new Dictionary<T,int>();
foreach (T t in a)
{
++d[t];
}
foreach (T t in b)
{
if (--d[t] < 0) return false;
}
return true;
}
 
Step 1. Check Length for both array if not equal then return false.

Step 2. Calculate Hash for each element in Array 1 and add them (for
string values you can use hashing function like :
SIGMA( ascii value of A * i)) for i = 0 to len of array -1

Step 3 Do the same for array 2

If the sum are same them both array are equal as per defn else false.
 
Rana said:
Step 2. Calculate Hash for each element in Array 1 and add them (for
string values you can use hashing function like :
SIGMA( ascii value of A * i)) for i = 0 to len of array -1

Step 3 Do the same for array 2

If the sum are same them both array are equal as per defn else false.


Not really, hashes are not guaranteed to be unique. what hash'es should
fulfill is the implication:

a.Equals(b) => a.GetHashCode() == b.GetHashCode()

that's not the same as the implication:

a.GetHashCode() == b.GetHashCode() => a.Equals(b)

which is required for your implementation to be correct. *And* your
solution have overflow problems, even *if*:

a.Equals(b) <=> a.GetHashCode() == b.GetHashCode()

the sum is will most likely produce overflows (it almost certainly will
if a good hashfunction is used) and invalidly declare lists equal.

Actually, when I combine hash'es I usually prefer xor'ing (^=) hash'es
together, it doesn't produce any overflow (won't fail in "checked" code)
and is associative but it also allows removal of elements by the same
operation. It doesn't work well on multisets though :)

What you *do* however know after hashing all elements is:

hashmap(lista) != hashmap(listb) => lista != listb

and if the lists are almost never equals (if for example this is an
assertion that the lists are differenct) this may be faster than using
other comparisons. But I'd take a look at it with a
performance/memory-profiler before writing code that complex for the
situation.

Probably the simplest ways to compare the lists are:
1. Use the dictionary solution proposed
2. Sort them and compare item by item.

If this is too inefficient for the concrete scenario perhaps a some
reqirement on the data-structures should be made?
 
I was not talking of the hashcode given by object.GetHashCode().
I agree as you said, a.Equals(b) does not imply a.GetHashCode() ==
b.GetHashCode()

Rather we need to calculate hash values using some functions on the
elements(values) of the array

if the elements are string(considering them to be char array) we can
use the function
say HASH(arr) -> sigma( ascii value of arr * i + 1) for i = 0 to
arr.Length - 1

for example
"AB" => 1*65 + 2*66
"BC" => 1*66 + 2*67 etc.

unless the strings are extremely large the summation will not overflow.

and also HASH(strA) == HASH(StrB) => StrA==StrB (the fn above works
ascii strings)

Please correct me if I am wrong
 
Rana said:
I was not talking of the hashcode given by object.GetHashCode().
I agree as you said, a.Equals(b) does not imply a.GetHashCode() ==
b.GetHashCode()

Rather we need to calculate hash values using some functions on the
elements(values) of the array

But the point of *hash* values is that they're still not unique. They
manage to give an absolute "these values *aren't* equal" answer in most
cases while taking *less information size* than the original. If the
information size is the same as the original, you're no better off than
just comparing for equality the long way - and if the information size
*is* less than the original, then there must be possible duplicates by
the pigeon-hole principle.
if the elements are string(considering them to be char array) we can
use the function
say HASH(arr) -> sigma( ascii value of arr * i + 1) for i = 0 to
arr.Length - 1

for example
"AB" => 1*65 + 2*66
"BC" => 1*66 + 2*67 etc.

unless the strings are extremely large the summation will not overflow.


The values won't be unique though. Two spaces (1*32+2*32) would end up
with the same hash code as '`' (1*96) for example.
and also HASH(strA) == HASH(StrB) => StrA==StrB (the fn above works
ascii strings)

No, as shown by the counterexample above.
 
Rana said:
I was not talking of the hashcode given by object.GetHashCode().

OK. The same reasoning applies to any hash-code, it's not really
important whether it's object.GetHashCode().
I agree as you said, a.Equals(b) does not imply a.GetHashCode() ==
b.GetHashCode()

Is there a typing error here? I said the opposite: "does imply" :)

For a function, h, to be a hash-function (w.r.t. two specific
equality-relations of the domain and codomain of h). it must fullfill:

Domain equality preservation
a == b => h(a) == h(b)

Codomain inequality preservation
h(a) != h(b) => a != b

Although this definition allows i.e. the constant-function

c(x) = 0

as a hashing function for strings to integers, it is usually implicit
that h is should map elements in its domain into it's codomain "with a
good amount of spreading" -- which determines the "quality" of the
hashing done by h. For a better definition of that look at
"perfect-hashing" and "cryptographic-hashing".

If the codomain of h is smaller than the domain of h, the pigeon-hole
principle demonstrates that h cannot be one-to-one, and in that case:

h(a) == h(b) =/=> a == b
if the elements are string(considering them to be char array) we can
use the function
say HASH(arr) -> sigma( ascii value of arr * i + 1) for i = 0 to
arr.Length - 1

for example
"AB" => 1*65 + 2*66
"BC" => 1*66 + 2*67 etc.

unless the strings are extremely large the summation will not overflow.


It does have other problems though, most notably that it breaks codomain
inquality preservation from above, as demonstrated by Jon Skeet, which
you *need* to correctly decide permutational equality.

Also, the hash-function you propose does not help you to decide
equality-by-permutation, since it explicitly encodes the position of
elements.

Lets get back to the original problem, comparing arrays unsorted -- the
permutation-equality problem: can one array be permuted into the other.
Lets assume that the arrays have equal length, N, otherwise the problem
is simple :)

Both the dictionary and the sorting approach transforms the input into a
canonical form, which can then be more efficiently compared.

The dictionary-approach canonically represents the set of unordered
permutations of a squence as a counter for each occurence.
It requires the elements to be effectively hashable,and allows
permutational-equality to be decided in O(N) memory and O(N) time by
comparing the sizes of the keys-sets and traversing one hash-table
looking up counts in the other.

The sorting solution canonically represents the set of unordered
permutations of a sequence by ordering the elements. This places the
stronger, compared to hashability, demand of an ordering on the
elements. After bringing both arrays on a canonical form, the arrays can
be compared item by item, deciding permutational-equality on O(N) memory
and O(N log N) time, depending on the choice of sorting algorithm.

What *hashing* can do for you (in this case) is that it can with O(1)
memory and O(N) time decide "roughly" if it is possible for the two
arrays to be equals. That can be done by using a (preferably surjective)
function, s, on each element (possibly identity, as in your example),
and combining these via any associative function, c, for example: xor,
times, sum or whatever. The associative property *exactly* removes the
ordering. If the hashes of two arrays matches you still have to do a
real comparison, but if they dont, you have proved they cannot be
permuted into each other.

If you expect to do many comparisons of arrays to other arrays, it *may*
be a good idea to use this approach before doing a full comparison. You
can amortize the effort spent calculation the hash over the number of
times you do comparisons.

Depending on the way you hash you may get quite good results with this
technuqie, but I doubt it would really be practial. I would look into
hashing one of the canonicalized representations instead and storing
that along with the array. I suspect the canonical representation will
allow you to achieve a better utilization of the codomain of the hashing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top