set matching

solandre · Mar 30, 2006

hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

..possibility one ...
...is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
...if i find an equal i iterate in both arrays, ...
...if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
...this is called "Merged join", i read.
...it performs well if the matching elements are "close" together.

..possibility two...
...i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
...binarysearch is used instead of sequentially searching the remainder
of an array.
...this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

any suggestions for a still faster solution to find matches in two
sorted arrays?

thanks

J.Marsch · Mar 30, 2006

Solandre:

Do you have a requirement (other than searching) to maintain your lists in
sorted order?
What are you storing in the arrays? (is it just a primitive, like a number,
or is it an array of data structures)?

Maybe you could get better performance by switching to a different data
structure (hash tables, binary tree, etc).

For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.

There could be a few different ways to go, but which is the best depends
upon your specific situation. You can find a pretty good discussion of data
structures here:
http://msdn.microsoft.com/vcsharp/programming/datastructures/

Helge Jensen · Mar 30, 2006

solandre said:
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

I don't really understand this. Are your input lists sorted or not?

if they are, and they are reasonable equally sized (within a few orders
of magnitude), you can iterate them simultaneously, by maintaining an
index into them seperatly and move the index pointing to the smallest
element forward, perhaps this is what you described above?

Using binary-search to find the next matching pair isn't really likely
to be good. Although an approach where you jump in steps with increasing
length by squaring and then afterwards search linearly may work well if
the lists have very different sizes.

solandre · Mar 30, 2006

thanks for the replies.

yes the arrays are sorted. merged join might be the wron name for it,
is it called transversal searching?
yes we hold two pointers one for each array.
jumping with increasing steps instead binarysearching the rest is
interesting and i will evaluate it.
i think it really depends on the structure of the data, where the net
match or next higher possibly is (in our case it is assumingly far
away).

we have sorted out binary trees, as trees always need to hold pointers
to the next or previouse node. which we dont want to afford now for
reason of size.

the arrays hold primitive ulongs (is it a primitve type?)

Guest · Mar 30, 2006

Matching sorted lists... sounds like what I heard called a "master file
update" waaaay back in college (a COBOL class).

If the referenced objects match, do what you do with them and advance both
references
Else advance the reference to the lesser object

Brian Gideon · Mar 30, 2006

J.Marsch said:
For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.

Hash tables have an average case runtime of O(1). One con I would add
to the list is that hash tables consume more memory.

Brian Gideon · Mar 30, 2006

Solandre,

I'm still a little confused. Can you provide an example of what the
operation in question would produce. Consider two sets.

A = { 2, 3, 5, 7, 11, 13, 17, 19 }
B = { 1, 1, 2, 3, 5, 8, 13, 21 }

Would the result be the intersection of the two? Result = { 2, 3, 5, 13
}

Brian

Helge Jensen · Mar 31, 2006

Brian said:
B = { 1, 1, 2, 3, 5, 8, 13, 21 }

B is not actually a set, unless the two 1's are somehow different

solandre · Mar 31, 2006

..there would be no duplicates in the arrays
..considering these arrays, with no dup. 1 in arrB, the result would be
your mentioned { 2, 3, 5, 13}

Brian Gideon · Mar 31, 2006

Yeah, I was just trying to be cute and make B the Fibonacci numbers

Brian Gideon · Mar 31, 2006

Okay. Yeah, you want to intersect two sets. Try the following code.
The ICollection that is returned will be in sorted order if the two
input collections are sorted. You said the collections would not have
duplicates, but this code can handle that case anyway. It runs in
linear time so it should be pretty fast.

public static ICollection Intersect(ICollection lhs, ICollection rhs)
{
ICollection larger = lhs.Count <= rhs.Count ? lhs : rhs;
ICollection smaller = lhs.Count <= rhs.Count ? rhs : lhs;

Hashtable table = new Hashtable();
foreach (object element in larger)
{
if (!table.Contains(element))
{
table.Add(element, element);
}
}

ArrayList result = new ArrayList();
foreach (object element in smaller)
{
if (table.Contains(element))
{
table.Remove(element);
result.Add(element);
}
}

return result;
}

Brian

Bill Butler · Apr 2, 2006

solandre said:
hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

.possibility one ...
..is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
..if i find an equal i iterate in both arrays, ...
..if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

Basically this sounds like a single pass of the Merge Sort algorithm (without the actual merge that
is)

.possibility two...
..i enhanced this method with binarysearch, as the arrays are several
10 million elements big.

Are Both arrays of similar Size?

..binarysearch is used instead of sequentially searching the remainder
of an array.
..this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

Define "Incredibly"
If you "Know" that a section of your data is highly localized you can narrow the scope of the binary
search to attempt to speed up the performance in the localized section.

any suggestions for a still faster solution to find matches in two
sorted arrays?

The first variant (Merge sort) should perform fairly well in the general case, however, if you have
knowledge about your data and how it is likely to be distributed you can possibly improve the
performance further. Let me ask you a few questions about the usage patterns.

After you intersect 2 Arrays, What then?
Intersect 2 new Arrays?
Intersect the first array with a whole bunch of secondary arrays?
The reason that I ask is that if you are repeatedly searching the same array, you could optimize the
search for THAT particular data set. A hashtable could improve your performance, but only if you are
searching the same Array more than once.

Are both Arrays of similar size?
Does the data have any known clumping or is it fairly uniform?
Do both Arrays, in general span the same range of possible values or is one of the arrays more
localized or clumpy?

If you "Know" you are in a localized/clumpy section of data....use the standard Merge sort method
If you "Know" you are likely going to have to go a long way to find the next match ...Use a binary
search.

If you can develop a heuristic to determine locality, you can alter your search method accordingly.
Unfortunately, adding the calculation of a heuristic ADDS to the overhead. So you will have to
decide if it is worth it.

Good luck
Bill

Compaing two arrays when element order does not matter	38	Feb 2, 2006
Get/Set Property for Array	13	May 22, 2009
Help with multi-dimensional arrays	1	May 5, 2010
Conceptual and speed question	2	Nov 18, 2005
Array - delete and replace elements	2	Apr 17, 2011
C# and array in a dll	5	Jan 28, 2005
Best ways to search an array/collection for an element	8	Feb 12, 2010
Need some fast code for some string array logic	3	Mar 18, 2005

set matching

solandre

J.Marsch

Helge Jensen

solandre

Guest

Brian Gideon

Brian Gideon

Helge Jensen

solandre

Brian Gideon

Brian Gideon

Bill Butler

Ask a Question

Similar Threads