set matching

S

solandre

hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

..possibility one ...
...is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
...if i find an equal i iterate in both arrays, ...
...if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
...this is called "Merged join", i read.
...it performs well if the matching elements are "close" together.

..possibility two...
...i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
...binarysearch is used instead of sequentially searching the remainder
of an array.
...this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

any suggestions for a still faster solution to find matches in two
sorted arrays?

thanks
 
J

J.Marsch

Solandre:

Do you have a requirement (other than searching) to maintain your lists in
sorted order?
What are you storing in the arrays? (is it just a primitive, like a number,
or is it an array of data structures)?

Maybe you could get better performance by switching to a different data
structure (hash tables, binary tree, etc).

For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.

There could be a few different ways to go, but which is the best depends
upon your specific situation. You can find a pretty good discussion of data
structures here:
http://msdn.microsoft.com/vcsharp/programming/datastructures/
 
H

Helge Jensen

solandre said:
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

I don't really understand this. Are your input lists sorted or not?

if they are, and they are reasonable equally sized (within a few orders
of magnitude), you can iterate them simultaneously, by maintaining an
index into them seperatly and move the index pointing to the smallest
element forward, perhaps this is what you described above?

Using binary-search to find the next matching pair isn't really likely
to be good. Although an approach where you jump in steps with increasing
length by squaring and then afterwards search linearly may work well if
the lists have very different sizes.
 
S

solandre

thanks for the replies.

yes the arrays are sorted. merged join might be the wron name for it,
is it called transversal searching?
yes we hold two pointers one for each array.
jumping with increasing steps instead binarysearching the rest is
interesting and i will evaluate it.
i think it really depends on the structure of the data, where the net
match or next higher possibly is (in our case it is assumingly far
away).

we have sorted out binary trees, as trees always need to hold pointers
to the next or previouse node. which we dont want to afford now for
reason of size.

the arrays hold primitive ulongs (is it a primitve type?)
 
G

Guest

Matching sorted lists... sounds like what I heard called a "master file
update" waaaay back in college (a COBOL class).

If the referenced objects match, do what you do with them and advance both
references
Else advance the reference to the lesser object
 
B

Brian Gideon

J.Marsch said:
For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.

Hash tables have an average case runtime of O(1). One con I would add
to the list is that hash tables consume more memory.
 
B

Brian Gideon

Solandre,

I'm still a little confused. Can you provide an example of what the
operation in question would produce. Consider two sets.

A = { 2, 3, 5, 7, 11, 13, 17, 19 }
B = { 1, 1, 2, 3, 5, 8, 13, 21 }

Would the result be the intersection of the two? Result = { 2, 3, 5, 13
}

Brian
 
S

solandre

..there would be no duplicates in the arrays
..considering these arrays, with no dup. 1 in arrB, the result would be
your mentioned { 2, 3, 5, 13}
 
B

Brian Gideon

Okay. Yeah, you want to intersect two sets. Try the following code.
The ICollection that is returned will be in sorted order if the two
input collections are sorted. You said the collections would not have
duplicates, but this code can handle that case anyway. It runs in
linear time so it should be pretty fast.

public static ICollection Intersect(ICollection lhs, ICollection rhs)
{
ICollection larger = lhs.Count <= rhs.Count ? lhs : rhs;
ICollection smaller = lhs.Count <= rhs.Count ? rhs : lhs;

Hashtable table = new Hashtable();
foreach (object element in larger)
{
if (!table.Contains(element))
{
table.Add(element, element);
}
}

ArrayList result = new ArrayList();
foreach (object element in smaller)
{
if (table.Contains(element))
{
table.Remove(element);
result.Add(element);
}
}

return result;
}

Brian
 
B

Bill Butler

solandre said:
hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

.possibility one ...
..is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
..if i find an equal i iterate in both arrays, ...
..if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

Basically this sounds like a single pass of the Merge Sort algorithm (without the actual merge that
is)
.possibility two...
..i enhanced this method with binarysearch, as the arrays are several
10 million elements big.

Are Both arrays of similar Size?
..binarysearch is used instead of sequentially searching the remainder
of an array.
..this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

Define "Incredibly"
If you "Know" that a section of your data is highly localized you can narrow the scope of the binary
search to attempt to speed up the performance in the localized section.
any suggestions for a still faster solution to find matches in two
sorted arrays?

The first variant (Merge sort) should perform fairly well in the general case, however, if you have
knowledge about your data and how it is likely to be distributed you can possibly improve the
performance further. Let me ask you a few questions about the usage patterns.

After you intersect 2 Arrays, What then?
Intersect 2 new Arrays?
Intersect the first array with a whole bunch of secondary arrays?
The reason that I ask is that if you are repeatedly searching the same array, you could optimize the
search for THAT particular data set. A hashtable could improve your performance, but only if you are
searching the same Array more than once.

Are both Arrays of similar size?
Does the data have any known clumping or is it fairly uniform?
Do both Arrays, in general span the same range of possible values or is one of the arrays more
localized or clumpy?

If you "Know" you are in a localized/clumpy section of data....use the standard Merge sort method
If you "Know" you are likely going to have to go a long way to find the next match ...Use a binary
search.

If you can develop a heuristic to determine locality, you can alter your search method accordingly.
Unfortunately, adding the calculation of a heuristic ADDS to the overhead. So you will have to
decide if it is worth it.

Good luck
Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top