String array intersections

K

kevinmr

To all -

I have two arrays of strings one of length M and the other of length
N; N > M. Ther arrays will be on the order of 1 million for N array
and ~50000 for the M array. Can someone describe to me a fast
algorithm to find the intersection of the arrays? Currently I am using
ArrayList in VC.NET.

Thanks in advance
 
G

Gilles Kohl [MVP]

To all -

I have two arrays of strings one of length M and the other of length
N; N > M. Ther arrays will be on the order of 1 million for N array
and ~50000 for the M array. Can someone describe to me a fast
algorithm to find the intersection of the arrays? Currently I am using
ArrayList in VC.NET.

Hmm, are your arrays sorted?

Regards,
Gilles.
 
G

Gilles Kohl [MVP]

As Gilles alludes to, if your arrays are already sorted, then you can do a
sort of "merge sort" algorithm in which rather than creating a new output
array, you're simply identifying matches. That should be very fast,
basically O(n).

That was the plan :)
If the arrays aren't sorted, then I think that putting all the elements of
one into a Dictionary<> instance and then looking up all the other array's
elements in that Dictionary<> instance would work well. This would be
somewhat slower than if the data were sorted to start with, but usually
faster than actually sorting the data just for this one purpose.

Mind you, depending on your definition of "fast", processing 1 million of
anything isn't necessarily ever going to actually be fast. :)

But I think a dictionary-based solution would perform reasonably well for
unsorted data.

Ditto - the new HashSet<T> could come in handy here to play the role
of a "lightweight Dictionary". Basically, with smallArray and
largeArray being string arrays:

HashSet<string> set = new HashSet<string>(smallArray);
var intersection = largeArray.Where(name => set.Contains(name));

(although the OPs mention of ArrayList might be indicative of target
Framework 1.1 rather than 3.5, ahem)

Regards,
Gilles.
 
J

Jon Skeet [C# MVP]

On Apr 17, 1:55 pm, "Peter Duniho" <[email protected]>
wrote:

Mind you, depending on your definition of "fast", processing 1 million of
anything isn't necessarily ever going to actually be fast. :)

Ooh, I don't know about that. A little test app I wrote using
HashSet<string> takes about 100ms on my laptop with N=1000000 and
M=50000.

Not the sort of thing you want to be doing *really* regularly (like on
every web request to a server) but still pretty fast :)

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top