Quickly finding duplicates in an ArrayList

P

paradox

Basically I have an ArrayList of strings; I need a fast way of
searching through and comparing the elements of the ArrayList and the
return an ArrayList of items that have 3 Duplicates.

For example, if the ArrayList has the following elements:

elems[0] = "blue"
elems[1] = "red"
elems[2] = "blue"
elems[3] = "green"
elems[4] = "red"
elems[5] = "red"

I want to have a function that goes through the list and then returns a
new arraylist like:

newElems[0] = "red"
newElems[1] = "red"
newElems[2] = "red"

In actuality my arraylist is an array of objects, which have a string
property and I'm referencing like: elems[0].filename = "red"; but the
above example is easier to follow.

I would like to be able to perform this search as fast as possible. I
personally don't have any real experience or familiarity with topics
such as binary trees, etc; would something like that be the route to
take? Thanks.
 
P

Philipp Schumann

I know of two routes that you can follow.

The first - look at the ArrayList.IndexOf method, which is probably the
quickest way to find a certain item in the underlying array. I believe there
is an overload to specify at which index to start the search.

Secondly, if you create the arraylist yourself (not some other 3rd-party
code) you could use ArrayList.Contains() when populating the list to ensure
the duplicates don't get into the list in the first place.

My "two" euro cents. :)
 
P

paradox

Well... actually the point of the code is to find the duplicates and
then display them to users. To give you a much better understanding,
the program I am creating is for determining conflicts in the pk3 (zip)
files used in Quake 3. As people develop new maps, they create and use
existing map elements (textures, shaders, etc) and they are all stored
in these pk3 files. The problem that occurs is that they create their
own version of certain shaders or textures without renaming or changing
the filename directory structure; and this causes the game to fail to
load properly. Specifically the problem seems to be associated to
occasions when there are 3 pk3 files with a shader element that has the
same name but different file size.

I've gone through the pk3 (zip) files and populated an arraylist called
conlficts with a object of type element (a class I've defined).

I'm trying to provide a report that will go thru this list and return
these elements (and their associated pk3 files). So that the end user
can easily determine which files that they need to remove in order to
be able to connect successfully.

Using IndexOf isn't much different than the slow 3 nested loop approach
I'm trying to avoid.
 
K

Kenny

I would iterate through the ArrayList and for each item, add it to a
HashTable if it does not already exist in the HashTable and set its
value to 1. If it already exists, increment the value of that item in
the HashTable.

Once you've gone through the ArrayList, iterate through the HashTable
to see which items have a value greater than 1.

It's heavy on memory, but runs in linear time. Should be much better
than 3 nested loops.
 
B

Brian Gideon

Walk through the array list and add each item to a hashtable that
stores the number of occurrences of that item. Since hashtable inserts
and searches have O(1) time complexity the entire operation would be
O(n), which is good.

Hashtable table = new Hashtable();
foreach (string item in elems)
{
if (!table.Contains(item))
{
table.Add(item, 0);
}

int count = (int)table[item] + 1;
table[item] = count;
if (count == 2)
{
// Duplicate items are identified here.
}
}

Brian
 
J

Jason Black [MSFT]

This is almost what I was going to suggest, but with the difference that
since you need to track the original ArrayList indexes of the duplicate
items, you should create new ArrayLists and store them in the hashtable.
Something like:

Hashtable h = new Hashtable();
for(i=0; i < MainArrayList.Count; i++) {
// get a key (name, whatever) in some way that's appropriate for your
objects:
someKeyType key = GetKeyFromObject(MainArrayList);
// make a new arraylist if necessary
if(h[key] == null) {
h[key] = new ArrayList();
}
// store the index of this item in the arraylist
((ArrayList)h[key]).Add(i);
}
foreach(someKeyType key in h.Keys) {
if( ((ArrayList)h[key]).Count > 1) {
// do whatever you need to do to show the duplicates to the user.
// ((ArrayList)h[key]) contains the indexes of the duplicate items.
}
}

Again, harsh on memory, but fast.
 
B

Brian Gideon

Google for it first. There's a lot of information available that does
a better job of explaining it than I could do. If there is still
something you don't quite understand go ahead and post back and we'll
be more than happy to clarify something.

Brian
 
P

paradox

Yeah, I did google it immediately after posting the question and was
able to find sufficient information. If I could of deleted the post, I
would have. ;)

I do have a small question though. Does the .NET Hashtable work in
exactly the same manner as a general Hashtable? Also, I tried using
reflector to track down the GetHashCode function to see how it worked;
I ended up at the String class (taking into consideration the type of
key I was using was a string), and the code I found for the GetHashCode
function was:

[MethodImpl(MethodImplOptions.InternalCall)]
public override extern int GetHashCode();

I couldn't figure out where to find anymore info...
 
B

Brian Gideon

The GetHashCode method is a virtual member of an Object so everything
has a default implementation for it. The string's version of the
method is internal to the .NET framework. That's why you can't see it.
One of the most frequently used ways of getting the hash code of an
object is to XOR all of the object's fields. In the case of a string
you might XOR all characters together. I seriously doubt that's what
the framework is doing though. XOR works well because it is fast and
produces a very random distribution of values. That's important when
it comes time to map the hash code to a slot in the memory array. A
good distribution of hash codes lowers the probability that two keys
will get mapped to the same slot in memory. This is called a
collision. There are many strategies for handling collisions. I
believe the Hashtable in dotnet uses quadratic probing. This strategy
involves number theory and the use of prime numbers.
 
Joined
May 16, 2011
Messages
1
Reaction score
0
Jason Black [MSFT];4159626 said:
This is almost what I was going to suggest, but with the difference that
since you need to track the original ArrayList indexes of the duplicate
items, you should create new ArrayLists and store them in the hashtable.
Something like:

Hashtable h = new Hashtable();
for(i=0; i < MainArrayList.Count; i++) {
// get a key (name, whatever) in some way that's appropriate for your
objects:
someKeyType key = GetKeyFromObject(MainArrayList);
// make a new arraylist if necessary
if(h[key] == null) {
h[key] = new ArrayList();
}
// store the index of this item in the arraylist
((ArrayList)h[key]).Add(i);
}
foreach(someKeyType key in h.Keys) {
if( ((ArrayList)h[key]).Count > 1) {
// do whatever you need to do to show the duplicates to the user.
// ((ArrayList)h[key]) contains the indexes of the duplicate items.
}
}

Again, harsh on memory, but fast.


"Kenny" <[email protected]> wrote in message
news:[email protected]...
>I would iterate through the ArrayList and for each item, add it to a
> HashTable if it does not already exist in the HashTable and set its
> value to 1. If it already exists, increment the value of that item in
> the HashTable.
>
> Once you've gone through the ArrayList, iterate through the HashTable
> to see which items have a value greater than 1.
>
> It's heavy on memory, but runs in linear time. Should be much better
> than 3 nested loops.
>


Using sorting mechanism is better than introducing HashSet , its not good practice to create another object.:cheers:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top