How to best extract a list of identical keys in a sorted ArrayList with BinarySearch ?

Guy · Jun 1, 2007

Is there a better way to search identical elements in a sorted array
list than the following:

iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count,
aSearchedObject );

aFoundObject= m_Array[iIndex];

m_ResultArray.Add ( aFoundObject);

iIndex++;
while ( ( m_Array[iIndex].CompareTo( aSearchedObject) ) && ( iIndex
< m_Array.Count ) )
{
m_ResultArray.Add ( aFoundObject);

Guy · Jun 1, 2007

Ok, bad key strucked:

Is there a better way to search identical elements in a sorted array
list than the following:

iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count,
aSearchedObject );

aFoundObject= m_Array[iIndex];

m_ResultArray.Add ( aFoundObject);

iIndex++;
while ( ( m_Array[iIndex].CompareTo( aSearchedObject) ) && ( iIndex
< m_Array.Count ) )
{
m_ResultArray.Add ( aFoundObject);
iIndex++
}

Is there a garanty that the first BinarySearch, will strike the first
items in the sorted list ?

Example the search object is :
ID, Make, Model, Year ( Ex: 1234, Buick, Regal, 1998 )
The array contains 5 Buick, Regal, 1998 with 5 distincts ID (they have
different engines configurations). I assume they are consecutive in
the sorted array.
My IComparable compares two object using this:
(Make+Model+Year).ToHashCode.ComparesTo ( the other object hascode)

I need to extract the 5 Buicks ID from the list as fast as possible.

Thanks.

Ben Voigt [C++ MVP] · Jun 1, 2007

Guy said:
Ok, bad key strucked:

Is there a better way to search identical elements in a sorted array
list than the following:

iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count,
aSearchedObject );

aFoundObject= m_Array[iIndex];

m_ResultArray.Add ( aFoundObject);

iIndex++;
while ( ( m_Array[iIndex].CompareTo( aSearchedObject) ) && ( iIndex
< m_Array.Count ) )
{
m_ResultArray.Add ( aFoundObject);
iIndex++
}

Is there a garanty that the first BinarySearch, will strike the first
items in the sorted list ?

Probably not... checking Reflector.

No, there is no such guarantee. BinarySearch will return the index for the
first item it encounters which compares equal, and since it doesn't make a
sequential pass through the array, that is not necessarily the lowest index
of all matching items.

Bruce Wood · Jun 1, 2007

Ok, bad key strucked:

Is there a better way to search identical elements in a sorted array
list than the following:

iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count,
aSearchedObject );

aFoundObject= m_Array[iIndex];

m_ResultArray.Add ( aFoundObject);

iIndex++;
while ( ( m_Array[iIndex].CompareTo( aSearchedObject) ) && ( iIndex
< m_Array.Count ) )
{
m_ResultArray.Add ( aFoundObject);
iIndex++
}

Is there a garanty that the first BinarySearch, will strike the first
items in the sorted list ?

Example the search object is :
ID, Make, Model, Year ( Ex: 1234, Buick, Regal, 1998 )
The array contains 5 Buick, Regal, 1998 with 5 distincts ID (they have
different engines configurations). I assume they are consecutive in
the sorted array.
My IComparable compares two object using this:
(Make+Model+Year).ToHashCode.ComparesTo ( the other object hascode)

I need to extract the 5 Buicks ID from the list as fast as possible.

Use BinarySearch to find an entry. Work backward from the entry to
find the first entry. Work forward from the entry that BinarySearch
found to find the last entry. Done.

There is no guarantee that a binary search will find the first entry,
since it doesn't examine all of the elements, it can't possibly. Only
a linear search will guarantee to find the first matching entry.

How long is this ArrayList, by the way?

If the ArrayList is truly huge, and this search is a common operation,
consider the following: trade memory for speed. Use a SortedList of
objects, where each object in the main list contains the key and a
list of items with that key. When you need all items with a certain
key, hash into the SortedList (no searching required) and grab the
list of equivalent items. Lookups will be as fast as they can possibly
be.

You can also create an enumeration over the SortedList that returns a
deep traversal of all items in the collection, in order.

However, the whole thing will use significantly more memory.

Peter Duniho · Jun 1, 2007

Use BinarySearch to find an entry. Work backward from the entry to
find the first entry. Work forward from the entry that BinarySearch
found to find the last entry. Done.

There is no guarantee that a binary search will find the first entry,
since it doesn't examine all of the elements, it can't possibly. Only
a linear search will guarantee to find the first matching entry.

Well, sure a binary search could find the first entry. It could do
exactly what you propose the OP do, which is to upon finding a matching
element, work its way back until it knows where the first matching one
is. It all just depends on how the behavior is defined. Microsoft
decided to go with the faster-but-ambiguous "first match found" design,
but they could just as easily have defined the behavior to always return
the "first match in array" result instead.

How long is this ArrayList, by the way?

If the ArrayList is truly huge, and this search is a common operation,
consider the following: trade memory for speed. Use a SortedList of
objects, where each object in the main list contains the key and a
list of items with that key. When you need all items with a certain
key, hash into the SortedList (no searching required) and grab the
list of equivalent items. Lookups will be as fast as they can possibly
be.

I'm not exactly clear on what you propose here. First of all, according
to the docs, SortedList does not allow duplicate keys. So the question of
finding the first of a given key is moot.

Second, hashing only works if you've stored the items according to the
hash value in the first place. I see no advantage in applying some sort
of hash value to a SortedList. Now, a Hashtable would be different and of
course much faster. But then that's true generally and is a completely
different issue.

Finally, as near as I can tell the implementation of SortedList involves a
separate array containing just the keys and indexes into the array
containing the actual value data. This means that every time something is
added to, or removed from the collection, the key index array has to be
adjusted, shifting all the items after the insertion or removal. This is
probably okay if the array is built once and then persists, but it could
get very expensive if there's a lot of data and the array is in constant
flux. Note that this issue is worst in the very situation in which you're
suggesting one use SortedList: "If the ArrayList is truly huge".

And of course, given that you can't just "hash into the SortedList",
searching a SortedList is going to take the same time as searching a
sorted ArrayList, since both would require a binary search for optimal
results.

All of the above is with respect to the non-generic SortedList. There is
also a generic SortedList<> that would be useful in some situations, but
which has similar problems (in particular, the requirement that keys be
unique). It's not clear from the documentation whether it's a balanced
binary tree or not, but if it's not then it would also have the problem
that the worst-case scenario for searching is considerably worse than for
a binary search.

If constant-order searching is desired, then a hashtable-based data
structure needs to be used, like Hashtable or the generic Dictionary<>.

Pete

Bruce Wood · Jun 2, 2007

Well, sure a binary search could find the first entry. It could do
exactly what you propose the OP do, which is to upon finding a matching
element, work its way back until it knows where the first matching one
is. It all just depends on how the behavior is defined. Microsoft
decided to go with the faster-but-ambiguous "first match found" design,
but they could just as easily have defined the behavior to always return
the "first match in array" result instead.
True.

I'm not exactly clear on what you propose here. First of all, according
to the docs, SortedList does not allow duplicate keys. So the question of
finding the first of a given key is moot.

You stopped reading at "SortedList". I'm proposing the classic "list
of lists" solution: the sorted list contains one entry for each key,
where that entry contains all of the items that match that key.

Second, hashing only works if you've stored the items according to the
hash value in the first place. I see no advantage in applying some sort
of hash value to a SortedList. Now, a Hashtable would be different and of
course much faster. But then that's true generally and is a completely
different issue.

Finally, as near as I can tell the implementation of SortedList involves a
separate array containing just the keys and indexes into the array
containing the actual value data. This means that every time something is
added to, or removed from the collection, the key index array has to be
adjusted, shifting all the items after the insertion or removal. This is
probably okay if the array is built once and then persists, but it could
get very expensive if there's a lot of data and the array is in constant
flux. Note that this issue is worst in the very situation in which you're
suggesting one use SortedList: "If the ArrayList is truly huge".

And of course, given that you can't just "hash into the SortedList",
searching a SortedList is going to take the same time as searching a
sorted ArrayList, since both would require a binary search for optimal
results.

I have never looked at the underlying implementation for a SortedList.
If you are correct, then I'm appalled that it does not combine a list
and a hash table. If the indexer on key has to do a binary search then
that's a terrible waste of time. I'd rather trade memory for more
speed, personally.

All of the above is with respect to the non-generic SortedList. There is
also a generic SortedList<> that would be useful in some situations, but
which has similar problems (in particular, the requirement that keys be
unique).

....which isn't a problem at all: you misunderstood my post.

It's not clear from the documentation whether it's a balanced
binary tree or not, but if it's not then it would also have the problem
that the worst-case scenario for searching is considerably worse than for
a binary search.

If constant-order searching is desired, then a hashtable-based data
structure needs to be used, like Hashtable or the generic Dictionary<>.

Yes, and if SortedList truly is simply an ordered list with a key
indexer that searches by binary search or some other such method, then
I wouldn't propose it.

In that case the OP should use a hash table and, if ordered access is
required for some reason, either maintain a parallel sorted list or
sort it on an as-needed basis.

I still can't believe that SortedList doesn't use a dual structure
internally. Yuck.

Peter Duniho · Jun 2, 2007

You stopped reading at "SortedList". I'm proposing the classic "list
of lists" solution: the sorted list contains one entry for each key,
where that entry contains all of the items that match that key.

I didn't stop reading. I simply got distracted by the implication that
you could use some sort of hashed access to the SortedList.

[...]
I have never looked at the underlying implementation for a SortedList.
If you are correct, then I'm appalled that it does not combine a list
and a hash table. If the indexer on key has to do a binary search then
that's a terrible waste of time. I'd rather trade memory for more
speed, personally.

Well, there's a Hashtable class. I presume that SortedList doesn't do
hashing, because someone who wanted hashing would just use the Hashtable
class instead, or possibly would use the two class together to achieve
hashing with sorting. Hashing and sorting are so fundamentally different
that I just don't see a simple collection class actually implementing
both. Not that it couldn't, just that it would make the interface so
complicated, merging two completely different behaviors into a single
class.

[...]

If constant-order searching is desired, then a hashtable-based data
structure needs to be used, like Hashtable or the generic Dictionary<>.

Click to expand...

Yes, and if SortedList truly is simply an ordered list with a key
indexer that searches by binary search or some other such method, then
I wouldn't propose it.

From the documentation of SortedList.Item:

Retrieving the value of this property is an O(log n)
operation, where n is Count

Sure looks like a binary search to me.

In that case the OP should use a hash table and, if ordered access is
required for some reason, either maintain a parallel sorted list or
sort it on an as-needed basis.

Well, IMHO it really depends on the performance needs. A binary search is
actually pretty fast, even on very large data. The main problem with it
is that it requires that the data be kept in sorted order in the first
place, which can be expensive. But if you already have the requirement
that the data be sorted, I see no need to switch to hashing to find data.
Even if you assume that the binary search is slower than a lookup by hash
value, it's not going to be *much* slower (certainly not an order of
magnitude or anything like that).

In addition, hashing of course has the trade-off of memory requirements
versus collisions. With any large data set, unless you have a really huge
hash table, you're going to have a fair number of collisions, which of
course you have to search linearly through before getting the item you
really want. That linear search can easily consume the same time as a
simple binary search on sorted data would, depending on the number of
collisions for that hashed value.

Hashing works well when the data isn't sorted in the first place and you
have no other need to sort it, but if the data needs to be sorted anyway,
there's really no reason to not just use a binary search.

I still can't believe that SortedList doesn't use a dual structure
internally. Yuck.

Well, it does. It has an index array and the value array.

It's just
that neither of those data structures are a hash table.

Pete

Bruce Wood · Jun 2, 2007

You stopped reading at "SortedList". I'm proposing the classic "list
of lists" solution: the sorted list contains one entry for each key,
where that entry contains all of the items that match that key.

Click to expand...

I didn't stop reading. I simply got distracted by the implication that
you could use some sort of hashed access to the SortedList.

[...]
I have never looked at the underlying implementation for a SortedList.
If you are correct, then I'm appalled that it does not combine a list
and a hash table. If the indexer on key has to do a binary search then
that's a terrible waste of time. I'd rather trade memory for more
speed, personally.

Click to expand...

Well, there's a Hashtable class. I presume that SortedList doesn't do
hashing, because someone who wanted hashing would just use the Hashtable
class instead, or possibly would use the two class together to achieve
hashing with sorting. Hashing and sorting are so fundamentally different
that I just don't see a simple collection class actually implementing
both. Not that it couldn't, just that it would make the interface so
complicated, merging two completely different behaviors into a single
class.

Oh, I don't know. I think the interface would look exactly like
SortedList, except that it would find things faster.

From the documentation of SortedList.Item:

Retrieving the value of this property is an O(log n)
operation, where n is Count

Sure looks like a binary search to me.

Yup. That's pretty conclusive. I should read the doc more closely,
huh?

Well, IMHO it really depends on the performance needs. A binary search is
actually pretty fast, even on very large data. The main problem with it
is that it requires that the data be kept in sorted order in the first
place, which can be expensive. But if you already have the requirement
that the data be sorted, I see no need to switch to hashing to find data.
Even if you assume that the binary search is slower than a lookup by hash
value, it's not going to be *much* slower (certainly not an order of
magnitude or anything like that).
Agreed.

In addition, hashing of course has the trade-off of memory requirements
versus collisions. With any large data set, unless you have a really huge
hash table, you're going to have a fair number of collisions, which of
course you have to search linearly through before getting the item you
really want. That linear search can easily consume the same time as a
simple binary search on sorted data would, depending on the number of
collisions for that hashed value.

Hashing works well when the data isn't sorted in the first place and you
have no other need to sort it, but if the data needs to be sorted anyway,
there's really no reason to not just use a binary search.

Agreed.

I had just assumed that SortedList provided the fastest access
possible but, as you pointed out, the gain in speed wouldn't be that
much, and the amount of memory consumed wouldn't be worth it.

So, we agree: the OP should use a Hashtable (or Dictionary<>) if
sorting isn't important, and a SortedList if sorting is required.

Peter Duniho · Jun 2, 2007

[...]
So, we agree: the OP should use a Hashtable (or Dictionary<>) if
sorting isn't important, and a SortedList if sorting is required.

Yes, vehemently agree.

(Personally, I'm in love with the generics, so
I would recommend SortedList<> if the data needs to be ordered generally,
and Dictionary<> if not

).

ArrayList help needed	2	Jun 8, 2004
IComparer, Compare, CompareTo for multiple comparisons for Sort inArray and List	3	Sep 30, 2008
How do I Custom Sort Directory Items	1	Mar 2, 2005
Quickly finding duplicates in an ArrayList	12	Feb 23, 2005
fastest sorted list type?	15	Jan 13, 2008
HELP!! Why doesn't this code sort correctly?	2	Apr 7, 2005
Sort and Extract	1	Feb 25, 2009
Algorithm question	1	Nov 29, 2006

How to best extract a list of identical keys in a sorted ArrayList with BinarySearch ?

Guy

Guy

Ben Voigt [C++ MVP]

Bruce Wood

Peter Duniho

Bruce Wood

Peter Duniho

Bruce Wood

Peter Duniho

Ask a Question

Similar Threads