ArrayList BinarySearch vs Contains

  • Thread starter Thread starter tshad
  • Start date Start date
T

tshad

Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Thanks,

Tom
 
tshad said:
Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is an
O(lg(n)) operation. Just make sure that the data really is sorted!

-cd
 
tshad said:
Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Contains is O(n)
BinarySearch is O(log(n))

For small collections Contains is probably faster
For large collections BinarySearch is Faster

If you are frequently looking for data near the front of the collection
Contains might have an edge

In short, it depends on the number of items and normal usage.

Hope this helps
Bill
 
Carl Daniel said:
Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is
an O(lg(n)) operation. Just make sure that the data really is sorted!

What does O(n) and O(lg(n)) mean?

Thanks,

Tom
 
tshad said:
What does O(n) and O(lg(n)) mean?

This question popped up just this week.
Here was my answer


If memory serves correctly the O stands for 'Order'.
Thus O(n) would be called "Order n" and O(n^2) would be "Order n
squared".
This is a measure of how efficient a given algorithm performs.
Technically it is a measure of the asymptotic behavior as the number of
elements gets very large, but, it often is appropriate even for fairly
small collections.

O(1) means that this operation takes a constant amount of time
independent of the number of items involved. A hashtable retrieval is an
O(1) operation

O(n) means that an operation takes a time proportional to the number of
elements (n). If an collection has twice as many elements, the operation
takes twice as long. A linear search is an O(n) operation

others include
O(n^2) : Goes as n*n (not very efficient for large collections)
O(log(n)): goes as the logarithm of the number of elements
O(n*log(n)) : you get the idea.

Hope this helps
Bill
 
tshad said:
What does O(n) and O(lg(n)) mean?

It's a Computer Science way of describing the complexity or cost of an
algorithm. The "n" relates to the size of the data, the "O" is read "order"
and the actual function describes how the length of the operation varies
according to the length of the data given the operation.

In this particular case, O(n) refers to the fact that a straight linear
search through the data will always take an amount of time that is directly
proportional to the size of the data, and increases linearly as the size of
the data increases. On the other hand, the BinarySearch method takes
advantage of the fact that the data is sorted, allowing a search for a
particular data item to increase in time proportional to log(n). Since
log(n) is always much smaller than n itself, this means BinarySearch is much
more efficient, on average.

If you intend to do any serious programming, especially where performance is
an issue, you would do well to find a good textbook or other reference on
algorithms and learn not only about the concept of "order", but also what
common algorithms have what order and how the order is affected by the data
(for example, the average case for a binary search is O(log(n)), but
depending on how the data is stored the worst case might wind up being O(n)
anyway).

Pete
 
Hi Tom,

"Big O Notation"
http://en.wikipedia.org/wiki/Big_o_notation

Basically you can think of it as the "n"umber of "O"perations that is the
worst-case scenario to find the specified item in the list.

e.g., if you have 10 items in the list then the worst case in a linear search
would be O(n). This means that the worst case scenario would be examining
every item in the list (10 operations). The worst-case in a binary search
would be O(lg(n)). Binary logarithm of 10 operations. Binary log is the
worst case scenario because a binary search cuts the list in half, then
performs the same operation again on the half that contains the search item.
The pattern is continued until only one or zero items remain. If the list
isn't sorted properly than the search may fail because the algorithm might
chose the "wrong" half in which to continue searching after comparing the
current item to the search item.

"Binary Logarithm"
http://en.wikipedia.org/wiki/Binary_logarithm

"Binary Search"
http://en.wikipedia.org/wiki/Binary_search
 
Dave Sexton said:
e.g., if you have 10 items in the list then the worst case in a linear
search would be O(n). This means that the worst case scenario would be
examining every item in the list (10 operations). The worst-case in a
binary search would be O(lg(n)).

Note:

"Order" stated by itself is almost always for *average* performance, not
worst-case. One would normally specify explicitly if describing the
worst-case order for an algorithm.

For example, it is NOT true that "the worst-case in a binary search would be
O(log(n))" for all implementations of a binary search. If you have a sorted
array, it's true that the worst case is O(log(n)). However, if you have a
binary tree, a search on that tree could be as bad as O(n).

Pete
 
Hi Peter,

My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worst-case, so thanks for correcting me.
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worst-cases :)
 
Dave Sexton said:
My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worst-case, so thanks for correcting
me.

I just took a look at the article you referenced. Wow. They sure didn't
cover in that kind of detail in my CS classes way back when. :) Anyway, I
can see where the confusion comes from, since the phrase "asymptotic upper
bound" could be read to imply "worst case" (the "upper bound" certainly has
that implication in daily usage). I can't say as I blame you for reading it
that way.

That's what we lowly programmers get for using the same concept that those
high-falutin' "algorithm scientists" use. :)
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worst-cases :)

I guess. Though, to be fair, in those two cases the order of the average
case is the same as the order of the worst case (if one assumes a binary
search on an array, which seems reasonable given this thread). So the
distinction is entirely academic. :)

Pete
 
Hi Peter,

I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worst-case", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p

Luckily, I've never been tasked with the responsibility of authoring
high-performance search algorithms on complex data structures. Although if I
was, I probably would have learned the correct meaning of the notation ;)

Thanks again.
 
Dave Sexton said:
I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worst-case", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p

I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.
 
Hi Jon,
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.

That's a good idea to cover all bases.

Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?

What I mean is that it's much easier to chose an algorithm based on either
average time or worst-case time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)

BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.

"Quicksort"
http://en.wikipedia.org/wiki/QuickSort

(The Quicksort article is particularly interesting with its visual
representations of the sort in action)

"Array.Sort Method"
http://msdn.microsoft.com/library/d...cpref/html/frlrfsystemarrayclasssorttopic.asp
 
Contains is O(n) operation while BinarySearch is O(log n) operation. In big
list sizes and under intense load, this will make a huge difference, so use
BinarySearch to locate items in sorted lists.
 
Dave Sexton said:
That's a good idea to cover all bases.

Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?

They can only be compared with as much data as is available.
What I mean is that it's much easier to chose an algorithm based on either
average time or worst-case time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)

You'd need to have more information such as how likely the worst case
scenario is, and how easy it might be to avoid.

Note that just knowing the order isn't always enough either - things
can both be the same order, but with a large factor involved, eg one
will always be five times slower than another. IIRC (again!) a merge
sort can *always* be O(n log n) but it has other costs both in memory
and time factors which are higher than QuickSort, which is why
QuickSort is used more often.
BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.

And to think people claim miracles don't happen in the 21st century ;)
 
If you have a sorted array, it's true that the worst case is
O(log(n)).

Everyone is making wonderful points on this topic, but I wanted to
underline a few key points.
As Peter said, a BinarySearch on a sorted ArrayList (or a balanced
Binary tree) will have a Worst case performance of O(log(n)).

Another important point of Big O notation is that it describes
Asymptotic behavior. Thus for sufficiently large n an O(log(n))
algorithm will outperform an O(n) algorithm. The problem is determining
what constitutes "sufficiently large n". For small values of n,
O(log(n)) algorithms often perform WORSE than linear algorithms do to
the additional setup complexity. Of course, for small collections, ANY
search will be fast, so it doesn't really matter.

Bill
 
Hi Jon,

So it seems that choosing an appropriate algorithm depends not only on what
the O notation says about performance, but other factors that the notation
doesn't account for such as scale, depth and distribution of the data,
perhaps, which also may affect the performance of the algorithm.

I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick one
and see if it performs better than any other algorithm that I decide to test.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable criteria
for performance and memory consumption.

I'll be outsourcing any project in the future where this is required ;)

Thanks.
 
Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?

Since I'm not one for micro-optimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there may
be substantial amounts of data at times and just swallow the performance hit
on those smaller collections. I guess if there is a diametric aspect to the
data then maybe switching between linear and binary algorithms at runtime
might be acceptable. I've never had a project that required this much thought
on search algorithms, however :)

Thanks.
 
Dave Sexton said:
Hi Jon,

So it seems that choosing an appropriate algorithm depends not only on
what the O notation says about performance, but other factors that the
notation doesn't account for such as scale, depth and distribution of the
data, perhaps, which also may affect the performance of the algorithm.

Yup. Welcome to the world of optimization. :) The "order" of an algorithm
is an important analytical aspect, but as you've noted one should not
consider it to the be the last word in which algorithm is "best". There are
so many other factors that come into play, the order by itself just doesn't
tell you everything you need to know.
I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick
one and see if it performs better than any other algorithm that I decide
to test.

Well, yes and no. I mean, it's useful to use the order as a starting point.
After all, you can pretty much rule out even testing a bubble sort when
you've got a Quicksort implementation to use. You don't literally need to
test all possible solutions, thank goodness.

Also note that testing your potential solutions does not necessarily tell
you worst-case performance. It depends on how you pick your data for the
tests, and different algorithms may have different data that results in the
worst-case performance. So, testing can be useful, but it's not necessarily
an improvement over just looking at the order. To improve upon the order
analysis, you have to do more analysis. Testing alone won't cut it.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable
criteria for performance and memory consumption.

I'll be outsourcing any project in the future where this is required ;)

That works. :)

That said, I'll note that much of the time, absolutely *optimal* performance
is not required, while at the same time the very worst cases are not
necessarily that hard to avoid. For example, even in the average case,
Quicksort is dependent on the distribution of the data and how you pick the
pivot element for each iteration. Even when the magnitude is reasonably
proportional to n log(n), there can be wide variations in real-time
performance.

This can be addressed either by playing some tricks when picking the pivot,
or by simply ignoring the problem and accepting that occasionally you'll get
data organized in a way that hinders the sorting performance. The very
worst case comes when the data is already sorted, and this condition can be
detected fairly easily (and of course, if you detect it, you need not do any
more work :) ).

I'm just using Quicksort as an example. The same ideas apply to pretty much
any algorithm. Knowing the order is useful, but it's not the end-all,
be-all for picking an algorithm or implementation. You can use the order to
make some broad generalizations, and then refine the implementation to
address specifics that apply in one's particular situation. And in the end,
most coders simply accept that sometimes the universe will throw some really
hard data at the algorithm, and it won't get the very best performance it
might have gotten. :)

Pete
 
Dave Sexton said:
It seems you are suggesting that the scale of the data can affect whether
a required sorting operation will actually reduce the overall performance
of an algorithm so that it becomes worse than a linear search.

His point (I think) is that some algorithms only work after some initial
setup, and that setup can be costly. I'm not sure that all sorting and
searching algorithms are a good example of this. For example, depending on
the implementation of a Quicksort, it can perform just as well on small data
sets as large ones. Likewise binary search (though for very small data
sets, the difference between n and log n is very small).

But it is certainly true that the order of an algorithm considers only a
data set that's basically infinite. There can be wide variations in
performance, well away from the theoretical performance, when the data set
is very small. To make matters worse, what constitutes "very small" or
"large enough" varies depending on the algorithm

I think this is what Bill was saying...I don't think I'm adding anything
new, but hopefully am just restating it in a slightly different way, which
might make it clearer to some people (just as it may obfuscate it to others
:) ).
Can this be true even when the data is sorted as it's being entered into
the structure?

Ack! And I mean that in the Bill the Cat way, not the i/o handshaking way.
:)

The analysis becomes very different when you've got an incremental situation
like that. The order analysis for these different algorithms assumes you
start with all the data and run through it once to produce the end results.
What is best in that case may very well not be best for a case where you
want to perform the same operation after each data element is added.

For example, resorting an array would be a silly way to sort data as it's
being added to a data structure. After all, if the data is already sorted,
sorting it again just to add on element is very inefficient. Even worse, as
I noted elsewhere, the worst-case performance for some sorting algorithms
comes when the data starts out sorted. This is a classic example of why
knowing the exact scenario the algorithm is to be used is at least as
important as knowing the order of the algorithm. :)

I would never use Quicksort to sort data as it's being entered into a data
structure. It's one of the worst choices in that case, and you practically
guarantee minimal performance, even though the order of the algorithm is
superior on average.
Since I'm not one for micro-optimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there
may be substantial amounts of data at times and just swallow the
performance hit on those smaller collections.

As Bill notes, when the data size is small, the order of the algorithm is
usually not relevant since any algorithm will be fast. So yes, swallowing
the performance hit is fine in that case.

However, that's not the same as ignoring the exact scenario being solved.
You need to take into account how the algorithm will be used, and not create
a situation where you choose an algorithm that has a good average case
order, but will always be used where the worst case order applies.

I hope that makes sense...I'm typing quickly because I'm hungry and want to
eat lunch. :)

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top