ArrayList BinarySearch vs Contains

  • Thread starter Thread starter tshad
  • Start date Start date
Hi Peter,

I have a lot of experience optimizing code, just not for search and sorting
algorithms. They appear now to require similar amounts of knowledge and
effort. The main difference appears to be that search and sort algorithms
have a mathematical base, but general performance tweaking in code just
requires an understanding of the coding language, the framework and expected
use of the application, without the need for any math. Testing for
performance and sampling data has always been good enough on my projects so
far. The mathematical aspects of the algorithms are confusing so I just
always leaned on my original understanding of the O notation, expecting it to
tell the whole story.

Thanks for your comprehensive response.
 
Hi Peter,

Informative and entertaining ;)

I thought that "setup" might have been referring to the perquisite that some
algorithms have on sorted data. If not, what exactly does "setup" mean?

I'm stepping out to eat too (and go see Borat). We can reconvene at 8:30 :p
 
Dave said:
Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?

Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
 
Dave Sexton said:
Hi Peter,

I have a lot of experience optimizing code, just not for search and
sorting algorithms. They appear now to require similar amounts of
knowledge and effort. The main difference appears to be that search and
sort algorithms have a mathematical base, but general performance tweaking
in code just requires an understanding of the coding language, the
framework and expected use of the application, without the need for any
math.

Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topic-related comment, than specifically
directed at you.

I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.

The latter is usually the first step, since if you have a bad algorithm, no
amount of code-level optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)

Pete
 
Dave Sexton said:
Hi Peter,

Informative and entertaining ;)

I thought that "setup" might have been referring to the perquisite that
some algorithms have on sorted data. If not, what exactly does "setup"
mean?

It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the better-order algorithm.
For smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.

The take-away isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.

Pete
 
Brian Gideon said:
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. [...] Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

Interestingly, that's also a good example of the other comment Bill made.
That is, if the break-even point is just 10 elements, there's not really any
point in bothering with the better-performing-but-worse-order algorithm,
since even the poor performance of the better-order algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.

Assuming 10 is truly the break-even point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.

Pete
 
Dave Sexton said:
Hi Bill,

It seems you are suggesting that the scale of the data can affect
whether a required sorting operation will actually reduce the overall
performance of an algorithm so that it becomes worse than a linear
search. Can this be true even when the data is sorted as it's being
entered into the structure?

Absolutely.
Seach algorithms have two distinct pieces to them
- Comparisons : Is this the right element? It is greater than this?
Less than this?
- What is the next element in the collection that I should compare
against.

A Linear search does a Fast Compare and a fast iteration to the next
element, but it doesn't apply any logic to reduce the number of
comparisons required. Therefore, for small data sets, a linear search
can be the best option. For large sets, you pay the price of having to
compare agains each and every element until you find the correct one.

A Binary search against a sorted dataset has a fairly fast compare with
a slower iteration to the next element, since it needs to calculate
where it should do the next compare. This additional logic in the
algorithm means that each step in the algorithm is slower than the
linear algorithm, but it drastically reduces the number of steps
involved.Therefore the Binary search is faster when the number of
elements becomes the dominant factor.


The original question asked about the best way to find elements in a
sorted ArrayList, but I would like to bring up an alternative. Try
using a Hashtable instead. It has a higher memory footprint, but the
performance is exceptional. For completeness in this discussion I would
like to point out that Hashtable retrival is a constant time operation
O(1). I ran a quick test comparing Hashtable vs Linear search vs
BinarySearch and the results are impressive. Hashtable wins hands down
when more than a handfull of elements are involved


Bill
 
Hi Peter,
Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topic-related comment, than specifically
directed at you.

No need to cater to my narcissism - I understood what you meant and I
appreciate the help ;)
I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.

The latter is usually the first step, since if you have a bad algorithm, no
amount of code-level optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)

Yes, I agree that refinement is critical in applications that rely on
top-notch performance. It's rare to get it right the first time. I wouldn't
doubt that it's the same when coding search and sort algorithms. It does make
sense to me as well that being able to choose an appropriate algorithm before
implementing the framework in which it's executed is ideal, but that's not
always possible as you mentioned. I guess a lot of that depends on how many
of the variables that govern which algorithm is appropriate will continue to
change and how much. i.e., the less things will change the easier it is to
choose an appropriate algorithm before choosing its code base.
 
Hi Peter,
It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the better-order algorithm. For
smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.

I was just speaking about this thread to a friend of mine who was quick to
jump in and say, "Use linear searches with 30 items or less and BinarySearch
otherwise", immediately after I said the words, "newsgroup thread on the big O
notation and sorting algorithms". I think this type of thought process
relating to search and sort algorithms is quite common among programmers.
It's really easy to understand and people just seem to love "rules" since it
makes things easy to remember.

From what has been discussed so far it sounds like this line of thought is
incorrect - there's just too many variables. Choosing an appropriate pivot
element in a quicksort based on the actual data being sorted, for example,
seems to prevent any constant threshold such as "30" from being appropriate,
in general.

So is there some practical guideline such as 30 or is this truly an arbitrary
number?
The take-away isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.

Big O notation seems to be like expressing algorithms in terms of probability,
which becomes more accurate as the number of "attempts" increases, so to
speak. When a collection is small, there is more of a chance that the data
will coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can increase
the effectiveness of an algorithm much like how larger data sets provide more
accurate results, probabilistically.

Have I understood you correctly?
 
HI Brian,

Thanks for the reply.

This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable

Doesn't O(1) indicate the constant factor?
Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

I'd debate that if I had more knowledge on the subject ;)
 
Hi Peter,
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. [...] Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

Interestingly, that's also a good example of the other comment Bill made.
That is, if the break-even point is just 10 elements, there's not really any
point in bothering with the better-performing-but-worse-order algorithm,
since even the poor performance of the better-order algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.

Assuming 10 is truly the break-even point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.

I was thinking the same thing (in my last post to you which was on a related
topic), but I think there is still use for the HybridDictionary:

If an application requires the use of many dictionaries at once, each
containing a small number of distinct elements to begin with, the number of
elements increased over time, and searched consistently and asynchronously,
and if the best possible performance is required, then HybridDictionary might
make sense.

I'm not sure that it deserved to be in the FCL before LinkedList, however ;)
 
Hi Bill,

The results of your test are interesting, but I'm sure it depends on the
distribution of data in the Hashtable, does it not?

If all of the data is in a single bucket doesn't a search operation become
O(n)?
 
Hi Brian,

Actually, I thought more about what you said and it's clear to me now that you
meant O(1) doesn't reveal the actual time it takes, it just shows that the
time is constant, whatever value it may be.

10 still sounds arbitrary to me, however :)
 
Dave Sexton said:
[...]
From what has been discussed so far it sounds like this line of thought is
incorrect - there's just too many variables. Choosing an appropriate
pivot element in a quicksort based on the actual data being sorted, for
example, seems to prevent any constant threshold such as "30" from being
appropriate, in general.

So is there some practical guideline such as 30 or is this truly an
arbitrary number?

As far as I know, your understanding is correct. That is, I don't have any
particular expertise in the area, but I would agree with you that the number
30 (or the number 10, as stated in a different message) is pretty much
arbitrary.

I believe that if you know the exact algorithms, implementation details, and
have some representative data to consider, you could come up with an exact
number for the break-even point. But since that number does depend on the
exact implementation as well as the nature of the data (affecting whether
actual performance is closer to best, average, or worst case), I can't
imagine that one could come up with a single number that applies in all
situations.
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak.

I understand why it looks like that, but that's not my understanding of the
nature of the question.
When a collection is small, there is more of a chance that the data will
coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can
increase the effectiveness of an algorithm much like how larger data sets
provide more accurate results, probabilistically.

Have I understood you correctly?

I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.

As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.

If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worst-case" order of an algorithm. The "worst-case"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.

I'm pretty sure that's right. But don't take my word for it. :)

Pete
 
Hi Peter,

I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.

Yes, I see that.

I jumped to conclusions when I read "you can't rely on the order when the data
set is small".

That sounded to me just like how probability works (expressed inversely) - the
larger the sample, the more accurate the prediction. I saw the big O notation
as a prediction of an algorithm's effectiveness on any given set when I wrote
that response.

After much mulling, I see now (again) that the notation describes the number
of steps that an algorithm might require in a given case, but not the chance
of that case actually happening. Cases such as "best", "worst" and "average"
can be expressed by the big O notation. The chances that any of those cases
will occur depends on other factors that simply aren't expressed by the
notation, such as the amount of data, its structure or its distribution within
the set.
As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.

Great response, thanks.
If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worst-case" order of an algorithm. The "worst-case"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.

Got it.

<snip>
 
Dave Sexton said:
Hi Bill,

The results of your test are interesting, but I'm sure it depends on
the distribution of data in the Hashtable, does it not?

If all of the data is in a single bucket doesn't a search operation
become O(n)?

A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this
topic at http://en.wikipedia.org/wiki/Hash_function

I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to
suit the data and minimize collisions.

Assuming that you have a quality Hashing function, it can be a
frustrating experience just attempting to come up with a data set that
will "Defeat" the hash (make it linear). The reason for this is that
good hash functions tend to be "One Way". By that I mean that it is hard
to find a valid input that will produce a given output except through
trial and error.

Bill
 
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak. When a collection is small, there is more of
a chance that the data will coalesce in a way that undermines the
effectiveness of an algorithm. Larger collections may have a better
distribution of data, which can increase the effectiveness of an
algorithm much like how larger data sets provide more accurate
results, probabilistically.

Have I understood you correctly?

It is not a probabilistic issue. It is an asymptotic issue. Let me give
a simple example to demonstrate.

Suppose I have some algorithm that performs as follows:

T = A + B*N

where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

The average time per element is then

t = T/N = A/N + B

As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large
N you can simply ignore the A term altogether and simply assume that T =
B*N.

Here is an example with more factors

T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)

for large values of N the third term dominates the equation and
therefore this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is
negligible.

Lets pick some real value to demonstrate.

Say A=200, B=20, C=1

N=4: The constant term dominates
T = 200 + 80 + 16

N=10: The constant term and Linear term dominate but the N^2 term is
growing
T = 200 + 200 + 100

N= 20 : The constant term is not a big factor anymore, but the N^2 term
is now important
T = 200 + 400 + 400

N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600

N = 100 : No contest
T = 200 + 2000 + 10000

As you can see, the N^2 behavior is only important as the number of
elements grows. For smaller collections, the lower order terms can
dominate the total time. This is the reason why a linear search can
actually be faster than a binary search for small collections. Once your
collection reaches a critical mass however, it is no contest.


Hope this helps
Bill
 
Hi Bill,
A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this topic at
http://en.wikipedia.org/wiki/Hash_function

I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to suit
the data and minimize collisions.

Well, I was asking because your test assumes that the keys were all hashed
properly and it made me realize that there must also be a worst case too. So
I understood O(1) to be best-case and O(n) to be worst-case for hash tables.

I'm aware of hash functions and that a hash table works only as well as the
hashing function used to generate its keys. I just wanted to be absolutely
certain that my understanding of what the O(n) notation represents was
accurate, in the context of a hash table.

I found an article that verifies O(n) to be worst-case:

"Hash Table"
http://en.wikipedia.org/wiki/Hash_table
Assuming that you have a quality Hashing function, it can be a frustrating
experience just attempting to come up with a data set that will "Defeat" the
hash (make it linear). The reason for this is that good hash functions tend
to be "One Way". By that I mean that it is hard to find a valid input that
will produce a given output except through trial and error.

You make a good point, but there is still the possibility for worst-case no
matter how well the hashing function performs because there is a finite number
of valid hashes. I understand that worst-case might be difficult to produce
and is probably not realistic in most real-world scenarios.

So do you think it's better to just ignore the worst-case when speaking of
hash tables?

(I'm leaning towards yes myself, although Jon's point in another post about
supplying as much information about an algorithm as possible seems to apply
here as well)
 
Hi Bill,

Suppose I have some algorithm that performs as follows:

T = A + B*N

where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

The average time per element is then

t = T/N = A/N + B

As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large N
you can simply ignore the A term altogether and simply assume that T = B*N.

Very nice proof, thank you.

Algebra sure does brings back memories ;)
Here is an example with more factors

T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)

for large values of N the third term dominates the equation and therefore
this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is negligible.
Interesting.

Lets pick some real value to demonstrate.

Say A=200, B=20, C=1

N=4: The constant term dominates
T = 200 + 80 + 16

N=10: The constant term and Linear term dominate but the N^2 term is growing
T = 200 + 200 + 100

N= 20 : The constant term is not a big factor anymore, but the N^2 term is
now important
T = 200 + 400 + 400

N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600

N = 100 : No contest
T = 200 + 2000 + 10000

As you can see, the N^2 behavior is only important as the number of elements
grows. For smaller collections, the lower order terms can dominate the total
time. This is the reason why a linear search can actually be faster than a
binary search for small collections. Once your collection reaches a critical
mass however, it is no contest.

Great, thank you.

I believe you've illustrated what Peter has written about choosing the
dominate factor to be represented by the O notation and how each factor
changes unevenly as the number of elements increases.

Here, O(N^2) notates the worst case?

O(A) is the best case (only one element)?

I assume the average case is closer to O(N^2), but how can it be notated?
 
Dave Sexton said:
[...]
Here, O(N^2) notates the worst case?

Not necessarily. It only represents worst-case if you are specifically
looking at the algorithm's behavior in the worst-case. Here, O(N^2) could
very well be the average case. Bill's example didn't actually specify the
algorithm, so there's no way to know in this particular example. But when
calculating order for an actual algorithm, you would have that information
and would be able to say whether the calculated order represented a best,
average, or worst case scenario.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top