Search a list/array of objects for specified criteria?

R

Rick

I have a large list of objects where each object has a unique
(non-overlapping) date range. The list is sorted chronologically. What is
the most efficient way to search this list for a single object that spans a
specified date?
 
O

Oliver Sturm

Hello Rick,
I have a large list of objects where each object has a unique
(non-overlapping) date range. The list is sorted chronologically. What is
the most efficient way to search this list for a single object that spans
a specified date?

My suggestion below... you mention that the list is sorted
chronologically. I don't think that information is of any use here, as
long as you don't have any good information about the distribution of the
elements in the list, or the lengths of the spans covered by a typical
element. Of course you could implement an algorithm that looks at a first
element and tries to heuristically guess where the element you're looking
for may be in the list - but without knowing the total range of elements,
the number, the distribution and the lengths of the spans, I find it very
hard to say whether the algorithm would be any more efficient than the
basic one below.

If you want efficiency without having good estimates of these details,
maybe you should organize your data in a different way - using
dictionaries for certain ranges of data, for instance. Once more it's hard
to say what that structure should best look like, without knowing details
about the data you're expecting.

And finally, don't forget: it's usually better to write a reasonably good
algorithm and to optimize it only when it proves to be a problem.



[TestFixture]
public class Tests {
[Test]
public void Test( ) {
List<ClassValidWithinRange> list = new List<ClassValidWithinRange>(new
ClassValidWithinRange[] {
new ClassValidWithinRange(new DateTime(2007, 1, 1), new DateTime(2007,
1,15)),
new ClassValidWithinRange(new DateTime(2007, 1, 16), new
DateTime(2007, 1,31)),
new ClassValidWithinRange(new DateTime(2007, 2, 1), new DateTime(2007,
2,28))
});
ClassValidWithinRange checkVal = new ClassValidWithinRange(new
DateTime(2007, 3, 1), new DateTime(2007, 3, 31));
list.Add(checkVal);

ClassValidWithinRange foundVal =
list.Find(delegate(ClassValidWithinRange entry) {
return entry.IsValidOn(new DateTime(2007, 3, 15));
});

Assert.AreEqual(checkVal, foundVal);
}
}

public class ClassValidWithinRange {
public ClassValidWithinRange(DateTime from, DateTime to) {
this.from = from;
this.to = to;
}
public bool IsValidOn(DateTime date) {
// Implement with your choice of <, <=, > and >=
return from < date && to > date;
}
private DateTime from;
public DateTime From {
get {
return from;
}
set {
from = value;
}
}
private DateTime to;
public DateTime To {
get {
return to;
}
set {
to = value;
}
}
}


Oliver Sturm
 
R

Rick

Thanks Oliver.

In short, you recommend using the Find method of the List<T> class. This
performs a linear search, which is basically what I'm doing now. It's
acceptable but hardly optimal. I believe that a binary search could be more
efficient, even where the distribution is uneven. The problem is that the
BinarySearch methods provided by .NET don't seem to support searching using
an IComparer alone.


Oliver Sturm said:
Hello Rick,
I have a large list of objects where each object has a unique
(non-overlapping) date range. The list is sorted chronologically. What is
the most efficient way to search this list for a single object that spans
a specified date?

My suggestion below... you mention that the list is sorted
chronologically. I don't think that information is of any use here, as
long as you don't have any good information about the distribution of the
elements in the list, or the lengths of the spans covered by a typical
element. Of course you could implement an algorithm that looks at a first
element and tries to heuristically guess where the element you're looking
for may be in the list - but without knowing the total range of elements,
the number, the distribution and the lengths of the spans, I find it very
hard to say whether the algorithm would be any more efficient than the
basic one below.

If you want efficiency without having good estimates of these details,
maybe you should organize your data in a different way - using
dictionaries for certain ranges of data, for instance. Once more it's hard
to say what that structure should best look like, without knowing details
about the data you're expecting.

And finally, don't forget: it's usually better to write a reasonably good
algorithm and to optimize it only when it proves to be a problem.



[TestFixture]
public class Tests {
[Test]
public void Test( ) {
List<ClassValidWithinRange> list = new List<ClassValidWithinRange>(new
ClassValidWithinRange[] {
new ClassValidWithinRange(new DateTime(2007, 1, 1), new DateTime(2007,
1,15)),
new ClassValidWithinRange(new DateTime(2007, 1, 16), new DateTime(2007,
1,31)),
new ClassValidWithinRange(new DateTime(2007, 2, 1), new DateTime(2007,
2,28))
});
ClassValidWithinRange checkVal = new ClassValidWithinRange(new
DateTime(2007, 3, 1), new DateTime(2007, 3, 31));
list.Add(checkVal);

ClassValidWithinRange foundVal = list.Find(delegate(ClassValidWithinRange
entry) {
return entry.IsValidOn(new DateTime(2007, 3, 15));
});

Assert.AreEqual(checkVal, foundVal);
}
}

public class ClassValidWithinRange {
public ClassValidWithinRange(DateTime from, DateTime to) {
this.from = from;
this.to = to;
}
public bool IsValidOn(DateTime date) {
// Implement with your choice of <, <=, > and >=
return from < date && to > date;
}
private DateTime from;
public DateTime From {
get {
return from;
}
set {
from = value;
}
}
private DateTime to;
public DateTime To {
get {
return to;
}
set {
to = value;
}
}
}


Oliver Sturm
 
R

Rick

I think I found an answer. I could create an IComparer that compares the
upperbound of each date range. I can then use a BinarySearch to look for an
object with a specific upperbound (equal to my desired date). It will return
either the index of an exact match, or the bitwise complement of the index
next highest object, or the bitwise complement of the list count.

Rick said:
Thanks Oliver.

In short, you recommend using the Find method of the List<T> class. This
performs a linear search, which is basically what I'm doing now. It's
acceptable but hardly optimal. I believe that a binary search could be
more efficient, even where the distribution is uneven. The problem is that
the BinarySearch methods provided by .NET don't seem to support searching
using an IComparer alone.


Oliver Sturm said:
Hello Rick,
I have a large list of objects where each object has a unique
(non-overlapping) date range. The list is sorted chronologically. What is
the most efficient way to search this list for a single object that spans
a specified date?

My suggestion below... you mention that the list is sorted
chronologically. I don't think that information is of any use here, as
long as you don't have any good information about the distribution of the
elements in the list, or the lengths of the spans covered by a typical
element. Of course you could implement an algorithm that looks at a first
element and tries to heuristically guess where the element you're looking
for may be in the list - but without knowing the total range of elements,
the number, the distribution and the lengths of the spans, I find it very
hard to say whether the algorithm would be any more efficient than the
basic one below.

If you want efficiency without having good estimates of these details,
maybe you should organize your data in a different way - using
dictionaries for certain ranges of data, for instance. Once more it's
hard to say what that structure should best look like, without knowing
details about the data you're expecting.

And finally, don't forget: it's usually better to write a reasonably good
algorithm and to optimize it only when it proves to be a problem.



[TestFixture]
public class Tests {
[Test]
public void Test( ) {
List<ClassValidWithinRange> list = new List<ClassValidWithinRange>(new
ClassValidWithinRange[] {
new ClassValidWithinRange(new DateTime(2007, 1, 1), new DateTime(2007,
1,15)),
new ClassValidWithinRange(new DateTime(2007, 1, 16), new DateTime(2007,
1,31)),
new ClassValidWithinRange(new DateTime(2007, 2, 1), new DateTime(2007,
2,28))
});
ClassValidWithinRange checkVal = new ClassValidWithinRange(new
DateTime(2007, 3, 1), new DateTime(2007, 3, 31));
list.Add(checkVal);

ClassValidWithinRange foundVal = list.Find(delegate(ClassValidWithinRange
entry) {
return entry.IsValidOn(new DateTime(2007, 3, 15));
});

Assert.AreEqual(checkVal, foundVal);
}
}

public class ClassValidWithinRange {
public ClassValidWithinRange(DateTime from, DateTime to) {
this.from = from;
this.to = to;
}
public bool IsValidOn(DateTime date) {
// Implement with your choice of <, <=, > and >=
return from < date && to > date;
}
private DateTime from;
public DateTime From {
get {
return from;
}
set {
from = value;
}
}
private DateTime to;
public DateTime To {
get {
return to;
}
set {
to = value;
}
}
}


Oliver Sturm
 
O

Oliver Sturm

Hello Rick,
In short, you recommend using the Find method of the List<T> class. This
performs a linear search, which is basically what I'm doing now. It's
acceptable but hardly optimal. I believe that a binary search could be
more efficient, even where the distribution is uneven.

Hm... I'm not sure really - it's a mathematical exercise to prove whether
it would be more efficient under the circumstances or not. There are two
problems:

* the distribution may be seriously uneven, we (I at least) don't have any information on that
* the decision of whether or not a date falls in a given range is not only dependent on one distributed value, but on two different ones.

The second problem brings the length of ranges into the equation - if
there is some regularity to this, the task of estimating efficiency may
become easier.

In the end I think that theoretically a binary search could improve
efficiency, if done right. But as we seem to know practically nothing
about the content of the data list, it would be rather hard to implement
the binary search in a way that is more efficient on average than a linear
search. And of course there's always the storage to think about... linear
search could be made more efficient very easily by splitting the list
based on years, or something similar.

Other things I would think about... I think a linear search algorithm
should probably be efficient enough for a large majority of use cases. In
those cases where the number of objects would be too large to use linear
searching efficiently, it seems like a good question to ask why these huge
numbers of objects are being handled in-memory in the first place - a
relational database could make the search operation much easier and a lot
faster in these particular cases.


Oliver Sturm
 
O

Oliver Sturm

Hello Rick,
I think I found an answer. I could create an IComparer that compares the
upperbound of each date range. I can then use a BinarySearch to look for
an object with a specific upperbound (equal to my desired date). It will
return either the index of an exact match, or the bitwise complement of
the index next highest object, or the bitwise complement of the list count.

Have fun :)


Oliver Sturm
 
B

Bill Butler

Oliver Sturm said:
Hello Rick,


Hm... I'm not sure really - it's a mathematical exercise to prove
whether it would be more efficient under the circumstances or not.
There are two problems:

* the distribution may be seriously uneven, we (I at least) don't
have any information on that
* the decision of whether or not a date falls in a given range is not
only dependent on one distributed value, but on two different ones.

Neither of these points will effect the speed of the search in the
slightest.
Remember, we are not talking about a (possibly imbalanced) binary tree.
We are talking about using a binary search algorithm over a sorted list.
That algorithm will give you the same performance as a perfectly
balanced tree. Uneven distributions of the data will not impact the
search in the slightest.

Also remember that Rick said:
<quote>
I have a large list of objects where each object has a unique
(non-overlapping) date range. The list is sorted chronologically
</quote>

Since the date ranges do NOT overlap there is a clear sequence of data
points
obj1.Start
obj1.End
obj2.Start
obj2.End
....

So, You perform a Binary search using either Start or end (the choice is
irrelevant). This seach will go as O(logN). If you were lucky enough to
hit the date on the money, then you are done. If you did not find the
exact date (far more likely), the algorithm will return a negative
number, which is the bitwise complement of the index of the next
element. In other words, the algorithm has in effect told you which item
in the Array you need to examine (off by one is more like it). Then it
is a simple matter of seeing if your date falls within the Start and End
dates for the object in question.

Of course, unless the number of data points is substantial, the total
time would be negligable in either case.

Hope this helps
Bill




<snip>
 
O

Oliver Sturm

Hello Bill,

I understand I was probably mixing up some things here - my ideas were
targeted at a more efficient algorithm that would take the structure into
account, based on the thought that the amount of data in question is so
vast that we start running into problems with the linear search to begin
with.


Oliver Sturm
 
B

Bill Butler

Oliver Sturm said:
Hello Bill,

I understand I was probably mixing up some things here - my ideas were
targeted at a more efficient algorithm that would take the structure
into account, based on the thought that the amount of data in question
is so vast that we start running into problems with the linear search
to begin with.


You do bring up a good point.
Knowledge of your data, and how it is likely to be searched can go a
LONG way towards picking the best approach.

If most of the time you are accessing data near the ends of the range
the linear search may in fact outperform the binary search which does
better for random access.

If the number of data points is not large, it doesn't matter how you
search, since the time is negligible in either case. Unless you are
sitting in a tight loop processing requests continuously that is.

Bill
 
O

Oliver Sturm

Hello Bill,
Knowledge of your data, and how it is likely to be searched can go a LONG
way towards picking the best approach.

That's what I had in mind - maybe too much so, granted.
If the number of data points is not large, it doesn't matter how you
search, since the time is negligible in either case. Unless you are
sitting in a tight loop processing requests continuously that is.

Oh, of course. I didn't mean my comments somewhere on the thread to say
that it's meaningless to search for efficiency in this regard. But I find
it significant to think about the best way to achieve efficiency - for
instance by changing the storage structure instead of optimizing searches
-, and this depends on the use case a lot.


Oliver Sturm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top