need superfast binarysearch, with superfast icomparer

  • Thread starter Thread starter solandre
  • Start date Start date
S

solandre

my motivation is as follows:

i have to binarysearch several million times an long[] that is several
million items big. this costs time, although i use
Array.Binarysearch<long>. so i try to keep the Compare-method as lean
as possible, as it is summoned several 10 million times.

in case of an integer the fastest way would be this i think:

public int Compare(int a, int b)
{
return a-b;
}

the prob is the returnval needs to be an integer, because the
binarysearch-method works with an int as far as i know. if i would
write...

public int Compare(long a, long b)
{
return a-b;
}

its an error, or i write...

public int Compare(long a, long b)
{
return (int)(a-b);
}

i loose information, likely leading to incorrect results.

any idea for a superfast way for long-types or anybody who knows a
binarysearch-method that takes a long-type as input?

thanks
 
Why do you need a compare method? Your implementation (return a-b) works the
same as the default comparison for long.

If there's somehting I'm missing though, logical comparison might be faster
than subtracting the numbers:

int Compare(long a, long b)
{
if (a > b) return 1;
else if (a < b) return -1;
else return 0;
}
 
I expect the best performance would be achieved by implementing the search in
unsafe code (perhaps with assembler).

The IComparable interface and the like make implementing searching and
sorting _complex_ structures very easy to code (perhaps at the expense of
performance). But you don't seem to be using a complex structure, so just use
straightforward code and call it good.
 
solandre said:
my motivation is as follows:

i have to binarysearch several million times an long[] that is several
million items big. this costs time, although i use
Array.Binarysearch<long>. so i try to keep the Compare-method as lean
as possible, as it is summoned several 10 million times.

1. Are the items you are looking for coming in as one big sorted
collection? if they are you can do your lookup by linear travesing both
collections at the same time.

2. If your input is coming in as an unsorted collection it may be faster
to sort it and goto 1.

3. Perhaps you could use a dictionary for the items instead? if you only
need to decide membership. Hashsets with longs can be extremely fast. If
push comes to shove you can implement your own hastset for the longs and
get *extremely* fast performance.
public int Compare(long a, long b)
{
return (int)(a-b);
}

i loose information, likely leading to incorrect results.

Yes, that would certainly be very bad.
any idea for a superfast way for long-types or anybody who knows a
binarysearch-method that takes a long-type as input?

1. try benchmarking using:

public int Compare(long a, long b) {
if ( a == b )
return 0;
else if ( a < b )
return -1;
else
return 1;
}

is it fast enough?

2. you can write your own binary-search.

3. if the code is *still* not fast enough, you could step into the realm
of unsafe code and/or write your own hash-set.
 
well, somthing i thought i dont need to mention is, that i use the
long-type which actually is an ulong as a container for
3bit+35bit+26bit informations. for binarysearching the array i need
just the 35bit-information which is allready longer than int.

but this already means i cannot easily subtract the values. and i have
to write a comparer which does the masking. the latest code looks like
this:

public int Compare(ulong PDA, ulong PDB)
//compares ulong-pds by 35-bit-pID
//<0...PDA is lower than PDB
//0...equl
//>0...PDA higher than PDB
{
//pID-35bit-masking
PDA&=0x1FFFFFFFFC000000;
PDB&=0x1FFFFFFFFC000000;
if (PDA> PDB) return 1;
else if (PDA< PDB) return -1;
else return 0;
}

faster would be return ((PDA & 0x1FFFFFFFFC000000) - (PDB &
0x1FFFFFFFFC000000)), which is the border i am not able to cross for
reason of the returntype.

do you have any further hints on that?
 
thanks for your replies. i want to return some information i already
collected:

1.linear traversing:
actually this binary-search question is part (as you guessed right) of
a matching two arrays question which i posted later some minutes, as it
is the wrapping topic.

here i/we enhanced linear traversing with binary-search, which is an
advance when the matches are far appart (see there).

2. on hash-tables: looking up a dictionary (in safe code) with
10million items 10 million times, unluckily takes longer than
binary-searching two arrays of the same size. the dictionary is resized
automatically by .net(i guess doubled if the load exeedes 70%). so
there should be not too much collisions. but i did not research this
further.

3. i wrote a binary-search (in safe code) but it performed poorly, so i
didnt devlope it further, but might pick this up again later ...

4. unsafe code.... whoao .... :-) something i pushed far appart, but
that keeps coming closer as more it is like "every tick counts". ....
we might soon start to do some b-search in c++.
 
solandre said:
my motivation is as follows:

i have to binarysearch several million times an long[] that is several
million items big.

Binary searching through "several million items" should still only take
20 iterations or so, shouldn't it? Are you sure your array is sorted to
start with? If it's not, binary searching won't work and you may see
odd results.

You shouldn't need to specify an IComparer, however, as Int64
implements IComparable<Int64>.

Having said all this, here's a pretty fast comparison implementation:

public int Compare (long a, long b)
{
if (a==b)
{
return 0;
}
return a < b ? -1 : 1;
}

If none of that helps, could you post a short but complete program
which demonstrates the problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.
 
public int Compare(ulong PDA, ulong PDB)
//compares ulong-pds by 35-bit-pID
//<0...PDA is lower than PDB
//0...equl
//>0...PDA higher than PDB
{
//pID-35bit-masking
PDA&=0x1FFFFFFFFC000000;
PDB&=0x1FFFFFFFFC000000;
if (PDA> PDB) return 1;
else if (PDA< PDB) return -1;
else return 0;
}

Ah, hmmm, looks good.
 
solandre said:
here i/we enhanced linear traversing with binary-search, which is an
advance when the matches are far appart (see there).

Binary-searching costs log(remaining elements), which you will pay many
times. A better approach is stepping forward from last match with larger
and larger steps, for example by squaring and afterwards searching the
found interval. I *guess* this would bring your set-intersection into
complexity O(Min(N,M) + Log(Max(N,M)).
2. on hash-tables: looking up a dictionary (in safe code) with
10million items 10 million times, unluckily takes longer than
binary-searching two arrays of the same size. the dictionary is resized
automatically by .net(i guess doubled if the load exeedes 70%). so
there should be not too much collisions. but i did not research this
further.

If your input is already sorted you should go with the method from
above, which is a refinement of merge-sort, ditionaries will *probably*
not be faster, especially since it linearly traverses memory which
prevents on-cpu cache-trashing.

If you are using System.Collections.Hashtable your are paying for boxing
and unboxing, that's *really* expensive. If you have access to .NET-2
you can use System.Collections.Generic.Dictionary<ulong,object> and
escape the boxing costs. However since your input is already sorted i'm
guessing this won't beat merge-sort variants.
3. i wrote a binary-search (in safe code) but it performed poorly, so i
didnt devlope it further, but might pick this up again later ...

Be sure to write the code so the compiler can see that indexes remain
within bounds, this way you can possibly avoid paying for
bounds-checking - I have saved roughly 15% runtime in a hash-set
implementation by rewriting the code so the compiler could skip
bounds-checks.
4. unsafe code.... whoao .... :-) something i pushed far appart, but
that keeps coming closer as more it is like "every tick counts". ....
we might soon start to do some b-search in c++.

Do you have a profiler running on your code? maybe something you don't
realize is going on.

I'm not too sure you will gain much by changing into C++, but maybe your
future experiments will show :)
 
any idea for a superfast way for long-types or anybody who knows a
binarysearch-method that takes a long-type as input?

thanks

class Comparer3 : System.Collections.Generic.IComparer<long>
{
public int Compare(long a1, long b1)
{
return a1.CompareTo(b1);
}
}

That is as fast as it gets I think. You could possibly beat the built
in compare method if you use binary operations, but I doubt it.

The most important part is to use generics since boxing is very slow.
If you aren't using .Net 2.0 you could probably improve the speed by
writing your own binary search, but with .Net 2.0 it most likely isn't
worth it.
 
solandre said:
well, somthing i thought i dont need to mention is, that i use the
long-type which actually is an ulong as a container for
3bit+35bit+26bit informations. for binarysearching the array i need
just the 35bit-information which is allready longer than int.
A thought. Does the 3bit field have anything to do with the search? If
you split your large file into eight smaller files, defined by the
value in the 3bit field would that help?

rossum
 
Marcus Andrén said:
class Comparer3 : System.Collections.Generic.IComparer<long>
{
public int Compare(long a1, long b1)
{
return a1.CompareTo(b1);
}
}

That is as fast as it gets I think. You could possibly beat the built
in compare method if you use binary operations, but I doubt it.

You can, actually. Try this:

using System;
using System.Collections.Generic;

public class Test
{
const int Iterations = 1000000000;

static void Main()
{
IComparer<long> comp = new DirectComparer();

DateTime start = DateTime.Now;
for (long x = 0; x < Iterations; x++)
{
comp.Compare (x, 0);
}
DateTime end = DateTime.Now;

Console.WriteLine (end-start);
}
}

class DelegatingComparer : IComparer<long>
{
public int Compare (long x, long y)
{
return x.CompareTo(y);
}
}

class DirectComparer : IComparer<long>
{
public int Compare (long x, long y)
{
if (x == y)
{
return 0;
}
return x < y ? -1 : 1;
}
}

On my box, DirectComparer is about 30% faster than DelegatingComparer.
I don't know whether DelegatingComparer isn't being inlined properly or
what, but DirectComparer is definitely faster. A similar test comparing
x with Iterations (i.e. reversing the code flow) makes
DelegatingComparer slightly faster than it was before, but still slower
than DirectComparer.
 
Back
Top