Searching 18,000 strings of an average size of 10Kb

A

Allan Ebdrup

What would be the fastest way to search 18,000 strings of an average size of
10Kb, I can have all the strings in memory, should I simply do a instr on
all of the strings? Or is there a faster way? I would like to have a kind of
search like google where you can enter several words to search for, guess
that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I have
my data in an MSSQL 2000 database, and it's not updated very often, some
hundred new records added a day spread over the whole day. and some hundred
records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at all
times. But if there is a better more efficient way to do it I would like som
pointers.

Kind Regards,
Allan Ebdrup.
 
D

Deckarep

If you can sort the list of strings...use a Binary Search. A Binary
Search is capable of finding for instance the string required in about
16 steps of say 500,000 strings. Don't quote me on that but if you
doubled your size to 1,000,000 it would only require 1 more step to
find it.

Essentially a binary search halves the data it needs to search but
again this requires you to work from an array that you can garuntee to
be sorted.

..net has this functionality built in and it's wicked fast.

-R
 
D

_DD

What would be the fastest way to search 18,000 strings of an average size of
10Kb, I can have all the strings in memory, should I simply do a instr on
all of the strings? Or is there a faster way? I would like to have a kind of
search like google where you can enter several words to search for, guess
that calls for a regular expression "word1|word2|word3|...".
Is there any kind of indexing tools available for this kind of thing, I have
my data in an MSSQL 2000 database, and it's not updated very often, some
hundred new records added a day spread over the whole day. and some hundred
records removed a day in one go.

I don't mind that I'm using 200MB memory to hold everything in memory at all
times. But if there is a better more efficient way to do it I would like som
pointers.

Kind Regards,
Allan Ebdrup.

Tough to say from your description whether you are looking for
something fairly simple, fast, and direct, or whether you need to
search on multiple fields.

If the latter, and if the data is already in a SQL db, then a SQL
query may be the best way to do that, no? People spend years trying to
optimize db queries, so it would be hard to get something running
faster, especially as you are probably doing a query to get the data
in the first place. You would have no appreciable memory hit. Of
course you would not be asking if it were that simple.

So...if it's the former, and you need blazing speed at the expense of
loading everything into ram, then check into Boyer-Moore. I believe
that is still the basis of the fastest linear string search
algorithms.

Boyer Moore:
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html

That should get you started.
 
A

Allan Ebdrup

Deckarep said:
If you can sort the list of strings...use a Binary Search. A Binary
Search is capable of finding for instance the string required in about
16 steps of say 500,000 strings. Don't quote me on that but if you
doubled your size to 1,000,000 it would only require 1 more step to
find it.

Essentially a binary search halves the data it needs to search but
again this requires you to work from an array that you can garuntee to
be sorted.

.net has this functionality built in and it's wicked fast.

I know about binary searches.
However I'm searching for the sting to be somewhere inside the 10Kb of the
string in 18.000 instances, so I can't sort the data, I need some kind of
indexing.

Kind Regards,
Allan Ebdrup
 
A

Allan Ebdrup

_DD said:
On Thu, 11 Jan 2007 13:31:16 +0100, "Allan Ebdrup"
Tough to say from your description whether you are looking for
something fairly simple, fast, and direct, or whether you need to
search on multiple fields.

Simple fast and direct, just one field.
If the latter, and if the data is already in a SQL db, then a SQL
query may be the best way to do that, no? People spend years trying to
optimize db queries, so it would be hard to get something running
faster, especially as you are probably doing a query to get the data
in the first place. You would have no appreciable memory hit. Of
course you would not be asking if it were that simple.

LIKE runs to slow, and I can't use a fulltext index because it splits up
into whole words, and I want a search where searching for "pædagog" also
retuns results with "børnehavepædagog" or "pædagogmedhjælper"
So...if it's the former, and you need blazing speed at the expense of
loading everything into ram, then check into Boyer-Moore. I believe
that is still the basis of the fastest linear string search
algorithms.

Boyer Moore:
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html

This is probably the kind of thing used in RegEx, it's blasing fast when
compared to .IndexOf, witch has a surprisingly poor performance.

I found this plugin for MSSQL:
http://www.surfray.com/en/index.php/products/overloook
Seems to be just the kind of thing I want. Ad an index and have fast
searching. I'll try it out and let you know.

Kind Regards,
Allan Ebdrup
 
D

_DD

.. I want a search where searching for "pædagog" also
retuns results with "børnehavepædagog" or "pædagogmedhjælper"

Gotta appreciate anyone who knows how to do that combined 'ae' thing!
This is probably the kind of thing used in RegEx, it's blasing fast when
compared to .IndexOf, witch has a surprisingly poor performance.

String functions in general do not seem to be optimal performers. The
reason that I suggested a raw Boyer-Moore algorithm is that if you are
indeed doing a linear search as you describe, then it will probably be
the most direct method. You may be able to optimize for your
particular data format.

I don't doubt that Regex often uses Boyer-Moore, but I'm not sure how
much overhead a Regex implementation would add due to the fact that
they are more generalized.
I found this plugin for MSSQL:
http://www.surfray.com/en/index.php/products/overloook
Seems to be just the kind of thing I want. Ad an index and have fast
searching. I'll try it out and let you know.

1800 times faster than an MS SQL 2005 search? I take back what I said
about the SQL DB team working around the clock to optimize queries.
Seriously, I guess I'd be skeptical about that, but who knows? Please
do report back if you use it.
 
E

Ethan Strauss

I don't have a specific answer, but you might want to take a look at
Bioinformatics code or ask a Bioinformatics group. This sort of search is
pretty common and I am sure it has been optimized to death.
Ethan
 
D

_DD

I don't have a specific answer, but you might want to take a look at
Bioinformatics code or ask a Bioinformatics group. This sort of search is
pretty common and I am sure it has been optimized to death.
Ethan

Gene-matching algorithms--That's certainly an interesting thought. Not
only optimized; they build custom hardware for that. But I thought
there would be a fuzzy aspect to most of their matching algorithms. Is
that not the case?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top