Wanted: strign searching algorithm <--- suggestions appreciated...

almurph · Jun 19, 2008

Hi,

Wondering if you can you help me here. I'm looking for suggestions
essentially (though reasoning would be appreciated too i.e. the why).
I'm looking for a good search algorithm that can handle (i.e. scale
well with) a gradually increasing (very long) string.
For example, say the search pattern is: "ADGH" and the corpus is
"AFSDWFSDTERTDFVDFGDFHHJCFHDFGWEWEGDFGDGSDGSDGDFGHDFDFGDADGHWETERGDFGEDHDFH"

Now, I know that there are many search algorithms out there can can
find this pattern. But I have a number of other requirements:

1. The string is gradually getting longer (from the right) - hence if
building an intermediate object (like a suffix tree) for example - it
has to be able to scale easily and in linear time (like, Ukkonen's
algorithm - which build from the left i.e. using prefixes in near
linear time)
2. There could be multiple occurances of the search pattern in the
corpus
3. I'm also looking for the ability to spot *close* matches, like, say
the string "ADFGH" would be suggested for the search pattern "ADGH" as
it's close...
4. If building any intermediate object the ability to store easily n a
database _may_ be useful, hence the ability of the object to be
serialisable would be handy.

A tall order I know, it may require more than 1 type of algorithm.
So, I would greatly appreciate any suggestions/comments/ideas that you
may want to make.

Thanking you
Al.

Lasse Vågsæther Karlsen · Jun 19, 2008

Hi,

Wondering if you can you help me here. I'm looking for suggestions
essentially (though reasoning would be appreciated too i.e. the why).
I'm looking for a good search algorithm that can handle (i.e. scale
well with) a gradually increasing (very long) string.
For example, say the search pattern is: "ADGH" and the corpus is
"AFSDWFSDTERTDFVDFGDFHHJCFHDFGWEWEGDFGDGSDGSDGDFGHDFDFGDADGHWETERGDFGEDHDFH"

Now, I know that there are many search algorithms out there can can
find this pattern. But I have a number of other requirements:

1. The string is gradually getting longer (from the right) - hence if
building an intermediate object (like a suffix tree) for example - it
has to be able to scale easily and in linear time (like, Ukkonen's
algorithm - which build from the left i.e. using prefixes in near
linear time)

When you say that the string is getting longer, does this mean that you
do a search against the "AFSDWFSD..." string above, and then later add
more to the end of it and want a re-search?

Or do you mean different strings of different lengths?

2. There could be multiple occurances of the search pattern in the
corpus

Do you want first or all?

3. I'm also looking for the ability to spot *close* matches, like, say
the string "ADFGH" would be suggested for the search pattern "ADGH" as
it's close...

Define "close".

4. If building any intermediate object the ability to store easily n a
database _may_ be useful, hence the ability of the object to be
serialisable would be handy.

A tall order I know, it may require more than 1 type of algorithm.
So, I would greatly appreciate any suggestions/comments/ideas that you
may want to make.

Thanking you
Al.

Unfortunately I can't point you to a specific algorithm that you
probably haven't looked at, like Knuth-Morris-Pratt algorithm
(http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm)
or the Boyer-Moore algorithm
(http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm).

Unfortunately neither of those will handle close matches.

If you're searching for words, there are probably some indexing
algorithms that can handle misspellings, but if you're searching for
non-textual content (ie. characters but not really readable text) you
might need to define what "close" means more carefully.

A regular expression could help you, if you for instance say that you
can at most have 1 character between each other character:

A.?D.?G.?H

Or, if you tolerate at most one extra character inbetween the others you
could generate a search expression like this:

(ADGH|A.DGH|AD.GH|ADG.H)

However, it probably won't be very fast for really big texts.

almurph · Jun 19, 2008

Lasse, my comments are inline below:

When you say that the string is getting longer, does this mean that you
do a search against the "AFSDWFSD..." string above, and then later add
more to the end of it and want a re-search?

AL: Yes, the big corpus "AFSDWFSD..." gets longer by 1 letter every
day, which gets appendixed on the right ie.
AFSDWFSD..." <--- + "A" here

Therefore any intermediate objects must be easily scalable.

Or do you mean different strings of different lengths?

AL: No.

Do you want first or all?

AL: Location of all would be great

Define "close".

Good question. ts a matter of definition of course but am looking at
say within a Levinstein distance of 3 but well open to suggestion
here, it does not have to be Levenshtein distance, there are many
other edit distances...

Unfortunately I can't point you to a specific algorithm that you
probably haven't looked at, like Knuth-Morris-Pratt algorithm
(http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm)
or the Boyer-Moore algorithm
(http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm).

Unfortunately neither of those will handle close matches.

Yes, I aggree Lasse but thanks for the suggestions. I know these
algorithms - they have to preprocess ie. build tables before the main
search and use these tables to make jumps over characters in the
corpus. Nice technique but I don't know how long it takes to build
these tables and if they scale well i.e. as new characters are added
the tabels have to be rebuilt.
Plus as you already mentioned - they dont do close matches. Thanks
though.

If you're searching for words, there are probably some indexing
algorithms that can handle misspellings, but if you're searching for
non-textual content (ie. characters but not really readable text) you
might need to define what "close" means more carefully.

A regular expression could help you, if you for instance say that you
can at most have 1 character between each other character:

A.?D.?G.?H

Or, if you tolerate at most one extra character inbetween the others you
could generate a search expression like this:

(ADGH|A.DGH|AD.GH|ADG.H)

However, it probably won't be very fast for really big texts.

A good suggestion and a fair point. Thank you.

I am now looking at suffix arrays. Appears to be good and faster and
not so "big" as suffix trees. The close match stuill eludes me with
this technique though.

almurph · Jun 20, 2008

If an 'A' is added then you only need to search for patterns that
either end in an 'A' or that have a "close match" ending in 'A'.
Searching from right to left is also a good idea as then you can cut
short the search once the pattern has moved away from the changed end.
All subsequent results will be the saem as the previous day's run,
which would indicate storing a list of the matches found for each
target. Since you want multiple matches, you are going to need some
sort of structure to store all the matches for a given pattern anyway.
Just store the match structure overnight and reuse it for the bulk of
the work the next day.

rossum

- Show quoted text -- Hide quoted text -

- Show quoted text -

rossum,

thanks for you feedback. will do.

al.

Algorithm Suggestion/Help (buiding strings from a graph)	1	Jun 23, 2008
How to sort List<...> -- created using LinQ?	11	Mar 11, 2010
Data structure suggestion wanted...	2	Sep 17, 2006
Web Crawler Architecture	3	Jun 25, 2008
Unit Testing Model-View-Presenter	10	Oct 3, 2006
Generic Search Capability Using Filters	7	Jan 14, 2009
String Searching a Check box in Word...	1	Apr 6, 2008
Convert.ToString( double ) + xsd:pattern + RegEx == ?	4	Dec 22, 2004

Wanted: strign searching algorithm <--- suggestions appreciated...

almurph

Lasse Vågsæther Karlsen

almurph

almurph

Ask a Question

Similar Threads