Wanted: strign searching algorithm <--- suggestions appreciated...

A

almurph

Hi,

Wondering if you can you help me here. I'm looking for suggestions
essentially (though reasoning would be appreciated too i.e. the why).
I'm looking for a good search algorithm that can handle (i.e. scale
well with) a gradually increasing (very long) string.
For example, say the search pattern is: "ADGH" and the corpus is
"AFSDWFSDTERTDFVDFGDFHHJCFHDFGWEWEGDFGDGSDGSDGDFGHDFDFGDADGHWETERGDFGEDHDFH"

Now, I know that there are many search algorithms out there can can
find this pattern. But I have a number of other requirements:

1. The string is gradually getting longer (from the right) - hence if
building an intermediate object (like a suffix tree) for example - it
has to be able to scale easily and in linear time (like, Ukkonen's
algorithm - which build from the left i.e. using prefixes in near
linear time)
2. There could be multiple occurances of the search pattern in the
corpus
3. I'm also looking for the ability to spot *close* matches, like, say
the string "ADFGH" would be suggested for the search pattern "ADGH" as
it's close...
4. If building any intermediate object the ability to store easily n a
database _may_ be useful, hence the ability of the object to be
serialisable would be handy.

A tall order I know, it may require more than 1 type of algorithm.
So, I would greatly appreciate any suggestions/comments/ideas that you
may want to make.

Thanking you
Al.
 
L

Lasse Vågsæther Karlsen

Hi,

Wondering if you can you help me here. I'm looking for suggestions
essentially (though reasoning would be appreciated too i.e. the why).
I'm looking for a good search algorithm that can handle (i.e. scale
well with) a gradually increasing (very long) string.
For example, say the search pattern is: "ADGH" and the corpus is
"AFSDWFSDTERTDFVDFGDFHHJCFHDFGWEWEGDFGDGSDGSDGDFGHDFDFGDADGHWETERGDFGEDHDFH"

Now, I know that there are many search algorithms out there can can
find this pattern. But I have a number of other requirements:

1. The string is gradually getting longer (from the right) - hence if
building an intermediate object (like a suffix tree) for example - it
has to be able to scale easily and in linear time (like, Ukkonen's
algorithm - which build from the left i.e. using prefixes in near
linear time)

When you say that the string is getting longer, does this mean that you
do a search against the "AFSDWFSD..." string above, and then later add
more to the end of it and want a re-search?

Or do you mean different strings of different lengths?
2. There could be multiple occurances of the search pattern in the
corpus

Do you want first or all?
3. I'm also looking for the ability to spot *close* matches, like, say
the string "ADFGH" would be suggested for the search pattern "ADGH" as
it's close...

Define "close".
4. If building any intermediate object the ability to store easily n a
database _may_ be useful, hence the ability of the object to be
serialisable would be handy.

A tall order I know, it may require more than 1 type of algorithm.
So, I would greatly appreciate any suggestions/comments/ideas that you
may want to make.

Thanking you
Al.

Unfortunately I can't point you to a specific algorithm that you
probably haven't looked at, like Knuth-Morris-Pratt algorithm
(http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm)
or the Boyer-Moore algorithm
(http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm).

Unfortunately neither of those will handle close matches.

If you're searching for words, there are probably some indexing
algorithms that can handle misspellings, but if you're searching for
non-textual content (ie. characters but not really readable text) you
might need to define what "close" means more carefully.

A regular expression could help you, if you for instance say that you
can at most have 1 character between each other character:

A.?D.?G.?H

Or, if you tolerate at most one extra character inbetween the others you
could generate a search expression like this:

(ADGH|A.DGH|AD.GH|ADG.H)

However, it probably won't be very fast for really big texts.
 
A

almurph

Lasse, my comments are inline below:
When you say that the string is getting longer, does this mean that you
do a search against the "AFSDWFSD..." string above, and then later add
more to the end of it and want a re-search?

AL: Yes, the big corpus "AFSDWFSD..." gets longer by 1 letter every
day, which gets appendixed on the right ie.
AFSDWFSD..." <--- + "A" here

Therefore any intermediate objects must be easily scalable.
Or do you mean different strings of different lengths?

AL: No.
Do you want first or all?

AL: Location of all would be great
Define "close".

Good question. ts a matter of definition of course but am looking at
say within a Levinstein distance of 3 but well open to suggestion
here, it does not have to be Levenshtein distance, there are many
other edit distances...
Unfortunately I can't point you to a specific algorithm that you
probably haven't looked at, like Knuth-Morris-Pratt algorithm
(http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm)
or the Boyer-Moore algorithm
(http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm).

Unfortunately neither of those will handle close matches.

Yes, I aggree Lasse but thanks for the suggestions. I know these
algorithms - they have to preprocess ie. build tables before the main
search and use these tables to make jumps over characters in the
corpus. Nice technique but I don't know how long it takes to build
these tables and if they scale well i.e. as new characters are added
the tabels have to be rebuilt.
Plus as you already mentioned - they dont do close matches. Thanks
though.
If you're searching for words, there are probably some indexing
algorithms that can handle misspellings, but if you're searching for
non-textual content (ie. characters but not really readable text) you
might need to define what "close" means more carefully.

A regular expression could help you, if you for instance say that you
can at most have 1 character between each other character:

A.?D.?G.?H

Or, if you tolerate at most one extra character inbetween the others you
could generate a search expression like this:

(ADGH|A.DGH|AD.GH|ADG.H)

However, it probably won't be very fast for really big texts.

A good suggestion and a fair point. Thank you.

I am now looking at suffix arrays. Appears to be good and faster and
not so "big" as suffix trees. The close match stuill eludes me with
this technique though.
 
A

almurph

If an 'A' is added then you only need to search for patterns that
either end in an 'A' or that have a "close match" ending in 'A'.
Searching from right to left is also a good idea as then you can cut
short the search once the pattern has moved away from the changed end.
All subsequent results will be the saem as the previous day's run,
which would indicate storing a list of the matches found for each
target.  Since you want multiple matches, you are going to need some
sort of structure to store all the matches for a given pattern anyway.
Just store the match structure overnight and reuse it for the bulk of
the work the next day.

rossum












- Show quoted text -- Hide quoted text -

- Show quoted text -

rossum,

thanks for you feedback. will do.

al.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top