Fast way to determine if a string contains a member of a list of strings

Karch · Mar 5, 2008

I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.

amdrit · Mar 5, 2008

I think RegEx will help you out here. As far as not iterating over the
entire list, I've no clue how you can selectively not fall through given the
only match in the list is the last item in the list or when there is no
match at all.

John Daragon · Mar 5, 2008

Karch said:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

Could you give us some idea of the likely length of the list? And
whether it's static or dynamic ?

jd

Peter Duniho · Mar 5, 2008

I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77

I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either.

I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.

Pete

John Daragon · Mar 5, 2008

Karch said:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.

If the list is static, which means that the overhead of hashing it can
occur just the once, I'd be inclined to use a HashTable.

Tokenise the string using String.Split(), then iterate over the
resulting array, using each token to access the hashtable, which will
return null if the token is not found. For large values of list length
this should be really quite efficient (assuming a decent hash algorithm
for your data type and distribution).

jd

Ben Voigt [C++ MVP] · Mar 5, 2008

Karch said:
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee
of the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following,
there would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given
point (zero position), but I need to keep processing until I have
exhausted the candidate string in the list - as shown above - to prevent a
false match.

I have played around with some different data structures, such as prefix
and suffix trees, an these work in the case that you have a string that
you are trying to match in a list, not vice versa. The approach is
required to be very performant because it will be evaluated millions of
times. I am also okay with an unsafe code approach that works. I just need
the evaluations to terminate as soon as possible rather than looping
through every single item in the list. Even an IndexOf is too slow.

By your example, you are assuming a delimiter (so that a string containing
Smitherson does not match Smith).

Store your list in a Dictionary.
Extract the word (or each word) from the string by splitting with the
delimiter, look for it in the dictionary.

Ben Voigt [C++ MVP] · Mar 5, 2008

Ben Voigt said:
By your example, you are assuming a delimiter (so that a string containing
Smitherson does not match Smith).

Store your list in a Dictionary.
Extract the word (or each word) from the string by splitting with the
delimiter, look for it in the dictionary.

OTOH, if the items can contain the delimiter as in "Van Buren" things get
somewhat more complex.

Build a trie from your list. Match using the trie using a maximal length
match algorithm. So far Smitherson will still match Smith, so you need one
more step: after you make a match, check that the subsequent character is a
delimiter or you reached the end of your string.

Barry Kelly · Mar 5, 2008

(Followups set to microsoft.public.dotnet.languages.csharp)

This sounds awfully like a college problem or pre-job technical
assignment.

I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.

"The fastest way in terms of storage" is a contradictory constraint.
Fastest implies time, storage implies space. Optimizing for one or the
other usually involves a tradeoff, though fastest usually requires
keeping storage down because of cache effects etc. Cache effects may be
relevant for this problem, see below.

However, even taking this into account, your problem is underspecified.
See below for details.

So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In order for 'Smith' in the list to not match against 'Smith' (followed
by 'erson') in this example, there needs to be either (1) an implied
"word-break" zero-width assertion after each element in the list
(similar to regex '$', '\W', '\>', etc.), or (2) an implied convention
for splitting up the sentence into fragments that can be matched against
the list - tokenizing, in a word. The missing information (1) can be
converted into (2) and vice versa. You need to preprocess - implicitly
via algorithm or explicitly via data munging - either the list or the
sentence to reflect the hidden constraints you haven't described.

However, once you've done that, the solution falls out fairly easily. A
prefix trie which includes an edge for the assertion along the lines of
(1) above should solve the problem trivially, or alternatively
extracting the appropriate token using the appropriate rules of (2) and
using simple hash lookup should also work. Matches should cost
proportional to the length of the matched string in both cases, while
mismatches might be discarded more cheaply using (1); however the
constant factors involved, and in particular cache locality loss from a
trie, may make (2) cheaper even for mismatches, especially if you have a
lot of spurious prefixes. You'll need to measure.

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

-- Barry

Jesse Houwing · Mar 5, 2008

Hello Peter,

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could
be if you concatenate all of your possible matches, maybe it doesn't.

The .NET regex engine does not optimize for partial matches, but you can
derive those yourself so that:

New York|New Zealand
becomes (new (York|Zealand))

which would perform better. I know of certain tools online used by the SpamAssassin
project which can do this for you. It shouldn't be to hard to build by hand
yourself though.

also if new Zealand is found more often statistically, you can optimize the
regex by putting that option in front so that:

New York|New Zealand
becomes (new (Zealand|York))

Normally a () will result in a group being created. This eats performance.
You have 2 ways to suppress this.
First:
Add the option RegexOptions.ExplicitCapture to the regex object's constructor
Second
add ?: in all the () like this (?:new (?:Zealand|York))

If you have devided your sub strings into the smalles overlapping subgroups,
you can also start to capture greedily using the (?>...) group construct
which will improve performance.

If you know the casing beforehand, it's probably faster to not specify the
RegexOptions.IgnoreCase. But you'd have to test that, I never tried. As in:
[Nn]ew [Yy]ork
is probably faster than
New York + RegexOptions.IgnoreCase

If you refrain from certain regex tricks you could try to see if RegexOptions.ECMACompliant
is faster than a normal regex, haven't tried either.

Lastly use the RegexOptions.Compiled option to make your regex faster. The
first call will suffer a severe hit, but the rest becomes much faster (usually).
Store the Regex in a static readonly Regex and you'll only have to do it
once.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches
together). If it performs well enough, there you go.

And if not, there is room to optimize

. I'm willing to help on that if
needed.

In the end the regex solution will never be the fastest solution possible.
Using unsafe string manipulation will be much faster, but also a lot harder
to implement. If you can reach the required performance using Regex, then
I say go for it. But that's your perogative.

Jack Jackson · Mar 6, 2008

One problem I see with this is that you apparently are not doing
simply a string search, but some kind of word search, without defining
what a word is.

You say the "Smithperson said ..." should not have any matches, so
while the string Smith is a substring, it does not match.

Would it match:
(Smith
(Smith)
Smith,
Smith;
Smith-
smith

etc.

The answer to this will probably constrian your possible algorithms.
If it didn't get too many false hits, doing a string search first and
then re-checking to see if the matches are valid word matches might be
OK.

Karch · Mar 6, 2008

The list is static (read into memory during initialization) and could be up
to approximately 1000 items (hence the need for a fast method)

Karch · Mar 6, 2008

I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today. I don't think Regex is an option for me because of performance and
the fact that, although the list itself is static, it will be loaded into
memory at run-time. So, the expression building would be complex, not to
mention eating cycles just to prep.

Karch · Mar 6, 2008

These are novel approaches, but in my case - after looking into it further -
I don't think they would be a good option because of how I need to load the
list data initially and then the actual expression building step required.
Then, of course, the regex evaluation itself.

Jesse Houwing said:
Hello Peter,

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could
be if you concatenate all of your possible matches, maybe it doesn't.

Click to expand...

The .NET regex engine does not optimize for partial matches, but you can
derive those yourself so that:

New York|New Zealand
becomes (new (York|Zealand))

which would perform better. I know of certain tools online used by the
SpamAssassin project which can do this for you. It shouldn't be to hard to
build by hand yourself though.

also if new Zealand is found more often statistically, you can optimize
the regex by putting that option in front so that:

New York|New Zealand
becomes (new (Zealand|York))

Normally a () will result in a group being created. This eats performance.
You have 2 ways to suppress this. First:
Add the option RegexOptions.ExplicitCapture to the regex object's
constructor
Second
add ?: in all the () like this (?:new (?:Zealand|York))

If you have devided your sub strings into the smalles overlapping
subgroups, you can also start to capture greedily using the (?>...) group
construct which will improve performance.

If you know the casing beforehand, it's probably faster to not specify the
RegexOptions.IgnoreCase. But you'd have to test that, I never tried. As
in:
[Nn]ew [Yy]ork
is probably faster than New York + RegexOptions.IgnoreCase

If you refrain from certain regex tricks you could try to see if
RegexOptions.ECMACompliant is faster than a normal regex, haven't tried
either.

Lastly use the RegexOptions.Compiled option to make your regex faster. The
first call will suffer a severe hit, but the rest becomes much faster
(usually). Store the Regex in a static readonly Regex and you'll only have
to do it once.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches
together). If it performs well enough, there you go.

Click to expand...

And if not, there is room to optimize . I'm willing to help on that if
needed.

In the end the regex solution will never be the fastest solution possible.
Using unsafe string manipulation will be much faster, but also a lot
harder to implement. If you can reach the required performance using
Regex, then I say go for it. But that's your perogative.

If not, I'd guess there's a fair amount of research into algorithms
like this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile
ago and I posted some sample code to get someone else started. You
can find that post here:
http://groups.google.com/group/microsoft.public.dotnet.languages.cshar
p/msg/0f06f696d4500b77

I've referred to this post a couple of other times, and while I've
never had anyone say it actually turned out to be useful, they've
never said it wasn't either.

I suppose if I'm going to keep referring people to it, maybe I ought
to clean it up a little bit more. But what's there does work and
should be enough to get you pointed in the right direction.

Pete

Click to expand...

Karch · Mar 6, 2008

The Hashtable would work great alongside a tokenized string, but the problem
is that I don't have a distinct delimiter since the string to be matched can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.

Karch · Mar 6, 2008

Barry Kelly said:
(Followups set to microsoft.public.dotnet.languages.csharp)

This sounds awfully like a college problem or pre-job technical
assignment.

Nope. If it were, I would probably have a nice fresh memory of my data
structures and algorithms class and wouldn't be posting here.

In order for 'Smith' in the list to not match against 'Smith' (followed
by 'erson') in this example, there needs to be either (1) an implied
"word-break" zero-width assertion after each element in the list
(similar to regex '$', '\W', '\>', etc.), or (2) an implied convention
for splitting up the sentence into fragments that can be matched against
the list - tokenizing, in a word. The missing information (1) can be
converted into (2) and vice versa. You need to preprocess - implicitly
via algorithm or explicitly via data munging - either the list or the
sentence to reflect the hidden constraints you haven't described.

However, once you've done that, the solution falls out fairly easily. A
prefix trie which includes an edge for the assertion along the lines of
(1) above should solve the problem trivially, or alternatively
extracting the appropriate token using the appropriate rules of (2) and
using simple hash lookup should also work. Matches should cost
proportional to the length of the matched string in both cases, while
mismatches might be discarded more cheaply using (1); however the
constant factors involved, and in particular cache locality loss from a
trie, may make (2) cheaper even for mismatches, especially if you have a
lot of spurious prefixes. You'll need to measure.

Yeah, this would work really good - but one point that I didn't mention is
that I can't really tokenize the string since the strings to match could
span multiple words and whitepace. I looked at the data and I couldn't find
a realiable way to tokenize.

Peter Duniho · Mar 6, 2008

I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today.

Have you tried the dictionary approach suggested by a couple of others?

I wasn't paying attention when I first read your question and failed to
note that your search is delimited by spaces. Or, at least in the
examples you provided it was.

I think that if your data is actually restricted like that, the dictionary
lookup might be as fast as a state graph and it would be a lot simpler in
terms of implementation.

Just a thought. Obviously I really like state graphs

, but when there
is information you know about the input data that allows you to constrain
the problem a bit (e.g. by using spaces to identify the beginning and
ending of any possible match), it's often possible to solve the problem
efficiently without using a state graph.

Pete

Peter Duniho · Mar 6, 2008

The Hashtable would work great alongside a tokenized string, but the
problem
is that I don't have a distinct delimiter since the string to be matched
can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.

Then how do you distinguish "Smith" from "Smitherson", as in your
example? Can you at least make a requirement that the search pattern will
always begin and end on a space, even if it includes spaces within?

Pete

Jesse Houwing · Mar 6, 2008

Hello Karch" news.microsoft.com,

These are novel approaches, but in my case - after looking into it
further - I don't think they would be a good option because of how I
need to load the list data initially and then the actual expression
building step required.

Keep in mind that if you the list is static after load, it would be a one
time thing. You could even use an event to trigger the reload of teh regex
if needed. But I suspect that a dedicated string algorithm is the way to
go here.

Then, of course, the regex evaluation itself.

Only timing will tell, but my guess is that a dedicated algorithm in the
end will be faster.

Jesse

Hello Peter,

I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of
strings.

I don't know if RegEx has optimizations for dealing with this sort
of thing. It could, but given how long a search string for RegEx
could be if you concatenate all of your possible matches, maybe it
doesn't.

Click to expand...

The .NET regex engine does not optimize for partial matches, but you
can derive those yourself so that:

New York|New Zealand
becomes (new (York|Zealand))
which would perform better. I know of certain tools online used by
the SpamAssassin project which can do this for you. It shouldn't be
to hard to build by hand yourself though.

also if new Zealand is found more often statistically, you can
optimize the regex by putting that option in front so that:

New York|New Zealand
becomes (new (Zealand|York))
Normally a () will result in a group being created. This eats
performance.
You have 2 ways to suppress this. First:
Add the option RegexOptions.ExplicitCapture to the regex object's
constructor
Second
add ?: in all the () like this (?:new (?:Zealand|York))
If you have devided your sub strings into the smalles overlapping
subgroups, you can also start to capture greedily using the (?>...)
group construct which will improve performance.

If you know the casing beforehand, it's probably faster to not
specify the
RegexOptions.IgnoreCase. But you'd have to test that, I never tried.
As
in:
[Nn]ew [Yy]ork
is probably faster than New York + RegexOptions.IgnoreCase
If you refrain from certain regex tricks you could try to see if
RegexOptions.ECMACompliant is faster than a normal regex, haven't
tried either.

Lastly use the RegexOptions.Compiled option to make your regex
faster. The first call will suffer a severe hit, but the rest becomes
much faster (usually). Store the Regex in a static readonly Regex and
you'll only have to do it once.

That'd be my first try though. It seems like the actual RegEx
search string would be simple (just "or" all of your possible
matches together). If it performs well enough, there you go.

Click to expand...

And if not, there is room to optimize . I'm willing to help on that
if needed.

In the end the regex solution will never be the fastest solution
possible. Using unsafe string manipulation will be much faster, but
also a lot harder to implement. If you can reach the required
performance using Regex, then I say go for it. But that's your
perogative.

If not, I'd guess there's a fair amount of research into algorithms
like this, but a fairly basic approach is a state graph. It's
memory intensive, but rewards you with good performance. This came
up awhile ago and I posted some sample code to get someone else
started. You can find that post here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csh
ar p/msg/0f06f696d4500b77

I've referred to this post a couple of other times, and while I've
never had anyone say it actually turned out to be useful, they've
never said it wasn't either.

I suppose if I'm going to keep referring people to it, maybe I ought
to clean it up a little bit more. But what's there does work and
should be enough to get you pointed in the right direction.

Pete

Click to expand...

Click to expand...

Karch · Mar 6, 2008

As I mentioned, I do have the ability to reorder the data, so I'll have to
make sure that sub-patterns occur after search patterns. Or separate the
searches into two passes. I don't really care so much if I find both "Smith"
AND "Smitherson" - as long as I find at least one of them in the source
string. As far as your second question, I'm not sure. For example, I am
going to have things like this in my search list:

"Joe Smith ("
"Jones"
"Jane Jones"
"Jim Stewart/"

The source string (the one to be searched) could be anything. All I care
about is if one of those string in the list appears in the source string. As
soon as I find one, I can stop the searching and return "true". I don't need
to verify all the places that any of the strings in the list occur. Does
that make more sense?

Mufaka · Mar 6, 2008

Karch said:
The Hashtable would work great alongside a tokenized string, but the problem
is that I don't have a distinct delimiter since the string to be matched can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.

<snip>
This is a little different than your original post, but if you can't
delimit search string, you might try something like the following:

Hash the static list of words
Keep a list of unique lengths for those words
for each unique length, iterate the string pulling out substrings of
that length
see if that substring is in your hashed list of words

It at least reduces the amount of InString's you'd have to do.

I'd be interested in seeing how this performs against large strings.
Here's my test code:

public class WordMatch
{
private Dictionary<string, string> _words = new
Dictionary<string,string>();
private Dictionary<int, int> _lengths = new Dictionary<int,int>();

public void SetWords(List<string> words)
{
foreach (string word in words)
{
string lowerWord = word.ToLower();
if (!_words.ContainsKey(lowerWord))
{
_words.Add(lowerWord, lowerWord);
}

if (!_lengths.ContainsKey(word.Length))
{
_lengths.Add(word.Length, word.Length);
}
}
}

public bool StringHasMatch(string text)
{
string lowerText = text.ToLower();
foreach (int length in _lengths.Keys)
{
if (text.Length < length) break;

int ndx = 0;

while (ndx + length <= text.Length)
{
if (_words.ContainsKey(lowerText.Substring(ndx, length)))
{
return true;
}
ndx++;
}
}
return false;
}
}

Fast way to determine if a string contains a member of a list of strings

Karch

amdrit

John Daragon

Peter Duniho

John Daragon

Ben Voigt [C++ MVP]

Ben Voigt [C++ MVP]

Barry Kelly

Jesse Houwing

Jack Jackson

Karch

Karch

Karch

Karch

Karch

Peter Duniho

Peter Duniho

Jesse Houwing

Karch

Mufaka