Fast way to determine if a string contains a member of a list of strings

Mufaka · Mar 6, 2008

Mufaka said:
<snip>
This is a little different than your original post, but if you can't
delimit search string, you might try something like the following:

Hash the static list of words
Keep a list of unique lengths for those words
for each unique length, iterate the string pulling out substrings of
that length
see if that substring is in your hashed list of words

It at least reduces the amount of InString's you'd have to do.

I'd be interested in seeing how this performs against large strings.
Here's my test code:

if (text.Length < length) break;

<snip>
That should be continue instead of break.

Karch · Mar 7, 2008

WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47
Execution time: 0.055008 seconds.

Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

Peter Duniho · Mar 7, 2008

WOW! Really disappointing timings for the state machine method. Here are
the
results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a
source
string of length = 47
Execution time: 0.055008 seconds.

Method: Using the StateMachine method on 359 strings and a source string
of
length = 47
Execution time: 2.359223 seconds.

Are you counting the time it takes to create the state graph? How many
trials is that? Just one? What is the test string like? Have you tried
testing the worst-case scenario, which is a source string that has zero
matches?

Basically, those times sound wrong to me. Or, at least the second time
(the first seems plausible). The only way I could see the state graph
taking almost 2.5 seconds to process a 47 character string is if you are
including the time it takes to build the state graph (which you shouldn't,
since you only have to do that once).

Even then, with only 359 strings it seems odd. Even assuming 10-character
strings on average with _no_ overlapping characters (worst-case for
allocations), that's only 35900 iterations (3590 characters, with a
10-node traversal in the graph) to build the graph. I'm not used to
seeing .NET only being able to handle ~10K of "something" per
second...that's pretty slow.

Regardless of how slow building the graph is, once that's done processing
an actual string should be extremely quick.

Can you post some sample code and test data? A simple
concise-but-complete console application that demonstrates the two
different algorithms should be sufficient, along with whatever input data
you're testing of course.

By the way, if you _are_ building the graph each time and for some reason
this isn't something you can fix, then no...a state graph might not help..
It depends a lot on how large your input data is, but if you're only
processing ~50 characters every time you start with a fresh search list,
then even if a state graph or other solution is faster than brute force,
it probably wouldn't be by much (and while a 50X disparity seems wrong to
me, it actually wouldn't surprise me for brute force to be faster than a
state graph for such short source strings if you have to rebuild the state
graph for each search).

Pete

Peter Duniho · Mar 7, 2008

WOW! Really disappointing timings for the state machine method.

Well, I was right that it shouldn't have been that bad. But it was my
fault that it was that bad.

I wrote that code a long time ago, before I understood just how costly
throwing and catching an exception can be. Practically all of the time
spent in processing your string is likely to be in a single method in the
code I posted, where I use an exception to detect non-existing keys. That
was a very bad choice on my part. I've since learned my lesson, but I
didn't ever update that code.

Anyway, you should be able to dramatically improve the performance of the
state graph implementation by replacing the StateNode.NextState() method
with this code:

public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state = null;

if (_mpchstate.ContainsKey(tk))
{
state = _mpchstate[tk];
}
else if (fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}

return state;
}

Sorry about that. I _did_ say that the code was in need of review.

There are other things I would do differently were I writing that code
today, but I think that's the most serious problem (most of the other
issues are stylistic in nature).

With that change on my own machine, I went from taking 1.5 seconds to
process a 50-character string (with no nodes in the graph, which is the
worst-case scenario) to taking 0.005 seconds. I think that should make it
more competitive with the brute force approach.

Pete

John Daragon · Mar 7, 2008

Karch wrote:
Snip 8<

Yeah, this would work really good - but one point that I didn't mention is
that I can't really tokenize the string since the strings to match could
span multiple words and whitepace. I looked at the data and I couldn't find
a realiable way to tokenize.

Snip 8<

Is the data in your list arbitrarily complex? I guess what I'm getting
at is: if the majority of the strings to match are simple strings with
no embedded spaces or other special characters, it may well be possible
to build a hashtable (fixated, me?) that caters for the simplest
(tokenised) case, but which returns an object identifying whether this
string can be standalone or is a prefix for a more complex one.
Everything apart from the simple case would have to be handled
exceptionally, but if they're few and far between you may well still be
a long way ahead of the game adopting a technique that is based on hashing.

jd

Jesse Houwing · Mar 7, 2008

Hello Karch,

WOW! Really disappointing timings for the state machine method. Here
are the results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a
source
string of length = 47
Execution time: 0.055008 seconds.
Method: Using the StateMachine method on 359 strings and a source
string of
length = 47
Execution time: 2.359223 seconds.

Could you post the code for both? This I'd like to see why it's so much slower...

My guess is it somehow creates massive amoutns of strings to allocate...
but I could be wrong...

Jesse

Peter Duniho · Mar 7, 2008

Could you post the code for both? This I'd like to see why it's so much
slower...

My guess is it somehow creates massive amoutns of strings to allocate...
but I could be wrong...

You are.

See my other replies...the code I posted earlier and
referred to in this thread was actually kind of stupid, as it used an
exception as a normal part of processing. Very bad idea.

Hopefully the OP will actually come back here and see my fix for the
problem. Sometimes people start threads and then once they feel they've
reached a conclusion, whether useful or not, they just abandon the
newsgroup until they need help again.

Now, if only I could figure out a way to retroactively fix the post I made
before.

I'm not entirely proud of that code for a variety of reasons,
but I'm especially embarassed that it has that serious performance bug in
it. It makes the class entirely useless for the intended purpose.

Pete

Karch · Mar 13, 2008

WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47
Execution time: 0.055008 seconds.

Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that you
provided, at what point do I know I have a match and can terminate the
traversal?

Also, to answer your question, I do not have space delimiters - the strings
can span multiple words, etc.

Peter Duniho · Mar 13, 2008

WOW! Really disappointing timings for the state machine method.

Did you really just post this message? Have you read my messages in reply
to your previous (very similar) post? The state machine will work very
well, once you incorporate the bug-fix I provided.

[...]
In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that you
provided, at what point do I know I have a match and can terminate the
traversal?

The code I posted doesn't provide that functionality. However, it's not
hard to add. You could add a property to the StateNode class that tests
to see if it's at a "terminal" state (in the code I posted, this would be
true when the private field "_lobj" is a non-zero-length list...that is,
"_lobj.Count > 0").

Note, of course, that this graph supports non-leaf nodes with
terminations. That's so you can have search strings that overlap each
other, like "Smith" and "Smitherson". You'll need to decide for yourself
what an appropriate action to take is in that situation and implement it
accordingly.

Also, to answer your question, I do not have space delimiters - the
strings
can span multiple words, etc.

Okay...in that case, I think the state graph may well be one of the better
options for you.

Pete

Peter Duniho · Mar 13, 2008

Sorry, some follow-up points (it's late, I'm getting sloppy)...

[...]
In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that
you
provided, at what point do I know I have a match and can terminate the
traversal?

Click to expand...

The code I posted doesn't provide that functionality. However, it's not
hard to add. You could add a property to the StateNode class that tests
to see if it's at a "terminal" state (in the code I posted, this would
be true when the private field "_lobj" is a non-zero-length list...that
is, "_lobj.Count > 0").

Of course, you could also just note the return value for the
"RgobjFromNextState()" method, and if it's a non-zero-length list, that's
a terminal node. No need to modify the StateNode class at all in that
case.

Note, of course, that this graph supports non-leaf nodes with
terminations. That's so you can have search strings that overlap each
other, like "Smith" and "Smitherson". You'll need to decide for
yourself what an appropriate action to take is in that situation and
implement it accordingly.

Okay, so...since you did point out that your search strings are "mutually
exclusive" -- I assume this means that you know for sure your source text
will only ever include one of the strings that's in the list of search
strings -- then obviously you can terminate the traversal of the state
graph as soon as you reach a state where the list of objects for a node
has a non-zero length.

Pete

Karch · Mar 13, 2008

Sorry, I must have missed both my earlier post and your bug-fix post. Could
you please repost the bug-fix provided? I would like to try the timings
again with the additional changes.

Peter Duniho said:
WOW! Really disappointing timings for the state machine method.

Click to expand...

Did you really just post this message? Have you read my messages in reply
to your previous (very similar) post? The state machine will work very
well, once you incorporate the bug-fix I provided.

[...]
In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that you
provided, at what point do I know I have a match and can terminate the
traversal?

Click to expand...

The code I posted doesn't provide that functionality. However, it's not
hard to add. You could add a property to the StateNode class that tests
to see if it's at a "terminal" state (in the code I posted, this would be
true when the private field "_lobj" is a non-zero-length list...that is,
"_lobj.Count > 0").

Note, of course, that this graph supports non-leaf nodes with
terminations. That's so you can have search strings that overlap each
other, like "Smith" and "Smitherson". You'll need to decide for yourself
what an appropriate action to take is in that situation and implement it
accordingly.

Also, to answer your question, I do not have space delimiters - the
strings
can span multiple words, etc.

Click to expand...

Okay...in that case, I think the state graph may well be one of the better
options for you.

Pete

Peter Duniho · Mar 13, 2008

Sorry, I must have missed both my earlier post and your bug-fix post.
Could
you please repost the bug-fix provided? I would like to try the timings
again with the additional changes.

I would have just pointed you to the Google Groups archive, with a link to
the message. But for some reason, Google is having trouble today. WhenI
first looked at this thread on their archives, I could not see the most
recent messages (Google claimed to only have the first 16). Then when I
looked again, Google sent me back an error message saying it couldn't even
get at the archive for this newsgroup.

Oh well.

In the meantime, here's my previous post with the fix, copied and pasted:

WOW! Really disappointing timings for the state machine method.

Well, I was right that it shouldn't have been that bad. But it was my
fault that it was that bad.

I wrote that code a long time ago, before I understood just how costly
throwing and catching an exception can be. Practically all of the time
spent in processing your string is likely to be in a single method in the
code I posted, where I use an exception to detect non-existing keys. That
was a very bad choice on my part. I've since learned my lesson, but I
didn't ever update that code.

Anyway, you should be able to dramatically improve the performance of the
state graph implementation by replacing the StateNode.NextState() method
with this code:

public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state = null;

if (_mpchstate.ContainsKey(tk))
{
state = _mpchstate[tk];
}
else if (fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}

return state;
}

Sorry about that. I _did_ say that the code was in need of review.
There are other things I would do differently were I writing that code
today, but I think that's the most serious problem (most of the other
issues are stylistic in nature).

With that change on my own machine, I went from taking 1.5 seconds to
process a 50-character string (with no nodes in the graph, which is the
worst-case scenario) to taking 0.005 seconds. I think that should make it
more competitive with the brute force approach.

Pete

Karch · Mar 13, 2008

New timings including changes are MUCH better. And the state machine wins:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47
Execution time: 0.533385 seconds.

Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 0.351235 seconds.

34% improvement!

Thanks.

Sorry, I must have missed both my earlier post and your bug-fix post.
Could
you please repost the bug-fix provided? I would like to try the timings
again with the additional changes.

I would have just pointed you to the Google Groups archive, with a link to
the message. But for some reason, Google is having trouble today. When I
first looked at this thread on their archives, I could not see the most
recent messages (Google claimed to only have the first 16). Then when I
looked again, Google sent me back an error message saying it couldn't even
get at the archive for this newsgroup.

Oh well.

In the meantime, here's my previous post with the fix, copied and pasted:

WOW! Really disappointing timings for the state machine method.

Well, I was right that it shouldn't have been that bad. But it was my
fault that it was that bad.

I wrote that code a long time ago, before I understood just how costly
throwing and catching an exception can be. Practically all of the time
spent in processing your string is likely to be in a single method in the
code I posted, where I use an exception to detect non-existing keys. That
was a very bad choice on my part. I've since learned my lesson, but I
didn't ever update that code.

Anyway, you should be able to dramatically improve the performance of the
state graph implementation by replacing the StateNode.NextState() method
with this code:

public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state = null;

if (_mpchstate.ContainsKey(tk))
{
state = _mpchstate[tk];
}
else if (fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}

return state;
}

Sorry about that. I _did_ say that the code was in need of review.
There are other things I would do differently were I writing that code
today, but I think that's the most serious problem (most of the other
issues are stylistic in nature).

With that change on my own machine, I went from taking 1.5 seconds to
process a 50-character string (with no nodes in the graph, which is the
worst-case scenario) to taking 0.005 seconds. I think that should make it
more competitive with the brute force approach.

Pete

Peter Duniho · Mar 13, 2008

New timings including changes are MUCH better. And the state machine
wins:

I thought it might.

Keep in mind also that it appears you're roughly around the break-even
point between the two implementations. Depending on the actual input data
and how you're using it, the state graph can potentially be much better
than the IndexOf() implementation.

In fact, based on the numbers you posted, I suspect that you're including
the cost of creating the state graph in the total time. For best results,
you would actually just create the state graph once and then reuse the
same graph each time you need to do a search. In that case, the cost of
creating the state graph should not be included in the performance
comparison.

With 359 search strings, the IndexOf() implementation would scan the input
string 359 times as compared to having to scan it only 1 time for the
state graph (in the worst-case scenario, that is: when none of the search
strings are in the input string). Even in the best case -- when the very
first search string is in the input string -- the two would be
equivalent. The average case would have the IndexOf() implementation 180
times slower than the state graph.

In other words, a mere 34% improvement is remarkably tiny given the actual
difference in efficiency of the two algorithms. If you were to perform
the timing test using the same state graph (that is, creating it only
once) on a larger number of iterations, you'd likely find a difference of
100x or more between the two, with the state graph winning. And
presumably, if performance is actually important to you, the number of
iterations in the real-world situation would be high enough to realize
that kind of performance advantage, assuming you use the state graph
correctly by only creating it once.

Pete

Karch · Mar 13, 2008

I only create the state graph one time and I only populate the list of
strings one time. So, those are not interfering with proper timings; I am
only timing the searches. But, just for fun, here is the same test with 1M
iterations. The ratio is about the same.

Method: Looping with IndexOf
Execution time: 53.548416 seconds.

Method: State Machine
Execution time: 31.109533 seconds.

Ben Voigt [C++ MVP] · Mar 13, 2008

Peter said:
I thought it might.

Keep in mind also that it appears you're roughly around the break-even
point between the two implementations. Depending on the actual input
data and how you're using it, the state graph can potentially be much
better than the IndexOf() implementation.

In fact, based on the numbers you posted, I suspect that you're
including the cost of creating the state graph in the total time. For best
results, you would actually just create the state graph once
and then reuse the same graph each time you need to do a search. In
that case, the cost of creating the state graph should not be
included in the performance comparison.

With 359 search strings, the IndexOf() implementation would scan the
input string 359 times as compared to having to scan it only 1 time
for the state graph (in the worst-case scenario, that is: when none
of the search strings are in the input string). Even in the best
case -- when the very first search string is in the input string --
the two would be equivalent. The average case would have the
IndexOf() implementation 180 times slower than the state graph.

In other words, a mere 34% improvement is remarkably tiny given the
actual difference in efficiency of the two algorithms. If you were
to perform the timing test using the same state graph (that is,
creating it only once) on a larger number of iterations, you'd likely
find a difference of 100x or more between the two, with the state
graph winning. And presumably, if performance is actually important
to you, the number of iterations in the real-world situation would be
high enough to realize that kind of performance advantage, assuming
you use the state graph correctly by only creating it once.

It's now O(N) instead of O(N * M), but the constant has likely changed by a
factor of 5 or more. I'd guess the speedup would be closer to 20x than
180x.

Peter Duniho · Mar 13, 2008

I only create the state graph one time and I only populate the list of
strings one time. So, those are not interfering with proper timings; I am
only timing the searches. But, just for fun, here is the same test with
1M
iterations. The ratio is about the same.

What is your input data? Does the input string contain any of the search
strings? If so, where is that search string in the list of search
strings? Are you doing 1 million identical searches, or are you somehow
randomizing the input?

Basically, that's a fairly minimal improvement relative to the potential
of the state graph versus a brute force IndexOf() solution. So either
your test case is somehow favorable to the IndexOf(), or there's still
some remaining problem with the state graph implementation.

Pete

Karch · Mar 14, 2008

Yes, you are right; my input data is not as random as it could be. I just
wanted to get a quick snapshot, so it's likely that the improvement would be
greater over a larger and more varied set. I'll put together a more
realistic set of test data and report back.

Peter Duniho · Mar 14, 2008

It's now O(N) instead of O(N * M), but the constant has likely changed
by a
factor of 5 or more. I'd guess the speedup would be closer to 20x than
180x.

You're right that my analysis was too simplistic. However, I'm not sure
that the 5x ratio is necessarily a precise estimate either. The
performance of the IndexOf() implementation depends a _lot_ on the input
data. Which string matches and where it appears in the input string plays
a big part in performance (as noted already), but also how close the
search strings match to non-matching text in the input string. IndexOf()
can waste a lot of time trying to match strings that eventually fail and
doing that can bring the cost of testing any one string much closer to the
cost of testing any one node in the state graph (which is, as you surely
noted, is where the constant difference you're talking about comes from).

That said, yes...I'd agree that a more conservative estimate of potential
speed difference might be 20x. Whatever the estimate, it's at least an
order of magnitude difference and potentially much greater, rather than
something relative small like the 34% observed here. On average, that
is...obviously for at least one particular test case, the difference is in
fact 34%.

Pete

Fast way to determine if a string contains a member of a list of strings

Mufaka

Karch

Peter Duniho

Peter Duniho

John Daragon

Jesse Houwing

Peter Duniho

Karch

Peter Duniho

Peter Duniho

Karch

Peter Duniho

Karch

Peter Duniho

Karch

Ben Voigt [C++ MVP]

Peter Duniho

Karch

Peter Duniho