Fastest way to search a string for the occurance of a word??

msnews.microsoft.com · Aug 10, 2004

I need to check the request header of every request that is hitting my web
server and route traffic accordingly. I can think of several ways to do
this, but I am looking for the FASTEST way. Your advice would be
appreciated. Thank you.

Peter Morris · Aug 10, 2004

I need to check the request header of every request that is hitting my web

server and route traffic accordingly. I can think of several ways to do
this, but I am looking for the FASTEST way. Your advice would be
appreciated. Thank you.

I would advise testing the several ways you can think of and timing them :-)

--
Pete
====
Audio compression components, DIB graphics controls, FastStrings
http://www.droopyeyes.com

Read or write articles on just about anything
http://www.HowToDoThings.com

Niki Estner · Aug 10, 2004

Of course this depends on the length of your string, and to some extent to
its contents, but generally RegEx's are pretty fast. (Faster than
String.IndexOf, for example).

I'd guess the network would be the main bottleneck in the application you
describe, but if matching performance is as critical as you say, I'd suggest
to perform benchmarks of the following alternatives:
- a "bare-hand" Boyer-Moore implementation
- C/C++'s strstr function (uses CPU-specific optimizations)
- RegEx.Match
using real test data.

Post your results, if you're done!

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
Of course this depends on the length of your string, and to some extent to
its contents, but generally RegEx's are pretty fast. (Faster than
String.IndexOf, for example).

I usually find the exact opposite. Regular expressions are extremely
powerful, but for simple string testing, I usually find a hard-coded
test to be about an order of magnitude faster than using regular
expressions.

It's definitely worth the OP testing the actual situation they're
interested in.

Niki Estner · Aug 11, 2004

Jon Skeet said:
I usually find the exact opposite. Regular expressions are extremely
powerful, but for simple string testing, I usually find a hard-coded
test to be about an order of magnitude faster than using regular
expressions.

Really?

I've build a little test sample, to find out what cases you mean, maybe
there's a mistake in it?

using System;
using System.Text;
using System.Text.RegularExpressions;

public class RegExTest
{
public static void Main()
{
string searchString = "1234";
string bigString;
StringBuilder builder = new StringBuilder(1234);

const int stringLength = 10000;
const int repeatCount = 10000;

Random rnd = new Random();

for (int i=0; i<stringLength; i++)
builder.Append((char)(rnd.Next()%10+'0'));

bigString = builder.ToString();

{
Console.WriteLine("Testing String.IndexOf:");
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
int index = bigString.IndexOf(searchString);
int x = 0;
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}

{
Console.WriteLine("Testing Regex.Match:");
Regex regex = new Regex(searchString);
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
Match m = regex.Match(bigString);
int x = 0;
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}

Console.ReadLine();
}
}

It's definitely worth the OP testing the actual situation they're
interested in.

Yes, that was my advice.

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
Really?

I've build a little test sample, to find out what cases you mean, maybe
there's a mistake in it?

Well, most of the kinds of tests I was talking about are doing more
than just IndexOf - things like checking for a number being (probably)
parseable, etc.

The test case you gave showed roughly the same results for both methods
- sometimes faster for IndexOf, sometimes faster for Regex. The way I
see it, String.IndexOf *should* be faster, as it's a more blunt
instrument - it doesn't need to work out anything clever about what the
regex is doing, etc.

Using RegexOptions.Compiled did (on a single-test basis

seem a bit
faster, but that has implications in terms of memory use. (It's fine if
you've got a static search string, but no good otherwise.) Of course,
using regular expressions is likely to generate more garbage (if only
the Match instance) than String.IndexOf (which really shouldn't be
generating any garbage at all, I'd have thought).

Personally I'd use IndexOf just for clarity's sake, to be honest.

Niki Estner · Aug 11, 2004

Jon Skeet said:
Well, most of the kinds of tests I was talking about are doing more
than just IndexOf - things like checking for a number being (probably)
parseable, etc.

Maybe, but the OP's question was what's the "Fastest way to search a string
for the occurance of a word?".

The test case you gave showed roughly the same results for both methods
- sometimes faster for IndexOf, sometimes faster for Regex.

Really? in all the tests I did here, the Regex was by far superior.

Examples on my PC:
stringLength = 100; repeatCount = 100000;
Testing String.IndexOf: 00:00:00.3750000
Testing Regex.Match: 00:00:00.0625000

stringLength = 1000; repeatCount = 100000;
Testing String.IndexOf: 00:00:03.1406250
Testing Regex.Match: 00:00:00.5625000

stringLength = 10000; repeatCount = 100000;
Testing String.IndexOf: 00:00:02.3437500
Testing Regex.Match: 00:00:00.5312500

The Regex is about 5 times faster.

If the regex is compiled:
stringLength = 100; repeatCount = 100000;
Testing String.IndexOf: 00:00:00.3437500
Testing Regex.Match: 00:00:00.0625000

stringLength = 1000; repeatCount = 100000;
Testing String.IndexOf: 00:00:00.9375000
Testing Regex.Match: 00:00:00.1562500

stringLength = 10000; repeatCount = 100000;
Testing String.IndexOf: 00:00:10.6562500
Testing Regex.Match: 00:00:01.2031250

That's about 5-10 times faster.

The way I
see it, String.IndexOf *should* be faster, as it's a more blunt
instrument - it doesn't need to work out anything clever about what the
regex is doing, etc.

Wrong. RegEx's use the Boyer-Moore algorithm, which does take a little
time&space for initialization, but is magnitudes faster in matching. Also,
they can be compiled, in which case they whould have to be compared to a
matching routine specially built for finding one string.

Using RegexOptions.Compiled did (on a single-test basis seem a bit
faster, but that has implications in terms of memory use. (It's fine if
you've got a static search string, but no good otherwise.)

Yes, that's why it's an option. You're not supposed to use it if you got new
regular expressions all the time.

Of course,
using regular expressions is likely to generate more garbage (if only
the Match instance) than String.IndexOf (which really shouldn't be
generating any garbage at all, I'd have thought).

If the regex doesn't contain any capturing paranthesis, it shouldn't
generate more than the Match object either (plus the initialization data
when the RegEx object is created).

Personally I'd use IndexOf just for clarity's sake, to be honest.

I do too, if speed doesn't matter.
But again, that wasn't the OP's question.

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
Maybe, but the OP's question was what's the "Fastest way to search a string
for the occurance of a word?".

True - but you did use the word "generally" in your first post, which
was what I was replying to.

Really? in all the tests I did here, the Regex was by far superior.

<snip results>

Interesting - entirely different from my results. I would suggest that
such short tests aren't a very good measure, however. I used a string
length of 10000 and a repeat count of 300000 to get reasonable results.
(That's using a random seem of 10, btw - using a fixed seed ensures
reasonably repeatable results.)

Using only results of 5 seconds or more, I never saw either method get
more than about a 20% advantage over the other - and even that was only
after making the regular expression compiled. I wonder what the
difference is that's making your regex implementation so much faster?

One thing which makes a *huge* difference is if you make sure that
there's a match, and make it early in the string. At that stage,
String.IndexOf works very quickly compared with Regex. (Try appending
the search string to the StringBuilder before doing the rest of the
appends and you'll see what I mean.)

Wrong. RegEx's use the Boyer-Moore algorithm, which does take a little
time&space for initialization, but is magnitudes faster in matching.

I haven't studied that for a while (and the web page I've found
describing it is too dense to take in during a lunch break

but I
thought it only really helped if there were repeated characters within
the text being searched for?

Also, they can be compiled, in which case they whould have to be compared to a
matching routine specially built for finding one string.

Yup (hence my comment in the previous post).

Yes, that's why it's an option. You're not supposed to use it if you got new
regular expressions all the time.

Indeed. However, of course, if you've got new regular expressions all
the time, you shouldn't be testing with a single fixed regular
expression which is precreated, which is what your test does. Depending
on what the OP's actual use will be, he should either specify
RegexOptions.Compiled, or move the Regex constructor call to the inside
of the loop when testing.

If the regex doesn't contain any capturing paranthesis, it shouldn't
generate more than the Match object either (plus the initialization data
when the RegEx object is created).

Sure - but just that extra Match object could be relevant if the search
itself is fast (and generates a match, of course - otherwise
Match.Empty will be returned all the time).

I do too, if speed doesn't matter.
But again, that wasn't the OP's question.

I still would if speed was *generally* important but not the absolutely
dominating factor. Basically, I'd suggest that the OP uses IndexOf and
checks to see whether this is actually a significant part of the time
taken per request.

Jon Skeet [C# MVP] · Aug 11, 2004

Jon Skeet said:
I haven't studied that for a while (and the web page I've found
describing it is too dense to take in during a lunch break but I
thought it only really helped if there were repeated characters within
the text being searched for?

Just thinking about this, I'm talking rubbish. However, there's
something similar that String.IndexOf could do to improve things. If,
for instance, you're searching for "1234" and you find "123" and then
an 'x' you only need to start again from the 'x' rather than the '2'. I
would have thought *that* kind of optimisation would be fairly quick to
do at the start of a call to IndexOf - possibly with some sort of
heuristics to determine how far to go with the optimisation (e.g. only
checking the first five characters of the string to find patterns,
etc).

Niki Estner · Aug 11, 2004

Jon Skeet said:
..
Just thinking about this, I'm talking rubbish. However, there's
something similar that String.IndexOf could do to improve things. If,
for instance, you're searching for "1234" and you find "123" and then
an 'x' you only need to start again from the 'x' rather than the '2'. I
would have thought *that* kind of optimisation would be fairly quick to
do at the start of a call to IndexOf - possibly with some sort of
heuristics to determine how far to go with the optimisation (e.g. only
checking the first five characters of the string to find patterns,
etc).

This is the version of String.IndexOf in ROTOR (written in native code):

INT32 COMNlsInfo::FastIndexOfString(WCHAR *source, INT32 startIndex, INT32
endIndex, WCHAR *pattern, INT32 patternLength)
{
int endPattern = endIndex - patternLength + 1;

if (endPattern<0) {
return -1;
}

if (patternLength <= 0) {
return startIndex;
}

WCHAR patternChar0 = pattern[0];
for (int ctrSrc = startIndex; ctrSrc<=endPattern; ctrSrc++) {
if (source[ctrSrc] != patternChar0)
continue;
int ctrPat;
for (ctrPat = 1; (ctrPat < patternLength) && (source[ctrSrc +
ctrPat] == pattern[ctrPat]); ctrPat++) {
;
}
if (ctrPat == patternLength) {
return (ctrSrc);
}
}

return (-1);
}

As you can see, there is no optimization in it.

Again, this was built for something like matching 5-character strings in
10-characters strings. Building Boyer-Moore-tables (even if it's only for
the first few characters) would blow it up too much. If you're matching big
strings, and matching performance does matter, use a RegEx.

Niki

Niki Estner · Aug 11, 2004

Jon Skeet said:
..
Interesting - entirely different from my results. I would suggest that
such short tests aren't a very good measure, however.

A 10000-digit string isn't really "short". And 5 seconds on a 3000-MHz box
isn't either...
Do you have any better tests?

I used a string
length of 10000 and a repeat count of 300000 to get reasonable results.
(That's using a random seem of 10, btw - using a fixed seed ensures
reasonably repeatable results.)

Really Interesting. On my platform the RegEx is about 4 times faster using
these settings.
I'm using the 1.1 version of the framework, on an Intel P4 processor.
(Release build, of course).

Using only results of 5 seconds or more, I never saw either method get
more than about a 20% advantage over the other - and even that was only
after making the regular expression compiled. I wonder what the
difference is that's making your regex implementation so much faster?

Strange thing.
If I did get the same results as you, I would agree to your conclusion to
use String.IndexOf if in doubt. (but I don't...)

One thing which makes a *huge* difference is if you make sure that
there's a match, and make it early in the string. At that stage,
String.IndexOf works very quickly compared with Regex. (Try appending
the search string to the StringBuilder before doing the rest of the
appends and you'll see what I mean.)

Yes, but usually "best-case performance" is of little interest...

I haven't studied that for a while (and the web page I've found
describing it is too dense to take in during a lunch break but I
thought it only really helped if there were repeated characters within
the text being searched for?

In the above example (matching for "123"), the algorithm would only look at
each 3rd character, test if it's a '1', a '2', or a '3'. If not, it skips
the 2 characters in between.
Of course the effect is bigger if the search string is longer.

Sure - but just that extra Match object could be relevant if the search
itself is fast (and generates a match, of course - otherwise
Match.Empty will be returned all the time).

Constants usually aren't considered (thinking in big-O-notation), but I
guess if you always look for a 3-digit string in a 4-digit string, this
might make a difference ;-)

I still would if speed was *generally* important but not the absolutely
dominating factor. Basically, I'd suggest that the OP uses IndexOf and
checks to see whether this is actually a significant part of the time
taken per request.

Knowing nothing about his problem, I don't think I would be a good advisor.
I merely tried to answer his question...

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
A 10000-digit string isn't really "short". And 5 seconds on a 3000-MHz box
isn't either...

5 seconds is reasonable - it was the early results which concerned me,
which were significantly sub-second. At that level, it's very easy to
get misleading results.

Do you have any better tests?

Running yours for longer

Really Interesting. On my platform the RegEx is about 4 times faster using
these settings.
I'm using the 1.1 version of the framework, on an Intel P4 processor.
(Release build, of course).

Hmm. I was compiling and running from the command line, but I did
disable debugging and enable optimisations...

1.1 as well, with a P4 as well (3.06GHz, not that that should matter).
I presume you don't have SP1 of 1.1 yet? (It's part of XP SP2, which I
don't have yet.)

Strange thing.
If I did get the same results as you, I would agree to your conclusion to
use String.IndexOf if in doubt. (but I don't...)

It's certainly very odd. Could you see if you could reproduce the
problem with a specific random seed, and quote the first (say) 20
characters of the random string to check the RNG is coming up with the
same strings on your computer as mine? I'm intrigued now.

Yes, but usually "best-case performance" is of little interest...

I think it really depends on what the typical data is. "Worst-case"
performance isn't terribly interesting if it's not *too* bad and
doesn't come up very often.

In the above example (matching for "123"), the algorithm would only look at
each 3rd character, test if it's a '1', a '2', or a '3'. If not, it skips
the 2 characters in between.

Yup, that's pretty much how I remembered it.

Of course the effect is bigger if the search string is longer.

Yup - although it depends on what patterns are present, of course.

Constants usually aren't considered (thinking in big-O-notation), but I
guess if you always look for a 3-digit string in a 4-digit string, this
might make a difference ;-)

Well, given that the OP is looking in web request headers, I don't
think the strings are going to be that large, really.

Knowing nothing about his problem, I don't think I would be a good advisor.
I merely tried to answer his question...

I usually try to answer what I think someone's trying to do as well as
what they're asking. It often saves time in the long run... lots of
people ask performance questions about entirely the wrong pieces of
their application, or when they really don't need to worry about it at
all.

Niki Estner · Aug 11, 2004

Jon Skeet said:
...
It's certainly very odd. Could you see if you could reproduce the
problem with a specific random seed, and quote the first (say) 20
characters of the random string to check the RNG is coming up with the
same strings on your computer as mine? I'm intrigued now.

I've changed the relevant part this way:
Random rnd = new Random(10);
for (int i=0; i<stringLength; i++)
builder.Append((char)(rnd.Next()%10+'0'));
bigString = builder.ToString();
Console.WriteLine(bigString.Substring(0,20));

Output is: 13771437564146648143.

And, I wouldn't call it a "problem" ;-)

...
I think it really depends on what the typical data is. "Worst-case"
performance isn't terribly interesting if it's not *too* bad and
doesn't come up very often.

We don't know the OP's typical data, so random data should give at least a
clue. And I'm using average-case performance.

...
I usually try to answer what I think someone's trying to do as well as
what they're asking. It often saves time in the long run... lots of
people ask performance questions about entirely the wrong pieces of
their application, or when they really don't need to worry about it at
all.

Yes, but in this case you didn't answer the OP's question, you instead
corrected mine, and I think that's inappropriate unless my answer is wrong.

I don't think my original reply was wrong: you might want to read it again -
I don't think you would have given that guy a different advice, (at least
considering the performance data I have).

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
I've changed the relevant part this way:
Random rnd = new Random(10);
for (int i=0; i<stringLength; i++)
builder.Append((char)(rnd.Next()%10+'0'));
bigString = builder.ToString();
Console.WriteLine(bigString.Substring(0,20));

Output is: 13771437564146648143.

Well we're at least dealing with the same string then

That gives me roughly the same results for String.IndexOf and using a
regular expression.

And, I wouldn't call it a "problem" ;-)

We don't know the OP's typical data, so random data should give at least a
clue. And I'm using average-case performance.

Well, you're using "whatever case we happen to get" performance -
without knowing anything about the real world data, we don't know what
"average-case" performance is going to be at all. For instance, where
did you get "1234" from? The results change very dramatically if you
change it to "123" for instance, presumably because "123" is found
fairly early in the string, which is where String.IndexOf does better.

(Searching for "123" gives results of 2.89s for String.IndexOf, and
4.52s for regular expressions, for a million iterations (one test
only.))

Yes, but in this case you didn't answer the OP's question, you instead
corrected mine, and I think that's inappropriate unless my answer is wrong.

I'm not sure that "corrected" is the right word for what I did -
certainly "commented on".

I don't think my original reply was wrong: you might want to read it again -
I don't think you would have given that guy a different advice, (at least
considering the performance data I have).

If I'd had the same results you'd had, I would probably have
recommended REs, yes - but definitely with a caveat that it's worth
checking that this really is the performance bottleneck, as IndexOf is
more readable and in most cases has "good enough" performance. When
advocating any "more complex" code for performance reasons, I tend to
put such a caveat in. (I consider regular expressions more complex as
you need to be careful about the search string - if it contains any
characters that are "special" for regexes, you may need to escape them,
for instance.)

Having looked at my reply to you, btw, I noticed I'd used the phrase
"hard-coded", and so I thought I'd apply that to our particular test. I
assume this is pretty much what Boyes-Moore does:

static int Find1234 (string haystack)
{
int index=0;
int max = haystack.Length-3;
while (index < max)
{
if (haystack[index]=='1')
{
if (haystack[index+1]=='2')
{
if (haystack[index+2]=='3')
{
if (haystack[index+3]=='4')
{
return index;
}
else
index += 3;
}
else
index+=2;
}
else
index++;
}
else
index++;
}
return -1;
}

(I think that's correct, isn't it?)

I haven't tried to optimise it any further, but it already just about
out-performs the other ways in a few test cases. (That's not using
compiled reg-exes, admittedly - just your original test code.)

Niki Estner · Aug 11, 2004

Jon Skeet said:
...
Well, you're using "whatever case we happen to get" performance -

Performing thousands of tests over random data, adding up the results:
That's average performance in my understanding.
You may have read somewhere that quicksort has O(n*log n) performance in the
average, O(n²) in the worst case. Average case = sort random data; Worst
case = sort data that is already sorted.

without knowing anything about the real world data, we don't know what
"average-case" performance is going to be at all.

???
You think you can't say e.g. whether bubblesort on the average performs
better/equally/worse than quicksort without knowing what data will be
sorted?
They do teach that topic differently where I come from...

For instance, where
did you get "1234" from? The results change very dramatically if you
change it to "123" for instance, presumably because "123" is found
fairly early in the string, which is where String.IndexOf does better.
(Searching for "123" gives results of 2.89s for String.IndexOf, and
4.52s for regular expressions, for a million iterations (one test
only.))

Again, on my PC, Regex is 5 times faster.
BTW: I've tested that on my laptop, too. Same results.

...
I'm not sure that "corrected" is the right word for what I did -
certainly "commented on".

Then I must have misunderstood the sentence "I usually find the exact
opposite". To me that sounded a lot like a correction...

...
If I'd had the same results you'd had, I would probably have
recommended REs, yes - but definitely with a caveat that it's worth
checking that this really is the performance bottleneck, as IndexOf is
more readable and in most cases has "good enough" performance. When
advocating any "more complex" code for performance reasons, I tend to
put such a caveat in. (I consider regular expressions more complex as
you need to be careful about the search string - if it contains any
characters that are "special" for regexes, you may need to escape them,
for instance.)

That *was* my advice. I *did* tell him to find out what's the bottleneck of
his app, any do a few performance tests if string matching turns out to be
that bottleneck.

Having looked at my reply to you, btw, I noticed I'd used the phrase
"hard-coded", and so I thought I'd apply that to our particular test. I
assume this is pretty much what Boyes-Moore does:
...

(I think that's correct, isn't it?)

I think it should read:

static int Find1234 (string haystack)
{
int index=3;
int max = haystack.Length-3;
while (index < max)
{
switch (haystack[index])
{
case '1':
if ((haystack[index+1] == '2') &&
(haystack[index+2] == '3') &&
(haystack[index+3] == '4'))
return index;
break;
case '2':
if ((haystack[index-1] == '1') &&
(haystack[index+2] == '3') &&
(haystack[index+3] == '4'))
return index;
break;
case '3':
if ((haystack[index-2] == '1') &&
(haystack[index-1] == '2') &&
(haystack[index+3] == '4'))
return index;
break;
case '4':
if ((haystack[index-3] == '1') &&
(haystack[index-2] == '2') &&
(haystack[index-1] == '3'))
return index;
break;
}
index+=4;
}
return -1;
}

On my PC, that's about 20-30% faster than a compiled Regex.
But I guess if the string was variable, and the function had to use
table-lookups instead of constants, it would probably be slower.

I haven't tried to optimise it any further, but it already just about
out-performs the other ways in a few test cases. (That's not using
compiled reg-exes, admittedly - just your original test code.)

Yep, that was why I suggested that in my original reply.

Niki

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
Performing thousands of tests over random data, adding up the results:
That's average performance in my understanding.

You would need to do so over thousands of random strings rather than
just a few in order to get a meaningful result. Similarly, it's only
average against the parts of the test which are fixed - all the tests
are using a search string of "1234" against an alphabet of 0-9.

The important thing here is that we don't know what the average data
will be for the OP - he will have best cases and worst cases too, and
average cases - and his average case is unlikely to be the same as the
random average case.

You may have read somewhere that quicksort has O(n*log n) performance in the
average, O(n²) in the worst case. Average case = sort random data; Worst
case = sort data that is already sorted.

???
You think you can't say e.g. whether bubblesort on the average performs
better/equally/worse than quicksort without knowing what data will be
sorted?
They do teach that topic differently where I come from...

There's a difference between average over the real data and average
over theoretical data though. If you know that your real data will vary
but in almost all cases the search string occurs in the first few
positions, that gives a different average case from the theoretical
one, IMO.

The theoretical average is useless to the OP really - only the real
average is going to be of any use. Something which performs badly on
the theoretical average but well on the real data (on average) is still
more use to the OP than the reverse.

That's not to say that theoretical averages aren't interesting, just
not particularly useful in this case, IMO. (Especially theoretical
averages which still rely on other parts being fixed, such as the
alphabet and the length of search string.)

Again, on my PC, Regex is 5 times faster.
BTW: I've tested that on my laptop, too. Same results.

This really is most bizarre. Any suggestions as to how we could work
out what's going on a bit better? I'm really interested by this part.

Then I must have misunderstood the sentence "I usually find the exact
opposite". To me that sounded a lot like a correction...

I really don't think so - it's passing on what I've experienced, which
is apparently different to what you've experienced. I don't think it's
worth getting hung up about though.

That *was* my advice. I *did* tell him to find out what's the bottleneck of
his app, any do a few performance tests if string matching turns out to be
that bottleneck.

Yup, looking back, you're entirely right.

I think it should read:

<snip>

Ah, right. Mine was doing something completely different (and less
efficient in almost every case, I believe, although it still beat
IndexOf

On my PC, that's about 20-30% faster than a compiled Regex.
But I guess if the string was variable, and the function had to use
table-lookups instead of constants, it would probably be slower.

I don't doubt it.

Yep, that was why I suggested that in my original reply.

Right.

Niki Estner · Aug 11, 2004

Jon Skeet said:
There's a difference between average over the real data and average
over theoretical data though. If you know that your real data will vary
but in almost all cases the search string occurs in the first few
positions, that gives a different average case from the theoretical
one, IMO.

Yes, that's why I recommended making performance tests with 3 different
methods against real data, to decide which method is best for that data.
What's your point?

(Especially theoretical
averages which still rely on other parts being fixed, such as the
alphabet and the length of search string.)

Noone says you're not allowed to change the search string the same way you
changed the length of the big string, or the repeat count.
I only fixed the alphabet to make debugging a little easier, but I do get
the same results with fully random search- and match-strings.

...
This really is most bizarre. Any suggestions as to how we could work
out what's going on a bit better? I'm really interested by this part.

Me too.
We are using the same source, maybe you could post your release
built-executable, to make sure we use the same executable, too?

This is my result for the 3 searching methods we had:
stringLength = 10000; repeatCount = 10000;
Testing String.IndexOf: 00:00:03.6093750
Testing Find1234: 00:00:00.2968750
Testing Regex.Match: 00:00:00.4218750

So, Regex and "Find1234" are in the same league, while String.IndexOf is
about 5-10 times slower. Could you do the same test on your PC to find out
whether "your" String.IndexOf is faster, or Regex.Match is slower on your
PC?

...
I really don't think so - it's passing on what I've experienced, which
is apparently different to what you've experienced. I don't think it's
worth getting hung up about though.

Ok.

Niki

Fred · Aug 11, 2004

I've always found Regular Expressions slower than an IndexOf, so I was
surprised by your claim. I tried to run your example, wondering what
I could have been doing wrong in the past. However, your example
consistantly showed the RegEx expression taking ~70% longer to
execute.

Jon Skeet [C# MVP] · Aug 11, 2004

Niki Estner said:
Yes, that's why I recommended making performance tests with 3 different
methods against real data, to decide which method is best for that data.
What's your point?

Only that the tests, while interesting, are very definitely of limited
value. More interesting to us than to the OP, I suspect.

Noone says you're not allowed to change the search string the same way you
changed the length of the big string, or the repeat count.
I only fixed the alphabet to make debugging a little easier, but I do get
the same results with fully random search- and match-strings.

It's more the length of the search string and size of the search
string, just in terms of the probable first match - it looks to me like
if the match is fairly early, String.IndexOf works faster than a
regular expression, but if the match is late or there's no match at
all, regular expressions work faster. (All on my machine, of course.)
Now, as the length of both needle and haystack strings are virtually
boundless, any "average" over them becomes fairly meaningless, IMO.

Me too.
We are using the same source, maybe you could post your release
built-executable, to make sure we use the same executable, too?

Good plan - but I suggest we take it offline to avoid annoying people
with attachments. We can try to work out what's going on and then post
the results here.

This is my result for the 3 searching methods we had:
stringLength = 10000; repeatCount = 10000;
Testing String.IndexOf: 00:00:03.6093750
Testing Find1234: 00:00:00.2968750
Testing Regex.Match: 00:00:00.4218750

So, Regex and "Find1234" are in the same league, while String.IndexOf is
about 5-10 times slower. Could you do the same test on your PC to find out
whether "your" String.IndexOf is faster, or Regex.Match is slower on your
PC?

Sure. I've split up Regex.Match into compiled and not compiled. By the
way, is there a reason for the "int x=0;" statements in your original
post? I've left them in the code I'm testing, but I suspect the
compiler is effectively removing them anyway.

I've used your (much better

version of Find1234, with the slight
change that the return values are (reading down the code) index,
index-1, index-2, index-3, and -1. I doubt that a single subtraction
per iteration will affect things, but it's best to be sure...

My results are:
Testing String.IndexOf:
00:00:00.4375000
Testing Find1234:
00:00:00.2031250
Testing Regex.Match (compiled):
00:00:00.2812500
Testing Regex.Match (not compiled):
00:00:00.3437500

That suggests that String.IndexOf is being very slow on your box,
rather than regular expressions being slow on mine, if you see what I
mean. (Is your box roughly 2GHz?)

Ok.

Goodo. For what it's worth though, I *am* sorry if I came across as
more "corrective" than I intended to - both in the first post and
throughout this thread. I definitely focused too much on the first
paragraph of your post too much.

Niki Estner · Aug 11, 2004

Jon Skeet said:
...

Only that the tests, while interesting, are very definitely of limited
value. More interesting to us than to the OP, I suspect.

What? Testing against various algorithms with real data is of little value?
You need to explain that.

...
It's more the length of the search string and size of the search
string, just in terms of the probable first match - it looks to me like
if the match is fairly early, String.IndexOf works faster than a
regular expression, but if the match is late or there's no match at
all, regular expressions work faster. (All on my machine, of course.)
Now, as the length of both needle and haystack strings are virtually
boundless, any "average" over them becomes fairly meaningless, IMO.

That's why big-O-notation is commonly used. Using the term "faster" I
refered to that. I think that has been a source of misunderstanding.
And, no big-O-notation is not meaningless. It's the best way to compare
algorithms.

Good plan - but I suggest we take it offline to avoid annoying people
with attachments. We can try to work out what's going on and then post
the results here.

Sure.
n dot estner at gmx dot de.

Sure. I've split up Regex.Match into compiled and not compiled. By the
way, is there a reason for the "int x=0;" statements in your original
post? I've left them in the code I'm testing, but I suspect the
compiler is effectively removing them anyway.

I forgot to delete those. I put them in there for debugging: the only way to
inpect the return values of String.IndexOf/Regex.Match.

I've used your (much better version of Find1234, with the slight
change that the return values are (reading down the code) index,
index-1, index-2, index-3, and -1. I doubt that a single subtraction
per iteration will affect things, but it's best to be sure...

I knew I missed something.
Another good argument for the regex...

My results are:
Testing String.IndexOf:
00:00:00.4375000
Testing Find1234:
00:00:00.2031250
Testing Regex.Match (compiled):
00:00:00.2812500
Testing Regex.Match (not compiled):
00:00:00.3437500

Really strange.
A guy in another thread says on his machine regexes take 70% longer than
String.IndexOf.

That suggests that String.IndexOf is being very slow on your box,
rather than regular expressions being slow on mine, if you see what I
mean. (Is your box roughly 2GHz?)

Yes (and 2.6)

Niki

Fastest way to search a string for the occurance of a word??

msnews.microsoft.com

Peter Morris

Niki Estner

Jon Skeet [C# MVP]

Niki Estner

Jon Skeet [C# MVP]

Niki Estner

Jon Skeet [C# MVP]

Jon Skeet [C# MVP]

Niki Estner

Niki Estner

Jon Skeet [C# MVP]

Niki Estner

Jon Skeet [C# MVP]

Niki Estner

Jon Skeet [C# MVP]

Niki Estner

Fred

Jon Skeet [C# MVP]

Niki Estner