Fastest way to search a string for the occurance of a word??

  • Thread starter msnews.microsoft.com
  • Start date
J

Jon Skeet [C# MVP]

Niki Estner said:
What? Testing against various algorithms with real data is of little value?
You need to explain that.

Sorry, I wasn't clear - testing with *random* data is of little value
to the OP - certainly compared with testing with real data. (I know you
have always agreed that the OP should test with real data.)
That's why big-O-notation is commonly used. Using the term "faster" I
refered to that. I think that has been a source of misunderstanding.
And, no big-O-notation is not meaningless. It's the best way to compare
algorithms.

It's a good way of comparing algorithms if you want to know how they'll
perform as the size of input (or whatever) increases. It's not
particularly useful if the size of input is known to only be likely to
go up to a certain level, which could well be smaller than the level at
which big-O notation becomes a good predictor of which algorithm is
faster.
Sure.
n dot estner at gmx dot de.

Whoops - I'd already sent it to niki dot estner at cube dot net - I'll
resend.
I forgot to delete those. I put them in there for debugging: the only
way to inpect the return values of String.IndexOf/Regex.Match.

Goodo - just checking I wasn't missing something :)
I knew I missed something.
Another good argument for the regex...

Or IndexOf :) (But yes, point very much taken.)
Really strange.
A guy in another thread says on his machine regexes take 70% longer than
String.IndexOf.

Odd indeed.
Yes (and 2.6)

Hmm. I suppose there are all kinds of other factors - especially
cache/memory speed - which have a big effect on this kind of test, but
I'm still baffled as to why IndexOf is *so* much slower.

Ah - I've just had a thought: which culture are you using?
String.IndexOf is culture-sensitive - it could be that your culture has
a particularly slow IndexOf.

I'm sure this is it - if I use:
CompareInfo info = CultureInfo.InvariantCulture.CompareInfo;

and then call info.IndexOf(bigString, searchString);

it takes nearly two and a half seconds for me, too.

If I then change it to:

info.IndexOf(bigString, searchString, CompareOptions.Ordinal)

then it goes back down. If I use the current culture rather than the
invariant culture, it's fast too.

What happens if you do the above on your box?
 
N

Niki Estner

Jon Skeet said:
...
Ah - I've just had a thought: which culture are you using?
String.IndexOf is culture-sensitive - it could be that your culture has
a particularly slow IndexOf.

Ahhhh, *that*'s it. Now I do get readings similar to yours: Regex is only
faster if the match is late in the string and if the searched string is
long.

You seem to live in a far simpler culture than I do ;-)

Seriously: I wonder what's the difference? I mean, String.IndexOf should be
case-insensitive. Why is a German case-insensitive-equality-test so
expensive?

Niki
 
J

Jon Skeet [C# MVP]

Niki Estner said:
Ahhhh, *that*'s it. Now I do get readings similar to yours: Regex is only
faster if the match is late in the string and if the searched string is
long.

You seem to live in a far simpler culture than I do ;-)

Seriously: I wonder what's the difference? I mean, String.IndexOf should be
case-insensitive. Why is a German case-insensitive-equality-test so
expensive?

Combining characters, possibly? For instance, o followed (or preceded,
I can never remember) by an umlaut combining character will probably
match an o-umlaut in German, but not in English. At a guess, anyway...

Arguably specifying CompareOptions.Ordinal is a good idea for many
applications. It's a shame you can't specify it on String.IndexOf :(
 
N

Niki Estner

Jon Skeet said:
...

Combining characters, possibly? For instance, o followed (or preceded,
I can never remember) by an umlaut combining character will probably
match an o-umlaut in German, but not in English. At a guess, anyway...

You mean "ö" = "oe". That's true, but IndexOf doesn't find that, I just
tried. But probably it is some character translation like that.

Know what's funny: according to the docs, Regex.Match should be
culture-sensitive, too. So, it seems to be the fastest "German" matching
method.
Arguably specifying CompareOptions.Ordinal is a good idea for many
applications. It's a shame you can't specify it on String.IndexOf :(

So, in the end, I agree with you that using String.IndexOf is fine for
clarity, but it's not to be used in a performance-critical section of an
application, unless you know for sure where you application will be used or
change the CultureInfo manually.

Niki
 
J

Jon Skeet [C# MVP]

Niki Estner said:
You mean "ö" = "oe".

No, not quite. There's an actual umlaut combining character (not sure
where exactly) rather than just "e".
That's true, but IndexOf doesn't find that, I just
tried. But probably it is some character translation like that.

Know what's funny: according to the docs, Regex.Match should be
culture-sensitive, too. So, it seems to be the fastest "German" matching
method.

Interesting. If we can work out how to make IndexOf match non-
obviously, we can try the same thing for regular expressions.
So, in the end, I agree with you that using String.IndexOf is fine for
clarity, but it's not to be used in a performance-critical section of an
application, unless you know for sure where you application will be used or
change the CultureInfo manually.

Yup - although I'd be tempted to use CompareInfo directly, specifying
CompareOptions.Ordinal. Then it's lost its clarity though :(
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top