High speed string processing

Alexander Muylaert · May 23, 2004

Hi

Does anyone know a good starting point about high speed string processing in
C#?

What I need is a very fast routine for a case insensitive "contains". e ==
E == é == ë == ...

Kind regards

Alexander

Mohamoss · May 23, 2004

hi Alexander
you need to be using the StringBuilder class , you can read more about it
on these links
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/
frlrfsystemtextstringbuilderclasstopic.asp
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=13
Mohamed Mahfouz
MEA Developer Support Center
ITworx on behalf of Microsoft EMEA GTSC

Vijaye Raji · May 23, 2004

Simplest is to use already available methods on string.

1. Do a ToLower
2. Replace all occurences of "E", "é", "ë" with "e" in both the base and
matching string.
3. Use the Contains method on the base string.

Or write yourself one. Convert the string to a character array and then
walk thru the characters one by one.

HTH

-vJ

Chad Z. Hower aka Kudzu · May 23, 2004

Vijaye Raji said:
1. Do a ToLower
2. Replace all occurences of "E", "?", "?" with "e" in both the base and
matching string.
3. Use the Contains method on the base string.

This causes string copies as string are immutable. Unless you are doing
single operations, its anything but "high speed" as he requested. And Im not
even sure StringBuilder will match his needs - it depends on what he means by
high speed.

For our needs we have a lot of ASCII text that we have to manage (its
sockets, thus its always ascii text, never will be other) so we have written
our own class. It's fully managed and quite fast. And if its not fast enough
we can compile it unmanaged (its Delphi, we can compile either way) and
export a managed interface.

--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"

Empower ASP.NET with IntraWeb
http://www.atozed.com/IntraWeb/

Vijaye Raji · May 24, 2004

You're right. Using "string" for these would be horrible.

I was suggesting the algorithm to use, should have been more careful about
the use of the word "string"

-vJ

Alexander Muylaert · May 24, 2004

Hi

I'm a Delphi programmer myself and I would need the function in a very tight
loop. Called often while typing. It is a search function on +- 5000 -
500000 strings while typing. And it is not the first char, if it is inside
the string, we have a hit, if it is not inside the string, we don't have a
hit.

The Q_strings library from delphi is really great for me. This also doesn't
translate the string first. It works with a mapping table. MapTable[Char]
of char = ...

This is very fast. I need the same or better speed.

Kind regards

Alexander

Greg Miller · May 27, 2004

Does anyone know a good starting point about high speed string processing in
C#?

What I need is a very fast routine for a case insensitive "contains". e ==
E == é == ë == ...

There are some algorithms in Cormen's (et. al) "Algorithms",
and in Knuth's "The Art of Computer Programming: Searching and
Sorting". You can also find the same algorithms in books on compiler
design. But these deal specifically with matching the string, and not
the replacement and case insensitivity.
I'd recommend using a separate buffer for what the user is
typing in. Every time a character is pressed, convert it to lower
case, and change your funky characters to the regular ones (é to e).
Also I wouldn't try to find each string in what the user typed, rather
I'd try to find what the user typed in the list of strings starting at
different offsets. Having the list of strings sorted, and doing a
binary search will make that a manageable task.
Reverse all of the strings in your list, and search backwards
through what the user typed. E.G. when the user types 'r' you only
have to search through your list for strings that end with 'r'. This
way you never have to back up.
That should be fast enough, but if it's not, sort the list by
length first, then alphabetically. Keep a list of indexes into the
list as to where the first word of each length is, then in your binary
search only search the strings which match the length you're looking
for.
500000 is a lot of strings. But, in a binary search, at most
you have to look at 19 strings to find a match, or determine there is
no match. Which is very manageable for processing while a user is
typing.
If you want to go one step further, use a separate thread for
matching the string. People tend to pause and burst when typing.

High speed string processing

Alexander Muylaert

Mohamoss

Vijaye Raji

Chad Z. Hower aka Kudzu

Vijaye Raji

Alexander Muylaert

Greg Miller