Comparing strings and characters

PIEBALD · Aug 29, 2008

Not really a C#-specific comment, more general .net observations.

1) A while back I found the need to determine whether or not a particular
StringComparer was case-insensitive. The best way I found was to use
Reflection to access the private _ignoreCase field, which I'd rather not do.
It seems to me that a public readonly property would be helpful. (Better yet,
it should probably have a CompareOptions)

2) The CompareInfo.Compare method only works on strings (or substrings), not
individual characters. It seems that it must compare one character at a time,
why not expose a method to allow that?

3) HashSet<char> has a constructor that can be passed an
IEqualityComparer<char>, but there's none built-in. I'm not complaining,
really, but it seems to me that if you're going to write a StringComparer it
should be based on a CharComparer.

So I spent this evening writing my own CharComparer, trying to stay close to
the pattern of StringComparer, but I added public readonly fields to hold the
CompareInfo and CompareOptions.
The main downside is that it has to wrap the characters in strings to use
CompareInfo.Compare, which seems needlessly wasteful.

Or am I missing something?

Pavel Minaev · Aug 29, 2008

Not really a C#-specific comment, more general .net observations.

1) A while back I found the need to determine whether or not a particular
StringComparer was case-insensitive. The best way I found was to use
Reflection to access the private _ignoreCase field, which I'd rather not do.
It seems to me that a public readonly property would be helpful. (Better yet,
it should probably have a CompareOptions)

It is most likely a design issue in your own code. You should never
need to tell what a given StringComparer instance does, but rather
just use its results. For one thing, I can write my own StringComparer-
derived class that is case-insensitive, but implemented differently
from the stock one. Or I can write a StringComparer that ignores case
on the first letter, but not for the rest of them. And then your code
will most likely break - and it shouldn't.

2) The CompareInfo.Compare method only works on strings (or substrings), not
individual characters. It seems that it must compare one character at a time,
why not expose a method to allow that?

CompareInfo does culture-sensitive comparison of Unicode characters. A
single Unicode character is not necessarily a single char value - keep
in mind that .NET strings are UTF-16, which means that some Unicode
characters are represented by two codepoints (i.e., two chars). Then
there are also numerous ways to represent a single character, such as
combining symbols. And finally, not every char value is, by itself, a
valid Unicode character (it can be a part of a surrogate pair
codepoint). Therefore, proper culture-sensitive comparison on
individual chars is simply not possible. Codepoint comparison is, but
that's precisely what operator== does on char anyway.

3) HashSet<char> has a constructor that can be passed an
IEqualityComparer<char>, but there's none built-in. I'm not complaining,
really, but it seems to me that if you're going to write a StringComparerit
should be based on a CharComparer.

See above, and the reasons are all the same. Also, for the same
reasons, StringComparer does not just compare individial chars
separately in a loop, and combines the result - it's much more
complicated than that.

If you need a set of Unicode characters, and you want to have proper
internationalization, then you will have to use HashSet<string>.

Mihai N. · Aug 29, 2008

CompareInfo does culture-sensitive comparison of Unicode characters. A

single Unicode character is not necessarily a single char value - keep
in mind that .NET strings are UTF-16, which means that some Unicode
characters are represented by two codepoints (i.e., two chars). Then
there are also numerous ways to represent a single character, such as
combining symbols. And finally, not every char value is, by itself, a
valid Unicode character (it can be a part of a surrogate pair
codepoint). Therefore, proper culture-sensitive comparison on
individual chars is simply not possible. Codepoint comparison is, but
that's precisely what operator== does on char anyway.

Right on the mark!

Case conversion is a locale-sensitive operation, and same is
string comparison.

Comparing stand-alone characters are meaningless even without
implementation details (like surrogates, or combining characters).

For instance in Spanish "ch" sorts like a single character between c and d.
In German "oe" is is equivalent to o umlaut. And so on.
Also on German upper case of sharp s (one character) is "SS"

If you do anything on one character, you are guaranteed to be wrong.

PIEBALD · Aug 29, 2008

wouldn't it be easier to

have it compare two strings that differ only by case and see what the
result is?

I use that as a fall-back position, but apparently even that isn't a sure
thing with other cultures and alphabets.

PIEBALD · Aug 29, 2008

Ah, thanks. So perhaps only the Ordinal and OrdinalIgnoreCase should be used
for individual characters.

You should never
need to tell what a given StringComparer instance does, but rather
just use its results.

That's fine as far as the StringComparer goes, but I was also using
Enum.Parse which can take an IgnoreCase boolean, it seems best to send the
same value the StringComparer is using.

There could also be an application that uses a StringComparer to search
based on user input, the developer may want to include an indication of
whether or not the search is case-sensitive (even when the user has the
ability to specify the comparer).

Comparing Strings	7	Jan 26, 2006
Sorting strings	3	Mar 5, 2009
Comparing strings	4	Jun 22, 2005
Intersect of 2 strings	23	Jan 5, 2010
Need Char Comparer Help	14	Jan 15, 2010
Testing a Comparer instance type	2	Jan 2, 2008
Need Help with Parsing Primitives	17	Jan 16, 2010
Get Strings	3	Aug 30, 2012

Comparing strings and characters

PIEBALD

Pavel Minaev

Mihai N.

PIEBALD

PIEBALD

Ask a Question

Similar Threads