Comparing strings and characters

P

PIEBALD

Not really a C#-specific comment, more general .net observations.

1) A while back I found the need to determine whether or not a particular
StringComparer was case-insensitive. The best way I found was to use
Reflection to access the private _ignoreCase field, which I'd rather not do.
It seems to me that a public readonly property would be helpful. (Better yet,
it should probably have a CompareOptions)

2) The CompareInfo.Compare method only works on strings (or substrings), not
individual characters. It seems that it must compare one character at a time,
why not expose a method to allow that?

3) HashSet<char> has a constructor that can be passed an
IEqualityComparer<char>, but there's none built-in. I'm not complaining,
really, but it seems to me that if you're going to write a StringComparer it
should be based on a CharComparer.

So I spent this evening writing my own CharComparer, trying to stay close to
the pattern of StringComparer, but I added public readonly fields to hold the
CompareInfo and CompareOptions.
The main downside is that it has to wrap the characters in strings to use
CompareInfo.Compare, which seems needlessly wasteful.

Or am I missing something?
 
P

Pavel Minaev

Not really a C#-specific comment, more general .net observations.

1) A while back I found the need to determine whether or not a particular
StringComparer was case-insensitive. The best way I found was to use
Reflection to access the private _ignoreCase field, which I'd rather not do.
It seems to me that a public readonly property would be helpful. (Better yet,
it should probably have a CompareOptions)

It is most likely a design issue in your own code. You should never
need to tell what a given StringComparer instance does, but rather
just use its results. For one thing, I can write my own StringComparer-
derived class that is case-insensitive, but implemented differently
from the stock one. Or I can write a StringComparer that ignores case
on the first letter, but not for the rest of them. And then your code
will most likely break - and it shouldn't.
2) The CompareInfo.Compare method only works on strings (or substrings), not
individual characters. It seems that it must compare one character at a time,
why not expose a method to allow that?

CompareInfo does culture-sensitive comparison of Unicode characters. A
single Unicode character is not necessarily a single char value - keep
in mind that .NET strings are UTF-16, which means that some Unicode
characters are represented by two codepoints (i.e., two chars). Then
there are also numerous ways to represent a single character, such as
combining symbols. And finally, not every char value is, by itself, a
valid Unicode character (it can be a part of a surrogate pair
codepoint). Therefore, proper culture-sensitive comparison on
individual chars is simply not possible. Codepoint comparison is, but
that's precisely what operator== does on char anyway.
3) HashSet<char> has a constructor that can be passed an
IEqualityComparer<char>, but there's none built-in. I'm not complaining,
really, but it seems to me that if you're going to write a StringComparerit
should be based on a CharComparer.

See above, and the reasons are all the same. Also, for the same
reasons, StringComparer does not just compare individial chars
separately in a loop, and combines the result - it's much more
complicated than that.

If you need a set of Unicode characters, and you want to have proper
internationalization, then you will have to use HashSet<string>.
 
M

Mihai N.

CompareInfo does culture-sensitive comparison of Unicode characters. A
single Unicode character is not necessarily a single char value - keep
in mind that .NET strings are UTF-16, which means that some Unicode
characters are represented by two codepoints (i.e., two chars). Then
there are also numerous ways to represent a single character, such as
combining symbols. And finally, not every char value is, by itself, a
valid Unicode character (it can be a part of a surrogate pair
codepoint). Therefore, proper culture-sensitive comparison on
individual chars is simply not possible. Codepoint comparison is, but
that's precisely what operator== does on char anyway.

Right on the mark!

Case conversion is a locale-sensitive operation, and same is
string comparison.

Comparing stand-alone characters are meaningless even without
implementation details (like surrogates, or combining characters).

For instance in Spanish "ch" sorts like a single character between c and d.
In German "oe" is is equivalent to o umlaut. And so on.
Also on German upper case of sharp s (one character) is "SS"

If you do anything on one character, you are guaranteed to be wrong.
 
P

PIEBALD

wouldn't it be easier to
have it compare two strings that differ only by case and see what the
result is?

I use that as a fall-back position, but apparently even that isn't a sure
thing with other cultures and alphabets.
 
P

PIEBALD

Ah, thanks. So perhaps only the Ordinal and OrdinalIgnoreCase should be used
for individual characters.
You should never
need to tell what a given StringComparer instance does, but rather
just use its results.

That's fine as far as the StringComparer goes, but I was also using
Enum.Parse which can take an IgnoreCase boolean, it seems best to send the
same value the StringComparer is using.

There could also be an application that uses a StringComparer to search
based on user input, the developer may want to include an indication of
whether or not the search is case-sensitive (even when the user has the
ability to specify the comparer).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Comparing Strings 7
Sorting strings 3
Comparing strings 4
Intersect of 2 strings 23
Need Char Comparer Help 14
Testing a Comparer instance type 2
Need Help with Parsing Primitives 17
Get Strings 3

Top