string.Compare / OrdinalIgnoreCase

  • Thread starter Thread starter Rene
  • Start date Start date
R

Rene

Hi,

It was my understanding that when comparing strings using
"OrdinalIgnoreCase" as the method to compare the strings, the .Net compared
the strings by first capitalizing all of the characters on the string and
then making an ordinal comparison (Unicode code point comparison).

I guess I was wrong (or I am not getting something) because my experiments
prove otherwise.

In the code below, compare1 returns zero proving that if I manually
capitalize the strings and then compare them the .Net says they are the
same.

However, compare2 does not return zero, so this means that the .Net is doing
something different that what I assumed.

Could someone please tell me why compare1 and compare2 are returning
different values?

Thank you.

---------------------------------------------------------

// LATIN CAPITAL LETTER I (U+0049)
string capitalLetterI = "I";

// LATIN SMALL LETTER DOTLESS I (U+0131)
string smallLetterDotlessI = "\u0131";

string upper1 = smallLetterDotlessI.ToUpper();
string upper2 = capitalLetterI.ToUpper();
int compare1 = string.Compare(upper1, upper2, StringComparison.Ordinal);

int compare2 = string.Compare(smallLetterDotlessI, capitalLetterI,
StringComparison.OrdinalIgnoreCase);
 
Rene said:
It was my understanding that when comparing strings using
"OrdinalIgnoreCase" as the method to compare the strings, the .Net compared
the strings by first capitalizing all of the characters on the string and
then making an ordinal comparison (Unicode code point comparison).

The process of capitalization itself is culture-sensitive, which is
what's tripping you up. Your call to ToUpper is returning plain "I" in
both cases, because it's using the thread's current culture - if you
specify CultureInfo.InvariantCulture as the culture to use when upper
casing, you'll get the same results for both comparisons.

In this case, I believe that from a culture-neutral point of view,
they're different letters rather than just differently capitalised
letters. It's all a bit tricksy though, to be honest.

Hope this at least explains a bit of what's going on...
 
Sure enough, I added

System.Threading.Thread.CurrentThread.CurrentCulture =
System.Globalization.CultureInfo.InvariantCulture;

before doing any comparing and viola, I got the same answers this time (both
show as not being equal).

Looks like I have to do some more reading on "string.Compare". I didn't
think that learning about string/Unicode/culture/etc will take me as long as
it has taken me, the more I research the more new stuff I keep bumping on...
dam it!

Thanks.
 
OK, I did some more digging around, according to the following site:

http://www.fileformat.info/info/unicode/char/0131/index.htm

The *Unicode* uppercase equivalent for 'LATIN SMALL LETTER DOTLESS I'
(U+0131) is 'LATIN CAPITAL LETTER I' (U+0049).

Having said that, I was under the impression that the OrdinalIgnoreCase flag
would use the *Unicode conversion tables* (no culture involved) to convert
the characters on the string to uppercase, this means that uppercase
conversion should always be the same no matter what culture is being used.

If above is true, the result for "compare2" should be zero because:

The "smallLetterDotlessI" variable capitalized using the Unicode tables
should return (U+0049).
The "capitalLetterI" variable is already a capital character so after
capitalizing using the Unicode tables should return (U+0049).

So you may think that the line of code below should return zero:

int compare2 = string.Compare(smallLetterDotlessI, capitalLetterI,
StringComparison.OrdinalIgnoreCase);

But it does not. So what's going on? What logic is the .Net using when
comparing with the OrdinalIgnoreCase flag? Is it not uppercasing all
characters using the Unicode conversion tables?

Thanks.
 
Well, I think I found the answer here:

http://blogs.msdn.com/michkap/archive/2005/03/10/391564.aspx

Basically the page says:

"Windows and the .NET Framework mainly support simple, reversible casing --
which is to say single code point casing that have ToUpper() and ToLower()
as inverse operations that can "undo" each other."

So in my example, the 'LATIN SMALL LETTER DOTLESS I' (U+0131) will need to
uppercase to 'LATIN CAPITAL LETTER I' (U+0049), but then 'LATIN CAPITAL
LETTER I' (U+0049) should in return lowercase to 'LATIN SMALL LETTER DOTLESS
I' (U+0131) but that is not the case because it will lowercase to 'LATIN
SMALL LETTER I' (U+0069) Since this conversion is not reversible
OrdinalIgnoreCase is not really uppercasing the character and that is why
"compare2" will not return zero.

At least that's what I think is going on.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top