string.CompareTo(string) oddity/bug/misconseption?

H

Helge Jensen

In my locale (US-english/DK i think, dunno how to state it properly :),
..NET-1.1-sp1 I observe the following:

"A".CompareTo("B") == -1; // A < B
"B".CompareTo("F") == -1; // B < F
"A".CompareTo("F") == -1; // A < F

Seems like lexicographic order, and also:

"BB".CompareTo("FF") == 1; // BB < FF

So that's looking as a usual lexicographic order, but, much to my surprise:

"AA".CompareTo("BB") == 1; // AA > BB
"AA".CompareTo("FF") == 1; // AA > FF

The documentation says that string.CompareTo returns: "A 32-bit signed
integer indicating the lexical relationship between the two comparands."

What's going on here, am I misinterpreting, or is this a bug?
 
H

Helge Jensen

Helge said:
"BB".CompareTo("FF") == 1; // BB < FF

Sorry, that should be:

"BB".CompareTo("FF") == -1; // BB < FF

note the "-1", here... thought i read it through a enough times!
 
J

Jon Skeet [C# MVP]

Helge Jensen said:
In my locale (US-english/DK i think, dunno how to state it properly :),
.NET-1.1-sp1 I observe the following:

"A".CompareTo("B") == -1; // A < B
"B".CompareTo("F") == -1; // B < F
"A".CompareTo("F") == -1; // A < F

Seems like lexicographic order, and also:

"BB".CompareTo("FF") == 1; // BB < FF

So that's looking as a usual lexicographic order, but, much to my surprise:

"AA".CompareTo("BB") == 1; // AA > BB
"AA".CompareTo("FF") == 1; // AA > FF

The documentation says that string.CompareTo returns: "A 32-bit signed
integer indicating the lexical relationship between the two comparands."

What's going on here, am I misinterpreting, or is this a bug?

That certainly sounds very odd - could you give more details of your
locale? Try printing out CultureInfo.CurrentCulture.Name and post the
result - I'd like to try to reproduce it.
 
H

Helge Jensen

Okay, I now understand this, although I must say i don't like it much :)

In my country we have 3 special characters, æ (aelig), ø (oslash), and å
(aring). I will use the (...) notation from now on, just in case
mail-clients mess up the international characters.

The alphabet in "da-DK" ends with these three letters i sequence, so it's:

X Y Z (aelig) (oslash) (aring)

None of the three are considered ligatures.

When the 3 characters are not available, substitutions are used:

(aelig) : aelig
(oslash): oe
(aring) : aa

What happens here is that string.CompareTo(string) considers the
substitution of (aring), so it sorts "aa" as "(aring)". Note that it
does not substiture for (aelig) and (oslash).

While this sorting may *seem* attractive, it is inconsistent, for example:

"dataanl(aelig)g", "datamat"

will be sorted as "datamat" < "dataanl(arlig)g", while the correct
sorting is the reverse. More fun can be had had with with words like:
"klimaanl(aelig)g", "kontraangreb", ..., try it yourself at
http://www.ddoo.dk/perbang.com.cms?aid=109&mode=3&w=aa .

Since 1948, the alternative spelling "aa" for (aring) is only correct in
"proper names" (the dictionary translation of the word "egennavn", it
means names for places, people, ...). Names for places using the
alternative representation is *not* correct, but may be used when
respecting strong local traditions. [loosely translated from Nudansk
ordbog & Retskrivningsleksikon]

"Dansk Sprognævn" is the national committee for specification of the
language, and they advice that alphabetization should sort "aa" as
"(aring)" only when spoken as (aring), which of course
string.CompareTo(string) cannot respect.

So string.CompareTo(string) is in a tight spot. However, the choice of
sorting "aa" as "(aring)" seems misguided, because:

1. it is inconsistent, sorting some words improperly
2. "aa" is not to be sorted as "(aring)" in the general case
3. (aring) can be encoded in string, so there is a choice between
"aa" and "(aring)"
4. if "Aarhus" is explicitly chosen over "(Aring)rhus" it is because
of local traditions
5. people are used to looking for words with ambiguous spelling under
both "a" and "aring"

oooh, yes, and the "bastard" reason that flashed it into my eyes:

n. it sorts "0xaa" *after* "0xff" ;)

I guess danish alphabetizations sucks, but there really is no reason to
make it suck inconsistently :)
 
H

Helge Jensen

That certainly sounds very odd - could you give more details of your
locale? Try printing out CultureInfo.CurrentCulture.Name and post the
result - I'd like to try to reproduce it.

System.Globalization.CultureInfo.CurrentCulture.Name = "da-DK".

The culture where proper alfabetization requires speech recognition :(
 
C

cody

oooh, yes, and the "bastard" reason that flashed it into my eyes:
n. it sorts "0xaa" *after* "0xff" ;)

in that case you should use

String.Compare(string1, string2, false, CultureInfo.InvariantCulture);

or even

String.CompareOrdinal()
 
H

Helge Jensen

cody said:
in that case you should use

String.Compare(string1, string2, false, CultureInfo.InvariantCulture);

Yes, I should.

However, that does not solve the problem with wrong sorting of certain
words.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top