Unclear about string class

Pavils Jurjans · May 25, 2004

Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

-- Pavils

mikeb · May 25, 2004

Pavils said:
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

Strings in .NET are comprised of UTF-16 encoded characters. For the
vast majority of characters, one character will be encoded into 16 bits
(a single .NET Char). There are some characters which get encoded into
more than one set of 16-bit values - similar to the way MBCS work on
Win32 systems. These are pretty rare, and in my experience, I have not
seen any .NET code that even makes an attempt to deal with it - most
..NET code I've seen treats a System.Char as a character.

So I guess what they're saying is that an index into a .NET String type
will point to a System.Char type, but that it is not necessarily
pointing to a Unicode character, since some UTF-16 characters are
encoded using more than one code point.

See Jon Skeet's excellent FAQ on Unicode/Character Encoding issues:

http://www.yoda.arachsys.com/csharp/unicode.html

cody · May 26, 2004

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

Are you sure that your japanese character consists of more than one unicode
character?

Pavils Jurjans · May 27, 2004

Are you sure that your japanese character consists of more than one
unicode

character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

-- Pavils

mikeb · May 29, 2004

Pavils said:
No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

Jon Skeet [C# MVP] · May 29, 2004

A character isn't converted into *characters* when it's encoded - it's
converted into *bytes*. There's a big difference.

Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

Yup, that's absolutely correct. I'm mystified as to where this doc is
too...

Unclear about string class

Pavils Jurjans

mikeb

cody

Pavils Jurjans

mikeb

Jon Skeet [C# MVP]