Unclear about string class

P

Pavils Jurjans

Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

-- Pavils
 
M

mikeb

Pavils said:
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

Strings in .NET are comprised of UTF-16 encoded characters. For the
vast majority of characters, one character will be encoded into 16 bits
(a single .NET Char). There are some characters which get encoded into
more than one set of 16-bit values - similar to the way MBCS work on
Win32 systems. These are pretty rare, and in my experience, I have not
seen any .NET code that even makes an attempt to deal with it - most
..NET code I've seen treats a System.Char as a character.

So I guess what they're saying is that an index into a .NET String type
will point to a System.Char type, but that it is not necessarily
pointing to a Unicode character, since some UTF-16 characters are
encoded using more than one code point.

See Jon Skeet's excellent FAQ on Unicode/Character Encoding issues:

http://www.yoda.arachsys.com/csharp/unicode.html
 
C

cody

Here's an excerpt from msdn online documentation:
An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?


Are you sure that your japanese character consists of more than one unicode
character?
 
P

Pavils Jurjans

Are you sure that your japanese character consists of more than one
unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

-- Pavils
 
M

mikeb

Pavils said:
No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.
 
J

Jon Skeet [C# MVP]

A character isn't converted into *characters* when it's encoded - it's
converted into *bytes*. There's a big difference.
Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

Yup, that's absolutely correct. I'm mystified as to where this doc is
too...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top