This spanish character string "ñ" cause something that I don't understand

Tony Johansson · Mar 31, 2010

Hi!

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two
bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?

static void Main(string[] args)
{
UTF8Encoding utf8 = new UTF8Encoding();
string chars = "ñ";
char ch = 'ñ';
byte[] byteArray = new byte[utf8.GetByteCount(chars)];
byteArray = utf8.GetBytes(chars);
Console.WriteLine(utf8.GetString(byteArray));
}

//Tony

Mihai N. · Mar 31, 2010

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two

bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?

The code point of ñ is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

You can have some fun starting with the table here:
http://en.wikipedia.org/wiki/UTF-8#Description

195 177 decimal = C3 B1 hex = 11000011 10110001 binary
Now you take the binary and compare it to the UTF-8 pattern:
11000011 10110001
110yyyxx 10xxxxxx (second line in the table)
So you extract the usefull bits (above yyyxxxxxxxx) and get
00011 110001
Together that is 00011110001 or split in groups of 4 you
get 000.1111.0001. That is exactly F1 (241).

Tony Johansson · Mar 31, 2010

Mihai N. said:
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

You can have some fun starting with the table here:
http://en.wikipedia.org/wiki/UTF-8#Description

195 177 decimal = C3 B1 hex = 11000011 10110001 binary
Now you take the binary and compare it to the UTF-8 pattern:
11000011 10110001
110yyyxx 10xxxxxx (second line in the table)
So you extract the usefull bits (above yyyxxxxxxxx) and get
00011 110001
Together that is 00011110001 or split in groups of 4 you
get 000.1111.0001. That is exactly F1 (241).

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where
241 fits in one byte.

//Tony

Harlan Messinger · Mar 31, 2010

Tony said:
Hi!

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two
bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?

Since the characters is represented in memory as UTF-16, why would the
debugger show you what it would be in UTF-8?

The UTF-16 representation for all Unicode characters with values less
than 65536 is the straightforward 16-bit integer representation of the
value. This isn't the case in UTF-8.

Arne Vajhøj · Apr 1, 2010

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where
241 fits in one byte.

See answers in other thread.

Arne

Mihai N. · Apr 1, 2010

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has

zeros in the highorder byte as it is in this case where
241 fits in one byte.

Not qute. Again the wikipedia table is just fine to explain the thing:
http://en.wikipedia.org/wiki/UTF-8#Description
1 utf-8 byte => the range is U+0000 - U+007F
2 utf-8 bytes => the range is U+0080 - U+07FF
3 utf-8 bytes => the range is U+0800 - U+FFFF
4 utf-8 bytes => the range is U+10000 - U+10FFFF

So with 2 UTF-8 bytes you have 0x100 in 0x800 chances (1 in 8) that the
UTF-16 is below 0xFF (and has a zero in the high-order byte).

Arne VajhÃ¸j · Apr 1, 2010

Not qute. Again the wikipedia table is just fine to explain the thing:
http://en.wikipedia.org/wiki/UTF-8#Description
1 utf-8 byte => the range is U+0000 - U+007F
2 utf-8 bytes => the range is U+0080 - U+07FF
3 utf-8 bytes => the range is U+0800 - U+FFFF
4 utf-8 bytes => the range is U+10000 - U+10FFFF

So with 2 UTF-8 bytes you have 0x100 in 0x800 chances (1 in 8) that the
UTF-16 is below 0xFF (and has a zero in the high-order byte).

That assume that all characters has the same probability. They do not.

English text would be practically 100%. French, German, Swedish etc.
text would be higher than 95%. Chinese would be a big fat zero.

Arne

Mihai N. · Apr 2, 2010

That assume that all characters has the same probability. They do not.

English text would be practically 100%. French, German, Swedish etc.
text would be higher than 95%. Chinese would be a big fat zero.

Arne

You have a point. But not foir this question

The initial question was not how many characters have zero in the
high-order byte in UTF-8, but "from the characters what take 2 bytes
in UTF-8"

So the only characters in that class are:
2 bytes => U+0080 - U+07FF
Restrict them to the ones with high-byte zero and you get U+0080-U+00FF
That basically means the accented characters in Latin 1.
English does not count at all, because it's only 1 UTF-8 byte, and Chinese
does not count, because it is 3 (and sometimes 4) bytes in UTF-8.

But yes, it is not clean statistics, it is languages dependent.

about encoding UTF-8 and UTF-16	6	Mar 31, 2010
C# and encodings	30	Feb 3, 2009
Getting around .Net Strings being UTF-16 encoded only	5	Nov 1, 2005
NotSupportedException with charset conversions	12	Sep 29, 2004
Converting byte array to Unicode string in C#	1	Feb 13, 2006
query string encoding/decoding	9	Mar 3, 2004
WTD: Nvidia Videocard for 4x AGP Slot	0	May 29, 2008

This spanish character string "ñ" cause something that I don't understand

Tony Johansson

Mihai N.

Tony Johansson

Harlan Messinger

Arne Vajhøj

Mihai N.

Arne VajhÃ¸j

Mihai N.

Ask a Question

Similar Threads