This spanish character string "ñ" cause something that I don't understand

T

Tony Johansson

Hi!

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two
bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?

static void Main(string[] args)
{
UTF8Encoding utf8 = new UTF8Encoding();
string chars = "ñ";
char ch = 'ñ';
byte[] byteArray = new byte[utf8.GetByteCount(chars)];
byteArray = utf8.GetBytes(chars);
Console.WriteLine(utf8.GetString(byteArray));
}

//Tony
 
M

Mihai N.

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two
bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?


The code point of ñ is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

You can have some fun starting with the table here:
http://en.wikipedia.org/wiki/UTF-8#Description

195 177 decimal = C3 B1 hex = 11000011 10110001 binary
Now you take the binary and compare it to the UTF-8 pattern:
11000011 10110001
110yyyxx 10xxxxxx (second line in the table)
So you extract the usefull bits (above yyyxxxxxxxx) and get
00011 110001
Together that is 00011110001 or split in groups of 4 you
get 000.1111.0001. That is exactly F1 (241).
 
T

Tony Johansson

Mihai N. said:
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

You can have some fun starting with the table here:
http://en.wikipedia.org/wiki/UTF-8#Description

195 177 decimal = C3 B1 hex = 11000011 10110001 binary
Now you take the binary and compare it to the UTF-8 pattern:
11000011 10110001
110yyyxx 10xxxxxx (second line in the table)
So you extract the usefull bits (above yyyxxxxxxxx) and get
00011 110001
Together that is 00011110001 or split in groups of 4 you
get 000.1111.0001. That is exactly F1 (241).

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where
241 fits in one byte.

//Tony
 
H

Harlan Messinger

Tony said:
Hi!

Here I encode the spanish character "ñ" to UTF-8 which is encoded as a two
bytes with the values 195 and 177 which is understandable.
As we know a char is a Unicode which is a signed 16-bits integer.
Now to my question when I run this program and use the debugger and hover
over this ch variabel that is of type char
it shows 241.
I mean because a char is Unicode(UTF-16) and this value is using two bytes
when UTF-8 is used how can the debugger show 241 when I hover over this ch
variable ?

Since the characters is represented in memory as UTF-16, why would the
debugger show you what it would be in UTF-8?

The UTF-16 representation for all Unicode characters with values less
than 65536 is the straightforward 16-bit integer representation of the
value. This isn't the case in UTF-8.
 
A

Arne Vajhøj

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where
241 fits in one byte.

See answers in other thread.

Arne
 
M

Mihai N.

When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
zeros in the highorder byte as it is in this case where
241 fits in one byte.

Not qute. Again the wikipedia table is just fine to explain the thing:
http://en.wikipedia.org/wiki/UTF-8#Description
1 utf-8 byte => the range is U+0000 - U+007F
2 utf-8 bytes => the range is U+0080 - U+07FF
3 utf-8 bytes => the range is U+0800 - U+FFFF
4 utf-8 bytes => the range is U+10000 - U+10FFFF

So with 2 UTF-8 bytes you have 0x100 in 0x800 chances (1 in 8) that the
UTF-16 is below 0xFF (and has a zero in the high-order byte).
 
A

Arne Vajhøj

Not qute. Again the wikipedia table is just fine to explain the thing:
http://en.wikipedia.org/wiki/UTF-8#Description
1 utf-8 byte => the range is U+0000 - U+007F
2 utf-8 bytes => the range is U+0080 - U+07FF
3 utf-8 bytes => the range is U+0800 - U+FFFF
4 utf-8 bytes => the range is U+10000 - U+10FFFF

So with 2 UTF-8 bytes you have 0x100 in 0x800 chances (1 in 8) that the
UTF-16 is below 0xFF (and has a zero in the high-order byte).

That assume that all characters has the same probability. They do not.

English text would be practically 100%. French, German, Swedish etc.
text would be higher than 95%. Chinese would be a big fat zero.

Arne
 
M

Mihai N.

That assume that all characters has the same probability. They do not.

English text would be practically 100%. French, German, Swedish etc.
text would be higher than 95%. Chinese would be a big fat zero.

Arne

You have a point. But not foir this question :)
The initial question was not how many characters have zero in the
high-order byte in UTF-8, but "from the characters what take 2 bytes
in UTF-8"

So the only characters in that class are:
2 bytes => U+0080 - U+07FF
Restrict them to the ones with high-byte zero and you get U+0080-U+00FF
That basically means the accented characters in Latin 1.
English does not count at all, because it's only 1 UTF-8 byte, and Chinese
does not count, because it is 3 (and sometimes 4) bytes in UTF-8.

But yes, it is not clean statistics, it is languages dependent.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top