Unicode in .NET

T

Tony Johansson

Hi!

First of all this statement is from MSDN ".NET Framework uses Unicode UTF-16
to represent characters.
In some cases, the .NET Framework uses UTF-8 internally."

Unicode representations is used for char data types for example if I check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");


Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

If the answer is no then we have a follow up question just because
of a string is a sequence of char and as we saw earlier a char is using
unicode with 2 bytes so why is not each char
in a string using 2 bytes ???


//Tony
 
A

Alberto Poblacion

Tony Johansson said:
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Yes, that is right. This simple "s" is stored as Unicode, and it will
be using two bytes in memory, since that is how the String stores it
internally.
 
H

Harlan Messinger

Tony said:
Hi!

First of all this statement is from MSDN ".NET Framework uses Unicode UTF-16
to represent characters.
In some cases, the .NET Framework uses UTF-8 internally."

Unicode representations is used for char data types for example if I check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.
Unsigned.


Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");


Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

If the answer is no then we have a follow up question just because
of a string is a sequence of char and as we saw earlier a char is using
unicode with 2 bytes so why is not each char
in a string using 2 bytes ???

It *is* two bytes, and it is Unicode. And everything you cited says it
would be. So I'm curious what is leading you to ask this question with
the expectation that you're going to be told it's only one byte?
 
M

Mihai N.

Unicode representations is used for char data types for example if I check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Technically, you can't know. Because that is an internal .NET implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16 internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.
 
T

Tony Johansson

Mihai N. said:
Unicode representations is used for char data types for example if I
check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to
ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some
cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Technically, you can't know. Because that is an internal .NET
implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16
internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

//Tony
 
A

Alberto Poblacion

Tony Johansson said:
So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

Unicode specifies the characters as "code points". There are more than a
million code points, in the range 0 to 10FFFF hex. The part of Unicode known
as the "Basic Multilingual Plane" (BMP) uses only four hex digits to specify
each code point. For instance U+0060 is a capital Z. These characters can
fit into 16 bits. The rest of the code points are arranged in supplementary
planes (up to 16 planes, not all of them currently assigned) and require a
surrogate pair to encode them in UTF-16 (and four bytes in UTF-8).

So, the "both use 16 bits" is not always true; only the characters in
the BMP can be encoded with16 bits.
 
H

Harlan Messinger

Tony said:
Mihai N. said:
Unicode representations is used for char data types for example if I
check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.
No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");
No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to
ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some
cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???
Technically, you can't know. Because that is an internal .NET
implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16
internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

Unicode (by which I mean the character set, not any encoding of it that
one might also call "Unicode") doesn't use bits at all. It assigns
numbers to characters. An encoding is what defines a mapping between
those characters and a bit-based representation of them in memory or
storage. Since Unicode provides for the assignment of numbers to more
than 65,536 characters, it is automatically impossible to have an
encoding that covers all of Unicode and that encodes every character in
16 bits.
 
A

Arne Vajhøj

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

We had this discussion a few weeks ago.

Unicode is a relationship between the character in
all the worlds languages and numbers called code
points.

Unicode can be encoded using different encoding like UTF-8 and UTF-16.

In UTF-16 a single number (code point) is encoded in 1 or 2
16 bit chars = 2 or 4 8 bit bytes.

For all western languages it is 1 and 2.

Arne
 
M

Mihai N.

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

As others already answered, Unicode assigns character to numbers.
How things get mapped to bytes is another story.
A bit about this here: http://mihai-nita.net/2006/08/06/basic-lingo/

And the first question here http://unicode.org/faq/utf_bom.html
is "Is Unicode a 16-bit encoding?" and the answer starts with "No."
:)

As such, Unicode proper is not represented on 16 bits (or 8, or 32).
Heck, 敥 is still Unicode :)

UTF-8 and UTF-16 and UTF-32 (with little-endian and big-endian "flavors")
are all Unicode Transformation Format, and are all equivalent.


In some whays similar to the fact that 0xA (hex) and 10 (decimal) and
012 (octal) and 1010 (binary) and X (Roman) all represent the same
concept, that of ten, if you want :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top