Unicode in .NET

Tony Johansson · Apr 30, 2010

Hi!

First of all this statement is from MSDN ".NET Framework uses Unicode UTF-16
to represent characters.
In some cases, the .NET Framework uses UTF-8 internally."

Unicode representations is used for char data types for example if I check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

If the answer is no then we have a follow up question just because
of a string is a sequence of char and as we saw earlier a char is using
unicode with 2 bytes so why is not each char
in a string using 2 bytes ???

//Tony

Alberto Poblacion · Apr 30, 2010

Tony Johansson said:
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Yes, that is right. This simple "s" is stored as Unicode, and it will
be using two bytes in memory, since that is how the String stores it
internally.

Harlan Messinger · Apr 30, 2010

Tony said:
Hi!

First of all this statement is from MSDN ".NET Framework uses Unicode UTF-16
to represent characters.
In some cases, the .NET Framework uses UTF-8 internally."

Unicode representations is used for char data types for example if I check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.
Unsigned.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

If the answer is no then we have a follow up question just because
of a string is a sequence of char and as we saw earlier a char is using
unicode with 2 bytes so why is not each char
in a string using 2 bytes ???

It *is* two bytes, and it is Unicode. And everything you cited says it
would be. So I'm curious what is leading you to ask this question with
the expectation that you're going to be told it's only one byte?

Mihai N. · May 1, 2010

Unicode representations is used for char data types for example if I check

with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Technically, you can't know. Because that is an internal .NET implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16 internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.

Tony Johansson · May 1, 2010

Mihai N. said:
Unicode representations is used for char data types for example if I
check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

Click to expand...

No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

Click to expand...

No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to
ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some
cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Click to expand...

Technically, you can't know. Because that is an internal .NET
implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16
internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

//Tony

Alberto Poblacion · May 1, 2010

Tony Johansson said:
So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

Unicode specifies the characters as "code points". There are more than a
million code points, in the range 0 to 10FFFF hex. The part of Unicode known
as the "Basic Multilingual Plane" (BMP) uses only four hex digits to specify
each code point. For instance U+0060 is a capital Z. These characters can
fit into 16 bits. The rest of the code points are arranged in supplementary
planes (up to 16 planes, not all of them currently assigned) and require a
surrogate pair to encode them in UTF-16 (and four bytes in UTF-8).

So, the "both use 16 bits" is not always true; only the characters in
the BMP can be encoded with16 bits.

Harlan Messinger · May 1, 2010

Tony said:
Mihai N. said:

Unicode representations is used for char data types for example if I
check
with
sizeof for the char datatype it write 2 bytes.
Unicode is using a signed 16 bits integer.

Click to expand...

No, UTF-16 (used by .NET) is using a signed 16 bits integer.
Not Unicode.
Not very important here, but that is the correct statement.

Now if we look at the string data type which is reference type.
Here I store the string literal "s" in a byte array.
The size of the byte array is 2 bytes because Unicode is using 2 bytes.
byte[] myByte = Encoding.Unicode.GetBytes("s");

Click to expand...

No. The size of the byte array depends of the encoding of the byte array.
Has nothing to do with the String representation anymore.
Convert to UTF32Encoding and it will take 4 bytes, convert to
ASCIIEncoding
and it will be 1 byte.
It is just the confusion between "Unicode" and UTF16.
The better name for the encoding would have been UTF16Encoding.

Now to the question that I have hard to understand. Remember that I wrote
this row as the first row in this mail
".NET Framework uses Unicode UTF-16 to represent characters. In some
cases,
the .NET Framework uses UTF-8 internally."
So if I write this simple statement
string myString = "s";
will Unicode be used to store the single string literal "s" ???

Click to expand...

Technically, you can't know. Because that is an internal .NET
implementation
detail. Nobody explained when is .NET using UTF-8 and when UTF-16
internally.
In theory all strings might be UTF-8, and then "s" would take one byte.
In reality it is very likely to be UTF-16, so 2 bytes.
But you can't know for sure, unless you dig in the .NET sources.

Click to expand...

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

Unicode (by which I mean the character set, not any encoding of it that
one might also call "Unicode") doesn't use bits at all. It assigns
numbers to characters. An encoding is what defines a mapping between
those characters and a bit-based representation of them in memory or
storage. Since Unicode provides for the assignment of numbers to more
than 65,536 characters, it is automatically impossible to have an
encoding that covers all of Unicode and that encodes every character in
16 bits.

Arne Vajhøj · May 1, 2010

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

We had this discussion a few weeks ago.

Unicode is a relationship between the character in
all the worlds languages and numbers called code
points.

Unicode can be encoded using different encoding like UTF-8 and UTF-16.

In UTF-16 a single number (code point) is encoded in 1 or 2
16 bit chars = 2 or 4 8 bit bytes.

For all western languages it is 1 and 2.

Arne

Mihai N. · May 2, 2010

So what is the difference between Unicode and UTF-16 ?
Both is used 16 bits as far as I know.

As others already answered, Unicode assigns character to numbers.
How things get mapped to bytes is another story.
A bit about this here: http://mihai-nita.net/2006/08/06/basic-lingo/

And the first question here http://unicode.org/faq/utf_bom.html
is "Is Unicode a 16-bit encoding?" and the answer starts with "No."

As such, Unicode proper is not represented on 16 bits (or 8, or 32).
Heck, 敥 is still Unicode

UTF-8 and UTF-16 and UTF-32 (with little-endian and big-endian "flavors")
are all Unicode Transformation Format, and are all equivalent.

In some whays similar to the fact that 0xA (hex) and 10 (decimal) and
012 (octal) and 1010 (binary) and X (Roman) all represent the same
concept, that of ten, if you want

This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
size of a file and unicode	6	Mar 2, 2010
about encoding UTF-8 and UTF-16	6	Mar 31, 2010
C# and encodings	30	Feb 3, 2009
data type representations in .NET	6	Apr 29, 2010
converting letters to it's unicode representation	2	May 22, 2007
Unicode values	2	May 13, 2008
Unicode beyond U+FFFF	1	Mar 4, 2010

Unicode in .NET

Tony Johansson

Alberto Poblacion

Harlan Messinger

Mihai N.

Tony Johansson

Alberto Poblacion

Harlan Messinger

Arne Vajhøj

Mihai N.

Ask a Question

Similar Threads