Unicode and utf 8 /utf 16

A

archana

Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.
 
J

Jon Skeet [C# MVP]

archana said:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.

Jon
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Unicode is a character set, just as UCS.

UTF-8 and UTF-16 are UCS Transformation Formats. As Unicode and UCS are
effectively synonymous, UTF-8 and UTF-16 is used to encode Unicode strings.

In UTF-16 the characters are encoded as 16 bit sequences (two bytes).
UTF-16 and UCS-2 are identical for all characters that USC-2 handles.
You can treat UCS-2 data as UTF-16 without any problems.

In UTF-8 the most common characters are encoded as 8 bit sequences (one
byte). Special characters are encoded as 24 bit sequences (three bytes).

As the character type in .NET is a 16 bit Uncode character, it's
synonymous with the UCS BMP (Basic Multilingual Plane) that UCS-2 handles.

In conclusion, in .NET the Unicode and UCS BMP character sets are the
same, and UCS-2 and UTF-16 are the same.

There is no encoding in UCS that corresponds to UTF-8. If you export
data to something that only handles UCS, you have to use UTF-16.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Jon said:
See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.

Jon

From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
 
B

Barry Kelly

Göran Andersson said:
From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.

..NET chars have surrogate pair forms (check out Char.IsHighSurrogate()
and Char.IsLowSurrogate()) combining two characters to form a single
abstract character. Thus, the number of physical characters in a .NET
string may be greater than the number of actual, abstract characters.

-- Barry
 
M

Mihai N.

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
No. UTF-16 is a superset of UCS2. And .NET is UTF-16, not UCS2.

Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9

Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"

Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18

So you end up being able to represent 13 values!

This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode


Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates

So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF

Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?

Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technically possible" value, but you should
exclude reserved areas, private use areas and others.


Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2

And .NET is UTF-16
=================================================
To answer the original questions:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.
There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.

You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.


See some official standard here:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-
Chapter04a#96f19a02

whic i should use to support character ucs-2.
I want to use ucs-2 character in streamreader and streamwriter.
Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.

How unicode and utf chacters are stored.
The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top