As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
No. UTF-16 is a superset of UCS2. And .NET is UTF-16, not UCS2.
Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9
Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"
Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18
So you end up being able to represent 13 values!
This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode
Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates
So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?
Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this
http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technically possible" value, but you should
exclude reserved areas, private use areas and others.
Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2
And .NET is UTF-16
=================================================
To answer the original questions:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.
There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.
You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.
See some official standard here:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-
Chapter04a#96f19a02
whic i should use to support character ucs-2.
I want to use ucs-2 character in streamreader and streamwriter.
Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.
How unicode and utf chacters are stored.
The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522