Universal String (4 Byte Canonical Encoding) and UTF-32

  • Thread starter Thread starter Jeffrey Walton
  • Start date Start date
J

Jeffrey Walton

Hi All,

BMP Strings are a subset of Universal Strings.The BMP string uses
approximately 65,000 code points from Universal String encoding. BMP
Strings: ISO/IEC 10646, 2-octet canonical form, Universal String: ISO/
IEC 10646, 4-octet canonical form.

An excellent discussion occured with respect to BMP Strings and .Net
(see http://groups.google.com/group/micr...csharp/browse_thread/thread/f18fcb62156a1a0c/).
The discussion ended with the statement, "UTF-16 is a superset of
UCS2."

Can we use UTF-32 for UCS4 [Universal String, 4-octet canonical form]
in the same manner as was justified in the previously mentioned thread
(UTF-16/UCS2)?

Thanks,
Jeff
Jeffrey Walton
 
Jeffrey Walton said:
BMP Strings are a subset of Universal Strings.The BMP string uses
approximately 65,000 code points from Universal String encoding. BMP
Strings: ISO/IEC 10646, 2-octet canonical form, Universal String: ISO/
IEC 10646, 4-octet canonical form.

An excellent discussion occured with respect to BMP Strings and .Net
(see http://groups.google.com/group/microsoft.public.dotnet.languages
.csharp/browse_thread/thread/f18fcb62156a1a0c/).
The discussion ended with the statement, "UTF-16 is a superset of
UCS2."

Can we use UTF-32 for UCS4 [Universal String, 4-octet canonical form]
in the same manner as was justified in the previously mentioned thread
(UTF-16/UCS2)?

It's not quite clear to me how you want to use UTF-32. I have a
Utf32String class which is probably full of bugs (I've never really
used it) but you're welcome to it - it's part of the library at
http://pobox.com/~skeet/csharp/miscutil

You can use UTF-16 to cover the same range of values, however, using
surrogate pairs. The System.String class doesn't have a *lot* of
support for this though - it's not exactly easy to work with things
outside the BMP.

Are you doing a lot of work requiring non-BMP characters?
 
An excellent discussion occured with respect to BMP Strings and .Net
(see http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/brows
e_thread/thread/f18fcb62156a1a0c/).
The discussion ended with the statement, "UTF-16 is a superset of
UCS2."

This part did not change since July :-)

Can we use UTF-32 for UCS4 [Universal String, 4-octet canonical form]
in the same manner as was justified in the previously mentioned thread
(UTF-16/UCS2)?

You can consider UTF-32 to be the same thing as UCS4.
(while UTF-16 is a superset of UCS2).
There are no surrogates, nothing tricky in UTF-32

In general UCS is use by ISO/IEC 10646, while UTF is Unicode lingo.

My personal rule: when in doubt, I go to the official source:
http://www.unicode.org/versions/Unicode5.0.0/appC.pdf
"As a consequence, UCS-4 can now be taken effectively as an alias
for the Unicode encoding form UTF-32, except that UTF-32 has the
extra requirement that additional Unicode semantics be observed
for all characters."

And somewhere below (C.6)
"In the framework of the Unicode Standard, character semantics
are indicated via character properties, functional specifications,
usage annotations, and name aliases;"

In fact, the whole C.4-C.7 range is interesting for this topic.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top