about encoding UTF-8 and UTF-16

T

Tony Johansson

Hi!

This character "ñ" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

My second question does a code page include all the Unicode standards UTF-8,
UTF-16 and UTF-32.if not
where are for example this character "ñ" defined for the different Unicode
standards ?

//Tony
 
P

Peter Duniho

Tony said:
Hi!

This character "ñ" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

Define "common". But absent a specific definition from you, I'd say
"no". UTF-8 can easily represent far more than 256 different characters
using two bytes, while 256 is the absolute theoretical maximum that
UTF-16 could represent with a value having the form 00xx, where "x" is a
hexadecimal digit.
My second question does a code page include all the Unicode standards UTF-8,
UTF-16 and UTF-32.if not
where are for example this character "ñ" defined for the different Unicode
standards ?

In general, Unicode is a superset of each of the various code pages.
So, no…a given code page is not going to be able to include all of the
Unicode characters.

Finally, note as has been mentioned before: UTF-8, -16, and -32 are
_encodings_ for Unicode, while Unicode is the character set. Each of
the encodings can represent all of the characters in Unicode, and the
actual code point within Unicode for any given character is always the
same. Only the value in a specific encoding changes, and you can find
ALL of this information on the http://www.unicode.org/ web site
(including, for example, code point and encoding values for a given
character, such as 'ñ').

Pete
 
J

Jeff Johnson

This character "ñ" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

Honestly Tony, in this case, who cares? I realize that you're trying to
learn things but I think you need to pick and choose what you want to dive
deeply into, and in my opinion the internal workings of UTF-8 and UTF-16
shouldn't concern you. UTF-8 is a middleman. It exists to bridge the gap
between single-byte code pages and the new, global world of Unicode. Data
stored in UTF-8 is almost always translated into something else (like .NET
translates everything to UTF-16) so you should really only know how to USE
UTF-8 without worrying about its guts. (Unless you're trying to write your
own UTF-8 encoder/decoder, of course.)
My second question does a code page include all the Unicode standards
UTF-8,
UTF-16 and UTF-32.if not
where are for example this character "ñ" defined for the different Unicode
standards ?

Code pages do not "include Unicode standards." Let me see if I can come up
with a good analogy.

If you know anything about bitmaps, you know that there are indexed bitmaps
and true-color bitmaps. An indexed bitmap is like a paint-by-number set
(assuming you're old enough to remember those things and they existed where
you grew up). You have a limited supply of colors and each color is mapped
to a number (the index). Perhaps 0 = Red, 1 = White, 2 = Purple, etc. You
cannot use any color outside the range of your given color palette. Let's
say this color palette has 256 entries for this example, and therefore each
index value fits nicely into one byte. You define your bitmap by specifying
a bunch of indexes (bytes) that indicate which color is to be applied to
each pixel. So your bitmap data might contain 0 0 0 2 2 1, meaning three
pixels of red, two pixels of purple, and one pixel of white. Six pixels, six
bytes. Moderately compact.

On the other hand, in a true color image each pixel is represented by three
(or four, if you want transparency) bytes. There is no color palette because
those bytes can represent any color. So now your six pixels look like this:
0xFF0000 0xFF0000 0xFF0000 0xFF00FF 0xFF00FF 0xFFFFFF. Six pixels, 18 bytes.
Big, but flexible as far as colors go.

A code page is like an indexed image. Single-byte code pages contain 256
"slots," each of which can represent a character (a glyph). Each code page
has a table somewhere which tells it how to map each index (0 - 255) to a
specific Unicode character (called "code points").

Unicode itself is like the entire color spectrum (or at least it's pretty
close).

The Windows-1252 code page (Latin 1 or something like that) maps 65 -> A
(U+0041), 34 -> " (U+0022), 42 -> * (U+003A), and so on. Many other code
pages have similar mappings for indexes 0 - 127, but when you get to 128 -
255 you tend to see more variation. For example, and I'm totally making this
up, a Russian code page might map 165 to U+0427 whereas a Spanish code page
might map it to your ñ, U+00F1.

UTF-8, on the other hand, is not a mapping but rather an encoding, which
takes a Unicode code point and stores it in 1 to 4 bytes (encoding), or
takes 1 to 4 bytes and translates that into a Unicode code point (decoding).

Unicode is like the center of a wheel (the hub), and code pages are the
spokes. Everything ultimately goes through the hub. UTF-8 and friends are
not spokes; they are more like "transport mechanisms" and are not directly
related to code pages.
 
A

Arne Vajhøj

This character "ñ" is represented as 241 in UTF-16.

It is a 16 bit integer with the value 241.
The code point of is U+00F1

That may be the common notation.
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

It must be:

0x00 0xF1 for UTF-16 bytes
0x00 0x00 0x00 0xF1 for UTF-32 bytes
0xC3 0xB1 for UTF-8 bytes
My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

It is the case for characters that are also in ISO-8859-1.

So yes - it is common for western texts.
My second question does a code page include all the Unicode standards UTF-8,
UTF-16 and UTF-32.

CP 1200 and 1201 = UTF-16 (little and big endian) [well - actually
UCS-2, but let us ignore that difference ...]

CP 65000 = UTF-7 [nobody uses that]

CP 65001 = UTF-8
if not
where are for example this character "ñ" defined for the different Unicode
standards ?

In CP 1252 (which is approx. ISO-8859-1) it is a single byte 241 (0xF1).

Arne
 
T

Tim Roberts

Tony Johansson said:
This character "ñ" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

You can look all of this up. These are international standards, and there
are very good reasons for this design.

The lowest 128 Unicode code points map to one-byte encodings in UTF-8. The
next 1,920 code points map to two-byte encodings. The next 63,488 code
points map to three-byte encodings. Anything above U+10000 requires four
bytes.

So, yes, the Unicode code points from U+0080 to U+00FF always take two
bytes in UTF-8.
 
J

Jeff Johnson

So, yes, the Unicode code points from U+0080 to U+00FF always take two
bytes in UTF-8.

But the "opposite" is not true! That is, just because the UTF-8 encoding
yields 2 bytes does not suggest that the UTF-16 encoding will "likely" have
0 in the MSB. If there are 1920 possible 2-byte UTF-8 sequences and only 128
of them represent U+0080 - U+00FF, then that accounts for only 6.667% of the
possible 2-byte sequences. So back to Tony's question:

I would say "Don't count on it."
 
T

Tim Roberts

Jeff Johnson said:
But the "opposite" is not true! That is, just because the UTF-8 encoding
yields 2 bytes does not suggest that the UTF-16 encoding will "likely" have
0 in the MSB. If there are 1920 possible 2-byte UTF-8 sequences and only 128
of them represent U+0080 - U+00FF, then that accounts for only 6.667% of the
possible 2-byte sequences. So back to Tony's question:


I would say "Don't count on it."

You're right. The question I read was not the question he really asked.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top