This character "ñ" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.
My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
zeros
in the highorder byte as it is in this case where 241 fits in one byte ?
Honestly Tony, in this case, who cares? I realize that you're trying to
learn things but I think you need to pick and choose what you want to dive
deeply into, and in my opinion the internal workings of UTF-8 and UTF-16
shouldn't concern you. UTF-8 is a middleman. It exists to bridge the gap
between single-byte code pages and the new, global world of Unicode. Data
stored in UTF-8 is almost always translated into something else (like .NET
translates everything to UTF-16) so you should really only know how to USE
UTF-8 without worrying about its guts. (Unless you're trying to write your
own UTF-8 encoder/decoder, of course.)
My second question does a code page include all the Unicode standards
UTF-8,
UTF-16 and UTF-32.if not
where are for example this character "ñ" defined for the different Unicode
standards ?
Code pages do not "include Unicode standards." Let me see if I can come up
with a good analogy.
If you know anything about bitmaps, you know that there are indexed bitmaps
and true-color bitmaps. An indexed bitmap is like a paint-by-number set
(assuming you're old enough to remember those things and they existed where
you grew up). You have a limited supply of colors and each color is mapped
to a number (the index). Perhaps 0 = Red, 1 = White, 2 = Purple, etc. You
cannot use any color outside the range of your given color palette. Let's
say this color palette has 256 entries for this example, and therefore each
index value fits nicely into one byte. You define your bitmap by specifying
a bunch of indexes (bytes) that indicate which color is to be applied to
each pixel. So your bitmap data might contain 0 0 0 2 2 1, meaning three
pixels of red, two pixels of purple, and one pixel of white. Six pixels, six
bytes. Moderately compact.
On the other hand, in a true color image each pixel is represented by three
(or four, if you want transparency) bytes. There is no color palette because
those bytes can represent any color. So now your six pixels look like this:
0xFF0000 0xFF0000 0xFF0000 0xFF00FF 0xFF00FF 0xFFFFFF. Six pixels, 18 bytes.
Big, but flexible as far as colors go.
A code page is like an indexed image. Single-byte code pages contain 256
"slots," each of which can represent a character (a glyph). Each code page
has a table somewhere which tells it how to map each index (0 - 255) to a
specific Unicode character (called "code points").
Unicode itself is like the entire color spectrum (or at least it's pretty
close).
The Windows-1252 code page (Latin 1 or something like that) maps 65 -> A
(U+0041), 34 -> " (U+0022), 42 -> * (U+003A), and so on. Many other code
pages have similar mappings for indexes 0 - 127, but when you get to 128 -
255 you tend to see more variation. For example, and I'm totally making this
up, a Russian code page might map 165 to U+0427 whereas a Spanish code page
might map it to your ñ, U+00F1.
UTF-8, on the other hand, is not a mapping but rather an encoding, which
takes a Unicode code point and stores it in 1 to 4 bytes (encoding), or
takes 1 to 4 bytes and translates that into a Unicode code point (decoding).
Unicode is like the center of a wheel (the hub), and code pages are the
spokes. Everything ultimately goes through the hub. UTF-8 and friends are
not spokes; they are more like "transport mechanisms" and are not directly
related to code pages.