C# and encodings

beginwithl · Feb 3, 2009

Hi

I sincerely apologize for making so many question in one thread, but
at least this way if you decide not to answer, there won’t be 10
useless threads floating around. Anyways, I started learning about
file I/O, code tables, encodings etc and man, this stuff is a bit
overwhelming

1)

a) With "Encoding.Default" you retrieve system’s default code page.
But if windows has numerous code pages, then what exactly would
default page be, meaning where ( or in what apps ) does windows use
this default page over other code pages?

b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does ( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?

c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

* Can these code pages also use UTF-16 or UTF-32 encoding?

* Are there also code pages that support more than 255, but less than
2^16 code points?

2)
From MSDN site:
“StreamWriter defaults to using an instance of UTF8Encoding unless
specified otherwise. This instance of UTF8Encoding is constructed such
that the Encoding.GetPreamble method returns the Unicode byte order
mark written in UTF-8. The preamble of the encoding is added to a
stream when you are not appending to an existing stream. This means
any text file you create with StreamWriter will have three byte order
marks at its beginning."

As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?

4)

a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)

If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”

But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?

b) “Since the Unicode byte order mark character is not found in any
code page, it disappears if data is converted to ANSI. Unlike other
Unicode characters, it is not replaced by a default character when it
is converted. If a byte order mark is found in the middle of a file,
it is not interpreted as a Unicode character and has no effect on text
output.”

Well, since at least some (ANSI) code pages do have glyphs for
characters at code points FF and FE, I assume above text implies that
apps ( using non-unicode code pages ) reading such a file would
understand that FE FF sequence represents BOM and thus should ignore
it?
In other words, it is up to app ( using non-Unicode code page )
reading such a file to realize that FE FF sequence should be ignored?!

5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”

I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?

b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

6)
Say app1 ( running on PC1 ) and app2 ( running on PC2 ) communicate
via network using TCP/IP protocol. PC1 uses little endian-order, while
PC2 uses big-endian order. Now, I know we send information over TCP/IP
( and networks in general ) using big-endian order, but:

a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?

b) If so, then if PC1 sends some .exe file to PC2, then how will PC2
know whether it came from little endian-machine and thus should
reverse bytes before trying to load this .exe file?

Thank you

beginwithl · Feb 4, 2009

hi

Windows only has one current code page at a time.

Where exactly is that code page used, since as far as I know, apps
running on top of Windows can use whatever code page they choose?

That makes no sense. Unicode can't be represented in only 8-bits, so
there's no such thing as an 8-bit code page that uses Unicode character
encoding.

But Notepad supports Unicode and yet it only recognizes 255 character,
thus it only has 255 code points – couldn’t we then say that Notepad
uses 8-bit code page?

UTF-16 and UTF-32 are also Unicode. See above.

I realize that

You are welcome to say that.

Well, I created a file and wrote some text ( via StreamWriter ) and
then checked via Hex editor if there was also BOM present and it
wasn’t, so uhm ...

Three possibilities:

-- the documentation is wrong
-- the implementation is wrong
-- your assumption about when the BOM should be added is wrong

I don't have enough first-hand knowledge to choose among those three at
the moment.

Prob the last option

I believe the statement refers to converting from Unicode to an ANSI code
page, not the other way. The point is not whether 0xff or 0xfe can be
found on their own. The point is that the Unicode "character" 0xfeff is
not representable in any ANSI code page, and is treated specially by
stripping it from input rather than replacing it with the "default
character".

If you use Encoding to convert to an ANSI code page, it will be ignored
automatically.

While other characters not presentable in ANSI code page will be
replaced with default character by Encoding class?

It's the Unicode BOM character. It's an actual Unicode character.

But has no other, non computer related meaning?

3)

You use them instead. See Encoding.GetEncoding().

a) From MSDN:
"The GetEncoding method relies on the underlying platform to support
most code pages; however, the .NET Framework natively supports some
encodings."

So if .Net doesn’t support particular encoding, it checks the
underlying OS if it supports that encoding, and if it does, it ”
borrows ” OS’s code page and instructions on how to encode it?

b) From MSDN:
"For a list of code pages, see the Encoding class topic. Or use the
GetEncodings method to get a list of all encodings."

I assume GetEncodings also lists code pages which .Net may not support
out of the box?

That is what all the regular code pages do.

* My question was a bit off ... what I meant to ask was if there are 8-
bit code pages that only have 255 points defined, where all of those
those 255 code points are assigned to same characters as the first 255
code points in Unicode coded characters set?

* In any case, if the answer is yes, are then any of such code pages
encoded with either UTF-16 or UTF-8 encoding?

UTF-8 and ANSI are external formats.

I’m not sure I understand what you meant by that

Internally all string and char uses 16 bit unicode.

I believe the 16 bit exist as a unicode code point but not as
UTF-8.

* But since UTF-8 can use up to 6 octets, wouldn’t that suggest that
it has defines code point FEFF? Or is that code point skipped? For
what reason?

thank you guys

Göran Andersson · Feb 4, 2009

b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does

It doesn't really work that way. Strings are Uncode, and they can be
encoded into a binary stream using an encoding that either supports the
full Unicode character set or an encoding that supports the subset that
a codepage represents.

( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?

Four, there is UTF-7 also.

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

No, if they only have 256 code points they don't use Unicode encoding,
they map each character to a byte value.

* Can these code pages also use UTF-16 or UTF-32 encoding?

No, they already have an encoding, they can't use another encoding also.

* Are there also code pages that support more than 255, but less than
2^16 code points?

There are character sets that use double character combinations, like
chinese, japanese, korean and arabic, but the characters in a pair are
encoded as separate characters, not as a single entity like Unicode does.

2)
From MSDN site:
“StreamWriter defaults to using an instance of UTF8Encoding unless
specified otherwise. This instance of UTF8Encoding is constructed such
that the Encoding.GetPreamble method returns the Unicode byte order
mark written in UTF-8. The preamble of the encoding is added to a
stream when you are not appending to an existing stream. This means
any text file you create with StreamWriter will have three byte order
marks at its beginning."

As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

Why would you say that? I think that it actually is true. Most
applications that read text files supports unicode, so you will never
notice the byte order mark.

3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?

There are other classes that handle multiple encodings, like
SBCSCodePageEncoding, DBCSCodePageEncoding, ISO2022Encoding,
EUCJPEncoding, GB18030Encoding and ISCIIEncoding. They are marked as
internal and can only be created from the factory methods in Encoding
class, so you don't read about them in the documentation.

4)
a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)

If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”

But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?

Here the documentation is not correct. I tried some, and the BOM is
written for UTF-8, UTF-16 and UTF-32, but not for UTF-7, ASCII,
ISO-8859-1 or Windows-1252.

5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”

I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?

Well, the Char structure is the only type that handles characters, all
other types that handle text (String, StringBuilder et.c.) uses the Char
structure.

b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

Yes, in the Unicode character set it's the code for a zero width space
character. This means that if you for example concatenate two unicode
encoded text files so that you end up with a BOM in the middle of the
text, it will still be invisible.

6)
Say app1 ( running on PC1 ) and app2 ( running on PC2 ) communicate
via network using TCP/IP protocol. PC1 uses little endian-order, while
PC2 uses big-endian order. Now, I know we send information over TCP/IP
( and networks in general ) using big-endian order, but:

a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?

Yes. The network layer can not change the endianness of the data, as it
doesn't know what kind of data it represents. It treats everything as bytes.

b) If so, then if PC1 sends some .exe file to PC2, then how will PC2
know whether it came from little endian-machine and thus should
reverse bytes before trying to load this .exe file?

The file formats that use little endian or big endian data are well
defined and doesn't change endianness just because the system natively
uses a different endianness.

Arne Vajhøj · Feb 4, 2009

Where exactly is that code page used, since as far as I know, apps
running on top of Windows can use whatever code page they choose?

I think the difference must be somewhere in the difference between
explicit and default.

But Notepad supports Unicode and yet it only recognizes 255 character,
thus it only has 255 code points – couldn’t we then say that Notepad
uses 8-bit code page?

Notepad supports UTF-8.

a) From MSDN:
"The GetEncoding method relies on the underlying platform to support
most code pages; however, the .NET Framework natively supports some
encodings."

So if .Net doesn’t support particular encoding, it checks the
underlying OS if it supports that encoding, and if it does, it ”
borrows ” OS’s code page and instructions on how to encode it?

No.

If GetEncoding supports it, then .NET supports it.

The doc does not say how it provides the support.

b) From MSDN:
"For a list of code pages, see the Encoding class topic. Or use the
GetEncodings method to get a list of all encodings."

I assume GetEncodings also lists code pages which .Net may not support
out of the box?

No.

The docs say:

"This method returns a list of supported encodings"

Arne

Arne Vajhøj · Feb 4, 2009

(e-mail address removed) wrote:

* My question was a bit off ... what I meant to ask was if there are 8-
bit code pages that only have 255 points defined, where all of those
those 255 code points are assigned to same characters as the first 255
code points in Unicode coded characters set?
No.

* In any case, if the answer is yes, are then any of such code pages
encoded with either UTF-16 or UTF-8 encoding?

Encoding X is never encoded with Encoding Y.

I’m not sure I understand what you meant by that

They are used reading/writing to files, sockets etc..

But:

* But since UTF-8 can use up to 6 octets, wouldn’t that suggest that
it has defines code point FEFF?
No.

Or is that code point skipped?

I think so.

As I read http://en.wikipedia.org/wiki/UTF-8#Description then that
bit pattern will never be used for UTF-8.

For
what reason?

Probably to distinguish between BOM and data.

:-)

Arne

Pavel Minaev · Feb 4, 2009

It is usually code page 1252, which is very close to ISO-8859-1.

Ouch! Apart from being techically wrong, this answer is quite evil in
that it is precisely that kind of assumption that often leads to
software slightly or severely broken on non-Western European locales
(sorry for the rant, it's just a sore point here).

The default Windows code page depends on the regional settings, and
those default to whichever language the Windows itself is localized
in. For English Windows installs, and, I believe, all Western European
languages, it is indeed CP1252. However, for Eastern Europe already,
the codepage largely depends on the country, and that's doubly so for
the ex-USSR countries. And that's not even speaking of Arabic, Hebrew,
Chinese, Japanese, and many other!

So please, please don't make any assumptions about what it's going to
be. The only thing that's safe to assume is that the encoding is
always ASCII-compatible. Apart from that, all bets are off. For
example, if you assume CP1252, and use guillemet « » symbols that are
present in it, and then I run it on my Russian XP, I'm going to see
Russian letters instead - this obviously doesn't improve the
readability of the text.

Of course, this only applies to non-Unicode stuff. Thankfully with
Unicode we are past this mess.

As to the question of where one may encounter text in default Windows
encoding - a few examples might be plain text files created by the
user, MP3 tags, and in general output of any non-Unicode application.

There must be some Windows setting to change it, but I don't
know where.

It's "default Windows code page for non-Unicode programs", if I
remember correctly. It's on the "Advanced" tab of the language &
regional settings dialog accessible from Control Panel.

Pavel Minaev · Feb 4, 2009

Windows only has one current code page at a time.

Well, nt quite - Windows (or rather, a specific user) has one locale
at a time, but two associated non-Unicode codepages - one for GUI (aka
"ANSI" in Win32 parlance - this is what Encoding.Default returns), one
for text mode ("OEM") - a legacy of DOS. If I remember correctly, the
latter one can actually be changed using "chcp" within the context of
a specific command line session - another DOS artifact.

The code page can be and often is not Unicode. Any character encoding
that is not Unicode by definition uses a different encoding than Unicode
does.

I think we need to start distinguishing between character set and
encoding here

Character set - a set of valid characters (code points in Unicode
parlance). Unicode is a character set, but not an encoding. CP1250 is
a character set, which is always encoded using 8-bit clean.

Encoding - a way of encoding a specific character set as a sequence of
bytes. UTF-8, UTF-16, UCS4 are all encodings of Unicode.

A Windows code page is in fact an encoding, not a character set (as
evidenced by the fact that there's a specific codepage for UTF-8, and
one could technically add a separate codepage for UTF-7).

Mihai N. · Feb 4, 2009

1)

a) With "Encoding.Default" you retrieve system’s default code page.
But if windows has numerous code pages, then what exactly would
default page be, meaning where ( or in what apps ) does windows use
this default page over other code pages?

This is controlled by the system locale (or "Language for non-Unicode
applications"). A user can change it, but affects the whole system,
all users, and requires a reboot.
See http://www.mihai-nita.net/article.php?artID=20050611a

b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does ( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?

No. Unicode is one code page (with several encodings).
If a code page "supports Unicode" then it is Unicode.
But (most) other code pages are a subset of Unicode
(not necesarily contigous subset).
Anyway, a text in any code page can also be representes as Unicode.
But not the other way around.

c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

No. A code page does not use Unicode encodings.

* Can these code pages also use UTF-16 or UTF-32 encoding?

All UTF encodings are Unicode only
(as the name says: UTF = *Unicode* Transformation Form)

* Are there also code pages that support more than 255, but less than
2^16 code points?

Yes. Code pages designed for Japanese (cp932, or Shift-JIS), Chinese
Traditional (cp950, or Big-5), Chinese Simplified (cp936, or GB2312),
Korean (cp949). These are the only ones that can also be system code
pages. But there are other code pages with more than 255 characters.
You should consider these "legacy" and not use them, except for interchange
with old applications and files.

2)
As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

Based on the doc, is should add the preamble.
Did you test and it is not true?

3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?

You use the Encoding class with a numeric code page identifier:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

4)

a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)

If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”

But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?

Sounds like the doc is not correct.

b) “Since the Unicode byte order mark character is not found in any ....
Well, since at least some (ANSI) code pages do have glyphs for
characters at code points FF and FE, I assume above text implies that
apps ( using non-unicode code pages ) reading such a file would
understand that FE FF sequence represents BOM and thus should ignore
it?

Nope. The BOM is not FF FE code points.
The code point is FEFF. And when you convert to a code page, there is
no equivalent for it.

5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”

I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?

All text is UTF-16. If you can think of text other than String and char,
(I cannot), it is also UTF-16.

b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

You are confusing things. FEFF is a character (BOM) at the beginning
of the file. In the middle, it means ZERO WIDTH NON-BREAKING SPACE.
It was used as a marker, then kind of transformed into a de facto
standard.
A bit like #! at the beginning of a Unix script meaning
"this file is as a script and the thing after #! is the path to the
interpretor for this script"
It is a convention. It does not mean there are no scripts without
#! or that the existence of #! guarantees 100% that the file is a script.
Just that "very-very likely" is a script.

a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?

The packet headers are always in the same byte order.
It is part of the specs.

Advice: these are things in which many people have opinions,
but relatively few know what they are talking about (including me).
Whenever something is not clear, don't ask on newsgroups, go to the
official sources:
- http://unicode.org/faq/utf_bom.html
- http://www.unicode.org/glossary/

Mihai N. · Feb 4, 2009

b) If so, then if PC1 sends some .exe file to PC2, then how will PC2
know whether it came from little endian-machine and thus should
reverse bytes before trying to load this .exe file?

Missed this one: the .exe file is the content.
The packets (with headers and all) is the envelope.
The envelope always uses the same endianness, so that any computer
can talk to any computer.

You should write the address on envelope in a certain way if you
want it delivered.

What is inside is a different story. It is between the two parties
that comunicate. Can be Russian, Chinese, pictures, music, a C program.

The spec of the exe file has info on endianess.
(but what an "executable" means is again a different beast. In general
you cannot freelly mix and match execustables for Mac OS X, Linux,
Windows, Windows Mobile, 32 and 64 bits, Mac PPC vs Mac Intel, etc.)

Mihai N. · Feb 4, 2009

Character set - a set of valid characters (code points in Unicode

parlance). Unicode is a character set, but not an encoding. CP1250 is
a character set, which is always encoded using 8-bit clean.

Encoding - a way of encoding a specific character set as a sequence of
bytes. UTF-8, UTF-16, UCS4 are all encodings of Unicode.

A Windows code page is in fact an encoding, not a character set (as
evidenced by the fact that there's a specific codepage for UTF-8, and
one could technically add a separate codepage for UTF-7).

Please don't confuse things (and don't confuse others).

Character set has nothing to do with code points, and Unicode is not
a character set.
Unicode is a coded character set, or code page, or "charset" (unfortunate
name choice in the Unix/RFC world)

See also here: http://www.mihai-nita.net/article.php?artID=20060806a

Recomended for everybody: don't just believe everything you see on
the net (this includes my answers).
Go to the official source: http://www.unicode.org/glossary/

Mihai N. · Feb 4, 2009

There are character sets that use double character combinations, like

chinese, japanese, korean and arabic, but the characters in a pair are
encoded as separate characters, not as a single entity like Unicode does.

Can you give an example of Arabic code page that requires double character
combinations?

This means that if you for example concatenate two unicode
encoded text files so that you end up with a BOM in the middle of the
text, it will still be invisible.

That is an incorrect thing to do.
In the middle of a string U+FEFF is not a BOM anymore.
The fact that you don't see it does not mean it is ok to have
"junk" in the middle of the string.

GÃ¶ran Andersson · Feb 4, 2009

Mihai said:
Can you give an example of Arabic code page that requires double character
combinations?

That is an incorrect thing to do.

Well, is it? As long as the encoding of the files are the same, it works
just fine.

In the middle of a string U+FEFF is not a BOM anymore.
The fact that you don't see it does not mean it is ok to have
"junk" in the middle of the string.

It's not junk, it's a normal character.

It actually makes sense to have a zero width space between the contents
of the files. That way they are kept separate without adding any visible
characters.

beginwithl · Feb 4, 2009

hi

It doesn't really work that way. Strings are Uncode, and they can be
encoded into a binary stream using an encoding that either supports the
full Unicode character set or an encoding that supports the subset that
a codepage represents.

* So if code page supports only a subset of Unicode character set…
what do we call it then? Unicode compliant code page or…?

* if I create a code page with 30000 code points that map to same
characters as those in Unicode coded character set, and if I encode
this code page using UTF-16, it still won’t be considered Unicode?

* But that doesn’t explain why Notepad claims to support Unicode and
yet it only has 383 code points defined?!

And if you say we use UTF-16/UTF-8 encodings only for Unicode coded
character set ( that is for sets that support full Unicode character
set ), then why does Notepad also use UTF-16 encoding? After all, its
code page supports only a subset of Unicode character set?!

No, if they only have 256 code points they don't use Unicode encoding,
they map each character to a byte value.

But doesn’t that depend on the implementation – if I wanted to create
a code page that is subset of first 600 Unicode characters points,
then why couldn’t I use UTF-16 encoding ( for whatever reason,
perhaps because I’m expecting the files my app will read will only
contain first 600 Unicode characters )?

Notepad supports UTF-8.

I checked it and as far as I can tell it only recognizes first 383 or
so characters. That would suggest it only has around 380 code points
defined and yet it still uses UTF-16 or UTF-8 encodings

No.

Then what is meant by “relies on the underlying platform to support
most code pages”?

(e-mail address removed) wrote:

Encoding X is never encoded with Encoding Y.

Isn’t term code page used for any coded character set? You imply as if
the term code page already automatically suggests some non-Unicode
encoding ( BTW – I realize that term Unicode means coded character set
and not encoding )?

No. Unicode is one code page (with several encodings).
If a code page "supports Unicode" then it is Unicode.

By “code page supports Unicode” you mean that it has as many code
points ( and thus characters that map to those code points ) defined
as full Unicode character set?

But (most) other code pages are a subset of Unicode
(not necesarily contigous subset).

* By subset you mean that the code points these code pages do have
defined, map to same characters as equivalent Unicode code points?

* and yet even though these code pages are subsets of Unicode
character set, we still don’t call them Unicode coded character set?
Then what do we call it? Unicode compliant code page…?

No. A code page does not use Unicode encodings.

All UTF encodings are Unicode only
(as the name says: UTF = *Unicode* Transformation Form)

So code pages that are a subset of Unicode don’t use Unicode
encodings? But I thought that was a design choice…thus if someone
created a code page CP1 that is a subset of Unicode Encoding, they
could also decide to use UTF-16 … that way apps that understood CP1
would also be able to (partially) understand Unicode files?! What
reasoning would prevent a programmer from using UTF to encode CP1?

..

Based on the doc, is should add the preamble.
Did you test and it is not true?

I checked it with Hex editor and there was no BOM written

You are confusing things. FEFF is a character (BOM) at the beginning
of the file. In the middle, it means ZERO WIDTH NON-BREAKING SPACE.
It was used as a marker, then kind of transformed into a de facto
standard.

I guess I’m a bit confused on what the definition of character is.
‘\n’ or ‘\r’ are considered characters, but to me they are no more
character-like than ZERO WIDTH NON-BREAKING SPACE, and yet the latter
is not considered a character

thank you all

Göran Andersson · Feb 4, 2009

hi

* So if code page supports only a subset of Unicode character set…
what do we call it then? Unicode compliant code page or…?

No, they are just character sets. Unicode supports pretty much any
character that exists on earth, so you would have a hard time finding a
character set that isn't a subset of Unicode.

* if I create a code page with 30000 code points that map to same
characters as those in Unicode coded character set, and if I encode
this code page using UTF-16, it still won’t be considered Unicode?

I would call that a partial/broken implementation of Unicode.

* But that doesn’t explain why Notepad claims to support Unicode and
yet it only has 383 code points defined?!

Where have you read this? I pasted over 400 different characters into
Notepad, it accepted them, saved them and loaded them just fine...

But doesn’t that depend on the implementation – if I wanted to create
a code page that is subset of first 600 Unicode characters points,
then why couldn’t I use UTF-16 encoding ( for whatever reason,
perhaps because I’m expecting the files my app will read will only
contain first 600 Unicode characters )?

You could do whatever you like, and we could discuss if it's Unicode or
not, but that's just a theoretical discussion. It's not really relevant
to the used and existing character sets.

I guess I’m a bit confused on what the definition of character is.
‘\n’ or ‘\r’ are considered characters, but to me they are no more
character-like than ZERO WIDTH NON-BREAKING SPACE, and yet the latter
is not considered a character

Yes, it is.

http://www.fileformat.info/info/unicode/char/feff/index.htm

Arne Vajhøj · Feb 5, 2009

Pavel said:
Ouch! Apart from being techically wrong, this answer is quite evil in
that it is precisely that kind of assumption that often leads to
software slightly or severely broken on non-Western European locales
(sorry for the rant, it's just a sore point here).

The default Windows code page depends on the regional settings, and
those default to whichever language the Windows itself is localized
in. For English Windows installs, and, I believe, all Western European
languages, it is indeed CP1252. However, for Eastern Europe already,
the codepage largely depends on the country, and that's doubly so for
the ex-USSR countries. And that's not even speaking of Arabic, Hebrew,
Chinese, Japanese, and many other!

My apologies for making the classic "everyone is in Western Europe or
North America" mistake.

Arne

Mihai N. · Feb 5, 2009

That is an incorrect thing to do.

Well, is it? As long as the encoding of the files are the same, it works
just fine.

It "it works" and "it is correct" are the same thing for you,
then you have a problem.

It's not junk, it's a normal character.

It actually makes sense to have a zero width space between the contents
of the files. That way they are kept separate without adding any visible
characters.

Once you merge two streams of text, it is one.
There are no "files" but "file"
And zero width space was never intended to "separate the content of files"
You also have a BOM that becomes a zero width space.
That's ok for you?

Check the official FAQ: http://unicode.org/faq/utf_bom.html#BOM
"A byte order mark (BOM) consists of the character code U+FEFF at the
beginning of a data stream". Note "beginning"

Also read the answer to "What should I do with U+FEFF in the middle of a
file"

If you don't agree with that page, please contact the Unicode Consortium
and tell them they are idiots, and that you know better than them
how Unicode works.

GÃ¶ran Andersson · Feb 5, 2009

If you don't want to keep a civil discussion, just go somewhere else.

beginwithl · Feb 5, 2009

hi

No, they are just character sets. Unicode supports pretty much any
character that exists on earth, so you would have a hard time finding a
character set that isn't a subset of Unicode.

I would call that a partial/broken implementation of Unicode.

Where have you read this? I pasted over 400 different characters into
Notepad, it accepted them, saved them and loaded them just fine...

I tried it. I wrote some characters ( via Console app ) into a .txt
file and then opened the file via Notepad. Everything up to 380 or so
was displayed correctly, but from that point only question mark ( or
whatever it uses to signal unknown character … can’t really remember )
was displayed. I admit I only checked about 20 or so characters ( none
of which was displayed ), but based on those I assume it can't display
anything above 385+.

If what I claim above is true, then Notepad "claiming" it has/
suppports Unicode encoding is false? Thus, all it does have is a code
page that is:
a) a subset of Unicode coded character set and
b) is encoded using UTF-8 or UTF-16 encoding

?!

thank you all for your kind help

Pavel Minaev · Feb 5, 2009

I tried it. I wrote some characters ( via Console app ) into a .txt
file and then opened the file via Notepad. Everything up to 380 or so
was displayed correctly, but from that point only question mark ( or
whatever it uses to signal unknown character … can’t really remember )
was displayed. I admit I only checked about 20 or so characters ( none
of which was displayed ), but based on those I assume it can't display
anything above 385+.

A question mark in Windows doesn't mean "unknown character". It means
a "character for which this fon't doesn't have a glyph". This has
nothing to do with codepages, and everything with font that you're
using. Try changing Notepad font to something like "Arial Unicode MS",
and see what it does then.

beginwithl · Feb 5, 2009

hi

A question mark in Windows doesn't mean "unknown character". It means
a "character for which this fon't doesn't have a glyph". This has
nothing to do with codepages, and everything with font that you're
using. Try changing Notepad font to something like "Arial Unicode MS",
and see what it does then.

Will try it out. But how would/should app behave if it encounters an
unknown character? I thought question marks were used for that?!

cheers

encoding	1	Feb 11, 2012
about encoding UTF-8 and UTF-16	6	Mar 31, 2010
This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
UTF-16	1	Oct 9, 2010
Help!! Convert file encoding	2	Sep 2, 2008
C# MD5 Hash Does Not Match Hash Generated From Java	3	Jun 5, 2011
I'm using about twice as many bytes of memory as the size of the file	8	Mar 4, 2010
Unicode in .NET	8	Apr 30, 2010

C# and encodings

beginwithl

beginwithl

Göran Andersson

Arne Vajhøj

Arne Vajhøj

Pavel Minaev

Pavel Minaev

Mihai N.

Mihai N.

Mihai N.

Mihai N.

GÃ¶ran Andersson

beginwithl

Göran Andersson

Arne Vajhøj

Mihai N.

GÃ¶ran Andersson

beginwithl

Pavel Minaev

beginwithl

Ask a Question

Similar Threads