encoding

Tony Johansson · Mar 18, 2010

Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

//Tony

Harlan Messinger · Mar 18, 2010

Tony said:
Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

The purpose of the marker is to indicate whether the data is stored in
"big-endian" or "little-endian" order--that is, whether multibyte
encodings are arranged high-order byte first or low-order byte first.
Therefore, the need for this marker only arose when multibyte encodings
were introduced.

Harlan Messinger · Mar 18, 2010

Harlan said:
The purpose of the marker is to indicate whether the data is stored in
"big-endian" or "little-endian" order--that is, whether multibyte
encodings are arranged high-order byte first or low-order byte first.
Therefore, the need for this marker only arose when multibyte encodings
were introduced.

I realized, as soon as I pressed Send, that there were some holes in my
own understanding of this, especially when I realized you'd mentioned
UTF-7 (which is variable-byte) so I just checked and found my
explanation wasn't particularly helpful.

Peter Duniho · Mar 18, 2010

Tony said:
Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

You are not guaranteed markers for the standard Unicode formats either.

ASCII was "designed" long before anyone was really thinking hard about
portable character encodings, so there was no chance it would support a
marker.

And UTF-7 is used in such specialized situations, there's no need for a
marker because anything that can use it will be doing so in a context
where there's some other way to specify the format.

In general, it's very difficult to identify encoding from the text file
itself. There are some exceptions (XML allows inclusion of the
encoding, for example, as part of the header), but most of the time
encoded text needs some external indicator as to what encoding is used.
Either some convention or some explicit statement to that effect.

Pete

Jeff Johnson · Mar 18, 2010

In general, it's very difficult to identify encoding from the text file
itself.

Yup: http://blogs.msdn.com/michkap/archive/2006/07/11/662342.aspx

Arne Vajhøj · Mar 19, 2010

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

I would not consider Unicode an encoding.

And the BOM is optional not required for UTF-8.

Regarding why then BOM only makes sense for certain
encodings, but in the end it is a matter of
choice by whoever designed the encoding.

If you define the TonyEncoding to map between Unicode
and bytes, then you can put the headers in front that
you want.

Arne

Jeff Johnson · Mar 19, 2010

I would not consider Unicode an encoding.

Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

Harlan Messinger · Mar 19, 2010

Jeff said:
Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

It isn't an encoding in the binary sense because it only assigns
characters to numbers, it doesn't specify a representation. It doesn't
specify, for example, whether "A" should be represented as 41 or 0041 or
00000041 (or something else), or whether an em-dash would be 2014 or
002014 or 00002014 (or something else).

Peter Duniho · Mar 19, 2010

Jeff said:
Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

I believe Arne's point is that "Unicode" by itself does not describe a
way to encode characters as bytes. There are specific encodings within
Unicode (as part of the standard): UTF-8, UTF-16, and UTF-32. But
Unicode by itself describes a collection of valid characters, not how
they are encoded as bytes.

Pete

Jeff Johnson · Mar 19, 2010

I believe Arne's point is that "Unicode" by itself does not describe a way
to encode characters as bytes. There are specific encodings within
Unicode (as part of the standard): UTF-8, UTF-16, and UTF-32. But Unicode
by itself describes a collection of valid characters, not how they are
encoded as bytes.

Ah. I just go with the convention that "Unicode" by itself, at least in the
..NET world, means UTF-16LE.

Arne Vajhøj · Mar 20, 2010

Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

No.

Unicode is a mapping between the various symbols and a number.

Encoding is the mapping between the number and 1-many bytes.

Arne

Arne Vajhøj · Mar 20, 2010

Ah. I just go with the convention that "Unicode" by itself, at least in the
.NET world, means UTF-16LE.

It is relative common in traditional Win32 C++ context.

I hope that it is not so common in .NET context. The docs for
String and Char very specifically say that it is Unicode in
UTF-16 encoding.

But it may very well be the original posters interpretation
as well, because he listed Unicode but not UTF-16.

Arne

Tony Johansson · Mar 20, 2010

Arne Vajhøj said:
It is relative common in traditional Win32 C++ context.

I hope that it is not so common in .NET context. The docs for
String and Char very specifically say that it is Unicode in
UTF-16 encoding.

But it may very well be the original posters interpretation
as well, because he listed Unicode but not UTF-16.

Arne

I used the different enum that Encoding class had.
Here the Unicode was UTF-16.

//Tony

Jeff Johnson · Mar 20, 2010

No.

Unicode is a mapping between the various symbols and a number.

Encoding is the mapping between the number and 1-many bytes.

Right, but consider this little gem:

===========
Encoding.Unicode Property

Gets an encoding for the UTF-16 format using the little-endian byte order.
===========

I think people can be forgiven for equating the two, especially in the
context of .NET code, since Microsoft plainly made it look that way.

about encoding UTF-8 and UTF-16	6	Mar 31, 2010
Why use other encoding then UTF-8 when this support almost every language	29	Mar 25, 2010
XML Encoding	2	Feb 1, 2008
UTF-16	1	Oct 9, 2010
C# and encodings	30	Feb 3, 2009
This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
Default encoding as UTF-8 in VS.NET 2005	1	Feb 1, 2007
Why is this XPath doesn't give correct result	1	Mar 14, 2012

encoding

Tony Johansson

Harlan Messinger

Harlan Messinger

Peter Duniho

Jeff Johnson

Arne Vajhøj

Jeff Johnson

Harlan Messinger

Peter Duniho

Jeff Johnson

Arne Vajhøj

Arne Vajhøj

Tony Johansson

Jeff Johnson

Ask a Question

Similar Threads