encoding

T

Tony Johansson

Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

//Tony
 
H

Harlan Messinger

Tony said:
Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.
The purpose of the marker is to indicate whether the data is stored in
"big-endian" or "little-endian" order--that is, whether multibyte
encodings are arranged high-order byte first or low-order byte first.
Therefore, the need for this marker only arose when multibyte encodings
were introduced.
 
H

Harlan Messinger

Harlan said:
The purpose of the marker is to indicate whether the data is stored in
"big-endian" or "little-endian" order--that is, whether multibyte
encodings are arranged high-order byte first or low-order byte first.
Therefore, the need for this marker only arose when multibyte encodings
were introduced.

I realized, as soon as I pressed Send, that there were some holes in my
own understanding of this, especially when I realized you'd mentioned
UTF-7 (which is variable-byte) so I just checked and found my
explanation wasn't particularly helpful.
 
P

Peter Duniho

Tony said:
Hi!

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

You are not guaranteed markers for the standard Unicode formats either.

ASCII was "designed" long before anyone was really thinking hard about
portable character encodings, so there was no chance it would support a
marker.

And UTF-7 is used in such specialized situations, there's no need for a
marker because anything that can use it will be doing so in a context
where there's some other way to specify the format.

In general, it's very difficult to identify encoding from the text file
itself. There are some exceptions (XML allows inclusion of the
encoding, for example, as part of the header), but most of the time
encoded text needs some external indicator as to what encoding is used.
Either some convention or some explicit statement to that effect.

Pete
 
A

Arne Vajhøj

Here is some encodings standards
1.ASCII
2.Unicode
3.UTF-7
4.UTF-8
5.UTF-32

In the beginning of the file encoded with Unicode,UTF-8 and UTF-32 is code
markers but file encoded
with ASCII and UTF-7 does not contains any code markers at all.
So why is that not code markers for these two.

I would not consider Unicode an encoding.

And the BOM is optional not required for UTF-8.

Regarding why then BOM only makes sense for certain
encodings, but in the end it is a matter of
choice by whoever designed the encoding.

If you define the TonyEncoding to map between Unicode
and bytes, then you can put the headers in front that
you want.

Arne
 
J

Jeff Johnson

I would not consider Unicode an encoding.

Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.
 
H

Harlan Messinger

Jeff said:
Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

It isn't an encoding in the binary sense because it only assigns
characters to numbers, it doesn't specify a representation. It doesn't
specify, for example, whether "A" should be represented as 41 or 0041 or
00000041 (or something else), or whether an em-dash would be 2014 or
002014 or 00002014 (or something else).
 
P

Peter Duniho

Jeff said:
Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

I believe Arne's point is that "Unicode" by itself does not describe a
way to encode characters as bytes. There are specific encodings within
Unicode (as part of the standard): UTF-8, UTF-16, and UTF-32. But
Unicode by itself describes a collection of valid characters, not how
they are encoded as bytes.

Pete
 
J

Jeff Johnson

I believe Arne's point is that "Unicode" by itself does not describe a way
to encode characters as bytes. There are specific encodings within
Unicode (as part of the standard): UTF-8, UTF-16, and UTF-32. But Unicode
by itself describes a collection of valid characters, not how they are
encoded as bytes.

Ah. I just go with the convention that "Unicode" by itself, at least in the
..NET world, means UTF-16LE.
 
A

Arne Vajhøj

Uh, why? An encoding is simply a means of associating a set of bytes with
the characters they represent. That's what Unicode does.

No.

Unicode is a mapping between the various symbols and a number.

Encoding is the mapping between the number and 1-many bytes.

Arne
 
A

Arne Vajhøj

Ah. I just go with the convention that "Unicode" by itself, at least in the
.NET world, means UTF-16LE.

It is relative common in traditional Win32 C++ context.

I hope that it is not so common in .NET context. The docs for
String and Char very specifically say that it is Unicode in
UTF-16 encoding.

But it may very well be the original posters interpretation
as well, because he listed Unicode but not UTF-16.

Arne
 
T

Tony Johansson

Arne Vajhøj said:
It is relative common in traditional Win32 C++ context.

I hope that it is not so common in .NET context. The docs for
String and Char very specifically say that it is Unicode in
UTF-16 encoding.

But it may very well be the original posters interpretation
as well, because he listed Unicode but not UTF-16.

Arne

I used the different enum that Encoding class had.
Here the Unicode was UTF-16.

//Tony
 
J

Jeff Johnson

No.

Unicode is a mapping between the various symbols and a number.

Encoding is the mapping between the number and 1-many bytes.

Right, but consider this little gem:

===========
Encoding.Unicode Property

Gets an encoding for the UTF-16 format using the little-endian byte order.
===========

I think people can be forgiven for equating the two, especially in the
context of .NET code, since Microsoft plainly made it look that way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top