BIG encoding and UTF8?

E

EmeraldShield

We have an application that uses UTF8 everywhere to load / save / process
documents.
One of our clients is having a problem with BIG Encoded files being trashed
after running through our app.

Indeed I have verified that if I go to a website in Taiwan and save the file
in BIG5 and then just load / save the file with a UTF8 text reader / write
some bytes are modified.

How can I correct this? It was my understanding the UTF8 was supposed to be
high byte friendly.
 
L

Lau Lei Cheong

You're correct about the UTF8 is multibyte encoding, but big5 is just a
different coding scheme.

When you open the big5 file in a UTF8 editor(I assume you mean the editor
use UTF8 to handle internal strings like .NET applications), some conversion
happens there to ensure it displays correctly. So basically it's no longer
big5 once you get the file loaded.

To "fix" this, it'll depend on what your "editor" is. In VS.NET you can
change the
encoding scheme when you select "Advanced Save Option..." from the "File"
menu.
UltraEdit have a dropdown menu on the "Save" dialog. I suppose you'll find
some
exposed methods if you're automating it with COM+ application.

If the "editor" is written by yourself and it's written with .NET, you
should use Encoding class to do conversion first (Codepage for Big5 is
CP950). I prefer to do a GetBytes() first then write them into a
BinaryWriter.
 
W

Walter Wang [MSFT]

Hi,

Thank you for your post.

Based on my understanding, your question is how to process BIG5 file in
.NET. If I've misunderstood anything, please feel free to post here.

In .NET, a string is a sequential collection of Unicode characters that is
used to represent text. Each Unicode character in a string is defined by a
Unicode scalar value, also called a Unicode code point or the ordinal
(numeric) value of the Unicode character. Each code point is encoded using
UTF-16 encoding.

Encoding is the process of transforming a set of Unicode characters into a
sequence of bytes. Decoding is the reverse. The Unicode Standard assigns a
code point (a number) to each character in every supported script. A
Unicode Transformation Format (UTF) is a way to encode that code point. The
Unicode Standard version 3.2 uses the following UTFs:

* UTF-8, which represents each code point as a sequence of one to four
bytes.
* UTF-16, which represents each code point as a sequence of one to two
16-bit integers.
* UTF-32, which represents each code point as a 32-bit integer.

For a list of .NET supported encodings, please refer to following MSDN
Library article:

#Encoding Class (System.Text)
http://msdn2.microsoft.com/en-us/library/system.text.encoding.aspx

System.IO.StreamReader is designed for character input in a particulear
encoding. StreamReader defaults to UTF-8 encoding unless specified
otherwise, instead of defaulting to the ANSI code page for the current
system.

To open a BIG5 encoding file, you must pass the specified encoding to
StreamReader, for example:

StreamReader sr = new StreamReader(@"c:\temp\1.txt",
Encoding.GetEncoding("big5"));

StreamWriter is like StreamReader, when not specified, will use an instance
of UTF8Encoding to encode the text.

Hope this helps. Please feel free to post here if anything is unclear.

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
W

Walter Wang [MSFT]

Hi,

I am interested in this issue. Would you mind letting me know the result of
the suggestions? If you need further assistance, feel free to let me know.
I will be more than happy to be of assistance.

Have a great day!

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
E

EmeraldShield

Hello,

Sorry I got real sick last week and just got back to the office. I have not
yet tried any of your suggestions, but I will and post my findings.

My big problem right now is that if I use a UTF-8 encoded stream and just
load / save (using Dot Net - not some editor) the bytes are different.

I have tried to use different settings, but I cannot get Dot Net to process
a BIG5 file without corruption unless I specifiy BIG5 as the encoding type.
This does not work as I don't know if it is BIG5 or not until after it is
loaded...

These are emails we process, but for my test I used IE to go to a Taiwanese
site and save the page as BIG5 encoding.

Jason
 
W

Walter Wang [MSFT]

Hi Jason,

Thank you for your update.

Unfortunately StreamReader can only detect the encoding by looking at the
first three bytes of the stream. It can only recognizes UTF-8,
little-endian Unicode, and big-endian Unicode text if the file starts with
appropriate byte order marker (BOM). Otherwise, the user must provide the
correct encoding. Also note this Caution in the documentation of
StreamReader:

When you compile a set of characters with a particular cultural setting and
retrieve those same characters with a different cultural setting, the
characters might not be interpretable, and could cause an exception to be
thrown.

For more information, please refer to MSDN Library:

#StreamReader Constructor (Stream, Encoding)
http://msdn2.microsoft.com/en-us/library/ms143456.aspx

Normally a HTML file will have meta data in it to specify which encoding
it's using, thus the browser will display it correctly. For example, an
UTF-8 encoded html file will have:

<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=utf-8">

Since you mentioned your program is processing email, depending on the
email format, it may have some meta data in its header which describes its
encoding.

Hope this helps. Please feel free to post here if anything is unclear.

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
E

EmeraldShield

Unfortunately StreamReader can only detect the encoding by looking at the
first three bytes of the stream. It can only recognizes UTF-8,
little-endian Unicode, and big-endian Unicode text if the file starts with
appropriate byte order marker (BOM). Otherwise, the user must provide the
correct encoding. Also note this Caution in the documentation of
StreamReader:

I thought that UTF-8 with high order detection disabled would preserve the
original data. Is this not the case? How would we preserve the data? In
c++ you could always just read WORDs and write them without fear of
corruption.

All email is supposed to be UTF8 friendly (You cannot send high order
commands), but it is not the first 3 bytes. These are actual SMTP
communications. Some of the attachments or email are trashed. The email
itself does not have an identifier to tell us it is BIG5 (I will double
check that today when I get in the office).
Since you mentioned your program is processing email, depending on the
email format, it may have some meta data in its header which describes its
encoding.

We are the email processor, so anything added in the header will have to be
done in our app. That is the problem. We have attachments in Outlook from
Chinese customers using BIG5 that are being trashed and we cannot figure out
why if we use UTF8 everywhere.

Thanks,

Jason
 
J

Jon Skeet [C# MVP]

EmeraldShield said:
I thought that UTF-8 with high order detection disabled would preserve the
original data. Is this not the case? How would we preserve the data? In
c++ you could always just read WORDs and write them without fear of
corruption.

If you need to preserve text data without knowing the encoding, you
need to treat it as opaque binary data - just read it as bytes.
All email is supposed to be UTF8 friendly (You cannot send high order
commands), but it is not the first 3 bytes. These are actual SMTP
communications. Some of the attachments or email are trashed. The email
itself does not have an identifier to tell us it is BIG5 (I will double
check that today when I get in the office).

It should have. There should be a header somewhere saying what encoding
to use - otherwise there are numerous possible ones.
We are the email processor, so anything added in the header will have to be
done in our app. That is the problem. We have attachments in Outlook from
Chinese customers using BIG5 that are being trashed and we cannot figure out
why if we use UTF8 everywhere.

I think you're still trying to treat UTF-8 as a silver bullet, and it's
not. You can't treat arbitrary binary data (which is what the
BIG5-encoded text effectively is, as far as you're concerned) as a
valid UTF-8-encoded string and expect to write it out without losing
data.

If you don't want to lose any data and you don't know what the encoding
is, just read and write is as bytes instead of text.

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top