System.IO writers and the BOM getting in the way of integration pr

G

Guest

I'm working on an integration project to move data from my own XML format to
a legacy file format, so this file may be manually imported into the
downstream app.

The downstream app internally recognizes Unicode characters. However, the
import engine on the downstream app chokes on the byte-order-marks BOM in the
files created with anything but ASCII encoding when I write the file.

I have no means for upgrading or fixing this issue in the downstream app.

If I use Excel to save the Unicode / UTF8 file as ANSI, the lone unicode
character in the file is preserved but the BOM is removed. The ANSI file
imports without problem in the downstream app, and the unicode character is
recognized and is processed as expected.

In the System.IO classes, every file I create gets a BOM. How can I avoid
this, or, how can I strip off the BOM? If I write the file with the ASCII
encoding, I get ? instead of my unicode character. How does Excel preserve
the character but strip the BOM?

I realize this is not an elegant design ... but legacy format is critically
important to my customer. Thanks to all in advance,

-David
 
J

Joerg Jooss

dbaldi said:
I'm working on an integration project to move data from my own XML
format to a legacy file format, so this file may be manually imported
into the downstream app.

The downstream app internally recognizes Unicode characters. However,
the import engine on the downstream app chokes on the
byte-order-marks BOM in the files created with anything but ASCII
encoding when I write the file.

I have no means for upgrading or fixing this issue in the downstream
app.

If I use Excel to save the Unicode / UTF8 file as ANSI, the lone
unicode character in the file is preserved but the BOM is removed.

There's no concept like BOMs in Windows-125x (aka ANSI) nor in ASCII,
so this isn't really surprising.
The ANSI file imports without problem in the downstream app, and the
unicode character is recognized and is processed as expected.
In the System.IO classes, every file I create gets a BOM.

What classes/methods do you use?
How can I
avoid this, or, how can I strip off the BOM? If I write the file with
the ASCII encoding, I get ? instead of my unicode character. How does
Excel preserve the character but strip the BOM?

Writing a BOM is purely optional for UTF-8. You can suppress it by
creating your own UTF8Encoding object:

Encoding utf8 = new UTF8Encoding(false); // false suppresses the BOM

Note that the default UTF8Encoding instance exposed as Encoding.UTF8
*does* emit a BOM. That's probably the root cause for all your problems.

Cheers,
 
G

Guest

Still having issues... not able to exactly reproduce what Excel does, which
suggests I misinterpreted the issue in the first place.

When I use this to create my StreamWriter

_writer = new System.IO.StreamWriter( file, false, new UnicodeEncoding());

Then my "bullet" character comes out OK. But, of course, its got the BOM so
the file fails to import into the downstream app.

BUT, if I use this, to avoid the BOM:

_writer = new System.IO.StreamWriter( file, false, new
UnicodeEncoding(false, false));

Then the bullet character is replaced by
[space] in the output. I
tried big and little endian on a lark, same results.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top