Need to reliably detect a text file's encoding for XML deserialization

M

Marc Scheuner

Folks,

I have a text file which contains some XML. In its XML header, it
claims to be of UTF-8 encoding - however, it's really not, it's a ANSI
/ Windows-1252 / ISO-8859-1 encoding.

Trouble is: when I deserialize objects from that file, all the German
umlauts and other special characters get dropped, some even cause
deserialization errors.

When I open the file in a text editor and save it as a REAL UTF-8
file, every thing works just fine as expected.

I then tried to make sure I open the text file with a StreamReader,
telling it to determine the encoding automatically, and I intended to
then store it as real UTF-8 in case it wasn't really in that encoding.

Trouble is: no matter what encoding the file is in, when I tell
StreamReader to auto-detect the encoding, it *ALWAYS* comes back with
UTF-8 and then my deserialization might fail......

I even tried to use the Platform SDK function "IsTextUnicode" on the
first 256 bytes I read from the file using a FileStream - no luck
either, IsTextUnicode always returns false ........

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

Thanks for any hints, pointers, and most notably, CODE SAMPLES !! ;-)

Marc
 
J

Jon Skeet [C# MVP]

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

However, there are probably ways of chaining together readers etc so
that you can sort out your XML problem if you know the correct
encoding. Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?
 
M

Marc Scheuner

Hi Jon
You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

Drats..... I was afraid of that answer :)
Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?

It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....

Thanks!
Marc
 
J

Jon Skeet [C# MVP]

Marc Scheuner said:
Drats..... I was afraid of that answer :)


It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....

So can you ask the authors of the "host app" to fix things?
 
M

Marc Scheuner

So can you ask the authors of the "host app" to fix things?

I doubt it - they *claim* they're delivering UTF-8, while really
they're sending me a ANSI / Windows-1252 file. Guess I'll just have to
find some technical way to make this configurable or something, since
the stupidity and ignorance on the other side can't be cured ;-)

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top