Need to reliably detect a text file's encoding for XML deserialization

Marc Scheuner · Apr 6, 2006

Folks,

I have a text file which contains some XML. In its XML header, it
claims to be of UTF-8 encoding - however, it's really not, it's a ANSI
/ Windows-1252 / ISO-8859-1 encoding.

Trouble is: when I deserialize objects from that file, all the German
umlauts and other special characters get dropped, some even cause
deserialization errors.

When I open the file in a text editor and save it as a REAL UTF-8
file, every thing works just fine as expected.

I then tried to make sure I open the text file with a StreamReader,
telling it to determine the encoding automatically, and I intended to
then store it as real UTF-8 in case it wasn't really in that encoding.

Trouble is: no matter what encoding the file is in, when I tell
StreamReader to auto-detect the encoding, it *ALWAYS* comes back with
UTF-8 and then my deserialization might fail......

I even tried to use the Platform SDK function "IsTextUnicode" on the
first 256 bytes I read from the file using a FileStream - no luck
either, IsTextUnicode always returns false ........

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

Thanks for any hints, pointers, and most notably, CODE SAMPLES !! ;-)

Marc

Jon Skeet [C# MVP] · Apr 6, 2006

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

However, there are probably ways of chaining together readers etc so
that you can sort out your XML problem if you know the correct
encoding. Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?

Marc Scheuner · Apr 7, 2006

Hi Jon

You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

Drats..... I was afraid of that answer :-)

Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?

It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....

Thanks!
Marc

Jon Skeet [C# MVP] · Apr 7, 2006

Marc Scheuner said:
Drats..... I was afraid of that answer

It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....

So can you ask the authors of the "host app" to fix things?

Marc Scheuner · Apr 9, 2006

So can you ask the authors of the "host app" to fix things?

I doubt it - they *claim* they're delivering UTF-8, while really
they're sending me a ANSI / Windows-1252 file. Guess I'll just have to
find some technical way to make this configurable or something, since
the stupidity and ignorance on the other side can't be cured ;-)

Marc

XML Encoding	2	Feb 1, 2008
deserializing and XML file [.csdl file EntityFramework]	4	Apr 7, 2009
Xml Serialization and Xml Encoding	1	May 16, 2006
simple XML deserialization question	1	Apr 21, 2005
Serializing / Deserializing XML using System.Xml.Serialization	6	Dec 10, 2008
Encoding XML troubles	2	Mar 27, 2005
Transfer-Encoding chunked and WCF	1	Jan 11, 2010
Deserialize XML where XML may contain XML	1	Jul 5, 2006

Need to reliably detect a text file's encoding for XML deserialization

Marc Scheuner

Jon Skeet [C# MVP]

Marc Scheuner

Jon Skeet [C# MVP]

Marc Scheuner

Ask a Question

Similar Threads