Bug in CurrentEncoding.EncodingName?

N

Nick

Hi,

I have used the following code to test the encoding of a file :


public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
oSR.ReadToEnd(); // Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}


from:
http://groups.google.com.hk/groups?...%20encoding&hl=zh-TW&lr=&ie=UTF-8&sa=N&tab=wg

But the encoding is always showing Unicode? What's wrong?

Thanks

Nick
 
J

Jon Skeet [C# MVP]

Nick said:
I have used the following code to test the encoding of a file :

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
oSR.ReadToEnd(); // Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}


from:
http://groups.google.com.hk/groups?hl=zh-TW&lr=&ie=UTF-8&threadm=258t
005q76lpof86nsbqv4f0o2d66sba20%404ax.com&rnum=1&prev=/groups%3Fq%3Dc%2
523%2520detect%2520file%2520encoding%26hl%3Dzh-TW%26lr%3D%26ie%3DUTF-8
%26sa%3DN%26tab%3Dwg

But the encoding is always showing Unicode? What's wrong?

As I replied in the thread you quoted there, you shouldn't expect code
like that to correctly determine a file's encoding.

It may be able to work out byte order and encoding for Unicode/UTF-8
files which include byte order marks, but it's unlikely to work for
other files and other encodings.
 
J

Jay B. Harlow [MVP - Outlook]

Nick,
In addition to the other comments:

Read the help for StreamReader(path,detectEncodingFromByteOrderMarks)
closer. ;-)

"The detectEncodingFromByteOrderMarks parameter detects the
encoding by looking at the first three bytes of the stream. It
automatically recognizes UTF-8, little-endian Unicode, and
big-endian Unicode text if the file starts with the appropriate
byte order marks. Otherwise, the user-provided encoding is used.
See the Encoding.GetPreamble method for more information."

Remember that if you do not call the constructor of StreamReader with an
Encoding object that UTF8Encoding is used. I would expect the same rule to
apply here. In other words Encoding.Default is not considered unless you
pass it to the constructor.

Ergo your code always returns a Unicode encoding.

Have you tried the StreamReader(path,encoding,
detectEncodingFromByteOrderMarks) constructor?

Hope this helps
Jay
 
J

Jon Skeet [C# MVP]

Nick said:
So any other method can do that?

You can't do it reliably - there's no way to tell (for instance)
whether something is using one 8-bit code page or another. The best you
can do is make heuristic guesses, to be honest. For instance, if every
other byte is 0 for most of the time, that *probably* means it's a
Unicode encoding. If the whole file is valid in UTF-8, that may be
indicated - but it's still very dodgy, to be honest.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top