Determine File Encoding

M

Marc Jennings

Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.
private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new System.Text.UnicodeEncoding();
byte[] inputFile2 = unic2.GetBytes(rdr.ReadToEnd());
if(inputFile2.Length == originalSize)
{
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

Thanks in advance
Marc.
 
N

Nick Malik [Microsoft]

Why read the entire file to determine the encoding. Can't you tell from the
indicator bytes at the beginning?

Forgive me if I don't know much about encoding, but your algorithm appears
wildly inefficient on its face.

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
Marc Jennings said:
Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.
private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new System.Text.UnicodeEncoding();
byte[] inputFile2 = unic2.GetBytes(rdr.ReadToEnd());
if(inputFile2.Length == originalSize)
{
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

Thanks in advance
Marc.
 
M

Marc Jennings

I have to forgive you for not knowing too much about encoding. I know
even less. I agree that the algorithm *is* wildly inneficient, but
the fact is that I have not got a clue. :) Such are the joys of
learning from Google.
 
G

Guest

Check out the StreamReader constructors that take a bool argument to
determine the encoding from the byte order mark. Also check out the
Encoding.GetPreamble() method.
 
J

Joerg Jooss

Marc said:
Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.
private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new
System.Text.UnicodeEncoding(); byte[] inputFile2 =
unic2.GetBytes(rdr.ReadToEnd()); if(inputFile2.Length ==
originalSize) {
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

The most obvious flaw would be that generally speaking this is
impossible to achieve ;-)

The second flaw is that your code is just plain wrong. You're using a
UTF-8 StreamReader regardless of the actual encoding. This object will
be able to read UTF-8 and ASCII, but UTF-16 will break for sure.

The third flaw is that you assume "the number of types of encoding is
small". I'd say
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un
icode_81rn.asp is not really a short list, although many of these
encodings are not likely to be found in your typical American or
Western European PC environment.

Cheers,
 
J

Joerg Jooss

KH said:
Check out the StreamReader constructors that take a bool argument to
determine the encoding from the byte order mark. Also check out the
Encoding.GetPreamble() method.

That works only for certain UTFs and maybe some rather obscure stuff,
but today's popular 8 bit encodings like ISO-8859-x or Windows-152x
don't use preambles or BOMs.

Cheers,
 
M

Marc Jennings

On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"

**snip**
The most obvious flaw would be that generally speaking this is
impossible to achieve ;-)

The second flaw is that your code is just plain wrong. You're using a
UTF-8 StreamReader regardless of the actual encoding. This object will
be able to read UTF-8 and ASCII, but UTF-16 will break for sure.

The third flaw is that you assume "the number of types of encoding is
small". I'd say
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un
icode_81rn.asp is not really a short list, although many of these
encodings are not likely to be found in your typical American or
Western European PC environment.

Cheers,

Agreed in the general case, but perhaps I should have made my
situation a little clearer. The files that I need to deal with will
only be one of a very small subset of all the possible encodings out
there.

At least now I know my thinking is more flawed than I though it
was....
 
C

cody

There is no way to determine the encoding of the file unless you know
exactly the text which you expect in the file or there are marker bytes in
the file or a special file extension.
But you can try to use a statistic approach. If the bytes on even positions
are mostly bigger than bytes on uneven positions (or was it the other way
around?) you have unicode. if there are no null chars and no chars < ascii
#32 except \r and \n you have certainly ascii encoding.
In all other cases you may have UTF8.
 
J

Joerg Jooss

Marc said:
On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"

**snip**

Agreed in the general case, but perhaps I should have made my
situation a little clearer. The files that I need to deal with will
only be one of a very small subset of all the possible encodings out
there.

At least now I know my thinking is more flawed than I though it
was....

The best approach is to have some kind of "protocol", that allows to
transports meta data like character encoding. If this is not possible
(as in the case of plain files), let the user decide by allowing him or
her to select and switch all supported between all supported encodings.

Cheers,
 
G

Guest

Hello Marc,

If you open a file using StreamReader it will load a CurrentEncoding
with the correct file encoding and convert the bytes to the correct
characters.
 
J

Jon Skeet [C# MVP]

Roby Eisenbraun Martins
If you open a file using StreamReader it will load a CurrentEncoding
with the correct file encoding and convert the bytes to the correct
characters.

Only if you're lucky. It won't be able to guess correctly between
different ANSI character sets, for instance.

It's definitely best to take the guesswork out, either by explicitly
stating the encoding, making sure there *is* only one encoding, or
allowing the user to override any guesswork which has been performed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top