Determine File Encoding

Marc Jennings · Jun 1, 2005

Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.

private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new System.Text.UnicodeEncoding();
byte[] inputFile2 = unic2.GetBytes(rdr.ReadToEnd());
if(inputFile2.Length == originalSize)
{
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

Thanks in advance
Marc.

Nick Malik [Microsoft] · Jun 1, 2005

Why read the entire file to determine the encoding. Can't you tell from the
indicator bytes at the beginning?

Forgive me if I don't know much about encoding, but your algorithm appears
wildly inefficient on its face.

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--

Marc Jennings said:
Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.

private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new System.Text.UnicodeEncoding();
byte[] inputFile2 = unic2.GetBytes(rdr.ReadToEnd());
if(inputFile2.Length == originalSize)
{
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

Click to expand...

Thanks in advance
Marc.

Marc Jennings · Jun 1, 2005

I have to forgive you for not knowing too much about encoding. I know
even less. I agree that the algorithm *is* wildly inneficient, but
the fact is that I have not got a clue. :-)

Such are the joys of
learning from Google.

Guest · Jun 1, 2005

Check out the StreamReader constructors that take a bool argument to
determine the encoding from the byte order mark. Also check out the
Encoding.GetPreamble() method.

Joerg Jooss · Jun 1, 2005

Marc said:
Hi there,

Can anyone point out any really obvious flaws in the methodology below
to determine the likely encoding of a file, please? I know the number
of types of encoding is small, but that is only because the
possibilities I need to work with is a small list.

private string determineFileEncoding(FileStream strm)
{
long originalSize = strm.Length;
StreamReader rdr = new StreamReader(strm);

strm.Position = 0;
System.Text.UTF8Encoding unic = new System.Text.UTF8Encoding();
byte[] inputFile = unic.GetBytes(rdr.ReadToEnd());
if(inputFile.Length == originalSize)
{
return "UTF8";
}

strm.Position = 0;
System.Text.UnicodeEncoding unic2 = new
System.Text.UnicodeEncoding(); byte[] inputFile2 =
unic2.GetBytes(rdr.ReadToEnd()); if(inputFile2.Length ==
originalSize) {
return "Unicode";
}

strm.Position = 0;
System.Text.UTF7Encoding unic3 = new System.Text.UTF7Encoding();
byte[] inputFile3 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile3.Length == originalSize)
{
return "UTF7";
}

System.Text.ASCIIEncoding unic4 = new System.Text.ASCIIEncoding();
byte[] inputFile4 = unic3.GetBytes(rdr.ReadToEnd());
if(inputFile4.Length == originalSize)
{
return "Ascii";
}

return "Not known";
}

Click to expand...

The most obvious flaw would be that generally speaking this is
impossible to achieve ;-)

The second flaw is that your code is just plain wrong. You're using a
UTF-8 StreamReader regardless of the actual encoding. This object will
be able to read UTF-8 and ASCII, but UTF-16 will break for sure.

The third flaw is that you assume "the number of types of encoding is
small". I'd say
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un
icode_81rn.asp is not really a short list, although many of these
encodings are not likely to be found in your typical American or
Western European PC environment.

Cheers,

Joerg Jooss · Jun 1, 2005

KH said:
Check out the StreamReader constructors that take a bool argument to
determine the encoding from the byte order mark. Also check out the
Encoding.GetPreamble() method.

That works only for certain UTFs and maybe some rather obscure stuff,
but today's popular 8 bit encodings like ISO-8859-x or Windows-152x
don't use preambles or BOMs.

Cheers,

Marc Jennings · Jun 3, 2005

On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"

**snip**

The most obvious flaw would be that generally speaking this is
impossible to achieve ;-)

The second flaw is that your code is just plain wrong. You're using a
UTF-8 StreamReader regardless of the actual encoding. This object will
be able to read UTF-8 and ASCII, but UTF-16 will break for sure.

The third flaw is that you assume "the number of types of encoding is
small". I'd say
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un
icode_81rn.asp is not really a short list, although many of these
encodings are not likely to be found in your typical American or
Western European PC environment.

Cheers,

Agreed in the general case, but perhaps I should have made my
situation a little clearer. The files that I need to deal with will
only be one of a very small subset of all the possible encodings out
there.

At least now I know my thinking is more flawed than I though it
was....

cody · Jun 3, 2005

There is no way to determine the encoding of the file unless you know
exactly the text which you expect in the file or there are marker bytes in
the file or a special file extension.
But you can try to use a statistic approach. If the bytes on even positions
are mostly bigger than bytes on uneven positions (or was it the other way
around?) you have unicode. if there are no null chars and no chars < ascii
#32 except \r and \n you have certainly ascii encoding.
In all other cases you may have UTF8.

Joerg Jooss · Jun 3, 2005

Marc said:
On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"

**snip**

Agreed in the general case, but perhaps I should have made my
situation a little clearer. The files that I need to deal with will
only be one of a very small subset of all the possible encodings out
there.

At least now I know my thinking is more flawed than I though it
was....

The best approach is to have some kind of "protocol", that allows to
transports meta data like character encoding. If this is not possible
(as in the case of plain files), let the user decide by allowing him or
her to select and switch all supported between all supported encodings.

Cheers,

Guest · Jun 5, 2005

Hello Marc,

If you open a file using StreamReader it will load a CurrentEncoding
with the correct file encoding and convert the bytes to the correct
characters.

Jon Skeet [C# MVP] · Jun 5, 2005

Roby Eisenbraun Martins

If you open a file using StreamReader it will load a CurrentEncoding
with the correct file encoding and convert the bytes to the correct
characters.

Only if you're lucky. It won't be able to guess correctly between
different ANSI character sets, for instance.

It's definitely best to take the guesswork out, either by explicitly
stating the encoding, making sure there *is* only one encoding, or
allowing the user to override any guesswork which has been performed.

Convert binary file->utf8->binary file	2	Apr 22, 2005
WEBRESPONSE ENCODING PROBLEM	1	Sep 17, 2007
Getting around .Net Strings being UTF-16 encoded only	5	Nov 1, 2005
Automating a POST request	4	Sep 7, 2004
TripleDES "Specified key not a valid size for this algorithm"	2	Nov 17, 2005
Detect encoding of a text file	2	Jan 21, 2004
HttpWebResponse: Problems reading response stream - The chunk length was not valid exception	8	Nov 12, 2003
Redirected request in HttpWebRequest does not maintain specified method!!	1	Mar 27, 2004

Determine File Encoding

Marc Jennings

Nick Malik [Microsoft]

Marc Jennings

Guest

Joerg Jooss

Joerg Jooss

Marc Jennings

cody

Joerg Jooss

Guest

Jon Skeet [C# MVP]

Ask a Question

Similar Threads