PC Review


Reply
Thread Tools Rate Thread

Detect encoding of a text file

 
 
Marc Scheuner [MVP ADSI]
Guest
Posts: n/a
 
      21st Jan 2004
Folks,

I have a number of text files in a directory, and I'd like to know
what type of encoding they're in.

I thought I could just open a StreamReader for each file one at a
time, and let .NET determine the encoding (Default, UTF-8, Unicode) -
there's a nice constructor for StreamReader which takes a file name,
and a boolean "detectEncodingFromByteOrderMarks" which I figured would
do exactly what I want - open the file and see if it's a Default
(ISO-8859-1), UTF-8 or Unicode (UTF-16) file.

Here's my function:

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}

But that doesn't seem to work - in this case, *all* files are being
labelled as "UTF-8", which I *KNOW* is *NOT* true....

Is there any easy way in C# to let it determine the file's encoding
reliably? Do I really need to "manually" look at the first three bytes
of each file?

Any ideas??

Marc
================================================================
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch
 
Reply With Quote
 
 
 
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      21st Jan 2004
Marc Scheuner [MVP ADSI] <(E-Mail Removed)> wrote:

<snip>

> Is there any easy way in C# to let it determine the file's encoding
> reliably? Do I really need to "manually" look at the first three bytes
> of each file?


There *is* no way of determining it reliably. Something that starts
with "abc" could be in UTF-8, ASCII, UCS-2 without any BOM, Cp1252,
etc...

If you know that all your files are going to be *either* UCS-2 with a
BOM or UTF-8, that makes things a lot simpler - but you'll still
basically have to look at the first few bytes.

--
Jon Skeet - <(E-Mail Removed)>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
 
Reply With Quote
 
Tadao Machida [MS]
Guest
Posts: n/a
 
      22nd Jan 2004
Hi Marc,

You need to read actually the file to detect the ecoding from Byte Order
mark.
Your sample does not read the file, it means that the BOM has not been read
yet.

I modified your code like below.

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
oSR.ReadToEnd(); // Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}



Thanks,
Tadao Machida [MS]

--------------------
>From: Marc Scheuner [MVP ADSI] <(E-Mail Removed)>
>Folks,
>
>I have a number of text files in a directory, and I'd like to know
>what type of encoding they're in.
>
>I thought I could just open a StreamReader for each file one at a
>time, and let .NET determine the encoding (Default, UTF-8, Unicode) -
>there's a nice constructor for StreamReader which takes a file name,
>and a boolean "detectEncodingFromByteOrderMarks" which I figured would
>do exactly what I want - open the file and see if it's a Default
>(ISO-8859-1), UTF-8 or Unicode (UTF-16) file.
>
>Here's my function:
>
>public string DetermineFileType(string aFileName)
>{
> string sEncoding = string.Empty;
>
> StreamReader oSR = new StreamReader(aFileName, true);
> sEncoding = oSR.CurrentEncoding.EncodingName;
>
> return sEncoding;
>}
>
>But that doesn't seem to work - in this case, *all* files are being
>labelled as "UTF-8", which I *KNOW* is *NOT* true....
>
>Is there any easy way in C# to let it determine the file's encoding
>reliably? Do I really need to "manually" look at the first three bytes
>of each file?
>
>Any ideas??
>
>Marc
>================================================================
>Marc Scheuner May The Source Be With You!
>Bern, Switzerland m.scheuner(at)inova.ch
>


 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to detect the character encoding of a file ? JB Microsoft VB .NET 2 16th Aug 2008 08:35 AM
File Conversion-Text Encoding =?Utf-8?B?ZWxoaXJzY2g=?= Microsoft Word Document Management 2 9th Sep 2006 06:40 AM
Need to reliably detect a text file's encoding for XML deserialization Marc Scheuner Microsoft C# .NET 4 9th Apr 2006 03:18 PM
file converion or text encoding =?Utf-8?B?W0lNR11odHRwOi8vdGlueXBpYy5jb20vaXB0bWF3 Microsoft Word Document Management 2 19th Dec 2005 02:29 AM
Can I get the encoding of a text file? MattB Microsoft ASP .NET 2 17th Jun 2005 03:59 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 03:41 AM.