How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded

Karl Mondale · Jan 22, 2010

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

Pegasus [MVP] · Jan 22, 2010

Karl Mondale said:
Assume I have a text file. How can I detect if the text inside is encoded
in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

I would check Google or Wikipedia, e.g. here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1. It explains the whole code in
detail. To find out programmatically you need to read the first few bytes.
The exact method depends on the tool you wish to use.

Tim Slattery · Jan 22, 2010

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

You can't really, except by minutely examining the contents and seeing
whether there's something makes sense in one system but not another.
Even then you may not be sure what the creator intended, and it might
not matter anyway.

In UTF8, for example, characters in the 7-bit ASCII set are given in a
single byte. (The Unicode codes for those characters are the same as
the 7-bit ASCII encoding, just a bunch of zeroes in front). Other
Unicode characters are expressed in two or three bytes. So if the
entire file consists of 7-bit ASCII characters, the file will be
exactly the same whether UTF8 or ASCII was intended.

Paul Randall · Jan 22, 2010

The short answer is that you can't alway determine the encoding from the
content of a file.

To see why, you can use Notepad to experiment with creating and saving text
as ANSI, Unicode, Unicode Big Endian, and UTF-8. Try pasting in some some
text from foreign web pages, as well as plain English text. Looking at the
files in a hex editor, like XVI32, you will see that for all but Ansi,
Notepad prepends a few bytes (called a Byte Order Mark) to indicate the type
of text file. For Unicode, it is the two byte sequence (hex) FFFE or FEFF,
to indicate either big endian or little endian unicode. Not all
applications prepend a BOM. Ansi and your two ISO encodings always use one
byte per character. Unicode always uses two bytes per character, except the
new Unicode-32 uses 4 bytes per character. UTF-8 uses a variable number of
bytes per character (one to five, I think), and can encode all two-byte
Unicode characters. For saving as Ansi, Notepad complains if all characters
can't be saved as one-byte characters.

-Paul Randall

Carmel · Jan 22, 2010

Karl Mondale said:
Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

Microsoft doesn't distribute a utility that can accomplish that feat easily.
If you can get your file transfered to a FreeBSD or Linux system, you could
use either 'file' or 'enca' to determine its property's.

MAN pages:

http://unixhelp.ed.ac.uk/CGI/man-cgi?file
http://linux.die.net/man/1/enca

John Wunderlich · Jan 22, 2010

(e-mail address removed) (Karl Mondale) wrote in

Assume I have a text file. How can I detect if the text inside is
encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

That's rather difficult.

ISO8859-1 is almost identical to -15 where -15 replaces one encoding
with the Euro symbol and includes a few more french symbols. The only
way to tell them apart would be to look at the symbols in context.

UTF-8 is identical to ISO8859 for the first 128 ASCII characters which
include all the standard keyboard characters. After that, characters
are encoded as a multi-byte sequence.

Unicode is usually encoded in UTF-16. If you're lucky, there might be
a BOM (Byte Order Mark) of 0xFFFE or 0xFEFF as the first two characters
in the file. Otherwise, look for a 0x00 (Null character) as every
other character if the text file contains basic 7-bit ASCII characters.

HTH,
John

How to convert UTF-8 character to ISO8859-1 character	5	Oct 5, 2009
Data (text) conversion	1	Dec 31, 2003
UTF-16	1	Oct 9, 2010
Convert UTF-16 Unicode to UTF-8 Unicode?	0	Apr 1, 2010
How can I make sure that a text file is saved in a given codepage?	5	Dec 9, 2003
How to create a UTF-8 text file from excel VBA	1	Apr 29, 2006
This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
Outlook Express question	4	Sep 27, 2011

How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded

Karl Mondale

Pegasus [MVP]

Tim Slattery

Paul Randall

Carmel

John Wunderlich

Ask a Question

Similar Threads