File Read Spanish characters

Chip · Dec 9, 2005

There is surprisingly little information on the various encoding options for
reading a text file. I have what seems to be a very basic issue: I'm reading
a text file that includes Spanish characters such as "ñ". When I read the
file into a string, that character is missing. Encoding seems to be the
culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to
let us know what encoding to read the file with, but most software doesn't
do this so we are left with BOMless files. So how can we reliably read these
files without knowing what encoding it was written with?

Through trial and error I have found that using UTF-7 picks up these Spanish
characters, along with the English.
Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).

Since I am clueless on matters of encoding, my question is: am I safe using
UTF-7 if I only care about English and Spanish? What is the downside? I
won't be able to read Romanian? Japanese?

Is there a way to programatically find the correct encoding without the BOM?

Chip

Joerg Jooss · Dec 12, 2005

Chip said:
There is surprisingly little information on the various encoding
options for reading a text file. I have what seems to be a very basic
issue: I'm reading a text file that includes Spanish characters such
as "ñ". When I read the file into a string, that character is
missing. Encoding seems to be the culprit. File writers SHOULD begin
a file with the BOM (Byte Order Mark) to let us know what encoding to
read the file with, but most software doesn't do this so we are left
with BOMless files.

Remember that these are byte order marks, which are intended to be used
for identifying whether an encoding uses Big Endian or Little Endian
representation. The fact that some encodings can be identified by their
BOM is just a nice side effect.

So how can we reliably read these files without
knowing what encoding it was written with?

Only through application specific meta data (like HTTP headers).
There's no grand universal scheme to tell a file's character encoding.

Through trial and error I have found that using UTF-7 picks up these
Spanish characters, along with the English. Dim Reader As New
StreamReader(fs, System.Text.Encoding.UTF7).

That's quite likely not what you want. Try Encoding.Default.

Since I am clueless on matters of encoding, my question is: am I safe
using UTF-7 if I only care about English and Spanish? What is the
downside? I won't be able to read Romanian? Japanese?

Depends on the input. UTF-7 is only (and rarely?) used for E-mail. I
guess the chance to find a true UTF-7 encoded file is pretty much zero.

Is there a way to programatically find the correct encoding without
the BOM?

As I said, in general no. If the range of possible encodings is
limited, you may be able to create a proper detection algorithm, though.

Cheers,

Chip · Dec 14, 2005

More info on this:
http://weblogs.asp.net/rosherove/archive/2003/05/15/7054.aspx

Juan T. Llibre · Dec 14, 2005

If you only care about english and spanish,
you'll be safe using iso-8859-1.

Juan T. Llibre
ASP.NET MVP
============

Double Encoding Spanish	1	Jul 26, 2005
UTF-8 and diacritics combining characters	5	Dec 19, 2008
Page encoding and browsers (IE in particular)	9	Nov 13, 2007
This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
Accessing postgresql database	2	May 20, 2009
Foreign Characters & ASP.net	2	Aug 21, 2005
C# and encodings	30	Feb 3, 2009
Special character to &abc equivalents	8	May 7, 2005

File Read Spanish characters

Chip

Joerg Jooss

Chip

Juan T. Llibre

Ask a Question

Similar Threads