Reading a text file with spanish accents

Amy L. · Oct 12, 2007

I am at an absolute loss on what is going on here. I have a text file
with some Spanish writing. Some of the characters have accents. I have
not found anyway to read this text file and echo the output to the
console showing the accents.

I have tried using UTF-8 but it does not like the accent characters.

It basically converts
Añoro esta situación

to
A?oro esta situaci?n

What am I missing?
Amy

Nicholas Paldino [.NET/C# MVP] · Oct 12, 2007

Amy,

Well, it's possible that you are reading the file correctly from UTF-8,
but the font for the console doesn't support those characters. What is the
font that you are using and does it support those characters?

Cor Ligthert[MVP] · Oct 12, 2007

Amy,

The Spanish characters are in the 1252 characterset. It is in my idea good
to check that in the Country settings . The way to handle this seems for me
in almost every Windows OS version different, so I cannot tell you that. I
have had problems enough with this where in not every application the
characters were showed right although that was when using combined set 1250
and 1252.

http://msdn2.microsoft.com/en-us/library/aa912040.aspx

Cor

Amy L. · Oct 12, 2007

Nicholas said:
Amy,

Well, it's possible that you are reading the file correctly from
UTF-8, but the font for the console doesn't support those characters.
What is the font that you are using and does it support those characters?

In testing I decided to print each char to the screen along with its
byte value. The code is merely a (int)c where c is a char.

When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
with a code of 65535

When using StreamReader with Encoding.Default the ñ gets displayed as a
ñ with a code of 241

When using FileStream with no encoding (don't believe you can set it)
and than printing the characters of the bytes ñ gets displayed as a ñ
with a code of 241.

When attempting to convert the byte array returned from the FileStream
to a String in UTF8 via below the sting does not convert properly (I get
the ? for the accented characters).

UTF8Encoding temp = new UTF8Encoding( true );
Console.WriteLine( temp.GetString( b ) );

However, if I do
Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );

It prints correctly.

I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works. I know the characters are for the most
part being read in correctly especially with FileStream. It just seems
like I am lost on what to do about the encoding of them.

Thoughts?
Darrell

Jon Skeet [C# MVP] · Oct 12, 2007

However, if I do
Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );

It prints correctly.

I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works. I know the characters are for the most
part being read in correctly especially with FileStream. It just seems
like I am lost on what to do about the encoding of them.

*Characters* are not read at all by a FileStream. Bytes are read by a
FileStream. An Encoding is the way of converting between bytes and
characters.

If your file is effectively encoded using Encoding.Default, that's what
you should use. It would be generally better if you were able to start
with a UTF-8 file, but if you can't control whatever produces the file,
then you need to follow its lead.

Picking an encoding is a bit like picking an image format - you might
prefer PNG to BMP, but if someone gives you a BMP file and you try to
read it as if it were a PNG, you won't get the right picture.

Christof Nordiek · Oct 12, 2007

Amy L. said:
In testing I decided to print each char to the screen along with its byte
value. The code is merely a (int)c where c is a char.

When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
with a code of 65535

This is a non-character in Unicode. So the file seems not to be UTF-8
encoded

When using StreamReader with Encoding.Default the ñ gets displayed as a ñ
with a code of 241

This is hexadecimal 00F1
This is the right Unicode for ñ (LATIN SMALL LETTER N WITCH TILDE)
So this seems to be the right encoding.

When using FileStream with no encoding (don't believe you can set it) and
than printing the characters of the bytes ñ gets displayed as a ñ with a
code of 241.

So this must be the byte stored in the file.
The ANSI-Encoding if your system seems to map unicode 00F1 to byte F1
In UTF-8 this would be the beginning byte of a 4 byte charcter. Very
probable the next byte can't be a following character of a UTF-8 character.
So obviously the encding uses FFFF as substitution character for incorrect
encoding.

I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works.

As Jon said, if this is the way, the file was encoded, this is the right
encoding to read the file.

Christof

Ben Mc · Oct 12, 2007

Hi Amy,

Just a quick bit of info of the top of my head, (i havent read into
detail about your problem in the above discussion), but what first
comes to mind is why you are trying to use UTF-8 and NOT UTF-16.

The 8 stands for 8bits which is can hold 0-255 decimal values (ala
ASCII character set). UTF-16 was introduced to handle international
character-sets, as it is 16bit, hence a capacity to hold 65536
different characters - from 0 - 65535 (64k)

Hope this helps.

Cheers,
Ben

Jon Skeet [C# MVP] · Oct 12, 2007

Just a quick bit of info of the top of my head, (i havent read into
detail about your problem in the above discussion), but what first
comes to mind is why you are trying to use UTF-8 and NOT UTF-16.

The 8 stands for 8bits which is can hold 0-255 decimal values (ala
ASCII character set). UTF-16 was introduced to handle international
character-sets, as it is 16bit, hence a capacity to hold 65536
different characters - from 0 - 65535 (64k)

No, you've completely misunderstood UTF-8, as well as claiming that
ASCII has 256 values (it doesn't - it's only 7 bit).

UTF-8 is perfectly capable of encoding all Unicode characters. A
Unicode character is encoded in 1-4 bytes by UTF-8. UTF-8 is a
pleasantly compact format because it encodes ASCII characters (which
make up the majority of most documents in the Western world) as single
bytes, and is ASCII-compatible in that any valid ASCII document is
also a valid UTF-8 document with the same meaning.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information
(and ignore the fact that it says it's about Linux/Unix).

Jon

JC Francis · Mar 30, 2010

The solution is as follows:

//---------------------------------------------------
string vFilePath = "c:\bla.txt"; //your file path
string vLine;

StreamReader vFile = new StreamReader (vFilePath, Encoding.GetEncoding ("iso-8859-1"));

while ((vLine = vReadOBEFile.ReadLine ()) != null) {
//your process
}
//---------------------------------------------------

If you need another Encoding just check:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

The iso-8859-1 encoding will solve your problem.

Jon Skeet [C# MVP] wrote:

Re: Reading a text file with spanish accents
12-Oct-07

No, you've completely misunderstood UTF-8, as well as claiming tha
ASCII has 256 values (it doesn't - it's only 7 bit)

UTF-8 is perfectly capable of encoding all Unicode characters.
Unicode character is encoded in 1-4 bytes by UTF-8. UTF-8 is
pleasantly compact format because it encodes ASCII characters (whic
make up the majority of most documents in the Western world) as singl
bytes, and is ASCII-compatible in that any valid ASCII document i
also a valid UTF-8 document with the same meaning

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more informatio
(and ignore the fact that it says it's about Linux/Unix)

Jon

Previous Posts In This Thread:

Reading a text file with spanish accents
I am at an absolute loss on what is going on here. I have a text file
with some Spanish writing. Some of the characters have accents. I have
not found anyway to read this text file and echo the output to the
console showing the accents

I have tried using UTF-8 but it does not like the accent characters

It basically convert
A?oro esta situaci?

t
A?oro esta situaci?

What am I missing
Amy

Amy, Well, it's possible that you are reading the file correctly from
Amy

Well, it's possible that you are reading the file correctly from UTF-8,
but the font for the console doesn't support those characters. What is the
font that you are using and does it support those characters

--
- Nicholas Paldino [.NET/C# MVP
- (e-mail address removed)

Amy,The Spanish characters are in the 1252 characterset.
Amy

The Spanish characters are in the 1252 characterset. It is in my idea good
to check that in the Country settings . The way to handle this seems for me
in almost every Windows OS version different, so I cannot tell you that. I
have had problems enough with this where in not every application the
characters were showed right although that was when using combined set 1250
and 1252

http://msdn2.microsoft.com/en-us/library/aa912040.asp

Cor

Nicholas Paldino [.
Nicholas Paldino [.NET/C# MVP] wrote

In testing I decided to print each char to the screen along with its
byte value. The code is merely a (int)c where c is a char

When using StreamReader with Encoding.UTF8 the ? gets displayed as a ?
with a code of 6553

When using StreamReader with Encoding.Default the ? gets displayed as a
? with a code of 24

When using FileStream with no encoding (don't believe you can set it)
and than printing the characters of the bytes ? gets displayed as a ?
with a code of 241

When attempting to convert the byte array returned from the FileStream
to a String in UTF8 via below the sting does not convert properly (I get
the ? for the accented characters)

UTF8Encoding temp = new UTF8Encoding( true )
Console.WriteLine( temp.GetString( b ) )

However, if I d
Console.WriteLine( System.Text.Encoding.Default.GetString( b ) )

It prints correctly

I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works. I know the characters are for the most
part being read in correctly especially with FileStream. It just seems
like I am lost on what to do about the encoding of them.

Thoughts?
Darrell

Re: Reading a text file with spanish accents

<snip>

*Characters* are not read at all by a FileStream. Bytes are read by a
FileStream. An Encoding is the way of converting between bytes and
characters.

If your file is effectively encoded using Encoding.Default, that's what
you should use. It would be generally better if you were able to start
with a UTF-8 file, but if you can't control whatever produces the file,
then you need to follow its lead.

Picking an encoding is a bit like picking an image format - you might
prefer PNG to BMP, but if someone gives you a BMP file and you try to
read it as if it were a PNG, you won't get the right picture.

--
Jon Skeet - <[email protected]>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Re: Reading a text file with spanish accents

This is a non-character in Unicode. So the file seems not to be UTF-8
encoded

This is hexadecimal 00F1
This is the right Unicode for ? (LATIN SMALL LETTER N WITCH TILDE)
So this seems to be the right encoding.

So this must be the byte stored in the file.
The ANSI-Encoding if your system seems to map unicode 00F1 to byte F1
In UTF-8 this would be the beginning byte of a 4 byte charcter. Very
probable the next byte can't be a following character of a UTF-8 character.
So obviously the encding uses FFFF as substitution character for incorrect
encoding.

As Jon said, if this is the way, the file was encoded, this is the right
encoding to read the file.

Christof

Hi Amy,Just a quick bit of info of the top of my head, (i havent read
Hi Amy,

Just a quick bit of info of the top of my head, (i havent read into
detail about your problem in the above discussion), but what first
comes to mind is why you are trying to use UTF-8 and NOT UTF-16.

The 8 stands for 8bits which is can hold 0-255 decimal values (ala
ASCII character set). UTF-16 was introduced to handle international
character-sets, as it is 16bit, hence a capacity to hold 65536
different characters - from 0 - 65535 (64k)

Hope this helps.

Cheers,
Ben

Re: Reading a text file with spanish accents

No, you've completely misunderstood UTF-8, as well as claiming that
ASCII has 256 values (it doesn't - it's only 7 bit).

UTF-8 is perfectly capable of encoding all Unicode characters. A
Unicode character is encoded in 1-4 bytes by UTF-8. UTF-8 is a
pleasantly compact format because it encodes ASCII characters (which
make up the majority of most documents in the Western world) as single
bytes, and is ASCII-compatible in that any valid ASCII document is
also a valid UTF-8 document with the same meaning.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information
(and ignore the fact that it says it's about Linux/Unix).

Jon

Submitted via EggHeadCafe - Software Developer Portal of Choice
BizTalk Repeating Structures Table Looping and Table Extract
http://www.eggheadcafe.com/tutorial...0-a5704fe31a76/biztalk-repeating-structu.aspx

Spanish language character wrong	11	Nov 29, 2007
accents in spanish not recognized	1	Jun 18, 2005
Accent	6	Dec 13, 2004
Convert from byte array to string	6	Nov 4, 2004
Foreign accents not displaying	6	Aug 3, 2009
Milo: Apostrophe key giving me a-with-an-accent in nonHTML notes	1	Feb 6, 2010
How to type Spanish accent marks in Access 2000	4	Nov 26, 2005
how do I change my keyboard to type Spanish characters	1	May 11, 2007

Reading a text file with spanish accents

Amy L.

Nicholas Paldino [.NET/C# MVP]

Cor Ligthert[MVP]

Amy L.

Jon Skeet [C# MVP]

Christof Nordiek

Ben Mc

Jon Skeet [C# MVP]

JC Francis

Ask a Question

Similar Threads