Strange character transformations

CyberSpyders · Aug 9, 2006

Hi,

I have an ASP.Net website, which allows users to upload a file which is
then inserted into a database.

This is all fine until it reads a line with the string +Anu in it.
It transforms this to this char É» (which, if Googled for, is
described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK'
(U+027B) or, in Phonetics, as a 'Retroflex approximant'.)

Has anyone seen this behaviour before, and know how to stop it?
The code's simple - here's an example. The É» appears in the output
where the input is +Anu - it's transformed before I can touch it!

using (StreamReader sr = new StreamReader(strFile,
System.Text.Encoding.UTF7)) {
// Read and display lines from the file until the end of the file is
reached.
while ((line = sr.ReadLine()) != null) {
Response.Write(line);
}
}

Regards

Adam

Marc Gravell · Aug 9, 2006

And is this file actually UTF7? Perhaps you should be doing raw binary
transfers?

Also - you are doing ReadLine and Write, so you may already be losing some
end-of-line characters... not 100% sure...

Marc

CyberSpy · Aug 12, 2006

Graven,

I'm not sure how a 4 letter string like this could be seen as an
encoding issue, but I will certainly give it a go. Thanks for the
suggestion.

Adam

CyberSpy · Aug 16, 2006

Larry,

You were spot on - changing to UTF8 stopped this transformation. Thanks

It's not quite solved my problem though.
The file is a Text file, each line being a series of files delimited by
the ¦ character, as this was unliekley to ever appear in the actual
data.

Unfortunately, UTF8 encoding strips these characters completely. ASCII
encoding, on the other hand, replaces them with ?

Oh the joy of character encoding.

Regards

Adam

Marc Gravell · Aug 16, 2006

The reader should have no problem with this character - are you sure this is
the issue? And are you sure of the contents of the file? (perhaps read it in
a hex editor to see what is actually there).

If we are talking about the same "pipe" character, then both the ASCII and
UFT8 representation is hex 7C (single byte); UFT7 has this as hex 2B 41 48
77 2D; Unicode has 7C 00 or 00 7C (depending on endian-ness), and UTF32 has
7C 00 00 00

So what is actually there?

Marc

Marc Gravell · Aug 16, 2006

Damnit; too many pipes! Is this "broken bar" ([Alt]+0166 on numeric)?

In which case yes, every encoding disagrees:

ASCII: 3F
UTF7: 2B 41 4B 59 2D
UTF8: C2 A6
Unicode: A6 00
UTF32: A6 00 00 00

Actually quite handy, as you can look at the file in hex and figure out
encoding the file was written in!

Marc

Jon Skeet [C# MVP] · Aug 16, 2006

Marc said:
Damnit; too many pipes! Is this "broken bar" ([Alt]+0166 on numeric)?

In which case yes, every encoding disagrees:

ASCII: 3F
UTF7: 2B 41 4B 59 2D
UTF8: C2 A6
Unicode: A6 00
UTF32: A6 00 00 00

Actually quite handy, as you can look at the file in hex and figure out
encoding the file was written in!

There's one more encoding which should be considered: Encoding.Default.
My guess is that that's the one it was *actually* saved in - if it's a
single byte, it can't be any of the UTF encodings, and it can't be
ASCII, so Encoding.Default is a likely culprit.

Jon

How to manage backspace/backline	4	Sep 6, 2012
How to read a Unicode data saved as ASCII in notepad file as txt ?	3	Aug 8, 2007
Problem reading special characters into a list box	2	Jun 12, 2009
Print question	14	Dec 1, 2010
StreamReader vs StreamWrite	4	Jan 29, 2006
Compare to previous line	6	Jul 24, 2006
Writing out text with nulls	5	Oct 14, 2008
Writing extended ascii characters to text file.	3	Jan 19, 2005

Strange character transformations

CyberSpyders

Marc Gravell

CyberSpy

CyberSpy

Marc Gravell

Marc Gravell

Jon Skeet [C# MVP]

Ask a Question

Similar Threads