Strange character transformations

C

CyberSpyders

Hi,

I have an ASP.Net website, which allows users to upload a file which is
then inserted into a database.

This is all fine until it reads a line with the string +Anu in it.
It transforms this to this char É» (which, if Googled for, is
described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK'
(U+027B) or, in Phonetics, as a 'Retroflex approximant'.)

Has anyone seen this behaviour before, and know how to stop it?
The code's simple - here's an example. The É» appears in the output
where the input is +Anu - it's transformed before I can touch it!

using (StreamReader sr = new StreamReader(strFile,
System.Text.Encoding.UTF7)) {
// Read and display lines from the file until the end of the file is
reached.
while ((line = sr.ReadLine()) != null) {
Response.Write(line);
}
}

Regards

Adam
 
M

Marc Gravell

And is this file actually UTF7? Perhaps you should be doing raw binary
transfers?

Also - you are doing ReadLine and Write, so you may already be losing some
end-of-line characters... not 100% sure...

Marc
 
C

CyberSpy

Graven,

I'm not sure how a 4 letter string like this could be seen as an
encoding issue, but I will certainly give it a go. Thanks for the
suggestion.

Adam
 
C

CyberSpy

Larry,

You were spot on - changing to UTF8 stopped this transformation. Thanks

It's not quite solved my problem though.
The file is a Text file, each line being a series of files delimited by
the ¦ character, as this was unliekley to ever appear in the actual
data.

Unfortunately, UTF8 encoding strips these characters completely. ASCII
encoding, on the other hand, replaces them with ?

Oh the joy of character encoding.

Regards

Adam
 
M

Marc Gravell

The reader should have no problem with this character - are you sure this is
the issue? And are you sure of the contents of the file? (perhaps read it in
a hex editor to see what is actually there).

If we are talking about the same "pipe" character, then both the ASCII and
UFT8 representation is hex 7C (single byte); UFT7 has this as hex 2B 41 48
77 2D; Unicode has 7C 00 or 00 7C (depending on endian-ness), and UTF32 has
7C 00 00 00

So what is actually there?

Marc
 
M

Marc Gravell

Damnit; too many pipes! Is this "broken bar" ([Alt]+0166 on numeric)?

In which case yes, every encoding disagrees:

ASCII: 3F
UTF7: 2B 41 4B 59 2D
UTF8: C2 A6
Unicode: A6 00
UTF32: A6 00 00 00

Actually quite handy, as you can look at the file in hex and figure out
encoding the file was written in!

Marc
 
J

Jon Skeet [C# MVP]

Marc said:
Damnit; too many pipes! Is this "broken bar" ([Alt]+0166 on numeric)?

In which case yes, every encoding disagrees:

ASCII: 3F
UTF7: 2B 41 4B 59 2D
UTF8: C2 A6
Unicode: A6 00
UTF32: A6 00 00 00

Actually quite handy, as you can look at the file in hex and figure out
encoding the file was written in!

There's one more encoding which should be considered: Encoding.Default.
My guess is that that's the one it was *actually* saved in - if it's a
single byte, it can't be any of the UTF encodings, and it can't be
ASCII, so Encoding.Default is a likely culprit.

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top