something about Unicode

  • Thread starter Thread starter Jet Leung
  • Start date Start date
J

Jet Leung

Hi all,
If I read a file by StreamReader and Encoding is Unicode, than all the
contant read from the file will be change into Unicode string . And how can
I make this Unicode string return the real text of that file?
 
Hi Jet,

Convert the string into a byte array and convert it back to a string using a new encoding.

string s = "能! æºå¸¯"; // some random japanese characters

byte[] b = new byte[s.Length*2]; // the string contains chars, which are two bytes each.

int j = System.Text.Encoding.Unicode.GetBytes(s, 0, s.Length, b, 0) // converts to byte array

string t = System.Text.Encoding.GetEncoding("ISO-8859-1").GetString(b, 0, b.Length);
// converts to 8-bit characters using the ISO-8859-1 character set

byte[] b1 = new byte[t.Length]; // new byte array

int k = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(t, 0, t.Length, b1, 0);
// convert the 8-bit text to byte array

string u = System.Text.Encoding.Unicode.GetString(b1, 0, b1.Length);
// converts the byte array back to a unicode string
 
Morten said:
Hi Jet,

Convert the string into a byte array and convert it back to a string
using a new encoding.

To achieve what? You start having a Unicode string and you end up having
one. There is no such thing as "encoding" for a String object. Applying
different encodings is dangerous as illustrated by your sample code...
string s = "能! æºå¸¯"; // some random japanese characters

because there's no way to represent these in ISO-8859-1.

Cheers,
 
Jet Leung said:
If I read a file by StreamReader and Encoding is Unicode, than all the
contant read from the file will be change into Unicode string . And how can
I make this Unicode string return the real text of that file?

You'd be *much* better off using the appropriate encoding to start
with. What's the point of reading it with the Unicode encoding if it's
not in Unicode?
 
To achieve what? You start having a Unicode string and you end up having
one. There is no such thing as "encoding" for a String object. Applying
different encodings is dangerous as illustrated by your sample code...

You miss the point. All Strings are unicode, but if you read text using the wrong encoding you might end up with the wrong text. My sample tells you how to change encoded text. And for good measure I tell you how to change it back.
because there's no way to represent these in ISO-8859-1.

Ever seen web pages using the wrong encoding? All the information is still there, but it is displayed wrong. s in my sample is just such a case, it looks wrong. It shouldn't be japanese or anything of the sort, the characters were just taken out of nowhere, and as it happens the ISO-8859-1 of the text is just garbage, but try converting æ•ˆæ±¬â¯æ½—æ±²d instead.
 
Morten said:
You miss the point. All Strings are unicode, but if you read text
using the wrong encoding you might end up with the wrong text. My
sample tells you how to change encoded text. And for good measure I
tell you how to change it back.

I probably do not qualify for "awake, yet" right now, but there's no way
back with Japanese characters and ISO-8859-x... I guess the sample was just
somewhat misleading (me).
Ever seen web pages using the wrong encoding?

Tons of 'em ;-)
 
Back
Top