Possible bug in UnicodeEncoding

K

KrippZ

Hello!

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

static void Main(string[] args)
{
byte[] bytearrBuffer = new byte[256];
for (int i = 0; i < 256; i++)
{
bytearrBuffer = (byte)i;
}
WriteBuffer(bytearrBuffer, "Buffer.txt");
WriteBuffer(new System.Text.UnicodeEncoding().GetBytes(new
System.Text.UnicodeEncoding().GetString(bytearrBuffer)),
"Buffer2.txt");
}


public static void WriteBuffer(byte[] arrbyteBuffer, string
filename)
{
try
{
string sLogFileName = Path.Combine("c:\\", filename);

FileStream fs = new
FileStream(sLogFileName,FileMode.Create,FileAccess.Write,FileShare.Write);
BinaryWriter bw = new BinaryWriter(fs);

for (int i = 0; i < arrbyteBuffer.Length; i++)
{
bw.Write(arrbyteBuffer.ToString());
}

bw.Flush();
bw.Close();
}
catch
{
}
}

Cheers
//KrippZ
 
M

Mattias Sjögren

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

I wouldn't call it a bug. There's no guarantee that a random byte
array will come back the same after a
Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may
have spacial meaning or may be invalid according to that encoding. So
you can't take an arbitrary blob and decode it to a string like that.


Mattias
 
J

Jon Skeet [C# MVP]

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

Not a bug, or at least not the bug you think it is.
This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

The moral of the story is that you shouldn't treat arbitrary binary
data as text.
 
J

Jon Skeet [C# MVP]

Jon Skeet said:
The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

Sorry, I've realised what I'd done wrong in the above analysis. My
general principle was right (as was the conclusion that the byte array
didn't represent a valid Unicode string) but the logic was off.

This bit is right:
Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff].

and the bytes 216-225 end up being 16-bit values of:

0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0

Now, the Encoding looks at the first of those (0xd9d8) and expects a
high surrogate character to follow. It doesn't, so it presumably
ignores the character. It moves on to 0xdbda, which is "correctly"
followed by 0xdddb, so those end up forming a surrogate pair. The
0xdfde should have been preceded by a low surrogate, so it ignores it
and moves on to the rest - which are valid in themselves.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top