Possible bug in UnicodeEncoding

KrippZ · Sep 12, 2006

Hello!

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

static void Main(string[] args)
{
byte[] bytearrBuffer = new byte[256];
for (int i = 0; i < 256; i++)
{
bytearrBuffer = (byte)i;
}
WriteBuffer(bytearrBuffer, "Buffer.txt");
WriteBuffer(new System.Text.UnicodeEncoding().GetBytes(new
System.Text.UnicodeEncoding().GetString(bytearrBuffer)),
"Buffer2.txt");
}

public static void WriteBuffer(byte[] arrbyteBuffer, string
filename)
{
try
{
string sLogFileName = Path.Combine("c:\\", filename);

FileStream fs = new
FileStream(sLogFileName,FileMode.Create,FileAccess.Write,FileShare.Write);
BinaryWriter bw = new BinaryWriter(fs);

for (int i = 0; i < arrbyteBuffer.Length; i++)
{
bw.Write(arrbyteBuffer.ToString());
}

bw.Flush();
bw.Close();
}
catch
{
}
}

Cheers
//KrippZ

Mattias Sjögren · Sep 13, 2006

We here at the office have discovered something odd. Can somebody

please verify this potential bug for us?

I wouldn't call it a bug. There's no guarantee that a random byte
array will come back the same after a
Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may
have spacial meaning or may be invalid according to that encoding. So
you can't take an arbitrary blob and decode it to a string like that.

Mattias

Jon Skeet [C# MVP] · Sep 13, 2006

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

Not a bug, or at least not the bug you think it is.

This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

The moral of the story is that you shouldn't treat arbitrary binary
data as text.

Jon Skeet [C# MVP] · Sep 13, 2006

Jon Skeet said:
The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

Click to expand...

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

Sorry, I've realised what I'd done wrong in the above analysis. My
general principle was right (as was the conclusion that the byte array
didn't represent a valid Unicode string) but the logic was off.

This bit is right:

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff].

and the bytes 216-225 end up being 16-bit values of:

0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0

Now, the Encoding looks at the first of those (0xd9d8) and expects a
high surrogate character to follow. It doesn't, so it presumably
ignores the character. It moves on to 0xdbda, which is "correctly"
followed by 0xdddb, so those end up forming a surrogate pair. The
0xdfde should have been preceded by a low surrogate, so it ignores it
and moves on to the rest - which are valid in themselves.

BinaryWriter is not binary enough	2	Apr 11, 2005
Bug in BinaryWriter	2	Jul 12, 2007
Generating OV2 Files	3	Feb 23, 2009
Bug or Feature in BinaryReader.PeekChar()?	0	Nov 29, 2004
HELP Please!!!!	3	Jul 9, 2007
unknown exception thrown when decrypting read-in message	1	Mar 22, 2004
HttpWebRequest to save images	4	Sep 17, 2007
Array Byte To FileStream (to blocks)	1	Jul 19, 2005

Possible bug in UnicodeEncoding

KrippZ

Mattias Sjögren

Jon Skeet [C# MVP]

Jon Skeet [C# MVP]

Ask a Question

Similar Threads