VS2005 broke my VS2003 Code

G

Guest

I just fired up the production version of VS 2005. My first activity was to
convert an application that I wrote in VS2003 .Net 1.1..... The new version
does not work the same as the old...Ive found a single bit of code that
produces a different [and from a .net 1.1 point of view incorrect] result.
The problem code is:

byte[] cipherText = byte array with lenght=48 --> 24 unicode characters
byte[] plaintext = null;
byte[] plaintextBlock = null;
byte[] cipherBlock = null;
int indexToRead =0;
int readLength = 0;
String cipherTextString ="";
String plainTextString = "";


cipherTextString = Encoding.Unicode.GetString(cipherText); <-- this is not
working the same as before

Instead of --> cipherTextString with length of 24 I get a length of 23. {I
run the same code in the vs2003 and I get the correct 24 character length}

Any help is greatly appreciated.
 
J

Jon Skeet [C# MVP]

Frank Dzaebel said:
byte[] cipherText = byte array with lenght=48 --> 24 unicode characters

Give us the real string, which is in cipherText.
Note, the length for cipherText must *not* be 48! for
every [24] unicode character combination.
One unicode character can e.g. also consists of 3 bytes.
http://www.yoda.arachsys.com/csharp/unicode.html

Not in terms of the Unicode encoding, which is UTF-16. 48 bytes should
always give 24 UTF-16 code points, I believe.
 
J

Jon Skeet [C# MVP]

katzky said:
I just fired up the production version of VS 2005. My first activity was to
convert an application that I wrote in VS2003 .Net 1.1..... The new version
does not work the same as the old...Ive found a single bit of code that
produces a different [and from a .net 1.1 point of view incorrect] result.
The problem code is:

byte[] cipherText = byte array with lenght=48 --> 24 unicode characters
byte[] plaintext = null;
byte[] plaintextBlock = null;
byte[] cipherBlock = null;
int indexToRead =0;
int readLength = 0;
String cipherTextString ="";
String plainTextString = "";


cipherTextString = Encoding.Unicode.GetString(cipherText); <-- this is not
working the same as before

Instead of --> cipherTextString with length of 24 I get a length of 23. {I
run the same code in the vs2003 and I get the correct 24 character length}

Any help is greatly appreciated.

You've got a fundamental problem here, assuming that every valid byte
array is a valid UTF-16 codepoint sequence. It's always a sequence of
UTF-16 code points, but some of those code points may be invalid,
either not being defined at all, or being one half of a surrogate pair.

To represent binary data in text, you should use something like base16
(hex) or a base64 encoding.
 
G

Guest

Jon -

I was thinking that perhaps a two byte sequence might produce an invalid
unicode value. For instance the pair [96, 219] does not translate into a
Char in Vs2005... but it does in Vs2003 [Char 56160].

My method is really quite a hack [at best] what I am really trying to do is
take a byte[] and break it into blocks. Is there a better method of taking
the equivalent of a 'Substring' from a byte[]?



Jon Skeet said:
katzky said:
I just fired up the production version of VS 2005. My first activity was to
convert an application that I wrote in VS2003 .Net 1.1..... The new version
does not work the same as the old...Ive found a single bit of code that
produces a different [and from a .net 1.1 point of view incorrect] result.
The problem code is:

byte[] cipherText = byte array with lenght=48 --> 24 unicode characters
byte[] plaintext = null;
byte[] plaintextBlock = null;
byte[] cipherBlock = null;
int indexToRead =0;
int readLength = 0;
String cipherTextString ="";
String plainTextString = "";


cipherTextString = Encoding.Unicode.GetString(cipherText); <-- this is not
working the same as before

Instead of --> cipherTextString with length of 24 I get a length of 23. {I
run the same code in the vs2003 and I get the correct 24 character length}

Any help is greatly appreciated.

You've got a fundamental problem here, assuming that every valid byte
array is a valid UTF-16 codepoint sequence. It's always a sequence of
UTF-16 code points, but some of those code points may be invalid,
either not being defined at all, or being one half of a surrogate pair.

To represent binary data in text, you should use something like base16
(hex) or a base64 encoding.
 
G

Guest

Jon -

The char data type has a range of 0000 - FFFF. I always assumed that this
was continuous. What you are saying suggests that only a subset of 0000 -
FFFF represents valid char.



Jon Skeet said:
katzky said:
I just fired up the production version of VS 2005. My first activity was to
convert an application that I wrote in VS2003 .Net 1.1..... The new version
does not work the same as the old...Ive found a single bit of code that
produces a different [and from a .net 1.1 point of view incorrect] result.
The problem code is:

byte[] cipherText = byte array with lenght=48 --> 24 unicode characters
byte[] plaintext = null;
byte[] plaintextBlock = null;
byte[] cipherBlock = null;
int indexToRead =0;
int readLength = 0;
String cipherTextString ="";
String plainTextString = "";


cipherTextString = Encoding.Unicode.GetString(cipherText); <-- this is not
working the same as before

Instead of --> cipherTextString with length of 24 I get a length of 23. {I
run the same code in the vs2003 and I get the correct 24 character length}

Any help is greatly appreciated.

You've got a fundamental problem here, assuming that every valid byte
array is a valid UTF-16 codepoint sequence. It's always a sequence of
UTF-16 code points, but some of those code points may be invalid,
either not being defined at all, or being one half of a surrogate pair.

To represent binary data in text, you should use something like base16
(hex) or a base64 encoding.
 
J

Jon Skeet [C# MVP]

katzky said:
The char data type has a range of 0000 - FFFF. I always assumed that this
was continuous. What you are saying suggests that only a subset of 0000 -
FFFF represents valid char.

The data type itself is continuous, but not every code point is a valid
Unicode character.

It's a bad idea to start producing strings which have undefined
characters in :)

Jon
 
J

Jon Skeet [C# MVP]

Jon said:
Frank Dzaebel said:
byte[] cipherText = byte array with lenght=48 --> 24 unicode characters

Give us the real string, which is in cipherText.
Note, the length for cipherText must *not* be 48! for
every [24] unicode character combination.
One unicode character can e.g. also consists of 3 bytes.
http://www.yoda.arachsys.com/csharp/unicode.html

Not in terms of the Unicode encoding, which is UTF-16. 48 bytes should
always give 24 UTF-16 code points, I believe.

Sorry - to be clearer on this: a UTF-16 encoding of a *valid* Unicode
string which contains 24 UTF-16 code-points (which could be fewer than
24 characters if it contains surrogates) will always be 48 bytes long
(without byte-ordering mark).

Jon
 
G

Guest

Jon -

Well it worked sans problem in the Vs2003 version. Take for instance the
byte pair
hex 60 , DB -- > Char = \uDB60. This is defined in Vs2003 but gives a null
value in Vs2005. This does not make sense to me.
 
J

Jon Skeet [C# MVP]

katzky said:
Well it worked sans problem in the Vs2003 version. Take for instance the
byte pair
hex 60 , DB -- > Char = \uDB60. This is defined in Vs2003 but gives a null
value in Vs2005. This does not make sense to me.

You were relying on unspecified behaviour, basically - and that's never
a good idea.

You should never treat arbitrary binary data as if it's encoded text
data. If you need a string representation of arbitrary binary data, use
something like Base64.

Jon
 
G

Guest

Jon -

Thanks for you help. I fixed the problem by getting rid of the conversion
to string [It was a stupid hack which should have been changed anyways]

I am still confused why I can not take any two bytes and produce a valid
char.
 
J

Jon Skeet [C# MVP]

katzky said:
Thanks for you help. I fixed the problem by getting rid of the conversion
to string [It was a stupid hack which should have been changed anyways]

I am still confused why I can not take any two bytes and produce a valid
char.

Because only certain values are defined in Unicode, and some other
values have special meanings - for instance, characters which are
"halves" of surrogate pairs. If they occur without their corresponding
other half, you don't have a valid string to start with.

You also need to be aware of the possibility of things like combining
characters - if you happened to end up with the bytes for an "e" and an
accenting combining character, it *may* be reasonable for
UnicodeEncoding to decode that to a single accented "e" character. (The
docs aren't very clear on this, from what I remember.) Of course,
that's less of a problem if you're dealing with text *as text*...

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top