BinaryWriter and string data type

  • Thread starter Thread starter John Aldrin
  • Start date Start date
J

John Aldrin

Hi,

I'm looking for info that explains the format of a string data type
when written to a stream using a BinaryWriter. I've looked all over
MSDN and Internet and I cannot seem to find it.

I did some simple testing and it seems that the string data is
prefixed w/a variable number of bytes that indicate the length.

1 byte if length <= 127

2 bytes if length > 127 - 1st byte has high order bit set, remaining
bits indicate number of bytes + (2nd byte * 127).

I haven't explored any further.



Thanx

jra
 
Hi John,

UTF8Encoding is used by default.
There is an overloaded constructor that accepts Stream an Encoding if you
wish to change the encoding used.
 
Length is encoded on 1, 2, 3, 4 or 5 bytes as follows:

* The int (32 bits) is split in 7 bit chunks.
* The 8th bit is used to indicate if the reader should read further (bit
set) or stop (bit clear).

So, if len < 0x7F, it is encoded on one byte as b0 = len
if len < 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len < 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80, b1
= ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

len is the length of the UTF8 encoding and it is followed by the UTF8 byte
representation of the string.

If you want the source code, I suggest that you download the Shared Source
Common Language Infrastructure from the Microsoft site (just search for
"Shared Source Common Language Infrastructure" on www.microsoft.com) . The
source files for the binary reader/writer are in the sscli\clr\src\system\io
directory

Bruno.
 
Hi John,

UTF8Encoding is used by default.
There is an overloaded constructor that accepts Stream an Encoding if you
wish to change the encoding used.

Thanx. I took a quick look at UTF8 encoding and I got the impression
that UTF8 specified how the string data w/in a string looks in memory.
I didn't think it addressed the length part of a string.

When looking at examples for the UTF8Encoding class it appeared that
when strings are converted to byte array's the length part was not in
the byte array.

I'm looking for a definition of the length format when a string in
written to a stream using the binary formatter.

Thanx

jra
 
Ooops: the tests should be done with <= rather than <. Here is a corrected
version.

if len <= 0x7F, len is encoded on one byte as b0 = len
if len <= 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len <= 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80,
b1 = ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

Bruno
 
Many Thanx

Length is encoded on 1, 2, 3, 4 or 5 bytes as follows:

* The int (32 bits) is split in 7 bit chunks.
* The 8th bit is used to indicate if the reader should read further (bit
set) or stop (bit clear).

So, if len < 0x7F, it is encoded on one byte as b0 = len
if len < 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len < 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80, b1
= ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

len is the length of the UTF8 encoding and it is followed by the UTF8 byte
representation of the string.

If you want the source code, I suggest that you download the Shared Source
Common Language Infrastructure from the Microsoft site (just search for
"Shared Source Common Language Infrastructure" on www.microsoft.com) . The
source files for the binary reader/writer are in the sscli\clr\src\system\io
directory

Bruno.
 
Hi John,

Yes, you are right.
Length is added separately and in a form of 7bit integer (every byte has
7bits only) in little endian (less signicative bytes first) .
 
Back
Top