BinaryWriter and string data type

J

John Aldrin

Hi,

I'm looking for info that explains the format of a string data type
when written to a stream using a BinaryWriter. I've looked all over
MSDN and Internet and I cannot seem to find it.

I did some simple testing and it seems that the string data is
prefixed w/a variable number of bytes that indicate the length.

1 byte if length <= 127

2 bytes if length > 127 - 1st byte has high order bit set, remaining
bits indicate number of bytes + (2nd byte * 127).

I haven't explored any further.



Thanx

jra
 
M

Miha Markic

Hi John,

UTF8Encoding is used by default.
There is an overloaded constructor that accepts Stream an Encoding if you
wish to change the encoding used.
 
B

Bruno Jouhier [MVP]

Length is encoded on 1, 2, 3, 4 or 5 bytes as follows:

* The int (32 bits) is split in 7 bit chunks.
* The 8th bit is used to indicate if the reader should read further (bit
set) or stop (bit clear).

So, if len < 0x7F, it is encoded on one byte as b0 = len
if len < 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len < 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80, b1
= ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

len is the length of the UTF8 encoding and it is followed by the UTF8 byte
representation of the string.

If you want the source code, I suggest that you download the Shared Source
Common Language Infrastructure from the Microsoft site (just search for
"Shared Source Common Language Infrastructure" on www.microsoft.com) . The
source files for the binary reader/writer are in the sscli\clr\src\system\io
directory

Bruno.
 
J

John Aldrin

Hi John,

UTF8Encoding is used by default.
There is an overloaded constructor that accepts Stream an Encoding if you
wish to change the encoding used.

Thanx. I took a quick look at UTF8 encoding and I got the impression
that UTF8 specified how the string data w/in a string looks in memory.
I didn't think it addressed the length part of a string.

When looking at examples for the UTF8Encoding class it appeared that
when strings are converted to byte array's the length part was not in
the byte array.

I'm looking for a definition of the length format when a string in
written to a stream using the binary formatter.

Thanx

jra
 
B

Bruno Jouhier [MVP]

Ooops: the tests should be done with <= rather than <. Here is a corrected
version.

if len <= 0x7F, len is encoded on one byte as b0 = len
if len <= 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len <= 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80,
b1 = ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

Bruno
 
J

John Aldrin

Many Thanx

Length is encoded on 1, 2, 3, 4 or 5 bytes as follows:

* The int (32 bits) is split in 7 bit chunks.
* The 8th bit is used to indicate if the reader should read further (bit
set) or stop (bit clear).

So, if len < 0x7F, it is encoded on one byte as b0 = len
if len < 0x3FFF, is is encoded on 2 bytes as b0 = (len & 0x7F) | 0x80, b1 =
len >> 7
if len < 0x 1FFFFF, it is encoded on 3 bytes as b0 = (len & 0x7F) | 0x80, b1
= ((len >> 7) & 0x7F) | 0x80, b2 = len >> 14
etc.

len is the length of the UTF8 encoding and it is followed by the UTF8 byte
representation of the string.

If you want the source code, I suggest that you download the Shared Source
Common Language Infrastructure from the Microsoft site (just search for
"Shared Source Common Language Infrastructure" on www.microsoft.com) . The
source files for the binary reader/writer are in the sscli\clr\src\system\io
directory

Bruno.
 
M

Miha Markic

Hi John,

Yes, you are right.
Length is added separately and in a form of 7bit integer (every byte has
7bits only) in little endian (less signicative bytes first) .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top