String to byte[] reloaded

Jon Skeet [C# MVP] · Feb 13, 2007

nano2k said:
Thank you Ben, your work helped me.
Anyway, I decided to redesign my method and somehow stream the result
directly into an archived buffer.
It's a hell lot of work, as the response method is very delicate, but
it's worth the effort.

Thanks anyone who answered. Anyway, I'm still puzzled why microsoft
has not (yet) implemented such a mechanism that was very useful in
MFC.

Because strings are immutable, and because UTF-16 is rarely the
encoding you want when you're converting to a byte array anyway.

Willy Denoyette [MVP] · Feb 13, 2007

Ben Voigt said:
nano2k said:

Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Click to expand...

Here's your loaded gun, plenty of rope to hang yourself (I looked at the C++/CLI function
PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage collection, in which
case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real property, not a
constant, so the value is gotten from your actual runtime library, not when you compile.
Of course the JIT will inline that property access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this might be almost
stable.

I don't know why the offset isn't needed when you use a pinning pointer. I did notice
that the pinning action moves the string though... let me try with a larger string....

With a one million character string, there is no change in the pointer, and the
AddrOfPinnedObject call still includes the correct offset. Probably small objects get
moved to the Large Object Heap in order to pin them (and do they come back, maybe once
pinned, always pinned until all references disappear?).

So that's how to get a zero-copy pointer to the internal data of a large string.

Note that everything here is based on my quick tests and reading vcclr.h and I may just
have gotten lucky; pressure on the GC could move things around and mess things up, or
other bad things could happen.

Above assumes that the GC would never compact the LOH, IMO no-one ever said that future
versions of the CLR would not attempt to compact the LOH, so I think it's dangerous to
assume no pinning is needed.

Anyway, why make it that complicated when you have the "fixed" statement in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Note that it's easy to corrupt the heap when passing native pointers to unmanaged....

Willy.

nano2k · Feb 13, 2007

Because strings are immutable, and because UTF-16 is rarely the

encoding you want when you're converting to a byte array anyway.

I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.
I only pointed out the differences between the old (good IMO) features
of CString class and the new (not completely good, IMO) String class.
And once again, I only need readonly access; I can't see any harm
here.
Of course, one may argue that CString's feature would make Strings
"less immutable". I can at least partially aggree, but this is only a
theoretical point of view. In practise, my application is affected by
this - let me say it - limitation.
For that reason, I need to redesign a part of it just because of the
way Strings are seen as immutable. No problem, I will do it thinking
that perhaps my initial analyse was a bit twisted.
As for the UTF-16 in my case I have 2 possibilities:
1. To assume this extra cost in processing by compressing the unneeded
zero bytes.
2. To jump over each zero byte when compressing.
IMO, either method is welcome because in many cases it will prevent
the memory peak that generates the out of memory exception.

Thanks again to you all. I will redesign the processing method in a
way to stream the response.
Any debate on this subject will always capture my attention.

Jon Skeet [C# MVP] · Feb 13, 2007

nano2k said:
I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.

There are only any issues *if* the format you'd want the string in is
UTF-16. As has been said before, that means that for many strings using
mostly ASCII characters, pretty much every other byte is going to be 0.

I only pointed out the differences between the old (good IMO) features
of CString class and the new (not completely good, IMO) String class.
And once again, I only need readonly access; I can't see any harm
here.

Yes, you only need readonly access. How do you mark a byte array as
being readonly though?

Of course, one may argue that CString's feature would make Strings
"less immutable". I can at least partially aggree, but this is only a
theoretical point of view.

No it's not. Making strings mutable has a *huge* effect on any number
of things, not least security. Either something is mutable, or it's
not. If I can't rely on strings being immutable, then I need to take a
copy of one every time I receive it etc.

In practise, my application is affected by
this - let me say it - limitation.

By the sounds of it your application needed rearchitecting so as not to
have huge chunks of data in memory at a time anyway.

For that reason, I need to redesign a part of it just because of the
way Strings are seen as immutable. No problem, I will do it thinking
that perhaps my initial analyse was a bit twisted.
As for the UTF-16 in my case I have 2 possibilities:
1. To assume this extra cost in processing by compressing the unneeded
zero bytes.

That is likely to add to the compressed size, as well.

2. To jump over each zero byte when compressing.

You won't know where to fill in the zero bytes when decompressing then.

IMO, either method is welcome because in many cases it will prevent
the memory peak that generates the out of memory exception.

Other solutions that have been offered here are much better, IMO.

Thanks again to you all. I will redesign the processing method in a
way to stream the response.

And by doing that you'll not only avoid this "limitation", you'll
reduce the overall memory usage from what it would have been even if
you *could* have accessed the data in situ.

nano2k · Feb 13, 2007

I like that strings are immutable and I don't want that to be changed.

There are only any issues *if* the format you'd want the string in is
UTF-16. As has been said before, that means that for many strings using
mostly ASCII characters, pretty much every other byte is going to be 0.

No, I was talking about performance issues regarding the
multiplication of strings that are inherent with immutable strings.

Yes, you only need readonly access. How do you mark a byte array as
being readonly though?

You can't; I was only talking about an hypothetical situation.

By the sounds of it your application needed rearchitecting so as not to
have huge chunks of data in memory at a time anyway.

100% correct; no pb with that. Still, I think that such a feature
could bring benefits. I can imagine it working just the way the 3 APIs
work together: GlobalAlloc(), GlobalLock() and GlobalUnlock().

Marc Gravell · Feb 13, 2007

Of course, if you want a mutable string, you might also consider
char[] (or possibly StringBuilder) as "string-like" place-holders.
Encoding.GetBytes() and Encoding.GetChars() will work with char[]
happily (maybe not StringBuilder) for the purpose of passing byte[]
to/from compression routines, presumably using UTF8 for the encoding
to get single-byte ASCII behavior while still supporting full
unicode.

Just some thoughts, possibly already stated (it is a long chain to
re-read...)

Marc

Ben Voigt · Feb 13, 2007

By the sounds of it your application needed rearchitecting so as not to

have huge chunks of data in memory at a time anyway.
[snip]

That is likely to add to the compressed size, as well.

Not by much. This mainly just doubles the length of every dictionary entry
within the compressed data, but the backreferences which constitute the bulk
of the compressor output shouldn't grow at all.

Also, compressors *need* to keep a large chunk (on the order of megabytes)
of data in memory at a time for good ratios, if you use an streaming
interface then the compressor probably just ends up allocating buffers to
maintain the history.

Ben Voigt · Feb 13, 2007

Above assumes that the GC would never compact the LOH, IMO no-one ever

said that future versions of the CLR would not attempt to compact the LOH,
so I think it's dangerous to assume no pinning is needed.

No, if you use GCHandle with the pinning option you have zero-copy, at least
in .NET 2.0, and gc safety in any version.

Anyway, why make it that complicated when you have the "fixed" statement
in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET
doesn't take the address of a string, but of the m_firstChar member which
isn't visible to clients (see System.String.ReplaceCharInPlace,
System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to
string. Looks like it will do the same as the C++ PtrToStringChars and the
C# code I gave, only simpler. However, I'm concerned how the
null-termination guarantee is provided without incurring the overhead of a
copy at least part of the time.

Ben Voigt · Feb 13, 2007

Ben Voigt said:
Above assumes that the GC would never compact the LOH, IMO no-one ever
said that future versions of the CLR would not attempt to compact the
LOH, so I think it's dangerous to assume no pinning is needed.

Click to expand...

No, if you use GCHandle with the pinning option you have zero-copy, at
least in .NET 2.0, and gc safety in any version.

Anyway, why make it that complicated when you have the "fixed" statement
in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Click to expand...

Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET
doesn't take the address of a string, but of the m_firstChar member which
isn't visible to clients (see System.String.ReplaceCharInPlace,
System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to
string. Looks like it will do the same as the C++ PtrToStringChars and
the C# code I gave, only simpler. However, I'm concerned how the
null-termination guarantee is provided without incurring the overhead of a
copy at least part of the time.

Turns out the MSIL uses OffsetToStringData as well, along with some unusual
"string pinned" local variable. Doesn't look like anything that could
produce a copy, but I don't know what magic is invoked by that "string
pinned" type.

Willy Denoyette [MVP] · Feb 13, 2007

Ben Voigt said:
Ben Voigt said:

Above assumes that the GC would never compact the LOH, IMO no-one ever said that future
versions of the CLR would not attempt to compact the LOH, so I think it's dangerous to
assume no pinning is needed.

Click to expand...

No, if you use GCHandle with the pinning option you have zero-copy, at least in .NET 2.0,
and gc safety in any version.

Anyway, why make it that complicated when you have the "fixed" statement in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Click to expand...

Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET doesn't take
the address of a string, but of the m_firstChar member which isn't visible to clients
(see System.String.ReplaceCharInPlace, System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to string. Looks
like it will do the same as the C++ PtrToStringChars and the C# code I gave, only
simpler. However, I'm concerned how the null-termination guarantee is provided without
incurring the overhead of a copy at least part of the time.

Click to expand...

Turns out the MSIL uses OffsetToStringData as well, along with some unusual "string
pinned" local variable. Doesn't look like anything that could produce a copy, but I don't
know what magic is invoked by that "string pinned" type.

The "string pinned" is not a local variable, this tells the JIT that it should reserve a
slot in the "global handle" table, this table holds references to objects that can't be
moved until they are removed from this table. This is quite handy, by this the Interop layer
knows that the parameter's object reference is all ready pinned, so he shouldn't care about
this.
If you look at the IL you'll see that the string 'reference' is copied to this location at
the start of the fixed block scope and set to null when the scope ends.
The Global Handle table is part of the LOH.

Willy.

Barry Kelly · Feb 14, 2007

nano2k said:
No, I was talking about performance issues regarding the
multiplication of strings that are inherent with immutable strings.

Actually, since immutable strings can be freely shared, I believe that
it leads to a reduction in pointless string duplication simply to
establish a private copy.

-- Barry

String to byte[] reloaded	3	Feb 10, 2007
can my decompress method be written in a better way	3	Jul 31, 2010
how to convert a byte array with string,int and double embeded?	2	Dec 19, 2008
byte[] to string conversion ...	5	Nov 23, 2006
Creating Image instance from byte[] - Am I leaking memory or not?	7	Sep 3, 2009
convert string to byte array	13	Sep 16, 2008
Best practice to allocate this byte array for this situation?	1	Dec 30, 2006
Serialization pads to 4K boundary?	1	Feb 21, 2008

String to byte[] reloaded

Jon Skeet [C# MVP]

Willy Denoyette [MVP]

nano2k

Jon Skeet [C# MVP]

nano2k

Marc Gravell

Ben Voigt

Ben Voigt

Ben Voigt

Willy Denoyette [MVP]

Barry Kelly

Ask a Question

Similar Threads