String to byte[] reloaded

nano2k · Feb 10, 2007

Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks

Jon Skeet [C# MVP] · Feb 10, 2007

nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

*Which* byte[] equivalent? It depends on the encoding.

I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you actually proven (with profiling etc) that a normal
Encoding.GetBytes call is causing you a bottleneck?

David Browne · Feb 10, 2007

Jon Skeet said:
nano2k said:

I need an efficient method to convert a string object to it's byte[]
equivalent.

Click to expand...

*Which* byte[] equivalent? It depends on the encoding.

I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

Click to expand...

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

David

=?ISO-8859-2?Q?G=F6ran_Andersson?= · Feb 10, 2007

nano2k said:
Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks

Do you really need a byte array? A string can be indexed by it's
characters, and you can cast each char to an int, so effectively you
have an int array already. If you need it as bytes, just split each int
into two bytes.

If you want to access the string as bytes to modify it, that is a really
bad idea. Strings are immutable, and every method that uses strings rely
on that.

Barry Kelly · Feb 10, 2007

nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

There are many byte[] equivalents for a string, one for each encoding
and its options.

I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.

Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry

Jon Skeet [C# MVP] · Feb 10, 2007

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Click to expand...

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

I'm not entirely surprised that others have thought it would be useful
though

David Browne · Feb 11, 2007

Jon Skeet said:
No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Click to expand...

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

Click to expand...

I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.

David

Jon Skeet [C# MVP] · Feb 11, 2007

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

Click to expand...

That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.

It would depend on the scope of the manager. The BufferManager
*classes* are public, but how you share instances of them is up to you

Lebesgue · Feb 11, 2007

You have not stated what your performance requirements are or how you
measured that all other methods are not efficient enough, but the answer to
the second part of your question whether MemoryStream.GetBuffer allocates
new memory is false.

From Reflector:

public virtual byte[] GetBuffer()
{
if (!this._exposable)
{
throw new
UnauthorizedAccessException(Environment.GetResourceString("UnauthorizedAccess_MemStreamBuffer"));
}
return this._buffer;
}

nano2k · Feb 12, 2007

Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Barry Kelly a scris:

nano2k said:
nano2k said:

I need an efficient method to convert a string object to it's byte[]
equivalent.

Click to expand...

There are many byte[] equivalents for a string, one for each encoding
and its options.

I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

Click to expand...

I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.

Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

Click to expand...

Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry

Jon Skeet [C# MVP] · Feb 12, 2007

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.

No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.

I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.

Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?

I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.

=?ISO-8859-1?Q?G=F6ran_Andersson?= · Feb 12, 2007

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

If you get the string data directly from memory, you will get it as
UTF-16, so almost every other byte will be zero. I think that you get
better compression if you encode the string first.

If you have huge strings to compress, you should not keep them in
memory. Save the string to a file, and read a chunk at a time from the
file and compress. This of course adds a litte overhead, but it scales
far better, and from what you said, it's the scalability that is the
problem.

nano2k · Feb 12, 2007

Thanks, Jon

No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.

That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.

Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?

I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type

You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.

I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...

Jon Skeet [C# MVP] · Feb 12, 2007

That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.

You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.

I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type

If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.

I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...

It sounds to me like you should look very carefully at the overall
architecture and your compression API.

nano2k · Feb 12, 2007

You wanted to reduce the overall memory consumption, right? So create a

buffer and reuse it, encoding different strings into the same byte
array.

In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?
This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.

If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.

Yes, I also think this is the way to go.
Thanks.

Marc Gravell · Feb 12, 2007

Maybe I missed something in an earlier post... but why do you want a
buffer that big? There are far better (more efficient) ways of reading
data using small buffers and chunking. I wasn't sure if your example
was dealing with database BLOBs - but if so you can do this using
small buffers too (even for big BLOBs).

Small (ish) good. Big bad.

Marc

Jon Skeet [C# MVP] · Feb 12, 2007

nano2k said:
In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?

Tens of megabytes isn't too big if you've only got one of them. Better
than tens of megabytes being allocated and deallocated on a regular
basis.

This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.

Yes, you need to serialize the compression. That will help keep the
memory usage down a lot, even if it slows down the app overall.

Yes, I also think this is the way to go.

Goodo.

Ben Voigt · Feb 13, 2007

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.

Jon Skeet [C# MVP] · Feb 13, 2007

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

Just to check what you mean - as far as I'm aware, the LOH *is* garbage
collected, but is *not* compacted.

nano2k · Feb 13, 2007

Thank you Ben, your work helped me.
Anyway, I decided to redesign my method and somehow stream the result
directly into an archived buffer.
It's a hell lot of work, as the response method is very delicate, but
it's worth the effort.

Thanks anyone who answered. Anyway, I'm still puzzled why microsoft
has not (yet) implemented such a mechanism that was very useful in
MFC.

Thanks.

Ben Voigt a scris:

nano2k said:
nano2k said:

Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Click to expand...

Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.

String to byte[] reloaded	3	Feb 10, 2007
How array allocation is implemented in c#?	1	Feb 2, 2014
Cannot convert 'ref byte[]' to 'ref byte'	6	Sep 15, 2008
Creating Image instance from byte[] - Am I leaking memory or not?	7	Sep 3, 2009
can my decompress method be written in a better way	3	Jul 31, 2010
Serialization pads to 4K boundary?	1	Feb 21, 2008
Best practice to allocate this byte array for this situation?	1	Dec 30, 2006
Converting a MemoryStream of bytes to a string	3	May 23, 2006

String to byte[] reloaded

nano2k

Jon Skeet [C# MVP]

David Browne

=?ISO-8859-2?Q?G=F6ran_Andersson?=

Barry Kelly

Jon Skeet [C# MVP]

David Browne

Jon Skeet [C# MVP]

Lebesgue

nano2k

Jon Skeet [C# MVP]

=?ISO-8859-1?Q?G=F6ran_Andersson?=

nano2k

Jon Skeet [C# MVP]

nano2k

Marc Gravell

Jon Skeet [C# MVP]

Ben Voigt

Jon Skeet [C# MVP]

nano2k

Ask a Question

Similar Threads