String to byte[] reloaded

N

nano2k

Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks
 
J

Jon Skeet [C# MVP]

nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

*Which* byte[] equivalent? It depends on the encoding.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you actually proven (with profiling etc) that a normal
Encoding.GetBytes call is causing you a bottleneck?
 
D

David Browne

Jon Skeet said:
nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

*Which* byte[] equivalent? It depends on the encoding.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

David
 
?

=?ISO-8859-2?Q?G=F6ran_Andersson?=

nano2k said:
Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks

Do you really need a byte array? A string can be indexed by it's
characters, and you can cast each char to an int, so effectively you
have an int array already. If you need it as bytes, just split each int
into two bytes.

If you want to access the string as bytes to modify it, that is a really
bad idea. Strings are immutable, and every method that uses strings rely
on that.
 
B

Barry Kelly

nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

There are many byte[] equivalents for a string, one for each encoding
and its options.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.
Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry
 
J

Jon Skeet [C# MVP]

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

I'm not entirely surprised that others have thought it would be useful
though :)
 
D

David Browne

Jon Skeet said:
No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.

David
 
J

Jon Skeet [C# MVP]

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.

It would depend on the scope of the manager. The BufferManager
*classes* are public, but how you share instances of them is up to you
:)
 
L

Lebesgue

You have not stated what your performance requirements are or how you
measured that all other methods are not efficient enough, but the answer to
the second part of your question whether MemoryStream.GetBuffer allocates
new memory is false.

From Reflector:

public virtual byte[] GetBuffer()
{
if (!this._exposable)
{
throw new
UnauthorizedAccessException(Environment.GetResourceString("UnauthorizedAccess_MemStreamBuffer"));
}
return this._buffer;
}
 
N

nano2k

Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.


Barry Kelly a scris:
nano2k said:
I need an efficient method to convert a string object to it's byte[]
equivalent.

There are many byte[] equivalents for a string, one for each encoding
and its options.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.
Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry
 
J

Jon Skeet [C# MVP]

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.

No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.

Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

If you get the string data directly from memory, you will get it as
UTF-16, so almost every other byte will be zero. I think that you get
better compression if you encode the string first.

If you have huge strings to compress, you should not keep them in
memory. Save the string to a file, and read a chunk at a time from the
file and compress. This of course adds a litte overhead, but it scales
far better, and from what you said, it's the scalability that is the
problem.
 
N

nano2k

Thanks, Jon

No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.
That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.

Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?
I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type


You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.
I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...
 
J

Jon Skeet [C# MVP]

That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.

You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.
I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type

If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.
I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...

It sounds to me like you should look very carefully at the overall
architecture and your compression API.
 
N

nano2k

You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.
In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?
This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.
If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.
Yes, I also think this is the way to go.
Thanks.
 
M

Marc Gravell

Maybe I missed something in an earlier post... but why do you want a
buffer that big? There are far better (more efficient) ways of reading
data using small buffers and chunking. I wasn't sure if your example
was dealing with database BLOBs - but if so you can do this using
small buffers too (even for big BLOBs).

Small (ish) good. Big bad.

Marc
 
J

Jon Skeet [C# MVP]

nano2k said:
In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?

Tens of megabytes isn't too big if you've only got one of them. Better
than tens of megabytes being allocated and deallocated on a regular
basis.
This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.

Yes, you need to serialize the compression. That will help keep the
memory usage down a lot, even if it slows down the app overall.
Yes, I also think this is the way to go.

Goodo.
 
B

Ben Voigt

nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();



Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.
 
J

Jon Skeet [C# MVP]

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

Just to check what you mean - as far as I'm aware, the LOH *is* garbage
collected, but is *not* compacted.
 
N

nano2k

Thank you Ben, your work helped me.
Anyway, I decided to redesign my method and somehow stream the result
directly into an archived buffer.
It's a hell lot of work, as the response method is very delicate, but
it's worth the effort.

Thanks anyone who answered. Anyway, I'm still puzzled why microsoft
has not (yet) implemented such a mechanism that was very useful in
MFC.

Thanks.


Ben Voigt a scris:
nano2k said:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation => not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();



Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top