Clean up binary data for UTF-8?

S

SDS

Greets,

I am reading some binary data and putting into a System.String object
via Encoding.Default.GetString(). I then take this string and put it
into a dataset field of type System.String.

I then create a XmlWriter with UTF-8 encoding in it's settings and
attempt to save the dataset to disk via .WriteXml.

The binary data is unreliable in the sense that I can expect to receive
illegal characters according to the XmlWriter. For example, I receive
this exception:

System.ArgumentException: '', hexadecimal value 0x0E, is an invalid
character.

These illegal characters I am not intersted in preserving (but I am
intersted in anything and everything legal according to UTF-8), so
ideally I'm looking for a quick and easy way to strip out anything that
is going to cause me problems.

What is the best approach here?

TIA!
 
J

Jon Skeet [C# MVP]

SDS said:
I am reading some binary data and putting into a System.String object
via Encoding.Default.GetString().

Don't do that. It's a bad, bad idea. You *will* lose data sooner or
later.

Instead, use Convert.ToBase64String.
I then take this string and put it
into a dataset field of type System.String.

I then create a XmlWriter with UTF-8 encoding in it's settings and
attempt to save the dataset to disk via .WriteXml.

The binary data is unreliable in the sense that I can expect to receive
illegal characters according to the XmlWriter. For example, I receive
this exception:

System.ArgumentException: '', hexadecimal value 0x0E, is an invalid
character.

These illegal characters I am not intersted in preserving (but I am
intersted in anything and everything legal according to UTF-8), so
ideally I'm looking for a quick and easy way to strip out anything that
is going to cause me problems.

What is the best approach here?

Aside from potentially losing data due to treating binary data as text,
you're not actually running into a UTF-8 limitation - you're running
into an XML limitation. There's no way of including most of the
characters below 32 (there are exceptions like line feed, carriage
return and tab) into an XML document, even though UTF-8 knows how to
handle them.

If you use Convert.ToBase64String, you won't run into this problem.
 
S

SDS

Jon,

Thanks for your quick reply. I think I am missing something here...

Given this byte array:

new byte[] {0xf3,0x1c,0xf8,0x18,0x42,0x65} )

I know that the last 2 characters are "Be". Using
Convert.ToBase64String, I am getting "8xz4GEJl".

Maybe I should have mentioned that this data is readable text and not a
blob? Also, I need full support for international characters in this
situation so I need to be very careful not to drop characters important
to other languages.

Thanks!
 
J

Jon Skeet [C# MVP]

SDS said:
Thanks for your quick reply. I think I am missing something here...

Given this byte array:

new byte[] {0xf3,0x1c,0xf8,0x18,0x42,0x65} )

I know that the last 2 characters are "Be".

There are no characters in the above. There are only bytes.
Using Convert.ToBase64String, I am getting "8xz4GEJl".

Maybe I should have mentioned that this data is readable text and not a
blob?

As it's written above, it *is* a blob. If you mean it to be text, you
need to specify which encoding it's using. (For instance, bytes 0xf3
and 0xf8 mean very different things in different encodings.)
Also, I need full support for international characters in this
situation so I need to be very careful not to drop characters important
to other languages.

Well, all characters above 31 are representable in XML, I believe - so
you need to consider what you want to do with characters like you were
getting before. If you need those control characters, I'd recommend
Base64-encoding the binary data, then Base64-decoding it and applying
whatever encoding the data is actually in so that you get back the
original string. Alternatively, you could still use text but have some
kind of escaping system.

Jon
 
S

SDS

The binary format I am reading from specifies that strings should be
encoded in UTF-8 format.

I don't need the characters that are causing problems, but like I said
earlier I need to be very careful to not drop any characters that would
be important in international languages.

If we are talking about characters 0-31, I guess my question really is
-- what is the most efficient way for me to strip out characters 0-31
from a string so it can be put into an XML document via XmlWriter
without error?

Thanks again.
 
J

Jon Skeet [C# MVP]

SDS said:
The binary format I am reading from specifies that strings should be
encoded in UTF-8 format.

In that case, you should be using Encoding.UTF8 to decode it to start
with, rather than Encoding.Default. You may well find that by using the
right encoding to start with, you don't get any problematic characters.
I don't need the characters that are causing problems, but like I said
earlier I need to be very careful to not drop any characters that would
be important in international languages.
Sure.

If we are talking about characters 0-31, I guess my question really is
-- what is the most efficient way for me to strip out characters 0-31
from a string so it can be put into an XML document via XmlWriter
without error?

The best thing is probably to have a "first phase" which checks to see
if there *are* any characters to strip out (i.e. check each character
to see whether it's less than 31 *and* not '\r', '\n' or '\t') and if
there are, go through a second time using a StringBuilder to create the
string just from the characters you want. I suspect you'll find there
are very few strings which need to have characters removed, so that
removal process can be readable rather than ultra-efficient :)

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top