UTF8 to UTF16 ?

P

PL

I'm somewhat confused about Unicode but up until now I havent really seen
much issues with using it up until recently. We recently started using an
SMS gateway that requires a unicode message to be sent as a hexadecimal
string where each byte code has been replaced with their hexadecimal value,
for example: 043104AF0442044D044...

This string according to their documentation must be in UTF-16 before
conversion to the hexadecimal form, we however are using UTF-8 on our
website and all the texts are entered as UTF-8.

When I try to send a unicode formatted message using content from our
website it shows some characters correctly but not all of them, I cannot see
another reason for this than the fact that we are using UTF-8 and they
require it to be in UTF-16.

Now to the questions:

1. How do I convert between UTF-8 and UTF-16 ? I was looking at the Decoder,
Encoder classes but it doesn't really provide a direct way to convert
between encodings that I could see.

2. Since all strings are actually UTF-16 in .NET does this mean that the
conversion already has been made or does it mean it is actually storing
UTF-8 encoded bytes into a UTF-16 string ?

Thank you
PL.
 
M

Morten Wennevik

Hi PL,

You can use the System.Text.Encoding class to convert one string to a byte
array and then back to string in another encoding.

byte[] data = System.Text.Encoding.UTF8.GetBytes(utf8string);
string unicodestring = System.Text.Encoding.Unicode.GetString(data);

Beware that UTF16 can be big endian, in which case use BigEndianUnicode to
get the string.

As for the second question. Yes all strings are unicode, but the content
of the string does not have to be unicode encoded. I believe a string can
hold UTF8 encoded data without loss, but if you plan on doing string
manipulation I would convert it to unicode first.
 
P

PL

Thank you, I was looking at the Encoding class without seeing that simple
solution :-/

PL.
 
N

Nick Hounsome

Morten Wennevik said:
Hi PL,

You can use the System.Text.Encoding class to convert one string to a byte
array and then back to string in another encoding.

byte[] data = System.Text.Encoding.UTF8.GetBytes(utf8string);
string unicodestring = System.Text.Encoding.Unicode.GetString(data);

This is just wrong.

Strings are strings of characters they are not strings of encodings of
characters hence it is meaningless to have a variable of type System.String
called utf8string.

Consider the simpler situation with Int32:
The integer 10 is not the sequence of characters "10" in decimal and nor is
it "1010" in binary and nor is it the bytes 0x00,0x00,0x00,0x0a - these are
all encodings. The above 2 lines are the equivalent of writing something
like:

int hexInt = 0x42;
string data = hexInt.ToString("X");
int decimalInt = int.Parse(data);
Beware that UTF16 can be big endian, in which case use BigEndianUnicode to
get the string.

This brings up the issue of byte order makes (BOM).
If you use BOM then the encoding can be inferred from the first few bytes.
As for the second question. Yes all strings are unicode, but the content
of the string does not have to be unicode encoded. I believe a string can
hold UTF8 encoded data without loss,

A string is not encoded therefore it is meaningless to say that it holds
UTF8 encoded data.
but if you plan on doing string manipulation I would convert it to
unicode first.

There is no other type of string in .NET therefore all string manipulation
is inherently unicode.

To understand what you need to do you need to specify how your data comes in
and out of your app. If it comes as byte arrays then what you have is this:

byte[] utf8Input = .....;
string inputString = System.Text.Encoding.UTF8.GetString(utf8Input );
byte[] utf16Output = System.Text.Encoding.Unicode.GetBytes(inputString );
OutputHex(utf16Output);
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top