Basic Conversion Query

C

C# Learner

Hi,

I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?

In Delphi I could simple say:

myString = UTF8ToAnsi(myString);

How can I do this using .NET?

I tried making a general-purpose static method to do this:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);
byte[] asciiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Unicode, utf8Bytes);

return Encoding.Unicode.GetString(asciiBytes);
}

....but it doesn't work as desired. This just causes the pair of
encoded characters to be replaced to '?' characters.

TIA
 
J

Jon Skeet [C# MVP]

C# Learner said:
I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?

Please see my responses in microsoft.public.dotnet.framework. Your
basic problem is mixing up binary data and character data.

See http://www.pobox.com/~skeet/csharp/unicode.html
 
C

C# Learner

Jon Skeet said:
Please see my responses in microsoft.public.dotnet.framework. Your
basic problem is mixing up binary data and character data.

See http://www.pobox.com/~skeet/csharp/unicode.html

I really can't see where I'm mixing up anything with anything. I
simply have a string in UTF8 format. I just want to decode it.

Is it not possible in .NET? I've done this in Delphi without problem.

Thanks for your patient replies.
 
C

C# Learner

Jan Tielens said:
Check out the System.Text.UTF8Encoding class:
http://tinyurl.com/kh15

Hi,

Thanks for the reply, but this doesn't seem to be working:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);

System.Text.UTF8Encoding u = new UTF8Encoding();
return u.GetString(utf8Bytes);
}
 
J

Jon Skeet [C# MVP]

C# Learner said:
I really can't see where I'm mixing up anything with anything. I
simply have a string in UTF8 format.

That's mixing things up to start with. "<x> in UTF8 format/encoding" is
I just want to decode it.

You're looking at the wrong thing though - you need to decode the bytes
you read from the socket, rather than generating character data from
those bytes in some way which you haven't defined and then talking
about that character data as if it were "in UTF8 format".
Is it not possible in .NET? I've done this in Delphi without problem.

I suspect a string in Delphi isn't the same as it is in .NET.
 
J

Jon Skeet [C# MVP]

C# Learner said:
Thanks for the reply, but this doesn't seem to be working:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);

System.Text.UTF8Encoding u = new UTF8Encoding();
return u.GetString(utf8Bytes);
}

That code works fine, but doesn't do what you want it to do. Look very
closely at the documentation for Encoding.GetBytes and
Encoding.GetString. Between those and the web page I pointed you at
before, you should end up with an understanding of why "a UTF-8 encoded
string" is like saying "a hex-formatted number" - the number itself, as
a number, has no encoding, only a string representation of the number
has a format.
 
C

C# Learner

Jon Skeet said:
That's mixing things up to start with. "<x> in UTF8 format/encoding" is


You're looking at the wrong thing though - you need to decode the bytes
you read from the socket, rather than generating character data from
those bytes in some way which you haven't defined and then talking
about that character data as if it were "in UTF8 format".


I suspect a string in Delphi isn't the same as it is in .NET.

Okay, I've got it working.

I had to use the following:

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = RawEncoding.GetBytes(value);
byte[] ansiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Default, utf8Bytes);

return Encoding.Default.GetString(ansiBytes);
}

public class RawEncoding
{
public static byte[] GetBytes(string text)
{
byte[] result = new byte[text.Length];

for(int i = 0; i < text.Length; ++i) {
result = (byte)text;
}

return result;
}
}

Thanks
 
J

Jon Skeet [C# MVP]

C# Learner said:
Okay, I've got it working.

I had to use the following:

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = RawEncoding.GetBytes(value);
byte[] ansiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Default, utf8Bytes);

return Encoding.Default.GetString(ansiBytes);
}

public class RawEncoding
{
public static byte[] GetBytes(string text)
{
byte[] result = new byte[text.Length];

for(int i = 0; i < text.Length; ++i) {
result = (byte)text;
}

return result;
}
}


That's still ignoring the original problem though, and you may well
find you get corrupted data. (It's also doing one conversion more than
you really need to.)

The key thing is how you're getting this "UTF-8 encoded string" in the
first place. Something must be converting bytes into a string - and
*that's* the place to fix. Either it shouldn't be doing a conversion at
all (in which case it should pass the byte array along) or it should be
doing the conversion using the UTF-8 encoding (in which case your
string will then be correct).
 
C

C# Learner

Jon Skeet said:
That's still ignoring the original problem though, and you may well
find you get corrupted data. (It's also doing one conversion more than
you really need to.)

The key thing is how you're getting this "UTF-8 encoded string" in the
first place. Something must be converting bytes into a string - and
*that's* the place to fix. Either it shouldn't be doing a conversion at
all (in which case it should pass the byte array along) or it should be
doing the conversion using the UTF-8 encoding (in which case your
string will then be correct).

Just for reference, here's the method I use to "convert" the bytes to
a string after reading them from the socket:

public static string GetString(byte[] data)
{
StringBuilder sb = new StringBuilder();

for(int i = 0; i < data.Length; ++i) {
sb.Append((char)data);
}

return sb.ToString();
}

Regards
 
J

Jon Skeet [C# MVP]

C# Learner said:
Just for reference, here's the method I use to "convert" the bytes to
a string after reading them from the socket:

public static string GetString(byte[] data)
{
StringBuilder sb = new StringBuilder();

for(int i = 0; i < data.Length; ++i) {
sb.Append((char)data);
}

return sb.ToString();
}


Right. Don't do that. *That's* where you're mixing binary and character
data (essentially treating binary data as character data).

Either keep it as bytes, or use Encoding.UTF8.GetString(data) instead
of the above when you're reading the text.
 
C

C# Learner

Jon Skeet said:
Right. Don't do that. *That's* where you're mixing binary and character
data (essentially treating binary data as character data).

Either keep it as bytes, or use Encoding.UTF8.GetString(data) instead
of the above when you're reading the text.

The problem with using UTF8.GetString is that it just seems to remove
important bytes.

For example, the raw packet read from the socket might be something
like (note, these are raw bytes, and I'm displaying them as a string
literal for convenience):

"FOOPROTOCOL\xC0\x801\xC0\x80Field1\xC0\x80"

So that's:
"FOOPROTOCOOL"
0xC0
0x80
'1'
0xC0
0x80
"Field1"
0xC0
0x80

Now, using UTF8.GetString on the above will do something like the
following:

"FOOPROTOCOL1Field1"

i.e. all the "\xC0\x80" delimiters were removed.
 
J

Jon Skeet [C# MVP]

C# Learner said:
The problem with using UTF8.GetString is that it just seems to remove
important bytes.

Not if those bytes are part of the text...
For example, the raw packet read from the socket might be something
like (note, these are raw bytes, and I'm displaying them as a string
literal for convenience):

"FOOPROTOCOL\xC0\x801\xC0\x80Field1\xC0\x80"

So that's:
"FOOPROTOCOOL"
0xC0
0x80
'1'
0xC0
0x80
"Field1"
0xC0
0x80

Now, using UTF8.GetString on the above will do something like the
following:

"FOOPROTOCOL1Field1"

i.e. all the "\xC0\x80" delimiters were removed.

Yes, because you're *again* mixing binary data and character data. The
0xc0 and 0x80 bytes aren't part of the text, they're delimiters. You
should only convert the text data into a string, however you do the
conversion.

It sounds like what you actually should be doing is finding the
delimiters within the byte array and converting each text section of
the binary data into a string separately. Unless you do that, you will
definitely be mixing binary and character data.

I don't know if you've got access to the protocol itself by the way,
but if you have I'd suggest changing it so that rather than using
delimiters, you prefix each string with the length in bytes.
 
C

C# Learner

Jon Skeet said:
Yes, because you're *again* mixing binary data and character data. The
0xc0 and 0x80 bytes aren't part of the text, they're delimiters. You
should only convert the text data into a string, however you do the
conversion.

It sounds like what you actually should be doing is finding the
delimiters within the byte array and converting each text section of
the binary data into a string separately. Unless you do that, you will
definitely be mixing binary and character data.

I don't know if you've got access to the protocol itself by the way,
but if you have I'd suggest changing it so that rather than using
delimiters, you prefix each string with the length in bytes.

Hi Jon,

I guess all the problems I'm running into are due to the fact that I
basically think of a string as an array of bytes.

In this case, I don't have access to the protocol, and can't change
it. Also, separating the packets into an array of fields would be
less efficient and more work than is desired in this particular
scenario. The reason for this is that a packet may have a large
number of fields, say 100.

I think the only way of doing this correctly would be to keep the C#
code I have currently. It works as expected.

Thanks for your patience in this matter. It's much appreciated.
 
J

Jon Skeet [C# MVP]

C# Learner said:
I guess all the problems I'm running into are due to the fact that I
basically think of a string as an array of bytes.

Yes indeed - it's not, it's a sequence of *characters*.
In this case, I don't have access to the protocol, and can't change
it. Also, separating the packets into an array of fields would be
less efficient and more work than is desired in this particular
scenario. The reason for this is that a packet may have a large
number of fields, say 100.

It's really not going to take long to sort them out though...
I think the only way of doing this correctly would be to keep the C#
code I have currently. It works as expected.

Well, in that case I'd at least recommend changing your code to:

public static string GetString(byte[] data)
{
char[] chars = new char[data.Length];
for (int i=0; i < data.Length; i++)
{
chars=data;
}
return new string(chars);
}

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = new byte[value.Length];
for (int i=0; i < utf8Bytes.Length; i++)
{
utf8Bytes = (byte)value;
}

return Encoding.UTF8.GetString(utf8Bytes);
}
 
C

C# Learner

Jon Skeet said:
Yes indeed - it's not, it's a sequence of *characters*.

This is something I'd better look into!
It's really not going to take long to sort them out though...


Well, in that case I'd at least recommend changing your code to:

<snipped for brevity>

Will do, thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top