convert to utf

A

Albert Jan

Hi,

I have text from mime email messages with different encoding that I want to
convert to utf-8, but I'm relatively new on encoding problems.

I use the following code, but this doesn't seem to work (the in and output
remains the same):

string
x=toUTF8("=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=","GB2312");
public static string toUTF8(string messageString, string charset)
{
Encoding dstEnc = Encoding.UTF8;
if(charset.Length==0)
{
charset="us-ascii";
}
Encoding srcEnc=Encoding.GetEncoding(charset);
byte[] srcData = srcEnc.GetBytes( messageString );
byte[] dstData;
// see if we need to convert data
if(dstEnc != srcEnc )
{
dstData = Encoding.Convert( srcEnc, dstEnc, srcData );
}
else
{
dstData = srcData;
}
char[] utf8Chars = new char[Encoding.UTF8.GetCharCount(dstData, 0,
dstData.Length)];
Encoding.UTF8.GetChars(dstData, 0, dstData.Length, utf8Chars, 0);
string utf8String = new string(utf8Chars);
return utf8String;
}



Any help would be appreciated,



Albert Jan
 
M

Morten Wennevik

Hi Albert,

I'm not too familiar with GB2312 (Simplified Chinese) but I believe all
characters in your string have the same value in both GB2312 and UTF8 so
there would be no change when you recode it (they look the same in both
encodings). For reference your string looks to me like

"=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" which isn't
chinese in any way.

Although your encoding conversion code is overly elaborate I see nothing
wrong with it. You could simplify if greatly with something like this.:

public static string toUTF8(string messageString, string charset)
{
Encoding dstEnc = Encoding.UTF8;
MessageBox.Show(Encoding.Default.ToString());

if(charset.Length==0)
{
charset="us-ascii";
}

Encoding srcEnc=Encoding.GetEncoding(charset);
byte[] srcData = srcEnc.GetBytes( messageString );

string utf8String = dstEnc.GetString(srcData);
return utf8String;
}
 
A

Albert Jan

Hi Morten,

Thanks for your answer

the string I use ("=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" )
comes from a chinese mail-message and was probably transferred to
'quoted-printable' by the sender.

So my problem is how to redisplay the original chinese characters. I suppose
I have to cut out the "GB2312?", but what next?


Regards,

Albert Jan
 
S

Stefan Simek

Hi,

The string looks like base64 encoded data to me.

I suggest decoding the s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ== part and
try to convert the resulting bytes to unicode using the GB2312 encoding...

string DoConversion(string base64)
{
// decode the base64 string
byte[] bytes = Convert.FromBase64String(base64);

// get the GB2312 encoding
Encoding encoding = Encoding.GetEncoding(20936);

// decode the GB2312 encoded byte stream
string result = encoding.GetString(bytes);

return result;
}

The result certainly looks like chinese to me... ;)

HTH,
Stefan


Albert said:
Hi Morten,

Thanks for your answer

the string I use ("=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" )
comes from a chinese mail-message and was probably transferred to
'quoted-printable' by the sender.

So my problem is how to redisplay the original chinese characters. I suppose
I have to cut out the "GB2312?", but what next?


Regards,

Albert Jan


Hi Albert,

I'm not too familiar with GB2312 (Simplified Chinese) but I believe all
characters in your string have the same value in both GB2312 and UTF8 so
there would be no change when you recode it (they look the same in both
encodings). For reference your string looks to me like

"=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" which isn't
chinese in any way.

Although your encoding conversion code is overly elaborate I see nothing
wrong with it. You could simplify if greatly with something like this.:

public static string toUTF8(string messageString, string charset)
{
Encoding dstEnc = Encoding.UTF8;
MessageBox.Show(Encoding.Default.ToString());

if(charset.Length==0)
{
charset="us-ascii";
}

Encoding srcEnc=Encoding.GetEncoding(charset);
byte[] srcData = srcEnc.GetBytes( messageString );

string utf8String = dstEnc.GetString(srcData);
return utf8String;
}
 
A

Albert Jan

Great, this works!

I have probably overlooked a part of the Mime specification: nowhere in the
mailmessage that I used as an example for imput is the use of base64
declared

Thanks,

Albert Jan



Stefan Simek said:
Hi,

The string looks like base64 encoded data to me.

I suggest decoding the s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ== part and
try to convert the resulting bytes to unicode using the GB2312 encoding...

string DoConversion(string base64)
{
// decode the base64 string
byte[] bytes = Convert.FromBase64String(base64);

// get the GB2312 encoding
Encoding encoding = Encoding.GetEncoding(20936);

// decode the GB2312 encoded byte stream
string result = encoding.GetString(bytes);

return result;
}

The result certainly looks like chinese to me... ;)

HTH,
Stefan


Albert said:
Hi Morten,

Thanks for your answer

the string I use ("=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" )
comes from a chinese mail-message and was probably transferred to
'quoted-printable' by the sender.

So my problem is how to redisplay the original chinese characters. I suppose
I have to cut out the "GB2312?", but what next?


Regards,

Albert Jan


Hi Albert,

I'm not too familiar with GB2312 (Simplified Chinese) but I believe all
characters in your string have the same value in both GB2312 and UTF8 so
there would be no change when you recode it (they look the same in both
encodings). For reference your string looks to me like

"=?GB2312?B?s8q5q8u+vq3A7aGissbO8bK/w8W1xNK7t+LQxQ==?=" which isn't
chinese in any way.

Although your encoding conversion code is overly elaborate I see nothing
wrong with it. You could simplify if greatly with something like this.:

public static string toUTF8(string messageString, string charset)
{
Encoding dstEnc = Encoding.UTF8;
MessageBox.Show(Encoding.Default.ToString());

if(charset.Length==0)
{
charset="us-ascii";
}

Encoding srcEnc=Encoding.GetEncoding(charset);
byte[] srcData = srcEnc.GetBytes( messageString );

string utf8String = dstEnc.GetString(srcData);
return utf8String;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top