Convert Encoding from Shift-JIS to UTF-8

D

DbNetLink

I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.

My problem is the encoding conversion code works for Japanese characters
encoded as "iso-2022-jp" but does not for "shift-jis"

What looked straight forward is proving less so, my test code looks like
this:

<%@ Page Language="C#"%>

<script language="C#" runat="server">
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{
string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.GetEncoding( "shift-jis" );
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}
</script>

Thanks in advance
 
J

Jon Skeet [C# MVP]

DbNetLink said:
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.

There's something wrong here. The request value is a unicode string -
all strings are unicode in .NET. Any encoding has already been taken
into account. You should be able to just write the string to the
database without any change.

See http://www.pobox.com/~skeet/csharp/unicode.html
 
D

DbNetLink

Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?

Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.
 
J

Jon Skeet [C# MVP]

DbNetLink said:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?
Yes.

Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.

No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?
 
D

DbNetLink

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );

Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.

If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(


Thanks for your help BTW

Jon Skeet said:
DbNetLink said:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?
Yes.

Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.

No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?
 
J

Jon Skeet [C# MVP]

DbNetLink said:
Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );

No. You're converting the string into UTF-8, but then using the result
as if it were a valid shift-jis-encoded byte array.
Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.

It gives the fundamentals, which should explain why the line of code at
the top is a really bad idea.
If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(

You should just be able to use the string, without venturing into
encodings at all.

If that's not working, you need to work through it step by step - see
http://www.pobox.com/~skeet/csharp/debuggingunicode.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top