Unicode problem (displaying japanese text correctly)

R

Rob

Hello all

I have a SQL Server database and I'm experiening terrible problems in
displaying japanese language text stored in the db in my
ado.net/asp.net application. The relevant columns in the db are all
nvarchar (thus unicode).After pulling the data and using the
ToString() method alone always results in garbage on the screen,
regardless of what encloding is selected in the browser
(view-encoding) or what culture threads I set up - for example:
Thread.CurrentThread.CurrentCulture = new CultureInfo("ja-JP");

The only way I have been able to get the correct display is to convert
the encoding to shift_jis like this:

// rdr is an instance of SqlDataReader
string uniString = rdr["blahblah"].ToString();
Response.Write(uniString); // outputs garbage

// 1252 is the default codepage, and 932 is the shift_jis codepage
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
Response.Write(Encoding.GetEncoding(932).GetString(bytes)); // no
problem

In addition to culture, I have also tried this in web.config:
<globalization requestEncoding="shift_jis"
responseEncoding="shift_jis" fileEncoding="shift_jis" />

The result is also the same if I use englishXP/englishIE or
japaneseXP/japaneseIE as the client OS/browser.

Is there some other way to get the desired output ? Although the above
solution works for simple output using a reader, it doesn't seem very
elegant and when I want to use a dataset and bind it to a control I am
back at square one.

Any help / advice is much appreciated
Thanks
Rob
 
D

Dmitriy Lapshin [C# / .NET MVP]

Hello Rob,
regardless of what encloding is selected in the browser
(view-encoding) or what culture threads I set up - for example:
Thread.CurrentThread.CurrentCulture = new CultureInfo("ja-JP");

This affects formatting of dates and currencies but not the output encoding.
You might try to set the ContentEncoding and/or Charset properties of the
Response object in code-behind.
You can also try to add the "META charset" tag to the output.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://www.x-unity.net/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Rob said:
Hello all

I have a SQL Server database and I'm experiening terrible problems in
displaying japanese language text stored in the db in my
ado.net/asp.net application. The relevant columns in the db are all
nvarchar (thus unicode).After pulling the data and using the
ToString() method alone always results in garbage on the screen,
regardless of what encloding is selected in the browser
(view-encoding) or what culture threads I set up - for example:
Thread.CurrentThread.CurrentCulture = new CultureInfo("ja-JP");

The only way I have been able to get the correct display is to convert
the encoding to shift_jis like this:

// rdr is an instance of SqlDataReader
string uniString = rdr["blahblah"].ToString();
Response.Write(uniString); // outputs garbage

// 1252 is the default codepage, and 932 is the shift_jis codepage
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
Response.Write(Encoding.GetEncoding(932).GetString(bytes)); // no
problem

In addition to culture, I have also tried this in web.config:
<globalization requestEncoding="shift_jis"
responseEncoding="shift_jis" fileEncoding="shift_jis" />

The result is also the same if I use englishXP/englishIE or
japaneseXP/japaneseIE as the client OS/browser.

Is there some other way to get the desired output ? Although the above
solution works for simple output using a reader, it doesn't seem very
elegant and when I want to use a dataset and bind it to a control I am
back at square one.

Any help / advice is much appreciated
Thanks
Rob
 
J

Jon Skeet [C# MVP]

Rob said:
// 1252 is the default codepage, and 932 is the shift_jis codepage
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
Response.Write(Encoding.GetEncoding(932).GetString(bytes)); // no
problem

That's a *really* bad idea. It's likely to fail in some places.

First, check whether or not your original string actually has the right
data. The way to do that is (just for test purposes) to write out in
text the Unicode values of each character. Cast each character of the
string to an int, and print that out. Look at www.unicode.org to find
out what those characters are.

Now, if that fails, there's something going wrong in the way you're
fetching the string. If it looks okay, however, then it's an output
problem to the browser.
 
R

Rob

Thanks. Yeah, I realised that culture doesn't affect output encoding
in the way I want. As for setting ContentEncoding and/or Charset
properties of the Reponse object in code behind, isn't this equivalent
to setting responseEncoding="shift_jis" in web.config in the
<globalization> tag, which I have already tried to no avail (and is
not reset anywhere else subsequently). Anyway, I tried both methods
explicity and neither work. Same goes for using the <META tag in the
page header. And, as previously mentioned, it also doesn't work to set
the encoding in the client browser.

I also perhaps should mention that the data gets input to the db from
a classic ASP application, and those client configs are always
japanese OS/browser.
 
R

Rob

Jon Skeet said:
Rob said:
// 1252 is the default codepage, and 932 is the shift_jis codepage
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
Response.Write(Encoding.GetEncoding(932).GetString(bytes)); // no
problem

That's a *really* bad idea. It's likely to fail in some places.

First, check whether or not your original string actually has the right
data. The way to do that is (just for test purposes) to write out in
text the Unicode values of each character. Cast each character of the
string to an int, and print that out. Look at www.unicode.org to find
out what those characters are.

Now, if that fails, there's something going wrong in the way you're
fetching the string. If it looks okay, however, then it's an output
problem to the browser.

OK, so there is something wrong with the way I'm fetching the string.
Here is what I did:
Response.Write("These are the values of each character in the original
string: <br>");
foreach (char c in testString)
Response.Write((int) c + "<br>");

In my test the desired output is 4 japanese charcters. The output for
this loop had 8 integer values, corressponding to the garbage output
problem.

Then i did this:

byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
newString = Encoding.GetEncoding(932).GetString(bytes);
Response.Write("char values for converted string: <br>");
foreach (char c in newString)
Response.Write((int) c + "<br>");

This output had 4 characters corresponding to the correct unicode
value for the 4 correct japanese characters verified here:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5BAE
[note: 5BAE is the hex value of the first character in the string
(23470 decimal). FYI, the first 2 charcters in the garbage string are
8249 and 123 (dec)]

So, what is wrong with the way I am fetching the string ? In my simple
test, I just do the following:

con.Open();
SqlCommand cmd1 = new SqlCommand(selectString, con);
SqlDataReader rdr = cmd1.ExecuteReader();

rdr.Read() ;
string testString = rdr[columnName].ToString() ;

As previously mentioned, the column on the db (SQL Server 2000) is of
type nvarchar and the data is input by a classic ASP application (not
mine). This app has no problem displaying the data - it just sets the
Response.charset or META tag to shift_jis and all is well. As far as I
know nvarchar is stored on the db as unicode, however, the developer
of the ASP app swears blind that it is stored as shift_jis.

Thanks for any help ! !
 
J

Jon Skeet [C# MVP]

Rob said:
OK, so there is something wrong with the way I'm fetching the string.
Here is what I did:
Response.Write("These are the values of each character in the original
string: <br>");
foreach (char c in testString)
Response.Write((int) c + "<br>");

In my test the desired output is 4 japanese charcters. The output for
this loop had 8 integer values, corressponding to the garbage output
problem.

Right. At least you now know where the real problem is :)
Then i did this:

byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
newString = Encoding.GetEncoding(932).GetString(bytes);
Response.Write("char values for converted string: <br>");
foreach (char c in newString)
Response.Write((int) c + "<br>");

That suggests that the driver has been misconfigured. I suspect that
using 1252 you *may* actually get some dodgy data for some characters,
and that that could be fixed by using 28591 - but I'm not suggesting
that you should do that. (It depends on exactly how the database is
screwed up though.)
This output had 4 characters corresponding to the correct unicode
value for the 4 correct japanese characters verified here:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5BAE
[note: 5BAE is the hex value of the first character in the string
(23470 decimal). FYI, the first 2 charcters in the garbage string are
8249 and 123 (dec)]

So, what is wrong with the way I am fetching the string ? In my simple
test, I just do the following:

con.Open();
SqlCommand cmd1 = new SqlCommand(selectString, con);
SqlDataReader rdr = cmd1.ExecuteReader();

rdr.Read() ;
string testString = rdr[columnName].ToString() ;

As previously mentioned, the column on the db (SQL Server 2000) is of
type nvarchar and the data is input by a classic ASP application (not
mine). This app has no problem displaying the data - it just sets the
Response.charset or META tag to shift_jis and all is well. As far as I
know nvarchar is stored on the db as unicode, however, the developer
of the ASP app swears blind that it is stored as shift_jis.

Thanks for any help ! !

My guess is that it's actually being stored in the database incorrectly
to start with. One thing you might like to do is construct a "proper"
string (with these characters) in a .NET test application, store it in
the database, and then retrieve it.

My guess is that the ASP developer is mucking things up by manually
encoding the original string into shift_jis bytes, and then uploading
those bytes as though they were unicode characters. SQL Server does
indeed store character data as Unicode. You might like to quote this to
the developer, from the SQL Enterprise Manager help:

<quote>
nvarchar(n)

Variable-length Unicode character data of n characters. n must be a
value from 1 through 4,000. Storage size, in bytes, is two times the
number of characters entered. The data entered can be 0 characters in
length.
</quote.
 
R

Rob

Hi John ! Thanks for your replies so far :)

There doesn't seem to be anything strange in the way the ASP app
writes to the DB. I have looked at a simple example of some ASP in the
app which is very simple and just does this to update a row:
if request.servervariables("REQUEST_METHOD") = "POST" then
con.open
con.execute("UPDATE tbBlah SET blah = '" & request.form("text") & "'
WHERE ID = 1")
con.close
end if

When japanese characters are typed in the text box "text" and the form
submitted then the ASP app can read them back with no problem like
this:

set rs = con.execute("SELECT blah FROM tbBlah WHERE ID = 1")
JapString = rs("blah")
...
<html>
...
<%=JapString %>

and they display perfectly, but in .net I have the problem previously
described.

I cannot today test an ASP.NET test to do the same thing coz I don't
have a machine with japanese OS but I will tomorrow. But there can
only be 2 outcomes: 1) It works fine in ASP.NET, or I have the same
problem retrieving the chars. Either way it points to something which
I just cannot understand.

Any further thoughts / advice before I test the ASP.NET test (or about
anything specific I should do in the ASP.NET test) would be very much
appreciated.

Thanks
Rob

Jon Skeet said:
Rob said:
OK, so there is something wrong with the way I'm fetching the string.
Here is what I did:
Response.Write("These are the values of each character in the original
string: <br>");
foreach (char c in testString)
Response.Write((int) c + "<br>");

In my test the desired output is 4 japanese charcters. The output for
this loop had 8 integer values, corressponding to the garbage output
problem.

Right. At least you now know where the real problem is :)
Then i did this:

byte[] bytes = Encoding.GetEncoding(1252).GetBytes(uniString);
newString = Encoding.GetEncoding(932).GetString(bytes);
Response.Write("char values for converted string: <br>");
foreach (char c in newString)
Response.Write((int) c + "<br>");

That suggests that the driver has been misconfigured. I suspect that
using 1252 you *may* actually get some dodgy data for some characters,
and that that could be fixed by using 28591 - but I'm not suggesting
that you should do that. (It depends on exactly how the database is
screwed up though.)
This output had 4 characters corresponding to the correct unicode
value for the 4 correct japanese characters verified here:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5BAE
[note: 5BAE is the hex value of the first character in the string
(23470 decimal). FYI, the first 2 charcters in the garbage string are
8249 and 123 (dec)]

So, what is wrong with the way I am fetching the string ? In my simple
test, I just do the following:

con.Open();
SqlCommand cmd1 = new SqlCommand(selectString, con);
SqlDataReader rdr = cmd1.ExecuteReader();

rdr.Read() ;
string testString = rdr[columnName].ToString() ;

As previously mentioned, the column on the db (SQL Server 2000) is of
type nvarchar and the data is input by a classic ASP application (not
mine). This app has no problem displaying the data - it just sets the
Response.charset or META tag to shift_jis and all is well. As far as I
know nvarchar is stored on the db as unicode, however, the developer
of the ASP app swears blind that it is stored as shift_jis.

Thanks for any help ! !

My guess is that it's actually being stored in the database incorrectly
to start with. One thing you might like to do is construct a "proper"
string (with these characters) in a .NET test application, store it in
the database, and then retrieve it.

My guess is that the ASP developer is mucking things up by manually
encoding the original string into shift_jis bytes, and then uploading
those bytes as though they were unicode characters. SQL Server does
indeed store character data as Unicode. You might like to quote this to
the developer, from the SQL Enterprise Manager help:

<quote>
nvarchar(n)

Variable-length Unicode character data of n characters. n must be a
value from 1 through 4,000. Storage size, in bytes, is two times the
number of characters entered. The data entered can be 0 characters in
length.
</quote.
 
J

Jon Skeet [C# MVP]

Rob said:
Hi John ! Thanks for your replies so far :)

There doesn't seem to be anything strange in the way the ASP app
writes to the DB. I have looked at a simple example of some ASP in the
app which is very simple and just does this to update a row:
if request.servervariables("REQUEST_METHOD") = "POST" then
con.open
con.execute("UPDATE tbBlah SET blah = '" & request.form("text") & "'
WHERE ID = 1")
con.close
end if

That in itself is dodgy, IMO - it's including the text directly in the
SQL. I've no idea whether that's likely to work properly or not. What
character encoding does normal ASP use for strings? Unless it's
unicode, you've *definitely* got a problem. If it doesn't, I don't know
how it's meant to cope with Unicode input, to be honest - it all sounds
pretty messy.
When japanese characters are typed in the text box "text" and the form
submitted then the ASP app can read them back with no problem like
this:

Just because the ASP app can read them back doesn't mean it's not
broken - it just means that it's probably broken in the same way for
both sending and receiving, as it were. (Imagine an app which always
stored numbers as -x instead of x, and retrieved them as -x instead of
x - the app would be broken, but it would be consistent with itself.)
I cannot today test an ASP.NET test to do the same thing coz I don't
have a machine with japanese OS but I will tomorrow. But there can
only be 2 outcomes: 1) It works fine in ASP.NET, or I have the same
problem retrieving the chars. Either way it points to something which
I just cannot understand.

You shouldn't need to test with ASP.NET at all - just use a small
console app to write to the database and to read from it. Nor do you
need a Japanese OS to test it - just include specific Unicode
characters that you're interested in.
 
R

Rob

Hi John

Thanks again for your advice / comments.

It appears that you are correct about the way the text is being
inserted into the db by ASP (I think the underlying problem in the
class ASP app is that the data is not prefixed with 'N' to specify
unicode). I have verified that my ASP.NET app can insert/update,
extract and display japanese text with no problem. So this leaves me
at the point of wondering how best to deal with the data which is
stored by the ASP app. You said already that my kludge was not a good
idea:
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(originalString);
string newString = Encoding.GetEncoding(936).GetString(bytes);

Do you have any better solution for the short term ? Does it depend on
what code page the SQL Server was installed with ? [I currenly don't
know that, other than it's an english os and db]. I still don't really
understand exactly *what* is happening to the data as it arrives on
the the db from classic ASP app.

In the longer term, I think we will fix the way that the japanese text
is stored by the legacy ASP app. I'm fairly sure that if it just
prefixed a "N" to the data being inserted/updated that would do the
trick. I'm still curious though about how the classic ASP is working -
since it is sending chars to the db which are being converted into
whatever code page the db was installed with and when it receives data
from the db it is implicity recieving unicode, since all strings in
vbscript are unicode - so I wonder if the OLEDB driver also does some
conversion along the way ? But these are question about classic
ASP/ADO and vbscript though so I guess they should be asked elsewhere.

Any further advice about dealing with the problem will be very welcome
:)

Thanks !
Rob
 
J

Jon Skeet [C# MVP]

So this leaves me
at the point of wondering how best to deal with the data which is
stored by the ASP app. You said already that my kludge was not a good
idea:
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(originalString);
string newString = Encoding.GetEncoding(936).GetString(bytes);

Do you have any better solution for the short term ?

For the short term, that's probably okay. Depending on *exactly* what's
going on, you might be better to use 28591 (ISO Latin-1) but working
out a way to test that would be tricky, unfortunately.
Does it depend on
what code page the SQL Server was installed with ? [I currenly don't
know that, other than it's an english os and db]. I still don't really
understand exactly *what* is happening to the data as it arrives on
the the db from classic ASP app.

See later.
In the longer term, I think we will fix the way that the japanese text
is stored by the legacy ASP app.

That's definitely the way to go.
I'm fairly sure that if it just
prefixed a "N" to the data being inserted/updated that would do the
trick.

Could well be... I don't know how ASP itself handles characters, I'm
afraid, so I'm reluctant to say for sure.
I'm still curious though about how the classic ASP is working -
since it is sending chars to the db which are being converted into
whatever code page the db was installed with and when it receives data
from the db it is implicity recieving unicode, since all strings in
vbscript are unicode - so I wonder if the OLEDB driver also does some
conversion along the way ? But these are question about classic
ASP/ADO and vbscript though so I guess they should be asked elsewhere.

I think (given what you've shown works to some extent at least) the ASP
app is converting the characters into a shift_jis encoded byte array,
then inserting those characters into the database as if the byte array
were actually a string encoded with Cp1252. Whether or not the code
page of the database is involved I don't know, but it wouldn't entirely
surprise me!
 
R

Rob

Thank you very much :)
So this leaves me
at the point of wondering how best to deal with the data which is
stored by the ASP app. You said already that my kludge was not a good
idea:
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(originalString);
string newString = Encoding.GetEncoding(936).GetString(bytes);

Do you have any better solution for the short term ?

For the short term, that's probably okay. Depending on *exactly* what's
going on, you might be better to use 28591 (ISO Latin-1) but working
out a way to test that would be tricky, unfortunately.
Does it depend on
what code page the SQL Server was installed with ? [I currenly don't
know that, other than it's an english os and db]. I still don't really
understand exactly *what* is happening to the data as it arrives on
the the db from classic ASP app.

See later.
In the longer term, I think we will fix the way that the japanese text
is stored by the legacy ASP app.

That's definitely the way to go.
I'm fairly sure that if it just
prefixed a "N" to the data being inserted/updated that would do the
trick.

Could well be... I don't know how ASP itself handles characters, I'm
afraid, so I'm reluctant to say for sure.
I'm still curious though about how the classic ASP is working -
since it is sending chars to the db which are being converted into
whatever code page the db was installed with and when it receives data
from the db it is implicity recieving unicode, since all strings in
vbscript are unicode - so I wonder if the OLEDB driver also does some
conversion along the way ? But these are question about classic
ASP/ADO and vbscript though so I guess they should be asked elsewhere.

I think (given what you've shown works to some extent at least) the ASP
app is converting the characters into a shift_jis encoded byte array,
then inserting those characters into the database as if the byte array
were actually a string encoded with Cp1252. Whether or not the code
page of the database is involved I don't know, but it wouldn't entirely
surprise me!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top