difference /u and &#

  • Thread starter Thread starter Bart
  • Start date Start date
B

Bart

Hi,

I receive an utf8 character from a database, like &#30000 (Japanese
Character, style: &#XXXXX).
How can I visualize the Japanese character on my application? I have found
the class System.Text.Encoding, but the input looks like \uXXXX. I don't
know how to do.

Thank you,

Bart
 
Bart said:
I receive an utf8 character from a database, like &#30000 (Japanese
Character, style: &#XXXXX).
How can I visualize the Japanese character on my application? I have found
the class System.Text.Encoding, but the input looks like \uXXXX. I don't
know how to do.

I'm not entirely sure what you mean by "the input looks like \uXXXX".
Do you mean it's stored in the database as a string with "\uXXXX" in?
Are you *sure* about that, or is that just what the debugger is
showing? (Try writing it out to the console.)
 
Jon Skeet said:
I'm not entirely sure what you mean by "the input looks like \uXXXX".
Do you mean it's stored in the database as a string with "\uXXXX" in?
Are you *sure* about that, or is that just what the debugger is
showing? (Try writing it out to the console.)

I have looked this example at MSDN Library:

UTF8Encoding utf8 = new UTF8Encoding();
UTF8Encoding utf8ThrowException = new UTF8Encoding(false, true);

// This array contains two high surrogates in a row (\uD801,
\uD802).
// A high surrogate should be followed by a low surrogate.
Char[] chars = new Char[] {'a', 'b', 'c', '\uD801', '\uD802', 'd'};

It means that I have to write the strings as \uXXXX, but in my database the
file are stored (utf8) as &#XXXXX. I don't understand why in the example an
utf8 character has that format and in my database a different one even if
are both utf8 encoded.
 
Bart said:
Jon Skeet said:
I'm not entirely sure what you mean by "the input looks like \uXXXX".
Do you mean it's stored in the database as a string with "\uXXXX" in?
Are you *sure* about that, or is that just what the debugger is
showing? (Try writing it out to the console.)

I have looked this example at MSDN Library:

UTF8Encoding utf8 = new UTF8Encoding();
UTF8Encoding utf8ThrowException = new UTF8Encoding(false, true);

// This array contains two high surrogates in a row (\uD801,
\uD802).
// A high surrogate should be followed by a low surrogate.
Char[] chars = new Char[] {'a', 'b', 'c', '\uD801', '\uD802', 'd'};

It means that I have to write the strings as \uXXXX, but in my database the
file are stored (utf8) as &#XXXXX. I don't understand why in the example an
utf8 character has that format and in my database a different one even if
are both utf8 encoded.

I'm not sure what you mean by "stored (utf8) as &#XXXXX". Do you mean
that the actual characters '&' '#' etc are in the database - or is it
just that that's how you see non-ASCII characters when displaying them
in (say) a SQL query execution environment?

The reason Unicode characters above 0xffff have to be stored in
surrogate form in .NET is that .NET uses UTF-16 internally, effectively
- each character is 16 bits, which isn't enough to cover the whole of
Unicode.

When you write a string in a C# program, however, you *can* use
\UXXXXXXXX instead (note the capital U). Only values up to 0x10ffff are
supported, so the first two Xs will always be 0.
 
At first, I wuold like to thank you for answers.


Jon Skeet said:
I'm not sure what you mean by "stored (utf8) as &#XXXXX". Do you mean
that the actual characters '&' '#' etc are in the database - or is it
just that that's how you see non-ASCII characters when displaying them
in (say) a SQL query execution environment?

I mean that the '&' '#' character are in database and if I make a query I
receive, in C#, a character as &#XXXXX.
I will make an example:
I have a form that make a query on a database written in Japanese with utf8
encoding: the result of the query on my form is:

差よ

and I don't know which method I have to call to make the conversion.
 
Bart said:
I mean that the '&' '#' character are in database and if I make a query I
receive, in C#, a character as &#XXXXX.

Right - I think.
I will make an example:
I have a form that make a query on a database written in Japanese with utf8
encoding: the result of the query on my form is:

差よ

and I don't know which method I have to call to make the conversion.

When you say "on your form" - do you mean as output on a web page? If
so, that's introducing another level of encoding - XML encoding. What
happens if you write a simple console application to search the
database and write the results to a console? Do you still get 差?
 
Jon Skeet said:
When you say "on your form" - do you mean as output on a web page? If
so, that's introducing another level of encoding - XML encoding. What
happens if you write a simple console application to search the
database and write the results to a console? Do you still get 差?

it is a simple windows form. the result of the query appears on a label.
I have not tried a console application yet.
 
Bart said:
it is a simple windows form. the result of the query appears on a label.
I have not tried a console application yet.

Right, if it's a windows form, that's fine. I'm intrigued as to why the
database returns it that way. How did the data get in there in the
first place? It should be returning the data properly in UTF-8 rather
than using an XML-type encoding.

How sure are you that the problem isn't just that the program which
*submitted* the data to the database was XML-encoding it?
 
Jon Skeet said:
Right, if it's a windows form, that's fine. I'm intrigued as to why the
database returns it that way. How did the data get in there in the
first place? It should be returning the data properly in UTF-8 rather
than using an XML-type encoding.

How sure are you that the problem isn't just that the program which
*submitted* the data to the database was XML-encoding it?

Actually I don't know how the data were stored in the database. Perhaps they
used web interface PHPMyAdmin (according to their style).
 
Bart said:
Actually I don't know how the data were stored in the database. Perhaps they
used web interface PHPMyAdmin (according to their style).

Hmm.

What happens if you try to insert data into the database yourself? What
database is it, anyway (SQL Server, Oracle etc)?

My guess is that you'll find that whatever you put into the database,
you get the same thing out - if you put the string "&#30000" in, you'll
get that out, but if you put in a string actually containing U+030000
(the unicode character 0x30000) you'll get that out, in which case the
data in the database is effectively corrupt to some extent. Depending
on your project, you may want to write a tool to "clean" the database,
and then write your main project without worrying about it.

Out of interest, what happens to '&' signs in the database? Do they
come out as & ?
 
Jon Skeet said:
Hmm.

What happens if you try to insert data into the database yourself? What
database is it, anyway (SQL Server, Oracle etc)?

My guess is that you'll find that whatever you put into the database,
you get the same thing out - if you put the string "&#30000" in, you'll
get that out, but if you put in a string actually containing U+030000
(the unicode character 0x30000) you'll get that out, in which case the
data in the database is effectively corrupt to some extent. Depending
on your project, you may want to write a tool to "clean" the database,
and then write your main project without worrying about it.

Out of interest, what happens to '&' signs in the database? Do they
come out as & ?

I tried to access to database from web by using phpMyAdmin. The situation is
quite strange, because if I access with Japanese interface (SJIS) I can look
some names written in Japanese and some combinations as
よしだ
I'm getting quite confuse!!!!!! at last I don't know how I can make query,
utf or sjis..... I have to check it better.......
 
Bart said:
I tried to access to database from web by using phpMyAdmin. The situation is
quite strange, because if I access with Japanese interface (SJIS) I can look
some names written in Japanese and some combinations as
よしだ
I'm getting quite confuse!!!!!! at last I don't know how I can make query,
utf or sjis..... I have to check it better.......

Ignore the web interface for the moment - the first thing you need to
understand is what the database itself is doing. Now, what happens if
you try to insert a new record with Japanese characters in (not using
&#xxxxx at all)?
 
Jon Skeet said:
Ignore the web interface for the moment - the first thing you need to
understand is what the database itself is doing. Now, what happens if
you try to insert a new record with Japanese characters in (not using
&#xxxxx at all)?

I tried to insert a japanese name (not with web interface) into a record and
I tried to visualize on the web and with windows form:
for example, I tried to insert ?????as name from a windows form that makes
an "insert" in the database.

the windows form that visualize the select query shows me "????"

the web interface shows me the same, "????".

Now, I think there is some problems with the exchange of the data. I tried
to have the code of the webservice in php:

function INSERT ($mystring, $name,$password)
{
//make a connection to DBMS
$connection = mysql_connect("localhost", $name,$password);

// select a database
mysql_select_db('DATABASE, $connection);

//Query to insert
$doquery=mysql_query("INSERT INTO Name VALUES ( , '$mystring') " )or
die("");

//close connection
mysql_close($connection);
}

in C# I simply add a web reference to database and on a windows form I have
a textbox to write the name.
 
Bart said:
I tried to insert a japanese name (not with web interface) into a record and
I tried to visualize on the web and with windows form:
for example, I tried to insert ?????as name from a windows form that makes
an "insert" in the database.

the windows form that visualize the select query shows me "????"

the web interface shows me the same, "????".

Were those all meant to be question marks, or were some of them meant
to be characters?
Now, I think there is some problems with the exchange of the data. I tried
to have the code of the webservice in php:

function INSERT ($mystring, $name,$password)
{
//make a connection to DBMS
$connection = mysql_connect("localhost", $name,$password);

// select a database
mysql_select_db('DATABASE, $connection);

//Query to insert
$doquery=mysql_query("INSERT INTO Name VALUES ( , '$mystring') " )or
die("");

//close connection
mysql_close($connection);
}

Ah, it's MySQL. That could well make a difference to things. For one
thing, according to
http://dev.mysql.com/doc/mysql/en/Charset-Unicode.html
MySql doesn't cope with UTF-8 values which take more than three bytes -
in other words, Unicode values > 0xffff. However, it also implies that
you don't *need* Unicode values > 0xffff. Unless you know otherwise, I
suggest we assume that surrogates aren't part of the problem for the
moment.

Now, how is the database set up? Are you connecting to it with an
appropriate connection string? Either of those could be the problem -
or it could be a bug in the MySQL .NET provider you're using.

I suggest you try to isolate the problem: come up with a simple clean
database (with a single table with a single column), then a short but
complete program which demonstrates the problem.

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.

Once you've got that, we should be able to work out either how to fix
things or who to complain to :)
 
Bart said:
I think it is the best way :-))))))


Thank you very much, I will continue and let you know asap.

Righto. If you let me know which version of MySQL you've got, and which
provider you're using, I'll try to get them installed here so I can run
your code too.
 
Back
Top