UTF-8 encoding in AJAX web application.

Allan Ebdrup · Mar 16, 2007

I hava an ajax web application where i hvae problems with UTF-8 encoding oc
chineese chars.

My Ajax webapplication runs in a HTML page that is UTF-8 Encoded.
I copy and paste some chineese chars from another HTML page viewed in IE7,
that is also UTF-8 encoded (search for "china" on google.com). I paste the
chineese chars into a content editable div.
My Ajax webservice compiles an XML where the data from the content editable
div is placed in a CDATA section and sends it to a webservice on the server.
I read the content editable div using .innerHTML. I call the webservice
using
XMLHttpRequest in the following way:
-----
req.open("POST", strUrl, true);
req.setRequestHeader("Content-Type", "application/x-www-form-urlencoded;
charset=UTF-8");
var strSend = "";
for(var i=0; i<aParameters.length; i+=2)
{
if(strSend.length!=0) strSend += "&";
strSend += aParameters + "=" + encodeURIComponent(aParameters[i+1]);
}
req.send(strSend);
-----
where req is the XMLHttpRequest object. and aParameteres is an array that
contians: parameterName, parameterValue, parameterName, parameterValue,...

Before I send the XML I write it to screen and here the chineese chars are
displayed correctly.

On the server i use DotNet 2.0. The XML is transformed to SQL in a CDATA
section, that is read and executed against a MSSQL 2000 database, where the
string with the chineese chars is stored in a text column.

When I load the data again in my Ajax webapplication a webservice is called
that returns the string in an XML in a CDATA section, the data is read from
the database using a DataReader.

When the text loaded is displayed the chineese chars have turned into
questionmarks.

I've tried to change the column in the database to a image and use a
byte-array to fetch the data from the database. that didn't work, so I
changed it back.
I've added the following to my web.config:
-----
<globalization
requestEncoding="utf-8"
responseEncoding="utf-8"
fileEncoding="utf-8"
/>
-----
I've changed my ToXml method on my object to the following:
-----
// Define the desired encoding of the output
System.Text.Encoding encodingOfXmlOutput = System.Text.Encoding.UTF8;
// Create MemoryStream to recieve our bytes
using (System.IO.MemoryStream memoryStream = new System.IO.MemoryStream())
{
// Create XmlTextWriter using our created memoryStream and
encodingOfXmlOutput
using (System.Xml.XmlTextWriter xmlWriter = new
System.Xml.XmlTextWriter(memoryStream, encodingOfXmlOutput))
{
// Set formatting options for XmlTextWriter
xmlWriter.Formatting = System.Xml.Formatting.None; // Output should not be
indented
//Write XML
xmlWriter.WriteStartElement("Question");
xmlWriter.WriteStartElement("QuestionText");
xmlWriter.WriteCData(this.Text);
xmlWriter.WriteEndElement(); //QuestionText
xmlWriter.WriteEndElement(); //Question
// Force all bytes into memoryStream
xmlWriter.Flush();
// Create buffer to recieve bytes from memoryStream
// Some encodings like UTF-8 contains a preamble (bytes to identify the
encoding)
// having this preamble in our output will invalidate our output, so we wont
be grapping that.
byte[] buffer = new byte[memoryStream.Length -
encodingOfXmlOutput.GetPreamble().Length];
// Position cursor correct in memoryStream (which is after the preamble
memoryStream.Position = encodingOfXmlOutput.GetPreamble().Length;
// Fill data from current position of memoryStream into buffer
memoryStream.Read(buffer, 0, buffer.Length);
// Return string of the created Xml
return encodingOfXmlOutput.GetString(buffer);
}
}
-----
Still the same problem.

When I transform the xml to sql I use the following function:
-----
public static string Transform(XslCompiledTransform compiledTransform,
IXPathNavigable document)
{
if (compiledTransform == null) throw new
ArgumentNullException("compiledTransform");
using (StringWriter writer = new StringWriter())
{
string strResult = string.Empty;
compiledTransform.Transform(document, null, writer);
strResult = writer.ToString();
return strResult;
}
}
-----

The XSLT has the following encoding
-----
<?xml version="1.0" encoding="UTF-8"?>
-----

So my question is the following: Where does my encoding screw up? How come I
can't save and load chineese chars correctly?

Any pointers would be greatly appreciated.

I don't know what other UTF-8 chars don't work correctly, but the danish
chars I initially had problems with (æøå) work correctly, I would like my
solution to work with any UTF-8 chars.

Kind Regards,
Allan Ebdrup

SimeonArgus · Mar 16, 2007

I hava an ajax web application where i hvae problems with UTF-8 encoding oc
chineese chars.

My Ajax webapplication runs in a HTML page that is UTF-8 Encoded.
I copy and paste some chineese chars from another HTML page viewed in IE7,
that is also UTF-8 encoded (search for "china" on google.com). I paste the
chineese chars into a content editable div.
My Ajax webservice compiles an XML where the data from the content editable
div is placed in a CDATA section and sends it to a webservice on the server.
I read the content editable div using .innerHTML. I call the webservice
using
XMLHttpRequest in the following way:
-----
req.open("POST", strUrl, true);
req.setRequestHeader("Content-Type", "application/x-www-form-urlencoded;
charset=UTF-8");
var strSend = "";
for(var i=0; i<aParameters.length; i+=2)
{
if(strSend.length!=0) strSend += "&";
strSend += aParameters + "=" + encodeURIComponent(aParameters[i+1]);}

req.send(strSend);
-----
where req is the XMLHttpRequest object. and aParameteres is an array that
contians: parameterName, parameterValue, parameterName, parameterValue,...

Before I send the XML I write it to screen and here the chineese chars are
displayed correctly.

On the server i use DotNet 2.0. The XML is transformed to SQL in a CDATA
section, that is read and executed against a MSSQL 2000 database, where the
string with the chineese chars is stored in a text column.

When I load the data again in my Ajax webapplication a webservice is called
that returns the string in an XML in a CDATA section, the data is read from
the database using a DataReader.

When the text loaded is displayed the chineese chars have turned into
questionmarks.

I've tried to change the column in the database to a image and use a
byte-array to fetch the data from the database. that didn't work, so I
changed it back.
I've added the following to my web.config:
-----
<globalization
requestEncoding="utf-8"
responseEncoding="utf-8"
fileEncoding="utf-8"
/>
-----
I've changed my ToXml method on my object to the following:
-----
// Define the desired encoding of the output
System.Text.Encoding encodingOfXmlOutput = System.Text.Encoding.UTF8;
// Create MemoryStream to recieve our bytes
using (System.IO.MemoryStream memoryStream = new System.IO.MemoryStream())
{
// Create XmlTextWriter using our created memoryStream and
encodingOfXmlOutput
using (System.Xml.XmlTextWriter xmlWriter = new
System.Xml.XmlTextWriter(memoryStream, encodingOfXmlOutput))
{
// Set formatting options for XmlTextWriter
xmlWriter.Formatting = System.Xml.Formatting.None; // Output should notbe
indented
//Write XML
xmlWriter.WriteStartElement("Question");
xmlWriter.WriteStartElement("QuestionText");
xmlWriter.WriteCData(this.Text);
xmlWriter.WriteEndElement(); //QuestionText
xmlWriter.WriteEndElement(); //Question
// Force all bytes into memoryStream
xmlWriter.Flush();
// Create buffer to recieve bytes from memoryStream
// Some encodings like UTF-8 contains a preamble (bytes to identify the
encoding)
// having this preamble in our output will invalidate our output, so we wont
be grapping that.
byte[] buffer = new byte[memoryStream.Length -
encodingOfXmlOutput.GetPreamble().Length];
// Position cursor correct in memoryStream (which is after the preamble
memoryStream.Position = encodingOfXmlOutput.GetPreamble().Length;
// Fill data from current position of memoryStream into buffer
memoryStream.Read(buffer, 0, buffer.Length);
// Return string of the created Xml
return encodingOfXmlOutput.GetString(buffer);}
}

-----
Still the same problem.

When I transform the xml to sql I use the following function:
-----
public static string Transform(XslCompiledTransform compiledTransform,
IXPathNavigable document)
{
if (compiledTransform == null) throw new
ArgumentNullException("compiledTransform");
using (StringWriter writer = new StringWriter())
{
string strResult = string.Empty;
compiledTransform.Transform(document, null, writer);
strResult = writer.ToString();
return strResult;}
}

-----

The XSLT has the following encoding
-----
<?xml version="1.0" encoding="UTF-8"?>
-----

So my question is the following: Where does my encoding screw up? How come I
can't save and load chineese chars correctly?

Any pointers would be greatly appreciated.

I don't know what other UTF-8 chars don't work correctly, but the danish
chars I initially had problems with (æøå) work correctly, I would like my
solution to work with any UTF-8 chars.

Kind Regards,
Allan Ebdrup

It sounds like the problem isn't with your application, but with your
databse definition. Your web page is UTF-8, but is your databse
table?

Assuming that your databse *IS* set up to store UTF-8, is the query
tool you are using? It may be translating the extra characters into ?
between the database and your application.

It may be that your code is fine, and you should redirect your bug
search to the database level.

I know that's not a definitive answer, but I hope that helps.

--Sim

Jon Skeet [C# MVP] · Mar 16, 2007

Allan Ebdrup said:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc
chineese chars.

My Ajax webapplication runs in a HTML page that is UTF-8 Encoded.
I copy and paste some chineese chars from another HTML page viewed in IE7,
that is also UTF-8 encoded (search for "china" on google.com). I paste the
chineese chars into a content editable div.
My Ajax webservice compiles an XML where the data from the content editable
div is placed in a CDATA section and sends it to a webservice on the server.
I read the content editable div using .innerHTML. I call the webservice
using
XMLHttpRequest in the following way:

<snip>

See http://pobox.com/~skeet/csharp/debuggingunicode.html

Steven Cheng[MSFT] · Mar 19, 2007

Hi Allan,

Regarding on this unicode transfer issue, I think it is likely due to the
text convertion in SQL Server database. I have performed the following test
on my local test machine:

** use an ASP.NET aspx page to render out a <textarea> and use
client-script (with xmlhttp component) to send the input in <textarea> to
server, charset is utf-8 as you did

** at server-side, I save the xmlhttp posted data into a file(also utf-8
encoding).

Based on my test, the chinese characters are correctly saved. Therefore,
you can try checking the posted data at server-side, use debugger to break
into code and inspect the variable or write it into file for checking. If
the problem is caused by SQL Server database storage, we need to do some
further research against the database table.

Please feel free to pos there if you have any other finding or questions.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

==================================================

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.

==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

Allan Ebdrup · Mar 19, 2007

SimeonArgus said:
It sounds like the problem isn't with your application, but with your
databse definition. Your web page is UTF-8, but is your databse
table?

Assuming that your databse *IS* set up to store UTF-8, is the query
tool you are using? It may be translating the extra characters into ?
between the database and your application.

It may be that your code is fine, and you should redirect your bug
search to the database level.

I know that's not a definitive answer, but I hope that helps.

Thanks for your feedback.
You might be right, but as I wrote I tried to save the data as a byte array
(image) in the database, and that didn't eliminate the problem.

When I execute my SQL command text, what encoding does SQL server expect in
the command text? Is there any way to specify what encoding to use in the
SQL command text?

Kind Regards,
Allan Ebdrup.

Jon Skeet [C# MVP] · Mar 19, 2007

When I execute my SQL command text, what encoding does SQL server expect in
the command text? Is there any way to specify what encoding to use in the
SQL command text?

That should all be handled for you in the driver. Just use strings,
which are already unicode.

Jon

Allan Ebdrup · Mar 19, 2007

Steven Cheng said:
Hi Allan,

Regarding on this unicode transfer issue, I think it is likely due to the
text convertion in SQL Server database. I have performed the following
test
on my local test machine:

** use an ASP.NET aspx page to render out a <textarea> and use
client-script (with xmlhttp component) to send the input in <textarea> to
server, charset is utf-8 as you did

** at server-side, I save the xmlhttp posted data into a file(also utf-8
encoding).

Based on my test, the chinese characters are correctly saved. Therefore,
you can try checking the posted data at server-side, use debugger to break
into code and inspect the variable or write it into file for checking. If
the problem is caused by SQL Server database storage, we need to do some
further research against the database table.

Please feel free to pos there if you have any other finding or questions.

OK, now I've tried to log the text in different places and found the
following.
The SQL executed against the database has the chars encoded correctly.
When I extract the data from the database it is converted to questionmarks.

I've tried to change the column that has the data from text to image, but
the bytes retrieved from the SQL server are not the same as what i passed
in.

Is this because I write the value inline in the SQL command text? Is the SQL
command parsed in some way using some kind of encoding, even though it's
specified to be a image coloumn?
Would passing the value to the SQL Server as a parameter perhaps solve the
problem?

Kind Regards,
Allan Ebdrup.

Steven Cheng[MSFT] · Mar 20, 2007

#Why do some SQL strings have an 'N' prefix?

Steven Cheng[MSFT] · Mar 20, 2007

Hello Allan,

Thanks for your reply.

Of course, for .net application, using the ADO.NET SqlCommand and
SqlParameter objects to supply any dynamic parameters should be the
preferred approach. For your SQL Server 2000 database table's column(that
will store the posted chinese characters), is it defined as unicode
character type(such as nchar, nvarchar or ntext ...)? If the column
datatype is of unicode, it should be able to store the chinese characters
correctly, otherwise, you need to make sure the table/column or database's
collation is correctly set as Chinese collation so that chinese chars can
be stored in non-unicode encoded format.

In addition, for SQL Server (7.0 or 2000) unicode datatype, it is stored in
UCS-2 charset (no matter the data is originally encoded in UTF-8, UTF-16
or....). Here is a good MSDN reference introducing the international
features in SQL Server 2000

http://msdn2.microsoft.com/en-us/library/aa902644(SQL.80).aspx

For your current code that directly use inline SQL command text to execute
the insert query, I think you can try adding a 'N' prefix in each string
parameter, this prefix is used to explicit mark the parameter value as
unicode chars.

#Why do some SQL strings have an 'N' prefix?
http://databases.aspfaq.com/general/why-do-some-sql-strings-have-an-n-prefix
.html

If you have any further specific questions, please feel free to let me know.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Allan Ebdrup · Mar 20, 2007

Steven Cheng said:
Hello Allan,

Thanks for your reply.

Of course, for .net application, using the ADO.NET SqlCommand and
SqlParameter objects to supply any dynamic parameters should be the
preferred approach. For your SQL Server 2000 database table's column(that
will store the posted chinese characters), is it defined as unicode
character type(such as nchar, nvarchar or ntext ...)? If the column
datatype is of unicode, it should be able to store the chinese characters
correctly, otherwise, you need to make sure the table/column or database's
collation is correctly set as Chinese collation so that chinese chars can
be stored in non-unicode encoded format.

In addition, for SQL Server (7.0 or 2000) unicode datatype, it is stored
in
UCS-2 charset (no matter the data is originally encoded in UTF-8, UTF-16
or....). Here is a good MSDN reference introducing the international
features in SQL Server 2000

http://msdn2.microsoft.com/en-us/library/aa902644(SQL.80).aspx

For your current code that directly use inline SQL command text to execute
the insert query, I think you can try adding a 'N' prefix in each string
parameter, this prefix is used to explicit mark the parameter value as
unicode chars.

#Why do some SQL strings have an 'N' prefix?
http://databases.aspfaq.com/general/why-do-some-sql-strings-have-an-n-prefix
html

I changed my code to use parameters, but I still had the same problem,
then I also changed the database to use a image column (byte array), and
when I store the bytes to the database I use:

System.Text.Encoding.UTF8.GetBytes(value)

When I retrieve the data from the database I use:

System.Text.Encoding.UTF8.GetString(value)

It works! Now I can store UTF-8 strings in the database. Thank you for your
help.

Kind Regards,
Allan Ebdrup

Allan Ebdrup · Mar 20, 2007

Steven Cheng said:
In addition, for SQL Server (7.0 or 2000) unicode datatype, it is stored
in
UCS-2 charset (no matter the data is originally encoded in UTF-8, UTF-16
or....). Here is a good MSDN reference introducing the international
features in SQL Server 2000

http://msdn2.microsoft.com/en-us/library/aa902644(SQL.80).aspx

The only problem with saving the text as binary data (image) is that I can't
search the text.
The data was originally saved in a "text" column, there is no "ntext" all
you can do is set the collation.
I tried to set the collation to UCS-2 but couldn't find it in the list of
possible collations, what collation should I use to be able to use UCS-2?

I was thinking I could convert my UTF-8 encoded string to UCS-2 and save it
in the database.
Then I could convert from UCS-2 to UTF-8 when fetching the string from the
database.
How do I convert a string from UTF-8 to UCS-2 and back?

Kind Regards,
Allan Ebdrup

Allan Ebdrup · Mar 20, 2007

Allan Ebdrup said:
The data was originally saved in a "text" column, there is no "ntext" all
you can do is set the collation.

Sorry my mistake ther is a "ntext" type in MSSQL, and from what I've read
the collation is only used for defining a sortorder when you use a ntext
column, is this correct?

Kind Regards,
Allan Ebdrup

Allan Ebdrup · Mar 20, 2007

Allan Ebdrup said:
Sorry my mistake ther is a "ntext" type in MSSQL, and from what I've read
the collation is only used for defining a sortorder when you use a ntext
column, is this correct?

I tried with a "ntext" and that works for saving UTF-8 chars as far as I've
tested. But I should really convert from UTF-8 to UCS-2 before saving right?
and back when loading.
How do I convert between UTF-8 and UCS-2, I've searched the net for a
solution, but so far I've come up empty.

Kind Regards,
Allan Ebdrup

Jon Skeet [C# MVP] · Mar 20, 2007

Allan Ebdrup said:
I tried with a "ntext" and that works for saving UTF-8 chars as far as I've
tested. But I should really convert from UTF-8 to UCS-2 before saving right?
and back when loading.

No - the driver will do that for you.

How do I convert between UTF-8 and UCS-2, I've searched the net for a
solution, but so far I've come up empty.

You don't have to. The storage format is irrelevant - it's the driver's
job to take the unicode text you give it (and when you're passing it as
a string, it doesn't matter what generated that string; it's just
unicode data) and store it accurately.

Steven Cheng[MSFT] · Mar 21, 2007

Thanks for your reply Allan,

Yes, if you manually do the text-binary encoding/decoding in .net code and
store raw binary stream in SQL Server, it will certainly work. I'm still
interested in why your original text type column not work. What's the
datatype of that column, nchar or nvarchar?

Anyway, if meet further problem later or still have interests to make text
column work, welcome to continue discuss here.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Allan Ebdrup · Mar 21, 2007

Jon Skeet said:
No - the driver will do that for you.

Is this done by detecting the UFT8 preamble? And the the driver converts to
UCS-2? And if so how come the result is still in UTF-8 when I retrieve the
data again?
Or is there no conversion?
Why is it important that MSSQL only supports UCS-2 unicode if everything
works fine with UTF-8?
I can see that everything works fine when storing a UTF-8 string in an ntext
column, and when I query the data in queryanalyzer the string is displayed
correctly in the result set, how can this be if MSSQL only supports UCS-2
encoding?
How is the string stored? in UTF-8 or UCS-2?

Sorry for all the questions, I'm just tryng to understand what's going on.

Kind Regards,
Allan Ebdrup

Steven Cheng[MSFT] · Mar 21, 2007

Thanks for your reply Allan, and also thanks for Jon's input.

Actually, the knowledge that necessary to explain the problem here is
specific to string/text charset/encoding.

Let me try to answer the questions:

#First, both UTF-8, UTF-16, UCS-2.. are one of the encoding schema of
Unicode charset. In other words, unicode character string can be encoded
into binary stream through either of these ones.

Is this done by detecting the UFT8 preamble? And the the driver converts to
UCS-2? And if so how come the result is still in UTF-8 when I retrieve the
data again?
Or is there no conversion?
======================
First, UTF-8 is the encoding that your web page and client browsesr used to
transfer the unicode characters. This is because UTF-8 is multiple-byte
encoding schema, it will has compressed size and improved performance if
the transfered data mostly contain ASCII characters(since UTF-16 or UCS-2
will always use two bytes to represent a character). And when your .net
code has successfully get the unicode string, it has already been converted
to UTF-16(unicode encoding) because .net always to two-byte Unicode
encoding to represent characters in memory. And when you use ADO.NET to
submit string/characeters data to SQL Server database.

At SQL Server side, it simpy receive the unicode characters from client,
and store them into the target table column. Here problem may occur depend
on the column's Charset type, if it is of unicode type(e.g. nvarchar,
nchar, ntext ...), SQL Server can store them correctly (persisted as UCS-2
encoding). If the column is not of uncode char type, it will use the
column/table/database's current collation (charset) to encoding the unicode
characters into binary stream(such charset is usually a multi-byte charset).

Why is it important that MSSQL only supports UCS-2 unicode if everything
works fine with UTF-8?
I can see that everything works fine when storing a UTF-8 string in an
ntext
column, and when I query the data in queryanalyzer the string is displayed
correctly in the result set, how can this be if MSSQL only supports UCS-2
encoding?
How is the string stored? in UTF-8 or UCS-2?
===========================
UTF-8 is good at data compression if the data mainly contains non-wide
chars(ASCII chars), but it is less efficient than UTF-16, UCS-2 because it
use different number of bytes to encode different characeters while UTF-16
and UCS-2 always use two bytes for each character. And SQL Server 2000 is
an old product, UCS-2 is preferred at that time. Anyway, there is no really
true or false on which encoding to choose for SQL Server here.

Therefore, for any unicode text column, they'll be persisted in UCS-2
encoding(in memory or data file). And when client application query these
data out, they can be correctly retrieved and processed as long as the
client application support Unicode. For example, .net framework can
correctly handle uncode chars and unicode chars are stored as UTF-16
encoded format in memory.

In addition, here is a good reference on MS globaldev site introducing
charset/encoding:

#Globalization Step-by-Step
http://www.microsoft.com/globaldev/getWR/steps/wrg_codepage.mspx

Not sure whether I've missed anything in your former reply, if you have any
further specific questions on this, please feel free to post here.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Jon Skeet [C# MVP] · Mar 21, 2007

Is this done by detecting the UFT8 preamble? And the the driver converts to
UCS-2? And if so how come the result is still in UTF-8 when I retrieve the
data again?

It's done by passing in strings as the parameters. At that stage
there's no encoding involved (well, sort of - all .NET strings are
actually UTF-16, which is very similar to UCS-2, but you can
effectively ignore that). In particular, it is meaningless to say that
a string (as a System.String) is "UTF-8 encoded".

Why is it important that MSSQL only supports UCS-2 unicode if everything
works fine with UTF-8?

That's the internal storage format, that's all. (It may mean you can't
support characters not in the Basic Multilingual Plane, but it's not
worth worrying about that at this stage.)

I can see that everything works fine when storing a UTF-8 string in an ntext
column, and when I query the data in queryanalyzer the string is displayed
correctly in the result set, how can this be if MSSQL only supports UCS-2
encoding?
How is the string stored? in UTF-8 or UCS-2?

UCS-2 in the database. When you fetch it from the database, the driver
will convert it into a .NET string for you.

The main point is that so long as you use parameters/queries which
just use strings, you shouldn't need to worry about it at all in your
code: so long as you start with the right string (wherever it came
from) it should be stored correctly.

Sorry for all the questions, I'm just tryng to understand what's going on.

Have you read http://pobox.com/~skeet/csharp/unicode.html ? It may
help.

Jon

Allan Ebdrup · Mar 21, 2007

Jon Skeet said:
It's done by passing in strings as the parameters. At that stage
there's no encoding involved (well, sort of - all .NET strings are
actually UTF-16, which is very similar to UCS-2, but you can
effectively ignore that). In particular, it is meaningless to say that
a string (as a System.String) is "UTF-8 encoded".

I don't get why you can't say a System.String is UTF-8 encoded? if the bytes
in the string have to be read with a UTF-8 encoding to make sense? Granted
you would like the string to be UTF-16, but the bytes in the string have to
be read with a UTF-8 encoding to get the right meaning.

Where does my UTF-8 encoded XML get translated to UTF-16, I've got the
contents in a CDATA section and I fetch it using:
---
string text = xmlDocument.SelectSingleNode("QuestionText/text()").Value
---
So does fetching the CDATA section's value like this actually translate from
UTF-8 encoding to UTF-16 because all strings in .Net are UTF-16? (I thought
it just byte-copied the contents of the CDATA section, therefore I would get
a UTF-8 encoded string in the "text" variable).
And because the string is translated to UTF-16 it gets stored correctly in
the database when I pass it as a parmeter to the database?
OR
The string that is the XML gets passed to my webservice as a parameter that
I load into an XML document. As mentioned earlier I set my website to use
UTF-8 when sending/recieving data, is the string actually transformed from
UTF-8 to UTF-16 (because all strings are UTF-16 in .Net), before it is
passed to my webmethod?

I'mean I have a UTF-8 encoded string when I send it on the wire, and
somewhere it gets translated to a UTF-16 string that is stored in the
database as UCS-2 (UTF-16 to UCS-2 and back is handled by the driver as I
understand it)
The bytes of my original UTF-8 string are not the same as the bytes stored
in the UCS-2 string in the database so they have to get translated
somewhere, I'm just trying to understand where this translation occurs.

I know I could just drop it because it works, but I too curious about what
actually happens.

Kind Regards,
Allan Ebdrup.

Allan Ebdrup · Mar 21, 2007

Allan Ebdrup said:
The string that is the XML gets passed to my webservice as a parameter
that I load into an XML document. As mentioned earlier I set my website to
use UTF-8 when sending/recieving data, is the string actually transformed
from UTF-8 to UTF-16 (because all strings are UTF-16 in .Net), before it
is passed to my webmethod?

This would make sense because the XML I load has no defenition of encoding
so I guess it defaults to thinking the string is UTF-16 when I load the XML
into a XmlDocument?

But then I don't get why sending my xml back to the client works. I use the
code described in an earlier post where I UTF-8 encode the xml string and
send it, Why doesn't my webservice assume the string is UTF-16 and
translates it to UTF-8 before sending it?
The bytes in the string of xml I return from my webservice is UTF-8 encoded.

Still puzzled...

Kind Regards,
Allan Ebdrup

about encoding UTF-8 and UTF-16	6	Mar 31, 2010
XmlDocument and utf-8	4	Jun 16, 2007
XML serialization and UTF encoding 8	3	Dec 5, 2006
Convert large XML file to UTF-8	5	Aug 5, 2006
Invalid character returned when reading UTF-8 XML	7	Jun 30, 2008
XmlSerializer over NetworkStream	3	Mar 4, 2008
Serialized class from XSDObjectGen results in invalid char at position 1,1	2	Nov 5, 2005
Transfer-Encoding chunked and WCF	1	Jan 11, 2010

UTF-8 encoding in AJAX web application.

Allan Ebdrup

SimeonArgus

Jon Skeet [C# MVP]

Steven Cheng[MSFT]

Allan Ebdrup

Jon Skeet [C# MVP]

Allan Ebdrup

Steven Cheng[MSFT]

Steven Cheng[MSFT]

Allan Ebdrup

Allan Ebdrup

Allan Ebdrup

Allan Ebdrup

Jon Skeet [C# MVP]

Steven Cheng[MSFT]

Allan Ebdrup

Steven Cheng[MSFT]

Jon Skeet [C# MVP]

Allan Ebdrup

Allan Ebdrup

Ask a Question

Similar Threads