Surrogate Character handling with text input field.

I

Itsuo

Hi,

IE counts the number of characters based on UTF-16
regardless of page encoding. As a result, I am seeing the
unexpected result with surrogate characters. (a.k.a
supplement characters)

The issue is that IE counts one surrogate pair as 2 code
units and then it does not allow users to enter
characters up to the number of characters specified in
maxlength.

For example, if input field is defined as below, it is
supposed to be able to enter 10 code units based on the
page encoding, which would be UTF-8.

<input type="text" name="T1" size="20" maxlength="10">

However, when users enter surrogate characters, they can
only enter 5 characters since IE counts code points in
UTF16 instead of UTF8.

I think this is a bug in IE.
 
S

Seaver

Dear Itsuo,

Thank you for your posting.

According to your post, you have concern with the surrogate characters in
IE.

If I have misunderstood your concern please don't hesitate to let me know.

Surrogate character values are Unicode code positions that are associated
in pair (high surrogate and low surrogate) to specify additional
characters. IE6, when associated with Windows has already some support for
those characters. The Unicode consortium has not yet standardized content
for these areas, it has however accepted in principle new repertoires for
the surrogate area.

It is very important to distinguish surrogate code points (in the range
U+D800..U+DFFF) from supplementary code points (in the completely different
range, U+10000..U+10FFFF). Surrogate code points are reserved for use, *in
pairs*, in representing supplementary code points in UTF-16. There are
supplementary characters (i.e. encoded characters represented with a single
supplementary code point), but there are not and will never be *surrogate
characters* (i.e. encoded characters represented with a single surrogate
code point).

Surrogate characters allow you to get beyond the standard 65,000-character
limitation of Unicode. This mechanism extends the number of characters that
can be defined, and that's why only 5 surrogate characters can be input.

Hope it help!

Sincerely,

Seaver Ren

Product Support Services
Microsoft Corporation

This posting is provided "AS IS" with no warranties, and confers no rights
Get Secure! - www.microsoft.com/security
 
I

Itsuo

Sorry for confusion in my original post. As you guessed,
I am concerned about surrogate characters, not
supplementary code points.

But my point is that IE should count code points based on
the page encoding, which is UTF8, rather than UTF16,
which would be used internally in IE. As you probably
know, surrogate characters are counted as one code unit
in UTF8 unlike UTF16.

So, our expectation is that IE should allow 10 surrogate
characters in my example while the page encoding is UTF8.

Regards,

Itsuo Okamoto
 
S

Seaver

Dear Itsuo,

UTF-16 is the primary Unicode encoding used by Microsoft Windows 2000. Even
before Unicode 2.0 was released, it became clear that the goal of Unicode
(to support a single code point for every character in every language)
could not be achieved using only 65,536 characters. Some languages, such as
Chinese, require that many characters to encode just the rarely used
characters. Thus, support was added for a surrogate range to handle an
additional 1,048,576 characters. UTF-16 is the encoding that fully supports
this extension to the original standard.

In UTF-16, the same standard of 2 bytes per code point is followed;
however, with UTF-16 certain code points use another code point right after
them to define the character. Like UCS-2, UTF-16 is stored in a Little
Endian manner, as is everything on Windows, by default.

As for UTF-8, many ASCII and other byte-oriented systems that require 8-bit
encodings (such as mail servers) must span a vast array of computers that
use different encodings, different byte orders, and different languages.
UTF-8 is an encoding scheme that is designed to treat Unicode data in a way
that is independent of the byte ordering on the computer.

That's why UTF-16 is rarely supported in IE's HTML.

Regards,

Seaver
 
I

Itsuo

Thanks for your explaination.

I think I understand why IE supports UTF-16 internally.
But the point is that we want to check text length in UTF-
8 when we set page encoding to UTF-8.

In our three tier application, we give customers choice
for page encoding of HTML interface and guide them to set
it to the same character set as database so that they can
ensure that they do not lose any data among the tiers and
length check forbids them to enter data, which cannot be
stored in database.

But since IE is counting code points of surrogate
characters in UTF-16, it does not allow users to enter
data, which can be stored in database.

So, I am wondering if MS can fix this problem or not.
Also if there is a way for us to escalate this issue, I
would like to know it.

Regards,

Itsuo
 
S

Seaver

Dear Itsuo,

Per your request, you may report the found reproducible symptom to our IE
Developer support team by calling (800) 936-5800.

In the meantime, please rest assured that I've routed this issue to
appropriate Microsoft Channels. We have stringent program quality goals and
metrics that must be met before its release, and we strive to capture any
and all product feedback so as to ensure that we are continuously
developing Microsoft products to meet customer needs. This is exactly why
feedback such as yours is always taken very seriously.

Regards,

Seaver
 
I

Itsuo

Thanks for routing our request. I appreciate your
response on this issue.

In case there is any update from your appropriate
channel, please kindly update us.

Regards,

Itsuo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top