PC Review


Reply
Thread Tools Rate Thread

Byte size of characters when encoding

 
 
Vladimir
Guest
Posts: n/a
 
      9th Jul 2004

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.


 
Reply With Quote
 
 
 
 
mikeb
Guest
Posts: n/a
 
      9th Jul 2004
Vladimir wrote:

> Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
> Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
>
> But why that?


Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

>
> Look:
>
> /*
> Each Unicode character in a string is defined by a Unicode scalar value,
> also called ...
>
> An index is the position of a Char, not a Unicode character, in a String. An
> index is a zero-based, nonnegative number starting from the first position
> in the string, which is index position zero. Consecutive index values might
> not correspond to consecutive Unicode characters because a Unicode character
> might be encoded as more than one Char. To work with each Unicode character
> instead of each Char, use the System.Globalization.StringInfo class.
> */
>
> With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
> 1/2, 2 bytes?
> Isn't it?
> Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
> 2.
> Because charCount means count of instance of struct Char.
> Or not? May be it means count of Unicode characters?
> If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
> charCount * 4.
>
> This methods does not fit each other.
>
>



--
mikeb
 
Reply With Quote
 
Vladimir
Guest
Posts: n/a
 
      9th Jul 2004
> > Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
> > Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
> >
> > But why that?

>
> Strings in .NET are already Unicode encoded. So if you encode the
> string to an array of bytes, you get bytes per character.
>
> However, for UTF8 encoding a single Unicode character can be encoded
> using up to 4 bytes in the worst case. charCount*4 is just a worst case
> scenario if the string happened to contain only characters that required
> 4 byte encoding.


Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?


 
Reply With Quote
 
mikeb
Guest
Posts: n/a
 
      10th Jul 2004
Vladimir wrote:
>>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
>>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
>>>
>>>But why that?

>>
>>Strings in .NET are already Unicode encoded. So if you encode the
>>string to an array of bytes, you get bytes per character.
>>
>>However, for UTF8 encoding a single Unicode character can be encoded
>>using up to 4 bytes in the worst case. charCount*4 is just a worst case
>>scenario if the string happened to contain only characters that required
>>4 byte encoding.

>
>
> Do you want to say that two instances of struct Char in UTF-8 can occupy 8
> bytes?
>


It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb
 
Reply With Quote
 
Vladimir
Guest
Posts: n/a
 
      10th Jul 2004
> >>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
2.
> >>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
> >>>
> >>>But why that?
> >>
> >>Strings in .NET are already Unicode encoded. So if you encode the
> >>string to an array of bytes, you get bytes per character.
> >>
> >>However, for UTF8 encoding a single Unicode character can be encoded
> >>using up to 4 bytes in the worst case. charCount*4 is just a worst case
> >>scenario if the string happened to contain only characters that required
> >>4 byte encoding.

> >
> >
> > Do you want to say that two instances of struct Char in UTF-8 can occupy

8
> > bytes?
> >

>
> It turns out that while a UTF8 character can take up to 4 bytes to be
> encoded, for the Framework, a struct Char can always be encoded in at
> most 3 bytes. That's because the struct char holds a 16-bit Unicode
> value, and that can always be encoded in 3 or fewer bytes.
>
> A 4-byte UTF8 encoding is only needed for Unicode code points that
> require 'surrogates' - or a pair of 16-bit values to represent the
> character. Surrogates cannot be represented in a single struct Char -
> but I believe they are supported in strings.
>
> Anyway, here's what can happen using struct Char:
>
> char c1 = '\uFFFF';
> char c2 = '\u1000';
>
> byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });
>
> If you dump the byte array, you'll see that each Char was encoded into 3
> UTF8 bytes.
>


It's makes me crazy.
I don't understand.

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

If charCount means unicode 32 bit character:
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.

If charCount means unicode 16 bit character (Char structure):
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.

Suppose we have a string with length 5 (length in string menas count of
instances of stuct Char).
UTF8Encoding.GetMaxByteCount(stringInstance.Length) returns 15.
But it's not true.

And.
May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
only 4 bytes?
Yes or not?

Look:

/*
UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
characters at all,
and no compression occurs-its performance is excellent. UTF?16 encoding is
also referred
to as Unicode encoding.

UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
characters
as 3 bytes, and some characters as 4 bytes. Characters with a value below
0x0080 are
compressed to 1 byte, which works very well for characters used in the
United States.
Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
well for
European and Middle Eastern languages. Characters of 0x0800 and above are
converted to
3 bytes, which works well for East Asian languages. Finally, surrogate
character pairs are
written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
less useful than
UTF?16 if you encode many characters with values of 0x0800 or above.
*/

Does it mean that each pair of characters in UTF-16 can't be occupy more
than 4 bytes in UTF-8?

Wait a minute.
It seams that I undestend something.

Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
2 bytes (in UTF-16 its occupy always 2 bytes).
Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
(in UTF-16 its occupy always 2 bytes).
Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
occupy always 4 bytes).

Right?

But then I think UTF8Encoding.GetMaxByteCount(charCount) must
returns charCount * 3.


 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      11th Jul 2004
Vladimir <(E-Mail Removed)> wrote:
> It's makes me crazy.
> I don't understand.


I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.

--
Jon Skeet - <(E-Mail Removed)>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
 
Reply With Quote
 
Vladimir
Guest
Posts: n/a
 
      11th Jul 2004
> I think it's just a bug. UnicodeEncoding is doing the right thing, but
> UTF8Encoding should return charCount*3, not charCount*4.
>


How can we send the bug repprot?
And...

I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.


 
Reply With Quote
 
Jerry Pisk
Guest
Posts: n/a
 
      11th Jul 2004
You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
internal representation - it uses UCS-2, where each character is expressed
as 2 bytes with the exception of characters larger than 0xFFFF which are
expressed as a sequence of two characters, called surrogate pair. So each
character in UCS-2 takes up two bytes but some Unicode characters have to be
expressed in pairs.

Jerry

"Jon Skeet [C# MVP]" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Vladimir <(E-Mail Removed)> wrote:
>> It's makes me crazy.
>> I don't understand.

>
> I think it's just a bug. UnicodeEncoding is doing the right thing, but
> UTF8Encoding should return charCount*3, not charCount*4.
>
> --
> Jon Skeet - <(E-Mail Removed)>
> http://www.pobox.com/~skeet
> If replying to the group, please do not mail me too



 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      11th Jul 2004
Jerry Pisk <(E-Mail Removed)> wrote:
> You're right, it is a bug, but the correct answer is not what you think it
> is.


I think that depends on how you read the documentation.

> In UTF-8 a character can be up to 6 bytes, see
> http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
> internal representation - it uses UCS-2, where each character is expressed
> as 2 bytes with the exception of characters larger than 0xFFFF which are
> expressed as a sequence of two characters, called surrogate pair. So each
> character in UCS-2 takes up two bytes but some Unicode characters have to be
> expressed in pairs.


That's exactly what I thought. I believe GetMaxByteCount is meant to
return the maximum number of bytes for a sequence of 16-bit characters
though, where 2 characters forming a surrogate pair counts as 2
characters in the input. That way the maximum number of bytes required
to encode a string, for instance, is GetMaxByteCount(theString.Length).
Given that pretty much the whole of the framework works on the
assumption that a character is 16 bits and that surrogate pairs *are*
two characters, this seems more useful. It would be better if it were
more explicitly documented either way, however.

--
Jon Skeet - <(E-Mail Removed)>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      11th Jul 2004
Vladimir <(E-Mail Removed)> wrote:
> > I think it's just a bug. UnicodeEncoding is doing the right thing, but
> > UTF8Encoding should return charCount*3, not charCount*4.

>
> How can we send the bug repprot?


I don't know the best way of submitting bugs for 1.1. I'll try to
remember to submit it as a Whidbey bug if I get the time to test it.
(Unfortunately time is something I'm short of at the moment.)

> And...
>
> I've found that new BitArray(int.length) causes overflow exception when
> length in range from int.MaxValue - 30 to int.MaxValue.


I'm not entirely surprised, but it should at least be documented I
guess.

--
Jon Skeet - <(E-Mail Removed)>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Encoding byte array to string John Microsoft Access 0 28th Jul 2008 01:23 PM
Universal String (4 Byte Canonical Encoding) and UTF-32 Jeffrey Walton Microsoft C# .NET 2 22nd Nov 2007 03:19 AM
single byte characters ? how many bytes for 37 characters ? bitshift Microsoft C# .NET 4 5th Jul 2007 09:22 PM
Byte arrays to string WITHOUT encoding/decoding Jeff Stewart Microsoft VB .NET 2 27th Nov 2004 10:38 AM
Byte size of characters when encoding Vladimir Microsoft Dot NET 35 15th Jul 2004 09:15 AM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 04:46 PM.