Why use other encoding then UTF-8 when this support almost every language

Tony Johansson · Mar 25, 2010

Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

//Tony

Tony Johansson · Mar 25, 2010

Tony Johansson said:
Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

//Tony

I must correct myself UTF-8 can use up to 48-bit.

//Tony

Maate · Mar 25, 2010

I must correct myself UTF-8 can use up to 48-bit.

//Tony

Hey, I'm not sure, but I would guess that UTF-8 is slightly more
expensive to parse than other unicode encodings. For example, when
reading UTF-16 encoded text the parser would know that it has to read
exactly two bytes per character. On the other hand, if UTF-8 encoded,
the number of bytes to read per character will depend on the
information stored in individual bits. You could consider just a
simple example: this code in c# "my test string".Substring(5, 1), will
be easy to calculate in UTF-16, but with UTF-8 the parser would have
to calculate the individual character starting from the beginning in
order to determine which bytes actually represents character number 5
- perhaps making it at least 5 times as expensive. Probably this also
explains why for example .NET CLR stores text as UTF-16 internally -
it probably makes it easier (better performant) to manipulate and
search text.

Anyway, just some thoughts

Br. Morten

Chris Dunaway · Mar 25, 2010

Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

//Tony

http://www.joelonsoftware.com/articles/Unicode.html

Konrad Neitzel · Mar 25, 2010

Hi all!

Maate said:
Hey, I'm not sure, but I would guess that UTF-8 is slightly more
expensive to parse than other unicode encodings.

Why that? UTF-16 also is not fixed to 2 Bytes per character. It can use more
bytes per character if required (A reason, why there is also a UTF-32)

For example, when
reading UTF-16 encoded text the parser would know that it has to read
exactly two bytes per character. On the other hand, if UTF-8 encoded,
the number of bytes to read per character will depend on the
information stored in individual bits.

And yes, that can be the important point. Whenever you want to have random
access to characters without parsing all characters till the character you
want to read, you must be carefull that you really know how you many bytes
each character has.

UTF-16 is not fixed to 2 Bytes! That is a common mistake you find often. If
you want a fixed 2 Byte encoding, UCS-2 could be choosen but then you do not
support all characters that are supported with UTF-16!

More details can be found on
http://en.wikipedia.org/wiki/UTF
http://en.wikipedia.org/wiki/UTF-16
http://en.wikipedia.org/wiki/UTF-32

Konrad

Harlan Messinger · Mar 25, 2010

Tony said:
Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

Most applications don't need to support almost all languages.

If you're creating an application for use in Japan by Japanese people,
for example, then you might prefer to use Shift-JIS, which uses two
bytes per character, instead of UTF-8, which uses three bytes per
Japanese character.

If you're building an app for use in the United States by English
speakers, and much of the input is likely to come from ASCII sources, or
much of your output is intended for use with software that understands
only ISO-8859-1, then it may have no use for UTF-8.

Arne Vajhøj · Mar 25, 2010

I must correct myself UTF-8 can use up to 48-bit.

I think I will try and answer 3 questions:

Q1) Why use encodings that are not Unicode based?
A1) Due to legacy support. There are tons of apps out
there reading and writing among other ISO-8859-1/CP-1252.

Q2) Why use another encoding than UTF-8 for disk files?
A2) I don't think there are any good reason.

Q3) Why use another encoding than UTF-8 for in memory
strings?
A3) Originally all Unicode could be in 16 bit entities
and even today all the most used Unicode can be in
16 bit entities. Therefore using 16 bit entities
is a lot faster than UTF-8 for many typical string
operations.

Arne

Arne Vajhøj · Mar 25, 2010

UTF-8 was invented later, and we need to be able to process files that
were written with software developed before UTF-8 came into wide use.

There are other kinds of Unicode (such as UTF-16 big- and little-endian)
and there are even text files that are not Unicode at all (such as
Windows ANSI, from Windows 3 days).

Even most files written on 9x/NT 4/2000 are in ANSI.

UTF-8 first became common in the MS world with .NET.

Arne

Arne Vajhøj · Mar 25, 2010

http://www.joelonsoftware.com/articles/Unicode.html

<quote>
Back in the semi-olden days, when Unix was being invented and K&R were
writing The C Programming Language, everything was very simple. EBCDIC
was on its way out.
</quote>

I was not tempted to read any further.

Arne

Mihai N. · Mar 26, 2010

Most applications don't need to support almost all languages.

If you're creating an application for use in Japan by Japanese people,
for example, then you might prefer to use Shift-JIS, which uses two
bytes per character, instead of UTF-8, which uses three bytes per
Japanese character.

If you're building an app for use in the United States by English
speakers, and much of the input is likely to come from ASCII sources, or
much of your output is intended for use with software that understands
only ISO-8859-1, then it may have no use for UTF-8.

Sorry, but this is prety bad advice.
Was probably ok 10 years ago, but not now.

First, the extra bytes here and there don't amount to much.
To one team bringint this argument I have shown that the .jpg they used for
splash-screen was bigger than all the strings together.

Second, all system APIs are Unicode UTF-16.
So if you use Shift-JIS or ASCII, you will waste time for conversions
back and forth (happening in the belly of the OS).
Same for keystrokes: the system gets unicode input and has to convert it
to a code page for legacy application.
Of course, if your application is running on Windows 95/98, then
you are better without Unicode.
This is also true for Mac OS X and Qt.

Third, this is a C# newsgroup, so I would assume the question
refers to that. So all strings are Unicode (UTF-16). Use any other
code page, and you will have to convert.

Harlan Messinger · Mar 26, 2010

Mihai said:
Sorry, but this is prety bad advice.
Was probably ok 10 years ago, but not now.

First, the extra bytes here and there don't amount to much.

You think it doesn't cost anything to buy 50% more drives to store and
back up your data if you have a system that stores terabytes of text?
Not to mention backups taking 50% longer.

To one team bringint this argument I have shown that the .jpg they used for
splash-screen was bigger than all the strings together.

Note the difference between "it is sometimes not worth using something
other than UTF-8" and "it is always not worth using something other than
UTF-8".

Second, all system APIs are Unicode UTF-16.
So if you use Shift-JIS or ASCII, you will waste time for conversions
back and forth (happening in the belly of the OS).

Whereas conversion to UTF-8 happens by magic?

Same for keystrokes: the system gets unicode input and has to convert it
to a code page for legacy application.

So you should spite yourself by writing apps that won't communicate with
the legacy apps you're stuck supporting?

Of course, if your application is running on Windows 95/98, then
you are better without Unicode.
This is also true for Mac OS X and Qt.

Third, this is a C# newsgroup, so I would assume the question
refers to that. So all strings are Unicode (UTF-16). Use any other
code page, and you will have to convert.

Believe it or not, many, many .NET applications receive data generated
by non-.NET applications and vice versa. Writing an app in C# doesn't
mean you get to pretend that every app is written in C#.

JeffWhitledge · Mar 26, 2010

Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

//Tony

UTF-8 supports the complete Unicode character set so it is a fine
choice for many applications. It can be used for nearly all of the
world's written languages, and it is a compact representation for
latin-based texts (like English), which are very common.

Except for interfacing with legacy applications, there is no good
reason to use a non-Unicode character set.

However, there are good reasons for using a Unicode character encoding
other than UTF-8.

Many platforms use UTF-16 internally (Windows NT,XP,Vista,7; the .Net
Framework, C#), so by sticking with that you can avoid conversions.
Many languages (especially Asian languages) have a more compact
representation in UTF-16 than in UTF-8. UTF-16 will be simpler to
process for many texts, since the characters in the Basic Multilingual
Plane (plane 0, which encodes the vast majority of the characters used
by living languages) are always represented by exactly 2 bytes in
UTF-16. (Characters in the higher planes are represented in 4 bytes in
UTF-16, but these characters are far less common.)

For these reasons, UTF-16 can also be an excellent choice of encoding
scheme.

There are few applications where UTF-32 is the best choice, and
probably all of them are for internal processing only. I can't imagine
a scenerio in which UTF-32 would be the best choice for storing or
transmitting text.

Chris Dunaway · Mar 26, 2010

I was not tempted to read any further.

Arne

Ummm... why?

Arne Vajhøj · Mar 26, 2010

Ummm... why?

Because that paragraph revealed a lack of knowledge so
big that I considered it a sure waste of time to read any
further.

Arne

Arne VajhÃ¸j · Mar 26, 2010

Sorry, but this is prety bad advice.
Was probably ok 10 years ago, but not now.

First, the extra bytes here and there don't amount to much.
To one team bringint this argument I have shown that the .jpg they used for
splash-screen was bigger than all the strings together.

I agree.

Second, all system APIs are Unicode UTF-16.
So if you use Shift-JIS or ASCII, you will waste time for conversions
back and forth (happening in the belly of the OS).
Same for keystrokes: the system gets unicode input and has to convert it
to a code page for legacy application.

Are you saying that all *A Win32 calls convert to Unicode?

Of course, if your application is running on Windows 95/98, then
you are better without Unicode.
This is also true for Mac OS X and Qt.

MacOS X support Unicode.

Unicode is the native character set in Qt.

Third, this is a C# newsgroup, so I would assume the question
refers to that. So all strings are Unicode (UTF-16). Use any other
code page, and you will have to convert.

It also converts for UTF-16.

It may be faster than UTF-8, but I would not expect a big difference.

Arne

Arne VajhÃ¸j · Mar 26, 2010

You think it doesn't cost anything to buy 50% more drives to store and
back up your data if you have a system that stores terabytes of text?
Not to mention backups taking 50% longer.

He is talking about western text.

The overhead is not 50% for that.

More like 2-5%.

Arne

Harlan Messinger · Mar 27, 2010

Arne said:
He is talking about western text.

The overhead is not 50% for that.

More like 2-5%.

Ah, but that's why I chose Japanese as an example that would show how
large the impact could be!

Arne VajhÃ¸j · Mar 27, 2010

Ah, but that's why I chose Japanese as an example that would show how
large the impact could be!

UTF-8 is a bit pro-western.

Arne

Maate · Mar 27, 2010

Why that? UTF-16 also is not fixed to 2 Bytes per character. It can use more
bytes per character if required (A reason, why there is also a UTF-32)

Thanks for pointing this out, I really was not aware of that!

However, I still think my point on performance is quite valid, so
allow me to be more specific: You can create better performant
algorithms on text encoded with UTF-16 unless you write in Klingon or
Egyptian Hieroglyphs (and a few other languages using characters
outside of the U+0000 to U+FFFF space) ;-)

Br. Morten

Mihai N. · Mar 27, 2010

You think it doesn't cost anything to buy 50% more drives to store and

back up your data if you have a system that stores terabytes of text?
Not to mention backups taking 50% longer.

1. Not all your data is text.
In fact, I bet very little of it is text.
2. The rule of thumb recomandation is:
- legacy code pages to "talk" with ancient software
- utf-8 for storage/serialization/comunication
- utf-16 for processing
- convert at the edge
There are exceptions, nothing is carved in stone, but you should know
why you decide differently

"it is always not worth using something other than UTF-8".

Processing on a systemt that uis

Whereas conversion to UTF-8 happens by magic?

Did I recomend utf-8? Read again.
And even if utf-8 makes sense sometimes, you convert at the edge.
You answer recomended Shift-JIS and Latin 1. Same conversion overhead,
with no benefit (international text support)

So you should spite yourself by writing apps that won't communicate with
the legacy apps you're stuck supporting?

Are you communicate with other applications thru keyboard messages?
Anyway, this is trying to twist my answer to mean
"always, absolutely always use utf-16, this is a religious tenet and
you have to obey it blindly without using your brain"
That's not the case.

about encoding UTF-8 and UTF-16	6	Mar 31, 2010
C# and encodings	30	Feb 3, 2009
Unicode in .NET	8	Apr 30, 2010
UTF-8 encoding in AJAX web application.	23	Mar 16, 2007
UTF-8 preamble -> Possible bug in StreamWriter(or at least strange behaviour..)	10	Oct 11, 2005
UTF-8 or Default Unicode - Which one should be taken	1	Nov 27, 2003
Need to reliably detect a text file's encoding for XML deserialization	4	Apr 6, 2006
I'm using about twice as many bytes of memory as the size of the file	8	Mar 4, 2010

Why use other encoding then UTF-8 when this support almost every language

Tony Johansson

Tony Johansson

Maate

Chris Dunaway

Konrad Neitzel

Harlan Messinger

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Mihai N.

Harlan Messinger

JeffWhitledge

Chris Dunaway

Arne Vajhøj

Arne VajhÃ¸j

Arne VajhÃ¸j

Harlan Messinger

Arne VajhÃ¸j

Maate

Mihai N.

Ask a Question

Similar Threads