ANSI or Unicode

Jeff B. · Mar 25, 2004

Hello,

Can anyone tell me an easy way to determine when I should using the
System.Text.Encoding.Default class versus the System.Text.Encoding.Unicode
class? I want to use Unicode whenever it's available on the user's system,
otherwise I want to use standard ANSI encoding.

I know that Windows NT, XP, 2000, and 2003 all have native Unicode support
but Windows 9x is hit-and-miss. Is there a quick and easy way to determine
if Unicode is available on a user's system?

--- Thanks

--

Jeff Bramwell
Digerati Technologies, LLC
www.digeratitech.com

Manage Multiple Network Configurations with Select-a-Net
www.select-a-net.com

Jon Skeet [C# MVP] · Mar 25, 2004

Jeff B. said:
Can anyone tell me an easy way to determine when I should using the
System.Text.Encoding.Default class versus the System.Text.Encoding.Unicode
class? I want to use Unicode whenever it's available on the user's system,
otherwise I want to use standard ANSI encoding.

I know that Windows NT, XP, 2000, and 2003 all have native Unicode support
but Windows 9x is hit-and-miss. Is there a quick and easy way to determine
if Unicode is available on a user's system?

The Unicode encoding itself will always be available, as it's part of
the framework. Do you mean whether Unicode support is available in
things like notepad?

Jeff B. · Mar 25, 2004

The Unicode encoding itself will always be available, as it's part of

the framework. Do you mean whether Unicode support is available in
things like notepad?

What I'm trying to say is that if I call a method from the Unicode class on
a Windows 9x system that doesn't have the Microsoft Layer for Unicode
installed, will it still work? For example, if I'm encoding/decoding a
string via the Unicode class, and MLU isn't installed, will the string still
encode correctly? Another example:

byte[] buffer = System.Text.Encoding.Unicode.GetBytes("Some String");

On a system with Unicode support, I would expect the above byte array to
contain two bytes for every character. If I call the above method on a
non-Unicode machine, will I then receive just one byte per character (which
is what I would expect)?

--- Thanks

--

Jeff Bramwell
Digerati Technologies, LLC
www.digeratitech.com

Manage Multiple Network Configurations with Select-a-Net
www.select-a-net.com

Jon Skeet [C# MVP] · Mar 25, 2004

Jeff B. said:
What I'm trying to say is that if I call a method from the Unicode class on
a Windows 9x system that doesn't have the Microsoft Layer for Unicode
installed, will it still work?

I would certainly hope so.

For example, if I'm encoding/decoding a
string via the Unicode class, and MLU isn't installed, will the string still
encode correctly? Another example:

byte[] buffer = System.Text.Encoding.Unicode.GetBytes("Some String");

On a system with Unicode support, I would expect the above byte array to
contain two bytes for every character. If I call the above method on a
non-Unicode machine, will I then receive just one byte per character (which
is what I would expect)?

It's certainly not what I'd expect - you asked for Unicode, you'd
better get Unicode or an exception! However, I'd certainly expect it to
work. Converting from a string or char array to Unicode bytes is
absolutely trivial - it just involves pushing out the two bytes for
each char. I'd be amazed if the method involved calling into something
like the MLU.

Jeff B. · Mar 25, 2004

It's certainly not what I'd expect - you asked for Unicode, you'd

better get Unicode or an exception! However, I'd certainly expect it to
work. Converting from a string or char array to Unicode bytes is
absolutely trivial - it just involves pushing out the two bytes for
each char. I'd be amazed if the method involved calling into something
like the MLU.

OK, that makes sense. I guess my final question would be that if I'm
passing a string into an API call (for whatever reason) and I'm calling the
"A"nsi version of the API method (because I have CharSet=CharSet.Auto and
I'm running on a Windows 9x system without MLU), I probably wouldn't want to
use Encoding.Unicode methods but rather the Encoding.Default methods. Is
this correct, or could I still use the Unicode methods?

I'm just trying to come up with some "predictable" method for determining
when I should use the Unicode methods versus the ANSI methods.

--- Thanks again.

--

Jeff Bramwell
Digerati Technologies, LLC
www.digeratitech.com

Manage Multiple Network Configurations with Select-a-Net
www.select-a-net.com

Jon Skeet said:
Jeff B. said:

What I'm trying to say is that if I call a method from the Unicode class on
a Windows 9x system that doesn't have the Microsoft Layer for Unicode
installed, will it still work?

Click to expand...

I would certainly hope so.

For example, if I'm encoding/decoding a
string via the Unicode class, and MLU isn't installed, will the string still
encode correctly? Another example:

byte[] buffer = System.Text.Encoding.Unicode.GetBytes("Some String");

On a system with Unicode support, I would expect the above byte array to
contain two bytes for every character. If I call the above method on a
non-Unicode machine, will I then receive just one byte per character (which
is what I would expect)?

Click to expand...

Mihai N. · Mar 26, 2004

Converting from a string or char array to Unicode bytes is

absolutely trivial - it just involves pushing out the two bytes for
each char. I'd be amazed if the method involved calling into something
like the MLU.

Absolutely false for most characters other than ASCII (below 128).

Jon Skeet [C# MVP] · Mar 26, 2004

Absolutely false for most characters other than ASCII (below 128).

Really? How exactly do you believe the UnicodeEncoding should work
then?

Note that I'm not talking about UTF-8 - I'm talking about the
Encoding.Unicode encoding, UTF-16. Note further that by "char array",
I'm talking about a .NET char array, not a C char*.

Other than endianness, what issues are there to look at?

Mihai N. · Mar 28, 2004

Really? How exactly do you believe the UnicodeEncoding should work

then?

It is table-based.
For instance: the trademark sign, present in the ANSI US codepage (1252)
is Unicode 2212.
Same for other chars
So, "pushing out the two bytes for each char" is wrong.
This is even worse for code pages other than 1252.

For full mapping tables go to
ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS

Jon Skeet [C# MVP] · Mar 28, 2004

Mihai N. said:
It is table-based.

Seems like a pretty silly implementation then. Here's some code which
suggests that it's *not* table based, in that random strings when
encoded *without* using a table come out the same as using
Encoding.Unicode:

using System;
using System.Text;

public class Test
{
const int Iterations = 100000;
const int StringLength = 1000;

static Random random = new Random();

static void Main()
{
for (int i=0; i < Iterations; i++)
{
TestEncoding();
}
}

static void TestEncoding()
{
// First build a random string
StringBuilder builder = new StringBuilder(StringLength);
for (int i=0; i < StringLength; i++)
{
builder.Append ((char)random.Next(65536));
}

string testString = builder.ToString();

// Now encode using UnicodeEncoding
byte[] official = Encoding.Unicode.GetBytes(testString);

// Now encode using my supposedly incorrect algorithm
byte[] jon = new byte[testString.Length*2];
for (int i=0; i < testString.Length; i++)
{
jon[i*2]=(byte)(testString&0xff);
jon[i*2+1]=(byte)(testString>>8);
}

// Now compare them
if (!CompareArrays(jon, official))
{
Console.WriteLine ("Byte arrays don't match");
}
}

static bool CompareArrays(byte[] a1, byte[] a2)
{
if (a1.Length != a2.Length)
{
return false;
}
for (int i=0; i < a1.Length; i++)
{
if (a1 != a2)
{
return false;
}
}
return true;
}
}

If we'd been talking about Encoding.Default, or something similar, that
would have been a different matter, but Encoding.Unicode, Encoding.UTF8
etc don't require tables.

For instance: the trademark sign, present in the ANSI US codepage (1252)
is Unicode 2212.

Click to expand...

Where does the ANSI US codepage come in? The source we're talking about
is Unicode characters (a char array or string), and the destination is
a byte array with the Unicode encoding (new UnicodeEncoding or
Encoding.Unicode). Nowhere in the process does is ANSI encoding
relevant.

Same for other chars
So, "pushing out the two bytes for each char" is wrong.
This is even worse for code pages other than 1252.

Click to expand...

But we're not *talking* about 1252. We're talking about encoding an
array of Unicode chars, *in UTF-16*.

For full mapping tables go to
ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS

Click to expand...

That has no relevance here, because no Windows-specific encodings are
involved.

Mihai N. · Mar 29, 2004

Where does the ANSI US codepage come in? The source we're talking about

is Unicode characters (a char array or string), and the destination is
a byte array with the Unicode encoding (new UnicodeEncoding or
Encoding.Unicode). Nowhere in the process does is ANSI encoding
relevant. ....
But we're not *talking* about 1252. We're talking about encoding an
array of Unicode chars, *in UTF-16*.

Probably this is what you where talking about.
I have seen no such specification.

This is the original question:So, it was about ANSI.
And, depending on what he wants, he should use encodings other than Unicode,
even when Unicode is available.
For instance:
- sending / receiving data over a network conections (SMTP, POP3, Web, etc)
or serial connections
- reading / writing text files from / for older applications
- limitations imposed by the file system (not being able to create folders
and files containing characters unsupported by the system's default code
page, for Windows 9x & Me)
- dealing with legacy devices

To answer the original question: there is no easy way to determine what you
want.

Jon Skeet [C# MVP] · Mar 29, 2004

Probably this is what you where talking about.
I have seen no such specification.

Then you haven't read the whole thread properly.

This is the original question:
So, it was about ANSI.

Sort of - it was about whether or not Encoding.Unicode was available or
not by the time you entered the thread, however.

And, depending on what he wants, he should use encodings other than Unicode,
even when Unicode is available.

Potentially, yes. I have no argument with that.

However, the point at which you came into the thread was where I'd
specified:

<quote>
Converting from a string or char array to Unicode bytes is
absolutely trivial - it just involves pushing out the two bytes for
each char. I'd be amazed if the method involved calling into something
like the MLU.
</quote>

which was in response to Jeff asking whether using Encoding.Unicode
would work or not. My point is that Encoding.Unicode always *should*
work, and *doesn't* require a look-up table.

Note the "Unicode bytes" bit - not ANSI bytes, but Unicode, as in
Encoding.Unicode. So no, my statement was not "absolutely false" as you
claimed. It was entirely correct - but you needed to read the rest of
the thread to see the context. Nor does UnicodeEncoding need to be
table-based, as you later claimed.

To answer the original question: there is no easy way to determine what you
want.

Indeed, that answers part of the original question. It doesn't answer
Jeff's question about whether Encoding.Unicode should work or not,
however, which my other posts did.

Mihai N. · Mar 30, 2004

Indeed, that answers part of the original question. It doesn't answer

Jeff's question about whether Encoding.Unicode should work or not,
however, which my other posts did.

This is why I did not try to answer it again.
I did read the whole thread, but let's just put it on my "not so good
English" and close it here.

ANSI or Unicode

Jeff B.

Jon Skeet [C# MVP]

Jeff B.

Jon Skeet [C# MVP]

Jeff B.

Mihai N.

Jon Skeet [C# MVP]

Mihai N.

Jon Skeet [C# MVP]

Mihai N.

Jon Skeet [C# MVP]

Mihai N.