Mysterious functions of text encoding....

V

Viorel

For me is a little bit mysterious how work encoding and decoding functions,
what is underneath of their calling?



Encoding1.GetBytes(string1); in particularly ASCII.GetBytes(string1)

Encoding1.GetChars(string1);

Encoding1.GetChars(arrayofbytes1);

string1=Encoding1.GetString(arrayofbytes1);



If I know (perhaps) that a char is based on 2 bytes (16 bits)

and all Strings in C#(NET) are a set of chars



P.S. Please explain on plane of working with bytes (I come from C world)

I will appreciate
 
M

Michael Giagnocavo [MVP]

They read a Unicode char, and then determine the actual byte sequence
in whatever encoding you choose. For instance, if you use ASCII or
UTF-8 and pass "hi", it'll make a byte[] {0x68, 0x69}. If using
Unicode, it'd be 0x68, 0x00, 0x69, 0x00.

The same applies when you get chars or a string from bytes. It reads
the bytes and then determines what actual Unicode characters they are.

Also, there is no Encoding.GetChars(string) method (since it'd only
return chars of a string, which would always be the .NET internal
representation (Unicode)).

-mike
MVP
 
J

Jon Skeet

Viorel said:
So, if I understand correctly Encoding1.GetBytes (string1); takes content of
string1 represented in Unicode coverts (internally) the content in Encoding1
and then takes the bytes and returns to me as an array of bytes. It means
that internally it always happens conversions from Unicode to Encoding1.

And string1=Encoding1.GetString(arrayofbytes1);creates(internally) a string
in Encoding1 and then converts it to Unicode to be assigned to string1

Thus the rule(of language) of keeping all strings in Unicode is never
broken.

No.

Strings are *always* in Unicode. Encoding.GetString takes the sequence
of Unicode characters and converts them into a sequence of bytes which
represents (in the specified encoding) that sequence of characters.

For an example of how this might be done, have a look at my EBCDIC
encoding:
http://www.pobox.com/~skeet/csharp/ebcdic/

You might also find this article useful:
http://www.pobox.com/~skeet/csharp/unicode.html
My notice: First time I thought that all strings are kept in their
encoding.

Nope. Strings don't have any encoding associated with them.
I thought string1 is in Encoding1 and if
string2=Encoding2.GetString(arrayofbytes1); and Encoding1!= Encoding2
trying to assign string2 to string1 (string1=string2) it will arise an
exception. Thus it wouldn't be the need of internal (out of my view)
conversion from Encoding1 to Unicode and it would be more explicit .Am I
right?

Nope.
 
V

Viorel

Strings are *always* in Unicode. Encoding.GetString takes (&from where?&)
the sequence
of Unicode characters and converts them into a sequence of bytes, which
represents (in the specified encoding) that sequence of characters (&if it
can else it truncates&).



If I am not wrong GetString:

1) takes a sequence of bytes

2) creates from that sequence a string('lying' on that bytes ) based
on encoder object(which called GetString)

3) that string are converted in Unicode format(finally to be used by
me)

If these steps are not implemented in C#, could they achieve the same
result?

If no, explain me more detailed the way from bytes to string in same manner
as above



Thank you very much.
 
J

Jon Skeet

[If you could make it clearer which bit you're quoting, it would make
your posts easier to read... I've reformatted it here.]

from where?

From the parameter you pass it.
if it can else it truncates

What do you mean by "truncates" here? It doesn't just truncate the
string or byte array. If the encoding is passed a sequence of bytes it
doesn't fully understand (eg including some bits with the top bit set
where the encoding is ASCII) I don't believe there's any particular
specified behaviour - I prefer to end up with '?' in the returned
string, myself.
If I am not wrong GetString:

1) takes a sequence of bytes
Yes.

2) creates from that sequence a string('lying' on that bytes ) based
on encoder object(which called GetString)

No. I don't know where you get this idea from.
3) that string are converted in Unicode format(finally to be used by
me)

You're wrong. There's no need for some strange middle string.
If these steps are not implemented in C#, could they achieve the same
result?

No, because a string is *always* in Unicode.
If no, explain me more detailed the way from bytes to string in same manner
as above

It depends on how the encoding implementation wants to do it - as I
said before, if you want an example implementation, look at my EBCDIC
encoding.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

encoding problem 2
base64 encode/decode issue 2
encoding question 7
text encoding issues 1
C# and encodings 30
Trying to encrypt a string 8
HTMLEncode: low surrogate char Error 1
Shared functions not accessible 10

Top