Unicode to ASCII string conversion

G

Ger

I have not been able to find a simple, straight forward Unicode to ASCII
string conversion function in VB.Net.
Is that because such a function does not exists or do I overlook it?

I found Encoding.Convert, but that needs byte arrays.

Thanks,
/Ger
 
H

Herfried K. Wagner [MVP]

* "Ger said:
I have not been able to find a simple, straight forward Unicode to ASCII
string conversion function in VB.Net.
Is that because such a function does not exists or do I overlook it?

'System.Text.Encoding.ASCII.GetBytes',
'System.Text.Encoding.Unicode.GetByte'.

BTW: Notice that there is no 1:1 mapping defined between Unicode and
ASCII, ASCII is 7-bit only and can thus only represent 128 characters.
 
G

Ger

Herfried K. Wagner said:
'System.Text.Encoding.ASCII.GetBytes',
'System.Text.Encoding.Unicode.GetByte'.
Thanks for your reply, but this returns a byte array. I ment straight
forward string-to-string conversion. It is possible ofcourse to write a
simple function to do this and using the encoding class, but I was just
wondering why the framework does not support the "direct string-to-string".
/Ger
 
C

Cor Ligthert

Ger,
Thanks for your reply, but this returns a byte array. I ment straight
forward string-to-string conversion. It is possible ofcourse to write a
simple function to do this and using the encoding class, but I was just
wondering why the framework does not support the "direct
string-to-string".

In the dotNet is a "String" is forever a string of unicode Chars. What you
call "ascii string" is forever a bytearray.

Therefore as an answer there is nothing more than Herfried suggested.
Although you can create an array of objects which contains bytes, however
that is no solution in my opinion.

I hope this helps to get the idea?

Cor
 
G

Ger

Ah, now I think I get the idea. So when I convert a (unicode) string into an
ascii byte array, and then the byte array back into a string, I still have
Unicode, right? So that is of no use when you want to write ASCII to a
filestream.

Is the code below then writing ASCII output to my filestream?

Dim UnicodeString As String = "abcdëfg"
Dim fsOutput as New FileStream(..)
Dim wOutput As New StreamWriter(fsOutput, System.Text.Encoding.ASCII)
wOutput.WriteLine(UnicodeString)

Thank you for your reply.

/Ger
 
C

Cor Ligthert

J

Jay B. Harlow [MVP - Outlook]

Ger,
Ah, now I think I get the idea. So when I convert a (Unicode) string into
an
ascii byte array, and then the byte array back into a string, I still have
Unicode, right?
Correct, just remember that you will loose some characters going to & from
ASCII.
So that is of no use when you want to write ASCII to a
filestream.
If you need an ASCII file, then use a ASCII encoding. It really depends on
what is going to read the file again.

I would recommend with an ANSI encoding (see below) or UTF8 encoding. With
ASCII you will loose all extended characters (ASCII is 7 bit encoding), with
ANSI you will loose characters that are outside of your regional ANSI code
page. UTF8 preserves all Unicode characters. I would recommend ANSI encoding
if the file was going to be opened by casual users in Notepad. I would
recommend UTF8 if full Unicode support is required. ANSI & UTF8 are both 8
bit encodings.

Is the code below then writing ASCII output to my filestream?

Yes that code is writing ASCII, as you included the ASCII encoding on the
StreamWriter constructor.

The text file itself will contain ASCII characters, when you subsequently
open that text stream and read it (with a StreamReader) it will be converted
back to Unicode strings. When reading the file back try to use the same
encoding as written. For example if you wrote ANSI, then use ANSI to read.
If you wrote UTF8, then use UTF8 to read. As ANSI & UTF8 encode characters
127 to 255 differently. Remember that Encoding.UTF8 is used on the stream
writer if you do not give one, if you are reading text files created by
Notepad, then you want Encoding.Default.

I would recommend:
Dim wOutput As New StreamWriter(fsOutput, System.Text.Encoding.Default)

Which will write the file in your current ANSI code page as defined by the
regional settings in Windows Control Panel. Which will preserve extended
characters.

Remember that ANSI is an 8 bit encoding that is dependent on region (code
page). While ASCII is a 7 bit encoding, ASCII does not support extended
characters such as ë. It will be converted into either a normal e or a ?.

Hope this helps
Jay
 
G

Ger

Thank you very much guys for your help and clearing up the issue for me.
I will go for Jay's solution and use ANSI 8-bit.
/Ger
 
C

Cor Ligthert

Jay,

Because of Ger's answer, now I become curious. I did not read it in your
message, however what is the solution, Ger told he wanted a straight string
to string conversion and explicitly no bytearray, however now I understand
he can convert Unicode to a 8 bits ANSI String in VBNet? (And I am not
talking about writing a file with 8 bits chars by decoding the char)

I showed in this thread with a link to an MSDN page that a String contains
forever 16 bits Chars.

Is that documentation wrong or do I not understand it or maybe even
something complete different..

Cor
 
J

Jay B. Harlow [MVP - Outlook]

Ger,
I should add that I would normally use Encoding.Default (ANSI) for straight
text files. I would normally use UTF8 for XML files.

Hope this helps
Jay

Ger said:
Thank you very much guys for your help and clearing up the issue for me.
I will go for Jay's solution and use ANSI 8-bit.
/Ger

"Jay B. Harlow [MVP - Outlook]" <[email protected]> schreef in
bericht
Ger,
Correct, just remember that you will loose some characters going to &
from
ASCII.

If you need an ASCII file, then use a ASCII encoding. It really depends
on
what is going to read the file again.

I would recommend with an ANSI encoding (see below) or UTF8 encoding.
With
ASCII you will loose all extended characters (ASCII is 7 bit encoding), with
ANSI you will loose characters that are outside of your regional ANSI
code
page. UTF8 preserves all Unicode characters. I would recommend ANSI encoding
if the file was going to be opened by casual users in Notepad. I would
recommend UTF8 if full Unicode support is required. ANSI & UTF8 are both
8
bit encodings.



Yes that code is writing ASCII, as you included the ASCII encoding on the
StreamWriter constructor.

The text file itself will contain ASCII characters, when you subsequently
open that text stream and read it (with a StreamReader) it will be converted
back to Unicode strings. When reading the file back try to use the same
encoding as written. For example if you wrote ANSI, then use ANSI to
read.
If you wrote UTF8, then use UTF8 to read. As ANSI & UTF8 encode
characters
127 to 255 differently. Remember that Encoding.UTF8 is used on the stream
writer if you do not give one, if you are reading text files created by
Notepad, then you want Encoding.Default.

I would recommend:


Which will write the file in your current ANSI code page as defined by
the
regional settings in Windows Control Panel. Which will preserve extended
characters.

Remember that ANSI is an 8 bit encoding that is dependent on region (code
page). While ASCII is a 7 bit encoding, ASCII does not support extended
characters such as ë. It will be converted into either a normal e or a ?.

Hope this helps
Jay
 
J

Jay B. Harlow [MVP - Outlook]

Cor,
Read my post. ;-) I only discussed reading & writing strings to ASCII, ANSI,
and UTF8 files (7 & 8 bit encodings).

You are correct System.String & System.Char are UTF-16 (16 bit Unicode),
files can be ANSI, ASCII, UTF7, UTF8, EBCDIC, UTF16 and many other
encodings.


FWIW: VS.NET 2005 (.NET 2.0, aka Whidbey, due out in 2005) appears to
support UTF-32 encoding for files.

http://msdn2.microsoft.com/library/ts575t62.aspx

Hope this helps
Jay


Cor Ligthert said:
Jay,

Because of Ger's answer, now I become curious. I did not read it in your
message, however what is the solution, Ger told he wanted a straight
string
to string conversion and explicitly no bytearray, however now I understand
he can convert Unicode to a 8 bits ANSI String in VBNet? (And I am not
talking about writing a file with 8 bits chars by decoding the char)

I showed in this thread with a link to an MSDN page that a String contains
forever 16 bits Chars.

Is that documentation wrong or do I not understand it or maybe even
something complete different..

Cor




"Jay B. Harlow [MVP - Outlook]" <[email protected]> schreef in
bericht
Ger,
Correct, just remember that you will loose some characters going to &
from
ASCII.

If you need an ASCII file, then use a ASCII encoding. It really depends
on
what is going to read the file again.

I would recommend with an ANSI encoding (see below) or UTF8 encoding.
With
ASCII you will loose all extended characters (ASCII is 7 bit encoding), with
ANSI you will loose characters that are outside of your regional ANSI
code
page. UTF8 preserves all Unicode characters. I would recommend ANSI encoding
if the file was going to be opened by casual users in Notepad. I would
recommend UTF8 if full Unicode support is required. ANSI & UTF8 are both
8
bit encodings.



Yes that code is writing ASCII, as you included the ASCII encoding on the
StreamWriter constructor.

The text file itself will contain ASCII characters, when you subsequently
open that text stream and read it (with a StreamReader) it will be converted
back to Unicode strings. When reading the file back try to use the same
encoding as written. For example if you wrote ANSI, then use ANSI to
read.
If you wrote UTF8, then use UTF8 to read. As ANSI & UTF8 encode
characters
127 to 255 differently. Remember that Encoding.UTF8 is used on the stream
writer if you do not give one, if you are reading text files created by
Notepad, then you want Encoding.Default.

I would recommend:


Which will write the file in your current ANSI code page as defined by
the
regional settings in Windows Control Panel. Which will preserve extended
characters.

Remember that ANSI is an 8 bit encoding that is dependent on region (code
page). While ASCII is a 7 bit encoding, ASCII does not support extended
characters such as ë. It will be converted into either a normal e or a ?.

Hope this helps
Jay
 
G

Ger

Jay,
Actually the reason I bothered this group with my question was because my
app generates and writes HTML data to a file believing it would be
interpreted by the browser as ASCII by setting charset=windows-1252 in the
meta tag of the htlm head.
Then I found that certain diacritical characters were not well interpreted
by the browser, and I found that that was because of the characters in the
generated stream were delivered as Unicode. After changing the meta tag in
the HTML code to charset=UTF-8 all was fine, but it left me with the
question on how these things work, hence why there was no simple
string-to-string conversion in the framework.
Your and Cor's replies made me understand better how these things work in
..Net, and I thank you both for your help in this.
/Ger



Jay B. Harlow said:
Ger,
I should add that I would normally use Encoding.Default (ANSI) for straight
text files. I would normally use UTF8 for XML files.

Hope this helps
Jay

Ger said:
Thank you very much guys for your help and clearing up the issue for me.
I will go for Jay's solution and use ANSI 8-bit.
/Ger

"Jay B. Harlow [MVP - Outlook]" <[email protected]> schreef in
bericht
Ger,
Ah, now I think I get the idea. So when I convert a (Unicode) string into
an
ascii byte array, and then the byte array back into a string, I still have
Unicode, right?
Correct, just remember that you will loose some characters going to &
from
ASCII.

So that is of no use when you want to write ASCII to a
filestream.
If you need an ASCII file, then use a ASCII encoding. It really depends
on
what is going to read the file again.

I would recommend with an ANSI encoding (see below) or UTF8 encoding.
With
ASCII you will loose all extended characters (ASCII is 7 bit encoding), with
ANSI you will loose characters that are outside of your regional ANSI
code
page. UTF8 preserves all Unicode characters. I would recommend ANSI encoding
if the file was going to be opened by casual users in Notepad. I would
recommend UTF8 if full Unicode support is required. ANSI & UTF8 are both
8
bit encodings.


Is the code below then writing ASCII output to my filestream?

Yes that code is writing ASCII, as you included the ASCII encoding on the
StreamWriter constructor.

The text file itself will contain ASCII characters, when you subsequently
open that text stream and read it (with a StreamReader) it will be converted
back to Unicode strings. When reading the file back try to use the same
encoding as written. For example if you wrote ANSI, then use ANSI to
read.
If you wrote UTF8, then use UTF8 to read. As ANSI & UTF8 encode
characters
127 to 255 differently. Remember that Encoding.UTF8 is used on the stream
writer if you do not give one, if you are reading text files created by
Notepad, then you want Encoding.Default.

I would recommend:

Dim wOutput As New StreamWriter(fsOutput, System.Text.Encoding.Default)

Which will write the file in your current ANSI code page as defined by
the
regional settings in Windows Control Panel. Which will preserve extended
characters.

Remember that ANSI is an 8 bit encoding that is dependent on region (code
page). While ASCII is a 7 bit encoding, ASCII does not support extended
characters such as ë. It will be converted into either a normal e or a ?.

Hope this helps
Jay

Ah, now I think I get the idea. So when I convert a (unicode) string into
an
ascii byte array, and then the byte array back into a string, I still have
Unicode, right? So that is of no use when you want to write ASCII to a
filestream.

Is the code below then writing ASCII output to my filestream?

Dim UnicodeString As String = "abcdëfg"
Dim fsOutput as New FileStream(..)
Dim wOutput As New StreamWriter(fsOutput, System.Text.Encoding.ASCII)
wOutput.WriteLine(UnicodeString)

Thank you for your reply.

/Ger


"Cor Ligthert" <[email protected]> schreef in bericht
Ger,

Thanks for your reply, but this returns a byte array. I ment
straight
forward string-to-string conversion. It is possible ofcourse to
write a
simple function to do this and using the encoding class, but I was just
wondering why the framework does not support the "direct
string-to-string".

In the dotNet is a "String" is forever a string of unicode Chars. What
you
call "ascii string" is forever a bytearray.

Therefore as an answer there is nothing more than Herfried suggested.
Although you can create an array of objects which contains bytes, however
that is no solution in my opinion.

I hope this helps to get the idea?

Cor
 
J

Jay B. Harlow [MVP - Outlook]

Ger,
interpreted by the browser as ASCII by setting charset=windows-1252 in the
charset=windows-1252 is ANSI, not ASCII!

If you use Encoding.Default you will (normally) wind up with windows-1252 in
the US and some of Europe.

Try the following:

Debug.WriteLine(System.Text.Encoding.Default.EncodingName, "encoding
name")
Debug.WriteLine(System.Text.Encoding.Default.BodyName, "body name")
Debug.WriteLine(System.Text.Encoding.Default.HeaderName, "header
name")
Debug.WriteLine(System.Text.Encoding.Default.WebName, "web name")
Debug.WriteLine(System.Text.Encoding.Default.CodePage, "code page")
Debug.WriteLine(System.Text.Encoding.Default.WindowsCodePage,
"windows code page")

I would expect Encoding.Default with charset = Encoding.Default.WebName
should give you the effect you desire, with any regional settings.

Of course UTF8 will preserve ALL unicode characters in your output (good for
XML & HTML).

Hope this helps
Jay

Ger said:
Jay,
Actually the reason I bothered this group with my question was because my
app generates and writes HTML data to a file believing it would be
interpreted by the browser as ASCII by setting charset=windows-1252 in the
meta tag of the htlm head.
Then I found that certain diacritical characters were not well interpreted
by the browser, and I found that that was because of the characters in the
generated stream were delivered as Unicode. After changing the meta tag in
the HTML code to charset=UTF-8 all was fine, but it left me with the
question on how these things work, hence why there was no simple
string-to-string conversion in the framework.
Your and Cor's replies made me understand better how these things work in
.Net, and I thank you both for your help in this.
/Ger



"Jay B. Harlow [MVP - Outlook]" <[email protected]> schreef in
bericht
Ger,
I should add that I would normally use Encoding.Default (ANSI) for straight
text files. I would normally use UTF8 for XML files.

Hope this helps
Jay

Ger said:
Thank you very much guys for your help and clearing up the issue for
me.
I will go for Jay's solution and use ANSI 8-bit.
/Ger

"Jay B. Harlow [MVP - Outlook]" <[email protected]> schreef in
bericht
Ger,
Ah, now I think I get the idea. So when I convert a (Unicode) string
into
an
ascii byte array, and then the byte array back into a string, I
still
have
Unicode, right?
Correct, just remember that you will loose some characters going to &
from
ASCII.

So that is of no use when you want to write ASCII to a
filestream.
If you need an ASCII file, then use a ASCII encoding. It really
depends
on
what is going to read the file again.

I would recommend with an ANSI encoding (see below) or UTF8 encoding.
With
ASCII you will loose all extended characters (ASCII is 7 bit
encoding),
with
ANSI you will loose characters that are outside of your regional ANSI
code
page. UTF8 preserves all Unicode characters. I would recommend ANSI
encoding
if the file was going to be opened by casual users in Notepad. I would
recommend UTF8 if full Unicode support is required. ANSI & UTF8 are both
8
bit encodings.


Is the code below then writing ASCII output to my filestream?

Yes that code is writing ASCII, as you included the ASCII encoding on the
StreamWriter constructor.

The text file itself will contain ASCII characters, when you subsequently
open that text stream and read it (with a StreamReader) it will be
converted
back to Unicode strings. When reading the file back try to use the
same
encoding as written. For example if you wrote ANSI, then use ANSI to
read.
If you wrote UTF8, then use UTF8 to read. As ANSI & UTF8 encode
characters
127 to 255 differently. Remember that Encoding.UTF8 is used on the stream
writer if you do not give one, if you are reading text files created
by
Notepad, then you want Encoding.Default.

I would recommend:

Dim wOutput As New StreamWriter(fsOutput, System.Text.Encoding.Default)

Which will write the file in your current ANSI code page as defined by
the
regional settings in Windows Control Panel. Which will preserve extended
characters.

Remember that ANSI is an 8 bit encoding that is dependent on region (code
page). While ASCII is a 7 bit encoding, ASCII does not support
extended
characters such as ë. It will be converted into either a normal e or a ?.

Hope this helps
Jay

Ah, now I think I get the idea. So when I convert a (unicode) string
into
an
ascii byte array, and then the byte array back into a string, I
still
have
Unicode, right? So that is of no use when you want to write ASCII to a
filestream.

Is the code below then writing ASCII output to my filestream?

Dim UnicodeString As String = "abcdëfg"
Dim fsOutput as New FileStream(..)
Dim wOutput As New StreamWriter(fsOutput,
System.Text.Encoding.ASCII)
wOutput.WriteLine(UnicodeString)

Thank you for your reply.

/Ger


"Cor Ligthert" <[email protected]> schreef in bericht
Ger,

Thanks for your reply, but this returns a byte array. I ment
straight
forward string-to-string conversion. It is possible ofcourse to
write
a
simple function to do this and using the encoding class, but I
was
just
wondering why the framework does not support the "direct
string-to-string".

In the dotNet is a "String" is forever a string of unicode Chars. What
you
call "ascii string" is forever a bytearray.

Therefore as an answer there is nothing more than Herfried suggested.
Although you can create an array of objects which contains bytes,
however
that is no solution in my opinion.

I hope this helps to get the idea?

Cor
 
C

Cor Ligthert

Jay,
If you use Encoding.Default you will (normally) wind up with windows-1252 in
the US and some of Europe.
I think that I have confused you, therefore I have checked it, as far as I
can see it is only not used in the Cyriclic writing European countries.

See the link bellow

And I miss even some languages on that as from Kroatia, Slovenia (Miha),
Slowaaks and I am in doubt about Turkish

So probably European countries which not uses 1252 are where is spoken
Greek
Rumanian (retroromanish is one of the 4 in Swiss spoken languages)
Serbians,
Hertza/Bosnie,
Russian

Look at the bottom of the page, the page is in German however I think easy
to understand

http://de.wikipedia.org/wiki/Windows-1252#Windows-1252

So next time you can in my opinion write "most European"

I hope this gives again some ideas

Cor
 
J

Jay B. Harlow [MVP - Outlook]

Cor,
Aren't some of the European countries one of two encodings, such as the
Netherlands? Or is that just the DOS code page? (Trying to remember
something I thought you've stated before).

Jay
 
C

Cor Ligthert

Jay,

Therefore I checked it, I have confused you where I was talking about the
DOS 437 and 850 code page and you never knew what was used, however we use
now probably all the 1252.

Cor
 
C

Cor Ligthert

br................
Therefore I checked it, I have confused you where I was talking about the
DOS 437 and 850 code page and you never knew what was used, however we use
now probably all the 1252.

That "you" never knew is not Jay, just as saying.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top