Convert string to "best possible" ascii representation

Achim Domma · Aug 30, 2006

Hi,

I have to convert a string to its "best possible" ascii representation.
It's clear to me that this is not possible or sense full for all unicode
characters. But for most European characters it should be possible.

For example:

"Müller" should become "Muller" and "é" should become "e".

Does some functionality like this already exist?

Achim

Guest · Aug 30, 2006

"Best possible"? Who, pray tell, is the arbiter of that? You are the one that
chooses the encoding, and there are many to choose from. If you use strict
ASCII encoding, you may have characters that render as ? Question Marks.
Peter

Morten Wennevik · Aug 30, 2006

Hi Achim,

There is nothing out of the box that will do this for you.
You are probably best served using a lookup table to convert the
characters, but there is a method that will approximate most of the
characters. This is not guaranteed to work!

string s = "éëæñúüøå";
byte[] data = Encoding.GetEncoding("ISO-8859-6").GetBytes(s);
s = Encoding.GetEncoding("ISO-8859-1").GetString(data);

// s == "eeanuuoa"

Larry Lard · Aug 30, 2006

Achim said:
Hi,

I have to convert a string to its "best possible" ascii representation.
It's clear to me that this is not possible or sense full for all unicode
characters. But for most European characters it should be possible.

For example:

"Müller" should become "Muller" and "é" should become "e".

Does some functionality like this already exist?

Would you say this is something that's commonly done? Because that's
what gets in the Framework.

By the way, what are you going to do with the Scandinavian å and ø ?
Replacing them with a and o would be wrong at best.

Cor Ligthert [MVP] · Aug 30, 2006

Achim,

Maybe these two links on this page can help you in addition to the other
information you have got.

http://www.vb-tips.com/dbPages.aspx?ID=cca7e08a-9580-42b3-beff-76c81839e6c9

I hope this helps,

Cor

joachim · Aug 30, 2006

there is a method that will approximate most of the

characters. This is not guaranteed to work!

I once needed a converter from any codepage to any codepage (as a
matter of fact, all windows codepages to all macintosh codepages). On
this link you can get all the
mappings you'll need for ASCII to Unicode:

http://www.unicode.org/Public/MAPPINGS/VENDORS/

I wrote a parser that built a substitution matrix from two files to
only switch the characters that had different ASCII codes for the same
unicode value. In your case, I'd suggest
you build your matrix from one single file (don't hard code it to keep
your solution flexible).

To make the substitiutions I implemented an Aho-Corasick engine with
callbacks
(you'll definitely want to use this if you want your replacement to be
efficient when processing large files - let's say 1GB)

http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

With this method you are in complete control of what you want to
change. It is also flexible, because you only need to change the file
which holds your substitutions.

Drop me a line and I'll send you some code,

Best Regards,
Joachim

Character representation	8	Oct 8, 2004
converting letters to it's unicode representation	2	May 22, 2007
Reading an Ascii string	18	Jul 8, 2006
Removing non-ascii characters from a string	13	Aug 29, 2008
ASCII to hex/binary	3	Jul 3, 2009
Best Ways to Convert OST to PDF?	3	Jun 30, 2025
Non-ascii characters in VS.NET service	10	Feb 9, 2007
Converting characters in string to ascii	8	Sep 23, 2005

Convert string to "best possible" ascii representation

Achim Domma

Guest

Morten Wennevik

Larry Lard

Cor Ligthert [MVP]

joachim

Ask a Question

Similar Threads