Convert string to "best possible" ascii representation

A

Achim Domma

Hi,

I have to convert a string to its "best possible" ascii representation.
It's clear to me that this is not possible or sense full for all unicode
characters. But for most European characters it should be possible.

For example:

"Müller" should become "Muller" and "é" should become "e".

Does some functionality like this already exist?

Achim
 
G

Guest

"Best possible"? Who, pray tell, is the arbiter of that? You are the one that
chooses the encoding, and there are many to choose from. If you use strict
ASCII encoding, you may have characters that render as ? Question Marks.
Peter
 
M

Morten Wennevik

Hi Achim,

There is nothing out of the box that will do this for you.
You are probably best served using a lookup table to convert the
characters, but there is a method that will approximate most of the
characters. This is not guaranteed to work!

string s = "éëæñúüøå";
byte[] data = Encoding.GetEncoding("ISO-8859-6").GetBytes(s);
s = Encoding.GetEncoding("ISO-8859-1").GetString(data);

// s == "eeanuuoa"
 
L

Larry Lard

Achim said:
Hi,

I have to convert a string to its "best possible" ascii representation.
It's clear to me that this is not possible or sense full for all unicode
characters. But for most European characters it should be possible.

For example:

"Müller" should become "Muller" and "é" should become "e".

Does some functionality like this already exist?

Would you say this is something that's commonly done? Because that's
what gets in the Framework.

By the way, what are you going to do with the Scandinavian å and ø ?
Replacing them with a and o would be wrong at best.
 
J

joachim

there is a method that will approximate most of the
characters. This is not guaranteed to work!

I once needed a converter from any codepage to any codepage (as a
matter of fact, all windows codepages to all macintosh codepages). On
this link you can get all the
mappings you'll need for ASCII to Unicode:

http://www.unicode.org/Public/MAPPINGS/VENDORS/

I wrote a parser that built a substitution matrix from two files to
only switch the characters that had different ASCII codes for the same
unicode value. In your case, I'd suggest
you build your matrix from one single file (don't hard code it to keep
your solution flexible).

To make the substitiutions I implemented an Aho-Corasick engine with
callbacks
(you'll definitely want to use this if you want your replacement to be
efficient when processing large files - let's say 1GB)

http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

With this method you are in complete control of what you want to
change. It is also flexible, because you only need to change the file
which holds your substitutions.

Drop me a line and I'll send you some code,

Best Regards,
Joachim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top