How to remove accents (A-Umlaut to A)

C

cody

Is there a method to replace special characters like Ä (A-Umlaut) with
A, Ö (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?
 
M

Morten Wennevik [C# MVP]

Is there a method to replace special characters like Ä (A-Umlaut) with
A, Ö (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?

Hi Cody,

There is no generic way to do this. There is a hack that works in most cases involving switching Encoding the string and reading it in a different encoding, but this is by no means ensured to work for you. Your best bet is to create a lookup table and manually translate each character. If you anticipate a wide variety of characters, maybe Unicode or UTF-8 support is best.
 
J

Jon Skeet [C# MVP]

Morten Wennevik said:
There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);


Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);


string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.


Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));

byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
 
M

Morten Wennevik [C# MVP]

Morten Wennevik said:
There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);


Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);


string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.


Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));
byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}

Interesting.

Well, it would remove what is defined as unicode accents, which is what the OP asked, but it does not normalize other characters into ascii, like the Norwegian æøå, in which case only å is defined as having an accent, though æ and ø could be translated to a and o. The first method would eat æø and return only a and the second would return æøa
 
J

Jon Skeet [C# MVP]

Interesting.

Well, it would remove what is defined as unicode accents, which is
what the OP asked, but it does not normalize other characters into
ascii, like the Norwegian æøå, in which case only å is defined as
having an accent, though æ and ø could be translated to a and o. The
first method would eat æø and return only a and the second would
return æøa

Right. It's a shame there's not better support in the framework for
this, but as it's improved from 1.1 to 2.0 there's a chance it'll get
better in the future :)
 
U

UL-Tomten

æ and ø could be translated to a and o.

I don't think that makes sense for all languages. As far as I
understand Unicode normalization, æ is normalized as far as Unicode is
concerned, according to the latin normalization chart. Further
decomposition risks emulating the dreaded "silent ASCII treatment"
strings are given by .NET unless you're careful, and should likely
take culture into account. In some regards, I think Unicode
normalization may even defeat the purpose of the ASCII-fication we're
discussing here, since the more information you have about a
character, the better you can ASCII-fy it. In German, ä is a fancy a,
but not in Swedish, and "normalization" would have to acknowledge
this. But we digress...
 
U

UL-Tomten

Is there a method to replace special characters like Ä [...]

Maybe knowing the reason why you're doing this can help us find you a
better solution?

A common example: turning strings into filenames on non-Unicode file
systems. In this case, using Encoding.ASCII with "" fallback (to avoid
question marks) is in my opinion not problematic, since the whole idea
is to truncate the input strings, and the resemblance between filename
and string is just a bonus. If you don't need that resemblance,
hashing strings makes things easier. If the purpose is something else,
maybe you need a different solution.

Either way, you should be prepared for the contingency that the string
has _only_ characters without ASCII counterparts, for example.
 
C

cody

Jon said:
Morten Wennevik said:
There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);


Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));


byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);


string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.


Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));

byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}

Thank you very much, this will do it!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top