How to remove accents (A-Umlaut to A)

cody · Aug 7, 2007

Is there a method to replace special characters like Ä (A-Umlaut) with
A, Ö (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?

Morten Wennevik [C# MVP] · Aug 7, 2007

Is there a method to replace special characters like Ä (A-Umlaut) with
A, Ö (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?

Hi Cody,

There is no generic way to do this. There is a hack that works in most cases involving switching Encoding the string and reading it in a different encoding, but this is by no means ensured to work for you. Your best bet is to create a lookup table and manually translate each character. If you anticipate a wide variety of characters, maybe Unicode or UTF-8 support is best.

Jon Skeet [C# MVP] · Aug 7, 2007

Morten Wennevik said:
There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.

Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));

byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}

Morten Wennevik [C# MVP] · Aug 7, 2007

Morten Wennevik said:
Morten Wennevik said:

There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Click to expand...

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.

Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));
byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}

Interesting.

Well, it would remove what is defined as unicode accents, which is what the OP asked, but it does not normalize other characters into ascii, like the Norwegian æøå, in which case only å is defined as having an accent, though æ and ø could be translated to a and o. The first method would eat æø and return only a and the second would return æøa

Jon Skeet [C# MVP] · Aug 7, 2007

Interesting.

Well, it would remove what is defined as unicode accents, which is
what the OP asked, but it does not normalize other characters into
ascii, like the Norwegian æøå, in which case only å is defined as
having an accent, though æ and ø could be translated to a and o. The
first method would eat æø and return only a and the second would
return æøa

Right. It's a shame there's not better support in the framework for
this, but as it's improved from 1.1 to 2.0 there's a chance it'll get
better in the future

UL-Tomten · Aug 8, 2007

æ and ø could be translated to a and o.

I don't think that makes sense for all languages. As far as I
understand Unicode normalization, æ is normalized as far as Unicode is
concerned, according to the latin normalization chart. Further
decomposition risks emulating the dreaded "silent ASCII treatment"
strings are given by .NET unless you're careful, and should likely
take culture into account. In some regards, I think Unicode
normalization may even defeat the purpose of the ASCII-fication we're
discussing here, since the more information you have about a
character, the better you can ASCII-fy it. In German, ä is a fancy a,
but not in Swedish, and "normalization" would have to acknowledge
this. But we digress...

UL-Tomten · Aug 8, 2007

Is there a method to replace special characters like Ä [...]

Maybe knowing the reason why you're doing this can help us find you a
better solution?

A common example: turning strings into filenames on non-Unicode file
systems. In this case, using Encoding.ASCII with "" fallback (to avoid
question marks) is in my opinion not problematic, since the whole idea
is to truncate the input strings, and the resemblance between filename
and string is just a bonus. If you don't need that resemblance,
hashing strings makes things easier. If the purpose is something else,
maybe you need a different solution.

Either way, you should be prepared for the contingency that the string
has _only_ characters without ASCII counterparts, for example.

cody · Aug 9, 2007

Jon said:
Morten Wennevik said:

There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.

Click to expand...

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/microsoft.public.dotnet.general/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here's the code posted, which does some upper-casing which isn't needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.

Here's a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));

byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}

Thank you very much, this will do it!

Umlaut (diacritic) - How to Remove and replace properly	1	Apr 4, 2007
How type Umlaut quickly on keyboard without sep number pad?	1	Feb 19, 2009
ASCII lookup table	15	Apr 12, 2005
Language problem	2	Jul 26, 2004
How to type accent marks?	8	Sep 20, 2006
Why is StringBuilder changing pipes to "o" umlaut when loading a pipe-delimited string?	2	Apr 20, 2005
How to type "nu +[umlaut]" in a Chinese IME?	6	Jan 4, 2005
Importing Danish Into Word 2003	2	Apr 21, 2008

How to remove accents (A-Umlaut to A)

cody

Morten Wennevik [C# MVP]

Jon Skeet [C# MVP]

Morten Wennevik [C# MVP]

Jon Skeet [C# MVP]

UL-Tomten

UL-Tomten

cody

Ask a Question

Similar Threads