Removing non-ascii characters from a string

E

Eps

Hi there,

I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

There is an article on code project which kind of looks like it does
what I want but I can't help thinking it makes it more complex than it
needs to be.

I have looked at the msdn pages to do with Encodings but I am not very
familiar with this topic.

If I can get a list of ascii characters then it should be easy to write
a method that checks each char against the list and performs the replace
or remove operation if necessary. Yet I can't find anything exactly
like this with trusty old google, is there something I am missing ?.

If it helps the reason I need this is because I am writing a front end
for the lame command line mp3 encoder, it doesn't like being passed, or
asked to output to, file paths containing unicode characters.
 
A

Anthony Jones

Eps said:
Hi there,

I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

There is an article on code project which kind of looks like it does
what I want but I can't help thinking it makes it more complex than it
needs to be.

I have looked at the msdn pages to do with Encodings but I am not very
familiar with this topic.

If I can get a list of ascii characters then it should be easy to write
a method that checks each char against the list and performs the replace
or remove operation if necessary. Yet I can't find anything exactly
like this with trusty old google, is there something I am missing ?.

If it helps the reason I need this is because I am writing a front end
for the lame command line mp3 encoder, it doesn't like being passed, or
asked to output to, file paths containing unicode characters.


Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB £ CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD
 
E

Eps

Anthony said:
Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB £ CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD

I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter ?.
 
P

Pavel Minaev

    byte[] asciiChars = Encoding.ASCII.GetBytes("AB £ CD");
    string result = Encoding.ASCII.GetString(asciiChars);
    Console.WriteLine(result);
creates the string:-

I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter?.

Because Encoding classes encode and decode CLR strings (which are
_always_ Unicode) to/from byte arrays in specified encoding, typically
for serialization or interop purposes. There's no such thing as a non-
Unicode System.String (well, you could treat a string as a plain array
of char, but any .NET function will still treat string as UTF-16).

What you ask is still possible, because ASCII is a pure subset of
Unicode. With LINQ, you could use this one-liner:

string ascii = new string(s.Where(c => (int)c >= 0 && (int)c <=
127).ToArray());

Note however that "ascii" would still be a Unicode string - it just
wouldn't contain any non-ASCII characters.
 
A

Anthony Jones

Eps said:
Anthony said:
Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB £ CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD

I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter ?.

As you have already identified all strings in .NET are unicode. Hence you'd
be asking GetString to take a unicode string and return a unicode string.

I understand what you are saying, you would like it to take unicode string
that contains any characters and convert it to a unicode string that
contains only characters that are found in the ASCII range.

It would reduce this:-

string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))

to this:-

string sOut = Encoding.ASCII.GetString(s)

However if the .NET framework reduced every possible scenario of that sort
the framework would become huge and unwieldy.

If you have C# 3 you can do this for yourself:-

public static class Exts
{
public static string GetString(this Encoding enc, string s)
{
return enc.GetString(enc.GetBytes(s));
}
}

When a code file has a using statement include the namespace to which the
above class belongs instances of Encoding types will now have the overload
you desire.
 
A

Anthony Jones

string ascii = new string(s.Where(c => (int)c >= 0 && (int)c <=
127).ToArray());

The problem with a new shiny tool is that we start to search for excuses to
use it. Even when its not the appropriate tool for the job. ;)
 
A

Alberto Poblacion

Eps said:
I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter
?.

The Encoding classes have a pair of functions that convert between
arrays of bytes and strings.
GetBytes takes as input a Unicode String and returns an array of bytes
that represent that String converted to the chosen encoding.
GetString is provided to perform the opposite conversion (from the byte
array into a Unicode String), so that's why it takes a byte array instead of
a string.
Note that Unicode Strings are the only kind of strings that .Net
suports; any other kind is treated as a byte array, so it wouldn't make
sense to write a GetString taking a String and returning a String, because
it would do nothing. That is if by String we refer to System.String, which
is meant to contain Unicode. Nothing stops you from writing
MyNamespace.MyString and using that class to encapsulate anything that you
want.
 
E

Eps

Pavel said:
Because Encoding classes encode and decode CLR strings (which are
_always_ Unicode) to/from byte arrays in specified encoding, typically
for serialization or interop purposes. There's no such thing as a non-
Unicode System.String (well, you could treat a string as a plain array
of char, but any .NET function will still treat string as UTF-16).

What you ask is still possible, because ASCII is a pure subset of
Unicode. With LINQ, you could use this one-liner:

string ascii = new string(s.Where(c => (int)c >= 0 && (int)c <=
127).ToArray());

Note however that "ascii" would still be a Unicode string - it just
wouldn't contain any non-ASCII characters.

Thanks for your replies, I think I have a better understanding of it
now. I think that code above is pretty much exactly what I am looking
for, all I need to do is make sure the strings that I pass to lame only
contain ascii characters, I can't test it now but it should work.
 
A

Arne Vajhøj

Eps said:
I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

I would use:

s = Regex.Replace(s, @"[^\u0000-\u007F]", "");

Arne
 
M

Mihai N.

I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

You have some solutions, but make sure what you do with them.
If you have to validate some input, then check if there are non-ascii
characters and show an error.
Just removing what you don't like is guaranteed to result in junk.
(what is the meaning of a French word with missing characters?)

A bit like this:
- you ask for a number.
- I give you 0x1A3C7
- you want a decimal number

Option 1: validate and complain that this is not a decimal numer
Option 2: remove x, A and C, then interpret 0137 as a decimal number
This is the equivalent of what you do now with non-ascii stuff :)
 
E

Eps

Mihai said:
You have some solutions, but make sure what you do with them.
If you have to validate some input, then check if there are non-ascii
characters and show an error.
Just removing what you don't like is guaranteed to result in junk.
(what is the meaning of a French word with missing characters?)

A bit like this:
- you ask for a number.
- I give you 0x1A3C7
- you want a decimal number

Option 1: validate and complain that this is not a decimal numer
Option 2: remove x, A and C, then interpret 0137 as a decimal number
This is the equivalent of what you do now with non-ascii stuff :)

I do agree, if this were a business app I would definitely be
complaining about bad input data.

Its a tool for transcoding mp3's for my own personal use, I doubt anyone
else will ever use it. But you are correct, certain unicode strings
like for example...

Góðan daginn by Sigur Rós from Með suð í eyrum við spilum endalaust

come out badly mangled after the unicode to ascii conversion, I don't
see how I can avoid this unfortunately and its something I am personally
willing to put up with.
 
E

Eps

Eps said:
I do agree, if this were a business app I would definitely be
complaining about bad input data.

Its a tool for transcoding mp3's for my own personal use, I doubt anyone
else will ever use it. But you are correct, certain unicode strings
like for example...

Góðan daginn by Sigur Rós from Með suð í eyrum við spilum endalaust

come out badly mangled after the unicode to ascii conversion, I don't
see how I can avoid this unfortunately and its something I am personally
willing to put up with.

Actually I could output to a temporary file and then use unicode .net to
copy the file to the correct output file path.

I will look into that.
 
M

Mihai N.

Actually I could output to a temporary file and then use unicode .net to
copy the file to the correct output file path.

I am not sure I understand, but puting together the mp3 part with the path
part I think that you try to automatically organize/rename mp3 files based
on the mp3 id tags inside the mp3 (or something similar).

Although you can use Unicode file names from .NET, you might have
some troubles with many of the mp3 players out there that are still
in the 1990s and never heard of Unicode :)

If this is what you need, you might be able to do better:
1. Use full Unicode files and choose a good MP3 playes
(Windows Media Player can do that, no probles, and I am sure
there must be others)
2. don't convert to ascii, but to 1252 (Western-European code page)
If you are on a system that uses 1252 code page as ansi code page
this will preserve a lot of the characters and non-unicode players
will still work
3. Don't remove the characters with accents, but try to remove the
accents and keep "the base" character. Here is how:
http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx
 
E

Eps

Mihai said:
I am not sure I understand, but puting together the mp3 part with the path
part I think that you try to automatically organize/rename mp3 files based
on the mp3 id tags inside the mp3 (or something similar).

Although you can use Unicode file names from .NET, you might have
some troubles with many of the mp3 players out there that are still
in the 1990s and never heard of Unicode :)

If this is what you need, you might be able to do better:
1. Use full Unicode files and choose a good MP3 playes
(Windows Media Player can do that, no probles, and I am sure
there must be others)
2. don't convert to ascii, but to 1252 (Western-European code page)
If you are on a system that uses 1252 code page as ansi code page
this will preserve a lot of the characters and non-unicode players
will still work
3. Don't remove the characters with accents, but try to remove the
accents and keep "the base" character. Here is how:
http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx

Good points.

I don't write any tags using lame, I use TagLibSharp to read the tags
from the original file and copy it to the output file. I am not aware
of any Encoding issues but I will look in to it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top