getting rid of non english characters

T

TeekUS

hey ppl, i am currently developing a parsing application my input is a
10MB english text file the parsing works fine however every now and then
a non english character appears that messes everything up. i need to get
rid of all these characters before i parse.HELP!

Teekus
(P.S. i used Regex.Replace but that did not take out the non english
characters!)
 
J

Jon Skeet [C# MVP]

TeekUS said:
hey ppl, i am currently developing a parsing application my input is a
10MB english text file the parsing works fine however every now and then
a non english character appears that messes everything up. i need to get
rid of all these characters before i parse.HELP!

(P.S. i used Regex.Replace but that did not take out the non english
characters!)

If it's an English text file but you're reading some non-English
characters, that suggests that you're losing data - I'd worry about
that to start with. Have you looked at the file to see what's actually
there where you're getting incorrect data?
 
T

TeekUS

Jon ---> no i dont think i am losing data. The BSCs sometimes just send
down some rubbish in their files for a reason we do not yet know. right
now all i need is to take these characters out so that my parser can run
smoothly. (if i replace the character with any other english character,
it is handeld correctly)Suggestions??

Teekus
 
J

Jon Skeet [C# MVP]

TeekUS said:
Jon ---> no i dont think i am losing data. The BSCs sometimes just send
down some rubbish in their files for a reason we do not yet know. right
now all i need is to take these characters out so that my parser can run
smoothly. (if i replace the character with any other english character,
it is handeld correctly)Suggestions??

Well, the easiest way would be to do something like:

1) Read the file line by line
2) For each line, check whether or not there are any non-ASCII
characters
3) If there are, use ToCharArray to get a character array for the
string, then run through that array and convert any non-ASCII character
into '?', then convert the char array to a string
4) Do whatever you want to do with the line.

Something like:

using (StreamReader reader = ...)
{
string line;

while ( (line=reader.ReadLine())!=null)
{
bool hasNonAscii=false;
foreach (char c in line)
{
if (c > 127)
{
hasNonAscii=true;
break;
}
}
if (hasNonAscii)
{
char[] chars = line.ToCharArray();
for (int i=0; i < chars.Length; i++)
{
if (chars > 127)
{
chars='?';
}
}
line = new string(chars);
}

// Do whatever with line
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top