Hello,
I am coming back to a project and I dont remember what the following
Regex says
I do know it removes all \r\n from the string, but I dont see how.
Can someone explain this one?
Regex re = new Regex(@"([\x00-\x1F\x7E-\xFF]+)",
RegexOptions.Compiled);
string op = re.Replace(FileToParse, "");
How it works? The outer parentheses are redundant IMHO. The regex
boils down to a positive character group with two ranges, the start
and end of which (respectively) being expressed as hexadecimal
escapes: \x00-\x1F (0 to 31 in decimal) and \x7E-\xFF (126 to 255 in
decimal). With the appended "+", it basically means "one or more
characters between 0-31 resp. 126-255".
Replacing all these occurences with nothing (empty string) does far
more than just remove \r and \n - it removes all characters in the
range 0-31 and 126-255. The intention is probably to kill anything
that is not in the "ASCII" range. Unfortunately, it also kills the
tilde "~" (126).
It will also remove e.g. accents and umlaut characters in the range
128-256. What it will NOT remove are Unicode characters from 256
upwards.
Try e.g.
string originalString = "Testing <\u00e7> <\u0107> ";
Regex re = new Regex(@"([\x00-\x1F\x7E-\xFF]+)",
RegexOptions.Compiled);
string replacedString = re.Replace(originalString, "");
MessageBox.Show(originalString);
MessageBox.Show(replacedString);
The first "special" character, a lowercase C with cedilla, will be
removed. The second one, a lowercase c with acute accent, will not be
affected.
(My suggestion, if your intention is to remove anything not in the
range 32-126, would be to use this:
Regex re = new Regex(@"[^\x20-\x7E]+", RegexOptions.Compiled);
instead.)
Regards,
Gilles.