Ignoring spaces in regular expression matching

M

Mark Rae

Hi,

I'm trying to construct a RegEx pattern which will validate a string so that
it can contain:

only the numerical characters from 0 to 9 i.e. no decimal points, negative
signs, exponentials etc
only the 26 letters of the standard Western alphabet in either upper or
lower case
spaces i.e. ASCII character 32

I seem to be doing OK with the first two criteria, but am having trouble
with the space character.

E.g. the following works perfectly:

Regex.IsMatch("ThisIsThe2ndString", @"[^0-9][^a-z][^A-Z]")

However, this doesn't work:

Regex.IsMatch("This Is The 2nd String", @"[^0-9][^a-z][^A-Z]")

I've tried various combinations of [\s] and [^\s] but with little success.

However, the following works, though I don't really understand why:

Regex.IsMatch("This is the 2nd string", @"[^0-9][^a-z][^A-Z]",
RegexOptions.IgnoreCase)

Any assistance gratefully received.

Mark
 
P

Paul E Collins

Mark said:
I'm trying to construct a RegEx pattern which will
validate a string so that it can contain [only digits.
letters and spaces]

I think you want something like this:
^[a-zA-Z0-9 ]*$
i.e. every character between ^ start and $ end must be in the [group],
and there can be * zero or more of them (you'd use + if you want at
least one character in there). Be aware that "\s" would match some
things that aren't spaces (like tabs and newlines).

Of course, if you're having special trouble with spaces, you could do
s.Replace(" ", "") first to get rid of them in your validator.

Finally, I'm not convinced that regexes are ideal in .NET for this
kind of trivial check (as opposed to something complicated like nested
expressions and optional segments), because they're a special library
call and not a native operator as in Perl, which I suspect you might
have come from. I expect a loop like this would be more efficient:

bool valid = true;
for (int i = 0; i < s.Length; i++)
{
if (!((s >= 'A' && s <= 'Z') || (s >= 'a' && s <= 'z')
|| (s >= '0' && s <= '9') || s == ' '))
{
valid = false; break;
}
}

Eq.
 
T

Tasos Vogiatzoglou

string[] strs = new string[] { "ABC123", "ABC1.1", "ABC 123", "ABC 123
.." };

string srx = @"[^\.]+|[\w\s\d]+";
Regex rx = new Regex(srx,RegexOptions.ECMAScript);

foreach (string str in strs)
{
Console.WriteLine("{0} {1}", str,
rx.Match(str).Length==str.Length);
}

This works (if I understood correctly your problem). IsMatch returns
true for any match in the string so I don't think this is the one you
want.

Regards,
Tasos
 
K

Kevin Spencer

You can use a literal space in your character set:

(?i)[^a-z 0-9]

The "(?i)" indicates case-insensitivity. Note the literal space between
"a-z" and "0-9". This excludes the space character as well.

The "\s" indicates *any* white-space character, including such things as
tabs. If that is what you want, use:

(?i)[^a-z\s0-9]

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.
 
M

Mark Rae

You can use a literal space in your character set:

(?i)[^a-z 0-9]

The "(?i)" indicates case-insensitivity. Note the literal space between
"a-z" and "0-9". This excludes the space character as well.

The "\s" indicates *any* white-space character, including such things as
tabs. If that is what you want, use:

(?i)[^a-z\s0-9]

Excellent! Thanks very much.
 
M

Mark Rae

I think you want something like this:
^[a-zA-Z0-9 ]*$
i.e. every character between ^ start and $ end must be in the [group], and
there can be * zero or more of them (you'd use + if you want at least one
character in there).

Doesn't work...
Of course, if you're having special trouble with spaces, you could do
s.Replace(" ", "") first to get rid of them in your validator.

I could do that, or even not do any validation at all...
Finally, I'm not convinced that regexes are ideal in .NET for this kind of
trivial check (as opposed to something complicated like nested expressions
and optional segments), because they're a special library call and not a
native operator as in Perl, which I suspect you might have come from.

I've never written a line of Perl in my life...
I expect a loop like this would be more efficient:

I wouldn't know...
 
J

Jon Skeet [C# MVP]

Mark Rae said:
It doesn't.

When a proposed solution doesn't work, could you explain in what way?
It makes life a lot easier for people who want to make further
suggestions.
 
M

Mark Rae

When a proposed solution doesn't work, could you explain in what way?

I'm afraid I can't in this case, other than to say it always seems to find a
match no matter what string I pass into it...

I simply don't know enough about regular expressions to make a valuable
response - I don't mind confessing that it remains one area of coding which
I find very difficult to get my head around, to the extent where I still
find it difficult to look at even the simplest of patterns and understand
instinctively what it's trying to do...
It makes life a lot easier for people who want to make further
suggestions.

I couldn't agree more! However, in this case, Kevin Spencer has solved my
problem completely.
 
J

Jon Skeet [C# MVP]

Mark Rae said:
I'm afraid I can't in this case, other than to say it always seems to find a
match no matter what string I pass into it...

That's enough - just an example of something which should fail but
passes would be good.
I simply don't know enough about regular expressions to make a valuable
response

A sample which doesn't do what you want to is the most valuable
response you can make in this case :)
I couldn't agree more! However, in this case, Kevin Spencer has solved my
problem completely.

Right. I'd still be interested in an example which should fail but
passes, so I can try to beef up my own regex experience.
 
M

Mark Rae

That's enough - just an example of something which should fail but
passes would be good.


A sample which doesn't do what you want to is the most valuable
response you can make in this case :)

See the reply I'm referring to:
IsMatch returns true for any match in the string so I don't think this is
the
one you want.

That's correct - no matter what string I pass into it, it always returns
true...
 
K

Kevin Spencer

Hi Mark,

I may be able to help you there. It helps to understand how the Regular
Expressions Engine works. First, it evaluates a character at a time, and it
is procedural in nature. A regular expression is like a series of
instructions, rather than a real single pattern. In your case:
Regex.IsMatch("This is the 2nd string", @"[^0-9][^a-z][^A-Z]",
RegexOptions.IgnoreCase)

Basically, this is using character classes. A character class is a series of
tokens inside square brackets, and it can be translated as "this type of
character or this type of character or this type of character..." In other
words, multiple character types or literals are joined with an implicit "or"
operator:

[\dA!] literally means "any single digit or an 'A' or an '!' character".
Note that it also implies a singular value, that is, one character.
Quantifiers are used to indicate that anything in the character class are
repeated 0, 1 or more times, as in:

[\dA!] (any of these characters 1 time)
[\dA!]* (any of these characters 0 or more times)
[\dA!]+ (any of these characters 1 or more times)
etc.

The '^' is the logical "Not" operator, which means "Not any of these
characters."

So, you had at first "[^0-9]" (Not a digit between 0 and 9)
followed by "[^a-z]" (Not a character between a and z)
and followed by "[^A-Z]" (Not a character between A and Z)

Now, remember that it's looking for a match. A match satisfies *all* of the
criteria you specify, so you can think of this and joining all of these
character classes with "AND" as in:

"Not a digit between 0 and 9 AND not a character between a and z AND not a
character between A and Z."

Note that the space character is not any of those, so it's a match. Using
negation is tricky. In fact, *any* character that was NOT in any of those 3
character sets would be a match.

The character class is used to apply the same rules to a set of characters.
The only time you need to separate them into groups is when the rules
(specifically logical Not or quantifiers) do not apply the same to all of
the characters.

Also, as a regular expression is basically procedural (although it does
employ backtracking), you should be careful about the order of the matches.
The following 2 sets are NOT the same:

[\dA!][0X]
[0X][\dA!]

In the first case, "0X3A" would *not* match. In the second case it would.
This is because the string and the pattern are evaluated in sequence. One
term for this is "consumption" - a regular expression "consumes" a string as
it evaluates it.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.
 
J

Jon Skeet [C# MVP]

Mark Rae said:
That's correct - no matter what string I pass into it, it always returns
true...

Well, I've only tried the version that Paul Collins gave (which you
replied to with the same "doesn't work" answer), and that seems to
work:

using System;
using System.Text.RegularExpressions;

class Test
{
static void Main()
{
Regex r = new Regex("^[a-zA-Z0-9 ]*$");
Console.WriteLine (r.IsMatch ("Hello"));
Console.WriteLine (r.IsMatch ("Hello there"));
Console.WriteLine (r.IsMatch ("Hell#o"));
}
}

Produces:
True
True
False


This is why it's important to give a specific example of something that
fails - preferrably with a short but complete program which
demonstrates what you've been trying it with.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top