Only allowing alphanumeric characters and '_' and '-'

DotNetNewbie · Feb 26, 2008

Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$

Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?

KH · Feb 26, 2008

Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}

str = sb.ToString();

Arne VajhÃ¸j · Feb 26, 2008

KH said:
Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}

str = sb.ToString();

I think that code is an overkill compared to a simple Regex.Replace !

Arne

Jesse Houwing · Feb 26, 2008

Hello KH,

Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}
str = sb.ToString();

Though, should your requirements become more complex, a regex solution like
the following can be used:

string cleaned = Regex.Replace("string to clean", "[^\w\d_-]", "", RegexOptions.None);

Just put all the characters you want to keep into the range above. Everything
else will be removed.

Jesse

DotNetNewbie said:
DotNetNewbie said:

Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$

Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?

Click to expand...

KH · Feb 27, 2008

I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

A great way to use them though is to put the pattern in a config file so it
can be easily changed when requirements change or for different customers w/o
recompiling the app.

Peter Duniho · Feb 27, 2008

I usually avoid regex's because of performance. In this case I haven't
tested
but would imagine the difference is approximatly "who cares" ...
nonetheless
I just think of regex's as overkill in many situations where people try
to
use them.

It's funny. I agree with both statements, sort of. (Do you smell an
essay coming on? You should...

)

Fundamentally, I think that Regex is a good thing. It's a concise,
reliable way to represent various string interpretations and
manipulations. As far as performance goes, I don't think there's a
reliable way to say that Regex is always better- or worse-performing than
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at
least a broad variety of possible applications, if not the majority. As a
framework class, it's got the potential to be well-optimized and there's
good justification for it to be. On the other hand, explicit algorithms
may or may not be well-optimized, depending on who wrote the code and how
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing
it wrong. With Regex, yes there's the possibility of writing an incorrect
expression, but it's more likely in that case that it just won't work.
It's much harder to get those subtle "happens once in awhile with only
this very specific input". Not impossible, but IMHO more difficult.

So those are all things in favor of Regex. I think that in general,
anything that allows you to specify an operation in a concise, error-free
way and then perform that operation with reasonable, or even optimal
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard. I recognize that
there are folks out there who have used regular expressions so much that
it's just like writing regular programming code to them. They know it
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we
skip back and forth in the MSDN documentation trying to find just the
right syntax for representing some goal. There's an incredible amount of
capability there, and with that comes a fairly extensive grammar that
needs to be learned to use it effectively. But the syntax of that grammar
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like
way to program it. Maybe something like a RegexBuilder class or something
that you can use to construct an appropriate regular expression. Or maybe
just a syntax that looks more like C# than like APL. Or maybe something
that takes actual C# code expressions and converts it into a suitable
regular expression. Or some alternative I've yet to consider.

I don't know what the actual solution is. All I know is that Regex itself
can be very trying to use if you're inexperienced with it, to a _much_
greater extent than, say, VB or C# might be. So in the end, for simple
operations I find myself thinking "well, some explicit C# code will be
clearer, and it should be easy to make it bug-free", and so I wind up not
using Regex there. And then for more complex operations, where the
conciseness and precision of Regex would be a benefit, I find myself
thinking "I just don't get how to do this in Regex and the docs aren't
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex. I've posted questions
here asking how to write Regex expressions to do what I want, and to the
credit of the newsgroup experts who do know Regex, they've always come
through. For me, and for others who ask similar questions. Jesse Houwing
in particular deserves major kudos for his Regex "kung fu" and his
willingness to share it with others. But in the end, if I can't be
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the
time and really learn regular expressions and then it'd be useful. But I
don't, and for the occasional moments when it'd be useful to me, it's just
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem.

Pete

KWienhold · Feb 27, 2008

It's funny. I agree with both statements, sort of. (Do you smell an
essay coming on? You should... )

Fundamentally, I think that Regex is a good thing. It's a concise,
reliable way to represent various string interpretations and
manipulations. As far as performance goes, I don't think there's a
reliable way to say that Regex is always better- or worse-performing than
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at
least a broad variety of possible applications, if not the majority. Asa
framework class, it's got the potential to be well-optimized and there's
good justification for it to be. On the other hand, explicit algorithms
may or may not be well-optimized, depending on who wrote the code and how
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing
it wrong. With Regex, yes there's the possibility of writing an incorrect
expression, but it's more likely in that case that it just won't work.
It's much harder to get those subtle "happens once in awhile with only
this very specific input". Not impossible, but IMHO more difficult.

So those are all things in favor of Regex. I think that in general,
anything that allows you to specify an operation in a concise, error-free
way and then perform that operation with reasonable, or even optimal
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard. I recognize that
there are folks out there who have used regular expressions so much that
it's just like writing regular programming code to them. They know it
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we
skip back and forth in the MSDN documentation trying to find just the
right syntax for representing some goal. There's an incredible amount of
capability there, and with that comes a fairly extensive grammar that
needs to be learned to use it effectively. But the syntax of that grammar
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like
way to program it. Maybe something like a RegexBuilder class or something
that you can use to construct an appropriate regular expression. Or maybe
just a syntax that looks more like C# than like APL. Or maybe something
that takes actual C# code expressions and converts it into a suitable
regular expression. Or some alternative I've yet to consider.

I don't know what the actual solution is. All I know is that Regex itself
can be very trying to use if you're inexperienced with it, to a _much_
greater extent than, say, VB or C# might be. So in the end, for simple
operations I find myself thinking "well, some explicit C# code will be
clearer, and it should be easy to make it bug-free", and so I wind up not
using Regex there. And then for more complex operations, where the
conciseness and precision of Regex would be a benefit, I find myself
thinking "I just don't get how to do this in Regex and the docs aren't
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex. I've posted questions
here asking how to write Regex expressions to do what I want, and to the
credit of the newsgroup experts who do know Regex, they've always come
through. For me, and for others who ask similar questions. Jesse Houwing
in particular deserves major kudos for his Regex "kung fu" and his
willingness to share it with others. But in the end, if I can't be
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the
time and really learn regular expressions and then it'd be useful. But I
don't, and for the occasional moments when it'd be useful to me, it's just
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem.

Pete

While I do use Regex from time to time (input field validation,
parsing Sql-Connection-strings etc.), I totally agree with Peter.
Whenever I do use regular expressions it would have been quite trivial
to achieve the same thing in code, when the pattern matching becomes
complex enough to really make you want the power the Regex engine
offers, I often find I just can't get the expression to work right in
all circumstances.
A library that would offer a more natural way of constructing regular
expressions would be great, but given the complexity of the syntax
(let alone the fact that there are several different implementations),
I don't quite see how that could be done...

Kevin Wienhold

Stefan Nobis · Feb 27, 2008

Peter Duniho said:
Fundamentally, I think that Regex is a good thing.

Fundamentally a RegEx is a type 3 grammar, equivalent to a finite
automata.

So a RegEx is more like an upper bound to a class of pattern matching
problems. Sometimes a RegEx is not enough, then you need to go up in
the hierachy to type 2 grammars and write parsers. But in many cases
you don't need all of the expressiveness of a RegEx so you can use
quite simpler constructs.

BTW: In the class of parsing problems where regular expressions
suffice, using a RegEx parser is the most costly (sane) way to do the
job. Simple comparisios like IsDigitOrLetter (traversing the input
string only once, without the overhead of parser generation) are
always (much) faster and need (much) less memory.

Some problems need full regluar expression expressiveness, so in these
cases the cost and overhead of a RegEx is mandatory.

As far as performance goes, I don't think there's a reliable way to
say that Regex is always better- or worse-performing than an
equivalent explicit algorithm.

These class of problems are really good studies and understood. There
are quite reliable ways to say when a RegEx is needed, what performance
and memory characterics follow and when other way are needed or more
efficient.

These and much more are the basics of computer science. There's more
to programming than just try&error.

other hand, explicit algorithms may or may not be well-optimized,

But a regular expression may also be badly written and as such induce
much more overhead and worse performance for the same regular
expression engine used with a better written RegEx. A regluar
expression is a simple language but still complex enough to say the
same thing in different ways.

If you do basic comparision of algorithms you have always to assume
that the implementation are written as good as possible (for example a
routine to copy a 10 character long string should not need 50MB RAM
and quite some minutes of runtime to do it's job; it's always possible
to do worse, we are only interested if it's possible to do better).

In addition, every time you write an explicit algorithm, you risk
writing it wrong. With Regex, yes there's the possibility of
writing an incorrect expression, but it's more likely in that case
that it just won't work. It's much harder to get those subtle
"happens once in awhile with only this very specific input". Not
impossible, but IMHO more difficult.

You didn't write quite some complex regular expressions, did you? A
RegEx is quite easy to have those subtle problems. But you are not
wrong. A regular expression is a type 3 grammar, C# has (more or less)
a type 2 grammar (it's even Turing complete), so it's much more
expressive and so there exists much more potential for errors.

But for the rest of us, using Regex is an exercise in frustration as
we skip back and forth in the MSDN documentation trying to find just
the right syntax for representing some goal. There's an incredible
amount of capability there, and with that comes a fairly extensive
grammar that needs to be learned to use it effectively. But the
syntax of that grammar is pretty arcane IMHO, and has been very hard
to learn, at least for me.

The concept of regular expressions are not that difficult. The most
common representation in todays languages are pure artificial. Other
representations and syntaxes are possible and do exists; for the
language Common Lisp exists a library called cl-ppcre implementing a
quite efficient regular expression engine (for some examples even
faster than the C engine) -- this engine understands the common
representations but also allows another syntax:

CL-USER> (ppcre:

arse-string "^([\w\d_-])*$")

SEQUENCE :START-ANCHOR

GREEDY-REPETITION 0 NIL

REGISTER

CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

It's quite long representation and maybe to some eyes even worse but
showing that other ways to notated a RegEx are quite possible.

questions here asking how to write Regex expressions to do what I
want

Maybe have a look at

http://weitz.de/regex-coach/

a IMHO quite useful tool to learn regular expressions and to
experiment with them.

Stefan Nobis · Feb 27, 2008

Stefan Nobis said:
CL-USER> (ppcre:arse-string "^([\w\d_-])*$")
SEQUENCE :START-ANCHOR GREEDY-REPETITION 0 NIL REGISTER CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

Ups, bad example. The simple translator doen't convert \w and
\d. Sorry. It should read more like this (to put everything except \w
- and _ in the register):

SEQUENCE :START-ANCHOR

GREEDY-REPETITION 0 NIL

REGISTER

INVERTED-CHAR-CLASS :WORD-CHAR-CLASS
#\_
#\-)))
:END-ANCHOR)

The first to parameters to :GREEDY-REPETITION meening the min and max
allowed number of repetitions (the above 0 NIL corresponds to the *,
something like

GREEDY-REPETITION 3 5 ...) corresponds to
....{3,5}). The syntax #\_ is Common Lisp syntax for the single
character _.

Here is a handwritten example using the verbose syntax (I
don't have the perl-like version at hand, sorry):

sequence :start-anchor

alternation #\# ";;;")

positive-lookahead :word-char-class)

register

greedy-repetition 0 nil :word-char-class))

positive-lookahead

alternation :end-anchor

sequence

greedy-repetition 1 nil
:whitespace-char-class)
:non-whitespace-char-class)))

greedy-repetition 0 1

sequence

greedy-repetition 1 nil :whitespace-char-class)

register

greedy-repetition 0 nil :everything)))))

Arne VajhÃ¸j · Feb 28, 2008

KH said:
I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

Usually fewer lines of code is what is most cost effective overall.

Regex is simple code (and if the reader knows regex as a general concept
it is even easy to read) and code that is easy to modify to different
requirements.

It does come with a certain overhead. It may not be suited for
being called billions or trillions of times. But I doubt that was
the case here (the variable was named 'username').

Arne

Parsing a string, removing any NON alphanumeric characters usingregex	3	Feb 26, 2008
Regex for removing all special characters from a string?	1	Mar 5, 2004
Regex	3	Sep 28, 2009
Regex: replacing \n and spaces	4	Jan 5, 2007
Regex. Digits, Letters and Dashes. what am I doing wrong?	6	Aug 30, 2012
XML and & character	2	Apr 17, 2006
String.Substring result when non-alphanumeric character?	3	Jul 23, 2004
Regex help	4	Aug 1, 2003

Only allowing alphanumeric characters and '_' and '-'

DotNetNewbie

KH

Arne VajhÃ¸j

Jesse Houwing

KH

Peter Duniho

KWienhold

Stefan Nobis

Stefan Nobis

Arne VajhÃ¸j

Ask a Question

Similar Threads