Only allowing alphanumeric characters and '_' and '-'

  • Thread starter Thread starter DotNetNewbie
  • Start date Start date
D

DotNetNewbie

Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$


Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?
 
Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}

str = sb.ToString();
 
KH said:
Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}

str = sb.ToString();

I think that code is an overkill compared to a simple Regex.Replace !

Arne
 
Hello KH,
Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}
str = sb.ToString();

Though, should your requirements become more complex, a regex solution like
the following can be used:

string cleaned = Regex.Replace("string to clean", "[^\w\d_-]", "", RegexOptions.None);

Just put all the characters you want to keep into the range above. Everything
else will be removed.

Jesse
DotNetNewbie said:
Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$

Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?
 
I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

A great way to use them though is to put the pattern in a config file so it
can be easily changed when requirements change or for different customers w/o
recompiling the app.
 
I usually avoid regex's because of performance. In this case I haven't
tested
but would imagine the difference is approximatly "who cares" ...
nonetheless
I just think of regex's as overkill in many situations where people try
to
use them.

It's funny. I agree with both statements, sort of. (Do you smell an
essay coming on? You should... :) )

Fundamentally, I think that Regex is a good thing. It's a concise,
reliable way to represent various string interpretations and
manipulations. As far as performance goes, I don't think there's a
reliable way to say that Regex is always better- or worse-performing than
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at
least a broad variety of possible applications, if not the majority. As a
framework class, it's got the potential to be well-optimized and there's
good justification for it to be. On the other hand, explicit algorithms
may or may not be well-optimized, depending on who wrote the code and how
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing
it wrong. With Regex, yes there's the possibility of writing an incorrect
expression, but it's more likely in that case that it just won't work.
It's much harder to get those subtle "happens once in awhile with only
this very specific input". Not impossible, but IMHO more difficult.

So those are all things in favor of Regex. I think that in general,
anything that allows you to specify an operation in a concise, error-free
way and then perform that operation with reasonable, or even optimal
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard. I recognize that
there are folks out there who have used regular expressions so much that
it's just like writing regular programming code to them. They know it
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we
skip back and forth in the MSDN documentation trying to find just the
right syntax for representing some goal. There's an incredible amount of
capability there, and with that comes a fairly extensive grammar that
needs to be learned to use it effectively. But the syntax of that grammar
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like
way to program it. Maybe something like a RegexBuilder class or something
that you can use to construct an appropriate regular expression. Or maybe
just a syntax that looks more like C# than like APL. Or maybe something
that takes actual C# code expressions and converts it into a suitable
regular expression. Or some alternative I've yet to consider.

I don't know what the actual solution is. All I know is that Regex itself
can be very trying to use if you're inexperienced with it, to a _much_
greater extent than, say, VB or C# might be. So in the end, for simple
operations I find myself thinking "well, some explicit C# code will be
clearer, and it should be easy to make it bug-free", and so I wind up not
using Regex there. And then for more complex operations, where the
conciseness and precision of Regex would be a benefit, I find myself
thinking "I just don't get how to do this in Regex and the docs aren't
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex. I've posted questions
here asking how to write Regex expressions to do what I want, and to the
credit of the newsgroup experts who do know Regex, they've always come
through. For me, and for others who ask similar questions. Jesse Houwing
in particular deserves major kudos for his Regex "kung fu" and his
willingness to share it with others. But in the end, if I can't be
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the
time and really learn regular expressions and then it'd be useful. But I
don't, and for the occasional moments when it'd be useful to me, it's just
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem. :)

Pete
 
It's funny.  I agree with both statements, sort of.  (Do you smell an  
essay coming on?  You should...  :) )

Fundamentally, I think that Regex is a good thing.  It's a concise,  
reliable way to represent various string interpretations and  
manipulations.  As far as performance goes, I don't think there's a  
reliable way to say that Regex is always better- or worse-performing than  
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at  
least a broad variety of possible applications, if not the majority.  Asa  
framework class, it's got the potential to be well-optimized and there's  
good justification for it to be.  On the other hand, explicit algorithms 
may or may not be well-optimized, depending on who wrote the code and how  
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing  
it wrong.  With Regex, yes there's the possibility of writing an incorrect  
expression, but it's more likely in that case that it just won't work.  
It's much harder to get those subtle "happens once in awhile with only  
this very specific input".  Not impossible, but IMHO more difficult.

So those are all things in favor of Regex.  I think that in general,  
anything that allows you to specify an operation in a concise, error-free  
way and then perform that operation with reasonable, or even optimal  
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard.  I recognize that  
there are folks out there who have used regular expressions so much that  
it's just like writing regular programming code to them.  They know it  
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we  
skip back and forth in the MSDN documentation trying to find just the  
right syntax for representing some goal.  There's an incredible amount of  
capability there, and with that comes a fairly extensive grammar that  
needs to be learned to use it effectively.  But the syntax of that grammar  
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like  
way to program it.  Maybe something like a RegexBuilder class or something  
that you can use to construct an appropriate regular expression.  Or maybe  
just a syntax that looks more like C# than like APL.  Or maybe something 
that takes actual C# code expressions and converts it into a suitable  
regular expression.  Or some alternative I've yet to consider.

I don't know what the actual solution is.  All I know is that Regex itself  
can be very trying to use if you're inexperienced with it, to a _much_  
greater extent than, say, VB or C# might be.  So in the end, for simple  
operations I find myself thinking "well, some explicit C# code will be  
clearer, and it should be easy to make it bug-free", and so I wind up not  
using Regex there.  And then for more complex operations, where the  
conciseness and precision of Regex would be a benefit, I find myself  
thinking "I just don't get how to do this in Regex and the docs aren't  
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex.  I've posted questions  
here asking how to write Regex expressions to do what I want, and to the  
credit of the newsgroup experts who do know Regex, they've always come  
through.  For me, and for others who ask similar questions.  Jesse Houwing  
in particular deserves major kudos for his Regex "kung fu" and his  
willingness to share it with others.  But in the end, if I can't be  
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the 
time and really learn regular expressions and then it'd be useful.  But I  
don't, and for the occasional moments when it'd be useful to me, it's just 
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem.  :)

Pete

While I do use Regex from time to time (input field validation,
parsing Sql-Connection-strings etc.), I totally agree with Peter.
Whenever I do use regular expressions it would have been quite trivial
to achieve the same thing in code, when the pattern matching becomes
complex enough to really make you want the power the Regex engine
offers, I often find I just can't get the expression to work right in
all circumstances.
A library that would offer a more natural way of constructing regular
expressions would be great, but given the complexity of the syntax
(let alone the fact that there are several different implementations),
I don't quite see how that could be done...

Kevin Wienhold
 
Peter Duniho said:
Fundamentally, I think that Regex is a good thing.

Fundamentally a RegEx is a type 3 grammar, equivalent to a finite
automata. :)

So a RegEx is more like an upper bound to a class of pattern matching
problems. Sometimes a RegEx is not enough, then you need to go up in
the hierachy to type 2 grammars and write parsers. But in many cases
you don't need all of the expressiveness of a RegEx so you can use
quite simpler constructs.

BTW: In the class of parsing problems where regular expressions
suffice, using a RegEx parser is the most costly (sane) way to do the
job. Simple comparisios like IsDigitOrLetter (traversing the input
string only once, without the overhead of parser generation) are
always (much) faster and need (much) less memory.

Some problems need full regluar expression expressiveness, so in these
cases the cost and overhead of a RegEx is mandatory.
As far as performance goes, I don't think there's a reliable way to
say that Regex is always better- or worse-performing than an
equivalent explicit algorithm.

These class of problems are really good studies and understood. There
are quite reliable ways to say when a RegEx is needed, what performance
and memory characterics follow and when other way are needed or more
efficient.

These and much more are the basics of computer science. There's more
to programming than just try&error.
other hand, explicit algorithms may or may not be well-optimized,

But a regular expression may also be badly written and as such induce
much more overhead and worse performance for the same regular
expression engine used with a better written RegEx. A regluar
expression is a simple language but still complex enough to say the
same thing in different ways.

If you do basic comparision of algorithms you have always to assume
that the implementation are written as good as possible (for example a
routine to copy a 10 character long string should not need 50MB RAM
and quite some minutes of runtime to do it's job; it's always possible
to do worse, we are only interested if it's possible to do better).
In addition, every time you write an explicit algorithm, you risk
writing it wrong. With Regex, yes there's the possibility of
writing an incorrect expression, but it's more likely in that case
that it just won't work. It's much harder to get those subtle
"happens once in awhile with only this very specific input". Not
impossible, but IMHO more difficult.

You didn't write quite some complex regular expressions, did you? A
RegEx is quite easy to have those subtle problems. But you are not
wrong. A regular expression is a type 3 grammar, C# has (more or less)
a type 2 grammar (it's even Turing complete), so it's much more
expressive and so there exists much more potential for errors.
But for the rest of us, using Regex is an exercise in frustration as
we skip back and forth in the MSDN documentation trying to find just
the right syntax for representing some goal. There's an incredible
amount of capability there, and with that comes a fairly extensive
grammar that needs to be learned to use it effectively. But the
syntax of that grammar is pretty arcane IMHO, and has been very hard
to learn, at least for me.

The concept of regular expressions are not that difficult. The most
common representation in todays languages are pure artificial. Other
representations and syntaxes are possible and do exists; for the
language Common Lisp exists a library called cl-ppcre implementing a
quite efficient regular expression engine (for some examples even
faster than the C engine) -- this engine understands the common
representations but also allows another syntax:

CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
(:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

It's quite long representation and maybe to some eyes even worse but
showing that other ways to notated a RegEx are quite possible.
questions here asking how to write Regex expressions to do what I
want

Maybe have a look at

http://weitz.de/regex-coach/

a IMHO quite useful tool to learn regular expressions and to
experiment with them.
 
Stefan Nobis said:
CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
(:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

Ups, bad example. The simple translator doen't convert \w and
\d. Sorry. It should read more like this (to put everything except \w
- and _ in the register):

(:SEQUENCE :START-ANCHOR
(:GREEDY-REPETITION 0 NIL
(:REGISTER
(:INVERTED-CHAR-CLASS :WORD-CHAR-CLASS
#\_
#\-)))
:END-ANCHOR)

The first to parameters to :GREEDY-REPETITION meening the min and max
allowed number of repetitions (the above 0 NIL corresponds to the *,
something like (:GREEDY-REPETITION 3 5 ...) corresponds to
....{3,5}). The syntax #\_ is Common Lisp syntax for the single
character _.

Here is a handwritten example using the verbose syntax (I
don't have the perl-like version at hand, sorry):

(:sequence :start-anchor (:alternation #\# ";;;")
(:positive-lookahead :word-char-class)
(:register (:greedy-repetition 0 nil :word-char-class))
(:positive-lookahead
(:alternation :end-anchor
(:sequence
(:greedy-repetition 1 nil
:whitespace-char-class)
:non-whitespace-char-class)))
(:greedy-repetition 0 1
(:sequence
(:greedy-repetition 1 nil :whitespace-char-class)
(:register (:greedy-repetition 0 nil :everything)))))
 
KH said:
I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

Usually fewer lines of code is what is most cost effective overall.

Regex is simple code (and if the reader knows regex as a general concept
it is even easy to read) and code that is easy to modify to different
requirements.

It does come with a certain overhead. It may not be suited for
being called billions or trillions of times. But I doubt that was
the case here (the variable was named 'username').

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top