Regex Validator - detect all but certain HTML tags

Barry L. Camp · Jan 22, 2007

Hi all... hope someone can help out.

Not a unique situation, but my search for a solution has not yielded
what I need yet.

I'm trying to come up with a regular expression for a
RegularExpressionValidator that will allow certain HTML tags:

<a>, , <blockquote>, , , <img>, <li>, <ol>, , <quote>,
<ul>

but block others. So basically I'd like to detect "<" then look for
certain sequences (the tags above). But I also of course have to
account for any number of possible attributes. And then some of the
tags above have closing tags, others do not.

I don't fully understand regular expressions, yet would like to learn;
however, I also want to find a way to do this soon.

I don't want to reinvent this particular "wheel" if it has already been
done before, if you know what I mean.

Any help from anyone out there would be greatly appreciate. Thanks
Much!

Barry L. Camp

Cor Ligthert [MVP] · Jan 22, 2007

Barry,

When you do not want to invent the wheel, than why do you not use the wheel
that is invented for this?

MSHTML.

http://www.vb-tips.com/dbpages.aspx?ID=541adf13-d9c0-435c-893f-56dbb63fdf1c

I hope this helps,

Cor

Barry L. Camp · Jan 22, 2007

I need to find a regex for Regular Expression Validator to perform what
I am describing.

I'm not interested in parsing a web page. I'm trying to parse the
contents of a textbox - I have a DetailsView control, for which one
textbox is for content that may be displayed in an .aspx page. I want
to allow certain tags, but block all others. The DetailsView is bound
to an ObjectDataSource, so naturally it would be nice to have the
Validator catch everything for me if at all possible.

Any ideas (anyone)?

Thanks,

Barry

Guest · Jan 22, 2007

I'm not interested in parsing a web page. I'm trying to parse the
contents of a textbox - I have a DetailsView control, for which one
textbox is for content that may be displayed in an .aspx page. I want
to allow certain tags, but block all others. The DetailsView is bound
to an ObjectDataSource, so naturally it would be nice to have the
Validator catch everything for me if at all possible.

You can load the textbox contents into MSHTML.

Finding a regular expression to handle all the cases of HTML will be
challenging to say the least - I suggest you take a look at MSHTML again
and see if it'll work.

Barry L. Camp · Jan 22, 2007

That example is not what I am looking for. I'm not trying to grab an
entire web page, or rebuild the content that may be in the textbox. All
I am trying to do is detect whether "forbidden" HTML tags are in the
text, and prevent further processing (or in my case, prevent the user
from saving a record in the DetailsView) until the user has edited the
contents of the textbox such that they are acceptable (i.e. don't have
"forbidden" HTML tags).

I'll grant that finding a suitable regex is not easy. Ideally that
would be the best solution, though, as I have my DetailsView hooked
into an ObjectDataSource, and would like to have the validator catch
everything, in-stream. I don't want to have to instantiate MSHTML just
to parse one single textbox.

Guest · Jan 22, 2007

I'll grant that finding a suitable regex is not easy. Ideally that
would be the best solution, though, as I have my DetailsView hooked
into an ObjectDataSource, and would like to have the validator catch
everything, in-stream. I don't want to have to instantiate MSHTML just
to parse one single textbox.

Regular Expressions are not well suited to parse complex XML type documents
due to the nested nature of such documents. There are better tools for the
job.

Perhaps loading the HTML into an XML Doc and using XPath to search for
unwanted tags?

If you're set on using Regular Expressions, take a look at Community
Server's source code. I recall it had a set of regular expressions to parse
out unwanted tags.

http://communityserver.org/

Cor Ligthert [MVP] · Jan 23, 2007

Barry,

Why are you than written this in your starting message?

I don't want to reinvent this particular "wheel" if it has already been
done before, if you know what I mean.

You definitly show that you want to reinvent the wheel that exist already.

Cor

Barry L. Camp · Jan 23, 2007

Cor said:
Barry,

Why are you than written this in your starting message?

You definitly show that you want to reinvent the wheel that exist already.

Cor

You still don't understand what I am trying to do.

I'm not trying to read HTML documents, XML, XHTML or any *TML.

The best way to explain what I am doing would be to go on any
discussion forum or web-based e-mail. There's a big, huge textbox
(textarea, or whatever). I've got one of these in a .NET 2.0
DetailsView control. I've got it bound to an ObjectDataSource. They're
all in TemplateFields, and I've got Validator controls hooked to all of
the textbox inputs. The smaller ones are tied to regex validators, with
simple expressions to allow text only, because that's all I need.

But this big textbox... I want to accept a small subset of HTML tags.
because this data... bound into a database... I want to mash together
with a MasterPage and some content, and render it back. It's not an
HTML document already. It's just text. I want to take the data and make
it PART of a web page.

Barry L. Camp · Jan 23, 2007

Spam said:
Regular Expressions are not well suited to parse complex XML type documents
due to the nested nature of such documents. There are better tools for the
job.

Perhaps loading the HTML into an XML Doc and using XPath to search for
unwanted tags?

If you're set on using Regular Expressions, take a look at Community
Server's source code. I recall it had a set of regular expressions to parse
out unwanted tags.

http://communityserver.org/

Well, I'm not exactly working with entire HTML pages. But I think
you've made a great suggestion.

I am building essentially a home-grown CMS, not on the order of a
CommunityServer or DNN, but just something that I can easily add/edit
content later on. It's just to suit what I need. It's also for me to
tinker with, and help educate myself on .NET 2.0, to help prepare for
the new Cert exams. (I've already got 70-431 done).

Looks like I have a lot more reading to do, but that's fine. I
appreciate the idea. Thanks much!

Barry

Cor Ligthert [MVP] · Jan 23, 2007

Barry,

You still did not look at the sample I gave you or what is written here.
The first part of the sample is to get a HTML page. It is mostly impossible
to give a sample withouth that you have underlaying data. The second part is
to show how you can tear a page appart accoording to its tages in different
parts.

However, what has Regex to do with this.

But this big textbox... I want to accept a small subset of HTML tags.
because this data... bound into a database... I want to mash together
with a MasterPage and some content, and render it back. It's not an
HTML document already. It's just text. I want to take the data and make
it PART of a web page.

For that is as well already a wheel. Why do you want so strongly to invent
your own wheels.

Cor

Barry L. Camp · Jan 23, 2007

Cor said:
Barry,

You still did not look at the sample I gave you or what is written here.

Yes, I did. And it has nothing to do with what I am doing.

The first part of the sample is to get a HTML page.

I don't care about that. I've said several times that I am not
interested in parsing an HTML page, so why would I want to even *get*
one?

It is mostly impossible
to give a sample withouth that you have underlaying data. The second part is
to show how you can tear a page appart accoording to its tages in different
parts.

I don't want to get a page, or tear it apart, or look at a page.

As I have said repeatedly, I'm taking input from a textbox, which is
*not* an HTML page, but may contain HTML tags. Parsing an entire HTML
page (which is what I AM NOT DOING) is a totally different concept than
the mere *detection* of a small number of tags in simple text (which is
what I AM DOING).

However, what has Regex to do with this.

Because as I have said repeatedly, I am trying to use the
RegularExpressionValidator to enforce validation rules on my textbox.
Was I not clear enough?

How about this example - perhaps this illustrates it better.

<asp:TemplateField HeaderText="Author">

<EditItemTemplate>
<asp:TextBox ID="AuthorTextBox"
runat="server" Text='<%# Bind("Author") %>' />
<asp:RegularExpressionValidator
ID="RegularExpressionValidator1" runat="server"
ControlToValidate="AuthorTextBox"

ValidationExpression="^[\w\s\.\-']{1,128}$" Text=" An Author's
name may only contain letters, numbers, spaces, apostrophes, hyphens or
periods." Display="Dynamic" SetFocusOnError="true" />
</EditItemTemplate>

For that is as well already a wheel. Why do you want so strongly to invent
your own wheels.

You know what... forget it. Someone else understood what I was trying
to do, and gave me a helpful suggestion. You seem to be stuck in the
belief that I am working on something completely different from what I
am really doing, and that's not helpful in the slightest.

Thanks for your time anyway.

using a regular expression to match up to but not including html start/end tags	9	Oct 11, 2008
Regex to strip evil HTML tags	2	Apr 10, 2005
Only allowing certain html tags	4	Jan 22, 2006
Regex expression to remove some html tags	3	Jan 3, 2006
regular expression for tags with there attribute and content.	1	Sep 9, 2006
Regex help	2	Jul 7, 2003
regex puzzle!	9	Nov 23, 2004
Regex Help	1	Mar 29, 2008

Regex Validator - detect all but certain HTML tags

Barry L. Camp

Cor Ligthert [MVP]

Barry L. Camp

Guest

Barry L. Camp

Guest

Cor Ligthert [MVP]

Barry L. Camp

Barry L. Camp

Cor Ligthert [MVP]

Barry L. Camp

Ask a Question

Similar Threads