Regex Validator - detect all but certain HTML tags

B

Barry L. Camp

Hi all... hope someone can help out.

Not a unique situation, but my search for a solution has not yielded
what I need yet.

I'm trying to come up with a regular expression for a
RegularExpressionValidator that will allow certain HTML tags:

<a>, <b>, <blockquote>, <br>, <i>, <img>, <li>, <ol>, <p>, <quote>,
<ul>

but block others. So basically I'd like to detect "<" then look for
certain sequences (the tags above). But I also of course have to
account for any number of possible attributes. And then some of the
tags above have closing tags, others do not.

I don't fully understand regular expressions, yet would like to learn;
however, I also want to find a way to do this soon.

I don't want to reinvent this particular "wheel" if it has already been
done before, if you know what I mean. :)

Any help from anyone out there would be greatly appreciate. Thanks
Much!

Barry L. Camp
 
B

Barry L. Camp

I need to find a regex for Regular Expression Validator to perform what
I am describing.

I'm not interested in parsing a web page. I'm trying to parse the
contents of a textbox - I have a DetailsView control, for which one
textbox is for content that may be displayed in an .aspx page. I want
to allow certain tags, but block all others. The DetailsView is bound
to an ObjectDataSource, so naturally it would be nice to have the
Validator catch everything for me if at all possible.

Any ideas (anyone)?

Thanks,

Barry
 
G

Guest

I'm not interested in parsing a web page. I'm trying to parse the
contents of a textbox - I have a DetailsView control, for which one
textbox is for content that may be displayed in an .aspx page. I want
to allow certain tags, but block all others. The DetailsView is bound
to an ObjectDataSource, so naturally it would be nice to have the
Validator catch everything for me if at all possible.

You can load the textbox contents into MSHTML.

Finding a regular expression to handle all the cases of HTML will be
challenging to say the least - I suggest you take a look at MSHTML again
and see if it'll work.
 
B

Barry L. Camp

That example is not what I am looking for. I'm not trying to grab an
entire web page, or rebuild the content that may be in the textbox. All
I am trying to do is detect whether "forbidden" HTML tags are in the
text, and prevent further processing (or in my case, prevent the user
from saving a record in the DetailsView) until the user has edited the
contents of the textbox such that they are acceptable (i.e. don't have
"forbidden" HTML tags).

I'll grant that finding a suitable regex is not easy. Ideally that
would be the best solution, though, as I have my DetailsView hooked
into an ObjectDataSource, and would like to have the validator catch
everything, in-stream. I don't want to have to instantiate MSHTML just
to parse one single textbox.
 
G

Guest

I'll grant that finding a suitable regex is not easy. Ideally that
would be the best solution, though, as I have my DetailsView hooked
into an ObjectDataSource, and would like to have the validator catch
everything, in-stream. I don't want to have to instantiate MSHTML just
to parse one single textbox.

Regular Expressions are not well suited to parse complex XML type documents
due to the nested nature of such documents. There are better tools for the
job.

Perhaps loading the HTML into an XML Doc and using XPath to search for
unwanted tags?

If you're set on using Regular Expressions, take a look at Community
Server's source code. I recall it had a set of regular expressions to parse
out unwanted tags.

http://communityserver.org/
 
C

Cor Ligthert [MVP]

Barry,

Why are you than written this in your starting message?
I don't want to reinvent this particular "wheel" if it has already been
done before, if you know what I mean. :)

You definitly show that you want to reinvent the wheel that exist already.

Cor
 
B

Barry L. Camp

Cor said:
Barry,

Why are you than written this in your starting message?


You definitly show that you want to reinvent the wheel that exist already.

Cor

You still don't understand what I am trying to do.

I'm not trying to read HTML documents, XML, XHTML or any *TML.

The best way to explain what I am doing would be to go on any
discussion forum or web-based e-mail. There's a big, huge textbox
(textarea, or whatever). I've got one of these in a .NET 2.0
DetailsView control. I've got it bound to an ObjectDataSource. They're
all in TemplateFields, and I've got Validator controls hooked to all of
the textbox inputs. The smaller ones are tied to regex validators, with
simple expressions to allow text only, because that's all I need.

But this big textbox... I want to accept a small subset of HTML tags.
because this data... bound into a database... I want to mash together
with a MasterPage and some content, and render it back. It's not an
HTML document already. It's just text. I want to take the data and make
it PART of a web page.
 
B

Barry L. Camp

Spam said:
Regular Expressions are not well suited to parse complex XML type documents
due to the nested nature of such documents. There are better tools for the
job.

Perhaps loading the HTML into an XML Doc and using XPath to search for
unwanted tags?

If you're set on using Regular Expressions, take a look at Community
Server's source code. I recall it had a set of regular expressions to parse
out unwanted tags.

http://communityserver.org/


Well, I'm not exactly working with entire HTML pages. But I think
you've made a great suggestion.

I am building essentially a home-grown CMS, not on the order of a
CommunityServer or DNN, but just something that I can easily add/edit
content later on. It's just to suit what I need. It's also for me to
tinker with, and help educate myself on .NET 2.0, to help prepare for
the new Cert exams. (I've already got 70-431 done).

Looks like I have a lot more reading to do, but that's fine. I
appreciate the idea. Thanks much!

Barry
 
C

Cor Ligthert [MVP]

Barry,

You still did not look at the sample I gave you or what is written here.
The first part of the sample is to get a HTML page. It is mostly impossible
to give a sample withouth that you have underlaying data. The second part is
to show how you can tear a page appart accoording to its tages in different
parts.

However, what has Regex to do with this.
But this big textbox... I want to accept a small subset of HTML tags.
because this data... bound into a database... I want to mash together
with a MasterPage and some content, and render it back. It's not an
HTML document already. It's just text. I want to take the data and make
it PART of a web page.

For that is as well already a wheel. Why do you want so strongly to invent
your own wheels.

Cor
 
B

Barry L. Camp

Cor said:
Barry,

You still did not look at the sample I gave you or what is written here.

Yes, I did. And it has nothing to do with what I am doing.
The first part of the sample is to get a HTML page.

I don't care about that. I've said several times that I am not
interested in parsing an HTML page, so why would I want to even *get*
one?
It is mostly impossible
to give a sample withouth that you have underlaying data. The second part is
to show how you can tear a page appart accoording to its tages in different
parts.

I don't want to get a page, or tear it apart, or look at a page.

As I have said repeatedly, I'm taking input from a textbox, which is
*not* an HTML page, but may contain HTML tags. Parsing an entire HTML
page (which is what I AM NOT DOING) is a totally different concept than
the mere *detection* of a small number of tags in simple text (which is
what I AM DOING).
However, what has Regex to do with this.

Because as I have said repeatedly, I am trying to use the
RegularExpressionValidator to enforce validation rules on my textbox.
Was I not clear enough?

How about this example - perhaps this illustrates it better.

<asp:TemplateField HeaderText="Author">
<!-- Other ItemTemplate tags here. -->
<EditItemTemplate>
<asp:TextBox ID="AuthorTextBox"
runat="server" Text='<%# Bind("Author") %>' />
<asp:RegularExpressionValidator
ID="RegularExpressionValidator1" runat="server"
ControlToValidate="AuthorTextBox"

ValidationExpression="^[\w\s\.\-']{1,128}$" Text="<br />An Author's
name may only contain letters, numbers, spaces, apostrophes, hyphens or
periods." Display="Dynamic" SetFocusOnError="true" />
</EditItemTemplate>
For that is as well already a wheel. Why do you want so strongly to invent
your own wheels.

You know what... forget it. Someone else understood what I was trying
to do, and gave me a helpful suggestion. You seem to be stuck in the
belief that I am working on something completely different from what I
am really doing, and that's not helpful in the slightest.

Thanks for your time anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top