Secure Users from dangerous html tags with RegEx

  • Thread starter Thread starter Hero41Day
  • Start date Start date
H

Hero41Day

Hi,

I'm looking for a easy simple way, using Regular expression to remove
any dangerous HTML from user posts which might put other users at risk.

right now i'm using this regular expression:
return
Regex.Replace(text,@"</?(((?!a|b|i|u|img|table|tr|td|/)[^>]*)|((a|b|i|u|table|tr|td|img)[^\s>]{1,}))*>","$1");

the problem is that its still remove the image tag, even though i don't
care that it will stay.

Does anyone have a better idea?

And yes, i'm aware to the "ValidateRequest" flag, but i don't want to
use it in those cases.

Thanks.
 
Define "dangerous HTML."

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Nyuck nyuck nyuck
 
you are better of defining safe HTML.

Be sure to strip position style commands, and you may want to validate
height and widths. Also strip IE behaviors.

-- bruce (sqlwork.com)

Kevin Spencer said:
Define "dangerous HTML."

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Nyuck nyuck nyuck


Hero41Day said:
Hi,

I'm looking for a easy simple way, using Regular expression to remove
any dangerous HTML from user posts which might put other users at risk.

right now i'm using this regular expression:
return
Regex.Replace(text,@"</?(((?!a|b|i|u|img|table|tr|td|/)[^>]*)|((a|b|i|u|table|tr|td|img)[^\s>]{1,}))*>","$1");

the problem is that its still remove the image tag, even though i don't
care that it will stay.

Does anyone have a better idea?

And yes, i'm aware to the "ValidateRequest" flag, but i don't want to
use it in those cases.

Thanks.
 
Do it the other way instead. Use HtmlEncode to make the entire string
totally harmless, then you use regular expressions to make a few
selected tags work again. Then you can decide what properties are
allowed in the tags, so that they won't contain Javascript. You can also
make sure that each starting tag has an ending tag, so they won't mess
up the layout of the rest of the page.
 
Back
Top