Regular Expression for all attributes in HTML tag

  • Thread starter Thread starter Gert Conradie
  • Start date Start date
G

Gert Conradie

I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert
 
Hi Gert,
I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.
(Worst case scenario where standards was not followed in the past)

Since your parser needs to be aware of all kinds of ways to write
attributes, I think trying to write an all-around regular expression quickly
becomes a steep uphill climb.

I would probably forget about regular expressions altogether, and instead
write a simple text parser of my own. I think that would be simpler.

Just a thought, I'm not saying you can't do it with regex.

--
Regards,

Mr. Jani Järvinen
C# MVP
Helsinki, Finland
(e-mail address removed)
http://www.saunalahti.fi/janij/
 
This ought to do it for you:

(\w+)=(?:["']?([^"'>=]*)["']?)

Translation: a sequence of one or more word characters (letters and/or
digits), followed by an equals sign, followed by 0 or 1 single quote or
double quote, followed by any number of any character that is not a single
quote or a double quote or a right angle bracket, followed by 0 or 1 single
or double quotes.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
 
Hi Kevin & other
(\w+)=(?:["']?([^"'>=]*)["']?)

This one misses the "key4=444" in my example but surely make my attempt
look like a goods train compared. :) I will use it as a starting point
to try again.

Yani & Winista, I will try the parser and let you know the results...

Thanks, gert




Kevin said:
This ought to do it for you:

(\w+)=(?:["']?([^"'>=]*)["']?)

Translation: a sequence of one or more word characters (letters and/or
digits), followed by an equals sign, followed by 0 or 1 single quote or
double quote, followed by any number of any character that is not a single
quote or a double quote or a right angle bracket, followed by 0 or 1 single
or double quotes.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.


Gert Conradie said:
I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert
 
Back
Top