non-backtracking subexpression

P

Passiday

Hi,

I am trying here to make sense of two regex constructs:

1. Non-backtracking subexpression: (?> subexpression)

The official description (The subexpression is fully matched once, and
then does not participate piecemeal in backtracking.) does not make
sense for me. It's hard to understand without an example, which shows
an expression that does match with generic group, but does not match
with non-backtracking group. For example, I hoped to see regex "^.*?(?
x+)g.*$" fail to match with string "abcxxxdefxxxxghi" (ie, the "(?>x
+)" would eat up the first "xxx" and when it could no match the
following "g", the regex engine would not give up the already consumed
"xxx" for backtracking purposes. I would be happy to understand where
I am mistaken.

2. Balancing group: (?<name1-name2> subexpression)

I am looking for a stable way to extract outer html from text that is
retrieved from a web page. I was planning to load the text in XML
parser and then select the needed nodes using XPath, but I hate to
thing about the need to tidy up the text so that it really is valid
for XML parsing, and also although I would have a matching XML
element, reading the raw original html that's under that matching XML
note would be impossible, because XML property is built up from the
parsed object. So I hoped that it is possible to use the regexes, and
the balancing groups could help in order to select the full outer html
of the matching element (it indeed has a lot of inner elements). So
I'd be grateful for an example how to do it.

The need: extract the full outer html of elements like "<elementName1|
elementName2 class="className1|className2" .. some other
attributes ..>". The matching elements would not contain other
matching elements in their body, but they do contain other html markup
that makes the regex nontrivial. I speculate that the correct regex
would contain the elementName in group that is backreferenced in the
end of the regex (in order to match with correct </elementName>), but
in between there would be some construction of balancing groups that
makes sure that between the outer tags there is a valid html,
including the loose use of <br>.


Thanks,

Passiday
 
P

Passiday

Ok,

While I am still wondering about the non-backtracking subexpression,
here is my own answer to the outer-HTML extraction. It might be of
some help for someone doing HTML parsing.

<(?<SearchElement>div|span) class="[^"]*\b(className1|className2)\b[^"]
*">(<(?<Element>\w+)[^>]*>|</\k<Element>>(?<-Element>)|[^<]*)*(?
(Element)(?!))</\k<SearchElement>>

(The regex could be split in several lines once this message is
submitted)

Here's a brief explanation how this works:

For purposes of easier referencing to the different parts of the
regex, here is it's split in several sections:
1. <(?<SearchElement>div|span) class="[^"]*\b(className1|className2)\b
[^"]*">
2. (
2.1. <(?<Element>\w+)[^>]*>|
2.2. </\k<Element>>(?<-Element>)|
2.3. [^<]*
)*
3. (?(Element)(?!))
4. </\k<SearchElement>>

And what each section does:

1. Here the html element of interest is located. In this case, a div
or span element is looked for, with class attribute className1 or
className2.
2. This section with it's quantifier (*) will match the full inner
html. It is necessary to make sure that it's contents are valid html,
whatever the number of nested elements there is. The regex is a
collection of three types of animals that the inner html text can be
split in: opening tag, closing tag, and the text in between the tags.
2.1. This expression matches the opening tag. The element name is
stored in the Element stack.
2.2. The closing tag. Notice how the "\k<Element>" is used to refer to
the corresponding opening element. The regex will not validate if the
topmost value in the Element stack does not match with this closing
element. Such situation could appear only if the inner html is not
proper, that is, there is mismatched closing tag.
2.3. This matches whatever text in between the tags.
3. While the 2.2 took care for checking if there are not too many
closing tags, it is necessary to make sure also that there are not
left unclosed opening tags. This expression does just that. It says
"If the Element stack is not empty, fail the match".
4. The closing tag of outer html. See how the "\k<SearchElement>" is
used to match the opening element name.

I guess that adding "?>" (non-backtracking subexpression) to the start
of the 2nd section could improve this expressions performance, but I
am not quite sure how it works. I might imagine that it could come to
use when the regex fails at section 3, and then starts backtracking
the section 2 in pointless search for possible match.

I am well aware that the expression above is far from perfect html
validation. The html element names have their specific rules, there
can be a lot of whitespace here and there, etc. But it should provide
a good base for understanding how stuff works.

My thanks go to Steven Levithan for his elaborate blog post on the
balancing groups: (http://blog.stevenlevithan.com/archives/balancing-
groups). I wish the balancing groups had more quality documentation
online. For example, I wonder is the section 2.2 could be rewritten as
"(?<-Element></\k<Element>>)". At least it seems so, that the stack-
popping happens only if the expression in parens is matched. Also it
is unclear, what exactly the (<?<Name2-Name1>) does. Well, the doc
says the Name1 value is popped out, but something weird is pushed in
the Name2 stack.

I hope this post makes some contribution to the use of this valuable
but obscure feature of .NET regular expressions.


Regards,

Passiday
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top