A Question About Regular Expressions and Capture

E

eBob.com

I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>". Is there a
straightforward way to tell it to not include the "</b>" in the value? Note
that the "</b>" is not always present so the pattern has to say that it is
optional.

Thank, Bob


<tr height=24>
<td class=td1 width="35%"><b>Celular</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%"><b>123-abc-5678</b></td>
</tr>



<tr height=24>
<td class=td1 width="35%">Edad</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">24 Años</td>
</tr>

<tr height=24>
<td class=td1 width="35%">Altura</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">1.70 mts.</td>
 
L

Larry Lard

eBob.com said:
I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678</b>
optional </b>: no

and

value: 123-abc-5678
optional </b>: yes

since the 'value' match happens first, and it can legitimately capture
Is there a
straightforward way to tell it to not include the "</b>" in the value?

How about, instead of value capturing one or more of any character with


..+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.
 
E

eBob.com

Thank you very much Larry. It finally occurred to me that there had to be
some way to take advantage of the fact that the string I am after does not
contain "<", but the only solution I could think of was very ugly. Your
suggestion is much, much better. And thank you for making me aware of the
HtmlAgilityPack, I will be looking into it.

Thanks, Bob

Larry Lard said:
eBob.com said:
I am using regular expressions and a particular feature called "capture"
(I
think) to suck some information out of some html. I could have never
come
up with this myself but Balena has an example which is very similar to
this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra
parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678".
I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more
characters,
terminated by, optionally, "</b>" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678</b>
optional </b>: no

and

value: 123-abc-5678
optional </b>: yes

since the 'value' match happens first, and it can legitimately capture
Is there a
straightforward way to tell it to not include the "</b>" in the value?

How about, instead of value capturing one or more of any character with


.+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top