A Question About Regular Expressions and Capture

eBob.com · Jun 13, 2006

I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(){0,1}(?<value>.+)(){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(){0,1}((?<value>.+))(){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678". I want, and I think it should be, "123-abc-5678". I
can't understand why the "" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "" followed by "</td>". Is there a
straightforward way to tell it to not include the "" in the value? Note
that the "" is not always present so the pattern has to say that it is
optional.

Thank, Bob

<tr height=24>
<td class=td1 width="35%">Celular</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">123-abc-5678</td>
</tr>

<tr height=24>
<td class=td1 width="35%">Edad</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">24 Años</td>
</tr>

<tr height=24>
<td class=td1 width="35%">Altura</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">1.70 mts.</td>

Larry Lard · Jun 13, 2006

eBob.com said:
I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(){0,1}(?<value>.+)(){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(){0,1}((?<value>.+))(){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678". I want, and I think it should be, "123-abc-5678". I
can't understand why the "" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678
optional : no

and

value: 123-abc-5678
optional : yes

since the 'value' match happens first, and it can legitimately capture

Is there a
straightforward way to tell it to not include the "" in the value?

How about, instead of value capturing one or more of any character with

..+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.

eBob.com · Jun 13, 2006

Thank you very much Larry. It finally occurred to me that there had to be
some way to take advantage of the fact that the string I am after does not
contain "<", but the only solution I could think of was very ugly. Your
suggestion is much, much better. And thank you for making me aware of the
HtmlAgilityPack, I will be looking into it.

Thanks, Bob

Larry Lard said:
eBob.com said:

I am using regular expressions and a particular feature called "capture"
(I
think) to suck some information out of some html. I could have never
come
up with this myself but Balena has an example which is very similar to
this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(){0,1}(?<value>.+)(){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(){0,1}((?<value>.+))(){0,1}</td>" ' extra
parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678". I want, and I think it should be, "123-abc-5678".
I
can't understand why the "" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more
characters,
terminated by, optionally, "" followed by "</td>".

Click to expand...

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678
optional : no

and

value: 123-abc-5678
optional : yes

since the 'value' match happens first, and it can legitimately capture

Is there a
straightforward way to tell it to not include the "" in the value?

Click to expand...

How about, instead of value capturing one or more of any character with

.+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.

Regular Expressions	10	Jan 24, 2005
HTML, CSS and JavaScript in a Gridview	2	Jul 7, 2009
About Regular Expressions	1	Dec 9, 2004
Screen Scraping With VB.NET	1	Jun 29, 2004
VBA and Internet Explorer	9	Sep 8, 2009
Button not appearing in DataGrid column	1	Oct 27, 2011
Looping through HTML table to populate Excel	6	Apr 28, 2009
regex pattern - ignore whitespace (CRLF and spaces)?	2	Mar 27, 2006

A Question About Regular Expressions and Capture

eBob.com

Larry Lard

eBob.com

Ask a Question

Similar Threads