html and regex

S

Sean

I have been trying to parse a webpage in my own free time and I have
come to yet another regex I can't quiet seem to get. I wanted to get
the data inside the <table> tags, however the html and other daata
inside it span multiple lines and nothing I use seems to work.

My first attempt was: <table .*>(?<Info>.*?)</table>

This worked on <table><td>this is some random sentence</td></table>

but not on:

<table width="100%" border="0" cellspacing="0" cellpadding="0"
class="niceTableBorder">
test
</table>

I tried playing with the whitespace \s escape but nothing seemed to
work.

Is there a highly recommended book for c# regex's that I can pick up
to learn this instead of relying on the usenet group here?

Any help appreciated.

-Sean
 
S

Sean

I was trying to stay away from 3rd party items. It's as much of a
learning experience as a fun hobby. I figured for something as trivial
as this I wouldn't need to use all the extra functionallity. I
understand if I was doing something more indepth I wouldn't want to re-
invent the wheel, but I don't think this is that deep.
 
J

Jarlaxle

i would use the xmldocument class or the xmltextreader class. they are
simple to use and built-in to .net.
 
B

Ben Voigt [C++ MVP]

Sean said:
I have been trying to parse a webpage in my own free time and I have
come to yet another regex I can't quiet seem to get. I wanted to get
the data inside the <table> tags, however the html and other daata
inside it span multiple lines and nothing I use seems to work.

My first attempt was: <table .*>(?<Info>.*?)</table>

This worked on <table><td>this is some random sentence</td></table>

but not on:

<table width="100%" border="0" cellspacing="0" cellpadding="0"
class="niceTableBorder">
test
</table>

I tried playing with the whitespace \s escape but nothing seemed to
work.

Is there a highly recommended book for c# regex's that I can pick up
to learn this instead of relying on the usenet group here?

Any help appreciated.

To solve your immediate problem, use string.Replace("\r", " ").Replace("\n",
" ")

However you'll still be in trouble with multiple tables, the closing tag may
match the wrong opening tag. (Whether nested tables or siblings gives you
trouble will depend on whether you are using greedy matching, neither
minimal nor maximal will be correct in every case)

I think it's proven that you can't match nested delimiters in the general
case using pure regexs.

If you want something more lightweight than a full parser, you could try to
regex match individual tags and use a stack. Note that angle brackets
inside quoted strings could still cause you some grief.
 
S

Sean

Thanks Ben.

Yeah I think if I take out all of the whitespace as you suggest it'll
create less problems.

I also see what you mean, grabs the tags and implement a stack object
ot push and pop all of the opening and closing tag elements. I think
that'll be the best approach I can find.
 
R

Roger Frost

I too am interested in a really good book about Regular Expressions
(specific to C# or .NET would be excellent).

In the mean time Sean, I have been referencing
http://www.regular-expressions.info/ but since this site has the top two
Google matches, you are probably already aware of it. :)

In your example below, have you tried passing RegexOptions.Singleline? It
causes "." to match all characters including "\n". I think it is required
in order to span multiple lines unless you match the newline character
explicitly in your Regex statement.

Hope this helps. Regular expressions are the bane of my existence.
 
J

Jesse Houwing

Hello Sean,
I have been trying to parse a webpage in my own free time and I have
come to yet another regex I can't quiet seem to get. I wanted to get
the data inside the <table> tags, however the html and other daata
inside it span multiple lines and nothing I use seems to work.

My first attempt was: <table .*>(?<Info>.*?)</table>

This worked on <table><td>this is some random sentence</td></table>

but not on:

<table width="100%" border="0" cellspacing="0" cellpadding="0"
class="niceTableBorder">
test
</table>
I tried playing with the whitespace \s escape but nothing seemed to
work.

Is there a highly recommended book for c# regex's that I can pick up
to learn this instead of relying on the usenet group here?

Any help appreciated.

-Sean

As someone else already pointed out, '.' only matches everything, but the
newline character. There is a special option to change this behaviour, but
it is rarely needed. It would in this case probably result in more trouble
than it's worth.

As someone else already suggested, an sgml or html reader is probably your
best option. I'd try out the HtmlAgilityPack out on Codeplex.com, but this
can also be solved with regex:

<table[^>]*>(?<info>((?!</table).)*)

will work as long as you activate RegexOptions.SingleLine

<table[^>]*>(?<info>((?!</table)[\s\S])*)

will work even without specifying RegexOptions.SingleLine

The biggest problem with singleline on is that you create a great chance
that a '.*' somewhere will consume the whole contents of teh file and start
backtracking from there. Just like the .* in your table statement. These
are real performance killers.
 
J

Jesse Houwing

Hello Roger,

I can really reccommend Regular Expressions with .NET by Dan Appleman (http://www.amazon.com/Regular-Expressions-NET-Dan-Appleman/dp/B0000632ZU).
Or Mastering Regular Expressions by Jeffrey Friedl (http://www.amazon.com/Mastering-Reg...bs_sr_1?ie=UTF8&s=books&qid=1205803889&sr=1-1).

The first ons is a real .NET reference with C# code examples and it explains
specific things about the .NET syntax and specifics of Regular Expressions.
The second is the overall Bible on regular expressions. It covers the different
ways to implement a Regex Engine and from there builds on. It's a great read
if you want a more scientific backgroudn on the inner workings of a regex
engine and if you want to learn about and spot performance issues and other
harder parts in the Regex language.

Jesse
 
R

Roger Frost

Jesse Houwing said:
Hello Roger,

I can really reccommend Regular Expressions with .NET by Dan Appleman
(http://www.amazon.com/Regular-Expressions-NET-Dan-Appleman/dp/B0000632ZU).
Or Mastering Regular Expressions by Jeffrey Friedl
(http://www.amazon.com/Mastering-Reg...bs_sr_1?ie=UTF8&s=books&qid=1205803889&sr=1-1).

The first ons is a real .NET reference with C# code examples and it
explains specific things about the .NET syntax and specifics of Regular
Expressions. The second is the overall Bible on regular expressions. It
covers the different ways to implement a Regex Engine and from there
builds on. It's a great read if you want a more scientific backgroudn on
the inner workings of a regex engine and if you want to learn about and
spot performance issues and other harder parts in the Regex language.

Jesse


I will look into Dan Appleman's book for sure.

Once I understand how to use regular expressions correctly maybe I can get
ambitious. :)


Thanks a bunch Jesse!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top