Regular expression for nested HTML tags

S

Sudheer

I am looking for a regular expression for finding a certain content
presnt in a HTML page

The html page looks something like this:

<div class="info">
<h5>Genre:</h5>
<a href="http://www.imdb.com/Sections/Genres/Action/">Action</a> / <a
href="http://www.imdb.com/Sections/Genres/Adventure/">Adventure</a> /
<a href="http://www.imdb.com/Sections/Genres/Crime/">Crime</a> / <a
href="http://www.imdb.com/Sections/Genres/Thriller/">Thriller</a> <a
class="tn15more inline" href="http://www.imdb.com/title/tt0337978/
keywords" onclick="(new Image()).src='/rg/title-tease/keywords/images/
b.gif?link=/title/tt0337978/keywords';">more</a>
</div>

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>



now i need a regular expression that looks out the entire HTML and
helps me extract
1. the tagline of the movie
2. the plot outline etc etc.


it is assured that they will be present in a div with id= "info"

any help in this regard would be appreciated!
 
J

Jesse Houwing

Hello Sudheer,
I am looking for a regular expression for finding a certain content
presnt in a HTML page

The html page looks something like this:

<div class="info">
<h5>Genre:</h5>
<a href="http://www.imdb.com/Sections/Genres/Action/">Action</a> / <a
href="http://www.imdb.com/Sections/Genres/Adventure/">Adventure</a> /
<a href="http://www.imdb.com/Sections/Genres/Crime/">Crime</a> / <a
href="http://www.imdb.com/Sections/Genres/Thriller/">Thriller</a> <a
class="tn15more inline" href="http://www.imdb.com/title/tt0337978/
keywords" onclick="(new Image()).src='/rg/title-tease/keywords/images/
b.gif?link=/title/tt0337978/keywords';">more</a>
</div>
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>
now i need a regular expression that looks out the entire HTML and
helps me extract
1. the tagline of the movie
2. the plot outline etc etc.
it is assured that they will be present in a div with id= "info"

any help in this regard would be appreciated!


That would be pretty easy to do:

"<div class=\"info\">\s*<h5>Tagline:</h5>(?<Tagline>((?!</div).)+)"
"<div class=\"info\">\s*<h5>Plot Outline:</h5>(?<Plot>((?!</div).)+)"

Or more generic:
"<div class=\"info\">\s*<h5>(?<Key>[^:]+):</h5>(?<Value>((?!</div).)+)"

Another option, that would be a little more rebust, would be to use the HTML
Agility Pack (can be found on www.codeplex.com).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top