Screen scraping and regular expressions

C

Chris Wertman

Hello, I posted a slightly different version of theis to dotnet.general
where it was suggested I post it here.

Well I have to say Im getting exicted about my app , its almost there,
I have added a button to IE and am calling the current instance of IE
and grabbing th URL out just fine. Im using the webclient to grab the
html so far so good and Im only half bald.

Now I am at the point I need to extract out a couple of fields from
the HTML itself. I have read about usin regex to do this but am a
little confused, maybe Ive just been staring at the screen too long.

I get this HTML returned.

<b>Binding:</b> Paperback<br> <b>Publisher:</b>

What I need to extract is the word Paperback from the above string.

Here is what I have so far

Dim regex As New Regex("<b>Binding:</b>((.|\n)*?)<br> <b>Publisher:",
RegexOptions.IgnoreCase)

MsgBox(regex.Match(html).ToString)

But that returns <b>Binding:</b>Paperback<br> <b>Publisher:

This I am sure is something wrong with my regular expression , but can I
strip multiple items using this method say naming then rexez1 regex2 etc
?

Someone suggested I use the DOM using microsoft.mshtml
Is this the most efficient way ?

Do I need to somehow put it into StreamReader or ......well what do I do
with it then.

Chris
 
P

Paul Bromley

Hi Chris,

Take a look at the thread in this group - 'Getting Web Page info' - on 11th
April.

Best wishes

Paul Bromley
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top