Screen scraping and regular expressions

Chris Wertman · Apr 13, 2004

Hello, I posted a slightly different version of theis to dotnet.general
where it was suggested I post it here.

Well I have to say Im getting exicted about my app , its almost there,
I have added a button to IE and am calling the current instance of IE
and grabbing th URL out just fine. Im using the webclient to grab the
html so far so good and Im only half bald.

Now I am at the point I need to extract out a couple of fields from
the HTML itself. I have read about usin regex to do this but am a
little confused, maybe Ive just been staring at the screen too long.

I get this HTML returned.

Binding: Paperback Publisher:

What I need to extract is the word Paperback from the above string.

Here is what I have so far

Dim regex As New Regex("Binding:((.|\n)*?) Publisher:",
RegexOptions.IgnoreCase)

MsgBox(regex.Match(html).ToString)

But that returns Binding:Paperback Publisher:

This I am sure is something wrong with my regular expression , but can I
strip multiple items using this method say naming then rexez1 regex2 etc
?

Someone suggested I use the DOM using microsoft.mshtml
Is this the most efficient way ?

Do I need to somehow put it into StreamReader or ......well what do I do
with it then.

Chris

Paul Bromley · Apr 13, 2004

Hi Chris,

Take a look at the thread in this group - 'Getting Web Page info' - on 11th
April.

Best wishes

Paul Bromley

Regex and screen scraping	1	Apr 13, 2004
Regular expressions	3	Jan 27, 2005
A question about a failing regular expression	4	Jun 10, 2009
Regular Expression Mystery	1	Dec 10, 2007
regular expression	1	Feb 3, 2009
Regular Expressions	10	Jan 24, 2005
About Regular Expressions	1	Dec 9, 2004
A Question About Regular Expressions and Capture	2	Jun 13, 2006

Screen scraping and regular expressions

Chris Wertman

Paul Bromley

Ask a Question

Similar Threads