Regular Expression to Parse HTML

Charles Law · Apr 8, 2005

Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

A bit of text, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or
any kind of expert for that matter). Can anyone help with a suitable
pattern?

TIA

Charles

Galin Iliev · Apr 8, 2005

is this usefult for you?

http://regexplib.com/REDetails.aspx?regexp_id=520

Galin Iliev
MCSD, MCAD.NET

Herfried K. Wagner [MVP] · Apr 8, 2005

Charles Law said:
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

A bit of text, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text

Maybe it's easier to use the HTML Agility Pack:

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

Charles Law · Apr 8, 2005

Hi Galin

Thanks for the link. It looks like it ought to work, but when I test it
against even a simple tag it returns no matches. I tried verifying the
expression with Expresso and it gives the following error.

Reference to undefined group number 5.

Even when I test it using the facility on the web site it fails. Any idea
how to correct it?

Charles

Charles Law · Apr 8, 2005

Hi Herfried

It's not my luck day today for getting things to work. When I try to open
the AgilityPack solution I get two errors:

Unable to open project HtmlDomView
Unable to open project GetBinaryRemainder

When I try to run it comes up with 12 compile errors, one of which is a
cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too.

Charles

Scott Swigart [MVP] · Apr 8, 2005

There's an example of just that in my article on the new VBRUN site here:

http://msdn.microsoft.com/vbrun/vbfusion/5000classes/

The expression I used is:

("(?<=href\s*=\s*[""']).*?(?=[""'])")

--
Scott Swigart - MVP
http://blog.swigartconsulting.com

Charles Law said:
Hi Herfried

It's not my luck day today for getting things to work. When I try to open
the AgilityPack solution I get two errors:

Unable to open project HtmlDomView
Unable to open project GetBinaryRemainder

When I try to run it comes up with 12 compile errors, one of which is a
cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too.

Charles

Charles Law · Apr 8, 2005

Hi Scott

It looks like this would specifically decode hrefs, and so if I wanted to
decode another tag I would need to change the expression. To decode many
different tags I would need to generate multiple expressions and test
against each; please correct me if I have misunderstood. What I am hoping
for is a generic expression that will decode all tags that conform to the
general html format. I realise that this would also decode tags that are not
valid html, but this would not matter as I have control over the file and
what is in it.

Charles

Scott Swigart said:
There's an example of just that in my article on the new VBRUN site here:

http://msdn.microsoft.com/vbrun/vbfusion/5000classes/

The expression I used is:

("(?<=href\s*=\s*[""']).*?(?=[""'])")

Cor Ligthert · Apr 8, 2005

Charles,

Maybe I can point you on a class that is called MSHTML. It is not the nicest
class, however very good to filter tags from a document using loops or even
tag by tag by looping through the document something like this, this is a
document collection.

\\\
For Each iDocument As mshtml.IHTMLDocument2 In pDocuments
For i As Integer = 0 To iDocument.all.length - 1
Dim hrefname As String
Dim hElm As mshtml.IHTMLElement = DirectCast(iDocument.all.item(i),
mshtml.IHTMLElement)
Dim tagname As String = hElm.tagName.ToLower
If (tagname = "a") Or (tagname = "chk") Then
If Not DirectCast(hElm, mshtml.IHTMLAnchorElement).href Is
Nothing Then
hrefname = DirectCast(hElm,
mshtml.IHTMLAnchorElement).href.ToString
End If
End If
etc etc
///.
..
..
In this newsgroups I leave the answers about this mostly to somebody who has
by coincidence the same name as you, he is much longer and activer busy with
it than I.

Maybe you can search for his answers.

:-)

))))

Cor

Charles Law · Apr 8, 2005

Now why didn't I think of that ;-) I shall look this fellow up, of whom you
speak, and see what he has to say on the matter.

I have now got the Agility Pack working. It is somewhat smaller than mshtml
and, I suspect, quicker.

It's actually quite good, and may well be better than the regex idea;
especially since I don't currently have a regex that works! I had thought
that, for a large file, regex would be quicker than mshtml, but I have no
actual evidence of that. Conversely, though, I think that the Agility Pack
will be every bit as quick as a regex, if not quicker. Anyway, it works,
which is the main thing.

Charles

Herfried K. Wagner [MVP] · Apr 8, 2005

Charles,

Charles Law said:
I have now got the Agility Pack working. It is somewhat smaller than
mshtml and, I suspect, quicker.

I am glad to hear that you finally got the Agility Pack to work :-)

.

Jon Shemitz · Apr 8, 2005

Charles said:
I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

A bit of text, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Jay B. Harlow [MVP - Outlook] · Apr 8, 2005

Charles Law · Apr 8, 2005

Hi Jon

As with my reply to an earlier response, it looks like the expression you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse _any_
tag and attribute. Although the tags I am parsing are limited in number, it
would still be too onerous to create multiple expressions to compare with.

Thanks for the suggestion.

Charles

Jon Shemitz said:
Charles said:

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

A bit of text, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

Click to expand...

Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Charles Law · Apr 8, 2005

Hi Jay

I have just had a look at the link, and it is similar, I think, to the
Agility Pack. Now that I have the Agility Pack working I am going to try and
make that work for me, unless a regex comes up. I think the code to use a
regex would be shorter/simpler, but of course that does not necessarily
equate with speed, and that is my overriding concern (well, that and
reliability, of course).

Charles

Jon Shemitz · Apr 9, 2005

Charles said:
As with my reply to an earlier response, it looks like the expression you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse _any_
tag and attribute. Although the tags I am parsing are limited in number, it
would still be too onerous to create multiple expressions to compare with.

You misread. ?<attribute> &c captures to the named group "attribute" -
it doesn't match "attribute".

You should try it. I spent five minutes writing it for you for free.

Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Click to expand...

Click to expand...

Charles Law · Apr 9, 2005

Jon

I apologise if I appeared dismissive of your efforts. I have tried it with

Hello world

and it collects elements perfectly. I tried it with

Hello world

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?

Thanks again.

Charles

Jon Shemitz said:
Charles said:

As with my reply to an earlier response, it looks like the expression
you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse
_any_
tag and attribute. Although the tags I am parsing are limited in number,
it
would still be too onerous to create multiple expressions to compare
with.

Click to expand...

You misread. ?<attribute> &c captures to the named group "attribute" -
it doesn't match "attribute".

You should try it. I spent five minutes writing it for you for free.

Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Click to expand...

Click to expand...

Jon Shemitz · Apr 9, 2005

Charles said:
Jon

I apologise if I appeared dismissive of your efforts. I have tried it with

Hello world

and it collects elements perfectly. I tried it with

Hello world

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?

Of course. But you said everything would look like

<sometag someattribute='attr'>text</sometag>

or bare text. Try

#[ExplicitCapture|Multiline|IgnorePatternWhitespace]

^
(
<
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Charles Law · Apr 10, 2005

Jon

As we say in these parts, you know stuff.

Thanks muchly.

Charles

Jon Shemitz said:
Charles said:

Jon

I apologise if I appeared dismissive of your efforts. I have tried it
with

Hello world

and it collects elements perfectly. I tried it with

Hello world

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?

Click to expand...

Of course. But you said everything would look like

<sometag someattribute='attr'>text</sometag>

or bare text. Try

#[ExplicitCapture|Multiline|IgnorePatternWhitespace]

^
(
<
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

Dave · Apr 10, 2005

I have a well structured file

If you can guarantee that the file will always be well-formed, you can use System.Xml namespace classes to do the parsing for you.
i.e. XmlReader / XmlWriter / XmlDocument or any of the XPath readers/writers/document.

Charles Law · Apr 10, 2005

Hi Dave

Actually, you have hit on something there. I write the file in the first
place as HTML, but I could write it as XML, but use HTML tags. I would then
have the right class structure to read it back in. Marvellous. It pays to
look outside the box.

Thanks.

Charles

Regular Experssion Question	1	Nov 8, 2005
Parsing HTML	3	Feb 25, 2005
About Regular Expressions	1	Dec 9, 2004
Regular Expression Pattern Help	1	Feb 4, 2004
using a regular expression to match up to but not including html start/end tags	9	Oct 11, 2008
Regex help!	3	May 10, 2007
Parsing Files with Regular Expressions	3	Jul 26, 2006
Question about regular expressions	2	Oct 3, 2005

Regular Expression to Parse HTML

Charles Law

Galin Iliev

Herfried K. Wagner [MVP]

Charles Law

Charles Law

Scott Swigart [MVP]

Charles Law

Cor Ligthert

Charles Law

Herfried K. Wagner [MVP]

Jon Shemitz

Jay B. Harlow [MVP - Outlook]

Charles Law

Charles Law

Jon Shemitz

Charles Law

Jon Shemitz

Charles Law

Dave

Charles Law

Ask a Question

Similar Threads