Regular Expression to Parse HTML

  • Thread starter Thread starter Charles Law
  • Start date Start date
Charles,
| but I could write it as XML, but use HTML tags.

That would be XHTML ;-)

If you are writing the files, then this may be the way to go.

Hope this helps
Jay

| Hi Dave
|
| Actually, you have hit on something there. I write the file in the first
| place as HTML, but I could write it as XML, but use HTML tags. I would
then
| have the right class structure to read it back in. Marvellous. It pays to
| look outside the box.
|
| Thanks.
|
| Charles
|
|
| | >> I have a well structured file
| >
| > If you can guarantee that the file will always be well-formed, you can
use
| > System.Xml namespace classes to do the parsing for you. i.e. XmlReader /
| > XmlWriter / XmlDocument or any of the XPath readers/writers/document.
| >
| > --
| > Dave Sexton
| > [email protected]
| > -----------------------------------------------------------------------
| > | >> Does anyone have a regex pattern to parse HTML from a stream?
| >>
| >> I have a well structured file, where each line is of the form
| >>
| >> <sometag someattribute='attr'>text</sometag>
| >>
| >> for example
| >>
| >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| >> Just some text, without tags
| >>
| >> What I would like to be able to do is parse each line so that I get an
| >> array like this
| >>
| >> SPAN
| >> CLASS
| >> myclass
| >> A bit of text
| >>
| >> or
| >>
| >> Just some text, without tags
| >>
| >> The array bit should follow, but I don't profess to be a regex expert
(or
| >> any kind of expert for that matter). Can anyone help with a suitable
| >> pattern?
| >>
| >> TIA
| >>
| >> Charles
| >>
| >>
| >
| >
|
|
 
Charles,
NOTE: The SgmlTextReader I mentioned in my earlier post allows you to treat
any HTML as XML.

Hope this helps
Jay

| Hi Dave
|
| Actually, you have hit on something there. I write the file in the first
| place as HTML, but I could write it as XML, but use HTML tags. I would
then
| have the right class structure to read it back in. Marvellous. It pays to
| look outside the box.
|
| Thanks.
|
| Charles
|
|
| | >> I have a well structured file
| >
| > If you can guarantee that the file will always be well-formed, you can
use
| > System.Xml namespace classes to do the parsing for you. i.e. XmlReader /
| > XmlWriter / XmlDocument or any of the XPath readers/writers/document.
| >
| > --
| > Dave Sexton
| > [email protected]
| > -----------------------------------------------------------------------
| > | >> Does anyone have a regex pattern to parse HTML from a stream?
| >>
| >> I have a well structured file, where each line is of the form
| >>
| >> <sometag someattribute='attr'>text</sometag>
| >>
| >> for example
| >>
| >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| >> Just some text, without tags
| >>
| >> What I would like to be able to do is parse each line so that I get an
| >> array like this
| >>
| >> SPAN
| >> CLASS
| >> myclass
| >> A bit of text
| >>
| >> or
| >>
| >> Just some text, without tags
| >>
| >> The array bit should follow, but I don't profess to be a regex expert
(or
| >> any kind of expert for that matter). Can anyone help with a suitable
| >> pattern?
| >>
| >> TIA
| >>
| >> Charles
| >>
| >>
| >
| >
|
|
 
Hi Jay

You won't be surprised to hear that this is a continuing theme.

Once upon a time, there was RTF, but it was slow, and the people wept, for
it was very, very slow, and they got very, very bored waiting.

So, the developer chappie considered the many possible alternatives, and
decided to simplify the whole thing by invoking the minor devil known as the
listview. But the users came back and said, "but we liked the rich text box,
because it had colours and stuff".

And the developer said, "you have colours, what are you complaining about;
the listview is every bit as colourful, and quicker to boot, it just doesn't
retain the colours when you save and reload".

And then he added, "you are lucky to have anything at all, so just be
grateful", but he went away thinking that he had somehow done the users a
disservice.

So, anyway, he came up with the idea of saving the output as html, so that
it could be opened by the great God Microsoft Word; oh, and some browser
thingy called IE.

But then there was the dilemma: how to load it back into the application
with colour, as the users had become used to. And it was then that Regular
Expression came to the developer one night in a dream. But he knew little of
the Regular Expression, so he sought help from the great developers in the
sky. And they said, try this ... no, try this ... and he tried it, and it
worked; sought of.

But by this time, the developer had grown weary, and also his calculating
machine had become defective because he had done some re-installing and it
had mucked up his debugger, and it took him a day-and-a-half to put it
right. So, by Sunday evening he was really very weary indeed, and then some.

Finally, a door opened, and a bright light shone in. The developer tried
some stuff, and it worked. He wrote a set of classes to serialise and
de-serialise an html class, which looked remarkably like real html, which is
apparently something called xhtml.


So, now we are back in the present. The story is nearly at its end. The
developer just needs some sleep (and the love of a good women), and all will
be right with the world.

And so, to sleep, perchance to dream, ay there's the rub.

Charles
 
Charles,
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
Can't really help you on either of those... Other then wishing you luck in
those areas...


This question & the question on "Easiest way to generate XML in VB.NET" post
reminds me of Item #29 "Always Use a Parser" from Elliotte Rusty Harold's
book "Effective XML - 50 Specific Ways to Improve Your XML" from Addison
Wesley lists a number of other reasons to use a parser. Although Item #29 is
largely reading, I find the topic apropos to writing also. Hence my
suggestion, without realizing the connection, of using either the SgmlReader
or XHTML...

Hope this helps
Jay




| Hi Jay
|
| You won't be surprised to hear that this is a continuing theme.
|
| Once upon a time, there was RTF, but it was slow, and the people wept, for
| it was very, very slow, and they got very, very bored waiting.
|
| So, the developer chappie considered the many possible alternatives, and
| decided to simplify the whole thing by invoking the minor devil known as
the
| listview. But the users came back and said, "but we liked the rich text
box,
| because it had colours and stuff".
|
| And the developer said, "you have colours, what are you complaining about;
| the listview is every bit as colourful, and quicker to boot, it just
doesn't
| retain the colours when you save and reload".
|
| And then he added, "you are lucky to have anything at all, so just be
| grateful", but he went away thinking that he had somehow done the users a
| disservice.
|
| So, anyway, he came up with the idea of saving the output as html, so that
| it could be opened by the great God Microsoft Word; oh, and some browser
| thingy called IE.
|
| But then there was the dilemma: how to load it back into the application
| with colour, as the users had become used to. And it was then that Regular
| Expression came to the developer one night in a dream. But he knew little
of
| the Regular Expression, so he sought help from the great developers in the
| sky. And they said, try this ... no, try this ... and he tried it, and it
| worked; sought of.
|
| But by this time, the developer had grown weary, and also his calculating
| machine had become defective because he had done some re-installing and it
| had mucked up his debugger, and it took him a day-and-a-half to put it
| right. So, by Sunday evening he was really very weary indeed, and then
some.
|
| Finally, a door opened, and a bright light shone in. The developer tried
| some stuff, and it worked. He wrote a set of classes to serialise and
| de-serialise an html class, which looked remarkably like real html, which
is
| apparently something called xhtml.
|
|
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
|
| And so, to sleep, perchance to dream, ay there's the rub.
|
| Charles
|
|
| | > Charles,
| > | but I could write it as XML, but use HTML tags.
| >
| > That would be XHTML ;-)
| >
| > If you are writing the files, then this may be the way to go.
| >
| > Hope this helps
| > Jay
| >
| > | > | Hi Dave
| > |
| > | Actually, you have hit on something there. I write the file in the
first
| > | place as HTML, but I could write it as XML, but use HTML tags. I would
| > then
| > | have the right class structure to read it back in. Marvellous. It pays
| > to
| > | look outside the box.
| > |
| > | Thanks.
| > |
| > | Charles
| > |
| > |
| > | | > | >> I have a well structured file
| > | >
| > | > If you can guarantee that the file will always be well-formed, you
can
| > use
| > | > System.Xml namespace classes to do the parsing for you. i.e.
XmlReader
| > /
| > | > XmlWriter / XmlDocument or any of the XPath
readers/writers/document.
| > | >
| > | > --
| > | > Dave Sexton
| > | > [email protected]
| > |
| >
-----------------------------------------------------------------------
| > | > | > | >> Does anyone have a regex pattern to parse HTML from a stream?
| > | >>
| > | >> I have a well structured file, where each line is of the form
| > | >>
| > | >> <sometag someattribute='attr'>text</sometag>
| > | >>
| > | >> for example
| > | >>
| > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| > | >> Just some text, without tags
| > | >>
| > | >> What I would like to be able to do is parse each line so that I get
| > an
| > | >> array like this
| > | >>
| > | >> SPAN
| > | >> CLASS
| > | >> myclass
| > | >> A bit of text
| > | >>
| > | >> or
| > | >>
| > | >> Just some text, without tags
| > | >>
| > | >> The array bit should follow, but I don't profess to be a regex
expert
| > (or
| > | >> any kind of expert for that matter). Can anyone help with a
suitable
| > | >> pattern?
| > | >>
| > | >> TIA
| > | >>
| > | >> Charles
| > | >>
| > | >>
| > | >
| > | >
| > |
| > |
| >
| >
|
|
 
I have just spotted a Freudian slip
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good wom*e*n), and
all

Maybe there is something going on in my head that I don't know about ...
wouldn't be the first time.

I don't see any specific support for XHTML in .NET, unless it goes by
another name. I have my solution, using the XmlSerializer to serialise and
de-serialise a class hierarchy that resembles the html document I want to
manipulate. It requires that I name the classes quite carefully, and there
are some things that I cannot readily do, such as put comments
- -->) into a STYLE tag, but it works.

Have I missed a trick with this XHTML?

Charles
 
Charles,
| I don't see any specific support for XHTML in .NET
There is no specific support per se.

XHTML is HTML tags in an XML document.

Ergo the XHTML support in .NET is the classes System.Xml namespace, such as
the XmlSerializer. XmlSerializer directly or indirectly uses a
System.Xml.XmlWriter to write XML output. In other words it follows Item #29
& uses a "parser".

Hope this helps
Jay

|I have just spotted a Freudian slip
|
| > | So, now we are back in the present. The story is nearly at its end.
The
| > | developer just needs some sleep (and the love of a good wom*e*n), and
| > all
|
| Maybe there is something going on in my head that I don't know about ...
| wouldn't be the first time.
|
| I don't see any specific support for XHTML in .NET, unless it goes by
| another name. I have my solution, using the XmlSerializer to serialise and
| de-serialise a class hierarchy that resembles the html document I want to
| manipulate. It requires that I name the classes quite carefully, and there
| are some things that I cannot readily do, such as put comments
| - -->) into a STYLE tag, but it works.
|
| Have I missed a trick with this XHTML?
|
| Charles
|
|
<<snip>>
 
Thanks for clearing that up. I think I have probably done the best with it
then

Cheers

Charles
 
Back
Top