Markup to Text

T

Trebek

Hello grp:

I have a situation I was hoping someone might be able to suggest a solution.
I am retrieving html from a url and storing this information in Sql Server.
Our web service supplies this data to our clients via a web service that is
a client of the ws and to integration clients as xml data (HTML is encoded
in CDATA). We have an integration client who cannot accept html embedded in
the xml for whatever reason. Due to the large volume of this client, we are
tasked with coming up with a robust solution to convert the html markup to
an equivalent text representation. In the short term, we are removing the
html formatting and replacing it with regex, but this solution is not very
robust or particularly effective due to table structures not translating
very well. I have been looking for an 'HTML Stripper' tool but searching
Google hasn't yielded too much. Most of these tools are either gui-based or
require files for input. Neither one of these options will work for us
since this needs to run in a 'service' context without user interaction.

Does anyone know of an effective way to either handle the markup or know of
a COM object library that provides support for HTML tables?

Some things we have tried to date include HTMLDocument class (formatting is
not preserved), XHTML conversion followed by xslt parsing (effective but not
very efficient) and, as already mentioned, regex.

Any help is much appreciated,

Alex
 
N

Nick

Have you at all considered maybe using the IE component framework to load
the HTML script and provide an object model ready to be programmed against?
Surely Internet Explorer can provide the majority of functionality to
extract the munged data you require. Maybe there are HTML parsers as
components ready for your use - I am sure there are some ActiveX parsers.
If there arent, have you also considered building the parser?

Nick.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top