Performance of SubString and IndexOf methods

J

John M

Hi,

I have a program that does alot of analysis on HTML files. The HTML files
range from 300k - 1MB in size. My program processes the HTML using a SAX
style approach. The program runs very slow taking several minutes to process
a file. I profiled the program and discovered that approximately half the
processing time is spent in the IndexOf method, the other half is spent in
the SubString method of the String class.

What I would like to know is:

1) Are there any SAX style HTML parsers for .Net.
2) What are my alternatives to the String class and it's IndexOf and
SubString methods.

TIA,
John
 
J

John M

Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)
 
H

Herfried K. Wagner [MVP]

* "John M said:
Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)

I don't think that there is a faster function than 'InStr'. Notice that
you can specify the point where the search begins, so you don't need to
search from the beginning every time.
 
C

Cor

Hi John,

Have a look for Mshtml (when you look in VS.net set the help to all).

You also have carefully to watch what interface you take, some are slow some
are fast.
(mshtml.IHTMLDocument2 is a fast one).

You have to set a reference to it, but set not an import, because it has so
much references that it freezes your IDE completly. (name the namespace when
you declare).

I hope this helps?

Cor
 
J

Jay B. Harlow [MVP - Outlook]

John,
Have you looked at the SgmlReader on www.gotdotnet.com?

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

It allows you to read an HTML file with about the same ease as an XML file.


The following articles provide information on writing .NET code that
performs well.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/fastmanagedcode.asp

http://msdn.microsoft.com/library/d...y/en-us/dndotnet/html/highperfmanagedapps.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/vbnstrcatn.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbtchperfopt.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/dotnetperftechs.asp

Hope this helps
Jay


John M said:
Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)

Herfried K. Wagner said:
<http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/ht
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top