Performance of SubString and IndexOf methods

John M · Feb 21, 2004

Hi,

I have a program that does alot of analysis on HTML files. The HTML files
range from 300k - 1MB in size. My program processes the HTML using a SAX
style approach. The program runs very slow taking several minutes to process
a file. I profiled the program and discovered that approximately half the
processing time is spent in the IndexOf method, the other half is spent in
the SubString method of the String class.

What I would like to know is:

1) Are there any SAX style HTML parsers for .Net.
2) What are my alternatives to the String class and it's IndexOf and
SubString methods.

TIA,
John

Herfried K. Wagner [MVP] · Feb 21, 2004

John,

* "John M said:
1) Are there any SAX style HTML parsers for .Net.

<http://msdn.microsoft.com/library/d...e/html/cpconcomparingxmlreadertosaxreader.asp>
<http://www.xmlforasp.net/codeSection.aspx?csID=36>

John M · Feb 22, 2004

Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)

Herfried K. Wagner [MVP] · Feb 22, 2004

* "John M said:
Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)

I don't think that there is a faster function than 'InStr'. Notice that
you can specify the point where the search begins, so you don't need to
search from the beginning every time.

Cor · Feb 22, 2004

Hi John,

Have a look for Mshtml (when you look in VS.net set the help to all).

You also have carefully to watch what interface you take, some are slow some
are fast.
(mshtml.IHTMLDocument2 is a fast one).

You have to set a reference to it, but set not an import, because it has so
much references that it freezes your IDE completly. (name the namespace when
you declare).

I hope this helps?

Cor

Jay B. Harlow [MVP - Outlook] · Feb 24, 2004

John,
Have you looked at the SgmlReader on www.gotdotnet.com?

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

It allows you to read an HTML file with about the same ease as an XML file.

The following articles provide information on writing .NET code that
performs well.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/fastmanagedcode.asp

http://msdn.microsoft.com/library/d...y/en-us/dndotnet/html/highperfmanagedapps.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/vbnstrcatn.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbtchperfopt.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/dotnetperftechs.asp

Hope this helps
Jay

John M said:
Thanks for your response. However the HTML is not well-formed XML. It
contains <BR> tags and mismatched <TD>/<TR> elements. I have no control over
the HTML output. That is why I need an HTML parser because it needs to be
relaxed for handling badly constructed HTML.

I have already written my own HTML SAX style parser but it is so slow,
because of the methods previously mentioned. Time to put on the optimiation
hat methinks (assuming there is no decent and fast HTML parsers for .Net out
there)

Herfried K. Wagner said:

John,

Click to expand...

<http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/ht

Performance of Split vs. Substring and IndexOf?	9	Nov 13, 2003
Prove position of quotation mark at beginning of string	3	Jan 24, 2008
Is there a string function for counting occurances of substrings?	2	Oct 28, 2005
Overriding Equals method not being called when doing IndexOf an item in an ArrayList	20	Apr 26, 2005
Overriding the Equals method to find the IndexOf an item in an ArrayList	12	Nov 27, 2003
replacing substrings in strings	5	Jan 4, 2005
Extracting a portion of a string	3	Feb 16, 2005
Substring generates error	3	May 6, 2005

Performance of SubString and IndexOf methods

John M

Herfried K. Wagner [MVP]

John M

Herfried K. Wagner [MVP]

Cor

Jay B. Harlow [MVP - Outlook]

Ask a Question

Similar Threads