algorithm on comparing two html files

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I am trying to build a diff tool that allows me to compare two HTML files. I
am looking for resources on how to achive this. The main problem is that I do
not want to simply highlight the line of code where the change happened, but
rather the word/text that changed.

Example say the html file contains a table with three cells/one row, and all
that changes between the two HTML files that I want to compare is the value
on the second cell. I need to be able to distinuish that thats what changed,
even if the actual html code was one single line. (basically comparing what
is being rendered displayed by the HTML renderer).

Any ideas suggestions on where I can start looking ?

thanks
 
hi ddd,

this is mahesh here
use xml,xsl,& xslt,xml schema
if you want to simply compare two html files you can use regular expresion
for byte by byte comparision but i think you want to point out differance
between the contents of two html files then i would suggest you to use xml &
xslt transformation to transform xml into html
then it will be easyer for you to point out difference between content of
two xml file (indirectly html file)
since xml imposes tree structure on documents for whitch you can use xml
schema(xsd)
& comparing nodes you can compare differance between contents of document

i am also newbee in programming field so my suggetion may be stupid if so
excuse me


bye have a good day,


(e-mail address removed)
akstech solutions pvt ltd
 
Not all HTML files are XML compliant. The majority of the webpages in
the internet aren't.

Regards
Senthil
 
You can try using MSHTML parser to parse the HTML file and compare the
contents.

Regards
Senthil
 
The algorithm was published in E. Myers, "An O(ND) difference algorithm and
its variations," Algorithmica, vol. 1, pp. 251-266, 1986.

A good commercial product is Araxis Merge. It compares fles, not exactly
what you asked for. You could tokenize your files and then compare the
tokenized data.

I once wrote a DOS shareware program (JDIF -
http://www.qsm.co.il/Software/jdif.htm) which could be modified to compare
tokens rather than source lines. But it is limited by the DOS memory size.

For a reference implementation you could look at the GNU diff source code
(remember the GPL).

JR
 
Back
Top