S
smerf
I am trying to write a personal spider to crawl through websites and create
a highly specialized personal list of sites and pages that I may like to see
based on preferences that I have supplied. I have found some interesting
pages - interesting in the fact that they use javascript to encrypt the
pages to block people from ?stealing thier content?.
There are javascript tricks that you can use on the downloaded encrypted
page to get around these irritations. You have to run a javascript line in
the browsers address line and you get another window with the unencrypted
HTML in it. But, I want to see the HTML unencrypted without downloading
every image, wav, activex object and flash thingy on the page into an actual
webbrowser control. Utilizing a webbrowser control for this, and having to
dl all images and such would dramatically decrease the speed the spider can
crawl at.
An example of an encrypted page can be found at
http://www.aw-soft.com/htmlguard-sample.html. A simple Javascript way to
defeat it is by pasting
"javascript:window.open('about:blank').document.write('<pre>' +
document.documentElement.outerHTML.replace(/</g, '<') + '</pre>')" in the
IE address bar and clicing GO.
I have no interest in (or drive space for) mass web page content theft.
But, is there anything in the .Net framework that will help with viewing an
encrypted web page's source for my spider? It seems I need to be able to
run Javascript to decode the page into readable HTML.....but, as I may have
said, I am only after the HTML....I really don't want to DL the pics and all
that other stuff - it kills my speed.
Any ideas?
a highly specialized personal list of sites and pages that I may like to see
based on preferences that I have supplied. I have found some interesting
pages - interesting in the fact that they use javascript to encrypt the
pages to block people from ?stealing thier content?.
There are javascript tricks that you can use on the downloaded encrypted
page to get around these irritations. You have to run a javascript line in
the browsers address line and you get another window with the unencrypted
HTML in it. But, I want to see the HTML unencrypted without downloading
every image, wav, activex object and flash thingy on the page into an actual
webbrowser control. Utilizing a webbrowser control for this, and having to
dl all images and such would dramatically decrease the speed the spider can
crawl at.
An example of an encrypted page can be found at
http://www.aw-soft.com/htmlguard-sample.html. A simple Javascript way to
defeat it is by pasting
"javascript:window.open('about:blank').document.write('<pre>' +
document.documentElement.outerHTML.replace(/</g, '<') + '</pre>')" in the
IE address bar and clicing GO.
I have no interest in (or drive space for) mass web page content theft.
But, is there anything in the .Net framework that will help with viewing an
encrypted web page's source for my spider? It seems I need to be able to
run Javascript to decode the page into readable HTML.....but, as I may have
said, I am only after the HTML....I really don't want to DL the pics and all
that other stuff - it kills my speed.
Any ideas?