HTML Cleaner

  • Thread starter Thread starter No I'm Spartacus
  • Start date Start date
N

No I'm Spartacus

Hi everyone,

I'm not even sure there is such a utility, but I'm looking for a HTML
cleaner. What I want specifically, is sometimes when you save a
webpage off the Internet (just from File-Save As in your browser), you
will get a webpage that has things such as ad links in them. When you
disconnect from the 'net and try to view the saved page in your
browser, you will get a message saying '{insert name of ad server
here' cannot be reached' or similar. What I was looking for was a
utility to go in an remove such stuff (ads, iframe tags etc etc).
 
Spartacus, I've never heard of any tool that can do this, for free
anyway.
Is there a specific reason why you don't want the ads there. You can
still view the page can't you.

tf76
http://www.topfreeware.net
Member of the Freeware Revolution
 
What I want specifically, is sometimes when you save a
webpage off the Internet (just from File-Save As in your browser), you
will get a webpage that has things such as ad links in them. When you
disconnect from the 'net and try to view the saved page in your
browser, you will get a message saying '{insert name of ad server
here' cannot be reached' or similar. What I was looking for was a
utility to go in an remove such stuff (ads, iframe tags etc etc).

Have you the same problem if you in Internet Explorer
choose to save the webpage as a
"Web Archive, single file (*.mht)"
?

Regards
Thorkild Dalsgaard
 
Spartacus,
If you are using Firefox, just install the Scapbook extension.
Just save your desired webpage, then edit it with the built-in editing
tools. I save webpages all the time with Scrapbook, then take out all
the ads & junk. Works great!
 
Hi everyone,

I'm not even sure there is such a utility, but I'm looking
for a HTML cleaner. What I want specifically, is sometimes
when you save a webpage off the Internet (just from
File-Save As in your browser), you will get a webpage that
has things such as ad links in them. When you disconnect
from the 'net and try to view the saved page in your
browser, you will get a message saying '{insert name of ad
server here' cannot be reached' or similar. What I was
looking for was a utility to go in an remove such stuff
(ads, iframe tags etc etc). --

Regards,

Spartacus

A solution: disable images, JavaScript, Java, etc. while viewing
offline (saved pages).

J
 
Notetab Lite (www.notetab.com) is a text editor with a comprehensive built
in macro / programming language. I've got clips that will rip through an
HTML page in a second or two stripping out javascript and other undesireable
elements. A lot of these clips can be found for free and tweaked for your
purposes.

Alternately, the old Frontpage Express had a neat feature where scripts --
which are often used to call up ads -- appeared in wywiwyg mode as a small
rectangle. Delete the rectangle and you delete the whole script.

M
 
(e-mail address removed) wrote in news.ops.worldnet.att.net:

A solution: disable images, JavaScript, Java, etc. while viewing
offline (saved pages).

I've found the main culprit is the link to an external stylesheet. You can
take care of that beforehand using a filtering program like Proxomitron.
But if you want to take care of it *afterwards*, you'd have to have some
sort of automated program to get rid of the external CSS link.
 
(e-mail address removed) wrote in
news.ops.worldnet.att.net:



I've found the main culprit is the link to an external
stylesheet. You can take care of that beforehand using a
filtering program like Proxomitron. But if you want to take
care of it *afterwards*, you'd have to have some sort of
automated program to get rid of the external CSS link.

Proxomitron can do that, too. As an example, I kill
<link rel="stylesheet"*>
on certain pages. YMMV.

J
 
Hello,
I've found the main culprit is the link to an external stylesheet.

when saving "complete", this stylesheets should get saved as well.
And I think this is good, because more and more styling gets done
with CSS these days, and many webpages may be read *much* more
comfortably than the pure, unstyled page. Also pictures can
include relevant information, and IE is able to save them with
the webpage, so I don't see a reason for disabling them.

Things IE cannot save are dynamic content, image and iframe
urls, that are computed on view time, like ads usually.

I don't see a satisfying solution to this request.

Regards,
Thorsten
 
Spartacus, I've never heard of any tool that can do this, for free
anyway.
Is there a specific reason why you don't want the ads there. You can
still view the page can't you.

tf76
http://www.topfreeware.net
Member of the Freeware Revolution

Hi tf76,

There is a reason. Once I save the webpages, I usually don't access
them in my browser (I use the Lister function of Total Commander), and
I usually do it with my modem off. Once you do this, when you open the
pages, you usually (depending on where you saved the page from), get a
popup or two saying the page couldn't access whatever
webserver/adlink. I thought if there was a utility I could run over
some of my .html files (only some, not all of them have links to ads
in them - some also have an iframe tag that does not render well), I
could 'clean them up' and make my offline viewing experience a little
more pleasurable. Yeah, I can still view the pages, it was just a case
of getting rid of the 'non content' related tags.
 
Have you the same problem if you in Internet Explorer
choose to save the webpage as a
"Web Archive, single file (*.mht)"
?

Regards
Thorkild Dalsgaard

Hi Thorkild,

I haven't actually tried that - I use Opera as my browser. If I 'save
with images' in Opera, I don't usually have a problem, but a lot of
the time I only save the .html (I only save images when there are some
related to the content on the webpage - I don't count ads as part of
the content, so if a page is just text + banner ads, I only save the
..html file itself).
 
A solution: disable images, JavaScript, Java, etc. while viewing
offline (saved pages).

J

Hi J,

That's an interesting idea. Opera (which I use as my browser), has
both a 'user-mode' function (which you can selectively change the page
content/layout etc) built in, and a function which enables you to use
user created JavaScript's. I'm no JS expert, but there is a number of
User JS sites around with scripts that other users have created, so I
could take a look there (if User Mode didn't work), and see if there
are any suitable there.
 
Notetab Lite (www.notetab.com) is a text editor with a comprehensive built
in macro / programming language. I've got clips that will rip through an
HTML page in a second or two stripping out javascript and other undesireable
elements. A lot of these clips can be found for free and tweaked for your
purposes.

Alternately, the old Frontpage Express had a neat feature where scripts --
which are often used to call up ads -- appeared in wywiwyg mode as a small
rectangle. Delete the rectangle and you delete the whole script.

M

Actually, I did also eventually find (by luck admitedly), a freeware
utility called HTML Remover (http://www.e-systems.ro/html_remover.htm)
which sounds like it also removes scripts like Frontpage Express. From
the website:

"Emsa HTML Tag Remover is a software utility that allows removing html
tags from a html file with some extra degree of control on how the
html is removed and whitespace removal as well. It provides several
options to remove different types of data from the html page. It
allows whitespace removal, making the resulting text output condensed
as necessary. Finally, it works both in interactive mode, as well as
in command line mode, which can be useful for users wanting to use
this functionality from other programs or batch files."

When you run it, there are options to remove comment tags, script
tags, style tags, functions, pipe and tab characters, foreign and
special characters, and all tags (as well as the whitespace options).

Not sure how it would go on some of the pages that I had in mind when
I posted my original request for a HTML cleaner (the BoFH pages on The
Register in particular - when you open them in Notepad, the majority
of the file is scripts, ads, styles etc - only the little bit right in
the middle is the actual content of the page, sandwiched between all
the crap)....
 
Actually, I did also eventually find (by luck admitedly), a freeware
utility called HTML Remover (http://www.e-systems.ro/html_remover.htm)
which sounds like it also removes scripts like Frontpage Express. From
the website:

I'll have a look at that. Another solution for you. Somewhere out there I
got hold of an IE add-on that lets you hightlight stuff of interest on a
webpage, right-click, then the option to view just the code of the
hightlighted stuff appears. For those page thick with extraneous stuff, this
little utility has proven to be a god send. You just copy the hightlighted
stuff into a text editor, save that to your harddrive and you're done.

You might have to save the original page just to retrieve any images that go
along with you text, or save them individually if there aren't many.

Sorry, I don't have the URL for this add-on, nor do I know the name of it.
It just intalls itself into IE without a fuss.

M
 
(e-mail address removed) wrote in news.ops.worldnet.att.net:
Proxomitron can do that, too. As an example, I kill
<link rel="stylesheet"*>
on certain pages. YMMV.

Yeah, that's what I mean, you can kill it *beforehand* with a filter, but
if it's something already on your disk , you're out of luck. Unless
there's some way to filter local files. I think I heard that there is, but
I've never tried it.
 
(e-mail address removed) wrote in
news.ops.worldnet.att.net:


Yeah, that's what I mean, you can kill it *beforehand* with
a filter, but if it's something already on your disk ,
you're out of luck. Unless there's some way to filter
local files. I think I heard that there is, but I've never
tried it.

Examples of the "URL Match" for local files:
[^/]++XXX-file
www.google.com/|*x:*
The 'XXX' is whatever you choose as your prefix.
The 2nd example applies the filter to http://
www.google.com or XXX-file///x:/a_dir/a_file on drive X:.

J
 
Examples of the "URL Match" for local files:
[^/]++XXX-file
www.google.com/|*x:*
The 'XXX' is whatever you choose as your prefix.
The 2nd example applies the filter to http://
www.google.com or XXX-file///x:/a_dir/a_file on drive X:.

Okay, I'm afraid I don't understand. All my local files start with:

file:///c:/

How do I filter a file like that?

And the prefix you're referring to, is that the URL prefix that you can
configure in Proxo? And do you need the hyphen after it?

Thanks!
 
Examples of the "URL Match" for local files:
[^/]++XXX-file
www.google.com/|*x:*
The 'XXX' is whatever you choose as your prefix.
The 2nd example applies the filter to http://
www.google.com or XXX-file///x:/a_dir/a_file on drive X:.

Okay, I'm afraid I don't understand. All my local files
start with:

file:///c:/

How do I filter a file like that?

And the prefix you're referring to, is that the URL prefix
that you can configure in Proxo? And do you need the
hyphen after it?

Thanks!

First configure Proxomitron:

Open Proxomitron
Config | Access

Enter into "Prefix all URL command with:"
something meaningful to you, say "XYZ"

OK, save.

Now you can filter local files, e.g.,
file:///c:/blah.htm
by specifying
http://XYZ-file///c:/blah.htm

To restrict a filter only to your local files, enter in the
"URL Match" field:
[^/]++XYZ-file

J
 
(e-mail address removed) wrote in news.ops.worldnet.att.net:
First configure Proxomitron:

Open Proxomitron
Config | Access

Enter into "Prefix all URL command with:"
something meaningful to you, say "XYZ"

OK, save.

Now you can filter local files, e.g.,
file:///c:/blah.htm
by specifying
http://XYZ-file///c:/blah.htm

To restrict a filter only to your local files, enter in the
"URL Match" field:
[^/]++XYZ-file

OK thanks for the info, but I still can't get it to work. Nothing is
getting filtered. But this has stimulated me to look further into this
matter. So thanks for getting me off my butt and interersted in pursuing
this. :)
 
(e-mail address removed) wrote in
news.ops.worldnet.att.net:
First configure Proxomitron:

Open Proxomitron
Config | Access

Enter into "Prefix all URL command with:"
something meaningful to you, say "XYZ"

OK, save.

Now you can filter local files, e.g.,
file:///c:/blah.htm
by specifying
http://XYZ-file///c:/blah.htm

To restrict a filter only to your local files, enter in
the "URL Match" field:
[^/]++XYZ-file

OK thanks for the info, but I still can't get it to work.
Nothing is getting filtered. But this has stimulated me to
look further into this matter. So thanks for getting me
off my butt and interersted in pursuing this. :)
Good luck - it's (sometimes frustrating) fun.
If any Q's, feel free to ask.

J
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top