Web get command (wget) to download all icons/pics on a web page (too large or too small)

  • Thread starter Thread starter barb
  • Start date Start date
B

barb

How do I get Windows/Linux web get to ignore all files of a too-small size?

Like everyone, I often use the Windows/Linux "Free Software Foundation"
web-get wget command to download all the PDFs, GIFs, or JPEGs in a web site
onto my hard disk.

The basic command we all use is:

EXAMPLE FOR WINDOWS:
c:\> wget -prA.gif http://machine/path

EXAMPLE FOR LINUX:
% wget -prA.jpg http://machine/path

This famous wget command works great, except it downloads ALL the JPG & GIF
icons and photos at the targeted web site - large or small.

How do we tell wget to skip files of a certain size?

For example, assume we wish to skip anything smaller than, say, 10KB and
antyhing larger than, say, 100KB.

Can we get wget to skip files that are too small or too large?

barb
 
I don't know how to do that, but it would be easy to erase
all the small files. When the images have been downloaded
to a directory, sort the directory by file size and erase
those below the minimum size you want to keep.

Hi Marvin,

Thank you for your help. I thought of this but I was kind of hoping that
wget would have a "size" range option that handled this.

Something like:

wget -prA.pdf http://www.consumerreports.com --size<min:max>

What I do today is sort by file size and then delete the too-large files
and the too-small files but that is obviously not optimal.

barb
 
I think if you really want this, I think you're going to have to hack
wget such that it takes another option, --size-range or something.
Then wget would have to parse the server's 200 responses and either
halt the download if the 200 said the file wasn't in --size-range,
or unlink() the file after it finished. The exact approach you'd
take depends on the wget code itself, and your level of C skill.

Hi Dances with Crows,

Thank you for your kind help. As you surmised, I do not have the skill set
to "hack" the venerable wget command so that it selects to download only
files of a certain range in size.

I had also read the manpage and I had searched prior but I did not see that
anyone had done this yet. I am kind of surprised since it's the most basic
of things you want to do.

For example, let's say we went to a free icon site and let's say they
updated that site periodically with the little web page bitmaps and better
icons usable for powerpoint slides and too-big icons suitable for photo
sessions.

Let's say you had a scheduled wget go to that site daily and download all
the icons automatically from that http web page but not the large ones or
the really really small ones. Let's say there were thousands of these. Of
course, ftp would be a pain. You likely wouldn't even have FTP access
anyway. And, downloading them manually isn't in the cards.

What I'd want to schedule is:
wget -prA.gif,jpg,bmp http://that/freeware/icon/web/page --size:<low:high>

barb
 
I'm just wondering why you need to do this... You might be getting
into copyright issues here....

Hi poddys,

Thank you very much for asking the right questions. Let's say I went to
http://www.freeimages.co.uk or http://www.bigfoto.com or
http://www.freefoto.com/index.jsp or any of a zillion sites which supply
royalty free images or GIFs or bitmaps or PDFs or HTML files etc.

Why wouldn't I want to use wget to obtain all the images, pdfs, word
documents, powerpoint templates, whatever ... that this site offers.

Even for sites I PAY for such as consumer reports and technical data ...
why wouldn't I want to just use wget to download every single PDF or
Microsoft office document or graphic at that web site?

There's no copyright infringement in that is there?

I can do all that today with wget.
The only problem I have is that the really large (too large) files get
downloaded too and that the really small (too small) files seem to be
useless clutter.

barb
 
barb never stated what barb was doing with the images. It's a legit and
semi-interesting question, though, regardless of what the final purpose
is. Too bad there's nothing in wget that does what barb wants. barb
will have to either hack wget or write a small script to remove all
files between sizes X and Y after wget's finished.

Hi Dances with crows,

I don't know what I want to do with the images or pdfs or powerpoint
templates. For example, recently I found a page of royalty free powerpoint
calendar templates. The web page had scores and scores of them.

Nobody in their right mind is going to click on a link-by-link basis when
they can run a simple wget command and get them all in one fell swoop (are
they?)

wget -prA.ppt http://that/web/page

My older brother pointed me to one of his yahoo web pages which contained
photos, hundreds of them. I picked up them all in seconds using:
wget -prA.jpg http://that/web/page

I wouldn't THINK of downloading a hundred photos manually (would you?).

Do people REALLY download documents MANUALLY nowadays? Oh my. They're crazy
in my opinion (although I did write and file this letter manually myself
:P)

barb
 
You could probably write a script or batchfile to process the results of
the wget download based on filesize.

Hi Ben Dover,
Thank you very much for your kind advice.

I am not a programmer but I guess it could look like this (in dos)?

REM wget.bat
wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
dir
if filesize < 10K then del filename
else
if filesize > 100K then del filename
end

And, in linux, maybe something like this (I found on the web):

# wget
wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
foreach file (`ls`)
set size = `ls | awk 'print $3'`
if $size < 10000 then rm $file
if $size > 100000 then rm $file
endif
end

Is this a good start (which newsgroup could we ask?)
barb
 
Back
Top