Collecting URLs from large files?

S

Simon

In said:
I was wondering if anybody could point me towards a fast, simple and free
method to sort thru several large text files and gather all the URLs in
them. They are older IRC logs, and i'd really like to be able to check
out some of them, however manually sorting thru 1000 individual logs that
are over 200mbs combined is abit too much to do by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?


Thank you

you could try HTML Extractor <http://boozet.xepher.net>...

links are extracted to a text file...

S
 
J

John Fitzsimons

Roger Johansson wrote:
Agreed, but you have totally ignored (and snipped without indication)
the actual point of my message - you have to do something with a word in
a file to check whether or not it _is_ a valid URL, because just about
_any_ sequence of characters may be a valid URL, and there is _no_
requirement for there to be a dot in it.

Any sequence of characters ? Without a dot ? I can only see one on
the above list. The first one. Can you give us some other examples
please ?

Regards, John.
 
J

John Fitzsimons

On 17 May 2005 04:38:23 GMT, "Roger Johansson" <[email protected]>
wrote:

A text string containing a period is an URL, except when the dot is the
last character.

Interesting thought. So if one grepped for lines with a dot followed
by something then one would have all lines in a text file that are
URLs ?

Does anyone here know what the regexp would be, in something like
NoteTab, to find such lines please ?

Regards, John.
 
G

Gary R. Schmidt

John said:
Any sequence of characters ? Without a dot ? I can only see one on
the above list. The first one. Can you give us some other examples
please ?
Sure, on my LAN, if I type "sethra" into my browser I end up at that
machine.

Name-type addresses (as opposed to numeric 123.123.123.123) are
predicated upon domains, when a domain is not entered, your (default)
domain is added.

Also "Roger Johansson" snipped this line (among other important ones):
"google" is a URL (admittedly iff your domain is .com)
(If you aren't a maths type, "iff" means "if and only if")

Cheers,
Gary B-)
 
R

Roger Johansson

Gary said:
Name-type addresses (as opposed to numeric 123.123.123.123) are
predicated upon domains, when a domain is not entered, your (default)
domain is added.

You are now talking about the autocomplete function in web browsers or
DNS servers.
But we were talking about URL's, which still need at least one dot
inside the name.

When I write an incomplete url in the address bar my browser Opera
tries to add .com .org .net etc..
That doesn't mean that "google" is a valid URL.

The full name is the name users outside the local LAN have to use.

Your sethra is a short form of a longer URL.
 
G

Gary R. Schmidt

Roger said:
Gary R. Schmidt wrote:




You are now talking about the autocomplete function in web browsers or
DNS servers.
Sigh. No, I am not. The auto completion function prepends "www." and
appends ".com", so it would look up "www.sethra.com", and therefore fail.
But we were talking about URL's, which still need at least one dot
inside the name.
No they don't. Look at the spec., it says (paraphrasing):
[protocol selector://]address[:port and so on...]

Further on, "address" is simply defined as "any valid address" (or words
to that effect), and, inside my domain, "sethra" is a valid address.
When I write an incomplete url in the address bar my browser Opera
tries to add .com .org .net etc..
Yes, _your_ browser may, but mine _doesn't_.
That doesn't mean that "google" is a valid URL.
Yes it is. Try this simple example.
Edit your hosts file to include the lines:
64.233.187.99 oog1
64.233.187.104 oog2
(These addresses are from "nslookup www.google.com", you may wish to do
it yourself and get whatever your DNS resolves it to).

Then feed "oog1" and "oog2" to your browser, prevent it from doing the
domainisation stuff, and see where you end up.
The full name is the name users outside the local LAN have to use.
Ah, yes, the full name - rather like the prefix needed when you dial
from "outside". So is my "correct" telephone number 1234 5678, or 03
1234 5678, or +61 3 1234 5678? Do I need a 9 in there to get an outside
line from my exchange? Do I need to talk to an operator?
Your sethra is a short form of a longer URL.
No, it is a URL, just as "http://sethra.subdomain.domain:80" is a URL.

Cheers,
Gary B-)
 
S

Sting4Life

I would just like to thank everyone for their replys,


To clarify the URL issue mentioned, I am not sure I phrased that
correctly. I mean any form of web addressing appearing in them. Including
any practically containing start with http://, www., ip addresses and
such.

I do not understand the grep thing still, the sort and grep and all
appears to be separate programs, and I do not have them available.

The one someone said to email them to myself is also out of the question,
due to the size and quantity.

As for the other programs suggested, I will give them all a go. See if
any get the job done.



But thanks to everyone for the assistance
 
R

Roger Johansson

Sting4Life said:
The one someone said to email them to myself is also out of the
question, due to the size and quantity.

As for the other programs suggested, I will give them all a go. See
if any get the job done.

If I had this problem I would look through the collections of perl
programs on the web, and ask nicely for help in a perl newsgroup. I
know they have solved this problem a million times before.

You have to install perl to use a perl solution, (and probably suffer
lots of flames in the perl newsgroup for being a moron, read the f..
manual, etc..) but if it looks like it would be worth it...

Other alternatives, find the newsgroups where the words awk, sed and
grep are used most, ask the people there for help.

Get Windows versions of these utilities if you are in Windows.

When asking for help, specify exactly what kind or url's you want to
find, with several examples. If you want a solution which restores
url's which have been cut by wordwrapping (linefeeds) or not.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top