Collecting URLs from large files?

Sting4Life · May 12, 2005

I was wondering if anybody could point me towards a fast, simple and free
method to sort thru several large text files and gather all the URLs in
them. They are older IRC logs, and i'd really like to be able to check
out some of them, however manually sorting thru 1000 individual logs that
are over 200mbs combined is abit too much to do by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?

Thank you

Gary R. Schmidt · May 12, 2005

Sting4Life said:
I was wondering if anybody could point me towards a fast, simple and free
method to sort thru several large text files and gather all the URLs in
them. They are older IRC logs, and i'd really like to be able to check
out some of them, however manually sorting thru 1000 individual logs that
are over 200mbs combined is abit too much to do by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?

grep "http:" *.log | sort -u | less -S

Cheers,
Gary B-)

Sting4Life · May 12, 2005

grep "http:" *.log | sort -u | less -S

Cheers,
Gary B-)

I do not undersand what you are talking about?

I am not a linux user, but I believe i've heard something about a grep
command from it. If thats what you're referring to it won't help me,
being a Win9x user.

But thanks for the reply

Gerhard Hofmann · May 12, 2005

Sting4Life said:
I do not undersand what you are talking about?

I am not a linux user, but I believe i've heard something about a grep
command from it. If thats what you're referring to it won't help me,
being a Win9x user.

There are some free UNIX-like shells like Cygwin or Microsofts
Services-for-Unix that will enable you to run all this powerful stuff on
Windows machines.

Here is some grep for Windows as a single app:
http://dotnet.jku.at/applications/Grep/

Regards
Gerhard

Gary R. Schmidt · May 12, 2005

Sting4Life said:
I do not undersand what you are talking about?

I am not a linux user, but I believe i've heard something about a grep
command from it. If thats what you're referring to it won't help me,
being a Win9x user.

But thanks for the reply

Nothing to do with Linux, they are _just_ commands.

I've had them on machines ranging from VAXen to 16-bit MS-DOS machines
and all points in between, and not all of them ran UNIX (or Linux).

grep is a file searching tool, it stands for Global Regular Expression
Parser. Google for a version.

sort is a "usable" sort program, the "-u" option means "only keep a
single copy of multiple identical lines". Again, google for a version,
or just learn how to use the one on your machine.

less is a display program, like "more".

And as you did not specify which OS you were using, the most portable
answer I knew of was appropriate.

Cheers,
Gary B-)

Roger Johansson · May 13, 2005

And as you did not specify which OS you were using, the most portable
answer I knew of was appropriate.

The expression above looks like it will find all occurencies of
something following "http:"

That will not catch urls of the form www.google.com, for example.

How would you write the expression to really catch all url's?

Gary R. Schmidt · May 13, 2005

Roger said:
Gary R. Schmidt wrote:

The expression above looks like it will find all occurencies of
something following "http:"

No, it finds all (unique) lines _containing_ "http:"

That will not catch urls of the form www.google.com, for example.

How would you write the expression to really catch all url's?

The OP said URLs, a domain name is _not_ a URL, as a URL requires the
leading protocol selector. (So, I left out "nntp:", "ftp:", etcetera,
etcetera, etcetera.)

Of course, without knowing the format of the files being searched, it is
all rather noncupatory.

A general solution is to take each white-space delineated word in the
file and do a lookup on it. This has the advantage of catching
"255.255.255.255" as well as "abc.some.where". (of course, you have to
prune leading protocol selectors)

Cheers,
Gary B-)

Sparky · May 13, 2005

Gary said:
...noncupatory.

Great write up for grep, et al. But, stud, what on *earth* does
noncupatory mean?

nonplussed,

-Sparky

Roger Johansson · May 13, 2005

No, it finds all (unique) lines containing "http:"

If one url is written on more than one line.. your expression will cut
off the part which is not on the same line as "http:", won't it?
That will result in a lot of cut off partial url's.

The OP said URLs, a domain name is not a URL, as a URL requires the
leading protocol selector.

No, it doesn't. Your search expression will miss all url's which are
not prefixed by http:, and that is a lot of url's.

www.google.com for example.

(So, I left out "nntp:", "ftp:",
etcetera, etcetera, etcetera.)

Of course, without knowing the format of the files being searched, it
is all rather noncupatory.

A general solution is to take each white-space delineated word in the
file and do a lookup on it. This has the advantage of catching
"255.255.255.255" as well as "abc.some.where". (of course, you have
to prune leading protocol selectors)

A general solution is to take each white-space delineated word in the
file and check if it conforms to the rules for url's.

You cannot assume that the user of the routine is connected to
internet, or want you to check up url's on the net.
The task was to find all url's in a file.

Steven Burn · May 13, 2005

Sting4Life said:
I was wondering if anybody could point me towards a fast, simple and free
method to sort thru several large text files and gather all the URLs in
them. They are older IRC logs, and i'd really like to be able to check
out some of them, however manually sorting thru 1000 individual logs that
are over 200mbs combined is abit too much to do by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?

I've not had IRC and don't plan on getting it so have no way of testing whether or not it will work for you, but you could try Index.dat QV (ideally intended for use with index.dat files, but should work for your IRC files)

Index.dat QV
http://support.it-mate.co.uk/?mode=Products&p=index.datqv

--
Regards

Steven Burn
Ur I.T. Mate Group
www.it-mate.co.uk

Keeping it FREE!

spoon2001 · May 15, 2005

Sting4Life said:
I was wondering if anybody could point me towards a fast, simple and
free method to sort thru several large text files and gather all the
URLs in them. They are older IRC logs, and i'd really like to be able
to check out some of them, however manually sorting thru 1000
individual logs that are over 200mbs combined is abit too much to do
by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?

Thank you

VisitURL
http://www.tranglos.com/free/visiturl.html

Features: Quick overview

Features are legion. The program performs a relatively simple task,
but it aims to do it well, which includes flexibility. Here are some of the
highlights.

a.. URLs may be added one or several at a time, from clipboard or
maually.
b.. Automatic, configurable, semi-intelligent scanning for URL
descriptions when adding URLs from clipboard or importing from files.
c.. Configurable HTML output: you may choose how the list of
collected URLs is presented in your browser.
d.. Clipboard monitoring for automatic collection of URLs
(just copy the URL to clipboard and Visit will add it to its list).
e.. User-configurable list of URL schemes ('ftp://', 'http://', etc)
to watch for. Ability to extend the default list of schemes.
f.. Configurable import of URLs from plain text files, HTML
documents and Internet Explorer "Favorites" (i.e. Internet Shortcuts). For
instance, you may choose to only import mailto links, only links to '.zip'
files, or only URLs beginning with 'http://' - or any combination thereof.
The 'Import' feature itself is the most powerful I've seen in applications
of this type.
g.. URLs may be copied to clipboard in several ways: for instance,
you may simply copy the Internet address, or you may copy the URL and its
associated desciption as HTML code, ready to be pasted into a Web document
you are creating.
h.. 'Find' feature (search for text in URLs or URL descriptions).
i.. Option to live in system tray.
j.. Export URLs to a plain text file, as Explorer's 'favorites', to
a delimited file (*.CSV) or to Netscape Navigator's bookmarks.
k.. All functions available from keyboard.
l.. Configurable fonts and colors; configurable default actions (for
instance, you may choose what action the program will perform when you click
the tray icon). One-click or single-keypress operation for all frequently
used features.
m.. URLs may be sorted by net address, by description or by date
added.
n.. A user-defined template file may be used to format the HTML
output.
o.. Configurable activation hotkey to bring the program window to
front after it was minimized to the system tray
p.. Automatic installer and uninstaller

Mike Bourke · May 15, 2005

Just copy the content and paste it as text in an email to yourself. Outlook
Express etc will recognise the hyperlinks for what they are and turn them
into clickable links.

Mike B

Gary R. Schmidt · May 17, 2005

Roger Johansson wrote:
[SNIP]

A general solution is to take each white-space delineated word in the
file and check if it conforms to the rules for url's.

Do you know what a URL looks like?

It is [protocol_selector://]address[

ort][lots of other stuff...].

This means that "a" _is_ a valid URL (as is "1"). (I don't have any
machines names "a" on my LAN, the shortest is 5 letters, but if I type
its name into my browser, it goes there)

You cannot assume that the user of the routine is connected to
internet, or want you to check up url's on the net.
The task was to find all url's in a file.

To find all URLs in a file, just spit out _every_ space delimited text
sequence, IOW just list the file.

You have to look them up to check for validity.

Cheers,
Gary B-)

Roger Johansson · May 17, 2005

Do you know what a URL looks like?

Yes, it is an address to a resource in internet.

www.google.com is such an URL.

I use metapad instead of notepad, among other things because it knows
what an URL looks like, so it makes it into a clickable link.

If I wanted a way to extract URL's from big text files I would want it
to work at least as good as in metapad.

A text string containing a period is an URL, except when the dot is the
last character.

=?ISO-8859-1?Q?=BBQ=AB?= · May 17, 2005

Yes, it is an address to a resource in internet.

www.google.com is such an URL.

It's not a URL; it's a domain name. There really is a difference.

I use metapad instead of notepad, among other things because it
knows what an URL looks like, so it makes it into a clickable
link.

If I wanted a way to extract URL's from big text files I would
want it to work at least as good as in metapad.

A text string containing a period is an URL, except when the dot
is the last character.

URLs are much more complicated to that, though it may look simple when
metapad or some other app turns domain names into clickable links. If
you are really interested, here's a page on regexes to extract URLs
from a mail archive:

<http://www.truerwords.net/articles/ut/urlactivation.html>

Gary R. Schmidt · May 17, 2005

Roger Johansson wrote:
[SNIP]

A text string containing a period is an URL, except when the dot is the
last character.

No it isn't. A sequence of characters that (potentially) resolves to a
machine address is a minimal URL.

Simple examples:
"localhost" is a URL
"127.0.0.1" is a URL
"www.google.com" is a URL
"google.com" is a URL
"google" is a URL (admittedly iff your domain is .com)

"http://www.camerafarm.com.au/camera/customer/product.php?productid=835&cat=304&page=1"
is a URL
"goinker.com" is potentially a URL (at the time of writing, it wasn't)

Cheers,
Gary B-)

Roger Johansson · May 17, 2005

URLs are much more complicated to that, though it may look simple when
metapad or some other app turns domain names into clickable links. If
you are really interested, here's a page on regexes to extract URLs
from a mail archive:

<http://www.truerwords.net/articles/ut/urlactivation.html>

Look at example expression 2, with this text under it:

"Stricter compliance to the URL specification, but.."

Note that this "Stricter compliance" obviously means that the prefixes
in the first example are gone.
Prefixes; http ftp mailto news etc..

So, what does this say about the specification of URL's?

Why do you refer to a web document which supports my view?

Roger Johansson · May 17, 2005

Gary said:
Simple examples:
"localhost" is a URL
"127.0.0.1" is a URL
"www.google.com" is a URL
"google.com" is a URL

Yes, there is no need for a prefix like http

=?ISO-8859-1?Q?=BBQ=AB?= · May 17, 2005

Look at example expression 2, with this text under it:

"Stricter compliance to the URL specification, but.."

Note that this "Stricter compliance" obviously means that the
prefixes in the first example are gone.
Prefixes; http ftp mailto news etc..

I had already read the page.

So, what does this say about the specification of URL's?

Nothing. It just has some attempts to capture strings you might want
to turn into clickable links. If you want the spec, see RFC 2396.

Why do you refer to a web document which supports my view?

I thought you were asking questions about how to to capture such
things as domain names and URLs using regular expressions, so I
thought it might nelp you; I didn't realize you were just arguing a
claim for its own sake.

Gary R. Schmidt · May 17, 2005

Roger said:
Gary R. Schmidt wrote:

Yes, there is no need for a prefix like http

Agreed, but you have totally ignored (and snipped without indication)
the actual point of my message - you have to do something with a word in
a file to check whether or not it _is_ a valid URL, because just about
_any_ sequence of characters may be a valid URL, and there is _no_
requirement for there to be a dot in it.

Cheers,
Gary B-)

Collecting URLs from large files?

Sting4Life

Gary R. Schmidt

Sting4Life

Gerhard Hofmann

Gary R. Schmidt

Roger Johansson

Gary R. Schmidt

Sparky

Roger Johansson

Steven Burn

spoon2001

Mike Bourke

Gary R. Schmidt

Roger Johansson

=?ISO-8859-1?Q?=BBQ=AB?=

Gary R. Schmidt

Roger Johansson

Roger Johansson

=?ISO-8859-1?Q?=BBQ=AB?=

Gary R. Schmidt