PC Review


Reply
Thread Tools Rate Thread

Collecting URLs from large files?

 
 
Sting4Life
Guest
Posts: n/a
 
      12th May 2005
I was wondering if anybody could point me towards a fast, simple and free
method to sort thru several large text files and gather all the URLs in
them. They are older IRC logs, and i'd really like to be able to check
out some of them, however manually sorting thru 1000 individual logs that
are over 200mbs combined is abit too much to do by hand.

So I was wondering if anybody could point me towards an more automated
way of doing this?


Thank you
 
Reply With Quote
 
 
 
 
Gary R. Schmidt
Guest
Posts: n/a
 
      12th May 2005
Sting4Life wrote:

> I was wondering if anybody could point me towards a fast, simple and free
> method to sort thru several large text files and gather all the URLs in
> them. They are older IRC logs, and i'd really like to be able to check
> out some of them, however manually sorting thru 1000 individual logs that
> are over 200mbs combined is abit too much to do by hand.
>
> So I was wondering if anybody could point me towards an more automated
> way of doing this?
>

grep "http:" *.log | sort -u | less -S

Cheers,
Gary B-)

--
______________________________________________________________________________
Armful of chairs: Something some people would not know
whether you were up them with or not
- Barry Humphries
 
Reply With Quote
 
 
 
 
Sting4Life
Guest
Posts: n/a
 
      12th May 2005
In article <fVxge.11790$(E-Mail Removed)>,
(E-Mail Removed) says...
> Sting4Life wrote:
>
> > I was wondering if anybody could point me towards a fast, simple and free
> > method to sort thru several large text files and gather all the URLs in
> > them. They are older IRC logs, and i'd really like to be able to check
> > out some of them, however manually sorting thru 1000 individual logs that
> > are over 200mbs combined is abit too much to do by hand.
> >
> > So I was wondering if anybody could point me towards an more automated
> > way of doing this?
> >

> grep "http:" *.log | sort -u | less -S
>
> Cheers,
> Gary B-)
>
>

I do not undersand what you are talking about?

I am not a linux user, but I believe i've heard something about a grep
command from it. If thats what you're referring to it won't help me,
being a Win9x user.

But thanks for the reply
 
Reply With Quote
 
Gerhard Hofmann
Guest
Posts: n/a
 
      12th May 2005
Sting4Life schrieb:
>>grep "http:" *.log | sort -u | less -S
>>
>> Cheers,
>> Gary B-)
>>
>>

>
> I do not undersand what you are talking about?
>
> I am not a linux user, but I believe i've heard something about a grep
> command from it. If thats what you're referring to it won't help me,
> being a Win9x user.
>


There are some free UNIX-like shells like Cygwin or Microsofts
Services-for-Unix that will enable you to run all this powerful stuff on
Windows machines.

Here is some grep for Windows as a single app:
http://dotnet.jku.at/applications/Grep/

Regards
Gerhard

 
Reply With Quote
 
Gary R. Schmidt
Guest
Posts: n/a
 
      12th May 2005
Sting4Life wrote:
> In article <fVxge.11790$(E-Mail Removed)>,
> (E-Mail Removed) says...
>
>>Sting4Life wrote:
>>
>>
>>>I was wondering if anybody could point me towards a fast, simple and free
>>>method to sort thru several large text files and gather all the URLs in
>>>them. They are older IRC logs, and i'd really like to be able to check
>>>out some of them, however manually sorting thru 1000 individual logs that
>>>are over 200mbs combined is abit too much to do by hand.
>>>
>>>So I was wondering if anybody could point me towards an more automated
>>>way of doing this?
>>>

>>
>>grep "http:" *.log | sort -u | less -S
>>
>> Cheers,
>> Gary B-)
>>
>>

>
> I do not undersand what you are talking about?
>
> I am not a linux user, but I believe i've heard something about a grep
> command from it. If thats what you're referring to it won't help me,
> being a Win9x user.
>
> But thanks for the reply

Nothing to do with Linux, they are _just_ commands.

I've had them on machines ranging from VAXen to 16-bit MS-DOS machines
and all points in between, and not all of them ran UNIX (or Linux).

grep is a file searching tool, it stands for Global Regular Expression
Parser. Google for a version.

sort is a "usable" sort program, the "-u" option means "only keep a
single copy of multiple identical lines". Again, google for a version,
or just learn how to use the one on your machine.

less is a display program, like "more".

And as you did not specify which OS you were using, the most portable
answer I knew of was appropriate.

Cheers,
Gary B-)

--
______________________________________________________________________________
Armful of chairs: Something some people would not know
whether you were up them with or not
- Barry Humphries
 
Reply With Quote
 
Roger Johansson
Guest
Posts: n/a
 
      13th May 2005
Gary R. Schmidt wrote:

> > > grep "http:" *.log | sort -u | less -S


> And as you did not specify which OS you were using, the most portable
> answer I knew of was appropriate.


The expression above looks like it will find all occurencies of
something following "http:"

That will not catch urls of the form www.google.com, for example.

How would you write the expression to really catch all url's?


--
Roger J.
 
Reply With Quote
 
Gary R. Schmidt
Guest
Posts: n/a
 
      13th May 2005
Roger Johansson wrote:

> Gary R. Schmidt wrote:
>
>
>>>>grep "http:" *.log | sort -u | less -S

>
>
>>And as you did not specify which OS you were using, the most portable
>>answer I knew of was appropriate.

>
>
> The expression above looks like it will find all occurencies of
> something following "http:"

No, it finds all (unique) lines _containing_ "http:"

> That will not catch urls of the form www.google.com, for example.
>
> How would you write the expression to really catch all url's?
>

The OP said URLs, a domain name is _not_ a URL, as a URL requires the
leading protocol selector. (So, I left out "nntp:", "ftp:", etcetera,
etcetera, etcetera.)

Of course, without knowing the format of the files being searched, it is
all rather noncupatory.

A general solution is to take each white-space delineated word in the
file and do a lookup on it. This has the advantage of catching
"255.255.255.255" as well as "abc.some.where". (of course, you have to
prune leading protocol selectors)

Cheers,
Gary B-)

--
______________________________________________________________________________
Armful of chairs: Something some people would not know
whether you were up them with or not
- Barry Humphries
 
Reply With Quote
 
Sparky
Guest
Posts: n/a
 
      13th May 2005
Gary R. Schmidt wrote:
>...noncupatory.


Great write up for grep, et al. But, stud, what on *earth* does
noncupatory mean?

nonplussed,

-Sparky
 
Reply With Quote
 
Roger Johansson
Guest
Posts: n/a
 
      13th May 2005
Gary R. Schmidt wrote:

> > The expression above looks like it will find all occurencies of
> > something following "http:"


> No, it finds all (unique) lines containing "http:"


If one url is written on more than one line.. your expression will cut
off the part which is not on the same line as "http:", won't it?
That will result in a lot of cut off partial url's.

> > That will not catch urls of the form www.google.com, for example.
> > How would you write the expression to really catch all url's?


> The OP said URLs, a domain name is not a URL, as a URL requires the
> leading protocol selector.


No, it doesn't. Your search expression will miss all url's which are
not prefixed by http:, and that is a lot of url's.

www.google.com for example.

> (So, I left out "nntp:", "ftp:",
> etcetera, etcetera, etcetera.)


> Of course, without knowing the format of the files being searched, it
> is all rather noncupatory.


> A general solution is to take each white-space delineated word in the
> file and do a lookup on it. This has the advantage of catching
> "255.255.255.255" as well as "abc.some.where". (of course, you have
> to prune leading protocol selectors)


A general solution is to take each white-space delineated word in the
file and check if it conforms to the rules for url's.

You cannot assume that the user of the routine is connected to
internet, or want you to check up url's on the net.
The task was to find all url's in a file.


--
Roger J.
 
Reply With Quote
 
Steven Burn
Guest
Posts: n/a
 
      13th May 2005
"Sting4Life" <(E-Mail Removed)> wrote in message news:(E-Mail Removed)...
> I was wondering if anybody could point me towards a fast, simple and free
> method to sort thru several large text files and gather all the URLs in
> them. They are older IRC logs, and i'd really like to be able to check
> out some of them, however manually sorting thru 1000 individual logs that
> are over 200mbs combined is abit too much to do by hand.
>
> So I was wondering if anybody could point me towards an more automated
> way of doing this?


I've not had IRC and don't plan on getting it so have no way of testing whether or not it will work for you, but you could try Index.dat QV (ideally intended for use with index.dat files, but should work for your IRC files)

Index.dat QV
http://support.it-mate.co.uk/?mode=P...&p=index.datqv

--
Regards

Steven Burn
Ur I.T. Mate Group
www.it-mate.co.uk

Keeping it FREE!


 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting Relative URLs into Absolute URLs Nathan Sokalski Microsoft ASP .NET 1 12th Aug 2008 08:03 AM
LARGE 1, LARGE 2, LARGE 3, LARGE 4 jeel Microsoft Excel Worksheet Functions 2 30th Jan 2008 07:05 AM
Large Database with the URLs in the text field and not the Hyperli =?Utf-8?B?Q3JhaWcgRm9yc2hleQ==?= Microsoft Access 1 28th Jun 2005 09:22 PM
Re: collecting information Kaylene aka Taurarian Windows XP General 0 16th Mar 2004 09:55 AM
History URLs Do Not Work; URLs include unnecessary Prefix... Charuteiro Windows XP Internet Explorer 1 24th Dec 2003 03:10 AM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 12:38 AM.