Collecting URLs from large files?

Discussion in 'Freeware' started by Sting4Life, May 12, 2005.

  1. Sting4Life

    Sting4Life Guest

    I was wondering if anybody could point me towards a fast, simple and free
    method to sort thru several large text files and gather all the URLs in
    them. They are older IRC logs, and i'd really like to be able to check
    out some of them, however manually sorting thru 1000 individual logs that
    are over 200mbs combined is abit too much to do by hand.

    So I was wondering if anybody could point me towards an more automated
    way of doing this?


    Thank you
     
    Sting4Life, May 12, 2005
    #1
    1. Advertisements

  2. Sting4Life wrote:

    > I was wondering if anybody could point me towards a fast, simple and free
    > method to sort thru several large text files and gather all the URLs in
    > them. They are older IRC logs, and i'd really like to be able to check
    > out some of them, however manually sorting thru 1000 individual logs that
    > are over 200mbs combined is abit too much to do by hand.
    >
    > So I was wondering if anybody could point me towards an more automated
    > way of doing this?
    >

    grep "http:" *.log | sort -u | less -S

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 12, 2005
    #2
    1. Advertisements

  3. Sting4Life

    Sting4Life Guest

    In article <fVxge.11790$>,
    says...
    > Sting4Life wrote:
    >
    > > I was wondering if anybody could point me towards a fast, simple and free
    > > method to sort thru several large text files and gather all the URLs in
    > > them. They are older IRC logs, and i'd really like to be able to check
    > > out some of them, however manually sorting thru 1000 individual logs that
    > > are over 200mbs combined is abit too much to do by hand.
    > >
    > > So I was wondering if anybody could point me towards an more automated
    > > way of doing this?
    > >

    > grep "http:" *.log | sort -u | less -S
    >
    > Cheers,
    > Gary B-)
    >
    >

    I do not undersand what you are talking about?

    I am not a linux user, but I believe i've heard something about a grep
    command from it. If thats what you're referring to it won't help me,
    being a Win9x user.

    But thanks for the reply
     
    Sting4Life, May 12, 2005
    #3
  4. Sting4Life schrieb:
    >>grep "http:" *.log | sort -u | less -S
    >>
    >> Cheers,
    >> Gary B-)
    >>
    >>

    >
    > I do not undersand what you are talking about?
    >
    > I am not a linux user, but I believe i've heard something about a grep
    > command from it. If thats what you're referring to it won't help me,
    > being a Win9x user.
    >


    There are some free UNIX-like shells like Cygwin or Microsofts
    Services-for-Unix that will enable you to run all this powerful stuff on
    Windows machines.

    Here is some grep for Windows as a single app:
    http://dotnet.jku.at/applications/Grep/

    Regards
    Gerhard
     
    Gerhard Hofmann, May 12, 2005
    #4
  5. Sting4Life wrote:
    > In article <fVxge.11790$>,
    > says...
    >
    >>Sting4Life wrote:
    >>
    >>
    >>>I was wondering if anybody could point me towards a fast, simple and free
    >>>method to sort thru several large text files and gather all the URLs in
    >>>them. They are older IRC logs, and i'd really like to be able to check
    >>>out some of them, however manually sorting thru 1000 individual logs that
    >>>are over 200mbs combined is abit too much to do by hand.
    >>>
    >>>So I was wondering if anybody could point me towards an more automated
    >>>way of doing this?
    >>>

    >>
    >>grep "http:" *.log | sort -u | less -S
    >>
    >> Cheers,
    >> Gary B-)
    >>
    >>

    >
    > I do not undersand what you are talking about?
    >
    > I am not a linux user, but I believe i've heard something about a grep
    > command from it. If thats what you're referring to it won't help me,
    > being a Win9x user.
    >
    > But thanks for the reply

    Nothing to do with Linux, they are _just_ commands.

    I've had them on machines ranging from VAXen to 16-bit MS-DOS machines
    and all points in between, and not all of them ran UNIX (or Linux).

    grep is a file searching tool, it stands for Global Regular Expression
    Parser. Google for a version.

    sort is a "usable" sort program, the "-u" option means "only keep a
    single copy of multiple identical lines". Again, google for a version,
    or just learn how to use the one on your machine.

    less is a display program, like "more".

    And as you did not specify which OS you were using, the most portable
    answer I knew of was appropriate.

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 12, 2005
    #5
  6. Gary R. Schmidt wrote:

    > > > grep "http:" *.log | sort -u | less -S


    > And as you did not specify which OS you were using, the most portable
    > answer I knew of was appropriate.


    The expression above looks like it will find all occurencies of
    something following "http:"

    That will not catch urls of the form www.google.com, for example.

    How would you write the expression to really catch all url's?


    --
    Roger J.
     
    Roger Johansson, May 13, 2005
    #6
  7. Roger Johansson wrote:

    > Gary R. Schmidt wrote:
    >
    >
    >>>>grep "http:" *.log | sort -u | less -S

    >
    >
    >>And as you did not specify which OS you were using, the most portable
    >>answer I knew of was appropriate.

    >
    >
    > The expression above looks like it will find all occurencies of
    > something following "http:"

    No, it finds all (unique) lines _containing_ "http:"

    > That will not catch urls of the form www.google.com, for example.
    >
    > How would you write the expression to really catch all url's?
    >

    The OP said URLs, a domain name is _not_ a URL, as a URL requires the
    leading protocol selector. (So, I left out "nntp:", "ftp:", etcetera,
    etcetera, etcetera.)

    Of course, without knowing the format of the files being searched, it is
    all rather noncupatory.

    A general solution is to take each white-space delineated word in the
    file and do a lookup on it. This has the advantage of catching
    "255.255.255.255" as well as "abc.some.where". (of course, you have to
    prune leading protocol selectors)

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 13, 2005
    #7
  8. Sting4Life

    Sparky Guest

    Gary R. Schmidt wrote:
    >...noncupatory.


    Great write up for grep, et al. But, stud, what on *earth* does
    noncupatory mean?

    nonplussed,

    -Sparky
     
    Sparky, May 13, 2005
    #8
  9. Gary R. Schmidt wrote:

    > > The expression above looks like it will find all occurencies of
    > > something following "http:"


    > No, it finds all (unique) lines containing "http:"


    If one url is written on more than one line.. your expression will cut
    off the part which is not on the same line as "http:", won't it?
    That will result in a lot of cut off partial url's.

    > > That will not catch urls of the form www.google.com, for example.
    > > How would you write the expression to really catch all url's?


    > The OP said URLs, a domain name is not a URL, as a URL requires the
    > leading protocol selector.


    No, it doesn't. Your search expression will miss all url's which are
    not prefixed by http:, and that is a lot of url's.

    www.google.com for example.

    > (So, I left out "nntp:", "ftp:",
    > etcetera, etcetera, etcetera.)


    > Of course, without knowing the format of the files being searched, it
    > is all rather noncupatory.


    > A general solution is to take each white-space delineated word in the
    > file and do a lookup on it. This has the advantage of catching
    > "255.255.255.255" as well as "abc.some.where". (of course, you have
    > to prune leading protocol selectors)


    A general solution is to take each white-space delineated word in the
    file and check if it conforms to the rules for url's.

    You cannot assume that the user of the routine is connected to
    internet, or want you to check up url's on the net.
    The task was to find all url's in a file.


    --
    Roger J.
     
    Roger Johansson, May 13, 2005
    #9
  10. Sting4Life

    Steven Burn Guest

    "Sting4Life" <> wrote in message news:...
    > I was wondering if anybody could point me towards a fast, simple and free
    > method to sort thru several large text files and gather all the URLs in
    > them. They are older IRC logs, and i'd really like to be able to check
    > out some of them, however manually sorting thru 1000 individual logs that
    > are over 200mbs combined is abit too much to do by hand.
    >
    > So I was wondering if anybody could point me towards an more automated
    > way of doing this?


    I've not had IRC and don't plan on getting it so have no way of testing whether or not it will work for you, but you could try Index.dat QV (ideally intended for use with index.dat files, but should work for your IRC files)

    Index.dat QV
    http://support.it-mate.co.uk/?mode=Products&p=index.datqv

    --
    Regards

    Steven Burn
    Ur I.T. Mate Group
    www.it-mate.co.uk

    Keeping it FREE!
     
    Steven Burn, May 13, 2005
    #10
  11. Sting4Life

    spoon2001 Guest

    Sting4Life wrote:
    > I was wondering if anybody could point me towards a fast, simple and
    > free method to sort thru several large text files and gather all the
    > URLs in them. They are older IRC logs, and i'd really like to be able
    > to check out some of them, however manually sorting thru 1000
    > individual logs that are over 200mbs combined is abit too much to do
    > by hand.
    >
    > So I was wondering if anybody could point me towards an more automated
    > way of doing this?
    >
    >
    > Thank you


    VisitURL
    http://www.tranglos.com/free/visiturl.html

    Features: Quick overview

    Features are legion. The program performs a relatively simple task,
    but it aims to do it well, which includes flexibility. Here are some of the
    highlights.

    a.. URLs may be added one or several at a time, from clipboard or
    maually.
    b.. Automatic, configurable, semi-intelligent scanning for URL
    descriptions when adding URLs from clipboard or importing from files.
    c.. Configurable HTML output: you may choose how the list of
    collected URLs is presented in your browser.
    d.. Clipboard monitoring for automatic collection of URLs
    (just copy the URL to clipboard and Visit will add it to its list).
    e.. User-configurable list of URL schemes ('ftp://', 'http://', etc)
    to watch for. Ability to extend the default list of schemes.
    f.. Configurable import of URLs from plain text files, HTML
    documents and Internet Explorer "Favorites" (i.e. Internet Shortcuts). For
    instance, you may choose to only import mailto links, only links to '.zip'
    files, or only URLs beginning with 'http://' - or any combination thereof.
    The 'Import' feature itself is the most powerful I've seen in applications
    of this type.
    g.. URLs may be copied to clipboard in several ways: for instance,
    you may simply copy the Internet address, or you may copy the URL and its
    associated desciption as HTML code, ready to be pasted into a Web document
    you are creating.
    h.. 'Find' feature (search for text in URLs or URL descriptions).
    i.. Option to live in system tray.
    j.. Export URLs to a plain text file, as Explorer's 'favorites', to
    a delimited file (*.CSV) or to Netscape Navigator's bookmarks.
    k.. All functions available from keyboard.
    l.. Configurable fonts and colors; configurable default actions (for
    instance, you may choose what action the program will perform when you click
    the tray icon). One-click or single-keypress operation for all frequently
    used features.
    m.. URLs may be sorted by net address, by description or by date
    added.
    n.. A user-defined template file may be used to format the HTML
    output.
    o.. Configurable activation hotkey to bring the program window to
    front after it was minimized to the system tray
    p.. Automatic installer and uninstaller
     
    spoon2001, May 15, 2005
    #11
  12. Sting4Life

    Mike Bourke Guest

    Just copy the content and paste it as text in an email to yourself. Outlook
    Express etc will recognise the hyperlinks for what they are and turn them
    into clickable links.

    Mike B
     
    Mike Bourke, May 15, 2005
    #12
  13. Roger Johansson wrote:
    [SNIP]
    > A general solution is to take each white-space delineated word in the
    > file and check if it conforms to the rules for url's.


    Do you know what a URL looks like?

    It is [protocol_selector://]address[:port][lots of other stuff...].

    This means that "a" _is_ a valid URL (as is "1"). (I don't have any
    machines names "a" on my LAN, the shortest is 5 letters, but if I type
    its name into my browser, it goes there)

    > You cannot assume that the user of the routine is connected to
    > internet, or want you to check up url's on the net.
    > The task was to find all url's in a file.

    To find all URLs in a file, just spit out _every_ space delimited text
    sequence, IOW just list the file.

    You have to look them up to check for validity.

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 17, 2005
    #13
  14. Gary R. Schmidt wrote:

    > > A general solution is to take each white-space delineated word in
    > > the file and check if it conforms to the rules for url's.


    > Do you know what a URL looks like?


    Yes, it is an address to a resource in internet.

    www.google.com is such an URL.

    I use metapad instead of notepad, among other things because it knows
    what an URL looks like, so it makes it into a clickable link.

    If I wanted a way to extract URL's from big text files I would want it
    to work at least as good as in metapad.

    A text string containing a period is an URL, except when the dot is the
    last character.



    --
    Roger J.
     
    Roger Johansson, May 17, 2005
    #14
  15. "Roger Johansson" <> wrote in
    <news:>:

    >> Do you know what a URL looks like?

    >
    > Yes, it is an address to a resource in internet.
    >
    > www.google.com is such an URL.


    It's not a URL; it's a domain name. There really is a difference.

    > I use metapad instead of notepad, among other things because it
    > knows what an URL looks like, so it makes it into a clickable
    > link.
    >
    > If I wanted a way to extract URL's from big text files I would
    > want it to work at least as good as in metapad.
    >
    > A text string containing a period is an URL, except when the dot
    > is the last character.


    URLs are much more complicated to that, though it may look simple when
    metapad or some other app turns domain names into clickable links. If
    you are really interested, here's a page on regexes to extract URLs
    from a mail archive:

    <http://www.truerwords.net/articles/ut/urlactivation.html>

    --
    »Q«
     
    =?ISO-8859-1?Q?=BBQ=AB?=, May 17, 2005
    #15
  16. Roger Johansson wrote:
    [SNIP]
    >
    > A text string containing a period is an URL, except when the dot is the
    > last character.

    No it isn't. A sequence of characters that (potentially) resolves to a
    machine address is a minimal URL.

    Simple examples:
    "localhost" is a URL
    "127.0.0.1" is a URL
    "www.google.com" is a URL
    "google.com" is a URL
    "google" is a URL (admittedly iff your domain is .com)

    "http://www.camerafarm.com.au/camera/customer/product.php?productid=835&cat=304&page=1"
    is a URL
    "goinker.com" is potentially a URL (at the time of writing, it wasn't)

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 17, 2005
    #16
  17. »Q« wrote:

    > > A text string containing a period is an URL, except when the dot
    > > is the last character.


    > URLs are much more complicated to that, though it may look simple when
    > metapad or some other app turns domain names into clickable links. If
    > you are really interested, here's a page on regexes to extract URLs
    > from a mail archive:


    > <http://www.truerwords.net/articles/ut/urlactivation.html>


    Look at example expression 2, with this text under it:

    "Stricter compliance to the URL specification, but.."

    Note that this "Stricter compliance" obviously means that the prefixes
    in the first example are gone.
    Prefixes; http ftp mailto news etc..

    So, what does this say about the specification of URL's?

    Why do you refer to a web document which supports my view?


    --
    Roger J.
     
    Roger Johansson, May 17, 2005
    #17
  18. Gary R. Schmidt wrote:

    > Simple examples:
    > "localhost" is a URL
    > "127.0.0.1" is a URL
    > "www.google.com" is a URL
    > "google.com" is a URL


    Yes, there is no need for a prefix like http



    --
    Roger J.
     
    Roger Johansson, May 17, 2005
    #18
  19. "Roger Johansson" <> wrote in
    <news:>:

    > »Q« wrote:
    >
    >> <http://www.truerwords.net/articles/ut/urlactivation.html>

    >
    > Look at example expression 2, with this text under it:
    >
    > "Stricter compliance to the URL specification, but.."
    >
    > Note that this "Stricter compliance" obviously means that the
    > prefixes in the first example are gone.
    > Prefixes; http ftp mailto news etc..


    I had already read the page.

    > So, what does this say about the specification of URL's?


    Nothing. It just has some attempts to capture strings you might want
    to turn into clickable links. If you want the spec, see RFC 2396.

    > Why do you refer to a web document which supports my view?


    I thought you were asking questions about how to to capture such
    things as domain names and URLs using regular expressions, so I
    thought it might nelp you; I didn't realize you were just arguing a
    claim for its own sake.

    --
    »Q«
     
    =?ISO-8859-1?Q?=BBQ=AB?=, May 17, 2005
    #19
  20. Roger Johansson wrote:

    > Gary R. Schmidt wrote:
    >
    >
    >>Simple examples:
    >> "localhost" is a URL
    >> "127.0.0.1" is a URL
    >> "www.google.com" is a URL
    >> "google.com" is a URL

    >
    >
    > Yes, there is no need for a prefix like http
    >

    Agreed, but you have totally ignored (and snipped without indication)
    the actual point of my message - you have to do something with a word in
    a file to check whether or not it _is_ a valid URL, because just about
    _any_ sequence of characters may be a valid URL, and there is _no_
    requirement for there to be a dot in it.

    Cheers,
    Gary B-)

    --
    ______________________________________________________________________________
    Armful of chairs: Something some people would not know
    whether you were up them with or not
    - Barry Humphries
     
    Gary R. Schmidt, May 17, 2005
    #20
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frank Bohan

    Re: Extract URLs from text file

    Frank Bohan, Jun 23, 2003, in forum: Freeware
    Replies:
    0
    Views:
    2,045
    Frank Bohan
    Jun 23, 2003
  2. John Fitzsimons

    Getting URLs of streaming files.

    John Fitzsimons, Jul 19, 2003, in forum: Freeware
    Replies:
    1
    Views:
    218
    jack horsfield
    Jul 19, 2003
  3. Fran Mead

    Coin Collecting/Inventory Freeware?

    Fran Mead, Dec 11, 2003, in forum: Freeware
    Replies:
    0
    Views:
    476
    Fran Mead
    Dec 11, 2003
  4. Luke Hooft
    Replies:
    1
    Views:
    210
    dszady
    Mar 13, 2004
  5. Michael Gula
    Replies:
    3
    Views:
    153
    Frank Bohan
    Jul 30, 2005
Loading...

Share This Page