Looking for: Program to remove duplicate lines

R

Ridefort

Goodday,

I am looking for a programm that delete multiple lines but does not
sort the file.

Does someone has any pointers for me, please!?

TIA
Andre Linoge
 
R

Ridefort

Do the duplicate lines appear together, or are they scattered all
overt the file?

Its like this, see:

I have an rather large text file with a bunch of messages,

first there is an message, then beneath this message there is an exat
copy of the message above, then the next message begings with an copy
beneath it. etc etc.

instead of :

message 1
=====
message 2
=====
message 3
=====
etc, etc

I have:

message 1
=====
copy message 1
=====
message 2
=====
copy message 2
=====
message 3
=====
copy message 3
=====

What i need is to remove the "copy message"

Thats it. :))
 
L

Lou

Ridefort said:
Its like this, see:

I have an rather large text file with a bunch of messages,

first there is an message, then beneath this message there is an exat
copy of the message above, then the next message begings with an copy
beneath it. etc etc.

instead of :

message 1
=====
message 2
=====
message 3
=====
etc, etc

I have:

message 1
=====
copy message 1
=====
message 2
=====
copy message 2
=====
message 3
=====
copy message 3
=====

What i need is to remove the "copy message"

Thats it. :))

Do not know of any pgm that can do that but can tell you how to do it in
BASIC (or maybe even a couple of freebies or utilities).

Read file in and number all lines using a fixed number of digits (i.e.
0001 - 9999)
Sort file on fifth column and up.
Read file in and delete based on column five to end of each line.
Resort result on the numbers.
Strip numbers if it matters.

Yes there are several steps but its a computer so its easy once set up.

Lou
 
L

lisztfr

As i just see, Minitrue could probably do it :

http://www.idiotsdelight.net/minitrue/

========================================

Example 12 - Finding consecutive duplicate (non-empty) lines

mtr -x file.txt (^.+\r\n)\1+

The ^ matches the start of a line, .+ matches the body of lines
which
contain at least one character and \r\n matches the trailing
carriage return/newline. (Remove the \r for UNIX text files).
This is enclosed in parentheses so the backreference \1+ will
match
the parentheses contents at least once and possibly more.

=====================================


But i never tried it. Then gawk... ;

Your copy of the message should really be exact, as it is only
a dup *lines* remover.

laurent
 
O

omega

I am looking for a programm that delete multiple lines but does not
sort the file.

Does someone has any pointers for me, please!?

A console (dos box) util that will do it:

DeDupe (or Extract Unique Lines)
by John Augustine

DeDupe v1.2 - Removes duplicate ASCII text
lines anywhere; extracts unique lines from
two similar files. No pre-sorting req'd. File
viewer included. Optional duplicate record
file. Freeware.

Search for dedupe12.zip for the freeware version. There is also a
dedupe13.zip, but the author has not defined that release as freeware.
<q>"This is not free software. A small donation is required -"</q>
(Speaking of later releases of this, Google turned up with a 2001
release of Augustine's DeDupe, at Simtel -- listed as shareware.)

The few times I tested this one, it did the job. Just it was
a PITA for me having to interact manually with the console prompts
(the version 1.2 at least did not support just going to work from
batch mode). If this is just a quick job for you, not a routine
task, then perhaps that factor won't bother you. Another note:
This 1997 prog has expectedly line length limits; and filesize
limits; (and less important, probably have to use SFN for file
names). So you'd have to see that whatever limits described in
DeDupe docs would or would not matter for your task.

or

There are others about, console & command based, but I'm not spotting
them easily at casual glance. One that I have locally is RemDupli by
Bruce Desmond. Google doesn't respond to the name RemDupli, though.
Anyway it's an old 1993 model, so it's not worth the hunt, when there
are other choices.

or

Power Route: SED will do nonconsecutive duplicate lines.

Left in my notes from when I needed to do just that (including
trying to get some exclusions into the operation), I've pertinent
SED command strings. But I'm not sure I ought post them blindly.
Most ideal, especially if your task has special needs, would be
for someone adequately proficient in SED's powers to key you in
on the best command for your project. (Actually: in your later
followup, where you described your files, it was, tell the truth,
somewhat vague to me. Not sure if this means that you'd best
re-explain the files in more serious depth, or that I should
re-read.)

or

If for some reasons your project turns out to have some sort of
particularly tricky demands, then it is (g)awk that would be able
to step into the ring with the truly serious weight. There especially
though, it'd be best to ask a pro for a custom command.

or

Swinging back the other direction for a final candidate. A simple,
easy, GUI prog: "Dupli Find."

Author's site is http://www.rlvision.com . Quick Google at the
moment, and learning that it has now turned shareware. The one
I have on hand is Dupli Find 1.4, freeware. Perhaps someone in
ACF will know where the last freeware version can be downloaded?
(If it is v1.4 that is the LFW, and if it's also not to be easily
found, then I can do an upload. The prog's less than 100k, btw.)
 
L

lisztfr

You mean this one ?

http://stlinux.ouhk.edu.hk/mirror/nonags/textu32.html

"Dupli Find 1.4 for Win9x/NT4/Mill/Win2k

Updated: Aug 26, 2001
Homepage
Author: Dan Saeden
Description: Dupli Find is a utility that searches text files for
duplicate lines. Found lines will be reported in a separate view. It
can then remove the duplicate lines and save the file. Typical uses are
to remove duplicate email entries in online competitions or to check
serial number gererators, where each serial must be unique."

but it's 700ko

laurent h
 
O

omega

Ridefort said:
I have an rather large text file with a bunch of messages,

first there is an message, then beneath this message there is an exat
copy of the message above, then the next message begings with an copy
beneath it. etc etc.

I have:

message 1
=====
copy message 1
=====
message 2
=====
copy message 2
=====
message 3
=====
copy message 3
=====

What i need is to remove the "copy message"

It seems to me that a program that simply removed all duplicate lines would
not fit perfectly here. For instance, think of all the messages, different
messages, that happen to share the same line, such as "Hello."

What would be good is if this "text file of messages" was in some formal
format -- as far as the headers, mainly. For instance, an mbox file, or
one of the other defined types that contain a number of mail/news mssgs
in combined file. Then you could seek a program to use that is geared
towards that type of file format.

For instance, I expect there ought to be a selection, due to mass
demand, for removing dupes from formats that have been used by different
incarnations of Outlook/OE; or by Eudora; or, that which I have experience
with, mbox (someone here had provided me with a gawk script at one time
for cleaning dupes from an mbox text file of messages).

Some email progs will also remove dupes, after importing from formats
that they accept. (Don't ask me to name which have this function, and
are also freeware. If it's applicable, then we both would do better
by asking ACF....)

So, the question: is this text file of a certain predefined message format?
The layout of headers, they are in a consistent format?

If, for some reason you don't have full headers, and just brief headings,
then harder. You would need to look for the pattern for what is the common
separator between messages. I remember seeing that there are approaches with
GAWK (and maybe other scripting tools) which can look at text in blocks.
That area is complex too me, but there are folks out there who you might
well welcome the challenge if they were appreciated for it.

Or -- again if the circumstance that this is an informal text file without
the messages separated by a useful pattern of headers -- then,
well ok, it still viable option that you proceed with your Plan A, esp.
if you sought to limit investment of time. I mean, to just decide that
you didn't need lines that said "Hello" (etc) in different messages all
over the file in dupe anyway.
 
O

omega

"lisztfr" <[email protected]>:

[LFW of Dupli Find]
You mean this one ?

http://stlinux.ouhk.edu.hk/mirror/nonags/textu32.html

"Dupli Find 1.4 for Win9x/NT4/Mill/Win2k

Updated: Aug 26, 2001
Homepage
Author: Dan Saeden
Description: Dupli Find is a utility that searches text files for
duplicate lines. Found lines will be reported in a separate view. It
can then remove the duplicate lines and save the file. Typical uses are
to remove duplicate email entries in online competitions or to check
serial number gererators, where each serial must be unique."

Thanks, yes: that's the one, and perhaps the version that is the LFW.
From that nonags page you found, one can get an URL to retrieve the
old v1.4 description page, via archive.org cache:
http://web.archive.org/web/20031210150449/http://www.rlvision.com/~ds/dupli.html
but it's 700ko

Yes, I see. I downloaded it, and turns out that the link
<http://www.rlvision.com/~ds/prog/dupli.zip> is to a version 3.0
of the prog, shareware at that point. (Btw, part of the fat filesize
for that zip is that it's got a VB ocx tucked in there, Mscomctl.ocx)

Well, in the meantime, for anyone who might want to snag a copy, here's
an upload of Dupli Find, freeware, v1.4:

http://www.redshift.com/~omega/files/duplifind/
http://www.redshift.com/~omega/files/duplifind/duplifind14.zip (24k)
 
A

Al Klein

Do not know of any pgm that can do that but can tell you how to do it in
BASIC (or maybe even a couple of freebies or utilities).
Read file in and number all lines using a fixed number of digits (i.e.
0001 - 9999)

:)

I had the same idea, but using a spreadsheet program.
 
L

lisztfr

Works great, i tried the 3.0 version. Minitrue is good for other
purposes,
i guess one could use it as code page changer, with it's wide caracter
replacement features. Awk&Co are too hard to learn if one doesn't need
them :)

laurent h
 
D

David Harmon

=====
message 3
=====
copy message 3
=====

What i need is to remove the "copy message"

Are the "messages" a single line each? Are the five = signs a
reliable separator? Is there always exactly one duplicate?
This would be a "quickie" for anybody you know who programs, but you
would have to figure out exactly what you need.

C++ version based on lines only:

#include <iostream>
#include <string>
#include <set>
using namespace std;
int main()
{
set<string> seen;
string line;
while (getline(cin, line))
if (seen.find(line) == seen.end()) {
cout << line << '\n';
seen.insert(line);
}
}
 
L

Lou

Al said:
:)

I had the same idea, but using a spreadsheet program.

Sounds like that would work too, if the file could be read in.

Two great minds at work!

Lou
The principal mark of genius is not perfection but originality.
..
 
L

Lou

lisztfr said:
Works great, i tried the 3.0 version. Minitrue is good for other
purposes,
i guess one could use it as code page changer, with it's wide caracter
replacement features. Awk&Co are too hard to learn if one doesn't need
them :)

laurent h

Who are you talking to? What did they say?

This is a usenet discussion group, not a message board on a website.

Please follow the generally-accepted usenet convention of quoting at
Least part of whatever message you're replying to.

Newsgroup propagation being what it is, some folks will see your response
before they see the post on which you're commenting.
Others, in fact, may NEVER see the original post or didn't bother reading
it.

Another problem is that given the way threads drift, a subject line
may have little or nothing to do with what the current discussion
involves.

When you fail to cite a bit of the original message, your own
comments just hang out in the air, connected to nothing and making little
or no sense.

If you are posting through google-groups, you can quote properly by using
the "options" selection and NOT the "reply" button, as shown here:
http://www.safalra.com/special/googlegroupsreply/

For a comprehensive, net-wide FAQ, check this site:
http://www.netmeister.org/news/learn2quote.html
 
A

Adrian Carter

Goodday,

I am looking for a programm that delete multiple lines but does not
sort the file.

Does someone has any pointers for me, please!?

TIA
Andre Linoge

I have a program that places an icon in the Windows system tray, and removes
duplicate lines from text in the clipboard whenever you click on the icon.
Today
I added to it the capability to preserve the original order of the lines
(apart from
removing every 2nd, 3rd, 4th occurrence etc of any duplicate line). I have
not
made this available before, just uploaded it today. It is called
"Unduplicate",
you can download from here:
http://www.homestead.com/adriancarter/Index.html

There seems to be some problem with my website's page authoring tool, so
the
text on the page does not do justice to the program. Here is most of the
ReadMe.Txt file from the package:
==============================================
This is a small tray utility for removing duplicate lines of text from
data held in the clipboard. When run, it first displays a small window
that allows you to control a few options affecting program operation.
When the options are set to your liking, you can hide the small window
leaving the icon in the system tray, by selecting the "Continue" button.
The "Close" button shuts the program down.

Your data may or may not contain a heading in its first line. The top
three options are radio buttons that let you either protect the heading
from being sorted (and possibly removed), force the heading to be removed,
or say there is no heading.

It may be important that the original order of the data is preserved,
apart from removal of duplicates (the first occurrence of a set of
duplicates is always kept, 2nd, 3rd occurrences etc are deleted). The
checkbox labelled "Maintain original order" is used for this purpose
(defaulted to YES).

When you click on the tray icon with the LEFT mouse button, duplicate
lines are removed from the clipboard in accordance with the option chosen.
If you click with the RIGHT mouse button, the small window is shown again,
in case you want to modify the options or close down.

This tool was built using Borland C++ Builder (Version 6). I have included
all source code and project files, in case anyone wants to customise or
extend it. The file Unit1_DFM.TXT is a text version of the main (only)
form. It will be useful for anyone who wants to rebuild the project with
a version of C++ Builder (or Delphi) earlier than version 6.
==============================================

Adrian Carter

Email: adrian_carterau (AT) yahoo (DOT) com (DOT) au
(remove underscore & changee the other obvious things)
 
O

omega

lisztfr said:
Minitrue is good for other purposes,
i guess one could use it as code page changer, with it's wide caracter
replacement features.

In fact, the most recent time I called Minitrue into battle, it was for
a need basically related to that. It was to clean out some garbage from a
saved .mbx file. I used a string of different char codes representing the
garbage to be removed. It did the job neatly and quickly.
Awk&Co are too hard to learn if one doesn't need them :)

I'd like to hope that one day I would feel developed enough to take on
the learning of a little Awk scripting. But that's for sure only on my
"manana" horizon. (And I'd start Chapter One Awk only after first fueling
up with mega-dose of gingko and fish and whatever else it is that's
supposed to tune up the neurons into higher gear. Hmm, well maybe it's
a good espresso machine that would make for the key helper.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top