Creating a C.S.V. - how ?

  • Thread starter John Fitzsimons
  • Start date
J

John Fitzsimons

Hi,

I know some people here are very clever with text programs and regular
expressions etc. So here is my problem. The first two items of my
input text is....

0743202236

A MIND AT A TIME - TPB

LIVINE M

$26.95

0895948575

A WOMAN'S I CHING - TPB

STEIN Diane

$29.95

NYP

I want the end result to be something like........


0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95
0895948575, A WOMAN'S I CHING - TPB, STEIN Diane, $29.95,
NYP

Some of the problems to overcome are ;

(1) Each column can be a different width.

(2) The ISBN needs to be the first item on each line.

(3) The number of columns of info can change. The first is four for
example and the second is five.

My original may have thousands of items so I want to be able to
automate this.

My first thought was to change all carriage returns to something like
"," or ";" BUT with InfoRapid Search & Replace I tried that and only
one replacement was done. :-( Results were ...

Searched for
$
Replaced with
;
In Files
commas.txt
In Directories
C:\
Search Options
Pattern matching
Multiline search
Use internal converters
Use external converters
Matches found
1
Files found / total
1 / 1

Don't know what I did wrong. :-(

Any suggestions regarding the above task, anyone ? Method ? Software ?

Regards, John.
 
N

NoHeadRequired

|
| Hi,
|
| I know some people here are very clever with text programs and regular
| expressions etc. So here is my problem. The first two items of my
| input text is....
|
<snippy>

| Any suggestions regarding the above task, anyone ? Method ? Software ?
|
| Regards, John.

As the number of colums varies the marker would be a new record starts with
a variable 10 digit number. Therefore I must admit I'd probably write a
quick Program in Pascal or a PHP script to do this. It then becomes
relatively easy. I can provide more if you need it.

D.
 
M

mail

You need to decide the number of fields that you need.

You need a program to:

define the fileds
read the data into the fields
then write it out to the type of file you want

If you send me your datafile I will convert it for you.

Superdee
 
R

REM

John Fitzsimons <[email protected]> wrote:
0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95
0895948575, A WOMAN'S I CHING - TPB, STEIN Diane, $29.95,
NYP
Some of the problems to overcome are ;
(1) Each column can be a different width.

This pretty much rules out a simple search and replace. You'll need
double commas in 1:

0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95,,

to denote the empty field. Assuming that 5 fields is the max, that is.
(2) The ISBN needs to be the first item on each line.
(3) The number of columns of info can change. The first is four for
example and the second is five.
My original may have thousands of items so I want to be able to
automate this.

Your original determines your success. If each line has at least the
initial 4 items and possibly a 5th it will be easy to make a program
or script to automate.

On the other hand, if some lines lack field 3, but have the 5th field
it will be difficult for a program to identify which fields are
missing and which are present. The ISBN and price fields are the only
two that can be easily tested and identified.

Are you still trying to load this data into MySQL?

I'm doing a textbook bookstore project and learned the value of a good
original. I found a site giving book details in 5 line blocks of
information, so I started converting each page of 10 books to text to
use. After grabbing 100 pages I began to look over the text and place
each subject together, BIO, ACC, etc.

I found some pages inserted a 6th line for no apparent reason in some
book infos. I had to manually view and remove these; a chore.

After that it was a snap. I wrote a C++ program to read each file and
write my sql insert statements for various tables. There were dupes I
check for and caught, as each book can only be listed once. I ended up
easily loading 4 different book tables using the same data. This was a
foolproof method of making certain that each table field agrees with
all other table fields, as the 4 sql scrips were written together from
the same data read.


==================================
CREATE TABLE P_BOOK_PRICE
( ISBN VARCHAR2 (13) NOT NULL,
NEW FLOAT,
USED FLOAT,
BUYBACK FLOAT,
WHOLESALE FLOAT,
PRIMARY KEY (ISBN));

INSERT INTO P_BOOK_PRICE
VALUES ('0471375942', 127.95, 101.08, 74.21, 79.33);

INSERT INTO P_BOOK_PRICE
VALUES ('0262024829', 75.00, 59.25, 43.50, 46.50);

==================================
CREATE TABLE P_AUTHORS
( ISBN VARCHAR2 (13) NOT NULL,
AUTHORNAME VARCHAR2 (100),
PRIMARY KEY (ISBN));

INSERT INTO P_AUTHORS
VALUES ('0471375942', 'Dov Fried');

INSERT INTO P_AUTHORS
VALUES ('0262024829', 'Simon Benninga');

==================================
CREATE TABLE P_TEXTBOOK
( ISBN VARCHAR2 (13) NOT NULL,
STOREID VARCHAR2 (9) NOT NULL,
VENDORID INTEGER NOT NULL,
AGENTID INTEGER NOT NULL,
PAGES INTEGER,
DATEORDERED DATE,
DATERECEIVED DATE,
TITLE VARCHAR2 (120),
MINIMUMOH INTEGER,
QUANTITYNEW INTEGER,
QUANTITYUSED INTEGER,
COURSE CHAR(3),
PRIMARY KEY (ISBN));

INSERT INTO P_TEXTBOOK
VALUES ('0471375942', 'NAC-TEX-1', 0, 0, 452, '', '', 'The
Analysis and Use of Financial Statements', 107, 159, 52, 'ACC');

INSERT INTO P_TEXTBOOK
VALUES ('0262024829', 'NAC-TEX-1', 1, 1, 1419, '', '', 'Financial
Modeling - 2nd Edition', 114, 103, 3, 'ACC');

==================================
CREATE TABLE P_VENDORS
( VENDORID INTEGER NOT NULL,
VENDORNAME VARCHAR2 (70),
ADDRESS VARCHAR2 (50),
FAX CHAR (12),
PHONE CHAR (12),
PRIMARY KEY (VENDORID));

INSERT INTO P_VENDORS
VALUES (0, 'Wiley', '', '983-523-3100', '983-523-3101');

INSERT INTO P_VENDORS
VALUES (1, 'The MIT Press', '', '688-608-2670', '688-608-2671');

==================================

Note: The US is moving towards joining the international community
with a 13 digit ISBN and barcode beginning at years end. Australia
might not be far behind. Either way, you might well be faced with 13
digit ISBN's soon, so allow for this in your DB.

Attach your original to a mail. I might be able to easily alter the
program if loading a DB is still your goal and your original is pretty
solid.
 
J

John Fitzsimons

You need to decide the number of fields that you need.

I know that. A maximum of five.
You need a program to:
define the fileds
read the data into the fields
then write it out to the type of file you want
If you send me your datafile I will convert it for you.

Thank you. I will see if there is another way first though, as I don't
want to have to bother you if I do this task again.

Regards, John.
 
J

John Fitzsimons

This pretty much rules out a simple search and replace. You'll need
double commas in 1:
0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95,,
to denote the empty field. Assuming that 5 fields is the max, that is.

Yes, I can see that. An empty last field shouldn't be a problem
however. What I might be doing is simply deleting column five from my
results to avoid any potential complications with empty fields.

At the moment I can get something like

;0743202236

What I need is a "search and replace" that "recognises" numbers. So
that anything

xxxxxxxxxx

is changed to a carriage return THEN xxxxxxxxxx

That would put all number strings as the first item on each line. If
something like NoteTab has a wildcard item for "numbers" that might
fix things.
Your original determines your success. If each line has at least the
initial 4 items and possibly a 5th it will be easy to make a program
or script to automate.

That was my thinking. That's what I am after. I have worked out that
NoteTab does a nice job of putting ";" instead of a carriage return so
that is the bulk of the job done.

If I turn off line wrapping however I will end up with one line that
is thousands of words long. I don't know whether NoteTab and/or other
text editors would have problems with that.
On the other hand, if some lines lack field 3, but have the 5th field
it will be difficult for a program to identify which fields are
missing and which are present. The ISBN and price fields are the only
two that can be easily tested and identified.
Are you still trying to load this data into MySQL?

No, I have managed that already. With a different database file. My
next objective is to have multiple MySQL tables and/or databases.

At the moment I need to sort out the above to help with the second
database AND I have to work out what differences, if any, there are in
having multiple tables OR multiple databases.

Which is better ? For multiple book catalogues ? Or would both be the
same ? I don't know whether multiple connections to multiple tables is
any different (difficult ?) than multiple concurrent connections to
multiple databases.

Note: The US is moving towards joining the international community
with a 13 digit ISBN and barcode beginning at years end. Australia
might not be far behind. Either way, you might well be faced with 13
digit ISBN's soon, so allow for this in your DB.

IIRC I am using varchar (?) so that shouldn't be a problem. I expect
that limiting the field width would be better but I haven't got to
that level of sophistication yet. Remember, I am still a MySQL
"newbie". :)

In due course I want to find out whether a field limited in length to
100 characters is much faster (if at all) to search than a field with
200 characters in. Where the data populating the field is <100
characters long.
Attach your original to a mail. I might be able to easily alter the
program if loading a DB is still your goal and your original is pretty
solid.

Thanks, but my preference is to resolve this prior to uploading it to
my server. If I can work the long line problem, and how to make the
ISBN the first item, then I will pretty much have got things on the
way to being sorted.

Regards, John.
 
J

John Fitzsimons

| I know some people here are very clever with text programs and regular
| expressions etc. So here is my problem. The first two items of my
| input text is....

| Any suggestions regarding the above task, anyone ? Method ? Software ?
As the number of colums varies the marker would be a new record starts with
a variable 10 digit number. Therefore I must admit I'd probably write a
quick Program in Pascal or a PHP script to do this. It then becomes
relatively easy. I can provide more if you need it.

Thanks. I will keep that in mind. I haven't got PHP installed on my
'98 system yet so a PHP script wouldn't be much help unless I did it
from a web page. If/when I can work out the best PHP install things
might be different.

I have read a number of conflicting opinions of how best to install
PHP on a windows system and that has done nothing but confuse me.
Apparently a number of windows "configuration" files need altering to
do things correctly. Which sounded rather tedious/tiresome. Though I
did do something similar with Perl and that works fine.

Regards, John.
 
N

NoHeadRequired

| >As the number of colums varies the marker would be a new record starts
with
| >a variable 10 digit number. Therefore I must admit I'd probably write a
| >quick Program in Pascal or a PHP script to do this. It then becomes
| >relatively easy. I can provide more if you need it.
|
| Thanks. I will keep that in mind. I haven't got PHP installed on my
| '98 system yet so a PHP script wouldn't be much help unless I did it
| from a web page. If/when I can work out the best PHP install things
| might be different.
|
| I have read a number of conflicting opinions of how best to install
| PHP on a windows system and that has done nothing but confuse me.
| Apparently a number of windows "configuration" files need altering to
| do things correctly. Which sounded rather tedious/tiresome. Though I
| did do something similar with Perl and that works fine.
|

Get PHP Triad. It's still around it's Apache, PHP, MySQL and other stuff in
one lump.

D.
ps. don't need to do web stuff with PHP - it's just an interpreted language,
so you can run it from a DOS prompt ;-)
PHP.EXE myprog.php
 
C

Cousin Stanley

| ....
| 0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95
| 0895948575, A WOMAN'S I CHING - TPB, STEIN Diane, $29.95,
| ....

Cousin John ....

The following Python program is set up
assuming 4 field entries per book record ....

You will have to pre-edit the input file
to remove the 5th entry for each book ....

As an alternative to keep the 5th or i-th field entry,
you could add another line with a unique string
such as $$$$$ or ----- to delimit individual books ....

If you decide to go that way, let me know
and I can change the program to cope ....

Copy the code below, paste into a text editor,
and save as books_commas.py ....

Then .... python books_commas.py some_file_in.txt

The CSV output is written to the file books_out.txt
in the current working directory ....


'''
Module ....... books_commas.py

Purpose ...... Generate Comma Separated Variable Output File
from an input file with individual data items
on separate lines

Assumes an individual book record
has 4 field entries

Code_By ...... Stanley C. Kitching
Code_Date .... 2004-12-04
'''

import sys

path_in = sys.argv[ 1 ]

file_in = file( path_in , 'r' )

file_out = file( 'books_out.txt' , 'w' )

list_books = [ ]

this_list = []

for this_line in file_in :

if this_line != '\t\n' : # skip every other blank line

this_list.append( this_line.strip() )

if len( this_list ) == 4 :

list_books.append( this_list )

this_list = [ ]

for this_book in list_books :

file_out.write( '%s, %s, %s, %s \n' % ( tuple( this_book ) ) )

file_in.close()

file_out.close()
 
C

Cousin Stanley

| ....
| What I need is a "search and replace"
| that "recognises" numbers.
| ....

John ....

For search and replace in cases such as this,
instead of focusing specifically on numbers
or a specific search string, sometimes it helps
to consider beginning-of-line and end-of-line
as possible search points ....

A good text editor for this sort of stuff
is vim which has been around in unix/linux
for a long time and is also available for Windows ....

Vim uses ....

^ ....... to indicate the beginning of a line
$ ....... to indicate the end of a line

: 1,$s/^/Yo Ho Ho/ .... Adds Yo Ho Ho to beginning of all lines

: 1,$s/$/Yo Ho Ho/ .... Adds Yo Ho Ho to end of all lines
 
R

REM

John Fitzsimons <[email protected]> wrote:
At the moment I can get something like

What I need is a "search and replace" that "recognises" numbers. So
that anything

is changed to a carriage return THEN xxxxxxxxxx

Try PsPad. It has similar ability and syntax to the script Cousin
Stanley mentions.

In search and replace [0-9], beginning of line, and end of line are
all available. It has match {n} times, so you could start with the
shortest number of digits per ISBN and move on towards the max number
of digits, replacing with newline ISBN.

On the search and replace screen click ? for a quick list of
expressions.
That was my thinking. That's what I am after. I have worked out that
NoteTab does a nice job of putting ";" instead of a carriage return so
that is the bulk of the job done.
If I turn off line wrapping however I will end up with one line that
is thousands of words long. I don't know whether NoteTab and/or other
text editors would have problems with that.

Some do have problems with long lines. PsPad will handle the chore
though.
At the moment I need to sort out the above to help with the second
database AND I have to work out what differences, if any, there are in
having multiple tables OR multiple databases.

Multiple tables are used in relational databases. Each table is linked
to other tables via primary and foreigh keys, such as my textbook,
book price and author tables. They are linked by the unique ISBN
value.

I'm just learning and hope I don't give any bad advise, but we are
using a single Oracle database that has different table spaces for
different projects and the same DB is used for different courses. For
instance, grad students have their own table space in the same DB I'm
working in.
Which is better ? For multiple book catalogues ? Or would both be the
same ? I don't know whether multiple connections to multiple tables is
any different (difficult ?) than multiple concurrent connections to
multiple databases.

I think a single DB and possibly a single table might work if you're
doing a book list.

It is _far_ easier to query, update, add and delete rows in a single
table. It gets dicey doing the above in a relational DB with many
tables.
IIRC I am using varchar (?) so that shouldn't be a problem. I expect
that limiting the field width would be better but I haven't got to
that level of sophistication yet. Remember, I am still a MySQL
"newbie". :)

Exactly! I used varchar (13) for both ISBN's and barcodes for retail
items. The changeover is scheduled 2005-2007 in the US and will be a
mess for years afterwards, as no one wants to get new software and the
store we interviewed are using a 1989 program and cannot locate the
author to get it tweaked. 13 characters will take you far into the
future.

Note: 'x' or 'X' are a valid ISBN end character. The end digit is
computed in order to create a self checking value. 'x' represents a
check digit of 10.

In due course I want to find out whether a field limited in length to
100 characters is much faster (if at all) to search than a field with
200 characters in. Where the data populating the field is <100
characters long.

It is good to make fields to fit, when possible.
Thanks, but my preference is to resolve this prior to uploading it to
my server. If I can work the long line problem, and how to make the
ISBN the first item, then I will pretty much have got things on the
way to being sorted.

http://www.pspad.com/en/

PsPad is very useful. You can turn line wrapping on or off, as well as
selecting the number of characters to wrap at.
 
B

B. R. 'BeAr' Ederson

On Sat, 04 Dec 2004 10:15:04 +1100, John Fitzsimons wrote:

[RegEx Replace with InfoRapid S&R]
My first thought was to change all carriage returns to something like
"," or ";" BUT with InfoRapid Search & Replace I tried that and only
one replacement was done. :-( Results were ...

You didn't notice the whitespace between the carriage returns. Therefore
you only got 1 replacement. Try to replace this string:

\r\n[[:space:]]*\r\n

by a colon. You have to switch ON: pattern matching, multiline, and(!)
shortest match length.

After that you have to re-arrange the lines. Search for (same options):

(,)[[:digit:]]{10,10}

replace with:

\r\n

-> replace 1. subexpression.

That sometimes there are 4 and sometimes 5 entries shouldn't matter.
Most programs which load CSV will simply add empty last fields if no
entry can be found. The last colon of the file shouldn't be a problem,
either. Else you can delete it manually. (Or create a third Replace
by yourself... ;-) )

HTH.
BeAr
 
B

B. R. 'BeAr' Ederson

(,)[[:digit:]]{10,10}

As REM pointed out: the last 'digit' of the ISBN can be a 'X'. Therefore
you should replace this with:

(,)[[:digit:]]{9,9}

The replace operation should do well even when the larger ISBN will
come. The only problem could be a book title starting with a 9-digit
string... ;-)

BeAr
 
S

Susan Bugher

REM said:
Note: The US is moving towards joining the international community
with a 13 digit ISBN and barcode beginning at years end. Australia
might not be far behind. Either way, you might well be faced with 13
digit ISBN's soon, so allow for this in your DB.

On Jan. 1, *2007* the *International* standard is going to change. See:

http://www.niso.org/standards/resources/ISBN.html

<q> The new 13-digit ISBN has been approved and plans are underway to
transition to the new number industry-wide, world-wide by January 1,
2007. </q>

or http://www.isbn-international.org/en/revision.html

</q>
International ISBN Agency
ISBN Standard Revision

We hope that you are aware of the current revision of the ISBN standard
(ISO 2108) and the important changes that will be made in the number
structure, scope and governance of the ISBN system.

<SNIP>

Major changes to be aware of, and addressed by the [revised] guidelines,
include the following:

# The ISBN will change from 10 to 13 digits on 1 January 2007
</q>


Susan
 
J

John Fitzsimons

| 0743202236, A MIND AT A TIME - TPB, LIVINE M, $26.95
| 0895948575, A WOMAN'S I CHING - TPB, STEIN Diane, $29.95,
Cousin John ....

Hi Cousin Stanley,

I have at last sorted out my "input" file. It is approaching 200k
lines long.
The following Python program is set up
assuming 4 field entries per book record ....

Well, some records have 4 field entries and some have 5.
You will have to pre-edit the input file
to remove the 5th entry for each book ....

Not practical. That would mean my going through thousands of book
entries one by one and deleting the 5th entry where appropriate.
As an alternative to keep the 5th or i-th field entry,
you could add another line with a unique string
such as $$$$$ or ----- to delimit individual books ....

That wouldn't work either. For the same reason as above. AFTER I have
put each line of book info as one line per ISBN it would however be
easy. But no longer needed !
If you decide to go that way, let me know
and I can change the program to cope ....
Copy the code below, paste into a text editor,
and save as books_commas.py ....
Then .... python books_commas.py some_file_in.txt
The CSV output is written to the file books_out.txt
in the current working directory ....

Okay, going by your previous programming I would expect that to
work, and work quickly. :)
'''
Module ....... books_commas.py
Purpose ...... Generate Comma Separated Variable Output File
from an input file with individual data items
on separate lines
Assumes an individual book record
has 4 field entries
Code_By ...... Stanley C. Kitching
Code_Date .... 2004-12-04
'''
import sys
path_in = sys.argv[ 1 ]
file_in = file( path_in , 'r' )
file_out = file( 'books_out.txt' , 'w' )
list_books = [ ]
this_list = []
for this_line in file_in :
if this_line != '\t\n' : # skip every other blank line
this_list.append( this_line.strip() )
if len( this_list ) == 4 :
list_books.append( this_list )
this_list = [ ]
for this_book in list_books :
file_out.write( '%s, %s, %s, %s \n' % ( tuple( this_book ) ) )

file_out.close()

Currently my input data is something like ;

0876044143

EDGAR CAYCE'S DIET & RECIPE GUIDE

A.R.E. PRESS

$28.95

0876044380

EDGAR CAYCE'S EGYPT

A.R.E. PRESS

$75.00

IND

As you can see the first book is four items and the second one five
items. I HAVE been able to do the job with a NoteTab clip someone
made for me but it takes quite a while to run. At least 15 minutes.

What the clip appears to do is to remove blank lines, make the ISBN
the first item in each line, and replace the line returns with commas.

The above comes out as ;

0876044143,EDGAR CAYCE'S DIET & RECIPE GUIDE,A.R.E. PRESS,$28.95
0876044380,EDGAR CAYCE'S EGYPT,A.R.E. PRESS,$75.00,IND

Thanks for your help on this. IF you have the time perhaps you could
make your script suitable for 4 and 5 part info ? If you don't have
the time then that is fine. My existing method seems to work. Though
not overly quickly.

Thanks also to the many other people who offered help with this
exercise. The problem/suggested solutions certainly exercised my
grey cells ! :)


Regards, John.
 
B

B. R. 'BeAr' Ederson

You didn't notice the whitespace between the carriage returns. Therefore
you only got 1 replacement. Try to replace this string:

\r\n[[:space:]]*\r\n

by a colon. You have to switch ON: pattern matching, multiline, and(!)
^^^^^

Grmpfl. Has to be a comma of course. Sorry about this mental confusion
of mine. :-(

BeAr
 
C

Cousin Stanley

| I have at last sorted out my "input" file.
|
| It is approaching 200k lines long.
| ....
| ....
| Well, some records have 4 field entries and some have 5.
| ....

Cousin John ....

How do you obtain/build the input file ?

Maybe if we can backup one step
and examine the original source of the data,
there might be an easier way to delineate
the data from the original source ....

Processing a 200 KB file with this type
of Python program should be very quick,
approaching almost instant ....

I'll take another look at trying to manage
the difference in entries with either 4 or 5 fields,
where those fields are coded on separate lines
without any delimeters ....

It would be a lot easier if the data
was consistent throughout the entire input file ....
 
J

John Fitzsimons

On Fri, 31 Dec 2004 06:28:27 -0700, Cousin Stanley

Maybe if we can backup one step
and examine the original source of the data,
there might be an easier way to delineate
the data from the original source ....

The source is a web page. I do not at the moment have access to
the original database. The people who have the site are thinking of
releasing a CD with the database on BUT that could well be some
months away.

Regards, John.
 
C

Cousin Stanley

| The source is a web page.

Got a web address handy ?

I've done a bit of web scraping obtaining data from web pages
with Python and maybe if I looked over the original page,
it might be easier for me to deal with directly ....
 
J

John Fitzsimons

| The source is a web page.
Got a web address handy ?
I've done a bit of web scraping obtaining data from web pages
with Python and maybe if I looked over the original page,
it might be easier for me to deal with directly ....

The source isn't actually on a web page. I think it is SQL. I don't
have direct access to it. A search web page gets it. Title by title or
author by author.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top