Parsing out lines of text

Fred Boer · Aug 17, 2006

Hello:

Suppose you have a string of text like this:

650 0 $a Authors, American
650 0 $a Frontier and pioneer life
650 0 $a Children's stories
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.

There is a crlf at the end of all lines but the last. I want to extract each
line separately. I'd love some help as I've been trying different things
with no success...

Thanks!
Fred Boer

Fred Boer · Aug 17, 2006

My apologies: the string of text should have been:

650 0 $a Authors, American $y 20th century $v Biography $v Juvenile
literature.
650 0 $a Frontier and pioneer life $z United States $v Juvenile literature.
650 0 $a Children's stories $x Authorship $v Juvenile literature.
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.

I might also add that each line will consistently begin with "650" and have
"$a" in the same position in each line...

Thanks!
Fred

Allen Browne · Aug 17, 2006

Fred, could you use the Split() function to parse the string into an array
at the CrLf?

This kind of thing:
Dim varArray As Variant
Dim lngI As Long
varArray = Split(strInput, vbCrLf)
For lngI = LBound(varArray) To UBound(varArray)
Debug.Print varArray(lngI)
Next

Assumes Access 2000 or later.

John Nurick · Aug 17, 2006

Hi Fred,

I'd use the Split() function to parse the lines into an array

Dim Lines As Variant, j as Long

Lines = Split(TheString, vbCRLf)
For j = 0 to UBound(Lines)
'do something
Next

Then I'd probably use a regular expression to parse the fields out of
each line.

Fred Boer · Aug 17, 2006

Dear Allen and John:

I firmly believe in trying to solve things myself before asking for help,
but in this case maybe I shouldn't have; I spent hours trying to solve this,
and using the Split() funtion solved my problem in less than 60 seconds!
Perfect!

Obviously, I didn't know about Split(), and, John, I know next to nothing
about "regular expression". I will be doing some research, but if I might
try your patience...

For the last few months I've been working hard on creating a process to
download library cataloguing records for use within my library application
from places such as the Library of Congress. Examples of a "Raw" MARC
cataloguing record and a "Rendered" (a more human readable format) are
included below. Each line in the "Rendered" example is a separate "Tag"
field. Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Thanks so much to both of you!
Fred

Raw MARC Record

01527cam 2200373 a
4500001000800000005001700008008004100025035002100066906004500087955013200132010001700264020003900281020002700320040001800347042000900365043001200374050002700386082001900413100003000432245006400462250001200526260004200538300002100580500002000601520012000621600006000741650006900801650006700870650005700937600003900994650002301033650003101056991006601087107141420041029092602.0910828s1992
nyu j 001 0beng 9(DLC) 91033805
a7bcbccorignewd1eocipf19gy-gencatlg apc18 to bg00 08-28-91; bg15
to SCD 08-28-91; fd11 08-29-91; fa00 08-29-91; fa03 08-30-91; fq28 09-05-91;
CIP ver. lb02 01-28-93 a 91033805 a0060201134 :c$16.00 ($21.50
Can.) a0060201142 (lib. bdg.) aDLCcDLCdDLC alcac
an-us---00aPS3545.I342bZ555 199200a813/.52aB2201 aAnderson,
William,d1952-10aLaura Ingalls Wilder :ba biography /cby William
Anderson. a1st ed. aNew York, NY :bHarperCollins,c1992. a240 p.
;b22 cm. aIncludes index. aA biography of the writer whose pioneer
life on the American prairie became the basis for her "Little House"
books.10aWilder, Laura Ingalls,d1867-1957vJuvenile literature.
0aAuthors, Americany20th centuryvBiographyvJuvenile literature.
0aFrontier and pioneer lifezUnited StatesvJuvenile literature.
0aChildren's storiesxAuthorshipvJuvenile literature.11aWilder, Laura
Ingalls,d1867-1957. 1aAuthors, American. 1aFrontier and pioneer life.
bc-GenCollhPS3545.I342iZ555 1992p00001519748tCopy 1wBOOKS

"Rendered" MARC Record

001 1071414
005 20041029092602.0
008 910828s1992 nyu j 001 0beng
035 $9 (DLC) 91033805
906 $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
955 $a pc18 to bg00 08-28-91; bg15 to SCD 08-28-91; fd11 08-29-91; fa00
08-29-91; fa03 08-30-91; fq28 09-05-91; CIP ver. lb02 01-28-93
010 $a 91033805
020 $a 0060201134 : $c $16.00 ($21.50 Can.)
020 $a 0060201142 (lib. bdg.)
040 $a DLC $c DLC $d DLC
042 $a lcac
043 $a n-us---
050 00 $a PS3545.I342 $b Z555 1992
082 00 $a 813/.52 $a B $2 20
100 1 $a Anderson, William, $d 1952-
245 10 $a Laura Ingalls Wilder : $b a biography / $c by William Anderson.
250 $a 1st ed.
260 $a New York, NY : $b HarperCollins, $c 1992.
300 $a 240 p. ; $b 22 cm.
500 $a Includes index.
520 $a A biography of the writer whose pioneer life on the American
prairie became the basis for her "Little House" books.
600 10 $a Wilder, Laura Ingalls, $d 1867-1957 $v Juvenile literature.
650 0 $a Authors, American $y 20th century $v Biography $v Juvenile
literature.
650 0 $a Frontier and pioneer life $z United States $v Juvenile literature.
650 0 $a Children's stories $x Authorship $v Juvenile literature.
600 11 $a Wilder, Laura Ingalls, $d 1867-1957.
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.
991 $b c-GenColl $h PS3545.I342 $i Z555 1992 $p 00001519748 $t Copy 1 $w
BOOKS

John Nurick · Aug 17, 2006

Hi Fred,

Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Absolutely. A simple way to try it for yourself is to grab the
rgxExtract() function from
http://www.j.nurick.dial.pipex.com/Code/vbRegex/rgxExtract.htm
and paste it into a module.

Get a test string into a variable S by typing this into the Immediate
pane (or pasting it from your sample file) and hitting Enter:

S = "245 10 $a Laura Ingalls Wilder : $b a biography / $c by William
Anderson."

Then type this and hit Enter:
?rgxextract(S, "\$a +([^$:/.]+) *\W")

You should get
Laura Ingalls Wilder

Change the \$a to \$b and you should get
a biography
Change the \$b to \$c
by William Anderson

I don't know the rules for rendered MARC records and have probably
missed something. Here's how the regular expression

\$a +([^$:/.]+) *[$:/.]

works:

\$ matches the dollar sign ($ without the backslash has a special
meaning.

\$a matches "$a".

+ (<space>+) matches one or more spaces

( is a signal to start capturing the next characters matched;
this is the substring that rgxExtract will return.

[^$:/.] defines a character class consisting of all characters *except*
$:/. (these appear to be the characters that mark the end of
the substrings we're interested in, but it may be necessary to
exclude more characters.

[^$:/.]+ matches one or more characters in the defined class (e.g.
"Laura Ingalls Wilder"

) terminates the capturing process

* (<space>*) matches zero or more spaces

[$:/.] matches one of our terminating characters.

This will almost certainly need fine-tuning, but IMHO it's a lot easier
than writing hundreds of lines of code to parse strings.

A regular expression can have multiple sets of capturing parentheses, so
it's possible in principle (and often in practice) to use a single more
complicated regular expression to grab multiple substrings in one go.

Have fun!

Fred Boer · Aug 17, 2006

Dear John:

Thanks very much! I tried the sample test you provided, and it works
perfectly. I have also read one or two web pages describing regular
expressions. Right now it looks rather complex, but every reference I read
tells me that with a little practice it becomes easier, so I'll give it a
try! Be forewarned though, you might hear from me about this again!

Cheers!
Fred

John Nurick said:
Hi Fred,

Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Click to expand...

Absolutely. A simple way to try it for yourself is to grab the
rgxExtract() function from
http://www.j.nurick.dial.pipex.com/Code/vbRegex/rgxExtract.htm
and paste it into a module.

Get a test string into a variable S by typing this into the Immediate
pane (or pasting it from your sample file) and hitting Enter:

S = "245 10 $a Laura Ingalls Wilder : $b a biography / $c by William
Anderson."

Then type this and hit Enter:
?rgxextract(S, "\$a +([^$:/.]+) *\W")

You should get
Laura Ingalls Wilder

Change the \$a to \$b and you should get
a biography
Change the \$b to \$c
by William Anderson

I don't know the rules for rendered MARC records and have probably
missed something. Here's how the regular expression

\$a +([^$:/.]+) *[$:/.]

works:

\$ matches the dollar sign ($ without the backslash has a special
meaning.

\$a matches "$a".

+ (<space>+) matches one or more spaces

( is a signal to start capturing the next characters matched;
this is the substring that rgxExtract will return.

[^$:/.] defines a character class consisting of all characters *except*
$:/. (these appear to be the characters that mark the end of
the substrings we're interested in, but it may be necessary to
exclude more characters.

[^$:/.]+ matches one or more characters in the defined class (e.g.
"Laura Ingalls Wilder"

) terminates the capturing process

* (<space>*) matches zero or more spaces

[$:/.] matches one of our terminating characters.

This will almost certainly need fine-tuning, but IMHO it's a lot easier
than writing hundreds of lines of code to parse strings.

A regular expression can have multiple sets of capturing parentheses, so
it's possible in principle (and often in practice) to use a single more
complicated regular expression to grab multiple substrings in one go.

Have fun!

Parsing out lines of text

Fred Boer

Fred Boer

Allen Browne

John Nurick

Fred Boer

John Nurick

Fred Boer