Parsing out lines of text

  • Thread starter Thread starter Fred Boer
  • Start date Start date
F

Fred Boer

Hello:

Suppose you have a string of text like this:

650 0 $a Authors, American
650 0 $a Frontier and pioneer life
650 0 $a Children's stories
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.

There is a crlf at the end of all lines but the last. I want to extract each
line separately. I'd love some help as I've been trying different things
with no success...

Thanks!
Fred Boer
 
My apologies: the string of text should have been:

650 0 $a Authors, American $y 20th century $v Biography $v Juvenile
literature.
650 0 $a Frontier and pioneer life $z United States $v Juvenile literature.
650 0 $a Children's stories $x Authorship $v Juvenile literature.
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.

I might also add that each line will consistently begin with "650" and have
"$a" in the same position in each line...

Thanks!
Fred
 
Fred, could you use the Split() function to parse the string into an array
at the CrLf?

This kind of thing:
Dim varArray As Variant
Dim lngI As Long
varArray = Split(strInput, vbCrLf)
For lngI = LBound(varArray) To UBound(varArray)
Debug.Print varArray(lngI)
Next

Assumes Access 2000 or later.
 
Hi Fred,

I'd use the Split() function to parse the lines into an array

Dim Lines As Variant, j as Long

Lines = Split(TheString, vbCRLf)
For j = 0 to UBound(Lines)
'do something
Next

Then I'd probably use a regular expression to parse the fields out of
each line.
 
Dear Allen and John:

I firmly believe in trying to solve things myself before asking for help,
but in this case maybe I shouldn't have; I spent hours trying to solve this,
and using the Split() funtion solved my problem in less than 60 seconds!
Perfect!

Obviously, I didn't know about Split(), and, John, I know next to nothing
about "regular expression". I will be doing some research, but if I might
try your patience...

For the last few months I've been working hard on creating a process to
download library cataloguing records for use within my library application
from places such as the Library of Congress. Examples of a "Raw" MARC
cataloguing record and a "Rendered" (a more human readable format) are
included below. Each line in the "Rendered" example is a separate "Tag"
field. Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Thanks so much to both of you!
Fred



Raw MARC Record

01527cam 2200373 a
4500001000800000005001700008008004100025035002100066906004500087955013200132010001700264020003900281020002700320040001800347042000900365043001200374050002700386082001900413100003000432245006400462250001200526260004200538300002100580500002000601520012000621600006000741650006900801650006700870650005700937600003900994650002301033650003101056991006601087107141420041029092602.0910828s1992
nyu j 001 0beng  9(DLC) 91033805
a7bcbccorignewd1eocipf19gy-gencatlg apc18 to bg00 08-28-91; bg15
to SCD 08-28-91; fd11 08-29-91; fa00 08-29-91; fa03 08-30-91; fq28 09-05-91;
CIP ver. lb02 01-28-93 a 91033805  a0060201134 :c$16.00 ($21.50
Can.) a0060201142 (lib. bdg.) aDLCcDLCdDLC alcac
an-us---00aPS3545.I342bZ555 199200a813/.52aB2201 aAnderson,
William,d1952-10aLaura Ingalls Wilder :ba biography /cby William
Anderson. a1st ed. aNew York, NY :bHarperCollins,c1992. a240 p.
;b22 cm. aIncludes index. aA biography of the writer whose pioneer
life on the American prairie became the basis for her "Little House"
books.10aWilder, Laura Ingalls,d1867-1957vJuvenile literature.
0aAuthors, Americany20th centuryvBiographyvJuvenile literature.
0aFrontier and pioneer lifezUnited StatesvJuvenile literature.
0aChildren's storiesxAuthorshipvJuvenile literature.11aWilder, Laura
Ingalls,d1867-1957. 1aAuthors, American. 1aFrontier and pioneer life.
bc-GenCollhPS3545.I342iZ555 1992p00001519748tCopy 1wBOOKS

"Rendered" MARC Record

001 1071414
005 20041029092602.0
008 910828s1992 nyu j 001 0beng
035 $9 (DLC) 91033805
906 $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
955 $a pc18 to bg00 08-28-91; bg15 to SCD 08-28-91; fd11 08-29-91; fa00
08-29-91; fa03 08-30-91; fq28 09-05-91; CIP ver. lb02 01-28-93
010 $a 91033805
020 $a 0060201134 : $c $16.00 ($21.50 Can.)
020 $a 0060201142 (lib. bdg.)
040 $a DLC $c DLC $d DLC
042 $a lcac
043 $a n-us---
050 00 $a PS3545.I342 $b Z555 1992
082 00 $a 813/.52 $a B $2 20
100 1 $a Anderson, William, $d 1952-
245 10 $a Laura Ingalls Wilder : $b a biography / $c by William Anderson.
250 $a 1st ed.
260 $a New York, NY : $b HarperCollins, $c 1992.
300 $a 240 p. ; $b 22 cm.
500 $a Includes index.
520 $a A biography of the writer whose pioneer life on the American
prairie became the basis for her "Little House" books.
600 10 $a Wilder, Laura Ingalls, $d 1867-1957 $v Juvenile literature.
650 0 $a Authors, American $y 20th century $v Biography $v Juvenile
literature.
650 0 $a Frontier and pioneer life $z United States $v Juvenile literature.
650 0 $a Children's stories $x Authorship $v Juvenile literature.
600 11 $a Wilder, Laura Ingalls, $d 1867-1957.
650 1 $a Authors, American.
650 1 $a Frontier and pioneer life.
991 $b c-GenColl $h PS3545.I342 $i Z555 1992 $p 00001519748 $t Copy 1 $w
BOOKS
 
Hi Fred,

Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Absolutely. A simple way to try it for yourself is to grab the
rgxExtract() function from
http://www.j.nurick.dial.pipex.com/Code/vbRegex/rgxExtract.htm
and paste it into a module.

Get a test string into a variable S by typing this into the Immediate
pane (or pasting it from your sample file) and hitting Enter:

S = "245 10 $a Laura Ingalls Wilder : $b a biography / $c by William
Anderson."

Then type this and hit Enter:
?rgxextract(S, "\$a +([^$:/.]+) *\W")

You should get
Laura Ingalls Wilder

Change the \$a to \$b and you should get
a biography
Change the \$b to \$c
by William Anderson

I don't know the rules for rendered MARC records and have probably
missed something. Here's how the regular expression

\$a +([^$:/.]+) *[$:/.]

works:

\$ matches the dollar sign ($ without the backslash has a special
meaning.

\$a matches "$a".

+ (<space>+) matches one or more spaces

( is a signal to start capturing the next characters matched;
this is the substring that rgxExtract will return.

[^$:/.] defines a character class consisting of all characters *except*
$:/. (these appear to be the characters that mark the end of
the substrings we're interested in, but it may be necessary to
exclude more characters.

[^$:/.]+ matches one or more characters in the defined class (e.g.
"Laura Ingalls Wilder"

) terminates the capturing process

* (<space>*) matches zero or more spaces

[$:/.] matches one of our terminating characters.

This will almost certainly need fine-tuning, but IMHO it's a lot easier
than writing hundreds of lines of code to parse strings.

A regular expression can have multiple sets of capturing parentheses, so
it's possible in principle (and often in practice) to use a single more
complicated regular expression to grab multiple substrings in one go.

Have fun!
 
Dear John:

Thanks very much! I tried the sample test you provided, and it works
perfectly. I have also read one or two web pages describing regular
expressions. Right now it looks rather complex, but every reference I read
tells me that with a little practice it becomes easier, so I'll give it a
try! Be forewarned though, you might hear from me about this again! :)

Cheers!
Fred

John Nurick said:
Hi Fred,

Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?

Absolutely. A simple way to try it for yourself is to grab the
rgxExtract() function from
http://www.j.nurick.dial.pipex.com/Code/vbRegex/rgxExtract.htm
and paste it into a module.

Get a test string into a variable S by typing this into the Immediate
pane (or pasting it from your sample file) and hitting Enter:

S = "245 10 $a Laura Ingalls Wilder : $b a biography / $c by William
Anderson."

Then type this and hit Enter:
?rgxextract(S, "\$a +([^$:/.]+) *\W")

You should get
Laura Ingalls Wilder

Change the \$a to \$b and you should get
a biography
Change the \$b to \$c
by William Anderson

I don't know the rules for rendered MARC records and have probably
missed something. Here's how the regular expression

\$a +([^$:/.]+) *[$:/.]

works:

\$ matches the dollar sign ($ without the backslash has a special
meaning.

\$a matches "$a".

+ (<space>+) matches one or more spaces

( is a signal to start capturing the next characters matched;
this is the substring that rgxExtract will return.

[^$:/.] defines a character class consisting of all characters *except*
$:/. (these appear to be the characters that mark the end of
the substrings we're interested in, but it may be necessary to
exclude more characters.

[^$:/.]+ matches one or more characters in the defined class (e.g.
"Laura Ingalls Wilder"

) terminates the capturing process

* (<space>*) matches zero or more spaces

[$:/.] matches one of our terminating characters.

This will almost certainly need fine-tuning, but IMHO it's a lot easier
than writing hundreds of lines of code to parse strings.

A regular expression can have multiple sets of capturing parentheses, so
it's possible in principle (and often in practice) to use a single more
complicated regular expression to grab multiple substrings in one go.

Have fun!
 
Back
Top