Hi Fred,
Looking at "Tag Field" 245, as an example, would "Regular Expression"
be a good way to extract the three separate "chunks" of text from that
single line? That is to say, to take - "245 10 $a Laura Ingalls Wilder : $b
a biography / $c by William Anderson." - and return "Laura Ingalls Wilder"
and "a biography" and "William Anderson"? removing unecessary text and
punctuation?
Absolutely. A simple way to try it for yourself is to grab the
rgxExtract() function from
http://www.j.nurick.dial.pipex.com/Code/vbRegex/rgxExtract.htm
and paste it into a module.
Get a test string into a variable S by typing this into the Immediate
pane (or pasting it from your sample file) and hitting Enter:
S = "245 10 $a Laura Ingalls Wilder : $b a biography / $c by William
Anderson."
Then type this and hit Enter:
?rgxextract(S, "\$a +([^$:/.]+) *\W")
You should get
Laura Ingalls Wilder
Change the \$a to \$b and you should get
a biography
Change the \$b to \$c
by William Anderson
I don't know the rules for rendered MARC records and have probably
missed something. Here's how the regular expression
\$a +([^$:/.]+) *[$:/.]
works:
\$ matches the dollar sign ($ without the backslash has a special
meaning.
\$a matches "$a".
+ (<space>+) matches one or more spaces
( is a signal to start capturing the next characters matched;
this is the substring that rgxExtract will return.
[^$:/.] defines a character class consisting of all characters *except*
$:/. (these appear to be the characters that mark the end of
the substrings we're interested in, but it may be necessary to
exclude more characters.
[^$:/.]+ matches one or more characters in the defined class (e.g.
"Laura Ingalls Wilder"
) terminates the capturing process
* (<space>*) matches zero or more spaces
[$:/.] matches one of our terminating characters.
This will almost certainly need fine-tuning, but IMHO it's a lot easier
than writing hundreds of lines of code to parse strings.
A regular expression can have multiple sets of capturing parentheses, so
it's possible in principle (and often in practice) to use a single more
complicated regular expression to grab multiple substrings in one go.
Have fun!