Provided you don't mind getting errors and incorrect results in the
records that don't conform to the pattern, you can populate the separate
Epoch, etc. fields by using an update query with expressions like these.
You need to replace X with the name of your "full name" field; it can
help if you enclose it in square brackets. I
If you havne't already done so, I strongly recommend that you set up a
separate table with one record for each of the 700-odd species, with a
1:M relationship with your table of specimens (or whatever they are).
Epoch:
Trim(Left(X, Instr(1, X, "(") -1))
Age:
Mid(X, InStr(1, X, "(") +1, InStr(1, X, ")") - InStr(1, X, "(") -1)
Genus:
Trim(Mid(X, Instr(1, X, ")") + 1, InStrRev(X, " ") - InStr(1, X,
")")))
Species:
Mid(X, InstrRev(X, " "))
The expressions above use the standard VBA string-manipulation
functions. For interest I worked out the corresponding regular
expressions, which to me seem simpler and easier to maintain, though
others disagree.
Epoch:
rgxExtract(X, "^([^(]+)\s\(")
Age:
rgxExtract(X, "\((.*)\)")
Genus:
rgxExtract(X, "(\w+)\s\w+$")
Species:
rgxExtract(X, "\w+$")
The rgxExtract() function is at
www.j.nurick.dial.pipex.com/Code/
John,
In fact, not all the records have the same format. However, I'm after the
90% solution. Most are exactly in the form:
Epoch (Age) Genus Species
with the Epoch consisting of two words.
If I can take care of the records that are in this form, it will work for
about 550 out of the 700 unique names, and probably a larger percentage of
the more than 100,000 entries in the table.
Thanks again.
John Nurick said:
Parsing text can be tricky. There's no point trying unless one
understands the data. So: does *every single* value have the same
structure
Epoch<open parenthesis>Age<close parenthesis>Genus<space>Species
where Epoch is one more words and Genus and Species are one word each?
If so, the task is fairly simple. On the other hand if Age and its
parentheses are sometimes not included and the number of words varies,
it can be a lot more complicated.
Can you post some more sample records that illustrate the range of
structures found in the data, with an unambiguous indication of how each
should be parsed?
John,
It would be very worthwhile for me to maintain all the information that's in
the name such that I could filter out elements that could be grouped by Epoch
and Age, too, so I agree with your comment about the long view that I should
separate the full name into the 4 elements of Epoch, Age, Genus and Species.
(Nice work recognizing those for what they are, too. I imagine the words
looked a lot like gibberish ot most people.)
Can you help me with a query to parse that field into 4 elements? I'm a
beginner at Access, and I have tried to find what I need in both the help
files and an Access book I have to no avail.
Thanks
:
The examples you gave Jeff make it a good deal clearer; I'd not
appreciated how comparatively insigificant the first words are.
'Lower Miocene (Aquitanian) Siphonina davisi' -----> 'Siph. davisii'
'Upper Miocene (Tortonian) Robulus 15' ------> 'Rob. 15'
I assume you've set up a table Abbreviations including at least these
fields:
FullName
ShortName
and with 700 records for the 700 distinct names. Both the FullName and
ShortName fields need unique indexes for obvious reasons.
If the structure of the FullNames is consistent, you can get rid of the
geological information by dropping everything up to and including the
space after the last ")" in the value. In an update query, you'd update
ShortName to
Mid([LongName], InstrRev([LongName], ") ")+2)
with a criterion (on LongName) of
Like "*)*"
or in SQL
UPDATE Abbreviations
SET ShortName = Mid([LongName], InstrRev([LongName], ") ")+2)
WHERE LongName LIKE "*)*"
;
You can then use a succession of update queries to abbreviate the genera
(is that the right word). E.g. up
UPDATE Abbreviations
SET ShortName = Replace([ShortName], "Siphonina ", "Siph. ")
WHERE ShortName LIKE "Siphonina *"
;
But taking a longer view I think it would be worth normalising the
FullName into separate fields, e.g.
Epoch
Age
Genus
Species
which would be concatenated when you need the full name.
Then the abbreviated name could be constructed by looking up an
abbreviated form of the genus and concatenating the species.
Thanks, John.
I think you're right about doing this by hand. Maybe I can find a
quick-fingered and hungry student whose hands I can afford to rent for a few
hours.
I appreciate the other ideas, though. I'm just learning how to use MS
Access, and they will probably come in handy another time. I need to learn
how to set up those algorithms. though.
:
Hi BDB,
You may end up doing this largely by hand (on the 700, not the 100,000)
because any simple algorithm for abbreviating text is likely to produce
results that aren't particularly easy for humans to interpret.
(A simple algorithm might be:
1) remove vowels, starting with the last vowel of the first word, then
the last vowel of the second word, and so on. Do not remove a vowel if
it is the first letter of a word, or if it is the second letter of an
initial diphthong.
2) if the term is still too long, remove consonants, starting at the end
of the first word.
3) do not remove a letter if doing so would make this abbreviated term
the same as some other abbreviated term.
)
So: I'd start by searching for a list of paleontological terms and
accepted abbreviations. If that didn't produce anything useful I'd try
the comp.text.* hierarchy of newsgroups, first of all searching the
archive at groups.google.com and then if necessary posting a question.
Here is what I'm trying to do:
I've got a table with about 100,000 entries in a field, but only 700 of them
are unique. I'd like to create another field which is an abbreviated form of
the information in that field. Can somebody point me to a place where I can
figure out how to do this?
(If anybody in interested, the table is a list of of paleontological names
for microfossils found in wells in the Gulf of Mexico. The names can be
quite long, and I'd like to pare them down to something I can post on maps.)