Parsing text files

G

Gustaf

For practice, I have a go at implementing the GEDCOM 5.5 standard for
genealogy. Here's a sample of the file format:

http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gcch2.htm#S6

I'm planning to use StreamReader to read files, but I'm not sure exactly
how. GEDCOM files are divided into logical components called records,
which are lines starting with '0'. There are different kinds of records,
each with unique tag names and substructures.

The best approach I can think of is to loop through the file, check for
the sequence \n0, check of the next such sequence, and handle each such
block individually, depending on tag name.

This problem must arise in a large number of applications (I can think
of reading mailbox files for instance), so I figured there must be a
conventional way of solving it. Any ideas? I checked out another GEDCOM
library that described itself as "callback-based parser", but I'm not
sure what that means, but figure it could be a lead for how to approach
the problem.

Gustaf
 
F

Family Tree Mike

Gustaf said:
For practice, I have a go at implementing the GEDCOM 5.5 standard for
genealogy. Here's a sample of the file format:

http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gcch2.htm#S6

I'm planning to use StreamReader to read files, but I'm not sure exactly
how. GEDCOM files are divided into logical components called records,
which are lines starting with '0'. There are different kinds of records,
each with unique tag names and substructures.

The best approach I can think of is to loop through the file, check for
the sequence \n0, check of the next such sequence, and handle each such
block individually, depending on tag name.

This problem must arise in a large number of applications (I can think
of reading mailbox files for instance), so I figured there must be a
conventional way of solving it. Any ideas? I checked out another GEDCOM
library that described itself as "callback-based parser", but I'm not
sure what that means, but figure it could be a lead for how to approach
the problem.

Gustaf

Each record consists of multiple lines which start with a 0 in the first
non-white character. Read each line, trimming the front and back end.
Store sets of lines together. Keep all lines from the first 0 up to the
next 0 in a List<string> collection.

You then have several classes defined that take a set of records.
Records 1 and 2, for example would be passed to an Individual class
constructor which reads the subsequent lines.

The numbers are actually very helpful in organizing the classes. The
records that start with a number 4, belong to a 3rd level entity.
Record 1 (Robert Eugene Williams) says that his roll in a Birth (BIRT)
event was that of a Child (CHIL). The event belongs to a source record
(SOUR) with an ID of @6@ (a common ID form in GEDCOM). The other 3rd
record for the source (PAGE) says where other genealogists can find the
record in the source. The first 2 record says the date which belongs to
the BIRT (level 1) record. All of these belong to the main level 0
object, which is the individual.
 
G

Gregory A. Beamer

For practice, I have a go at implementing the GEDCOM 5.5 standard for
genealogy. Here's a sample of the file format:

http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gcch2.htm#S6

I'm planning to use StreamReader to read files, but I'm not sure
exactly how. GEDCOM files are divided into logical components called
records, which are lines starting with '0'. There are different kinds
of records, each with unique tag names and substructures.

The best approach I can think of is to loop through the file, check
for the sequence \n0, check of the next such sequence, and handle each
such block individually, depending on tag name.

This problem must arise in a large number of applications (I can think
of reading mailbox files for instance), so I figured there must be a
conventional way of solving it. Any ideas? I checked out another
GEDCOM library that described itself as "callback-based parser", but
I'm not sure what that means, but figure it could be a lead for how to
approach the problem.


Having done a lot of geneological work, I would recommend, as Mark has,
moving to GEDCOM XML, as it is much easier to consume. You can always
have an option to output standard GEDCOM, but I would work either in a
relational store or GEDCOM XML.

The GEDCOM standard is foudn here:
http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

Assuming GEDCOM 5.5, that is.

NOTE that there are numerous XML standards out there for genealogy.
GEDCOM ML is GEDCOM's standard. There is also GedML (Michael Kay, a
pioneer in these formats), gdxml (Hans Fugal) and GenXML (forget the
group name that developed).

In addition to what Mark has suggested (the MSDN article), there is a
neat codeplex project with a GEDCOM reading class:
http://gedcom.codeplex.com/

You can use it to help read in families and possibly even convert to
another format.


--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA

Twitter: @gbworld
Blog: http://gregorybeamer.spaces.live.com

*******************************************
| Think outside the box! |
*******************************************
 
G

Gustaf

Gregory said:
Having done a lot of geneological work, I would recommend, as Mark has,
moving to GEDCOM XML, as it is much easier to consume.

Indeed. I used to do a lot of XML work for several years, so when
reading the GEDCOM 5.5 spec, I get ichy all over and want to make an XML
remake (and the only thing stopping me is knowing that Michael Kay and
others has done it before).

However, the motivation for me now is to get *insight* in how to parse
text files in general, with GEDCOM as an example. Many of the same
problems I encounter with GEDCOM must apply to many other text formats
too. I can figure out ways of solving most things, but I always look for
conventional recipes.

Glad to find GEDCOM expertise here anyway. :)

Gustaf
 
G

Gustaf

Family said:
Each record consists of multiple lines which start with a 0 in the first
non-white character. Read each line, trimming the front and back end.
Store sets of lines together. Keep all lines from the first 0 up to the
next 0 in a List<string> collection.

Nice outline. Thank you.
The numbers are actually very helpful in organizing the classes. The
records that start with a number 4, belong to a 3rd level entity. Record

I think you mean lines, not records. :)

In my model so far, I got a Gedcom class, which contains one Dictionary
each for Family and Individual objects (with record ID as key for quick
and easy reference). I'll add more record classes later (there are 9),
but if I can get Family and Individual working I'm quite happy.

The idea is that when the GEDCOM file is loaded, I'll be able to
traverse a tree of Individual objects, jumping from relative to relative.

Then I'll make a 3D app that can visualize a tree of 500+ people. ;-)

Gustaf
 
F

Family Tree Mike

Gustaf said:
Nice outline. Thank you.


I think you mean lines, not records. :)

In my model so far, I got a Gedcom class, which contains one Dictionary
each for Family and Individual objects (with record ID as key for quick
and easy reference). I'll add more record classes later (there are 9),
but if I can get Family and Individual working I'm quite happy.

The idea is that when the GEDCOM file is loaded, I'll be able to
traverse a tree of Individual objects, jumping from relative to relative.

Then I'll make a 3D app that can visualize a tree of 500+ people. ;-)

Gustaf

Sounds like a good project. Good luck with it. I'm a fan of Family
Tree Maker, but I write my own tools on the side. Your approach seems
on track.
 
G

Gregory A. Beamer

Indeed. I used to do a lot of XML work for several years, so when
reading the GEDCOM 5.5 spec, I get ichy all over and want to make an
XML remake (and the only thing stopping me is knowing that Michael Kay
and others has done it before).

However, the motivation for me now is to get *insight* in how to parse
text files in general, with GEDCOM as an example. Many of the same
problems I encounter with GEDCOM must apply to many other text formats
too. I can figure out ways of solving most things, but I always look
for conventional recipes.

Glad to find GEDCOM expertise here anyway. :)


In general, most files that you deal with, outside of those somehow
column separated, involve parsing line by line. With GEDCOM, the most
difficult part is the fact that a record is on multiple lines. But, it
is fairly simple in that there are indicators to, at minimum, point out
the beginning of a record.

From a programming standpoint, you end up looking for begin pointers
with GEDCOM. But it is a good exercise in a rather complex type of
parse, so I would definitely head that direction first.

Most of what is parsed in business is either fixed width or delimited.
Once you know the size of the columns, you can pull the information and
create the record. Quite a bit easier than GEDCOM.

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA

Twitter: @gbworld
Blog: http://gregorybeamer.spaces.live.com

*******************************************
| Think outside the box! |
*******************************************
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top