Parser Theory - Using grammars to build parsers

G

Guest

Greetings,

I'm doing my own bit of research in implementing and using a parser to
understand the intent of someone's code or other specialized text file. So
far I've written a grammar that allows you to define your own grammar, the
program that reads the grammar understands the relationships made by the
grammar; however, the next step is using the grammar's data structures (the
in memory tree form of the grammar) and building a parser that can build, for
you, a series of objects that relate to the actual data contained within the
grammar you specify.

So far I've had a few ideas as to how to do this, based off the work I did
in writing the grammar parser. Based upon the kinds of data requested
(literals, character ranges, or references to other declarations in the
grammar) it can make a few assumptions. First: The grammar segments each
declaration into a series of 'leaves' or areas of the declaration delimited
by a '|'. The first step for the parser builder would be to define the
initial state character ranges, and use those to minimize complexity when
determining which path to take.

For example, if you have the following grammar:
S = D | L | C;
D = [0-9];
L = [A-Z];
C = "Test";
Each declaration will be assigned a series of 'states' and sub-states. Based
upon the character ranges specified, you can assume that L and C overlap on
the character 'T' for their initial state. When the parser is called to
obtain the first entity, the 'S' would be in its main state of '0' since it
doesn't have a clue what it's doing. Based upon this state the parser can
look at the first character and determine whether to eliminate D, L, or C
from the current running, if we are given 'T' as the first character, D is
thrown out. After that, since the two unify on the same character, the only
way to know which it is is to try both of them and accept the one which is
the longest.

On cases where an entire declaration is repeated, the thing I can think of
is to cache the attempt streams for the declarations and unwind the stack
until there's a point of commonality. As there are 'x' entities that share
'Y' in common, the unwound data would be stored and segmented based upon x-1
entities on how far in their 'unwound' stream they actually unify by.

The real question I have today is, am I on the right track at all,
logically, on this? The project is rather abstract and I'm no language
theory expert. As far as the actual code exporting goes, I've got that
covered in what I've dubbed 'OIL' for objectified intermediate language that
replaces CodeDOM in all of my projects.

Any assistance or pointers of any kind are appreciated.
 
F

Fred Mellender

I strongly suggest you look at the parser literature. You can find the
classic book, "Compilers", by Aho, Sethi and Ullman in your library. There
are tutorial papers on the web too.

The first thing you should do is separate the logic between a lexical
analyzer and the parser proper. The analyzer (sometimes called a scanner)
can be implemented via RegEx. It processes the input and "tokenizes" it
(turning it into objects for the parser to match against the parse rules).

There are two basic types of parser: top down (or recursive descent), and
bottom up (shift/reduce, sometimes called a "chart" parser). From what I
understand of your logic, you are headed toward a shift/reduce parser. This
is much more complicated than a recursive descent one, but has certain
advantages.

I have written both kinds of parsers in C# and discuss some of the
deficiencies of the recursive descent parser. The C# code for each is
available (for free!) at

http://www.frontiernet.net/~fredm/dps/Contents.htm (Chapter 3), and
http://www.frontiernet.net/~fredm/parser/LinguistWebPage.htm

Perhaps you will find that the second meets your needs and you will not have
to write your own.

Hope this helps.

Alexander Morou said:
Greetings,

I'm doing my own bit of research in implementing and using a parser to
understand the intent of someone's code or other specialized text file.
So
far I've written a grammar that allows you to define your own grammar, the
program that reads the grammar understands the relationships made by the
grammar; however, the next step is using the grammar's data structures
(the
in memory tree form of the grammar) and building a parser that can build,
for
you, a series of objects that relate to the actual data contained within
the
grammar you specify.

So far I've had a few ideas as to how to do this, based off the work I did
in writing the grammar parser. Based upon the kinds of data requested
(literals, character ranges, or references to other declarations in the
grammar) it can make a few assumptions. First: The grammar segments each
declaration into a series of 'leaves' or areas of the declaration
delimited
by a '|'. The first step for the parser builder would be to define the
initial state character ranges, and use those to minimize complexity when
determining which path to take.

For example, if you have the following grammar:
S = D | L | C;
D = [0-9];
L = [A-Z];
C = "Test";
Each declaration will be assigned a series of 'states' and sub-states.
Based
upon the character ranges specified, you can assume that L and C overlap
on
the character 'T' for their initial state. When the parser is called to
obtain the first entity, the 'S' would be in its main state of '0' since
it
doesn't have a clue what it's doing. Based upon this state the parser can
look at the first character and determine whether to eliminate D, L, or C
from the current running, if we are given 'T' as the first character, D is
thrown out. After that, since the two unify on the same character, the
only
way to know which it is is to try both of them and accept the one which is
the longest.

On cases where an entire declaration is repeated, the thing I can think of
is to cache the attempt streams for the declarations and unwind the stack
until there's a point of commonality. As there are 'x' entities that
share
'Y' in common, the unwound data would be stored and segmented based upon
x-1
entities on how far in their 'unwound' stream they actually unify by.

The real question I have today is, am I on the right track at all,
logically, on this? The project is rather abstract and I'm no language
theory expert. As far as the actual code exporting goes, I've got that
covered in what I've dubbed 'OIL' for objectified intermediate language
that
replaces CodeDOM in all of my projects.

Any assistance or pointers of any kind are appreciated.
 
J

james

Hey Alex,

Firstly, get the Compilers book that Fred recommended -- Its great!

If you want real world examples, Microsoft open sources most of C#...
Follow the link to
http://www.microsoft.com/downloads/...61-3F26-4555-AE17-3121B4F51D4D&displaylang=en
(if it doesnt work search for Shared Source Common Language
Infrastructure 2.0 Release). That should contain the code for the
CSharp lexer and parser.

You don't want to write these by hand. Look at the tool "lex" for
writing the lexer. It's mature and from the 70's... and its been
ported to modern languages like Java and C#.

Try the tool "yaac" to write the parser. It takes in a grammar and
generates the rest... I've only seen it in C++, but there might be a
similiar tool for C#. Keep in mind that certain grammars only work
with certain parsing strategies, for instance the Grammar S - > S a |
b would fall into an infinite loop in a recursive decent parser.

Good luck!
James
 
F

Frans Bouma [C# MVP]

james said:
Hey Alex,

Firstly, get the Compilers book that Fred recommended -- Its great!

If you want real world examples, Microsoft open sources most of C#...
Follow the link to
http://www.microsoft.com/downloads/details.aspx?FamilyId=8C09FD61-3F26
-4555-AE17-3121B4F51D4D&displaylang=en (if it doesnt work search for
Shared Source Common Language Infrastructure 2.0 Release). That
should contain the code for the CSharp lexer and parser.

You don't want to write these by hand. Look at the tool "lex" for
writing the lexer. It's mature and from the 70's... and its been
ported to modern languages like Java and C#.

Try the tool "yaac" to write the parser. It takes in a grammar and
generates the rest... I've only seen it in C++, but there might be a
similiar tool for C#. Keep in mind that certain grammars only work
with certain parsing strategies, for instance the Grammar S - > S a |
b would fall into an infinite loop in a recursive decent parser.

You can solve shift/reduce conflicts with a lookahead ;) (or take a
decision)

For yacc, if the topic starter has an LR(n) grammer, I'd go for ANTLR,
it has decent tooling as well these days.

But I also would recommend getting the Compilers book and implement a
small parser using the LR(n) algo in there yourself. It's complicated
at first perhaps but it gives so much insight in what kind of different
things are going on. For Lex, you could use a simple regex tokenizer,
it's not that hard to do.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
T

Tigger

Frans Bouma said:
You can solve shift/reduce conflicts with a lookahead ;) (or take a
decision)

For yacc, if the topic starter has an LR(n) grammer, I'd go for ANTLR,
it has decent tooling as well these days.

But I also would recommend getting the Compilers book and implement a
small parser using the LR(n) algo in there yourself. It's complicated
at first perhaps but it gives so much insight in what kind of different
things are going on. For Lex, you could use a simple regex tokenizer,
it's not that hard to do.

FB

I've used Gold Parser to generate parser engines and found it very useful:

http://www.devincook.com/goldparser/

Tigger
 
R

Robert Fuchs

Which way would you go when trying to parse EDIFACT messages?
Example of an EANCOM ORDERS message see below.
(Note that I inserted a CR/LF after each segment to make it more human
readable)

regards, Robert



UNB+UNOA:2+3027019680006:14+3017600144604:14+020315:1726+00000000000005++ORDERS'
UNH+00000000000001+ORDERS:D:96A:UN'
BGM+220:::COMMANDE+0040750+9'
DTM+137:20020301:102'
DTM+324:200203:610'
FTX+AAI+++PREVENIR A L?'AVANCE'
RFF+CT:30041864'
DTM+171:20020204:102'
NAD+OF+3027019770103::9'
NAD+OB+3027019680006::9'
NAD+IV+3017600144604::9'
NAD+SE+3017600144604::9'
NAD+DP+3027019770141::9++AIRAINES+1 RUE PAQUET GREUX+AIRAINES++80270'
CTA++:AIRAINES'
COM+0322294076:TE'
COM+0322293787:FX'
CUX+2:EUR:4'
PAT+1'
TDT+20++30++++K:N'
TOD++NC'
LIN+1++3598890000707:EN'
IMD+F++:::NITROR 24.15 VRAC'
QTY+21:25:TNE'
PRI+INF:139::INV'
UNS+S'
UNT+25+00000000000001'
UNZ+1+00000000000005'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top