G
Guest
Greetings,
I'm doing my own bit of research in implementing and using a parser to
understand the intent of someone's code or other specialized text file. So
far I've written a grammar that allows you to define your own grammar, the
program that reads the grammar understands the relationships made by the
grammar; however, the next step is using the grammar's data structures (the
in memory tree form of the grammar) and building a parser that can build, for
you, a series of objects that relate to the actual data contained within the
grammar you specify.
So far I've had a few ideas as to how to do this, based off the work I did
in writing the grammar parser. Based upon the kinds of data requested
(literals, character ranges, or references to other declarations in the
grammar) it can make a few assumptions. First: The grammar segments each
declaration into a series of 'leaves' or areas of the declaration delimited
by a '|'. The first step for the parser builder would be to define the
initial state character ranges, and use those to minimize complexity when
determining which path to take.
For example, if you have the following grammar:
S = D | L | C;
D = [0-9];
L = [A-Z];
C = "Test";
Each declaration will be assigned a series of 'states' and sub-states. Based
upon the character ranges specified, you can assume that L and C overlap on
the character 'T' for their initial state. When the parser is called to
obtain the first entity, the 'S' would be in its main state of '0' since it
doesn't have a clue what it's doing. Based upon this state the parser can
look at the first character and determine whether to eliminate D, L, or C
from the current running, if we are given 'T' as the first character, D is
thrown out. After that, since the two unify on the same character, the only
way to know which it is is to try both of them and accept the one which is
the longest.
On cases where an entire declaration is repeated, the thing I can think of
is to cache the attempt streams for the declarations and unwind the stack
until there's a point of commonality. As there are 'x' entities that share
'Y' in common, the unwound data would be stored and segmented based upon x-1
entities on how far in their 'unwound' stream they actually unify by.
The real question I have today is, am I on the right track at all,
logically, on this? The project is rather abstract and I'm no language
theory expert. As far as the actual code exporting goes, I've got that
covered in what I've dubbed 'OIL' for objectified intermediate language that
replaces CodeDOM in all of my projects.
Any assistance or pointers of any kind are appreciated.
I'm doing my own bit of research in implementing and using a parser to
understand the intent of someone's code or other specialized text file. So
far I've written a grammar that allows you to define your own grammar, the
program that reads the grammar understands the relationships made by the
grammar; however, the next step is using the grammar's data structures (the
in memory tree form of the grammar) and building a parser that can build, for
you, a series of objects that relate to the actual data contained within the
grammar you specify.
So far I've had a few ideas as to how to do this, based off the work I did
in writing the grammar parser. Based upon the kinds of data requested
(literals, character ranges, or references to other declarations in the
grammar) it can make a few assumptions. First: The grammar segments each
declaration into a series of 'leaves' or areas of the declaration delimited
by a '|'. The first step for the parser builder would be to define the
initial state character ranges, and use those to minimize complexity when
determining which path to take.
For example, if you have the following grammar:
S = D | L | C;
D = [0-9];
L = [A-Z];
C = "Test";
Each declaration will be assigned a series of 'states' and sub-states. Based
upon the character ranges specified, you can assume that L and C overlap on
the character 'T' for their initial state. When the parser is called to
obtain the first entity, the 'S' would be in its main state of '0' since it
doesn't have a clue what it's doing. Based upon this state the parser can
look at the first character and determine whether to eliminate D, L, or C
from the current running, if we are given 'T' as the first character, D is
thrown out. After that, since the two unify on the same character, the only
way to know which it is is to try both of them and accept the one which is
the longest.
On cases where an entire declaration is repeated, the thing I can think of
is to cache the attempt streams for the declarations and unwind the stack
until there's a point of commonality. As there are 'x' entities that share
'Y' in common, the unwound data would be stored and segmented based upon x-1
entities on how far in their 'unwound' stream they actually unify by.
The real question I have today is, am I on the right track at all,
logically, on this? The project is rather abstract and I'm no language
theory expert. As far as the actual code exporting goes, I've got that
covered in what I've dubbed 'OIL' for objectified intermediate language that
replaces CodeDOM in all of my projects.
Any assistance or pointers of any kind are appreciated.