Parser Theory - Using grammars to build parsers

Parser Theory - Using grammars to build parsers

Post by QWxleGFuZG » Thu, 18 Oct 2007 06:20:01


Greetings,

I'm doing my own bit of research in implementing and using a parser to
understand the intent of someone's code or other specialized text file. So
far I've written a grammar that allows you to define your own grammar, the
program that reads the grammar understands the relationships made by the
grammar; however, the next step is using the grammar's data structures (the
in memory tree form of the grammar) and building a parser that can build, for
you, a series of objects that relate to the actual data contained within the
grammar you specify.

So far I've had a few ideas as to how to do this, based off the work I did
in writing the grammar parser. Based upon the kinds of data requested
(literals, character ranges, or references to other declarations in the
grammar) it can make a few assumptions. First: The grammar segments each
declaration into a series of 'leaves' or areas of the declaration delimited
by a '|'. The first step for the parser builder would be to define the
initial state character ranges, and use those to minimize complexity when
determining which path to take.

For example, if you have the following grammar:
S = D | L | C;
D = [0-9];
L = [A-Z];
C = "Test";
Each declaration will be assigned a series of 'states' and sub-states. Based
upon the character ranges specified, you can assume that L and C overlap on
the character 'T' for their initial state. When the parser is called to
obtain the first entity, the 'S' would be in its main state of '0' since it
doesn't have a clue what it's doing. Based upon this state the parser can
look at the first character and determine whether to eliminate D, L, or C
from the current running, if we are given 'T' as the first character, D is
thrown out. After that, since the two unify on the same character, the only
way to know which it is is to try both of them and accept the one which is
the longest.

On cases where an entire declaration is repeated, the thing I can think of
is to cache the attempt streams for the declarations and unwind the stack
until there's a point of commonality. As there are 'x' entities that share
'Y' in common, the unwound data would be stored and segmented based upon x-1
entities on how far in their 'unwound' stream they actually unify by.

The real question I have today is, am I on the right track at all,
logically, on this? The project is rather abstract and I'm no language
theory expert. As far as the actual code exporting goes, I've got that
covered in what I've dubbed 'OIL' for objectified intermediate language that
replaces CodeDOM in all of my projects.

Any assistance or pointers of any kind are appreciated.
 
 
 

Parser Theory - Using grammars to build parsers

Post by Fred Melle » Thu, 18 Oct 2007 06:58:37

strongly suggest you look at the parser literature. You can find the
classic book, "Compilers", by Aho, Sethi and Ullman in your library. There
are tutorial papers on the web too.

The first thing you should do is separate the logic between a lexical
analyzer and the parser proper. The analyzer (sometimes called a scanner)
can be implemented via RegEx. It processes the input and "tokenizes" it
(turning it into objects for the parser to match against the parse rules).

There are two basic types of parser: top down (or recursive descent), and
bottom up (shift/reduce, sometimes called a "chart" parser). From what I
understand of your logic, you are headed toward a shift/reduce parser. This
is much more complicated than a recursive descent one, but has certain
advantages.

I have written both kinds of parsers in C# and discuss some of the
deficiencies of the recursive descent parser. The C# code for each is
available (for free!) at

http://www.frontiernet.net/~fredm/dps/Contents.htm (Chapter 3), and
http://www.frontiernet.net/~fredm/parser/LinguistWebPage.htm

Perhaps you will find that the second meets your needs and you will not have
to write your own.

Hope this helps.

"Alexander Morou" <Alexander XXXX@XXXXX.COM > wrote in
message news: XXXX@XXXXX.COM ...



 
 
 

Parser Theory - Using grammars to build parsers

Post by jame » Thu, 18 Oct 2007 15:36:59

Hey Alex,

Firstly, get the Compilers book that Fred recommended -- Its great!

If you want real world examples, Microsoft open sources most of C#...
Follow the link to
http://www.yqcomputer.com/
(if it doesnt work search for Shared Source Common Language
Infrastructure 2.0 Release). That should contain the code for the
CSharp lexer and parser.

You don't want to write these by hand. Look at the tool "lex" for
writing the lexer. It's mature and from the 70's... and its been
ported to modern languages like Java and C#.

Try the tool "yaac" to write the parser. It takes in a grammar and
generates the rest... I've only seen it in C++, but there might be a
similiar tool for C#. Keep in mind that certain grammars only work
with certain parsing strategies, for instance the Grammar S - > S a |
b would fall into an infinite loop in a recursive decent parser.

Good luck!
James
 
 
 

Parser Theory - Using grammars to build parsers

Post by Frans Boum » Thu, 18 Oct 2007 17:57:27


You can solve shift/reduce conflicts with a lookahead ;) (or take a
decision)

For yacc, if the topic starter has an LR(n) grammer, I'd go for ANTLR,
it has decent tooling as well these days.

But I also would recommend getting the Compilers book and implement a
small parser using the LR(n) algo in there yourself. It's complicated
at first perhaps but it gives so much insight in what kind of different
things are going on. For Lex, you could use a simple regex tokenizer,
it's not that hard to do.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.yqcomputer.com/
My .NET blog: http://www.yqcomputer.com/
Microsoft MVP (C#)
------------------------------------------------------------------------
 
 
 

Parser Theory - Using grammars to build parsers

Post by Tigge » Thu, 18 Oct 2007 19:38:57


I've used Gold Parser to generate parser engines and found it very useful:

http://www.yqcomputer.com/

Tigger
 
 
 

Parser Theory - Using grammars to build parsers

Post by Robert Fuc » Thu, 13 Dec 2007 05:55:50

Which way would you go when trying to parse EDIFACT messages?
Example of an EANCOM ORDERS message see below.
(Note that I inserted a CR/LF after each segment to make it more human
readable)

regards, Robert



UNB+UNOA:2+3027019680006:14+3017600144604:14+020315:1726+00000000000005++ORDERS'
UNH+00000000000001+ORDERS:D:96A:UN'
BGM+220:::COMMANDE+0040750+9'
DTM+137:20020301:102'
DTM+324:200203:610'
FTX+AAI+++PREVENIR A L?'AVANCE'
RFF+CT:30041864'
DTM+171:20020204:102'
NAD+OF+3027019770103::9'
NAD+OB+3027019680006::9'
NAD+IV+3017600144604::9'
NAD+SE+3017600144604::9'
NAD+DP+3027019770141::9++AIRAINES+1 RUE PAQUET GREUX+AIRAINES++80270'
CTA++:AIRAINES'
COM+0322294076:TE'
COM+0322293787:FX'
CUX+2:EUR:4'
PAT+1'
TDT+20++30++++K:N'
TOD++NC'
LIN+1++3598890000707:EN'
IMD+F++:::NITROR 24.15 VRAC'
QTY+21:25:TNE'
PRI+INF:139::INV'
UNS+S'
UNT+25+00000000000001'
UNZ+1+00000000000005'