How to construct a grammar?

This is a discussion on How to construct a grammar? within the Compilers forums in Theory and Concepts category; I've read a bit about parsing and compilers once in a while (red dragon, ANTLR reference), but have no experience building one. To my embarrassment I realized that I have no clear idea how to construct a grammar for an existing language. So, I'm wondering, is there a systematic way to get from a suitable list of language samples, i.e, without a formal definition to start with, to a grammar? Michael -- Michael Schuerig mailto:michael @ schuerig.de http://www.schuerig.de/michael/...

Go Back   Application Development Forum > Theory and Concepts > Compilers

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 07-05-2008, 02:25 AM
Michael Schuerig
Guest
 
Default How to construct a grammar?

I've read a bit about parsing and compilers once in a while (red
dragon, ANTLR reference), but have no experience building one. To my
embarrassment I realized that I have no clear idea how to construct a
grammar for an existing language.

So, I'm wondering, is there a systematic way to get from a suitable
list of language samples, i.e, without a formal definition to start
with, to a grammar?

Michael

--
Michael Schuerig
mailto:michael@schuerig.de
http://www.schuerig.de/michael/
Reply With Quote
  #2  
Old 07-05-2008, 10:49 AM
Detlef Meyer-Eltz
Guest
 
Default Re: How to construct a grammar?

> So, I'm wondering, is there a systematic way to get from a suitable
> list of language samples, i.e, without a formal definition to start
> with, to a grammar?


Caution: I will make some advertising of my own parser generator
product. But it was made just with such tasks in mind.

I often was engaged in similar tasks and I have tried to make a short
guide from my experiences.

http://www.texttransformer.com/tthelp/howtobegin.htm

The essential statement is: first write a rough parser, which can
process the whole sample and then refine the grammar peu ` peu. This
is possible with TextTransformer, because there is a special SKIP
token, by which unrecognized parts of text can be omitted in the
beginning.

TextTransformer is particularly suitable for this task, because you
can test your language samples easily in a visual way. Finally all
available language samples can be tested at once with the integrated
transformation-manager. If the parser fails at a certain file, it can
be opened by a double click in the TextTransformer IDE for debugging.
After you have corrected the grammar, you can quickly test all samples
again.

I have repeatedly used this method successfully. Among others for the
translation of a not specified BASIC dialect into Java and on several
unspecified table formats. This way a grammar for Borlands Delphi also
almost is completed.

Regards

Detlef
Reply With Quote
  #3  
Old 07-05-2008, 11:38 AM
Chris F Clark
Guest
 
Default Re: How to construct a grammar?

Michael Schuerig <michael@schuerig.de> writes:

> So, I'm wondering, is there a systematic way to get from a suitable
> list of language samples, i.e, without a formal definition to start
> with, to a grammar?


There are two answers to that question. A theoretical automated
answer and a practical hand-done answer. I'm going to start with the
answer I suspect you care about less, the automated theoretical one
and dispense quickly with it. Taking a set of examples and generating
a grammar from itusing an algorithm is a field of research, which I
think is called roughly "machine learning". There are papers in that
field that give results positive and negative, saying what you can and
cannot make a machine learn from a set of examples. The results are
mostly the kinds of things one would expect. As I recall: You can
learn a finite regular expression grammars with no closure operators
(a* or a+) from positive examples only (i.e. this is legal). Added
negative examples (i.e. these are illegal), you can learn regular
expressions including closures. I think its been shown that there are
some CFGs you cannot learn with only positive and negative examples
alone, you can only learn a regular expression that matches a superset
of the CFG. I cannot recall where the deterministic CFGs fit in the
learnability specturm. But, again, I don't this this is your real
question.

In a practical sense, it is possible and usually actually even easy to
construct a grammar (by hand) from examples, as long as one has some
concept of the language. The easiest (and actualy most common case)
is that you have a grammar for a similar language accessible. If so,
you are only trying to cover the differences. However, since that
case is covered in the general case, I will skip it, and assume you
are starting from "first principles".

The first thing one needs is a "vocabulary", a list of things one is
talking about. This natuarally breaks down into two parts. You can
do either part first, so I'm picking ar order that is convenient for
me to talk about.

The first part of the vocabulary is the tokens or terminals. These
are the atomic parts of your grammar. You problably have identifiers
and keywords and numbers and operators and whitespace and similar
kinds of things in your grammar. So, make a list of all these things
are find examples of each of them. Don't worry about exactly how to
define them yet, just assume that you will be able to write a lexer
that can find them. When you get done with the whole process of
defining your language grammar you can roughly repeat the process for
the tokens and get a lexical grammar for them. More importantly, it
is even more likely that you will be able to find an existing lexer
grammar for your set of tokens, than it is to find a grammar that
describes your whole language. In fact, as I am defining new grammars
for new small tools that I design from time to time, I essentially use
one lexer grammar with some small permutations for all the tools. It
is so trivial, that I never even think about the minor variations that
I introduce. Thay are just "obvious", e.g. the list of keywords
changes.

(Shameless plug here, I believe that is partially true because the
lexer notation we defined in Yacc++ is particuarly natural for doing
that. We actually spent time thinking about that. However, we aren't
the only ones that have noticed that and there are some tools that
essentially have pre-canned lexers built into them, that you just
customize to get the desired result. A lot of the scannerless tools
go this way, because you can then sweep the ugly parts into the
predefined sections and paper over them so users don't notice what one
has done. We don't do that, but we do hide some of the ugliness in
default behaviors one can override, e.g. certain error checks are
turned off by default at the lexical level (for example, shift-reduce
conflicts between tokens that end in a star, because the longest match
rule implies always shift), but one can turn them back on.)

In any case, you have your list of atomic tokens. You then make a
list of the non-terminals, the structures your language is composed
of. You obviously have a "grammar", "program", of "file" non-
terminal, which represents the largest chunk you are parsing at a
time. That is your start symbol. Your language probably has
sub-groupings within that, like blocks, statements, and expressions.
Again, make a list of what they are and group the examples
accordingly.

Once, you have this done, the process is simply a case of taking each
non-terminal and looking at the examples and writing down the rule(s)
that turn those examples into non-terminals. In the process, you may
find you need to introdeuce additional non-terminals are tokens for
things you missed, but that isn't a major issue. Let's look at an
example of doing that.

Non-terminal: statement
Examples:
a := b + c;
if x = y then a := 1;

First guess at rules:
statement: variable ":=" variable "+" variable ";"
| if variable "=" variable then variable ":=" constant ";"
;

Now, we probably quickly realize that we want to use an expression
non-terminal to capture the various forms that part can take, so we
change the grammar to look like.

statement: variable ":=" expression ";"
| if boolean_expression then variable ":=" expression ";"
// see the section after deubggin on my comment
// on using boolean_expression here
;

We may also decide that we want to distinguish our types of statements
(i.e. new non-terminals), resulting in the following grammar.

statement: assignment_statement
| if_statement
;
assignment_statement: variable ":=" expression ";"
;
if_statement: if boolean_expression then variable ":=" expression ";"
;

Next, we may realizes from the examples that if_statements take a
statement after the then clause, resulting in:

if_statement: if boolean_expression then statement
;

When, you have finished that exercise you have a grammar for your
language. Worth noting, is that that are often certain things one
wants in a language (likes lists separated by semicolons, certain
phrases which are optional). Those things often have an idiomatic
expression in the grammar notion of choice. In Yacc++, we like
regular expressions for mnay of the idioms and have a tutorial chapter
in our documentation that is simply a cookbook of some of the more
common idioms and the corresponding regular expression. In normal
yacc grammars, you will find the same idioms described by sets of
(often recursive) productions where some helper nonterminals are added
with names like expression_list. The _list says it is part of an
idiom describing lists of expressions. In either case, one learns the
idiomatic expressions and uses them, because they often have "good
properties". (BTW, I've never seen the corresponding yacc cookbook
(and have no interest in writing it), but if you read yacc grammars,
you will slowly notice the conventions.)

This gets us to the next point. Above I said you have a grammar, but
the grammar may or may not be suitable to your tool. If you've
written lots of grammars and have developed an intuition for how to
avoid problems, it may be acceptable as is. However, you need to test
it by running the grammar through your tool and seeing if the tool
generates errors. If so, your language isn't acceptible to your tool,
and may even have flaws where it doesn't even describe what you want.
Even if your grammar is acceptible to your tool, you need to test it
to make certain that it is actually what you want, by building a small
copy of you application that just lexes and parses input and running
relevant examples through to see if the examples are turned into the
right kind of parse tree (AST).

If your grammar isn't acceptible (or parses something incorrectly),
one has come to the hard part, debugging.

If the grammar parses something incorrectly, one needs to trace the
grammar and see where the parser has categorized something
incorrectly, and fix that rule. Sometimes the problem is that the
non-terminal you captured is context dependent and in one context some
phrase is legal, but in another context the phrase is not. In that
case, you may need to split the non-terminal into two distinct
non-terminals (one for each context). Alternately, you can have the
reverse problem, where you have two different non-terminals and they
both cover some case and if the parser picks the wrong one to apply it
misses the possibility of the other non-terminal and either
mis-recognizes the result or refues to recognize a valid case. Note
that if there are two non-terminals that are "similar" (the rules for
similarity differe for the different grammar classes, e.g. LL versus
LR) and can both apply at a particular point in a parse, the tool will
(should) catch this problem and give you a "conflict" message. This
message is to warn you that the resulting parser may not do what you
expect for some inputs. (This is one of my arguments for using a tool
or hand-written recursive descent, even if you want a recursive
descent output. Tools can perform correctness checks to help you not
make mistakes. By hand, you have to find those problems "some other
way".)

Of course, there are more problems and fixes than just splitting and
comibning rules--that's where the idiomatic expressions help. They
are often ways of writing a particular set of rules to describe a
particular language problem, such that the tool will likely find it
acceptible. For example, the rules on left-recursion (not if you want
LL), and/or right-recursion (can be "expensive" at run-time in LR) are
rules to avoid common problems.

Debugging and changing your grammar to accept exactly the language you
want is hard. Grammars can do some things well and easily. Other
things can take bizarre non-obvious sets of rules to describe. And,
somethings, cannot be desribed in practical grammars at all.

In fact, the use of boolean_expression above was exactly such a case.
Grammars are not good at enforcing type systems. Sometimes one can do
it, but it takes lots of extra rules to do so. Other times it cannot
be done at all. Moreover, even when one can do it and is will to have
extra rules, the resulting grammar is fragile and hard to modify.
When one encounters such problem things in a grammar, the rule of
thumb is to make the grammar accept invalid programs (generalize what
you accept) and then add semantic checks to your compiler that catches
the invalid cases. Thus, in the above case a better grammar would
have been:

if_statement: if expression then statement
; // type check that expression returns a boolean

Second shamelss plug, a few years back I wrote for Sigplan NOTICES, a
serious of columns on resolving some of the grammar problems and
tricky cases. The columns try to capture some of the intuition and
rules of thumb that aren't always described in textbooks. (The column
died because I ran out of time to write it, but someday I may get back
to it, because I didn't run out of material.) If you are seriously
interested in writing grammars, I recommend reading those columns,
there are about 10 of them, and they are only about 5 pages each.
(They all read something like this post, because that's my writing
style, for better or worse.)

In any case,

Hope this helps,
-Chris

************************************************** ****************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
Reply With Quote
  #4  
Old 07-05-2008, 02:56 PM
Michael Schuerig
Guest
 
Default Re: How to construct a grammar?

Chris,

thanks for your extensive reply. I only have very few comments below.

Chris F Clark wrote:

> Michael Schuerig <michael@schuerig.de> writes:
>
>> So, I'm wondering, is there a systematic way to get from a suitable
>> list of language samples, i.e, without a formal definition to start
>> with, to a grammar?

>
> There are two answers to that question. A theoretical automated
> answer and a practical hand-done answer. I'm going to start with the
> answer I suspect you care about less, the automated theoretical one
> and dispense quickly with it. Taking a set of examples and generating
> a grammar from itusing an algorithm is a field of research, which I
> think is called roughly "machine learning".


Indeed, that's not what I have in mind, but I see how my wording can
lead to that interpretation. Substitute my "suitable list of language
samples" with "informal understanding of the language to be
recognized".

> This gets us to the next point. Above I said you have a grammar, but
> the grammar may or may not be suitable to your tool. If you've
> written lots of grammars and have developed an intuition for how to
> avoid problems, it may be acceptable as is. However, you need to test
> it by running the grammar through your tool and seeing if the tool
> generates errors. If so, your language isn't acceptible to your tool,
> and may even have flaws where it doesn't even describe what you want.
> Even if your grammar is acceptible to your tool, you need to test it
> to make certain that it is actually what you want, by building a small
> copy of you application that just lexes and parses input and running
> relevant examples through to see if the examples are turned into the
> right kind of parse tree (AST).


That intuition part is what I'm really after. I don't expect to write
that many grammars in my life and therefore my intuition may always be
lacking. That's why I'm interested in a systematic approach that
codifies some of the intuition and experience of others.

If I understand you correctly, your advice (snipped in this reply) would
be to first identify the words of the language (lexer), then work
top-down from sentence level.

> Hope this helps,
> -Chris


Yes, thanks a lot for your effort!

Michael

--
Michael Schuerig
mailto:michael@schuerig.de
http://www.schuerig.de/michael/
Reply With Quote
  #5  
Old 07-06-2008, 03:12 PM
Chris F Clark
Guest
 
Default Re: How to construct a grammar?

Michael Schuerig <michael@schuerig.de> writes:

> If I understand you correctly, your advice (snipped in this reply) would
> be to first identify the words of the language (lexer), then work
> top-down from sentence level.


Actually, one other piece of advice that you missed, probably because
I didn't state it clearly. Look at (and for) other grammars that
might have features similar to your grammar. There are lots of solved
problems in the language world, complete with grammars.

I just want to list some idioms that are interesting and places to
look.

include files -- find a C preprocessor grammar
this idiom is so useful we put some special features in Yacc++
to make it easy

macros -- again look at a C preprocessor grammar

reserved words (also called keywords)
usually tool specific, but knowing how to integrate a symbol
table into the grammar is a big plus here, also can be done
in the lexer with a completely different technique

keywords that can also be identifiers -- find a PL/I grammar
there is a simple solution for this that most people don't get
on the other hand, it can be a dubious language feature

typed names (aka name mangling) -- look into C++
not every feature is done in the grammar

scoped names -- Algol pioneered this
another case where the work doesn't necessarily involve the grammar

block structure via indentation -- look at Haskell (and Python, I think)

Reading an SQL grammar, both for the "base" language and for some
particular implementation is good.

I would also read a BASIC grammar, the original language, the ANSI
standard, and for Microsoft's latest flavor if I could get them.

Oh, and before all of the above, I would read a Pascal grammar. The
language isn't perfect, but the grammar is simple and clear, designed
to be learned from.

Also, most enlightening, comparing the ANSI C grammar to one done with
precedence. It shows why precedence is much more terse (and at the
same time more clear), but you also want to know the way to do nested
operators by expressions, especially if you are going to use a tool
like ANTLR or JavaCC which doesn't implement precedence directives.

Even with all those suggestions, I think one needs to read only about
3-5 grammars to get the basic drift, especially if one has 3-5
languages which are close to the one wants to implement. In reality
there are probably half-a-dozen to a dozen idioms one needs to know.
Moreover, many of them are so obvious, that they fall-out naturally.

A good example of this is the if-then-else idiom. In theory, this
isn't LL. In fact, one solution to that problem is the use of
if-then-else-fi, which does have an LL solution. However, it turns
out that there is a natural way to hack if-then-else into a recursive
descent parser that most generators accept and people use without even
thinking about. And, there is a corresponding short-cut idiom for LR
parsers that has the same effect, although there is a precise way to
write the code to be correct for LR.

In any case, the devil is in the details, as they say. Most parts of
writing grammars isn't hard, and starting top-down is a good way to
work. It's when you get to the tricky part of the language, that you
need to find a clever solution, and those solutions often exist in a
grammar somewhere, but they haven't yet gotten cast into patterns.
It's just that there isn't a pattern community for parser writing,
partially because the community is fragmented by the different tools,
because the idioms for ANTLR are often different than those for yacc.

Well, I dragged this point out enough, still I

Hope this helps,
-Chris

************************************************** ****************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
Reply With Quote
  #6  
Old 07-06-2008, 05:45 PM
ArarghMail806@Arargh.com
Guest
 
Default Re: How to construct a grammar?

On Sun, 06 Jul 2008 15:12:08 -0400, Chris F Clark
<cfc@shell01.TheWorld.com> wrote:

<snip>
>I would also read a BASIC grammar, the original language, the ANSI
>standard, and for Microsoft's latest flavor if I could get them.


The ANSI standard is almost painful to read, and I have never seen a
published grammer for any of the MS basics.

<snip>
--
ArarghMail806 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html
[The original Dartmouth Basic was designed to be trivially parsable with
simple techniques, e.g., each statement type started with a different
letter. I doubt there's a grammar for MS Basic, which looks like the
result of way too much adhoc code in the parser. -John]
Reply With Quote
  #7  
Old 07-07-2008, 11:04 AM
Quinn Tyler Jackson
Guest
 
Default Re: How to construct a grammar?

"Michael Schuerig" <michael@schuerig.de> replied to Chris F Clark:

> That intuition part is what I'm really after. I don't expect to write
> that many grammars in my life and therefore my intuition may always be
> lacking. That's why I'm interested in a systematic approach that
> codifies some of the intuition and experience of others.


It may or may not be what you're looking for, but some years ago I
wrote a small book on parsing with Visual Parse++ for Sand Stone, and
the second chapter of the book goes into some detail on the craft of
extracting a grammar from sometimes informal sources. You can grab
the Word and PDF for that here:

http://www.sand-stone.com/pwvp.ZIP

As it says in the introduction to the second chapter: "... parsing is a
predictable mathematical process, and the generation of a parser from a
grammar specification is carried out in a provably reproducible fashion;
however, the process of converting a loose language specification into a
suitable grammar can be considered as much of a craft as a science. As with
all crafts, the experience and insight of the craftspeople involved in the
process have considerable bearing on the outcome."

Having written full grammars for C++, Perl, Lua, C#, and SQL92, (and various
proprietary langauges) in some cases in more than one formalism, and in some
cases from only the loosest of what could be called "specifications" or
"standards" -- I'll say that it's not so much about "intuition" as it is
about learning the idioms. Where "intuition" comes in is really a matter of
having an ability to generalize and recognize patterns in various languages.
So, Chris' followup advice:

"Look at (and for) other grammars that might have features similar to
your grammar. There are lots of solved problems in the language
world, complete with grammars."

is very probably a fountain of a lot of white-midnights-avoided.

Were I to write that section of the book again, I would add the statement:
"Probably the most important ability when turning language examples into a
correct grammar is the ability to examine positive examples and make the
cognitive leap from the specific to the general." Which to say, compact and
correct generative grammars derived from informal sources are largely about
the grammar author's ability to abstract and generalize. All else has a
strong potential to lead to ad hockery and warts, IMO.

Just one person's opinion,
Quinn
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 06:04 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.