| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I've read a bit about parsing and compilers once in a while (red dragon, ANTLR reference), but have no experience building one. To my embarrassment I realized that I have no clear idea how to construct a grammar for an existing language. So, I'm wondering, is there a systematic way to get from a suitable list of language samples, i.e, without a formal definition to start with, to a grammar? Michael -- Michael Schuerig mailto:michael@schuerig.de http://www.schuerig.de/michael/ |
|
#2
| |||
| |||
| > So, I'm wondering, is there a systematic way to get from a suitable > list of language samples, i.e, without a formal definition to start > with, to a grammar? Caution: I will make some advertising of my own parser generator product. But it was made just with such tasks in mind. I often was engaged in similar tasks and I have tried to make a short guide from my experiences. http://www.texttransformer.com/tthelp/howtobegin.htm The essential statement is: first write a rough parser, which can process the whole sample and then refine the grammar peu ` peu. This is possible with TextTransformer, because there is a special SKIP token, by which unrecognized parts of text can be omitted in the beginning. TextTransformer is particularly suitable for this task, because you can test your language samples easily in a visual way. Finally all available language samples can be tested at once with the integrated transformation-manager. If the parser fails at a certain file, it can be opened by a double click in the TextTransformer IDE for debugging. After you have corrected the grammar, you can quickly test all samples again. I have repeatedly used this method successfully. Among others for the translation of a not specified BASIC dialect into Java and on several unspecified table formats. This way a grammar for Borlands Delphi also almost is completed. Regards Detlef |
|
#3
| |||
| |||
| Michael Schuerig <michael@schuerig.de> writes: > So, I'm wondering, is there a systematic way to get from a suitable > list of language samples, i.e, without a formal definition to start > with, to a grammar? There are two answers to that question. A theoretical automated answer and a practical hand-done answer. I'm going to start with the answer I suspect you care about less, the automated theoretical one and dispense quickly with it. Taking a set of examples and generating a grammar from itusing an algorithm is a field of research, which I think is called roughly "machine learning". There are papers in that field that give results positive and negative, saying what you can and cannot make a machine learn from a set of examples. The results are mostly the kinds of things one would expect. As I recall: You can learn a finite regular expression grammars with no closure operators (a* or a+) from positive examples only (i.e. this is legal). Added negative examples (i.e. these are illegal), you can learn regular expressions including closures. I think its been shown that there are some CFGs you cannot learn with only positive and negative examples alone, you can only learn a regular expression that matches a superset of the CFG. I cannot recall where the deterministic CFGs fit in the learnability specturm. But, again, I don't this this is your real question. In a practical sense, it is possible and usually actually even easy to construct a grammar (by hand) from examples, as long as one has some concept of the language. The easiest (and actualy most common case) is that you have a grammar for a similar language accessible. If so, you are only trying to cover the differences. However, since that case is covered in the general case, I will skip it, and assume you are starting from "first principles". The first thing one needs is a "vocabulary", a list of things one is talking about. This natuarally breaks down into two parts. You can do either part first, so I'm picking ar order that is convenient for me to talk about. The first part of the vocabulary is the tokens or terminals. These are the atomic parts of your grammar. You problably have identifiers and keywords and numbers and operators and whitespace and similar kinds of things in your grammar. So, make a list of all these things are find examples of each of them. Don't worry about exactly how to define them yet, just assume that you will be able to write a lexer that can find them. When you get done with the whole process of defining your language grammar you can roughly repeat the process for the tokens and get a lexical grammar for them. More importantly, it is even more likely that you will be able to find an existing lexer grammar for your set of tokens, than it is to find a grammar that describes your whole language. In fact, as I am defining new grammars for new small tools that I design from time to time, I essentially use one lexer grammar with some small permutations for all the tools. It is so trivial, that I never even think about the minor variations that I introduce. Thay are just "obvious", e.g. the list of keywords changes. (Shameless plug here, I believe that is partially true because the lexer notation we defined in Yacc++ is particuarly natural for doing that. We actually spent time thinking about that. However, we aren't the only ones that have noticed that and there are some tools that essentially have pre-canned lexers built into them, that you just customize to get the desired result. A lot of the scannerless tools go this way, because you can then sweep the ugly parts into the predefined sections and paper over them so users don't notice what one has done. We don't do that, but we do hide some of the ugliness in default behaviors one can override, e.g. certain error checks are turned off by default at the lexical level (for example, shift-reduce conflicts between tokens that end in a star, because the longest match rule implies always shift), but one can turn them back on.) In any case, you have your list of atomic tokens. You then make a list of the non-terminals, the structures your language is composed of. You obviously have a "grammar", "program", of "file" non- terminal, which represents the largest chunk you are parsing at a time. That is your start symbol. Your language probably has sub-groupings within that, like blocks, statements, and expressions. Again, make a list of what they are and group the examples accordingly. Once, you have this done, the process is simply a case of taking each non-terminal and looking at the examples and writing down the rule(s) that turn those examples into non-terminals. In the process, you may find you need to introdeuce additional non-terminals are tokens for things you missed, but that isn't a major issue. Let's look at an example of doing that. Non-terminal: statement Examples: a := b + c; if x = y then a := 1; First guess at rules: statement: variable ":=" variable "+" variable ";" | if variable "=" variable then variable ":=" constant ";" ; Now, we probably quickly realize that we want to use an expression non-terminal to capture the various forms that part can take, so we change the grammar to look like. statement: variable ":=" expression ";" | if boolean_expression then variable ":=" expression ";" // see the section after deubggin on my comment // on using boolean_expression here ; We may also decide that we want to distinguish our types of statements (i.e. new non-terminals), resulting in the following grammar. statement: assignment_statement | if_statement ; assignment_statement: variable ":=" expression ";" ; if_statement: if boolean_expression then variable ":=" expression ";" ; Next, we may realizes from the examples that if_statements take a statement after the then clause, resulting in: if_statement: if boolean_expression then statement ; When, you have finished that exercise you have a grammar for your language. Worth noting, is that that are often certain things one wants in a language (likes lists separated by semicolons, certain phrases which are optional). Those things often have an idiomatic expression in the grammar notion of choice. In Yacc++, we like regular expressions for mnay of the idioms and have a tutorial chapter in our documentation that is simply a cookbook of some of the more common idioms and the corresponding regular expression. In normal yacc grammars, you will find the same idioms described by sets of (often recursive) productions where some helper nonterminals are added with names like expression_list. The _list says it is part of an idiom describing lists of expressions. In either case, one learns the idiomatic expressions and uses them, because they often have "good properties". (BTW, I've never seen the corresponding yacc cookbook (and have no interest in writing it), but if you read yacc grammars, you will slowly notice the conventions.) This gets us to the next point. Above I said you have a grammar, but the grammar may or may not be suitable to your tool. If you've written lots of grammars and have developed an intuition for how to avoid problems, it may be acceptable as is. However, you need to test it by running the grammar through your tool and seeing if the tool generates errors. If so, your language isn't acceptible to your tool, and may even have flaws where it doesn't even describe what you want. Even if your grammar is acceptible to your tool, you need to test it to make certain that it is actually what you want, by building a small copy of you application that just lexes and parses input and running relevant examples through to see if the examples are turned into the right kind of parse tree (AST). If your grammar isn't acceptible (or parses something incorrectly), one has come to the hard part, debugging. If the grammar parses something incorrectly, one needs to trace the grammar and see where the parser has categorized something incorrectly, and fix that rule. Sometimes the problem is that the non-terminal you captured is context dependent and in one context some phrase is legal, but in another context the phrase is not. In that case, you may need to split the non-terminal into two distinct non-terminals (one for each context). Alternately, you can have the reverse problem, where you have two different non-terminals and they both cover some case and if the parser picks the wrong one to apply it misses the possibility of the other non-terminal and either mis-recognizes the result or refues to recognize a valid case. Note that if there are two non-terminals that are "similar" (the rules for similarity differe for the different grammar classes, e.g. LL versus LR) and can both apply at a particular point in a parse, the tool will (should) catch this problem and give you a "conflict" message. This message is to warn you that the resulting parser may not do what you expect for some inputs. (This is one of my arguments for using a tool or hand-written recursive descent, even if you want a recursive descent output. Tools can perform correctness checks to help you not make mistakes. By hand, you have to find those problems "some other way".) Of course, there are more problems and fixes than just splitting and comibning rules--that's where the idiomatic expressions help. They are often ways of writing a particular set of rules to describe a particular language problem, such that the tool will likely find it acceptible. For example, the rules on left-recursion (not if you want LL), and/or right-recursion (can be "expensive" at run-time in LR) are rules to avoid common problems. Debugging and changing your grammar to accept exactly the language you want is hard. Grammars can do some things well and easily. Other things can take bizarre non-obvious sets of rules to describe. And, somethings, cannot be desribed in practical grammars at all. In fact, the use of boolean_expression above was exactly such a case. Grammars are not good at enforcing type systems. Sometimes one can do it, but it takes lots of extra rules to do so. Other times it cannot be done at all. Moreover, even when one can do it and is will to have extra rules, the resulting grammar is fragile and hard to modify. When one encounters such problem things in a grammar, the rule of thumb is to make the grammar accept invalid programs (generalize what you accept) and then add semantic checks to your compiler that catches the invalid cases. Thus, in the above case a better grammar would have been: if_statement: if expression then statement ; // type check that expression returns a boolean Second shamelss plug, a few years back I wrote for Sigplan NOTICES, a serious of columns on resolving some of the grammar problems and tricky cases. The columns try to capture some of the intuition and rules of thumb that aren't always described in textbooks. (The column died because I ran out of time to write it, but someday I may get back to it, because I didn't run out of material.) If you are seriously interested in writing grammars, I recommend reading those columns, there are about 10 of them, and they are only about 5 pages each. (They all read something like this post, because that's my writing style, for better or worse.) In any case, Hope this helps, -Chris ************************************************** **************************** Chris Clark Internet: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. or: compres@world.std.com 23 Bailey Rd Web Site: http://world.std.com/~compres Berlin, MA 01503 voice: (508) 435-5016 USA fax: (978) 838-0263 (24 hours) |
|
#4
| |||
| |||
| Chris, thanks for your extensive reply. I only have very few comments below. Chris F Clark wrote: > Michael Schuerig <michael@schuerig.de> writes: > >> So, I'm wondering, is there a systematic way to get from a suitable >> list of language samples, i.e, without a formal definition to start >> with, to a grammar? > > There are two answers to that question. A theoretical automated > answer and a practical hand-done answer. I'm going to start with the > answer I suspect you care about less, the automated theoretical one > and dispense quickly with it. Taking a set of examples and generating > a grammar from itusing an algorithm is a field of research, which I > think is called roughly "machine learning". Indeed, that's not what I have in mind, but I see how my wording can lead to that interpretation. Substitute my "suitable list of language samples" with "informal understanding of the language to be recognized". > This gets us to the next point. Above I said you have a grammar, but > the grammar may or may not be suitable to your tool. If you've > written lots of grammars and have developed an intuition for how to > avoid problems, it may be acceptable as is. However, you need to test > it by running the grammar through your tool and seeing if the tool > generates errors. If so, your language isn't acceptible to your tool, > and may even have flaws where it doesn't even describe what you want. > Even if your grammar is acceptible to your tool, you need to test it > to make certain that it is actually what you want, by building a small > copy of you application that just lexes and parses input and running > relevant examples through to see if the examples are turned into the > right kind of parse tree (AST). That intuition part is what I'm really after. I don't expect to write that many grammars in my life and therefore my intuition may always be lacking. That's why I'm interested in a systematic approach that codifies some of the intuition and experience of others. If I understand you correctly, your advice (snipped in this reply) would be to first identify the words of the language (lexer), then work top-down from sentence level. > Hope this helps, > -Chris Yes, thanks a lot for your effort! Michael -- Michael Schuerig mailto:michael@schuerig.de http://www.schuerig.de/michael/ |
|
#5
| |||
| |||
| Michael Schuerig <michael@schuerig.de> writes: > If I understand you correctly, your advice (snipped in this reply) would > be to first identify the words of the language (lexer), then work > top-down from sentence level. Actually, one other piece of advice that you missed, probably because I didn't state it clearly. Look at (and for) other grammars that might have features similar to your grammar. There are lots of solved problems in the language world, complete with grammars. I just want to list some idioms that are interesting and places to look. include files -- find a C preprocessor grammar this idiom is so useful we put some special features in Yacc++ to make it easy macros -- again look at a C preprocessor grammar reserved words (also called keywords) usually tool specific, but knowing how to integrate a symbol table into the grammar is a big plus here, also can be done in the lexer with a completely different technique keywords that can also be identifiers -- find a PL/I grammar there is a simple solution for this that most people don't get on the other hand, it can be a dubious language feature typed names (aka name mangling) -- look into C++ not every feature is done in the grammar scoped names -- Algol pioneered this another case where the work doesn't necessarily involve the grammar block structure via indentation -- look at Haskell (and Python, I think) Reading an SQL grammar, both for the "base" language and for some particular implementation is good. I would also read a BASIC grammar, the original language, the ANSI standard, and for Microsoft's latest flavor if I could get them. Oh, and before all of the above, I would read a Pascal grammar. The language isn't perfect, but the grammar is simple and clear, designed to be learned from. Also, most enlightening, comparing the ANSI C grammar to one done with precedence. It shows why precedence is much more terse (and at the same time more clear), but you also want to know the way to do nested operators by expressions, especially if you are going to use a tool like ANTLR or JavaCC which doesn't implement precedence directives. Even with all those suggestions, I think one needs to read only about 3-5 grammars to get the basic drift, especially if one has 3-5 languages which are close to the one wants to implement. In reality there are probably half-a-dozen to a dozen idioms one needs to know. Moreover, many of them are so obvious, that they fall-out naturally. A good example of this is the if-then-else idiom. In theory, this isn't LL. In fact, one solution to that problem is the use of if-then-else-fi, which does have an LL solution. However, it turns out that there is a natural way to hack if-then-else into a recursive descent parser that most generators accept and people use without even thinking about. And, there is a corresponding short-cut idiom for LR parsers that has the same effect, although there is a precise way to write the code to be correct for LR. In any case, the devil is in the details, as they say. Most parts of writing grammars isn't hard, and starting top-down is a good way to work. It's when you get to the tricky part of the language, that you need to find a clever solution, and those solutions often exist in a grammar somewhere, but they haven't yet gotten cast into patterns. It's just that there isn't a pattern community for parser writing, partially because the community is fragmented by the different tools, because the idioms for ANTLR are often different than those for yacc. Well, I dragged this point out enough, still I Hope this helps, -Chris ************************************************** **************************** Chris Clark Internet: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. or: compres@world.std.com 23 Bailey Rd Web Site: http://world.std.com/~compres Berlin, MA 01503 voice: (508) 435-5016 USA fax: (978) 838-0263 (24 hours) ------------------------------------------------------------------------------ |
|
#6
| |||
| |||
| On Sun, 06 Jul 2008 15:12:08 -0400, Chris F Clark <cfc@shell01.TheWorld.com> wrote: <snip> >I would also read a BASIC grammar, the original language, the ANSI >standard, and for Microsoft's latest flavor if I could get them. The ANSI standard is almost painful to read, and I have never seen a published grammer for any of the MS basics. <snip> -- ArarghMail806 at [drop the 'http://www.' from ->] http://www.arargh.com BCET Basic Compiler Page: http://www.arargh.com/basic/index.html [The original Dartmouth Basic was designed to be trivially parsable with simple techniques, e.g., each statement type started with a different letter. I doubt there's a grammar for MS Basic, which looks like the result of way too much adhoc code in the parser. -John] |
|
#7
| |||
| |||
| "Michael Schuerig" <michael@schuerig.de> replied to Chris F Clark: > That intuition part is what I'm really after. I don't expect to write > that many grammars in my life and therefore my intuition may always be > lacking. That's why I'm interested in a systematic approach that > codifies some of the intuition and experience of others. It may or may not be what you're looking for, but some years ago I wrote a small book on parsing with Visual Parse++ for Sand Stone, and the second chapter of the book goes into some detail on the craft of extracting a grammar from sometimes informal sources. You can grab the Word and PDF for that here: http://www.sand-stone.com/pwvp.ZIP As it says in the introduction to the second chapter: "... parsing is a predictable mathematical process, and the generation of a parser from a grammar specification is carried out in a provably reproducible fashion; however, the process of converting a loose language specification into a suitable grammar can be considered as much of a craft as a science. As with all crafts, the experience and insight of the craftspeople involved in the process have considerable bearing on the outcome." Having written full grammars for C++, Perl, Lua, C#, and SQL92, (and various proprietary langauges) in some cases in more than one formalism, and in some cases from only the loosest of what could be called "specifications" or "standards" -- I'll say that it's not so much about "intuition" as it is about learning the idioms. Where "intuition" comes in is really a matter of having an ability to generalize and recognize patterns in various languages. So, Chris' followup advice: "Look at (and for) other grammars that might have features similar to your grammar. There are lots of solved problems in the language world, complete with grammars." is very probably a fountain of a lot of white-midnights-avoided. Were I to write that section of the book again, I would add the statement: "Probably the most important ability when turning language examples into a correct grammar is the ability to examine positive examples and make the cognitive leap from the specific to the general." Which to say, compact and correct generative grammars derived from informal sources are largely about the grammar author's ability to abstract and generalize. All else has a strong potential to lead to ad hockery and warts, IMO. Just one person's opinion, Quinn |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.