| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I'm trying to write a basic parser for S-expressions, basic list structures used in Lisp. This is mainly to help me understand Lex and Yacc. Since Yacc is meant to be good for recursive balanced structures, I thought it would be an easy project. I did find it easy to write a grammar that recognizes the correct form for an S-expression, however, I'm having a massive mental block when it comes to putting the data into a form in which it can be manipulated. Basically I want to convert the list syntax into an in-memory equivalent, using pairs (basically a linked list). I've managed to parse a flat list into a standard linked list, but I'm having problems seeing how I can extend this to do arbitrarily nested lists. The only way I can think is to maintain my own stack, which seems to defeat the point of using yacc. There must be a way to leverage yacc's own recursion to build the data structures, but I just can't think of one. Here's my lexer (change yylval.v to yylval.s to work with the second parser): /* list lexer */ %{ #include "list.tab.h" %} %% \( return OPEN; \) return CLOSE; [a-z]+ { yylval.v = yytext; /* for identifier */ return SYMBOL; } [[:space:]] /* silently eat */ %% I have two attempts at the parser; one has the proper types for how I'd want the end program to look, but I can't figure out how to get it into a working state; the other works, but only does half the job (only builds a flat non- delimited list). Here's the first. At the moment the grammar recognizes just single pairs, but it doesn't work because it consumes both identifiers for each match to the 'pair' rule; I would want it to consume the first, then match again with the second identifier now becoming the first. In this way it might be possible to build the linked list as we walked down the list. /* list parser */ %{ #include <stdio.h> #include <stdlib.h> #include <string.h> enum type { TYPE_SYMBOL, TYPE_PAIR }; struct object { enum type t; union { char *symbol; struct pair *pair; }; }; struct pair { struct object *car; struct object *cdr; }; struct object *symbol(char *value); struct object *pair(struct object *car, struct object *cdr); void write(struct object *obj); void *xmalloc(size_t size); void fatal(); %} %union { char *v; struct object *o; } %token OPEN CLOSE INVALID %token <v> SYMBOL %type <o> identifier pair %% input: /* empty */ | input pair { printf("writing pair:\n"); write($2); printf("\n"); } pair: identifier identifier { printf("pair read\n"); printf("$1 = '%s'\n", $1->symbol); printf("$2 = '%s'\n", $2->symbol); $$ = pair($1, $2); } identifier: SYMBOL { printf("creating symbol: %s\n", $1); $$ = symbol($1); } %% struct object *symbol(char *value) { struct object *ret; ret = xmalloc(sizeof(struct object)); ret->t = TYPE_SYMBOL; ret->symbol = strdup(value); return ret; } struct object *pair(struct object *car, struct object *cdr) { struct object *ret; ret = xmalloc(sizeof(struct object)); ret->t = TYPE_PAIR; ret->pair = xmalloc(sizeof(struct pair)); ret->pair->car = car; ret->pair->cdr = cdr; return ret; } void write(struct object *obj) { switch (obj->t) { case TYPE_SYMBOL: printf("%s", obj->symbol); break; case TYPE_PAIR: printf("("); write(obj->pair->car); printf(" . "); write(obj->pair->cdr); printf(")"); break; default: fatal(); } } void *xmalloc(size_t size) { void *ret = malloc(size); if (ret == NULL) fatal(); return ret; } void fatal() { perror("parser"); exit(1); } Here's the second. As you can see, the 'current' node is global, so it can only ever support a flat list. /* list parser */ %{ #include <stdio.h> #include <stdlib.h> #include <string.h> struct node { char *item; struct node *next; }; struct node *first; struct node *current; void walk(struct node *root); void *xmalloc(size_t size); void fatal(); %} %union { char *s; } %token <s> SYMBOL %% input: /* empty */ | input SYMBOL { printf("encountered symbol: %s\n", $2); if (first == NULL) { puts("mallocing first node"); first = xmalloc(sizeof(struct node)); first->item = strdup($2); } if (current == NULL) { puts("setting current = first"); current = first; } else { struct node *n = xmalloc(sizeof(struct node)); n->item = strdup($2); n->next = NULL; // for the moment current->next = n; current = n; } } %% int main() { yyparse(); walk(first); } void walk(struct node *root) { puts("walking linked list"); do printf("item: %s\n", root->item); while (root = root->next); } void *xmalloc(size_t size) { void *ret = malloc(size); if (ret == NULL) fatal(); return ret; } void fatal() { perror("parser"); exit(1); } And this is the action-free grammar that seems to reliably match the expressions I want to parse, but I can't seem to retrofit the structure building onto it: input: /* empty */ | input list list: OPEN expressions CLOSE expressions: /* empty */ | expressions expression; expression: identifier | list identifier: SYMBOL (I'm using the default main() that gets used when compiling with -ly - ll.) I apologize for the length of this post. I'd be very grateful for a hint. Thanks, David |
|
#2
| |||
| |||
| David <amoebae@gmail.com> writes: > I'm trying to write a basic parser for S-expressions, basic list > structures used in Lisp. This is mainly to help me understand Lex and > Yacc. Since Yacc is meant to be good for recursive balanced > structures, I thought it would be an easy project. I did find it easy > to write a grammar that recognizes the correct form for an > S-expression, however, I'm having a massive mental block when it comes > to putting the data into a form in which it can be manipulated. > Basically I want to convert the list syntax into an in-memory > equivalent, using pairs (basically a linked list). When manipulating a list, one often needs both access to the head and the tail, because you insert the head into a list above it, but you append things to the tail of the list. So, imagine that your parser at each level returned a struct that had head and tail of the underlying list (or if it is a single elt, both head and tail point to the same thing). See, if that model doesn't allow you to add code to your third grammar. For, the first version of the code don't worry about where you allocate and free such structs, just that you have as many as you need. After to you get that version working, the management of these return value structs should be relatively obvious. Hope this helps, -Chris ************************************************** **************************** Chris Clark Internet: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. or: compres@world.std.com 23 Bailey Rd Web Site: http://world.std.com/~compres Berlin, MA 01503 voice: (508) 435-5016 USA fax: (978) 838-0263 (24 hours) |
|
#3
| |||
| |||
| Chris F Clark <cfc@shell01.TheWorld.com> writes: > David <amoebae@gmail.com> writes: > >> I'm trying to write a basic parser for S-expressions, basic list >> structures used in Lisp. ... > > When manipulating a list, one often needs both access to the head and > the tail, because you insert the head into a list above it, but you > append things to the tail of the list. So, imagine that your parser > at each level returned a struct that had head and tail of the > underlying list (or if it is a single elt, both head and tail point to > the same thing). See, if that model doesn't allow you to add code to > your third grammar. ... Although you are quite correct that this techinque of tracking list tails for mutation can be useful in general, I don't think David needs to go to that length. He can stick with straightforward functional construction of the lists, using actions like $$ = pair($1,$2), where "pair" is his constructor analogous to Lisp's "cons". All he needs to do is get the underlying grammar right; his problem was less with the datastructure and actions than it was with the grammar. (Although his code does suggest he might have misunderstood the datastructure by not recognizing that the cdr of a nonempty list is itself always a list, one element shorter.) His grammar (which is simpilified relative to any full Lisp) would say (1) An s-expression is either a symbol or a list. (2) A list always starts with an open paren; we can call what follows that the tail. (3) A tail is either (a) just a close paren (forming an empty list), or (b) an s-expression followed by a further tail (forming a nonempty list). In 3b, the car and cdr of the nonempty list are already available by the time the grammar production is reduced, so there is no reason not to construct the nonempty list (i.e., the pair) in a functional fashion, rather than by mutation. As David suggested, this is making good use of yacc's stack. The list is being built up starting with an empty list and tacking elements onto the front, starting with the last element and continuing forward until the first element is tacked on -- the opposite of the order in which the elements are read. That reversal of order is coming from yacc's stack, and explains why there is no need to be mutating the tail end of a list. Yacc's stack also addresses David's other concern, the ability to nest lists within lists as the constituent elements. That is achieved just by having 3b use an s-expresion as the car of the nonempty list, rather than only a symbol. Of course, this requires that the lists are delimited by the open and close parentheses, whereas David's grammar productions omitted those (though they were declared as terminals and recognized by the lexical analyzer). -Max Hailperin Professor of Computer Science Gustavus Adolphus College |
|
#4
| |||
| |||
| > As David suggested, this is making good use of yacc's stack. The list > is being built up starting with an empty list and tacking elements > onto the front, starting with the last element and continuing forward > until the first element is tacked on -- the opposite of the order in > which the elements are read. That reversal of order is coming from > yacc's stack, and explains why there is no need to be mutating the > tail end of a list. > > Yacc's stack also addresses David's other concern, the ability to nest > lists within lists as the constituent elements. That is achieved just One practical consideration here:- The yacc geenrated code has a #define YYMAXDEPTH (=150, but modifiable) which could be a limiting factor for functional languages. I mean, unlike procedural languages -they have recursive functions which are evaluated lazily. regards -kamal [I doubt it'll be a problem. Remember that the yacc stack is used for parsing the input, not for running the resulting program, and the stack just tracks the unconsumed tokens and symbols. It'd be a pretty dense program that needed more than that. -John] |
|
#5
| |||
| |||
| Thanks for the responses, everyone. I was a little worried about an artificial limitation on the nesting-level of sexps being imposed by the use of right recursion. Having read the docs for bison (which I'm actually using), it looks like this won't come up in practice - and for this project I'd rather keep it simple. However, I'd be interested to know - can Max's solution be converted to a left-recursive one that can manually allocate arbitrary hunks of memory for new tokens? How much more complicated would that solution be, and what would be the general strategy? |
|
#6
| |||
| |||
| amoe <amoebae@gmail.com> writes: ..... > However, I'd be interested to know - can Max's solution be converted > to a left-recursive one that can manually allocate arbitrary hunks of > memory for new tokens? How much more complicated would that solution > be, and what would be the general strategy? Let me start by paraphrasing your question, to make sure I'm actually answering what you are asking. If I understand you correctly, you are asking about leaving unchanged the way nesting is handled -- the ability of a list-element to itself be a list, as in (((a))) -- and just changing how the linear decomposition of each level of list is handled, in which the list is broken down into one element and the rest -- the part that is relevant in a situation like (a b c). You want to have the decomposition within the grammar be the reverse of that in the data structure. The grammar would take the viewpoint (a b c) = (a b) + c but the data strcuture would be based on (a b c) = a + (b c) as it is in Lisp. There are two ways you could do that. One was already suggested by Chris Clark earlier in this thread: build the data structure from left to right, at each step modifying the last pair to reference a new pair. That is, you would start out with the list (a), then by mutating its last pair turn it into (a b), then by mutating the new last pair turn it into (a b c). The other approach may be simpler for you to fit into your existing infrastructure; build the list in reverse order and then at the end reverse it. That is, you would use ordinary constructor operations to build up from (a) to (b a) to (c b a), and then once the list is complete, you would flip it around to (a b c). In my earlier post, I had said > His grammar (which is simpilified relative to any full Lisp) would say > (1) An s-expression is either a symbol or a list. > (2) A list always starts with an open paren; we can call what follows > that the tail. > (3) A tail is either > (a) just a close paren (forming an empty list), or > (b) an s-expression followed by a further tail (forming a > nonempty list). For ease of switching this around to the left-recursive form you are now asking about, first recognize that it could be rewritten as follows, while remaining right-recursive: (1) An s-expression is either a symbol or a list. (2) A list always starts with an open paren and ends with a close paren; we can call what comes in between them the body. (3) A body is either (a) just the empty string (forming an empty list), or (b) an s-expression followed by a further body (forming a nonempty list). Now, to make it left-recursive, you would just modify part (3): (3) A body is either (a) just the empty string (forming an empty list), or (b) a body followed by an s-expression (forming a nonempty list). However, the "forming a nonempty list" part is now where life gets interesting, because the pair constructor tacks an element onto the front of a list, not the back. As I said, Chris Clark's approach of modifying the tail end of the list is indeed reasonable. But the reason I showed the modified grammar was to point out that my alternative suggestion of building the list in reverse order and then reversing it once it is complete fits quite naturally into the grammar, as well as not requiring any fanciness at the data-structure level. Namely, notice that we've already got two separate nonterminals for "a list" versus "a body". This positions you perfectly for having the synthesized attribute value for a list be in the left-to-right order, (a b c) in the example, while the synthesized attribute value for the body is in the right-to-left order, (c b a). There are natural places to hang the two different kinds of actions you need, one to build the list using your pair constructor and one to do the list reversal. The same technique can be used even in grammars for non-parenthesized lists, which don't naturally provide two separate nonterminals -- you just have to recognize the need for introducing an additional nonterminal with the trivial production A -> B solely for the sake of providing a setting for the reversal action. But here the setting is already in place, just waiting to be used. |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.