Parsing nested structures with yacc

This is a discussion on Parsing nested structures with yacc within the Compilers forums in Theory and Concepts category; I'm trying to write a basic parser for S-expressions, basic list structures used in Lisp. This is mainly to help me understand Lex and Yacc. Since Yacc is meant to be good for recursive balanced structures, I thought it would be an easy project. I did find it easy to write a grammar that recognizes the correct form for an S-expression, however, I'm having a massive mental block when it comes to putting the data into a form in which it can be manipulated. Basically I want to convert the list syntax into an in-memory equivalent, using pairs (basically a ...

Go Back   Application Development Forum > Theory and Concepts > Compilers

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-07-2008, 12:59 PM
amoe
Guest
 
Default Parsing nested structures with yacc

I'm trying to write a basic parser for S-expressions, basic list
structures used in Lisp. This is mainly to help me understand Lex and
Yacc. Since Yacc is meant to be good for recursive balanced
structures, I thought it would be an easy project. I did find it easy
to write a grammar that recognizes the correct form for an
S-expression, however, I'm having a massive mental block when it comes
to putting the data into a form in which it can be manipulated.
Basically I want to convert the list syntax into an in-memory
equivalent, using pairs (basically a linked list).

I've managed to parse a flat list into a standard linked list, but I'm
having problems seeing how I can extend this to do arbitrarily nested
lists. The only way I can think is to maintain my own stack, which
seems to defeat the point of using yacc. There must be a way to
leverage yacc's own recursion to build the data structures, but I just
can't think of one.

Here's my lexer (change yylval.v to yylval.s to work with the second
parser):

/* list lexer */
%{
#include "list.tab.h"
%}

%%
\( return OPEN;
\) return CLOSE;
[a-z]+ {
yylval.v = yytext; /* for identifier */
return SYMBOL;
}
[[:space:]] /* silently eat */
%%


I have two attempts at the parser; one has the proper types for how
I'd want the
end program to look, but I can't figure out how to get it into a
working state;
the other works, but only does half the job (only builds a flat non-
delimited
list).

Here's the first. At the moment the grammar recognizes just single
pairs, but it
doesn't work because it consumes both identifiers for each match to
the 'pair'
rule; I would want it to consume the first, then match again with the
second
identifier now becoming the first. In this way it might be possible
to build
the linked list as we walked down the list.


/* list parser */
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

enum type {
TYPE_SYMBOL, TYPE_PAIR
};

struct object {
enum type t;
union {
char *symbol;
struct pair *pair;
};
};

struct pair {
struct object *car;
struct object *cdr;
};


struct object *symbol(char *value);
struct object *pair(struct object *car, struct object *cdr);
void write(struct object *obj);
void *xmalloc(size_t size);
void fatal();
%}

%union {
char *v;
struct object *o;
}

%token OPEN CLOSE INVALID
%token <v> SYMBOL
%type <o> identifier pair

%%

input: /* empty */
| input pair {
printf("writing pair:\n");
write($2);
printf("\n");
}

pair: identifier identifier {
printf("pair read\n");
printf("$1 = '%s'\n", $1->symbol);
printf("$2 = '%s'\n", $2->symbol);
$$ = pair($1, $2);
}

identifier: SYMBOL {
printf("creating symbol: %s\n", $1);
$$ = symbol($1);
}


%%

struct object *symbol(char *value) {
struct object *ret;
ret = xmalloc(sizeof(struct object));

ret->t = TYPE_SYMBOL;
ret->symbol = strdup(value);

return ret;
}

struct object *pair(struct object *car, struct object *cdr) {
struct object *ret;

ret = xmalloc(sizeof(struct object));

ret->t = TYPE_PAIR;
ret->pair = xmalloc(sizeof(struct pair));

ret->pair->car = car;
ret->pair->cdr = cdr;

return ret;
}

void write(struct object *obj) {
switch (obj->t) {
case TYPE_SYMBOL:
printf("%s", obj->symbol);
break;
case TYPE_PAIR:
printf("(");
write(obj->pair->car);
printf(" . ");
write(obj->pair->cdr);
printf(")");
break;
default:
fatal();
}
}

void *xmalloc(size_t size) {
void *ret = malloc(size);
if (ret == NULL) fatal();
return ret;
}

void fatal() {
perror("parser");
exit(1);
}


Here's the second. As you can see, the 'current' node is global, so
it can only
ever support a flat list.


/* list parser */
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct node {
char *item;
struct node *next;
};

struct node *first;
struct node *current;

void walk(struct node *root);
void *xmalloc(size_t size);
void fatal();
%}

%union {
char *s;
}

%token <s> SYMBOL

%%

input: /* empty */
| input SYMBOL {
printf("encountered symbol: %s\n", $2);

if (first == NULL) {
puts("mallocing first node");
first = xmalloc(sizeof(struct node));
first->item = strdup($2);
}

if (current == NULL) {
puts("setting current = first");
current = first;
} else {
struct node *n = xmalloc(sizeof(struct node));
n->item = strdup($2);
n->next = NULL; // for the moment

current->next = n;
current = n;
}
}

%%

int main() {
yyparse();
walk(first);
}

void walk(struct node *root) {
puts("walking linked list");

do printf("item: %s\n", root->item);
while (root = root->next);
}

void *xmalloc(size_t size) {
void *ret = malloc(size);
if (ret == NULL) fatal();
return ret;
}

void fatal() {
perror("parser");
exit(1);
}


And this is the action-free grammar that seems to reliably match the
expressions I want to parse, but I can't seem to retrofit the
structure building onto it:


input: /* empty */
| input list

list: OPEN expressions CLOSE

expressions: /* empty */
| expressions expression;

expression: identifier | list

identifier: SYMBOL


(I'm using the default main() that gets used when compiling with -ly -
ll.)

I apologize for the length of this post. I'd be very grateful for a
hint.

Thanks, David
Reply With Quote
  #2  
Old 08-08-2008, 10:01 AM
Chris F Clark
Guest
 
Default Re: Parsing nested structures with yacc

David <amoebae@gmail.com> writes:

> I'm trying to write a basic parser for S-expressions, basic list
> structures used in Lisp. This is mainly to help me understand Lex and
> Yacc. Since Yacc is meant to be good for recursive balanced
> structures, I thought it would be an easy project. I did find it easy
> to write a grammar that recognizes the correct form for an
> S-expression, however, I'm having a massive mental block when it comes
> to putting the data into a form in which it can be manipulated.
> Basically I want to convert the list syntax into an in-memory
> equivalent, using pairs (basically a linked list).


When manipulating a list, one often needs both access to the head and
the tail, because you insert the head into a list above it, but you
append things to the tail of the list. So, imagine that your parser
at each level returned a struct that had head and tail of the
underlying list (or if it is a single elt, both head and tail point to
the same thing). See, if that model doesn't allow you to add code to
your third grammar. For, the first version of the code don't worry
about where you allocate and free such structs, just that you have as
many as you need. After to you get that version working, the
management of these return value structs should be relatively obvious.

Hope this helps,
-Chris

************************************************** ****************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
Reply With Quote
  #3  
Old 08-09-2008, 08:49 AM
Max Hailperin
Guest
 
Default Re: Parsing nested structures with yacc

Chris F Clark <cfc@shell01.TheWorld.com> writes:

> David <amoebae@gmail.com> writes:
>
>> I'm trying to write a basic parser for S-expressions, basic list
>> structures used in Lisp. ...

>
> When manipulating a list, one often needs both access to the head and
> the tail, because you insert the head into a list above it, but you
> append things to the tail of the list. So, imagine that your parser
> at each level returned a struct that had head and tail of the
> underlying list (or if it is a single elt, both head and tail point to
> the same thing). See, if that model doesn't allow you to add code to
> your third grammar. ...


Although you are quite correct that this techinque of tracking list
tails for mutation can be useful in general, I don't think David needs
to go to that length. He can stick with straightforward functional
construction of the lists, using actions like $$ = pair($1,$2), where
"pair" is his constructor analogous to Lisp's "cons". All he needs to
do is get the underlying grammar right; his problem was less with the
datastructure and actions than it was with the grammar. (Although his
code does suggest he might have misunderstood the datastructure by not
recognizing that the cdr of a nonempty list is itself always a list,
one element shorter.)

His grammar (which is simpilified relative to any full Lisp) would say

(1) An s-expression is either a symbol or a list.

(2) A list always starts with an open paren; we can call what follows
that the tail.

(3) A tail is either
(a) just a close paren (forming an empty list), or
(b) an s-expression followed by a further tail (forming a nonempty
list).

In 3b, the car and cdr of the nonempty list are already available by
the time the grammar production is reduced, so there is no reason not
to construct the nonempty list (i.e., the pair) in a functional
fashion, rather than by mutation.

As David suggested, this is making good use of yacc's stack. The list
is being built up starting with an empty list and tacking elements
onto the front, starting with the last element and continuing forward
until the first element is tacked on -- the opposite of the order in
which the elements are read. That reversal of order is coming from
yacc's stack, and explains why there is no need to be mutating the
tail end of a list.

Yacc's stack also addresses David's other concern, the ability to nest
lists within lists as the constituent elements. That is achieved just
by having 3b use an s-expresion as the car of the nonempty list,
rather than only a symbol. Of course, this requires that the lists
are delimited by the open and close parentheses, whereas David's
grammar productions omitted those (though they were declared as
terminals and recognized by the lexical analyzer).

-Max Hailperin
Professor of Computer Science
Gustavus Adolphus College

Reply With Quote
  #4  
Old 08-11-2008, 05:49 AM
kamal
Guest
 
Default Re: Parsing nested structures with yacc

> As David suggested, this is making good use of yacc's stack. The list
> is being built up starting with an empty list and tacking elements
> onto the front, starting with the last element and continuing forward
> until the first element is tacked on -- the opposite of the order in
> which the elements are read. That reversal of order is coming from
> yacc's stack, and explains why there is no need to be mutating the
> tail end of a list.
>
> Yacc's stack also addresses David's other concern, the ability to nest
> lists within lists as the constituent elements. That is achieved just


One practical consideration here:-
The yacc geenrated code has a #define YYMAXDEPTH (=150, but
modifiable) which could be a limiting factor for functional languages.
I mean, unlike procedural languages -they have recursive functions
which are evaluated lazily.

regards
-kamal
[I doubt it'll be a problem. Remember that the yacc stack is used for
parsing the input, not for running the resulting program, and the
stack just tracks the unconsumed tokens and symbols. It'd be a pretty
dense program that needed more than that. -John]

Reply With Quote
  #5  
Old 08-19-2008, 11:20 AM
amoe
Guest
 
Default Re: Parsing nested structures with yacc

Thanks for the responses, everyone. I was a little worried about an
artificial limitation on the nesting-level of sexps being imposed by
the use of right recursion. Having read the docs for bison (which I'm
actually using), it looks like this won't come up in practice - and
for this project I'd rather keep it simple.

However, I'd be interested to know - can Max's solution be converted
to a left-recursive one that can manually allocate arbitrary hunks of
memory for new tokens? How much more complicated would that solution
be, and what would be the general strategy?
Reply With Quote
  #6  
Old 08-20-2008, 08:02 AM
Max Hailperin
Guest
 
Default Re: Parsing nested structures with yacc

amoe <amoebae@gmail.com> writes:
.....
> However, I'd be interested to know - can Max's solution be converted
> to a left-recursive one that can manually allocate arbitrary hunks of
> memory for new tokens? How much more complicated would that solution
> be, and what would be the general strategy?


Let me start by paraphrasing your question, to make sure I'm actually
answering what you are asking.

If I understand you correctly, you are asking about leaving unchanged
the way nesting is handled -- the ability of a list-element to itself
be a list, as in (((a))) -- and just changing how the linear
decomposition of each level of list is handled, in which the list is
broken down into one element and the rest -- the part that is relevant
in a situation like (a b c).

You want to have the decomposition within the grammar be the reverse
of that in the data structure. The grammar would take the viewpoint
(a b c) = (a b) + c
but the data strcuture would be based on
(a b c) = a + (b c)
as it is in Lisp.

There are two ways you could do that. One was already suggested by
Chris Clark earlier in this thread: build the data structure from left
to right, at each step modifying the last pair to reference a new
pair. That is, you would start out with the list (a), then by
mutating its last pair turn it into (a b), then by mutating the new
last pair turn it into (a b c).

The other approach may be simpler for you to fit into your existing
infrastructure; build the list in reverse order and then at the end
reverse it. That is, you would use ordinary constructor operations to
build up from (a) to (b a) to (c b a), and then once the list is
complete, you would flip it around to (a b c).

In my earlier post, I had said

> His grammar (which is simpilified relative to any full Lisp) would say
> (1) An s-expression is either a symbol or a list.
> (2) A list always starts with an open paren; we can call what follows
> that the tail.
> (3) A tail is either
> (a) just a close paren (forming an empty list), or
> (b) an s-expression followed by a further tail (forming a
> nonempty list).


For ease of switching this around to the left-recursive form you are
now asking about, first recognize that it could be rewritten as
follows, while remaining right-recursive:

(1) An s-expression is either a symbol or a list.
(2) A list always starts with an open paren and ends with a close
paren; we can call what comes in between them the body.
(3) A body is either
(a) just the empty string (forming an empty list), or
(b) an s-expression followed by a further body (forming a
nonempty list).

Now, to make it left-recursive, you would just modify part (3):

(3) A body is either
(a) just the empty string (forming an empty list), or
(b) a body followed by an s-expression (forming a
nonempty list).

However, the "forming a nonempty list" part is now where life gets
interesting, because the pair constructor tacks an element onto the
front of a list, not the back.

As I said, Chris Clark's approach of modifying the tail end of the
list is indeed reasonable. But the reason I showed the modified
grammar was to point out that my alternative suggestion of building
the list in reverse order and then reversing it once it is complete
fits quite naturally into the grammar, as well as not requiring any
fanciness at the data-structure level.

Namely, notice that we've already got two separate nonterminals for "a
list" versus "a body". This positions you perfectly for having the
synthesized attribute value for a list be in the left-to-right order,
(a b c) in the example, while the synthesized attribute value for the
body is in the right-to-left order, (c b a). There are natural places
to hang the two different kinds of actions you need, one to build the
list using your pair constructor and one to do the list reversal.

The same technique can be used even in grammars for non-parenthesized
lists, which don't naturally provide two separate nonterminals -- you
just have to recognize the need for introducing an additional
nonterminal with the trivial production

A -> B

solely for the sake of providing a setting for the reversal action.
But here the setting is already in place, just waiting to be used.
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 04:07 PM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.