Number of compiler passes

This is a discussion on Number of compiler passes within the Compilers forums in Theory and Concepts category; I was wondering, with the computers of today, with as much memory as they have, is there still a reason to limit the amount of passes a compiler makes? Sure, with each additional pass there will be some overhead, but other than that, is there a disadvantage I'm missing? In particular, I'm working on a source-to-source compiler at the moment. >From my own language to C. It now has the following stages: * Flex + Bison to create AST * Over AST to find declarations (and re-declarations) * Over AST to detect which variables are used in which functions * ...

Go Back   Application Development Forum > Theory and Concepts > Compilers

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 07-21-2008, 10:29 AM
Michiel
Guest
 
Default Number of compiler passes

I was wondering, with the computers of today, with as much memory as they
have, is there still a reason to limit the amount of passes a compiler
makes? Sure, with each additional pass there will be some overhead, but
other than that, is there a disadvantage I'm missing?

In particular, I'm working on a source-to-source compiler at the moment.
>From my own language to C. It now has the following stages:


* Flex + Bison to create AST
* Over AST to find declarations (and re-declarations)
* Over AST to detect which variables are used in which functions
* Over AST to find undeclared references
* Over AST to find the type of each expression and note type mismatches
* Over AST to find the access-type (read/write/both) of each expression
* Over AST to perform optimizations
* Over AST to translate to target language

Pass 2 to 4 may seem excessive, but I believe they are necessary because of
the following features:
* Var/function declarations anywhere in the code (incl. nested functions)
* Forward referencing of constant functions and constant variables

Some of the stages could be merged, I suppose. But it would not help code
readability or provability to do so.

I would like to hear some opinions about this.

Thanks!

--
Michiel Helvensteijn
[Back in the era of coal fired mainframes, compilers were divided into multiple
passes so the code for each pass could be overlaid. These days I agree that
there isn't a strong argument either way except perhaps better cache data locality
if you combine passes. -John]

Reply With Quote
  #2  
Old 07-21-2008, 05:07 PM
glen herrmannsfeldt
Guest
 
Default Re: Number of compiler passes

Michiel wrote:

> I was wondering, with the computers of today, with as much memory as they
> have, is there still a reason to limit the amount of passes a compiler
> makes? Sure, with each additional pass there will be some overhead, but
> other than that, is there a disadvantage I'm missing?

(snip)

John wrote:
> [Back in the era of coal fired mainframes, compilers were divided into multiple
> passes so the code for each pass could be overlaid. These days I agree that
> there isn't a strong argument either way except perhaps better cache data locality
> if you combine passes. -John]


Also, in those days the intermediate data usually went to disk.

I know discussion about changes in the OS/360 assembler, I believe
from F to XF, went from four passes to three, one less write and read
back from disk, with significant speed-up.

Not that I think we should go back to those days, but I am still
surprised at the amount of memory needed by some modern compilers.
There are people trying to run gcc on MVS with 24 bit addressing, with
the ability to compile itself, but it doesn't seem to fit.

There was also a discussion on extended addressing for the PDP-10,
such that one might be able to run gcc. Again, even with the full
extended addressing of TOPS-20, it seems that it won't fit.

-- glen
Reply With Quote
  #3  
Old 07-21-2008, 05:33 PM
George Neuner
Guest
 
Default Re: Number of compiler passes

On Mon, 21 Jul 2008 16:29:15 +0200, Michiel <m.helvensteijn@gmail.com>
wrote:

>... is there still a reason to limit the amount of passes a compiler
>makes? Sure, with each additional pass there will be some overhead, but
>other than that, is there a disadvantage I'm missing?
>
>In particular, I'm working on a source-to-source compiler at the moment.
>>From my own language to C. It now has the following stages:

>
>* Flex + Bison to create AST
>* 1 Over AST to find declarations (and re-declarations)
>* 2 Over AST to detect which variables are used in which functions
>* 3 Over AST to find undeclared references
>* 4 Over AST to find the type of each expression and note type mismatches
>* 5 Over AST to find the access-type (read/write/both) of each expression
>* 6 Over AST to perform optimizations
>* 7 Over AST to translate to target language
>
>Pass 2 to 4 may seem excessive, but I believe they are necessary because of
>the following features:
>* Var/function declarations anywhere in the code (incl. nested functions)
>* Forward referencing of constant functions and constant variables
>
>Some of the stages could be merged, I suppose. But it would not help code
>readability or provability to do so.


The only disadvantage to many passes is having to store symbol and
analysis data that future passes may require.

You don't give much detail of your language, but I don't see how
"readability or provability" would be harmed by combining passes 1 & 2
and passes 3 & 4. [I numbered them above for convenience]

Assuming variables must be declared, processing declarations gives you
all you need - the symbol table(s) you construct will provide scoping
for non-local variables and functions. I don't see a need for two
passes.

Similarly, undeclared references will not be represented in your
symbol table. They will be caught when you look them up to determine
their type. You can assign them a special nil type so going forward
any expressions using them fail to type properly.

I don't know what you mean by "access-type" of an expression - unless
you're talking about I/O in which case I still don't know what you
mean. We could be having a terminology disconnect - to my mind an
"expression" always computes or side effects, so it always reads
operands and sometimes writes a result.

I would think that your optimization phase itself would consist of
multiple passes as a number of optimizations are possible even on an
AST form.

And since you are targeting C from a language that has nested
functions, you might actually need more than one pass to affect the
translation. Simple function nesting as in Pascal can be handled
easily, but full closures as in Lisp requires constructing runtime
data structures and a lot more analysis.

George
Reply With Quote
  #4  
Old 07-22-2008, 03:31 PM
Michiel
Guest
 
Default Re: Number of compiler passes

George Neuner wrote:

>>* Flex + Bison to create AST
>>* 1 Over AST to find declarations (and re-declarations)
>>* 2 Over AST to detect which variables are used in which functions
>>* 3 Over AST to find undeclared references
>>* 4 Over AST to find the type of each expression and note type mismatches
>>* 5 Over AST to find the access-type (read/write/both) of each expression
>>* 6 Over AST to perform optimizations
>>* 7 Over AST to translate to target language

>
> You don't give much detail of your language, but I don't see how
> "readability or provability" would be harmed by combining passes 1 & 2
> ...
> Assuming variables must be declared, processing declarations gives you
> all you need - the symbol table(s) you construct will provide scoping
> for non-local variables and functions. I don't see a need for two
> passes.


In the source language, a function can use variables from outside its
scope. More importantly, these variables could be declared after the
function is. This is okay as long as this function is not called
before the declaration of these variables.

> and passes 3 & 4. [I numbered them above for convenience]
> ...
> Similarly, undeclared references will not be represented in your
> symbol table. They will be caught when you look them up to determine
> their type. You can assign them a special nil type so going forward
> any expressions using them fail to type properly.


You're right about these, though. They could be merged and I will
certainly consider it. The only reason for me to keep them separate is
because they are conceptually different operations.

> I don't know what you mean by "access-type" of an expression - unless
> you're talking about I/O in which case I still don't know what you
> mean. We could be having a terminology disconnect - to my mind an
> "expression" always computes or side effects, so it always reads
> operands and sometimes writes a result.


Yes, I suppose I am stretching the definition a bit. An expression has a
type (bool, int, etc.). But to me it also has an access type. This is
either readonly, writeonly or readwrite. It basically specifies whether it
is an r-value, an l-value or both.

> I would think that your optimization phase itself would consist of
> multiple passes as a number of optimizations are possible even on an
> AST form.


Yep, probably. I haven't actually started on this stage yet.

> And since you are targeting C from a language that has nested
> functions, you might actually need more than one pass to affect the
> translation. Simple function nesting as in Pascal can be handled
> easily, but full closures as in Lisp requires constructing runtime
> data structures and a lot more analysis.


For now, functions in the source language cannot be passed, assigned or
returned yet. But in the future I hope to include functional aspects into
the language. I'm already planning for it in the design.

Thanks for your reply,

--
Michiel Helvensteijn
Reply With Quote
  #5  
Old 07-25-2008, 11:39 AM
Denis Washington
Guest
 
Default Re: Number of compiler passes

On Tue, 2008-07-22 at 21:31 +0200, Michiel wrote:
> For now, functions in the source language cannot be passed, assigned or
> returned yet. But in the future I hope to include functional aspects into
> the language. I'm already planning for it in the design.
>
> Thanks for your reply,


Out of pure curiosity: what kind of language are you trying to develop?
What are you planning to do with it? That would interest me.

Regards,
Denis

Reply With Quote
  #6  
Old 07-25-2008, 02:26 PM
Michiel
Guest
 
Default Re: Number of compiler passes

> Out of pure curiosity: what kind of language are you trying to develop?
> What are you planning to do with it? That would interest me.


I have already answered you by mail, but I'll repeat it for the group
(minus the question about your own language):

It's a language I'm designing and implementing with a colleague for
our Masters degree in Computer Science. On the surface it's just an
imperative language with some nice tricks (like the nested
functions). But we're using it primarily for research on automatic
correctness proving.

The language has a special syntax for function preconditions,
postconditions and invariants. The hope is that most of these
assertions can be proved at compile time. This framework would also
clear the way for a lot of new compiler optimizations. It also has
other benefits.

That's its most important feature. As a part of my research I hope to
add new paradigms (OOP, functional) to the language one by one as long
as they can still be supported by this framework.

By the way, I'm still not sure what to name the language. Any ideas?

--
Michiel Helvensteijn
Reply With Quote
  #7  
Old 07-25-2008, 07:29 PM
George Neuner
Guest
 
Default Re: Number of compiler passes

On Tue, 22 Jul 2008 21:31:57 +0200, Michiel <m.helvensteijn@gmail.com>
wrote:

>George Neuner wrote:
>
>>>* Flex + Bison to create AST
>>>* 1 Over AST to find declarations (and re-declarations)
>>>* 2 Over AST to detect which variables are used in which functions
>>>* 3 Over AST to find undeclared references
>>>* 4 Over AST to find the type of each expression and note type mismatches
>>>* 5 Over AST to find the access-type (read/write/both) of each expression
>>>* 6 Over AST to perform optimizations
>>>* 7 Over AST to translate to target language

>>
>> You don't give much detail of your language, but I don't see how
>> "readability or provability" would be harmed by combining passes 1 & 2
>> ...
>> Assuming variables must be declared, processing declarations gives you
>> all you need - the symbol table(s) you construct will provide scoping
>> for non-local variables and functions. I don't see a need for two
>> passes.

>
>In the source language, a function can use variables from outside its
>scope. More importantly, these variables could be declared after the
>function is. This is okay as long as this function is not called
>before the declaration of these variables.


As long as non-local variables are transitively in the lexical scope
chain of the function that uses them you can still gather all the
declarations in one pass. If they're not, you have a language that
will be error prone and confusing to use.

The simplest way to do it is to organize your symbol "table" as a
stack of lists[*]. You attach a local symbol list to each function's
definition node in the AST. As you recurse into a function, you push
its local symbol list onto the symbol stack and pop it off again when
you leave the function. The list of globals you keep at the bottom of
the stack. To look up a symbol you just examine the lists starting
from the top of the stack - the first correctly named symbol you find
will be the one nearest in scope to the usage point. This allows for
lexical shadowing of same-named variables in enclosing scopes.
[*] or stack of whatever data structure floats your boat.


>> I don't know what you mean by "access-type" of an expression - unless
>> you're talking about I/O in which case I still don't know what you
>> mean. We could be having a terminology disconnect - to my mind an
>> "expression" always computes or side effects, so it always reads
>> operands and sometimes writes a result.

>
>Yes, I suppose I am stretching the definition a bit. An expression has a
>type (bool, int, etc.). But to me it also has an access type. This is
>either readonly, writeonly or readwrite. It basically specifies whether it
>is an r-value, an l-value or both.


I believe you are overthinking the problem.

L-value or R-value is a matter of usage, not type. Obviously it
matters whether a particular variable is in an operand position or in
a result position, but that depends on the class of the expression -
unary, binary, ternary, etc. - and not on the operand or result types
or the operators involved.

IMO, value handling should be isolated in the handler function for the
particular expression type and is not anything I would ever care about
at the symbol information level.

If you really want to, you can annotate the variable's AST node as to
whether it's used for value or address, but I don't think that's
necessary. Since you are targeting C which already has a fairly
sophisticated understanding of complex data types, your compiler
really doesn't have to bother. If your source language says "A[X] :=
A[X] + 1", you can just say that directly in C after appropriately
defining A and X. You don't need to transform it all the way down
into address computations and dereferences.

George
--
for email reply remove "/" from address

Reply With Quote
  #8  
Old 07-26-2008, 07:34 AM
Michiel
Guest
 
Default Re: Number of compiler passes

George Neuner wrote:

>>In the source language, a function can use variables from outside its
>>scope. More importantly, these variables could be declared after the
>>function is. This is okay as long as this function is not called
>>before the declaration of these variables.

>
> As long as non-local variables are transitively in the lexical scope
> chain of the function that uses them you can still gather all the
> declarations in one pass. If they're not, you have a language that
> will be error prone and confusing to use.


They are.

I think I understand. You are proposing a breadth first traversal? We are
using the visitor pattern right now, which lends itself best to a depth
first traversal.

> The simplest way to do it is to organize your symbol "table" as a
> stack of lists[*].
> ...
> This allows for
> lexical shadowing of same-named variables in enclosing scopes.


This is pretty much what I'm doing already. :-)

>>Yes, I suppose I am stretching the definition a bit. An expression has a
>>type (bool, int, etc.). But to me it also has an access type. This is
>>either readonly, writeonly or readwrite. It basically specifies whether it
>>is an r-value, an l-value or both.

>
> I believe you are overthinking the problem.
>
> L-value or R-value is a matter of usage, not type.


I've seen the terms used in both contexts. As in: A variable is an l-value.
A constant is an r-value. I believe the Dragon book does this.

> Obviously it
> matters whether a particular variable is in an operand position or in
> a result position, but that depends on the class of the expression -
> unary, binary, ternary, etc. - and not on the operand or result types
> or the operators involved.


Maybe I should clarify. This information decides whether an expression CAN
be used as an l/r-value or not.

If it cannot be an l-value, it is a readonly expression. You will see this
often with a simple arithmetic operation or with a constant or literal.

If it cannot be a an r-value, it is a writeonly expression. This is seen
less often, but can occur for formal parameters of the 'out' direction. (As
opposed to 'in' or 'inout'.) Or for properties that have a setter method
but no getter method.

If it can be both, it is a readwrite expression. The access-type of your
basic variable. But also of some other expressions, like properties and
array-subscriptings.

This information is stored in the symbol table and is used to check for
referencing errors. It may also help for some optimizations later.

> Since you are targeting C which already has a fairly
> sophisticated understanding of complex data types, your compiler
> really doesn't have to bother. If your source language says "A[X] :=
> A[X] + 1", you can just say that directly in C after appropriately
> defining A and X. You don't need to transform it all the way down
> into address computations and dereferences.


But I'm sure our source language handles a lot of constructs a little
differently than C does. Also, we want to write our own error messages.

Anyway, the translation to C is only a temporary solution. We're focussing
on the design and proof of correctness concept right now, and do not want
to have to bother with assembly or learning to use the gcc backend until
later. Translating to C seemed like the easiest solution for now. But all
processing of the AST is completely separate from the final translation.

--
Michiel Helvensteijn

Reply With Quote
  #9  
Old 07-27-2008, 11:06 PM
glen herrmannsfeldt
Guest
 
Default Re: Number of compiler passes

George Neuner wrote:
(snip)

> As long as non-local variables are transitively in the lexical scope
> chain of the function that uses them you can still gather all the
> declarations in one pass. If they're not, you have a language that
> will be error prone and confusing to use.


> The simplest way to do it is to organize your symbol "table" as a
> stack of lists[*]. You attach a local symbol list to each function's
> definition node in the AST.


I think that is fine, but some languages and language features
complicate the problem.

One interesting feature (though some might disagree on
whether it really is a feature) is partial qualification
for structure references in PL/I.

Referencing a variable in a structure only needs enough
qualification to be unambiguous. It gets more interesting
with nested scope when a name could agree with partial
qualification at more than one level. I believe it works
such that, given the stack of lists as you indicate, you can
go up the list and take the first one that agrees, within
partial qualification, with the given symbol.

That makes it relatively easy for compilers, and means
that a change at a higher level of nesting won't change
the meaning of a name at a lower level.

-- glen

Reply With Quote
  #10  
Old 07-28-2008, 03:12 PM
George Neuner
Guest
 
Default Re: Number of compiler passes

On Sat, 26 Jul 2008 13:34:31 +0200, Michiel <m.helvensteijn@gmail.com>
wrote:

>George Neuner wrote:
>
>>>In the source language, a function can use variables from outside its
>>>scope. More importantly, these variables could be declared after the
>>>function is. This is okay as long as this function is not called
>>>before the declaration of these variables.

>>
>> As long as non-local variables are transitively in the lexical scope
>> chain of the function that uses them you can still gather all the
>> declarations in one pass. If they're not, you have a language that
>> will be error prone and confusing to use.

>
>They are.
>
>I think I understand. You are proposing a breadth first traversal? We are
>using the visitor pattern right now, which lends itself best to a depth
>first traversal.


No, depth first is fine (and I think, preferred). As long as the
compiler can tell a declaration from other expressions in the AST, the
actual traversal order doesn't matter.


>>>... An expression has a
>>>type (bool, int, etc.). But to me it also has an access type. This is
>>>either readonly, writeonly or readwrite. It basically specifies whether it
>>>is an r-value, an l-value or both.

>>
>> I believe you are overthinking the problem.
>>
>> L-value or R-value is a matter of usage, not type.

>
>I've seen the terms used in both contexts. As in: A variable is an l-value.
>A constant is an r-value. I believe the Dragon book does this.


I've also seen this usage, but IMO it can be confusing and I don't
like it. The problem is the abstraction level (see below) and the
fact that the same variable can be an L-value in one expression and an
R-value in another, or both in the same expression.

The practical usage: L-value = result/target, R-value = operand, has
some meaning.


>> Obviously it
>> matters whether a particular variable is in an operand position or in
>> a result position, but that depends on the class of the expression -
>> unary, binary, ternary, etc. - and not on the operand or result types
>> or the operators involved.

>
>Maybe I should clarify. This information decides whether an expression CAN
>be used as an l/r-value or not.
>
>If it cannot be an l-value, it is a readonly expression. You will see this
>often with a simple arithmetic operation or with a constant or literal.
>
>If it cannot be a an r-value, it is a writeonly expression. This is seen
>less often, but can occur for formal parameters of the 'out' direction. (As
>opposed to 'in' or 'inout'.) Or for properties that have a setter method
>but no getter method.
>
>If it can be both, it is a readwrite expression. The access-type of your
>basic variable. But also of some other expressions, like properties and
>array-subscriptings.


Ok, we have different definitions of expression. I consider an
expression to compute something or cause a side effect like altering
control flow. To me, a variable reference like "A[X]" is a
subexpression that can't stand on its own.

If you think about how to express the high level code in a 3-address
form you'll see what I'm saying. The high level expression C
expression "i++" seems to have a read/write variable reference, but in
fact expanded into 3-address code the read/modify/write nature of the
operation becomes clear. Written as "i := i + 1" or "i := inc i", you
clearly see that each reference to the variable can only be read or
write, but never both.


>This information is stored in the symbol table and is used to check for
>referencing errors. It may also help for some optimizations later.


I see what now what you're getting at - it's your terminology that's
throwing me. You want to know whether the variable is ever written to
or only read, but that's different than _being_ an L-value or R-value.

L-value and R-value at the expression level are not synonymous with
"write" and "read" at the symbol level. Take, for example, a
write-only function parameter passed by reference. How do you assign
to that value?

pseudo Ada:

procedure f( A: out var array of integer ) is
var i: integer;
begin
for i := 0 to 9 do
A[i] := i;
end for;
end f;

At the symbol level, "A" is write-only because it is an "out"
parameter - an attempt to read from it will result in an error. But
it is also a "var" (ie. reference) parameter, so accessing it or
indexing from it requires some computation. Leaving out some
gyrations that Ada for-loops require, this code translates into
(simple) 3-address as something like

t1 := A
i := 0
:loop t2 := i > 9
if t2 goto end
*t1 := i
t1 := t1 + sizeof(integer)
i := i + 1
goto loop
:end

As you can clearly see, the "write-only" variable "A" is used only in
an operand position, ie. as an R-value. It is never used (directly)
as an L-value.


George
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 05:41 PM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.