| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I was wondering, with the computers of today, with as much memory as they have, is there still a reason to limit the amount of passes a compiler makes? Sure, with each additional pass there will be some overhead, but other than that, is there a disadvantage I'm missing? In particular, I'm working on a source-to-source compiler at the moment. >From my own language to C. It now has the following stages: * Flex + Bison to create AST * Over AST to find declarations (and re-declarations) * Over AST to detect which variables are used in which functions * Over AST to find undeclared references * Over AST to find the type of each expression and note type mismatches * Over AST to find the access-type (read/write/both) of each expression * Over AST to perform optimizations * Over AST to translate to target language Pass 2 to 4 may seem excessive, but I believe they are necessary because of the following features: * Var/function declarations anywhere in the code (incl. nested functions) * Forward referencing of constant functions and constant variables Some of the stages could be merged, I suppose. But it would not help code readability or provability to do so. I would like to hear some opinions about this. Thanks! -- Michiel Helvensteijn [Back in the era of coal fired mainframes, compilers were divided into multiple passes so the code for each pass could be overlaid. These days I agree that there isn't a strong argument either way except perhaps better cache data locality if you combine passes. -John] |
|
#2
| |||
| |||
| Michiel wrote: > I was wondering, with the computers of today, with as much memory as they > have, is there still a reason to limit the amount of passes a compiler > makes? Sure, with each additional pass there will be some overhead, but > other than that, is there a disadvantage I'm missing? (snip) John wrote: > [Back in the era of coal fired mainframes, compilers were divided into multiple > passes so the code for each pass could be overlaid. These days I agree that > there isn't a strong argument either way except perhaps better cache data locality > if you combine passes. -John] Also, in those days the intermediate data usually went to disk. I know discussion about changes in the OS/360 assembler, I believe from F to XF, went from four passes to three, one less write and read back from disk, with significant speed-up. Not that I think we should go back to those days, but I am still surprised at the amount of memory needed by some modern compilers. There are people trying to run gcc on MVS with 24 bit addressing, with the ability to compile itself, but it doesn't seem to fit. There was also a discussion on extended addressing for the PDP-10, such that one might be able to run gcc. Again, even with the full extended addressing of TOPS-20, it seems that it won't fit. -- glen |
|
#3
| |||
| |||
| On Mon, 21 Jul 2008 16:29:15 +0200, Michiel <m.helvensteijn@gmail.com> wrote: >... is there still a reason to limit the amount of passes a compiler >makes? Sure, with each additional pass there will be some overhead, but >other than that, is there a disadvantage I'm missing? > >In particular, I'm working on a source-to-source compiler at the moment. >>From my own language to C. It now has the following stages: > >* Flex + Bison to create AST >* 1 Over AST to find declarations (and re-declarations) >* 2 Over AST to detect which variables are used in which functions >* 3 Over AST to find undeclared references >* 4 Over AST to find the type of each expression and note type mismatches >* 5 Over AST to find the access-type (read/write/both) of each expression >* 6 Over AST to perform optimizations >* 7 Over AST to translate to target language > >Pass 2 to 4 may seem excessive, but I believe they are necessary because of >the following features: >* Var/function declarations anywhere in the code (incl. nested functions) >* Forward referencing of constant functions and constant variables > >Some of the stages could be merged, I suppose. But it would not help code >readability or provability to do so. The only disadvantage to many passes is having to store symbol and analysis data that future passes may require. You don't give much detail of your language, but I don't see how "readability or provability" would be harmed by combining passes 1 & 2 and passes 3 & 4. [I numbered them above for convenience] Assuming variables must be declared, processing declarations gives you all you need - the symbol table(s) you construct will provide scoping for non-local variables and functions. I don't see a need for two passes. Similarly, undeclared references will not be represented in your symbol table. They will be caught when you look them up to determine their type. You can assign them a special nil type so going forward any expressions using them fail to type properly. I don't know what you mean by "access-type" of an expression - unless you're talking about I/O in which case I still don't know what you mean. We could be having a terminology disconnect - to my mind an "expression" always computes or side effects, so it always reads operands and sometimes writes a result. I would think that your optimization phase itself would consist of multiple passes as a number of optimizations are possible even on an AST form. And since you are targeting C from a language that has nested functions, you might actually need more than one pass to affect the translation. Simple function nesting as in Pascal can be handled easily, but full closures as in Lisp requires constructing runtime data structures and a lot more analysis. George |
|
#4
| |||
| |||
| George Neuner wrote: >>* Flex + Bison to create AST >>* 1 Over AST to find declarations (and re-declarations) >>* 2 Over AST to detect which variables are used in which functions >>* 3 Over AST to find undeclared references >>* 4 Over AST to find the type of each expression and note type mismatches >>* 5 Over AST to find the access-type (read/write/both) of each expression >>* 6 Over AST to perform optimizations >>* 7 Over AST to translate to target language > > You don't give much detail of your language, but I don't see how > "readability or provability" would be harmed by combining passes 1 & 2 > ... > Assuming variables must be declared, processing declarations gives you > all you need - the symbol table(s) you construct will provide scoping > for non-local variables and functions. I don't see a need for two > passes. In the source language, a function can use variables from outside its scope. More importantly, these variables could be declared after the function is. This is okay as long as this function is not called before the declaration of these variables. > and passes 3 & 4. [I numbered them above for convenience] > ... > Similarly, undeclared references will not be represented in your > symbol table. They will be caught when you look them up to determine > their type. You can assign them a special nil type so going forward > any expressions using them fail to type properly. You're right about these, though. They could be merged and I will certainly consider it. The only reason for me to keep them separate is because they are conceptually different operations. > I don't know what you mean by "access-type" of an expression - unless > you're talking about I/O in which case I still don't know what you > mean. We could be having a terminology disconnect - to my mind an > "expression" always computes or side effects, so it always reads > operands and sometimes writes a result. Yes, I suppose I am stretching the definition a bit. An expression has a type (bool, int, etc.). But to me it also has an access type. This is either readonly, writeonly or readwrite. It basically specifies whether it is an r-value, an l-value or both. > I would think that your optimization phase itself would consist of > multiple passes as a number of optimizations are possible even on an > AST form. Yep, probably. I haven't actually started on this stage yet. > And since you are targeting C from a language that has nested > functions, you might actually need more than one pass to affect the > translation. Simple function nesting as in Pascal can be handled > easily, but full closures as in Lisp requires constructing runtime > data structures and a lot more analysis. For now, functions in the source language cannot be passed, assigned or returned yet. But in the future I hope to include functional aspects into the language. I'm already planning for it in the design. Thanks for your reply, -- Michiel Helvensteijn |
|
#5
| |||
| |||
| On Tue, 2008-07-22 at 21:31 +0200, Michiel wrote: > For now, functions in the source language cannot be passed, assigned or > returned yet. But in the future I hope to include functional aspects into > the language. I'm already planning for it in the design. > > Thanks for your reply, Out of pure curiosity: what kind of language are you trying to develop? What are you planning to do with it? That would interest me. Regards, Denis |
|
#6
| |||
| |||
| > Out of pure curiosity: what kind of language are you trying to develop? > What are you planning to do with it? That would interest me. I have already answered you by mail, but I'll repeat it for the group (minus the question about your own language): It's a language I'm designing and implementing with a colleague for our Masters degree in Computer Science. On the surface it's just an imperative language with some nice tricks (like the nested functions). But we're using it primarily for research on automatic correctness proving. The language has a special syntax for function preconditions, postconditions and invariants. The hope is that most of these assertions can be proved at compile time. This framework would also clear the way for a lot of new compiler optimizations. It also has other benefits. That's its most important feature. As a part of my research I hope to add new paradigms (OOP, functional) to the language one by one as long as they can still be supported by this framework. By the way, I'm still not sure what to name the language. Any ideas? -- Michiel Helvensteijn |
|
#7
| |||
| |||
| On Tue, 22 Jul 2008 21:31:57 +0200, Michiel <m.helvensteijn@gmail.com> wrote: >George Neuner wrote: > >>>* Flex + Bison to create AST >>>* 1 Over AST to find declarations (and re-declarations) >>>* 2 Over AST to detect which variables are used in which functions >>>* 3 Over AST to find undeclared references >>>* 4 Over AST to find the type of each expression and note type mismatches >>>* 5 Over AST to find the access-type (read/write/both) of each expression >>>* 6 Over AST to perform optimizations >>>* 7 Over AST to translate to target language >> >> You don't give much detail of your language, but I don't see how >> "readability or provability" would be harmed by combining passes 1 & 2 >> ... >> Assuming variables must be declared, processing declarations gives you >> all you need - the symbol table(s) you construct will provide scoping >> for non-local variables and functions. I don't see a need for two >> passes. > >In the source language, a function can use variables from outside its >scope. More importantly, these variables could be declared after the >function is. This is okay as long as this function is not called >before the declaration of these variables. As long as non-local variables are transitively in the lexical scope chain of the function that uses them you can still gather all the declarations in one pass. If they're not, you have a language that will be error prone and confusing to use. The simplest way to do it is to organize your symbol "table" as a stack of lists[*]. You attach a local symbol list to each function's definition node in the AST. As you recurse into a function, you push its local symbol list onto the symbol stack and pop it off again when you leave the function. The list of globals you keep at the bottom of the stack. To look up a symbol you just examine the lists starting from the top of the stack - the first correctly named symbol you find will be the one nearest in scope to the usage point. This allows for lexical shadowing of same-named variables in enclosing scopes. [*] or stack of whatever data structure floats your boat. >> I don't know what you mean by "access-type" of an expression - unless >> you're talking about I/O in which case I still don't know what you >> mean. We could be having a terminology disconnect - to my mind an >> "expression" always computes or side effects, so it always reads >> operands and sometimes writes a result. > >Yes, I suppose I am stretching the definition a bit. An expression has a >type (bool, int, etc.). But to me it also has an access type. This is >either readonly, writeonly or readwrite. It basically specifies whether it >is an r-value, an l-value or both. I believe you are overthinking the problem. L-value or R-value is a matter of usage, not type. Obviously it matters whether a particular variable is in an operand position or in a result position, but that depends on the class of the expression - unary, binary, ternary, etc. - and not on the operand or result types or the operators involved. IMO, value handling should be isolated in the handler function for the particular expression type and is not anything I would ever care about at the symbol information level. If you really want to, you can annotate the variable's AST node as to whether it's used for value or address, but I don't think that's necessary. Since you are targeting C which already has a fairly sophisticated understanding of complex data types, your compiler really doesn't have to bother. If your source language says "A[X] := A[X] + 1", you can just say that directly in C after appropriately defining A and X. You don't need to transform it all the way down into address computations and dereferences. George -- for email reply remove "/" from address |
|
#8
| |||
| |||
| George Neuner wrote: >>In the source language, a function can use variables from outside its >>scope. More importantly, these variables could be declared after the >>function is. This is okay as long as this function is not called >>before the declaration of these variables. > > As long as non-local variables are transitively in the lexical scope > chain of the function that uses them you can still gather all the > declarations in one pass. If they're not, you have a language that > will be error prone and confusing to use. They are. I think I understand. You are proposing a breadth first traversal? We are using the visitor pattern right now, which lends itself best to a depth first traversal. > The simplest way to do it is to organize your symbol "table" as a > stack of lists[*]. > ... > This allows for > lexical shadowing of same-named variables in enclosing scopes. This is pretty much what I'm doing already. :-) >>Yes, I suppose I am stretching the definition a bit. An expression has a >>type (bool, int, etc.). But to me it also has an access type. This is >>either readonly, writeonly or readwrite. It basically specifies whether it >>is an r-value, an l-value or both. > > I believe you are overthinking the problem. > > L-value or R-value is a matter of usage, not type. I've seen the terms used in both contexts. As in: A variable is an l-value. A constant is an r-value. I believe the Dragon book does this. > Obviously it > matters whether a particular variable is in an operand position or in > a result position, but that depends on the class of the expression - > unary, binary, ternary, etc. - and not on the operand or result types > or the operators involved. Maybe I should clarify. This information decides whether an expression CAN be used as an l/r-value or not. If it cannot be an l-value, it is a readonly expression. You will see this often with a simple arithmetic operation or with a constant or literal. If it cannot be a an r-value, it is a writeonly expression. This is seen less often, but can occur for formal parameters of the 'out' direction. (As opposed to 'in' or 'inout'.) Or for properties that have a setter method but no getter method. If it can be both, it is a readwrite expression. The access-type of your basic variable. But also of some other expressions, like properties and array-subscriptings. This information is stored in the symbol table and is used to check for referencing errors. It may also help for some optimizations later. > Since you are targeting C which already has a fairly > sophisticated understanding of complex data types, your compiler > really doesn't have to bother. If your source language says "A[X] := > A[X] + 1", you can just say that directly in C after appropriately > defining A and X. You don't need to transform it all the way down > into address computations and dereferences. But I'm sure our source language handles a lot of constructs a little differently than C does. Also, we want to write our own error messages. Anyway, the translation to C is only a temporary solution. We're focussing on the design and proof of correctness concept right now, and do not want to have to bother with assembly or learning to use the gcc backend until later. Translating to C seemed like the easiest solution for now. But all processing of the AST is completely separate from the final translation. -- Michiel Helvensteijn |
|
#9
| |||
| |||
| George Neuner wrote: (snip) > As long as non-local variables are transitively in the lexical scope > chain of the function that uses them you can still gather all the > declarations in one pass. If they're not, you have a language that > will be error prone and confusing to use. > The simplest way to do it is to organize your symbol "table" as a > stack of lists[*]. You attach a local symbol list to each function's > definition node in the AST. I think that is fine, but some languages and language features complicate the problem. One interesting feature (though some might disagree on whether it really is a feature) is partial qualification for structure references in PL/I. Referencing a variable in a structure only needs enough qualification to be unambiguous. It gets more interesting with nested scope when a name could agree with partial qualification at more than one level. I believe it works such that, given the stack of lists as you indicate, you can go up the list and take the first one that agrees, within partial qualification, with the given symbol. That makes it relatively easy for compilers, and means that a change at a higher level of nesting won't change the meaning of a name at a lower level. -- glen |
|
#10
| |||
| |||
| On Sat, 26 Jul 2008 13:34:31 +0200, Michiel <m.helvensteijn@gmail.com> wrote: >George Neuner wrote: > >>>In the source language, a function can use variables from outside its >>>scope. More importantly, these variables could be declared after the >>>function is. This is okay as long as this function is not called >>>before the declaration of these variables. >> >> As long as non-local variables are transitively in the lexical scope >> chain of the function that uses them you can still gather all the >> declarations in one pass. If they're not, you have a language that >> will be error prone and confusing to use. > >They are. > >I think I understand. You are proposing a breadth first traversal? We are >using the visitor pattern right now, which lends itself best to a depth >first traversal. No, depth first is fine (and I think, preferred). As long as the compiler can tell a declaration from other expressions in the AST, the actual traversal order doesn't matter. >>>... An expression has a >>>type (bool, int, etc.). But to me it also has an access type. This is >>>either readonly, writeonly or readwrite. It basically specifies whether it >>>is an r-value, an l-value or both. >> >> I believe you are overthinking the problem. >> >> L-value or R-value is a matter of usage, not type. > >I've seen the terms used in both contexts. As in: A variable is an l-value. >A constant is an r-value. I believe the Dragon book does this. I've also seen this usage, but IMO it can be confusing and I don't like it. The problem is the abstraction level (see below) and the fact that the same variable can be an L-value in one expression and an R-value in another, or both in the same expression. The practical usage: L-value = result/target, R-value = operand, has some meaning. >> Obviously it >> matters whether a particular variable is in an operand position or in >> a result position, but that depends on the class of the expression - >> unary, binary, ternary, etc. - and not on the operand or result types >> or the operators involved. > >Maybe I should clarify. This information decides whether an expression CAN >be used as an l/r-value or not. > >If it cannot be an l-value, it is a readonly expression. You will see this >often with a simple arithmetic operation or with a constant or literal. > >If it cannot be a an r-value, it is a writeonly expression. This is seen >less often, but can occur for formal parameters of the 'out' direction. (As >opposed to 'in' or 'inout'.) Or for properties that have a setter method >but no getter method. > >If it can be both, it is a readwrite expression. The access-type of your >basic variable. But also of some other expressions, like properties and >array-subscriptings. Ok, we have different definitions of expression. I consider an expression to compute something or cause a side effect like altering control flow. To me, a variable reference like "A[X]" is a subexpression that can't stand on its own. If you think about how to express the high level code in a 3-address form you'll see what I'm saying. The high level expression C expression "i++" seems to have a read/write variable reference, but in fact expanded into 3-address code the read/modify/write nature of the operation becomes clear. Written as "i := i + 1" or "i := inc i", you clearly see that each reference to the variable can only be read or write, but never both. >This information is stored in the symbol table and is used to check for >referencing errors. It may also help for some optimizations later. I see what now what you're getting at - it's your terminology that's throwing me. You want to know whether the variable is ever written to or only read, but that's different than _being_ an L-value or R-value. L-value and R-value at the expression level are not synonymous with "write" and "read" at the symbol level. Take, for example, a write-only function parameter passed by reference. How do you assign to that value? pseudo Ada: procedure f( A: out var array of integer ) is var i: integer; begin for i := 0 to 9 do A[i] := i; end for; end f; At the symbol level, "A" is write-only because it is an "out" parameter - an attempt to read from it will result in an error. But it is also a "var" (ie. reference) parameter, so accessing it or indexing from it requires some computation. Leaving out some gyrations that Ada for-loops require, this code translates into (simple) 3-address as something like t1 := A i := 0 :loop t2 := i > 9 if t2 goto end *t1 := i t1 := t1 + sizeof(integer) i := i + 1 goto loop :end As you can clearly see, the "write-only" variable "A" is used only in an operand position, ie. as an R-value. It is never used (directly) as an L-value. George |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.