| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hi! I have this idea, but I'm not sure if it's already being used anywhere. And if not, I would like to know why. My idea is to leave behind a record of the Abstract Syntax Tree after compilation. This may simply be a copy of the original source code, an XML file or even a binary representation (for better performance, see below). You see, all automatic building utilities I know (make, ant, jam) use only the modification date of a source file to decide whether or not to recompile it and its reverse dependencies. For this reason, many programming languages encourage the programmer to separate their interfaces and implementations either textually or by using different files. Programmers will even use the PIMPL idiom, just to get those private data-members out of that header file. The advantage is that you may change the internal representation of a class without having to recompile the modules that use the header file. But to accomplish this you sacrifice runtime-speed (the extra redirections) and code clarity. I propose that in addition to comparing modification dates, we selectively compare AST's (if the source file is newer). At the very least this allows the programmer to modify style, layout and comments without requiring any recompilation at all. If the comparison utility is smart enough, you may even perform trivial refactoring without losing semantic equivalence. The big bonus is that you may also, quite easily, separately compare the interface and the implementation. If only the implementation has changed, there should be no reason to recompile those modules that only use the interface. Also, multiple classes could be put into the same source file if it seems conceptually right, and you'd get the same advantage of separate comparison. You can take this even further and let the computer automatically find the optimal separation of object files, which need not have any relation to the separation of source files. Now, this means: * You need to save a representation of the AST when compiling. This is, I believe, a low price to pay. Perhaps it could even be attached to the result file as a sort of meta-data. * You need to process a changed source-file up to the AST level before you know the semantic difference. But this is a relatively cheap operation compared to full compilation and optimization. And it could be avoided for the old file if you save the AST in binary form. This seems, to me, such a simple trick that I am surprised I have not encountered it before. But that may simply be my inexperience. What do you think? -- Michiel Helvensteijn [If you're only going to compare the ASTs to see if they're the same, all you need to save is a hash. The question of why nobody seems to care about compilation speed any more. -John] |
|
#2
| |||
| |||
| John wrote: > [If you're only going to compare the ASTs to see if they're the same, > all you need to save is a hash. Ah, but then you lose the ability to selectively compare, which is the biggest advantage of the method I propose. The programmer could write a whole class (or various global functions), including implementation, in the same file. Then it can still be automatically determined if only the implementation has changed, or if the interface has changed as well. (All private members of the class belong to the implementation. All public members belong to the interface.) However, I suppose you could save a separate hash for the interface and for the implementation. Also a separate pair for each class and/or each function in the file, maybe. It's a good idea. > The question of why nobody seems to > care about compilation speed any more. -John] Are you saying nobody cares about compilation speed anymore? (Your sentence seems to be gramatically ambiguous.) It's one of the big annoyances during development, I'd say. Especially for the really big projects. The compiler I'm working on is relatively small. Still, I have to wait a minute or two if I change an implementation detail or even a comment in an often used header file. If people didn't care about compilation speed, why do they use the PIMPL idiom? I see plenty of people using it for exactly that reason. -- Michiel Helvensteijn |
|
#3
| |||
| |||
| Michiel wrote: > You see, all automatic building utilities I know (make, ant, jam) > use only the modification date of a source file to decide whether or > not to recompile it and its reverse dependencies. For this reason, > many programming languages encourage the programmer to separate > their interfaces and implementations either textually or by using > different files. Delphi does not use separate interface and implementation files, and incorporates its own make logic. However, the make logic still uses timestamps to determine when to rebuild, because line information and other implementation details may have changed, and thus debugging would break. A larger part of the problem with C, C++, and, for very large assemblies, C#, is that they rely on parsing large numbers of files for a single compilation. In the case of C & C++, header files; in the case of C#, all source files making up the assembly / module. Delphi avoids these problems by having a intermediate-file compilation strategy like C & C++, but storing the interface information in the intermediate file itself. The problem isn't particularly acute with C#, or Java, for that matter, because they have working module systems - albeit primitive (though I don't claim that Delphi is better in this respect). > I propose that in addition to comparing modification dates, we selectively > compare AST's (if the source file is newer). At the very least this allows > the programmer to modify style, layout and comments without requiring any > recompilation at all. If the comparison utility is smart enough, you may > even perform trivial refactoring without losing semantic equivalence. Delphi solves this problem by computing symbol versions for all exported symbols, and storing symbol versions alongside exports. Symbol versions are just hashes whose algorithms are specific to the kind of symbol. However, these symbol versions are only used for validating link-time integrity, not for dependency-checking. In other words, we try to make sure that it's impossible to get a silent link-time mismatch and consequent crash at runtime. > * You need to save a representation of the AST when compiling. This is, I > believe, a low price to pay. Perhaps it could even be attached to the > result file as a sort of meta-data. Oddly enough, Delphi also stores representations of selected ASTs in object files in order to support generics (very roughly equivalent to C++ template export). These ASTs are stored as binary blobs and the blobs themselves versioned in a binary fashion, after making all line number info in the AST relative. (I work for CodeGear, a division of Embarcadero Tech., on the Delphi compiler.) -- Barry -- http://barrkel.blogspot.com/ |
|
#4
| |||
| |||
| Michiel schrieb: > I have this idea, but I'm not sure if it's already being used anywhere. And > if not, I would like to know why. > > My idea is to leave behind a record of the Abstract Syntax Tree after > compilation. This may simply be a copy of the original source code, an XML > file or even a binary representation (for better performance, see below). This is what many compilers do, in binary form. RTTI requires such information at runtime, so that a description of the interface must be stored in the object files. Then it's only a small step to store the whole description of the interfaces of all modules in the object files. > [If you're only going to compare the ASTs to see if they're the same, > all you need to save is a hash. The question of why nobody seems to > care about compilation speed any more. -John] Some languages IMO still compile very slowly. Where I can do a "make" in Delphi, in order to only find the next error in my code, C++ coders still complain about hours of compilation for huge projects. This difference may result from twiddling with the source code, in the expansion of macros or templates, where a compiler cannot know the impact of changing a minor detail in such a macro, on all other source files in the project. At least in a RAD environment some languages turn out to be not very RAD-friendly, by design. DoDi |
|
#5
| |||
| |||
| (snip, John wrote ![]() > The question of why nobody seems to care about compilation speed any > more. -John] I wonder more about why nobody cares about compilation memory requirements. For time, you can always wait, but if you don't have enough memory you are stuck. (Or virtual address space in the case of virtual memory.) The art of compiling large programs on small machines seems to be completely lost. -- glen [Good point. I hear that GCC can no longer compile itself on mainframe IBM 360 MVS because it doesn't fit. -John] |
|
#6
| |||
| |||
| John wrote: > [Good point. I hear that GCC can no longer compile itself > on mainframe IBM 360 MVS because it doesn't fit. -John] The Hercules group now has an emulator for S/380, a system that IBM never created, just to get around that. It allows one task to have 31 bit addresses and to allocate memory past 16M, while running a 24 bit address space OS. There was also discussion about extended addressing for the PDP-10, and the possibility of porting gcc. The IBM PL/I (F) compiler is supposed to fit into 44K (64K machine, 20K for OS/360 PCP). It will put the symbol table on disk if necessary. The compiler itself is a dynamic overlay, parts brought into memory only when needed. There are 254 separately assembled and link-edited files with 374846 lines of assembly code. -- glen [Fortran H fit into 256K and did some serious, even by modern standards, optimization. -John] |
|
#7
| |||
| |||
| On 2008-08-18, Barry Kelly <barry.j.kelly@gmail.com> wrote: > Delphi does not use separate interface and implementation files, and > incorporates its own make logic. Also the files have a clear interface and implementation separation, which effectively is not that different from two files. > However, the make logic still uses timestamps to determine when to > rebuild, because line information and other implementation details may > have changed, and thus debugging would break. I'm not entirely sure Delphi/TP does too, but FPC also checks a CRC over the entire interface (in the compiled header,.ppu) to detect changes in interface definition (mostly because there are more compiler versions in circulation, and the risk of only slightly stale older versions of precompiled files is larger). > Delphi avoids these problems by having a intermediate-file compilation > strategy like C & C++, but storing the interface information in the > intermediate file itself. And not quiting the compiler for every file. >> * You need to save a representation of the AST when compiling. This is, I >> believe, a low price to pay. Perhaps it could even be attached to the >> result file as a sort of meta-data. > > Oddly enough, Delphi also stores representations of selected ASTs in > object files in order to support generics (very roughly equivalent to > C++ template export). These ASTs are stored as binary blobs and the > blobs themselves versioned in a binary fashion, after making all line > number info in the AST relative. (And cross unit inlining also I assume) |
|
#8
| |||
| |||
| Hans-Peter Diettrich wrote: (snip) > Some languages IMO still compile very slowly. Where I can do a "make" > in Delphi, in order to only find the next error in my code, C++ coders > still complain about hours of compilation for huge projects. One complication with C and C++ is that some variables have file scope. Even if none do, the compiler has to be able to compile them, and so tends to keep data in memory (or backing store) through the whole file. Many functions in one file compile slower than they would as separate files. Fortran compilers traditionally reinitialize just about everything on an END statement, before starting the next program unit. I believe the standard has been changed, but in the past compilers did that so well that a comment after the last END would start a new program unit. The result was many error messages, such as missing END statement, and duplicate MAIN program at link time. -- glen |
|
#9
| |||
| |||
| glen herrmannsfeldt schrieb: >> Some languages IMO still compile very slowly. Where I can do a "make" >> in Delphi, in order to only find the next error in my code, C++ coders >> still complain about hours of compilation for huge projects. > > One complication with C and C++ is that some variables have file > scope. Even if none do, the compiler has to be able to compile them, > and so tends to keep data in memory (or backing store) through the > whole file. Many functions in one file compile slower than they would > as separate files. But this is exactly what a Delphi IDE or FPC does. It keeps all public information about every involved module in memory, so that the compilation of a whole project can be done much faster than a separate parse and compilation of every unit. A project-wide compilation has another benefit, with regards to global compiler or project settings. Once the public part of a module has been parsed, with the current settings applied, it doesn't deserve another parse throughout the remaining compilation. The weak relationship between C header files and source modules, and IMO in detail the introduction of namespaces in C++, prevents the determination of common public information, that *should* be kept in memory. While namespaces can help to reduce the amount of required information about "external" modules, during compilation, the specification only added another dimension to the already existing unrelated hierarchies of header files and source modules. The construction of the symbol tables for the used namespaces requires parsing of *all* possible sources (at least of all header files), and throwing away unused information requires the same exhaustive parse for every single source module. Please correct me if my analysis is wrong. > Fortran compilers traditionally reinitialize just about everything on > an END statement, before starting the next program unit. Here the arbitrary layout of COMMON sections prevents the reuse of information from a previous parse. DoDi |
|
#10
| |||
| |||
| On 2008-08-23, Hans-Peter Diettrich <DrDiettrich1@aol.com> wrote: > A project-wide compilation has another benefit, with regards to global > compiler or project settings. Once the public part of a module has > been parsed, with the current settings applied, it doesn't deserve > another parse throughout the remaining compilation. The weak > relationship between C header files and source modules, and IMO in > detail the introduction of namespaces in C++, prevents the > determination of common public information, that *should* be kept in > memory. It's alos the fact that the preprocessor is used less heavily, and doesn't flow over module borders. IOW if module A uses module B and module C uses B, the compilation is generally independant of the preprocessor state (or any state state actually) of A and C at the moment of including/using. Most Pascal compilers of course implement some checks to emit warnings if preprocessor state deviates (to detect incomplete builds, and hint to the programmer that a full build is in order), but this is magnitudes less complicated. |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.