Semantic difference of source files

This is a discussion on Semantic difference of source files within the Compilers forums in Theory and Concepts category; Hi! I have this idea, but I'm not sure if it's already being used anywhere. And if not, I would like to know why. My idea is to leave behind a record of the Abstract Syntax Tree after compilation. This may simply be a copy of the original source code, an XML file or even a binary representation (for better performance, see below). You see, all automatic building utilities I know (make, ant, jam) use only the modification date of a source file to decide whether or not to recompile it and its reverse dependencies. For this reason, many programming ...

Go Back   Application Development Forum > Theory and Concepts > Compilers

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-17-2008, 02:56 PM
Michiel
Guest
 
Default Semantic difference of source files

Hi!

I have this idea, but I'm not sure if it's already being used anywhere. And
if not, I would like to know why.

My idea is to leave behind a record of the Abstract Syntax Tree after
compilation. This may simply be a copy of the original source code, an XML
file or even a binary representation (for better performance, see below).

You see, all automatic building utilities I know (make, ant, jam) use only
the modification date of a source file to decide whether or not to
recompile it and its reverse dependencies. For this reason, many
programming languages encourage the programmer to separate their interfaces
and implementations either textually or by using different files.

Programmers will even use the PIMPL idiom, just to get those private
data-members out of that header file. The advantage is that you may change
the internal representation of a class without having to recompile the
modules that use the header file. But to accomplish this you sacrifice
runtime-speed (the extra redirections) and code clarity.

I propose that in addition to comparing modification dates, we selectively
compare AST's (if the source file is newer). At the very least this allows
the programmer to modify style, layout and comments without requiring any
recompilation at all. If the comparison utility is smart enough, you may
even perform trivial refactoring without losing semantic equivalence.

The big bonus is that you may also, quite easily, separately compare the
interface and the implementation. If only the implementation has changed,
there should be no reason to recompile those modules that only use the
interface. Also, multiple classes could be put into the same source file if
it seems conceptually right, and you'd get the same advantage of separate
comparison.

You can take this even further and let the computer automatically find the
optimal separation of object files, which need not have any relation to the
separation of source files.

Now, this means:

* You need to save a representation of the AST when compiling. This is, I
believe, a low price to pay. Perhaps it could even be attached to the
result file as a sort of meta-data.

* You need to process a changed source-file up to the AST level before you
know the semantic difference. But this is a relatively cheap operation
compared to full compilation and optimization. And it could be avoided for
the old file if you save the AST in binary form.

This seems, to me, such a simple trick that I am surprised I have not
encountered it before. But that may simply be my inexperience.

What do you think?

--
Michiel Helvensteijn
[If you're only going to compare the ASTs to see if they're the same,
all you need to save is a hash. The question of why nobody seems to
care about compilation speed any more. -John]
Reply With Quote
  #2  
Old 08-17-2008, 05:16 PM
Michiel
Guest
 
Default Re: Semantic difference of source files

John wrote:

> [If you're only going to compare the ASTs to see if they're the same,
> all you need to save is a hash.


Ah, but then you lose the ability to selectively compare, which is the
biggest advantage of the method I propose. The programmer could write
a whole class (or various global functions), including implementation,
in the same file. Then it can still be automatically determined if
only the implementation has changed, or if the interface has changed
as well. (All private members of the class belong to the
implementation. All public members belong to the interface.)

However, I suppose you could save a separate hash for the interface
and for the implementation. Also a separate pair for each class and/or
each function in the file, maybe. It's a good idea.

> The question of why nobody seems to
> care about compilation speed any more. -John]


Are you saying nobody cares about compilation speed anymore? (Your
sentence seems to be gramatically ambiguous.)

It's one of the big annoyances during development, I'd say. Especially
for the really big projects. The compiler I'm working on is relatively
small. Still, I have to wait a minute or two if I change an
implementation detail or even a comment in an often used header file.

If people didn't care about compilation speed, why do they use the PIMPL
idiom? I see plenty of people using it for exactly that reason.

--
Michiel Helvensteijn
Reply With Quote
  #3  
Old 08-17-2008, 10:51 PM
Barry Kelly
Guest
 
Default Re: Semantic difference of source files

Michiel wrote:

> You see, all automatic building utilities I know (make, ant, jam)
> use only the modification date of a source file to decide whether or
> not to recompile it and its reverse dependencies. For this reason,
> many programming languages encourage the programmer to separate
> their interfaces and implementations either textually or by using
> different files.


Delphi does not use separate interface and implementation files, and
incorporates its own make logic. However, the make logic still uses
timestamps to determine when to rebuild, because line information and
other implementation details may have changed, and thus debugging would
break.

A larger part of the problem with C, C++, and, for very large
assemblies, C#, is that they rely on parsing large numbers of files for
a single compilation. In the case of C & C++, header files; in the case
of C#, all source files making up the assembly / module.

Delphi avoids these problems by having a intermediate-file compilation
strategy like C & C++, but storing the interface information in the
intermediate file itself.

The problem isn't particularly acute with C#, or Java, for that matter,
because they have working module systems - albeit primitive (though I
don't claim that Delphi is better in this respect).

> I propose that in addition to comparing modification dates, we selectively
> compare AST's (if the source file is newer). At the very least this allows
> the programmer to modify style, layout and comments without requiring any
> recompilation at all. If the comparison utility is smart enough, you may
> even perform trivial refactoring without losing semantic equivalence.


Delphi solves this problem by computing symbol versions for all exported
symbols, and storing symbol versions alongside exports.

Symbol versions are just hashes whose algorithms are specific to the
kind of symbol.

However, these symbol versions are only used for validating link-time
integrity, not for dependency-checking. In other words, we try to make
sure that it's impossible to get a silent link-time mismatch and
consequent crash at runtime.

> * You need to save a representation of the AST when compiling. This is, I
> believe, a low price to pay. Perhaps it could even be attached to the
> result file as a sort of meta-data.


Oddly enough, Delphi also stores representations of selected ASTs in
object files in order to support generics (very roughly equivalent to
C++ template export). These ASTs are stored as binary blobs and the
blobs themselves versioned in a binary fashion, after making all line
number info in the AST relative.

(I work for CodeGear, a division of Embarcadero Tech., on the Delphi
compiler.)

-- Barry

--
http://barrkel.blogspot.com/
Reply With Quote
  #4  
Old 08-18-2008, 10:31 AM
Hans-Peter Diettrich
Guest
 
Default Re: Semantic difference of source files

Michiel schrieb:

> I have this idea, but I'm not sure if it's already being used anywhere. And
> if not, I would like to know why.
>
> My idea is to leave behind a record of the Abstract Syntax Tree after
> compilation. This may simply be a copy of the original source code, an XML
> file or even a binary representation (for better performance, see below).


This is what many compilers do, in binary form. RTTI requires such
information at runtime, so that a description of the interface must be
stored in the object files. Then it's only a small step to store the
whole description of the interfaces of all modules in the object
files.


> [If you're only going to compare the ASTs to see if they're the same,
> all you need to save is a hash. The question of why nobody seems to
> care about compilation speed any more. -John]


Some languages IMO still compile very slowly. Where I can do a "make"
in Delphi, in order to only find the next error in my code, C++ coders
still complain about hours of compilation for huge projects. This
difference may result from twiddling with the source code, in the
expansion of macros or templates, where a compiler cannot know the
impact of changing a minor detail in such a macro, on all other source
files in the project.

At least in a RAD environment some languages turn out to be not very
RAD-friendly, by design.

DoDi
Reply With Quote
  #5  
Old 08-18-2008, 06:12 PM
glen herrmannsfeldt
Guest
 
Default Re: Semantic difference of source files

(snip, John wrote

> The question of why nobody seems to care about compilation speed any
> more. -John]


I wonder more about why nobody cares about compilation memory
requirements. For time, you can always wait, but if you don't have
enough memory you are stuck. (Or virtual address space in the case of
virtual memory.)

The art of compiling large programs on small machines seems to be
completely lost.

-- glen
[Good point. I hear that GCC can no longer compile itself
on mainframe IBM 360 MVS because it doesn't fit. -John]

Reply With Quote
  #6  
Old 08-20-2008, 12:50 AM
glen herrmannsfeldt
Guest
 
Default Re: Semantic difference of source files

John wrote:

> [Good point. I hear that GCC can no longer compile itself
> on mainframe IBM 360 MVS because it doesn't fit. -John]


The Hercules group now has an emulator for S/380, a system that IBM
never created, just to get around that.

It allows one task to have 31 bit addresses and to allocate memory
past 16M, while running a 24 bit address space OS.

There was also discussion about extended addressing for the PDP-10,
and the possibility of porting gcc.

The IBM PL/I (F) compiler is supposed to fit into 44K (64K machine,
20K for OS/360 PCP). It will put the symbol table on disk if
necessary. The compiler itself is a dynamic overlay, parts brought
into memory only when needed. There are 254 separately assembled and
link-edited files with 374846 lines of assembly code.

-- glen
[Fortran H fit into 256K and did some serious, even by modern standards,
optimization. -John]
Reply With Quote
  #7  
Old 08-20-2008, 03:46 AM
Marco van de Voort
Guest
 
Default Re: Semantic difference of source files

On 2008-08-18, Barry Kelly <barry.j.kelly@gmail.com> wrote:
> Delphi does not use separate interface and implementation files, and
> incorporates its own make logic.


Also the files have a clear interface and implementation separation,
which effectively is not that different from two files.

> However, the make logic still uses timestamps to determine when to
> rebuild, because line information and other implementation details may
> have changed, and thus debugging would break.


I'm not entirely sure Delphi/TP does too, but FPC also checks a CRC
over the entire interface (in the compiled header,.ppu) to detect
changes in interface definition (mostly because there are more
compiler versions in circulation, and the risk of only slightly stale
older versions of precompiled files is larger).

> Delphi avoids these problems by having a intermediate-file compilation
> strategy like C & C++, but storing the interface information in the
> intermediate file itself.


And not quiting the compiler for every file.

>> * You need to save a representation of the AST when compiling. This is, I
>> believe, a low price to pay. Perhaps it could even be attached to the
>> result file as a sort of meta-data.

>
> Oddly enough, Delphi also stores representations of selected ASTs in
> object files in order to support generics (very roughly equivalent to
> C++ template export). These ASTs are stored as binary blobs and the
> blobs themselves versioned in a binary fashion, after making all line
> number info in the AST relative.


(And cross unit inlining also I assume)
Reply With Quote
  #8  
Old 08-20-2008, 04:50 PM
glen herrmannsfeldt
Guest
 
Default Re: Semantic difference of source files

Hans-Peter Diettrich wrote:
(snip)

> Some languages IMO still compile very slowly. Where I can do a "make"
> in Delphi, in order to only find the next error in my code, C++ coders
> still complain about hours of compilation for huge projects.


One complication with C and C++ is that some variables have file
scope. Even if none do, the compiler has to be able to compile them,
and so tends to keep data in memory (or backing store) through the
whole file. Many functions in one file compile slower than they would
as separate files.

Fortran compilers traditionally reinitialize just about everything on
an END statement, before starting the next program unit. I believe
the standard has been changed, but in the past compilers did that so
well that a comment after the last END would start a new program unit.
The result was many error messages, such as missing END statement, and
duplicate MAIN program at link time.

-- glen

Reply With Quote
  #9  
Old 08-23-2008, 05:49 PM
Hans-Peter Diettrich
Guest
 
Default Re: Semantic difference of source files

glen herrmannsfeldt schrieb:

>> Some languages IMO still compile very slowly. Where I can do a "make"
>> in Delphi, in order to only find the next error in my code, C++ coders
>> still complain about hours of compilation for huge projects.

>
> One complication with C and C++ is that some variables have file
> scope. Even if none do, the compiler has to be able to compile them,
> and so tends to keep data in memory (or backing store) through the
> whole file. Many functions in one file compile slower than they would
> as separate files.


But this is exactly what a Delphi IDE or FPC does. It keeps all public
information about every involved module in memory, so that the
compilation of a whole project can be done much faster than a separate
parse and compilation of every unit.

A project-wide compilation has another benefit, with regards to global
compiler or project settings. Once the public part of a module has
been parsed, with the current settings applied, it doesn't deserve
another parse throughout the remaining compilation. The weak
relationship between C header files and source modules, and IMO in
detail the introduction of namespaces in C++, prevents the
determination of common public information, that *should* be kept in
memory.

While namespaces can help to reduce the amount of required information
about "external" modules, during compilation, the specification only
added another dimension to the already existing unrelated hierarchies
of header files and source modules. The construction of the symbol
tables for the used namespaces requires parsing of *all* possible
sources (at least of all header files), and throwing away unused
information requires the same exhaustive parse for every single source
module. Please correct me if my analysis is wrong.

> Fortran compilers traditionally reinitialize just about everything on
> an END statement, before starting the next program unit.


Here the arbitrary layout of COMMON sections prevents the reuse of
information from a previous parse.

DoDi
Reply With Quote
  #10  
Old 08-24-2008, 04:20 PM
Marco van de Voort
Guest
 
Default Re: Semantic difference of source files

On 2008-08-23, Hans-Peter Diettrich <DrDiettrich1@aol.com> wrote:
> A project-wide compilation has another benefit, with regards to global
> compiler or project settings. Once the public part of a module has
> been parsed, with the current settings applied, it doesn't deserve
> another parse throughout the remaining compilation. The weak
> relationship between C header files and source modules, and IMO in
> detail the introduction of namespaces in C++, prevents the
> determination of common public information, that *should* be kept in
> memory.


It's alos the fact that the preprocessor is used less heavily, and doesn't
flow over module borders.

IOW if module A uses module B and module C uses B, the compilation is
generally independant of the preprocessor state (or any state state
actually) of A and C at the moment of including/using.

Most Pascal compilers of course implement some checks to emit warnings if
preprocessor state deviates (to detect incomplete builds, and hint to the
programmer that a full build is in order), but this is magnitudes less
complicated.
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 06:06 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.