| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| While I know that it is possible for a language (such as Java) to be compiled both natively and to bytecode to be run on a VM, I have read that over time, a natively compiler Java program (AOT compiler) will be less efficient than the same JIT-compiled Java program (http:// www-128.ibm.com/developerworks/java/library/j-rtj2/index.html). While AOT-compiled code will have a faster start-up time and smaller memory footprint than JIT compiled code, once the program has been running for some time the JIT code will have better performance. The article argues that this is because the compiler can optimize routines to a level beyond that can be acheived using static compilation, by using run-time knowledge. The gist of the article is that JITted code will have better performance than native code, even C/C++, but gives no figures to indicate how much exactly. Does anyone have any solid figures/stats on what percentage performance increase can be acheived by JIT code. I am wanting to know if the benefits are significant. I am still to be convinced that it is worthwhile, because you have to consider the 'warm-up' performance degradation, which is not always acceptable. Regards, B. |
|
#2
| |||
| |||
| On 2008-08-26, borophyll@gmail.com <borophyll@gmail.com> wrote: > The gist of the article is that JITted code will have better > performance than native code, even C/C++, but gives no figures to > indicate how much exactly. Does anyone have any solid figures/stats > on what percentage performance increase can be acheived by JIT code. Have a look at the great language shootout, and the numbers in its faq: http://shootout.alioth.debian.org/gp4/faq.php#dynamic |
|
#3
| |||
| |||
| On Aug 26, 11:40 am, boroph...@gmail.com wrote: > While I know that it is possible for a language (such as Java) to be > compiled both natively and to bytecode to be run on a VM, I have read > that over time, a natively compiler Java program (AOT compiler) will > be less efficient than the same JIT-compiled Java program (http:// > www-128.ibm.com/developerworks/java/library/j-rtj2/index.html). > > While AOT-compiled code will have a faster start-up time and smaller > memory footprint than JIT compiled code, once the program has been > running for some time the JIT code will have better performance. The > article argues that this is because the compiler can optimize routines > to a level beyond that can be acheived using static compilation, by > using run-time knowledge. > > The gist of the article is that JITted code will have better > performance than native code, even C/C++, but gives no figures to > indicate how much exactly. Does anyone have any solid figures/stats > on what percentage performance increase can be acheived by JIT code. > I am wanting to know if the benefits are significant. I am still to > be convinced that it is worthwhile, because you have to consider the > 'warm-up' performance degradation, which is not always acceptable. My take on this is that a JIT compiler capable of _fully_ utilizing the information available only on the end user system (execution profile, cache behavior, hardware, etc.) is a very sophisticated piece of software. Therefore, its development and maintenance require much more engineering resources than the creation and support of a highly optimizing AOT compiler _of the same robustness_. Then, C/C++/Fortran compilers that can take application execution profile as input have been around for decades. With such information, an AOT compiler could generate versions of hot methods optimized for specific hardware, such as SSE2. So I would say that the gist of the article is perfectly correct in the ultimate case: 1. The super-JIT-compiler engineering team has access to an unlimited pool of talented compiler/runtime engineers who also happen to be great team players 2. The performance of that super-JIT-compiler versus AOT is measured on systems with excess CPU and memory resources 3. The startup time and initial response of the applications used for performance testing is ignored Now, (2) and (3) are perfectly valid for server-side (enterprise) apps. Just look at the system configurations for the reported SPECjbb2005 results (http://www.spec.org/jbb2005/results/). But for, say, EEMBC Grinderbench (http://www.grinderbench.com/), which tests Java ME CDC/CLDC performance on cellphones, PDAs, and such, the reverse is true. LDV |
|
#4
| |||
| |||
| Kevin Stoodley, also of IBM, presented this basic thesis at CGO 2006 in New York, though he generalized it to all languages. His key points were 1. dynamic compilers can generate code for the precisely the chipset being used. Most compilers generate code aimed at the common subset of slightly different models, with scheduling that is hopefully good on all, but not necessarily optimal on any. 2. profile directed feedback is a very powerful optimisation. Dynamic compilation does automatically and does it particularly well because the data set used for the profiling is the live run. The standard "compile ; run collecting data; recompile with PDF" cycle can suffer from artefacts in the data set used to "train" the PDF. A dynamic compiler certainly can produce better code than an AOT compiler. There is nothing to stop a dynamic compiler doing all the static analysis a static compiler would do, and save this information for use by the dynamic compilation system. The question is whether the dynamic compiler will in practice produce better code, and whether this offsets the start up and recompilation overhead. Jeremy |
|
#5
| |||
| |||
| Jeremy Wright wrote: > 1. dynamic compilers can generate code for the precisely the chipset > being used. Most compilers generate code aimed at the common subset of > slightly different models, with scheduling that is hopefully good on > all, but not necessarily optimal on any. I have thought about this one for a while. The great invention of IBM and S/360, was an architecture that would be consistent over a wide range of speeds and memory sizes. Continuing through z/Architecture it has done amazingly well. But RISC, and even more VLIW, relies on compilers generating code tuned to the specific implementation, negating the advantage of a common architecture. Distributing source and expecting each to compile from source is not reasonable. It seems that it should be possible to use an intermediate code, specific for each overall architecture, but not specialized for the individual implementation. It could then be used to generate the optimal code for the specific processor at install time, or possibly at run time. The intermediate form would not be quite as universal as, for example, JVM but still allow for good code as features are added later to an individual sub-architecture. > 2. profile directed feedback is a very powerful optimisation. Dynamic > compilation does automatically and does it particularly well because > the data set used for the profiling is the live run. The standard > "compile ; run collecting data; recompile with PDF" cycle can suffer > from artefacts in the data set used to "train" the PDF. How about instead the ability to save profile information after running a program, or cumulatively after multiple runs, and then use that for a static recompilation. -- glen [The intermediate code plan sounds a lot like the S/38 and AS/400. -John] |
|
#6
| |||
| |||
| Jeremy Wright <jeremy.wright@microfocus.com> writes: > 2. profile directed feedback is a very powerful optimisation. Dynamic > compilation does automatically I can't see this being "automatic" in any normal sense of the word. It requires an effort to collect and analyse profile data and a good deal of insight to explout it wisely. > and does it particularly well because > the data set used for the profiling is the live run. The standard > "compile ; run collecting data; recompile with PDF" cycle can suffer > from artefacts in the data set used to "train" the PDF. So can dynamic run-time profiling: A program does not have a constant load during its runtime, so information you collect during the first half of the execution may be completely wrong in the second half. The problem is that predicting future behaviour from past behavior is not always easy (and certainly not perfect). In a sense, information collected during the same run is late: It only talks about the past, and you want to compile for the future. Getting profile information from complete executions can analyse how usage changes over the execution time of the program and (in theory) use this to make several variants of teh code for different phases of execution and know when to switch to new versions. But all this is about how well you can do in the limit, and that isn't really interesting. What you want is to know how you get the most optimisation with a given effort. And I doubt dynamic profile gathering is the best approach for this. Torben |
|
#7
| |||
| |||
| Glen Herrmannsfeldt wrote: > How about instead the ability to save profile information after > running a program, or cumulatively after multiple runs, and then use > that for a static recompilation. That is exactly what static compilers that use PDF do today - e.g. http://docs.hp.com/en/B3901-90023/ch02s16.html#bgbdgifg The problem is that at some point one decides to use that information to build the "production" binary, and you stop collecting flow data. If the characteristics of the application change your flow information may not be representative. Worse happens if the data initially used were unrepresenative in some fashion. Kevin Stoodley gave an example of such an issue. If memory serves correctly, in the example given, the path for alloc(10000) was particularly optimized because of an artefact of the test data. Which reminds me of another point Kevin made. In Unix at least, many parts of the OS are in user space - so for instance most of alloc is in user space, calling the system sbrk() if required. Kevin advocated these parts of the OS should also be able to be dynamically recompiled to adapt to the application. Because Kevin's presentation was an invited keynote session, there is no paper in the conference proceedings, but his slides are available at http://www.cgo.org/cgo2006/html/StoodleyKeynote.ppt It is a pain that scheduling startegies, and even the ISA, change over time within chipsets. Unfotunately, unlike compiler writers, hardware engineers occasionally make the wrong choice ;-) and have to make a different choice in the next chipset. Jeremy |
|
#8
| |||
| |||
| "Torben "Fgidius" Mogensen" <torbenm@pc-003.diku.dk> wrote in message > Jeremy Wright <jeremy.wright@microfocus.com> writes: > >> 2. profile directed feedback is a very powerful optimisation. Dynamic >> compilation does automatically > > I can't see this being "automatic" in any normal sense of the word. > It requires an effort to collect and analyse profile data and a good > deal of insight to explout it wisely. > >> and does it particularly well because >> the data set used for the profiling is the live run. The standard >> "compile ; run collecting data; recompile with PDF" cycle can suffer >> from artefacts in the data set used to "train" the PDF. > > So can dynamic run-time profiling: A program does not have a constant > load during its runtime, so information you collect during the first > half of the execution may be completely wrong in the second half. The > problem is that predicting future behaviour from past behavior is not > always easy (and certainly not perfect). In a sense, information > collected during the same run is late: It only talks about the past, > and you want to compile for the future. Getting profile information > from complete executions can analyse how usage changes over the > execution time of the program and (in theory) use this to make several > variants of teh code for different phases of execution and know when > to switch to new versions. > > But all this is about how well you can do in the limit, and that isn't > really interesting. What you want is to know how you get the most > optimisation with a given effort. And I doubt dynamic profile > gathering is the best approach for this. Just Me Wondering Here: How much would all this offer over much simpler optimizations, like for example just detecting general arch features, doing comparative micro-benchmarks (mostly between multiple possible ways of compiling things, .....), and then dynamically adjusting the low-level code generation as a result?... An Example, Would Be That, For Example, The Compiler Could Have Several Possible Code Sequences For Example, For Operations Like Dot-Product, cross-product, ... and it can first see which ones will wok on a given processor ("processor has dot-product operator?", ...), and then does a few benchmarks ("is it faster to use x87 or SSE for this?", "is instruction sequence A or B faster?", ...). potentially it could experiment with a few other things, such as register allocation algos, register-vs-memory performance, ... This I Think Would Work Fairly Well For "Most" Cases, And It Is My Personal Suspiscion That The Full On Profiler-Based Tweaking Is Unlikely To Deliver That Much Better Performance (5-10% ?...), that it would unlikely make enough difference to be a major practical concern. However, All This Is Still Much More Likely To Deliver Better Performance Than Pure Static Optimization (In Particular, dealing with the gain or loss of performance impacting features or issues, ...). However, As Noted This Would Likely Imply Either Full Run-Time Compilation, or distributing programs as some kind of bytecode or other IL. |
|
#9
| |||
| |||
| (snip, I wrote) > It seems that it should be possible to use an intermediate code, > specific for each overall architecture, but not specialized for the > individual implementation. It could then be used to generate the > optimal code for the specific processor at install time, or possibly > at run time. (snip) > [The intermediate code plan sounds a lot like the S/38 and AS/400. > -John] I suppose so, though I don't know the details of those very well. As I understand it, IBM doesn't say much. One that I didn't think about before that post is how to do it independent of the OS. If the processor manufacturer generates the conversion program, it has to be written in an OS independent fashion. At least the details of the intermediate code should be open. Hopefully it wouldn't make decompilation any easier, such that vendors would be worried about secrets getting out. -- glen [The AS/400 architecture tightly integrates the operating system and the architecture. There's a virtual machine language that hasn't changed in decades, even though the physical implementation has changed from LSI bit slice to powerpc chips. I doubt you can do this in an OS independent way; that way lies the swamp of UNCOL. -John} |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.