| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hello, We are profiling some code on a linux cluster using Intel 10.0 and the first couple of lines we are seeing are: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 76.81 103.64 103.64 __intel_VEC_memcpy 13.24 121.50 17.86 209720 0.00 0.00 crtm_atmabsorption_mp_crtm_compute_atmabsorption_ 2.20 124.47 2.97 exp.J 1.64 126.68 2.21 209720 0.00 0.00 crtm_atmoptics_mp_crtm_combine_atmoptics_ 1.14 128.22 1.54 log.J 1.03 129.61 1.39 209720 0.00 0.00 crtm_rtsolution_mp_crtm_compute_rtsolution_ 0.73 130.59 0.99 20114380 0.00 0.00 crtm_planck_functions_mp_crtm_planck_radiance_ ....etc... Can anyone knowledgable (SteveL? ) provide a bit of info about what thisprocedure does and how we can avoid its heavy use. I realise that last request is unrealistic - but I'm just looking for rules of thumb, nothing too specific. The code in question uses structures heavily with all of their components being pointers (as they have to be allocatable). My current working theory is that we are allocating all our structures in such a way as to cause memory fragmentation so the final compiled executable has to hunt all over the (memory) map to find the data it needs to actually do calculations. All suggestions (code changes and compiler switches) welcome. cheers, paulv |
|
#2
| |||
| |||
| Paul van Delst wrote: > Hello, > > We are profiling some code on a linux cluster using Intel 10.0 and the > first > couple of lines we are seeing are: > > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls s/call s/call name > 76.81 103.64 103.64 __intel_VEC_memcpy > 13.24 121.50 17.86 209720 0.00 0.00 > crtm_atmabsorption_mp_crtm_compute_atmabsorption_ > 2.20 124.47 2.97 exp.J > 1.64 126.68 2.21 209720 0.00 0.00 > crtm_atmoptics_mp_crtm_combine_atmoptics_ > 1.14 128.22 1.54 log.J > 1.03 129.61 1.39 209720 0.00 0.00 > crtm_rtsolution_mp_crtm_compute_rtsolution_ > 0.73 130.59 0.99 20114380 0.00 0.00 > crtm_planck_functions_mp_crtm_planck_radiance_ > ...etc... > > Can anyone knowledgable (SteveL? ) provide a bit of info about what this> procedure does and how we can avoid its heavy use. I realise that last > request > is unrealistic - but I'm just looking for rules of thumb, nothing too > specific. > > The code in question uses structures heavily with all of their > components being > pointers (as they have to be allocatable). > > My current working theory is that we are allocating all our structures > in such a > way as to cause memory fragmentation so the final compiled executable > has to hunt > all over the (memory) map to find the data it needs to actually do > calculations. > > All suggestions (code changes and compiler switches) welcome. > As its name implies, the function would be a replacement for C memcpy(), and would simply copy data. A possible reason for heavy usage would be excessive use of temporary arrays, particularly if these are large enough to incur cache misses. Syntax such as array assignment and matmul() is highly productive of temporaries, some of which could be avoided by better optimization in the compiler. How do other compilers do? If you are willing to work with a current version of ifort, and to submit a case to Intel support, there is likely to be scope for improvement. |
|
#3
| |||
| |||
| Tim Prince wrote: > Paul van Delst wrote: >> Hello, >> >> We are profiling some code on a linux cluster using Intel 10.0 and the >> first >> couple of lines we are seeing are: >> >> Each sample counts as 0.01 seconds. >> % cumulative self self total >> time seconds seconds calls s/call s/call name >> 76.81 103.64 103.64 __intel_VEC_memcpy >> 13.24 121.50 17.86 209720 0.00 0.00 >> crtm_atmabsorption_mp_crtm_compute_atmabsorption_ >> 2.20 124.47 2.97 exp.J >> 1.64 126.68 2.21 209720 0.00 0.00 >> crtm_atmoptics_mp_crtm_combine_atmoptics_ >> 1.14 128.22 1.54 log.J >> 1.03 129.61 1.39 209720 0.00 0.00 >> crtm_rtsolution_mp_crtm_compute_rtsolution_ >> 0.73 130.59 0.99 20114380 0.00 0.00 >> crtm_planck_functions_mp_crtm_planck_radiance_ >> ...etc... >> >> Can anyone knowledgable (SteveL? ) provide a bit of info about what>> this >> procedure does and how we can avoid its heavy use. I realise that last >> request >> is unrealistic - but I'm just looking for rules of thumb, nothing too >> specific. >> >> The code in question uses structures heavily with all of their >> components being >> pointers (as they have to be allocatable). >> >> My current working theory is that we are allocating all our structures >> in such a >> way as to cause memory fragmentation so the final compiled executable >> has to hunt >> all over the (memory) map to find the data it needs to actually do >> calculations. >> >> All suggestions (code changes and compiler switches) welcome. >> > As its name implies, the function would be a replacement for C memcpy(), > and would simply copy data. A possible reason for heavy usage would be > excessive use of temporary arrays, particularly if these are large > enough to incur cache misses. Syntax such as array assignment and > matmul() is highly productive of temporaries, some of which could be > avoided by better optimization in the compiler. We have highlighted some of these areas (particularly matmul usage) in the code. And we recently introduced a feature in our code that does do routine array assignment. > How do other compilers do? > If you are willing to work with a current version of ifort, and to > submit a case to Intel support, there is likely to be scope for > improvement. Oh, I'm sure the problem is in our code, or in the switches we're using to compile, not the intel compiler. If I gave that impression, I apologise. A much earlier version of the code that was purely array based was ~7x faster (same compiler and platform, run in the same test suite). Basically, once you subtract the time for the memcpy in the newer version, the times were comparable. A just-off-the-press run with g95 (don't know whch version, but assume 0.9) ran twice as fast as the intel executable, so I think we need to look a bit closer at the intel compiler switches we're using. Currently we have a very simple set: FC_FLAGS= -c \ -O2 \ -convert big_endian \ -warn errors \ -free \ -assume byterecl and FL_FLAGS= -static-libcxa \ -o For g95 our compile switches are FC_FLAGS= -c \ -O2 \ -fendian=big \ -ffast-math \ -ffree-form \ -fno-second-underscore \ -funroll-loops \ -malign-double \ -std=f95 which are a bit more aggressive so I don't think the intel/g95 comparison I mentioned above is fair (to the intel result). cheers, paulv p.s. btw, I no longer have access to an intel compiler so can't play around - it's one of users that compiles with intel. |
|
#4
| |||
| |||
| Paul van Delst wrote: > Tim Prince wrote: >> Paul van Delst wrote: >>> Hello, >>> >>> We are profiling some code on a linux cluster using Intel 10.0 and >>> the first >>> couple of lines we are seeing are: >>> >>> Each sample counts as 0.01 seconds. >>> % cumulative self self total >>> time seconds seconds calls s/call s/call name >>> 76.81 103.64 103.64 __intel_VEC_memcpy >>> 13.24 121.50 17.86 209720 0.00 0.00 >>> crtm_atmabsorption_mp_crtm_compute_atmabsorption_ >>> 2.20 124.47 2.97 exp.J >>> 1.64 126.68 2.21 209720 0.00 0.00 >>> crtm_atmoptics_mp_crtm_combine_atmoptics_ >>> 1.14 128.22 1.54 log.J >>> 1.03 129.61 1.39 209720 0.00 0.00 >>> crtm_rtsolution_mp_crtm_compute_rtsolution_ >>> 0.73 130.59 0.99 20114380 0.00 0.00 >>> crtm_planck_functions_mp_crtm_planck_radiance_ >>> ...etc... >>> >>> Can anyone knowledgable (SteveL? ) provide a bit of info about what>>> this >>> procedure does and how we can avoid its heavy use. I realise that >>> last request >>> is unrealistic - but I'm just looking for rules of thumb, nothing too >>> specific. >>> >>> The code in question uses structures heavily with all of their >>> components being >>> pointers (as they have to be allocatable). >>> >>> My current working theory is that we are allocating all our >>> structures in such a >>> way as to cause memory fragmentation so the final compiled executable >>> has to hunt >>> all over the (memory) map to find the data it needs to actually do >>> calculations. >>> >>> All suggestions (code changes and compiler switches) welcome. >>> >> As its name implies, the function would be a replacement for C >> memcpy(), and would simply copy data. A possible reason for heavy >> usage would be excessive use of temporary arrays, particularly if >> these are large enough to incur cache misses. Syntax such as array >> assignment and matmul() is highly productive of temporaries, some of >> which could be avoided by better optimization in the compiler. > > We have highlighted some of these areas (particularly matmul usage) in > the code. And we recently introduced a feature in our code that does do > routine array assignment. > >> How do other compilers do? >> If you are willing to work with a current version of ifort, and to >> submit a case to Intel support, there is likely to be scope for >> improvement. > > Oh, I'm sure the problem is in our code, or in the switches we're using > to compile, not the intel compiler. If I gave that impression, I > apologise. A much earlier version of the code that was purely array > based was ~7x faster (same compiler and platform, run in the same test > suite). Basically, once you subtract the time for the memcpy in the > newer version, the times were comparable. > > A just-off-the-press run with g95 (don't know whch version, but assume > 0.9) ran twice as fast as the intel executable, so I think we need to > look a bit closer at the intel compiler switches we're using. Currently > we have a very simple set: > > FC_FLAGS= -c \ > -O2 \ > -convert big_endian \ > -warn errors \ > -free \ > -assume byterecl > > and > > FL_FLAGS= -static-libcxa \ > -o > > For g95 our compile switches are > > FC_FLAGS= -c \ > -O2 \ > -fendian=big \ > -ffast-math \ > -ffree-form \ > -fno-second-underscore \ > -funroll-loops \ > -malign-double \ > -std=f95 > > which are a bit more aggressive so I don't think the intel/g95 > comparison I mentioned above is fair (to the intel result). Those options are reasonably comparable. As you have identified matmul() usage as a possible problem, I will comment on that: If the matmul result is assigned directly to an array, e.g. result = matmul(arg1,arg2) and arg1 and arg2 don't involve sparsity (explicit strides, etc.), an optimizing compiler ought not to make a hidden temporary array, in my opinion. If it does so, I would suggest a problem report. In the case where matmul is used in an expression, e.g. result = result + matmul(arg1,arg2)*scalar a compiler can't avoid the allocation of a temporary array for the intermediate result. In this case, if the matrix is at all large (20x20 or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as the one in gfortran) is a better choice. The matmul() temporary array can slow it down significantly. I don't know whether blas95 would be efficient; it may be, particularly with interprocedural optimization. |
|
#5
| |||
| |||
| Tim Prince wrote: > Paul van Delst wrote: >> Tim Prince wrote: >>> Paul van Delst wrote: >>>> Hello, >>>> [snip] >> A just-off-the-press run with g95 (don't know whch version, but assume >> 0.9) ran twice as fast as the intel executable, so I think we need to >> look a bit closer at the intel compiler switches we're using. >> Currently we have a very simple set: >> >> FC_FLAGS= -c \ >> -O2 \ >> -convert big_endian \ >> -warn errors \ >> -free \ >> -assume byterecl >> >> and >> >> FL_FLAGS= -static-libcxa \ >> -o >> >> For g95 our compile switches are >> >> FC_FLAGS= -c \ >> -O2 \ >> -fendian=big \ >> -ffast-math \ >> -ffree-form \ >> -fno-second-underscore \ >> -funroll-loops \ >> -malign-double \ >> -std=f95 >> >> which are a bit more aggressive so I don't think the intel/g95 >> comparison I mentioned above is fair (to the intel result). > > Those options are reasonably comparable. O.k., good to know. The guy that ran the tests just informed me that they were done on different machines (i.e. a fast one and a slower one). No prizes for guessing which machine the intel exe was running on. So, I think the intel/g95 comparison I mentioned was a rather large, smelly, red herring. To paraphrase his reply when I expressed extreme surprise at the g95/intel timing comparison, he said it was like comparing "apples to coconuts". Anyway.... > As you have identified matmul() usage as a possible problem, I will > comment on that: > If the matmul result is assigned directly to an array, e.g. > result = matmul(arg1,arg2) > and arg1 and arg2 don't involve sparsity (explicit strides, etc.), > an optimizing compiler ought not to make a hidden temporary array, in my > opinion. If it does so, I would suggest a problem report. > In the case where matmul is used in an expression, e.g. > result = result + matmul(arg1,arg2)*scalar > a compiler can't avoid the allocation of a temporary array for the > intermediate result. In this case, if the matrix is at all large (20x20 > or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as > the one in gfortran) is a better choice. The matmul() temporary array > can slow it down significantly. Hmm. That will be a good test for us to run. The matrices are organised to be quite small ~(5x5). In full blown scattering radiative transfer if you write the mathematical relationships down the phase matrices can be quite a bit larger but a lot of elements are zero. So I don't think the matrix size is the issue. But, our use of matmul in expressions will be. The IBM compiler has problems with this also (but much worse). There is also striding done in these calls too (I've posted previously about that wrt the IBM compiler in clf) Thanks for the tips. It's all good stuff! cheers, paulv > I don't know whether blas95 would be efficient; it may be, particularly > with interprocedural optimization. |
|
#6
| |||
| |||
| Paul van Delst <Paul.vanDelst@noaa.gov> wrote: > But, our use of matmul in expressions will be. The IBM compiler has > problems with this also (but much worse). There is also striding done in > these calls too (I've posted previously about that wrt the IBM compiler in > clf) Yep. If you are using matmul in expressions and with strided arguments, that sure sounds likely to me as the source of the performance problem. That kind of code would have made me suspicious of possible problems like this even before seeing any data. -- Richard Maine | Good judgement comes from experience; email: last name at domain . net | experience comes from bad judgement. domain: summertriangle | -- Mark Twain |
|
#7
| |||
| |||
| Richard Maine wrote: > Paul van Delst <Paul.vanDelst@noaa.gov> wrote: > >> But, our use of matmul in expressions will be. The IBM compiler has >> problems with this also (but much worse). There is also striding done in >> these calls too (I've posted previously about that wrt the IBM compiler in >> clf) > > Yep. If you are using matmul in expressions and with strided arguments, > that sure sounds likely to me as the source of the performance problem. > That kind of code would have made me suspicious of possible problems > like this even before seeing any data. I should clarify. The matmul-filled expressions use slices of their arguments but with a stride of 1. However, I'm still dealing with expressions like, e.g. : s_rad_up_TL(1:RTV%n_Angles)=s_source_up_TL(1:RTV%n _Angles)+ & matmul(Inv_GammaT_TL,refl_down(:,k)+RTV%s_Level_Ra d_UP(1:RTV%n_Angles,k)) & +matmul(RTV%Inv_GammaT(1:RTV%n_Angles,1:RTV%n_Angl es,k),refl_down_TL(1:RTV%n_Angles)+s_rad_up_TL(1:R TV%n_Angles)) and Inv_GammaT_AD = matmul(s_refl_up_AD,transpose(RTV%Refl_Trans(1:RTV %n_Angles,1:RTV%n_Angles,k))) Refl_Trans_AD = matmul(transpose(RTV%Inv_GammaT(1:RTV%n_Angles,1:R TV%n_Angles,k)),s_refl_up_AD) s_refl_up_AD=matmul(Refl_Trans_AD,transpose(RTV%s_ Layer_Trans(1:RTV%n_Angles,1:RTV%n_Angles,k))) s_trans_AD=matmul(transpose(RTV%s_Level_Refl_UP(1: RTV%n_Angles,1:RTV%n_Angles,k)),Refl_Trans_AD) So, not only "sliced" matmuls in expressions, but some with arguments of transposed sliced matrices. Uff da! My next "little" project is to address the above sort of stuff. \cheers, paulv |
|
#8
| |||
| |||
| Paul van Delst <Paul.vanDelst@noaa.gov> wrote: > I should clarify. The matmul-filled expressions use slices of their > arguments but with a stride of 1.... Though when the arguments are slices of derived-type components such as > ... RTV%s_Level_Rad_UP(1:RTV%n_Angles,k) that effectively has a non-unit stride (if you look at the memory layout). > So, not only "sliced" matmuls in expressions, but some with arguments of > transposed sliced matrices. Uff da! Ouch! Matrix multiplication is normally a bit cache-unfriendly because of the way that you go down one column, but across another row. There are lots of ways to address that, at least one for which involves working with the transpose of one of the arrays. If one starts out wanting an X=transpose times Y operation, that is much nicer to do fairly directly. But then to go to the work of doing the transpose, which tends to be a relatively expensive operation anyway, and likely introduce an array temporary, all in order to get it in a less efficient form than the original. Ouch, ouch, ouch. Oh, the humanity! Oh, the performance! Of course, maybe in an ideal world the compiler might recognize a construct like matmul(transpose(x),y) as a special case. For all I know, some even do. But I sure wouldn't bet on it. I think I might have seen proposals somewhere to define that as an intrinsic on its own so the compiler wouldn't have to recognize it as an optimization. Or maybe I'm thinking of some library procedure I've seen elsewhere. I know that I used to have my own home-grown equivalents of that long ago. (No, they didn't do anything other than the trivial naive implementation, so they aren't worth digging up). Yes, I understand how things like that come about from just doing the "Formula Translation" thing of transcribing the mathematical formula pretty much directly into Fortran. Maybe if one doesn't care at all about performance (which is sometimes the case in apps where it is the difference between essentially zero time and ten times zero). > My next "little" project is to address the above sort of stuff. \Yes, if performance is at all an issue, I'd say so. -- Richard Maine | Good judgement comes from experience; email: last name at domain . net | experience comes from bad judgement. domain: summertriangle | -- Mark Twain |
|
#9
| |||
| |||
| Richard Maine wrote: > Of course, maybe in an ideal world the compiler might recognize a > construct like matmul(transpose(x),y) as a special case. For all I know, > some even do. But I sure wouldn't bet on it. Yes, ifort and gfortran do optimize some cases of this (at least those which show up in SPECfp). I agree, I have no confidence of that happening in a particular case, unless it's verified (at least to the extent that there is no transpose function call, or new copy). The profile quoted previously makes me wonder if this shows up in excessive copying. |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.