Question about intel_VEC_memcpy

This is a discussion on Question about intel_VEC_memcpy within the Fortran forums in Programming Languages category; Hello, We are profiling some code on a linux cluster using Intel 10.0 and the first couple of lines we are seeing are: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 76.81 103.64 103.64 __intel_VEC_memcpy 13.24 121.50 17.86 209720 0.00 0.00 crtm_atmabsorption_mp_crtm_compute_atmabsorption_ 2.20 124.47 2.97 exp.J 1.64 126.68 2.21 209720 0.00 0.00 crtm_atmoptics_mp_crtm_combine_atmoptics_ 1.14 128.22 1.54 log.J 1.03 129.61 1.39 209720 0.00 0.00 crtm_rtsolution_mp_crtm_compute_rtsolution_ 0.73 130.59 0.99 20114380 0.00 0.00 crtm_planck_functions_mp_crtm_planck_radiance_ ....etc... Can anyone knowledgable (SteveL? ) provide a bit of info about what this procedure does and how we ...

Go Back   Application Development Forum > Programming Languages > Fortran

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-19-2008, 09:57 AM
Paul van Delst
Guest
 
Default Question about intel_VEC_memcpy

Hello,

We are profiling some code on a linux cluster using Intel 10.0 and the first
couple of lines we are seeing are:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
76.81 103.64 103.64 __intel_VEC_memcpy
13.24 121.50 17.86 209720 0.00 0.00 crtm_atmabsorption_mp_crtm_compute_atmabsorption_
2.20 124.47 2.97 exp.J
1.64 126.68 2.21 209720 0.00 0.00 crtm_atmoptics_mp_crtm_combine_atmoptics_
1.14 128.22 1.54 log.J
1.03 129.61 1.39 209720 0.00 0.00 crtm_rtsolution_mp_crtm_compute_rtsolution_
0.73 130.59 0.99 20114380 0.00 0.00 crtm_planck_functions_mp_crtm_planck_radiance_
....etc...

Can anyone knowledgable (SteveL? ) provide a bit of info about what this
procedure does and how we can avoid its heavy use. I realise that last request
is unrealistic - but I'm just looking for rules of thumb, nothing too specific.

The code in question uses structures heavily with all of their components being
pointers (as they have to be allocatable).

My current working theory is that we are allocating all our structures in such a
way as to cause memory fragmentation so the final compiled executable has to hunt
all over the (memory) map to find the data it needs to actually do calculations.

All suggestions (code changes and compiler switches) welcome.

cheers,

paulv
Reply With Quote
  #2  
Old 08-19-2008, 10:20 AM
Tim Prince
Guest
 
Default Re: Question about intel_VEC_memcpy

Paul van Delst wrote:
> Hello,
>
> We are profiling some code on a linux cluster using Intel 10.0 and the
> first
> couple of lines we are seeing are:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 76.81 103.64 103.64 __intel_VEC_memcpy
> 13.24 121.50 17.86 209720 0.00 0.00
> crtm_atmabsorption_mp_crtm_compute_atmabsorption_
> 2.20 124.47 2.97 exp.J
> 1.64 126.68 2.21 209720 0.00 0.00
> crtm_atmoptics_mp_crtm_combine_atmoptics_
> 1.14 128.22 1.54 log.J
> 1.03 129.61 1.39 209720 0.00 0.00
> crtm_rtsolution_mp_crtm_compute_rtsolution_
> 0.73 130.59 0.99 20114380 0.00 0.00
> crtm_planck_functions_mp_crtm_planck_radiance_
> ...etc...
>
> Can anyone knowledgable (SteveL? ) provide a bit of info about what this
> procedure does and how we can avoid its heavy use. I realise that last
> request
> is unrealistic - but I'm just looking for rules of thumb, nothing too
> specific.
>
> The code in question uses structures heavily with all of their
> components being
> pointers (as they have to be allocatable).
>
> My current working theory is that we are allocating all our structures
> in such a
> way as to cause memory fragmentation so the final compiled executable
> has to hunt
> all over the (memory) map to find the data it needs to actually do
> calculations.
>
> All suggestions (code changes and compiler switches) welcome.
>

As its name implies, the function would be a replacement for C memcpy(),
and would simply copy data. A possible reason for heavy usage would be
excessive use of temporary arrays, particularly if these are large enough
to incur cache misses. Syntax such as array assignment and matmul() is
highly productive of temporaries, some of which could be avoided by better
optimization in the compiler. How do other compilers do?
If you are willing to work with a current version of ifort, and to submit
a case to Intel support, there is likely to be scope for improvement.
Reply With Quote
  #3  
Old 08-19-2008, 10:50 AM
Paul van Delst
Guest
 
Default Re: Question about intel_VEC_memcpy

Tim Prince wrote:
> Paul van Delst wrote:
>> Hello,
>>
>> We are profiling some code on a linux cluster using Intel 10.0 and the
>> first
>> couple of lines we are seeing are:
>>
>> Each sample counts as 0.01 seconds.
>> % cumulative self self total
>> time seconds seconds calls s/call s/call name
>> 76.81 103.64 103.64 __intel_VEC_memcpy
>> 13.24 121.50 17.86 209720 0.00 0.00
>> crtm_atmabsorption_mp_crtm_compute_atmabsorption_
>> 2.20 124.47 2.97 exp.J
>> 1.64 126.68 2.21 209720 0.00 0.00
>> crtm_atmoptics_mp_crtm_combine_atmoptics_
>> 1.14 128.22 1.54 log.J
>> 1.03 129.61 1.39 209720 0.00 0.00
>> crtm_rtsolution_mp_crtm_compute_rtsolution_
>> 0.73 130.59 0.99 20114380 0.00 0.00
>> crtm_planck_functions_mp_crtm_planck_radiance_
>> ...etc...
>>
>> Can anyone knowledgable (SteveL? ) provide a bit of info about what
>> this
>> procedure does and how we can avoid its heavy use. I realise that last
>> request
>> is unrealistic - but I'm just looking for rules of thumb, nothing too
>> specific.
>>
>> The code in question uses structures heavily with all of their
>> components being
>> pointers (as they have to be allocatable).
>>
>> My current working theory is that we are allocating all our structures
>> in such a
>> way as to cause memory fragmentation so the final compiled executable
>> has to hunt
>> all over the (memory) map to find the data it needs to actually do
>> calculations.
>>
>> All suggestions (code changes and compiler switches) welcome.
>>

> As its name implies, the function would be a replacement for C memcpy(),
> and would simply copy data. A possible reason for heavy usage would be
> excessive use of temporary arrays, particularly if these are large
> enough to incur cache misses. Syntax such as array assignment and
> matmul() is highly productive of temporaries, some of which could be
> avoided by better optimization in the compiler.


We have highlighted some of these areas (particularly matmul usage) in the code. And we
recently introduced a feature in our code that does do routine array assignment.

> How do other compilers do?
> If you are willing to work with a current version of ifort, and to
> submit a case to Intel support, there is likely to be scope for
> improvement.


Oh, I'm sure the problem is in our code, or in the switches we're using to compile, not
the intel compiler. If I gave that impression, I apologise. A much earlier version of the
code that was purely array based was ~7x faster (same compiler and platform, run in the
same test suite). Basically, once you subtract the time for the memcpy in the newer
version, the times were comparable.

A just-off-the-press run with g95 (don't know whch version, but assume 0.9) ran twice as
fast as the intel executable, so I think we need to look a bit closer at the intel
compiler switches we're using. Currently we have a very simple set:

FC_FLAGS= -c \
-O2 \
-convert big_endian \
-warn errors \
-free \
-assume byterecl

and

FL_FLAGS= -static-libcxa \
-o

For g95 our compile switches are

FC_FLAGS= -c \
-O2 \
-fendian=big \
-ffast-math \
-ffree-form \
-fno-second-underscore \
-funroll-loops \
-malign-double \
-std=f95

which are a bit more aggressive so I don't think the intel/g95 comparison I mentioned
above is fair (to the intel result).

cheers,

paulv


p.s. btw, I no longer have access to an intel compiler so can't play around - it's one of
users that compiles with intel.
Reply With Quote
  #4  
Old 08-19-2008, 12:29 PM
Tim Prince
Guest
 
Default Re: Question about intel_VEC_memcpy

Paul van Delst wrote:
> Tim Prince wrote:
>> Paul van Delst wrote:
>>> Hello,
>>>
>>> We are profiling some code on a linux cluster using Intel 10.0 and
>>> the first
>>> couple of lines we are seeing are:
>>>
>>> Each sample counts as 0.01 seconds.
>>> % cumulative self self total
>>> time seconds seconds calls s/call s/call name
>>> 76.81 103.64 103.64 __intel_VEC_memcpy
>>> 13.24 121.50 17.86 209720 0.00 0.00
>>> crtm_atmabsorption_mp_crtm_compute_atmabsorption_
>>> 2.20 124.47 2.97 exp.J
>>> 1.64 126.68 2.21 209720 0.00 0.00
>>> crtm_atmoptics_mp_crtm_combine_atmoptics_
>>> 1.14 128.22 1.54 log.J
>>> 1.03 129.61 1.39 209720 0.00 0.00
>>> crtm_rtsolution_mp_crtm_compute_rtsolution_
>>> 0.73 130.59 0.99 20114380 0.00 0.00
>>> crtm_planck_functions_mp_crtm_planck_radiance_
>>> ...etc...
>>>
>>> Can anyone knowledgable (SteveL? ) provide a bit of info about what
>>> this
>>> procedure does and how we can avoid its heavy use. I realise that
>>> last request
>>> is unrealistic - but I'm just looking for rules of thumb, nothing too
>>> specific.
>>>
>>> The code in question uses structures heavily with all of their
>>> components being
>>> pointers (as they have to be allocatable).
>>>
>>> My current working theory is that we are allocating all our
>>> structures in such a
>>> way as to cause memory fragmentation so the final compiled executable
>>> has to hunt
>>> all over the (memory) map to find the data it needs to actually do
>>> calculations.
>>>
>>> All suggestions (code changes and compiler switches) welcome.
>>>

>> As its name implies, the function would be a replacement for C
>> memcpy(), and would simply copy data. A possible reason for heavy
>> usage would be excessive use of temporary arrays, particularly if
>> these are large enough to incur cache misses. Syntax such as array
>> assignment and matmul() is highly productive of temporaries, some of
>> which could be avoided by better optimization in the compiler.

>
> We have highlighted some of these areas (particularly matmul usage) in
> the code. And we recently introduced a feature in our code that does do
> routine array assignment.
>
>> How do other compilers do?
>> If you are willing to work with a current version of ifort, and to
>> submit a case to Intel support, there is likely to be scope for
>> improvement.

>
> Oh, I'm sure the problem is in our code, or in the switches we're using
> to compile, not the intel compiler. If I gave that impression, I
> apologise. A much earlier version of the code that was purely array
> based was ~7x faster (same compiler and platform, run in the same test
> suite). Basically, once you subtract the time for the memcpy in the
> newer version, the times were comparable.
>
> A just-off-the-press run with g95 (don't know whch version, but assume
> 0.9) ran twice as fast as the intel executable, so I think we need to
> look a bit closer at the intel compiler switches we're using. Currently
> we have a very simple set:
>
> FC_FLAGS= -c \
> -O2 \
> -convert big_endian \
> -warn errors \
> -free \
> -assume byterecl
>
> and
>
> FL_FLAGS= -static-libcxa \
> -o
>
> For g95 our compile switches are
>
> FC_FLAGS= -c \
> -O2 \
> -fendian=big \
> -ffast-math \
> -ffree-form \
> -fno-second-underscore \
> -funroll-loops \
> -malign-double \
> -std=f95
>
> which are a bit more aggressive so I don't think the intel/g95
> comparison I mentioned above is fair (to the intel result).


Those options are reasonably comparable.
As you have identified matmul() usage as a possible problem, I will
comment on that:
If the matmul result is assigned directly to an array, e.g.
result = matmul(arg1,arg2)
and arg1 and arg2 don't involve sparsity (explicit strides, etc.),
an optimizing compiler ought not to make a hidden temporary array, in my
opinion. If it does so, I would suggest a problem report.
In the case where matmul is used in an expression, e.g.
result = result + matmul(arg1,arg2)*scalar
a compiler can't avoid the allocation of a temporary array for the
intermediate result. In this case, if the matrix is at all large (20x20
or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as
the one in gfortran) is a better choice. The matmul() temporary array
can slow it down significantly.
I don't know whether blas95 would be efficient; it may be, particularly
with interprocedural optimization.
Reply With Quote
  #5  
Old 08-19-2008, 12:55 PM
Paul van Delst
Guest
 
Default Re: Question about intel_VEC_memcpy

Tim Prince wrote:
> Paul van Delst wrote:
>> Tim Prince wrote:
>>> Paul van Delst wrote:
>>>> Hello,
>>>>


[snip]

>> A just-off-the-press run with g95 (don't know whch version, but assume
>> 0.9) ran twice as fast as the intel executable, so I think we need to
>> look a bit closer at the intel compiler switches we're using.
>> Currently we have a very simple set:
>>
>> FC_FLAGS= -c \
>> -O2 \
>> -convert big_endian \
>> -warn errors \
>> -free \
>> -assume byterecl
>>
>> and
>>
>> FL_FLAGS= -static-libcxa \
>> -o
>>
>> For g95 our compile switches are
>>
>> FC_FLAGS= -c \
>> -O2 \
>> -fendian=big \
>> -ffast-math \
>> -ffree-form \
>> -fno-second-underscore \
>> -funroll-loops \
>> -malign-double \
>> -std=f95
>>
>> which are a bit more aggressive so I don't think the intel/g95
>> comparison I mentioned above is fair (to the intel result).

>
> Those options are reasonably comparable.


O.k., good to know.

The guy that ran the tests just informed me that they were done on different machines
(i.e. a fast one and a slower one). No prizes for guessing which machine the intel exe was
running on. So, I think the intel/g95 comparison I mentioned was a rather large, smelly,
red herring. To paraphrase his reply when I expressed extreme surprise at the g95/intel
timing comparison, he said it was like comparing "apples to coconuts". Anyway....

> As you have identified matmul() usage as a possible problem, I will
> comment on that:
> If the matmul result is assigned directly to an array, e.g.
> result = matmul(arg1,arg2)
> and arg1 and arg2 don't involve sparsity (explicit strides, etc.),
> an optimizing compiler ought not to make a hidden temporary array, in my
> opinion. If it does so, I would suggest a problem report.
> In the case where matmul is used in an expression, e.g.
> result = result + matmul(arg1,arg2)*scalar
> a compiler can't avoid the allocation of a temporary array for the
> intermediate result. In this case, if the matrix is at all large (20x20
> or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as
> the one in gfortran) is a better choice. The matmul() temporary array
> can slow it down significantly.


Hmm. That will be a good test for us to run. The matrices are organised to be quite small
~(5x5). In full blown scattering radiative transfer if you write the mathematical
relationships down the phase matrices can be quite a bit larger but a lot of elements are
zero. So I don't think the matrix size is the issue. But, our use of matmul in expressions
will be. The IBM compiler has problems with this also (but much worse). There is also
striding done in these calls too (I've posted previously about that wrt the IBM compiler
in clf)

Thanks for the tips. It's all good stuff!

cheers,

paulv


> I don't know whether blas95 would be efficient; it may be, particularly
> with interprocedural optimization.

Reply With Quote
  #6  
Old 08-19-2008, 01:14 PM
Richard Maine
Guest
 
Default Re: Question about intel_VEC_memcpy

Paul van Delst <Paul.vanDelst@noaa.gov> wrote:

> But, our use of matmul in expressions will be. The IBM compiler has
> problems with this also (but much worse). There is also striding done in
> these calls too (I've posted previously about that wrt the IBM compiler in
> clf)


Yep. If you are using matmul in expressions and with strided arguments,
that sure sounds likely to me as the source of the performance problem.
That kind of code would have made me suspicious of possible problems
like this even before seeing any data.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
Reply With Quote
  #7  
Old 08-19-2008, 02:52 PM
Paul van Delst
Guest
 
Default Re: Question about intel_VEC_memcpy

Richard Maine wrote:
> Paul van Delst <Paul.vanDelst@noaa.gov> wrote:
>
>> But, our use of matmul in expressions will be. The IBM compiler has
>> problems with this also (but much worse). There is also striding done in
>> these calls too (I've posted previously about that wrt the IBM compiler in
>> clf)

>
> Yep. If you are using matmul in expressions and with strided arguments,
> that sure sounds likely to me as the source of the performance problem.
> That kind of code would have made me suspicious of possible problems
> like this even before seeing any data.


I should clarify. The matmul-filled expressions use slices of their arguments but with a
stride of 1. However, I'm still dealing with expressions like, e.g. :

s_rad_up_TL(1:RTV%n_Angles)=s_source_up_TL(1:RTV%n _Angles)+ &
matmul(Inv_GammaT_TL,refl_down(:,k)+RTV%s_Level_Ra d_UP(1:RTV%n_Angles,k)) &
+matmul(RTV%Inv_GammaT(1:RTV%n_Angles,1:RTV%n_Angl es,k),refl_down_TL(1:RTV%n_Angles)+s_rad_up_TL(1:R TV%n_Angles))

and

Inv_GammaT_AD =
matmul(s_refl_up_AD,transpose(RTV%Refl_Trans(1:RTV %n_Angles,1:RTV%n_Angles,k)))
Refl_Trans_AD =
matmul(transpose(RTV%Inv_GammaT(1:RTV%n_Angles,1:R TV%n_Angles,k)),s_refl_up_AD)
s_refl_up_AD=matmul(Refl_Trans_AD,transpose(RTV%s_ Layer_Trans(1:RTV%n_Angles,1:RTV%n_Angles,k)))
s_trans_AD=matmul(transpose(RTV%s_Level_Refl_UP(1: RTV%n_Angles,1:RTV%n_Angles,k)),Refl_Trans_AD)


So, not only "sliced" matmuls in expressions, but some with arguments of transposed sliced
matrices. Uff da!

My next "little" project is to address the above sort of stuff. \

cheers,

paulv
Reply With Quote
  #8  
Old 08-19-2008, 03:23 PM
Richard Maine
Guest
 
Default Re: Question about intel_VEC_memcpy

Paul van Delst <Paul.vanDelst@noaa.gov> wrote:

> I should clarify. The matmul-filled expressions use slices of their
> arguments but with a stride of 1....


Though when the arguments are slices of derived-type components such as

> ... RTV%s_Level_Rad_UP(1:RTV%n_Angles,k)


that effectively has a non-unit stride (if you look at the memory
layout).

> So, not only "sliced" matmuls in expressions, but some with arguments of
> transposed sliced matrices. Uff da!


Ouch! Matrix multiplication is normally a bit cache-unfriendly because
of the way that you go down one column, but across another row. There
are lots of ways to address that, at least one for which involves
working with the transpose of one of the arrays.

If one starts out wanting an X=transpose times Y operation, that is much
nicer to do fairly directly. But then to go to the work of doing the
transpose, which tends to be a relatively expensive operation anyway,
and likely introduce an array temporary, all in order to get it in a
less efficient form than the original. Ouch, ouch, ouch. Oh, the
humanity! Oh, the performance!

Of course, maybe in an ideal world the compiler might recognize a
construct like matmul(transpose(x),y) as a special case. For all I know,
some even do. But I sure wouldn't bet on it. I think I might have seen
proposals somewhere to define that as an intrinsic on its own so the
compiler wouldn't have to recognize it as an optimization. Or maybe I'm
thinking of some library procedure I've seen elsewhere. I know that I
used to have my own home-grown equivalents of that long ago. (No, they
didn't do anything other than the trivial naive implementation, so they
aren't worth digging up).

Yes, I understand how things like that come about from just doing the
"Formula Translation" thing of transcribing the mathematical formula
pretty much directly into Fortran. Maybe if one doesn't care at all
about performance (which is sometimes the case in apps where it is the
difference between essentially zero time and ten times zero).

> My next "little" project is to address the above sort of stuff. \


Yes, if performance is at all an issue, I'd say so.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
Reply With Quote
  #9  
Old 08-19-2008, 07:15 PM
Tim Prince
Guest
 
Default Re: Question about intel_VEC_memcpy

Richard Maine wrote:

> Of course, maybe in an ideal world the compiler might recognize a
> construct like matmul(transpose(x),y) as a special case. For all I know,
> some even do. But I sure wouldn't bet on it.

Yes, ifort and gfortran do optimize some cases of this (at least those
which show up in SPECfp). I agree, I have no confidence of that
happening in a particular case, unless it's verified (at least to the
extent that there is no transpose function call, or new copy). The
profile quoted previously makes me wonder if this shows up in excessive
copying.
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 03:03 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.