Array optimizing problem in C++? - Java
This is a discussion on Array optimizing problem in C++? - Java ; On Mar 25, 12:20 pm, Lionel B <m...@privacy.net> wrote:
>
> Personally I've never managed to code up a scenario (using GCC with
> various optimisations) where __restrict__ appears to have made any
> difference whatsoever.
Try the program I gave ...
-
Re: Array optimizing problem in C++?
On Mar 25, 12:20 pm, Lionel B <m...@privacy.net> wrote:
>
> Personally I've never managed to code up a scenario (using GCC with
> various optimisations) where __restrict__ appears to have made any
> difference whatsoever.
Try the program I gave earlier on this topic.
Compile it with :
g++ -O3
Then uncomment the line :
//#define NO_ALIASING_OPTIMIZATION
and compile it again with g++ -O3.
There should be a difference.
Let me know if you don't find any.
Here is the program :
#include <iostream>
#include <ctime>
//#define NO_ALIASING_OPTIMIZATION
const int len = 50000;
__attribute__((noinline))
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i = 0 ; i < 17 ; ++i )
dest[ i ] = src[ i ] + src[ i + 1 ] + src[ i + 2 ];
}
void fill (int* src)
{
for (int i = 0 ; i < len ; ++ i )
src[i] = i;
}
int main()
{
int src_array [len] = {0} ;
int dest_array [len] = {0};
fill(src_array);
smooth (dest_array, dest_array); // dummy call
clock_t start=clock();
for (int i = 0; i < 100000000; i++)
smooth (dest_array, src_array);
clock_t endt=clock();
std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";
// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;
return 0;
}
Alexandre Courpron.
-
Re: Array optimizing problem in C++?
On Tue, 25 Mar 2008 04:48:26 -0700, courpron wrote:
> On Mar 25, 12:20Â pm, Lionel B <m...@privacy.net> wrote:
>>
>> Personally I've never managed to code up a scenario (using GCC with
>> various optimisations) where __restrict__ appears to have made any
>> difference whatsoever.
>
> Try the program I gave earlier on this topic. Compile it with :
> g++ -O3
>
> Then uncomment the line :
> //#define NO_ALIASING_OPTIMIZATION
>
> and compile it again with g++ -O3.
>
> There should be a difference.
> Let me know if you don't find any.
With g++ 4.1.2 (same results with 4.2.0)
With aliasing optimisation:
Time smooth(): 2120 ms
Without aliasing optimisation:
Time smooth(): 2120 ms
With ICC (Intel compiler) using -restrict -O3
With aliasing optimisation:
Time smooth(): 3410 ms
Without aliasing optimisation:
Time smooth(): 2990 ms
So there appears to be a small improvement with aliasing optimisation for
ICC (although there is a fair amount of variance in the results, so not
very significant) and no discernible difference for GCC. In fact I've
checked that there is absolutely no difference in the assembler generated
by g++. This is on linux x86_64 - I tried compiling 32-bit binaries
(using -m32 flag) and again, no difference.
--
Lionel B
-
Re: Array optimizing problem in C++?
On Tue, 25 Mar 2008 12:54:47 +0000, Lionel B wrote:
> On Tue, 25 Mar 2008 04:48:26 -0700, courpron wrote:
>
>> On Mar 25, 12:20Â pm, Lionel B <m...@privacy.net> wrote:
> With ICC (Intel compiler) using -restrict -O3
>
> With aliasing optimisation:
>
> Time smooth(): 3410 ms
>
> Without aliasing optimisation:
>
> Time smooth(): 2990 ms
Sorry, those results were swapped round
> So there appears to be a small improvement with aliasing optimisation
> for ICC (although there is a fair amount of variance in the results, so
> not very significant)
--
Lionel B
-
Re: Array optimizing problem in C++?
On Mar 25, 1:54 pm, Lionel B <m...@privacy.net> wrote:
> On Tue, 25 Mar 2008 04:48:26 -0700, courpron wrote:
> > On Mar 25, 12:20 pm, Lionel B <m...@privacy.net> wrote:
>
> >> Personally I've never managed to code up a scenario (using GCC with
> >> various optimisations) where __restrict__ appears to have made any
> >> difference whatsoever.
>
> > Try the program I gave earlier on this topic. Compile it with :
> > g++ -O3
>
> > Then uncomment the line :
> > //#define NO_ALIASING_OPTIMIZATION
>
> > and compile it again with g++ -O3.
>
> > There should be a difference.
> > Let me know if you don't find any.
>
> With g++ 4.1.2 (same results with 4.2.0)
>
> With aliasing optimisation:
>
> Time smooth(): 2120 ms
>
> Without aliasing optimisation:
>
> Time smooth(): 2120 ms
>
I used exactly the same compiler (g++ 4.1.2) and there was a
difference in the generated assembly listing (my architecture is 32
bits, intel pentium 4)
Could you post the generated assembly when NO_ALIASING_OPTIMIZATION is
defined ?
Alexandre Courpron.
-
Re: Array optimizing problem in C++?
On Tue, 25 Mar 2008 06:13:30 -0700, courpron wrote:
> On Mar 25, 1:54Â pm, Lionel B <m...@privacy.net> wrote:
>> On Tue, 25 Mar 2008 04:48:26 -0700, courpron wrote:
>> > On Mar 25, 12:20Â pm, Lionel B <m...@privacy.net> wrote:
>>
>> >> Personally I've never managed to code up a scenario (using GCC with
>> >> various optimisations) where __restrict__ appears to have made any
>> >> difference whatsoever.
>>
>> > Try the program I gave earlier on this topic. Compile it with : g++
>> > -O3
>>
>> > Then uncomment the line :
>> > //#define NO_ALIASING_OPTIMIZATION
>>
>> > and compile it again with g++ -O3.
>>
>> > There should be a difference.
>> > Let me know if you don't find any.
>>
>> With g++ 4.1.2 (same results with 4.2.0)
>>
>> With aliasing optimisation:
>>
>> Time smooth(): 2120 ms
>>
>> Without aliasing optimisation:
>>
>> Time smooth(): 2120 ms
>>
> I used exactly the same compiler (g++ 4.1.2) and there was a difference
> in the generated assembly listing (my architecture is 32 bits, intel
> pentium 4)
>
> Could you post the generated assembly when NO_ALIASING_OPTIMIZATION is
> defined ?
Sure (here's the 32-bit version, as you might be able to compare it better):
$ g++ -m32 -O3 -S scratch.cpp
$ cat scratch.s
.file "scratch.cpp"
.section .ctors,"aw",@progbits
.align 4
.long _GLOBAL__I__Z6smoothPiS_
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB1435:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %edx
pushl %esi
..LCFI2:
movl 8(%ebp), %esi
pushl %ebx
..LCFI3:
movl 8(%edx), %eax
leal 8(%edx), %ebx
addl 4(%edx), %eax
addl (%edx), %eax
leal 4(%edx), %ecx
movl %eax, (%esi)
movl 4(%ebx), %eax
addl 4(%ecx), %eax
addl 4(%edx), %eax
movl %eax, 4(%esi)
movl 8(%edx), %eax
addl 8(%ecx), %eax
addl 8(%ebx), %eax
movl %eax, 8(%esi)
movl 12(%edx), %eax
addl 12(%ecx), %eax
addl 12(%ebx), %eax
movl %eax, 12(%esi)
movl 16(%edx), %eax
addl 16(%ecx), %eax
addl 16(%ebx), %eax
movl %eax, 16(%esi)
movl 20(%edx), %eax
addl 20(%ecx), %eax
addl 20(%ebx), %eax
movl %eax, 20(%esi)
movl 24(%edx), %eax
addl 24(%ecx), %eax
addl 24(%ebx), %eax
movl %eax, 24(%esi)
movl 28(%edx), %eax
addl 28(%ecx), %eax
addl 28(%ebx), %eax
movl %eax, 28(%esi)
movl 32(%edx), %eax
addl 32(%ecx), %eax
addl 32(%ebx), %eax
movl %eax, 32(%esi)
movl 36(%edx), %eax
addl 36(%ecx), %eax
addl 36(%ebx), %eax
movl %eax, 36(%esi)
movl 40(%edx), %eax
addl 40(%ecx), %eax
addl 40(%ebx), %eax
movl %eax, 40(%esi)
movl 44(%edx), %eax
addl 44(%ecx), %eax
addl 44(%ebx), %eax
movl %eax, 44(%esi)
movl 48(%edx), %eax
addl 48(%ecx), %eax
addl 48(%ebx), %eax
movl %eax, 48(%esi)
movl 52(%edx), %eax
addl 52(%ecx), %eax
addl 52(%ebx), %eax
movl %eax, 52(%esi)
movl 56(%edx), %eax
addl 56(%ecx), %eax
addl 56(%ebx), %eax
movl %eax, 56(%esi)
movl 60(%edx), %eax
addl 60(%ecx), %eax
addl 60(%ebx), %eax
movl %eax, 60(%esi)
movl 64(%edx), %eax
addl 64(%ecx), %eax
addl 64(%ebx), %eax
movl %eax, 64(%esi)
popl %ebx
popl %esi
popl %ebp
ret
..LFE1435:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl _Z4fillPi
.type _Z4fillPi, @function
_Z4fillPi:
..LFB1436:
pushl %ebp
..LCFI4:
xorl %eax, %eax
movl %esp, %ebp
..LCFI5:
movl 8(%ebp), %edx
.p2align 4,,7
..L4:
movl %eax, (%edx,%eax,4)
addl $1, %eax
cmpl $50000, %eax
jne .L4
popl %ebp
ret
..LFE1436:
.size _Z4fillPi, .-_Z4fillPi
.align 2
.p2align 4,,15
.type _Z41__static_initialization_and_destruction_0ii, @function
_Z41__static_initialization_and_destruction_0ii:
..LFB1591:
pushl %ebp
..LCFI6:
movl %esp, %ebp
..LCFI7:
subl $24, %esp
..LCFI8:
subl $1, %eax
je .L15
..L14:
leave
ret
.p2align 4,,7
..L15:
cmpl $65535, %edx
jne .L14
movl $_ZSt8__ioinit, (%esp)
call _ZNSt8ios_base4InitC1Ev
movl $__dso_handle, 8(%esp)
movl $0, 4(%esp)
movl $__tcf_0, (%esp)
call __cxa_atexit
leave
ret
..LFE1591:
.size _Z41__static_initialization_and_destruction_0ii, .-_Z41__static_initialization_and_destruction_0ii
.align 2
.p2align 4,,15
.type _GLOBAL__I__Z6smoothPiS_, @function
_GLOBAL__I__Z6smoothPiS_:
..LFB1593:
pushl %ebp
..LCFI9:
movl $65535, %edx
movl %esp, %ebp
..LCFI10:
movl $1, %eax
popl %ebp
jmp _Z41__static_initialization_and_destruction_0ii
..LFE1593:
.size _GLOBAL__I__Z6smoothPiS_, .-_GLOBAL__I__Z6smoothPiS_
.align 2
.p2align 4,,15
.type __tcf_0, @function
__tcf_0:
..LFB1592:
pushl %ebp
..LCFI11:
movl %esp, %ebp
..LCFI12:
movl $_ZSt8__ioinit, 8(%ebp)
popl %ebp
jmp _ZNSt8ios_base4InitD1Ev
..LFE1592:
.size __tcf_0, .-__tcf_0
.section .rodata.str1.1,"aMS",@progbits,1
..LC0:
.string "Time smooth(): "
..LC3:
.string " ms\n"
.section .rodata.cst4,"aM",@progbits,4
.align 4
..LC1:
.long 1232348160
.align 4
..LC2:
.long 1148846080
.text
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB1437:
leal 4(%esp), %ecx
..LCFI13:
andl $-16, %esp
pushl -4(%ecx)
..LCFI14:
pushl %ebp
..LCFI15:
movl %esp, %ebp
..LCFI16:
pushl %edi
..LCFI17:
pushl %esi
..LCFI18:
pushl %ebx
..LCFI19:
pushl %ecx
..LCFI20:
subl $400024, %esp
..LCFI21:
leal -200016(%ebp), %esi
movl $200000, 8(%esp)
leal -400016(%ebp), %edi
movl $0, 4(%esp)
movl %esi, (%esp)
call memset
movl $200000, 8(%esp)
movl $0, 4(%esp)
movl %edi, (%esp)
call memset
xorl %eax, %eax
.p2align 4,,7
..L21:
movl %eax, (%esi,%eax,4)
addl $1, %eax
cmpl $50000, %eax
jne .L21
movl %edi, 4(%esp)
xorl %ebx, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
call clock
movl %eax, -400020(%ebp)
.p2align 4,,7
..L23:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L23
call clock
movl $.LC0, 4(%esp)
movl $_ZSt4cout, (%esp)
movl %eax, %ebx
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
subl -400020(%ebp), %ebx
pushl %ebx
fildl (%esp)
addl $4, %esp
fdivs .LC1
movl %eax, (%esp)
fmuls .LC2
fstpl 4(%esp)
call _ZNSolsEd
movl $.LC3, 4(%esp)
movl %eax, (%esp)
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl -400016(%ebp), %eax
movl $_ZSt4cout, (%esp)
movl %eax, 4(%esp)
call _ZNSolsEi
addl $400024, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE1437:
.size main, .-main
.local _ZSt8__ioinit
.comm _ZSt8__ioinit,1,1
.weakref _Z20__gthrw_pthread_oncePiPFvvE,pthread_once
.weakref _Z27__gthrw_pthread_getspecificj,pthread_getspecific
.weakref _Z27__gthrw_pthread_setspecificjPKv,pthread_setspecific
.weakref _Z22__gthrw_pthread_createPmPK14pthread_attr_tPFPvS3_ES3_,pthread_create
.weakref _Z22__gthrw_pthread_cancelm,pthread_cancel
.weakref _Z26__gthrw_pthread_mutex_lockP15pthread_mutex_t,pthread_mutex_lock
.weakref _Z29__gthrw_pthread_mutex_trylockP15pthread_mutex_t,pthread_mutex_trylock
.weakref _Z28__gthrw_pthread_mutex_unlockP15pthread_mutex_t,pthread_mutex_unlock
.weakref _Z26__gthrw_pthread_mutex_initP15pthread_mutex_tPK19pthread_mutexattr_t,pthread_mutex_init
.weakref _Z26__gthrw_pthread_key_createPjPFvPvE,pthread_key_create
.weakref _Z26__gthrw_pthread_key_deletej,pthread_key_delete
.weakref _Z30__gthrw_pthread_mutexattr_initP19pthread_mutexattr_t,pthread_mutexattr_init
.weakref _Z33__gthrw_pthread_mutexattr_settypeP19pthread_mutexattr_ti,pthread_mutexattr_settype
.weakref _Z33__gthrw_pthread_mutexattr_destroyP19pthread_mutexattr_t,pthread_mutexattr_destroy
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE5:
.long .LEFDE5-.LASFDE5
..LASFDE5:
.long .LASFDE5-.Lframe1
.long .LFB1591
.long .LFE1591-.LFB1591
.uleb128 0x0
.byte 0x4
.long .LCFI6-.LFB1591
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI7-.LCFI6
.byte 0xd
.uleb128 0x5
.align 4
..LEFDE5:
..LSFDE11:
.long .LEFDE11-.LASFDE11
..LASFDE11:
.long .LASFDE11-.Lframe1
.long .LFB1437
.long .LFE1437-.LFB1437
.uleb128 0x0
.byte 0x4
.long .LCFI13-.LFB1437
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI14-.LCFI13
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI15-.LCFI14
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI16-.LCFI15
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI20-.LCFI16
.byte 0x84
.uleb128 0x6
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.align 4
..LEFDE11:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits
--
Lionel B
-
Re: Array optimizing problem in C++?
On Mar 25, 2:23 pm, Lionel B <m...@privacy.net> wrote:
> [snip asm]
It doesn't generate the same thing on my machine. Your assembly
listing doesn't display any aliasing optimization. There is still 3
memory access for each iteration (2 reads and 1 write). I also see
that there are "weakref" in your assembly listing, which is not the
case on my machine. This is probably a libstd++ issue. So, if you
don't mind, could you try the following program (simpler smooth
function and no #include, we can always resort to the command "time ./
a.out" to measure performances of each version) :
//#define NO_ALIASING_OPTIMIZATION
const int len = 16;
__attribute__((noinline))
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i = 0 ; i < len ; ++i )
dest[ i ] = *src;
}
int main()
{
int dest_array [len];
int *src = new int();
for (int i = 0; i < 100000000; i++)
smooth (dest_array, src);
return 0;
}
Alexandre Courpron.
-
Re: Array optimizing problem in C++?
On Tue, 25 Mar 2008 07:22:36 -0700, courpron wrote:
> On Mar 25, 2:23 pm, Lionel B <m...@privacy.net> wrote:
>> [snip asm]
>
> It doesn't generate the same thing on my machine. Your assembly listing
> doesn't display any aliasing optimization. There is still 3 memory
> access for each iteration (2 reads and 1 write). I also see that there
> are "weakref" in your assembly listing, which is not the case on my
> machine. This is probably a libstd++ issue. So, if you don't mind, could
> you try the following program (simpler smooth function and no #include,
> we can always resort to the command "time ./ a.out" to measure
> performances of each version) :
That seems to have made a difference - to the assembly at least - I
still don't see any significant difference in run time, however.
Compiled with:
g++ -m32 -S -O3 scratch.cpp
**************************** with no-alias optimisation:
.file "scratch.cpp"
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB2:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %edx
movl 8(%ebp), %eax
movl (%edx), %edx
movl %edx, (%eax)
movl %edx, 4(%eax)
movl %edx, 8(%eax)
movl %edx, 12(%eax)
movl %edx, 16(%eax)
movl %edx, 20(%eax)
movl %edx, 24(%eax)
movl %edx, 28(%eax)
movl %edx, 32(%eax)
movl %edx, 36(%eax)
movl %edx, 40(%eax)
movl %edx, 44(%eax)
movl %edx, 48(%eax)
movl %edx, 52(%eax)
movl %edx, 56(%eax)
movl %edx, 60(%eax)
popl %ebp
ret
..LFE2:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB3:
leal 4(%esp), %ecx
..LCFI2:
andl $-16, %esp
pushl -4(%ecx)
..LCFI3:
pushl %ebp
..LCFI4:
movl %esp, %ebp
..LCFI5:
pushl %edi
..LCFI6:
pushl %esi
..LCFI7:
pushl %ebx
..LCFI8:
xorl %ebx, %ebx
pushl %ecx
..LCFI9:
subl $72, %esp
..LCFI10:
movl $4, (%esp)
leal -80(%ebp), %edi
call _Znwj
movl %eax, %esi
movl $0, (%eax)
.p2align 4,,7
..L4:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L4
addl $72, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE3:
.size main, .-main
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE3:
.long .LEFDE3-.LASFDE3
..LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI4-.LCFI3
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI5-.LCFI4
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI8-.LCFI5
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.byte 0x4
.long .LCFI9-.LCFI8
.byte 0x84
.uleb128 0x6
.align 4
..LEFDE3:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits
**************************** without no-alias optimisation:
.file "scratch.cpp"
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB2:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %eax
movl 8(%ebp), %edx
movl (%eax), %ecx
movl %ecx, (%edx)
movl (%eax), %ecx
movl %ecx, 4(%edx)
movl (%eax), %ecx
movl %ecx, 8(%edx)
movl (%eax), %ecx
movl %ecx, 12(%edx)
movl (%eax), %ecx
movl %ecx, 16(%edx)
movl (%eax), %ecx
movl %ecx, 20(%edx)
movl (%eax), %ecx
movl %ecx, 24(%edx)
movl (%eax), %ecx
movl %ecx, 28(%edx)
movl (%eax), %ecx
movl %ecx, 32(%edx)
movl (%eax), %ecx
movl %ecx, 36(%edx)
movl (%eax), %ecx
movl %ecx, 40(%edx)
movl (%eax), %ecx
movl %ecx, 44(%edx)
movl (%eax), %ecx
movl %ecx, 48(%edx)
movl (%eax), %ecx
movl %ecx, 52(%edx)
movl (%eax), %ecx
movl %ecx, 56(%edx)
movl (%eax), %eax
movl %eax, 60(%edx)
popl %ebp
ret
..LFE2:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB3:
leal 4(%esp), %ecx
..LCFI2:
andl $-16, %esp
pushl -4(%ecx)
..LCFI3:
pushl %ebp
..LCFI4:
movl %esp, %ebp
..LCFI5:
pushl %edi
..LCFI6:
pushl %esi
..LCFI7:
pushl %ebx
..LCFI8:
xorl %ebx, %ebx
pushl %ecx
..LCFI9:
subl $72, %esp
..LCFI10:
movl $4, (%esp)
leal -80(%ebp), %edi
call _Znwj
movl %eax, %esi
movl $0, (%eax)
.p2align 4,,7
..L4:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L4
addl $72, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE3:
.size main, .-main
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE3:
.long .LEFDE3-.LASFDE3
..LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI4-.LCFI3
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI5-.LCFI4
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI8-.LCFI5
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.byte 0x4
.long .LCFI9-.LCFI8
.byte 0x84
.uleb128 0x6
.align 4
..LEFDE3:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits
--
Lionel B
-
Re: Array optimizing problem in C++?
On Mar 25, 3:58 pm, Lionel B <m...@privacy.net> wrote:
> On Tue, 25 Mar 2008 07:22:36 -0700, courpron wrote:
> > On Mar 25, 2:23 pm, Lionel B <m...@privacy.net> wrote:
> >> [snip asm]
>
> That seems to have made a difference - to the assembly at least - I
> still don't see any significant difference in run time, however.
Yes, the performance increase on my machine was around 4%.
In this example, a single int is read to fill the dest_array.
Therefore, it stays easily in the cache and the performance
degradation is minimal.
Operations on large arrays can however leads to cache trashing and
performance degradations would be much more visible.
The assembly listing now shows no aliasing optimization.
From this code :
> for ( int i = 0 ; i < len ; ++i )
> dest[ i ] = *src;
generated asm with __restrict :
> movl (%edx), %edx
> movl %edx, (%eax)
> movl %edx, 4(%eax)
> movl %edx, 8(%eax)
...
With __restrict, the value of *src is stored in the register once (in
edx), and it is used for the rest of the iterations.
generated asm without __restrict :
> movl (%eax), %ecx
> movl %ecx, (%edx)
> movl (%eax), %ecx
> movl %ecx, 4(%edx)
> movl (%eax), %ecx
> movl %ecx, 8(%edx)
...
without restrict, the value is stored in the register (in ecx) for
each iteration because the compiler doesn't know if &dest[i] and src
are pointing to the same object. The compiler must take into account
the situation where changing the value of dest[i] changes the value of
*src. It deals with this by reloading each time the *src value.
Alexandre Courpron.
-
Re: Array optimizing problem in C++?
On Tue, 25 Mar 2008 08:49:42 -0700, courpron wrote:
> On Mar 25, 3:58Â pm, Lionel B <m...@privacy.net> wrote:
>> On Tue, 25 Mar 2008 07:22:36 -0700, courpron wrote:
>> > On Mar 25, 2:23 pm, Lionel B <m...@privacy.net> wrote:
>> >> [snip asm]
>>
>> That seems to have made a difference - to the assembly at least - I
>> still don't see any significant difference in run time, however.
>
> Yes, the performance increase on my machine was around 4%.
I've just checked your original code under g++ 4.3.0 (installed with its
own libstdc++) and now I do indeed get the optimisation benefit:
without no-alias optimisation: 3270 ms
with no-alias optimisation: 1310 ms
I suspect you were right about it being an issue with libstdc++
Cheers,
--
Lionel B
-
Re: Array optimizing problem in C++?
Lionel B wrote:
> On Sun, 23 Mar 2008 16:59:00 -0600, Jerry Coffin wrote:
>
>> In article <MIednRYljpBPQ3vanZ2dnUVZ_rLinZ2d@comcast.com>,
>> andreytarasevich@hotmail.com says...
>>
>> [ ... ]
>>
>>> I don't really see much sense in comparing Java performance with
>>> C++ performance. What might be more interesting is the effect of
>>> the 'restrict' specifier supported by C99 compilers (and many
>>> C89/90 and C++ compielers as an extension), which is intended to
>>> assist compiler in performing exactly this kind of optimizations.
>>
>> Of, if you're really interested in C++, you could throw in a
>> comparison use a valarray, which was designed to give the same
>> sort of assurance. Unfortunately, I don't know of anybody who
>> seems to have gone to much (if any) trouble to optimize valarray
>> at all -- rather the contrary, it was an idea that even its own
>> creator admits was in the wrong place at the wrong time, so it's
>> ignored almost to death, so to speak.
>
> FWIW on my (gcc 4.1.1 supplied) implementation valarrays appear to
> be simply pointers to new'ed memory with a (GCC-specific)
> __restrict__ qualifier.
>
> Personally I've never managed to code up a scenario (using GCC with
> various optimisations) where __restrict__ appears to have made any
> difference whatsoever. I seem to recall that this has come up now
> and again - inconclusively - on the g++ lists.
You have to work with functions taking multiple pointers to arrays of
the same type. This is not idiomatic for C++ code.
Bo Persson