C/C++ speed optimization bible/resources/pointers needed!

This is a discussion on C/C++ speed optimization bible/resources/pointers needed! within the Other Technologies forums in category; <meta.x.gdb @ gmail.com> wrote in message news:1187063898.683782.256370 @ e9g2000prf.googlegro ups.com... > __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1])); Oh yes, it's much for fashionable to post obfuscated monstrosities such as this than to post clear assembly language code or to look in documentation such as instruction_tables.pdf or 248966.pdf. Take this kind of nonsense to comp.lang.c where it belongs. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end...

Go Back   Application Development Forum > Other Technologies

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #171  
Old 08-14-2007, 03:04 AM
James Van Buskirk
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

<meta.x.gdb@gmail.com> wrote in message
news:1187063898.683782.256370@e9g2000prf.googlegro ups.com...

> __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));


Oh yes, it's much for fashionable to post obfuscated monstrosities
such as this than to post clear assembly language code or to look
in documentation such as instruction_tables.pdf or 248966.pdf.

Take this kind of nonsense to comp.lang.c where it belongs.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


Reply With Quote
  #172  
Old 08-14-2007, 03:06 AM
glen herrmannsfeldt
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

meta.x.gdb@gmail.com wrote:
(snip)

> This compiles and runs fine with both Intel and GNU compilers 3.3,
> 4.0, etc.


> when I compile this and execute under Cygwin (running on Windows XP)
> and an AMD 4200+ I get
> ./a.exe
> ticks per rdtsc 6


> which isn't 1 or 2, but I can live with 6 clock ticks to process a
> seldom called op.


In all the times I have used RDTSC I don't know that I ever tried
to time RDTSC itself.

> if I compile and run this under Mac OS X (new Apple MacBookPro) Intel
> Core 2 I get 65 ?!?!


For IA32 I use it as a function returning (long long), which
conveniently has the result already in the right registers.

For x86-64, the result is still in EDX:EAX, and so isn't in the
appropriate 64 bit register. That won't matter the way you use
it, though.

> if I compile and run this on Suse Linux on a Xeon processor, I get
> 85 ?!?! (Intel or GNU compiler)


> I'm not even putting in serializing. does that look right to
> anyone ? Can anyone verify they get the same results on their x86
> machines ? This seems to me to be a HUGE amount of overhead for just
> a simple instruction.


I believe I have heard that it does serialization, or at least can do
serialization. I never worried about that, though.

-- glen

Reply With Quote
  #173  
Old 08-14-2007, 03:47 AM
Richard Heathfield
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

James Van Buskirk said:

> <meta.x.gdb@gmail.com> wrote in message
> news:1187063898.683782.256370@e9g2000prf.googlegro ups.com...
>
>> __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));

>
> Oh yes, it's much for fashionable to post obfuscated monstrosities
> such as this than to post clear assembly language code or to look
> in documentation such as instruction_tables.pdf or 248966.pdf.
>
> Take this kind of nonsense to comp.lang.c where it belongs.


Please don't. This kind of nonsense has no place in comp.lang.c either.
I'm not quite sure where it does belong, but comp.lang.c will chew it
up and spit out the bits.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Reply With Quote
  #174  
Old 08-14-2007, 02:20 PM
meta.x.gdb@gmail.com
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!



that's not C, that's assembly. People here brought up RDTSC as a
measurement technique for performance measurement.

I'm pointing out there is a serious problem with using RDTSC for
measurement, in that the op can be incredibly slow.


using "=A" versus "=a" "=d" doesn't change the answer.

I have had independent confirmation that Intel P4 chips take about 80
clock cycles, P3 takes 30 clock cycles. athlon64 takes about 6
cycles, The mftbr instruction on Power architectures takes about 2
clock cycles.


RDTSC is not serializing, but the form of the loop I created handles
this issue.


I talked last night with someone at Intel that acknowledged that this
could all very well be correct. the RDTSC instruction doesn't have a
very high priority at Intel.

another possibility is that the RDTSC instruction on Intel *is*
serializing.... which explains all the waiting, since it has to
completely drain the pipelines for each call, and that can take a lot
of cycles on a P4.

Brian Van Straalen



On Aug 14, 12:04 am, "James Van Buskirk" <not_va...@comcast.net>
wrote:
> <meta.x....@gmail.com> wrote in message
>
> news:1187063898.683782.256370@e9g2000prf.googlegro ups.com...
>
> > __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));

>
> Oh yes, it's much for fashionable to post obfuscated monstrosities
> such as this than to post clear assembly language code or to look
> in documentation such as instruction_tables.pdf or 248966.pdf.
>
> Take this kind of nonsense to comp.lang.c where it belongs.
>
> --
> write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
> 6.0134700243160014d-154/),(/'x'/)); end



Reply With Quote
  #175  
Old 08-14-2007, 03:04 PM
glen herrmannsfeldt
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

Steve Underwood wrote:

(snip)

> It might be a little tidier to use:


> static __inline__ unsigned long long int rdtsc(void)
> {
> unsigned long long int x;
>
> __asm__ volatile ("rdtsc \n\t" : "=A" (x));
>
> return x;
> }


I use:

.file "rdtsc.c"
.text
..globl rdtsc
.type rdtsc, @function
rdtsc:
rdtsc
ret
.size rdtsc, .-rdtsc

which I get by compiling with -S something like:

long long rdtsc() { return 1;}

editing the generated .s file, removing things that aren't
needed such that the only executable instructions are the
rdtsc and ret. This is for IA32, I have a different one
for x86-64.

For processors with CALL/RET branch prediction logic it
should be pretty fast.

-- glen

Reply With Quote
  #176  
Old 08-14-2007, 10:01 PM
Steve Underwood
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

glen herrmannsfeldt wrote:
> Steve Underwood wrote:
>
> (snip)
>
>> It might be a little tidier to use:

>
>> static __inline__ unsigned long long int rdtsc(void)
>> {
>> unsigned long long int x;
>>
>> __asm__ volatile ("rdtsc \n\t" : "=A" (x));
>>
>> return x;
>> }

>
> I use:
>
> .file "rdtsc.c"
> .text
> .globl rdtsc
> .type rdtsc, @function
> rdtsc:
> rdtsc
> ret
> .size rdtsc, .-rdtsc
>
> which I get by compiling with -S something like:
>
> long long rdtsc() { return 1;}
>
> editing the generated .s file, removing things that aren't
> needed such that the only executable instructions are the
> rdtsc and ret. This is for IA32, I have a different one
> for x86-64.
>
> For processors with CALL/RET branch prediction logic it
> should be pretty fast.
>
> -- glen
>


That's nasty. What I showed has no call/return overhead at all. It
compiles to a single in line instruction.

Steve
Reply With Quote
  #177  
Old 08-15-2007, 04:37 PM
glen herrmannsfeldt
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

Steve Underwood wrote:

(snip)

>>> static __inline__ unsigned long long int rdtsc(void)
>>> {
>>> unsigned long long int x;
>>> __asm__ volatile ("rdtsc \n\t" : "=A" (x));
>>> return x;
>>> }


>> I use:


>> .file "rdtsc.c"
>> .text
>> .globl rdtsc
>> .type rdtsc, @function
>> rdtsc:
>> rdtsc
>> ret
>> .size rdtsc, .-rdtsc



(snip)

> That's nasty. What I showed has no call/return overhead at all. It
> compiles to a single in line instruction.


Well, that assumes that you want the result in EDX:EAX, otherwise
it will take some instructions to get them in the right place.
I would not be surprised to see some push/pop while the compiler
gets some other values out of the appropriate registers.

I have used mine for calls from Java (through JNI) and Fortran,
which may not accept inline assembly.

-- glen

Reply With Quote
  #178  
Old 08-15-2007, 10:33 PM
Steve Underwood
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

glen herrmannsfeldt wrote:
> Steve Underwood wrote:
>
> (snip)
>
>>>> static __inline__ unsigned long long int rdtsc(void)
>>>> {
>>>> unsigned long long int x;
>>>> __asm__ volatile ("rdtsc \n\t" : "=A" (x));
>>>> return x;
>>>> }

>
>>> I use:

>
>>> .file "rdtsc.c"
>>> .text
>>> .globl rdtsc
>>> .type rdtsc, @function
>>> rdtsc:
>>> rdtsc
>>> ret
>>> .size rdtsc, .-rdtsc

>
>
> (snip)
>
>> That's nasty. What I showed has no call/return overhead at all. It
>> compiles to a single in line instruction.

>
> Well, that assumes that you want the result in EDX:EAX, otherwise
> it will take some instructions to get them in the right place.
> I would not be surprised to see some push/pop while the compiler
> gets some other values out of the appropriate registers.


The same is true with your code. You just have the call and return
overheads in addition to that. For 32 bit code, things happen to be
exactly where you want them, most of the time, to be treated as a
uint64_t value. With x86_64 code, you have the annoyance of needing to
cook up the 64 bit value from 2 32 bit pieces. Its odd they did not make
rdtsc work in a natural way, feeding the 64 bit A register, in the
x86_64 instruction set.
>
> I have used mine for calls from Java (through JNI) and Fortran,
> which may not accept inline assembly.


Fair enough.

Steve
Reply With Quote
  #179  
Old 08-26-2007, 03:18 AM
David Thompson
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

On Sun, 29 Jul 2007 00:45:26 -0400, Jerry Avins <jya@ieee.org> wrote:

> Wade Ward wrote:
> > "Jerry Avins" <jya@ieee.org> wrote in message

<snip>
> >> Sure, but why? When a compiler provides for including the programmer's
> >> assembly code in the object code, how can an assembler be faster?

> > I'm only partially familiar with assembly code. I have never laid eyes on
> > anything like a standard for assembly. Is there one?

>
> C has standard constructs for including assembly instructions in the
> generated object code. You need to observe register conventions when you
> use it. It is also easy to call an assembly routine as a subroutine or
> function. The interface is in the standard.
>

Not dejure-Standard = in ISO 9899 nor even defacto-standard = everyone
does the same thing. Fairly common = most C compilers do it, yes.

C++ does reserve the keyword 'asm' for this purpose (Standard C
doesn't even do that) and specifies that the asm instruction(s)/info
are represented as a string, but their contents are up to the
implementation, probably the assembler used.

Neither standard directly specifies calling to other languages except
C++ calling C, but both usually can be and typically are implemented
with simple, obvious calling conventions which can easily be handled
in assembly, assuming you can find the documentation.

FWIW, Ada (re)uses the named-aggregate syntax it has for several other
things, in an implementation-dependent way in a standard place in its
(rather extensive) naming hierarchy. And does have a standard syntax
to specify (partially implementation-dependent) calling conventions.

- formerly david.thompson1 || achar(64) || worldnet.att.net
Reply With Quote
  #180  
Old 09-14-2007, 06:44 AM
Alex Vinokur
Guest
 
Default Re: C/C++ speed optimization bible/resources/pointers needed!

On Jul 27, 12:19 am, lunamoonm...@gmail.com wrote:
> C/C++ speed optimization bible/resources/pointers needed!
>

[snip]

Look at C/C++ Program Perfometer
http://sourceforge.net/projects/cpp-perfometer/

The C/C++ Program Perfometer is an open source tool which enables the
programmer to measure the comparative performance of a C/C++ program
or of separated pieces of code by one of several desired metrics:
e.g., time, memory, or metrics defined by the programmer.

--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn



Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 07:29 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.