What's faster? - ASM x86 ASM 370
This is a discussion on What's faster? - ASM x86 ASM 370 ; octMove:
movdqa xmm1, [edi]
Blah, blah, blah
movdqa [edi], xmm1
add edi, 16
loop octMove
or
octMove:
movdqa xmm1, [edi]
add edi, 16
Blah, blah, blah
movdqa [edi-16], xmm1
loop octMove
I know there are latency issue with using an ...
-
What's faster?
octMove:
movdqa xmm1, [edi]
Blah, blah, blah
movdqa [edi], xmm1
add edi, 16
loop octMove
or
octMove:
movdqa xmm1, [edi]
add edi, 16
Blah, blah, blah
movdqa [edi-16], xmm1
loop octMove
I know there are latency issue with using an address register just
after setting it. Does the loop soak that up?
Thanks!
-- Rich Fife --
-
Re: What's faster?
Rich Fife <spamtrap@crayne.org> wrote in part:
> octMove:
> movdqa xmm1, [edi]
>
> Blah, blah, blah
>
> movdqa [edi], xmm1
> add edi, 16
>
> loop octMove
Not using `loop` is faster: It has been deliberately
slowed down to prevent MS-Win95 from crashing.
Otherwise, it should make no difference with register renaming.
-- Robert
-
Re: What's faster?
Robert Redelmeier wrote:
> Not using `loop` is faster: It has been deliberately
> slowed down to prevent MS-Win95 from crashing.
Kidding, or serious?
-
Re: What's faster?
Jim Leonard wrote:
> Robert Redelmeier wrote:
>> Not using `loop` is faster: It has been deliberately
>> slowed down to prevent MS-Win95 from crashing.
>
> Kidding, or serious?
>
That was my reaction, too. Until I benchmarked.
-
Re: What's faster?
Jeffrey Schwab wrote:
> Jim Leonard wrote:
> > Robert Redelmeier wrote:
> >> Not using `loop` is faster: It has been deliberately
> >> slowed down to prevent MS-Win95 from crashing.
> >
> > Kidding, or serious?
>
> That was my reaction, too. Until I benchmarked.
That's not what I meant. I knew it was slower; what I was questioning
is that AMD specifically degraded an instruction to get around a bug in
someone else's software. I call BS. I understand that CPU design was
influenced by the market (ie. the Pentium Pro was better than Pentium
for 32-bit operations but worse for 16-bit operations, etc.) but this
is the first time I've heard of intentionally degrading performance to
get around something that would be a trivial patch in someone else's
software...
-
Re: What's faster?
Jim Leonard <spamtrap@crayne.org> wrote in part:
> That's not what I meant. I knew it was slower; what I was
> questioning is that AMD specifically degraded an instruction
> to get around a bug in someone else's software.
How else do you explain going from 2 clocks (K6) to 8 (K7)
when everything else speeds up?
> I call BS.
Google through AMD.
> I understand that CPU design was influenced by the market (ie.the
> Pentium Pro was better than Pentium for 32-bit operations but
> worse for 16-bit operations, etc.) but this is the first time
> I've heard of intentionally degrading performance to get around
> something that would be a trivial patch in someone else's software
Nothing is trivial in MS-Windows. IIRC, MS-win95* couldn't
be installed and wouldn't boot on the machines. Had it booted,
then it could have been patched.
AFAIK, Intel also slowed their `loop` instruction 'cuz they
didn't want to get bit. I think from 4 clocks to 6 on the
Pentium!!!. The Pentium4 may be too recent to need the slowdown.
This was a "bug" that cost AMD sales.
-- Robert
-
Re: What's faster?
Jim Leonard wrote:
> but this
> is the first time I've heard of intentionally degrading performance to
> get around something that would be a trivial patch in someone else's
> software...
As I pointed out earlier, this certainly wouldn't have been the first
time. When the 486 or Pentium came along and instructions started
operating at 1 CPI, Intel forced the NOP instruction to take 3 cycles
because there was a considerable body of (DOS) code that computed CPU
clock frequency by counting the number of NOPs executed between timer
ticks (55ms). Such software (especially games) started behaving really
weird when the NOP was changed to one clock cycle. Fortunately, the NOP
was eventually fixed once people wised up and realized that they could
no longer depend on software timing loops in their software (not to
mention, having an OS API in Windows to provide this information was a
big help, too).
Cheers,
Randy Hyde
-
Re: What's faster?
Robert Redelmeier wrote:
> AFAIK, Intel also slowed their `loop` instruction 'cuz they
> didn't want to get bit. I think from 4 clocks to 6 on the
> Pentium!!!. The Pentium4 may be too recent to need the slowdown.
Wasn't the Pentium the first processor where we started hearing about
the 'RISC core'? Bottom line is that if LOOP was going to break
software that depending on timing loops, so would a lot of other
instructions. I suspect you're misinterpreting the reason why this
instruction was not faster than the corresponding discrete
instructions. The bottom line is that Intel concentrated on speeding up
a core set of instructions, and because you could easily synthesize
LOOP using instructions already in the core, and because so few
compilers made effective use of the LOOP instruction, Intel chose to
take the easy route and leave LOOP in microcode.
BTW, it's also apparent that INC and DEC are quickly fading from the
scene. Intel is recommending that programmers using ADD and SUB
instead. Of course, AMD led the charge with their AMD64 chips to using
the one-byte INC and DEC opcodes for 64-bit access. So I wouldn't be
surprised to find that INC and DEC get slowed down in future chips.
I.e.,
sub( 1, ecx );
jnz loopLbl;
becomes the replacement for
loop loopLbl;
rather than
dec( ecx );
jnz loopLbl;
Cheers,
Randy Hyde
-
Re: [Clax86list] Re: What's faster?
On 2 May 2006 15:44:21 -0700
"randyhyde@earthlink.net" <spamtrap@crayne.org> wrote:
:sub( 1, ecx );
:jnz loopLbl;
:
:becomes the replacement for
:
:loop loopLbl;
:
:rather than
:
:dec( ecx );
:jnz loopLbl;
Unfortunately, such a replacement would break a lot of existing code.
-- Chuck
-
Re: What's faster?
randyhyde@earthlink.net wrote:
> As I pointed out earlier, this certainly wouldn't have been the first
> time. When the 486 or Pentium came along and instructions started
> operating at 1 CPI, Intel forced the NOP instruction to take 3 cycles
> because there was a considerable body of (DOS) code that computed CPU
> clock frequency by counting the number of NOPs executed between timer
> ticks (55ms). Such software (especially games) started behaving really
This was happening LONG before 486. I remember games that broke on my
7.16MHz 8086 because the prefetch queue was 6 bytes instead of 4, and
that was enough. So, again, I can't believe this is the reason for
making NOP take longer.
Wasn't NOP just an alias for xchg ax,ax? Did that get slower too?
Similar Threads
-
By Application Development in forum Graphics
Replies: 0
Last Post: 11-25-2007, 11:53 AM
-
By Application Development in forum Theory
Replies: 0
Last Post: 11-25-2007, 11:53 AM
-
By Application Development in forum c++
Replies: 11
Last Post: 08-22-2007, 04:09 PM
-
By Application Development in forum ASM x86 ASM 370
Replies: 3
Last Post: 05-02-2006, 11:24 AM
-
By Application Development in forum DOTNET
Replies: 3
Last Post: 09-20-2004, 07:23 PM