nibble array/BCD notation/MMX

This is a discussion on nibble array/BCD notation/MMX within the ASM x86 ASM 370 forums in Programming Languages category; Hello there, I want to access an array of nibbles. I have developed in C the function that everyone will do without too much effort, but it is terribly slow. I am trying to approach the problem from another point of view in order to see if I get something faster. What I am looking for is a magic instruction (ala btsl) that given a pointer and an index, it returns the nth nibble. This is a similar problem to obtain the nth digit of a BCD (packed) format. I am wondering if there is a MMX (or some other ...

Go Back   Application Development Forum > Programming Languages > ASM x86 ASM 370

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-13-2008, 03:12 PM
ditiem
Guest
 
Default nibble array/BCD notation/MMX

Hello there,

I want to access an array of nibbles. I have developed in C the
function
that everyone will do without too much effort, but it is terribly
slow. I am
trying to approach the problem from another point of view in order to
see
if I get something faster. What I am looking for is a magic
instruction
(ala btsl) that given a pointer and an index, it returns the nth
nibble.

This is a similar problem to obtain the nth digit of a BCD (packed)
format.

I am wondering if there is a MMX (or some other extension) that could
do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
but
I am not into Intel assembly extensions).

Thanks in advance, any light is highly appreciated,

David

Reply With Quote
  #2  
Old 08-14-2008, 02:41 AM
Hendrik van der Heijden
Guest
 
Default Re: nibble array/BCD notation/MMX

ditiem schrieb:
> Hello there,
>
> I want to access an array of nibbles. I have developed in C the
> function
> that everyone will do without too much effort, but it is terribly
> slow.


Either you're using division and modulus per access or have some
uncommon definition of "terribly slow". Let's see the code.

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).


There's nothing which will help you for extracting a single specific nibble.
There may be something for you if you want many nibbles in fixed patterns
and do some operations on them.


Hendrik vdH

Reply With Quote
  #3  
Old 08-14-2008, 08:29 AM
Wolfgang Kern
Guest
 
Default Re: nibble array/BCD notation/MMX


David "ditiem" wrote:

> Hello there,

Hi,
> I want to access an array of nibbles. I have developed in C the
> function that everyone will do without too much effort, but it is
> terribly slow.


Sure. C isn't prone to work at low level

> I am trying to approach the problem from another point of view in
> order to see if I get something faster.
> What I am looking for is a magic instruction (ala btsl)
> that given a pointer and an index, it returns the nth nibble.


BTS isn't fast anyway.

My solution is quite simple,
even not optimised except it avoids branches:
(but don't ask me how to do this in C, perhaps Inline.Asm ?)
______________________
;push ... ;save registers as required
mov ebx,nibble_index
mov esi,source_ptr
;mov edi,result_buffer ;if desired
;push ebx ;make it a LOCAL if for a loop
;L0:
;__________________
shr ebx,1 ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN
mov al,[esi+ebx] ;get byte from [esi + ebx/2]
;mov ebx,[esp] ;restore nibble index
setc cl ;CL= 0 or 1 (= bit0 of index)
shl cl,2 ;*4= 0 or 4
shr al,cl ;no change if the index were EVEN
and al,0f ;
;__________________
DONE: ;AL contains the nibble yet
;mov [edi],al ;store it
;inc/dec edi ;advance destination ptr
;inc/dec ebx ;advance nibble index
;cmp ebx,[last_index] ;better use a register ie: edx
;jc/jns/jnbe L0 ;iterate as desired/required
;add esp,4 ;kill the LOCAL after the loop
;pop ... ;resore what you pushed on top
_______________________
You see this are just 6 lines (16 bytes) for a single nibble access.
Feel free to uncomment/delete/modify it for your own LOOP-solution.

Would be interesting to hear if this register- and dependency-stalling
piece of code performs any faster than your C-solution.

> This is a similar problem to obtain the nth digit of a BCD (packed)
> format.


I treat BCD and HEX-nibbles the same way (except for input limitation).

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).


SSE instruction may be of help for conversion of packed bytes,
but I haven't seen any nibble oriented functionality in there.

__
wolfgang


Reply With Quote
  #4  
Old 08-14-2008, 08:33 AM
Wolfgang Kern
Guest
 
Default Re: nibble array/BCD notation/MMX


"David "ditiem" wrote:

> Hello there,

Hi,
> I want to access an array of nibbles. I have developed in C the
> function that everyone will do without too much effort, but it is
> terribly slow.


Sure. C isn't prone to work at low level

> I am trying to approach the problem from another point of view in
> order to see if I get something faster.
> What I am looking for is a magic instruction (ala btsl)
> that given a pointer and an index, it returns the nth nibble.


BTS isn't fast anyway.

My solution is quite simple,
even not optimised except it avoids branches:
(but don't ask me how to do this in C, perhaps Inline.Asm ?)
______________________
;push ... ;save registers as required
mov ebx,nibble_index
mov esi,source_ptr
;mov edi,result_buffer ;if desired
;push ebx ;make it a LOCAL if for a loop
;L0:
;__________________
shr ebx,1 ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN
mov al,[esi+ebx] ;get byte from [esi + ebx/2]
;mov ebx,[esp] ;restore nibble index
setc cl ;CL= 0 or 1 (= bit0 of index)
shl cl,2 ;*4= 0 or 4
shr al,cl ;no change if the index were EVEN
and al,0f ;
;__________________
DONE: ;AL contains the nibble yet
;mov [edi],al ;store it
;inc/dec edi ;advance destination ptr
;inc/dec ebx ;advance nibble index
;cmp ebx,[last_index] ;better use a register ie: edx
;jc/jns/jnbe L0 ;iterate as desired/required
;add esp,4 ;kill the LOCAL after the loop
;pop ... ;resore what you pushed on top
_______________________
You see this are just 6 lines (16 bytes) for a single nibble access.
Feel free to uncomment/delete/modify it for your own LOOP-solution.

Would be interesting to hear if this register- and dependency-stalling
piece of code performs any faster than your C-solution.

> This is a similar problem to obtain the nth digit of a BCD (packed)
> format.


I treat BCD and HEX-nibbles the same way (except for input limitation).

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).


SSE instruction may be of help for conversion of packed bytes,
but I haven't seen any nibble oriented functionality in there.

__
wolfgang


Reply With Quote
  #5  
Old 08-14-2008, 09:08 AM
Wolfgang Kern
Guest
 
Default Re: nibble array/BCD notation/MMX


sorry for the double post, I may have hit the send button twice ...

__
wolfgang


Reply With Quote
  #6  
Old 08-14-2008, 03:13 PM
ditiem
Guest
 
Default Re: nibble array/BCD notation/MMX

On Aug 14, 8:41 am, Hendrik van der Heijden <spamt...@crayne.org>
wrote:

> Either you're using division and modulus per access or have some
> uncommon definition of "terribly slow". Let's see the code.


Nah! The problem is a bit more complicated than a simple array and an
index. I will try to explain in more detail. I am implementing a
virtual machine in which I need to store a variable and its type
(because it is dynamic). Let's say that the common technique for these
cases is to use either the 4 highest or lowest bits of a word to store
the type. In my design I dont want this to happen, so I decided that I
can store the types of several the variables all together in one
word. According to this, the memory addresses which has the last 7
bits to 0x0, 0x4,0x8 and 0xC, stores the types of the next (28) memory
addresses. For example, for a given variable (read pointer) C, the
type must be found in C&~0x7F plus the offset of C respect that
address: C & 0x7F (plus a shift of course)

The function is:

#define INDEX2_MASK 0x3F

void set_var_type ( unsigned int *C, unsigned int T )
{
unsigned int Cindex = ((unsigned int)(C) & INDEX_MASK) ;
unsigned char *CBase = (unsigned char*)((unsigned int)(C) &
~INDEX_MASK) ;

CBase += Cindex >> 3 ;
*CBase = (Cindex & 0x4)? (*CBase & 0x8F) | (T << 4) : (*CBase &
0xF8) | T ;
}

scared about the if-then-else? well, surprise! that function goes
faster (in my machine, a Pentium III 850MHz) than this:

void set_var_type_B ( unsigned int *C, unsigned int T )
{
unsigned int Cindex = ((unsigned int)(C) & INDEX2_MASK) ;
unsigned int *CBase = (unsigned int*)((unsigned int)(C) &
~INDEX2_MASK) ;

CBase += Cindex >> 5 ;
Cindex &= 0x7C ;

*CBase &= ~(0xF << Cindex) ;
*CBase |= T << Cindex ;
}

Answering your question about "terrible slow". The original function
was:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
*C &= 0xF0000000 ;
*C |= T << 28 ;
}

and according to the benchmarks the functions above have and slow-down
of 2.4 and 4.6 respectively (do not ask me why, I have lost whole day
trying to find the reason).

Hope this answer your question .

Best regards,

David

Reply With Quote
  #7  
Old 08-14-2008, 03:29 PM
ditiem
Guest
 
Default Re: nibble array/BCD notation/MMX

On Aug 14, 2:33 pm, "Wolfgang Kern" <spamt...@crayne.org> wrote:

> BTS isn't fast anyway.


Oops! may I ask how slow it is?


> My solution is quite simple,
> even not optimised except it avoids branches:
> (but don't ask me how to do this in C, perhaps Inline.Asm ?)


Leave that on my side... I am really fighting with gcc to put some
inline asm (, really a headache!

> You see this are just 6 lines (16 bytes) for a single nibble access.
> Feel free to uncomment/delete/modify it for your own LOOP-solution.


> Would be interesting to hear if this register- and dependency-stalling
> piece of code performs any faster than your C-solution.


Your code is good for reading (I love the trick of shr and setc ),
but now I am concentrating in writing. When I do the testing for
reading I will let you know how fast it goes (looks very good to me).

As I tried to explain to Hendrik, one of the good problems here is to
obtain the base address and the "index" of the nibble according to
that base address. Another point is how you map the index of the
nibble in the nibble-array, that is, which bits from the address you
take as the byte index and which bits you take as the nibble index.
The
solution that you propose is exactly the first one that came to my
mind, but today I found another way of mapping the things that can
save 1 or instructions.

I have a nice solution for writing the nibble with no branches in
only 10 instructions (plus the ones generated by the compiler to copy
the arguments in 2 registers). My problem now is that gcc says that I
use too many registers and that it cannot find a way to compile it...
(this should be a temporal problem).

> SSE instruction may be of help for conversion of packed bytes,
> but I haven't seen any nibble oriented functionality in there.


I abandoned that line, I read that they can work with nibbles, but
they just add, subtract and few things more. Really complicated to
obtain the nibble from there.


Best regards and thank a lot for the solution!

David

Reply With Quote
  #8  
Old 08-14-2008, 06:58 PM
Rod Pemberton
Guest
 
Default Re: nibble array/BCD notation/MMX

"ditiem" <spamtrap@crayne.org> wrote in message
news:68383c9b-13cd-4bf2-8996-6fb556aa750b@w24g2000prd.googlegroups.com...
> What I am looking for is a magic
> instruction
> (ala btsl) that given a pointer and an index, it returns the nth
> nibble.


As a general rule-of-thumb, "magic instructions" are slow on today's x86 or
later processors...

> I am wondering if there is a MMX (or some other extension) that could
> do this job


No single instruction in the general instruction set that I recognize.
Maybe something in MMX, SSE2, etc...

> I want to access an array of nibbles.


Well, since I'm interested in this issue also, I was waiting to see if
someone posted some SSE2 etc. solution.

Without such a solution, your goal is to reduce the work. So, I'd first try
a few precomputed 256 byte lookup arrays or tables. Alternately, I've had
good optimization results with methods similar to your
'set_var_type_original()' function, using optimization, only 'int' and
'unsigned char' types, and inlining. The goal is to get C types that the C
compiler will match with registers or move into registers so fewer
instructions are emitted. This is a bit of trial and error since many
compilers no longer respect the 'register' keyword. You might even find it
is better to copy file scope variables into procedure auto scope
variables... sometimes. The inlining, by inline keyword or GCC attribute,
should eliminate the procedure's overhead. This is preferred to inline
assembly since many compilers won't optimize inline assembly, but will
optimize inlined C.

To break apart an 8-bit register value or unsigned char in C (assuming
8-bits...), the first array would be precomputed to return the upper
nybble, and the second array would return the lower nybble. This can be
done in assembly or C easily. This avoids all shifts, etc. after the
initilization of the tables. You could use the 'xlatb' instruction in
assembly, but it's a "magic instruction"...

E.g.,

upper_nybble=upper_nybble_table[byte]; /* e.g., 0xF1 into table returns
0x0F */
lower_nybble=lower_nybble_table[byte]; /* e.g., 0xF1 into table returns
0x01 */

Hopefully, with optimization, your compiler will reduce that to two 'mov'
instructions. Or, you could code it that way in assembly.

Depending on the compiler's optimization ability, a simple 'and' may suffice
for the lower nybble, being reduced to a single 'and' instruction.

lower_nybble=byte&0x0F;

Or, you could code an 'and' in assembly...

To construct an 8-bit register value or unsigned char in C (assuming
8-bits...) from nybbles, an array would be precomputed to return the upper
nybble left shifted four-bits, and you already have the lower nybble. You
then bitwise-or them together.

E.g.,

byte=shifted_upper_nybble_table[upper_nybble]|lower_nybble;

HTH,


Rod Pemberton

Reply With Quote
  #9  
Old 08-15-2008, 03:08 AM
Terje Mathisen
Guest
 
Default Re: nibble array/BCD notation/MMX

ditiem wrote:
> Answering your question about "terrible slow". The original function
> was:
>
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
> *C &= 0xF0000000 ;
> *C |= T << 28 ;
> }


So basically what you're doing is to take the top byte of a 32-bit
variable, ORing the top 4 bits together with the bottom 4 bits of T, and
clearing out the bottom 28 bits?

Many compilers should generate code very similar to the following:

unsigned c = *C & 0xF0000000;
c |= T << 28;
*C = c;

which will compile into (assume ecx, edx contains the input parameters):

mov eax,[ecx]
shl edx,28

and eax,0f0000000h

or eax,edx

mov [ecx],eax

I.e. taking 5 cycles, and that's the best you can do without changing
the problem.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Reply With Quote
  #10  
Old 08-15-2008, 03:38 AM
ditiem
Guest
 
Default Re: nibble array/BCD notation/MMX

On Aug 15, 9:08*am, Terje Mathisen <spamt...@crayne.org> wrote:

> So basically what you're doing is to take the top byte of a 32-bit
> variable, ORing the top 4 bits together with the bottom 4 bits of T, and
> clearing out the bottom 28 bits?


Exactly

> which will compile into (assume ecx, edx contains the input parameters):
>
> * * * * mov eax,[ecx]
> * * * * shl edx,28
>
> * * * * and eax,0f0000000h
>
> * * * * or eax,edx
>
> * * * * mov [ecx],eax
>
> I.e. taking 5 cycles, and that's the best you can do without changing
> the problem.


Is that solution better than this one?

shl edx,28
and 0f0000000h, [ecx]
or edx , [ecx]

I am asking because it has been more than 10 years without touching
assembly and I dont remember neither the restrictions nor the times of
the
instructions.

regards,

David

Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 09:02 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.