| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hello there, I want to access an array of nibbles. I have developed in C the function that everyone will do without too much effort, but it is terribly slow. I am trying to approach the problem from another point of view in order to see if I get something faster. What I am looking for is a magic instruction (ala btsl) that given a pointer and an index, it returns the nth nibble. This is a similar problem to obtain the nth digit of a BCD (packed) format. I am wondering if there is a MMX (or some other extension) that could do this job (I heard they treat a word in sets of 4/8/32/64/128 bits, but I am not into Intel assembly extensions). Thanks in advance, any light is highly appreciated, David |
|
#2
| |||
| |||
| ditiem schrieb: > Hello there, > > I want to access an array of nibbles. I have developed in C the > function > that everyone will do without too much effort, but it is terribly > slow. Either you're using division and modulus per access or have some uncommon definition of "terribly slow". Let's see the code. > I am wondering if there is a MMX (or some other extension) that could > do this job (I heard they treat a word in sets of 4/8/32/64/128 bits, > but I am not into Intel assembly extensions). There's nothing which will help you for extracting a single specific nibble. There may be something for you if you want many nibbles in fixed patterns and do some operations on them. Hendrik vdH |
|
#3
| |||
| |||
| David "ditiem" wrote: > Hello there, Hi, > I want to access an array of nibbles. I have developed in C the > function that everyone will do without too much effort, but it is > terribly slow. Sure. C isn't prone to work at low level ![]() > I am trying to approach the problem from another point of view in > order to see if I get something faster. > What I am looking for is a magic instruction (ala btsl) > that given a pointer and an index, it returns the nth nibble. BTS isn't fast anyway. My solution is quite simple, even not optimised except it avoids branches: (but don't ask me how to do this in C, perhaps Inline.Asm ?) ______________________ ;push ... ;save registers as required mov ebx,nibble_index mov esi,source_ptr ;mov edi,result_buffer ;if desired ;push ebx ;make it a LOCAL if for a loop ;L0: ;__________________ shr ebx,1 ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN mov al,[esi+ebx] ;get byte from [esi + ebx/2] ;mov ebx,[esp] ;restore nibble index setc cl ;CL= 0 or 1 (= bit0 of index) shl cl,2 ;*4= 0 or 4 shr al,cl ;no change if the index were EVEN and al,0f ; ;__________________ DONE: ;AL contains the nibble yet ;mov [edi],al ;store it ;inc/dec edi ;advance destination ptr ;inc/dec ebx ;advance nibble index ;cmp ebx,[last_index] ;better use a register ie: edx ;jc/jns/jnbe L0 ;iterate as desired/required ;add esp,4 ;kill the LOCAL after the loop ;pop ... ;resore what you pushed on top _______________________ You see this are just 6 lines (16 bytes) for a single nibble access. Feel free to uncomment/delete/modify it for your own LOOP-solution. Would be interesting to hear if this register- and dependency-stalling piece of code performs any faster than your C-solution. > This is a similar problem to obtain the nth digit of a BCD (packed) > format. I treat BCD and HEX-nibbles the same way (except for input limitation). > I am wondering if there is a MMX (or some other extension) that could > do this job (I heard they treat a word in sets of 4/8/32/64/128 bits, > but I am not into Intel assembly extensions). SSE instruction may be of help for conversion of packed bytes, but I haven't seen any nibble oriented functionality in there. __ wolfgang |
|
#4
| |||
| |||
| "David "ditiem" wrote: > Hello there, Hi, > I want to access an array of nibbles. I have developed in C the > function that everyone will do without too much effort, but it is > terribly slow. Sure. C isn't prone to work at low level ![]() > I am trying to approach the problem from another point of view in > order to see if I get something faster. > What I am looking for is a magic instruction (ala btsl) > that given a pointer and an index, it returns the nth nibble. BTS isn't fast anyway. My solution is quite simple, even not optimised except it avoids branches: (but don't ask me how to do this in C, perhaps Inline.Asm ?) ______________________ ;push ... ;save registers as required mov ebx,nibble_index mov esi,source_ptr ;mov edi,result_buffer ;if desired ;push ebx ;make it a LOCAL if for a loop ;L0: ;__________________ shr ebx,1 ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN mov al,[esi+ebx] ;get byte from [esi + ebx/2] ;mov ebx,[esp] ;restore nibble index setc cl ;CL= 0 or 1 (= bit0 of index) shl cl,2 ;*4= 0 or 4 shr al,cl ;no change if the index were EVEN and al,0f ; ;__________________ DONE: ;AL contains the nibble yet ;mov [edi],al ;store it ;inc/dec edi ;advance destination ptr ;inc/dec ebx ;advance nibble index ;cmp ebx,[last_index] ;better use a register ie: edx ;jc/jns/jnbe L0 ;iterate as desired/required ;add esp,4 ;kill the LOCAL after the loop ;pop ... ;resore what you pushed on top _______________________ You see this are just 6 lines (16 bytes) for a single nibble access. Feel free to uncomment/delete/modify it for your own LOOP-solution. Would be interesting to hear if this register- and dependency-stalling piece of code performs any faster than your C-solution. > This is a similar problem to obtain the nth digit of a BCD (packed) > format. I treat BCD and HEX-nibbles the same way (except for input limitation). > I am wondering if there is a MMX (or some other extension) that could > do this job (I heard they treat a word in sets of 4/8/32/64/128 bits, > but I am not into Intel assembly extensions). SSE instruction may be of help for conversion of packed bytes, but I haven't seen any nibble oriented functionality in there. __ wolfgang |
|
#5
| |||
| |||
| sorry for the double post, I may have hit the send button twice ... __ wolfgang |
|
#6
| |||
| |||
| On Aug 14, 8:41 am, Hendrik van der Heijden <spamt...@crayne.org> wrote: > Either you're using division and modulus per access or have some > uncommon definition of "terribly slow". Let's see the code. Nah! The problem is a bit more complicated than a simple array and an index. I will try to explain in more detail. I am implementing a virtual machine in which I need to store a variable and its type (because it is dynamic). Let's say that the common technique for these cases is to use either the 4 highest or lowest bits of a word to store the type. In my design I dont want this to happen, so I decided that I can store the types of several the variables all together in one word. According to this, the memory addresses which has the last 7 bits to 0x0, 0x4,0x8 and 0xC, stores the types of the next (28) memory addresses. For example, for a given variable (read pointer) C, the type must be found in C&~0x7F plus the offset of C respect that address: C & 0x7F (plus a shift of course) The function is: #define INDEX2_MASK 0x3F void set_var_type ( unsigned int *C, unsigned int T ) { unsigned int Cindex = ((unsigned int)(C) & INDEX_MASK) ; unsigned char *CBase = (unsigned char*)((unsigned int)(C) & ~INDEX_MASK) ; CBase += Cindex >> 3 ; *CBase = (Cindex & 0x4)? (*CBase & 0x8F) | (T << 4) : (*CBase & 0xF8) | T ; } scared about the if-then-else? well, surprise! that function goes faster (in my machine, a Pentium III 850MHz) than this: void set_var_type_B ( unsigned int *C, unsigned int T ) { unsigned int Cindex = ((unsigned int)(C) & INDEX2_MASK) ; unsigned int *CBase = (unsigned int*)((unsigned int)(C) & ~INDEX2_MASK) ; CBase += Cindex >> 5 ; Cindex &= 0x7C ; *CBase &= ~(0xF << Cindex) ; *CBase |= T << Cindex ; } Answering your question about "terrible slow". The original function was: void set_var_type_original ( unsigned int *C, unsigned char T ) { *C &= 0xF0000000 ; *C |= T << 28 ; } and according to the benchmarks the functions above have and slow-down of 2.4 and 4.6 respectively (do not ask me why, I have lost whole day trying to find the reason). Hope this answer your question .Best regards, David |
|
#7
| |||
| |||
| On Aug 14, 2:33 pm, "Wolfgang Kern" <spamt...@crayne.org> wrote: > BTS isn't fast anyway. Oops! may I ask how slow it is? > My solution is quite simple, > even not optimised except it avoids branches: > (but don't ask me how to do this in C, perhaps Inline.Asm ?) Leave that on my side... I am really fighting with gcc to put some inline asm (, really a headache!> You see this are just 6 lines (16 bytes) for a single nibble access. > Feel free to uncomment/delete/modify it for your own LOOP-solution. > Would be interesting to hear if this register- and dependency-stalling > piece of code performs any faster than your C-solution. Your code is good for reading (I love the trick of shr and setc ),but now I am concentrating in writing. When I do the testing for reading I will let you know how fast it goes (looks very good to me). As I tried to explain to Hendrik, one of the good problems here is to obtain the base address and the "index" of the nibble according to that base address. Another point is how you map the index of the nibble in the nibble-array, that is, which bits from the address you take as the byte index and which bits you take as the nibble index. The solution that you propose is exactly the first one that came to my mind, but today I found another way of mapping the things that can save 1 or instructions. I have a nice solution for writing the nibble with no branches in only 10 instructions (plus the ones generated by the compiler to copy the arguments in 2 registers). My problem now is that gcc says that I use too many registers and that it cannot find a way to compile it... (this should be a temporal problem). > SSE instruction may be of help for conversion of packed bytes, > but I haven't seen any nibble oriented functionality in there. I abandoned that line, I read that they can work with nibbles, but they just add, subtract and few things more. Really complicated to obtain the nibble from there. Best regards and thank a lot for the solution! David |
|
#8
| |||
| |||
| "ditiem" <spamtrap@crayne.org> wrote in message news:68383c9b-13cd-4bf2-8996-6fb556aa750b@w24g2000prd.googlegroups.com... > What I am looking for is a magic > instruction > (ala btsl) that given a pointer and an index, it returns the nth > nibble. As a general rule-of-thumb, "magic instructions" are slow on today's x86 or later processors... > I am wondering if there is a MMX (or some other extension) that could > do this job No single instruction in the general instruction set that I recognize. Maybe something in MMX, SSE2, etc... > I want to access an array of nibbles. Well, since I'm interested in this issue also, I was waiting to see if someone posted some SSE2 etc. solution. Without such a solution, your goal is to reduce the work. So, I'd first try a few precomputed 256 byte lookup arrays or tables. Alternately, I've had good optimization results with methods similar to your 'set_var_type_original()' function, using optimization, only 'int' and 'unsigned char' types, and inlining. The goal is to get C types that the C compiler will match with registers or move into registers so fewer instructions are emitted. This is a bit of trial and error since many compilers no longer respect the 'register' keyword. You might even find it is better to copy file scope variables into procedure auto scope variables... sometimes. The inlining, by inline keyword or GCC attribute, should eliminate the procedure's overhead. This is preferred to inline assembly since many compilers won't optimize inline assembly, but will optimize inlined C. To break apart an 8-bit register value or unsigned char in C (assuming 8-bits...), the first array would be precomputed to return the upper nybble, and the second array would return the lower nybble. This can be done in assembly or C easily. This avoids all shifts, etc. after the initilization of the tables. You could use the 'xlatb' instruction in assembly, but it's a "magic instruction"... E.g., upper_nybble=upper_nybble_table[byte]; /* e.g., 0xF1 into table returns 0x0F */ lower_nybble=lower_nybble_table[byte]; /* e.g., 0xF1 into table returns 0x01 */ Hopefully, with optimization, your compiler will reduce that to two 'mov' instructions. Or, you could code it that way in assembly. Depending on the compiler's optimization ability, a simple 'and' may suffice for the lower nybble, being reduced to a single 'and' instruction. lower_nybble=byte&0x0F; Or, you could code an 'and' in assembly... To construct an 8-bit register value or unsigned char in C (assuming 8-bits...) from nybbles, an array would be precomputed to return the upper nybble left shifted four-bits, and you already have the lower nybble. You then bitwise-or them together. E.g., byte=shifted_upper_nybble_table[upper_nybble]|lower_nybble; HTH, Rod Pemberton |
|
#9
| |||
| |||
| ditiem wrote: > Answering your question about "terrible slow". The original function > was: > > void set_var_type_original ( unsigned int *C, unsigned char T ) > { > *C &= 0xF0000000 ; > *C |= T << 28 ; > } So basically what you're doing is to take the top byte of a 32-bit variable, ORing the top 4 bits together with the bottom 4 bits of T, and clearing out the bottom 28 bits? Many compilers should generate code very similar to the following: unsigned c = *C & 0xF0000000; c |= T << 28; *C = c; which will compile into (assume ecx, edx contains the input parameters): mov eax,[ecx] shl edx,28 and eax,0f0000000h or eax,edx mov [ecx],eax I.e. taking 5 cycles, and that's the best you can do without changing the problem. Terje -- - <Terje.Mathisen@hda.hydro.com> "almost all programming can be viewed as an exercise in caching" |
|
#10
| |||
| |||
| On Aug 15, 9:08*am, Terje Mathisen <spamt...@crayne.org> wrote: > So basically what you're doing is to take the top byte of a 32-bit > variable, ORing the top 4 bits together with the bottom 4 bits of T, and > clearing out the bottom 28 bits? Exactly > which will compile into (assume ecx, edx contains the input parameters): > > * * * * mov eax,[ecx] > * * * * shl edx,28 > > * * * * and eax,0f0000000h > > * * * * or eax,edx > > * * * * mov [ecx],eax > > I.e. taking 5 cycles, and that's the best you can do without changing > the problem. Is that solution better than this one? shl edx,28 and 0f0000000h, [ecx] or edx , [ecx] I am asking because it has been more than 10 years without touching assembly and I dont remember neither the restrictions nor the times of the instructions. regards, David |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.