| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hi, Software pipelining is a technique to reorder loops to achieve more parallelism at the instruction level. I was wondering if this optimization is mainly beneficial for processors that have multiple functional units like VLIWs, or can significant performance improvements be also achieved for RISC processors with a restricted number of functional unis like just two 4-stage pipelines? I was also wondering if software pipelining can benefit from profiling information, i.e. can the knowledge about frequently executed paths be exploited for a better pipelining? Do you know any successful applications? Regards, Tim |
|
#2
| |||
| |||
| It's a combination of loop unrolling and instruction scheduling. e.g forloop (...) { a[i] = b[i]; // stmt 1 c[i] = a[i]; // stmt 2 d[i] = c[i]; // stmt 3 } there is a clear true dependence b/w stmt 1 , 2 and 3 which constrains their execution order. If each statement takes one unit time then each iteration will take 3 unit time, so if there are 2 iterations then loop execution will take 6 unit time. But if we unroll the loop and the reorder the instruction execution order in such a way that all the non-dependent statements can execute simultaneously then we can save on execution time. i.e. 1. Unrolling step . a[0] = b[0] c[0] = a[0] d[0] = c[0] a[1] = b[1] c[1] = a[1] d[1] = c[1] 2. After Instruction scheduling step. a[0] = b[0] a[1] = b[1] c[0] = a[0] c[1] = a[1] d[0] = c[0] d[1] = c[1] Thus instructions in each of above pair of instruction can run parallely. Thus if the architecture has multiple functional units then we can run the above loop in 3 units times. Also we can club the instruction present in each group in a long VLIW format if the processor is VLIW, this will reduce the initial loop into few VLIW instructions. Optimization is called pipelining because we are able to overlap the execution of multiple instruction after optimizations. Thanks - Jatin Bhateja |
|
#3
| |||
| |||
| On Thu, 28 Aug 2008 11:49:49 +0530, Jatin Bhateja wrote: > It's a combination of loop unrolling and instruction scheduling. e.g Thank you, but I know how software pipelining is working. :-) My question was if also a significant performance increase for RISC architectures with a restricted number of functional units can be expected when software pipelining is applied. And if profiling might be exploited here. |
|
#4
| |||
| |||
| Tim Frink wrote: > My question was if also a significant performance increase for RISC > architectures with a restricted number of functional units can be > expected when software pipelining is applied. I'm not sure about significant increase, but it seems that even with a small number of FUs pipelining could be used to hide latency (e.g. memory access). But if FUs are already busy without software pipelining, then pipelining is not going to make them any busier. -- Pertti |
|
#5
| |||
| |||
| Also, the overheads of modulo scheduling can make it worse. Overheads include, more registers, prolog and epilog instructions. As Tim said, only advantage is in hiding latencies such as memory access and floating point. =Neeraj On Aug 29, 5:42 pm, Pertti Kellomaki <pertti.kellom...@tut.fi> wrote: > Tim Frink wrote: > > My question was if also a significant performance increase for RISC > > architectures with a restricted number of functional units can be > > expected when software pipelining is applied. > > I'm not sure about significant increase, but it seems that even with > a small number of FUs pipelining could be used to hide latency (e.g. > memory access). But if FUs are already busy without software > pipelining, then pipelining is not going to make them any busier. |
|
#6
| |||
| |||
| Neeraj Goel a icrit : > Also, the overheads of modulo scheduling can make it worse. > Overheads include, more registers, prolog and epilog instructions. > As Tim said, only advantage is in hiding latencies such as memory > access and floating point. Software pipelining is usually beneficial for loops with a high number of iterations. S ps: In order to hide memory latencies, non blocking caches should be present in the underlying cpu. This is not common in vliw processors. |
|
#7
| |||
| |||
| On Aug 29, 1:51 am, Tim Frink <plfr...@yahoo.de> wrote: > On Thu, 28 Aug 2008 11:49:49 +0530, Jatin Bhateja wrote: > > It's a combination of loop unrolling and instruction scheduling. e.g > > Thank you, but I know how software pipelining is working. :-) > > My question was if also a significant performance increase for RISC > architectures with a restricted number of functional units can be > expected when software pipelining is applied. You need more than 1 functional unit to take advantage of pipelining. Be it RISC or VLIW instruction format, that has no bearing on how many functional units a processor can have, The itanium processor has a RISC instruction format, and it has a notion of an EPIC (explicitly parallel instruction) bundle. The compiler will have to generate a bundle wherein multiple instructions will be handed to the processor and scheduled for parallel execution and it will yield a good performance if each of those instructions can utilize different functional units at the same time. > And if profiling might be exploited here. I didn't get you there i.e. profiling doesn't seem related to pipelining as described above. regards -kamal |
|
#8
| |||
| |||
| On Aug 28, 1:51 pm, Tim Frink <plfr...@yahoo.de> wrote: > On Thu, 28 Aug 2008 11:49:49 +0530, Jatin Bhateja wrote: > > It's a combination of loop unrolling and instruction scheduling. e.g > > Thank you, but I know how software pipelining is working. :-) > > My question was if also a significant performance increase for RISC > architectures with a restricted number of functional units can be > expected when software pipelining is applied. > > And if profiling might be exploited here. I believe software pipelining can improve performance even when there are only a few functional units. For example, suppose the following code is run on a machine with a single FU. All the It is just incrementing an array of 10 elements. ldc r0, 0 L1: 0. load r1, 0(r0) # latency 2 1. add r1, r1, 1 # latency 1 2. store 0(r0), r1 # latency 1 3. add r0, r0, 1 # latency 1 4. cmp r0, 10 # latency 1 5. blt L1 # latency 1 For a non-SW pipelined code, a single iteration would take 7 cycles. Now, for the SW pipelined code, a single iteration would take 6 cycles on average. ## prologue op0_0 noop op1_0 op2_0 op3_0 op4_0 ######## 1st iteration of steady state op0_1 op5_0 op1_1 op2_1 op3_1 op4_1 ######## 2nd iteration of steady state op0_2 op5_1 op1_2 op2_2 op3_2 op4_2 ######## .............. The degree of performance increase of a SW pipelined loop over a non- SW pipelined loop depends on several factors, including the number of resources (FUs), whether there are recurrences in the loop, the number of iterations and also register pressure. |
|
#9
| |||
| |||
| >> And if profiling might be exploited here. > > I didn't get you there i.e. profiling doesn't seem related to > pipelining as described above. This was just a question if it would make sense to somehow combine software pipelining with the most frequently executed path (determined by profiling). I can't imagine any useful combination of these two techniques, but maybe there are some? Regards, Tim |
|
#10
| |||
| |||
| > I believe software pipelining can improve performance even when there > are only a few functional units. > For example, suppose the following code is run on a machine with a > single FU. > All the > It is just incrementing an array of 10 elements. > > ldc r0, 0 > L1: > 0. load r1, 0(r0) # latency 2 > 1. add r1, r1, 1 # latency 1 > 2. store 0(r0), r1 # latency 1 > 3. add r0, r0, 1 # latency 1 > 4. cmp r0, 10 # latency 1 > 5. blt L1 # latency 1 > > For a non-SW pipelined code, a single iteration would take 7 cycles. > Now, for the SW pipelined code, a single iteration would take 6 cycles > on average. Even though I understand that this is an example, I still have the feeling that with just a single functional unit, the benefits of pipelining are marginal, in your example it's just about 1 cycle. But I also agree that this might have some positive effect when the loop has a high iteration count. |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.