2007年12月17日 星期一

=== Ch4 Exploiting Instruction Level Parallelism with S/W Approach ===

=== Ch4.1 Basic compiler techniques for exposing ILP ===
IA-64 : Intel Architecture-64, Intel's first 64-bit CPU micro architecture, is based on EPIC.

EPIC : Explicitly Parallel Instruction Computing


FIGURE 4.1 Latencies of FP operations used in this chapter.
這圖是貫穿第四章的精神所在,說明不同類型指令間的Latency.

先介紹什麼是 Pipeline Schedule 與 Loop Unrolling :

例如:
for (i=1000; i>0; i=i-1) {
 X[i] = X[i] + s;
}

1. MIPS code =>
Loop: L.D     F0,0(R1)
   ADD.D   F4,F0,F2
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
   BNE    R1,R2,Loop

2. Without any scheduling (10 cycles) =>
Loop: L.D F0,0(R1)
    stall
   ADD.D   F4,F0,F2
    stall
    stall
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
    stall
   BNE    R1,R2,Loop
    stall

3. Schedule 排程後(6 cycles) =>
Loop: L.D     F0,0(R1)
   DADDUI  R1,R1,#-8
   ADD.D   F4,F0,F2
    stall
   BNE    R1,R2,Loop
   S.D     F4,8(R1)

4. Loop unrolled 迴圈展開 =>
(14 clock cycles or 14/4=3.5 per iteration)
Loop: L.D   F0,0(R1)
   ADD.D  F4,F0,F2
   S.D   F4,0(R1)
   L.D    F6,-8(R1)
   ADD.D   F8,F6,F2
   S.D    F8,-8(R1)
   L.D    F10,-16(R1)
   ADD.D  F12,F10,F2
   S.D    F12,-16(R1)
   L.D     F14,-24(R1)
   ADD.D   F16,F14,F2
   S.D     F16,-24(R1)
   DADDUI R1,R1,#-32
   BNE    R1,R2,Loop

5. Unrolled loop 再 Schedule=>

Loop: L.D  F0,0(R1)
   L.D  F6,-8(R1)
   L.D  F10,-16(R1)
   L.D  F14,-24(R1)
   ADD.D  F4,F0,F2
   ADD.D  F8,F6,F2
   ADD.D  F12,F10,F2
   ADD.D  F16,F14,F2
   S.D  F4,0(R1)
   S.D  F8,-8(R1)
   DADDUI  R1,R1,#-32
   S.D    F12,16(R1)
   BNE    R1,R2,Loop
   S.D    F16,8(R1)


arrow
arrow
    全站熱搜
    創作者介紹
    創作者 amzshar 的頭像
    amzshar

    amzshar

    amzshar 發表在 痞客邦 留言(0) 人氣()