2007年12月17日 星期一
=== Ch4 Exploiting Instruction Level Parallelism with S/W Approach ===
=== Ch4.1 Basic compiler techniques for exposing ILP ===
IA-64 : Intel Architecture-64, Intel's first 64-bit CPU micro architecture, is based on EPIC.
EPIC : Explicitly Parallel Instruction Computing
FIGURE 4.1 Latencies of FP operations used in this chapter.
這圖是貫穿第四章的精神所在,說明不同類型指令間的Latency.
先介紹什麼是 Pipeline Schedule 與 Loop Unrolling :
例如:
for (i=1000; i>0; i=i-1) {
X[i] = X[i] + s;
}
1. MIPS code =>
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
2. Without any scheduling (10 cycles) =>
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall
BNE R1,R2,Loop
stall
3. Schedule 排程後(6 cycles) =>
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
BNE R1,R2,Loop
S.D F4,8(R1)
4. Loop unrolled 迴圈展開 =>
(14 clock cycles or 14/4=3.5 per iteration)
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
5. Unrolled loop 再 Schedule=>
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
BNE R1,R2,Loop
S.D F16,8(R1)
留言列表