2007年12月19日 星期三

.
Q1. Where can a block be placed in a cache?
1. If each block has only one place it can appear in the cache, the cache is said to be direct mapped.
This mapping is usually : (Block address) MOD (Number of blocks in cache)
2. If a block can be placed anywhere in the cache, the cache is said to be fully associative.
3. If a block can be placed in a restricted set of places in the cache, the cache is set associative.
A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection;
that is, (Block address) MOD (Number of sets in cache)

If there are n blocks in a set, the cache placement is called n-way set associative.




※ Q2: How is a block found if it is in the cache?
1. offset從區塊選出需要的資料,
2. Index field選出某一集合,
3. Tag field比對是否命中


如果Cache大小不變,增加關連性會增加每個集合中的區塊數,=>Index會縮短,Tag會加長.
Index=0即為Full associative.

※ Q3: Which block should be replaced on a cache miss?
1. Random
2. LRU(least-recently used)
3. FIFO

※ Q4: What happens on a write?
1. Write through : The information is written to both the block in the cache and to the block in the lower-level memory.
2. Write back : The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

To reduce the frequency of writing back blocks on replacement, a feature called the dirty bit is commonly used. This status bit indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean, the block is not written back on a miss, since identical information to the cache is found in lower levels.


※ When the CPU must wait for writes to complete during write through, the CPU is said to write stall. A common optimization to reduce write stalls is a write buffer, which allows the processor to continue as soon as the data is written to the buffer, thereby overlapping processor execution with memory updating. As we shall see shortly, write stalls can occur even with write buffers.

Since the data are not needed on a write, there are two are two options on a write miss:
1. Write allocate : The block is allocated on a write miss, followed by the write hit actions above. In this natural option, write misses act like read misses.
2. No-write allocate : This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower level memory.


※ Alpha 21264 Data Cache :
Cache Size = 64K bytes = 2^16 bytes.
Block Size = 64 bytes = 2^6 bytes. (block offset = 6)
2-way associativity. Write-back and Write-allocate.

2^index = Cache Size / (Block Size × Set associativity)

2^index = 65526 / (64 × 2) => index = 512 => index field = 9 bits

步驟1 : 21264CPU送出48 bits虛擬位址到Cache以供Tag檢查,並在同一時間將虛擬位址轉換為44 bits的實際位址. => Tag field = 44 -9 -6 = 29 bits

步驟2 : Index選擇,兩個Tag同時被比較,而比較結果相同者被選出,

步驟3 : 兩個標籤被由Cache讀出後,就與由CPU送出的區塊位址中的標籤部分比較.為了確定該標籤包含有效資訊, Valid bit必須為1,否則比較的結果就會被忽略.

步驟4 : 假設其中一個標籤比對符合,就會通知CPU利用比對成功的輸入從2:1Mux載入適當資料.
21264採用Write-back方法,並對每一block利用一個Dirty Bit來記錄其是否曾被寫入.如果該要被置換出的Block (Victim) 曾被修改過,它的資料以及位址就會被送進 Victim buffer.
CPU知道送出的是Instruction Address或Data address,所以不同類型的Address可以使用不同的port.如此可以將記憶體架構與CPU間的頻寬加倍. 分離式Cache也讓設計者可以針對每個Cache做最佳化(不同的容量、區塊大小、關連性都可調整). 而另一種 Unified Cache or mixed Cache are applied to caches that can contain either instructions or data.

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月19日 星期三
.
====== Ch5 Memory Hierarchy Design ======
※ The principle of locality (區域性原則) : most programs do not access all code or data uniformly


====== Ch5.2 Review of the ABCs of Caches ======

Cache
Instruction Cache / Data Cache
Unified Cache
Memory Stall Cycles / Misses per instruction / Miss Rate / Miss Penalty
Set / Direct mapping / N-way set Associative / Fully Associative
Block / Block Address / Tag field / Index field / Block offset
Valid bit
Random replacement
LRU
Write through / Write back
Dirty bit
Virtual Memory
Write stall / Write buffer / Write allocate / No-write allocate
Page
Page fault
Average Memory Access Time (AMAT)
Cache Hit / Cache Miss / Hit time
Locality (temporal/Spacial)
Access trace

公式1=>
CPU execution time = ( CPU clock cycles + Memory stall cycles ) × Clock cycle time

Memory stall cycles : CPU為了等待MEM存取時所暫停的時間.

公式2=>
Memory stall cycles = Number of miss × Miss Penalty
= IC × ( Miss / Instruction ) × Miss Penalty

= IC × ( Memory Accesses / Instruction ) × Miss Rate × Miss Penalty

EXAMPLE :
 CPI=4 , 50%屬於資料存取Load與Store,
 Miss Penalty=25, Miss rate=2%,
 若所有都在Cache命中,則快多少?
ANSWER :
保證命中 CPU execution time
 = ( CPU clock cycles + Memory stall cycles ) × Clock cycle time
 = (IC × CPI + 0 ) × Clock cycle time = IC × 1.0 × Clock cycle time

實際上 Memory stall cycles
 = IC × ( Memory Accesses / Instruction ) × Miss Rate × Miss Penalty
 = IC × ( 1 + 0.5 ) × 0.02 × 25 = IC × 0.75
=> CPU execution time
  = (IC × CPI + IC × 0.75) × Clock cycle time
  = IC × 1.75 × Clock cycle time
  所以快了 1.75 倍.

公式3=>
Misses / Instruction
= ( Miss Rate × Memory Access ) / Instruction
= Miss Rate × ( Memory Access / Instruction )

所以上一個例題
Misses / Instruction = 0.02 × ( 1.5 / Instruction ) = 0.03
Memory stall cycles = Number of miss × Miss Penalty
 = IC × ( Miss / Instruction ) × Miss Penalty
 = IC × 0.03 × 25
 = IC × 0.75

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月18日 星期二

=== Ch4.5 Hardware Support for Exposing more parallelism at compile time ===


※ such as loop unrolling, software pipelining, and trace scheduling can be used to increase the amount of parallelism available when the behavior of branches is fairly predictable at compile time. When the behavior of branches is not well known, compiler techniques alone may not be able to uncover much ILP.


※ 將指令擴充 :
The first is an extension of the instruction set to include conditional (條件指令) or predicated (預測指令) instructions.


※ 條件指令(Conditional instructions) :
1. An instruction refers to a condition, which is evaluated as part of the instruction execution.
2. If the condition is true, the instruction is executed normally.
3. If the condition is false, the execution continues as if the instruction were no-op (空指令).
例如: if (A==0) {S=T;}


※ Compiler Speculation with Hardware Support :
To speculate ambitiously requires 3 capabilities: (良好的預測執行3要素)
1. the ability of the compiler to find instructions that, with the possible use of register renaming, can be speculatively moved and not affect the program data flow,
2. the ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur, and
3. the ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts.


※ Hardware Support for Preserving Exception Behavior :
There are 4 methods that have been investigated for supporting more ambitious speculation without introducing erroneous exception behavior:
1. The H/W and OS cooperatively ignore exceptions for speculative instructions.
 - this approach preserves exception behavior for correct programs, but not for incorrect ones.  This approach may be viewed as unacceptable for some programs, but it has been used, under program control, as a “fast mode” in several processors.

2. Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur.

3. A set of status bits, called poison bits, are attached to the result registers written by speculated instructions when the instructions cause exceptions. The poison bits cause a fault when a normal instruction attempts to use the register.

4. A mechanism is provided to indicate that an instruction is speculative and the H/W buffers the instruction result until it is certain that the instruction is no longer speculative.


----------------------------------------------------------
Example1 : Here is an unusual loop. First, list the dependences and then rewrite the loop so that it is parallel.

  for (i=1;i<100;i=i+1) {
   a[i] = b[i] + c[i];  /* S1 */
   b[i] = a[i] + d[i];  /* S2 */
   a[i+1] = a[i] + e[i]; /* S3 */
  }

Solution :
1. S2 to S1以及 S3 to S1, a[] -> true-dep.
2. S1 to S2, bi -> anti-dep.
3. S3 to S1 loop-carried output-dep.
4. S3 to S2 loop-carried true-dep.
5. S3 to S3 loop-carried true-dep.

化解為:

  for (i = 1; i < 100; i = i + 1) {
    a[i] = b[i] + c[i]; //S1
    b[i] = a[i] + d[i]; //S2
  }
  a[100] = a[99] + e[99];

----------------------------------------------------------
EXAMPLE2: Here is a simple code fragment:

 for (i=2;i<=100;i+=2)
   a[i] = a[50*i+1];

To use the GCD test, this loop must first be “normalized”—written so that the index starts at 1 and increments by 1 on every iteration. Write a normalized version of the loop (change the indices as needed), then use the GCD test to see if there is a dependence.

Solution :
normalized正規化 =>

 for(i<1 ; i<=50 ; i++) {
   a[2*i] = a[ (100*i) + 1 ];
 }

a=2, b=0, c=100, d=1 代入 GCD test
=> gcd(2,100)=2 且 d-b=1, 因為1是2的因數, 所以有相依性存在.
(但是,實際上,Loop 載入順序是 a[101], a[201], …,a[5001]並指到 a[2], a[4],…,a[100]並不是相依性)

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月17日 星期

=== Ch4.4 Advanced Compiler Support for Exposing and Exploiting ILP ===

※ The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations, such a dependence is called a loop-carried dependence.

程式範例:

 for (i=1; i<=100; i=i+1) {
  A[i+1] = A[i] + C[i];  /* S1 */
  B[i+1] = B[i] + A[i+1]; /* S2 */
 }

--
A[2]
= A[1] + C[1]
B[2] = B[1] + A[2]
--
A[3] = A[2] + C[2]
B[3] = B[2] + A[3]
--
... ... ...
--
A[101] = A[100] + C[100]
B[101] = B[100] + A[101]
--

1.所以S1會用到上一次 S1計算出來的值,S2也會用到上一次S2的結果=> Loop-carried depenence
2.而同一迴圈,S2相依於S1,(not loop-carried),只要照順序執行即可。


※ Loop-carried dependence不見得會妨礙Parallelism:
程式範例:

 for (i=1; i<=100; i=i+1) {
  A[i] = A[i] + B[i];   /* S1 */
  B[i+1] = C[i] + D[i];  /* S2 */  }

--
A[1] = A[1] + B[1]

B[2] = C[1] + D[2]
--
A[2] = A[2] + B[2]
B[3] = C[2] + D[2]
--
... ... ...
--
A[100] = A[100] + B[100]
B[101] = C[100] + D[100]
--
S1相依於S2,之間存在Loop-carried dependence.

轉換關鍵性:
1.S1到S2沒有相依性,交換這兩道順序不會影響S2的執行.
2.Loop第一次執行,S1相依於此迴圈開始執行前的B[1]值.

化解為:
 A[1] = A[1] + B[1]
 for (i=1; i<=100; i=i+1) {
  A[i] = A[i] + B[i];   /* S1 */
  B[i+1] = C[i] + D[i];  /* S2 */
 }
 B[101] = C[100] + D[100]


※ A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding, as in the above fragment.

Detecting a recurrence can be important for two reasons:
Some architectures (especially vector computers) have special support for executing recurrences, and some recurrences can be the source of a reasonable amount of parallelism.

Dependence distance :
 for (i=6;i<=100;i=i+1) {
  Y[i] = Y[i-5] + Y[i];
 }

第I次執行時,Loop會讀取陣列元素i-5, Dependence distance = 5.
Dependence distance越大,the more potential parallelism can be obtained by unrolling loop.

※ Finding the dependences is important in 3 tasks :
1. Good scheduling of code.
2. Determining which loops might contain parallelism.
3. Eliminating name dependences.

※ Compiler 偵測 dependences ?
Nearly all dependence analysis algorithms work on the assumption that array indices are affine (仿射) : a one-dimensional array index is affine if it can be written in the form a × i + b, where a and b are constants, and i is the loop index variable. 而x[y[i]]就Nonaffine.


※ A dependence exists if two conditions hold: (GCD偵測)
1. There are two iteration indices, j and k, both within the limits of the for loop.
That is m ? j ? n, m ? k ? n.
2. The loop stores into an array element indexed by a × j + b and later fetches from that same array element when it is indexed by c × k + d. That is, a × j + b = c × k + d.

範例: Use the GCD test to determine whether dependences exist in the following loop:
  for (i=1; i<=100; i=i+1) {
   X[2*i+3] = X[2*i] * 5.0;
  }

解法: Given the values a = 2, b = 3, c = 2, and d = 0,
  then GCD(a,c) = 2, andd – b = –3.
  Since 2 does not divide –3, no dependence is possible.

=> GCD測試可保證沒相依性存在,但可能GCD測成功,但並沒相依性存在.
 (因為loop bounds沒考慮到)


※ Situation in which array-oriented dependence analysis (陣列導向的相依性分析) cannot tell us :
1. When objects are referenced via pointers.
2. When array indexing is indirect through another array.
3. When a dependence may exist for some value of inputs, but does not exist in actuality when the code is run since the input never take in those value.
4. When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend.


※ The basic approach used in points-to analysis (指向分析) replies on information from :
1. Type information(型別資訊), which restricts what a pointer can point to.
2. Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. (例: p指向X, q永不指向X, 則p和q就不能指向同一物件)
3. Information derived from pointer assignment. (p -> q -> X, q的值指定給p,則p指向q所指的物件)


※ Eliminating Dependent Computations (消除相依計算) :
1. Copy propagation (複製傳遞) : 用來避免複製運算 Eliminates operations that copy values.
 DADDUI R1, R2, #4
 DADDUI R1, R2, #4
 變成=> DADDUI R1, R2, #8

2. Tree height reduction (樹的高度縮減) :

   ADD R1,R2,R3
   ADD R4,R1,R6
   ADD R8,R4,R7
    轉換 3 cycles => 2 cycles
   ADD R1,R2,R3
   ADD R4,R6,R7
   ADD R8,R1,R4

3.Recurrences (遞迴):
  sum = sum + x;
  sum = sum + x1 + x2 + x3 + x4 + x5 ;  5 cycles
  => sum = ( (sum + x1) + (x2 + x3) ) + (x4 + x5) ;  3 cycles

※ Software Pipeline(軟體管線) : Symbolic Loop Unrolling (象徵性迴圈展開) :
Software Pipeline(軟體管線) : Reorganize loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop (從不同回合中挑選組合而成).
請參考 Fig4.6


※ Global Code Scheduling (全域程式碼排程):
1. Effective scheduling of a loop body with internal control flow will require moving inst. across branches.
2. Aims to compact a code fragment with internal control structure into the shortest possible sequence (Critical Path) that preserves the data and control dependence. (保留 data 與 control 相依性)
3. It can reduce the effect of control dependences arising from conditional nonloop branches by moving code.
4. Effectively using global code motion require estimates of the relative frequency of different paths.


Trace Scheduling (追蹤排程) : focusing on the Critical Path
1. Useful for processors with a large number of issues per clock.
2. A way to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths. (用於執行頻率有明顯差距的不同路徑上)


※ Two steps of Trace Scheduling :
1. Trace selection (追蹤選擇) : tries to find a likely sequence of basic blocks whose operations will be put together into a smaller number of instructions.
2. Trace compaction (追蹤壓縮) : Tries to squeeze the trace into a small number of wide instructions. (其實就是 code scheduling.)

The advantage of the Trace Scheduling approach is that it simplifies the decisions concerning global code motion. (Trace scheduling 優點在於簡化了 Global Code 移動的決策)


Superblocks (超級區塊) : 解決 Trace Scheduling 追蹤中間進入或離開造成十分複雜的情況
1. are formed by a process similar to that used for traces.
2. but are a form of extended basic blocks, which are restricted to a single entry point but allow multiple exists.


How can a superblock with only one entrance be constructed? The answer is to use tail duplication (尾部複製) to create a separate block that corresponds to the portion of the trace after the entry.


與一般的 Trace 產生方法比起來, 使用 superblocks能減少額外記錄(bookkeeping)與排程的複雜度, 但是程式碼可能會大於以 Trace 為基礎的方法. 所以, 如同Trace scheduling, Superblocks在其他技巧都行不通時再使用比較合適.

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月17日 星期一

=== Ch4.2 Static Branch Prediction ===

Delayed branch can reduce the Control hazard.

程式範例:
   LD    R1,0(R2)
   DSUBU R1,R1,R3
   BEQZ   R1,L
   OR    R4,R5,R6
   DADDU  R10,R4,R3
L:   DADDU  R7,R8,R9

=> DSUBU and BENQZ depend on LD
=> stall will be needed after LD.

1. branch almost taken
 => R7 was not needed on the fall-through path
 => Could increase the speed by moving DADDU to the position after LD.
2. branch rarely taken :
 => R4 was not needed on the taken path
 => Could increase the speed by moving OR to the position after LD.
3. profiled-based strategy predictor : 用預先收集的早期執行概況來預測分支


=== Ch4.3 Static Multiple Issue : VLIW ===
※靜態 Statically scheduled superscalar requires compiler assistance.

※動態 Dynamically-scheduled superscalar requires less compiler assistance, but has hardware costs.

VLIW : Very Long Instruction Word
在一道指令中納入很多運算(64~128 bits or more)
VLIWs use multiple, independent functional units. Rather than attempting to issue multiple, independent instructions to the units, a VLIW packages the multiple operations into one very long instruction, or requires that the instructions in the issue packet satisfy the same constraints.

Basic VLIW approach ---
1. Local scheduling tech :
 a.) the loop unrolling generates straight-line code.
 b.) Operate on a single basic block.

2. Global scheduling tech : (trace scheduling是特別為VLIW發展的全域排程技巧)
 a.) scheduling code across branches.
 b.) More complex in structure.
 c.) Must deal with significantly more complicated trade-offs in optimization.

※ For the original VLIW model, there are both technical and logistical problems.
=> The technical problems are the increase in code size and the limitations of lock-step operation.


Two different elements combine to increase code size substantially for a VLIW.
1, generating enough operations in a straight-line code fragment requires ambitiously unrolling loops (as earlier examples) thereby increasing code size.

2, whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.

=> 解increase code size方法 =>
1. Clever encodings (例如:讓數個function unit共用一個 large immediate field)
2. Compress the instructions in main memory


※ Early VLIWs operated in lock-step – T
here was no hazard detection H/W at all.
因為所有的 function unit 必須保持同步,所以任何一個管線發生stall,就會造成整個processor stall.

logistical : Binary code compatibility problem (執行碼相容性問題)
1. In a strict VLIW approach, the code sequence makes use of both the instruction set definition and the detailed pipeline structure.
2. Different numbers of functional units and unit latencies require different versions of code.
3. One possible solution is object-code translation or emulation.
4. Another approach is to temper the strictness of the approach, so that binary compatibility is still feasible.

Multiple Issue Processor兩個潛在的優點是Vector Processor所沒有的 :
1. has the potential to extract some amount of parallelism from less regularly structured code.
2. to use a more conventional, and typically less expensive, cache-based mem system.

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月17日 星期一

=== Ch4 Exploiting Instruction Level Parallelism with S/W Approach ===

=== Ch4.1 Basic compiler techniques for exposing ILP ===
IA-64 : Intel Architecture-64, Intel's first 64-bit CPU micro architecture, is based on EPIC.

EPIC : Explicitly Parallel Instruction Computing


FIGURE 4.1 Latencies of FP operations used in this chapter.
這圖是貫穿第四章的精神所在,說明不同類型指令間的Latency.

先介紹什麼是 Pipeline Schedule 與 Loop Unrolling :

例如:
for (i=1000; i>0; i=i-1) {
 X[i] = X[i] + s;
}

1. MIPS code =>
Loop: L.D     F0,0(R1)
   ADD.D   F4,F0,F2
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
   BNE    R1,R2,Loop

2. Without any scheduling (10 cycles) =>
Loop: L.D F0,0(R1)
    stall
   ADD.D   F4,F0,F2
    stall
    stall
   S.D     F4,0(R1)
   DADDUI  R1,R1,#-8
    stall
   BNE    R1,R2,Loop
    stall

3. Schedule 排程後(6 cycles) =>
Loop: L.D     F0,0(R1)
   DADDUI  R1,R1,#-8
   ADD.D   F4,F0,F2
    stall
   BNE    R1,R2,Loop
   S.D     F4,8(R1)

4. Loop unrolled 迴圈展開 =>
(14 clock cycles or 14/4=3.5 per iteration)
Loop: L.D   F0,0(R1)
   ADD.D  F4,F0,F2
   S.D   F4,0(R1)
   L.D    F6,-8(R1)
   ADD.D   F8,F6,F2
   S.D    F8,-8(R1)
   L.D    F10,-16(R1)
   ADD.D  F12,F10,F2
   S.D    F12,-16(R1)
   L.D     F14,-24(R1)
   ADD.D   F16,F14,F2
   S.D     F16,-24(R1)
   DADDUI R1,R1,#-32
   BNE    R1,R2,Loop

5. Unrolled loop 再 Schedule=>

Loop: L.D  F0,0(R1)
   L.D  F6,-8(R1)
   L.D  F10,-16(R1)
   L.D  F14,-24(R1)
   ADD.D  F4,F0,F2
   ADD.D  F8,F6,F2
   ADD.D  F12,F10,F2
   ADD.D  F16,F14,F2
   S.D  F4,0(R1)
   S.D  F8,-8(R1)
   DADDUI  R1,R1,#-32
   S.D    F12,16(R1)
   BNE    R1,R2,Loop
   S.D    F16,8(R1)


amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月3日 星期一

====== Ch3 ======

ILP (Instruction-Level Parallelism) : 同一時間執行超過一個指令的能力。


---3.1 Fig 3.1

A.2 Forwarding (前饋), Bypassing (旁路) : data hazard stalls
A.2 delayed branches (延遲分支), branch scheduling (分支排程) : control hazard stalls
A.8 scoreboarding (動態排程-計分版) : true-dependency data hazard stalls
3.2 renaming (動態排程-重新命名) : data hazard, WAR(anti-dependency), WAW(ouput-dependency)
3.4 branch prediction (分支預測) : control stalls
3.6 Issuing multiple instructions per cycle : ideal CPI
3.7 speculation (預測執行) : data hazard and control hazard stall
3.2/3.7 disambiguation (動態記憶體檢查) : Data hazard stalls with memory

--- 3.1 Three different types of dependences :
1. Data dependence : (RAW: true dependence)
2. Name dependence : (WAR: Anti-dependence, WAW: output-dependence) <-- renaming
3. Control dependence : <-- speculation


-- Q4 : RAW: true dependence, WAR: Anti-dependence, WAW: output-dependence
RAW : Inst j data depedent on Inst i <-- Overcome by Stall or Eliminating it by transforming the code
WAR : Inst i reads precedes Inst j write, the order must be preserved <-- register renaming
WAW : i writes and j writes, the order must be preserved
loop :
(1) DIV.D F0, F2, F4 RAW : (1)(2) |
(2) ADD.D F6, F0, F8 /(2)(3) |
(3) S.D F6,0(R1) /(4)(5) |
(4) SUB.D F8,F10,F14 WAR : (2)(4) |
(5) MUL.D F6,F10,F8 WAW : (2)(5) |

| register-renaming
(1) DIV.D F0, F2, F4
(2) ADD.D S, F0, F8
(3) S.D S,0(R1)
(4) SUB.D T,F10,F14
(5) MUL.D F6,F10,T


--3.2 Q5 Tomasulo's algorithm : Enhancement of scoreboarding, Allow Insts to exec out-of-order when there are sufficient resource and no data dependance.
(register-renaming to resolve WAR and WAW)

--3.4 Touenament Predictor (聯合式預測器): Adaptively combining local and global predictors

-The primary motivation for correlating branch predictors came from the observation that the standard 2-bit predictor using only local information failed on some important branches and that by adding global information, the performance could be improved.

-Tournament predictors take this insight to the next level, by using multiple predictors, usually one based on global information and one based on local information, and combining them with a selector.

-Tournament predictors are the most popular form of multilevel branch predictors. A multilevel branch predictor use several levels of branch prediction tables together with an algorithm for choosing among the multiple predictors; Existing tournament predictors use a 2-bit saturating counter per branch to choose among two different predictors. The four states of the counter dictate whether to use predictor 1 or predictor 2. The state transition diagram is shown in Figure 3.16.


--3.6 Taking advantage of more ILP with Multiple issue :
-Superscalar processors (dynamic issue capability) issue varying numbers of instructions per clock and are either statically scheduled or dynamically scheduled using techniques based on Tomasulo's algorithm. Statically scheduled processor use in-order execution, while dynamically scheduled processors use out-of-order execution.

-VLIW processors (static issue capability), in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (hence, they are also known as EPIC--Explicitly Parallel Instruction Computers). VLIW and EPIC processors are inherently statically scheduled by the compiler.

-- Q3.2
DADDI R1,R1,#4 / LD R2,7(R1) R1: true-dependency 不允許out-of-order
DADD R3,R1,R2 / S.D R2,7(R1) NONE 可
S.D R2,7(R1) / S.D R2,200(R7) Maybe output-denp. 可能,如果硬體夠早算出有效位址,S.D順序可能被交換
BEZ R1,place / S.D R1,7(R1) NONE 不可,直到分支分辨出結果前,改變指令都是預測執行


-- Q3.9 (a)考慮兩到分支 B1,B2交互執行,P欄位列出B1,B2共用的單一位元predictor之值. B1,B2欄位列出分支的動作.T is taken,NT is not taken,預測器一開始是NT
     P B1 P B2 P B1 P B2 P B1 P B2 P B1 P B2
      NT T T NT NT NT NT T T T T NT NT NT NT T
預測正確  - N - N - Y - N - Y - N - Y - N
此處B1,B2分別以T/NT交替出現,如果他們各有一個單一位元predictor,那麼每次都會預測錯誤,因為共用一個預測器,所以預測正確提升了.

(b)B1都T,B2都NT,若用單一位元predictor,那每次都會預測正確,因為共用一個預測器,所以預測都錯誤
      P B1 P B2 P B1 P B2 P B1 P B2 P B1 P B2
      NT T T NT NT T T NT NT T T NT NT T T NT
預測正確  - N - N - N - N - N - N - N - N

(c)假如一個predictor被一組分支指令共用,程式執行時.此集合中的成員可能會變動. 當一個新的分支進入或是一個舊的分支離開這個集合,目前這個predictor所代表的分支動作歷程不可能像預測舊的集合般預測這個新集合的行為,這時會影響預測器的狀態.集合改變的時間間隔可能會減少長期的預測準確率.


--Q3.20 當一道指令被錯誤預測時,對於構成CPU時間方程式的三個因素有什麼影響:動態指令數,每道指令平均時脈數,以及時脈週期時間?當預測錯誤時,CPU時間可能會增加.CPU時間方程式中的哪一個因素最適合用來模擬此時間增加?為什麼?
=>當預測執行正確時,應該被執行的指令可以藉由減少或消除暫停而提早執行.當執行動作被拖延到指令不再被預測執行時,延遲就無法避免.將必須執行的指令提早執行並不會對指令數目或時脈週期有任何影響.減少暫停週期能增進的是CPI.
當預測不正確時,不再執行路徑上的指令會被執行,而它們的結果會被忽略.對時脈週期是沒有影響的,但是對動態的指令數目會增加.一般的指令分佈對於CPI的改變與影響很小,但是不正確的預測執行指令所消耗的週期會造成CPU時間的大量增加,這點從計算IC的角度來看最為明顯.


--Q3.21
ADD.D F0,F8,F8 /
MUL.D F2,F8,F8 /
SUB.D F4,F0,F2 /
DADDI R10,R12,R12
ROB fields Committed?
------------------------------------------------- ---------------------
entry instruction destination Value Yes/No
0 ADD.D F0 F8 + F8 Y
1 MUL.D F2 - N
2 SUB.D F4 - N
0 DADDI F10 R12 + R12 N
1
2
此處沒有MUL.D項目,因為10個週期延遲表示他尚未執行完成. 此處也沒有SUB.D,因為相依於MUL.D.
ROB的第0項開始記錄ADD.D,但已經被 ADD.D覆寫了; 第1和第2項則保有它們的初始值.


.End.

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月3日 星期一
====== Ch2 ======
2.2 Classifying instruction set architecture :
For C = A + B
1. Stack : Push A / Push B / Add / Pop C
2. Accumulator : Load A / Add B / Store C
3. Reg-Mem : Load R1,A / Add R3,R1,B / Store R3,C
4. Reg-Reg(Load-Store) : Load R1,A / Load R2,B / Add R3,R1,R2 / Store R3,C

 
--- 2.3 Memory addressing :
Little Endian : byte order puts the byte whose address is "x...x000" at the least-significant position in the double word (the little end).
Such as : 7 6 5 4 3 2 1 0

Big Endian
: byte order puts the byte whose address is "x...x000" at the most-significant position in the double word (the big end).
Such as : 0 1 2 3 4 5 6 7


-- Addresses mode :
Register : Add R4,R3 / Regs[R4]←Regs[R4]+ Regs[R3]
Immediate : Add R4,#3 / Regs[R4]←Regs[R4]+3
Displacement : Add R4,100(R1) / Regs[R4]←Regs[R4] + Mem[100+Regs[R1]]
Register indirect : Add R4,(R1) / Regs[R4]←Regs[R4] + Mem[Regs[R1]]
Indexed : Add R3,(R1 + R2) / Regs[R3]←Regs[R3] +Mem[Regs[R1]+Regs[R2]]
Direct or absolute : Add R1,(1001) / Regs[R1]←Regs[R1] + Mem[1001]
Memory indirect : Add R1,@(R3) / Regs[R1]←Regs[R1] + Mem[Mem[Regs[R3]]]
Autoincrement : Add R1,(R2)+ / Regs[R1]←Regs[R1] + Mem[Regs[R2]] / Regs[R2]←Regs[R2]+d
Autodecrement : Add R1,–(R2) / Regs[R2]←Regs[R2]–d / Regs[R1]←Regs[R1] + Mem[Regs[R2]]
Scaled : Add R1,100(R2)[R3] / Regs[R1]← Regs[R1]+ Mem[100+Regs[R2] + Regs[R3]*d]


-- 2.8 SIMD :
Single-Instruction multiple-data 又稱 Vector Instructions ...
SISD是一筆指令,一筆資料=>一個結果,
在面對一串資料同時運算時就很沒效率,例如 C[8]=A[8]+B[8];
SIMD 則是單指令,多資料流,同樣的運算一次處理多筆資料,可節省處理時間。


-- paired-single operations :
Most graphics multimedia applications use 32-bit floating-point operations. Some computers double peak performance of single-precision, floating-point operations; they allow a single instruction to launch two 32-bit operations on operands found side-by-side in a double precision register. Just as in the prior case, the two partitions must be insulated to prevent operations on one half to affect the other. Such floating-point operations are called paired-single operations. For example, such an operation might be used to graphical transformations of vertices. This doubling in performance is typically accomplished by doubling the number of floating-point units, making it more expensive than just suppressing carries in integer adders.


-- DSP architectures use Saturating Arithmetic :
if the result is too large to be represented, it is set to the largest representable number, depending on the sign of the result. In contrast, two's complement arithmetic can add a small positive number to a large positive number and end up with a negative result. DSP algorithms rely on saturating arithmetic, and would be incorrect if run on a computer without it.


-- 2.9 Instruction for Control Flow (4 Types) :
1. Conditional branches
2. Jumps
3. Procedure call
4. Procedure return


-- Register indirect jumps are useful for :
0. Permit any addressing mode to be used supply the target address
1. Case and Switch statements
2. Virtual functions or methods in OO languages like C++ or Java
3. Higher-order functions or function pointers in languages like C or C++
4. Dynamically shared libraries allow a library to loaded and linked at run time.


-- Procedure Invocation Options :
There are two basic conventions in use to save registers: either at the call site or inside the procedure being called.
Caller saving means that the calling procedure must save the registers that it wants preserved for access after the call, and
thus the called procedure need not worry about registers. Callee saving is the opposite: the called procedure must save the registers it wants to use, leaving the caller is unrestrained.


-- 2.10 Encoding an instruction set :
To balance several competing forces :
1. As many registers and addressing modes as possible
2. Average instruction size and hence average program size
3. Instructions encoded into lengths that will be easy to handle in a pipelined implementation
-- 2.11 Graph coloring :
1. Register allocation algorithms
2. Construct a graph representing the possible candidates for allocation
3. How to use a limited set of colors so that no two adjacent nodes in a dependency graph have the same order
4. Heuristic algorithms that work well in practice


-- Tranditional Vector computer added strided addressing and gather/scatter addressing to increase the number of programs that can be vectorized. Strided addressing skips a fixed number of words between each access, so sequential addressing is often called unit stride addressing. Gather and scatter find their addresses in another vector register: think of it as register indirect addressing for vector computers. From a vector perspective, in contrast these shortvector SIMD computers support only unit strided accesses: memory accesses load or store all elements at once from a single wide memory location. Since the data for multimedia applications are often streams that start and end in memory, strided and gather/scatter addressing modes such are essential to successful vectoization.


-- 2.12 MIPS emphasizes :
1. Simple load-store instruction set
2. Design forpipeline efficiency
3. Efficiency as a compiler target


-- Q2.18 Copy propagation for X = Y + Z; Y = X + Z ; W = Y - X;
一次只看一行程式碼
1. X = Y + Z ; 此處運算元是給定的,並不是由程式算出來的, 所以 Copy propagation 並不會轉換這道指令
2. Y = X + Z ; 此處 X 是計算出來的值,所以要用 X = Y + Z 轉換程式碼
= Y + Z + Z ; 這樣就沒有需要計算的運算元了
3. W = Y - X ; 兩個運算都需要計算
= (X + Z) - (Y+Z+Z) = -Z
Copy propagation將指令2的工作從一個加法變成兩個, 將指令3從減法變成變號運算, 這樣做可以節省工作量.
The suggest for optimizing compilers is : 如果要得到最好的編譯結果, 必須具備複雜的取捨分析能力來完成最佳化步驟.


-- Q3. Assume that values A, B, and C reside in memory.
instruction opcodes are 8 bits / memory addresses are 64 bits / register addresses are 6 bits
For each of the architectures of Figure 2.2, how many addresses (or names) are in each instruction for the code to compute C=A+B and what is the total code size?

Stack                    Accumulator           Reg-Memory            Load/Store

Push A  1              Load A   1              Load R1,A 1              Load R1,A  1 

Push B  1              Add B    1              Add R3,R1,R2 0        Load R2,B  1

Add       0              Store C  1              Store R3,C 1              Add R3,R1,R2  0

Pop C    1                                                                                  Store R3,C   1

                                                                    3 opcodes=24            4 opcodes=32

4 opcodes=32           3 opcodes=24        4 register ops=24       6 register ops = 36

3 addr. = 192            3 addr.=192           3 addr.=192                3 addr.=192

Code size = 224bits  216bits                   240bits                        260bits


--

.End.

amzshar 發表在 痞客邦 留言(0) 人氣()

2007年12月3日 星期一

1.2 Quiz1 : Disscuss the critical system design issues of desktop, server and embbeded systems.

Ans1 :
1.Desktop : price-performance, graphics performance.
2.Server : throughput, availability, scalability.
3.Embedded : price, power consumption, application specific performance.


--- 1.4 Cost of an integrated circuit :
Cost of IC = [ ( Cost of die ) + ( Cost of testing ) + ( Cost of packaging ) ] / ( Final test yeild )

Cost of die = ( Cost of Wafer ) / [ ( Dies per wafer ) * ( Die yield ) ]

Dies per wafer = π * [ ( Wafer diameter / 2 ) ]2 / (Die area) - π * ( Wafer diameter ) / [ 2 * ( Die area ) ] 1/2


--- 1.6 Amdahl's Law (亞當斯定理) :
the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
藉由使用某種較快的執行方式所增進的效能,會受限於可採用此種執行方式所佔的時間比例。


--- 公式:
Speedup = ( Execution time before omprovement ) / ( Execution time after improvement )
=> Make the common cast fast.

ExTime new = ExTime old * [ ( 1 - Fraction enhanced ) + ( Fraction enhanced / Speedup enhanced ) ]

Speedup overall = ExTime old / ExTime new
= 1 / [ (1 - Fraction enhanced ) + ( Fraction enhanced / Speedup enhanced )]


--- Amdahl's Law expresses the law of diminishing returns (報酬遞減法則):
The incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl's Law is that if an enhancement is only usable for a fraction of a task, we can't speed up the task by more than the reciprocal of 1 minus that fraction.

CPU Time = Seconds / Program = ( Instructions / Program) * ( Cycles / Instruction ) * ( Seconds / Cycle )


---
Q1.2 Assume that we make an enhancement to a computer that improves some mode of execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a percentage of the execution time when the enhanced mode is in use (rather than as defined in the notes, where the percentage of the running time without the enhancement is used).
1. What is the speedup we have obtained from fast mode?
2. What percentage of the original execution time has been converted to fast mode?
Ans:
1. What is the speedup we have obtained from fast mode?
Assume that the time taken to execute some program P is 100t seconds when the enhancement is in use.
The amount of time during which enhanced mode is in use is 50% of this time, i.e. 50t seconds.
Now the enhanced mode gives us a speedup factor of 10. This means the original time (in unenhanced mode) to execute these 50t seconds would be 500t seconds. The total original time would therefore be 500t + 50t = 550t seconds.

Thus Speedup = Original time / Enhanced time = 550t / 100t = 5.5

2. What percentage of the original execution time has been converted to fast mode?
The amount of original time converted to fast mode is 500 seconds.
The total original time is 550t seconds.
Thus the percentage of original time converted to fast mode is 500t / 550t = 90.91%


.End.

amzshar 發表在 痞客邦 留言(0) 人氣()