amzshar｜痞客邦

.
====== Ch5.8 Main memory and Organization for improving performance (三種) ======
※ 1. Wider Main Memory : 若寬度變兩倍,存取記憶體只需原來的一半

※ 2. Simple Interleaved Memory : 簡單交錯式的記憶體

※ 3. Independent Memory Banks : 獨立式記憶體庫

====== Ch5.10 Virtual Memory ======
※ Virtual Memory (虛擬記憶體) :
1. A means of sharing a smaller amount of physical memory among many processes.
2. It divides physical memory into blocks and allocates them to different processes.

※ There are further differences between caches and virtual memory beyond those quantitative :
1. Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system. The longer miss penalty means it’s more important to make a good decision, so the operating system can be involved and spend take time deciding what to replace.
2. The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size.
3. In addition to acting as the lower-level backing store for main memory in the hierarchy, secondary storage is also used for the file system. In fact, the file system occupies most of secondary storage. It is not normally in the address space.

※ Translation lookaside buffer (TLB) 轉換後備緩衝區 :
- also called translation buffer (TB)
- It is special address translation cache.
- A TLB entry is like a chche entry where the tag holds portions of the virtual address and the data portion hold a physical page frame number, protection field, valid bit, and usually a use bit and dirty bit.

※ Selecting a Page Size : 分頁大小
=> The following favor a larger size:
1. The size of the page table is inversely proportional to the page size; memory (or other resources used for the memory map) can therefore be saved by making the pages bigger.
2. As mentioned on page 433 in section 5.7, a larger page size can allow larger caches with fast cache hit times.
3. Transferring larger pages to or from secondary storage, possibly over a network, is more efficient than transferring smaller pages.
4. The number of TLB entries are restricted, so a larger page size means that more memory can be mapped efficiently, thereby reducing the number of TLB misses.

====== Ch5.11 Protection and Examples of Virtual Memory ======
※Protecting Processes
The simplest protection mechanism is a pair of registers that checks every address to be sure that it falls between the two limits, traditionally called base and bound. An address is valid if
　　Base <= Address <= Bound
In some systems, the address is considered an unsigned number that is always added to the base, so the limit test is just
　　(Base + Address) <= Bound

※ the Computer Designer has 3 more responsibilities in helping the OS Designer protect processes from each other:
1. Provide at least two modes, indicating whether the running process is a user process or an operating system process. This latter process is sometimes called a kernel process, a supervisor process, or an executive process.
2. Provide a portion of the CPU state that a user process can use but not write. This state includes the base/bound registers, a user/supervisor mode bit(s), and the exception enable/disable bit. Users are prevented from writing this state because the operating system cannot control user processes if users can change the address range checks, give themselves supervisor privileges, or disable exceptions.
3. Provide mechanisms whereby the CPU can go from user mode to supervisor mode and vice versa. The first direction is typically accomplished by a system call, implemented as a special instruction that transfers control to a dedicated location in supervisor code space. The PC is saved from the point of the system call, and the CPU is placed in supervisor mode. The return to user mode is like a subroutine return that restores the previous user/supervisor mode.

End.

amzshar 發表在痞客邦留言(0) 人氣()

個人分類：◆計算機架構

▲top

May 12 Mon 2008 22:46
[EE_CSIE] Computer Architecture Chapter05 Notes (6)

2007年12月20日星期四

.

====== Ch5.7 Reduce Hit-Rate (四種) ======

※ (方法1) Small and Simple Caches :
1. A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address. - smaller hardware is faster - keep the cache simple
2. main benefit of direct-mapped caches:the designer can overlap the tag check with the transition of the data.

※(方法2) Avoiding Address Translation During Indexing of the Cache :
1. Using virtual addresses for the cache, since hits are much more common than misses.
2. Why doesn’t everyone build virtually addressed caches ?
- One reason is protection
- another reason is that every time a process is switched the virtual addresses refer to different physical addresses, requiring the cache to be flushed.

※ (方法3) Pipelined Cache Access :

※ (方法4) Trace Caches :
- Instead of limiting the instructions in a static cache block to spatial locality, a trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block.
- It comes from the cache blocks containing dynamic traces of the executed instructions as determined by the CPU transfer than containing static sequences of instructions as determined by memory.

綜合分析： + 表示改進，– 表示負面影響。

Figure 5.26 ... from
Computer Architecture : A Quantitative Approach

amzshar 發表在痞客邦留言(1) 人氣()

個人分類：◆計算機架構

▲top

May 12 Mon 2008 22:45
[EE_CSIE] Computer Architecture Chapter05 Notes (5)

2007年12月20日星期四

.
====== Ch5.5 Reducing Miss Rate (也是五種) ======
※ 3種Miss Type(失誤類型)：
1. Compulsory—The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses.
2. Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved.
3. Conflict—If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses or interference misses. The idea is that hits in a fully associative cache which become misses in an N-way set associative cache are due to more than N requests on some popular sets.

※ <方法1.> Larger Block Size :
較大區塊利用空間區域性來減低失誤率(減低Compulsory Miss),但增加penalty,與Conflict Miss.

※ <方法2.> Larger Cache :
減低Capacity Miss, 但是會有較長的Hit time,及較高成本.

※ <方法3.> Higher Associativity :
2:1 cache rule of thumb : a direct-mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2. This held for cache sizes less than 128 KB.

※ <方法4.> Way Prediction & Pseudo-Associative Caches :
只檢查快取記憶體中的一部份來看是否命中,若失誤,再檢查其他部分, (減低Conflict Miss, )
In way-prediction, extra bits are kept in the cache to predict the set of the next cache access. This prediction means the multiplexor is set early to select the desired set, and only a single tag comparison is performed that clock cycle. A miss results in checking the other sets for matches in subsequent clock cycles.
pseudo-associative or column associative. Accesses proceed just as in the direct-mapped cache for a hit. On a miss, however, before going to the next lower level of the memory hierarchy, a second cache entry is checked to see if it matches there. A simple way is to invert the most significant bit of the index field to find the other block in the “pseudo set.”

※ <方法5.> Compiler Optimization :
1. Loop Interchange :

/* Before */
　for (j = 0; j < 100; j = j+1)
　　for (i = 0; i < 5000; i = i+1)
　　　x[i][j] = 2 * x[i][j];

/* After */
　for (i = 0; i < 5000; i = i+1)
　　for (j = 0; j < 100; j = j+1)
　　　x[i][j] = 2 * x[i][j];

2. Blocking :

/* Before */
　for (i = 0; i < N; i = i+1)
　　for (j = 0; j < N; j = j+1)　
　　　{　r = 0;
　　　　　for (k = 0; k < N; k = k + 1)
　　　　　　r = r + y[i][k]*z[k][j];
　　　　　　x[i][j] = r;
　　　};

/* After */
　for (jj = 0; jj < N; jj = jj+B)
　　for (kk = 0; kk < N; kk = kk+B)
　　　for (i = 0; i < N; i = i+1)
　　　　for (j = jj; j < min(jj+B,N); j = j+1)
　　　　　{　r = 0;
　　　　　　　for (k = kk; k < min(kk+B,N); k = k + 1)
　　　　　　　　r = r + y[i][k]*z[k][j];
　　　　　　　x[i][j] = x[i][j] + r;
　　　　　};

====== Ch5.6 Reducing Cache miss penalty or Miss rate via parallelism (三種) ======
※方法1☆ Nonblocking Caches to Reduce Stalls on Cache Misses (無阻隔式快取記憶體):
又稱 lockup-free cache (無鎖式快取記憶體)
1. Hit under miss optimization : Reduces the effective miss penalty by being helpful during miss instead of ignoring the requests of CPU.
2. Hit under multiple miss (or miss under miss) : It is beneficial only if the memory system can service multiple misses.

※方法2☆ Hardware prefetching of instructions and data :

※方法3☆ compiler-controlled prefetching :
1. an alternative to hardware prefetching is for the compiler to insert preftech instructions.
2. The most effective prefetch is “semantically invisible”to a program :
- It doesn’t change the contents of registers and memory, and
- It cannot cause virtual memory faults.

amzshar 發表在痞客邦留言(0) 人氣()

個人分類：◆計算機架構

▲top

May 12 Mon 2008 22:45
[EE_CSIE] Computer Architecture Chapter05 Notes (4)

2007年12月19日星期三

.
====== Ch5.4 Reducing Cache Miss Penalty (五種) ======
※ [方法1] Multi-Level Caches :

Average memory access time = Hit timeL1 + Miss rateL1 × Miss penaltyL1
and
Miss penaltyL1 = Hit timeL2 + Miss rateL2 × Miss penaltyL2

所以

Average memory access time
= Hit timeL1 + Miss rateL1× (Hit timeL2 + Miss rateL2 × Miss penaltyL2)

1. Local miss rate : This rate is simply the number of misses in a cache divided by the total number of memory accesses to this cache. As you would expect, for the first-level cache it is equal to Miss rate L1 and for the second-level cache it is Miss rate L2.
2. Global miss rate : The number of misses in the cache divided by the total number of memory accesses generated by the CPU. Using the terms above, the global miss rate for the first-level cache is still just Miss rateL1 but for the second-level cache it is Miss rate L1 × Miss rate L2.

Average memory stalls per instruction
= Misses per instruction L1× Hit time L2 + Misses per instruction L2 × Miss penalty L2.

EXAMPLE Suppose that in 1000 memory references there are 40 misses in the 1st-level cache and 20 misses in the 2nd-level cache. What are the various miss rates? Assume the Miss penalty from L2 cache to Memory is 100 clock cycles, the Hit time of L2 cache is 10 clock cycles, the Hit time of L1 is 1 clock cycles, and there are 1.5 memory references per instruction. What is the average memory access time and average stall cycles per instruction? Ignore the impact of writes.
ANSWER :
1st-level local and global miss rate = 40 / 1000 = 4%
2nd-level local miss rate = 20 / 40 = 50%
2nd-level global miss rate = 20 / 1000 = 2%
=>
Average memory access time
　= 1 + 4%(10 + 50% × 100 ) = 3.4 clock cycles.
(若無L2 => Average memory access time = 1 + 4% × 100 = 5 clock cycles.)
1.5 memory references per instruction => 1000 memory reference per 667 instructions.
所以每千個指令的失誤率 Miss Rate L1 = 40*1.5 = 60 , Miss Rate L2 = 20*1.5=30

Average memory stalls per instruction
= Misses per instruction L1× Hit time L2 + Misses per instruction L2 × Miss penalty L2
= (60/1000) × 10 + (30/1000) × 100
= 0.060 × 10 + 0.030 × 100 = 3.6 clock cycles

另一種算法是 (Average memory access time - L1 Hit time ) × 平均Cache存取次數
= (3.4 – 1.0) * 1.5 = 3.6 clock cycles.

※ [方法2] Critical word first & Early restart : (只對Block大的Cache有效)
1. Critical word first : Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first.
2. Early restart : Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.

※ [方法3] Giving Priority to Read Misses over Writes :
This optimization serves reads before writes have been completed.

※ [方法4] Merging write buffer : (將連續字元組的多個寫入動作合併為單一區塊)
If the buffer contains other modified blocks, the addresses can be checked to see if the address of this new data matches the address of the valid write buffer entry. If so, the new data are combined with that entry,
called write merging.

※ [方法5] Victim caches :
One approach to lower miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at small cost.

amzshar 發表在痞客邦留言(0) 人氣()

個人分類：◆計算機架構

▲top

May 12 Mon 2008 22:44
[EE_CSIE] Computer Architecture Chapter05 Notes (3)

2007年12月19日星期三

.
====== Ch5.3 Cache Performance ======
※
※ Average Memory Access Time (AMAT) = Hit time + Miss Rate × Miss Penalty
※

EXAMPLE : 比較何者有較低的失誤率,
(假設Cahe為具有寫入緩衝區的直接寫入式cache)
1. 16KB指令快取 + 16KB資料快取,
2. 32KB合併式快取(Unified Cache),
假設36%指令是資料快取, 而命中需1個時脈週期(Hit time=1),
失誤代價=100時脈週期(Miss Penalty=100).
而Unified合併式快取,因無法同時處理兩個要求,Load與Store必須多出額外的1個時脈週期.

ANSWER :
先計算每千個指令的失誤次數轉換為失誤率.

Misses / Instruction = ( Miss Rate × Memory Access ) / Instruction

Miss Rate = [ ( Misses/1000 Instructions ) / 1000 ] / ( Memory Access / Instruction )

因為每個指令存取需要一次的 Memory Access以取得指令:
=> Miss Rate 16KB指令 = [ 3.82 / 1000 ] / 1.00 = 0.004 (佔 74%)
=> Miss Rate 16KB資料 = [ 40.9 / 1000 ] / 0.36 = 0.114 (佔26%)
=> 分離式Cache整體失誤率 = 74% × 0.004 + 26% × 0.114 = 0.0324

合併式Unified Cache必須計算指令與資料存取:
=> Miss Rate 32KB合併式
　　= [ 43.3 / 1000 ] / ( 1.00 + 0.36 ) = 0.0318 (比較上,稍微低)

Average memory access time (AMAT)
= % instructions × (Hit time + Instruction miss rate × Miss penalty) +% data × (Hit time + Data miss rate × Miss penalty)

=> AMAT分離式 = 74% × ( 1 + 0.004 × 100 ) + 26% × ( 1 + 0.114 × 100 ) = 4.24

=> AMAT合併式 = 74% × ( 1 + 0.0318 × 100 ) + 26% × ( 1 + 1 + 0.0318 × 100 ) = 4.44

所以, 分離式Cache在每個時脈週期提供兩個記憶體存取,
因而避免掉結構危障.
雖然Miss Rate 較高,但AMAT仍然比僅單一存取阜的Unified Cache要短.

※ 比較有沒有Cache對效能的影響
EXAMPLE : An in-order execution computer
(Such as Ultra SPARC III), Cache Penalty=100 clock cycles,
all instructions normally take 1.0 clock cycles,
Assume the Average Miss Rate is 2%, there is an average of 1.5 memory references per instruction, and that the average number of Cache Misses per 1000 instructions is 30.
What is the impact on performance when behavior of the cache is included?
Calculate the impact using both misses per instruction and miss rate.
ANSWER :
CPU time
= IC×[ ( CPI execution + (Memory stall clock cycles)/Instruction ] × Clock cycle time

1. 包含Cache失誤的效能---
CPU time = IC × (1.0 + 30/1000 × 100) × Clock cycle time
　　　　　= IC × 4.0 × Clock cycle time

2. 使用Miss Rate來計算---
CPU time = IC×[CPI execution + Miss Rate×(Memory accesses/Instruction)×Miss penalty]×Clock cycle time

=> CPU time
　= IC×[1.0 + 2%×1.5×100]×Clock cycle time
　　= IC × 4.0 × Clock cycle time

The clock cycle time and (IC) instruction count are the same, with or without a cache.
Thus, CPU time increases fourfold, with CPI from 1.00 for a “perfect cache” to 4.00 with a cache that can miss.
Without any memory hierarchy at all the CPI would increase again to 1.0 + 100 × 1.5 or 151— a factor of almost 40 times longer than a system with a cache!

※ 比較不同架構的Cache 對效能得影響 (Direct-Mapped v.s. 2-way-associative ):
EXAMPLE Assume that the CPI=2.0 with a perfect cache, the clock cycle time is 1.0 ns, there are 1.5 memory references per instruction, both caches size is 64 KB, and block size of 64 bytes. One cache is direct mapped and the other is two-way set associative. For set-associative caches we must add a multiplexor to select between the blocks in the set depending on the tag match. Since the speed of the CPU is tied directly to the speed of a cache hit, assume the CPU clock cycle time must be stretched 1.25 times to accommodate the selection multiplexor of the set-associative cache. Cache miss penalty=75 ns for either cache organization. First, calculate the average memory access time, and then CPU performance. Assume the hit time=1 clock cycle, the Miss Rate = 1.4% for direct-mapped 64-KB cache, the Miss Rate=1.0% for a two-way set-associative cache.
ANSWER :
Average memory access time = Hit time + Miss rate × Miss penalty

AMAT(1-way) = 1.0 + (0.014×75) = 2.05 ns
AMAT(2-way) = 1.0×1.25 + (0.01×75) = 2.00 ns (AMAT較佳)

CPU time = IC×[CPI execution + Miss Rate×(Memory accesses/Instruction)×Miss penalty]×Clock cycle time

CPU time(1-way) = IC×[2 + 0.0014×1.5×75]×1.0 = 3.58 × IC (CPU time較佳)
CPU time(2-way) = IC×[2×1.25 + 0.01×1.5×75] ×1.0=3.63 × IC
=> 相對效能 = CPU time (2-way) / CPU time(1-way) = 3.63 / 3.58 = 1.01

※ Out-of-Order Execution Processor:
( Memory stall cycles / Instruction )
= ( Misses / Instruction ) * (Total miss latency – Overlapped miss latency )

amzshar 發表在痞客邦留言(1) 人氣()

個人分類：◆計算機架構

▲top

amzshar

[Kuso]代溝從此開始 … (那螞蟻怎麼說？)

[Kuso] 很嚴肅的一部影片！！一定要看！！！

[Kuso] 雄獅的由來

[Kuso] 豬之歌 (開心一下)

[EE_CSIE] Computer Architecture Chapter05 Notes (7)

[EE_CSIE] Computer Architecture Chapter05 Notes (6)

[EE_CSIE] Computer Architecture Chapter05 Notes (5)

[EE_CSIE] Computer Architecture Chapter05 Notes (4)

[EE_CSIE] Computer Architecture Chapter05 Notes (3)

月曆

近期文章

最新迴響

新聞交換(RSS)

文章彙整

文章分類

我的連結

參觀人氣

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30