2-2) CA2 정리

▶

2022 midterm

1. (16pt) Fill the blanks.

2. (15pt) Compute the sizes of the tag bits and the index bits.

3. (10pt) Compare split caches vs. combined cache in terms of hit rate and bandwidth.

4. (10pt) Consider a 1 Mb DRAM device. ( $2^{20}$ )

5. (15pt) CPI with memory stalls

6. (10pt)

7. (15pt) Hit-time calculation

8. (10pt) Access control bits (R/W/X)

9. (10pt) Explain multi-level page table in virtual memory.

10. TLB / Paging exercise

▶

2023 midterm

1. (10pt) Fill the blanks.

2. (15pt) Compute the sizes of the tag bits and the index bits.

3. (10pt) Compare split caches vs. combined cache in terms of hit rate and bandwidth.

4. (10pt) Consider a 1 Mb DRAM device.

5. (15pt)

6. (10pt)

7. (30pt) Consider the following C code.

8. (30pt)

▶

2024 midterm

1.(16pt) Fill the blanks.

(a) Temporal locality : Tendency to re-use recently accessed data items

(b) Spatial locality : Tendency to reference data items that are close to other recently accessed items

(d) hit time : the time to access the upper level of the memory hierarchy

(e) miss penalty : the time to replace a block in the upper level with the corresponding block from the lower level + the time to deliver this block to the processor

(g) page fault : Occurs when the valid bit for a virtual page is off

(h) TLB : Special address translation cache that keeps track of recently used translation

2. (15pt) Compute the sizes of the tag bits and the index bits.

(a)

• direct mapped

• 32 KB of data with 4-word blocks

• 32-bit address

number of blocks = 2^15 / 16 = 2^11

byte offset = 2 bits

block offset = 2 bits

index = 11bits

tag = 32 - 15 = 17bits

(b)

• 4 way set associative

• 32 KB of data with 8-word blocks

• 32-bit address

number of blocks = 2^15 / 32 = 2^10

sets = 2^10 / 4 = 2^8

byte offset = 2bits

block offset = 3bits

index = 8bits

(c)

• fully associative

• 64 KB of data with 16-word blocks

• 32-bit address

3. (10pt) Compare split caches vs. combined cache in terms of hit rate and bandwidth.

4. (15pt) Assume that instruction cache miss rate is 1%, data cache miss rate is 4%, CPI (clock cycle per instruction) is 2 without any memory stall, and miss penalty is 200 cycles. In addition, assume that the frequency of loads/stores is 30%. Suppose that the CPU clock frequency is 1GHz. Assume that a given program A requires 10 seconds to complete on this machine without considering memory stall.

(a) Compute CPI with memory stall and compute the execution time of program A.

(b) When CPI without any memory stall becomes 1, compute CPI with memory stall and compute the execution time of program A.

(c) If the CPU clock rate is doubled with the same memory (so the clock frequency becomes 2GHz) when CPI without memory stall is 2, compute CPI with memory stall and compute the execution time of program A.

5. (10pt) Cache misses can be classified into compulsory, capacity, and conflict misses. We have a fully associative cache of which size is 16KB. When an application is running on the cache, cache miss rate is 8%. When the cache size increases to infinite, the cache miss rate reduces to 5% for the same application. When the 16KB fully associative cache changes to 16KB 4-way set associative cache, the miss rate becomes 9%. For the 4-way set associative cache of which size is 16KB, what are compulsory, capacity, and conflict cache miss rates?

6. (30pt) Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KB sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

(a) Assume a 64KB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How would you categorize the misses this workload is experiencing, based on the 3C model?

(b) Re-compute the miss rate when a cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

(c) “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

7. (10pt) Suppose that a cache has 64 blocks( $2^6$ ) that are 32(4*8, 즉 8 words = 1 block) bytes each. Show how to break the following address into the tag, set index, block offset, and byte offset if the cache is direct-mapped, 2-way set associative, 4-way set associative, 8-way set associative and fully associative?

0000 1000 0101 1100 0001 0001 0111 1001

8. (10pt) Consider a multilevel caches (primary and secondary caches). Assume that CPU clock rate is 5GHz, CPI = 1 with a primary cache of 100% hit rate, main memory access time is 100ns including miss handling, and miss rate per instruction at the primary cache is 2%. With a secondary cache of 0.5% miss rate and 5 ns access time, how will it be faster?

clock rate =→ 1cycle = 0.2ns

primary cash only :

secondary :

9. (15pt) Assume that TLB hit time is 2 cycles, cache index time (time to load a cache line) is 1 cycles, and cache tag comparison time is 1 cycles.

Compute hit time (TLB hit+cache hit) for

(a) physically indexed and physically tagged cache

(b) virtually indexed and physically tagged cache, and

10. (10pt) The virtual cache has an aliasing (or synonyms) problem. Explain what the aliasing problem is, and how to solve it.

11. (10pt) Explain the difference between interrupt and exception(or trap). Explain how OS handles the page fault exception.

12. (10) Draw a DRAM structure. The structure should include RAS, CAS, row buffer, and refresh buffer. For a (word size) load command at address 0x0…04, explain how the DRAM operates to serve using your DRAM structure diagram.

▶

2019 final exam

1. (10pt)

Compare “memory mapped I/O” vs. “I/O instruction” approaches.

Memory mapped I/O maps I/O devices into the same address space as main memory.

I/O instruction approach uses a separate address space for I/O devices with dedicated instructions.

매모리 매핑 I/O : 메인 메모리와 같은 공간에 I/O 기기 매핑

I/O instruction : I/O 기기를 위한 다른 주소공간 제공

2. (15pt)

Explain polling, interrupt, and DMA.

Polling : CPU periodically check I/O devices status.

interrupt : I/O device sends a signal to CPU when operation complete.

DMA : DMA controller transfers data between memory and I/O devices without CPU.

Polling : CPU가 기기를 주기적으로 확인

Interrupt : I/O 기기가 작동이 완료되면 신호를 보냄

DMA : CPU 없이 DMA 컨트롤러에서 메인 메모리로 직접 통신

3. (10pt)

If DMA writes to a memory block that is cached then the cached copy becomes stale.

Explain why it is happening and how to solve the stale problem.

DMA writes data directly to main memory. The CPU cache is unaware of this update. As a result, cache holds stale data while main memory has fresh data from DMA.

the solution is flush block from cache if they will be used for DMA

DMA는 CPU를 거치지 않고 메인 메모리와 직접 주고받음. 그래서 CPU 캐시는 유호하지 않은 데이터를 들고있게 됨.

해결책 : DMA로 사용할 블럭을 flush 하면 됨.

4. (10pt)

In case that OS uses virtual addresses for memory, DMA blocks may not be contiguous in physical memory.

How can we solve the scattered physical memory in DMA?

break transfers to page size chunks.

페이지 사이즈로 쪼개서 보냄

5. (10pt)

In RAID 5 system which uses distributed block-interleaved even parity.

Note that the even parity means that all xored valued should be even.

(a)

Assume E disk is out of order. Reconstruct E blocks.

Note that p denotes a parity bit.

	A	B	C	D	E
0	0	1	1	0	0 (p)
1	0	0	0	0	p (?)
2	1	1	1 (p)	0	?
3	1	0 (p)	1	1	?

(b)

For requests to update blocks A0, B2, and C3, compute how many cycles are required.

Assume that it takes a single cycle to update a single block in each HDD.

6.

What is Amdahl’s law?

the overall speedup of a program is limited by the fraction of the program that cannot be parallelized.

7.

Explain the fine-grain, coarse-grain, and simultaneous multithreading.

fine-grain switches threads every cycle.

coarse-grain switch threads only on long latency

simultaneous executes multiple threads in the same cycle.

8. (10pt)

Explain “Warp" in NVIDIA architecture.

Warp is the basic execution in NVIDIA GPUs, consisting of 32 threads that execute the same instruction simultaneously in SIMT.

9. (10pt)

For MESI snooping protocol, specify cache states in processors P1, P2, and P3 for each step.

Assume initial cache state is invalid.

MESI Table

Step	P1 cache state	P2 cache state	P3 cache state
P1: read A1	E	I	I
P1: write 1 to A1	M	I	I
P3: read A1	S	I	S
P2: write 2 to A1	I	M	I
P1: read A1	S	S	I
P2: write 3 to A1	I	M	I
P2: write 0 to A1	I	M	I
P3: read A1	I	S	S
P1: write 1 to A1	M	I	I

10. (10pt)

The following code shows the implementation of lock by using ll (load linked) and sc (store conditional) instructions.

(a) Fill the blanks.

try:   mov  R5, #1
       ll   R2,0(R4)   ; load linked
       sc   R5,0(R4)    ; store conditional
       beqz R5, try    ; branch store fails (R5 = 0)
       bnez R2, try     ; already locked?

(b)

The above code invalidates all other copies in caches, which generates considerable bus traffic.

Fill the blanks to revise the above code to solve the problem.

try:     mov  R5, #1
lockit:  lw      R2, 0(R4)
         bnez    R2, lockit
         ll      R2,0(R4)     ; load linked
         sc      R5,0(R4)     ;     store conditional
         beqz    R5, try        ;     branch store fails
         bnez    R2, try       ; already locked?

11. (10pt)

Explain the directory cache coherent protocol.

Directory protocol is a cache coherence mechanism that uses a centralized directory to track which caches have copies of each memory block.

▶

2020 final exam

1.

In HDD, the sector is the minimum unit to read and write. When a request from CPU arrives at HDD, what operations are required to read a sector in HDD? Explain them in terms of seek time, rotational latency, and data transfer.

Seek time : when a read request arrives, the disk arm must first move the read/write head to the correct track where the target sector is located.

Rotational Latency : rotate under the head for target sector.

Data transfer : the actual data is read from the disk surface. transfer time depends on the rotational speed and data density.

2.

Assume that the following HDD is given:

512B sector
10,000 rpm
5 ms average seek time
50 MB/s transfer rate
0.5 ms controller overhead
idle disk

Compute the average read time.

If the average seek time reduces to 1 ms, then what is the new average read time?

rotation time = 60s / 10000 = 6ms

avg rot time = 6/2 = 3ms

Transfer time = 512B / 50MB/s = 512 / 50 * 10^6 = 0.01ms

overhead = 0.5ms

→ ~~ 8.5ms

3.

In bus systems, there are two types of serial bus and parallel bus.

Which shows higher performance (bandwidth) in general?

Explain the reason.

serial buses avoid problems of parallel buses such as timing skew, signal interference, and crosstalk between many wires.

4.

In Flash memory, there are two types of NOR flash and NAND flash.

Explain the difference between NOR and NAND flashes.

5.

Compare ROM, PROM, EPROM, EEPROM, and flash memory in terms of the erase operation.

6.

Compare “memory mapped I/O” vs. “I/O instruction” approaches.

7.

Explain polling, interrupt, and DMA.

8.

If DMA writes to a memory block that is cached then the cached copy becomes stale.

Explain why it is happening and how to solve the stale problem.

9.

In case that OS uses virtual addresses for memory, DMA blocks may not be contiguous in physical memory.

How can we solve the scattered physical memory in DMA?

10.

In RAID 5 system which uses distributed block-interleaved even parity.

Note that even parity means that all XORed values should be even.

Assume E disk is out of order.

Reconstruct E blocks.

Note that p denotes a parity bit.

	A	B	C	E
0	1	1	1	0 (p)
1	0	1	0	p (?)
2	1	0	1 (p)	?
3	1	0 (p)	0	?

11.

What is Amdahl’s law?

12.

Explain why a two-core processor can consume half the energy compared with a single-core processor

using voltage, current, power, time, and energy equations.

13.

Write a code to sum 100,000 numbers on 10-processor UMA,

where each processor has an ID,

we partition 100 numbers per processor,

and initial summation occurs on each processor.

Assume:

execution time for all operations except synch() is negligible
synch() requires 1 ms

What is the total execution time to sum the 100,000 numbers on 10-processor UMA?

log 10 = 3.32

→ 4 ms

14.

Explain SPMD (Single Program Multiple Data) regarding the conditional branch on multiple cores.

SPMD is parallel programming model where all processors execute the same program but operate on different data.

15. (10pt)

Explain “Warp” in NVIDIA architecture.

16.

Explain the ll (load linked) and sc (store conditional) instructions.

LL reads a value from memory and sets a link flag associated with that address.

SC attempts to write a value from memory. if no other process has written, store succeeds, if another has modified, store fails.

▶

2021 final exam

1. (10pt)

Fill the blanks.

____I/O instruction____: It separates instructions to access I/O registers. It can only be executed in kernel mode.
____DMA_____: OS provides starting address in memory. I/O controller transfers to/from memory autonomously.

2. (15pt)

Explain polling, interrupt, and DMA.

3. (10pt)

Mark true/false for each sentence.

(a) (T/F) Flash memory is slower than DRAM but faster than HDD.

(b) (T/F) NOR flash memory provides higher density than NAND flash memory.

(d) (T/F) SSD is cheaper but more robust than HDD.

(e) (T/F) In flash memory, there are three operations of read, write, and erase.

4. (10pt)

If DMA writes to a memory block that is cached then the cached copy becomes stale.

Explain why it is happening and how to solve the stale problem.

DMA controller transfers data directly between I/O devices and main memory without CPU.

solutions : flush blocks from cache if they will be used for DMA

5. (10pt)

In case that OS uses virtual addresses for memory, DMA blocks may not be contiguous in physical memory.

How can we solve the scattered physical memory in DMA?

break transfers into page sized chunk.

6. (10pt)

In RAID 5 system which uses distributed block-interleaved even parity.

Note: even parity means all XORed values should be even.

(a)

Assume E disk is out of order. Reconstruct E blocks.

Note: p denotes a parity bit.

	A	B	C	D	E
0	1	1	1	0	0 (p)
1	0	1	0	0	p (?)
2	0	0	1 (p)	0	?
3	1	0 (p)	1	1	?

(b)

For requests to update blocks A0, B1, and C3, compute how many cycles are required.

Assume 1 cycle to update a single block in each HDD.

7. (10pt)

Assume the following NVIDIA Tesla code:

if(a<0)      b = c+d;   // path1
else if(a==0) b = c*d;  // path2
else          b = c/d;  // path3

WARP (32 threads):

10 → path1 (5 cycles)
10 → path2 (10 cycles)
12 → path3 (20 cycles)

Compute total cycles to run the 32 threads.

Note: Tesla SM has 8 SPs.

35 cycles.

if문 쓰면 차례대로 해야함.

8. (10pt)

For MESI snooping protocol, specify cache states in P1, P2, P3 for each step.

Initial state = invalid.

Step	P1 Cache	P2 Cache	P3 Cache
P1: read A1	E	I	I
P1: write 1 to A1	M	I	I
P1: read A1	M	I	I
P2: write 2 to A1	S	S	I
P3: read A1	I	I	M
P2: write 3 to A1	I	M	I
P2: write 0 to A1	I	M	I
P1: read A1	S	S	I
P1: write 1 to A1	M	I	I

9. (15pt)

ll/sc 기반 lock 코드.

(a) Fill the blanks.

try: mov  R5, #1
     ll   R2,0(R4)    ; load linked
     sc   ________    ; store conditional
     beqz ________    ; branch store fails (R5 = 0)
     bnez R2, try     ; already locked?

(b) Revised version to reduce bus traffic.

try:    mov  R5, #1
lockit: lw   R2, 0(R4)
        bnez ________
        ll   R2,0(R4)       ; load linked
        sc   ________       ; store conditional
        beqz ________       ; branch store fails
        bnez R2, try        ; already locked?

(c) Fetch-and-increment using ll/sc — fill the blanks.

try: ll  R2,0(R1)           ; load linked
     ____________________________
     sc  ________            ; store conditional
     beqz R2,try             ; branch store fails (R2 = 0)

▶

2022 final exam

1. (10pt)

Compare “memory mapped I/O” vs. “I/O instruction” approaches.

Memory Mapped I/O maps I/O devices into the same address space as main memory.

I/O instruction approach uses a separate address space for I/O devices with dedicated instructions such as IN and OUT.

2. (15pt)

Explain polling, interrupt, and DMA.

Polling : CPU periodically check I/O device status

interrupt driven I/O : I/O device sends a signal to CPU when operation complete.

DMA : Direct Memory Access : DMA controller transfers data between memory and I/O devices without CPU.

3. (10pt)

If DMA writes to a memory block that is cached then the cached copy becomes stale.

Explain why it is happening and how to solve the stale problem.

DMA writes data directly to main memory. The CPU cache is unaware of this update. As a result, the cache holds stale data while main memory contains fresh data from DMA. solution : flush block from caches if they will be used for DMA.

4. (10pt)

In case that OS uses virtual addresses for memory, DMA blocks may not be contiguous in physical memory.

How can we solve the scattered physical memory in DMA?

break transfers into page sized chunks.

5. (10pt)

Compare NOR flash and NAND flash in terms of price, speed, and endurance (or lifetime).

NAND is cheaper than NOR

NOR has faster read speed, NAND has faster write, erase speed.

NOR has higher endurance than NAND

6. (10pt)

In RAID 5 system which uses distributed block-interleaved even parity.

Note that even parity means that all XORed values should be even.

(a)

Assume E disk is out of order. Reconstruct E blocks.

Note that p denotes a parity bit.

	A	B	C	D	E
0	1	0	1	0	0 (p)
1	0	1	0	0	p (?)
2	0	0	1 (p)	0	?
3	1	0 (p)	1	1	?

(b)

For requests to update blocks A0, B1, and E3,

compute how many cycles are required.

Assume it takes a single cycle to update a single block in each HDD.

7. (10pt)

Assume that the following code is running on NVIDIA Tesla architecture:

if(a < 0)      b = c + d;   // path1
else if(a==0)  b = c * d;   // path2
else           b = c / d;   // path3

Among a group of 32 threads (WARP):

10 threads → path1
10 threads → path2
12 threads → path3

Assume:

path1 needs 5 cycles
path2 needs 10 cycles
path3 needs 20 cycles

Compute the total cycles to run the 32 threads.

Note that in Tesla, an SM has 8 SPs.

Warp : GPU에서 동시에 같은 명령어를 실행하는 스레드 묶음

35 cycles

8. (10pt)

For the snooping protocol which has three states: invalid, exclusive, shared,

specify cache states in processors P1, P2, and P3 for each step.

Assume initial cache state is invalid.

Step	P1 cache state	P2 cache state	P3 cache state
P1: read A1	S	I	I
P1: write 1 to A1	E	I	I
P1: read A1	E	I	I
P2: write 2 to A1	I	E	I
P3: read A1	I	S	S
P2: write 3 to A1	I	I	E
P2: write 0 to A1	I	I	E
P1: read A1	S	I	S
P1: write 1 to A1	E	I	I

9. (10pt)

The following code shows the implementation of lock by using ll (load linked) and sc (store conditional) instructions.

(a) Fill the blanks.

try:  mov  $5, #1
      ll   $2,0($4)        ; load linked
      sc   ________        ; store conditional
      beq  ________        ; branch store fails ($5 = 0)
      bne  $2, $0, try     ; already locked?

(b)

The following code shows the implementation of fetch-and-increment instruction using ll/sc.

Fill the blanks.

try:   ll  $2,0($1)         ; load linked
       ____________________________
       sc  ________          ; store conditional
       beq $2,$0,try         ; branch store fails ($2 = 0)

10. (10pt)

DMA causes a cache coherence problem.

Describe the situation where DMA causes the cache coherence problem and propose solutions.

11. (10pt)

Explain fine-grain, coarse-grain, and simultaneous multithreading.

▶

2023 final exam

1. (15pt)

Assume that TLB hit time is 2 cycles, cache index time (time to load a cache line) is 0.5 cycles,

and cache tag comparison time is 1 cycle.

Compute hit time (TLB hit + cache hit) for:

(a) physically indexed and physically tagged cache

(b) virtually indexed and physically tagged cache

2. (10pt)

A system uses a 32-bit address space and a 4 KB page frame.

The process virtual address space is 4 GB,

and the page table is single-level.

(a) Calculate the number of entries in the page table.

(b) If each page table entry is 4 bytes, compute the total page table size.

3. (15pt)

Given the virtual memory references:

0x1234, 0x2345, 0x3456, 0x4567, 0x1234, 0x5678, 0x2345, 0x6789, 0x3456, 0x789A

Assumptions:

Fully associative TLB with 4 entries
Page size = 1 KB
Initially TLB empty, all pages not in physical memory
TLB replacement: LRU
On TLB miss, page table lookup may be a hit or miss

Tasks:

(a) For each reference, determine:

TLB hit/miss
If miss → page hit or page miss

(b) Compute TLB hit rate.

4. (10pt)

The CPU communicates with I/O devices.

System supports polling, interrupt-driven I/O, DMA.

Application requires frequent, large data transfers between storage and memory.

(a) Describe operational differences among polling, interrupt-driven I/O, DMA.

Polling : CPU periodically checks the I/O device status

Interrupt-driven I/O : I/O device sends a signal to CPU when operation completes

DMA : (Direct Memory Access) : DMA controller transfers data between memory and I/O device without CPU

(b) Analyze CPU utilization for each.

Polling : CPU wastes cycles in busy-waiting loop.

Interrupt-driven I/O : CPU can execute other tasks while waiting for I/O.

DMA : CPU only handles initial setup and final completion interrupt.

Best Method : DMA

For frequent, large data transfers between storage and memory, DMA provides optimal perfermance with minimal CPU overhead. The CPU only configures the transfer parameters and handles a single completion interrupt, allowing maximum CPU availability for tasks.

5. (10pt)

In RAID 5 system (distributed block-interleaved even parity):

(a) Assume E disk is out of order. Reconstruct E blocks.

	A	B	C	D	E
0	1	0	0	0	1 (p)
1	0	1	1	0	0 (p)
2	0	0	1 (p)	0	1
3	1	0 (p)	0	1	0

(b) For update requests to A0, B2, C3, compute cycles.

Assume 1 cycle per block update.

A0 = use A, E → 1 cycle

B2 = use B, C

C3 = use B, C

B2 and C3 run sequentially → 2 cycles

max ( 1, 2 ) = 2 cycles

6. (10pt) — True/False

Mark T/F:

(a) Flash memory is slower than DRAM but faster than HDD.

(b) NOR flash memory provides higher density than NAND flash memory.

(d) SSD is cheaper but more robust than HDD.

(e) In flash memory, there are three operations of read, write, and erase.

7. (15pt)

Given ARM code:

ldr r0, [r1]
add r2, r0, r0

5-stage pipeline: IF, ID, EX, MEM, WB

Load instruction requires 4 cycles at MEM stage.

Others take 1 cycle.

Total = 10 cycles for 2 instructions.

Compute cycles for:

(a) 2 threads, fine-grain multithreading

fine-grain : 매 사이클 마다 스레드 전환

T1 → T2

java

         1 2 3 4 5 6 7 8 9 1011
T1 ldr : IFIDEXM1M2M3M4WB
T1 add :   IFID        EXMMWB
T2 ldr :   IFIDEXM1M2M3M4WB
T2 add :     IFID        EXMMWB

→ 11 cycles

(b) 4 threads, fine-grain multithreading

java

         1 2 3 4 5 6 7 8 9 10111213
T1 ldr : IFIDEXM1M2M3M4WB
T1 add :   IFID        EXMMWB
T2 ldr :   IFIDEXM1M2M3M4WB
T2 add :     IFID        EXMMWB
T3 ldr :     IFIDEXM1M2M3M4WB
T3 add :       IFID        EXMMWB
T4 ldr :       IFIDEXM1M2M3M4WB
T4 add :         IFID        EXMMWB

→ 13 cycles

coarse-grain : 긴 지연 (cache miss) 에서만 스레드 전환

java

        1 2 3 4 5 6 7 8 9 10
T1-LDR: IFIDEXM1M2M3M4WB
T1-ADD:   IF          IDEXWB
* 4

→ 40 cycles

Assume no L2-cache miss.

8. (10pt)

NVIDIA Tesla architecture code:

if(a < 0)      b = c + d;   // path1
else if(a==0)  b = c * d;   // path2
else           b = c / d;   // path3

WARP: 32 threads

10 threads: path1 (3 cycles)
10 threads: path2 (5 cycles)
12 threads: path3 (20 cycles)

Compute total cycles.

SM has 8 SPs.

26 cycles

9. (20pt)

Snooping protocol with 3 states: invalid, exclusive, shared.

Specify cache states for P1, P2, P3 for each step.

Also specify directory state for directory-based protocol.

Step	P1 cache state	P2 cache state	P3 cache state	Directory state for A1
P1: read A1	Exc	Inv	Inv	P1
P1: write 1 to A1	Exc	Inv	Inv	P1
P1: read A1	Exc	Inv	Inv	P1
P2: write 2 to A1	Inv	Exc	Inv	P2
P3: read A1	Inv	Shd	Shd	P2, P3
P2: write 3 to A1	Inv	Exc	Inv	P2
P2: write 0 to A1	Inv	Exc	Inv	P2
P1: read A1	Shd	Shd	Inv	P1, P2
P1: write 1 to A1	Exc	Inv	Inv	P1

10. (10pt)

ll/sc 기반 lock implementation:

(a) Fill the blanks:

try:  mov  $5, #1
      ll   $2,0($4)     ; load linked
      sc   $5,0($4)     ; store conditional
      beq  $5, $0, try     ; branch store fails ($5 = 0)
      bne  $2, $0, try  ; already locked?

(b) Fetch-and-increment with ll/sc — fill blanks:

try:  ll  $2,0($1)        ; load linked
      addi $2, $2, 1
      sc  $2,0($1)        ; store conditional
      beq $2,$0,try       ; branch store fails ($2 = 0)

▶

2024 final exam

1. (10pt)

In RAID 5 system which uses distributed block-interleaved even parity,

(a)

Assume E disk is out of order. Reconstruct E blocks.

	A	B	C	D	E
0	1	0	0	0	1 (p)
1	0	1	1	0 (p)	0
2	0	0	1 (p)	0	1
3	1	0 (p)	0	1	0

(b)

For requests to update blocks A0, B2, and C3, compute how many cycles are required.

Assume it takes 1 cycle to update a single block in each HDD.

각 업데이트 : 1 cycle

A0 : A, E 사용

B2, C3 : B, C 사용

A0과 B2 접근은 병렬 가능 , B2와 C3은 병렬 불가

1 + 1 = 2cycles

2. (10pt)

Mark true/false for each sentence.

(a) (T/F) Flash memory is slower than DRAM but faster than HDD.

(b) (T/F) NOR flash memory provides higher density than NAND flash memory.

(d) (T/F) SSD is cheaper but more robust than HDD.

(e) (T/F) In flash memory, there are three operations of read, write, and erase.

3. (10pt)

Assume HDD:

sector size = 0.5 KB
average seek time = 4 ms
15,000 rpm
transfer time = 100 MB/s
controller overhead = 0.2 ms

Compute access time.

60/15000 = 0.004 = 4ms

avg rotate time = 2ms

4 + 2 + 0.2 = 6.2ms

4. (10pt)

The USB has 4 lines.

Draw the USB connector lines and explain why the lengths of USB connector lines are different.

sql

    ┌─────────────────┐
    │  VCC  (Power)   │ ←── Longest pin
    ├─────────────────┤
    │  D-   (Data-)   │ ←── Medium length
    ├─────────────────┤
    │  D+   (Data+)   │ ←── Medium length
    ├─────────────────┤
    │  GND  (Ground)  │ ←── Longest pin
    └─────────────────┘

Outer 2 lines are power lines (VCC and GND), and inner 2 lines are data lines (D+ and D-). The different lengths ensure that power lines connect first and disconnect last, allowing time for power to stabilize before data communication begins. This prevents damage from voltage spikes and enables safe hot-plugging.

5. (10pt)

Draw NOR gate using two switches,

and explain the behavior at circuit level.

sql

        VDD (+5V)
         |
    +----|----+
    |         |
   SW_A     SW_B  (parallel connection)
    |         |
    +----+----+
         |
       Output
         |
        GND

When input = 0, Switch is closed

When input = 1, Switch is open

if both switch A and B are open, output is 1. else output is 0.

6. (10pt)

Explain ROM, PROM, EPROM, EEPROM, and Flash memory,

and the differences between them.

ROM (Read-Only memory) is programmed during manufacturing. Its data is permanent and cannot be modified.

PROM (Programmable ROM) can be programmed once by user using a PROM programmer.

EPROM (Erasable Programmable ROM) can be erased by exposing to UV light and reprogrammable multiple times.

EEPROM (Electrically Erasable Programmable ROM) can be erased and reprogrammed electrically, and provides byte-level erase and write operations.

Flash memory is Evolution of EEPROM technology, which provides Block-level erase.

7. (10pt)

Assume:

Processor 1: single core, 2GHz
Processor 2: two cores, 1GHz
Power consumption = 1W at 2GHz, 1V
Clock frequency is proportional to voltage
Two processes, each needs 10s

Which processor is more power-efficient?

Explain with reasoning involving voltage, current, power, time, energy.

P1 :

P2 :

Energy = Power * Time

P2 is 4 times more power-efficient.

8. (10pt)

Compare UMA architecture and NUMA architecture.

What is the difference?

UMA provides equal memory access time for all processors using shared memory and a common bus, while NUMA provides variable access time using distributed memory, offering better scalability at the cost of programming complexity.

9. (10pt)

Compare process and thread.

What is the difference?

Processes are independent execution units with separate memory spaces, While Threads are lightweight execution units that share memory within a process.

10. (10pt)

Explain the SIMD instruction.

SMID is a parallel processing technique where a single instruction simultaneously operates on multiple data elements.

11. (10pt)

Assume that the following code is running on NVIDIA Tesla architecture:

if(a<0)      b = c+d;   // path1
else if(a==0) b = c*d;  // path2
else          b = c/d;  // path3

WARP (32 threads):

10 threads → path1 (3 cycles)
10 threads → path2 (5 cycles)
12 threads → path3 (20 cycles)

Compute total cycles to run all 32 threads.

Note: Tesla SM has 8 SPs.

28 cycles

12. (20pt)

For the snooping protocol (states: invalid, exclusive, shared),

specify cache states in P1, P2, P3 for each step.

Initial state: all invalid.

Also specify the directory state for directory-based protocol.

Step	P1 Cache	P2 Cache	P3 Cache	Directory (A1)
P1: read A1	S	I	I	P1
P1: write 1 to A1	E	I	I	P1
P1: read A1	Exclusive	inv	inv	P1
P2: write 2 to A1	inv	Exclusive	inv	P2
P3: read A1	inv	Shared	Shared	P2, P3
P2: write 3 to A1	inv	Exclusive	inv	P2
P2: write 0 to A1	inv	Exclusive	inv	P2
P1: read A1	Shared	Shared	inv	P1, P2
P1: write 1 to A1	Exclusive	inv	inv	P1

Read :

Write :

13. (10pt)

The following code shows lock implementation using ll (load linked) and sc (store conditional).

(a) Fill the blanks.

try: mov  $5, #1
     ll   $2,0($4)      ; load linked
     sc   $5,0($4)      ; store conditional
     beq  $5, $0, try   ; branch store fails ($5 = 0)
     bne  $2, $0, try   ; already locked?

(b) Fetch-and-increment with ll/sc — fill blanks.

try: ll  $2,0($1)       ; load linked
	   addi $2, $2, #1
     sc  $2, 0($1)       ; store conditional
     beq $2,$0,try       ; branch store fails ($2 = 0)

▶

5.

▶

5.1. Intro

Memory hierarchy

the principle of locality
- Upper level : 더 빠름 → 비쌈
- Lower level : 더 느림 → 덜 비쌈

Locality

Temporal locality
- Tendency to reuse recently(time)
  - 데이터가 재사용 됨
- ex) loop structure
Spatial locality
- Tendency to reference data items that are close to other recently accessed items(space)
  - 근처의 데이터가 자주 사용됨
- ex) array, struct

Management
- Register : compiler
- Cache : Hardware
- Main memory : OS
- Disk memory : OS

Memory Technology

SRAM : Static Random Access Memory (cache)
DRAM : Dynamic Random Access Memory (main memory)
Magnetic Disk : Flash Memory (storage) : SSD(solid state drive)

Block

Minimum unit of information transfer between the two levels

Hit and Miss

hit : data requested by processor appears in some block in the upper level
miss : data is not found in the upper level
hit rate : fraction of memory accesses found in the upper level
miss rate : 1 - hit rate

Hit time
- time to access the upper level of the memory hierarchy
- including determine time whether hit or miss
Miss penalty
- additional time to handle the miss
- + time to deliver this block to the processor
Average memory access time (AMAT)
- hit time * hit rate + (hit time + miss penalty) * miss rate
- = hit time + miss rate * miss penalty

ex)

hit time = 10ns

miss penalty = 100ns

miss rate = 10%

AMAT = 10 + 100 * 10% = 20ns

Big Picture

Multi Level inclusion property
- Level( i ) ⊂ Level( i+1 )

▶

5.2. The Basics of Caches

Simple Cache Scenario

2 Questions
1. How do we know if a data item is in a cache?
2. If it is, how do we find it?

Direct Mapped Cache

Cache address ( Index )
- = ( Block Addr ) mod ( Number of blocks in the cache )
Each memory location is mapped to exactle one cache location

Fields on a Cache line

Data
Tag
- information to identify original location
- 메모리 주소의 인덱스를 제외한 상위 비트
- “신원 확인”
Valid bit
- 1 : Valid
- 2 : Invalid
- Valid bit = 0 이면 Tag가 일치해도 Miss 처리
- “

ex)

Tag : 상위 2비트

Index : 하위 3비트

22 접근
10110
index = 110 = 6
→ Miss (비어있음)
→ 6번 블록에 M[22] 저장
26 접근
11010
→ Miss
→ 2번 블록에 M[26] 저장
22 접근
10110
→ Hit(6번 블록)
→Temporal locality 확인(재접근)
26 접근 (재접근)

16 접근
10000
→ Miss
→ 0번 블록에 M[16] 저장

7. …

18 접근
10010
index = 2
→ Miss(2번 블록에는 M[26]이 있었음)
→ 블록 2의 M[26]을 M[18]로 교체

포인트

Cache with 4-Byte Block

블록이 한 단어인 경우

캐시 사양:

32-bit 주소
2ⁿ개의 블록 (각 블록은 1 word = 4 bytes)
Block size: 1 word = 4 bytes = 32 bits

주소 필드 분해 (32 bits):

1. Byte Offset (2 bits)

최하위 2 bits
1 word = 4 bytes → 4개 바이트 중 어느 것인지 선택
캐시에서는 사용 안 함 (word 단위로 접근)
2² = 4 bytes 구분 가능

2. Index (n bits)

Byte offset 다음 n bits
어느 캐시 블록에 저장할지 결정
2ⁿ개 블록 중 하나 선택

3. Tag (32 - n - 2 bits)

상위 나머지 bits
캐시 블록이 어느 메모리 블록인지 식별
Hit/Miss 판단에 사용

32-bit 메모리 주소 ┌─────────────┬─────────────┬──────────────┐ │ Tag │ Index │ Byte Offset │ │ (상위 bits) │ (중간 bits) │ (하위 2bits) │ └─────────────┴─────────────┴──────────────┘ ↓ ↓ ↓ "누구인지" "어디에 저장" "몇 번째 바이트" 식별용 위치 선택용 (word 내부)

총 비트 수:

= 블록 수 × 블록당 비트 = 2ⁿ × (1 + (30-n) + 32) = 2ⁿ × (63 - n)

1. 전체 캐시 구조 (n=3, 8개 블록 예시)

캐시 메모리 (SRAM 칩 내부) ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Index 0 │ V:1 │ Tag:10 │ Data: [M[16]] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 1 │ V:0 │ Tag:-- │ Data: [------] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 2 │ V:1 │ Tag:11 │ Data: [M[26]] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 3 │ V:1 │ Tag:00 │ Data: [M[3]] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 4 │ V:0 │ Tag:-- │ Data: [------] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 5 │ V:0 │ Tag:-- │ Data: [------] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 6 │ V:1 │ Tag:10 │ Data: [M[22]] ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Index 7 │ V:0 │ Tag:-- │ Data: [------] ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Cache with 2^m - Word Block

블록이 여러 word로 구성된 경우

캐시 사양:

32-bit 주소
2^n개의 블록
각 블록 = 2^m words (1 word = 4 bytes = 32 bits)

주소 필드 분해 (32 bits):

┌──────────┬────────┬──────────────┬─────────────┐ │ Tag │ Index │ Block Offset │Byte Offset │ │(상위bits) │(n bits)│ (m bits) │ (2 bits) │ └──────────┴────────┴──────────────┴─────────────┘ 32-n-m-2 n bits m bits 2 bits

1. Byte Offset (2 bits)

1 word = 4 bytes 중 어느 바이트인지
캐시 접근 시 사용 안 함

2. Block Offset (m bits)

블록 내에서 어느 word인지 선택
2^m개 word 중 하나 선택

3. Index (n bits)

어느 캐시 블록인지 선택 ( 2^n = 블록 개수)
2^n개 블록 중 하나

4. Tag (32 - n - m - 2 bits)

메모리 블록 식별

Block Offset의 역할 예시:

4-word 블록 (m=2)인 경우:

캐시 블록 하나: ┌────┬─────┬────────────────────────────────────────┐ │ V │ Tag │ Data Block │ ├────┼─────┼──────────┬──────────┬──────────┬───────┤ │ 1 │ 10 │ Word 0 │ Word 1 │ Word 2 │Word 3 │ │ │ │ (32bits) │ (32bits) │ (32bits) │(32b) │ └────┴─────┴──────────┴──────────┴──────────┴───────┘ ↑ ↑ ↑ ↑ Offset=00 Offset=01 Offset=10 Offset=11 Block Offset로 선택!

총 비트 수 계산:

각 캐시 블록당:

Valid bit: 1 bit
Tag: (32 - n - m - 2) bits
Data: 2^m words × 32 bits = 2^m × 32 bits

전체 캐시 총 비트:

= 블록 수 × 블록당 비트 = 2^n × (1 + (32-n-m-2) + 2^m×32) = 2^n × (31 - n - m + 2^m×32)

ex) Calculate total number of bits for the cache.

ex2) What block number does byte address 1200 map to in a cache with 64 blocks and 16 bytes/block?

ans :

block number = 1200 / 16 = 75

Cache block number = 75 mod 64 = 11

Miss rate vs. Block size

블록이 클수록 (일반적으로)

블록이 너무 클 경우

캐시 크기는 고정되어 있으므로:

캐시 크기 = 블록 수 × 블록 크기 → 블록 크기 ↑ → 블록 수 ↓

miss Rate 증가
블록 수가 적으면: - 더 많은 메모리 블록들이 같은 캐시 블록에 매핑 - Conflict miss 증가 - 자주 사용하는 데이터도 계속 교체됨 예: 64개 블록만 있으면 메모리 블록 0, 64, 128, 192... → 모두 캐시 블록 0에 매핑 → 서로 계속 쫓아내기! (Thrashing)

miss penalty 증가
Miss Penalty = Block fetch time + Cache load time = (First word latency + Transfer time) + Cache load time
Transfer time이 Block size에 비례, 따라서 블록이 커지면 miss penalty 증가
miss penalty 줄이는 법

Handling Writes

1. Write-through policy

항상 캐시와 메인 메모리 둘 다에 동시에 쓰기

Write Buffer (쓰기 버퍼) = 캐시와 메모리 사이의 임시 큐 Write-Through의 성능 문제를 해결하는 방법

2. Write-Back policy

캐시에만 쓰고, 메모리는 나중에 업데이트

Split Caches (분리형 캐시):

캐시 사양:

성능:

각각의 장점:

Combined Cache → Better Hit Rate

왜?

더 유연한 공간 활용
명령어/데이터 비율에 따라 자동 조절

예시: 프로그램 A: 명령어 많음 → 명령어에 더 많은 공간 할당 프로그램 B: 데이터 많음 → 데이터에 더 많은 공간 할당 → 적응적 공간 활용 → Hit rate 향상

Split Cache → Larger Bandwidth

왜?

동시 접근 가능!

파이프라인에서: ┌────────────┐ │ IF 단계 │ → I-Cache 접근 └────────────┘ ┌────────────┐ │ MEM 단계 │ → D-Cache 접근 (동시!) └────────────┘ → 대역폭 2배!

→ L1 캐시에 사용됨

Designing the Memory System to Support Caches

메모리와 캐시 사이의 대역폭을 증가시켜 Miss Penalty를 줄이는 방법들입니다.

목표:

메모리→캐시 대역폭 증가
Miss penalty 감소
큰 블록을 낮은 penalty로 전송

문제 상황:

프로세서와 메모리를 연결하는 버스
버스가 프로세서보다 최대 10배 느림

가정:

1 memory bus clock cycle
15 clock cycles for each DRAM access initiated
1 clock cycle to send a word of data
Cache block of 4 words

One-word-wide Memory

가장 단순한 메모리 구조

구조:

┌──────────┐ │Processor │ └─────┬────┘ │ 1-word bus (32 bits) ▼ ┌──────────┐ │ DRAM │ 1-word wide └──────────┘

특징: 한 번에 1 word씩만 전송

동작 방식:

4-word 블록 전송 (순차적):

Miss Penalty 계산:

Miss Penalty = 주소 전송 + 4개 word 접근 + 4개 word 전송 = 1 cycle (주소) + 4 × 15 cycles (DRAM 접근 4번) + 4 × 1 cycle (데이터 전송 4번) = 1 + 60 + 4 = 65 cycles

대역폭 계산:

전송된 데이터 = 4 words × 4 bytes = 16 bytes 소요 시간 = 65 cycles Bandwidth = 16 bytes / 65 cycles = 0.246 bytes/cycle ≈ 0.25 bytes/cycle

Wide Memory

메모리와 버스를 넓혀서 대역폭을 증가시키는 방법입니다.

Wide Memory (4-word wide):

┌──────────┐ │Processor │ └─────┬────┘ │ 4-word bus (128 bits) ← 버스 4배! ▼ ┌────┬────┬────┬────┐ │DRAM│DRAM│DRAM│DRAM│ 4-word ← 메모리 4배! └────┴────┴────┴────┘

동작 방식:

블록의 모든 word를 병렬로 동시 접근!

장점 :

병렬 전송 → Miss penalty 감소
대역폭 증가

단점(trade-offs) :

1. Wider Bus (넓은 버스)

2. Multiplexor & Control Logic

Miss Penalty 계산 :

대역폭 계산 :

요약 :

메모리 폭 ↑ → Miss Penalty ↓, Bandwidth ↑

Interleaved Memory

메모리만 넓히고 버스와 캐시는 그대로 유지하는 방법입니다.

핵심 아이디어:

메모리를 여러 뱅크(bank)로 나누어 병렬 접근┌──────────┐ │Processor │ └─────┬────┘ │ 1-word bus (그대로!) ▼ ┌────┬────┬────┬────┐ │Bank│Bank│Bank│Bank│ ← 여러 뱅크 │ 0 │ 1 │ 2 │ 3 │ └────┴────┴────┴────┘

4-word 블록 전송 과정:

1. 주소 전송: 1 cycle 2. 4개 뱅크 동시 DRAM 접근 시작 → 15 cycles (한 번만!) 3. 데이터 순차 전송 (버스가 1-word라서) Word 0: 1 cycle Word 1: 1 cycle Word 2: 1 cycle Word 3: 1 cycle → 총 4 cycles Miss Penalty = 1 + 1×15 + 4×1 = 1 + 15 + 4 = 20 cycles

▶

5.3. Measuring and Improving Cache Performance

캐시 성능 개선 기법 (2가지):

1. Using Associativity (연관성 증가)

Direct-mapped → Set-associative → Fully associative
Miss rate 감소

2. Multilevel Caching (다단계 캐시)

L1, L2, L3 캐시
Miss penalty 감소

CPU Time 공식 (캐시 포함):

CPU Time = (CPU execution cycles + Memory-stall cycles) × Clock cycle time

CPU execution cycles:

실제 명령어 실행 시간
Cache hit time 포함
Hit은 stall이 아니라 정상 실행의 일부

Memory-stall cycles:

메모리 대기 시간
Cache miss가 주요 원인

Memory-Stall 구성:

Memory-stall cycles = Read-stall + Write-stall

1. Read-Stall Cycles

공식:

Read-stall cycles = (Reads/Program) × Read miss rate × Read miss penalty

의미:

Reads/Program: 프로그램당 load 명령어 수
Read miss rate: Load miss 비율
Read miss penalty: Load miss 시 대기 시간

예시:

프로그램: 1000개 명령어 Load 명령어: 300개 (30%) Read miss rate: 5% Read miss penalty: 100 cycles Read-stall = 300 × 0.05 × 100 = 1,500 cycles

2. Write-Stall Cycles

공식:

Write-stall cycles = (Writes/Program) × Write miss rate × Write miss penalty + Write buffer stalls

의미:

Writes/Program: 프로그램당 store 명령어 수
Write miss rate: Store miss 비율
Write miss penalty: Store miss 시 대기 시간
Write buffer stalls: Write buffer 가득 참

예시:

프로그램: 1000개 명령어 Store 명령어: 100개 (10%) Write miss rate: 3% Write miss penalty: 100 cycles Write buffer stalls: 50 cycles Write-stall = 100 × 0.03 × 100 + 50 = 300 + 50 = 350 cycles

ex) Calculating Cache Performance

Instruction cache miss rate = 2%
Data cache miss rate = 4%
CPI = 2 (메모리 스톨 없을 때)
Miss penalty = 100 cycles
SPECint2000 기준 instruction frequencies 사용 = 36%

ans :

instruction count = i

instruction miss cycles = i * 2%* 100 = 2i

data miss cycles = i * 36% * 4% * 100 = 1.44 * i

total stall cycles = 3.44 * i

CPI with memory stalls = 2 + 3.44 = 5.44

Increased Importance of Cache Performance

프로세서 성능이 향상됨에 따라 캐시 성능의 중요성이 증가하는 이유를 설명합니다.

1. 캐시 패널티 증가 (Increased Cache Penalties)

더 빠른 프로세서의 영향:

낮은 CPI의 영향:

Base CPI가 낮아지면 → 메모리 스톨 사이클의 상대적 영향이 커짐
예: CPI 5→2로 감소 시, 3 cycle stall의 영향이 60%→150%로 증가

높은 클럭 속도의 영향:

클럭이 빨라져도 메인 메모리 접근 시간은 비례해서 빠르지 않음
Miss penalty는 프로세서 클럭 사이클 단위로 측정
예: 메모리 접근 100ns 고정
- 1GHz CPU: 100 cycles
- 2GHz CPU: 200 cycles ← 패널티 2배!

2. 히트 타임 문제 (Hit Time Issues)

느린 히트 타임의 영향:

캐시 히트에도 시간이 오래 걸리면 → 전체 프로세서 사이클 타임 증가
모든 명령어 실행이 느려짐

따라서 캐시를 최적화 해야 성능이 향상됨

3가지 배치 방식 (Placement Schemes)

1. Direct Mapping

장점 : 간단함, 빠름, 저렴함
단점 : 충돌 miss가 많음

2. Set-associative Mapping

Direct와 Fully 사이 어딘가
세트 내 비교가 필요함

3. Fully associative Mapping

장점 : 가장 유연하고 충돌 miss가 최소화됨
단점 : 매우 비싸고 복잡함. 또한 CAM이 필요함

Associative Memory ( = CAM)

일반 RAM 과의 핵심 차이점

일반 메모리 (RAM)

주소로 접근: 주소 입력 → 데이터 출력
특정 위치를 알아야 데이터를 가져올 수 있음

Associative Memory (CAM)

내용으로 접근: 데이터 입력 → 위치(주소) 출력
찾고 싶은 내용을 주면 그게 어디 있는지 알려줌

기본 동작

1. 병렬 비교

A (Argument Register): 찾고자 하는 값 저장
CAM의 모든 워드를 A와 동시에 비교
일치하면 M(i) = 1 설정

2. 순차 읽기

M(i) = 1인 항목들을 순차적으로 접근
일치하는 모든 데이터 읽기

K (Key Register) - 마스킹

역할

특정 필드나 비트만 선택적으로 비교

동작 원리

K = 1인 비트: 비교 대상
K = 0인 비트: 무시 (don't care)

효과

전체 워드가 아닌 일부 필드만 매칭 가능
태그, 인덱스 등 특정 영역만 비교
더 유연한 검색 가능

n-Way Set Associative Cache

기본 구조

배치 규칙

각 블록은 n개 위치 중 하나에 배치 가능
Set size = n blocks (각 세트는 n개 블록 보유)
Number of sets = cache size / n

2단계 매핑 방식

1단계: Set 선택 (Direct Mapping)

메모리 블록은 index 필드로 특정 세트에 매핑
각 블록은 고유한 하나의 세트로만 매핑됨
Direct mapping과 동일한 방식

2단계: Set 내 배치 (Associative)

선택된 세트 내에서 n개 way 중 아무 곳에나 배치 가능
검색 시 세트 내 모든 블록을 병렬로 비교
세트 내에서는 fully associative

▶

5.4. Virtual Memory

가상 메모리란?

메인 메모리를 보조 저장장치(디스크)의 "캐시"로 사용하는 기술

계층 구조:

Disk (매우 큼, 느림)
Main Memory (중간, DRAM) ← 캐시 역할
Cache (작음, 빠름, SRAM)
CPU

2가지 주요 동기

1. 프로그래밍 부담 제거

문제: 메인 메모리 크기 제한, 프로그래머가 직접 관리

해결: 가상 메모리로 큰 주소 공간 제공, 자동 관리

2. 여러 프로그램 간 효율적 공유

문제: 동시 실행 시 메모리 충돌, 보안 문제

해결: 각 프로그램에 독립적 가상 주소 공간

Address Space

Virtual Address Space

Virtual Address = Logical Address
프로그램이 생성하는 주소
프로세서(프로그램, 사용자)가 사용

Physical Address Space

Physical Address
실제 메인 메모리 주소
메모리 접근 시 사용

Mapping

Virtual Address → Physical Address 변환

Page와 Page Frame

Page (페이지)

Virtual memory의 블록
가상 메모리를 고정 크기로 나눈 단위

Page Frame

Physical memory의 블록
물리 메모리를 고정 크기로 나눈 단위
Page와 같은 크기

관계

Page (가상) ↔ Page Frame (물리)
매핑을 통해 연결
하나의 Page는 하나의 Page Frame에 매핑

Page Faults

Page Fault란?

발생 조건:

가상 페이지의 Valid bit = 0 (off)

요청한 페이지가 메모리에 없음

결과:

Exception 발생

OS가 제어권 획득

Page Fault 처리 과정 (6단계)

0. Reference

1. Page Fault 발생

2. Trap to OS

3. Page is on Backing Store

4. Bring in Missing Page

5. Reset Page Table

6. Restart Instruction

중요한 하드웨어 요구사항

프로세서는 Page fault 후 명령어 재시작 가능해야 함

이유: 명령어 실행 중 fault 발생 가능

처리 후 명령어를 처음부터 다시 실행

ex)

조건 :

Memory access time: 100 ns

Disk access time: 25 ms = 25,000,000 ns

p = page fault 발생 확률 결과:

Effective access time = 100(1-p) + 25000000p

성능 저하 목표가 10%라면

110 > 100 + 25000000p

p < 0.0000004

결론 : Page fault는 재앙적으로 비쌈

OS가 페이지 교체를 매우 잘 해야함

Replacement Algorithms

LRU (Least Recently Used)

가장 오래 사용되지 않은 페이지 교체

이론적으로 가장 좋은 성능

하지만 정확한 구현은 매우 비쌈

LRU Approximation (LRU 근사)

Use Bit (Reference Bit) 사용

각 페이지마다 Use Bit 1비트 추가

동작:

페이지 접근 시: Use bit = 1로 설정

주기적으로: 모든 Use bit = 0으로 리셋

의미:

Use bit = 1: 최근에 사용됨

Use bit = 0: 최근에 사용 안 됨

Translation-Lookaside Buffer(TLB)

가상 메모리의 문제

Double Memory Accesses (2번의 메모리 접근):

문제:

TLB란?

Translation-Lookaside Buffer

목적:

동작 원리 : Locality of Reference 활용:

TLB Miss Penalty

약 13 cycles

조건: 코드와 page table entry가 캐시에 있을 때

구성:

Exception 처리
Page table 접근
TLB 업데이트
복귀

비교적 저렴함

True Page Fault

Page table에도 없는 경우:

Valid bit = 0

Page fault 발생!

디스크에서 페이지 로드 필요

처리 시간:

수백만 cycles

Possible combinations

핵심 원칙

Inclusion property (포함 관계):

Disk ⊃ Memory ⊃ Cache

▶

5.5. A Common Framework for Memory Hierarchies

n-Way Set Associative Cache 정리

기본 개념

각 블록은 n개 위치 중 하나에 배치 가능

2단계 매핑: Set 선택 → Set 내 배치

3Cs

Compulsory Misses
- 블록에 처음으로 접근
- 피할 수 없음
해결 방법 :
Prefetching(미리 가져오기)
Capacity Misses
- 캐시 크기 < working set크기
- 블록이 교체됨
해결 방법 :
캐시 크기 증가
교체 알고리즘을 효율적으로 사용(LRU)
Conflict Misses
- 너무 많은 블록이 같은 세트에 매핑됨
- 세트에 빈 공간이 없음
해결 방법 :
associativity 증가

▶

6.

▶

6. Storage

I/O System Characteristics

Dependability (신뢰성)

특히 저장 장치에서 중요

데이터 손실 방지
시스템 가용성 유지
고장 허용 능력

성능 측정 지표

Latency (Response Time, 지연시간)

요청부터 완료까지 걸리는 시간
사용자 체감 속도

Throughput (Bandwidth, 처리량)

단위 시간당 처리할 수 있는 데이터 양
전체 시스템 효율성

Fault vs Failure

Failure (고장)

Deviation from specified service
사양에서 벗어난 동작
서비스 중단 발생

Fault (결함)

중요한 점

Fault가 항상 시스템 Failure로 이어지는 것은 아님

Fault는 불가피함 (컴포넌트는 고장남)
목표: Fault가 System Failure로 이어지지 않게
방법: 중복성, 복구 메커니즘, Fault tolerance

Dependability Measures

MTTF (Mean Time To Failure)

평균 고장 시간

시스템이 정상 작동하는 평균 시간
고장까지 걸리는 시간

MTTR (Mean Time To Repair)

평균 수리 시간

고장 발생 후 복구까지 걸리는 시간
서비스 중단 시간

MTBF (Mean Time Between Failures)

평균 고장 간격

MTBF = MTTF + MTTR

Availability (가용성)

정의

Availability = MTTF / (MTTF + MTTR)

가용성 향상 방법

MTTF 증가: Fault avoidance, tolerance, forecasting
MTTR 감소: 진단 및 수리 도구/프로세스 개선

Disk Storage

섹터(Sector) 구조 섹터: 디스크의 기본 읽기/쓰기 단위

ECC (Error Correcting Code)

계산 예시 문제

조건

512B 섹터
15,000 rpm
평균 seek time 4ms
전송률 100MB/s
컨트롤러 0.2ms
유휴 디스크 (대기 없음)

Queuing delay : 0ms

Seek time : 4ms (평균)

Rotational latency : (1/2) / (15000 / 60) = 2ms

Transfer time : 512B / 100MB/s = 0.005ms (생략)

Controller overhead : 0.2ms

→ Total = 6.2ms

Flash Storage

속도 : HDD에 비해 100에서 1000배 빠름

크기 : 더 작음

전력 : 저전력

Flash Types

1. NOR Flash

구조

agda

Bit cell이 NOR gate처럼 구성
→ 각 비트에 개별 접근 가능

특징 : Random Read/Write Access

agda

메모리처럼 동작:
  주소 0x1000 읽기 → 즉시 접근
  주소 0x2000 읽기 → 즉시 접근
  
바이트 단위 읽기/쓰기 가능

2. NAND Flash

구조

agda

Bit cell이 NAND gate처럼 구성
→ 블록 단위로만 접근 가능

특징 : Block-at-a time Access

agda

블록 단위 동작:
  읽기: 최소 페이지 단위 (4KB, 8KB)
  쓰기: 페이지 단위
  삭제: 블록 단위 (128KB, 256KB)
  
개별 바이트 접근 불가 ✗

NAND 가 NOR에 비해 가격이 저렴하여 저장장치에는 NAND 플래시를 사용한다.

Interconnecting Components

Bus

정의: Shared Communication Channel (공유 통신 채널)

타입 :

계층 구조

agda

┌──────────────────────────────────┐
│           CPU                    │
└───────────┬──────────────────────┘
            │ 고속
            │ Processor-Memory Bus
            │
      ┌─────┴─────┐
      │  Memory   │
      └───────────┘
            │
            │
      ┌─────┴─────┐
      │  Bridge   │  ← 속도/프로토콜 변환
      └─────┬─────┘
            │
            │ I/O Bus (PCI, PCIe)
    ════════┼════════════════
      ║     ║      ║     ║
    [HDD] [GPU] [USB] [NIC]

1. Data Lines (데이터 선)

역할

주소와 데이터 전송

두 가지 방식

Multiplexed (공유)

같은 선으로 주소와 데이터를 시간 나눠 전송
장점: 선 개수 절약, 비용 감소
단점: 느림, 복잡

Separate (분리)

주소선과 데이터선을 별도로 구성
장점: 빠름, 동시 전송 가능
단점: 선 많음, 비용 증가

2. Control Lines (제어 선)

역할

Data Type 표시

Read/Write, Memory/IO, 바이트 크기 등

Transaction 동기화

송신자와 수신자의 타이밍 조율

동기화 방식

Synchronous (동기식)

Bus Clock 사용

모든 장치가 같은 클럭 신호 공유
클럭 엣지에 맞춰 데이터 샘플링

장단점

간단, 빠름, 예측 가능
모든 장치가 같은 속도, 느린 장치가 전체 제한

사용

DDR 메모리 버스, 고속 버스

Asynchronous (비동기식)

Request/Acknowledge Handshaking

클럭 없이 REQ/ACK 신호로 동기화
송신자가 REQ 올림 → 수신자가 ACK 응답

장단점

각 장치가 자기 속도, 유연함
복잡한 로직, 핸드셰이킹 오버헤드

사용

SCSI, 긴 거리 버스, 다양한 속도의 장치

I/O Management

OS의 역할

I/O is mediated by the OS - OS가 I/O를 중재

이유

여러 프로그램이 I/O 자원 공유
보호(protection)와 스케줄링 필요

예시

프로그램 A와 B가 동시에 디스크 접근 요청
OS가 순서 조정 및 충돌 방지

I/O 특성

Asynchronous Interrupts (비동기 인터럽트)

I/O는 비동기적

I/O 완료 시점을 정확히 예측 불가
완료되면 인터럽트로 CPU에 알림

I/O programming is fiddly - 직접 I/O 프로그래밍은 복잡하고 까다로움

문제점

장치마다 다른 인터페이스
타이밍 제어 복잡
오류 처리 어려움

OS provides abstractions - OS가 추상화 제공

I/O Commands

I/O Controller Hardware

I/O 장치는 I/O 컨트롤러 하드웨어가 관리

역할

장치와 데이터 전송
소프트웨어와 동작 동기화

컨트롤러의 레지스터

Command Registers (명령 레지스터)

장치에게 작업 지시
예: 읽기 시작, 쓰기 시작, 리셋

Status Registers (상태 레지스터)

장치가 무엇을 하고 있는지 표시
오류 발생 여부 표시
예: busy, ready, error

Data Registers (데이터 레지스터)

Write: 장치로 데이터 전송
Read: 장치에서 데이터 수신

I/O Register Mapping

두 가지 매핑 방식

Memory Mapped I/O (메모리 맵 I/O)

동일한 주소 공간 사용

I/O 레지스터가 메모리 주소 공간에 위치
일반 메모리 접근 명령어 사용

Address Decoder로 구분

하드웨어가 메모리와 I/O 구분
특정 주소 범위는 I/O 장치로 라우팅

별도 명령어 사용

I/O 레지스터 접근을 위한 전용 명령어
메모리 주소 공간과 분리된 I/O 주소 공간

커널 모드에서만 실행

특권 명령어(privileged instruction)
사용자 모드에서 실행 시 예외 발생

Polling

Periodically check I/O status register - 주기적으로 I/O 상태 레지스터 확인

프로세스

Device ready → 작업 수행
Error 발생 → 에러 처리

적합한 경우

소형 임베디드 시스템
저성능 실시간 시스템

이유

Predictable timing (예측 가능한 타이밍)
Low hardware cost (낮은 하드웨어 비용)

단점 : CPU 시간 낭비

Interrupts

기본 동작

When a device is ready or error occurs - 장치가 준비되거나 오류 발생 시

Controller interrupts CPU

컨트롤러가 CPU에 인터럽트 신호
CPU가 현재 작업 중단하고 처리

동작

CPU는 각 명령어 완료 후 인터럽트 확인
인터럽트 있으면 핸들러 실행

우선순위 할당

긴급한 장치 → 높은 우선순위
덜 긴급한 장치 → 낮은 우선순위

I/O Data Transfer

Polling and Interrupt-Driven I/O

동작 방식

CPU가 직접 I/O 레지스터와 메모리 간 데이터 이동
각 바이트/워드마다 CPU 개입 필요

문제점

고속 장치에서는 시간 소모 큼
CPU가 데이터 전송에 계속 묶임

Direct Memory Access (DMA)

설정

OS가 메모리 시작 주소, 크기, 방향(읽기/쓰기) 설정
DMA 컨트롤러에 전달

자율적 전송

CPU 개입 없이 컨트롤러가 직접 메모리 접근
데이터 전송 자동 수행

동작 순서

OS가 DMA 컨트롤러 설정 (주소, 크기)
CPU는 다른 작업 수행
DMA가 메모리 ↔ 장치 전송
완료되면 인터럽트

DMA/Cache Interaction

문제 상황

If DMA writes to a memory block that is cached

문제:

해결 방법

DMA/VM Interaction

문제 상황

OS uses virtual addresses for memory

OS는 가상 주소 사용

프로그램은 연속된 가상 주소 공간 보유
실제 물리 메모리는 흩어져 있을 수 있음

해결 방법

Should DMA use virtual addresses?

가상 주소 사용 시

Would require controller to do translation

DMA 컨트롤러가 주소 변환 필요
IOMMU (I/O Memory Management Unit) 필요
복잡하고 비용 증가

물리 주소 사용 시 - 3가지 방법

설계 트레이드오프

I/O system design can trade-off between response time and throughput

Response Time vs Throughput

Response time 최적화 → throughput 감소 가능
Throughput 최적화 → response time 증가 가능

Amdahl’s Law

Amdahl's Law 적용

Don't neglect I/O performance as parallelism increases compute performance

병렬화로 CPU 성능 향상 시 I/O 성능 간과하지 말 것

▶

6. RAID

RAID

Redundant Array of Inexpensive (Independent) Disks

여러 개의 작은 디스크들을 사용하여 성능과 안정성을 향상시키는 스토리지 기술

기본 원리

RAID 0

특징

RAID 1 & 2

RAID 1 : Mirroring

구조

동작 방식

RAID 2 : Error Correcting Code

구조

특징

RAID 3

RAID 3 : Bit-Interleaved Parity

구조

동작 방식

특징

RAID 4

RAID 4 : Block-Interleavad Parity

구조

N + 1 디스크 구성
데이터 스트라이핑 (RAID 3과 같음)
전용 패리티 디스크에 블록 그룹별 패리티 저장

RAID 3과의 차이점

RAID 3 (바이트 레벨): 파일 "HELLO" → H는 디스크1, E는 디스크2, L은 디스크3... → 한 글자 읽으려면 모든 디스크 접근 필요
RAID 4 (블록 레벨): 파일 "HELLO" → 전체가 디스크1의 한 블록에 저장 → 해당 디스크만 접근하면 됨

특징

읽기 성능이 RAID 3에 비해 개선됨
쓰기는 여전히 패리티 디스크가 병목임

RAID 5

RAID 5 : Distributed Parity

구조

N + 1 디스크
RAID 4와 동일하게 블록 레벨 스트라이핑

RAID 4 와의 차이점

패리티를 모든 디스크에 분산 저장

특징

장점
- 패리티 병목 해결
- 좋은 읽기/쓰기 성능 (병렬 처리)
- 저장 효율 : N / N+1
단점
- 복구 중 성능 저하
- 여러 디스크 고장 시 복구 불가능
가장 널리 사용되는 RAID 구조

RAID 6

RAID 6 : P + Q redundancy

구조

N + 2 디스크
RAID 5처럼 패리티 분산 저장
RAID 5와의 차이 : 두 종류의 패리티 (P와 Q)
- P : 일반 xor 패리티
- Q : 다른 알고리즘 사용

특징

장점
- 2개 디스크 고장까지 복구 가능
단점
- 쓰기 성능 저하
- 저장 효율 감소 (N / N + 2)

Multiple RAID

RAID를 조합해서 사용하는 구조

RAID 정리

RAID의 장점

성능(Performance)과 가용성(Availability) 향상

성능 향상: 스트라이핑으로 병렬 I/O 가용성 향상: 중복 저장으로 고장 시에도 서비스 유지

고가용성(High Availability)의 조건

Hot Swapping 필요

▶

6. I/O System Design…

I/O 시스템 설계 시 두 가지 핵심 목표가 있다.

Latency 요구사항 충족
- 시간이 중요한 연산을 위해
Throughput 최대화
- 병목 찾기

시스템에 부하가 있을 경우 : 단순 분석으로는 부족함

→ 필요한 분석 방법 :

Queuing Models (대기열 모델)
Simulation (시뮬레이션)

Server Computers

애플리케이션이 점점 더 서버에서 실행됨

과거 : 개인 PC에서 실행

현재 : 서버에서 실행

→ 대규모 데이터센터 서버가 요구됨

하드웨어 : Multiple Processors, Network Connections

제약 : Space, Power

19” racks

랙 구조 :

c++

19" (48.26cm) 폭 표준 랙

┌────────────────────┐
│                    │ ← 1U (1.75")
├────────────────────┤
│                    │ ← 1U
├────────────────────┤
│                    │
│                    │ ← 2U (3.5")
├────────────────────┤
│                    │ ← 1U
├────────────────────┤
│                    │
│                    │
│                    │ ← 3U (5.25")
└────────────────────┘

랙 높이 = 42U (약 2m)

I/O System Design Example

Sun Fire x4150 시스템의 I/O 성능을 분석

시스템 사양

CPU & 메모리

c++

CPU: 10⁹ instructions/sec
FSB (Front Side Bus): 10.6 GB/sec
DRAM (DDR2 667MHz): 5.336 GB/sec

버스 & 디스크

c++

PCI-E 8× bus: 8 × 250MB/sec = 2GB/sec
디스크: 15,000 rpm, 평균 탐색 시간 2.9ms, 전송률 112MB/sec

워크로드

c++

작업: 64KB 디스크 읽기
명령어: 200,000 (사용자) + 100,000 (OS) = 300,000 instructions

Random Reads 분석

CPU I/O Rate

c++

명령어 수: 100,000 (OS) + 200,000 (user) = 300,000

코어당:
  10⁹ instructions/sec / 300,000 = 3,333 ops/sec

8개 코어:
  3,333 × 8 = 26,667 ops/sec

Disk I/O Rate

Random Reads

c++

시간/op 계산:
  Seek:               2.9ms / 4 = 0.725ms
  Rotational latency: 4ms / 2 = 2ms
  Transfer:           64KB / 112MB/sec = 0.571ms
  ──────────────────────────────────────
  총:                 3.3ms
  
  디스크 1개:
  1 / 3.3ms = 303 ops/sec

디스크 8개:
  303 × 8 = 2,424 ops/sec

Sequential Reads

c++

전송만 고려:
  112MB/sec / 64KB = 1,750 ops/sec

디스크 8개:
  1,750 × 8 = 14,000 ops/sec

PCI-E I/O Rate

c++

대역폭: 2GB/sec
전송 크기: 64KB

I/O rate = 2GB/sec / 64KB = 31,250 ops/sec

DRAM I/O Rate

c++

대역폭: 5.336 GB/sec
전송 크기: 64KB

I/O rate = 5.336 GB/sec / 64KB = 83,375 ops/sec

FSB I/O Rate

가정 : 최대 속도의 절반 유지 가능

c++

피크: 10.6 GB/sec → 지속 가능: 5.3 GB/sec

FSB 1개:
  5.3 GB/sec / 64KB = 81,540 ops/sec

FSB 2개:
  81,540 × 2 = 163,080 ops/sec

디스크에서 병목현상 발생 (다른 것들에 비해 처리 속도가 느림)

Fallacy : Disk Dependability

MTTF (Mean Time To Failure) = 평균 고장 시간

개별 디스크의 관점

c++

MTTF = 1,200,000시간이란:
  - 많은 디스크들의 평균 수명
  - 어떤 디스크는 1년 만에 고장
  - 어떤 디스크는 200년 버팀
  - 평균이 140년일 뿐

대규모 시스템 관점

c++

Annual Failure Rate (AFR) = 
  (총 디스크 × 연간 시간) / MTTF

= (1000 disks × 8760 hrs/year) / 1,200,000 hrs/failure

= 8,760,000 / 1,200,000

= 7.3 failures/year

AFR = 0.73% (디스크 1,000개 중 연간 약 7개 고장)

다른 Fallacies

제조사 사양에 비해 고장률이 3~5배 올라감
- 제조사 테스트 환경과 실제 데이터센터 환경이 다르기 때문
GB의 함정
- 네트워크의 1GB = 10⁹ bytes
- 저장 공간 1GB = 2³⁰ bytes = 1,073,071,824bytes
- 1GB/sec로 1초 전송하면?
  - 0.93GB 전송됨(저장 용량 기준)
  - 오차 약 7%

Pitfall: Offloading to I/O Processors

I/O Processor로 작업 위임하기

I/O Processor에 작업을 넘기는 것이 항상 좋지는 않음

1. 관리 오버헤드가 지배적

I/O 프로세서에 요청을 보내고 관리하는 오버헤드가 실제 작업보다 클 수 있음
작은 작업은 CPU에서 직접 처리하는 것이 더 빠름

2. I/O 아키텍처의 제약

작은 작업은 CPU에서 직접 하는게 빠르지만
I/O 아키텍처 설계상 그렇게 할 수 없는 경우가 있음

3. I/O 프로세서가 더 느림

I/O 프로세서는 단순한 작업만 하도록 설계되어 더 단순함
따라서 메인 CPU보다 느린 경우가 많음

4. 성능 향상 시 복잡도 증가

I/O 프로세서를 빠르게 만들면 주요 시스템 컴포넌트가 됨
결국 자체 보조 프로세서(coprocessor)가 필요할 수도 있음

Pitfall: Backing Up to Tape

과거 테이프의 장점

탈착 가능
대용량

디스크 기술 발전으로 장점 상실

더 나은 방법 : 데이터 복제

RAID, 원격 미러링 등등..

Fallacy: Disk Scheduling

OS는 더 이상 디스크의 물리적 특성을 알 수 없음

논리 블록 주소만 봄
실제 물리 위치 모름
캐시 상태 모름

결과

OS 스케줄링이 항상 최선은 아님
드라이브 자체 스케줄링을 신뢰하는 것이 나을 수 있음
상황에 따라 성능이 오히려 저하될 수 있음

▶

MP.

▶

Snooping

Snooping Protocol

멀티프로세서의 캐시 일관성 문제

문제 상황

agda

CPU 1 캐시: X = 5
CPU 2 캐시: X = 5
메모리:     X = 5

CPU 1이 X = 10으로 쓰기
→ CPU 1 캐시: X = 10
→ CPU 2 캐시: X = 5 (오래된 데이터)

Snooping 과 Directory based protocol

Snooping 방식

기본 아이디어

모든 프로세서가 버스를 "엿듣기(snoop)"
누군가 데이터 쓰면 다른 캐시들이 알아차림
버스가 자연스러운 broadcast 매체

동작

agda

CPU 1이 X 쓰기 → 버스에 요청
                ↓
모든 CPU가 snoop → X를 가진 캐시는 반응

장점: 소규모 시스템에 효과적 (버스 사용) 단점: 확장성 제한 (버스 병목)

Snoopy 두 가지 기본 프로토콜

Write Invalidate (무효화)

agda

CPU 1이 X 쓰기
→ 다른 캐시의 X를 무효화(invalid)
→ 다른 CPU가 X 읽으려면 다시 가져와야 함

장점: 한 번만 무효화하면 여러 번 쓰기 가능

Write Broadcast (방송)

agda

CPU 1이 X 쓰기
→ 새 값을 모든 캐시에 broadcast
→ 모든 캐시 즉시 업데이트

장점: 쓰기 후 읽기 latency 낮음
단점: 매번 broadcast (트래픽 많음)

실제로는 Write Invalidate가 더 효율적

버스의 역할

Write Serialization (쓰기 직렬화)

버스가 single arbitration point
모든 쓰기 요청이 버스를 거침
자연스럽게 순서 보장

Snoopy Protocol 예시

프로토콜 설정

기본 구성

Invalidation protocol 사용
Write-back cache
메모리 상태 (각 블록은 다음 중 하나)
- Clean in all caches and up-to-date in memory (Shared)
- Dirty in exactly one cache (Exclusive)
- Or not in any caches
캐시 블록 상태 (각 블록은 다음 중 하나)
- Shared (공유, 읽기 전용)
- Exclusive (독점, 읽기/쓰기)
- Invalid (유효 데이터 없음)
주요 동작
- Read misses
- Writes to clean line

Snoopy-Cache State Machine-I (CPU 요청)

CPU 요청에 따른 상태 전이

State Machine I, II, III의 차이

State Machine I

CPU 요청만 처리

자기 CPU가 read/write할 때의 동작
버스에 요청 보내는 것만 표현
다른 CPU의 요청은 고려 안함

예시: 내가 데이터 읽을 때만

State Machine II

버스 요청만 처리

다른 CPU의 버스 요청을 snoop할 때의 동작
다른 CPU가 read/write하는 것에 반응
내 CPU 요청은 고려 안함

예시: 다른 CPU가 데이터 쓸 때 내 캐시 어떻게 할지

State Machine III

CPU + 버스 요청 통합

State Machine I + State Machine II
완전한 상태 전이도
실제 구현에서 사용하는 완전한 버전

▶

Directory based

Directorty Protocol

Snoopy와의 차이점

Snoopy : 버스 broadcast (모두에게 알림)
Directory : Point-to-point 통신 (필요한 곳만)

3 States

Shared

≥ 1개 프로세서가 데이터 보유
메모리가 최신 상태 (up-to-date)
읽기 전용

Uncached

어떤 프로세서도 보유 안함
모든 캐시에서 유효하지 않음
메모리에만 존재

Exclusive

1개 프로세서(owner)만 보유
메모리는 오래됨 (out-of-date)
읽기/쓰기 가능

W.M. : Write Memory

D.V.R : Data Value Reply

I : Invalid

▶

기말 개념 정리

UMA vs NUMA

process vs thread

explain SMID

explain and draw USB

explain and draw NOR gate, compare NAND gate

explain ROM, PROM, EPROM, EEPROM, flash memory

polling, intrerrupt driven I/O, DMA

memory mapped I/O vs I/O instruction

explain fine-grain, coarse-grain, simultaneous multithreading