# Advanced Computer Architectures - Notes

260236

May 2024

## Preface

Every theory section in these notes has been taken from two sources:

- Computer Architecture: A Quantitative Approach. [\[1\]](#page-55-0)
- Course slides. [\[2\]](#page-55-1)

About:

§ [GitHub repository](https://github.com/PoliMI-HPC-E-notes-projects-AndreVale69/HPC-E-PoliMI-university-notes)

## Contents



#### [Index](#page-56-0) 57

## <span id="page-3-5"></span><span id="page-3-0"></span>1 Basic Concepts

This section is designed to review old concepts that are fundamental to this course.

## <span id="page-3-1"></span>1.1 Pipelining

#### <span id="page-3-2"></span>1.1.1 MIPS Architecture

MIPS (Microprocessor without Interlocked Pipelined Stages) is a family of Reduced Instruction Set Computer (RISC). It is based on the concept of executing only simple instruction in a reduced basic cycle to optimize the performance of  $CISC<sup>1</sup>$  $CISC<sup>1</sup>$  $CISC<sup>1</sup>$  CPUs.

MIPS is a load-store architecture (or a register–register architecture), which means it is an Instruction Set Architecture  $(ISA<sup>2</sup>)$  $(ISA<sup>2</sup>)$  $(ISA<sup>2</sup>)$  that divides **instructions** into two categories:

• Memory access (load and store between memory and registers; load data from memory to registers; store data from registers to memory):



• ALU operations (which only occur between registers).

Finally, MIPS is also a Pipeline Architecture. It means that it can execute a performance optimization technique based on overlapping the execution of multiple instructions derived from a sequential execution flow.

<span id="page-3-3"></span><sup>1</sup>CISC processors use simple and complex instructions to complete any given task. Instead, the RISC processor uses the approach of increasing internal parallelism by executing a simple set of instructions in a single clock cycle (see more [here\)](https://electronicsdesk.com/cisc-processor.html).

<span id="page-3-4"></span><sup>&</sup>lt;sup>2</sup>Instruction Set Architecture (ISA) is a part of the abstract model of a computer, which generally defines how software controls the CPU.

## Reduced Instruction Set of MIPS Processor

The instruction set of the MIPS processor is the following:

- ALU instructions:
	- Sum between two registers:
	- 1 add  $$s1, $s2, $s3$  #  $$s1 < $s2 + $s3$

Take the values from s2 and s3, make the sum and save the result on s1.

- Sum between register and constant:
- 1 addi  $$s1, $s1, 4$  #  $$s1 < $s1 + 4$

Take the value from s1, make the sum between s1 and 4, and save the result on s1.

• Load/Store instructions:

– Load

```
1 lw s1, offset (s2) # s1 <- Memory [s2 + offset]
```
From the s2 register, calculate the index on the memory with the offset, take the value and store it in the s1 register.

– Store

 $1 \text{ sw } $s1, \text{ offset } ($s2$)$  # Memory  $$s2 + \text{offset} ]$  <-  $$s1$ 

Take the value from the s1 register, take the value from the s2 register, calculate the index on the memory with the offset, and store the value taken from s1 in the memory.

• Branch instructions to control the instruction flow:

#### – Conditional branches

Only if the condition is true (branch on equal):

1 beq  $$s1$ ,  $$s2$ ,  $L1$  # if  $$s1 == $s1$  then goto  $L1$ 

Only if the condition is false (branch on not equal):

- 1 bne  $$s1, $s2, L1$  # if  $$s1 != $s2$  then goto L1
- Unconditional jumps. The branch is always taken. Jump:

<sup>1</sup> j L1 # jump to L1

Jump register:

1 jr \$s1 # jump to address contained in \$s1

## Formats of MIPS 32-bit Instructions

The previous instructions are divided into three types:

- Type R (Register): ALU instructions.
- Type I (Immediate): Load/Store instructions and Conditional branches.
- Type J (Jump): Unconditional jumps instructions.

Every instruction starts with a 6-bit opcode. In addition to the opcode:

- R-type instructions specify:
	- Three registers: rs, rt, rd
	- A shift amount field: shamt
	- A function field: funct
- I-type instructions specify:
	- Two registers: rs, rt
	- 16-bit immediate value: offset/immediate
- J-type instructions specify:
	- 26-bit jump target: address

<span id="page-5-0"></span>

Figure 1: MIPS 32-bit architecture.

Scan (or click) the QR code below to view the table in high quality:



## <span id="page-6-0"></span>Phases of execution of MIPS Instructions

Every instruction in the MIPS subset can be implemented in at most 5 clock cycles (phases) as follows:

- 1. Instruction Fetch (IF)
	- (a) Send the content of Program Counter (PC) register to the Instruction Memory (IM);
	- (b) Fetch the current instruction from Instruction Memory;
	- (c) Update the Program Counter to the next sequential address by adding the value 4 to the Program Counter (4 because each instruction is 4 bytes!).
- 2. Instruction Decode and Register Read (ID)
	- (a) Make the fixed-filed recording (decode the current instruction);
	- (b) Read from the Register File (RF) of one or two registers corresponding to the registers specified in the instruction fields;
	- (c) Sign-extension of the offset field of the instruction in case it is needed.
- 3. Execution (EX). The ALU operates on the operands prepared in the previous cycle depending on the instruction type (see more details after this list):
	- Register-Register ALU instructions: ALU executes the specified operation on the operands read from the Register File.
	- Register-Immediate (Register-Constant) ALU instructions: ALU executes the specified operation on the first operand read from Register File and the sign-extended immediate operand.
	- Memory Reference: ALU adds the base register and the offset to calculate the effective address.
	- Conditional Branches: ALU compares the two registers read from Register File and computes the possible branch target address by adding the sign-extended offset to the incremented Program Counter.
- 4. Memory Access (ME). It depends on the operation performed:
	- Load. Instructions require a read access to the Data Memory using the effective address.
	- Store. Instruction require a write access to the Data Memory using the effective address to write the data from the source register read from the Register File.
	- Conditional branches can update the content of the Program Counter with the branch target address, if the conditional test yielded true.
- <span id="page-7-1"></span>5. Write-Back (WB). It depends on the operation performed:
	- (a) Load instructions write the data read from memory in the destination register of the Register File.
	- (b) ALU instructions write the ALU results into the destination register of the Register File.

#### <span id="page-7-0"></span>Execution (EX) details

- Register-Register ALU instructions. Given the following pattern (where  $op$  can be the operators  $add/addi$  (+) or  $sub/subi$  (-), but not mult  $(x)$  or div  $(\div)$  because they required some special registers and therefore more phases):
- 1 op  $x, \, xy, \, x$  + e.g. op=add =>  $x \, \leq \, xy + \, x$

#### Cost: 4 clock cycles

- 1. Instruction Fetch (IF) and update the Program Counter (next sequential address);
- 2. Fixed-Field Decoding and read from Register File the registers: y and z;
- 3. Execution (EX), ALU performs the operation op  $(\mathsf{F} \times \mathsf{F})$  op  $\mathsf{F} \times \mathsf{F}$ );
- 4. Write-Back (WB), ALU writes the result into the destination register x.

#### • Memory Reference

- Load. Given the following pattern:
- $1$  lw  $x$ , offset  $(y)$  #  $x \leftarrow M[y + offset]$

Cost: 5 clock cycles

- 1. Instruction Fetch (IF) and update the Program Counter (next sequential address);
- 2. Fixed-Field Decoding and read of Base and register y from Register File (RF);
- 3. Execution (EX), ALU adds the base register and the offset to calculate the effective address: y + offset;
- 4. Memory Access (ME), read access to the Data Memory (DM) using the effective  $(y + offset)$  address;
- 5. Write-Back (WB), write the data read from memory in the destination register of the Register File (RF) x.

– Store. Given the following pattern:

#### 1 sw  $x$ , offset  $(y)$  #  $M[y + 0$ ffset] <-  $x$

#### Cost: 4 clock cycles

- 1. Instruction Fetch (IF) and update the Program Counter (next sequential address);
- 2. Fixed-Field Decoding and read of Base register y and source register x from Register File (RF);
- 3. Execution (EX), ALU adds the base register and the offset to calculate the effective address:  $y + \text{offset};$
- 4. Memory Access (WB), write the data read from memory in the destination register of the Register File (RF) M(y + offset).
- Conditional Branch. Given the following pattern:
- 1 beq \$x, \$y, offset

#### Cost: 4 clock cycles

- 1. Instruction Fetch (IF) and update the Program Counter (next sequential address);
- 2. Fixed-Field Decoding and read of source registers x and y from Register File (RF);
- 3. Execution (EX), ALU compares two registers x and y and compute the possible branch target address by adding the sign-extended offset to the incremented Program Counter: PC + 4 + offset;
- 4. Memory Access (ME), update the content of the Program Counter with the branch target address (we assume that the conditional test is true).

### <span id="page-9-0"></span>1.1.2 Implementation of MIPS processor - Data Path

Implementing a MIPS processor isn't difficult. On the following page we show three different diagrams: the first is a very high level data path to allow the reader to understand how it works; the second is more detailed, but without the CU (Control Unit); the third is the complete data path and it also includes the CU (in red).

<span id="page-9-1"></span>

Figure 2: A basic implementation of MIPS data path. [\[2\]](#page-55-1)

Scan (or click) the QR code below to view the figure [2](#page-9-1) in high quality:



Two notes:

- The Instruction Memory (read-only memory) is separated from Data Memory.
- The 32 general-purpose register are organized in a Register File (RF) with 2 read ports and 1 write port.

<span id="page-9-2"></span>

Figure 3: An implementation of MIPS data path (no Control Unit). [\[2\]](#page-55-1)

Scan (or click) the QR code below to view the figure [3](#page-9-2) in high quality:



<span id="page-10-0"></span>

Figure 4: A complete implementation of MIPS data path. [\[2\]](#page-55-1)

Scan (or click) the QR code below to view the figure [4](#page-10-0) in high quality:



#### <span id="page-11-1"></span><span id="page-11-0"></span>1.1.3 MIPS Pipelining

In simple words, the Instruction Pipelining (or Pipelining) is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (the eponymous "pipeline") performed by different processor units with different parts of instructions processed in parallel.

Definition 1: Pipelining

Pipelining is a performance optimization technique based on the overlap of the execution of multiple instructions deriving from a sequential execution flow.

Pipelining exploits the parallelism among instructions in a sequential instruction stream.

#### ⋆ Basic idea

The execution of an instruction is divided into different phases (called pipelines stages), requiring a fraction of the time necessary to complete the instruction. These stages are connected one to the next to form the pipeline:

- 1. Instructions enter the pipeline at one end;
- 2. Progress through the stages;
- 3. And exit from the other end.

As in the assembly line.

#### ✓ Advantage

The Pipelining is transparent for the programmer. To understand what it means, let's make an example.

## Example 1

Image a car assembly line (e.g. Ferrari). A new car exits from the Ferrari assembly line in the time necessary to complete one of the phases. The pipelining technique doesn't reduce the time required to complete a car (the latency), BUT increases the number of vehicles produced per time unit (the throughput) and the frequency to complete cars.



As introduced in the previous example, sequential execution is slower than pipelining. The following figure shows the difference (in terms of clock cycles) between sequential and pipelining.

<span id="page-12-0"></span>

Figure 5: Sequential vs Pipelining. [\[2\]](#page-55-1)

<span id="page-13-0"></span>The time to advance the instruction of one stage in the pipeline corresponds to a clock cycle. So the total cost is: 9 clock cycles.

It's obvious that the pipeline stages must be synchronized: the duration of a clock cycle is defined by the time required by the slower stage in the pipeline (i.e. 2 ns). The main goal is to balance the length of each pipeline stage. If the stages are perfectly balanced, the ideal speedup due to pipelining is equal to the number of pipeline stages.

```
Definition 2: ideal speedup
```
The ideal speedup must be the same value of the pipeline stages.

Look again at Figure [5.](#page-12-0) The sequential and pipelining cases consist of 5 instructions, each of which is divided into 5 low-level instructions of 2 ns each.

• The latency (total execution time) of each instruction is not varied, it's always 10 ns.

Definition 3: latency

The latency is the execution time of a single instruction.

- The throughput (number of low-level instructions completed in the time unit) is improved:
	- Sequential: 5 instructions in 50 ns (1 instruction per 10 ns,  $50 \div 5 =$ 10)
	- Pipelining: 5 instruction in 18 ns (1 instruction per 3.6 ns,  $18 \div 5 =$ 3.6)

Definition 4: throughput

The **throughput** is the number of instructions completed per unit of time.

## Pipeline Execution of MIPS Instructions

On page [8](#page-7-0) we discussed some MIPS instructions to understand how the MIPS architecture works. The aim of the following pages is to understand how MIPS works in a pipelined execution.

We want to perform the following assembly lines:

```
1 op x, sy, z # assume x \leq -sy + sz_2 \bf{lw} \bf{\$x} , \bf{offset} \bf{(\$y)} # \bf{\$x} <- \texttt{M[}\bf{\$y} + \bf{offset} \bf{]}_3 sw x, offset (y) # M[y + offset] <- x4 beq $x, $y, offset
```






Load Instructions:  $\texttt{lw} \, \, \texttt{\$x}$  ,  $\texttt{offset}(\texttt{\$y}) \, \, \, \texttt{\#} \, \, \texttt{\$x} \, \leftarrow \, \texttt{M}[\texttt{\$y} \, + \, \texttt{offset}]$ 



**Store Instructions: sw \$x,offset(\$y) # M[\$y + offset]**¬ **\$x** 



**Conditional Branches: beq \$x,\$y,offset**



Figure 6: Pipeline Execution of MIPS Instructions. [\[2\]](#page-55-1)



Resources used during the pipeline execution

Figure 7: Resources used during the pipeline execution (IM is Instruction Memory, REG is Register File and DM is Data Memory). [\[2\]](#page-55-1)

## Implementation of MIPS pipeline

<span id="page-16-0"></span>The division of the execution of each instruction in  $n$  stages implies that in each clock cycle, there are  $n$  instructions for execution. That means the CPU must have  $n$  modules corresponding to  $n$  execution stages. Therefore, to do pipelining, we need pipeline registers to separate the different stages.

In the following figure, we can see how the pipeline registers are implemented. Between each phase of execution of MIPS instructions (details on page [7\)](#page-5-0), there is a pipeline register holding the result of the instruction.



Figure 8: MIPS pipeline implementation. [\[2\]](#page-55-1)

Note: the data stored in the interstage registers correspond (obviously) to different instructions.

Finally, in the following figure we can see the timeline implementation of the pipeline registers. But there are two basic assumptions to make:

- 1. There are no data dependencies between instructions. If there were, an instruction could read a register with an unknown value (Pipeline Hazard, page [19\)](#page-18-0).
- 2. There are no branch/jump instructions.



Figure 9: Timeline of MIPS pipeline implementation. [\[2\]](#page-55-1)

#### <span id="page-18-1"></span><span id="page-18-0"></span>1.1.4 The problem of Pipeline Hazards

#### Definition 5: Hazard

A hazard (conflict) is created whenever there a dependence between two instructions, and instructions are close enough that the overlap caused by pipelining would change the order of access to the operands involved in the dependence.

#### **A** Problem Consequences

The Hazards:

- Force the next instruction in the pipeline to be executed later than its intended clock cycle.
- Reduced the performance from the ideal speedup achieved by pipelining (direct previous consequence).

There are three classes of Hazards:

• Structural Hazards. Attempt to use the same resource from different instructions simultaneously.

Example: single memory for instruction and data.

• Data Hazards. Attempt to use a result before it is ready.

Example: instruction depending on a result of a previous instruction still in the pipeline.

There are also two specific forms of data hazard, called **Load-Use** Data Hazard and Load-Store Data Hazard. Both occur when the data loaded by a load instruction is not yet available when it is needed by another instruction. In the case of Load-Use, the "another instruction" is an operator such as add; in the case of Load-Store, the "another instruction" is the store (sw) instruction.

The following example shows the conflict (Load-Use Data Hazard) between two instructions. In particular, the value lw writes to s2 is not available until lw has completed the MEM phase, but and needs this value when it enters the EX phase, i.e. when lw enters the MEM phase.

```
1 \text{ W} $s2, 20($s1)
2 and $s4 , $s2 , $s5
```
• Control Hazards. Attempt to make a decision on the next instruction to execute before the condition is evaluated (more detailed analysis on page [21\)](#page-19-0).

Example: conditional branch execution.

Structural Hazards? No problem for MIPS Architecture!

There aren't any structural hazards in MIPS architecture because the Instruction Memory (IM) is separated from the Data Memory (DM). Also, the Register File (RF) is used in the same clock cycle (read access by an instruction and write access by another instruction).

### <span id="page-19-1"></span>® How to detect Data Hazards? Dependency Analysis

To detect Data Hazards, it is suggested to analyze the dependencies. If the instructions executed in the pipeline depend on each other, data hazards can arise when instructions are too close. For example:

2 and \$2, \$1, \$3 # reg. \$2 written by sub<br>
2 and \$12, \$2, \$5 # 1 operand (\$2) depends<br>
3 or \$13, \$6, \$2 # 2 operand (\$2) depends<br>
4 add \$14, \$2, \$2 # 1 (\$2) and 2 (\$2) op.s # 1 operand (\$2) depends on sub # 2 operand (\$2) depends on sub  $# 1$  (\$2) and 2 (\$2) op.s depend on sub 5 sw \$15, 100(\$2) # base reg. (\$2) depends on sub

Data Hazards can occur in a variety of situations, but a true dependency situation is created by a RAW (Read After Write) Hazard.

Definition 6: Read After Write Hazard

A RAW (Read After Write) Hazard occurs when an instruction  $n + 1$  tries to read a source operand before the previous instruction n has written its value in the Register File (RF).

For example:

```
1 sub $2, $1, $3 # reg. $2 is written by sub
2 and $12 , $2 , $5 # 1 op. ($2) depends on sub
```
#### <span id="page-20-0"></span>® How to detect Control Hazards? Check conditional branches

First of all, some examples of conditional branches for MIPS processor are: beq (branch on equal) and bne (branch on not equal):

1 beq \$s1, \$s2, L1 # if \$s1 == \$s1 then goto L1<br>2 bne \$s1, \$s2, L1 # if \$s1 != \$s2 then goto L1 # if  $$s1$  !=  $$s2$  then goto L1

The address to which you want to branch is called the Branch Target Address. If the branch condition:

- Is satisfied  $\Rightarrow$  the **branch is taken** and the Branch Target Address is stored in the Program Counter (PC).
- Is not satisfied  $\Rightarrow$  the **branch is not taken** (untaken) and the instruction stream is executed sequentially with the next instruction address  $(PC +4)$ .

In detail, the stages are the following:

- 1. [IF] Instruction fetch and PC increment.
- 2. [ID] Instruction Decode and Registers Read (e.g. x and y)
- 3. [EX] Compare registers (e.g. x and y) in the ALU to derive the Branch Outcome: taken or not taken. Also, computation of the Branch Target Address, so  $PC + 4 +$  offset
- 4. [ME] The Branch Outcome is used to decide the next PC:
	- Is satisfied  $\Rightarrow$  PC take PC + 4 + offset
	- Is not satisfied  $\Rightarrow$  PC take PC + 4

Let us now move on to a more interesting analysis. To understand when the Control Hazards occur, think about the Branch Outcome and the Branch Target Address. Both are ready at the end of the EX (execution) phase (so between pass number 3 and 4). Finally, branches are resolved when the Program Counter is updated at the end of the Memory Access stage (after pass number 4).

To feed the condition branch into the pipeline, we need to create a way where the condition branch is decided before the EX stage of the next instruction. It's obvious, because if the Branch Outcome is positive, we need to skip the next instruction and do the conditional jump instead.

This is a more detailed explanation of a control hazard. Control Hazards arise from the pipelining of conditional branches and other jump instructions that change the PC. They also reduce the performance from the ideal speedup gained by pipelining, because it is necessary to hold the pipeline until the branch is resolved.

## MIPS Optimized Pipeline

<span id="page-21-0"></span>Consider the following situation:



Figure 10: Why MIPS Optimized Pipeline was born. [\[2\]](#page-55-1)

The Register File is used in 2 stages: read access during ID (and operation) and write access during Write Back (WB) (sub operation). What happens if read and write refer to the same register in the same clock cycle? Or we insert a stall, or we use an optimized pipeline (smart choice).

#### Definition 7: Optimized Pipeline

By selecting Optimized Pipeline, we assume the Register File (RF) read occurs in the second half of clock cycle and the Register File write in the first half of clock cycle.

This way we don't need the stall. The following Figure [11](#page-22-0) shows an optimized pipeline.

<span id="page-22-0"></span>

Figure 11: Optimized Pipeline (IM is Instruction Memory, REG is Register File, and DM is Data Memory). [\[2\]](#page-55-1)

And the problem mentioned at the beginning of this paragraph is partially solved, as we can see in the following figure.



Figure 12: Optimized Pipeline to solve the example stall. [\[2\]](#page-55-1)

#### <span id="page-23-1"></span><span id="page-23-0"></span>1.1.5 The solution of Data Hazards

The following techniques don't solve the problem completely, but they do solve it partially. So they find a perfect balance between the ideal speedup and a situation where the hazard is total.

The solution can be applied on runtime (hardware techniques) or on compilation (static-time techniques):

- Compilation Techniques (static-time techniques):
	- The **insertion of nop** is a simple (logical) solution where we **insert a** nop operator between dependent statements to ensure correct operation.

See the example on page [27.](#page-26-0)

– The instructions scheduling is a technique used by the compiler to prevent correlating instructions from being too close together. It tries to reorder instructions by inserting independent instructions between correlating instructions. If the compiler can't do this, it inserts nop operations.

See the example on page [28.](#page-27-0)

- Hardware Techniques (runtime techniques):
	- $-$  The insertion of stalls (called also bubbling the pipeline, pipeline break, or pipeline stall) is a sort of a delay before the processor can resume execution of the instruction. As we can see in the example on page [28,](#page-27-1) the stalls delay the stages of the correlating instructions.
	- The data forwarding uses temporary results stored in the pipeline registers instead of waiting for the results to be written back to the Register File (RF). To do this, it's necessary to add new paths and multiplexers at the inputs of the ALU to fetch inputs from the pipeline to avoid inserting stalls in the pipeline. See the example on page [28.](#page-27-2)

We have the mandatory to give more words to the data forwarding technique. First of all, its implementation needs new paths and new multiplexers. So, to adapt the MIPS architecture, the new implementation will be show in the figure [13](#page-24-0) on page [25.](#page-24-0)

<span id="page-24-1"></span><span id="page-24-0"></span>

Figure 13: Implementation of MIPS with Forwarding Unit. [\[2\]](#page-55-1)

Scan (or click) the QR code below to view the figure [13](#page-24-0) in high quality:



The forwarding paths created inside the MIPS architecture are three: EX to EX path, MEM to EX path, and MEM to MEM path.



Figure 14: Forwarding paths on MIPS architecture. [\[2\]](#page-55-1)

Furthermore, the forwarding technique can solve the Load-Use and Load-Store Data Hazard. It's a very interesting feature because the MEM to EX and MEM to MEM paths can solve two different situations:

• Load-Use Hazard. It's solved by MEM to EX path because the value loaded in the MEM stage, is forwarded directly to the EX stage of the next conflict instruction (but unfortunately we need one stall to delay the run).



• Load-Store Hazard. It's solved by MEM to MEM path because the value loaded in the MEM stage, is forwarded directly to the MEM stage of the next conflict instruction.





27

<span id="page-26-0"></span>Insertion of nop. [\[2\]](#page-55-1)

<span id="page-27-2"></span><span id="page-27-1"></span><span id="page-27-0"></span>

#### <span id="page-28-0"></span>1.1.6 The solution of Control Hazards

There are multiple techniques to resolve a Control Hazard.

#### ✓ Conservative Solution - The Branch Stalls

The following solution is the most conservative. Solve the problem? Yes, but it's called conservative because adopt a banal technique: stalling until resolution at the end of the Memory Access (ME) stage of the branch.

The main problem is the loss of performance. Each branch costs a penalty of 3 stalls to decide and fetch the correct instruction flow in the pipeline:

| beq \$1, \$3, L1   IF   ID   EX   ME,   WB |  |                            |  |   |    |                      |       |     |           |
|--------------------------------------------|--|----------------------------|--|---|----|----------------------|-------|-----|-----------|
| and $$12, $2, $5$                          |  | IF stall IF stall IF stall |  | D |    | $EX \mid ME \mid WB$ |       |     |           |
| or \$13, \$6, \$2                          |  |                            |  |   | ID | EX                   | ME WB |     |           |
| add \$14, \$2, \$2                         |  |                            |  |   | Æ  | ID                   | EX.   | ME. | <b>WB</b> |

Figure 15: Example of a conservative solution to solve a Control Hazard.

#### ✓ Start to think to the branch prediction - Flush solution

The branch stalls are not good because there is a reduction in throughput. So we can make a kind of prediction on the branch and assume that the branch will not be taken. So we start fetching and executing the next 3 instructions in the pipeline. Ok, but wait, what if the branch is taken? No problem, we flush the next 3 instructions before they write their result and then fetch the instruction at the branch target address.



Figure 16: Example of a flush solution to solve a Control Hazard.

#### $\checkmark$  Early Evaluation of the Program Counter (PC) in ID stage

It's clear that to improve performance in the event of branch hazards, we need to add more hardware features, such as:

- Compare registers to derive the Branch Outcome (BO).
- Compute the Branch Target Address (BTA).
- Update the PC register.

Fortunately, the MIPS-optimized pipeline already has these features and does so during the ID stage. As a result, the Branch Outcome (BO) and the Branch Target Address (BTA) are known at the end of the ID stage.



Figure 17: Early Evaluation of the Program Counter (PC) in ID stage.

Now, using the conservative solution or the flush solution, we get two different results:

• Combo with Conservative Solution: stalling until resolution at the end of the ID stage (when the Branch Outcome and the Branch Target Address are known) to decide which instruction to fetch.

Performance consideration: each branch costs one stall of penalty to decide and fetch the correct instruction flow along the pipeline.

One-cycle-delay for every branch still yields a performance loss of 10% to 30% depending on the branch frequency (Stall Cycles per Instruction due to Branches equal to Branch Frequency times to Branch Penalty):

Stall Cycles = Branch Frequency  $\times$  Branch Penalty

| beg \$1, \$3, L1 $\ \ $ |                         | <b>ME</b> | <b>WB</b> |           |           |           |           |
|-------------------------|-------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
| and $$12, $2, $5$       | IF stall $\parallel$ IF | ID        | EX        | <b>ME</b> | <b>WB</b> |           |           |
| or \$13, \$6, \$2       |                         | IF        | ID        | EX.       | ME.       | <b>WB</b> |           |
| add $$14, $2, $2$       |                         |           | -IF       | ID        | EX        | <b>ME</b> | <b>WB</b> |

Figure 18: Example of conservative solution in MIPS architecture.

• Combo with Fetch Solution: we assume the branch is not taken. Performance consideration: if the Branch Outcome (BO) will be taken, it will be necessary to flush only one instructions before writing its results and fetch the right instruction at the Branch Target Address.



Figure 19: Example of fetch solution in MIPS architecture.

The unique solution is to use branch prediction techniques to deal with this loss of performance.

#### <span id="page-31-4"></span><span id="page-31-0"></span>1.1.7 Performance evaluation in pipelining

As we have seen in the previous sections, the pipelining increases the CPU instruction throughput (number of instructions completed per unit of time) but doesn't reduce the latency (the execution time of a single instruction).

The increase in latency is a direct consequence of two problems:

- The imbalance among the pipeline stages
- The overhead in the pipeline control

This imbalance between the pipeline stages and the overhead are bad aspects:

- The imbalance reduces performance because the clock can run no faster than the time needed for the slowest pipeline stage;
- The overhead arises from the delay introduced by interstage registers and clock skew.

Finally, all instructions should be the same number of pipeline stages. Each assumption and optimization shown previously works well in this case.

Definition 8: number of Clock Cycles, Clocks Per Instructions and MIPS formula

Given:

- The Instruction Count per iteration as  $IC<sub>per</sub>$  iter
- The number of Stall Cycles per iteration as # Stall Cycles
- The length of the pipeline is  $x$

We can calculate the number of Clock Cycles as the sum between the Instruction Count (how many stages there are in one instruction), the number of Stall Cycles inserted by the hardware technique (called insertion of stalls), plus the length of the pipeline  $x$ :

<span id="page-31-1"></span># Clock  $\text{Cycles}_{\text{per}\_\texttt{iter}} = \text{IC}_{\text{per}\_\texttt{iter}} + \text{\#}$  Stall  $\text{Cycles}_{\text{per}\_\texttt{iter}} + x$  (1)

The Clocks Per Instruction per iteration, CPIper\_iter, is calculated with the rapport between the number of Clock Cycles per iteration (previous equation) divided by the Instruction Count per iteration:

<span id="page-31-2"></span>
$$
CPI_{per\_iter} = \frac{\# \text{Clock Cycles}_{per\_iter}}{IC_{per\_iter}}
$$

$$
= \frac{\left( IC_{per\_iter} + \# \text{ Stall Cycles}_{per\_iter} + x \right)}{IC_{per\_iter}} \tag{2}
$$

Finally, the MIPS formula per iteration is calculated with the rapport between the frequency of the clock  $(f_{\text{clock}})$  divided by the multiply between the Instructions Per Clock (as the ratio  $1 \div \text{CPI}$ ) and  $10^6$  (1 million instructions):

<span id="page-31-3"></span>
$$
MIPS_{per\_iter} = \frac{f_{clock}}{(CPI_{per\_iter} \times 10^6)}
$$
 (3)

<span id="page-32-0"></span>We can asymptotically  $(AS)$  rewrite equations [1,](#page-31-1) [2](#page-31-2) and [3](#page-31-3) as follows:

$$
\text{\# Clock Cycles}_{AS} = IC_{AS} + \text{\# Stall Cycles}_{AS} + x \tag{4}
$$

$$
CPI_{AS} = \lim_{n \to \infty} \frac{\text{\# Clock Cycles}_{AS}}{IC_{AS}}
$$
\n
$$
= \lim_{n \to \infty} \frac{(IC_{AS} + \text{\# Stall Cycles}_{AS} + x)}{IC_{AS}}
$$
\n
$$
(5)
$$

$$
MIPS_{AS} = \frac{f_{\text{clock}}}{(\text{CPI}_{AS} \times 10^6)}
$$
(6)

Note: the ideal speedup, then Clock Per Instruction, should be equal to 1. But stalls cause the pipeline performance to degrade from the ideal performance, so we have the Average Clock Per Instruction (CPI):

$$
AVG(CPI) = Ideal CPI + # Stall Cycles_{per_instruction}
$$
\n
$$
= 1 + # Stall Cycles_{per_instruction}
$$
\n(7)

And obviously, the Pipeline Stall Cycles per Instruction is:

 $PSCI = Structural$  Haz. +Data Haz. +Control Haz. +Memory Stalls  $(8)$ 

## <span id="page-33-2"></span><span id="page-33-0"></span>1.2 Cache

#### <span id="page-33-1"></span>1.2.1 Introduction

The cache is introduced to increase the performance of a computer through the memory system in order to:

- Provide the user the illusion to use a memory that is simultaneously large and fast.
- Provide the data to the processor at high frequency.

It takes advantage from the Locality of Reference.

```
Definition 9: Locality of Reference
Locality of reference refers to a phenomenon in which a computer pro-
gram tends to access same set of memory locations for a par-
ticular time period.
In other words, Locality of Reference refers to the tendency of the
computer program to access instructions whose addresses are near one
another. The property of locality of reference is mainly shown by loops
and subroutine calls in a program.
```
There are two types of Locality of Reference:

• Temporal Locality

Definition 10: Temporal Locality

Temporal Locality means that a instruction which is recently executed have high chances of execution again. So the instruction is kept in cache memory such that it can be fetched easily and takes no time in searching for the same instruction.

• Spatial Locality

Definition 11: Spatial Locality

Spatial Locality means that all those instructions which are stored nearby to the recently executed instruction have high chances of execution. It refers to the use of data elements(instructions) which are relatively close in storage locations.



Table 1: [Difference between Spatial Locality and Temporal Locality.](https://www.geeksforgeeks.org/difference-between-spatial-locality-and-temporal-locality/)

## ® Where can we find the cache?

In general, the memory hierarchy is composed of several level. Let us consider 2 levels: cache and main memory. The cache (upper level) is smaller, faster and more expensive than the main memory (lower level).

The minimum chunk of data that can be copied in the cache is the block or cache line. To exploit the spatial locality, the block size must be a multiple of the word size in memory. So, for example a 128-bit block size is equal to 4 words of 32-bit.

The **number of blocks in cache** is given by:

Number of cache blocks  $=$   $\frac{\text{Cache Size}}{\text{Black}}$ Block Size

For example, if the cache size is 64K-Byte and the block size is 128-bit (16-Byte), then the number of cache blocks is 4K blocks.

#### <span id="page-35-2"></span><span id="page-35-0"></span>1.2.2 Cache Hit and Cache Miss

Unfortunately, the Cache can't be maintain every data inside the computer. In order to be faster, it contains only some data with.

When the processor makes a request of a certain type:

- Ideal case. If the requested data is found in one of the cache blocks (upper level), then there is a hit in the cache address and it's called Cache Hit.
- Problematic case. If the requested data is not found in one of the cache blocks (upper level), then there is a miss in the cache address and it's called Cache Miss.

But beware, in this case we need to access the lower level of the memory hierarchy to find the requested block. This causes:

- To stall the CPU;
- To require to block from the main memory;
- To copy (write) the block in cache;
- To repeat the cache access (hit).

Definition 12: Cache Hit

A Cache Hit is when a requested data is found in one of the cache block of the upper level of the memory.

Furthermore we define the **Hit Rate** as the **number of memory accesses** that find the data in the upper level with respect to the total number of memory accesses:

$$
\text{Hit Rate} = \frac{\text{\# hits}}{\text{\# memory accesses}}
$$

Finally we define the Hit Time as the time to access the data in the upper level of the hierarchy, including the time needed to decide if the attempt of access will result in a hit or miss.

<span id="page-35-1"></span>Definition 13: Cache Miss

A Cache Miss is when a requested data is not found in one of the cache blocks and must be taken from the lower level of the memory.

Furthermore we define the Miss Rate as the number of memory accesses not finding the data in the upper level with respect to the total number of memory accesses:

$$
\text{Miss Rate} = \frac{\text{# misses}}{\text{# memory accesses}} \tag{9}
$$

Finally we define the Miss Time as

$$
\mathtt{Miss Time} = \mathtt{Hit Time} + \mathtt{Miss Penalty} \qquad \qquad (10)
$$

<span id="page-36-0"></span>Where the Miss Penalty is the time needed to access the lower level and to replace the block in the upper level.

Two observations:

1. Should be obviously the definition:

Hit Rate + Miss Rate  $= 1$ 

2. Typically, we have the following relation:

Hit Time ≪ Miss Penalty

Finally, the Average Memory Access Time (AMAT) can be calculated as:

 $AMAT = Hit Time + Miss Rate * Miss Penalty$  (11)

#### <span id="page-37-0"></span>1.2.3 Cache Structure

Each entry (cache line) in the cache includes: **Each entry (cache line)** in the cache includes:

- (V) Valid bit to indicate if this position contains valid data or not. At **1. Valid bit** to indicate if this position contains valid data or not. the bootstrap, all the entries in the cache are marked as  $INVALID$ .
- (TAG) Cache Tag(s) contains the value that univocally identifies the memory address corresponding to the stored data.
- <span id="page-37-1"></span>• (DATA) Cache Data contains a copy of data (block or cache line).  $\mathbf{m}$ **1.** (**Similar**) cache Data contains a copy of data (block of cache line).



After a general presentation of the cache structure, we answer four questions about the memory hierarchy to understand different topics:

- Block placement (page [39\)](#page-38-0). Where can a block be placed in the upper level?
	- Direct Mapped Cache (page [39\)](#page-38-1)
	- Fully Associative Cache (page [41\)](#page-40-0)
	- $-$  *n*-way Set Associative Cache (page [43\)](#page-42-0)
- Block identification (page [45\)](#page-44-0). How is a block found if it is in the upper level?
- Block replacement (page [45\)](#page-44-1). Which block should be replaced on a miss?
- Write strategy (page [46\)](#page-45-0). What happens on a write?

### Block placement

<span id="page-38-4"></span><span id="page-38-0"></span>The main question is: where can a block be placed in the upper level? In other words, the problem is: given the address of the block in the main memory, where can the block be placed in the cache?

So, we need to find the correspondence between the memory address and the cache address of the block. This correspondence depends on the cache structure and can be of three types:

- Direct Mapped Cache
- Fully Associative Cache
- n-way Set-Associative Cache

#### <span id="page-38-1"></span>Direct Mapped Cache

With the Direct Mapped Cache structure, each memory location corresponds to one cache location and only one cache location. The following formula gives the cache address of the block:

<span id="page-38-3"></span> $(Block Address)_{cache} = (Block Address)_{mem} \mod (\# Cache Block)$  (12)

The block address of the cache corresponds to the modulo operation between the block address of the memory and the number  $(\#)$  of cache blocks. The [modulo](https://en.wikipedia.org/wiki/Modulo) [operation](https://en.wikipedia.org/wiki/Modulo) returns a division's remainder or signed remainder after dividing one number by another.

<span id="page-38-2"></span>

Figure 20: This figure shows the memory address composed of the block address (tag and index used to identify the block) and the block offset.

From Figure [21,](#page-39-0) we can see the complete structure of the cache if we choose the direct mapped cache technique.

The rectangle on the top is the memory address (Figure [20\)](#page-38-2). First, we check the Tag value; if it's equal to the value in the cache, we check the Valid bit (V) to see if the position contains valid data: if the value is 1, we have a cache hit; otherwise, the data is invalid. The  $Taq$  contains the value that univocally identifies the memory address corresponding to the stored data. To take the data word, we use the block offset as the selector in the [multiplexer](https://en.wikipedia.org/wiki/Multiplexer) to choose which data block to take. The index field indicates the cache row to check.

<span id="page-39-0"></span>

Figure 21: The cache structure of the Direct Mapped Cache technique.

For example, we assume a block-frame address composed of 32 bits. Our cache structure is direct mapped, and the number of cache blocks is 8. A possible exercise could be determining where block 12 can be placed in the 8-block cache.

To solve this problem, we can use the formula no [12](#page-38-3) on page [39:](#page-38-3)

 $(Block$  Address)<sub>cache</sub> = 12 mod  $8=5$ 

<span id="page-39-1"></span>The result is 5, so the answer is: with the direct mapped technique, the block number is 4 (because the first index of the cache blocks is zero and not 1).



#### <span id="page-40-2"></span><span id="page-40-0"></span>Fully Associative Cache

In a **Fully Associative Cache**, the memory block can be placed in any position of the cache. So, all the cache blocks must be checked during the search of the block.

Note the index does not exist in the memory address; there are the Tag bits only:

Number of blocks 
$$
=
$$
  $\frac{\text{Cache Size}}{\text{Block Size}}$  (13)

The Memory Address comprises the Block Address (Tag) and the Block Offset.



Figure 22: The Memory Address comprises the Block Address (Tag) and the Block Offset.

<span id="page-40-1"></span>The structure of the cache using this technique is as follows:



Figure 23: The cache structure of the Fully Associative Cache technique.

As shown in Figure [23,](#page-40-1) the cache structure is more accessible because there are no Index fields. We check only the Tag field from the memory address. Finally, the Block Offset chooses the Data Block from the cache. We have a cache hit if the Tag is equal to the Tag of the cache and the value in and with the valid bit is true.

For example, we assume a block-frame address composed of 32 bits. Our cache structure is fully associative, and the number of cache blocks is 8. A possible exercise could be determining where block 12 can be placed in the 8 block cache.

<span id="page-41-0"></span>Unlike before, the position can be anywhere.



Direct Mapped on the left and Fully Associative on the right.

#### <span id="page-42-2"></span><span id="page-42-0"></span>n-way Set Associative Cache

In a  $n$ -way Set Associative Cache, the cache is composed of sets. Each set is composed of  $n$  blocks:

Number of blocks = 
$$
\frac{\text{Cache Size}}{\text{Block Size}}
$$
  
Number of sets = 
$$
\frac{\text{Cache Size}}{(\text{Block Size} \times n)}
$$
 (14)

The memory block can be placed in any block of the set, so the search must be done on all the blocks.

Each memory block corresponds to a single set of the cache, and the block can be placed in whatever block of the  $n$  blocks of the set:

<span id="page-42-1"></span>
$$
(\text{Set})_{\text{cache}} = (\text{Block address})_{\text{mem}} \mod (\text{# sets in cache}) \tag{15}
$$



Figure 24: The memory address comprises the block address (Tag and index used to identify the set) and the block offset.



Figure 25: This structure is a 2-way Set Associative Cache.

Taking the examples of previous pages, with the 2-way Set Associative, the raking the **examples** of previous pages, with the z-way set Associative, the answer is anywhere in set 0. The reason for this is that using the formula [15:](#page-42-1) formata 10.

$$
\left(\texttt{Set}\right)_{\texttt{cache}} = 12 \mod 4 = 0
$$



Direct Mapped on the left, 2-way Set Associative on the center and Fully Associative on the right.

## Block identification

<span id="page-44-0"></span>The main question is: how is a block found if it is in the upper level? The problem with identifying a block is that we must compare Tag bits. The comparison depends on the structure of the cache:

- With **Direct Mapping** (page [39\)](#page-37-1), we need to:
	- Identify block positions by index (with the formula [12\)](#page-38-3);
	- Compare block tags;
	- Verify the valid bit.
- With the **Set Associative Mapping** (page [43\)](#page-41-0), we need to:
	- Identify the set by index (with the formula [15\)](#page-42-1);
	- Compare tags of the set;
	- Verify the Valid bit.
- With the Fully Associative Mapping (page [41\)](#page-39-1), we must:
	- Compare tags in every block;
	- Verify the Valid bit.

Comparing the Tag bits, we do not need to check index or block offset bits.

#### Block replacement

<span id="page-44-1"></span>The main question is: which block should be replaced on a miss? In case of a miss (definition on page [36\)](#page-35-1), the replacement strategy depends on the structure of the cache:

- In a Fully Associative Cache (page [41\)](#page-39-1), we need to decide which block to replace: any block is a potential candidate for the replacement.
- In a **Set Associative Cache** (page [43\)](#page-41-0), we must select among the blocks in the given set.
- In a Direct Mapped Cache (page [39\)](#page-37-1), only one candidate must be replaced (no need for any block replacement strategy).

So in the Fully Associative Cache and Set Associative Cache, the main strategies used to choose the block to be replaced are:

- [Random](https://en.wikipedia.org/wiki/Cache_replacement_policies#Random_replacement_(RR)) (or pseudo-random)
- [LRU \(Least Recently Used\)](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU))
- [FIFO \(First In First Out, or oldest block replaced\)](https://en.wikipedia.org/wiki/Cache_replacement_policies#First_in_first_out_(FIFO))

## Write strategy

<span id="page-45-1"></span><span id="page-45-0"></span>The main question is: what happens on a write? It depends on the written policy adopted in the cache. We remember that there are two possible options:

- Write-Through: data is simultaneously updated (written) to cache and memory. This process is more straightforward and more reliable. This is used when there are no frequent writes to the cache.
	- $\checkmark$  The main advantages
		- 1. It is simpler to implement but to be effective, it requires a write buffer to not wait for the memory hierarchy (to avoid write stalls).
		- 2. The read misses are cheaper because they do not require any write to the memory hierarchy.
		- 3. Memory is always up to date.
- Write-Back: the data is updated only in the cache and then added to the memory later. The modified cache block is written to the memory only when it is replaced due to a cache miss. So, how can we understand if a block is clean or dirty? We need to add a Dirty Bit. Each Block in the cache requires a bit to indicate if the data present in the cache was modified  $(Dirtu)$  or not modified  $(Clean)$ . If it is clean, writing it into the memory is unnecessary. It is designed to reduce write operation to a memory.

 $\checkmark$  The main advantages

- 1. The processor can write the Block at the frequency the cache, not the main memory, can accept.
- 2. Multiple writes to the same Block require only a single write to the main memory.

#### ® What is a Write Buffer?

A Write Buffer is a FIFO buffer that does not wait for main memory access (the typical number of entries is 4 to 8). So, the processor writes data to the cache and the write buffer; the memory controller writes the contents of the buffer to memory.



Figure 26: The cache structure with a Write Buffer.

The main problem with this idea is saturation. However, as we have cited before, the write buffer is usually combined with the Write-Through policy.

## <span id="page-46-0"></span>® Ok, the cache can adopt one of the two write policies, but what happens on a Write Miss?

If write occurs to a location that is not present in the Cache (Write Miss), we use two options: Write Allocate (or fetch on write) and No Write Allocate (Write-Around).

In the first one, the data is loaded from the memory into the cache and then updated. Write Allocate works with both Write Back and Write-Through. However, it is generally used with Write Back because bringing data from the memory to the cache is unnecessary and then updating it in both the cache and main memory.

In the No Write Allocate option, the data is directly written/updated to the main memory without disturbing the cache. It is better to use this when the data is not immediately used again. Generally, the Write-Through cache uses the No Write Allocate option (hoping the next writes to the Block will be done again in memory).

## **B** Summary

| ${\bf Event}$    | Summary                                                                                             |  |  |  |  |  |  |
|------------------|-----------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| Read Hit         | Read data in cache.                                                                                 |  |  |  |  |  |  |
| <b>Read Miss</b> | Events that manifest themselves:                                                                    |  |  |  |  |  |  |
|                  | 1. CPU stalls;                                                                                      |  |  |  |  |  |  |
|                  | 2. Data request to memory;                                                                          |  |  |  |  |  |  |
|                  | 3. Copy in cache (write in cache);                                                                  |  |  |  |  |  |  |
|                  | 4. Repeat of cache read operation.                                                                  |  |  |  |  |  |  |
| Write Hit        | It depends on the policy chosen:                                                                    |  |  |  |  |  |  |
|                  | Write-Through: write data both in cache<br>and in memory.                                           |  |  |  |  |  |  |
|                  | Write-Back: write data to cache only,<br>and copy memory only when replaced<br>due to a cache miss. |  |  |  |  |  |  |
| Write Miss       | There will certainly be CPU stalls. It also de-<br>pends on the option we choose:                   |  |  |  |  |  |  |
|                  | <i>Write Allocate:</i>                                                                              |  |  |  |  |  |  |
|                  | 1. Data request to memory;                                                                          |  |  |  |  |  |  |
|                  | 2. Copy in cache (write in cache);                                                                  |  |  |  |  |  |  |
|                  | 3. Repeat of cache write operation.                                                                 |  |  |  |  |  |  |
|                  | $\bullet$ No Write Allocate:                                                                        |  |  |  |  |  |  |
|                  | 1. Simply send write data to main<br>memory.                                                        |  |  |  |  |  |  |

Table 2: Summary: Hit and Miss & Read and Write.

## <span id="page-48-2"></span><span id="page-48-0"></span>1.3 Exceptions handling

#### <span id="page-48-1"></span>1.3.1 Introduction

We use the term exception to cover not only exceptions but also interrupts and faults. More in general, we consider the following type of events:

- I/O device request
- Invoking OS system call from a user program
- Tracing instruction execution
- Integer arithmetic overflow/underflow
- Floating point arithmetic anomaly
- Page fault
- Misaligned memory access
- Memory protection violation
- Hardware/Power failure

#### . Causes of Interrupts/Exceptions

An interrupt is an event that requires the processor's attention. The causes of an interrupt or exception can be of two types:

- Asynchronous Exceptions, when a request comes from an external event, such as:
	- I/O device service-request
	- Timer expiration
	- Power disruptions, hardware failure

These events are caused by devices external to the CPU and memory and can be handled after the completion of the current instruction (easier to handle)

- Synchronous Exceptions, when a request comes from an internal event (a.k.a. exceptions), such as:
	- Undefined opcode, privileged instruction
	- Integer arithmetic overflow, FPU exception
	- Misaligned memory access
	- Virtual memory exceptions: page faults, TLB misses, protection violations
	- Traps: system calls, e.g., jumps into kernel

#### <span id="page-49-0"></span>**E** Classes of Exceptions

Some exceptions are predictable and easier to handle, such as user-requested exceptions. But, some exceptions are unpredictable, such as "coerced" exceptions. Other classes of exceptions are:

- User Requested "Exceptions", such as  $I/O$  events, are predictable. They are treated as exceptions because they use the same mechanisms that are used to save and restore the state; handled after the instruction has completed (easier to handle).
- Coerced "exceptions" are caused by some hardware event not under control of the user program; hard to implement because they are unpredictable.
- Masking. To mask an interrupt is to disable it, so it is deferred or ignored by the processor, while to unmask an interrupt is to enable it.

User maskable interrupts are signals affected by the mask. So, when the interrupt is disabled, the associated interrupt signal may be ignored by the processor, or it may remain pending.

User nonmaskable interrupts are signals that cannot be disabled and they are not affected by the interrupt mask.

• Within instructions. These classes of exceptions are synchronous, since the instruction triggers the exception.

In this case, the instruction must be stopped and restarted.

• Between instructions. These classes of exceptions are asynchronous and they arise from catastrophic situations such as HW malfunctions and always cause program termination.

Finally, there are two explanations of the terms we will use:

- The term terminating event means that the program execution always stops after the interrupt/exception.
- The term resume event means that the program execution continues after the interrupt/exception.

#### <span id="page-50-2"></span><span id="page-50-0"></span>1.3.2 Interrupts and Interrupt Handler

The interrupts change the normal flow of control. As you can see in Figure [27,](#page-50-1) on the left we see the instructions of the program; on the right we see the interrupt handler.

<span id="page-50-1"></span>

Figure 27: Program and Interrupt Handler.

An Interrupt Handler, also known as an interrupt service routine or ISR, is a special block of code associated with a specific interrupt condition. Interrupts can be of two types:

• Synchronous Interrupts (exception) are caused by a particular instruction stage. In general, the instruction  $I_i$  (see the Figure [27\)](#page-50-1) cannot be completed and needs to be restarted after the exception has been handled.

If we think about the pipeline, this condition requires undoing the effect of one or more partially executed instructions.

• Asynchronous Interrupts are caused by an I/O device requesting attention by asserting one of the prioritized interrupt request lines.

When the processor decides to process the interrupt:

- 1. It stops the current program at instruction  $I_i$ , completing all the instructions up to  $I_{i-1}$  (called **precise interrupt**, see below to understand what it is).
- 2. It saves the Program Counter (PC) of the instruction  $I_i$  in a special register called Exception Program Counter (EPC):  $PC \rightarrow EPC$ .
- 3. It disables interrupts and transfer control to a designated interrupt handler running in the kernel mode. So it loads the Interrupt Vector Address (IVA) into the Program Counter (PC): IVA  $\rightarrow$  PC.
- <span id="page-51-0"></span>4. When the Interrupt Handler has finished, it must restore the stable situation. It uses a special indirect jump instruction called Return-From-Exception (RFE), which restores the PC and:
	- Re-enables the interrupts (again, because they were disabled in the previous step);
	- Returns the processor to user mode;
	- Restores the hardware and control state.

#### **O** Ok, but what exactly is a Precise Interrupts in Asynchronous<br> **Intermate**  $\frac{1}{2}$ Interrupts?  $\frac{1}{2}$  state and  $\frac{1}{2}$  state and  $\frac{1}{2}$  state interruptions (including the interruption)

An interrupt or exception is precise if there is a single instruction (or interrupt point) for which all previous instructions have committed their state rupt point) for which an **previous instructions** have committed their state and **no** subsequent instructions (including the interrupting instruction  $I_i$ ) and no subsequent mistractions (in "

• An interrupt or exception is **precise** if there is a single instruction (or

In other words, we can restart execution at the interrupt point and "get the right answer".



Figure 28: Example of precise interrupt/exception (interrupt point is at red lw instruction).

#### <span id="page-52-0"></span>1.3.3 Exceptions in the 5-stage pipeline

Asynchronous interrupts can occur at any stage of the pipeline. In the following figure we can see some examples of exceptions that can occur at any stage.



Figure 29: Example of exceptions in the 5-stage pipeline.

The aim of this section is to understand how to handle multiple simultaneous exceptions at different stages of the pipeline, and how and where to handle external asynchronous interrupts such as an I/O service request.

```
Example 9: Data Page Fault and Arithmetic Exception
```
In this first example of an exception in a 5-stage pipeline, we can see that pipelining can cause exceptions to be thrown out of order. In fact, we remember that exceptions can occur at different stages in the pipeline for different instructions.

For example, a Data Page Fault occurs at the memory stage of the first instruction. The Arithmetic Exception occurs in the execution phase of the second instruction.

Because of the pipelined execution, the page fault occurs at the same time as the overflow. Note, however, that the data page fault MUST be handled first.





From the previous examples, we know that out-of-order is a serious pipelining problem.

#### ✓ How can the pipeline be modified to solve these problems?

By using the exception flag and the PC.

- 1. First, hold the exception flag and pass it through the pipeline.
- 2. Hold the Program Counter (PC) and pass it through the pipeline (to be saved and restored after the Exception Handler Routine).
- 3. Finally, do not react to the exception until it reaches the commit point. In other words, wait until the end of MEM stage to raise the exception.

Please note that when the instruction reaches the Commit Point, before entering the Write Back (WB) phase, the following steps are performed:

- Store the Program Counter in the Exception Program Counter (PC  $\rightarrow$ EPC) and store the Interrupt Handler Address in the Program Counter  $(IHA \rightarrow PC).$
- Make all next instructions in previous stages NOP operations (see Figure [30](#page-54-0) on page [55\)](#page-54-0).
- Handle interrupts by "faulting nop" in the Instruction Fetch (IF) stage.

At the end of the Exception Handler Routine, the instruction is re-executed.

By adopting this solution, all exceptions are deferred to be handled in order at the MEM stage, as in any non-pipelined processor.

<span id="page-54-0"></span>

Figure 30: Make all next instructions in previous stage NOP operations.

In the following figure, we can see how the execution flow changes when an exception occurs in a pipeline.



Figure 31: This is how we solve the out-of-order problem in a pipeline architecture.

## References

- <span id="page-55-0"></span>[1] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. ISSN. Elsevier Science, 2017.
- <span id="page-55-1"></span>[2] Cristina Silvano. Lesson 1, pipelining. Slides from the HPC-E master's degree course on Politecnico di Milano, 2024.

## <span id="page-56-0"></span>Index







Index