Advanced Computer Architectures

Cesare (MINGJU LI) May 16, 2018

A note for the course of Advanced Computer Architectures.

L00 Intro

L01 Pipelining: Basic Concepts

MIPS

ALU Instructions: add $s1, $s2, $s3
Load/Store Instructions: lw/sw $s1 offset($s2)
Branch Instructions: beq/bne $s1, $s2, L1

R-Format for Instructions

Phase of execution of MIPS Instructions:

Instruction Fetch
Instruction Decode and Register Read
Execution
Memory Access
Write Back Cycle

Different type of Instructions(R/I/J) may require different execution phase. In an ALU R-Instruction, there is no needs for the memory access.

Pipelining

Basic idea: The execution of an instruction is divided into different phases (pipelines stages), requiring a fraction of the time necessary to complete the instruction.

The pipeline stages must besynchronized: the duration of a clock cycle is defined by the time requested by the slower stage of the pipeline.The goal is to balance the length of each pipeline stage.

一个MIPS Instruction可能需要8ns，但是为了在宏观上实现pipelining，我们把5个阶段分开，并且将最长事件的阶段作为一个clock cycle，虽然每一个Instruction花费的时间变多了，但是由于pipelining，整体的output提高了。

Pipeline Hazards

Structural Hazards: Use the same resource from different instructions simultaneously
Data Hazards: Attempt to use a result before it is ready
Control Hazards: Attempt to make a decision on the next instruction to execute before the condition is evaluated.

No s-hazards in MIPS
While d-hazard in very possible (RAW)
WAW/WAR occur more easily when instructions are executed out-of-order.

Data Hazards Solution:

Insertion of nop
Rescheduling to avoid correlating instruction too close
Insertion of stalls
data forward or bypassing (uses temporary results stored in the pipeline registers instead of waiting for the write back of results in the RF)

The differnce between nop and stalls is that nop is an instruction, while stalls is only stalled for one clk

Forwarding Paths
EX/EX path
MEM/EX path
MEM/ID path
MEM/MEM path (for load/stores)

To avoid the read/write RF in one clk, we assume the RF read occurs in the second half of clock cycle and the RF write in the first half of clock cycle.

Performance Metrics in Pipelining

IC = Instruction Cound

Clk Cycles = IC + #Stalls + 4

CPI = Clk Cycles/IC

MIPS = f_clock/(CPI*10⁶)

The ideal CPI on a pipelined processor would be 1, but stalls cause the pipeline performance to degrade form the ideal performance.

L02 Branch Prediction Techiniques

Control hazards: Attempt to make a decision on the next instruction to fetch before the branch condition is evaluated.

Branch Hazards Solution

Conservative Assumption: stall the pipeline until the branch decision is taken, and then fetch the correct instruction.

Conservative Assumption with Forwarding

Early Evaluation of the PC: get the result of the branch prediction in the end of ID state.

Branch Prediction Techniques

Static Branch Prediction Techniques

The actions for a branch are fixed for each branch during the entire execution. The actions are fixed at compile time.

Branch Always Not Taken: if the branch is actually taken, make the fetched instruction a nop.
Branch Always Taken
Backward Taken Forward Not Taken: backward-going branches are predicted as taken, while forward-going branches are predicted as not taken.
Profile-Driven Prediction
Delayed Branch: The MIPS compiler always schedules a branch independent instruction after the branch. (there are 4 ways in Delayed Branch: From Before/From Target/From fall-throuogh/From After)

Dynamic Branch Prediction Techniques

The decision causing the branch prediction can dynamically change during the program execution.

Schemes:

Branch Outcome Predictor
Branch Target Predictor/Branch Target Buffer

Branch History Table
Correlating Branch Predictor: Branch predictors that use the behavior of other branches to make a prediction are called Correlating Predictors or 2-level Predictors.
Two-level Adaptive Branch Predictor: The first level history is recorded in one (or more) k-bit shift register called Branch History Register (BHR), which records the outcomes of the k most recent branches. The second level history is recorded in one (or more) tables called Pattern History Table (PHT) of two-bit saturating counters.
Branch Target Buffer

L03 Instruction Level Parallelism Part I - Introduction

Dependencies

Data Dependencies(True Data Dependences): RAW-data dependence/WAR-anti dependence/WAW-output dependence
Name Dependencies: 2 instructions use the same register or memory location, but there are no flow of data between the instruction associated with that name.
Control Dependencies

Multi-cycles: Basic Assumptions

We consider single-issue processors.
Instructions are then issued in-order.
Execution stage might require multiple cycles, depending on the operation type.
Memory stage might require multiple cycles access time due to data cache misses.

ILP = Exploit potential overlap of execution among unrelated instructions
While overlapping possible only if: NO structural/RAW/WAR/WAW/Control hazards

In a multiple-issue pipelined processor, the ideal CPI would be CPI ideal < 1. (Dual issue means there are two lanes to execute the two pipeline simutanously)

The issue width is the number of instructions that can be issued in a single cycle by a multiple issue processor

Two Strategies to support ILP: Dynamic Scheduling/Static Scheduling

Hazards due to data dependences that cannot be solved by forwarding cause stalls of the pipeline: no new instructions are fetched nor issued even if they are not data dependent.

Solution: Allow data independent instructions behind a stall to proceed. (Enables out-of-order execution and completion/commit)

Dynamic Scheduling: Superscalar Processor [Expensive]

Dynamic Scheduling: The hardware reorder dynamically the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior.

Basically: Instructions are fetched and issued in program order (in-order-issue)
Execution begins as soon as operands are available – possibly, out of order execution – note: possible even with pipelined scalar architectures.
Out-of order execution introduces possibility of WAR and WAW data hazards.
Out-of order execution implies out of order completion unless there is a re-order buffer to get in- order completion

Static Scheduling: VLIW

VLIW (Very Long Instruction Word) processors expect dependency-free code generated by the compiler

L04 Instruction Level Parallelism Part II - Scoreboard

Basic Assumptions
Single Issue processors
IF might fetch either into an IR or into a queue of pending instructions
Instructions are then issued from the IR or from the queue
EX stage may require multiple cycles
MEM stage may require multiple cycles

Scoreboard Pipeline: in-order issue but out-of-order execution/completion

Due to out-of-order completion WAR and WAW hazards can occur.

Scoreboard has 4 stages

Issue (decode and check no WAW) [INORDER]
Read Operands (wait until no RAW) } [OUT OF ORDER]
Execution [OUT OF ORDER]
Write result (check no WAR) [OUT OF ORDER]

Optimisations:
check for WAW postponed in WRITE Stage instead of in ISSUE Stage
forwarding

L05 Instruction Level Parallelism Part III - Tomasulo

Scheme

ISSUE: [IN ORDER] check for structural hazards in RESERVATION STATIONS (not in FU)
EXECUTION: [OUT OF ORDER] When operands ready (Check for RAW hazards solved)/When FU available (Check for structural hazards in FU)
WRITE RESULT: [OUT OF ORDER] Execution completion depends on latency of FUs
REGISTER RENAMING based on Reservation Stations to avoid WAR and WAW hazards
Results dispatched to RESERVATION STATIONS and to RF through the Common Data Bus
Control is distributed on Reservation Stations
Reservation Stations offer a sort of data forwarding!

L06 Instruction Level Parallelism Part IV - Register Renaming

Tomasulo: Implicit Register Renaming
Explicit Register Renaming

Implicit Register Renaming

Register renaming provided by Reservation Stations (which buffer the operands of instructions) to eliminate WAR and WAW hazards

Out-of-order commit really messes up our chance to get precise exceptions!

Explicit Register Renaming

Use physical register file that is larger than number of registers specified by the ISA.

对于每一个Function Unit由于其可能具有潜在的WAW等可能性，故我们可以分配一个新的Register给他，比如两个连续的指令:

DIVD F10 F0 F6
ADDD F6 F8 F2
由于下面的指令很可能在上面的完成前就完成了，我们为了避免WAR，我们讲ADDD的结果储存在P42中（假设原来的F6是在P32），这就解决了WAR问题。

L07 Instruction Level Parallelism Part V - VLIW

Single-Issue Processors: Scalar processors that fetch and issue max one operation in each clock cycle.

Multiple-Issue Processors require: Fetching more instructions in a cycle (higher bandwidth from the instruction cache)

Multiple-Issue Processors can be:

Dynamically scheduled (issue a varying number of instructions at each clock cycle).
Statically scheduled (issue a fixed number of instructions at each clock cycle).

VLIW

VLIW approach advantages:

Simpler hardware because the complexity of the control unit is moved to the compiler
Low power consumption
Good performance through extensive compiler optimization

VLIW approach disadvantages:

Early VLIWs were quite rigid in the instruction format and they required recompilation of programs for different versions of the hardware

Dependencies

RAW Hazards: Scalars/superscalars generate NOPs/stalls or execute successive instructions (dynamic scheduling or instruction reordering)
WAR and WAW Hazards: they are statically solved by the compiler by correctly selecting temporal slots for the operations or by register renaming.
Structural hazards: they are also solved by the compiler.
Control hazards are solved dynamically by the hardware by flushing the execution of instructions due to a mispredicted branch.

L08 VLIW Code Scheduling

Dependence Graph

Ready List

L09 Reorder Buffer

ReOrder Buffer (ROB)

Buffer to hold the results of instructions that have finished execution but non committed
Buffer to pass results among instructions that can be speculated
Support out-of-order execution but in-order commit

Four Steps of Speculative Tomasulo Algorithm

Issue — get instruction from FP Op Queue: If reservation station and ROB slot free, issue instr & send operands & ROB no. for destination(this stage sometimes called “dispatch”)
Execution — operate on operands (EX): When both operands ready then execute; if not ready, watch CDB for result; when both operands in reservation station, execute; checks RAW (sometimes called “issue”)
Write result — finish execution (WB) Write on Common Data Bus to all awaiting FUs & ROB; mark reservation station available.
Commit — update register with ROB resultWhen instr. at head of ROB & result present, update register with result (or store to memory) and remove instr from ROB. Mispredicted branch flushes ROB (sometimes called “graduation”)

3 different possible sequences in COMMIT

Normal Commit

Store Commit

Instruction is a branch with incorrect prediction

L10 Beyond ILP Multithreading

L11 Memory Hierarchy

Main Goal

To increase the performance of a computer through the memory system in order to:

Provide the user the illusion to use a memory that is simultaneously large and fast
Provide the data to the processor at high frequency

Locality

Temporal Locality: when there is a reference to one memory element, the trend is to refer again to the same memory element soon (i.e., instruction and data reused in loop bodies)
Spatial Locality: when there is a reference to one memory element, the trend is to refer soon at other memory elements whose addresses are close by (i.e., sequence of instructions or accesses to data organized as arrays or matrices)

– Exploit temporal locality by keeping the contents of recently accessed locations.
– Exploit spatial locality by fetching blocks of data around recently accessed locations.

Levels of memory hierarchy: Registers->L1 Cache->L2 Cache-> Main Memory

Basic Concepts

The memory hierarchy is composed of several levels, but data are copied between two adjacent levels. Let us consider two levels: cache and main memory

The cache level is smaller, faster and more expensive

The minimum chunk of data that can be copied in the cache is the block or cache line.

Cache hit/miss: whether the requested data is found in the one of the cache blocks

Hit Rate: Number of memory accesses that find the data in the upper level with respect to the total number of memory accesses[Hit Rate=#hits/#memory access]
Hit Time: time to access the data in the upper level of the hierarchy, including the time needed to decide if the attempt of access will result in a hit or miss
Miss Rate: number of memory accesses not finding the data in the upper level with respect to the total number of memory accesses[Miss Rate=#misses/#memory accesses]
Miss Penalty: time needed to access the lower level and to replace the block in the upper level
Miss Time: miss time= hit time+ miss penalty[typically hit time《miss penalty]
Average Memory Access Time/AMAT: AMAT = HitRateHitTime+MissRateMissTime
Average Memory Access Time/AMAT: AMAT = HitTime+MissRate*MissPenalty

Cache Structure

Valid bit: indicate if this position contains valid data or not. At the bootstrap, all the entries in the cache are marked as INVALID
Cache Tag: contains the value that uniquelly identifies the memory address corresponding to the stored data.
Cache Data: contains a copy of data (block or cache line)

Cache Architecture

Direct Mapped Cache
Fully Associative Cache
n-way Set-Associative Cache