Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

CSE 401 Midterm May 15, 2017 1 pipeline (40pt) A particular (fictional) CPU has

ID: 3903575 • Letter: C

Question

CSE 401 Midterm May 15, 2017 1 pipeline (40pt) A particular (fictional) CPU has the following internal units and timings (WR and RR are write/read registers, ALU does all logic and integer nerations and there is a separate floating point unit. ID - instruction fetch and decode 150ps RR - read registers 40ps WR - write registers 60ps ALU - arithmetic and logic 180ps FPU - floating point 280ps MEM-memory access 200ps There are 6 basic instruction types: 1. LOAD: ID+RR+ALU+MEM+WR: 660ps 2. STORE: ID+RR+ALU+MEM: 600ps 3. MEMOP: ID+WR+RR+ALU+MEM: 660ps 4. LOGIC/INTEGER: ID+RR+ALU+WR: 460ps 5. FLOATING POINT: ID+RR+FPU+WR: 560ps 6. BRANCH: ID+RR+ALU: 400ps 1 cycle is 660ps for this machine, on the assumption that all instructions take 1 cycle (ignore memory delays). Outline what a pipeline would look like: Give • number of stages 5 Stages • operations in each stage • new clock rate • speedup compared to original 660ps instruction rate with no pipeline. Assume you double the number of registers, but increased address complexity would add 10ps to RR and 15ps to WR. What (if any) changes would you malee to your pipelined machine (e.g. clock or stages or both)?

Explanation / Answer

Each instruction has five phases.

Operations in each phase of instruction

Instruction fetch (IF)

Instruction decode (ID)

Execution (EX)

Memory access (MEM)

Writeback(WB)

Thus, each instruction requires five cycles to execute (CPI )= 5

Since there are five separate stages, a pipeline with one instruction for each stage can be achieved.

Consider the following classical example of pipelininig:

Instruction

Cycle 1

Cycle2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

Cycle 9

I1

IF

ID

EX

MEM

WB

I2

IF

ID

EX

MEM

WB

I3

IF

ID

EX

MEM

WB

I 4

IF

ID

EX

MEM

WB

I 5

IF

ID

EX

MEM

WB

Instruction1

completes

Instruction2

completes

Instruction3

completes

Instruction4

completes

Instruction5

Completes

Without pipelining each instruction will be executed in a sequential order in which all the instruction phases will take 5 cycles each making a total of 25 cycles to complete 5 instructions(i1,i1,….i5)

By applying pipeling the execution of instruction will be done in a parallel therefore completing one instruction per cycle

Pipeline speedup:

In the given problem,

Mnemonic

Instruction stage

Time in pico seconds

ID

INSTRUCTION FETCH AND DECODE

180

RR

READ REGISTERS

40

WR

WRITE REGISTERS

60

ALU

ARITHMETIC AND LOGIC

180

FPU

FLOATING POINT

280

MEM

MEMORY ACCESS

200

No. of pipeline stages: 5

Each pipeline stage:660ps

(1)

LOAD : ID+RR+ALU+MEM÷WR (660ps)

Speedup=660/5=132ps

(2)

STORE : ID+RR+ALU+MEM (660ps)

Speedup=660/5=132ps

(3)

MEMOP: ID+WR+RR+ALU+MEM (660ps)

Speedup=660/5=132ps

(4)

LOGIC/INTEGER. ID+RR+ALU+WR (460ps)

Speedup=460/5=92ps

(5)

FLOATING POINT: ID+RR+FPU+WR (560ps)

Speedup=560/5=112ps

(6)

BRANCH: ID+RR+ALU (400PS)

Speedup=400/5=80ps

INSTRUCTION NAME

/CYCLE NO.;

1

2

3

4

5

6

7

8

LOAD

ID

RR

ALU

MEM

WR

STORE

ID

RR

ALU

MEM

MEMOP

ID

WR

RR

ALU

MEM

LOGIC/INT

ID

RR

ALU

WR

FLOATING PT

ID

RR

FPU

WR

BRANCH

ID

RR

ALU

Theoretical speedup of a pipeline

S=nT/(k+n?1)t

where:

n is the no of tasks=(6)

T is the time to process a task=(660ps)

k is the number of stages =(5)

t is the time per stage =(280 considering maximum value of FPUoperation)

S=6(660)/(5+6-1)(280) =1.414286

So the speed up is 1.4 times faster in pipelined execution

Average no of cycles taken by an instruction in the given instruction set=

((660)*5 + (660)*4 + (660)*5+ (460)*4 + (560)*4+ (400)*3 = 14520)/(6*(660)) =3.67 clock cycles

No. Of cycles taken per instruction on average = 3.67

Assuming each cycle takes 660ps and ignoring memory delays

Assuming no. Of registers are doubled,

Address complexity of

RR=40+10=50ps

WR=60+15=75ps

By adding more number of stages we might reduce or maintain the same throughput of the machine

as the latency is inversely proportional to number of stages in pipeline ; increase in no of stages will reduce the average clock cycles needed for an instruction

Instruction

Cycle 1

Cycle2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

Cycle 9

I1

IF

ID

EX

MEM

WB

I2

IF

ID

EX

MEM

WB

I3

IF

ID

EX

MEM

WB

I 4

IF

ID

EX

MEM

WB

I 5

IF

ID

EX

MEM

WB

Instruction1

completes

Instruction2

completes

Instruction3

completes

Instruction4

completes

Instruction5

Completes