Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Consider the standard MIPS pipeline with Instruction Fetch, Instruction Decode (

ID: 3657031 • Letter: C

Question

Consider the standard MIPS pipeline with Instruction Fetch, Instruction Decode (and register fetch), Execute (including determining if a branch is taken or not and computing branch target address), Memory and Write Back. Thus we know that a branch is currently being executed in the second stage and the branch decision is known in the third stage. Assume that 15% of all instructions are branches. Assume that 60% of all branches are taken. Assume 5% of all instructions are jumps (unconditional branches). ? Without any branch prediction, compute the average number of cycles lost due to branches. ? Assume that we use static prediction that branch is always not taken. Compute the number of cycles lost due to branches. ? Assume that we use a static prediction that branches are taken. We will assume that we still lose one cycle on branches even if the branch prediction is correct (taken). Compute the number of cycles lost due to branches.

Explanation / Answer

Instruction fetch The Instruction Cache on these machines had a latency of one cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched from the cache. The PC predictor sends the Program Counter (PC) to the Instruction Cache to read the current instruction. At the same time, the PC predictor predicts the address of the next instruction by incrementing the PC by 4 (all instructions were 4 bytes long). This prediction was always wrong in the case of a taken branch, jump, or exception (see delayed branches, below). Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address. [edit]Decode Unlike earlier microcoded machines, the first RISC machines had no microcode. Once fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that simple combinational logic in each pipeline stage could produce the control signals for the datapath directly from the instruction bits. As a result, very little decoding is done in the stage traditionally called the decode stage. All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, these two register names are identified within the instruction, and the two registers named are read from the register file. In the MIPS design, the register file had 32 entries. At the same time the register file was read, instruction issue logic in this stage determined if the pipeline was ready to execute the instruction in this stage. If not, the issue logic would cause both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stages would prevent their initial flops from accepting new bits. If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is computed after the register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed. The decode stage ended up with quite a lot of hardware: the MIPS instruction set had the possibility of branching if two registers were equal, so a 32-bit-wide AND tree ran in series after the register file read, making a very long critical path through this stage. Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mispredict penalty. Since branches were very often taken (and thus mispredicted), it was very important to keep this penalty low. [edit]Execute Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation: Register-Register Operation (Single cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage. Memory Reference (Two cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle. Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers. [edit]Memory Access During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available. For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags. If the instruction is a load, data is read from the data cache, so both SRAMs are read in parallel during the access stage of the instruction. The single tag read from the tag SRAM is compared with the virtual address specified in the load instruction, and if the two are equal then the datum recently retrieved from the data SRAM is the desired element of data. The success of finding the tag immediately in the tag SRAM is called a cache hit and allows the load instruction to complete the writeback stage normally. If the tag from the tag SRAM and virtual address from the load instruction are not equal, then the data is not in the cache and the datum retrieved from the data SRAM is useless, referred to as a cache miss. The CPU pipeline must suspend operation (described below) while a state machine updates the cache from memory, reading the required data from memory into the cache and optionally writing any dirty data evicted from the cache back into memory. During a store, the tag SRAM is read to determine if the store is a cache hit or cache miss. If a cache miss occurs, the previously described cache update state machine brings the correct datum into the cache. Note that this means store data cannot be written to the cache data SRAM during the access stage because the processor does not yet know if the correct line is resident. Instead, the store data is held in a Store Data Queue, until it can be written to the cache data SRAM during the next store instruction. In a classic RISC pipeline, the Store Data Queue is just a single 32 bit register. For later reference, the virtual address written is held in the Store Address Queue, also a single 32 bit register. On more complicated pipelines, these queues can have multiple hardware registers and variable length. To complicate things further, a load instruction immediately after a store instruction could reference the same memory address, in which case the data must come from the Store Data Queue rather than from the cache's data SRAM. For this reason, during a load instruction the virtual address included in the instruction is checked against the Store Address Queue in addition to the tag cache SRAM. Should the virtual address included in the load instruction be matched by an entry in the Store Address Queue, the associated data from the Store Data Queue is forwarded during the writeback stage rather than any data from the data cache SRAM without changing the flow of the load operation through the pipeline.