Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Consider the following tournament branch predictor that employs a selector with

ID: 3762965 • Letter: C

Question

Consider the following tournament branch predictor that employs a selector with 16K entries (3-bit saturating counters). The selector picks a prediction out of either a global predictor (16-bit global history is XOR-ed with 16 bits of branch PC to index into a table of 2bit saturating counters) or a local predictor (16 bits of branch PC index into level-1, 10 bits of local history are XOR-ed with 10 bits of branch PC to generate the index into level-2 that has 3-bit saturating counters). What is the total capacity of the entire branch prediction system?

Explanation / Answer

Answer:

======

Consider an out-of-order processor similar to the one described in class. The architecture has 32 logical registers (also known as architected registers or program-defined registers and indicated as LR*) and 38 physical registers (indicated as PR*). On power up, the following program starts executing. To simplify the problem, some of the initialization code is not shown and you can ignore that code. The loop in the program is executed for at least three iterations.

line1: L.D LR1 0(LR2)
DADD LR1, LR1, LR3
ST.D LR1, 0(LR2)
DADD LR2, LR2, 8
BNE LR2, LR4, line1

The processor has a width of 5, i.e., every pipeline stage can move up to 5 instructions through in every cycle. Show the renamed code for the first 15 instructions of this program. In what cycle will the 15th instruction get committed?

Assumptions:
Assume that branch prediction is perfect for a simple program like this. With the help of a trace cache, even fetch is perfect. Assume that caches are perfect as well. Assume that the dependent of a DADD instruction can leave the issue queue in the cycle right after the DADD. Assume that the dependent of an L.D cannot leave in the next cycle, but the cycle after that. Assume a ROB, an issue queue, and an LSQ with 20 entries each. When the thread starts executing, its logical register LR1 is mapped to physical register PR1, LR2 is mapped to PR2, and so on. An instruction goes through 5 pipeline stages before it gets placed in the issue queue and an additional 5 pipeline stages (6 for a LD/ST) after it leaves the issue queue (in other words, an instruction will take a minimum of 11 cycles to go through the pipeline). When determining if a L.D can issue, you need not check to see if previous store addresses have been resolved (just to make the problem simpler). As a further simplification, assume that stores leave the issue queue when their register dependences have been fulfilled (recall that a real processor will issue a store only when the store is the oldest instruction in the ROB).