Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Relative to each other, CPUs have few threads and enormous caches, while GPUs ha

ID: 3752646 • Letter: R

Question

Relative to each other, CPUs have few threads and enormous caches, while GPUs have many threads and tiny caches. Generally speaking, we need to think carefully about any hardware architecture in which many hardware threads share a cache (usually, but not always, the last-level cache). GPUs are the extreme example of such an architecture. Suppose threads are cache friendly in the sense that each portion of a thread accesses a small subset of memory, and that this subset changes relatively slowly. Yet threads are normally interleaved with some frequency, rather than being allowed to run to completion. The technical term for this slowly changing subset of interesting data is the thread's _working set_. Now suppose that the working set of a thread is roughly the same size as the shared cache. Further suppose that the hardware schedules threads much more rapidly than the speed with which their working sets evolve. As a thought experiment, in which case is a thread more likely to find its data in cache:

Case i: We schedule many threads rapidly for short intervals, as described above.

Case ii: We schedule one thread for a relatively long time so that it owns the cache for an interval roughly equal to one phase of the evolution of its working set.

Explain.

Explanation / Answer

Answer:

After nearly 40 years wandering in the silicon wilderness searching for the promised land of CPU performance and power, computer deity, Berkeley's Dr. David Patterson handed down his famous "Three Walls."1 They were not etched in stone, but they may as well have been. These three immovable impediments defined the end times of increased computing performance. They would prevent computer users from ever reaching the land of milk and honey and 10 GHz Pentiums. There may be a hole in the Walls, but for now we know them as:

"Power Wall + Memory Wall + ILP Wall = Brick Wall"

- The Power Wall means faster computers get really hot.

- The Memory Wall means 1000 pins on a CPU package is way too many.

- ILP Wall means a deeper instruction pipeline really means digging a deeper power hole. (ILP stands for instruction level parallelism.)

Taken together, they mean that computers will stop getting faster. Furthermore, if an engineer optimizes one wall he aggravates the other two. That is exactly what Intel did.

Intel's Tejas hits the walls - hard

Intel engineers went pedal to the metal straight into the Power Wall, backed up, gunned the gas, and went hard into the Memory Wall.

The industry was stunned when Intel cancelled not one but two premier processor designs in May of 2004. Intel's Tejas CPU, Sanskrit for fire, dissipated a stupendous 150 watts at 2.8 GHz, more than Hasbro's Easy Bake Oven.

The Tejas had been projected to run 7 GHz. It never did. When microprocessors get too hot, they quit working and sometimes blow up.

The Memory Hierarchy And The Memory Wall

As far back as the 1980s, the term memory wall was coined to describe the growing disparity between CPU clock rates and off-chip memory and disk drive I/O rates. An example from the GPU world clearly illustrates the memory wall.

In 2005, a leading-edge GPU had 192 floating-point cores, while today’s leading-edge GPU contains 512 floating-point cores. In the intervening six years, the primary GPU I/O pipe remained the same. The GPU of six years ago utilized 16 lanes of PCI Express Gen2, and so does today’s GPU. As a result, per-core I/O rates for GPUs have dropped by a factor of 2.7 since 2005.

On-chip cache memory, which is 10 to 100 times faster than off-chip DRAM, was supposed to knock down the memory wall. But cache memories have their own set of problems. The L1 and L2 caches found on ARM-based application processors utilize more than half of the chip’s silicon area. As such, a significant percentage of processor power is consumed by cache memory, not by computations.