Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Assignment 2 Benchmarking the memory subsystem Deadline: Thursday February 18, 2

ID: 3668551 • Letter: A

Question

Assignment 2 Benchmarking the memory subsystem Deadline: Thursday February 18, 2016 1 Analysis How is the memory subsystem structured on your machine. On Linux system likwid-topology (dev by Google) is a good tool to know how the memory system is architectured • How much DRAM memory? Memory speed? Memory Technology • How much bandwidth can your processor draw from the bus? You will need to discover width of the memory bus and clock speed of the memory bus. Full duplex/half duplex? (On intel processor, ARK often indicates something.) • What is the highest level of cache? How big is it? How is it shared across core? • Same question for all levels of cache. 2 Bandwidth The purpose of this section is to measure the maximum memory bandwidth of the different components of the system. The easiest way to ensure that is to do the minimum amount of arithmetic operation per byte of data. For each level of the memory hierarchy, measure read, write and read/write bandwidth. The easiest way of doing this is to have each core do a measurable number of memory transfer on a piece of data of a particular size. Plot each bandwidth as a function of the size of the data each core work on. To measure read bandwidth, the easiest test is often to simply compute the sum of an array. To measure write bandwidth, the easiest test is often to set a memory region to zero. To measure read/write bandwidth, the easiest test is often to copy an array in an other one. The measurement itself can be an issue because of the fill-in (what happens at the beginning) and flushout (what happens at the end). A reasonable way to measure is to loop over your main operation multiple time and time only the middle loop iterations to make sure the measurement are carried out while all the cores are busy. 3 Latency The best way to measure memory latency is to perform memory operations that are not easily predicted. Linked list are certainly king in that context. Write an element-less singly linked list. (To be clear it is simply an array of integer next that refers to itself so that you traverse it by doing current = next[current];.) For different size of the list, measure the time it takes to follow a large number of links and report the time per link followed. You will need to set the list to see what you want to see. 1 • the instruction cost of the traversal by setting the list to be core bound. • the latency by shuffling the order of the list, but be careful not to build a short cycle. • the TLB latency cost by ensuring that you are jumping memory pages every time in a hard to predict fashion. 4 Help 4.1 Make sure bandwidth is memory bound Look at the assembly to ensure the it is mostly IO. You can estimate a sufficient arithmetic to IO instruction by looking at flops/bandwidth ratio. 4.2 alignment Be wary of of alignment. Look at the difference between mm256 load ps and mm256 loadu ps. 4.3 Other interesting things mm256 stream load si256 makes a non-temporal read (won’t be cached)

Explanation / Answer

1. This is the sample output on running likwid-topology tool:

likwid-topology is part of likwid (Like I Knew What I am Doing) toolsuite. likwid is a collection of command line applications for accessing system related information. Use following steps, if you are using GNU/Linux system:

1. Download likwid3.0.0 tarball from any third party website.

2. Extract the tarball at a proper location.

3. Enter the directory into which tarball is extracted.

4. Go thorugh files README and INSTALL to know about the toolsuite and installation procedure.

.5 Follow the installation procedure described in INSTALL (typical is make, make install).

6. Once likwid is installed, check which tools are available under likwid by typing likwid- followed by a tab. This will display all tools in likwid.

7. Run likwid-topology to get information about thread and cache topology.

--------------------------------------

2. You can do this as following:

a) To compute sum of array elements:

int sum=0;

for(int i=0;i<n;i++) //n = number of elements in array

   {

     sum+=a[i]; //let a[] be an array of n integers

   }

b) Allot a memory block and assign it to zero

    int* a = new int[1024] { };

c) Copy one array to another:

   //assume both array are of same size - n.

   for (int i = 0; i < n; i++)

        {

            b[i] = a[i]; //b[ ] is the destination array, a[ ] is the source array.

       }