cilk p4

Parallel_for loops are an important programming construct as many applications, like stencils, mainly consist of a series of parallel_for loops. Consider heat, one such application which computes a cell’s temperature over time based on it’s neighbor- hood’s previous temperatures. A natural way to implement this application is in a set of loops which iterate over the grid of cells one timestep at a time.

Intriguingly, in many cases when an application utilizes a series of parallel_for loops, if the same cores execute the same iterations they can get good locality benefits. When performing the same set of iterations multiple times, a core is likely to be reusing data between iterations. This reuse allows the necessary data to remain in that core’s private cache, reducing cache misses. Figure 5.1a demonstrates this, by assigning loop iterations statically which results in the data used to by each iteration to remain in the core’s cache.

This unfortunately is not what happens in the Cilk Plus parallel_for loops, preventing them from taking advantage of these potential locality benefits. In Cilk Plus, parallel_for loops are syntactic sugar for recursively split loop iterations which creates a task-tree that is then scheduled dynamically at runtime. This means there is no guarantee that a core will be assigned the same iterations in successive parallel_for loops and that the data it requires will be present in it’s cache. Figure 5.1b demonstrates one iteration of a Cilk Plus parallel_for . The iterations acquired by P2 depend on it acquiring half of the total iterations from P1 in the beginning. If another core was able to acquire this instead in in a successive execution, P2 would eventually find a different set of iterations. Although this parallel_for implemen- tation does not readily benefit from locality, it leverages the nice load balancing benefits of the underlying randomized work-stealing scheduler which can be important when iteration workloads differ.

In this work our aim is to develop locality-guided parallel_for loops in the Cilk Plus runtime which can both intelligently execute parallel_for loops to allow for better locality while continuing to perform dynamic load balancing. These loops should not require any additional programming effort, allow for the exploitation of locality when possible introduce minimal overhead and maintain good theoretical bounds.