# Generating Performance Portable Code using Rewrite Rules

# From High-Level Functional Expressions to High-Performance OpenCL Code

**Michel Steuwer** 

Christian Fensch

Sam Lindley

Christophe Dubach





### The Problem(s)

- Parallel processors everywhere
- Many different types: CPUs, GPUs, ...
- Parallel programming is hard
- Optimising is even harder
- Problem: No portability of performance!









**Accelerator** 

**FPGA** 





### Case Study: Parallel Reduction in OpenCL

- Summing up all values of an array
- Comparison of 7 implementations by Nvidia
- Investigating complexity and efficiency of optimisations





#### **Unoptimised Implementation Parallel Reduction**

```
kernel void reduceO(global float* g_idata, global float* g_odata,
                    unsigned int n, local float* l data) {
 unsigned int tid = get_local_id(0);
 unsigned int i = get_global_id(0);
  l_data[tid] = (i < n) ? g_idata[i] : 0;</pre>
 barrier(CLK_LOCAL_MEM_FENCE);
 // do reduction in local memory
  for (unsigned int s=1; s < get_local_size(0); s*= 2) {</pre>
    if ((tid % (2*s)) == 0) {
     l data[tid] += l data[tid + s];
    barrier(CLK_LOCAL_MEM_FENCE);
 // write result for this work-group to global memory
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0];
```



### **OpenCL Programming Model**

```
kernel void reduce0(global float* g_idata, global float* g_odata,
                    unsigned int n, local float* l data) {
 unsigned int tid = get_local_id(0);
  unsigned int i = get_global_id(0);
  l_data[tid] = (i < n) ? g_idata[i] : 0;</pre>
 barrier(CLK_LOCAL_MEM_FENCE);
 // do reduction in local memory
  for (unsigned int s=1; s < get_local_size(0); s*= 2) {</pre>
    if ((tid % (2*s)) == 0) {
      l data[tid] += l data[tid + s];
    barrier(CLK_LOCAL_MEM_FENCE);
 // write result for this work-group to global memory
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0];
```

- Multiple work-items (threads) execute the same kernel function
- Work-items are organised for execution in work-groups



### **Avoid Divergent Branching**

```
kernel void reduce1(global float* g_idata, global float* g_odata,
                    unsigned int n, local float* l_data) {
  unsigned int tid = get_local_id(0);
  unsigned int i = get_global_id(0);
  l_data[tid] = (i < n) ? g_idata[i] : 0;</pre>
  barrier(CLK_LOCAL_MEM_FENCE);
  for (unsigned int s=1; s < get local size(0); s*= 2) {</pre>
    // continuous work-items remain active
    int index = 2 * s * tid;
    if (index < get_local_size(0)) {</pre>
      l data[index] += l data[index + s];
    barrier(CLK LOCAL MEM FENCE);
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0];
```



#### **Avoid Interleaved Addressing**

```
kernel void reduce2(global float* g_idata, global float* g_odata,
                    unsigned int n, local float* l data) {
  unsigned int tid = get local id(0);
  unsigned int i = get_global_id(0);
  l data[tid] = (i < n) ? g idata[i] : 0;</pre>
  barrier(CLK_LOCAL_MEM_FENCE);
 // process elements in different order
 // requires commutativity
  for (unsigned int s=get local size(0)/2; s>0; s>>=1) {
    if (tid < s) {
     l_data[tid] += l_data[tid + s];
    barrier(CLK LOCAL MEM FENCE);
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0];
```



#### Increase Computational Intensity per Work-Item

```
kernel void reduce3(global float* g_idata, global float* g_odata,
                    unsigned int n, local float* l data) {
  unsigned int tid = get_local_id(0);
  unsigned int i = get_group_id(0) * (get_local_size(0)*2)
                                    + get local id(0);
  l_data[tid] = (i < n) ? g_idata[i] : 0;</pre>
  // performs first addition during loading
  if (i + get local size(0) < n)</pre>
    l_data[tid] += g_idata[i+get_local_size(0)];
  barrier(CLK_LOCAL_MEM_FENCE);
  for (unsigned int s=get_local_size(0)/2; s>0; s>>=1) {
    if (tid < s) {
      l data[tid] += l data[tid + s];
    barrier(CLK LOCAL MEM FENCE);
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0];
```



### **Avoid Synchronisation inside a Warp**

```
kernel void reduce4(global float* g_idata, global float* g_odata,
                    unsigned int n, local volatile float* l data) {
  unsigned int tid = get_local_id(0);
  unsigned int i = get_group_id(0) * (get_local_size(0)*2)
                                    + get local id(0);
  l data[tid] = (i < n) ? g_idata[i] : 0;</pre>
  if (i + get_local_size(0) < n)</pre>
    l_data[tid] += g_idata[i+get_local_size(0)];
  barrier(CLK_LOCAL_MEM_FENCE);
  # pragma unroll 1
  for (unsigned int s=get_local_size(0)/2; s>32; s>>=1) {
    if (tid < s) { l_data[tid] += l_data[tid + s]; }</pre>
    barrier(CLK_LOCAL_MEM_FENCE); }
  // this is not portable OpenCL code!
  if (tid < 32) {
    if (WG_SIZE >= 64) { l_data[tid] += l_data[tid+32]; }
    if (WG SIZE >= 32) { l data[tid] += l data[tid+16]; }
    if (WG SIZE >= 16) { l data[tid] += l data[tid+ 8]; }
    if (WG_SIZE >= 8) { l_data[tid] += l_data[tid+ 4]; }
    if (WG_SIZE >= 4) { l_data[tid] += l_data[tid+ 2]; }
    if (WG_SIZE >= 2) { l_data[tid] += l_data[tid+ 1]; } }
  if (tid == 0) g_odata[get_group_id(0)] = l_data[0]; }
```

#### **Complete Loop Unrolling**

```
kernel void reduce5(global float* g_idata, global float* g_odata,
                    unsigned int n, local volatile float* l_data) {
  unsigned int tid = get_local_id(0);
  unsigned int i = get_group_id(0) * (get_local_size(0)*2)
                                    + get local id(0);
 l data[tid] = (i < n) ? g idata[i] : 0;</pre>
  if (i + get_local_size(0) < n)</pre>
    l_data[tid] += g_idata[i+get_local_size(0)];
  barrier(CLK_LOCAL_MEM_FENCE);
  if (WG SIZE >= 256) {
    if (tid < 128) { l_data[tid] += l_data[tid+128]; }</pre>
    barrier(CLK_LOCAL_MEM_FENCE); }
  if (WG SIZE >= 128) {
    if (tid < 64) { l_data[tid] += l_data[tid+ 64]; }</pre>
    barrier(CLK_LOCAL_MEM_FENCE); }
  if (tid < 32) {
    if (WG SIZE >= 64) { l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) { l_data[tid] += l_data[tid+16]; }
    if (WG_SIZE >= 16) { l_data[tid] += l_data[tid+ 8]; }
    if (WG_SIZE >= 8) { l_data[tid] += l_data[tid+ 4]; }
    if (WG SIZE >= 4) { l data[tid] += l data[tid+ 2]; }
    if (WG SIZE >= 2) { l data[tid] += l data[tid+ 1]; } }
  if (tid == 0) g odata[get_group_id(0)] = l data[0]; }
```

#### **Fully Optimised Implementation**

```
kernel void reduce6(global float* g_idata, global float* g_odata,
                    unsigned int n, local volatile float* l data) {
  unsigned int tid = get local id(0);
  unsigned int i = get group id(0) * (get local size(0)*2)
                                   + get_local_id(0);
  unsigned int gridSize = WG_SIZE * get_num_groups(0);
 l data[tid] = 0:
  while (i < n) { l_data[tid] += g_idata[i];
                  if (i + WG SIZE < n)</pre>
                   l_data[tid] += g_idata[i+WG_SIZE];
                  i += gridSize; }
  barrier(CLK LOCAL MEM FENCE);
  if (WG SIZE >= 256) {
    if (tid < 128) { l data[tid] += l data[tid+128]; }
    barrier(CLK LOCAL MEM FENCE); }
  if (WG SIZE >= 128) {
    if (tid < 64) { l data[tid] += l_data[tid+ 64]; }</pre>
    barrier(CLK LOCAL MEM FENCE); }
  if (tid < 32) {
    if (WG SIZE >= 64) { l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) { l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) { l_data[tid] += l_data[tid+ 8]; }
    if (WG_SIZE >= 8) { l_data[tid] += l_data[tid+ 4]; }
    if (WG SIZE >= 4) { l data[tid] += l data[tid+ 2]; }
    if (WG SIZE >= 2) { l data[tid] += l data[tid+ 1]; } }
  if (tid == 0) g odata[get group id(0)] = l data[0]; }
```

### **Case Study Conclusions**

- Optimising OpenCL is complex
  - Understanding of target hardware required
- Program changes not obvious
- Is it worth it? ...

```
kernel
void reduce0(global float* g_idata,
             global float* g odata,
             unsigned int n,
             local float* l data) {
  unsigned int tid = get_local_id(0);
  unsigned int i = get_global_id(0);
 l_data[tid] = (i < n) ? g_idata[i] : 0;</pre>
  barrier(CLK LOCAL MEM FENCE);
 for (unsigned int s=1;
       s < get_local_size(0); s*= 2) {</pre>
    if ((tid % (2*s)) == 0) {
      l_data[tid] += l_data[tid + s];
    barrier(CLK_LOCAL_MEM_FENCE);
 if (tid == 0)
    g odata[get_group_id(0)] = l_data[0];
```

**Unoptimized Implementation** 

```
kernel
void reduce6(global float* g_idata,
             global float* g odata.
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
   get_group_id(0) * (get_local_size(0)*2)
                    + get local id(0);
  unsigned int gridSize =
   WG_SIZE * get_num_groups(0);
 l data[tid] = 0;
 while (i < n) {
   l_data[tid] += g_idata[i];
   if (i + WG SIZE < n)</pre>
     l_data[tid] += g_idata[i+WG SIZE];
   i += gridSize; }
 barrier(CLK_LOCAL_MEM FENCE);
 if (WG_SIZE >= 256) {
   if (tid < 128) {
     l data[tid] += l data[tid+128]: }
    barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
     l_data[tid] += l_data[tid+ 64]; }
   barrier(CLK LOCAL MEM FENCE); }
 if (tid < 32) {
   if (WG SIZE >= 64) {
     l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) {
     l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) {
     l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
     l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
     l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
     l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g odata[get group id(0)] = l data[0];
```

**Fully Optimized Implementation** 

#### **Performance Results Nvidia**



- ... Yes! Optimising improves performance by a factor of 10!
- Optimising is important, but ...



#### **Performance Results AMD and Intel**





- ... unfortunately, optimisations in OpenCL are not portable!
- Challenge: how to achieving portable performance?



# Generating Performance Portable Code using Rewrite Rules



Goal: automatic generation of Performance Portable code



## **Example Parallel Reduction** ③



# rewrite rules





```
kernel
void reduce6(global float* g_idata,
             global float* g odata.
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
    get group id(0) * (get local size(0)*2)
                    + get local id(0);
 unsigned int gridSize =
    WG SIZE * get_num_groups(0);
 l data[tid] = 0;
  while (i < n) {
    l_data[tid] += g_idata[i];
    if (i + WG SIZE < n)</pre>
      l_data[tid] += g_idata[i+WG_SIZE];
    i += gridSize; }
  barrier(CLK_LOCAL_MEM_FENCE);
 if (WG_SIZE >= 256) {
    if (tid < 128) {
      l data[tid] += l data[tid+128]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
      l_data[tid] += l_data[tid+ 64]; }
    barrier(CLK_LOCAL_MEM_FENCE); }
 if (tid < 32) {
    if (WG SIZE >= 64) {
      l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) {
      l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) {
     l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
      l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
     l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
      l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g_odata[get_group_id(0)] = l_data[0];
```

### **Example Parallel Reduction** ③



```
rewrite rules code generation
```

```
vecSum = reduce o join o map-workgroup (
    join o toGlobal (map-local (map-seq id)) o split 1 o
    join o map-warp (
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 1 o
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 2 o
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 4 o
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 8 o
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 16 o
        join o map-lane (reduce-seq (+) 0) o split 2 o reorder-stride 32
        ) o split 64 o
        join o map-local (reduce-seq (+) 0) o split 2 o reorder-stride 64 o
        join o toLocal (map-local (reduce-seq (+) 0)) o
        split (blockSize/128) o reorder-stride 128
        ) o split blockSize
```

```
THE UNIVERSITY of EDINBURGH informatics
```

```
kernel
void reduce6(global float* g_idata,
             global float* g odata,
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
   get_group_id(0) * (get_local_size(0)*2)
                    + get local id(0);
 unsigned int gridSize =
   WG_SIZE * get_num_groups(0);
 l data[tid] = 0;
 while (i < n) {
   l_data[tid] += g_idata[i];
    if (i + WG SIZE < n)</pre>
      l_data[tid] += g_idata[i+WG_SIZE];
   i += gridSize; }
 barrier(CLK LOCAL MEM FENCE);
 if (WG_SIZE >= 256) {
   if (tid < 128) {
     l data[tid] += l data[tid+128]; }
   barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
      l data[tid] += l data[tid+ 64]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (tid < 32) {
    if (WG SIZE >= 64) {
      l_data[tid] += l_data[tid+32]; }
    if (WG SIZE >= 32) {
      l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) {
      l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
      l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
      l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
      l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g odata[get group id(0)] = l data[0];
```

# Algorithmic Primitives

$$\begin{split} map_{A,B,I}: (A \to B) &\to [A]_I \to [B]_I \\ zip_{A,B,I}: [A]_I \to [B]_I \to [A \times B]_I \\ reduce_{A,I}: ((A \times A) \to A) \to A \to [A]_I \to [A]_1 \\ split_{A,I}: (n: \text{size}) \to [A]_{n \times I} \to [[A]_n]_I \\ join_{A,I,J}: [[A]_I]_J \to [A]_{I \times J} \\ iterate_{A,I,J}: (n: \text{size}) \to ((m: \text{size}) \to [A]_{I \times m} \to [A]_m) \to \\ [A]_{I^n \times J} \to [A]_J \\ reorder_{A,I}: [A]_I \to [A]_I \end{split}$$



## **1** High-Level Programs

$$scal = \lambda \ a.map \ (*a)$$
 $asum = reduce \ (+) \ 0 \circ map \ abs$ 
 $dot = \lambda \ xs \ ys.(reduce \ (+) \ 0 \circ map \ (*)) \ (zip \ xs \ ys)$ 
 $gemv = \lambda \ mat \ xs \ ys \ \alpha \ \beta.map \ (+) \ ($ 
 $zip \ (map \ (scal \ \alpha \circ dot \ xs) \ mat) \ (scal \ \beta \ ys) \ )$ 



## **Example Parallel Reduction** ③



# rewrite rules



kernel

unsigned int i =

void reduce6(global float\* g\_idata,

global **float**\* g odata.

get group id(0) \* (get local size(0)\*2)

local volatile float\* l\_data) {

unsigned int n,

unsigned int tid = get local id(0);



```
+ get local id(0);
unsigned int gridSize =
  WG SIZE * get_num_groups(0);
l data[tid] = 0;
while (i < n) {
  l_data[tid] += g_idata[i];
  if (i + WG SIZE < n)</pre>
    l_data[tid] += g_idata[i+WG_SIZE];
  i += gridSize; }
barrier(CLK_LOCAL_MEM_FENCE);
if (WG_SIZE >= 256) {
  if (tid < 128) {
    l data[tid] += l data[tid+128]; }
  barrier(CLK LOCAL MEM FENCE); }
if (WG SIZE >= 128) {
  if (tid < 64) {
    l_data[tid] += l_data[tid+ 64]; }
  barrier(CLK_LOCAL_MEM_FENCE); }
if (tid < 32) {
  if (WG SIZE >= 64) {
    l data[tid] += l data[tid+32]; }
  if (WG SIZE >= 32) {
    l_data[tid] += l_data[tid+16]; }
  if (WG SIZE >= 16) {
    l_data[tid] += l_data[tid+ 8]; }
  if (WG SIZE >= 8) {
    l data[tid] += l data[tid+ 4]; }
  if (WG SIZE >= 4) {
    l_data[tid] += l_data[tid+ 2]; }
  if (WG SIZE >= 2) {
    l data[tid] += l data[tid+ 1]; } }
if (tid == 0)
  g_odata[get_group_id(0)] = l_data[0];
```



### **Example Parallel Reduction** 3



# rewrite rules

code generation

```
vecSum = reduce ∘ join ∘ map-workgroup (
    join ∘ toGlobal (map-local (map-seq id)) ∘ split 1 ∘
    join ∘ map-warp (
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 1 ∘
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 2 ∘
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 4 ∘
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 8 ∘
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 16 ∘
        join ∘ map-lane (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 32
        ) ∘ split 64 ∘
        join ∘ map-local (reduce-seq (+) 0) ∘ split 2 ∘ reorder-stride 64 ∘
        join ∘ toLocal (map-local (reduce-seq (+) 0)) ∘
```

```
THE UNIVERSITY of EDINBURGH informatics
```

) ∘ split blockSize

*split* (blockSize/128) • reorder-stride 128

```
kernel
void reduce6(global float* g_idata,
             global float* g odata,
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
   get_group_id(0) * (get_local_size(0)*2)
                    + get local id(0);
 unsigned int gridSize =
   WG_SIZE * get_num_groups(0);
 l data[tid] = 0;
 while (i < n) {
    l_data[tid] += g_idata[i];
    if (i + WG SIZE < n)</pre>
      l_data[tid] += g_idata[i+WG_SIZE];
    i += gridSize; }
 barrier(CLK LOCAL MEM FENCE);
 if (WG_SIZE >= 256) {
   if (tid < 128) {
     l data[tid] += l data[tid+128]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
      l_data[tid] += l_data[tid+ 64]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (tid < 32) {
    if (WG SIZE >= 64) {
     l_data[tid] += l_data[tid+32]; }
    if (WG SIZE >= 32) {
      l_data[tid] += l_data[tid+16]; }
   if (WG SIZE >= 16) {
     l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
      l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
      l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
      l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g odata[get group id(0)] = l data[0];
```

## 2 Algorithmic Rewrite Rules

- Provably correct rewrite rules
- Express algorithmic implementation choices

#### Split-join rule:

```
map \ f \rightarrow join \circ map \ (map \ f) \circ split \ n
```

#### Map fusion rule:

```
map \ f \circ map \ g \to map \ (f \circ g)
```

#### Reduce rules:

```
reduce f \ z \to reduce \ f \ z \circ reduce Part \ f \ z

reduce Part \ f \ z \to reduce Part \ f \ z \circ reorder

reduce Part \ f \ z \to join \ \circ map \ (reduce Part \ f \ z) \circ split \ n

reduce Part \ f \ z \to iterate \ n \ (reduce Part \ f \ z)
```



## **② OpenCL Primitives**

#### **Primitive**

**OpenCL concept** 

mapGlobal

Work-items

workgroups global threads local threads

map Workgroup

mapLocal

Work-groups

mapSeq

reduceSeq

Sequential implementations

toLocal , toGlobal

Memory areas

map Vec, split Vec, join Vec

**Vectorization** 



## 2 OpenCL Rewrite Rules

Express low-level implementation and optimisation choices

#### Map rules:

```
map \ f \rightarrow map \ Workgroup \ f \mid map Local \ f \mid map Global \ f \mid map Seq \ f
```

#### Local/ global memory rules:

```
mapLocal\ f \rightarrow toLocal\ (mapLocal\ f) mapLocal\ f \rightarrow toGlobal\ (mapLocal\ f)
```

#### **Vectorisation rule:**

```
map\ f \rightarrow join\ Vec \circ map\ (map\ Vec\ f) \circ split\ Vec\ n
```

#### Fusion rule:

 $reduceSeq\ f\ z\circ mapSeq\ g \rightarrow reduceSeq\ (\lambda\ (acc,x).\ f\ (acc,g\ x))\ z$ 



## **Example Parallel Reduction** ③



# rewrite rules





```
kernel
void reduce6(global float* g_idata,
             global float* g odata.
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
    get group id(0) * (get local size(0)*2)
                    + get local id(0);
 unsigned int gridSize =
    WG SIZE * get_num_groups(0);
 l data[tid] = 0;
  while (i < n) {
    l_data[tid] += g_idata[i];
    if (i + WG SIZE < n)</pre>
      l_data[tid] += g_idata[i+WG_SIZE];
    i += gridSize; }
  barrier(CLK_LOCAL_MEM_FENCE);
 if (WG_SIZE >= 256) {
    if (tid < 128) {
      l data[tid] += l data[tid+128]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
      l_data[tid] += l_data[tid+ 64]; }
    barrier(CLK_LOCAL_MEM_FENCE); }
 if (tid < 32) {
    if (WG SIZE >= 64) {
      l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) {
      l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) {
     l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
      l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
     l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
      l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g_odata[get_group_id(0)] = l_data[0];
```

## **Example Parallel Reduction** ③



rewrite rules

rewrite rules code generation





```
kernel
void reduce6(global float* g_idata,
             global float* g odata.
             unsigned int n,
             local volatile float* l_data) {
 unsigned int tid = get local id(0);
 unsigned int i =
    get group id(0) * (get local size(0)*2)
                    + get local id(0);
  unsigned int gridSize =
    WG_SIZE * get_num_groups(0);
 l_{data}[tid] = 0:
  while (i < n) {
    l_data[tid] += g_idata[i];
    if (i + WG SIZE < n)</pre>
      l_data[tid] += g_idata[i+WG_SIZE];
    i += gridSize; }
  barrier(CLK_LOCAL_MEM FENCE);
 if (WG_SIZE >= 256) {
    if (tid < 128) {
      l data[tid] += l data[tid+128]; }
    barrier(CLK LOCAL MEM FENCE); }
 if (WG SIZE >= 128) {
    if (tid < 64) {
      l_data[tid] += l_data[tid+ 64]; }
    barrier(CLK_LOCAL_MEM_FENCE); }
 if (tid < 32) {
    if (WG SIZE >= 64) {
      l data[tid] += l data[tid+32]; }
    if (WG SIZE >= 32) {
      l_data[tid] += l_data[tid+16]; }
    if (WG SIZE >= 16) {
     l_data[tid] += l_data[tid+ 8]; }
    if (WG SIZE >= 8) {
      l data[tid] += l data[tid+ 4]; }
    if (WG SIZE >= 4) {
      l_data[tid] += l_data[tid+ 2]; }
    if (WG SIZE >= 2) {
      l data[tid] += l data[tid+ 1]; } }
 if (tid == 0)
    g odata[get group id(0)] = l data[0];
```

## **③ Pattern based OpenCL Code Generation**

Generate OpenCL code for each OpenCL primitive















 $reduce (+) 0 \circ join \circ map (reducePart (+) 0) \circ split n$ 





Fully automated search for good implementations possible

- For each node in the tree:
  - Apply one rule and randomly sample subtree
  - Repeat for node with best performing subtree





- For each node in the tree:
  - Apply one rule and randomly sample subtree
  - Repeat for node with best performing subtree





- For each node in the tree:
  - Apply one rule and randomly sample subtree
  - Repeat for node with best performing subtree





- For each node in the tree:
  - Apply one rule and randomly sample subtree
  - Repeat for node with best performing subtree





- For each node in the tree:
  - Apply one rule and randomly sample subtree
  - Repeat for node with best performing subtree

 $reduce (+) 0 \circ join \circ map (reducePart (+) 0) \circ split n$ 



③ repeat process

# Search Results Automatically Found Expressions

 $asum = reduce (+) 0 \circ map \ abs$ 



```
Nvidia GPU \lambda x. (reduceSeq \circ join \circ join \circ mapWorkgroup \ ( toGlobal \ (mapLocal \ (reduceSeq \ (\lambda(a,b).\ a + (abs\ b))\ 0)) \circ reorderStride\ 2048 \\ ) \circ split\ 128 \circ split\ 2048)\ x \\ \hline \lambda x. (reduceSeq \circ join \circ joinVec \circ join \circ mapWorkgroup \ ( mapLocal \ (reduceSeq \ (mapVec\ 2\ (\lambda(a,b).\ a + (abs\ b)))\ 0 \circ reorderStride\ 2048 \\ ) \circ split\ 128 \circ splitVec\ 2 \circ split\ 4096)\ x \\ \hline \lambda x. (reduceSeq \circ join \circ mapWorkgroup \ (join \circ joinVec \circ mapLocal\ ( reduceSeq \ (mapVec\ 4\ (\lambda(a,b).\ a + (abs\ b)))\ 0 \\ \hline \lambda x. (reduceSeq \ (mapVec\ 4\ (\lambda(a,b).\ a + (abs\ b)))\ 0 \\ ) \circ splitVec\ 4 \circ split\ 32768) \circ split\ 32768)\ x
```

Search on: Nvidia GTX 480 GPU, AMD Radeon HD 7970 GPU, Intel Xeon E5530 CPU



#### Search Results Search Efficiency







- Overall search on each platform took < 1 hour</li>
- Average execution time per tested expression < 1/2 second</li>



#### **Performance Results**

#### vs. Portable Implementation



Up to 20x speedup on fairly simple benchmarks vs. portable clBLAS implementation



#### **Performance Results**

#### vs. Hardware-Specific Implementations







- Automatically generated code vs. expert written code
- · Competitive performance vs. highly optimised implementations
- Up to 4.5x speedup for gemv on AMD



#### **Summary**

- OpenCL code is not performance portable
- Our approach uses
  - functional high-level primitives,
  - OpenCL-specific low-level primitives, and
  - rewrite-rules to generate performance portable code.
- Rewrite-rules define a space of possible implementations
- Performance on par with specialised, highly-tuned code

Michel Steuwer — <u>michel.steuwer@ed.ac.uk</u>

