Occa: New kernel qualifier macro

Created on 2 Apr 2020  ·  22Comments  ·  Source: libocca/occa

It is often helpful to give the kernel compiler some extra information about the number of threads (work-items) in a thread-block (work-group). For instance we can give the HIP compiler an upper bound on the number of threads in a thread-block (say 1024) as follows:

__launch_bounds__(1024) __global__ void fooKernel(...) { ... }

In fact for the current HIP release it is unfortunately the case that this must be specified when the thread-block size exceeds 256 (see https://github.com/ROCm-Developer-Tools/HIP/issues/1310 )

CUDA also has the same attribute. There is also an extra argument to the launch bounds qualifier for minimum number of thread-blocks (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds).

The kernel qualifiers in OpenCL are slightly different (see 6.7.2 of https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf )

Proposal v1 - in an ideal world when we know the thread-block size at compile time OCCA will add kernel qualifiers for appropriate launch bounds (CUDA, HIP) or work-group size hint (OpenCL).

Proposal v2 - If it is too complicated to do Proposal v1, then it would neat to add an okl attribute for launch bounds @qualifier("inner sizes", B) where B could be a compiler define. This would be expanded to __launch_bounds__(value of B) for CUDA/HIP or __attribute__((work_group_size_hint(value of B))) for OpenCL. Multi-dim variant would also be helpful.

feature

All 22 comments

Since this seems specific to HIP/CUDA/OpenCL, what about passing it in as a build property?

  addVectors = device.buildKernel("addVectors.okl",
                                  "addVectors",
                                  "launch_bounds: 1024");

That would be good I think.

I just ran into this. I'm happy to contribute this but I have a question. Does OCCA know the inner loop dimensions at JIT-time?

I believe the loop dimensions can be passed in as kernel arguments.

That's true, but I don't understand why that's useful. I'm asking how occa can emit a __launch_bounds__ attribute at JIT-time. It needs to know the loop dimensions at that point. This is orthogonal to the kernel arguments, isn't it?

He's saying the dims of the inner loops can be passed as arguments, i.e. we don't necessarily know the threadblock dimensions at JIT-compile time.

Oh right. I see what you're saying now @tcew. Good point.

This is what I had in mind:

@kernel void runtimeArgs(const int B,
                         const int T,
                         const int N,
                         const float *x,
                         const float *y,
                         float *xPy) {
  for (int b=0;b<B;++b;@outer(0)){
    for (int t=0;t<T;++t;@inner(0)){

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }
}

OCCA obviously cannot know the numerical loop bounds at JIT-time.

It does however, create a launcher that sets the dimensions of the thread grid:

extern "C" void runtimeArgs(occa::modeKernel_t * *deviceKernel,
                            const int & B,
                            const int & T,
                            const int & N,
                            occa::modeMemory_t * x,
                            occa::modeMemory_t * y,
                            occa::modeMemory_t * xPy) {
  {
    occa::dim outer, inner;
    outer.dims = 1;
    inner.dims = 1;
    int b = 0;
    outer[0] = B - 0;
    int t = 0;
    inner[0] = T - 0;
    occa::kernel kernel(deviceKernel[0]);
    kernel.setRunDims(outer, inner);
    kernel(B, T, N, x, y, xPy);
  }
}

Thus the user can specify any size for the loop bounds at runtime.

Crossed posts.

SYCL uses similar syntax:

sycl::range<2> global_range(Bx*Tx,By*Ty);
sycl::range<2> local_range(Tx,Ty);
sycl::nd_range<2> kernel_range(global_range, local_range);

device_queue.submit([&](sycl::handler &cgh) {
    ...
  cgh.parallel_for(kernel_range, kernel);
});

(Buffers have a range associated with them, which avoids the need to pass in N)

The SYCL sycl::range syntax is adapted from OpenCL, which itself is adapted from CUDA.

In your example the specification of the threading dimensions is separate from the body of the parallel for loops.

The OCCA OKL syntax is specifically designed to bring the loop dimensions and body code into a more familiar parallel for loop syntax.

for (int b=0;b<B;++b;@outer(0)){ /*  grid dimension defined here */
    for (int t=0;t<T;++t;@inner(0)){ /* thread block dimension defined here */

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }

The code inside the kernel is supposed to keep the parallel for loop bounds in close proximity to the body of the parallel for loops. Also the loop bound is not specified by any input array, since a more general kernel may require a very different thread grid configuration to the data arrays.

The OKL kernel constructions was an intentional choice born of having to repeatedly explain CUDA/OpenCL kernel syntax, kernel launch parameters, and kernel threading philosophy when training people.

Does SYCL have an analogous thread-block size hint like CUDA/HIP's __launch_bounds__ we're discussing here?

Good question. I double-checked the SYCL standard (v1.2.1) to find out. Any attributes available in OpenCL C are supported and can be given with the C++ 11 attribute specifier using the cl namespace. For example, __attribute__(((reqd_work_group_size(n))) in OpenCL C is equivalent to [[cl::reqd_work_group_size(n)]] in SYCL.

There are two flavours available to specify the thread-group size: work_group_size_hint(n) is the soft version—suggesting the thread-group size will be n—whereas req_work_group_size(n) is a strict requirement.

Some options for OCCA:

  1. add a "innerDimHint" member function to the occa::kernel class, that forces a recompile (if not already in the hash) with thread dim hint for CUDA/OpenCL/HIP.

  2. add some logic inside the launcher to trigger a recompile done when a new thread array size is specified. This featuremight be turned on/off by some OCCA_* define.

Both of these can be done in a backwards compatible way.

I would actually trust the separate runtimes to manage this, and opt to not recompile anything. In essence, use @dmed256's original proposal of making it a build prop, then adding the respective __launch_bound__ hint to the kernel at translation if the backend supports it.

For CUDA and HIP, __launch_bound__ is really just a hint, so maybe more akin to OpenCL's work_group_size_hint. It's only used to tell the compiler how many registers it can assume will be available to each thread in the block. If the user violates the launch bound, it's not necessarily an error, since the kernel may not heavily be using register. In the case where the user violates the launch bound and there is indeed not enough register, the runtime will throw a error that OCCA should catch.

When running an auto tuner I have noticed segmentation faults for OCCA:HIP kernels that required too much LDS or REG. I was skeptical that HIP will throw errors that will be caught. Hopefully that has been fixed now.

I am ok with a user supplied hint, or a launcher that spots new thread configurations at runtime.

FYI I have been happy as a lark just specifying explicit runtime loop bounds from host code via macro constant substitution in OKL kernel code (e.g., K_blockDim for outer, K_threadDim for inner) that I pass down via the kernel props. Of course, I have my own API code that queries the characteristics and status of the GPU. I use that information to calculate loop bounds to pass down. Yes that kind of dynamic thing causes JIT re-compile sometimes during runtime but in my case it is rare because I step-threshold changes, so the actual loop bounds usually remain the same or fall into some common set for a given kernel. I also utilize pre-compile for some of my kernels so that also reduces JIT re-compile.

Having these loop bounds explicitly defined at runtime also facilitates runtime sizing GPU local memory arrays with them in the OKL code.

Looks like this feature is going to be important:
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#performance-impact-for-kernel-launch-bound-attribute

With apologies for the clumsy implementation, this is a workaround I use for a kernel in libparanumal that requires Np (>256) threads:

occa::properties kernelInfo; 
...
 if(platform.device.mode()=="HIP"){
      char newflag[BUFSIZ];
      sprintf(newflag, " --gpu-max-threads-per-block=%d", mesh.Np);
      kernelInfo["compiler_flags"] += newflag;
    }

Using the hipcc compiler flag to specify the launch bounds was suggested by Noel Chalmers. The gross implementation is mine.

It is important to be careful when doing this since it is unclear what happens if a kernel violates the max bound.

To avoid using inappropriate bounds, I create separate copies of the occa::properties object for kernels that use different max thread counts.

It's likely that in the future, violating the launch bounds will become a runtime error.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmed256 picture dmed256  ·  4Comments

awehrfritz picture awehrfritz  ·  7Comments

amikstcyr picture amikstcyr  ·  11Comments

tcew picture tcew  ·  10Comments

jeremylt picture jeremylt  ·  12Comments