Occa: New kernel qualifier macro

Created on 2 Apr 2020 · 22Comments · Source: libocca/occa

It is often helpful to give the kernel compiler some extra information about the number of threads (work-items) in a thread-block (work-group). For instance we can give the HIP compiler an upper bound on the number of threads in a thread-block (say 1024) as follows:

__launch_bounds__(1024) __global__ void fooKernel(...) { ... }

In fact for the current HIP release it is unfortunately the case that this must be specified when the thread-block size exceeds 256 (see https://github.com/ROCm-Developer-Tools/HIP/issues/1310 )

CUDA also has the same attribute. There is also an extra argument to the launch bounds qualifier for minimum number of thread-blocks (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds).

The kernel qualifiers in OpenCL are slightly different (see 6.7.2 of https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf )

Proposal v1 - in an ideal world when we know the thread-block size at compile time OCCA will add kernel qualifiers for appropriate launch bounds (CUDA, HIP) or work-group size hint (OpenCL).

Proposal v2 - If it is too complicated to do Proposal v1, then it would neat to add an okl attribute for launch bounds @qualifier("inner sizes", B) where B could be a compiler define. This would be expanded to __launch_bounds__(value of B) for CUDA/HIP or __attribute__((work_group_size_hint(value of B))) for OpenCL. Multi-dim variant would also be helpful.

feature

Source

tcew

All 22 comments

Since this seems specific to HIP/CUDA/OpenCL, what about passing it in as a build property?

  addVectors = device.buildKernel("addVectors.okl",
                                  "addVectors",
                                  "launch_bounds: 1024");

dmed256 on 6 Apr 2020

👍1

That would be good I think.

tcew on 6 Apr 2020

I just ran into this. I'm happy to contribute this but I have a question. Does OCCA know the inner loop dimensions at JIT-time?

dmcdougall on 29 Jun 2020

I believe the loop dimensions can be passed in as kernel arguments.

tcew on 1 Jul 2020

That's true, but I don't understand why that's useful. I'm asking how occa can emit a __launch_bounds__ attribute at JIT-time. It needs to know the loop dimensions at that point. This is orthogonal to the kernel arguments, isn't it?

dmcdougall on 6 Jul 2020

He's saying the dims of the inner loops can be passed as arguments, i.e. we don't necessarily know the threadblock dimensions at JIT-compile time.

noelchalmers on 6 Jul 2020

Oh right. I see what you're saying now @tcew. Good point.

dmcdougall on 6 Jul 2020

This is what I had in mind:

@kernel void runtimeArgs(const int B,
                         const int T,
                         const int N,
                         const float *x,
                         const float *y,
                         float *xPy) {
  for (int b=0;b<B;++b;@outer(0)){
    for (int t=0;t<T;++t;@inner(0)){

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }
}

OCCA obviously cannot know the numerical loop bounds at JIT-time.

It does however, create a launcher that sets the dimensions of the thread grid:

extern "C" void runtimeArgs(occa::modeKernel_t * *deviceKernel,
                            const int & B,
                            const int & T,
                            const int & N,
                            occa::modeMemory_t * x,
                            occa::modeMemory_t * y,
                            occa::modeMemory_t * xPy) {
  {
    occa::dim outer, inner;
    outer.dims = 1;
    inner.dims = 1;
    int b = 0;
    outer[0] = B - 0;
    int t = 0;
    inner[0] = T - 0;
    occa::kernel kernel(deviceKernel[0]);
    kernel.setRunDims(outer, inner);
    kernel(B, T, N, x, y, xPy);
  }
}

Thus the user can specify any size for the loop bounds at runtime.

tcew on 6 Jul 2020

Crossed posts.

tcew on 6 Jul 2020

SYCL uses similar syntax:

sycl::range<2> global_range(Bx*Tx,By*Ty);
sycl::range<2> local_range(Tx,Ty);
sycl::nd_range<2> kernel_range(global_range, local_range);

device_queue.submit([&](sycl::handler &cgh) {
    ...
  cgh.parallel_for(kernel_range, kernel);
});

(Buffers have a range associated with them, which avoids the need to pass in N)

kris-rowe on 7 Jul 2020

The SYCL sycl::range syntax is adapted from OpenCL, which itself is adapted from CUDA.

In your example the specification of the threading dimensions is separate from the body of the parallel for loops.

The OCCA OKL syntax is specifically designed to bring the loop dimensions and body code into a more familiar parallel for loop syntax.

for (int b=0;b<B;++b;@outer(0)){ /*  grid dimension defined here */
    for (int t=0;t<T;++t;@inner(0)){ /* thread block dimension defined here */

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }

The code inside the kernel is supposed to keep the parallel for loop bounds in close proximity to the body of the parallel for loops. Also the loop bound is not specified by any input array, since a more general kernel may require a very different thread grid configuration to the data arrays.

The OKL kernel constructions was an intentional choice born of having to repeatedly explain CUDA/OpenCL kernel syntax, kernel launch parameters, and kernel threading philosophy when training people.

tcew on 7 Jul 2020

👍1

Does SYCL have an analogous thread-block size hint like CUDA/HIP's __launch_bounds__ we're discussing here?

noelchalmers on 7 Jul 2020

Good question. I double-checked the SYCL standard (v1.2.1) to find out. Any attributes available in OpenCL C are supported and can be given with the C++ 11 attribute specifier using the cl namespace. For example, __attribute__(((reqd_work_group_size(n))) in OpenCL C is equivalent to [[cl::reqd_work_group_size(n)]] in SYCL.

There are two flavours available to specify the thread-group size: work_group_size_hint(n) is the soft version—suggesting the thread-group size will be n—whereas req_work_group_size(n) is a strict requirement.

kris-rowe on 7 Jul 2020

Some options for OCCA:

add a "innerDimHint" member function to the occa::kernel class, that forces a recompile (if not already in the hash) with thread dim hint for CUDA/OpenCL/HIP.
add some logic inside the launcher to trigger a recompile done when a new thread array size is specified. This featuremight be turned on/off by some OCCA_* define.

Both of these can be done in a backwards compatible way.

tcew on 7 Jul 2020

I would actually trust the separate runtimes to manage this, and opt to not recompile anything. In essence, use @dmed256's original proposal of making it a build prop, then adding the respective __launch_bound__ hint to the kernel at translation if the backend supports it.

For CUDA and HIP, __launch_bound__ is really just a hint, so maybe more akin to OpenCL's work_group_size_hint. It's only used to tell the compiler how many registers it can assume will be available to each thread in the block. If the user violates the launch bound, it's not necessarily an error, since the kernel may not heavily be using register. In the case where the user violates the launch bound and there is indeed not enough register, the runtime will throw a error that OCCA should catch.

noelchalmers on 7 Jul 2020

👍1

When running an auto tuner I have noticed segmentation faults for OCCA:HIP kernels that required too much LDS or REG. I was skeptical that HIP will throw errors that will be caught. Hopefully that has been fixed now.

tcew on 8 Jul 2020

👀1

I am ok with a user supplied hint, or a launcher that spots new thread configurations at runtime.

tcew on 8 Jul 2020

FYI I have been happy as a lark just specifying explicit runtime loop bounds from host code via macro constant substitution in OKL kernel code (e.g., K_blockDim for outer, K_threadDim for inner) that I pass down via the kernel props. Of course, I have my own API code that queries the characteristics and status of the GPU. I use that information to calculate loop bounds to pass down. Yes that kind of dynamic thing causes JIT re-compile sometimes during runtime but in my case it is rare because I step-threshold changes, so the actual loop bounds usually remain the same or fall into some common set for a given kernel. I also utilize pre-compile for some of my kernels so that also reduces JIT re-compile.

Having these loop bounds explicitly defined at runtime also facilitates runtime sizing GPU local memory arrays with them in the OKL code.

pdhahn on 4 Jan 2021

Looks like this feature is going to be important:
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#performance-impact-for-kernel-launch-bound-attribute

stgeke on 6 May 2021

With apologies for the clumsy implementation, this is a workaround I use for a kernel in libparanumal that requires Np (>256) threads:

occa::properties kernelInfo; 
...
 if(platform.device.mode()=="HIP"){
      char newflag[BUFSIZ];
      sprintf(newflag, " --gpu-max-threads-per-block=%d", mesh.Np);
      kernelInfo["compiler_flags"] += newflag;
    }

Using the hipcc compiler flag to specify the launch bounds was suggested by Noel Chalmers. The gross implementation is mine.

tcew on 6 May 2021

It is important to be careful when doing this since it is unclear what happens if a kernel violates the max bound.

To avoid using inappropriate bounds, I create separate copies of the occa::properties object for kernels that use different max thread counts.

tcew on 6 May 2021

It's likely that in the future, violating the launch bounds will become a runtime error.

noelchalmers on 6 May 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

🌟 Who's using OCCA?

dmed256 · 4Comments

Compilation fails with PGI compiler

awehrfritz · 7Comments

cmake support

amikstcyr · 11Comments

Feature request: add option to stop parser for replacing defined variables with their values

tcew · 10Comments

libCEED Travis CI With OCCA Now Failing

jeremylt · 12Comments