LDC CUDA and SPIRV

From D Wiki
Revision as of 06:38, 22 May 2016 by Nicholas Wilson (talk | contribs) (formatting)
Jump to: navigation, search

About

This page is about the requirements and considerations for getting LDC to target the NVPTX and SPIR backends of LLVM i.e. triples - spir-unknown-unknown - spir64-unknown-unknown - nvptx-unknown-unknown - nvptx64-unknown-unknown

Address Spaces

CUDA and OpenCL both have notions of regions of memory:

0. Private. this is memory used by a given thread of execution and contains its stack and registers

1. Global. Memory that is global to the device

2. Local. Memory that is local to a work group (aka warp wave), a group of threads.

3. Constant. Memory (re)writable only by the host, between execution of the batch of kernels.

This is mapped to the LLVM concept of address spaces. In SPIR and CUDA these map to the above regions. In addition to the above there is a fifth address space (address space 4) that pointers may point to which is generic.

Note that pointers have two associated address spaces: the space of residence and the pointer space, e.g. one can have a local pointer to global memory, i.e. the pointer resides in local memory but points to somewhere in global memory.

TODO: how useful is this (other than for private which is required)? how should this interact with D default TLS?

The concept of address spaces does not exist in D and will need to be translated somehow.

Restrictions

In the environment of CUDA and OpenCL the nature of execution is more restricted than on CPU. In short, there are no exceptions (what to do about assert?), no function pointers (all template delegate parameters MUST be inlined), no recursion either direct or indirect, there is no I/O, no C or D runtime and no OS. However synchronisation primitives such as fence are still available, as are atomics. The expected way to achieve this is to have a transitive attribute (@kernel) that enforces these restriction similar to @nogc, nothrow and pure. If we were to disallow non-builtin globals and make the builtin ones immutable we may be able to get away with @kernel being equivalent to @nogc pure nothrow.

Ranges

Much of the programming power of D comes from ranges. The paradigm does not transfer perfectly to CUDA and OpenCL but should still be usable.

First it is useful to briefly cover the different types of ranges.

Generative ranges These do not take a range as an input put produce one. The produced range is not random access.

Transformative ranges These take ranges as input and return them as output. The output(s) are not necessarily random access (but will generally be if the input is). In the context of GPGPU it is useful to further categorise these as to the relation of the number of elements of input when compared to the number of outputs. Of particular interest are ranges that perform an n:n mapping as these can be chained from within the same kernel. The obvious example here is `map`. Some ranges e.g. `filter` do not preserve this and will have to be dealt with differently e.g. changing elements that do not pass the predicate to a sentinel value (e.g. NaN).

Consuming Ranges These take range(s) as input and either return void or a scalar.

In order to be chained within the same kernel the input range(s) need to have the same number of elements as the output and be random access. as the quintessential range on GPUs are arrays. Also the notion of `.save`ing a range doesn't translate.

Vector types

Currently LDC vector types use adapt __vector which rejects invalid types. this doesn't work for thing like float16 that are too big.

Builtins & Intrinsics

CUDA and OpenCL expose a lot of builtin variables (work size, GlobalLinearId) as well as intrinsic functions various FMA as well as other types like images pipes and events.

Metadata & Special Function attributes

the LLVM IR forms of SPIRV and PTX hold a lot of magic metadata.

The form of the metal data can be found in the test modules for the codegen of clang.

All spirv kernels have the attribute spirv_kernel and all CUDA device function have the attribute ptx_device.

Standard Library

As part of providing SPRI-V and CUDA as backends we will need to provide a standard library of function that meet the restriction criteria imposed by the environment.

A non exhaustive list is

  • vector operations and functions. both for fixed length and (run time) variable length.
  • work group functions e.g. reduce search sort
  • provide the builtin variables.
  • provide function to deal with the special objects: images, pipes and events.

Misc

Currently the KHRONOS branch of LLVM that supports SPIR-V only supports OpenCL. It is worth considering making supporting GLSL easy after the fact.

Ideally we should make the interface for CUDA and OpenCL as similar and consistent as possible. The higher level library code should be agnostic. The functions will be able to be introspected and so should be easy.

SPIR-V has the notion of capabilities.These include things like, half and double precision floating support, 64 bit integers, atomics, 64bit atomics, images and pipes.