Difference between revisions of "LDC CUDA and SPIRV"

From D Wiki
Jump to: navigation, search
(Pull Requests)
 
(18 intermediate revisions by 2 users not shown)
Line 4: Line 4:
 
- spir-unknown-unknown
 
- spir-unknown-unknown
 
- spir64-unknown-unknown
 
- spir64-unknown-unknown
- nvptx-unknown-unknown
+
- nvptx-nvidia-cuda
 
- nvptx64-nvidia-cuda
 
- nvptx64-nvidia-cuda
  
== Compilation process ==
+
== Needs Work ==
 +
 
 +
Merge the CodeGenVisitors and StatementsVisitors as these were separated for ease of development, but have the vast majority of code the same.
 +
 
 +
Refactor the isFromDCompute functions and the similar functions in ida.cpp
 +
 
 +
The ABI code should be moved to the rest of the ABI code
 +
 
 +
Remove IF_LOG s that are left over from me attempting to figure out what is going on. Basically anything with a %p format specifier.
 +
 
 +
Do the TODOs.
  
A D module declaration is tagged @ldc.attributes.compute
+
Attribute code adaption from clang/llvm.  
  
@compute gets picked up by gen/uda.cpp:applyFuncDeclUDAs
 
  
abort the codegenning of the current host llvm module and instead codegen into the (command line) specified targets (CUDA and/or OpenCL)
+
== Issues ==
  
Run a "conditional compilation pass" (akin to the nvvm_reflect pass) turning calls to "__dcompute_reflect(target t,uint version)" with constants followed by a DCE pass to remove code not intended for the target.
+
Need to create a CMake option to link with a SPIRV compatible LLVM to allow use with unmodified llvm.
  
Run a type replacement pass to replace GlobalPointer et. al with a pointer in the correct address space (these differ between CDUA and OpenCL) and other magic types such as images ( if I can't find a way to get regular buffers to be cached in texture cache and thus roll our own)
+
Currently segfaults if -singleobj is passed and the first file on the command line is marked @compute
  
Attach the relevant metadata to the generated module
+
== Desired Features not yet implemented ==
  
In the case of CUDA Pass the module off to the backend to produce .ptx files, or in the case of OpenCL targeting SPIR-V schedule the SPIR-V pass to generate the .spv file.
+
Suport for OpenCL Images and pipes (type resolution and metadata)
  
== Address Spaces ==
+
== Compilation process ==
  
CUDA and OpenCL both have notions of regions of memory:
+
A D module declaration is tagged @dcompute.attributes.compute
  
Private. this is memory used by a given thread of execution and contains its stack and registers
+
@compute gets picked up prior to the start of module code generation and is deferred until after the rest of the host code generation is done.  
Global. Memory that is global to the device
 
Local. Memory that is local to a work group (aka warp wave), a group of threads.
 
Constant. Memory (re)writable only by the host, between execution of the batch of kernels.
 
  
This is mapped to the LLVM concept of address spaces. In SPIR and CUDA these map to the above regions. In addition to the above there is a fifth address space that pointers may point to which is generic. These differ between SPIR and CUDA handle them with magic structs that get translated (see above).
+
codegen is then done foreach of the command line specified DCompute targets e.g. -mdcompute-targets=ocl-120,cuda-350 for OpenCL 1.2 and CUDA SM3.5
  
Note that pointers have two associated address spaces: the space of residence and the pointer space, e.g. one can have a local pointer to global memory, i.e. the pointer resides in local memory but points to somewhere in global memory.  
+
during statement codegen calls to "if(__dcompute_reflect(target t,uint version))" are selectively codegen'd for the appropriate target. during Type translation all types in dcompute.types are altered to the appropriate type e.g.
 +
GLobalPointer!(float) becomes {float addrspace(1)*} 
  
TODO: how useful is this (other than for private which is required)? how should this interact with D default TLS? Don't allow global variables at all in modules marked @compute, figure out how to do a shared_alloc for allocating shared memory.
+
Relevant metadata is attached to the generated module. This includes things like specifying which function are kernels (entry points)
  
 +
In the case of CUDA Pass the module off to the backend to produce .ptx files, or in the case of OpenCL targeting SPIR-V schedule the SPIR-V pass to generate the .spv file.
  
 
The following code
 
The following code
 
  @compute module foo;
 
  @compute module foo;
  import dcompute.attrbutes;//publicly imports ldc.attributes : compute
+
  import dcompute.attrbutes;
 
  import dcompute.std.transcendental;
 
  import dcompute.std.transcendental;
 
  extern(C) // for easier mangling
 
  extern(C) // for easier mangling
Line 52: Line 60:
 
  }
 
  }
  
when targeting OpenCL should produce the following IR
+
when targeting OpenCL should produce IR semantically equivalent to the following IR
  
 
   declare void @bar(<float x 4> addrspace(1)* %fp, float %b)
 
   declare void @bar(<float x 4> addrspace(1)* %fp, float %b)
Line 60: Line 68:
 
     store %2 <float x 4> addrspace(1)* %fp
 
     store %2 <float x 4> addrspace(1)* %fp
 
   }
 
   }
 +
!opencl.kernels = {!1}
 +
!1 = {void(<float x 4> addrspace(1)* %fp, float %b) *@bar, ... }
  
 
and when targeting CUDA should produce
 
and when targeting CUDA should produce
Line 69: Line 79:
 
     store %2 <float x 4> addrspace(1)* %fp
 
     store %2 <float x 4> addrspace(1)* %fp
 
  }
 
  }
 +
!nvvm.annotations = {!1}
 +
!1 = {void(<float x 4> addrspace(1)* %fp, float %b) *@bar, !"kernel", i32 1}
 +
 +
== Address Spaces ==
 +
 +
CUDA and OpenCL both have notions of regions of memory:
 +
 +
Private. this is memory used by a given thread of execution and contains its stack and registers
 +
Global. Memory that is global to the device
 +
Local. Memory that is local to a work group (aka warp wave), a group of threads.
 +
Constant. Memory (re)writable only by the host, between execution of the batch of kernels.
 +
 +
This is mapped to the LLVM concept of address spaces. In SPIR and CUDA these map to the above regions. In addition to the above there is a fifth address space that pointers may point to which is generic. These differ between SPIR and CUDA handle them with magic structs that get translated (see above).
 +
 +
Note that pointers have two associated address spaces: the space of residence and the pointer space, e.g. one can have a local pointer to global memory, i.e. the pointer resides in local memory but points to somewhere in global memory.
 +
 +
TODO: how useful is this (other than for private which is required)? how should this interact with D default TLS? Don't allow global variables at all in modules marked @compute, figure out how to do a shared_alloc for allocating shared memory.
 +
 +
  
 
== Restrictions ==
 
== Restrictions ==
Line 132: Line 161:
  
 
[https://github.com/KhronosGroup/SPIR/tree/spirv-1.1 KHRONOS's SPIR-V clang]
 
[https://github.com/KhronosGroup/SPIR/tree/spirv-1.1 KHRONOS's SPIR-V clang]
 +
 +
[http://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA C/C++ with LLVM]
 +
 +
[http://wujingyue.com/docs/gpucc-tutorial.pdf CGO 2016 gpucc tutorial]
 +
 +
[http://llvm.org/devmtg/2016-03/Presentations/AnastasiaStulova_OpenCL20_EuroLLVM2016.pdf A Journey of OpenCL 2.0 Development in Clang]
 +
 +
== The Diff ==
 +
this section details the [https://github.com/ldc-developers/ldc/compare/master...thewilsonator:dcompute differences made to LDC]
 +
 +
{|File | Comment|
 +
|driver/cl_options.h || added mDcomputeTargets command line option
 +
|-
 +
|driver/linker.cpp    || in linkObjToBinaryGcc: skip files ending with .spv and .ptx
 +
|-
 +
|driver/main.cpp    || add includes, in hideLLVMOptions: hide spirv options , in registerPredefinedVersions: version = D_DCompute, in cppmain emit for dcompute
 +
|-
 +
|driver/toobj.cpp    || add includes, in codegenModule: if module is spirv pass it to createSPIRVWriterPass, if is ptx Target->addPassesToEmitFile but always assembly
 +
|-
 +
|gen/functions.cpp || disable debuginfo generation for DCompute, no uwtable, alloca point => allocaPoint ( I think this fixes a crash)
 +
|-
 +
|gen/llvmhelpers.cpp || disable debuginfo generation for DCompute
 +
|-
 +
|gen/optimizer.cpp || dont optimise spirv modules as turning GEPs into extracts causes crashes.
 +
|-
 +
|gen/statements.cpp|| comments
 +
|-
 +
| gen/tollvm.cpp||  DtoBitCast
 +
|-
 +
|gen/uda.cpp|| _kernel and _compute, isDComputeAttibutes, isFromDComputeAttibutes, getDComputeAttributesStruct, hasKernelAttr, hasComputeAttr
 +
|-
 +
|ir/irtypestruct.cpp || all
 +
|-
 +
|gen/irstate.{h,cpp}|| all
 +
|}
  
 
----
 
----
 +
 +
== Pull Requests ==
 +
 +
this section details the pull requests associated with this project.
 +
 +
{|
 +
|Name || Number || Status || Depends On || Description
 +
|-
 +
|Dcompute || 1786 || Superseded || None || the original PR, now obsolete, superseded by the following PRs.
 +
|-
 +
|Cmake || 1795 || redundant || None || Adds the #define LDC_WITH_DCOMPUTE_SPIRV . Works. link with libLLVMSPRIVlib.a don't think this works. Equivalent to the changes to CMakeLitst.txt in 1786.
 +
|-
 +
|Attributes || 1800 || Merged || None || adds the magic attributes '@compute(int)'  and '@kernel'. REQUIRES associated druntime changes, see the PR. Euqivlent to the changes in gen/uda.* in 1786. CONFLICTS 1877 : severity trivial.
 +
|-
 +
|Addrspace || 1801 || merged || None || Correctly handle pointers with the llvm addrspace qualification, namely don't allow bit casting between pointers of different addrspaces. Necessary due to the way the addrspace qualification is attached. FILES gen/dvalue.cpp and gen/tollvm.* DEFERED optimise StripAddrSpaces when gDComputeTarget is added and add an early return if it is false.
 +
|-
 +
|Toobj and optimiser || 1879 || Merged || None || dont optimise SPIR-V as turning GEPs into extracts causes the back end to assert. Always use llvm::TargetMachine::CGFT_AssemblyFile for ptx, if we don't create target machine will return null. llvm::createSPIRVWriterPass(out)->runOnModule(m) is the way to make output happen for SPIR-V as there is no target machine for SPIR-V. FILES driver/toobj.cpp and gen/optimizer.cpp.
 +
|-
 +
|ABI ||  1878 || Merged || 1800 || Sets the calling conventions for @kernel and non-@kernel function correctly. FILES dcompute/abi* -> gen/abi* . FUTURE put gen/abi* into gen/abi/*
 +
|-
 +
|target + codegenmanager || 1922 || Merged || ABI || Targets hold the address space mapping and other information about the backend, set gTargetMachine  and add required metadata to the module being codegen'd. CONTENTION location of the magic types (currently 'dcompute.types', 'ldc.dcomputetypes'?) . FILES dcompute/codegen* -> driver/dcomputecodegen* , dcompute/target* -> gen/dcomputetarget* . DEFER update call to Declaration_codegen when its signature changes. FUTURE handle restrict if/when it becomes a thing
 +
|-
 +
|irtypestruct || 1952 || merged || 1922 + 1801 + 87|| Attach the addrspace qualification to the magic type above. FILES ir/irtypestruct.cpp
 +
|-
 +
|code generation || 1953 || merged || 1922 ||  Statement and Declaration visitors.
 +
|-
 +
| main || 1954 || merged || 1922 || main.cpp
 +
|-
 +
| semantic || 1993 || merged || none || semantic analysis. enforce nothrow (no errors either) @nogc, no globals, no classes, interfaces, assoc arrays, no runtime functions, and other restrictions
 +
|-
 +
| druntime || 87 || merged || None || ldc.attributes (@compute + @kernel) + ldc.dcomputetypes (Pointer)
 +
|-
 +
| lit tests || None || Ongoing || 1801 + 1922 + druntime || Lit based IR tests
 +
|}
  
 
[[Category:LDC]]
 
[[Category:LDC]]

Latest revision as of 05:20, 27 February 2017

About

This page is about the requirements and considerations for getting LDC to target the NVPTX and SPIR backends of LLVM i.e. triples - spir-unknown-unknown - spir64-unknown-unknown - nvptx-nvidia-cuda - nvptx64-nvidia-cuda

Needs Work

Merge the CodeGenVisitors and StatementsVisitors as these were separated for ease of development, but have the vast majority of code the same.

Refactor the isFromDCompute functions and the similar functions in ida.cpp

The ABI code should be moved to the rest of the ABI code

Remove IF_LOG s that are left over from me attempting to figure out what is going on. Basically anything with a %p format specifier.

Do the TODOs.

Attribute code adaption from clang/llvm.


Issues

Need to create a CMake option to link with a SPIRV compatible LLVM to allow use with unmodified llvm.

Currently segfaults if -singleobj is passed and the first file on the command line is marked @compute

Desired Features not yet implemented

Suport for OpenCL Images and pipes (type resolution and metadata)

Compilation process

A D module declaration is tagged @dcompute.attributes.compute

@compute gets picked up prior to the start of module code generation and is deferred until after the rest of the host code generation is done.

codegen is then done foreach of the command line specified DCompute targets e.g. -mdcompute-targets=ocl-120,cuda-350 for OpenCL 1.2 and CUDA SM3.5

during statement codegen calls to "if(__dcompute_reflect(target t,uint version))" are selectively codegen'd for the appropriate target. during Type translation all types in dcompute.types are altered to the appropriate type e.g.

GLobalPointer!(float) becomes {float addrspace(1)*}  

Relevant metadata is attached to the generated module. This includes things like specifying which function are kernels (entry points)

In the case of CUDA Pass the module off to the backend to produce .ptx files, or in the case of OpenCL targeting SPIR-V schedule the SPIR-V pass to generate the .spv file.

The following code

@compute module foo;
import dcompute.attrbutes;
import dcompute.std.transcendental;
extern(C) // for easier mangling
@kernel void bar(GlobalPointer!(Vector!(float,4)) fp, float b)
{
   if(__dcompute_reflect(target.OpenCL))
       *fp += float4(b);
   else if(__dcompute_reflect(target.CUDA))
       *fp = float4(cos(b);
}

when targeting OpenCL should produce IR semantically equivalent to the following IR

 declare void @bar(<float x 4> addrspace(1)* %fp, float %b)
 {
    %1 = load <float x 4> addrspace(1)* %fp
    %2 = add %b %1
    store %2 <float x 4> addrspace(1)* %fp
 }
!opencl.kernels = {!1}
!1 = {void(<float x 4> addrspace(1)* %fp, float %b) *@bar, ... }

and when targeting CUDA should produce

declare void @bar(<float x 4> addrspace(1)* %fp, float %b)
{
    %1 = load <float x 4> addrspace(1)* %fp
    %2 = call float cos(b)
    store %2 <float x 4> addrspace(1)* %fp
}
!nvvm.annotations = {!1}
!1 = {void(<float x 4> addrspace(1)* %fp, float %b) *@bar, !"kernel", i32 1}

Address Spaces

CUDA and OpenCL both have notions of regions of memory:

Private. this is memory used by a given thread of execution and contains its stack and registers Global. Memory that is global to the device Local. Memory that is local to a work group (aka warp wave), a group of threads. Constant. Memory (re)writable only by the host, between execution of the batch of kernels.

This is mapped to the LLVM concept of address spaces. In SPIR and CUDA these map to the above regions. In addition to the above there is a fifth address space that pointers may point to which is generic. These differ between SPIR and CUDA handle them with magic structs that get translated (see above).

Note that pointers have two associated address spaces: the space of residence and the pointer space, e.g. one can have a local pointer to global memory, i.e. the pointer resides in local memory but points to somewhere in global memory.

TODO: how useful is this (other than for private which is required)? how should this interact with D default TLS? Don't allow global variables at all in modules marked @compute, figure out how to do a shared_alloc for allocating shared memory.


Restrictions

In the environment of CUDA and OpenCL the nature of execution is more restricted than on CPU. In short, there are no exceptions (what to do about assert?), no function pointers (all template delegate parameters MUST be inlined), no recursion either direct or indirect, there is no I/O, no C or D runtime and no OS. However synchronisation primitives such as fence are still available, as are atomics. The expected way to achieve this is to have a transitive attribute (@kernel) that enforces these restriction similar to @nogc, nothrow and pure. If we were to disallow non-builtin globals and make the builtin ones immutable we may be able to get away with @kernel being equivalent to @nogc pure nothrow.

Ranges

Much of the programming power of D comes from ranges. The paradigm does not transfer perfectly to CUDA and OpenCL but should still be usable.

First it is useful to briefly cover different types of ranges.

Generative ranges: these do not take a range as an input put produce one. The produced range is not random access. Examples include iota, recurrence relations such as Fibonacci.

Transformative ranges: These take ranges as input and return them as output. The output(s) are not necessarily random access (but will generally be if the input is). In the context of GPGPU it is useful to further categorise these as to the relation of the number of elements of input when compared to the number of outputs. Of particular interest are ranges that perform an n:n mapping as these can be chained from within the same kernel. The obvious example here is `map`. Some ranges e.g. `filter` do not preserve this and will have to be dealt with differently e.g. changing elements that do not pass the predicate to a sentinel value (e.g. NaN), or stop the range chain and require new buffers to allocated and different kernels to continue the chain.

Consuming Ranges: these take range(s) as input and either return void or a scalar. Examples include min,max and reduce.

In order to be chained within the same kernel the input range(s) need to have the same number of elements as the output and be random access, as the quintessential range on GPUs are arrays. Also the notion of `.save`ing a range doesn't translate.

Vector types

Currently LDC vector types use adapt __vector which rejects invalid types. this doesn't work for thing like float16 that are too big.

Builtins & Intrinsics

CUDA and OpenCL expose a lot of builtin variables (work size, GlobalLinearId) as well as intrinsic functions various FMA as well as other types like images pipes and events.

Metadata & Special Function attributes

the LLVM IR forms of SPIRV and PTX hold a lot of magic metadata.

The form of the metal data can be found in the test modules for the codegen of clang.

All spirv kernels have the attribute spirv_kernel and all CUDA device function have the attribute ptx_device.

Standard Library

As part of providing SPRI-V and CUDA as backends we will need to provide a standard library of function that meet the restriction criteria imposed by the environment.

A non exhaustive list is

  • vector operations and functions. both for fixed length and (run time) variable length.
  • work group functions e.g. reduce search sort
  • provide the builtin variables.
  • provide function to deal with the special objects: images, pipes and events.

Misc

Currently the KHRONOS branch of LLVM that supports SPIR-V only supports OpenCL. It is worth considering making supporting GLSL easy after the fact.

Ideally we should make the interface for CUDA and OpenCL as similar and consistent as possible. The higher level library code should be agnostic. The functions will be able to be introspected and so should be easy.

SPIR-V has the notion of capabilities.These include things like, half and double precision floating support, 64 bit integers, atomics, 64bit atomics, images and pipes.


Resources

KHRONOS's SPRIV capable LLVM

OpenCL C++ standard library

KHRONOS's SPIR-V clang

Compiling CUDA C/C++ with LLVM

CGO 2016 gpucc tutorial

A Journey of OpenCL 2.0 Development in Clang

The Diff

this section details the differences made to LDC

driver/cl_options.h added mDcomputeTargets command line option
driver/linker.cpp in linkObjToBinaryGcc: skip files ending with .spv and .ptx
driver/main.cpp add includes, in hideLLVMOptions: hide spirv options , in registerPredefinedVersions: version = D_DCompute, in cppmain emit for dcompute
driver/toobj.cpp add includes, in codegenModule: if module is spirv pass it to createSPIRVWriterPass, if is ptx Target->addPassesToEmitFile but always assembly
gen/functions.cpp disable debuginfo generation for DCompute, no uwtable, alloca point => allocaPoint ( I think this fixes a crash)
gen/llvmhelpers.cpp disable debuginfo generation for DCompute
gen/optimizer.cpp dont optimise spirv modules as turning GEPs into extracts causes crashes.
gen/statements.cpp comments
gen/tollvm.cpp DtoBitCast
gen/uda.cpp _kernel and _compute, isDComputeAttibutes, isFromDComputeAttibutes, getDComputeAttributesStruct, hasKernelAttr, hasComputeAttr
ir/irtypestruct.cpp all
gen/irstate.{h,cpp} all

Pull Requests

this section details the pull requests associated with this project.

Name Number Status Depends On Description
Dcompute 1786 Superseded None the original PR, now obsolete, superseded by the following PRs.
Cmake 1795 redundant None Adds the #define LDC_WITH_DCOMPUTE_SPIRV . Works. link with libLLVMSPRIVlib.a don't think this works. Equivalent to the changes to CMakeLitst.txt in 1786.
Attributes 1800 Merged None adds the magic attributes '@compute(int)' and '@kernel'. REQUIRES associated druntime changes, see the PR. Euqivlent to the changes in gen/uda.* in 1786. CONFLICTS 1877 : severity trivial.
Addrspace 1801 merged None Correctly handle pointers with the llvm addrspace qualification, namely don't allow bit casting between pointers of different addrspaces. Necessary due to the way the addrspace qualification is attached. FILES gen/dvalue.cpp and gen/tollvm.* DEFERED optimise StripAddrSpaces when gDComputeTarget is added and add an early return if it is false.
Toobj and optimiser 1879 Merged None dont optimise SPIR-V as turning GEPs into extracts causes the back end to assert. Always use llvm::TargetMachine::CGFT_AssemblyFile for ptx, if we don't create target machine will return null. llvm::createSPIRVWriterPass(out)->runOnModule(m) is the way to make output happen for SPIR-V as there is no target machine for SPIR-V. FILES driver/toobj.cpp and gen/optimizer.cpp.
ABI 1878 Merged 1800 Sets the calling conventions for @kernel and non-@kernel function correctly. FILES dcompute/abi* -> gen/abi* . FUTURE put gen/abi* into gen/abi/*
target + codegenmanager 1922 Merged ABI Targets hold the address space mapping and other information about the backend, set gTargetMachine and add required metadata to the module being codegen'd. CONTENTION location of the magic types (currently 'dcompute.types', 'ldc.dcomputetypes'?) . FILES dcompute/codegen* -> driver/dcomputecodegen* , dcompute/target* -> gen/dcomputetarget* . DEFER update call to Declaration_codegen when its signature changes. FUTURE handle restrict if/when it becomes a thing
irtypestruct 1952 merged 1922 + 1801 + 87 Attach the addrspace qualification to the magic type above. FILES ir/irtypestruct.cpp
code generation 1953 merged 1922 Statement and Declaration visitors.
main 1954 merged 1922 main.cpp
semantic 1993 merged none semantic analysis. enforce nothrow (no errors either) @nogc, no globals, no classes, interfaces, assoc arrays, no runtime functions, and other restrictions
druntime 87 merged None ldc.attributes (@compute + @kernel) + ldc.dcomputetypes (Pointer)
lit tests None Ongoing 1801 + 1922 + druntime Lit based IR tests