Kernel embeddings#98
Conversation
using "stringImportPaths" for better readability
|
while ensureInit in buffers is not required but we don't know what a user will do . so , added a check |
|
|
||
| version(DComputeTestCUDA) | ||
| { | ||
| Platform.initialise(); |
There was a problem hiding this comment.
please add an additional test instead of altering this one.
|
|
||
| this(size_t elems) | ||
| { | ||
| ensureInit(); |
There was a problem hiding this comment.
why are these calls needed here?
|
|
||
| this(T[] arr) | ||
| { | ||
| ensureInit(); |
| * saxpy.launch([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N); | ||
| * launch!saxpy([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N); |
There was a problem hiding this comment.
is this actually true? saxpy is a template argument here.
| // SM level to PTX filename fragment, resolved inside client compilation | ||
| version (DComputeCUDA_1200) enum _arch = "cuda1200"; | ||
| else version (DComputeCUDA_900) enum _arch = "cuda900"; | ||
| else version (DComputeCUDA_800) enum _arch = "cuda800"; | ||
| else version (DComputeCUDA_750) enum _arch = "cuda750"; | ||
| else version (DComputeCUDA_700) enum _arch = "cuda700"; | ||
| else version (DComputeCUDA_600) enum _arch = "cuda600"; | ||
| else version (DComputeCUDA_500) enum _arch = "cuda500"; | ||
| else version (DComputeCUDA_300) enum _arch = "cuda300"; | ||
| else version (DComputeCUDA_210) enum _arch = "cuda210"; |
There was a problem hiding this comment.
move this logic into fromEmbedded possibly as an overload.
| * The PTX file is read and embedded at compile time via the D compiler's | ||
| * string import mechanism (-J / stringImportPaths in dub.json). No file | ||
| * I/O occurs at runtime. | ||
| * | ||
| * Example: | ||
| * Program p = Program.fromEmbedded!"kernel.ptx"(); | ||
| */ | ||
| static Program fromEmbedded(string filename)() |
There was a problem hiding this comment.
Currently the compiler emits the PTX file after compilation, so unless you double compile this then I don't think it will work as expected. You would need to essentially reference a symbol and then in the link phase have the compiler generate an object file for it and link that in.
| * block = Block dimensions [x, y, z]. | ||
| * args = Kernel arguments (host types, Buffer/UnifiedBuffer ). | ||
| */ | ||
| auto launch(alias k)(uint[3] grid, uint[3] block, |
There was a problem hiding this comment.
I think it would be best to move this function into runtime.d, so that all the easy -to use stuff is in sone file (and for people that want to use dcompute in a more standalone fashion, they can.
Summary
With this PR, all DCompute runtime infrastructure is managed lazily and transparently behind the scenes. Developers only need to write their host code, allocate memory (
Buffer), and launch their compute kernels directly usinglaunch!k.Major Changes
1. Lazy Static Init Runtime (
source/dcompute/driver/cuda/runtime.d)shared static this()) that initializes CUDA, discovers active GPUs, allocates the defaultContext(Device 0), and pushes it onto the context stack.static this()) that ensures every thread gets a lock-free, dedicatedQueue(CUstream) with zero resource contention.ensureInit()guard as a defensive safety fallback for edge cases.2. Context-Sensitive Compile-Time PTX Embedding (
source/dcompute/driver/cuda/package.d)import()statement inside thelaunch!ktemplate definition.launch!is a template, it is instantiated inside the parent project's compilation context.dcomputelibrary to compile as a standard static library without requiring any local PTX files or string import flags, while seamlessly embedding the consumer project's custom PTX at compile time.3. Defensive Safety Triggers (
source/dcompute/driver/cuda/buffer.d)ensureInit()triggers inside bothBuffer!Tconstructors.4. dub.json update
"stringImportPaths": ["."]or-Jflag should be used with the path where ptx is generated .Developer Workflow & Flow of State
1. Compilation Flow:
@computemodules (e.g.tests/kernel.d) directly into PTX intermediate assembly (kernels_cuda800_64.ptx).-J.(the current directory) to the host compilation.launch!matmul. The compiler processesimport("kernels_cuda800_64.ptx"), embedding the GPU bytecode directly into your executable's text segment.2. Execution Flow:
Bufferis instantiated, the underlying static constructors initialize CUDA, assign the default device, push the GPU context, and initialize the active thread's CUDA stream.launch!is executed, it checks ifProgram.globalProgramis initialized. Seeing it is null, it passes the embedded PTX string tocuModuleLoadData, registering your custom kernels in the GPU context.Current State & Validation
All internal unittests and client applications compile, link, and validate successfully in one command:
dub test --compiler=ldc2completes and passes successfully.dub run --force --compiler=ldc2builds cleanly from scratch, embeds custommatmulkernels, executes them on the GPU, and validates output against host CPU matrices.