Skip to content

Kernel embeddings#98

Draft
badnikhil wants to merge 6 commits into
libmir:masterfrom
badnikhil:kernel_embeddings
Draft

Kernel embeddings#98
badnikhil wants to merge 6 commits into
libmir:masterfrom
badnikhil:kernel_embeddings

Conversation

@badnikhil

@badnikhil badnikhil commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

With this PR, all DCompute runtime infrastructure is managed lazily and transparently behind the scenes. Developers only need to write their host code, allocate memory (Buffer), and launch their compute kernels directly using launch!k.


Major Changes

1. Lazy Static Init Runtime (source/dcompute/driver/cuda/runtime.d)

  • Implements a global platform initializer (shared static this()) that initializes CUDA, discovers active GPUs, allocates the default Context (Device 0), and pushes it onto the context stack.
  • Implements a thread-local queue initializer (static this()) that ensures every thread gets a lock-free, dedicated Queue (CUstream) with zero resource contention.
  • Implements a lightweight, double-checked ensureInit() guard as a defensive safety fallback for edge cases.

2. Context-Sensitive Compile-Time PTX Embedding (source/dcompute/driver/cuda/package.d)

  • Moves the compile-time PTX import() statement inside the launch!k template definition.
  • Why this is critical: Because launch! is a template, it is instantiated inside the parent project's compilation context.
  • This allows the dcompute library to compile as a standard static library without requiring any local PTX files or string import flags, while seamlessly embedding the consumer project's custom PTX at compile time.

3. Defensive Safety Triggers (source/dcompute/driver/cuda/buffer.d)

  • Inserts ensureInit() triggers inside both Buffer!T constructors.
  • just for Safety.

4. dub.json update

  • "stringImportPaths": ["."] or -J flag should be used with the path where ptx is generated .

Developer Workflow & Flow of State

1. Compilation Flow:

  1. LDC compiles your @compute modules (e.g. tests/kernel.d) directly into PTX intermediate assembly (kernels_cuda800_64.ptx).
  2. Your parent project's DUB configuration passes -J. (the current directory) to the host compilation.
  3. During host compilation, the D compiler instantiates launch!matmul. The compiler processes import("kernels_cuda800_64.ptx"), embedding the GPU bytecode directly into your executable's text segment.

2. Execution Flow:

  1. Lazy Host Setup: The moment a Buffer is instantiated, the underlying static constructors initialize CUDA, assign the default device, push the GPU context, and initialize the active thread's CUDA stream.
  2. Lazy Program Load (First Launch): The first time launch! is executed, it checks if Program.globalProgram is initialized. Seeing it is null, it passes the embedded PTX string to cuModuleLoadData, registering your custom kernels in the GPU context.
  3. High-Performance Dispatch (Subsequent Launches): On all future kernel launches, the initialization check is skipped, directly enqueuing your kernels on the active CUDA stream with zero driver overhead.

Current State & Validation

All internal unittests and client applications compile, link, and validate successfully in one command:

  • Internal DCompute Unittests: dub test --compiler=ldc2 completes and passes successfully.
  • Client Benchmark Applications: dub run --force --compiler=ldc2 builds cleanly from scratch, embeds custom matmul kernels, executes them on the GPU, and validates output against host CPU matrices.

@badnikhil

Copy link
Copy Markdown
Contributor Author

while ensureInit in buffers is not required but we don't know what a user will do . so , added a check

Comment thread source/dcompute/driver/cuda/program.d Outdated

version(DComputeTestCUDA)
{
Platform.initialise();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an additional test instead of altering this one.


this(size_t elems)
{
ensureInit();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these calls needed here?


this(T[] arr)
{
ensureInit();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +54 to +55
* saxpy.launch([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);
* launch!saxpy([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually true? saxpy is a template argument here.

Comment on lines +69 to +78
// SM level to PTX filename fragment, resolved inside client compilation
version (DComputeCUDA_1200) enum _arch = "cuda1200";
else version (DComputeCUDA_900) enum _arch = "cuda900";
else version (DComputeCUDA_800) enum _arch = "cuda800";
else version (DComputeCUDA_750) enum _arch = "cuda750";
else version (DComputeCUDA_700) enum _arch = "cuda700";
else version (DComputeCUDA_600) enum _arch = "cuda600";
else version (DComputeCUDA_500) enum _arch = "cuda500";
else version (DComputeCUDA_300) enum _arch = "cuda300";
else version (DComputeCUDA_210) enum _arch = "cuda210";

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this logic into fromEmbedded possibly as an overload.

Comment on lines +45 to +52
* The PTX file is read and embedded at compile time via the D compiler's
* string import mechanism (-J / stringImportPaths in dub.json). No file
* I/O occurs at runtime.
*
* Example:
* Program p = Program.fromEmbedded!"kernel.ptx"();
*/
static Program fromEmbedded(string filename)()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the compiler emits the PTX file after compilation, so unless you double compile this then I don't think it will work as expected. You would need to essentially reference a symbol and then in the link phase have the compiler generate an object file for it and link that in.

* block = Block dimensions [x, y, z].
* args = Kernel arguments (host types, Buffer/UnifiedBuffer ).
*/
auto launch(alias k)(uint[3] grid, uint[3] block,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to move this function into runtime.d, so that all the easy -to use stuff is in sone file (and for people that want to use dcompute in a more standalone fashion, they can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants