Kernel embeddings by badnikhil · Pull Request #98 · libmir/dcompute

badnikhil · 2026-05-25T10:25:43Z

Summary

With this PR, all DCompute runtime infrastructure is managed lazily and transparently behind the scenes. Developers only need to write their host code, allocate memory (Buffer), and launch their compute kernels directly using launch!k.

Major Changes

1. Lazy Static Init Runtime (`source/dcompute/driver/cuda/runtime.d`)

Implements a global platform initializer (shared static this()) that initializes CUDA, discovers active GPUs, allocates the default Context (Device 0), and pushes it onto the context stack.
Implements a thread-local queue initializer (static this()) that ensures every thread gets a lock-free, dedicated Queue (CUstream) with zero resource contention.
Implements a lightweight, double-checked ensureInit() guard as a defensive safety fallback for edge cases.

2. Context-Sensitive Compile-Time PTX Embedding (`source/dcompute/driver/cuda/package.d`)

Moves the compile-time PTX import() statement inside the launch!k template definition.
Why this is critical: Because launch! is a template, it is instantiated inside the parent project's compilation context.
This allows the dcompute library to compile as a standard static library without requiring any local PTX files or string import flags, while seamlessly embedding the consumer project's custom PTX at compile time.

3. Defensive Safety Triggers (`source/dcompute/driver/cuda/buffer.d`)

Inserts ensureInit() triggers inside both Buffer!T constructors.
just for Safety.

4. dub.json update

"stringImportPaths": ["."] or -J flag should be used with the path where ptx is generated .

Developer Workflow & Flow of State

1. Compilation Flow:

LDC compiles your @compute modules (e.g. tests/kernel.d) directly into PTX intermediate assembly (kernels_cuda800_64.ptx).
Your parent project's DUB configuration passes -J. (the current directory) to the host compilation.
During host compilation, the D compiler instantiates launch!matmul. The compiler processes import("kernels_cuda800_64.ptx"), embedding the GPU bytecode directly into your executable's text segment.

2. Execution Flow:

Lazy Host Setup: The moment a Buffer is instantiated, the underlying static constructors initialize CUDA, assign the default device, push the GPU context, and initialize the active thread's CUDA stream.
Lazy Program Load (First Launch): The first time launch! is executed, it checks if Program.globalProgram is initialized. Seeing it is null, it passes the embedded PTX string to cuModuleLoadData, registering your custom kernels in the GPU context.
High-Performance Dispatch (Subsequent Launches): On all future kernel launches, the initialization check is skipped, directly enqueuing your kernels on the active CUDA stream with zero driver overhead.

Current State & Validation

All internal unittests and client applications compile, link, and validate successfully in one command:

Internal DCompute Unittests: dub test --compiler=ldc2 completes and passes successfully.
Client Benchmark Applications: dub run --force --compiler=ldc2 builds cleanly from scratch, embeds custom matmul kernels, executes them on the GPU, and validates output against host CPU matrices.

using "stringImportPaths" for better readability

badnikhil · 2026-05-25T10:27:52Z

while ensureInit in buffers is not required but we don't know what a user will do . so , added a check

thewilsonator · 2026-06-08T08:32:59Z


    version(DComputeTestCUDA)
    {
-        Platform.initialise();


please add an additional test instead of altering this one.

thewilsonator · 2026-06-08T08:36:03Z


    this(size_t elems)
    {
+        ensureInit();


why are these calls needed here?

thewilsonator · 2026-06-08T08:36:09Z


    this(T[] arr)
    {
+        ensureInit();


thewilsonator · 2026-06-08T08:37:19Z

+ *   saxpy.launch([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);
+ *   launch!saxpy([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);


is this actually true? saxpy is a template argument here.

thewilsonator · 2026-06-08T08:40:30Z

+        // SM level to PTX filename fragment, resolved inside client compilation
+        version      (DComputeCUDA_1200) enum _arch = "cuda1200";
+        else version (DComputeCUDA_900)  enum _arch = "cuda900";
+        else version (DComputeCUDA_800)  enum _arch = "cuda800";
+        else version (DComputeCUDA_750)  enum _arch = "cuda750";
+        else version (DComputeCUDA_700)  enum _arch = "cuda700";
+        else version (DComputeCUDA_600)  enum _arch = "cuda600";
+        else version (DComputeCUDA_500)  enum _arch = "cuda500";
+        else version (DComputeCUDA_300)  enum _arch = "cuda300";
+        else version (DComputeCUDA_210)  enum _arch = "cuda210";


move this logic into fromEmbedded possibly as an overload.

thewilsonator · 2026-06-08T08:42:34Z

+     * The PTX file is read and embedded at compile time via the D compiler's
+     * string import mechanism (-J / stringImportPaths in dub.json).  No file
+     * I/O occurs at runtime.
+     *
+     * Example:
+     *   Program p = Program.fromEmbedded!"kernel.ptx"();
+     */
+    static Program fromEmbedded(string filename)()


Currently the compiler emits the PTX file after compilation, so unless you double compile this then I don't think it will work as expected. You would need to essentially reference a symbol and then in the link phase have the compiler generate an object file for it and link that in.

thewilsonator · 2026-06-08T08:44:22Z

+ *   block     = Block dimensions [x, y, z].
+ *   args      = Kernel arguments (host types, Buffer/UnifiedBuffer ).
+ */
+auto launch(alias k)(uint[3] grid, uint[3] block,


I think it would be best to move this function into runtime.d, so that all the easy -to use stuff is in sone file (and for people that want to use dcompute in a more standalone fashion, they can.

badnikhil added 5 commits May 25, 2026 15:44

cuda : created runtime.d for lazy init

bdfe811

cuda: embed PTX dynamically inside launch! template

ad38a67

updated flags

5f39f6c

using "stringImportPaths" for better readability

added ensureinit check in buffer for safety

3e012aa

update the test

c5a45ed

thewilsonator reviewed May 25, 2026

View reviewed changes

Comment thread source/dcompute/driver/cuda/program.d Outdated

simplify fromEmbedded syntax to take filename directly

e52a7b7

thewilsonator reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel embeddings#98

Kernel embeddings#98
badnikhil wants to merge 6 commits into
libmir:masterfrom
badnikhil:kernel_embeddings

badnikhil commented May 25, 2026 •

edited

Loading

Uh oh!

badnikhil commented May 25, 2026

Uh oh!

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

thewilsonator Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		* saxpy.launch([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);
		* launch!saxpy([N,1,1], [1,1,1], b_res, alpha, b_x, b_y, N);

Conversation

badnikhil commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major Changes

1. Lazy Static Init Runtime (source/dcompute/driver/cuda/runtime.d)

2. Context-Sensitive Compile-Time PTX Embedding (source/dcompute/driver/cuda/package.d)

3. Defensive Safety Triggers (source/dcompute/driver/cuda/buffer.d)

4. dub.json update

Developer Workflow & Flow of State

1. Compilation Flow:

2. Execution Flow:

Current State & Validation

Uh oh!

badnikhil commented May 25, 2026

Uh oh!

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

thewilsonator Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

badnikhil commented May 25, 2026 •

edited

Loading

1. Lazy Static Init Runtime (`source/dcompute/driver/cuda/runtime.d`)

2. Context-Sensitive Compile-Time PTX Embedding (`source/dcompute/driver/cuda/package.d`)

3. Defensive Safety Triggers (`source/dcompute/driver/cuda/buffer.d`)