Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)
Environment
- GPU: NVIDIA RTX PRO 4500 Blackwell (SM 12.0, 32 GB VRAM)
- OS: Unraid 7.3, kernel 6.18.23
- CUDA: 12.8.1 (nvidia/cuda:12.8.1-devel-ubuntu24.04)
- BeeLlama build: main branch, cloned May 23, 2026
- Docker: Yes, nvidia runtime
Models Used
- Target: unsloth/Qwen3.6-27B-GGUF Q5_K_M (19 GB)
- Draft: Anbeeld/Qwen3.6-27B-DFlash-GGUF Q4_K_M (986 MB)
- mmproj: mmproj-BF16.gguf (889 MB, optional)
Command
llama-server \
-m /models/Qwen3.6-27B-Q5_K_M.gguf \
--spec-type dflash \
--spec-draft-model /models/Qwen3.6-27B-DFlash-Q4_K_M.gguf \
--spec-draft-ngl all \
--spec-dflash-cross-ctx 1024 \
--mmproj /models/mmproj-BF16.gguf \
--reasoning on --jinja \
-ngl all -c 524288 -np 1 -b 2048 -ub 512 \
--kv-unified \
--cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
--flash-attn on \
--host 0.0.0.0 --port 11437
Behavior
- Server starts successfully and passes health check
- DFlash draft model loads correctly (58 tensors, 5 blocks, vocab match confirmed)
- DFlash KV cache allocated (40 MB)
- GPU hidden capture ring buffer allocated (5 layers x 1024 slots x 5120 embd, ~200 MB)
- Crash occurs on first inference request with segfault (exit 139)
Stack Trace
/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_state_dflash::draft(...)+0x361)
/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_draft(...)+0xbd)
Key Log Output Before Crash
dflash: target/drafter info: target_ctx_train=262144 target_vocab=248320 drafter_vocab=248320 vocab_match=1 capture_min=1 capture_max=61
dflash: GPU hidden capture policy: allowed=1 forced_cpu=0 requested=1 target_devices=1 drafter_devices=1
dflash gpu ring: allocated 5 layers x 1024 slots x 5120 embd + staging (~200 MB)
dflash: GPU cross ring enabled (5 layers x 1024 slots x 5120 embd)
dflash_kv_cache_init: allocated DFlash drafter K/V cache: 40.0 MB (5 layers, 1024 tokens, 1024 elems/token)
dflash: drafter K/V projection cache enabled (1024-token window)
slot launch_slot_: id 0 | spec dm controller: adaptive=1 controller=profit
Draft Model Metadata
general.architecture = dflash-draft
dflash-draft.block_count = 5
dflash-draft.context_length = 262144
dflash-draft.embedding_length = 5120
dflash-draft.dflash.block_size = 16
dflash-draft.dflash.target_layer_ids = [1, 16, 31, 46, 61]
Attempts Made (All Crash Identically)
-np 1 (single slot) — same crash
-np 2 (two slots) — same crash
- Without
--mmproj — same crash
- Different cache types (
q5_0/q4_1, turbo3_tcq) — same crash
- Different batch sizes (
-b 2048 -ub 512 vs -b 4096 -ub 1024) — same crash
- Rebuilt with explicit CUDA arch
-DGGML_CUDA_ARCH=120 — same crash (note: cmake ignored the flag, all 237 cubins compiled as sm_52)
- Both with and without
--spec-dflash-cross-ctx 1024 — same crash
Additional Notes
- 100% reproducible — every inference request triggers the crash
- Standard llama.cpp operations (model loading, KV cache, attention, vision) all work correctly on this GPU with the same binary
- The
atomic-tq-mtp-cuda Docker image runs the same model at 36.4 tok/s without DFlash
- MTP speculative decoding works (57.8 tok/s at 128K context) on this hardware
- Build shows
CUDA : ARCHS = 520 even with -DGGML_CUDA_ARCH=120 — cmake flag appears ignored
- However, CUDA forward compatibility means sm_52 code runs fine on Blackwell via PTX JIT — standard operations confirm this
Hypothesis
Qwen3.6-27B uses a hybrid architecture (16 KV-attention layers + 48 Gated DeltaNet/SSM layers). The DFlash draft model references target_layer_ids = [1, 16, 31, 46, 61] which span both layer types. The crash in common_speculative_state_dflash::draft() may be related to how DFlash handles hidden states from the non-standard SSM/DeltaNet layers during the draft function.
The z-lab DFlash model page notes: "The model is still under training, and inference engine support may not be fully available yet due to architectural changes, including causal SWA layers."
Thank you for this excellent fork — looking forward to getting DFlash working on Qwen3.6!
Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)
Environment
Models Used
Command
Behavior
Stack Trace
Key Log Output Before Crash
Draft Model Metadata
Attempts Made (All Crash Identically)
-np 1(single slot) — same crash-np 2(two slots) — same crash--mmproj— same crashq5_0/q4_1,turbo3_tcq) — same crash-b 2048 -ub 512vs-b 4096 -ub 1024) — same crash-DGGML_CUDA_ARCH=120— same crash (note: cmake ignored the flag, all 237 cubins compiled as sm_52)--spec-dflash-cross-ctx 1024— same crashAdditional Notes
atomic-tq-mtp-cudaDocker image runs the same model at 36.4 tok/s without DFlashCUDA : ARCHS = 520even with-DGGML_CUDA_ARCH=120— cmake flag appears ignoredHypothesis
Qwen3.6-27B uses a hybrid architecture (16 KV-attention layers + 48 Gated DeltaNet/SSM layers). The DFlash draft model references
target_layer_ids = [1, 16, 31, 46, 61]which span both layer types. The crash incommon_speculative_state_dflash::draft()may be related to how DFlash handles hidden states from the non-standard SSM/DeltaNet layers during the draft function.The z-lab DFlash model page notes: "The model is still under training, and inference engine support may not be fully available yet due to architectural changes, including causal SWA layers."
Thank you for this excellent fork — looking forward to getting DFlash working on Qwen3.6!