Agent-native inference engine with O(1) fork latency for tree-structured reasoning.
Dendrite is a specialized LLM inference engine designed for agentic workloads that require exploring multiple reasoning paths simultaneously. Unlike traditional inference engines that optimize for single-sequence throughput, Dendrite provides constant-time forking of inference state, enabling efficient tree-of-thought, MCTS, and beam search algorithms.
TL;DR: If your agent explores multiple reasoning paths, Dendrite is 1000-10000x faster at branching.
| Scenario | vLLM | SGLang | Dendrite |
|---|---|---|---|
| Fork 4K context | 50-100ms | 5-10ms | 3μs |
| 6-branch exploration | 300-600ms | 30-60ms | 18μs |
| MCTS (500 forks) | 25-50s | 2.5-5s | 1.5ms |
| Memory (6 branches, 4K prefix) | 6GB | ~2GB | 1.1GB |
Use Dendrite for: Tree-of-Thought, MCTS, Beam Search, Speculative Decoding, Multi-Agent Use vLLM for: Single-sequence generation, simple chat (higher throughput)
See BENCHMARKS.md and PRACTICAL_IMPACT.md for details.
- O(1) Fork Latency: Create reasoning branches without copying the KV cache using copy-on-write semantics
- Tree-Structured KV Cache: Share memory across branches with reference counting
- PagedAttention: Memory-efficient KV cache management with 16-token blocks
- MCTS & Beam Search: Built-in tree search algorithms with UCT scoring
- FlashInfer Integration: High-performance attention kernels with cascade support
- Grammar Constraints: Structured output via llguidance integration
Traditional (vLLM/SGLang): Dendrite (Copy-on-Write):
Fork = Copy entire KV cache Fork = Copy block table pointers
O(context_length) O(num_blocks) ≈ O(1)
┌─────────────────────┐ ┌─────────────────────┐
│ Parent: 4K tokens │ │ Block Table (Parent)│
│ [================] │ │ [0][1][2][3] │
└─────────────────────┘ └──│──│──│──│────────┘
│ fork │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ Child: Copy 4K │ │ Physical Blocks │
│ [================] │ ← 50-100ms │ [B0][B1][B2][B3] │
└─────────────────────┘ │ ref=2 (shared!) │
└──────────────────────┘
▲
┌──│──│──│──│────────┐
│ [0][1][2][3] │
│ Block Table (Child) │ ← 500ns
└─────────────────────┘
Memory is duplicated only when branches diverge and only at block granularity (16 tokens).
| Component | Status | Notes |
|---|---|---|
| KV Cache (CoW) | ✅ Complete | O(1) fork with reference counting |
| Paged KV Cache | ✅ Complete | Memory-efficient page pool |
| Tree State | ✅ Complete | Radix tree for prefix sharing |
| Scheduler | ✅ Complete | Fair batch scheduling |
| Attention Backend | ✅ Complete | Reference + Flash Attention |
| Transformer | ✅ Complete | Full LLaMA architecture |
| Tokenizer | ✅ Complete | HuggingFace tokenizers |
| MCTS Search | ✅ Complete | UCT scoring |
| Beam Search | ✅ Complete | Top-k beams |
| Grammar Constraints | ✅ Complete | llguidance integration |
| GPU Inference | ✅ Complete | candle-flash-attn |
| FP8 Quantization | 🔄 In Progress | E4M3/E5M2/MXFP8 |
| FlashInfer FFI | 🔄 In Progress | Paged attention kernels |
272 tests passing | 40.8 tok/s on NVIDIA GB10 (DGX Spark) with TinyLlama-1.1B
┌─────────────────────────────────────────────────────────────────┐
│ dendrite │
│ (High-level API) │
├─────────────────────────────────────────────────────────────────┤
│ dendrite-core │
│ ┌─────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────────┐ │
│ │ Cache │ │ Tree │ │ Scheduler │ │ Model │ │
│ │ (CoW) │ │ State │ │ (Batch) │ │ (Transformer) │ │
│ └─────────┘ └──────────┘ └───────────┘ └─────────────────┘ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Search │ │ Grammar │ │
│ │ (MCTS, Beam, UCT) │ │ (llguidance constraints) │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ dendrite-ffi │
│ (FlashInfer + CUDA bindings) │
└─────────────────────────────────────────────────────────────────┘
use dendrite_core::model::{ModelConfig, Tokenizer, Transformer};
use dendrite_core::attention::ReferenceBackend;
use candle_core::Device;
use std::sync::Arc;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Load model
let model_path = std::path::Path::new("/path/to/tinyllama");
let config = ModelConfig::from_file(&model_path.join("config.json"))?;
let tokenizer = Tokenizer::from_dir(model_path)?;
let backend = Arc::new(ReferenceBackend::new());
let mut transformer = Transformer::new(config, backend, Device::Cpu)?;
transformer.load_weights(model_path)?;
// Generate text
let text = transformer
.generate_text(&tokenizer, "Hello, my name is", 50, 0.0)
.await?;
println!("{}", text);
Ok(())
}use dendrite_core::model::{ModelConfig, Transformer, KvCache};
use dendrite_core::attention::FlashAttnBackend; // requires "cuda" feature
use candle_core::{Device, Tensor};
// Load model on GPU
let device = Device::new_cuda(0)?;
let backend = Arc::new(FlashAttnBackend::new(0)?);
let mut transformer = Transformer::new(config, backend, device.clone())?;
transformer.load_weights(model_path)?;
// Create KV cache for efficient autoregressive generation
let mut cache = transformer.create_cache();
// Prefill prompt
let prompt_tokens = vec![1u32, 15043, 29892]; // "<s>Hello,"
let input = Tensor::from_slice(&prompt_tokens, (1, 3), &device)?;
let logits = transformer.forward_with_cache(&input, &mut cache).await?;
// Decode with cached KV (O(1) per token)
let next_token = transformer.sample(&logits, 0.0)?;
println!("Generated token: {}", next_token);use dendrite_core::cache::{PagedKvCache, DEFAULT_PAGE_SIZE};
use candle_core::DType;
// Create paged cache with O(1) fork support
let cache = PagedKvCache::new(
32, // num_layers
1000, // initial pages
DEFAULT_PAGE_SIZE, // 16 tokens/page
8, // num_kv_heads
128, // head_dim
DType::F16,
device,
)?;
// Allocate parent sequence
let parent = cache.allocate_sequence();
// O(1) fork - shares pages via copy-on-write
let child1 = cache.fork_sequence(parent)?;
let child2 = cache.fork_sequence(parent)?;
// Each child can diverge independently
// Pages are copied only when modified| Crate | Description |
|---|---|
dendrite |
High-level API and engine |
dendrite-core |
Core engine: KV cache, scheduler, tree management |
dendrite-ffi |
FFI bindings for FlashInfer and CUDA |
- Rust 1.83+ (edition 2024)
- CUDA 12.x (optional, for GPU acceleration)
- FlashInfer 0.2.x (optional, for optimized kernels)
# CPU-only build
cargo build --release
# With CUDA support
cargo build --release --features cuda
# Full build with all features
cargo build --release --features full# GPU inference with TinyLlama (requires model weights)
cargo run -p dendrite-core --features cuda --example gpu_inference -- /path/to/tinyllama
# Text generation with tokenizer integration
cargo run -p dendrite-core --example text_inference -- /path/to/tinyllama
# Golden token validation (verify vs HuggingFace)
python scripts/generate_golden.py --model /path/to/tinyllama --output golden_cases.json
cargo run -p dendrite-core --example golden_validation -- /path/to/tinyllama golden_cases.json
# MCTS search with simulated environment
cargo run -p dendrite-core --example mcts_search
# Beam search with mock language model
cargo run -p dendrite-core --example beam_search# Fork latency benchmark
cargo bench --bench fork
# Tree search benchmark
cargo bench --bench tree_search
# Memory usage benchmark
cargo bench --bench memoryDendrite is optimized for:
- NVIDIA GB10/DGX Spark: 128GB unified memory, NVLink-C2C
- Grace Hopper: Unified memory architecture
- Works on any CUDA-capable GPU with reduced memory
| Metric | Target | Actual |
|---|---|---|
| Fork latency | < 50 μs | ~500ns (100x better) |
| Grammar mask | < 50 μs | ~1.6μs (30x better) |
| Decode latency | < 10 ms | 10ms on GB10 |
| Memory overhead | < 5% per fork | ~0.1% (CoW) |
See docs/architecture.md for detailed design.
MIT License - see LICENSE for details.
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
- FlashInfer for attention kernels
- vLLM for PagedAttention inspiration
- llguidance for grammar constraints
- Candle for Rust ML framework