[None][feat] Add simulation mode for serving benchmarking by venkywonka · Pull Request #12700 · NVIDIA/TensorRT-LLM

venkywonka · 2026-04-02T17:56:04Z

Summary

Adds a simulation mode to TensorRT-LLM that runs the real Python scheduler with mocked GPU execution, enabling fast GPU-free benchmarking of serving configurations. Complete v1 implementation (Phases 0-5).

Usage

# CLI: quick sim with constant predictor
trtllm-bench -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughput \
  --dataset data.jsonl --sim

# CLI: AIC predictor with TP=2
trtllm-bench -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughput \
  --dataset data.jsonl --sim --sim-config sim.yaml --tp 2

# Python API
from tensorrt_llm.llmapi import LLM, SamplingParams
from tensorrt_llm.llmapi.sim_config import SimConfig, PredictorConfig

sim_config = SimConfig(predictor=PredictorConfig(
    name="aiconfigurator", device_name="h100_sxm", backend_version="1.2.0rc5"))
llm = LLM(model, sim_config=sim_config, tensor_parallel_size=2)
output = llm.generate(["Hello world"], sampling_params=SamplingParams(max_tokens=8))
metrics = sim_config._clock.metrics  # TTFT, TPOT, ITL, throughput

What's included (Phases 0-5)

Phase	Deliverable	Status
0	Minimal POC — SimModelEngine + SimSampler duck-typed into PyExecutor	Done
1	SimConfig Pydantic model, InferTimePredictor ABC, constant predictor	Done
2	AIConfigurator integration — real per-batch time predictions	Done
3	SimClock replaces time.sleep() with predicted time accumulation	Done
3.5	SimDistributed mock + single-process executor — TP>1 works	Done
4	Per-request metrics (TTFT/TPOT/ITL/e2e), per-iteration breakdown	Done
5	`trtllm-bench throughput --sim` CLI with 3-tier verification	Done

Key design decisions

Real scheduler, mock execution: PyExecutor is completely unmodified. Only ModelEngine and Sampler are replaced
Single-process for sim: Forces GenerationExecutorWorker (bypasses MPI) so SimClock is in caller's address space
SimDistributed: No-op communication mock enables TP>1 simulation without NCCL
Same output format: --sim produces same table layout as real benchmark with [SIM] banner
No numpy: Metrics use stdlib-only percentile computation

Verified configurations

Config	TTFT	TPOT	Throughput
Constant (10/5ms) TP=1	15.0ms	4.3ms	177.8 tok/s
AIC H100 TP=1	3.19ms	1.21ms	686.2 tok/s
AIC H100 TP=2	2.91ms	1.03ms	792.8 tok/s
CLI: 10 requests constant	33.0ms	4.77ms	1523.8 tok/s
CLI: 10 requests AIC TP=2	5.54ms	1.22ms	6662.4 tok/s
CLI: Calibrated vs real RTX 3090 Ti	Structural match verified

New files

File	Purpose
`tensorrt_llm/llmapi/sim_config.py`	SimConfig + PredictorConfig (Pydantic)
`tensorrt_llm/_torch/pyexecutor/sim_clock.py`	SimClock — time accumulator + metrics
`tensorrt_llm/_torch/pyexecutor/sim_metrics.py`	SimRequestStats, calc_sim_metrics()
`tensorrt_llm/_torch/pyexecutor/sim_model_engine.py`	Dummy model engine
`tensorrt_llm/_torch/pyexecutor/sim_sampler.py`	Dummy sampler + token recording
`tensorrt_llm/_torch/pyexecutor/sim_predictor.py`	InferTimePredictor ABC + ConstantPredictor
`tensorrt_llm/_torch/pyexecutor/sim_predictor_aic.py`	AIConfiguratorPredictor
`tensorrt_llm/_torch/pyexecutor/sim_distributed.py`	No-op communication mock
`tensorrt_llm/bench/benchmark/sim_benchmark.py`	CLI sim benchmark runner

Modified files

File	Change
`tensorrt_llm/llmapi/llm_args.py`	Added `sim_config` field to TorchLlmArgs
`tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	`_create_sim_py_executor()`
`tensorrt_llm/executor/executor.py`	Force single-process when sim_config set
`tensorrt_llm/bench/benchmark/throughput.py`	`--sim` and `--sim-config` flags

Test plan

82 unit tests in tests/unittest/sim/ (SimConfig, SimClock, SimMetrics, SimDistributed, SimPredictor, AIC predictor)
3 e2e Python API tests in slop/test_sim.py (constant metrics, AIC metrics, AIC TP=2)
3-tier CLI tests in slop/test_bench_sim.py (constant sim, AIC TP=2, calibrated vs real)
Verify no regression in existing unit tests

🤖 Generated with Claude Code

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Introduce the predictor abstraction for simulation mode batch timing: - SimBatch dataclass: lightweight batch description decoupled from TRT-LLM internals for easy isolated testing - InferTimePredictor ABC: abstract base for batch execution time prediction - ConstantPredictor: returns fixed prefill/decode time per batch Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Introduce SimConfig and PredictorConfig StrictBaseModel classes in tensorrt_llm/llmapi/sim_config.py. Replace the simulation_mode: bool field on TorchLlmArgs with sim_config: Optional[SimConfig] (presence means activation). Update the activation check in py_executor_creator.py accordingly. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

SimModelEngine now accepts an optional time_predictor and calls time.sleep(predicted_time) in forward() to simulate realistic batch latency. The executor creator reads SimConfig from llm_args and instantiates a ConstantPredictor accordingly. The smoke test is updated to use SimConfig and assert wall-clock timing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Validates Pydantic models including defaults, custom values, negative-value rejection, invalid predictor names, extra-field rejection, and model_dump roundtrip serialization. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Add AIConfigurator predictor type to PredictorConfig with device_name, database_path, backend_version, and scale factor fields. A model_validator ensures device_name and backend_version are provided when name='aiconfigurator'. Existing constant predictor behavior is unchanged. Add validation tests for the new fields. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Introduce AIConfiguratorPredictor that uses the AIConfigurator SDK's analytical model to predict per-batch prefill and decode latency based on model architecture, hardware silicon tables, batch size, and sequence lengths. Tests use the bundled H100 SXM database with TinyLlama and are auto-skipped when the AIC systems directory is not available. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

…actory Populate SimBatch.requests with per-request input_length and past_kv_length from ScheduledRequests, enabling the AIConfigurator predictor to make per-request timing predictions. Add a predictor factory in _create_sim_py_executor that dispatches on pc.name to instantiate either ConstantPredictor or AIConfiguratorPredictor. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Replace time.sleep(predicted_time) with SimClock.step() accumulation. Simulation now runs at CPU speed with no sleeping. SimClock tracks total predicted time and iteration count. Known limitations: - SimClock is on MPI worker side; caller gets None via sim_config._clock - TRTLLM_SKIP_KV_CACHE_ESTIMATION=1 needed due to executor restart bug after configure_kv_cache_capacity shutdown cycle - Cross-process clock access deferred to Phase 4 New: sim_clock.py, test_sim_clock.py (5 tests) Modified: sim_model_engine.py, py_executor_creator.py, sim_config.py Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Deep-Sim vision: embed per-op timing collection (COLLECT mode) and simulation (SIMULATE mode) directly into TRT-LLM's op layer. Replaces external AIC collection pipeline with self-reporting ops. Update v1 roadmap: Phase 3 done, add Phase 3.5 (single-process mode), note reusability constraints for Deep-Sim. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

- Add SimDistributed mock that no-ops all communication (barrier, broadcast, allgather, allreduce) for sim mode - Force single-process GenerationExecutorWorker when sim_config is set, bypassing MPI spawn regardless of model_world_size - Auto-skip KV cache estimation in sim mode (removes env var hack) - SimClock now visible to caller (same address space) - TP>1 simulation works: TP=2 constructs, generates, records clock - AIC predictions differ for TP=1 vs TP=2 (ratio 1.16x for TinyLlama) Verified: 57 unit tests + 3 e2e tests (TP1 constant, TP2 constant, AIC TP1 vs TP2) all passing. No env var hacks needed. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Document Phase 3 (SimClock) and Phase 3.5 (single-process sim mode, SimDistributed, TP>1 support). Key discoveries: prefill produces first token (8 iters not 9), MPI boundary caused clock invisibility (fixed with SimDistributed + single-process executor routing). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Per-request timing (TTFT, TPOT, ITL, e2e), per-iteration breakdown, calc_sim_metrics() pure function. Python API on SimClock + optional file output. Verification with constant predictor (exact values), AIC predictor (structural checks), and AIC TP=2 (TP flows through). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

New sim_metrics.py with SimRequestStats, SimIterationRecord, and calc_sim_metrics() pure function. Computes TTFT, TPOT, ITL, e2e latency, and throughput from simulated clock timestamps. Extended SimClock with record_iteration(), register_request(), record_token(), metrics property, and write_metrics() for file output. Hooked SimModelEngine.forward() to record iteration data and SimSampler.update_requests() to record per-request token timestamps. Key finding: TTFT includes prefill + first decode iteration because PyExecutor skips sampler during prefill context processing. Verified with: constant predictor (exact values), AIC predictor (structural checks + cross-consistency), AIC TP=2 (TP flows through). 82 unit tests + 3 e2e tests passing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Key discovery: TTFT includes prefill + first decode (15ms not 10ms) because PyExecutor skips sampler during prefill context processing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

trtllm-bench submits all requests at once (batch mode) with no rate limiting. Our sim already matches this pattern. Arrival modeling is only needed for future online serving simulation. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

trtllm-bench throughput --sim with 3-tier verification: basic sim output, AIC TP=2, and calibrated constant predictor vs real silicon. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

New sim_benchmark.py with load_sim_config, run_sim_benchmark, print_sim_report for CLI sim mode. Add --sim and --sim-config flags to throughput_command bypassing async_benchmark. Helper scripts: calibrate_sim.py extracts prefill/decode times from real iteration log; compare_reports.py shows side-by-side comparison. 3-tier verification passed: - Tier 1: Constant predictor sim (10 requests, 1523.8 tok/s) - Tier 2: AIC TP=2 sim (10 requests, 6662.4 tok/s) - Tier 3: Calibrated sim vs real RTX 3090 Ti (real: prefill=131.58ms decode=7.63ms, sim matches structure) Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

3-tier CLI verification: constant sim, AIC TP=2, calibrated vs real RTX 3090 Ti. Key discovery: iteration log uses Python repr format. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Mark Phase 5 as Done in roadmap. Add consolidated fidelity and limitations doc covering: what's faithfully modeled (chunking, capacity, reuse), what's not (piggybacking, overlap, disagg, PP), and practical limitations (GPU memory, process model, AIC accuracy). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Split Phase 7 into 7a (PP support, hard) and 7b (disagg KV transfer, medium). Add Phase 8 (GPU-free mock KV, backlog). Document 5 implementation findings: PP blocked by SimDistributed, calibration usefulness, GPU requirement, output format mismatch, Deep-Sim architecture validation. Update gantt chart to show v1 complete. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

slop/ contains local paths, model tokenizers, and dev-specific plans/specs that should not be on GitHub. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

venkywonka and others added 20 commits March 30, 2026 18:31

feat: Add simulation mode POC Phase 0

0c80165

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

fix: rc7/rc10 compatibility patches for sim mode dev container

cb48558

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

fix: Add start_worker call and restore priority kwarg for sim mode e2e

fb317dc

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

fix: Move SimConfig import out of TYPE_CHECKING for Pydantic validation

21bb4a1

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

test: Add unit tests for InferTimePredictor and ConstantPredictor

707437a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

feat: Add SimBatchRequest and requests field to SimBatch

2216e3d

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

docs: Add Phase 4 what-was-built doc and update CLAUDE.local.md

21d0e29

Key discovery: TTFT includes prefill + first decode (15ms not 10ms) because PyExecutor skips sampler during prefill context processing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

github-actions bot assigned venkywonka Apr 2, 2026

venkywonka added 6 commits April 2, 2026 11:08

docs: Add Phase 5 CLI integration design spec

b6cb121

trtllm-bench throughput --sim with 3-tier verification: basic sim output, AIC TP=2, and calibrated constant predictor vs real silicon. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

docs: Add Phase 5 what-was-built doc and update CLAUDE.local.md

ba68f49

3-tier CLI verification: constant sim, AIC TP=2, calibrated vs real RTX 3090 Ti. Key discovery: iteration log uses Python repr format. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

venkywonka changed the title ~~[None][feat] Add simulation mode for GPU-free serving benchmarking~~ [None][feat] Add simulation mode for serving benchmarking Apr 2, 2026

chore: Remove slop/ from git tracking, add to .gitignore

84a9c20

slop/ contains local paths, model tokenizers, and dev-specific plans/specs that should not be on GitHub. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Add simulation mode for serving benchmarking#12700

[None][feat] Add simulation mode for serving benchmarking#12700
venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
venkywonka:venky/hisim-port

venkywonka commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

venkywonka commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

What's included (Phases 0-5)

Key design decisions

Verified configurations

New files

Modified files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

venkywonka commented Apr 2, 2026 •

edited

Loading