Skip to content

[None][feat] Add simulation mode for serving benchmarking#12700

Draft
venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
venkywonka:venky/hisim-port
Draft

[None][feat] Add simulation mode for serving benchmarking#12700
venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
venkywonka:venky/hisim-port

Conversation

@venkywonka
Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka commented Apr 2, 2026

Summary

Adds a simulation mode to TensorRT-LLM that runs the real Python scheduler with mocked GPU execution, enabling fast GPU-free benchmarking of serving configurations. Complete v1 implementation (Phases 0-5).

Usage

# CLI: quick sim with constant predictor
trtllm-bench -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughput \
  --dataset data.jsonl --sim

# CLI: AIC predictor with TP=2
trtllm-bench -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughput \
  --dataset data.jsonl --sim --sim-config sim.yaml --tp 2

# Python API
from tensorrt_llm.llmapi import LLM, SamplingParams
from tensorrt_llm.llmapi.sim_config import SimConfig, PredictorConfig

sim_config = SimConfig(predictor=PredictorConfig(
    name="aiconfigurator", device_name="h100_sxm", backend_version="1.2.0rc5"))
llm = LLM(model, sim_config=sim_config, tensor_parallel_size=2)
output = llm.generate(["Hello world"], sampling_params=SamplingParams(max_tokens=8))
metrics = sim_config._clock.metrics  # TTFT, TPOT, ITL, throughput

What's included (Phases 0-5)

Phase Deliverable Status
0 Minimal POC — SimModelEngine + SimSampler duck-typed into PyExecutor Done
1 SimConfig Pydantic model, InferTimePredictor ABC, constant predictor Done
2 AIConfigurator integration — real per-batch time predictions Done
3 SimClock replaces time.sleep() with predicted time accumulation Done
3.5 SimDistributed mock + single-process executor — TP>1 works Done
4 Per-request metrics (TTFT/TPOT/ITL/e2e), per-iteration breakdown Done
5 trtllm-bench throughput --sim CLI with 3-tier verification Done

Key design decisions

  • Real scheduler, mock execution: PyExecutor is completely unmodified. Only ModelEngine and Sampler are replaced
  • Single-process for sim: Forces GenerationExecutorWorker (bypasses MPI) so SimClock is in caller's address space
  • SimDistributed: No-op communication mock enables TP>1 simulation without NCCL
  • Same output format: --sim produces same table layout as real benchmark with [SIM] banner
  • No numpy: Metrics use stdlib-only percentile computation

Verified configurations

Config TTFT TPOT Throughput
Constant (10/5ms) TP=1 15.0ms 4.3ms 177.8 tok/s
AIC H100 TP=1 3.19ms 1.21ms 686.2 tok/s
AIC H100 TP=2 2.91ms 1.03ms 792.8 tok/s
CLI: 10 requests constant 33.0ms 4.77ms 1523.8 tok/s
CLI: 10 requests AIC TP=2 5.54ms 1.22ms 6662.4 tok/s
CLI: Calibrated vs real RTX 3090 Ti Structural match verified

New files

File Purpose
tensorrt_llm/llmapi/sim_config.py SimConfig + PredictorConfig (Pydantic)
tensorrt_llm/_torch/pyexecutor/sim_clock.py SimClock — time accumulator + metrics
tensorrt_llm/_torch/pyexecutor/sim_metrics.py SimRequestStats, calc_sim_metrics()
tensorrt_llm/_torch/pyexecutor/sim_model_engine.py Dummy model engine
tensorrt_llm/_torch/pyexecutor/sim_sampler.py Dummy sampler + token recording
tensorrt_llm/_torch/pyexecutor/sim_predictor.py InferTimePredictor ABC + ConstantPredictor
tensorrt_llm/_torch/pyexecutor/sim_predictor_aic.py AIConfiguratorPredictor
tensorrt_llm/_torch/pyexecutor/sim_distributed.py No-op communication mock
tensorrt_llm/bench/benchmark/sim_benchmark.py CLI sim benchmark runner

Modified files

File Change
tensorrt_llm/llmapi/llm_args.py Added sim_config field to TorchLlmArgs
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py _create_sim_py_executor()
tensorrt_llm/executor/executor.py Force single-process when sim_config set
tensorrt_llm/bench/benchmark/throughput.py --sim and --sim-config flags

Test plan

  • 82 unit tests in tests/unittest/sim/ (SimConfig, SimClock, SimMetrics, SimDistributed, SimPredictor, AIC predictor)
  • 3 e2e Python API tests in slop/test_sim.py (constant metrics, AIC metrics, AIC TP=2)
  • 3-tier CLI tests in slop/test_bench_sim.py (constant sim, AIC TP=2, calibrated vs real)
  • Verify no regression in existing unit tests

🤖 Generated with Claude Code

venkywonka and others added 20 commits March 30, 2026 18:31
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce the predictor abstraction for simulation mode batch timing:
- SimBatch dataclass: lightweight batch description decoupled from
  TRT-LLM internals for easy isolated testing
- InferTimePredictor ABC: abstract base for batch execution time prediction
- ConstantPredictor: returns fixed prefill/decode time per batch

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce SimConfig and PredictorConfig StrictBaseModel classes in
tensorrt_llm/llmapi/sim_config.py. Replace the simulation_mode: bool
field on TorchLlmArgs with sim_config: Optional[SimConfig] (presence
means activation). Update the activation check in
py_executor_creator.py accordingly.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
SimModelEngine now accepts an optional time_predictor and calls
time.sleep(predicted_time) in forward() to simulate realistic
batch latency. The executor creator reads SimConfig from llm_args
and instantiates a ConstantPredictor accordingly. The smoke test
is updated to use SimConfig and assert wall-clock timing.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Validates Pydantic models including defaults, custom values,
negative-value rejection, invalid predictor names, extra-field
rejection, and model_dump roundtrip serialization.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Add AIConfigurator predictor type to PredictorConfig with device_name,
database_path, backend_version, and scale factor fields. A model_validator
ensures device_name and backend_version are provided when
name='aiconfigurator'. Existing constant predictor behavior is unchanged.
Add validation tests for the new fields.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce AIConfiguratorPredictor that uses the AIConfigurator SDK's
analytical model to predict per-batch prefill and decode latency based
on model architecture, hardware silicon tables, batch size, and
sequence lengths. Tests use the bundled H100 SXM database with
TinyLlama and are auto-skipped when the AIC systems directory is
not available.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
…actory

Populate SimBatch.requests with per-request input_length and
past_kv_length from ScheduledRequests, enabling the AIConfigurator
predictor to make per-request timing predictions. Add a predictor
factory in _create_sim_py_executor that dispatches on pc.name to
instantiate either ConstantPredictor or AIConfiguratorPredictor.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Replace time.sleep(predicted_time) with SimClock.step() accumulation.
Simulation now runs at CPU speed with no sleeping. SimClock tracks
total predicted time and iteration count.

Known limitations:
- SimClock is on MPI worker side; caller gets None via sim_config._clock
- TRTLLM_SKIP_KV_CACHE_ESTIMATION=1 needed due to executor restart bug
  after configure_kv_cache_capacity shutdown cycle
- Cross-process clock access deferred to Phase 4

New: sim_clock.py, test_sim_clock.py (5 tests)
Modified: sim_model_engine.py, py_executor_creator.py, sim_config.py
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Deep-Sim vision: embed per-op timing collection (COLLECT mode) and
simulation (SIMULATE mode) directly into TRT-LLM's op layer. Replaces
external AIC collection pipeline with self-reporting ops.

Update v1 roadmap: Phase 3 done, add Phase 3.5 (single-process mode),
note reusability constraints for Deep-Sim.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
- Add SimDistributed mock that no-ops all communication (barrier,
  broadcast, allgather, allreduce) for sim mode
- Force single-process GenerationExecutorWorker when sim_config is set,
  bypassing MPI spawn regardless of model_world_size
- Auto-skip KV cache estimation in sim mode (removes env var hack)
- SimClock now visible to caller (same address space)
- TP>1 simulation works: TP=2 constructs, generates, records clock
- AIC predictions differ for TP=1 vs TP=2 (ratio 1.16x for TinyLlama)

Verified: 57 unit tests + 3 e2e tests (TP1 constant, TP2 constant,
AIC TP1 vs TP2) all passing. No env var hacks needed.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Document Phase 3 (SimClock) and Phase 3.5 (single-process sim mode,
SimDistributed, TP>1 support). Key discoveries: prefill produces first
token (8 iters not 9), MPI boundary caused clock invisibility (fixed
with SimDistributed + single-process executor routing).

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Per-request timing (TTFT, TPOT, ITL, e2e), per-iteration breakdown,
calc_sim_metrics() pure function. Python API on SimClock + optional
file output. Verification with constant predictor (exact values),
AIC predictor (structural checks), and AIC TP=2 (TP flows through).

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
New sim_metrics.py with SimRequestStats, SimIterationRecord, and
calc_sim_metrics() pure function. Computes TTFT, TPOT, ITL, e2e
latency, and throughput from simulated clock timestamps.

Extended SimClock with record_iteration(), register_request(),
record_token(), metrics property, and write_metrics() for file output.

Hooked SimModelEngine.forward() to record iteration data and
SimSampler.update_requests() to record per-request token timestamps.

Key finding: TTFT includes prefill + first decode iteration because
PyExecutor skips sampler during prefill context processing.

Verified with: constant predictor (exact values), AIC predictor
(structural checks + cross-consistency), AIC TP=2 (TP flows through).

82 unit tests + 3 e2e tests passing.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Key discovery: TTFT includes prefill + first decode (15ms not 10ms)
because PyExecutor skips sampler during prefill context processing.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
trtllm-bench submits all requests at once (batch mode) with no rate
limiting. Our sim already matches this pattern. Arrival modeling is
only needed for future online serving simulation.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
trtllm-bench throughput --sim with 3-tier verification: basic sim
output, AIC TP=2, and calibrated constant predictor vs real silicon.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
New sim_benchmark.py with load_sim_config, run_sim_benchmark,
print_sim_report for CLI sim mode. Add --sim and --sim-config flags
to throughput_command bypassing async_benchmark.

Helper scripts: calibrate_sim.py extracts prefill/decode times from
real iteration log; compare_reports.py shows side-by-side comparison.

3-tier verification passed:
- Tier 1: Constant predictor sim (10 requests, 1523.8 tok/s)
- Tier 2: AIC TP=2 sim (10 requests, 6662.4 tok/s)
- Tier 3: Calibrated sim vs real RTX 3090 Ti
  (real: prefill=131.58ms decode=7.63ms, sim matches structure)

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
3-tier CLI verification: constant sim, AIC TP=2, calibrated vs real
RTX 3090 Ti. Key discovery: iteration log uses Python repr format.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Mark Phase 5 as Done in roadmap. Add consolidated fidelity and
limitations doc covering: what's faithfully modeled (chunking,
capacity, reuse), what's not (piggybacking, overlap, disagg, PP),
and practical limitations (GPU memory, process model, AIC accuracy).

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Split Phase 7 into 7a (PP support, hard) and 7b (disagg KV transfer,
medium). Add Phase 8 (GPU-free mock KV, backlog). Document 5
implementation findings: PP blocked by SimDistributed, calibration
usefulness, GPU requirement, output format mismatch, Deep-Sim
architecture validation. Update gantt chart to show v1 complete.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
@venkywonka venkywonka changed the title [None][feat] Add simulation mode for GPU-free serving benchmarking [None][feat] Add simulation mode for serving benchmarking Apr 2, 2026
slop/ contains local paths, model tokenizers, and dev-specific
plans/specs that should not be on GitHub.

Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant