[None][feat] Add simulation mode for serving benchmarking#12700
Draft
venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
Draft
[None][feat] Add simulation mode for serving benchmarking#12700venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
venkywonka wants to merge 27 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce the predictor abstraction for simulation mode batch timing: - SimBatch dataclass: lightweight batch description decoupled from TRT-LLM internals for easy isolated testing - InferTimePredictor ABC: abstract base for batch execution time prediction - ConstantPredictor: returns fixed prefill/decode time per batch Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce SimConfig and PredictorConfig StrictBaseModel classes in tensorrt_llm/llmapi/sim_config.py. Replace the simulation_mode: bool field on TorchLlmArgs with sim_config: Optional[SimConfig] (presence means activation). Update the activation check in py_executor_creator.py accordingly. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
SimModelEngine now accepts an optional time_predictor and calls time.sleep(predicted_time) in forward() to simulate realistic batch latency. The executor creator reads SimConfig from llm_args and instantiates a ConstantPredictor accordingly. The smoke test is updated to use SimConfig and assert wall-clock timing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Validates Pydantic models including defaults, custom values, negative-value rejection, invalid predictor names, extra-field rejection, and model_dump roundtrip serialization. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Add AIConfigurator predictor type to PredictorConfig with device_name, database_path, backend_version, and scale factor fields. A model_validator ensures device_name and backend_version are provided when name='aiconfigurator'. Existing constant predictor behavior is unchanged. Add validation tests for the new fields. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Introduce AIConfiguratorPredictor that uses the AIConfigurator SDK's analytical model to predict per-batch prefill and decode latency based on model architecture, hardware silicon tables, batch size, and sequence lengths. Tests use the bundled H100 SXM database with TinyLlama and are auto-skipped when the AIC systems directory is not available. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
…actory Populate SimBatch.requests with per-request input_length and past_kv_length from ScheduledRequests, enabling the AIConfigurator predictor to make per-request timing predictions. Add a predictor factory in _create_sim_py_executor that dispatches on pc.name to instantiate either ConstantPredictor or AIConfiguratorPredictor. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Replace time.sleep(predicted_time) with SimClock.step() accumulation. Simulation now runs at CPU speed with no sleeping. SimClock tracks total predicted time and iteration count. Known limitations: - SimClock is on MPI worker side; caller gets None via sim_config._clock - TRTLLM_SKIP_KV_CACHE_ESTIMATION=1 needed due to executor restart bug after configure_kv_cache_capacity shutdown cycle - Cross-process clock access deferred to Phase 4 New: sim_clock.py, test_sim_clock.py (5 tests) Modified: sim_model_engine.py, py_executor_creator.py, sim_config.py Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Deep-Sim vision: embed per-op timing collection (COLLECT mode) and simulation (SIMULATE mode) directly into TRT-LLM's op layer. Replaces external AIC collection pipeline with self-reporting ops. Update v1 roadmap: Phase 3 done, add Phase 3.5 (single-process mode), note reusability constraints for Deep-Sim. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
- Add SimDistributed mock that no-ops all communication (barrier, broadcast, allgather, allreduce) for sim mode - Force single-process GenerationExecutorWorker when sim_config is set, bypassing MPI spawn regardless of model_world_size - Auto-skip KV cache estimation in sim mode (removes env var hack) - SimClock now visible to caller (same address space) - TP>1 simulation works: TP=2 constructs, generates, records clock - AIC predictions differ for TP=1 vs TP=2 (ratio 1.16x for TinyLlama) Verified: 57 unit tests + 3 e2e tests (TP1 constant, TP2 constant, AIC TP1 vs TP2) all passing. No env var hacks needed. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Document Phase 3 (SimClock) and Phase 3.5 (single-process sim mode, SimDistributed, TP>1 support). Key discoveries: prefill produces first token (8 iters not 9), MPI boundary caused clock invisibility (fixed with SimDistributed + single-process executor routing). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Per-request timing (TTFT, TPOT, ITL, e2e), per-iteration breakdown, calc_sim_metrics() pure function. Python API on SimClock + optional file output. Verification with constant predictor (exact values), AIC predictor (structural checks), and AIC TP=2 (TP flows through). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
New sim_metrics.py with SimRequestStats, SimIterationRecord, and calc_sim_metrics() pure function. Computes TTFT, TPOT, ITL, e2e latency, and throughput from simulated clock timestamps. Extended SimClock with record_iteration(), register_request(), record_token(), metrics property, and write_metrics() for file output. Hooked SimModelEngine.forward() to record iteration data and SimSampler.update_requests() to record per-request token timestamps. Key finding: TTFT includes prefill + first decode iteration because PyExecutor skips sampler during prefill context processing. Verified with: constant predictor (exact values), AIC predictor (structural checks + cross-consistency), AIC TP=2 (TP flows through). 82 unit tests + 3 e2e tests passing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Key discovery: TTFT includes prefill + first decode (15ms not 10ms) because PyExecutor skips sampler during prefill context processing. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
trtllm-bench submits all requests at once (batch mode) with no rate limiting. Our sim already matches this pattern. Arrival modeling is only needed for future online serving simulation. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
trtllm-bench throughput --sim with 3-tier verification: basic sim output, AIC TP=2, and calibrated constant predictor vs real silicon. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
New sim_benchmark.py with load_sim_config, run_sim_benchmark, print_sim_report for CLI sim mode. Add --sim and --sim-config flags to throughput_command bypassing async_benchmark. Helper scripts: calibrate_sim.py extracts prefill/decode times from real iteration log; compare_reports.py shows side-by-side comparison. 3-tier verification passed: - Tier 1: Constant predictor sim (10 requests, 1523.8 tok/s) - Tier 2: AIC TP=2 sim (10 requests, 6662.4 tok/s) - Tier 3: Calibrated sim vs real RTX 3090 Ti (real: prefill=131.58ms decode=7.63ms, sim matches structure) Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
3-tier CLI verification: constant sim, AIC TP=2, calibrated vs real RTX 3090 Ti. Key discovery: iteration log uses Python repr format. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Mark Phase 5 as Done in roadmap. Add consolidated fidelity and limitations doc covering: what's faithfully modeled (chunking, capacity, reuse), what's not (piggybacking, overlap, disagg, PP), and practical limitations (GPU memory, process model, AIC accuracy). Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
Split Phase 7 into 7a (PP support, hard) and 7b (disagg KV transfer, medium). Add Phase 8 (GPU-free mock KV, backlog). Document 5 implementation findings: PP blocked by SimDistributed, calibration usefulness, GPU requirement, output format mismatch, Deep-Sim architecture validation. Update gantt chart to show v1 complete. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
slop/ contains local paths, model tokenizers, and dev-specific plans/specs that should not be on GitHub. Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a simulation mode to TensorRT-LLM that runs the real Python scheduler with mocked GPU execution, enabling fast GPU-free benchmarking of serving configurations. Complete v1 implementation (Phases 0-5).
Usage
What's included (Phases 0-5)
trtllm-bench throughput --simCLI with 3-tier verificationKey design decisions
--simproduces same table layout as real benchmark with [SIM] bannerVerified configurations
New files
tensorrt_llm/llmapi/sim_config.pytensorrt_llm/_torch/pyexecutor/sim_clock.pytensorrt_llm/_torch/pyexecutor/sim_metrics.pytensorrt_llm/_torch/pyexecutor/sim_model_engine.pytensorrt_llm/_torch/pyexecutor/sim_sampler.pytensorrt_llm/_torch/pyexecutor/sim_predictor.pytensorrt_llm/_torch/pyexecutor/sim_predictor_aic.pytensorrt_llm/_torch/pyexecutor/sim_distributed.pytensorrt_llm/bench/benchmark/sim_benchmark.pyModified files
tensorrt_llm/llmapi/llm_args.pysim_configfield to TorchLlmArgstensorrt_llm/_torch/pyexecutor/py_executor_creator.py_create_sim_py_executor()tensorrt_llm/executor/executor.pytensorrt_llm/bench/benchmark/throughput.py--simand--sim-configflagsTest plan
tests/unittest/sim/(SimConfig, SimClock, SimMetrics, SimDistributed, SimPredictor, AIC predictor)slop/test_sim.py(constant metrics, AIC metrics, AIC TP=2)slop/test_bench_sim.py(constant sim, AIC TP=2, calibrated vs real)🤖 Generated with Claude Code