Skip to content

[#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only)#12708

Open
govind-ramnarayan wants to merge 5 commits intoNVIDIA:mainfrom
nv-auto-deploy:feat/paperclip_maximizer_merge1_infra
Open

[#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only)#12708
govind-ramnarayan wants to merge 5 commits intoNVIDIA:mainfrom
nv-auto-deploy:feat/paperclip_maximizer_merge1_infra

Conversation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator

@govind-ramnarayan govind-ramnarayan commented Apr 2, 2026

Fixes: #12712

This includes infrastructure-related changes made during the one-week AutoDeploy model onboarding sprint. Specific custom model files and tests will follow in another PR.

Consists of changes from #12209 related to existing infrastructure. Does not introduce any new onboarded models.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added multi-modal/vision support verification checklist
    • Added GitHub CLI authentication configuration guidance
    • Added model registry YAML extra resolution support
  • Improvements

    • Updated default inference backends (attention, compilation)
    • Increased default inference capacity (sequence length, batch size)
    • Simplified prompt input handling
  • Bug Fixes

    • Improved multi-modal message normalization
    • Enhanced hidden states layout support
  • Removals

    • Removed benchmarking utility module

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@govind-ramnarayan govind-ramnarayan changed the title Feat/paperclip maximizer merge1 infra [#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only) Apr 2, 2026
@govind-ramnarayan govind-ramnarayan marked this pull request as ready for review April 2, 2026 19:42
@govind-ramnarayan govind-ramnarayan requested review from a team as code owners April 2, 2026 19:42
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot help

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

The pull request introduces infrastructure changes to support new models, shifting default optimization backends from flashinfer/torch-compile to trtllm/torch-cudagraph, refactoring batch/sequence management to use tensor-based operations with explicit max_num_tokens parameters, adding multimodal message normalization, simplifying prompt input handling, and removing the benchmarking utility module.

Changes

Cohort / File(s) Summary
Backend & Configuration Defaults
tensorrt_llm/_torch/auto_deploy/llm_args.py, tensorrt_llm/_torch/auto_deploy/config/default.yaml
Updated attention backend from flashinfer to trtllm and compile backend from torch-compile to torch-cudagraph; increased default max_seq_len (512→2048) and max_batch_size (8→64); adjusted CUDA-graph batch-size heuristic to start at 16 instead of 1.
Batch & Sequence Management Refactoring
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py, tensorrt_llm/_torch/auto_deploy/shim/interface.py
Converted BatchInfo from numpy-backed arrays to tensor-based operations; made max_num_tokens a required (non-optional) parameter in SequenceInfo and CachedSequenceInterface; updated reshape logic in SequenceInfo.maybe_gather_and_squeeze().
Batch Management Test Updates
tests/unittest/auto_deploy/.../test_*.py (20+ test files)
Added max_num_tokens arguments to CachedSequenceInterface and SequenceInfo construction calls across resource handler, shim, and KV-cache tests; introduced default_max_num_tokens() helper function in test utilities.
Prompt & Input Handling
examples/auto_deploy/build_and_run_ad.py
Added registry YAML path resolution via get_registry_yaml_extra() and _inject_registry_yaml_extra() helpers; simplified PromptInput type (removed Dict support); introduced prepare_queries() to normalize string queries to HF-style chat messages or plain prompts based on tokenizer; removed entire benchmarking flow (config fields, execution, and result storage).
Multimodal Message Support
tensorrt_llm/_torch/auto_deploy/llm.py
Added normalization of plain-string messages to multimodal list-of-dicts form ([{"type": "text", "text": ...}]) when processor has image_processor attribute; removed token_type_ids from forwarded arguments to prevent kwarg mismatch.
Custom Operations
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
Updated q_scaling calculation to use scale * math.sqrt(head_dim) instead of constant 1.0; updated gather_tokens() to handle additional hidden-state layouts and adjust reshaping logic for generate-only batches.
ONNX Schema Registration
tensorrt_llm/_torch/auto_deploy/transform/library/_onnx_schemas.py
Modified schema registration to conditionally register schemas only if not already present, avoiding duplicate-registration errors.
Benchmarking Module Removal
tensorrt_llm/_torch/auto_deploy/utils/benchmark.py
Deleted entire benchmarking utility module including GenerationProfiler, benchmark() function, and store_benchmark_results() helper.
Test Input Structure Updates
tests/unittest/auto_deploy/.../smoke/test_ad_*.py
Updated prompt configuration structure from dictionary form ({"prompt": "..."}) to direct string assignment in test query setup.
Test Case Simplification
tests/unittest/auto_deploy/multigpu/smoke/test_ad_build_small_multi.py
Removed one parametrized test case using transformers_replace_cached_attn with flashinfer backend.
Infrastructure & Documentation
.claude/agents/ad-onboard-reviewer.md, AGENTS.md
Added new "BB. Vision / Multi-Modal Support" checklist section (with duplicate entries); added GitHub CLI authentication configuration guidance via GH_CONFIG_DIR.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is largely incomplete and uses the template as boilerplate without substantive content in critical sections. Add detailed explanations in the Description and Test Coverage sections. The PR Checklist is checked but the Description and Test Coverage sections are empty placeholders. Provide a clear summary of what infrastructure changes are being made and why they are needed.
Out of Scope Changes check ❓ Inconclusive While most changes align with infrastructure goals, the duplicate 'BB. Vision / Multi-Modal Support' section in ad-onboard-reviewer.md and some prompt format changes in example configs appear potentially unrelated to the stated infra-only scope and require clarification.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the primary change: AutoDeploy Model Onboarding infrastructure changes for sprint 03/19, with explicit '(infra only)' qualifier.
Linked Issues check ✅ Passed The PR's infrastructure changes comprehensively address linked issue #12712 objectives: default backend switching to torch-cudagraph [see llm_args.py, default.yaml], test utilities additions, and repository structure preparation for new models.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/transform/library/_onnx_schemas.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The file has been modified and the copyright year should reflect the current year.

-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/_onnx_schemas.py` at line
1, Update the copyright header at the top of the file by changing the year from
2025 to 2026; locate the top-of-file comment string that currently reads
"Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved." and edit it to
"Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved." so the file
header reflects the current year.
🧹 Nitpick comments (1)
tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_trtllm_attention_quant_fp8.py (1)

247-253: Consider using default_max_num_tokens helper for consistency.

Other test files in this PR (e.g., test_gated_delta_rule_cache.py, test_torch_gated_delta_rule_cache.py) use the default_max_num_tokens(max_seq_len, max_batch_size) helper which computes (max_seq_len + 1) * max_batch_size. Here, the hardcoded value 256 differs from what the formula would produce: (64 + 1) * 4 = 260.

If the specific value 256 is intentional for this test, consider adding a brief comment explaining why. Otherwise, consider importing and using default_max_num_tokens for consistency.

Also applies to: 296-302

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_trtllm_attention_quant_fp8.py`
around lines 247 - 253, The test instantiates CachedSequenceInterface with a
hardcoded max_num_tokens=256 which is inconsistent with the project's helper;
replace the literal with default_max_num_tokens(max_seq_len, max_batch_size)
(importing default_max_num_tokens) so max_num_tokens is computed as (max_seq_len
+ 1) * max_batch_size for consistency with other tests, or if 256 is intentional
add a brief inline comment next to the CachedSequenceInterface invocation
explaining why it differs; target the CachedSequenceInterface construction and
the variables max_seq_len and max_batch_size referenced in the test (also update
the similar occurrence around the 296-302 block).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/agents/ad-onboard-reviewer.md:
- Around line 47-55: Remove the duplicated "BB. Vision / Multi-Modal Support"
checklist block by deleting the second occurrence of the header and its table
(the repeated "BB. Vision / Multi-Modal Support" section that duplicates the
earlier BB1–BB2 table), leaving a single canonical BB section; ensure any
references to BB1 and BB2 remain intact and that the file contains only one
instance of that header and its three-column table so the checklist is not
duplicated.

In `@AGENTS.md`:
- Around line 126-133: The GitHub CLI guidance under the "GitHub CLI
authentication (`GH_CONFIG_DIR`)" section in AGENTS.md appears unrelated to the
current paperclip PR; either add a short clarifying note explaining when/why
agents should set GH_CONFIG_DIR (e.g., for workflows that interact with forks or
run external gh commands) or remove/move this whole section to a separate
documentation PR dedicated to GitHub CLI workflows; update the AGENTS.md header
"GitHub CLI authentication (`GH_CONFIG_DIR`)" and include a one-line rationale
if keeping it so future reviewers know its intended scope.

In `@examples/auto_deploy/build_and_run_ad.py`:
- Around line 100-104: The VS Code launch configuration still passes the removed
BenchmarkConfig.enabled flag which causes Pydantic validation to fail
(ExperimentConfig uses extra="forbid"); open the launch.json used for
running/debugging and remove the "--benchmark.enabled=false" argument so no
unknown "--benchmark.enabled" CLI flag is passed, ensuring runs use the updated
BenchmarkConfig (results_path/store_results) and ExperimentConfig without
validation errors.

In `@tensorrt_llm/_torch/auto_deploy/llm.py`:
- Around line 49-57: The current normalization mutates inputs["messages"]
in-place (variable messages) when self.processor has image_processor by
assigning to msg["content"], which can produce caller-visible side effects; fix
by working on a copy: create a shallow/deep copy of inputs["messages"] (e.g.,
clone messages into a new list of dicts) before the loop and modify that copy
(use that copy for downstream processing) so inputs["messages"] remains
unchanged; update any places that use the original variable to use the copied
variable instead (refer to variables/messages and the hasattr(self.processor,
"image_processor") block).

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/_onnx_schemas.py`:
- Line 1: Update the copyright header at the top of the file by changing the
year from 2025 to 2026; locate the top-of-file comment string that currently
reads "Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved." and edit it
to "Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved." so the file
header reflects the current year.

---

Nitpick comments:
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_trtllm_attention_quant_fp8.py`:
- Around line 247-253: The test instantiates CachedSequenceInterface with a
hardcoded max_num_tokens=256 which is inconsistent with the project's helper;
replace the literal with default_max_num_tokens(max_seq_len, max_batch_size)
(importing default_max_num_tokens) so max_num_tokens is computed as (max_seq_len
+ 1) * max_batch_size for consistency with other tests, or if 256 is intentional
add a brief inline comment next to the CachedSequenceInterface invocation
explaining why it differs; target the CachedSequenceInterface construction and
the variables max_seq_len and max_batch_size referenced in the test (also update
the similar occurrence around the 296-302 block).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 270f293f-90b8-4aba-b26e-a8d98a56b76c

📥 Commits

Reviewing files that changed from the base of the PR and between 11c40bb and 5a2fb55.

📒 Files selected for processing (25)
  • .claude/agents/ad-onboard-reviewer.md
  • AGENTS.md
  • examples/auto_deploy/build_and_run_ad.py
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
  • tensorrt_llm/_torch/auto_deploy/llm.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/interface.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/_onnx_schemas.py
  • tensorrt_llm/_torch/auto_deploy/utils/benchmark.py
  • tests/unittest/auto_deploy/_utils_test/_model_test_utils.py
  • tests/unittest/auto_deploy/multigpu/smoke/test_ad_build_small_multi.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/test_resource_handlers.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_engine.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py
  • tests/unittest/auto_deploy/singlegpu/smoke/test_ad_guided_decoding_regex.py
  • tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_sampler.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_trtllm_attention_quant_fp8.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_gated_delta_rule_cache.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_mrope_delta_cache.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py
💤 Files with no reviewable changes (2)
  • tests/unittest/auto_deploy/multigpu/smoke/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/utils/benchmark.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41497 [ run ] triggered by Bot. Commit: 5a2fb55 Link to invocation

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Nothing imports this module — it is dead code cleaned up as part of
the paperclip infra consolidation.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
The comment was historical context about switching to graph mode.
Transformers mode deprecation will be tracked separately.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Centralizes the (max_seq_len + 1) * max_batch_size formula (a WAR for
flashinfer issue NVIDIA#4504) into a single helper in _model_test_utils.py.
Replaces hardcoded magic numbers (129 * 4) and inline formulas across
7 test files.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
- Remove duplicate BB Vision/Multi-Modal section in ad-onboard-reviewer.md
- Remove stale --benchmark.enabled flag from .vscode/launch.json
- Update copyright year to 2025-2026 in _onnx_schemas.py

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan govind-ramnarayan force-pushed the feat/paperclip_maximizer_merge1_infra branch from 45ffa73 to c44dda7 Compare April 2, 2026 20:16
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41502 [ run ] triggered by Bot. Commit: c44dda7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41502 [ run ] completed with state SUCCESS. Commit: c44dda7
/LLM/main/L0_MergeRequest_PR pipeline #32420 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Infrastructure Changes for New Models

2 participants