Skip to content

feature: dynamo integration with nat profiler and prometheus/grafana dashboard#1486

Merged
rapids-bot[bot] merged 51 commits intoNVIDIA:developfrom
bbednarski9:bb/dynamo-router-v1.5
Feb 24, 2026
Merged

feature: dynamo integration with nat profiler and prometheus/grafana dashboard#1486
rapids-bot[bot] merged 51 commits intoNVIDIA:developfrom
bbednarski9:bb/dynamo-router-v1.5

Conversation

@bbednarski9
Copy link
Copy Markdown
Contributor

@bbednarski9 bbednarski9 commented Jan 25, 2026

Description

This PR brings the ai-dynamo prometheus metrics API integration with grafana dash boarding into NAT. This is a core feature of the dynamo optimization pipeline for the NAT/dynamo integration. We also integrate the prometheus API with the NAT profiler, so that we can begin end-to-end agentic header-injection optimization.
Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Built-in Prometheus/Grafana monitoring, dashboards and a metrics collector; enhanced Dynamo metrics surfaced to the profiler.
    • Optimized Thompson Sampling routing and processor with observability and multi-worker support.
    • Multi-worker orchestration, MultiLRU tooling, KV-event observer, demo and image-build scripts, and new deployment start/stop helpers.
  • Documentation

    • Extensive monitoring, architecture, parameters, and end-to-end deployment guides and diagrams.
  • Chores

    • Default HTTP port changed from 8099 to 8000; configs and examples updated.

@bbednarski9 bbednarski9 requested a review from a team as a code owner January 25, 2026 20:37
@bbednarski9 bbednarski9 added improvement Improvement to existing functionality non-breaking Non-breaking change labels Jan 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds an optimized Thompson Sampling router and processors for Dynamo with dynamic discovery, Prometheus/Grafana monitoring and dashboards, multi-worker startup/stop/orchestration and image-build tooling, NAT profiler integration for Dynamo metrics, transport-level LLM hint injection, many monitoring/observability scripts, and extensive docs and demos.

Changes

Cohort / File(s) Summary
Documentation & Architecture
docs/source/components/integrations/frameworks.md, external/dynamo/optimized/ARCHITECTURE.md, external/dynamo/optimized/README.md, external/dynamo/optimized/PARAMETERS.md, external/dynamo/monitoring/README.md, external/dynamo/E2E_SEQUENCE.md
Updated framework link; added architecture, parameters, README, monitoring guide, and end-to-end sequence docs describing the optimized Thompson Sampling router, processor flows, and observability.
Router Implementations
external/dynamo/optimized/router.py, external/dynamo/optimized/router_multilru.py, external/dynamo/optimized/router_multilru.py
Added WorkloadAwareRouter implementations (contextual LinTS/Thompson Sampling, per-prefix reuse, contextual updates, Prometheus metrics, CLI/config overrides, endpoints for generate/find_worker and feedback).
Processor Implementations
external/dynamo/optimized/processor.py, external/dynamo/optimized/processor_multilru.py, external/dynamo/generalized/processor.py
Added optimized processors that register as dynamo.backend.generate, extract nvext annotations, select workers via router, stream responses, emit KV-efficiency metrics; removed explicit service creation call in generalized processor.
Monitoring Stack & Dashboards
external/dynamo/monitoring/docker-compose.yml, external/dynamo/monitoring/prometheus.yml, external/dynamo/monitoring/grafana/provisioning/..., external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json, external/dynamo/monitoring/rules/vllm-aliases.yml
Added Prometheus+Grafana compose, Prometheus scrape jobs, Grafana provisioning (datasource, dashboards provider, JSON dashboard), and recording rules to harmonize vLLM metrics for dashboards.
Metrics Collection & Observability Tools
external/dynamo/collect_metrics.sh, external/dynamo/monitoring/scripts/kv_event_observer.py, external/dynamo/monitoring/scripts/*, external/dynamo/monitoring/rules/*
New Bash metrics collector, KV event observer (ZMQ) script, and related monitoring scripts and rules for collecting and aggregating Dynamo/vLLM metrics.
Profiler Integration
src/nat/profiler/inference_optimization/dynamo_metrics.py, src/nat/data_models/profiler.py, src/nat/profiler/profile_runner.py
Added Dynamo metrics collector and data models (DynamoCoreMetrics, DynamoMetricsResult, DynamoMetricsCollector), ProfilerConfig.dynamo_metrics field, and profile runner integration to optionally collect/attach Prometheus metrics to profiling results.
LLM Client Transport
src/nat/llm/dynamo_llm.py
Replaced httpx event-hook header injection with a transport wrapper that injects Dynamo routing hints into HTTP headers and nvext.annotations in POST bodies.
Examples & Configs
examples/dynamo_integration/.../profile_rethinking_full_test.yml, examples/.../eval_config_no_rethinking_minimal_test.yml
Enabled Dynamo Prometheus metrics in profiler config, switched dynamo LLM base_url from :8099 to :8000, and reduced eval concurrency in one config.
Environment & Build
external/dynamo/.env.example, external/dynamo/build_multi_lru_image.sh
Expanded .env example with HuggingFace credentials, active Dynamo tunables, default ports (8000), and added script to build a MultiLRU-enabled vLLM image.
Startup / Orchestration / Lifecycle Scripts
external/dynamo/start_dynamo_unified.sh, external/dynamo/start_dynamo_disagg.sh, external/dynamo/start_dynamo_optimized_thompson_hints_sglang.sh, external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh, external/dynamo/start_dynamo_unified_thompson_hints.sh (removed), external/dynamo/stop_dynamo.sh, external/dynamo/monitor_dynamo.sh
Added/updated start scripts for unified/disaggregated/optimized deployments with multi-worker GPU allocation, per-worker ports, ETCD/NATS orchestration, optional monitoring; changed default HTTP port 8099→8000; added enhanced stop options including metrics stack and data clearing; removed an older unified Thompson script.
Monitoring Compose & Provisioning
external/dynamo/monitoring/docker-compose.yml, external/dynamo/monitoring/grafana/provisioning/dashboards/dashboards.yml, external/dynamo/monitoring/grafana/provisioning/datasources/datasources.yml
Added docker-compose for Prometheus/Grafana and Grafana provisioning files to auto-load dashboards and datasource.
Prometheus Rules & Recording
external/dynamo/monitoring/rules/vllm-aliases.yml
Added recording rules to expose vLLM metric aliases to match SGLang dashboard expectations.
Demo & Utility Scripts
external/dynamo/demo_priority_eviction.sh, external/dynamo/build_multi_lru_image.sh
Added demo script for MultiLRU eviction experiments and build helper for MultiLRU vLLM image.
Misc: Pre-commit
.pre-commit-config.yaml
Changed a local hook to run as Python and added nbconvert dependency for clearing notebook outputs.
Docs & Large JSON artifacts
external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json
Added a comprehensive Grafana dashboard JSON with multiple panels covering KV, KVE, TTFT/ITL, throughput, per-worker metrics, and backend templating.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Processor
    participant Router
    participant Worker
    participant Prometheus

    Client->>Frontend: HTTP generate request (headers / nvext.annotations)
    Frontend->>Processor: route to backend.generate
    Processor->>Processor: extract hints (prefix_id, reuse_budget, osl, iat)
    Processor->>Router: find_worker(tokens, prefix_id, reuse_budget)
    Router->>Prometheus: record decision metric
    Router-->>Processor: worker_id, decision_id
    Processor->>Worker: forward/stream generate request
    Worker-->>Processor: stream response chunks
    Processor->>Prometheus: update latency/token metrics
    Processor->>Prometheus: async update KVE metrics
    Processor-->>Frontend: stream response back
    Frontend-->>Client: deliver response
    Processor->>Router: feedback(decision_id, latency, success)
    Router->>Prometheus: record feedback / update baselines
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: integrating Dynamo with NAT profiler and adding Prometheus/Grafana dashboard support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
external/dynamo/start_dynamo_unified.sh (1)

17-25: Update the header comment to match the new multi‑worker defaults.

The header still mentions GPUs 0‑3 with TP=4, but defaults are now 0‑7 and TP=2 with multiple workers. Please refresh the description. As per coding guidelines.

📝 Example update
-#   - Unified Worker (GPUs 0,1,2,3, TP=4, no disaggregation)
+#   - Unified Workers (default GPUs 0-7, TP=2 per worker, no disaggregation)
🤖 Fix all issues with AI agents
In `@external/dynamo/collect_metrics.sh`:
- Around line 27-41: The curl calls that write to "$frontend_file" and
"$worker_file" should use --fail and --max-time (e.g., --max-time 10) and check
curl's exit status so failures don't create empty/partial files; fetch into temp
files, verify each curl succeeded, log an error or warning to stderr with
context ("frontend metrics" / "worker metrics") and either exit 1 or skip
inclusion into "$combined_file" based on desired behavior, then atomically move
the temp files to "$frontend_file" and "$worker_file" before building
"$combined_file".

In
`@external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json`:
- Around line 1-4: Add the SPDX Apache-2.0 header object according to the repo’s
JSON convention immediately before the existing dashboard content (i.e., before
the top-level "annotations" key); update the JSON to include the standard SPDX
fields (e.g., "spdx": { "license": "Apache-2.0", "copyright": "...", ... })
following the repository’s header schema so the file begins with the SPDX header
object then the existing dashboard object with "annotations".

In `@external/dynamo/monitoring/README.md`:
- Around line 1-3: Add the SPDX Apache-2.0 license header as the very first line
of the README before the "# Dynamo Monitoring Stack" title; update the README.md
for the "Dynamo Monitoring Stack" document to include the exact SPDX identifier
"SPDX-License-Identifier: Apache-2.0" on its own line at the top of the file.

In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 1-2: The ARCHITECTURE.md file titled "Optimized Thompson Sampling
Router Architecture" is missing the required SPDX Apache-2.0 header; add the
standard SPDX header comment (e.g., "SPDX-License-Identifier: Apache-2.0") at
the very top of the file before any content so the document complies with
licensing guidelines and linter checks.

In `@external/dynamo/optimized/PARAMETERS.md`:
- Around line 1-3: Add the standard SPDX Apache-2.0 header to the top of
PARAMETERS.md to comply with licensing guidelines; insert the SPDX identifier
line "SPDX-License-Identifier: Apache-2.0" (and optional year/owner comment
above or beside it per project convention) as the first lines of the file so the
new Markdown doc describing WorkloadAwareRouter in router.py includes the
required license header.

In `@external/dynamo/optimized/processor.py`:
- Around line 718-745: Add a return type annotation to parse_args() (->
argparse.Namespace) and add a new CLI argument "--kv-cache-block-size"
(type=int) that defaults from the DYNAMO_KV_BLOCK_SIZE environment variable or
64 when unset; ensure parse_args returns the Namespace with this value and then
thread that value into the call to register_llm(...) (or wherever the model/card
is registered) so register_llm receives kv_cache_block_size instead of the
hard-coded 1, keeping the argument name kv_cache_block_size consistent with
existing worker/frontend usage.

In `@external/dynamo/optimized/README.md`:
- Around line 1-3: Add the SPDX license header to the top of the README so the
file begins with the standard SPDX identifier before the existing title line "#
Optimized Thompson Sampling Router Architecture"; insert a single-line header
"SPDX-License-Identifier: Apache-2.0" (with no other characters) as the first
line of the file so the README follows the repository's Markdown licensing
guideline.

In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh`:
- Around line 284-301: The script currently uses read -p to prompt for HF_TOKEN
which can echo the secret to the terminal; change the interactive prompt to use
a silent read (e.g., read -s -r HF_TOKEN) so the token is not displayed, add a
newline/confirmation echo after the silent read to keep UX clean, and ensure the
later checks that set HF_TOKEN="dummy" or echo "✓ HuggingFace token received"
remain unchanged; update the prompt logic around the HF_TOKEN and
LOCAL_MODEL_DIR checks to use the silent read and not print the token.
- Around line 77-83: Validate GPU/TP sizing after computing NUM_GPUS and before
using NUM_WORKERS: check that TP_SIZE is a positive integer and that NUM_GPUS >=
TP_SIZE and that NUM_GPUS is divisible by TP_SIZE so NUM_WORKERS computed from
NUM_GPUS/TP_SIZE is > 0; if the check fails, print a clear error using
WORKER_GPUS, TP_SIZE and NUM_GPUS context and exit non‑zero to avoid later
indexing in wait_for_worker or using CONTAINER_GPU_INDICES when NUM_WORKERS ==
0. Ensure the validation runs right after NUM_WORKERS is assigned and prevents
further execution on bad configs.

In `@external/dynamo/start_dynamo_unified.sh`:
- Around line 53-55: NUM_WORKERS calculation can yield 0 or incorrect values
when TP_SIZE > NUM_GPUS or when NUM_GPUS is not divisible by TP_SIZE; add
validation after computing NUM_WORKERS (and before accessing WORKER_PIDS) to
check that NUM_GPUS and TP_SIZE are positive integers, that TP_SIZE <= NUM_GPUS,
and that NUM_GPUS % TP_SIZE == 0, and if any check fails print a clear error and
exit (use the same logging/exit pattern as the script). Also consider computing
NUM_WORKERS as an integer division only after validation to avoid truncation
surprises and reference NUM_WORKERS, NUM_GPUS, TP_SIZE and WORKER_PIDS in the
error message so the failure is easy to debug.

In `@src/nat/llm/dynamo_llm.py`:
- Around line 393-410: The code always generates a prefix ID even when
prefix_template is intended to disable hints; change the logic in the block that
sets unique_id/prefix_id so that if prefix_template is None you set prefix_id =
None (do not generate unique_id or call uuid.uuid4()), otherwise generate
unique_id and compute prefix_id = prefix_template.format(uuid=unique_id) (or
fallback f"nat-dynamo-{unique_id}" when prefix_template is truthy but empty),
update the logger to only log the created prefix when prefix_id is not None, and
pass that prefix_id (possibly None) into _DynamoTransport so the transport can
skip injecting hints when prefix_id is None.
🟡 Minor comments (19)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/profile_rethinking_full_test.yml-243-246 (1)

243-246: Update Prometheus scrape interval reference to match actual config.

The comment states Prometheus scrapes every 5s, but the Prometheus configuration in external/dynamo/monitoring/prometheus.yml defines scrape_interval: 2s. Update the minimum range calculation accordingly.

Proposed fix
-        # Minimum: 15s (Prometheus scrapes every 5s, need ≥3 points for reliable rates)
+        # Minimum: 6s (Prometheus scrapes every 2s, need ≥3 points for reliable rates)
external/dynamo/optimized/ARCHITECTURE.md-13-14 (1)

13-14: Replace inanimate possessive ’s in Markdown.

Phrases like “Frontend’s built-in router” and “Dynamo’s --model-name/--model-path” violate the doc style rule. Please rephrase (e.g., “the built-in router of the frontend”, “the Dynamo --model-name/--model-path”). As per coding guidelines, ...

✍️ Example rephrases
-3. **Frontend's built-in router becomes irrelevant**
+3. **The built-in router of the frontend becomes irrelevant**

-| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | Dynamo's `--model-name`/`--model-path` |
+| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | the Dynamo `--model-name`/`--model-path` |

-> default frontend, but this is **irrelevant** in our architecture. The frontend's built-in
+> default frontend, but this is **irrelevant** in our architecture. The built-in router of the frontend

Also applies to: 216-224, 360-363

external/dynamo/optimized/ARCHITECTURE.md-15-21 (1)

15-21: Add language identifiers to fenced code blocks (MD040).

Markdownlint flags these fenced blocks for missing language tags. Please add text, bash, or json as appropriate.

🧩 Example fix
-```
+```text
 Frontend (built-in router: round-robin)
     → routes to dynamo.backend.generate
     → OUR PROCESSOR (intercepts!)
         → queries Thompson Sampling router
         → forwards to dynamo.worker.generate (actual SGLang workers)
-```
+```

Also applies to: 23-24, 297-304, 307-313

src/nat/llm/dynamo_llm.py-22-23 (1)

22-23: Docstring still references event hooks; update to transport-based injection.

The module now uses a custom transport, but the docstring still describes event hooks.

📝 Suggested update
-The implementation uses httpx event hooks to inject hints at the HTTP transport level,
+The implementation uses a custom httpx transport to inject hints at the HTTP level,
external/dynamo/monitoring/README.md-383-392 (1)

383-392: Wrap bare URLs in inline code or links (MD034).

Bare URLs are flagged by markdownlint; please wrap them in inline code or convert to links (e.g., http://localhost:3000).

🔗 Example fix
-Then open http://localhost:3000 in your browser.
+Then open `http://localhost:3000` in your browser.
...
-Example queries:
+Example queries:
 
-```promql
+```promql
 # Request rate (requests/second)
 rate(dynamo_frontend_requests_total[1m])

Also applies to: 414-488

external/dynamo/optimized/README.md-7-30 (1)

7-30: Specify languages for the untyped fenced blocks (MD040).

Add a language such as text to the architecture diagram and expected‑output snippet to satisfy markdownlint.

🧹 Example fix
-```
+```text
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  Client Request (with nvext.annotations)                                │
 ...
 └─────────────────────────────────────────────────────────────────────────┘
-```
+```
...
-```
+```text
 Step 3: Starting Custom Processor (Registers as backend.generate)...
 ...
 INFO processor.generate: Routing decision: worker=... decision=...
-```
+```

Also applies to: 157-174

external/dynamo/optimized/README.md-13-48 (1)

13-48: Avoid possessive 's for inanimate objects in docs.

Rephrase instances like “Processor's model card”, “Frontend's ModelWatcher”, and “frontend's expected model” to non‑possessive forms for doc style compliance. As per coding guidelines.

✍️ Example wording updates
-│       ↓ finds Processor's model card!                                   │
+│       ↓ finds the model card for the processor                           │
...
-3. **Frontend's ModelWatcher** discovers the processor's model card
+3. **Frontend ModelWatcher** discovers the model card for the processor
...
-- `--model-name`: Served model name (must match frontend's expected model)
+- `--model-name`: Served model name (must match the expected model for the frontend)

Also applies to: 86-88

external/dynamo/monitoring/README.md-24-55 (1)

24-55: Add language identifiers to fenced blocks flagged by markdownlint.

Several code fences (diagram, KVES formula, file tree, metric lists) are missing a language tag. Add text (or a more specific language) to satisfy MD040.

🧹 Example fixes
-```
+```text
 ┌──────────────────────────────────────────────────────────────────────────────┐
 │                              Dynamo Stack                                    │
 ...
 └──────────────────────────────────────────────────────────────────────────────┘
-```
+```
...
-```
+```text
 KVES = (TotalWork - ActualWork) / TotalWork ∈ [0,1]
 ...
 w_hit = (w_gpu_hit, w_cpu_hit, w_disk_hit)  # weights per hit source
-```
+```

Also applies to: 165-173, 295-308, 531-545

src/nat/profiler/profile_runner.py-230-251 (1)

230-251: Preserve stack traces and satisfy BLE001 when Dynamo metrics collection fails.

The broad catch logs with logger.warning, which drops the stack trace and still triggers BLE001. If the broad catch is intentional, switch to logger.exception and annotate with # noqa: BLE001 (or narrow the exception list). As per coding guidelines.

🛠️ Suggested change
-            except Exception as e:
-                logger.warning("Failed to collect Dynamo metrics: %s", e)
+            except Exception:  # noqa: BLE001
+                logger.exception("Failed to collect Dynamo metrics")
external/dynamo/start_dynamo_unified.sh-480-482 (1)

480-482: Fix the separator typo in the output banner.

The separator line prints ...ca...; this looks like an accidental keystroke.

🧹 Proposed fix
-    echo "=====================================ca===================="
+    echo "========================================================="
external/dynamo/monitoring/README.md-185-195 (1)

185-195: Doc style tweaks: avoid inanimate possessives and hyphenate compounds.

Rephrase “SGLang's/Dynamo's …” and change “KV cache related” → “KV cache‑related”; also remove the blank line inside the blockquote to satisfy MD028. As per coding guidelines.

✍️ Example wording updates
-> **Why use SGLang's native metric?** SGLang computes cache hit rate internally but doesn't include
+> **Why use the native metric from SGLang?** SGLang computes cache hit rate internally but doesn't include
 > `cached_tokens` in its API responses. The processor's `thompson_kve_*` counters will show 0
 > unless the underlying engine provides `usage.prompt_tokens_details.cached_tokens`.
-
-> **Note on Full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need
+> **Note on the full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need
 ...
-This section documents the working status of all KV cache related metrics across the Dynamo stack.
+This section documents the working status of all KV cache‑related metrics across the Dynamo stack.
external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json-96-104 (1)

96-104: Align the Requests panel query with the "req/min" unit.

The panel is labeled "Requests (1m)" with unit "reqpm", but the query uses a 5s window. Use a 1m increase window to match the declared per-minute metric.

Suggested query change
-          "expr": "sum(increase(dynamo_frontend_requests_total[5s]))",
+          "expr": "sum(increase(dynamo_frontend_requests_total[1m]))",
external/dynamo/optimized/router.py-1054-1062 (1)

1054-1062: Remove TODO in production code

The TODO for metrics querying violates the no-TODOs guideline. Either implement a stubbed query with an explicit config gate or track it in an issue. As per coding guidelines, avoid TODOs in committed code/comments.

Do you want me to draft a minimal metrics client stub (or open an issue template) for this block?

src/nat/profiler/inference_optimization/dynamo_metrics.py-1059-1062 (1)

1059-1062: Docstring typo: kv_cache_efficiency doesn’t exist

The field is kv_efficiency. This example will mislead users. As per coding guidelines, keep documentation accurate.

✅ Suggested fix
-            print(f"KV Efficiency: {core.kv_cache_efficiency:.2%}")
+            print(f"KV Efficiency: {core.kv_efficiency:.2%}")
external/dynamo/.env.example-108-115 (1)

108-115: Export KV tuning vars so child processes see them

Without export, these values won’t reach the startup scripts and defaults will be used.

✅ Suggested fix
-DYNAMO_KV_BLOCK_SIZE=64
+export DYNAMO_KV_BLOCK_SIZE=64
@@
-DYNAMO_MEM_FRACTION_STATIC=0.9
+export DYNAMO_MEM_FRACTION_STATIC=0.9
external/dynamo/optimized/processor.py-232-234 (1)

232-234: Remove hard-coded Dynamo version in docstring

Version numbers shouldn’t be embedded in .py docs. As per coding guidelines, keep docs version-agnostic.

✅ Suggested fix
-        # doesn't expose custom bucket configuration in Dynamo 0.7.1)
+        # doesn't expose custom bucket configuration in the current Dynamo release)
external/dynamo/optimized/router.py-1259-1328 (1)

1259-1328: Add return type hint for the CLI parser

Public functions require return type hints. Add -> argparse.Namespace to parse_args().

✅ Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
external/dynamo/optimized/processor.py-488-491 (1)

488-491: Use logger.exception() to capture stack trace information

This exception handler catches and logs without re-raising, so it should use logger.exception() to preserve the full stack trace. Also, when using logger.exception(), omit the exception object from the format arguments—it's automatically captured:

Suggested fix
        except Exception as e:
-            logger.error("Failed to pick worker: %s", e)
+            logger.exception("Failed to pick worker")
             self._metrics.router_errors_total.inc()
             return None, None

Consider narrowing the exception type if possible based on what the router client can raise.

src/nat/profiler/inference_optimization/dynamo_metrics.py-717-720 (1)

717-720: Use logger.exception when swallowing errors without re-raising

These exception handlers catch exceptions without re-raising and don't capture stack traces. According to coding guidelines, use logger.exception() (or logger.error(exc_info=True)) to preserve full diagnostic information for exceptions caught without re-raising.

✅ Example adjustment for lines 717-720
                except Exception as e:
                    error_msg = f"Failed to collect {metric_name}: {e}"
-                   logger.warning(error_msg)
+                   logger.exception(error_msg)
                    result.errors.append(error_msg)

Also applies to:

  • Lines 972-974: Replace logger.debug() with logger.exception()
  • Lines 1001-1002 and 1008-1009: Add logger.exception() calls to capture stack traces before appending to the errors list
🧹 Nitpick comments (4)
external/dynamo/monitoring/docker-compose.yml (1)

33-42: Follow environment variable pattern for Grafana credentials and use safer defaults.

This file is documented for local development only, but using hardcoded admin credentials and enabling anonymous Admin access should still follow environment variable defaults for consistency with best practices. Additionally, fix the file ending: the file currently ends with 2 blank lines instead of 1.

Recommended changes
-      - GF_AUTH_ANONYMOUS_ENABLED=true
-      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
-      - GF_AUTH_DISABLE_LOGIN_FORM=true
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANON_ENABLED:-false}
+      - GF_AUTH_ANONYMOUS_ORG_ROLE=${GRAFANA_ANON_ROLE:-Viewer}
+      - GF_AUTH_DISABLE_LOGIN_FORM=${GRAFANA_DISABLE_LOGIN_FORM:-false}
+      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
+      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}

Remove trailing blank lines at end of file to have only a single newline.

src/nat/data_models/profiler.py (1)

44-105: Trim the DynamoMetricsConfig docstring to Google-style format.

Guidelines require Google-style docstrings for public classes; consider a concise summary + Attributes: and move the long how‑to content to external docs. As per coding guidelines.

external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json (1)

572-584: Replace hard-coded worker instance queries with dynamic per-instance aggregation.

The current queries for refId D and E pin specific instances (localhost:18081 and localhost:18082), which breaks the panel if the worker count changes. Use sum by (instance) to automatically show all worker instances:

♻️ Suggested query change
-        {
-          "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18081\"}[5s])",
-          "legendFormat": "4. Worker 0 (18081) [rate(dynamo_component_requests_total{worker:18081}[5s])]",
-          "refId": "D"
-        },
-        {
-          "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18082\"}[5s])",
-          "legendFormat": "4. Worker 1 (18082) [rate(dynamo_component_requests_total{worker:18082}[5s])]",
-          "refId": "E"
-        },
+        {
+          "expr": "sum by (instance) (rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\"}[5s]))",
+          "legendFormat": "4. Worker ({{instance}}) [rate(dynamo_component_requests_total{worker}[5s])]",
+          "refId": "D"
+        },
external/dynamo/optimized/router.py (1)

472-557: Harden debug trace storage location

Defaulting to /tmp is vulnerable to symlink races on shared systems. Consider a user-scoped cache dir and restrictive permissions.

♻️ Suggested hardening
-        debug_trace_dir: str = "/tmp/dynamo_router_traces",
+        debug_trace_dir: str = "~/.cache/dynamo_router_traces",
@@
-        self.debug_trace_dir = str(debug_trace_dir)
+        self.debug_trace_dir = os.path.expanduser(str(debug_trace_dir))
@@
-            os.makedirs(self.debug_trace_dir, exist_ok=True)
+            os.makedirs(self.debug_trace_dir, mode=0o700, exist_ok=True)

Comment thread external/dynamo/collect_metrics.sh Outdated
Comment thread external/dynamo/monitoring/README.md Outdated
Comment thread external/dynamo/components/ARCHITECTURE.md
Comment thread external/dynamo/optimized/PARAMETERS.md Outdated
Comment thread external/dynamo/components/README.md
Comment thread external/dynamo/start_dynamo_optimized_thompson_hints_sglang.sh
Comment thread external/dynamo/start_dynamo_optimized_thompson_hints_sglang.sh
Comment thread external/dynamo/start_dynamo_unified.sh
Comment thread src/nat/llm/dynamo_llm.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
.pre-commit-config.yaml (1)

37-42: Add jupyter to additional_dependencies—the hook will fail without it.

The shell script is properly executable with correct shebang, but it calls jupyter nbconvert on line 25. Since language: python creates an isolated venv, only nbconvert is installed from additional_dependencies: ["nbconvert"]. The jupyter command won't be available, causing the hook to fail at runtime. Add "jupyter" to additional_dependencies. (Note: nbconvert is already pinned to 7.16.6 in uv.lock, so the version pinning concern is addressed.)

external/dynamo/start_dynamo_disagg.sh (1)

27-27: Stale comment: port reference is outdated.

The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 39. Update the comment to reflect the new default.

-# Frontend: Port 8099 (HTTP API)
+# Frontend: Port 8000 (HTTP API, configurable via DYNAMO_HTTP_PORT)
external/dynamo/start_dynamo_unified_thompson_hints.sh (1)

28-28: Stale comment: port reference is outdated.

The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 44. Update the comment for consistency.

-# Frontend: Port 8099 (HTTP API with prefix hint headers)
+# Frontend: Port 8000 (HTTP API with prefix hint headers)
🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/docker-compose.yml`:
- Around line 30-43: The Grafana service is exposed insecurely via network_mode:
host with anonymous admin access and login disabled; update the grafana service
to bind only to localhost (remove network_mode: host and use a ports mapping
like 127.0.0.1:3000:3000) and harden the environment defaults by setting
GF_AUTH_ANONYMOUS_ENABLED=false and GF_AUTH_DISABLE_LOGIN_FORM=false while
keeping GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD present; make
these defaults overrideable via environment or an .env file so developers can
re-enable anonymous/local-only access when explicitly needed (refer to the
grafana service, network_mode, GF_AUTH_ANONYMOUS_ENABLED,
GF_AUTH_DISABLE_LOGIN_FORM, GF_AUTH_ANONYMOUS_ORG_ROLE).

In `@external/dynamo/monitoring/README.md`:
- Around line 222-224: Replace inanimate possessive phrasing like "SGLang's
native metric" with non-possessive alternatives such as "the native SGLang
metric" or "the native metric from SGLang" in the README text (specifically the
sentence starting "SGLang computes cache hit rate internally..."). Update any
other occurrences of similar inanimate possessives in this section (e.g.,
"SGLang's", "the processor's `thompson_kve_*` counters") to use equivalent
non-possessive constructions to match the documentation style guidelines.
- Around line 229-233: Update the header paragraph "KV Cache Metrics Status" to
hyphenate the compound adjective by replacing the phrase "KV cache related
metrics" with "KV cache-related metrics" (preferably using a non-breaking hyphen
U+2011 if you want to prevent line-wrapping); locate the sentence in the same
paragraph that mentions the `${backend}` template variable and make the
single-word change there to ensure grammatical correctness.
- Around line 571-573: The README example uses a user-specific absolute path;
update the example invocation to use a generic placeholder instead (e.g.,
/path/to/repo or $REPO_ROOT) so it’s portable; modify the example line that runs
./collect_metrics.sh ./metrics_output 30 to show changing directory to a generic
repo root (instead of
/localhome/local-bbednarski/NeMo-Agent-Toolkit/external/dynamo) and keep the
collect_metrics.sh command unchanged to preserve the usage example.

In `@external/dynamo/monitoring/scripts/cache_experiment.sh`:
- Around line 1-14: The script cache_experiment.sh currently has an abbreviated
header; replace it with the full Apache-2.0 license header block (the standard
boilerplate used in the repo) including the SPDX-License-Identifier: Apache-2.0
line, the full copyright notice (Copyright (c) 2025-2026, NVIDIA CORPORATION &
AFFILIATES. All rights reserved.), and the complete Apache License 2.0 notice
text (permissions, conditions, and disclaimer) so the file contains the
canonical license template at the top.

In `@external/dynamo/monitoring/scripts/kv_event_observer.py`:
- Line 85: The parameter parent_hash in function record_stored is unused and
triggers a lint warning; either rename it to _parent_hash to indicate
intentional unusedness or actually use it (e.g., log or store ancestry info).
Edit the record_stored signature (record_stored) to rename parent_hash to
_parent_hash if you don't need it, or add a single-line usage inside
record_stored (such as including parent_hash in the emitted metric/log or
appending it to an ancestry list) so the linter no longer flags it.
- Around line 1-35: Add the required SPDX Apache-2.0 license header immediately
after the shebang line in kv_event_observer.py; place the standard multi-line
SPDX/Apache-2.0 header (including SPDX-License-Identifier: Apache-2.0 and the
copyright/notice block per repo policy) right after "#!/usr/bin/env python3" so
the file header is compliant before the module docstring and any definitions (no
changes to functions or logic like the KVEventBatch parsing or metrics polling).

In `@external/dynamo/monitoring/scripts/README.md`:
- Around line 1-5: Add the SPDX license header "SPDX-License-Identifier:
Apache-2.0" to the very top of the README.md for the "KV Cache Event Observer"
(place the header before the first "# KV Cache Event Observer" line) so the file
includes the required Apache-2.0 license identifier per project guidelines.

In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 9-14: Update the inanimate possessive phrasing in ARCHITECTURE.md:
replace occurrences like "Frontend's built-in router" with "the built-in router
of the frontend" (and similar rephrasings) throughout the document, including
the other instances noted around the section discussing DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate (lines referenced in the
review); preserve meaning and references to DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate while only adjusting
possessive wording to match the style guide.
- Around line 15-24: The fenced code blocks containing the "Frontend (built-in
router: round-robin) → routes to dynamo.backend.generate → OUR PROCESSOR ..."
snippet (and the other similar blocks around the ASCII box starting with
"┌─────────────────────────────────────────────────────────────────────────────────┐")
are missing language identifiers and there is a blank line inside a blockquote;
add an appropriate language tag (e.g., ```text, ```bash or ```json) to each
fenced code block and remove the empty line inside the blockquote so the block
is contiguous, ensuring all occurrences (including the other blocks referenced
in the comment) follow the same fixes.

In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh`:
- Around line 309-330: The prompt currently uses read -p which echoes the
HF_TOKEN to the terminal; change the interactive prompt in the HF_TOKEN check
(the block referencing HF_TOKEN and LOCAL_MODEL_DIR and the read call) to use
silent input by replacing read -p with read -s -p and then emit an echo newline
after the read so the terminal prompt formatting remains correct, preserving the
existing logic that checks for an empty value and sets HF_TOKEN="dummy" or
prints confirmation.
- Around line 86-92: After computing NUM_GPUS, CONTAINER_GPU_INDICES and
NUM_WORKERS, add a guard that validates TP_SIZE against NUM_GPUS: if TP_SIZE is
zero, greater than NUM_GPUS, or NUM_GPUS is not evenly divisible by TP_SIZE
(NUM_WORKERS equals 0 or NUM_GPUS % TP_SIZE != 0), immediately print a clear
error and exit with non‑zero status. Reference the variables NUM_GPUS, TP_SIZE,
NUM_WORKERS and CONTAINER_GPU_INDICES when implementing the check so the script
fails fast before any worker loops or indexing occur.

In `@external/dynamo/start_dynamo_unified.sh`:
- Line 481: The separator banner echo contains a stray "ca" that should be
removed; locate the echo call that prints
"=====================================ca====================" (in
start_dynamo_unified.sh) and update the string to a clean separator without the
"ca" characters so the output shows a continuous line of "=" characters.
🧹 Nitpick comments (4)
external/dynamo/monitoring/scripts/cache_experiment.sh (2)

188-198: Dependency on bc may cause failures on minimal systems.

The bc utility is used for arithmetic calculations but may not be installed by default. Consider adding a dependency check or using bash arithmetic where possible.

Alternative using awk (commonly available)
-echo -e "  Cache hits:    ${final_hits} (delta: +$(echo "$final_hits - $initial_hits" | bc))"
-echo -e "  Cache queries: ${final_queries} (delta: +$(echo "$final_queries - $initial_queries" | bc))"
+delta_hits_calc=$(awk "BEGIN {print $final_hits - $initial_hits}")
+delta_queries_calc=$(awk "BEGIN {print $final_queries - $initial_queries}")
+echo -e "  Cache hits:    ${final_hits} (delta: +${delta_hits_calc})"
+echo -e "  Cache queries: ${final_queries} (delta: +${delta_queries_calc})"
 echo ""
 
 # Calculate hit rate for this experiment
-delta_hits=$(echo "$final_hits - $initial_hits" | bc)
-delta_queries=$(echo "$final_queries - $initial_queries" | bc)
+delta_hits=$delta_hits_calc
+delta_queries=$delta_queries_calc
 if [ "$delta_queries" != "0" ]; then
-    hit_rate=$(echo "scale=1; $delta_hits * 100 / $delta_queries" | bc)
+    hit_rate=$(awk "BEGIN {printf \"%.1f\", $delta_hits * 100 / $delta_queries}")
     echo -e "${GREEN}Experiment hit rate: ${hit_rate}%${NC}"
 fi

173-176: Process cleanup could leave orphaned process on signal.

If the script receives a signal (e.g., Ctrl+C) before reaching line 174, the observer process will be orphaned. Consider adding a trap for cleanup.

Suggested fix

Add after line 144:

# Cleanup trap
trap 'kill $OBSERVER_PID 2>/dev/null; rm -f "$OBSERVER_LOG"' EXIT INT TERM

Then the explicit cleanup at lines 174-176 and 201 can be simplified or removed since the trap handles it.

external/dynamo/stop_dynamo.sh (1)

142-147: Missing error handling for volume removal.

If the Prometheus data volume doesn't exist, docker volume rm will output an error message. Consider suppressing the error or checking existence first.

Suggested fix
     if [ "$CLEAR_METRICS" = true ]; then
         echo ""
         echo "Clearing Prometheus data volume..."
-        docker volume rm monitoring_prometheus_data && echo "✓ Prometheus data volume removed (old metrics cleared)"
+        if docker volume rm monitoring_prometheus_data 2>/dev/null; then
+            echo "✓ Prometheus data volume removed (old metrics cleared)"
+        else
+            echo "  (Prometheus data volume not found or already removed)"
+        fi
     fi
external/dynamo/.env.example (1)

59-62: Align export usage in the .env example.

This file mixes export with bare assignments. If it’s intended to be sourced in a shell, DYNAMO_HTTP_PORT should be exported like the others; if it’s for Docker Compose .env, remove export keywords or add a note to avoid confusion.

🛠️ Consistency option (shell-source)
- DYNAMO_HTTP_PORT=8000
+ export DYNAMO_HTTP_PORT=8000

Comment thread external/dynamo/monitoring/docker-compose.yml
Comment thread external/dynamo/monitoring/README.md Outdated
Comment thread external/dynamo/monitoring/README.md
Comment thread external/dynamo/monitoring/README.md Outdated
Comment thread external/dynamo/monitoring/scripts/cache_experiment.sh Outdated
Comment thread external/dynamo/components/ARCHITECTURE.md
Comment thread external/dynamo/optimized/ARCHITECTURE.md Outdated
Comment thread external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh
Comment thread external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh
Comment thread external/dynamo/start_dynamo_unified.sh Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
external/dynamo/generalized/processor.py (1)

428-433: ⚠️ Potential issue | 🔴 Critical

Update code to match deployed Dynamo 0.7.1 or add version compatibility layer.

Lines 428-433 assume Dynamo ≥0.8.x (where create_service() was removed), but all deployment scripts in external/dynamo/ are pinned to Dynamo 0.7.1, where create_service() is still required. This code will fail at runtime with the current deployment version. Either restore the create_service() call for 0.7.1 compatibility, update deployment scripts to Dynamo 0.8.x+, or add conditional logic to detect the runtime version and handle both APIs.

This same issue affects multiple files: generalized/router.py, optimized/router_multilru.py, and optimized/processor_multilru.py.

🤖 Fix all issues with AI agents
In `@external/dynamo/.env.example`:
- Around line 16-18: The HF_TOKEN placeholder in the .env example is active and
may be accidentally sourced; change the HF_TOKEN export to be commented out
(e.g., prefix the HF_TOKEN line with a comment marker) so users must explicitly
set or export HF_TOKEN themselves; ensure HF_HOME remains exported and leave a
commented example or instruction for users to add their own HF_TOKEN instead of
exporting a dummy value.
- Around line 59-62: Update the .env.example so all variables are consistently
exported and remove unnecessary quotes: add export to variables missing it
(e.g., DYNAMO_HTTP_PORT and the other non-exported entries referenced in the
comment) so they propagate to child processes, and remove the surrounding quotes
from the image URL on the quoted line so the value follows dotenv conventions;
ensure other exported lines (those already using export) remain unchanged and
keep naming identical to existing keys.

In `@external/dynamo/build_multi_lru_image.sh`:
- Around line 193-207: The script currently calls ./container/build.sh while
shell option set -e may cause the script to exit before BUILD_EXIT_CODE is
captured; modify the invocation around the build step (the block that calls
./container/build.sh and assigns BUILD_EXIT_CODE) to temporarily disable set -e
(e.g., set +e), run ./container/build.sh with the existing flags (framework
VLLM, --target "$BUILD_TARGET", --tag "$IMAGE_TAG", --enable-kvbm, build-arg and
vllm-max-jobs, $NO_CACHE, $DRY_RUN), capture the exit status into
BUILD_EXIT_CODE, then re-enable set -e (set -e) so subsequent error handling
works reliably; alternatively implement the &&/|| pattern to set BUILD_EXIT_CODE
explicitly after the command.
- Around line 122-128: When updating an existing clone (KVBM_NEXT_DIR) the
current git sequence uses git checkout "$BRANCH" which can fail if a local
branch named BRANCH doesn't exist; replace that step with logic that ensures a
local branch is created from the remote if missing: after git fetch origin,
check if the local branch exists (e.g., git show-ref --verify --quiet
refs/heads/"$BRANCH"), and if it does git checkout "$BRANCH", otherwise create
it from origin with git checkout -b "$BRANCH" "origin/$BRANCH" (or git branch
--track "$BRANCH" "origin/$BRANCH" then checkout) before running git pull and
git submodule update --init --recursive.

In `@external/dynamo/generalized/router.py`:
- Around line 1069-1071: The comment about create_service() being removed is
inconsistent with other router implementations that still call await
component.create_service(); inspect which Dynamo version the project targets
(pyproject.toml) and either (A) make removal intentional by removing
create_service() everywhere and update the comment near
runtime.namespace("dynamo").component("router") / WorkloadAwareRouter to
explicitly state “create_service removed in Dynamo >=0.8; router endpoints
register automatically” or (B) preserve compatibility by calling create_service
conditionally—e.g., check for hasattr(component, "create_service") or a
runtime/version flag and call await component.create_service() when present—so
that the behavior matches the optimized router implementation (which currently
calls create_service), keeping the comment aligned with the chosen approach.

In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 507-510: Replace the logger.error call in the except block in
external/dynamo/optimized/processor_multilru.py with logger.exception so the
full stack trace is recorded when an exception is caught (keep the existing
message string and the subsequent self._metrics.router_errors_total.inc() and
return None, None intact); locate the except block that currently does
logger.error("Failed to pick worker: %s", e) and change it to
logger.exception(...) to capture the traceback for debugging.
- Around line 389-394: The _extract_hints function's signature and docstring are
out of sync with its actual return values: update the return type annotation
from tuple[str, int, str, str] to include the fifth value (likely a bool for
use_frequency_backend), e.g. tuple[str, int, str, str, bool], and revise the
docstring "Returns:" line to list all five returned items in the correct order
with brief types/names (prefix_id, total_requests, osl, iat,
use_frequency_backend) so documentation matches the implementation.
- Around line 661-664: The fire-and-forget call using
asyncio.create_task(self._update_kve_metrics_async(kve_data)) can let the Task
be garbage-collected; add a class-level set (e.g., self._background_tasks) in
the class __init__ and store the created Task in it, e.g., task =
asyncio.create_task(self._update_kve_metrics_async(kve_data));
self._background_tasks.add(task), and ensure completed tasks are removed (attach
a done callback or await/cleanup in shutdown) so tasks are tracked and not
prematurely collected.
🧹 Nitpick comments (9)
external/dynamo/demo_priority_eviction.sh (1)

53-56: Consider adding defensive handling for the metrics fetch.

With set -euo pipefail, if the metrics endpoint doesn't contain prefix_cache_hits_total{, grep returns exit code 1 and terminates the script. Additionally, if the container isn't running or curl fails, downstream arithmetic on empty values will error.

🛠️ Proposed fix to add fallback for robustness
 get_hits() {
-  docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \
-    grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}'
+  local hits
+  hits=$(docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \
+    grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}' || true)
+  echo "${hits:-0}"
 }
external/dynamo/optimized/router_multilru.py (7)

79-81: Add noqa comment for intentionally unused arguments.

The fallback implementation intentionally ignores tokens and min_overlap to maintain interface compatibility. Add noqa to suppress the static analysis warning and document the intent.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:  # noqa: ARG002
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Consider using tempfile.gettempdir() for secure temporary directory handling.

Hardcoding /tmp can be insecure on multi-user systems. The path is configurable via config, but the default should use the system's secure temp directory.

Suggested fix
+import tempfile
+
 def get_builtin_defaults() -> dict[str, Any]:
     """Return built-in default configuration (matches config.yaml)."""
     return {
         # ... other config ...
         "debug": {
             "traces_enabled": False,
-            "trace_dir": "/tmp/dynamo_router_traces",
+            "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"),
             "buffer_size": 2000,
         },
     }

378-408: Consider extracting shared Pydantic models to avoid duplication.

These models (RouterRequest, RouterResponse, FeedbackRequest, FeedbackAck) are duplicated in external/dynamo/generalized/router.py (lines 80-108) and processor_multilru.py. Consider extracting to a shared models.py module to follow the DRY principle.


502-503: Log a warning when lin_forget is clamped to valid range.

The clamping silently modifies user-provided values. Consider logging when the value is adjusted to help with debugging.

Suggested fix
         self.lin_forget = float(lints_forget)
+        original_forget = self.lin_forget
         self.lin_forget = max(1e-6, min(self.lin_forget, 0.999999))
+        if self.lin_forget != original_forget:
+            logger.warning("lin_forget clamped from %s to %s", original_forget, self.lin_forget)

568-574: Consider logging exception details instead of just the message.

The blind Exception catch is acceptable here since trace writing should never crash the router, but logging only %s loses the exception type. Use logger.debug(..., exc_info=True) or at least include the exception type.

Suggested fix
         try:
             path = os.path.join(self.debug_trace_dir, "router_traces.jsonl")
             with open(path, "a", encoding="utf-8") as f:
                 f.write(json.dumps(item, separators=(",", ":")) + "\n")
-        except Exception as e:
-            logger.debug("Trace write failed: %s", e)
+        except Exception:
+            logger.debug("Trace write failed", exc_info=True)

1075-1075: Prefix unused variable with underscore.

chosen_ctx is unpacked but never used. Prefix with underscore to indicate intentional discard and suppress the linter warning.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1257-1264: Add return type hint to _get_underloaded.

Per coding guidelines, public/internal APIs should have type hints. This method returns a tuple that should be annotated.

Suggested fix
-    def _get_underloaded(self, metrics: dict[str, Any] | None):
+    def _get_underloaded(self, metrics: dict[str, Any] | None) -> tuple[int, float]:
         if not metrics or not metrics.get("endpoints"):
             wid = int(random.choice(list(self.engine_client.instance_ids())))
             return wid, 0.0
external/dynamo/optimized/processor_multilru.py (1)

157-157: Sort __slots__ alphabetically for consistency.

Static analysis recommends natural sorting of __slots__. This is a minor style convention.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")

Comment thread external/dynamo/.env.example
Comment thread external/dynamo/.env.example
Comment thread external/dynamo/build_multi_lru_image.sh Outdated
Comment thread external/dynamo/build_multi_lru_image.sh Outdated
Comment thread external/dynamo/generalized/router.py Outdated
Comment thread external/dynamo/optimized/processor_multilru.py
Comment thread external/dynamo/optimized/processor_multilru.py Outdated
Comment thread external/dynamo/optimized/processor_multilru.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.yml (1)

103-111: ⚠️ Potential issue | 🟠 Major

Update eval_llm to use port 8000 for consistency.

The eval_llm uses port 8099 while dynamo_llm uses port 8000 in the same file. This is inconsistent. The profile_rethinking_full_test.yml config shows that eval_llm works correctly with port 8000, and most other eval configs maintain the same port for both LLMs. Update line 106 to base_url: http://localhost:8000/v1.

🤖 Fix all issues with AI agents
In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 742-769: Update parse_args() to declare and return an
argparse.Namespace return type and add a new CLI flag for kv_cache_block_size
(e.g., parser.add_argument("--kv-cache-block-size", type=int, default=1,
help="...")), then propagate that parsed value into register_llm where
kv_cache_block_size is currently hardcoded to 1 so the function uses
args.kv_cache_block_size; ensure register_llm signature or call sites accept the
new parameter if needed and update any references in the file (including the
other occurrence around lines 809-816) to use the parsed value.

In `@external/dynamo/optimized/processor.py`:
- Around line 497-500: Replace the plain error log with an exception log so the
full stack trace is captured: in the exception handler that currently calls
logger.error("Failed to pick worker: %s", e) (the block that then increments
self._metrics.router_errors_total and returns None, None), use
logger.exception(...) instead (preserving the message text) so the stack trace
is recorded when the method fails to pick a worker.
- Line 654: The create_task call that fires-and-forgets
self._update_kve_metrics_async(kve_data) can be garbage-collected; add a
persistent container (e.g., self._background_tasks: set[asyncio.Task]) in the
class __init__ and replace the bare asyncio.create_task(...) with creating the
task, adding it to self._background_tasks, and attaching a done callback to
remove the task from the set when finished; reference the existing call to
self._update_kve_metrics_async and the class initializer to locate where to add
the set and where to store/remove the task.

In `@external/dynamo/optimized/router_multilru.py`:
- Around line 1343-1345: The multilru router/processor files are inconsistent
with router.py/processor.py regarding component.create_service(); update
router_multilru.py and processor_multilru.py to match the non-multilru files by
conditionally invoking create_service when present: after obtaining component
via runtime.namespace("dynamo").component("router") (and similarly for
processor), check for the existence of create_service on the component and await
it if available (e.g., use a guard like hasattr or getattr to avoid failing on
Dynamo 0.8.x), and update/remove the outdated comment so both variants behave
consistently across Dynamo versions.
🧹 Nitpick comments (13)
external/dynamo/monitoring/scripts/kv_event_observer.py (2)

171-206: Consider validating URL scheme for defensive coding.

The urlopen call constructs URLs from host and metrics_port parameters. While this is an operator-run monitoring script, validating that the constructed URL uses http:// scheme would be more defensive against misconfiguration.

🛡️ Optional: Add scheme validation
     def _poll_metrics(self):
         """Background thread to poll Prometheus metrics for cache hits."""
         metrics_url = f"http://{self.host}:{self.metrics_port}/metrics"
+        if not metrics_url.startswith(("http://", "https://")):
+            raise ValueError(f"Invalid metrics URL scheme: {metrics_url}")
 
         while self.running:

507-508: Consider prefixing unused signal handler arguments.

The signal handler signature requires (signum, frame) arguments but they're unused. Prefixing with _ improves clarity and silences linter warnings.

♻️ Optional: Prefix unused args
-    signal.signal(signal.SIGINT, lambda s, f: setattr(observer, 'running', False))
-    signal.signal(signal.SIGTERM, lambda s, f: setattr(observer, 'running', False))
+    signal.signal(signal.SIGINT, lambda _s, _f: setattr(observer, 'running', False))
+    signal.signal(signal.SIGTERM, lambda _s, _f: setattr(observer, 'running', False))
external/dynamo/optimized/router.py (5)

79-81: Unused parameters in fallback KvIndexer.find_matches_for_request.

The tokens and min_overlap parameters are intentionally unused in this fallback stub implementation. Consider prefixing them with underscores to indicate they are deliberately ignored.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores:
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Hardcoded /tmp path for debug traces.

Using /tmp as the default trace directory can be a security concern in shared environments. Consider using a user-specific temp directory or requiring explicit configuration.

Suggested fix using tempfile
+import tempfile
+
 def get_builtin_defaults() -> dict[str, Any]:
     """Return built-in default configuration (matches config.yaml)."""
     return {
         ...
         "debug": {
             "traces_enabled": False,
-            "trace_dir": "/tmp/dynamo_router_traces",
+            "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"),
             "buffer_size": 2000,
         },
     }

Apply similar change to the constructor default at line 473.

Also applies to: 473-473


568-573: Use logger.debug with exception info for trace write failures.

The current implementation catches a blind Exception and logs with %s. Consider using logger.debug(..., exc_info=True) to capture the full context for debugging trace write issues.

Suggested fix
         try:
             path = os.path.join(self.debug_trace_dir, "router_traces.jsonl")
             with open(path, "a", encoding="utf-8") as f:
                 f.write(json.dumps(item, separators=(",", ":")) + "\n")
-        except Exception as e:
-            logger.debug("Trace write failed: %s", e)
+        except OSError:
+            logger.debug("Trace write failed", exc_info=True)

1073-1073: Unused variable chosen_ctx should be prefixed with underscore.

The unpacked chosen_ctx variable is not used. Prefix it with an underscore to indicate intentional discard.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1266-1333: Add return type annotation to parse_args().

Per coding guidelines, public functions require type hints on return values.

Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
     """Parse minimal CLI arguments.
external/dynamo/optimized/processor.py (1)

157-157: Sort __slots__ tuple for consistency.

Ruff suggests sorting __slots__ alphabetically for maintainability.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")
external/dynamo/optimized/processor_multilru.py (1)

157-157: Sort __slots__ tuple for consistency.

Same issue as in processor.py - consider sorting alphabetically.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")
external/dynamo/optimized/router_multilru.py (4)

79-81: Unused parameters in fallback KvIndexer.find_matches_for_request.

Same issue as router.py - prefix unused parameters with underscores.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores:
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Hardcoded /tmp path for debug traces.

Same security concern as router.py - consider using tempfile.gettempdir() or requiring explicit configuration.

Also applies to: 473-473, 1388-1388


1073-1073: Unused variable chosen_ctx should be prefixed with underscore.

Same issue as router.py.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1266-1333: Add return type annotation to parse_args().

Same issue as router.py - public functions require return type hints.

Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
     """Parse minimal CLI arguments.

Comment thread external/dynamo/optimized/processor_multilru.py Outdated
Comment thread external/dynamo/optimized/processor.py Outdated
Comment thread external/dynamo/optimized/processor.py Outdated
Comment thread external/dynamo/optimized/router_multilru.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/README.md`:
- Line 3: Replace the inanimate possessive phrase "ai-dynamo's Prometheus API"
in the README with a non-possessive form such as "the ai-dynamo Prometheus API"
or "the Prometheus API from ai-dynamo" so the sentence reads e.g. "Metrics are
collected at 2-second resolution directly from the ai-dynamo Prometheus API for
per-request granularity."; locate the exact phrase "ai-dynamo's Prometheus API"
in the README and update it to one of the suggested alternatives.

Comment thread external/dynamo/monitoring/README.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@external/dynamo/E2E_SEQUENCE.md`:
- Around line 329-371: The stub and usage comment disagree on
OverlapScores.scores: update the stub and comment so they match the Rust binding
— decide whether OverlapScores.scores returns raw counts (Dict[int, int]) or
normalized ratios (Dict[int, float]); then change the type annotation on
OverlapScores.scores and the Router usage comment to the chosen representation,
and add a short note pointing to KvIndexer.find_matches_for_request (or
find_matches) as the place where normalization/float() casting occurs so callers
know whether they must cast or if values are already normalized.
- Line 165: The sequence diagram's RouterResponse field name is inconsistent
with the JSON example and router.py; pick one canonical name and make them
match: either change the sequence diagram label RouterResponse to use
"prefix_hit_rate" to match router.py and the JSON example, or change the JSON
example and router.py usage to use "overlap" instead — update all occurrences
(sequence diagram arrow label, the JSON example block, and the RouterResponse
handling in router.py) so RouterResponse consistently uses the same field name
("prefix_hit_rate" or "overlap") across the diagram, JSON example, and code.
- Around line 7-51: The markdown has two fenced code blocks missing language
identifiers (the ASCII-art architecture diagram and the raw HTTP headers block),
which triggers MD040; update the first fenced block containing the NeMo Agent
Toolkit ASCII diagram to use ```text and update the second block showing example
headers (the x-prefix-id/x-prefix-total-requests snippet) to use ```http (or
```text) so both code fences include a language specifier; locate the blocks by
the ASCII art labeled "NeMo Agent Toolkit / DynamoModelConfig" and the raw
headers section that begins with "x-prefix-id" and add the language tokens to
the opening backticks.
- Around line 1-3: Add the standard Apache License 2.0 header comment to the
very top of E2E_SEQUENCE.md (before any content); include the SPDX short
identifier "SPDX-License-Identifier: Apache-2.0" and the usual Apache 2.0
boilerplate header (copyright notice placeholder and permission/limitation
lines) wrapped as a top-of-file comment appropriate for Markdown (e.g., an HTML
comment block <!-- ... -->) so the file is explicitly licensed under Apache-2.0.

Comment thread external/dynamo/E2E_SEQUENCE.md
Comment thread external/dynamo/E2E_SEQUENCE.md Outdated
Comment thread external/dynamo/E2E_SEQUENCE.md Outdated
Comment thread external/dynamo/E2E_SEQUENCE.md Outdated
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 force-pushed the bb/dynamo-router-v1.5 branch from 6421ddf to 556c7bb Compare February 19, 2026 20:28
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Comment thread external/dynamo/E2E_SEQUENCE.md
Comment thread external/dynamo/E2E_SEQUENCE.md Outdated
bbednarski9 and others added 4 commits February 20, 2026 09:34
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 requested a review from a team as a code owner February 20, 2026 22:09
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
…or agent_hints validation

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9
Copy link
Copy Markdown
Contributor Author

/ok to test 2fab656

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bbednarski9 bbednarski9 force-pushed the bb/dynamo-router-v1.5 branch from d95f130 to 8b8b3c6 Compare February 23, 2026 04:41
Copy link
Copy Markdown

@Salonijain27 Salonijain27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from a dependency point of view

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
@bbednarski9
Copy link
Copy Markdown
Contributor Author

/ok to test 2ee1229

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
@bbednarski9
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot Bot merged commit fe0d4f5 into NVIDIA:develop Feb 24, 2026
15 checks passed
Charlie-Yi-2002 pushed a commit to Charlie-Yi-2002/NeMo-Agent-Toolkit that referenced this pull request Mar 5, 2026
…dashboard (NVIDIA#1486)

This PR brings the ai-dynamo prometheus metrics API integration with grafana dash boarding into NAT. This is a core feature of the dynamo optimization pipeline for the NAT/dynamo integration. We also integrate the prometheus API with the NAT profiler, so that we can begin end-to-end agentic header-injection optimization.
Closes

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.



## Summary by CodeRabbit

* **New Features**
  * Built-in Prometheus/Grafana monitoring, dashboards and a metrics collector; enhanced Dynamo metrics surfaced to the profiler.
  * Optimized Thompson Sampling routing and processor with observability and multi-worker support.
  * Multi-worker orchestration, MultiLRU tooling, KV-event observer, demo and image-build scripts, and new deployment start/stop helpers.

* **Documentation**
  * Extensive monitoring, architecture, parameters, and end-to-end deployment guides and diagrams.

* **Chores**
  * Default HTTP port changed from 8099 to 8000; configs and examples updated.

Authors:
  - Bryan Bednarski (https://github.com/bbednarski9)
  - Dhruv Nandakumar (https://github.com/dnandakumar-nv)

Approvers:
  - Dhruv Nandakumar (https://github.com/dnandakumar-nv)
  - https://github.com/Salonijain27

URL: NVIDIA#1486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants