(lib) Add support for vLLM-backed text embedder by charlesbluca · Pull Request #1494 · NVIDIA/NeMo-Retriever

charlesbluca · 2026-03-05T20:04:11Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…embeddings-1

…embeddings-1 Made-with: Cursor # Conflicts: # nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py

- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm - batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval - Extend test_evaluate_lancedb_beir to assert Retriever kwargs Made-with: Cursor

…embeddings-1

The upstream refactor replaced ingest_modes/ with text_embed/operators.py. Add vLLM branching to _BatchEmbedGPUActor.__init__() so that embed_use_vllm=True in EmbedParams routes through create_local_embedder() with use_vllm=True, matching the interface used for recall queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-07T18:28:48Z

Greptile Summary

This PR adds vLLM-backed text embedding support to the NeMo Retriever library, introducing LlamaNemotronEmbed1BV2Embedder (now vLLM-backed, replacing the former HF class), LlamaNemotronEmbedVL1BV2VLLMEmbedder, LlamaNemotronEmbed1BV2HFEmbedder (new explicit HF class for query embedding), and the embed_with_vllm_llm / embed_multimodal_with_vllm_llm helpers. The factory create_local_embedder is restructured around a backend parameter, Retriever now uses create_local_query_embedder with embed_queries(), and the [local] extra gains vllm==0.17.0, flashinfer-cubin==0.6.4, and flashinfer-python==0.6.4 dependencies.

Several previously-flagged P1 concerns from earlier review rounds (missing _llm dataclass field in LlamaNemotronEmbedVL1BV2VLLMEmbedder, bool(\"False\") silently enabling eager mode in gpu_operator.py and processor.py, CUDA_VISIBLE_DEVICES global mutation) remain unresolved in the current diff and should be addressed before merging.

Confidence Score: 4/5

Safe to merge with caution — core embedding logic is correct, but several previously-flagged P1 issues remain unresolved.

The double-prefix bug is fixed, aarch64 wheel sources are present, version pins are consistent (0.17.0), and PoolerConfig exception handling is now specific to ImportError. However, three previously-flagged P1 concerns are still unaddressed: _llm is not declared as a dataclass field in LlamaNemotronEmbedVL1BV2VLLMEmbedder (can hide construction errors), bool('False') silently enables enforce_eager in gpu_operator.py and processor.py, and CUDA_VISIBLE_DEVICES is still mutated globally in LlamaNemotronEmbedVL1BV2VLLMEmbedder.post_init.

nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py (missing _llm field + CUDA_VISIBLE_DEVICES), nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py and processor.py (bool('False') enforce_eager)

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/text_embed/vllm.py	New helpers create_vllm_llm, embed_with_vllm_llm, embed_multimodal_with_vllm_llm; PoolerConfig outer catch now correctly limits to ImportError; well-structured with proper batching logic.
nemo_retriever/src/nemo_retriever/model/init.py	Factory now routes via backend=auto/hf/vllm; non-VL create_local_embedder always returns vLLM embedder with no auto-fallback on non-Linux platforms.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py	Class repurposed as vLLM-backed embedder; eager GPU load in post_init and missing unload() violate model-lazy-loading/gpu-memory-lifecycle rules; device deprecation warning is correctly implemented.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py	New LlamaNemotronEmbedVL1BV2VLLMEmbedder added; _llm not declared as dataclass field (could cause AttributeError on failed construction); CUDA_VISIBLE_DEVICES global mutation still present.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py	New HF-only embedder for query-side use; loads model eagerly in post_init; embed() signature lacks prefix kwarg (inconsistent with vLLM embedder's embed()).
nemo_retriever/src/nemo_retriever/text_embed/processor.py	Double-prefix bug fixed by passing prefix='' to embed(); VL/non-VL routing correct; bool('False') for enforce_eager still silently enables eager mode when YAML injects string values.
nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py	Correctly routes HF vs vLLM ingest backends; bool('False') for enforce_eager still present (string from YAML evaluates truthy), disabling CUDA graphs silently.
nemo_retriever/src/nemo_retriever/text_embed/runtime.py	local_batch_size bug fixed (now uses effective_batch_size); GPU cache clear on exception added; skip_prefix heuristic works correctly for all new model types.
nemo_retriever/src/nemo_retriever/retriever.py	_get_local_embedder now uses create_local_query_embedder with backend selection; embed_queries() replaces manual 'query: ' prefixing; cache keyed by (resolved_name, backend) tuple.
nemo_retriever/pyproject.toml	vllm bumped to 0.17.0 with both x86_64 and aarch64 CUDA 13 wheels in uv.sources; flashinfer-cubin and flashinfer-python added without uv.sources wheel pinning.
nemo_retriever/tests/test_vllm_embed.py	New test file covering embed_with_vllm_llm, embed_multimodal_with_vllm_llm, create_vllm_llm, and VLLMEmbedder image/text-image paths; good coverage of edge cases.
nemo_retriever/tests/test_create_local_embedder.py	New deprecation test for device kwarg; gpu_memory_utilization kwarg forwarding assertion added; tests look correct.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class LlamaNemotronEmbed1BV2Embedder {
        +model_id: Optional[str]
        +device: Optional[str] deprecated
        +gpu_memory_utilization: float
        +enforce_eager: bool
        -_llm: Any
        +embed(texts, batch_size, prefix) Tensor
        +embed_queries(texts, batch_size) Tensor
    }
    class LlamaNemotronEmbed1BV2HFEmbedder {
        +device: Optional[str]
        +model_id: Optional[str]
        -_tokenizer: Any
        -_model: Any
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
    }
    class LlamaNemotronEmbedVL1BV2Embedder {
        +device: Optional[str]
        +model_id: Optional[str]
        -_model: Any
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
        +embed_images(images_b64, batch_size) Tensor
        +embed_text_image(texts, images_b64, batch_size) Tensor
    }
    class LlamaNemotronEmbedVL1BV2VLLMEmbedder {
        +model_id: Optional[str]
        +gpu_memory_utilization: float
        +enforce_eager: bool
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
        +embed_images(images_b64, batch_size) Tensor
        +embed_text_image(texts, images_b64, batch_size) Tensor
    }
    class create_local_embedder {
        <<factory>>
        VL+auto/vllm --> LlamaNemotronEmbedVL1BV2VLLMEmbedder
        VL+hf --> LlamaNemotronEmbedVL1BV2Embedder
        nonVL --> LlamaNemotronEmbed1BV2Embedder
    }
    class create_local_query_embedder {
        <<factory>>
        nonVL+hf --> LlamaNemotronEmbed1BV2HFEmbedder
        otherwise --> create_local_embedder
    }
    create_local_embedder --> LlamaNemotronEmbed1BV2Embedder : non-VL
    create_local_embedder --> LlamaNemotronEmbedVL1BV2VLLMEmbedder : VL + auto/vllm
    create_local_embedder --> LlamaNemotronEmbedVL1BV2Embedder : VL + hf
    create_local_query_embedder --> LlamaNemotronEmbed1BV2HFEmbedder : non-VL + hf
    create_local_query_embedder --> create_local_embedder : otherwise

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py
Line: 116-128

Comment:
**Inconsistent `embed()` interface with vLLM counterpart**

`LlamaNemotronEmbed1BV2HFEmbedder.embed()` does not accept a `prefix` kwarg, while `LlamaNemotronEmbed1BV2Embedder.embed()` does (`prefix: str = "passage: "`). Code that switches backends via `create_local_query_embedder(..., backend="hf")` and then calls `embedder.embed(..., prefix="")` would get a `TypeError`. The `public-api-contract` rule requires interchangeable backends to share the same call signature. Consider adding `prefix: str = "passage: "` here (even if it's effectively ignored since the HF path self-prefixes), so callers don't need to know which backend is active.

**Rule Used:** Every public class and function in nemo_retriever ... ([source](https://app.greptile.com/review/custom-context?memory=public-api-contract))

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (27): Last reviewed commit: "fix(embed): add missing --embed-local-in..." | Re-trigger Greptile}

The torch 2.10 / vLLM 0.17 bump requires additional work not yet ready. Revert to the upstream baseline (vLLM 0.16.0, torch~=2.9.1, torchvision >=0.24,<0.25, flashinfer 0.6.3). The VLM embedder vLLM code path is preserved but now raises a clear RuntimeError at init time if the installed vLLM is < 0.17.0, since that version is required for VLM pooling support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig (query), matching the same flag added to the old inprocess/fused examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Text embedding via vLLM works on 0.16.0; only the VLM embedder requires 0.17.0 for pooling support. Move the version check accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both nemotron_vlm_captioner and nemotron_parse_v1_2 translate device to CUDA_VISIBLE_DEVICES and call configure_global_hf_cache_base() before constructing LLM(). The vLLM path in both local embedders was silently ignoring these fields. Apply the same pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pattern Separate HF and vLLM embedder classes rather than toggling `use_vllm` on a single class. `create_local_embedder()` dispatches to the appropriate class. `_BatchEmbedGPUActor` simplified to a single `create_local_embedder()` call. VLM vLLM embedder retains the vLLM >= 0.17.0 runtime guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d processor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-09T14:49:49Z

    from nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder import (
-        LlamaNemotronEmbed1BV2Embedder,
+        LlamaNemotronEmbed1BV2VLLMEmbedder,
    )

-    return LlamaNemotronEmbed1BV2Embedder(
+    return LlamaNemotronEmbed1BV2VLLMEmbedder(
+        model_id=model_id,
        device=device,
        hf_cache_dir=hf_cache_dir,
-        normalize=normalize,
-        max_length=max_length,
-        model_id=model_id,
+        gpu_memory_utilization=gpu_memory_utilization,
+        enforce_eager=enforce_eager,
+        compile_cache_dir=compile_cache_dir,
+        dimensions=dimensions,
    )


Non-Linux environments silently broken by factory-forcing vLLM

create_local_embedder now always returns LlamaNemotronEmbed1BV2VLLMEmbedder for non-VL models. Because vllm is gated to sys_platform == 'linux' in pyproject.toml, calling this factory on macOS or Windows immediately raises RuntimeError: vLLM is not installed — with no fallback to the perfectly functional HF path (LlamaNemotronEmbed1BV2Embedder). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory.

Consider keeping a use_vllm: bool parameter (defaulting to True on Linux, False elsewhere) and falling back to the HF class when vLLM is unavailable:

def create_local_embedder( model_name: str | None = None, *, use_vllm: bool | None = None, # None → auto-detect ... ) -> Any: if use_vllm is None: try: import vllm # noqa: F401 use_vllm = True except ImportError: use_vllm = False if not is_vl_embed_model(model_name): if use_vllm: from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder return LlamaNemotronEmbed1BV2VLLMEmbedder(...) else: from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/__init__.py Line: 71-83 Comment: **Non-Linux environments silently broken by factory-forcing vLLM** `create_local_embedder` now _always_ returns `LlamaNemotronEmbed1BV2VLLMEmbedder` for non-VL models. Because `vllm` is gated to `sys_platform == 'linux'` in `pyproject.toml`, calling this factory on macOS or Windows immediately raises `RuntimeError: vLLM is not installed` — with no fallback to the perfectly functional HF path (`LlamaNemotronEmbed1BV2Embedder`). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory. Consider keeping a `use_vllm: bool` parameter (defaulting to `True` on Linux, `False` elsewhere) and falling back to the HF class when vLLM is unavailable: ```python def create_local_embedder( model_name: str | None = None, *, use_vllm: bool | None = None, # None → auto-detect ... ) -> Any: if use_vllm is None: try: import vllm # noqa: F401 use_vllm = True except ImportError: use_vllm = False if not is_vl_embed_model(model_name): if use_vllm: from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder return LlamaNemotronEmbed1BV2VLLMEmbedder(...) else: from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-09T14:49:50Z

+        device=task_config.get("local_hf_device"),
+        hf_cache_dir=task_config.get("local_hf_cache_dir"),
+        gpu_memory_utilization=float(task_config.get("gpu_memory_utilization", 0.45)),
+        enforce_eager=bool(task_config.get("enforce_eager", False)),


bool("False") evaluates to True, silently enabling eager mode

task_config values can arrive as strings when the dict is populated from YAML or environment variables. bool("False") returns True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in YAML (parsed as the string "False") would silently get enforce_eager=True, disabling CUDA graphs and degrading throughput with no error.

Suggested change

enforce_eager=bool(task_config.get("enforce_eager", False)),

enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"),

Or use a small helper:

def _to_bool(v, default=False): if isinstance(v, bool): return v if v is None: return default return str(v).strip().lower() in ("1", "true", "yes")

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py Line: 108 Comment: **`bool("False")` evaluates to `True`, silently enabling eager mode** `task_config` values can arrive as strings when the dict is populated from YAML or environment variables. `bool("False")` returns `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in YAML (parsed as the string `"False"`) would silently get `enforce_eager=True`, disabling CUDA graphs and degrading throughput with no error. ```suggestion enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"), ``` Or use a small helper: ```python def _to_bool(v, default=False): if isinstance(v, bool): return v if v is None: return default return str(v).strip().lower() in ("1", "true", "yes") ``` How can I resolve this? If you propose a fix, please make it concise.

… compatibility Drop compile_cache_dir plumbing and rely on vLLM default compile cache behavior. Keep PoolerConfig construction compatible across vLLM versions by avoiding unsupported normalize argument and adding fallbacks.

Prune unused embed_via_vllm and stop exporting it from text_embed.vllm.

Remove the separate LlamaNemotronEmbed1BV2VLLMEmbedder class and move the vLLM pooling path into the canonical embedder type. create_local_embedder now instantiates that class and forwards normalize and max_length for Ray actors. Update factory tests and docs that referred to HuggingFace for the default text embedding path. Made-with: Cursor

LlamaNemotronEmbed1BV2Embedder never consumed device; vLLM follows process placement. create_local_embedder still accepts device for the VL HF path only. Strip device in TextEmbedGPUActor, stop forwarding it from graph embed actor and Retriever local embed construction, and adjust factory tests. Made-with: Cursor

Expose local_query_embed_backend on Retriever and RecallConfig so recall can use HuggingFace mean-pooled query vectors while ingestion keeps the existing vLLM text embed path. Add LlamaNemotronEmbed1BV2HFEmbedder with query:/passage: prefixes and per-batch tokenization aligned with vLLM micro-batching. Wire through graph_pipeline, BEIR, retrieve_and_score, and vdb_recall Typer commands. Extend unit tests for Retriever routing and BEIR kwargs. Made-with: Cursor

Restore optional device= for API compatibility; emit DeprecationWarning when set and document vLLM placement via CUDA_VISIBLE_DEVICES / tensor_parallel_size. Forward device from create_local_embedder for non-VL models. Extend factory tests and add a deprecation test that loads the real embedder module. Made-with: Cursor

jperez999 · 2026-04-14T23:19:10Z

Might need to coordinate with @edknv and the VL reranker PR

Resolve conflicts between local query embed backend (PR) and multimodal VL reranker (NVIDIA#1836): keep VL_EMBED_MODEL/VL_RERANK_MODEL exports and TYPE_CHECKING imports with Any; pass both local_query_embed_backend and embed_modality into RecallConfig; wire Retriever with local_query_embed_backend plus reranker bool and reranker_model_name. Made-with: Cursor

Wire reranker configuration through the full harness → graph_pipeline → BeirConfig → Retriever chain, which was previously missing for BEIR evaluation (rerank_modality was always "text" in BEIR mode). Changes: - recall/beir.py: add rerank_modality to BeirConfig, forward to Retriever - examples/graph_pipeline.py: add --rerank-modality CLI flag - harness/config.py: add reranker, reranker_model_name, rerank_modality fields to HarnessConfig with validation and HARNESS_* env var mappings - harness/run.py: emit --reranker/--no-reranker/--rerank-modality in _build_command; capture reranker fields in results.json test_config - harness/test_configs.yaml: add bo767_baseline, bo767_vl_text, bo767_vl_text_image, bo767_vl_text_image_reranked dataset configs - harness/vllm_bo767_sweep.yaml: new sweep for PR NVIDIA#1494 validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

… for vLLM/HF matrix Adds `embed_local_ingest_backend` field (vllm|hf) so HF and vLLM ingest backends can be selected per dataset config or sweep override, enabling the HF/HF vs vLLM/HF vs vLLM/vLLM experiment matrix for PR NVIDIA#1494 validation. - EmbedParams.local_ingest_backend wires through build_embed_kwargs → _BatchEmbedActor to select HF or vLLM at ingest time - graph_pipeline: adds --embed-local-ingest-backend CLI flag with validation - HarnessConfig: new field + env var + validation - Sweep YAML: supports top-level overrides: dict merged with per-run overrides (fixes load_nightly_config dropping the overrides key) - test_configs.yaml: bo767_text_hf (HF/HF) + bo767_text_vllm_hf (vLLM ingest/HF query) - vllm_bo767_sweep.yaml: embed_workers:1 override (prevents multi-GPU spill with vLLM), bo767_text_vllm_hf run, embed_batch_size:16 for HF path (avoids OOM at seq_len=8192) - harness-run.md: Step 3a.5 config sanity check, updated OOM cause table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

…p configs - graph_pipeline._resolve_file_patterns: use **/<ext> pattern with recursive=True so datasets with a corpus/ subdir are found - test_configs.yaml: add jp20_baseline, jp20_text_hf, jp20_text_vllm_hf, jp20_vl_text, jp20_vl_text_image, jp20_vl_text_image_reranked entries - harness/jp20_vl_sweep.yaml: add JP20 vLLM validation sweep (6 runs) - pyproject.toml: document Python.h requirement for vLLM torch inductor - README.md: add uv python install 3.12 step to ensure headers for vLLM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

greptile-apps · 2026-04-20T16:47:24Z

+    def __post_init__(self) -> None:
+        self._tokenizer = None
+        self._model = None
+        self._device = None
+
+        from nemo_retriever.model import _DEFAULT_EMBED_MODEL
+        from transformers import AutoModel, AutoTokenizer
+
+        model_id = self.model_id or _DEFAULT_EMBED_MODEL
+        dev = torch.device(self.device or ("cuda" if torch.cuda.is_available() else "cpu"))
+        hf_cache_dir = configure_global_hf_cache_base(self.hf_cache_dir)
+        _revision = get_hf_revision(model_id)
+        self._tokenizer = AutoTokenizer.from_pretrained(
+            model_id,
+            revision=_revision,
+            cache_dir=hf_cache_dir,
+            trust_remote_code=True,
+        )
+        self._model = AutoModel.from_pretrained(
+            model_id,
+            revision=_revision,
+            trust_remote_code=True,
+            cache_dir=hf_cache_dir,
+            torch_dtype=torch.bfloat16,
+        )
+        self._model = self._model.to(dev)
+        self._model.eval()
+        self._device = dev


Eager model load in __post_init__ violates model-lazy-loading rule

LlamaNemotronEmbed1BV2HFEmbedder.__post_init__ calls AutoTokenizer.from_pretrained and AutoModel.from_pretrained unconditionally at construction time, including a .to(dev) call that places weights on GPU immediately. The model-lazy-loading rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. create_local_query_embedder(..., backend="hf")) will trigger a full HF download/load and GPU transfer before any embed call is made.

Defer loading to the first _embed_local call and guard with an _ensure_loaded() helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off.

Rule Used: Models must be loaded lazily (on first use or expl... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py Line: 39-66 Comment: **Eager model load in `__post_init__` violates `model-lazy-loading` rule** `LlamaNemotronEmbed1BV2HFEmbedder.__post_init__` calls `AutoTokenizer.from_pretrained` and `AutoModel.from_pretrained` unconditionally at construction time, including a `.to(dev)` call that places weights on GPU immediately. The `model-lazy-loading` rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. `create_local_query_embedder(..., backend="hf")`) will trigger a full HF download/load and GPU transfer before any embed call is made. Defer loading to the first `_embed_local` call and guard with an `_ensure_loaded()` helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off. **Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading)) How can I resolve this? If you propose a fix, please make it concise.

Brings in vLLM 0.17 (torch 2.10, torchvision 0.25), Portal MCP server, QA eval pipeline, BEIR eval updates, nemotron-ocr-v2, and other upstream changes. Updates blocker comments now that vLLM 0.17 is available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edder - Add limit_mm_per_prompt param to create_vllm_llm (off by default, no impact on text-only callers) - Add embed_multimodal_with_vllm_llm helper for multimodal prompt dicts - Implement embed_images and embed_text_image on the vLLM VL embedder using confirmed prompt format from the model card ("passage: <image> ") - Bump flashinfer to 0.6.4 (required by vLLM 0.17) - Add 19 unit tests covering the new helper, parameter forwarding, and both new embedder methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Brings in HF_HUB_OFFLINE=1 Ray worker fix and --reranker-invoke-url addition to graph_pipeline.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Brings in pre-commit hook for uv lockfile regeneration and verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-21T19:20:23Z

+    model_id: Optional[str] = None
+    device: Optional[str] = None
+    hf_cache_dir: Optional[str] = None
+    gpu_memory_utilization: float = 0.45
+    enforce_eager: bool = False


_llm not declared as a dataclass field

LlamaNemotronEmbedVL1BV2VLLMEmbedder assigns self._llm in __post_init__ but never declares it as a typed dataclass field. Compare with LlamaNemotronEmbed1BV2Embedder which has _llm: Any = field(default=None, init=False, repr=False). Without the field declaration there is no None sentinel before __post_init__ runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to embed() / embed_queries() will raise an opaque AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm' rather than propagating the original construction error.

Suggested change

model_id: Optional[str] = None

device: Optional[str] = None

hf_cache_dir: Optional[str] = None

gpu_memory_utilization: float = 0.45

enforce_eager: bool = False

model_id: Optional[str] = None

device: Optional[str] = None

hf_cache_dir: Optional[str] = None

gpu_memory_utilization: float = 0.45

enforce_eager: bool = False

_llm: Any = field(default=None, init=False, repr=False)

Also requires adding field to the from dataclasses import dataclass, field import.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py Line: 176-180 Comment: **`_llm` not declared as a dataclass field** `LlamaNemotronEmbedVL1BV2VLLMEmbedder` assigns `self._llm` in `__post_init__` but never declares it as a typed dataclass field. Compare with `LlamaNemotronEmbed1BV2Embedder` which has `_llm: Any = field(default=None, init=False, repr=False)`. Without the field declaration there is no `None` sentinel before `__post_init__` runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to `embed()` / `embed_queries()` will raise an opaque `AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm'` rather than propagating the original construction error. ```suggestion model_id: Optional[str] = None device: Optional[str] = None hf_cache_dir: Optional[str] = None gpu_memory_utilization: float = 0.45 enforce_eager: bool = False _llm: Any = field(default=None, init=False, repr=False) ``` Also requires adding `field` to the `from dataclasses import dataclass, field` import. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-21T19:20:24Z

+            gpu_memory_utilization=float(self._kwargs.get("gpu_memory_utilization", 0.45)),
+            enforce_eager=bool(self._kwargs.get("enforce_eager", False)),
        )


bool("False") silently enables eager mode

self._kwargs.get("enforce_eager", False) can return the string "False" when kwargs are populated from YAML or environment variables. bool("False") evaluates to True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in config silently gets enforce_eager=True, disabling CUDA graphs and degrading throughput. The same pattern appears in processor.py line 108.

Or introduce a small shared helper — e.g. _to_bool(v, default=False) — and call it in both places.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py Line: 44-46 Comment: **`bool("False")` silently enables eager mode** `self._kwargs.get("enforce_eager", False)` can return the string `"False"` when kwargs are populated from YAML or environment variables. `bool("False")` evaluates to `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in config silently gets `enforce_eager=True`, disabling CUDA graphs and degrading throughput. The same pattern appears in `processor.py` line 108. Or introduce a small shared helper — e.g. `_to_bool(v, default=False)` — and call it in both places. How can I resolve this? If you propose a fix, please make it concise.

- EmbedParams.local_ingest_backend controls vllm vs hf at ingest time for non-VL models - _BatchEmbedActor dispatches to create_local_query_embedder(backend="hf") or create_local_embedder (vLLM) based on that field - _embed_group: pass effective_batch_size (image-modality-capped) as local_batch_size instead of the raw inference_batch_size so the local vLLM path respects the cap - embed_text_main_text_embed: clear CUDA cache on exception to release GPU memory before Ray retries the batch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

create_local_embedder now accepts backend="auto"|"hf"|"vllm" for VL models: - auto/vllm → LlamaNemotronEmbedVL1BV2VLLMEmbedder (requires vLLM >= 0.17) - hf → LlamaNemotronEmbedVL1BV2Embedder create_local_query_embedder delegates to create_local_embedder with the requested backend for VL, matching non-VL behaviour. Retriever._get_local_embedder is simplified to always call create_local_query_embedder; backend_raw is now respected for VL models (previously hardcoded to HF regardless of local_query_embed_backend). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ard HF_HUB_OFFLINE to Ray workers graph_pipeline.py was missing the --embed-local-ingest-backend option that run.py was already generating; wires the flag through to EmbedParams. graph_ingestor.py initialized Ray before executor.py could run, bypassing the HF_HUB_OFFLINE forwarding in executor.py; adds the env var directly to graph_ingestor's ray_env_vars dict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <charlesb@nvidia.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>

First pass of reduced vLLM embedder

57fe183

charlesbluca requested a review from a team as a code owner March 5, 2026 20:04

charlesbluca requested a review from ChrisJar March 5, 2026 20:04

charlesbluca mentioned this pull request Mar 5, 2026

[DNM] (lib) vLLM-backed embedder experimentation #1469

Closed

4 tasks

charlesbluca added 9 commits March 5, 2026 20:08

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

81b711e

…embeddings-1

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

2f19148

…embeddings-1

Pull vLLM CUDA 13 wheel

ec5a3e3

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

c9d4c42

…embeddings-1

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

a4fb012

…embeddings-1 Made-with: Cursor # Conflicts: # nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py

feat(beir): plumb use_vllm through BeirConfig to Retriever

840a765

- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm - batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval - Extend test_evaluate_lancedb_beir to assert Retriever kwargs Made-with: Cursor

Plumb vLLM CLI option into inprocess example

11afd40

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

e5db4ad

…embeddings-1

Linting

804bde6

charlesbluca changed the title ~~(lib) Add support for vLLM-backed embedder~~ (lib) Add support for vLLM-backed text embedder Mar 23, 2026

charlesbluca added 3 commits March 23, 2026 18:21

Allow vLLM to be toggled separately for recall

d00b4b2

Apply ingest/recall vLLM split to fused example

65c224b

Bump vLLM / torch to allow vLLM-backed VL embedder

9915ac1

charlesbluca requested a review from a team as a code owner March 24, 2026 19:40

charlesbluca and others added 3 commits April 7, 2026 18:21

Merge remote-tracking branch 'upstream/main' into retriever-vllm-for-…

c5fbfa5

…embeddings-1

Merge branch 'main' into retriever-vllm-for-embeddings-1

715ecc7

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

charlesbluca and others added 3 commits April 7, 2026 18:29

Add --embed-use-vllm flag to graph_pipeline example

e8ffd24

Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig (query), matching the same flag added to the old inprocess/fused examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move vLLM >=0.17.0 guard to VLM embedder only

2c74315

Text embedding via vLLM works on 0.16.0; only the VLM embedder requires 0.17.0 for pooling support. Move the version check accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Outdated

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Outdated

Fix TypeError when calling VL embedder with prefix kwarg in text_embe…

14d5e1b

…d processor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 9, 2026

View reviewed changes

jperez999 requested changes Apr 10, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/model/__init__.py

Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated

charlesbluca added 6 commits April 13, 2026 13:53

text-embed: remove unused embed_via_vllm helper

572b8b7

Prune unused embed_via_vllm and stop exporting it from text_embed.vllm.

Merge branch 'main' into retriever-vllm-for-embeddings-1

fc5b10b

greptile-apps Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py

charlesbluca added 2 commits April 13, 2026 16:59

Merge branch 'main' into retriever-vllm-for-embeddings-1

cb89300

jperez999 approved these changes Apr 14, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

charlesbluca and others added 4 commits April 21, 2026 18:26

Merge upstream/main into retriever-vllm-for-embeddings-1

65ff1d9

Brings in HF_HUB_OFFLINE=1 Ray worker fix and --reranker-invoke-url addition to graph_pipeline.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge upstream/main into retriever-vllm-for-embeddings-1

c727185

Brings in pre-commit hook for uv lockfile regeneration and verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated

charlesbluca and others added 2 commits April 21, 2026 19:09

style: apply black formatting to vLLM embedder and tests

172b1b0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: correct vLLM install instruction to include [local] extra

da0d022

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

charlesbluca and others added 3 commits April 21, 2026 19:32

	enforce_eager=bool(task_config.get("enforce_eager", False)),
	enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"),

Conversation

charlesbluca commented Mar 5, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Class Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 commented Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading