(lib) Add support for vLLM-backed text embedder#1494
(lib) Add support for vLLM-backed text embedder#1494charlesbluca wants to merge 61 commits intoNVIDIA:mainfrom
Conversation
…embeddings-1 Made-with: Cursor # Conflicts: # nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm - batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval - Extend test_evaluate_lancedb_beir to assert Retriever kwargs Made-with: Cursor
The upstream refactor replaced ingest_modes/ with text_embed/operators.py. Add vLLM branching to _BatchEmbedGPUActor.__init__() so that embed_use_vllm=True in EmbedParams routes through create_local_embedder() with use_vllm=True, matching the interface used for recall queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds vLLM-backed text embedding support to the NeMo Retriever library, introducing Several previously-flagged P1 concerns from earlier review rounds (missing
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/text_embed/vllm.py | New helpers create_vllm_llm, embed_with_vllm_llm, embed_multimodal_with_vllm_llm; PoolerConfig outer catch now correctly limits to ImportError; well-structured with proper batching logic. |
| nemo_retriever/src/nemo_retriever/model/init.py | Factory now routes via backend=auto/hf/vllm; non-VL create_local_embedder always returns vLLM embedder with no auto-fallback on non-Linux platforms. |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py | Class repurposed as vLLM-backed embedder; eager GPU load in post_init and missing unload() violate model-lazy-loading/gpu-memory-lifecycle rules; device deprecation warning is correctly implemented. |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py | New LlamaNemotronEmbedVL1BV2VLLMEmbedder added; _llm not declared as dataclass field (could cause AttributeError on failed construction); CUDA_VISIBLE_DEVICES global mutation still present. |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py | New HF-only embedder for query-side use; loads model eagerly in post_init; embed() signature lacks prefix kwarg (inconsistent with vLLM embedder's embed()). |
| nemo_retriever/src/nemo_retriever/text_embed/processor.py | Double-prefix bug fixed by passing prefix='' to embed(); VL/non-VL routing correct; bool('False') for enforce_eager still silently enables eager mode when YAML injects string values. |
| nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py | Correctly routes HF vs vLLM ingest backends; bool('False') for enforce_eager still present (string from YAML evaluates truthy), disabling CUDA graphs silently. |
| nemo_retriever/src/nemo_retriever/text_embed/runtime.py | local_batch_size bug fixed (now uses effective_batch_size); GPU cache clear on exception added; skip_prefix heuristic works correctly for all new model types. |
| nemo_retriever/src/nemo_retriever/retriever.py | _get_local_embedder now uses create_local_query_embedder with backend selection; embed_queries() replaces manual 'query: ' prefixing; cache keyed by (resolved_name, backend) tuple. |
| nemo_retriever/pyproject.toml | vllm bumped to 0.17.0 with both x86_64 and aarch64 CUDA 13 wheels in uv.sources; flashinfer-cubin and flashinfer-python added without uv.sources wheel pinning. |
| nemo_retriever/tests/test_vllm_embed.py | New test file covering embed_with_vllm_llm, embed_multimodal_with_vllm_llm, create_vllm_llm, and VLLMEmbedder image/text-image paths; good coverage of edge cases. |
| nemo_retriever/tests/test_create_local_embedder.py | New deprecation test for device kwarg; gpu_memory_utilization kwarg forwarding assertion added; tests look correct. |
Class Diagram
%%{init: {'theme': 'neutral'}}%%
classDiagram
class LlamaNemotronEmbed1BV2Embedder {
+model_id: Optional[str]
+device: Optional[str] deprecated
+gpu_memory_utilization: float
+enforce_eager: bool
-_llm: Any
+embed(texts, batch_size, prefix) Tensor
+embed_queries(texts, batch_size) Tensor
}
class LlamaNemotronEmbed1BV2HFEmbedder {
+device: Optional[str]
+model_id: Optional[str]
-_tokenizer: Any
-_model: Any
+embed(texts, batch_size) Tensor
+embed_queries(texts, batch_size) Tensor
}
class LlamaNemotronEmbedVL1BV2Embedder {
+device: Optional[str]
+model_id: Optional[str]
-_model: Any
+embed(texts, batch_size) Tensor
+embed_queries(texts, batch_size) Tensor
+embed_images(images_b64, batch_size) Tensor
+embed_text_image(texts, images_b64, batch_size) Tensor
}
class LlamaNemotronEmbedVL1BV2VLLMEmbedder {
+model_id: Optional[str]
+gpu_memory_utilization: float
+enforce_eager: bool
+embed(texts, batch_size) Tensor
+embed_queries(texts, batch_size) Tensor
+embed_images(images_b64, batch_size) Tensor
+embed_text_image(texts, images_b64, batch_size) Tensor
}
class create_local_embedder {
<<factory>>
VL+auto/vllm --> LlamaNemotronEmbedVL1BV2VLLMEmbedder
VL+hf --> LlamaNemotronEmbedVL1BV2Embedder
nonVL --> LlamaNemotronEmbed1BV2Embedder
}
class create_local_query_embedder {
<<factory>>
nonVL+hf --> LlamaNemotronEmbed1BV2HFEmbedder
otherwise --> create_local_embedder
}
create_local_embedder --> LlamaNemotronEmbed1BV2Embedder : non-VL
create_local_embedder --> LlamaNemotronEmbedVL1BV2VLLMEmbedder : VL + auto/vllm
create_local_embedder --> LlamaNemotronEmbedVL1BV2Embedder : VL + hf
create_local_query_embedder --> LlamaNemotronEmbed1BV2HFEmbedder : non-VL + hf
create_local_query_embedder --> create_local_embedder : otherwise
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py
Line: 116-128
Comment:
**Inconsistent `embed()` interface with vLLM counterpart**
`LlamaNemotronEmbed1BV2HFEmbedder.embed()` does not accept a `prefix` kwarg, while `LlamaNemotronEmbed1BV2Embedder.embed()` does (`prefix: str = "passage: "`). Code that switches backends via `create_local_query_embedder(..., backend="hf")` and then calls `embedder.embed(..., prefix="")` would get a `TypeError`. The `public-api-contract` rule requires interchangeable backends to share the same call signature. Consider adding `prefix: str = "passage: "` here (even if it's effectively ignored since the HF path self-prefixes), so callers don't need to know which backend is active.
**Rule Used:** Every public class and function in nemo_retriever ... ([source](https://app.greptile.com/review/custom-context?memory=public-api-contract))
How can I resolve this? If you propose a fix, please make it concise.Reviews (27): Last reviewed commit: "fix(embed): add missing --embed-local-in..." | Re-trigger Greptile
The torch 2.10 / vLLM 0.17 bump requires additional work not yet ready. Revert to the upstream baseline (vLLM 0.16.0, torch~=2.9.1, torchvision >=0.24,<0.25, flashinfer 0.6.3). The VLM embedder vLLM code path is preserved but now raises a clear RuntimeError at init time if the installed vLLM is < 0.17.0, since that version is required for VLM pooling support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig (query), matching the same flag added to the old inprocess/fused examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Text embedding via vLLM works on 0.16.0; only the VLM embedder requires 0.17.0 for pooling support. Move the version check accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nemotron_vlm_captioner and nemotron_parse_v1_2 translate device to CUDA_VISIBLE_DEVICES and call configure_global_hf_cache_base() before constructing LLM(). The vLLM path in both local embedders was silently ignoring these fields. Apply the same pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern Separate HF and vLLM embedder classes rather than toggling `use_vllm` on a single class. `create_local_embedder()` dispatches to the appropriate class. `_BatchEmbedGPUActor` simplified to a single `create_local_embedder()` call. VLM vLLM embedder retains the vLLM >= 0.17.0 runtime guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d processor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| from nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder import ( | ||
| LlamaNemotronEmbed1BV2Embedder, | ||
| LlamaNemotronEmbed1BV2VLLMEmbedder, | ||
| ) | ||
|
|
||
| return LlamaNemotronEmbed1BV2Embedder( | ||
| return LlamaNemotronEmbed1BV2VLLMEmbedder( | ||
| model_id=model_id, | ||
| device=device, | ||
| hf_cache_dir=hf_cache_dir, | ||
| normalize=normalize, | ||
| max_length=max_length, | ||
| model_id=model_id, | ||
| gpu_memory_utilization=gpu_memory_utilization, | ||
| enforce_eager=enforce_eager, | ||
| compile_cache_dir=compile_cache_dir, | ||
| dimensions=dimensions, | ||
| ) |
There was a problem hiding this comment.
Non-Linux environments silently broken by factory-forcing vLLM
create_local_embedder now always returns LlamaNemotronEmbed1BV2VLLMEmbedder for non-VL models. Because vllm is gated to sys_platform == 'linux' in pyproject.toml, calling this factory on macOS or Windows immediately raises RuntimeError: vLLM is not installed — with no fallback to the perfectly functional HF path (LlamaNemotronEmbed1BV2Embedder). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory.
Consider keeping a use_vllm: bool parameter (defaulting to True on Linux, False elsewhere) and falling back to the HF class when vLLM is unavailable:
def create_local_embedder(
model_name: str | None = None,
*,
use_vllm: bool | None = None, # None → auto-detect
...
) -> Any:
if use_vllm is None:
try:
import vllm # noqa: F401
use_vllm = True
except ImportError:
use_vllm = False
if not is_vl_embed_model(model_name):
if use_vllm:
from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder
return LlamaNemotronEmbed1BV2VLLMEmbedder(...)
else:
from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder
return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id)Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/__init__.py
Line: 71-83
Comment:
**Non-Linux environments silently broken by factory-forcing vLLM**
`create_local_embedder` now _always_ returns `LlamaNemotronEmbed1BV2VLLMEmbedder` for non-VL models. Because `vllm` is gated to `sys_platform == 'linux'` in `pyproject.toml`, calling this factory on macOS or Windows immediately raises `RuntimeError: vLLM is not installed` — with no fallback to the perfectly functional HF path (`LlamaNemotronEmbed1BV2Embedder`). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory.
Consider keeping a `use_vllm: bool` parameter (defaulting to `True` on Linux, `False` elsewhere) and falling back to the HF class when vLLM is unavailable:
```python
def create_local_embedder(
model_name: str | None = None,
*,
use_vllm: bool | None = None, # None → auto-detect
...
) -> Any:
if use_vllm is None:
try:
import vllm # noqa: F401
use_vllm = True
except ImportError:
use_vllm = False
if not is_vl_embed_model(model_name):
if use_vllm:
from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder
return LlamaNemotronEmbed1BV2VLLMEmbedder(...)
else:
from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder
return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id)
```
How can I resolve this? If you propose a fix, please make it concise.| device=task_config.get("local_hf_device"), | ||
| hf_cache_dir=task_config.get("local_hf_cache_dir"), | ||
| gpu_memory_utilization=float(task_config.get("gpu_memory_utilization", 0.45)), | ||
| enforce_eager=bool(task_config.get("enforce_eager", False)), |
There was a problem hiding this comment.
bool("False") evaluates to True, silently enabling eager mode
task_config values can arrive as strings when the dict is populated from YAML or environment variables. bool("False") returns True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in YAML (parsed as the string "False") would silently get enforce_eager=True, disabling CUDA graphs and degrading throughput with no error.
| enforce_eager=bool(task_config.get("enforce_eager", False)), | |
| enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"), |
Or use a small helper:
def _to_bool(v, default=False):
if isinstance(v, bool): return v
if v is None: return default
return str(v).strip().lower() in ("1", "true", "yes")Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 108
Comment:
**`bool("False")` evaluates to `True`, silently enabling eager mode**
`task_config` values can arrive as strings when the dict is populated from YAML or environment variables. `bool("False")` returns `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in YAML (parsed as the string `"False"`) would silently get `enforce_eager=True`, disabling CUDA graphs and degrading throughput with no error.
```suggestion
enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"),
```
Or use a small helper:
```python
def _to_bool(v, default=False):
if isinstance(v, bool): return v
if v is None: return default
return str(v).strip().lower() in ("1", "true", "yes")
```
How can I resolve this? If you propose a fix, please make it concise.… compatibility Drop compile_cache_dir plumbing and rely on vLLM default compile cache behavior. Keep PoolerConfig construction compatible across vLLM versions by avoiding unsupported normalize argument and adding fallbacks.
Prune unused embed_via_vllm and stop exporting it from text_embed.vllm.
Remove the separate LlamaNemotronEmbed1BV2VLLMEmbedder class and move the vLLM pooling path into the canonical embedder type. create_local_embedder now instantiates that class and forwards normalize and max_length for Ray actors. Update factory tests and docs that referred to HuggingFace for the default text embedding path. Made-with: Cursor
LlamaNemotronEmbed1BV2Embedder never consumed device; vLLM follows process placement. create_local_embedder still accepts device for the VL HF path only. Strip device in TextEmbedGPUActor, stop forwarding it from graph embed actor and Retriever local embed construction, and adjust factory tests. Made-with: Cursor
Expose local_query_embed_backend on Retriever and RecallConfig so recall can use HuggingFace mean-pooled query vectors while ingestion keeps the existing vLLM text embed path. Add LlamaNemotronEmbed1BV2HFEmbedder with query:/passage: prefixes and per-batch tokenization aligned with vLLM micro-batching. Wire through graph_pipeline, BEIR, retrieve_and_score, and vdb_recall Typer commands. Extend unit tests for Retriever routing and BEIR kwargs. Made-with: Cursor
Restore optional device= for API compatibility; emit DeprecationWarning when set and document vLLM placement via CUDA_VISIBLE_DEVICES / tensor_parallel_size. Forward device from create_local_embedder for non-VL models. Extend factory tests and add a deprecation test that loads the real embedder module. Made-with: Cursor
|
Might need to coordinate with @edknv and the VL reranker PR |
Resolve conflicts between local query embed backend (PR) and multimodal VL reranker (NVIDIA#1836): keep VL_EMBED_MODEL/VL_RERANK_MODEL exports and TYPE_CHECKING imports with Any; pass both local_query_embed_backend and embed_modality into RecallConfig; wire Retriever with local_query_embed_backend plus reranker bool and reranker_model_name. Made-with: Cursor
Wire reranker configuration through the full harness → graph_pipeline → BeirConfig → Retriever chain, which was previously missing for BEIR evaluation (rerank_modality was always "text" in BEIR mode). Changes: - recall/beir.py: add rerank_modality to BeirConfig, forward to Retriever - examples/graph_pipeline.py: add --rerank-modality CLI flag - harness/config.py: add reranker, reranker_model_name, rerank_modality fields to HarnessConfig with validation and HARNESS_* env var mappings - harness/run.py: emit --reranker/--no-reranker/--rerank-modality in _build_command; capture reranker fields in results.json test_config - harness/test_configs.yaml: add bo767_baseline, bo767_vl_text, bo767_vl_text_image, bo767_vl_text_image_reranked dataset configs - harness/vllm_bo767_sweep.yaml: new sweep for PR NVIDIA#1494 validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
… for vLLM/HF matrix Adds `embed_local_ingest_backend` field (vllm|hf) so HF and vLLM ingest backends can be selected per dataset config or sweep override, enabling the HF/HF vs vLLM/HF vs vLLM/vLLM experiment matrix for PR NVIDIA#1494 validation. - EmbedParams.local_ingest_backend wires through build_embed_kwargs → _BatchEmbedActor to select HF or vLLM at ingest time - graph_pipeline: adds --embed-local-ingest-backend CLI flag with validation - HarnessConfig: new field + env var + validation - Sweep YAML: supports top-level overrides: dict merged with per-run overrides (fixes load_nightly_config dropping the overrides key) - test_configs.yaml: bo767_text_hf (HF/HF) + bo767_text_vllm_hf (vLLM ingest/HF query) - vllm_bo767_sweep.yaml: embed_workers:1 override (prevents multi-GPU spill with vLLM), bo767_text_vllm_hf run, embed_batch_size:16 for HF path (avoids OOM at seq_len=8192) - harness-run.md: Step 3a.5 config sanity check, updated OOM cause table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
…p configs - graph_pipeline._resolve_file_patterns: use **/<ext> pattern with recursive=True so datasets with a corpus/ subdir are found - test_configs.yaml: add jp20_baseline, jp20_text_hf, jp20_text_vllm_hf, jp20_vl_text, jp20_vl_text_image, jp20_vl_text_image_reranked entries - harness/jp20_vl_sweep.yaml: add JP20 vLLM validation sweep (6 runs) - pyproject.toml: document Python.h requirement for vLLM torch inductor - README.md: add uv python install 3.12 step to ensure headers for vLLM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
| def __post_init__(self) -> None: | ||
| self._tokenizer = None | ||
| self._model = None | ||
| self._device = None | ||
|
|
||
| from nemo_retriever.model import _DEFAULT_EMBED_MODEL | ||
| from transformers import AutoModel, AutoTokenizer | ||
|
|
||
| model_id = self.model_id or _DEFAULT_EMBED_MODEL | ||
| dev = torch.device(self.device or ("cuda" if torch.cuda.is_available() else "cpu")) | ||
| hf_cache_dir = configure_global_hf_cache_base(self.hf_cache_dir) | ||
| _revision = get_hf_revision(model_id) | ||
| self._tokenizer = AutoTokenizer.from_pretrained( | ||
| model_id, | ||
| revision=_revision, | ||
| cache_dir=hf_cache_dir, | ||
| trust_remote_code=True, | ||
| ) | ||
| self._model = AutoModel.from_pretrained( | ||
| model_id, | ||
| revision=_revision, | ||
| trust_remote_code=True, | ||
| cache_dir=hf_cache_dir, | ||
| torch_dtype=torch.bfloat16, | ||
| ) | ||
| self._model = self._model.to(dev) | ||
| self._model.eval() | ||
| self._device = dev |
There was a problem hiding this comment.
Eager model load in
__post_init__ violates model-lazy-loading rule
LlamaNemotronEmbed1BV2HFEmbedder.__post_init__ calls AutoTokenizer.from_pretrained and AutoModel.from_pretrained unconditionally at construction time, including a .to(dev) call that places weights on GPU immediately. The model-lazy-loading rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. create_local_query_embedder(..., backend="hf")) will trigger a full HF download/load and GPU transfer before any embed call is made.
Defer loading to the first _embed_local call and guard with an _ensure_loaded() helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off.
Rule Used: Models must be loaded lazily (on first use or expl... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py
Line: 39-66
Comment:
**Eager model load in `__post_init__` violates `model-lazy-loading` rule**
`LlamaNemotronEmbed1BV2HFEmbedder.__post_init__` calls `AutoTokenizer.from_pretrained` and `AutoModel.from_pretrained` unconditionally at construction time, including a `.to(dev)` call that places weights on GPU immediately. The `model-lazy-loading` rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. `create_local_query_embedder(..., backend="hf")`) will trigger a full HF download/load and GPU transfer before any embed call is made.
Defer loading to the first `_embed_local` call and guard with an `_ensure_loaded()` helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off.
**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))
How can I resolve this? If you propose a fix, please make it concise.Brings in vLLM 0.17 (torch 2.10, torchvision 0.25), Portal MCP server, QA eval pipeline, BEIR eval updates, nemotron-ocr-v2, and other upstream changes. Updates blocker comments now that vLLM 0.17 is available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edder
- Add limit_mm_per_prompt param to create_vllm_llm (off by default, no
impact on text-only callers)
- Add embed_multimodal_with_vllm_llm helper for multimodal prompt dicts
- Implement embed_images and embed_text_image on the vLLM VL embedder
using confirmed prompt format from the model card ("passage: <image> ")
- Bump flashinfer to 0.6.4 (required by vLLM 0.17)
- Add 19 unit tests covering the new helper, parameter forwarding, and
both new embedder methods
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in HF_HUB_OFFLINE=1 Ray worker fix and --reranker-invoke-url addition to graph_pipeline.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in pre-commit hook for uv lockfile regeneration and verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| model_id: Optional[str] = None | ||
| device: Optional[str] = None | ||
| hf_cache_dir: Optional[str] = None | ||
| gpu_memory_utilization: float = 0.45 | ||
| enforce_eager: bool = False |
There was a problem hiding this comment.
_llm not declared as a dataclass field
LlamaNemotronEmbedVL1BV2VLLMEmbedder assigns self._llm in __post_init__ but never declares it as a typed dataclass field. Compare with LlamaNemotronEmbed1BV2Embedder which has _llm: Any = field(default=None, init=False, repr=False). Without the field declaration there is no None sentinel before __post_init__ runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to embed() / embed_queries() will raise an opaque AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm' rather than propagating the original construction error.
| model_id: Optional[str] = None | |
| device: Optional[str] = None | |
| hf_cache_dir: Optional[str] = None | |
| gpu_memory_utilization: float = 0.45 | |
| enforce_eager: bool = False | |
| model_id: Optional[str] = None | |
| device: Optional[str] = None | |
| hf_cache_dir: Optional[str] = None | |
| gpu_memory_utilization: float = 0.45 | |
| enforce_eager: bool = False | |
| _llm: Any = field(default=None, init=False, repr=False) |
Also requires adding field to the from dataclasses import dataclass, field import.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py
Line: 176-180
Comment:
**`_llm` not declared as a dataclass field**
`LlamaNemotronEmbedVL1BV2VLLMEmbedder` assigns `self._llm` in `__post_init__` but never declares it as a typed dataclass field. Compare with `LlamaNemotronEmbed1BV2Embedder` which has `_llm: Any = field(default=None, init=False, repr=False)`. Without the field declaration there is no `None` sentinel before `__post_init__` runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to `embed()` / `embed_queries()` will raise an opaque `AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm'` rather than propagating the original construction error.
```suggestion
model_id: Optional[str] = None
device: Optional[str] = None
hf_cache_dir: Optional[str] = None
gpu_memory_utilization: float = 0.45
enforce_eager: bool = False
_llm: Any = field(default=None, init=False, repr=False)
```
Also requires adding `field` to the `from dataclasses import dataclass, field` import.
How can I resolve this? If you propose a fix, please make it concise.| gpu_memory_utilization=float(self._kwargs.get("gpu_memory_utilization", 0.45)), | ||
| enforce_eager=bool(self._kwargs.get("enforce_eager", False)), | ||
| ) |
There was a problem hiding this comment.
bool("False") silently enables eager mode
self._kwargs.get("enforce_eager", False) can return the string "False" when kwargs are populated from YAML or environment variables. bool("False") evaluates to True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in config silently gets enforce_eager=True, disabling CUDA graphs and degrading throughput. The same pattern appears in processor.py line 108.
Or introduce a small shared helper — e.g. _to_bool(v, default=False) — and call it in both places.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py
Line: 44-46
Comment:
**`bool("False")` silently enables eager mode**
`self._kwargs.get("enforce_eager", False)` can return the string `"False"` when kwargs are populated from YAML or environment variables. `bool("False")` evaluates to `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in config silently gets `enforce_eager=True`, disabling CUDA graphs and degrading throughput. The same pattern appears in `processor.py` line 108.
Or introduce a small shared helper — e.g. `_to_bool(v, default=False)` — and call it in both places.
How can I resolve this? If you propose a fix, please make it concise.- EmbedParams.local_ingest_backend controls vllm vs hf at ingest time for non-VL models - _BatchEmbedActor dispatches to create_local_query_embedder(backend="hf") or create_local_embedder (vLLM) based on that field - _embed_group: pass effective_batch_size (image-modality-capped) as local_batch_size instead of the raw inference_batch_size so the local vLLM path respects the cap - embed_text_main_text_embed: clear CUDA cache on exception to release GPU memory before Ray retries the batch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
create_local_embedder now accepts backend="auto"|"hf"|"vllm" for VL models: - auto/vllm → LlamaNemotronEmbedVL1BV2VLLMEmbedder (requires vLLM >= 0.17) - hf → LlamaNemotronEmbedVL1BV2Embedder create_local_query_embedder delegates to create_local_embedder with the requested backend for VL, matching non-VL behaviour. Retriever._get_local_embedder is simplified to always call create_local_query_embedder; backend_raw is now respected for VL models (previously hardcoded to HF regardless of local_query_embed_backend). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ard HF_HUB_OFFLINE to Ray workers graph_pipeline.py was missing the --embed-local-ingest-backend option that run.py was already generating; wires the flag through to EmbedParams. graph_ingestor.py initialized Ray before executor.py could run, bypassing the HF_HUB_OFFLINE forwarding in executor.py; adds the env var directly to graph_ingestor's ray_env_vars dict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <charlesb@nvidia.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Description
Checklist