Skip to content

(lib) Add support for vLLM-backed text embedder#1494

Open
charlesbluca wants to merge 61 commits intoNVIDIA:mainfrom
charlesbluca:retriever-vllm-for-embeddings-1
Open

(lib) Add support for vLLM-backed text embedder#1494
charlesbluca wants to merge 61 commits intoNVIDIA:mainfrom
charlesbluca:retriever-vllm-for-embeddings-1

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@charlesbluca charlesbluca requested a review from a team as a code owner March 5, 2026 20:04
@charlesbluca charlesbluca requested a review from ChrisJar March 5, 2026 20:04
…embeddings-1

Made-with: Cursor

# Conflicts:
#	nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm
- batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval
- Extend test_evaluate_lancedb_beir to assert Retriever kwargs

Made-with: Cursor
@charlesbluca charlesbluca changed the title (lib) Add support for vLLM-backed embedder (lib) Add support for vLLM-backed text embedder Mar 23, 2026
@charlesbluca charlesbluca requested a review from a team as a code owner March 24, 2026 19:40
charlesbluca and others added 3 commits April 7, 2026 18:21
The upstream refactor replaced ingest_modes/ with text_embed/operators.py.
Add vLLM branching to _BatchEmbedGPUActor.__init__() so that
embed_use_vllm=True in EmbedParams routes through create_local_embedder()
with use_vllm=True, matching the interface used for recall queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

This PR adds vLLM-backed text embedding support to the NeMo Retriever library, introducing LlamaNemotronEmbed1BV2Embedder (now vLLM-backed, replacing the former HF class), LlamaNemotronEmbedVL1BV2VLLMEmbedder, LlamaNemotronEmbed1BV2HFEmbedder (new explicit HF class for query embedding), and the embed_with_vllm_llm / embed_multimodal_with_vllm_llm helpers. The factory create_local_embedder is restructured around a backend parameter, Retriever now uses create_local_query_embedder with embed_queries(), and the [local] extra gains vllm==0.17.0, flashinfer-cubin==0.6.4, and flashinfer-python==0.6.4 dependencies.

Several previously-flagged P1 concerns from earlier review rounds (missing _llm dataclass field in LlamaNemotronEmbedVL1BV2VLLMEmbedder, bool(\"False\") silently enabling eager mode in gpu_operator.py and processor.py, CUDA_VISIBLE_DEVICES global mutation) remain unresolved in the current diff and should be addressed before merging.

Confidence Score: 4/5

Safe to merge with caution — core embedding logic is correct, but several previously-flagged P1 issues remain unresolved.

The double-prefix bug is fixed, aarch64 wheel sources are present, version pins are consistent (0.17.0), and PoolerConfig exception handling is now specific to ImportError. However, three previously-flagged P1 concerns are still unaddressed: _llm is not declared as a dataclass field in LlamaNemotronEmbedVL1BV2VLLMEmbedder (can hide construction errors), bool('False') silently enables enforce_eager in gpu_operator.py and processor.py, and CUDA_VISIBLE_DEVICES is still mutated globally in LlamaNemotronEmbedVL1BV2VLLMEmbedder.post_init.

nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py (missing _llm field + CUDA_VISIBLE_DEVICES), nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py and processor.py (bool('False') enforce_eager)

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/text_embed/vllm.py New helpers create_vllm_llm, embed_with_vllm_llm, embed_multimodal_with_vllm_llm; PoolerConfig outer catch now correctly limits to ImportError; well-structured with proper batching logic.
nemo_retriever/src/nemo_retriever/model/init.py Factory now routes via backend=auto/hf/vllm; non-VL create_local_embedder always returns vLLM embedder with no auto-fallback on non-Linux platforms.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Class repurposed as vLLM-backed embedder; eager GPU load in post_init and missing unload() violate model-lazy-loading/gpu-memory-lifecycle rules; device deprecation warning is correctly implemented.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py New LlamaNemotronEmbedVL1BV2VLLMEmbedder added; _llm not declared as dataclass field (could cause AttributeError on failed construction); CUDA_VISIBLE_DEVICES global mutation still present.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py New HF-only embedder for query-side use; loads model eagerly in post_init; embed() signature lacks prefix kwarg (inconsistent with vLLM embedder's embed()).
nemo_retriever/src/nemo_retriever/text_embed/processor.py Double-prefix bug fixed by passing prefix='' to embed(); VL/non-VL routing correct; bool('False') for enforce_eager still silently enables eager mode when YAML injects string values.
nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py Correctly routes HF vs vLLM ingest backends; bool('False') for enforce_eager still present (string from YAML evaluates truthy), disabling CUDA graphs silently.
nemo_retriever/src/nemo_retriever/text_embed/runtime.py local_batch_size bug fixed (now uses effective_batch_size); GPU cache clear on exception added; skip_prefix heuristic works correctly for all new model types.
nemo_retriever/src/nemo_retriever/retriever.py _get_local_embedder now uses create_local_query_embedder with backend selection; embed_queries() replaces manual 'query: ' prefixing; cache keyed by (resolved_name, backend) tuple.
nemo_retriever/pyproject.toml vllm bumped to 0.17.0 with both x86_64 and aarch64 CUDA 13 wheels in uv.sources; flashinfer-cubin and flashinfer-python added without uv.sources wheel pinning.
nemo_retriever/tests/test_vllm_embed.py New test file covering embed_with_vllm_llm, embed_multimodal_with_vllm_llm, create_vllm_llm, and VLLMEmbedder image/text-image paths; good coverage of edge cases.
nemo_retriever/tests/test_create_local_embedder.py New deprecation test for device kwarg; gpu_memory_utilization kwarg forwarding assertion added; tests look correct.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class LlamaNemotronEmbed1BV2Embedder {
        +model_id: Optional[str]
        +device: Optional[str] deprecated
        +gpu_memory_utilization: float
        +enforce_eager: bool
        -_llm: Any
        +embed(texts, batch_size, prefix) Tensor
        +embed_queries(texts, batch_size) Tensor
    }
    class LlamaNemotronEmbed1BV2HFEmbedder {
        +device: Optional[str]
        +model_id: Optional[str]
        -_tokenizer: Any
        -_model: Any
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
    }
    class LlamaNemotronEmbedVL1BV2Embedder {
        +device: Optional[str]
        +model_id: Optional[str]
        -_model: Any
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
        +embed_images(images_b64, batch_size) Tensor
        +embed_text_image(texts, images_b64, batch_size) Tensor
    }
    class LlamaNemotronEmbedVL1BV2VLLMEmbedder {
        +model_id: Optional[str]
        +gpu_memory_utilization: float
        +enforce_eager: bool
        +embed(texts, batch_size) Tensor
        +embed_queries(texts, batch_size) Tensor
        +embed_images(images_b64, batch_size) Tensor
        +embed_text_image(texts, images_b64, batch_size) Tensor
    }
    class create_local_embedder {
        <<factory>>
        VL+auto/vllm --> LlamaNemotronEmbedVL1BV2VLLMEmbedder
        VL+hf --> LlamaNemotronEmbedVL1BV2Embedder
        nonVL --> LlamaNemotronEmbed1BV2Embedder
    }
    class create_local_query_embedder {
        <<factory>>
        nonVL+hf --> LlamaNemotronEmbed1BV2HFEmbedder
        otherwise --> create_local_embedder
    }
    create_local_embedder --> LlamaNemotronEmbed1BV2Embedder : non-VL
    create_local_embedder --> LlamaNemotronEmbedVL1BV2VLLMEmbedder : VL + auto/vllm
    create_local_embedder --> LlamaNemotronEmbedVL1BV2Embedder : VL + hf
    create_local_query_embedder --> LlamaNemotronEmbed1BV2HFEmbedder : non-VL + hf
    create_local_query_embedder --> create_local_embedder : otherwise
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py
Line: 116-128

Comment:
**Inconsistent `embed()` interface with vLLM counterpart**

`LlamaNemotronEmbed1BV2HFEmbedder.embed()` does not accept a `prefix` kwarg, while `LlamaNemotronEmbed1BV2Embedder.embed()` does (`prefix: str = "passage: "`). Code that switches backends via `create_local_query_embedder(..., backend="hf")` and then calls `embedder.embed(..., prefix="")` would get a `TypeError`. The `public-api-contract` rule requires interchangeable backends to share the same call signature. Consider adding `prefix: str = "passage: "` here (even if it's effectively ignored since the HF path self-prefixes), so callers don't need to know which backend is active.

**Rule Used:** Every public class and function in nemo_retriever ... ([source](https://app.greptile.com/review/custom-context?memory=public-api-contract))

How can I resolve this? If you propose a fix, please make it concise.

Reviews (27): Last reviewed commit: "fix(embed): add missing --embed-local-in..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/text_embed/processor.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated
Comment thread nemo_retriever/README.md Outdated
charlesbluca and others added 3 commits April 7, 2026 18:29
The torch 2.10 / vLLM 0.17 bump requires additional work not yet ready.
Revert to the upstream baseline (vLLM 0.16.0, torch~=2.9.1, torchvision
>=0.24,<0.25, flashinfer 0.6.3).

The VLM embedder vLLM code path is preserved but now raises a clear
RuntimeError at init time if the installed vLLM is < 0.17.0, since that
version is required for VLM pooling support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig
(query), matching the same flag added to the old inprocess/fused examples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Text embedding via vLLM works on 0.16.0; only the VLM embedder requires
0.17.0 for pooling support. Move the version check accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nemotron_vlm_captioner and nemotron_parse_v1_2 translate device to
CUDA_VISIBLE_DEVICES and call configure_global_hf_cache_base() before
constructing LLM(). The vLLM path in both local embedders was silently
ignoring these fields. Apply the same pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern

Separate HF and vLLM embedder classes rather than toggling `use_vllm`
on a single class. `create_local_embedder()` dispatches to the appropriate
class. `_BatchEmbedGPUActor` simplified to a single `create_local_embedder()`
call. VLM vLLM embedder retains the vLLM >= 0.17.0 runtime guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d processor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines 71 to 83
from nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder import (
LlamaNemotronEmbed1BV2Embedder,
LlamaNemotronEmbed1BV2VLLMEmbedder,
)

return LlamaNemotronEmbed1BV2Embedder(
return LlamaNemotronEmbed1BV2VLLMEmbedder(
model_id=model_id,
device=device,
hf_cache_dir=hf_cache_dir,
normalize=normalize,
max_length=max_length,
model_id=model_id,
gpu_memory_utilization=gpu_memory_utilization,
enforce_eager=enforce_eager,
compile_cache_dir=compile_cache_dir,
dimensions=dimensions,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-Linux environments silently broken by factory-forcing vLLM

create_local_embedder now always returns LlamaNemotronEmbed1BV2VLLMEmbedder for non-VL models. Because vllm is gated to sys_platform == 'linux' in pyproject.toml, calling this factory on macOS or Windows immediately raises RuntimeError: vLLM is not installed — with no fallback to the perfectly functional HF path (LlamaNemotronEmbed1BV2Embedder). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory.

Consider keeping a use_vllm: bool parameter (defaulting to True on Linux, False elsewhere) and falling back to the HF class when vLLM is unavailable:

def create_local_embedder(
    model_name: str | None = None,
    *,
    use_vllm: bool | None = None,  # None → auto-detect
    ...
) -> Any:
    if use_vllm is None:
        try:
            import vllm  # noqa: F401
            use_vllm = True
        except ImportError:
            use_vllm = False

    if not is_vl_embed_model(model_name):
        if use_vllm:
            from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder
            return LlamaNemotronEmbed1BV2VLLMEmbedder(...)
        else:
            from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder
            return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/__init__.py
Line: 71-83

Comment:
**Non-Linux environments silently broken by factory-forcing vLLM**

`create_local_embedder` now _always_ returns `LlamaNemotronEmbed1BV2VLLMEmbedder` for non-VL models. Because `vllm` is gated to `sys_platform == 'linux'` in `pyproject.toml`, calling this factory on macOS or Windows immediately raises `RuntimeError: vLLM is not installed` — with no fallback to the perfectly functional HF path (`LlamaNemotronEmbed1BV2Embedder`). Users doing development, CI, or CPU inference on non-Linux hosts now have no path through the public factory.

Consider keeping a `use_vllm: bool` parameter (defaulting to `True` on Linux, `False` elsewhere) and falling back to the HF class when vLLM is unavailable:

```python
def create_local_embedder(
    model_name: str | None = None,
    *,
    use_vllm: bool | None = None,  # None → auto-detect
    ...
) -> Any:
    if use_vllm is None:
        try:
            import vllm  # noqa: F401
            use_vllm = True
        except ImportError:
            use_vllm = False

    if not is_vl_embed_model(model_name):
        if use_vllm:
            from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2VLLMEmbedder
            return LlamaNemotronEmbed1BV2VLLMEmbedder(...)
        else:
            from ...llama_nemotron_embed_1b_v2_embedder import LlamaNemotronEmbed1BV2Embedder
            return LlamaNemotronEmbed1BV2Embedder(device=device, hf_cache_dir=hf_cache_dir, model_id=model_id)
```

How can I resolve this? If you propose a fix, please make it concise.

device=task_config.get("local_hf_device"),
hf_cache_dir=task_config.get("local_hf_cache_dir"),
gpu_memory_utilization=float(task_config.get("gpu_memory_utilization", 0.45)),
enforce_eager=bool(task_config.get("enforce_eager", False)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 bool("False") evaluates to True, silently enabling eager mode

task_config values can arrive as strings when the dict is populated from YAML or environment variables. bool("False") returns True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in YAML (parsed as the string "False") would silently get enforce_eager=True, disabling CUDA graphs and degrading throughput with no error.

Suggested change
enforce_eager=bool(task_config.get("enforce_eager", False)),
enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"),

Or use a small helper:

def _to_bool(v, default=False):
    if isinstance(v, bool): return v
    if v is None: return default
    return str(v).strip().lower() in ("1", "true", "yes")
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 108

Comment:
**`bool("False")` evaluates to `True`, silently enabling eager mode**

`task_config` values can arrive as strings when the dict is populated from YAML or environment variables. `bool("False")` returns `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in YAML (parsed as the string `"False"`) would silently get `enforce_eager=True`, disabling CUDA graphs and degrading throughput with no error.

```suggestion
            enforce_eager=task_config.get("enforce_eager", False) is True or str(task_config.get("enforce_eager", "false")).lower() in ("1", "true", "yes"),
```

Or use a small helper:

```python
def _to_bool(v, default=False):
    if isinstance(v, bool): return v
    if v is None: return default
    return str(v).strip().lower() in ("1", "true", "yes")
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread nemo_retriever/src/nemo_retriever/model/__init__.py
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated
… compatibility

Drop compile_cache_dir plumbing and rely on vLLM default compile cache behavior.

Keep PoolerConfig construction compatible across vLLM versions by avoiding unsupported normalize argument and adding fallbacks.
Prune unused embed_via_vllm and stop exporting it from text_embed.vllm.
Remove the separate LlamaNemotronEmbed1BV2VLLMEmbedder class and move the
vLLM pooling path into the canonical embedder type. create_local_embedder
now instantiates that class and forwards normalize and max_length for Ray
actors. Update factory tests and docs that referred to HuggingFace for the
default text embedding path.

Made-with: Cursor
LlamaNemotronEmbed1BV2Embedder never consumed device; vLLM follows process
placement. create_local_embedder still accepts device for the VL HF path only.
Strip device in TextEmbedGPUActor, stop forwarding it from graph embed actor
and Retriever local embed construction, and adjust factory tests.

Made-with: Cursor
Expose local_query_embed_backend on Retriever and RecallConfig so recall
can use HuggingFace mean-pooled query vectors while ingestion keeps the
existing vLLM text embed path. Add LlamaNemotronEmbed1BV2HFEmbedder with
query:/passage: prefixes and per-batch tokenization aligned with vLLM
micro-batching. Wire through graph_pipeline, BEIR, retrieve_and_score,
and vdb_recall Typer commands. Extend unit tests for Retriever routing and
BEIR kwargs.

Made-with: Cursor
Restore optional device= for API compatibility; emit DeprecationWarning when
set and document vLLM placement via CUDA_VISIBLE_DEVICES / tensor_parallel_size.
Forward device from create_local_embedder for non-VL models. Extend factory
tests and add a deprecation test that loads the real embedder module.

Made-with: Cursor
@jperez999
Copy link
Copy Markdown
Collaborator

Might need to coordinate with @edknv and the VL reranker PR

Resolve conflicts between local query embed backend (PR) and multimodal
VL reranker (NVIDIA#1836): keep VL_EMBED_MODEL/VL_RERANK_MODEL exports and
TYPE_CHECKING imports with Any; pass both local_query_embed_backend and
embed_modality into RecallConfig; wire Retriever with local_query_embed_backend
plus reranker bool and reranker_model_name.

Made-with: Cursor
charlesbluca added a commit to charlesbluca/NeMo-Retriever that referenced this pull request Apr 19, 2026
Wire reranker configuration through the full harness → graph_pipeline →
BeirConfig → Retriever chain, which was previously missing for BEIR
evaluation (rerank_modality was always "text" in BEIR mode).

Changes:
- recall/beir.py: add rerank_modality to BeirConfig, forward to Retriever
- examples/graph_pipeline.py: add --rerank-modality CLI flag
- harness/config.py: add reranker, reranker_model_name, rerank_modality
  fields to HarnessConfig with validation and HARNESS_* env var mappings
- harness/run.py: emit --reranker/--no-reranker/--rerank-modality in
  _build_command; capture reranker fields in results.json test_config
- harness/test_configs.yaml: add bo767_baseline, bo767_vl_text,
  bo767_vl_text_image, bo767_vl_text_image_reranked dataset configs
- harness/vllm_bo767_sweep.yaml: new sweep for PR NVIDIA#1494 validation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
charlesbluca added a commit to charlesbluca/NeMo-Retriever that referenced this pull request Apr 19, 2026
… for vLLM/HF matrix

Adds `embed_local_ingest_backend` field (vllm|hf) so HF and vLLM ingest backends
can be selected per dataset config or sweep override, enabling the HF/HF vs vLLM/HF
vs vLLM/vLLM experiment matrix for PR NVIDIA#1494 validation.

- EmbedParams.local_ingest_backend wires through build_embed_kwargs →
  _BatchEmbedActor to select HF or vLLM at ingest time
- graph_pipeline: adds --embed-local-ingest-backend CLI flag with validation
- HarnessConfig: new field + env var + validation
- Sweep YAML: supports top-level overrides: dict merged with per-run overrides
  (fixes load_nightly_config dropping the overrides key)
- test_configs.yaml: bo767_text_hf (HF/HF) + bo767_text_vllm_hf (vLLM ingest/HF query)
- vllm_bo767_sweep.yaml: embed_workers:1 override (prevents multi-GPU spill with vLLM),
  bo767_text_vllm_hf run, embed_batch_size:16 for HF path (avoids OOM at seq_len=8192)
- harness-run.md: Step 3a.5 config sanity check, updated OOM cause table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
…p configs

- graph_pipeline._resolve_file_patterns: use **/<ext> pattern with
  recursive=True so datasets with a corpus/ subdir are found
- test_configs.yaml: add jp20_baseline, jp20_text_hf, jp20_text_vllm_hf,
  jp20_vl_text, jp20_vl_text_image, jp20_vl_text_image_reranked entries
- harness/jp20_vl_sweep.yaml: add JP20 vLLM validation sweep (6 runs)
- pyproject.toml: document Python.h requirement for vLLM torch inductor
- README.md: add uv python install 3.12 step to ensure headers for vLLM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Comment on lines +39 to +66
def __post_init__(self) -> None:
self._tokenizer = None
self._model = None
self._device = None

from nemo_retriever.model import _DEFAULT_EMBED_MODEL
from transformers import AutoModel, AutoTokenizer

model_id = self.model_id or _DEFAULT_EMBED_MODEL
dev = torch.device(self.device or ("cuda" if torch.cuda.is_available() else "cpu"))
hf_cache_dir = configure_global_hf_cache_base(self.hf_cache_dir)
_revision = get_hf_revision(model_id)
self._tokenizer = AutoTokenizer.from_pretrained(
model_id,
revision=_revision,
cache_dir=hf_cache_dir,
trust_remote_code=True,
)
self._model = AutoModel.from_pretrained(
model_id,
revision=_revision,
trust_remote_code=True,
cache_dir=hf_cache_dir,
torch_dtype=torch.bfloat16,
)
self._model = self._model.to(dev)
self._model.eval()
self._device = dev
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Eager model load in __post_init__ violates model-lazy-loading rule

LlamaNemotronEmbed1BV2HFEmbedder.__post_init__ calls AutoTokenizer.from_pretrained and AutoModel.from_pretrained unconditionally at construction time, including a .to(dev) call that places weights on GPU immediately. The model-lazy-loading rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. create_local_query_embedder(..., backend="hf")) will trigger a full HF download/load and GPU transfer before any embed call is made.

Defer loading to the first _embed_local call and guard with an _ensure_loaded() helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off.

Rule Used: Models must be loaded lazily (on first use or expl... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py
Line: 39-66

Comment:
**Eager model load in `__post_init__` violates `model-lazy-loading` rule**

`LlamaNemotronEmbed1BV2HFEmbedder.__post_init__` calls `AutoTokenizer.from_pretrained` and `AutoModel.from_pretrained` unconditionally at construction time, including a `.to(dev)` call that places weights on GPU immediately. The `model-lazy-loading` rule requires models to load on first use, not at construction. Any code that constructs this class (e.g. `create_local_query_embedder(..., backend="hf")`) will trigger a full HF download/load and GPU transfer before any embed call is made.

Defer loading to the first `_embed_local` call and guard with an `_ensure_loaded()` helper, or document (in the class docstring) that GPU allocation occurs at construction time as an acknowledged design trade-off.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

charlesbluca and others added 4 commits April 21, 2026 18:26
Brings in vLLM 0.17 (torch 2.10, torchvision 0.25), Portal MCP server,
QA eval pipeline, BEIR eval updates, nemotron-ocr-v2, and other upstream
changes. Updates blocker comments now that vLLM 0.17 is available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edder

- Add limit_mm_per_prompt param to create_vllm_llm (off by default, no
  impact on text-only callers)
- Add embed_multimodal_with_vllm_llm helper for multimodal prompt dicts
- Implement embed_images and embed_text_image on the vLLM VL embedder
  using confirmed prompt format from the model card ("passage: <image> ")
- Bump flashinfer to 0.6.4 (required by vLLM 0.17)
- Add 19 unit tests covering the new helper, parameter forwarding, and
  both new embedder methods

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in HF_HUB_OFFLINE=1 Ray worker fix and --reranker-invoke-url
addition to graph_pipeline.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in pre-commit hook for uv lockfile regeneration and verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated
charlesbluca and others added 2 commits April 21, 2026 19:09
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +176 to +180
model_id: Optional[str] = None
device: Optional[str] = None
hf_cache_dir: Optional[str] = None
gpu_memory_utilization: float = 0.45
enforce_eager: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _llm not declared as a dataclass field

LlamaNemotronEmbedVL1BV2VLLMEmbedder assigns self._llm in __post_init__ but never declares it as a typed dataclass field. Compare with LlamaNemotronEmbed1BV2Embedder which has _llm: Any = field(default=None, init=False, repr=False). Without the field declaration there is no None sentinel before __post_init__ runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to embed() / embed_queries() will raise an opaque AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm' rather than propagating the original construction error.

Suggested change
model_id: Optional[str] = None
device: Optional[str] = None
hf_cache_dir: Optional[str] = None
gpu_memory_utilization: float = 0.45
enforce_eager: bool = False
model_id: Optional[str] = None
device: Optional[str] = None
hf_cache_dir: Optional[str] = None
gpu_memory_utilization: float = 0.45
enforce_eager: bool = False
_llm: Any = field(default=None, init=False, repr=False)

Also requires adding field to the from dataclasses import dataclass, field import.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py
Line: 176-180

Comment:
**`_llm` not declared as a dataclass field**

`LlamaNemotronEmbedVL1BV2VLLMEmbedder` assigns `self._llm` in `__post_init__` but never declares it as a typed dataclass field. Compare with `LlamaNemotronEmbed1BV2Embedder` which has `_llm: Any = field(default=None, init=False, repr=False)`. Without the field declaration there is no `None` sentinel before `__post_init__` runs, so if anything in lines 188–215 raises before the assignment on line 210, a subsequent call to `embed()` / `embed_queries()` will raise an opaque `AttributeError: 'LlamaNemotronEmbedVL1BV2VLLMEmbedder' object has no attribute '_llm'` rather than propagating the original construction error.

```suggestion
    model_id: Optional[str] = None
    device: Optional[str] = None
    hf_cache_dir: Optional[str] = None
    gpu_memory_utilization: float = 0.45
    enforce_eager: bool = False

    _llm: Any = field(default=None, init=False, repr=False)
```

Also requires adding `field` to the `from dataclasses import dataclass, field` import.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 44 to 46
gpu_memory_utilization=float(self._kwargs.get("gpu_memory_utilization", 0.45)),
enforce_eager=bool(self._kwargs.get("enforce_eager", False)),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 bool("False") silently enables eager mode

self._kwargs.get("enforce_eager", False) can return the string "False" when kwargs are populated from YAML or environment variables. bool("False") evaluates to True in Python because any non-empty string is truthy, so a user setting enforce_eager: false in config silently gets enforce_eager=True, disabling CUDA graphs and degrading throughput. The same pattern appears in processor.py line 108.

Or introduce a small shared helper — e.g. _to_bool(v, default=False) — and call it in both places.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py
Line: 44-46

Comment:
**`bool("False")` silently enables eager mode**

`self._kwargs.get("enforce_eager", False)` can return the string `"False"` when kwargs are populated from YAML or environment variables. `bool("False")` evaluates to `True` in Python because any non-empty string is truthy, so a user setting `enforce_eager: false` in config silently gets `enforce_eager=True`, disabling CUDA graphs and degrading throughput. The same pattern appears in `processor.py` line 108.

Or introduce a small shared helper — e.g. `_to_bool(v, default=False)` — and call it in both places.

How can I resolve this? If you propose a fix, please make it concise.

charlesbluca and others added 3 commits April 21, 2026 19:32
- EmbedParams.local_ingest_backend controls vllm vs hf at ingest time for non-VL models
- _BatchEmbedActor dispatches to create_local_query_embedder(backend="hf") or
  create_local_embedder (vLLM) based on that field
- _embed_group: pass effective_batch_size (image-modality-capped) as local_batch_size
  instead of the raw inference_batch_size so the local vLLM path respects the cap
- embed_text_main_text_embed: clear CUDA cache on exception to release GPU memory
  before Ray retries the batch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
create_local_embedder now accepts backend="auto"|"hf"|"vllm" for VL models:
- auto/vllm → LlamaNemotronEmbedVL1BV2VLLMEmbedder (requires vLLM >= 0.17)
- hf         → LlamaNemotronEmbedVL1BV2Embedder

create_local_query_embedder delegates to create_local_embedder with the
requested backend for VL, matching non-VL behaviour.

Retriever._get_local_embedder is simplified to always call
create_local_query_embedder; backend_raw is now respected for VL models
(previously hardcoded to HF regardless of local_query_embed_backend).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ard HF_HUB_OFFLINE to Ray workers

graph_pipeline.py was missing the --embed-local-ingest-backend option that
run.py was already generating; wires the flag through to EmbedParams.

graph_ingestor.py initialized Ray before executor.py could run, bypassing
the HF_HUB_OFFLINE forwarding in executor.py; adds the env var directly
to graph_ingestor's ray_env_vars dict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <charlesb@nvidia.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants