Skip to content

Comments

Support MIG UUID and respect CUDA_VISIBLE_DEVICES in nnUNetV2Runner#8716

Open
ytl0623 wants to merge 12 commits intoProject-MONAI:devfrom
ytl0623:fix-issue-7497
Open

Support MIG UUID and respect CUDA_VISIBLE_DEVICES in nnUNetV2Runner#8716
ytl0623 wants to merge 12 commits intoProject-MONAI:devfrom
ytl0623:fix-issue-7497

Conversation

@ytl0623
Copy link
Contributor

@ytl0623 ytl0623 commented Jan 22, 2026

Fixes #7497

Description

This PR fixes two critical issues when running nnUNetV2Runner on NVIDIA MIG (Multi-Instance GPU) environments or when using a specific CUDA_VISIBLE_DEVICES configuration.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: ytl0623 <david89062388@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates nnUNetV2Runner to support MIG UUIDs and richer GPU specs by accepting int | str | tuple | list for gpu_id. train_single_model_command now returns (cmd: list[str], env: dict[str, str]) and validates empty GPU lists; it preserves an existing CUDA_VISIBLE_DEVICES when gpu_id resolves to 0, otherwise sets CUDA_VISIBLE_DEVICES for single/multi-GPU selections. Calling sites (train_single_model, train_parallel, train, predict) updated to unpack (cmd, env) and invoke subprocesses with the env; logging switched to shlex-joined command representations. Type hints and docstrings updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title directly and clearly describes the main changes: MIG UUID support and CUDA_VISIBLE_DEVICES respect in nnUNetV2Runner.
Description check ✅ Passed Description identifies the issue (#7497), outlines the problems fixed, and completes the template with correct boxes checked; however, no docstring or test updates are marked despite code complexity.
Linked Issues check ✅ Passed Changes address both core requirements from #7497: accept MIG UUIDs via gpu_id (int|str|tuple|list support) and preserve existing CUDA_VISIBLE_DEVICES when gpu_id=0.
Out of Scope Changes check ✅ Passed All changes are focused on nnUNetV2Runner device handling, command construction, and environment variable management—directly aligned with #7497 requirements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 827-830: The current check uses gpu_id == 0 (int) so a string "0"
won’t match and will overwrite CUDA_VISIBLE_DEVICES; update the logic around
gpu_id, os.environ and logger.info in nnunetv2_runner.py to compare string
forms: check if "CUDA_VISIBLE_DEVICES" in os.environ and str(gpu_id) ==
os.environ.get("CUDA_VISIBLE_DEVICES") then log and do not overwrite, otherwise
set os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id); refer to the gpu_id
variable, logger.info call, and the CUDA_VISIBLE_DEVICES env handling
(consistent with train_single_model_command behavior).
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

791-791: Docstring should reflect new behavior.

The gpu_id parameter now accepts MIG UUIDs (strings) and respects existing CUDA_VISIBLE_DEVICES. Update the docstring at line 825 to document this.

Suggested docstring
-            gpu_id: which GPU to use for prediction.
+            gpu_id: GPU device index (int) or MIG UUID (str) for prediction.
+                If CUDA_VISIBLE_DEVICES is already set and gpu_id is 0, the existing
+                environment variable is preserved.

Also applies to: 825-825

ytl0623 and others added 2 commits January 22, 2026 14:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

531-565: Missing docstring and type hints.

Per coding guidelines, all definitions require Google-style docstrings documenting parameters and return values. This method also lacks type annotations.

Proposed signature and docstring
-    def train_single_model_command(self, config, fold, gpu_id, kwargs):
+    def train_single_model_command(
+        self, config: str, fold: int, gpu_id: int | str | tuple | list, kwargs: dict[str, Any]
+    ) -> str:
+        """
+        Build the shell command string for training a single nnU-Net model.
+
+        Args:
+            config: Configuration name (e.g., "3d_fullres").
+            fold: Cross-validation fold index (0-4).
+            gpu_id: Device selector—int, str (MIG UUID), or tuple/list for multi-GPU.
+            kwargs: Additional CLI arguments forwarded to nnUNetv2_train.
+
+        Returns:
+            Shell command string.
+        """
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 537-544: The code path for gpu_id being a tuple/list doesn't
handle empty sequences, leaving device_setting and num_gpus in unintended
states; update the branch that checks isinstance(gpu_id, (tuple, list)) to
explicitly handle len(gpu_id) == 0 (e.g., raise a ValueError or TypeError with a
clear message including the received gpu_id), ensuring you reference and set
device_setting and num_gpus deterministically (or abort) instead of falling
through silently; keep the existing branches for len>1 and len==1 unchanged.
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

553-557: Leading whitespace when device_setting is empty.

When the existing env var is reused, device_setting = "" produces " nnUNetv2_train ...".

Trim the prefix
-        cmd = (
-            f"{device_setting} nnUNetv2_train "
+        prefix = f"{device_setting} " if device_setting else ""
+        cmd = (
+            f"{prefix}nnUNetv2_train "

1. added docstring and type hints
2. lead whitespace when device_setting is empty
3. empty tuple/list falls through silently

Signed-off-by: ytl0623 <david89062388@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 531-533: Add direct unit tests for the new
train_single_model_command method and the updated predict signature: write tests
that call train_single_model_command with representative gpu_id types (int, str,
tuple/list) and sample kwargs and assert the returned command string contains
the expected flags/arguments and correctly interpolates values; add unit tests
for predict that call the method with the new signature parameters (including
optional/edge cases) and assert it accepts the new parameters, invokes the
expected internal behavior (mocking subprocess/IO as needed) and returns/raises
the expected results; focus on exercising argument parsing, kwarg handling, and
error branches, and use mocking for external side effects so tests remain pure
unit tests.
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

534-545: Docstring missing Raises section.

Per coding guidelines, document the ValueError raised when gpu_id is an empty tuple/list.

Suggested docstring addition
         Returns:
             Shell command string.
+
+        Raises:
+            ValueError: If gpu_id is an empty tuple or list.
         """

Signed-off-by: ytl0623 <david89062388@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 528-529: train_single_model_command currently builds and returns a
shell string that is passed to run_cmd (which expects list[str]) and then
executed with shell=True risking injection; change train_single_model_command to
build and return a list[str] (argv-style) instead of an f-string, move any
dynamic settings (e.g. kwargs, fold, trainer_class_name, config) into either
explicit CLI args in that list or into an env dict, then call run_cmd(cmd_list,
env=env_dict) without shell=True; update the caller in nnunetv2_runner.py to
pass the returned list and env to run_cmd and remove shell=True, and add a unit
test for train_single_model_command that asserts the returned value is a list of
strings with correct argument ordering and that environment variables are
returned/handled as expected (validate env contains expected keys like trainer
class / kwargs).
♻️ Duplicate comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

531-533: Add unit coverage for the new command builder and GPU handling.
Please add direct tests for train_single_model_command (int/str/tuple/list, MIG UUIDs, env preservation) and predict (gpu_id str/int paths). As per coding guidelines, ...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

489-529: ⚠️ Potential issue | 🟡 Minor

Add str type hint to train_single_model and fix CUDA_VISIBLE_DEVICES docstring.

Line 489 signature omits str for MIG UUID support (present in train_single_model_command). Line 492 docstring incorrectly states the environment variable is always overridden—code preserves it when gpu_id=0 and already set. Also, line 544 Returns docstring misidentifies the return type as a string rather than a tuple.

📝 Proposed updates
-    def train_single_model(self, config: Any, fold: int, gpu_id: tuple | list | int = 0, **kwargs: Any) -> None:
+    def train_single_model(self, config: Any, fold: int, gpu_id: tuple | list | int | str = 0, **kwargs: Any) -> None:
         """
         Run the training on a single GPU with one specified configuration provided.
-        Note: this will override the environment variable `CUDA_VISIBLE_DEVICES`.
+        Note: if CUDA_VISIBLE_DEVICES is already set and gpu_id resolves to 0, the existing value is preserved;
+        otherwise it is set to gpu_id.
         
         Args:
             config: configuration that should be trained. Examples: "2d", "3d_fullres", "3d_lowres".
             fold: fold of the 5-fold cross-validation. Should be an int between 0 and 4.
-            gpu_id: an integer to select the device to use, or a tuple/list of GPU device indices used for multi-GPU
-                training (e.g., (0,1)). Default: 0.
+            gpu_id: an int, MIG UUID (str), or tuple/list of GPU indices for multi-GPU training (e.g., (0,1)). Default: 0.
         Returns:
-            Shell command string.
+            Tuple of (command list, environment dict) for the training process.
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 534-545: The Returns section of the train_single_model_command
docstring is incorrect: the function signature and callsite expect a
tuple[list[str], dict[str, str]] but the docstring still says "Shell command
string." Update the Returns block in train_single_model_command to clearly state
it returns a tuple where the first element is a list of command string parts
(list[str]) and the second element is an environment dictionary (dict[str, str])
that should be used when running the command, matching the function signature
and how callers unpack the result.
- Around line 549-591: The call to train_single_model_command returns (cmd, env)
but the caller stored the tuple instead of unpacking it, causing later code that
expects a list of command strings (e.g., stage[device_id] used with ";
".join(stage[device_id])) to fail; fix by unpacking the return into cmd, env
when invoking train_single_model_command and append/extend only the cmd (not the
tuple) into stage[device_id] (or whatever collection holds commands), and ensure
the environment is actually passed to subprocess invocations (e.g.,
subprocess.run/launch) by supplying env=env so child processes inherit the
CUDA_VISIBLE_DEVICES and other settings.

…ES docstring.

2. Fix train_single_model_command return docstring.
3. Unpack tuple return from train_single_model_command and propagate environment to subprocesses.

Signed-off-by: ytl0623 <david89062388@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)

554-568: Consider validating gpu_id type in the else branch.

The else branch (line 566-568) is the fallback for non-str, non-tuple/list types — presumably int, but it would also silently accept float, bool, etc. A type guard could prevent surprises.

Optional guard
         else:
+            if not isinstance(gpu_id, int):
+                raise TypeError(f"gpu_id must be int, str, tuple, or list, got {type(gpu_id).__name__}")
             device_setting = str(gpu_id)
             num_gpus = 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monai/apps/nnunet/nnunetv2_runner.py` around lines 554 - 568, The fallback
else currently accepts any non-str/non-tuple/list value for gpu_id which can
silently allow floats or bools; update the branch handling gpu_id to explicitly
accept integers only (and reject bools) by checking isinstance(gpu_id, int) and
not isinstance(gpu_id, bool), set device_setting = str(gpu_id) and num_gpus = 1
for valid ints, and raise a TypeError for any other types; refer to gpu_id,
device_setting and num_gpus in nnunetv2_runner.py when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 532-534: Add unit tests to cover train_single_model_command and
the updated predict behavior: specifically write tests for
train_single_model_command to validate handling of gpu_id when passed as a
single string (including MIG UUID string), as a tuple/list of ints/strings, and
to assert it rejects an empty list; add tests for predict to ensure it preserves
an existing CUDA_VISIBLE_DEVICES env var when gpu_id is None and correctly
sets/overrides it when gpu_id is provided; use the function names
train_single_model_command and predict as targets and include assertions for MIG
UUID parsing, empty-list rejection, and environment preservation/restoration.
- Line 536: The docstring line "Build the shell command string for training a
single nnU-Net model." is stale — update the method's docstring to accurately
describe the return value (a tuple[list[str], dict[str, str]]), e.g., "Build the
command list and environment dict for training a single nnU-Net model" and
mention the exact return type and structure; locate the docstring in
nnunetv2_runner.py for the method that assembles the training command (the
function that currently returns tuple[list[str], dict[str, str]]) and replace
the wording so it reflects it returns a command list and environment dict rather
than a shell string. Ensure the docstring return description matches the
function signature and types.

---

Nitpick comments:
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 554-568: The fallback else currently accepts any
non-str/non-tuple/list value for gpu_id which can silently allow floats or
bools; update the branch handling gpu_id to explicitly accept integers only (and
reject bools) by checking isinstance(gpu_id, int) and not isinstance(gpu_id,
bool), set device_setting = str(gpu_id) and num_gpus = 1 for valid ints, and
raise a TypeError for any other types; refer to gpu_id, device_setting and
num_gpus in nnunetv2_runner.py when making this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nnUNetV2Runner cannot be run with NVIDIA MIG configuration

1 participant