Skip to content

Add exception propagation for deferred and async execution#6210

Merged
rostan-t merged 5 commits intoNVIDIA:mainfrom
rostan-t:ndd-better-exceptions
Feb 16, 2026
Merged

Add exception propagation for deferred and async execution#6210
rostan-t merged 5 commits intoNVIDIA:mainfrom
rostan-t:ndd-better-exceptions

Conversation

@rostan-t
Copy link
Copy Markdown
Collaborator

Category:

New feature (non-breaking change which adds functionality)

Description:

When using asynchronous execution, exceptions are raised when evaluating rather than directly when calling functions. This makes it harder to trace back exception to their actual source.

This PR modified the raised exception to fix this issue. The stack trace is captured when calling an operator and used to create a new exception with a proper traceback when necessary.

Additional information:

Affected modules and functionalities:

Dynamic mode

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4599

Comment thread dali/test/python/experimental_mode/test_exceptions.py Fixed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 13, 2026

Greptile Summary

This PR adds exception propagation for deferred and async execution in DALI's dynamic mode. When operators fail during asynchronous or deferred evaluation, the error was previously raised at the point of evaluation (e.g., .evaluate()), making it hard to trace back to the original operator call. This PR captures the call stack at operator invocation time and uses it to construct a synthetic traceback that points back to the original call site.

  • Adds _exceptions.py with capture_stack() to capture frames at invocation time and rethrow_exception() to re-raise exceptions with a synthetic traceback pointing to the original call site, chained via raise ... from ...
  • Modifies Invocation.__init__ to capture the call stack when in deferred or eager mode (skipped for sync modes where the natural traceback is sufficient)
  • Threads the captured call stack through _AsyncExecutor.submit() and _Future so async exceptions also get the origin traceback
  • Adds caller_depth parameter to Invocation to accommodate different call depths from _op_builder.py (default 4) vs _batch2tensor.py (explicit 3)
  • Adds comprehensive tests in test_exceptions.py that verify exception chaining works correctly for both deferred and eager modes
  • Removes glob message matching from some existing assert_raises calls since the exception messages are now wrapped by the rethrow mechanism

Confidence Score: 4/5

  • This PR is well-structured and safe to merge with minor considerations around test assertion specificity.
  • The implementation is sound and well-guarded: stack capture is only done for modes that need it (deferred/eager), exception rethrow is gated by both _call_stack and _eval_mode being non-None, and the synthetic frame construction uses documented Python APIs. The exception recreation via type(old_exception)(message) is safe because DALI's C++ backend only surfaces standard Python exception types. New tests cover the main scenarios. Minor concern: some existing tests had their error message assertions weakened.
  • _exceptions.py and _invocation.py are the core logic files — pay attention to the interaction between _eval_mode assignment timing in apply_eval_policy and the rethrow guard in run().

Important Files Changed

Filename Overview
dali/python/nvidia/dali/experimental/dynamic/_exceptions.py New module providing call stack capture and synthetic traceback construction for re-raising exceptions with origin context. Implementation is sound — uses documented Python APIs for frame/traceback synthesis.
dali/python/nvidia/dali/experimental/dynamic/_invocation.py Integrates call stack capture and exception rethrow into operator invocations. Stack is captured only for deferred/eager modes. Exception rethrow is guarded by both _call_stack and _eval_mode being non-None.
dali/python/nvidia/dali/experimental/dynamic/_async.py Threads call stack through _Future and _AsyncExecutor.submit(). On future wait, exceptions are rethrown with the captured origin traceback. Clean integration with existing async machinery.
dali/python/nvidia/dali/experimental/dynamic/_batch2tensor.py Adds explicit caller_depth=3 to Invocation constructor call, reflecting the shallower call stack for this pseudo-operator compared to the default depth of 4.
dali/test/python/experimental_mode/test_exceptions.py New test file with an exception_tester decorator that verifies exceptions are properly propagated with correct cause chaining for both deferred and eager modes. Tests multiple error scenarios.
dali/test/python/experimental_mode/test_async.py Removes test_exception_propagation which is now covered by the more comprehensive test_exceptions.py. Removes unused raises import.
dali/test/python/experimental_mode/test_interop.py Adds missing license header, removes glob parameter from assert_raises since the error message may now be wrapped by the rethrow mechanism.
dali/test/python/experimental_mode/test_tensor.py Removes glob parameter from assert_raises in test_batch_to_tensor_no_pad_error to accommodate the new exception wrapping behavior.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Op as Operator.__call__
    participant Inv as Invocation
    participant Exc as _exceptions
    participant Async as _AsyncExecutor
    participant Worker as Worker Thread

    User->>Op: call operator (e.g., ndd.resize)
    Op->>Inv: Invocation.__init__(caller_depth=4)
    Inv->>Exc: capture_stack(depth) [deferred/eager only]
    Exc-->>Inv: CallStack (code objects + line numbers)
    
    alt Eager Mode
        Inv->>Async: submit(callable, call_stack)
        Async->>Worker: queue task with call_stack
        Worker->>Worker: _run_impl() → exception!
        Worker-->>Async: store exception in _Future
        User->>Inv: evaluate() / .cpu() / etc.
        Inv->>Async: future.wait()
        Async->>Exc: rethrow_exception(exc, call_stack, eager)
        Exc->>Exc: _make_traceback(call_stack)
        Exc-->>User: raise new_exc from original_with_synthetic_tb
    else Deferred Mode
        Note over Inv: No immediate execution
        User->>Inv: evaluate() / .cpu() / etc.
        Inv->>Inv: _run_impl() → exception!
        Inv->>Exc: rethrow_exception(exc, call_stack, deferred)
        Exc->>Exc: _make_traceback(call_stack)
        Exc-->>User: raise new_exc from original_with_synthetic_tb
    else Sync Mode
        Inv->>Inv: _run_impl() [no stack capture]
        Note over Inv: Natural traceback suffices
    end
Loading

Last reviewed commit: baf619a

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_exceptions.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_exceptions.py Outdated
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
@rostan-t rostan-t force-pushed the ndd-better-exceptions branch from 87bbd68 to 530be68 Compare February 13, 2026 15:03
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py
@rostan-t
Copy link
Copy Markdown
Collaborator Author

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [43977528]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [43977528]: BUILD FAILED

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
while frame is not None:
if limit is not None and n >= limit:
break
stack.append((frame.f_code, frame.f_lineno))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are keeping the reference? to the frame's code, could this keep the frame alive?
I think that that's the whole reason for the FrameSummary objects, to remove any references and not prolong the lifetime in unintended way.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only keeping a reference to the code object, which shouldn't be referencing the frame at all.

Copy link
Copy Markdown
Collaborator Author

@rostan-t rostan-t Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> def foo():
...     return 1 + 2
...
>>> import gc
>>> gc.get_referents(foo.__code__)
[]

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
@rostan-t
Copy link
Copy Markdown
Collaborator Author

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [44144024]: BUILD STARTED

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread dali/python/nvidia/dali/experimental/dynamic/_exceptions.py
def test_ragged_batch_to_torch_error(device: str):
batch = ndd.batch([[1, 2, 3], [4, 5], [6]])
with assert_raises(ValueError, glob="non-uniform"):
with assert_raises(ValueError, glob="non-uniform"), ndd.EvalMode.sync_cpu:
Copy link
Copy Markdown
Contributor

@mzient mzient Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Don't we have to synchronize anyway to get a torch tensor?

Copy link
Copy Markdown
Collaborator Author

@rostan-t rostan-t Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. However, the exception happens before the synchronization: as_tensor calls batch_to_tensor and the exception is raised here, resulting in a AsynchronousExecutionError.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baf619a makes the re-raised exception type the same as the initial one. We just need to remove the message check.

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [44144024]: BUILD PASSED

@rostan-t
Copy link
Copy Markdown
Collaborator Author

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [44150629]: BUILD STARTED

def test_ragged_batch_to_torch_error(device: str):
batch = ndd.batch([[1, 2, 3], [4, 5], [6]])
with assert_raises(ValueError, glob="non-uniform"):
with assert_raises(ValueError):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the original message show up when the error is rethrown?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_raises performs the check only on the error message of the rethrown exception.

The original message shows up but in the cause. It looks something like this:

Traceback (most recent call last):
<synthetic traceback>
ValueError: <intial message>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
<initial traceback>
ValueError: An error happened during asynchronous execution

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [44150629]: BUILD PASSED

@rostan-t rostan-t merged commit c443bf9 into NVIDIA:main Feb 16, 2026
7 checks passed
@rostan-t rostan-t deleted the ndd-better-exceptions branch February 16, 2026 17:09
stiepan pushed a commit that referenced this pull request Feb 19, 2026
* Add exception propagation for deferred and async execution
* Remove test_async.test_exception_propagation
* Move the stack capture to the invocation. Fix it for the batch2tensor op
* Add missing header in test_interop.py
* Re-raise exceptions with the same type
---------

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants