Add exception propagation for deferred and async execution#6210
Add exception propagation for deferred and async execution#6210rostan-t merged 5 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds exception propagation for deferred and async execution in DALI's dynamic mode. When operators fail during asynchronous or deferred evaluation, the error was previously raised at the point of evaluation (e.g.,
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as User Code
participant Op as Operator.__call__
participant Inv as Invocation
participant Exc as _exceptions
participant Async as _AsyncExecutor
participant Worker as Worker Thread
User->>Op: call operator (e.g., ndd.resize)
Op->>Inv: Invocation.__init__(caller_depth=4)
Inv->>Exc: capture_stack(depth) [deferred/eager only]
Exc-->>Inv: CallStack (code objects + line numbers)
alt Eager Mode
Inv->>Async: submit(callable, call_stack)
Async->>Worker: queue task with call_stack
Worker->>Worker: _run_impl() → exception!
Worker-->>Async: store exception in _Future
User->>Inv: evaluate() / .cpu() / etc.
Inv->>Async: future.wait()
Async->>Exc: rethrow_exception(exc, call_stack, eager)
Exc->>Exc: _make_traceback(call_stack)
Exc-->>User: raise new_exc from original_with_synthetic_tb
else Deferred Mode
Note over Inv: No immediate execution
User->>Inv: evaluate() / .cpu() / etc.
Inv->>Inv: _run_impl() → exception!
Inv->>Exc: rethrow_exception(exc, call_stack, deferred)
Exc->>Exc: _make_traceback(call_stack)
Exc-->>User: raise new_exc from original_with_synthetic_tb
else Sync Mode
Inv->>Inv: _run_impl() [no stack capture]
Note over Inv: Natural traceback suffices
end
Last reviewed commit: baf619a |
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
87bbd68 to
530be68
Compare
|
!build |
|
CI MESSAGE: [43977528]: BUILD STARTED |
|
CI MESSAGE: [43977528]: BUILD FAILED |
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
| while frame is not None: | ||
| if limit is not None and n >= limit: | ||
| break | ||
| stack.append((frame.f_code, frame.f_lineno)) |
There was a problem hiding this comment.
As we are keeping the reference? to the frame's code, could this keep the frame alive?
I think that that's the whole reason for the FrameSummary objects, to remove any references and not prolong the lifetime in unintended way.
There was a problem hiding this comment.
We are only keeping a reference to the code object, which shouldn't be referencing the frame at all.
There was a problem hiding this comment.
>>> def foo():
... return 1 + 2
...
>>> import gc
>>> gc.get_referents(foo.__code__)
[]Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
|
!build |
|
CI MESSAGE: [44144024]: BUILD STARTED |
| def test_ragged_batch_to_torch_error(device: str): | ||
| batch = ndd.batch([[1, 2, 3], [4, 5], [6]]) | ||
| with assert_raises(ValueError, glob="non-uniform"): | ||
| with assert_raises(ValueError, glob="non-uniform"), ndd.EvalMode.sync_cpu: |
There was a problem hiding this comment.
Why? Don't we have to synchronize anyway to get a torch tensor?
There was a problem hiding this comment.
Yes, we do. However, the exception happens before the synchronization: as_tensor calls batch_to_tensor and the exception is raised here, resulting in a AsynchronousExecutionError.
There was a problem hiding this comment.
baf619a makes the re-raised exception type the same as the initial one. We just need to remove the message check.
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
|
CI MESSAGE: [44144024]: BUILD PASSED |
|
!build |
|
CI MESSAGE: [44150629]: BUILD STARTED |
| def test_ragged_batch_to_torch_error(device: str): | ||
| batch = ndd.batch([[1, 2, 3], [4, 5], [6]]) | ||
| with assert_raises(ValueError, glob="non-uniform"): | ||
| with assert_raises(ValueError): |
There was a problem hiding this comment.
Doesn't the original message show up when the error is rethrown?
There was a problem hiding this comment.
assert_raises performs the check only on the error message of the rethrown exception.
The original message shows up but in the cause. It looks something like this:
Traceback (most recent call last):
<synthetic traceback>
ValueError: <intial message>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
<initial traceback>
ValueError: An error happened during asynchronous execution
|
CI MESSAGE: [44150629]: BUILD PASSED |
* Add exception propagation for deferred and async execution * Remove test_async.test_exception_propagation * Move the stack capture to the invocation. Fix it for the batch2tensor op * Add missing header in test_interop.py * Re-raise exceptions with the same type --------- Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Category:
New feature (non-breaking change which adds functionality)
Description:
When using asynchronous execution, exceptions are raised when evaluating rather than directly when calling functions. This makes it harder to trace back exception to their actual source.
This PR modified the raised exception to fix this issue. The stack trace is captured when calling an operator and used to create a new exception with a proper traceback when necessary.
Additional information:
Affected modules and functionalities:
Dynamic mode
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-4599