Skip to content

Add fail_open parameter to span context managers#287

Closed
x wants to merge 2 commits intomainfrom
x/span-fail-open
Closed

Add fail_open parameter to span context managers#287
x wants to merge 2 commits intomainfrom
x/span-fail-open

Conversation

@x
Copy link
Contributor

@x x commented Mar 18, 2026

Summary

  • Adds fail_open: bool = False parameter to Trace.span(), AsyncTrace.span(), and TracingModule.span()
  • When fail_open=True, tracing setup/teardown errors are logged and suppressed — the span yields None and the caller continues uninterrupted
  • Body exceptions from the caller's code are never swallowed, regardless of fail_open
  • Default behavior (fail_open=False) is unchanged

Motivation

Wrapping async with trace.span(...) in a try/except to make tracing non-fatal is silently broken: @asynccontextmanager re-raises the caller's exception at the yield, so the outer except catches both tracing failures and business logic errors. The only correct workaround is manually decomposing __aenter__/__aexit__, which is ~20 lines of fragile boilerplate per call site. The fix has to live inside the generator, split across the yield boundary — it cannot be done from the outside.

Test plan

  • 16 new unit tests covering sync and async variants
    • Setup failure: suppressed when fail_open=True, propagated when False
    • Teardown failure: suppressed when fail_open=True, propagated when False
    • Body exception: always propagated, even when fail_open=True and end_span also fails
    • Happy path: unchanged behavior with and without fail_open
    • No trace ID: still yields None as before

Greptile Summary

This PR adds a fail_open: bool = False parameter to the Trace.span(), AsyncTrace.span(), and TracingModule.span() context managers. When enabled, tracing infrastructure failures (during span setup or teardown) are logged as warnings and suppressed rather than propagated, ensuring business logic is never interrupted by observability failures. The default behavior (fail_open=False) is completely unchanged.

The implementation correctly addresses the PR's stated motivation: wrapping an async with trace.span() in an outer try/except is unreliable because @asynccontextmanager re-raises body exceptions at the yield point, causing the outer handler to catch business-logic errors too. The fix is correctly placed inside the generator, split across the yield boundary.

Key implementation details:

  • Setup failure: Caught before yield, suppresses when fail_open=True, yields None to the caller.
  • Teardown failure: Caught in finally, suppresses when fail_open=True, warning-logged via exc_info=True.
  • Body exceptions: Never swallowed — body exceptions propagate in all cases, even when teardown also fails under fail_open=True.
  • 16 unit tests comprehensively cover the sync and async paths, but TracingModule.span() (which routes through Temporal activities) is not covered by the new tests.
  • A subtle Python behavior: when fail_open=True causes setup failure suppression and the caller's body also raises, the body exception will have the tracing SetupError as its implicit __context__, making tracebacks slightly misleading. This is cosmetic but worth noting.

Confidence Score: 4/5

  • Safe to merge; logic is correct and well-tested for the primary classes, with minor gaps in TracingModule test coverage and a cosmetic exception-chaining behaviour.
  • The core fail_open logic in all three context managers is correct and handles all documented edge cases (setup failure, teardown failure, body exception, combined body + teardown failure). The 16 unit tests validate the most important behavioural properties. Two minor issues prevent a 5: (1) TracingModule.span() is untested for fail_open despite having a different internal execution path (Temporal activities vs direct processor calls), and (2) when fail_open=True suppresses a setup failure and the caller's body also raises, Python implicitly chains the tracing SetupError as __context__ on the body exception, making stack traces slightly confusing.
  • tests/lib/core/tracing/test_span_fail_open.py — add coverage for TracingModule.span(fail_open=True).

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/trace.py Adds fail_open to Trace.span() and AsyncTrace.span(). Logic is correct for all documented scenarios; minor cosmetic concern with exception chaining when setup and body both fail under fail_open=True.
src/agentex/lib/adk/_modules/tracing.py Adds fail_open to TracingModule.span() with correct setup and teardown error handling. Implementation mirrors AsyncTrace.span() and correctly guards end_span with if span:.
tests/lib/core/tracing/test_span_fail_open.py 16 well-structured tests covering the critical properties (setup failure, teardown failure, body exception propagation, happy path, no-trace-id) for Trace and AsyncTrace. Coverage gap: TracingModule.span() is not tested.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Enter span context manager] --> B{trace_id present?}
    B -- No --> C[yield None\nno-op]
    B -- Yes --> D[call start_span]
    D -- success --> E[yield span to caller body]
    D -- raises --> F{fail_open?}
    F -- True --> G[log warning\nyield None\nreturn]
    F -- False --> H[re-raise setup exception]
    E -- body raises --> I[finally: call end_span]
    E -- body OK --> I
    I -- end_span raises --> J{fail_open?}
    J -- True --> K[log warning\nsuppress teardown error]
    J -- False --> L[re-raise teardown exception]
    K --> M{body raised?}
    M -- Yes --> N[body exception propagates]
    M -- No --> O[normal exit]
    I -- end_span OK --> M
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/lib/core/tracing/test_span_fail_open.py
Line: 1-8

Comment:
**Missing test coverage for `TracingModule.span()`**

The PR adds `fail_open` to all three context managers — `Trace.span()`, `AsyncTrace.span()`, and `TracingModule.span()` — but the 16 new tests only exercise `Trace` and `AsyncTrace`. `TracingModule.span()` has meaningfully different internal logic: it routes through Temporal activities (`ActivityHelpers.execute_activity`) when running inside a workflow, versus `_tracing_service` otherwise. A setup failure in `start_span` (e.g., a Temporal activity timeout) and a teardown failure in `end_span` should both be covered to verify the `fail_open` path works correctly in that class too.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agentex/lib/core/tracing/trace.py
Line: 162-169

Comment:
**Body exception gains unexpected `__context__` when setup also fails**

When `fail_open=True` and `start_span` raises (say, `SetupError`), the code enters the `except Exception:` handler and reaches `yield None`. If the caller's `with` body then raises (say, `ValueError`), Python's generator machinery throws `ValueError` into the generator at that `yield`. Because the `yield` is executing inside an active `except SetupError:` block, Python automatically sets `ValueError.__context__ = SetupError`.

The caller will see:

```
During handling of the above exception, another exception occurred:
...
ValueError: business logic error
```

This chains the tracing infrastructure failure onto the caller's unrelated business-logic exception, which could be confusing during debugging. The same pattern affects `AsyncTrace.span()` (line 331) and `TracingModule.span()` (line 141).

A simple mitigation is to suppress the implicit exception context at the `yield` site:

```python
except Exception:
    if fail_open:
        logger.warning(...)
        try:
            yield None
        except BaseException:
            raise  # body exception is clean; no __context__ from setup failure
        return
    raise
```

This isn't a correctness bug (the `ValueError` does propagate), but it is a UX issue for anyone reading tracebacks.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Add tests for span f..."

x added 2 commits March 18, 2026 14:22
When fail_open=True, tracing errors during span setup or teardown are
logged and suppressed instead of propagated. This ensures that tracing
infrastructure failures never interrupt the caller's business logic.

Without this, callers who want fault-tolerant tracing are forced to
manually decompose the context manager protocol (__aenter__/__aexit__),
which is error-prone and can silently swallow unrelated exceptions from
the body of the with-block.
Tests cover both sync and async variants: setup failures, teardown
failures, body exception propagation, and the critical property that
fail_open never swallows exceptions from the caller's code.
@x x closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant