Closed
Conversation
When fail_open=True, tracing errors during span setup or teardown are logged and suppressed instead of propagated. This ensures that tracing infrastructure failures never interrupt the caller's business logic. Without this, callers who want fault-tolerant tracing are forced to manually decompose the context manager protocol (__aenter__/__aexit__), which is error-prone and can silently swallow unrelated exceptions from the body of the with-block.
Tests cover both sync and async variants: setup failures, teardown failures, body exception propagation, and the critical property that fail_open never swallows exceptions from the caller's code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fail_open: bool = Falseparameter toTrace.span(),AsyncTrace.span(), andTracingModule.span()fail_open=True, tracing setup/teardown errors are logged and suppressed — the span yieldsNoneand the caller continues uninterruptedfail_openfail_open=False) is unchangedMotivation
Wrapping
async with trace.span(...)in atry/exceptto make tracing non-fatal is silently broken:@asynccontextmanagerre-raises the caller's exception at theyield, so the outerexceptcatches both tracing failures and business logic errors. The only correct workaround is manually decomposing__aenter__/__aexit__, which is ~20 lines of fragile boilerplate per call site. The fix has to live inside the generator, split across theyieldboundary — it cannot be done from the outside.Test plan
fail_open=True, propagated whenFalsefail_open=True, propagated whenFalsefail_open=Trueandend_spanalso failsfail_openNoneas beforeGreptile Summary
This PR adds a
fail_open: bool = Falseparameter to theTrace.span(),AsyncTrace.span(), andTracingModule.span()context managers. When enabled, tracing infrastructure failures (during span setup or teardown) are logged as warnings and suppressed rather than propagated, ensuring business logic is never interrupted by observability failures. The default behavior (fail_open=False) is completely unchanged.The implementation correctly addresses the PR's stated motivation: wrapping an
async with trace.span()in an outertry/exceptis unreliable because@asynccontextmanagerre-raises body exceptions at theyieldpoint, causing the outer handler to catch business-logic errors too. The fix is correctly placed inside the generator, split across theyieldboundary.Key implementation details:
yield, suppresses whenfail_open=True, yieldsNoneto the caller.finally, suppresses whenfail_open=True, warning-logged viaexc_info=True.fail_open=True.TracingModule.span()(which routes through Temporal activities) is not covered by the new tests.fail_open=Truecauses setup failure suppression and the caller's body also raises, the body exception will have the tracingSetupErroras its implicit__context__, making tracebacks slightly misleading. This is cosmetic but worth noting.Confidence Score: 4/5
TracingModuletest coverage and a cosmetic exception-chaining behaviour.fail_openlogic in all three context managers is correct and handles all documented edge cases (setup failure, teardown failure, body exception, combined body + teardown failure). The 16 unit tests validate the most important behavioural properties. Two minor issues prevent a 5: (1)TracingModule.span()is untested forfail_opendespite having a different internal execution path (Temporal activities vs direct processor calls), and (2) whenfail_open=Truesuppresses a setup failure and the caller's body also raises, Python implicitly chains the tracingSetupErroras__context__on the body exception, making stack traces slightly confusing.tests/lib/core/tracing/test_span_fail_open.py— add coverage forTracingModule.span(fail_open=True).Important Files Changed
fail_opentoTrace.span()andAsyncTrace.span(). Logic is correct for all documented scenarios; minor cosmetic concern with exception chaining when setup and body both fail underfail_open=True.fail_opentoTracingModule.span()with correct setup and teardown error handling. Implementation mirrorsAsyncTrace.span()and correctly guardsend_spanwithif span:.TraceandAsyncTrace. Coverage gap:TracingModule.span()is not tested.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Enter span context manager] --> B{trace_id present?} B -- No --> C[yield None\nno-op] B -- Yes --> D[call start_span] D -- success --> E[yield span to caller body] D -- raises --> F{fail_open?} F -- True --> G[log warning\nyield None\nreturn] F -- False --> H[re-raise setup exception] E -- body raises --> I[finally: call end_span] E -- body OK --> I I -- end_span raises --> J{fail_open?} J -- True --> K[log warning\nsuppress teardown error] J -- False --> L[re-raise teardown exception] K --> M{body raised?} M -- Yes --> N[body exception propagates] M -- No --> O[normal exit] I -- end_span OK --> MPrompt To Fix All With AI
Last reviewed commit: "Add tests for span f..."