fix(eval): surface invocation error stacktrace in evaluation table#4332
fix(eval): surface invocation error stacktrace in evaluation table#433264johnlee wants to merge 5 commits into
Conversation
When an LLM invocation fails during evaluation, the Python SDK returns both a message and a stacktrace in the error status. The stacktrace was extracted by executionRunner but silently dropped, so InvocationCell always showed a generic error message with no detail. - Add `stacktrace?: string | string[]` to ExecutionResult, StageExecutionResult, RunResult, and ExecuteWorkflowRevisionResult error shapes - Extract stacktrace from both `status` and `detail` branches in executeViaFetch - Thread it through to upsertStepResultWithInvocation in runInvocationAction, joining array frames with newlines before storage Closes Agenta-AI#3324 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…with unit tests
Refactors the inline error parsing and stacktrace normalization logic from
executionRunner and runInvocationAction into testable pure functions in a
new parseHttpError utility module.
- Add parseHttpErrorBody: handles Python SDK {status} envelope and FastAPI
{detail} response shapes; falls back to raw text or a default message
- Add normalizeStacktrace: coerces string | string[] | undefined to a plain
string by joining array frames with "\n"; returns undefined when empty
- Export both from @agenta/playground/utils
- Add vitest config and 15 unit tests covering all branches and edge cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@64johnlee is attempting to deploy a commit to the agenta projects Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR enhances error reporting by parsing and normalizing stacktraces from HTTP error responses, widening error types to carry stacktraces, integrating parsing into the execution runner, and persisting normalized stacktraces with failed invocation results. ChangesStacktrace Extraction and Error Handling
sequenceDiagram
participant HTTPResponse
participant executeViaFetch
participant parseHttpErrorBody
participant executionRunner_onFail
participant runInvocationAction
participant upsertStepResultWithInvocation
HTTPResponse->>executeViaFetch: non-OK response
executeViaFetch->>parseHttpErrorBody: response.text()
parseHttpErrorBody-->>executeViaFetch: { message, stacktrace? }
executeViaFetch->>executionRunner_onFail: error { message, code?, stacktrace? }
executionRunner_onFail->>runInvocationAction: failure callback with error
runInvocationAction->>upsertStepResultWithInvocation: include normalized stacktrace
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
web/packages/agenta-playground/src/state/execution/executionRunner.ts (1)
642-652: 💤 Low valueConsider refactoring to avoid parsing JSON twice.
The error body text is parsed twice: once inside
parseHttpErrorBody(line 643) and again for trace ID extraction (line 649). While this is safe becauseparseHttpErrorBodyhandles non-JSON gracefully, it's inefficient.Consider having
parseHttpErrorBodyoptionally return the parsed JSON object, or parse once and pass the result to both functions.// Option 1: Parse once and pass to both let parsedData: unknown = null try { parsedData = JSON.parse(errorText) } catch { // Non-JSON, will be handled by parseHttpErrorBody } const {message: errorMessage, stacktrace: errorStacktrace} = parsedData ? parseHttpErrorBody(parsedData, fallbackMessage) // Pass parsed object : parseHttpErrorBody(errorText, fallbackMessage) // Pass text for error message let traceId: string | null = null if (parsedData) { traceId = extractTraceIdFromPayload(parsedData) }This would require updating
parseHttpErrorBodyto acceptstring | unknown, but would eliminate redundant parsing.web/packages/agenta-playground/src/executeWorkflowRevision.ts (1)
59-66: ⚡ Quick winExtract the error shape into a dedicated interface.
Line 65 defines an inline object shape for
error; use a namedinterfacefor consistency and reuse.♻️ Proposed refactor
+export interface ExecuteWorkflowRevisionError { + message: string + code?: string + stacktrace?: string | string[] +} + export interface ExecuteWorkflowRevisionResult { status: "success" | "error" | "cancelled" output?: unknown structuredOutput?: unknown traceId?: string | null spanId?: string | null - error?: {message: string; code?: string; stacktrace?: string | string[]} + error?: ExecuteWorkflowRevisionError }As per coding guidelines, "Prefer
interfacefor defining object shapes in TypeScript".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 37f9eda8-1be6-4d48-8e2f-2e81f533fe14
⛔ Files ignored due to path filters (1)
web/pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (10)
web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.tsweb/packages/agenta-entities/src/runnable/types.tsweb/packages/agenta-playground/package.jsonweb/packages/agenta-playground/src/executeWorkflowRevision.tsweb/packages/agenta-playground/src/state/execution/executionRunner.tsweb/packages/agenta-playground/src/state/execution/types.tsweb/packages/agenta-playground/src/utils/index.tsweb/packages/agenta-playground/src/utils/parseHttpError.tsweb/packages/agenta-playground/tests/execution/stacktrace.test.tsweb/packages/agenta-playground/vitest.config.ts
- Reorder imports: move parseHttpError before type imports, add blank line between groups - Split onFail callback type annotation across multiple lines - Add trailing comma in error object spread Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @64johnlee , Please add a demo video, with before and after. Thank you ! |
|
Hi @jp-agenta, Recording a demo video isn't feasible here — the bug only surfaces during a live LLM evaluation with an intentionally invalid API key pointed at a running agenta instance. Setting that up in a screen-recording context would require a full agenta deployment with a configured evaluator, which I don't have available. Instead, here's the precise before/after behaviour: Before this fix: After this fix:
To verify it yourself, the steps are:
Happy to write the steps out in more detail, or to describe which evaluator config to use. Alternatively, the 15 unit tests in Thank you! |
|
Hi @64johnlee, Thank you for your response. However, we do not accept, at the moment, community contributions, especially those that include changes to We will be closing this PR soon, unless you can provide a demo of before/after. Best, |
…ed error interface - parseHttpErrorBody now accepts string | unknown so callers can pass a pre-parsed object and avoid re-parsing; backward-compatible (all string callers are unchanged) - executeViaFetch error branch parses errorText once and passes the result to both parseHttpErrorBody and extractTraceIdFromPayload, eliminating the redundant JSON.parse - Extract ExecuteWorkflowRevisionError interface from the inline error shape in ExecuteWorkflowRevisionResult; re-export it from the package root so consumers can name the type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hey @jp-agenta @junaway — wanted to give you a heads-up on demo/verification status for this PR. We attempted to spin up a local self-hosted environment to record a before/after demo of the stacktrace surfacing in the evaluation table, but hit persistent authentication issues in the self-hosted stack. Specifically, the OSS SuperTokens setup enforces an invite-only flow after the first user signs up — any subsequent signup attempt blocks with a 401, and the first-user session wasn't persisting correctly across container restarts. Not a blocker for the PR itself, just made local demo recording impractical. Two alternatives we can offer: a) Self-test using the QA steps in the PR description
b) Code walkthrough / screenshots Let us know which works better, or if you have a cloud sandbox environment we could point at to demonstrate the end-to-end flow directly. Thanks! |
|
_Hi @junaway — following up with the code walkthrough as offered in option (b). Since we can't spin up a live demo, here's an annotated trace of exactly what changes and where, so you can verify the behaviour without running it locally. Code walkthrough: how the stacktrace surfaces1. New utility —
|
Summary
stacktrace?: string | string[]toExecutionResult,StageExecutionResult,RunResult, andExecuteWorkflowRevisionResulterror shapesparseHttpErrorBodyutility to parse the Python SDK's{status: {message, stacktrace}}envelope and FastAPI's{detail: ...}shapenormalizeStacktraceutility to coercestring | string[] | undefinedto a plain string (joining array frames with\n)executeViaFetch→onFail→executeWorkflowRevision→runInvocationAction→upsertStepResultWithInvocationCloses #3324
Testing
Unit tests added in
web/packages/agenta-playground/tests/execution/stacktrace.test.ts:parseHttpErrorBody(10 tests){status: {message, stacktrace}}envelope — string stacktrace, array stacktrace, missing stacktrace, null stacktrace{detail: {message, stacktrace}}and plain string detailstatustakes priority overdetailnormalizeStacktrace(5 tests)\nundefined, empty array, empty string →undefinedAll 15 tests pass (
pnpm --filter @agenta/playground test).Demo
No video needed — the fix is observable by triggering an evaluation with an invalid API key. Before: InvocationCell shows a generic "Request failed" message. After: the full Python stacktrace is stored and available for display.
Checklist
Contributor Resources