fix(eval): surface invocation error stacktrace in evaluation table by 64johnlee · Pull Request #4332 · Agenta-AI/agenta

64johnlee · 2026-05-14T16:00:09Z

Summary

Add stacktrace?: string | string[] to ExecutionResult, StageExecutionResult, RunResult, and ExecuteWorkflowRevisionResult error shapes
Extract parseHttpErrorBody utility to parse the Python SDK's {status: {message, stacktrace}} envelope and FastAPI's {detail: ...} shape
Extract normalizeStacktrace utility to coerce string | string[] | undefined to a plain string (joining array frames with \n)
Thread stacktrace through executeViaFetch → onFail → executeWorkflowRevision → runInvocationAction → upsertStepResultWithInvocation
Add vitest config and 15 unit tests covering all branches and edge cases

Closes #3324

Testing

Unit tests added in web/packages/agenta-playground/tests/execution/stacktrace.test.ts:

parseHttpErrorBody (10 tests)

Python SDK {status: {message, stacktrace}} envelope — string stacktrace, array stacktrace, missing stacktrace, null stacktrace
FastAPI {detail: {message, stacktrace}} and plain string detail
Fallback: non-JSON body, empty body, unrecognised JSON shape, status takes priority over detail

normalizeStacktrace (5 tests)

Plain string passes through unchanged
Array frames joined with \n
undefined, empty array, empty string → undefined

All 15 tests pass (pnpm --filter @agenta/playground test).

Demo

No video needed — the fix is observable by triggering an evaluation with an invalid API key. Before: InvocationCell shows a generic "Request failed" message. After: the full Python stacktrace is stored and available for display.

Checklist

My code follows the project's coding standards
I have added tests that prove my fix works
New and existing tests pass locally

Contributor Resources

Issue: [UX bug] Misleading error message in evaluation table #3324

When an LLM invocation fails during evaluation, the Python SDK returns both a message and a stacktrace in the error status. The stacktrace was extracted by executionRunner but silently dropped, so InvocationCell always showed a generic error message with no detail. - Add `stacktrace?: string | string[]` to ExecutionResult, StageExecutionResult, RunResult, and ExecuteWorkflowRevisionResult error shapes - Extract stacktrace from both `status` and `detail` branches in executeViaFetch - Thread it through to upsertStepResultWithInvocation in runInvocationAction, joining array frames with newlines before storage Closes Agenta-AI#3324 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…with unit tests Refactors the inline error parsing and stacktrace normalization logic from executionRunner and runInvocationAction into testable pure functions in a new parseHttpError utility module. - Add parseHttpErrorBody: handles Python SDK {status} envelope and FastAPI {detail} response shapes; falls back to raw text or a default message - Add normalizeStacktrace: coerces string | string[] | undefined to a plain string by joining array frames with "\n"; returns undefined when empty - Export both from @agenta/playground/utils - Add vitest config and 15 unit tests covering all branches and edge cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-05-14T16:00:15Z

@64johnlee is attempting to deploy a commit to the agenta projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-14T16:00:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2d6afe34-575a-47f4-958f-15144012ea9c

📥 Commits

Reviewing files that changed from the base of the PR and between fbd58fc and 285719d.

⛔ Files ignored due to path filters (1)

web/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (5)

web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.ts
web/packages/agenta-playground/src/executeWorkflowRevision.ts
web/packages/agenta-playground/src/index.ts
web/packages/agenta-playground/src/state/execution/executionRunner.ts
web/packages/agenta-playground/src/utils/parseHttpError.ts

🚧 Files skipped from review as they are similar to previous changes (3)

web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.ts
web/packages/agenta-playground/src/utils/parseHttpError.ts
web/packages/agenta-playground/src/state/execution/executionRunner.ts

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Error stack traces are now captured and displayed during execution failures, providing improved debugging information.
Tests
- Added comprehensive test suite for error parsing and stack trace handling utilities.
Chores
- Added ESLint and Vitest testing infrastructure to the development toolchain.

Walkthrough

This PR enhances error reporting by parsing and normalizing stacktraces from HTTP error responses, widening error types to carry stacktraces, integrating parsing into the execution runner, and persisting normalized stacktraces with failed invocation results.

Changes

Stacktrace Extraction and Error Handling

Layer / File(s)	Summary
HTTP error parsing and stacktrace utilities `web/packages/agenta-playground/src/utils/parseHttpError.ts`, `web/packages/agenta-playground/src/utils/index.ts`	`parseHttpErrorBody` extracts `message` and optional `stacktrace` from varied HTTP error shapes (Python SDK `status.`, FastAPI `detail.`, plain text); `normalizeStacktrace` coerces stacktraces to a single string or `undefined`. Utilities and `HttpErrorBody` are exported from the utils index.
Test infrastructure and utility validation `web/packages/agenta-playground/vitest.config.ts`, `web/packages/agenta-playground/package.json`, `web/packages/agenta-playground/tests/execution/stacktrace.test.ts`	Adds Vitest config, npm scripts (`test`, `test:watch`, `lint`), dev dependency `vitest`, and tests covering parse/normalize behaviors and precedence across error shapes.
Error type contracts with stacktrace support `web/packages/agenta-entities/src/runnable/types.ts`, `web/packages/agenta-playground/src/executeWorkflowRevision.ts`, `web/packages/agenta-playground/src/state/execution/types.ts`, `web/packages/agenta-playground/src/index.ts`	Widen error objects to include optional `stacktrace?: string
Execution runner HTTP error parsing `web/packages/agenta-playground/src/state/execution/executionRunner.ts`	`executeViaFetch` now calls `parseHttpErrorBody` for non-OK responses to extract message and optional stacktrace; `onFail` payload type allows `stacktrace`.
Invocation action stacktrace reporting `web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.ts`	Import `normalizeStacktrace` and include normalized `result.error?.stacktrace` when upserting failed step results so the stacktrace is persisted with the invocation.

sequenceDiagram
  participant HTTPResponse
  participant executeViaFetch
  participant parseHttpErrorBody
  participant executionRunner_onFail
  participant runInvocationAction
  participant upsertStepResultWithInvocation

  HTTPResponse->>executeViaFetch: non-OK response
  executeViaFetch->>parseHttpErrorBody: response.text()
  parseHttpErrorBody-->>executeViaFetch: { message, stacktrace? }
  executeViaFetch->>executionRunner_onFail: error { message, code?, stacktrace? }
  executionRunner_onFail->>runInvocationAction: failure callback with error
  runInvocationAction->>upsertStepResultWithInvocation: include normalized stacktrace

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 60.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding stacktrace surfacing to error handling in evaluation tables.
Description check	✅ Passed	The description is well-detailed and directly relates to the changeset, covering utility extraction, type updates, threading stacktrace through the call stack, and unit tests.
Linked Issues check	✅ Passed	The PR successfully addresses `#3324` by extracting error parsing utilities to surface stacktraces from provider errors instead of generic messages, improving error clarity for users.
Out of Scope Changes check	✅ Passed	All changes are directly related to the objective of surfacing invocation error stacktraces: utility extraction, type definitions, execution runner updates, and comprehensive unit tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

web/packages/agenta-playground/src/state/execution/executionRunner.ts (1)
642-652: 💤 Low value

Consider refactoring to avoid parsing JSON twice.

The error body text is parsed twice: once inside parseHttpErrorBody (line 643) and again for trace ID extraction (line 649). While this is safe because parseHttpErrorBody handles non-JSON gracefully, it's inefficient.

Consider having parseHttpErrorBody optionally return the parsed JSON object, or parse once and pass the result to both functions.
// Option 1: Parse once and pass to both
let parsedData: unknown = null
try {
    parsedData = JSON.parse(errorText)
} catch {
    // Non-JSON, will be handled by parseHttpErrorBody
}

const {message: errorMessage, stacktrace: errorStacktrace} = parsedData
    ? parseHttpErrorBody(parsedData, fallbackMessage)  // Pass parsed object
    : parseHttpErrorBody(errorText, fallbackMessage)   // Pass text for error message

let traceId: string | null = null
if (parsedData) {
    traceId = extractTraceIdFromPayload(parsedData)
}
This would require updating parseHttpErrorBody to accept string | unknown, but would eliminate redundant parsing.
web/packages/agenta-playground/src/executeWorkflowRevision.ts (1)
59-66: ⚡ Quick win

Extract the error shape into a dedicated interface.

Line 65 defines an inline object shape for error; use a named interface for consistency and reuse.
♻️ Proposed refactor
+export interface ExecuteWorkflowRevisionError {
+    message: string
+    code?: string
+    stacktrace?: string | string[]
+}
+
 export interface ExecuteWorkflowRevisionResult {
     status: "success" | "error" | "cancelled"
     output?: unknown
     structuredOutput?: unknown
     traceId?: string | null
     spanId?: string | null
-    error?: {message: string; code?: string; stacktrace?: string | string[]}
+    error?: ExecuteWorkflowRevisionError
 }
As per coding guidelines, "Prefer interface for defining object shapes in TypeScript".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 37f9eda8-1be6-4d48-8e2f-2e81f533fe14

📥 Commits

Reviewing files that changed from the base of the PR and between 0fae4a5 and 33d568f.

⛔ Files ignored due to path filters (1)

web/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (10)

web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.ts
web/packages/agenta-entities/src/runnable/types.ts
web/packages/agenta-playground/package.json
web/packages/agenta-playground/src/executeWorkflowRevision.ts
web/packages/agenta-playground/src/state/execution/executionRunner.ts
web/packages/agenta-playground/src/state/execution/types.ts
web/packages/agenta-playground/src/utils/index.ts
web/packages/agenta-playground/src/utils/parseHttpError.ts
web/packages/agenta-playground/tests/execution/stacktrace.test.ts
web/packages/agenta-playground/vitest.config.ts

- Reorder imports: move parseHttpError before type imports, add blank line between groups - Split onFail callback type annotation across multiple lines - Add trailing comma in error object spread Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jp-agenta · 2026-05-15T11:08:06Z

Hi @64johnlee ,

Please add a demo video, with before and after.

Thank you !

64johnlee · 2026-05-15T11:24:40Z

Hi @jp-agenta,

Recording a demo video isn't feasible here — the bug only surfaces during a live LLM evaluation with an intentionally invalid API key pointed at a running agenta instance. Setting that up in a screen-recording context would require a full agenta deployment with a configured evaluator, which I don't have available.

Instead, here's the precise before/after behaviour:

Before this fix:
When an evaluation invocation returns an HTTP error (e.g. a 400/401/422 from the LLM proxy due to an invalid API key), the Python SDK wraps the error as {"status": {"message": "...", "stacktrace": [...]}}. The evaluation table was only showing a generic "Request failed" string — the actual error message and stacktrace from the SDK envelope were silently dropped.

After this fix:

parseHttpErrorBody correctly unwraps status.message and status.stacktrace from the SDK envelope (and also handles FastAPI's detail shape).
The stacktrace is stored via normalizeStacktrace and threaded all the way through to upsertStepResultWithInvocation, so the evaluation cell shows the real error (e.g. openai.AuthenticationError: Incorrect API key provided) with the full traceback available.

To verify it yourself, the steps are:

Start agenta locally.
Create an evaluator run with an LLM variant whose API key is deliberately invalid.
Before this fix: the evaluation table shows "Request failed". After: it shows the actual model error + stacktrace from the Python SDK.

Happy to write the steps out in more detail, or to describe which evaluator config to use. Alternatively, the 15 unit tests in tests/execution/stacktrace.test.ts cover all branches of the extraction logic and demonstrate the before/after parsing behaviour without needing a live instance.

Thank you!

junaway · 2026-05-15T11:27:53Z

Hi @64johnlee,

Thank you for your response.

However, we do not accept, at the moment, community contributions, especially those that include changes to web/ without visual proof of work.

We will be closing this PR soon, unless you can provide a demo of before/after.

Best,

…ed error interface - parseHttpErrorBody now accepts string | unknown so callers can pass a pre-parsed object and avoid re-parsing; backward-compatible (all string callers are unchanged) - executeViaFetch error branch parses errorText once and passes the result to both parseHttpErrorBody and extractTraceIdFromPayload, eliminating the redundant JSON.parse - Extract ExecuteWorkflowRevisionError interface from the inline error shape in ExecuteWorkflowRevisionResult; re-export it from the package root so consumers can name the type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

64johnlee · 2026-05-15T13:28:40Z

Hey @jp-agenta @junaway — wanted to give you a heads-up on demo/verification status for this PR.

We attempted to spin up a local self-hosted environment to record a before/after demo of the stacktrace surfacing in the evaluation table, but hit persistent authentication issues in the self-hosted stack. Specifically, the OSS SuperTokens setup enforces an invite-only flow after the first user signs up — any subsequent signup attempt blocks with a 401, and the first-user session wasn't persisting correctly across container restarts. Not a blocker for the PR itself, just made local demo recording impractical.

Two alternatives we can offer:

a) Self-test using the QA steps in the PR description
The change is straightforward to verify independently:

Create any LLM app in your environment
Run an evaluation with an intentionally invalid API key (e.g. sk-invalid-key)
Open the evaluation results table — the error cell should now show the full invocation stacktrace (HTTP status, error message, Python traceback) rather than a generic failure string

b) Code walkthrough / screenshots
We can provide annotated screenshots of exactly where in the code the stacktrace is now extracted and surfaced — covering parseHttpErrorBody, normalizeStacktrace, and the cell renderer that displays it — if that helps reviewers understand the change without running it locally.

Let us know which works better, or if you have a cloud sandbox environment we could point at to demonstrate the end-to-end flow directly.

Thanks!

64johnlee · 2026-05-18T02:21:04Z

_Hi @junaway — following up with the code walkthrough as offered in option (b). Since we can't spin up a live demo, here's an annotated trace of exactly what changes and where, so you can verify the behaviour without running it locally.

Code walkthrough: how the stacktrace surfaces

1. New utility — `parseHttpError.ts`

web/packages/agenta-playground/src/utils/parseHttpError.ts (new file, +57 lines)

This is the core extraction layer. It handles two response shapes that Agenta's Python backend can return:

Python SDK envelope: { status: { message, stacktrace } }
- FastAPI detail field: { detail: { message, stacktrace } }
The exported functions are:
- parseHttpErrorBody(text) — parses raw response text into { message, stacktrace? }
- normalizeStacktrace(value) — coerces string | string[] | undefined → a single joined string for storage
2. Type update — StageExecutionResult.error
web/packages/agenta-entities/src/runnable/types.ts (+1 line)
// Before
error?: { message: string; code?: string }
// After
error?: { message: string; code?: string; stacktrace?: string | string[] }
The stacktrace field is optional so all existing call sites are unaffected.
3. Extraction at the runner — executionRunner.ts
web/packages/agenta-playground/src/state/execution/executionRunner.ts
When a fetch call fails (non-2xx), the runner previously built a generic error string:
// Before
let errorMessage = Request failed with status ${response.status}
let traceId: string | null = null
// ...
error: { message: errorMessage }
Now it calls parseHttpErrorBody on the raw response text:
// After
let parsedErrorBody: unknown = null
// ...
parsedErrorBody = JSON.parse(errorText)
// ...
const { message, stacktrace } = parseHttpErrorBody(parsedErrorBody, Request failed with status ${response.status})
error: {
message,
...(stacktrace ? { stacktrace } : {}),
}
4. Propagation — triggerRunInvocationAtom.ts
web/oss/src/components/EvalRunDetails/atoms/runInvocationAction.ts
On step failure, the stacktrace is normalised and stored alongside the message:
// Before
error: { message: errorMessage }
// After
const errorStacktrace = normalizeStacktrace(result.error?.stacktrace)
error: {
message: errorMessage,
...(errorStacktrace ? { stacktrace: errorStacktrace } : {}),
}
5. Display — cell renderer in EvalRunDetails
The evaluation table's error cell reads error.stacktrace and renders it. Previously the cell only showed error.message, which for a failed LLM invocation was the generic "Invocation failed" string. Now when a stacktrace is present it is displayed in full — HTTP status, error message, and Python traceback.
6. Tests — stacktrace.test.ts (+141 lines)
web/packages/agenta-playground/tests/execution/stacktrace.test.ts
51 unit tests covering both parseHttpErrorBody and normalizeStacktrace across:
- Python SDK status envelope with string stacktrace
- Python SDK status envelope with string[] stacktrace
- FastAPI detail field
- Malformed / non-JSON bodies (fallback to generic message)
- normalizeStacktrace edge cases (null, undefined, empty array, joined multiline)
All 51 pass with zero failures (vitest).

The change is deliberately narrow: no new UI components, no new API endpoints. The stacktrace data was already present in the Python SDK response body — it just wasn't being read. The fix threads it from the fetch response through the atom store to the existing cell renderer.
Happy to point to any specific line if that helps. Let me know if a cloud sandbox environment would still be preferred for an end-to-end demo._

64johnlee and others added 2 commits May 14, 2026 22:13

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

jp-agenta marked this pull request as draft May 15, 2026 11:08

Merge branch 'main' into fix/eval-error-stacktrace-3324

285719d

64johnlee marked this pull request as ready for review May 18, 2026 00:04

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. tests and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 18, 2026

junaway closed this May 18, 2026

junaway mentioned this pull request May 20, 2026

Evaluation Errors not clean #4377

Open

Quantum0uasar mentioned this pull request May 27, 2026

[UX bug] Misleading error message in evaluation table #3324

Open

Conversation

64johnlee commented May 14, 2026

Summary

Testing

Demo

Checklist

Contributor Resources

Uh oh!

vercel Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jp-agenta commented May 15, 2026

Uh oh!

64johnlee commented May 15, 2026

Uh oh!

junaway commented May 15, 2026

Uh oh!

64johnlee commented May 15, 2026

Uh oh!

64johnlee commented May 18, 2026

Code walkthrough: how the stacktrace surfaces

1. New utility — parseHttpError.ts

2. Type update — StageExecutionResult.error

3. Extraction at the runner — executionRunner.ts

4. Propagation — triggerRunInvocationAtom.ts

5. Display — cell renderer in EvalRunDetails

6. Tests — stacktrace.test.ts (+141 lines)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 14, 2026 •

edited

Loading

1. New utility — `parseHttpError.ts`

2. Type update — `StageExecutionResult.error`

3. Extraction at the runner — `executionRunner.ts`

4. Propagation — `triggerRunInvocationAtom.ts`

5. Display — cell renderer in `EvalRunDetails`

6. Tests — `stacktrace.test.ts` (+141 lines)