[feat] Unify eval loops by junaway · Pull Request #4341 · Agenta-AI/agenta

junaway · 2026-05-15T14:41:35Z

No description provided.

coderabbitai · 2026-05-15T14:41:42Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b935efe-a51c-4a4b-9364-85e5be778038

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/unified-eval-loops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

vercel · 2026-05-15T14:41:42Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	May 21, 2026 7:02pm

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

CLAassistant · 2026-05-20T13:22:23Z

All committers have signed the CLA.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

…l-loops Salvage the two docs with unique, unreplicated analysis (legacy loop-family nesting; pure-function extraction signatures) into the in-scope dir with frozen/superseded banners pointing to proposal.md/plan.md as canonical. Remove the four docs superseded by the current unified design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

…(UEL-028) Add agenta:custom:mock:v0 — a deterministic, LLM-free, sandbox-free workflow that serves as both an application and an evaluator, selected by parameters={"key", "kwargs"} (echo/static/pass/fail/score/error/delay). This lets evaluation runs execute end-to-end through the real worker in acceptance tests without an LLM or the code sandbox. Registered as a managed service (handler/interface/role-table/mount). Also extend the LLM MOCKS registry with echo/error selectors. Fix UEL-028: batch (non-queue) source runs never finalized run status — they processed every scenario to success but stayed running/is_active forever. The severity floor in process_evaluation_source_slice ranked RUNNING above SUCCESS, pinning runs at running. Reorder so terminal statuses outrank the transient ones (FAILURE>ERRORS>SUCCESS>RUNNING>PENDING), reset status=RUNNING on every (re)dispatch (so extended finished runs re-finalize correctly), clear is_active on terminal, and persist status reliably in dao.edit_run (status.value + flag_modified). Only batch testset/invocation finalize (update_run_status=True); live/batch-query are untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… UEL-029 Unit: - test_run_flag_matrix.py — representative matrix for has_*/origin flag derivation and the simple-queue ">=1 human evaluator" rule. - test_mock_v0.py (sdk) — mock_v0 selector behaviors + validation. Acceptance flow tests (real worker, LLM-free via mock_v0): - test_evaluation_flows_run.py — run to completion across dispatch topologies: batch_testset, batch_invocation (success), live_query (stays running, guards the UEL-028 fix against live evals), batch_query (xfail UEL-029). - test_evaluation_flows_modify.py — restart a finished batch run (re-finalizes; validates UEL-028 extension case), archive/unarchive a default queue on a CLOSED run (UEL-012, asserts no 409). Docs/findings: - run-status-finalization.md — full analysis + rejected options for UEL-028. - findings.md — close UEL-012 (allow archive on closed run), UEL-016 (typed default-queue exceptions), UEL-028 (finalization); file UEL-029 (batch_query never finalizes, adjacent gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

batch_query runs (query -> evaluator, no app, not live) dispatched through process_query_source_run with update_run_status=False, so they never reached a terminal status — the same gap as UEL-028 on the query path. live_query shares the function and must keep running. Distinguish via the existing use_windowing flag (True=batch, False=live): set update_run_status=use_windowing. For the zero-traces batch case the slice processor rejects empty input, so finalize the run directly to SUCCESS with is_active cleared (no matching traces = complete, not failed). Flip the batch_query flow test from xfail to asserting success + is_active=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closing a run is meant to lock it, but every DAO mutation that raised EvaluationClosedConflict was swallowed by @suppress_exceptions before the router's @handle_evaluation_closed_exception could convert it — so closed-run edits silently returned 200/empty instead of 409. The lock was a no-op. Harden user-facing, keep the worker soft: - add EvaluationClosedConflict to the exclude list of the 20 user-facing DAO mutations that raise it (edit_run(s), create/edit/delete scenario(s)/result(s), edit/delete metrics, create/edit queue(s)); create_metrics already propagates. - add @handle_evaluation_closed_exception() to the create_runs/edit_runs/edit_run routes (previously undecorated -> 500 instead of 409). - make process_evaluation_source_slice tolerate the conflict on its finalization edit_run and per-item edit_scenario (try/except, log+skip) so a run closed mid-flight is treated as a lock, not a failure. Distinct from UEL-012, which deliberately leaves queue archive/unarchive unguarded on closed runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…UEL-030 test_default_queue_policy.py — data validation (default queue rejects user_ids/scenario_ids/step_keys/batch_size/batch_offset -> 422), demotion forbidden (-> 409), hard-delete forbidden (-> 409), promotion allowed. test_default_queue_lifecycle.py — the _reconcile_default_queue state machine via run create + step edits: required+missing -> create, not-required+active -> archive (is_queue flips False), required+archived -> unarchive, required+active -> no-op, non-human run -> no default queue. File UEL-030: the ux_evaluation_queues_default_per_run unique index is not enforced (absent from the DB, no code-level guard), so a second default queue can be created for one run. Uniqueness test xfailed against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause was a duplicate alembic revision id: add_default_evaluation_queues shared id a1b2c3d4e5f6 with drop_corrupted_metrics_for_some_runs, so alembic silently skipped the index migration and ux_evaluation_queues_default_per_run never existed in the DB. - Rename the index migration to revision a1d2e3f4a5b6 and chain both new branch migrations linearly after each environment's head (OSS after e6f7a8b9c0d1, EE after b2c3d4e5f7a8); mirror them in api/ee. Both graphs resolve to a single head a2b3c4d5e6f8. - Correct the partial-index predicate to "(flags ->> 'is_default')::boolean = true AND deleted_at IS NULL" in dbes.py and the migration, so it enforces one ACTIVE default per run while allowing archive->recreate. - Fix two json/jsonb type bugs in the backfill (data is a json column): cast data::jsonb before jsonb_array_elements and insert '{}'::json. - Rely on the index for enforcement: create_queue/create_queues already turn the unique-violation IntegrityError into EntityCreationConflict (409). Remove the prototyped pre-emptive SELECT guard (never worked, not race-safe). - Flip test_second_default_queue_for_same_run_is_rejected from xfail to passing; add recreate-after-archive and across-runs cases. Full suite: 16 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add acceptance coverage for POST /evaluations/metrics/refresh. The endpoint's refresh_metrics dispatcher fans out per run/scenario/timestamp; these tests pin the routing shape and the early-return ("nothing to refresh") paths that do not require traces, evaluator schemas, or the worker: - empty body (no run_id/run_ids) -> count 0 - unknown run_id / unknown run_ids loop -> count 0 - run with no metrics-bearing steps -> count 0 - scenario_ids and timestamps loop branches -> count 0 - run_ids takes precedence over run_id 7 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

UEL-030 was marked CLOSED but still sat at the top of "Open Findings". Relocate the full entry to the Closed Findings section so the section headers match each finding's status. No content changes to the entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

jp-agenta added 9 commits April 24, 2026 11:47

initial design

a114ab3

initial implementation

747502d

initial review

2220e65

fixed basic evals

0c559d8

Add evaluation parallelization

3862dfe

Parallelization checks

b958574

quick engine fix

5b647ee

ongoing debug

c4d8368

intermediate design extensions

e049ef5

Copilot AI review requested due to automatic review settings May 15, 2026 14:41

junaway changed the title ~~[feat] Unified eval loops~~ [feat] Unify eval loops May 15, 2026

Copilot AI reviewed May 15, 2026

View reviewed changes

junaway changed the base branch from main to release/v0.99.9 May 15, 2026 14:41

Merge release/v0.99.9 and evals<>queues work

50d1d8e

vercel Bot deployed to Preview May 15, 2026 15:52 View deployment

evals<>queues implementation

603820f

Copilot AI review requested due to automatic review settings May 15, 2026 16:32

Copilot AI reviewed May 15, 2026

View reviewed changes

vercel Bot deployed to Preview May 15, 2026 16:33 View deployment

latest findings

3ebe721

vercel Bot deployed to Preview May 20, 2026 13:07 View deployment

Merge release/v0.100.1

f68e42a

Copilot AI review requested due to automatic review settings May 20, 2026 13:20

Copilot AI reviewed May 20, 2026

View reviewed changes

vercel Bot deployed to Preview May 20, 2026 13:22 View deployment

junaway changed the base branch from release/v0.99.9 to release/v0.100.1 May 20, 2026 13:22

Add missing sdk files

9b388bd

vercel Bot deployed to Preview May 21, 2026 11:56 View deployment

Merge branch 'release/v0.100.1' into feat/unified-eval-loops

d73fe65

Copilot AI review requested due to automatic review settings May 21, 2026 11:59

Copilot AI reviewed May 21, 2026

View reviewed changes

vercel Bot deployed to Preview May 21, 2026 12:01 View deployment

vercel Bot deployed to Preview May 21, 2026 13:29 View deployment

reconciliation added in edit + some fixing

a9d28b3

Copilot AI review requested due to automatic review settings May 21, 2026 14:25

Copilot AI reviewed May 21, 2026

View reviewed changes

vercel Bot deployed to Preview May 21, 2026 14:26 View deployment

fix simple queue creation and eval processing

0f91ea3

vercel Bot deployed to Preview May 21, 2026 15:35 View deployment

fix services tests

f64d299

Copilot AI review requested due to automatic review settings May 21, 2026 15:37

Copilot AI reviewed May 21, 2026

View reviewed changes

vercel Bot deployed to Preview May 21, 2026 15:38 View deployment

jp-agenta and others added 7 commits May 21, 2026 18:33

Fix domain exceptions and run (un)archival

3b3ac15

vercel Bot deployed to Preview May 21, 2026 18:55 View deployment

jp-agenta and others added 2 commits May 21, 2026 20:55

Copilot AI review requested due to automatic review settings May 21, 2026 19:00

Copilot AI reviewed May 21, 2026

View reviewed changes

vercel Bot deployed to Preview May 21, 2026 19:02 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Unify eval loops#4341

[feat] Unify eval loops#4341
junaway wants to merge 36 commits into
release/v0.100.1from
feat/unified-eval-loops

junaway commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Review skipped

Uh oh!

vercel Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

CLAassistant commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

junaway commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

vercel Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented May 15, 2026 •

edited

Loading

vercel Bot commented May 15, 2026 •

edited

Loading

CLAassistant commented May 20, 2026 •

edited

Loading