Skip to content

[feat] Unify eval loops#4341

Draft
junaway wants to merge 36 commits into
release/v0.100.1from
feat/unified-eval-loops
Draft

[feat] Unify eval loops#4341
junaway wants to merge 36 commits into
release/v0.100.1from
feat/unified-eval-loops

Conversation

@junaway
Copy link
Copy Markdown
Contributor

@junaway junaway commented May 15, 2026

No description provided.

Copilot AI review requested due to automatic review settings May 15, 2026 14:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b935efe-a51c-4a4b-9364-85e5be778038

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/unified-eval-loops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment May 21, 2026 7:02pm

Request Review

@junaway junaway changed the title [feat] Unified eval loops [feat] Unify eval loops May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@junaway junaway changed the base branch from main to release/v0.99.9 May 15, 2026 14:41
Copilot AI review requested due to automatic review settings May 15, 2026 16:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot AI review requested due to automatic review settings May 20, 2026 13:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 20, 2026

CLA assistant check
All committers have signed the CLA.

@junaway junaway changed the base branch from release/v0.99.9 to release/v0.100.1 May 20, 2026 13:22
Copilot AI review requested due to automatic review settings May 21, 2026 11:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

…l-loops

Salvage the two docs with unique, unreplicated analysis (legacy loop-family
nesting; pure-function extraction signatures) into the in-scope dir with
frozen/superseded banners pointing to proposal.md/plan.md as canonical.
Remove the four docs superseded by the current unified design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 14:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copilot AI review requested due to automatic review settings May 21, 2026 15:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

jp-agenta and others added 7 commits May 21, 2026 18:33
…(UEL-028)

Add agenta:custom:mock:v0 — a deterministic, LLM-free, sandbox-free workflow
that serves as both an application and an evaluator, selected by
parameters={"key", "kwargs"} (echo/static/pass/fail/score/error/delay). This
lets evaluation runs execute end-to-end through the real worker in acceptance
tests without an LLM or the code sandbox. Registered as a managed service
(handler/interface/role-table/mount). Also extend the LLM MOCKS registry with
echo/error selectors.

Fix UEL-028: batch (non-queue) source runs never finalized run status — they
processed every scenario to success but stayed running/is_active forever. The
severity floor in process_evaluation_source_slice ranked RUNNING above SUCCESS,
pinning runs at running. Reorder so terminal statuses outrank the transient ones
(FAILURE>ERRORS>SUCCESS>RUNNING>PENDING), reset status=RUNNING on every
(re)dispatch (so extended finished runs re-finalize correctly), clear is_active
on terminal, and persist status reliably in dao.edit_run (status.value +
flag_modified). Only batch testset/invocation finalize (update_run_status=True);
live/batch-query are untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… UEL-029

Unit:
- test_run_flag_matrix.py — representative matrix for has_*/origin flag
  derivation and the simple-queue ">=1 human evaluator" rule.
- test_mock_v0.py (sdk) — mock_v0 selector behaviors + validation.

Acceptance flow tests (real worker, LLM-free via mock_v0):
- test_evaluation_flows_run.py — run to completion across dispatch topologies:
  batch_testset, batch_invocation (success), live_query (stays running, guards
  the UEL-028 fix against live evals), batch_query (xfail UEL-029).
- test_evaluation_flows_modify.py — restart a finished batch run (re-finalizes;
  validates UEL-028 extension case), archive/unarchive a default queue on a
  CLOSED run (UEL-012, asserts no 409).

Docs/findings:
- run-status-finalization.md — full analysis + rejected options for UEL-028.
- findings.md — close UEL-012 (allow archive on closed run), UEL-016 (typed
  default-queue exceptions), UEL-028 (finalization); file UEL-029 (batch_query
  never finalizes, adjacent gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
batch_query runs (query -> evaluator, no app, not live) dispatched through
process_query_source_run with update_run_status=False, so they never reached a
terminal status — the same gap as UEL-028 on the query path. live_query shares
the function and must keep running.

Distinguish via the existing use_windowing flag (True=batch, False=live): set
update_run_status=use_windowing. For the zero-traces batch case the slice
processor rejects empty input, so finalize the run directly to SUCCESS with
is_active cleared (no matching traces = complete, not failed).

Flip the batch_query flow test from xfail to asserting success + is_active=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closing a run is meant to lock it, but every DAO mutation that raised
EvaluationClosedConflict was swallowed by @suppress_exceptions before the
router's @handle_evaluation_closed_exception could convert it — so closed-run
edits silently returned 200/empty instead of 409. The lock was a no-op.

Harden user-facing, keep the worker soft:
- add EvaluationClosedConflict to the exclude list of the 20 user-facing DAO
  mutations that raise it (edit_run(s), create/edit/delete scenario(s)/result(s),
  edit/delete metrics, create/edit queue(s)); create_metrics already propagates.
- add @handle_evaluation_closed_exception() to the create_runs/edit_runs/edit_run
  routes (previously undecorated -> 500 instead of 409).
- make process_evaluation_source_slice tolerate the conflict on its finalization
  edit_run and per-item edit_scenario (try/except, log+skip) so a run closed
  mid-flight is treated as a lock, not a failure.

Distinct from UEL-012, which deliberately leaves queue archive/unarchive
unguarded on closed runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UEL-030

test_default_queue_policy.py — data validation (default queue rejects
user_ids/scenario_ids/step_keys/batch_size/batch_offset -> 422), demotion
forbidden (-> 409), hard-delete forbidden (-> 409), promotion allowed.

test_default_queue_lifecycle.py — the _reconcile_default_queue state machine
via run create + step edits: required+missing -> create, not-required+active ->
archive (is_queue flips False), required+archived -> unarchive, required+active
-> no-op, non-human run -> no default queue.

File UEL-030: the ux_evaluation_queues_default_per_run unique index is not
enforced (absent from the DB, no code-level guard), so a second default queue
can be created for one run. Uniqueness test xfailed against it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause was a duplicate alembic revision id: add_default_evaluation_queues
shared id a1b2c3d4e5f6 with drop_corrupted_metrics_for_some_runs, so alembic
silently skipped the index migration and ux_evaluation_queues_default_per_run
never existed in the DB.

- Rename the index migration to revision a1d2e3f4a5b6 and chain both new branch
  migrations linearly after each environment's head (OSS after e6f7a8b9c0d1, EE
  after b2c3d4e5f7a8); mirror them in api/ee. Both graphs resolve to a single
  head a2b3c4d5e6f8.
- Correct the partial-index predicate to "(flags ->> 'is_default')::boolean =
  true AND deleted_at IS NULL" in dbes.py and the migration, so it enforces one
  ACTIVE default per run while allowing archive->recreate.
- Fix two json/jsonb type bugs in the backfill (data is a json column): cast
  data::jsonb before jsonb_array_elements and insert '{}'::json.
- Rely on the index for enforcement: create_queue/create_queues already turn the
  unique-violation IntegrityError into EntityCreationConflict (409). Remove the
  prototyped pre-emptive SELECT guard (never worked, not race-safe).
- Flip test_second_default_queue_for_same_run_is_rejected from xfail to passing;
  add recreate-after-archive and across-runs cases. Full suite: 16 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jp-agenta and others added 2 commits May 21, 2026 20:55
Add acceptance coverage for POST /evaluations/metrics/refresh. The endpoint's
refresh_metrics dispatcher fans out per run/scenario/timestamp; these tests pin
the routing shape and the early-return ("nothing to refresh") paths that do not
require traces, evaluator schemas, or the worker:

- empty body (no run_id/run_ids) -> count 0
- unknown run_id / unknown run_ids loop -> count 0
- run with no metrics-bearing steps -> count 0
- scenario_ids and timestamps loop branches -> count 0
- run_ids takes precedence over run_id

7 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UEL-030 was marked CLOSED but still sat at the top of "Open Findings".
Relocate the full entry to the Closed Findings section so the section
headers match each finding's status. No content changes to the entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 19:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants