[feat] Unify eval loops#4341
Draft
junaway wants to merge 36 commits into
Draft
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…l-loops Salvage the two docs with unique, unreplicated analysis (legacy loop-family nesting; pure-function extraction signatures) into the in-scope dir with frozen/superseded banners pointing to proposal.md/plan.md as canonical. Remove the four docs superseded by the current unified design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(UEL-028)
Add agenta:custom:mock:v0 — a deterministic, LLM-free, sandbox-free workflow
that serves as both an application and an evaluator, selected by
parameters={"key", "kwargs"} (echo/static/pass/fail/score/error/delay). This
lets evaluation runs execute end-to-end through the real worker in acceptance
tests without an LLM or the code sandbox. Registered as a managed service
(handler/interface/role-table/mount). Also extend the LLM MOCKS registry with
echo/error selectors.
Fix UEL-028: batch (non-queue) source runs never finalized run status — they
processed every scenario to success but stayed running/is_active forever. The
severity floor in process_evaluation_source_slice ranked RUNNING above SUCCESS,
pinning runs at running. Reorder so terminal statuses outrank the transient ones
(FAILURE>ERRORS>SUCCESS>RUNNING>PENDING), reset status=RUNNING on every
(re)dispatch (so extended finished runs re-finalize correctly), clear is_active
on terminal, and persist status reliably in dao.edit_run (status.value +
flag_modified). Only batch testset/invocation finalize (update_run_status=True);
live/batch-query are untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… UEL-029 Unit: - test_run_flag_matrix.py — representative matrix for has_*/origin flag derivation and the simple-queue ">=1 human evaluator" rule. - test_mock_v0.py (sdk) — mock_v0 selector behaviors + validation. Acceptance flow tests (real worker, LLM-free via mock_v0): - test_evaluation_flows_run.py — run to completion across dispatch topologies: batch_testset, batch_invocation (success), live_query (stays running, guards the UEL-028 fix against live evals), batch_query (xfail UEL-029). - test_evaluation_flows_modify.py — restart a finished batch run (re-finalizes; validates UEL-028 extension case), archive/unarchive a default queue on a CLOSED run (UEL-012, asserts no 409). Docs/findings: - run-status-finalization.md — full analysis + rejected options for UEL-028. - findings.md — close UEL-012 (allow archive on closed run), UEL-016 (typed default-queue exceptions), UEL-028 (finalization); file UEL-029 (batch_query never finalizes, adjacent gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
batch_query runs (query -> evaluator, no app, not live) dispatched through process_query_source_run with update_run_status=False, so they never reached a terminal status — the same gap as UEL-028 on the query path. live_query shares the function and must keep running. Distinguish via the existing use_windowing flag (True=batch, False=live): set update_run_status=use_windowing. For the zero-traces batch case the slice processor rejects empty input, so finalize the run directly to SUCCESS with is_active cleared (no matching traces = complete, not failed). Flip the batch_query flow test from xfail to asserting success + is_active=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closing a run is meant to lock it, but every DAO mutation that raised EvaluationClosedConflict was swallowed by @suppress_exceptions before the router's @handle_evaluation_closed_exception could convert it — so closed-run edits silently returned 200/empty instead of 409. The lock was a no-op. Harden user-facing, keep the worker soft: - add EvaluationClosedConflict to the exclude list of the 20 user-facing DAO mutations that raise it (edit_run(s), create/edit/delete scenario(s)/result(s), edit/delete metrics, create/edit queue(s)); create_metrics already propagates. - add @handle_evaluation_closed_exception() to the create_runs/edit_runs/edit_run routes (previously undecorated -> 500 instead of 409). - make process_evaluation_source_slice tolerate the conflict on its finalization edit_run and per-item edit_scenario (try/except, log+skip) so a run closed mid-flight is treated as a lock, not a failure. Distinct from UEL-012, which deliberately leaves queue archive/unarchive unguarded on closed runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UEL-030 test_default_queue_policy.py — data validation (default queue rejects user_ids/scenario_ids/step_keys/batch_size/batch_offset -> 422), demotion forbidden (-> 409), hard-delete forbidden (-> 409), promotion allowed. test_default_queue_lifecycle.py — the _reconcile_default_queue state machine via run create + step edits: required+missing -> create, not-required+active -> archive (is_queue flips False), required+archived -> unarchive, required+active -> no-op, non-human run -> no default queue. File UEL-030: the ux_evaluation_queues_default_per_run unique index is not enforced (absent from the DB, no code-level guard), so a second default queue can be created for one run. Uniqueness test xfailed against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause was a duplicate alembic revision id: add_default_evaluation_queues
shared id a1b2c3d4e5f6 with drop_corrupted_metrics_for_some_runs, so alembic
silently skipped the index migration and ux_evaluation_queues_default_per_run
never existed in the DB.
- Rename the index migration to revision a1d2e3f4a5b6 and chain both new branch
migrations linearly after each environment's head (OSS after e6f7a8b9c0d1, EE
after b2c3d4e5f7a8); mirror them in api/ee. Both graphs resolve to a single
head a2b3c4d5e6f8.
- Correct the partial-index predicate to "(flags ->> 'is_default')::boolean =
true AND deleted_at IS NULL" in dbes.py and the migration, so it enforces one
ACTIVE default per run while allowing archive->recreate.
- Fix two json/jsonb type bugs in the backfill (data is a json column): cast
data::jsonb before jsonb_array_elements and insert '{}'::json.
- Rely on the index for enforcement: create_queue/create_queues already turn the
unique-violation IntegrityError into EntityCreationConflict (409). Remove the
prototyped pre-emptive SELECT guard (never worked, not race-safe).
- Flip test_second_default_queue_for_same_run_is_rejected from xfail to passing;
add recreate-after-archive and across-runs cases. Full suite: 16 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add acceptance coverage for POST /evaluations/metrics/refresh. The endpoint's
refresh_metrics dispatcher fans out per run/scenario/timestamp; these tests pin
the routing shape and the early-return ("nothing to refresh") paths that do not
require traces, evaluator schemas, or the worker:
- empty body (no run_id/run_ids) -> count 0
- unknown run_id / unknown run_ids loop -> count 0
- run with no metrics-bearing steps -> count 0
- scenario_ids and timestamps loop branches -> count 0
- run_ids takes precedence over run_id
7 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UEL-030 was marked CLOSED but still sat at the top of "Open Findings". Relocate the full entry to the Closed Findings section so the section headers match each finding's status. No content changes to the entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.