feat: adds reporting for cost and latency optimization failures#180
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 365fa94. Configure here.
| """ | ||
| if not self._history or not self._options.judges: | ||
| return False | ||
| recent = self._history[-1] |
There was a problem hiding this comment.
GT optimizer incorrectly reports quality as passing
Medium Severity
_all_judges_passing only inspects self._history[-1] (the last entry), but in the ground-truth optimizer all N sample results from a failed attempt are extended into history before _generate_new_variation is called. If the last sample's judges happened to pass while an earlier sample's judges failed, this method incorrectly returns True. The variation prompt then tells the LLM to "preserve existing behavior and only reduce cost," preventing it from addressing the quality failures in other samples.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 365fa94. Configure here.


Requirements
Describe the solution you've provided
This is intended to demystify some of the results we're receiving from the optimization package - namely:
scoreresult so that it can be shown on the UIDescribe alternatives you've considered
This is in some ways a bug fix since this information wasn't clear to the user as to what was causing the failure. Technically additional feature/functionality but likely required to express the required information to make it actionable for the user.
Additional context
Cost and latency are only optimized for/include scores if they trigger the keywords that would lead to them being optimized. "Base" implementations without these features being used are unaffected.
Note
Medium Risk
Changes optimization pass/fail logic and persisted result payloads (new gate scores, baseline handling, token-budget semantics), which could affect when runs succeed/fail and what the UI/API receives.
Overview
Improves optimization run reporting by tracking and persisting a single
accumulated_token_usagetotal across agent, judge, and variation calls, and including it in result PATCH payloads (extendinggenerationTokensto allowaccumulated_total).Refactors latency/cost optimization to use explicit baseline values (not
history[0]), caps history growth (_trim_history) for both standard and ground-truth flows, and adds synthetic_latency_gate/_cost_gatescore entries so gate failures are visible in results.Adjusts run control flow so pass/fail is evaluated before token-limit checks (including GT batches and validation), and updates variation prompting to focus purely on cost reduction when quality is already passing; also relaxes the cost gate tolerance from 20% to 10% improvement and expands tests accordingly.
Reviewed by Cursor Bugbot for commit 365fa94. Bugbot is set up for automated code reviews on this repo. Configure here.