Skip to content

feat: adds reporting for cost and latency optimization failures#180

Open
andrewklatzke wants to merge 1 commit intoaklatzke/AIC-2465/cost-optimizationfrom
aklatzke/AIC-2474/report-cost-latency-failures
Open

feat: adds reporting for cost and latency optimization failures#180
andrewklatzke wants to merge 1 commit intoaklatzke/AIC-2465/cost-optimizationfrom
aklatzke/AIC-2474/report-cost-latency-failures

Conversation

@andrewklatzke
Copy link
Copy Markdown
Contributor

@andrewklatzke andrewklatzke commented May 7, 2026

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Describe the solution you've provided

This is intended to demystify some of the results we're receiving from the optimization package - namely:

  • Total token counts are now accrued and reported with each result so that we can see if a user crosses the total allowed tokens threshold
  • Score results are reported for cost or latency if they're being optimized against as an item in the score result so that it can be shown on the UI
  • Finally, if quality has already met the required threshold the prompt now contains instructions to optimize only against cost (if cost is being optimized against)

Describe alternatives you've considered

This is in some ways a bug fix since this information wasn't clear to the user as to what was causing the failure. Technically additional feature/functionality but likely required to express the required information to make it actionable for the user.

Additional context

Cost and latency are only optimized for/include scores if they trigger the keywords that would lead to them being optimized. "Base" implementations without these features being used are unaffected.


Note

Medium Risk
Changes optimization pass/fail logic and persisted result payloads (new gate scores, baseline handling, token-budget semantics), which could affect when runs succeed/fail and what the UI/API receives.

Overview
Improves optimization run reporting by tracking and persisting a single accumulated_token_usage total across agent, judge, and variation calls, and including it in result PATCH payloads (extending generationTokens to allow accumulated_total).

Refactors latency/cost optimization to use explicit baseline values (not history[0]), caps history growth (_trim_history) for both standard and ground-truth flows, and adds synthetic _latency_gate/_cost_gate score entries so gate failures are visible in results.

Adjusts run control flow so pass/fail is evaluated before token-limit checks (including GT batches and validation), and updates variation prompting to focus purely on cost reduction when quality is already passing; also relaxes the cost gate tolerance from 20% to 10% improvement and expands tests accordingly.

Reviewed by Cursor Bugbot for commit 365fa94. Bugbot is set up for automated code reviews on this repo. Configure here.

@andrewklatzke andrewklatzke requested a review from a team as a code owner May 7, 2026 22:07
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 365fa94. Configure here.

"""
if not self._history or not self._options.judges:
return False
recent = self._history[-1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GT optimizer incorrectly reports quality as passing

Medium Severity

_all_judges_passing only inspects self._history[-1] (the last entry), but in the ground-truth optimizer all N sample results from a failed attempt are extended into history before _generate_new_variation is called. If the last sample's judges happened to pass while an earlier sample's judges failed, this method incorrectly returns True. The variation prompt then tells the LLM to "preserve existing behavior and only reduce cost," preventing it from addressing the quality failures in other samples.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 365fa94. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant