feat(model-eval-ingest): sync promoted bench evals#3258
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryThe latest commit ( All Previous Issues Resolved (click to expand)
Files Reviewed (25 files)
Reviewed by claude-sonnet-4.6 · 315,716 tokens Review guidance: REVIEW.md from base branch |
|
Note that if/when this is merged, we will then need to:
|
| scale: 6, | ||
| mode: 'number', | ||
| }), | ||
| avg_execution_ms: decimal('avg_execution_ms', { precision: 16, scale: 6, mode: 'number' }), |
There was a problem hiding this comment.
Do we need floating point precision for a value measured in ms?
There was a problem hiding this comment.
No! Ugh. Fixing.
| avg_cost_usd: decimal('avg_cost_usd', { precision: 14, scale: 8, mode: 'number' }), | ||
| avg_input_tokens: decimal('avg_input_tokens', { precision: 16, scale: 6, mode: 'number' }), | ||
| avg_output_tokens: decimal('avg_output_tokens', { precision: 16, scale: 6, mode: 'number' }), | ||
| avg_cache_read_tokens: decimal('avg_cache_read_tokens', { | ||
| precision: 16, | ||
| scale: 6, | ||
| mode: 'number', | ||
| }), |
There was a problem hiding this comment.
I'd argue all of these should be integers
There was a problem hiding this comment.
Yea, they are averages but we should just round. We don't really need this level of precision.
There was a problem hiding this comment.
All integers now with the exception of microdollar which is bigint, as otherwise the limit would be an avg cost of ~2000usd. Since that is only a factor of 100 off from results we have seen, I bumped it to be safe.
|
@jrf0110 - ready for re-review I think. Thanks for the review so far. |
| method: 'POST', | ||
| headers: { | ||
| 'Content-Type': 'application/json', | ||
| 'x-internal-api-key': INTERNAL_API_SECRET, |
There was a problem hiding this comment.
I thought you had moved to using cloudflare access service credentials? If so, you'd need to plumb through your client_id/secret and add the headers:
curl -H "CF-Access-Client-Id: <CLIENT_ID>" -H "CF-Access-Client-Secret: <CLIENT_SECRET>" https://app.example.com
There was a problem hiding this comment.
@jrf0110 - I did for the rest of it. This is authentication between the Vercel app and the Worker, to manually trigger a re-scan for promoted models. (Otherwise it's 15min)
This was an existing secret/pattern used by other app->worker requests, so just re-used.
I can swap it out and implement a new Client Access based approach if we want.



Summary
model-eval-ingestpuller that reads promoted evals from kilo-bench over a Service Binding, stores append-only audit rows, and recomputes publicmodelStats.benchmarks.kiloBenchcaches.Verification
swift-falcon-local-e2ebench eval, triggered the scheduled sync endpoint, and verified Postgres stored the ingest row and publickiloBenchcache.lastPromotedAt.Visual Changes
N/A
Reviewer Notes
BENCH_DASHBOARDService Binding plus the shared internal API secret; the web admin trigger usesMODEL_EVAL_INGEST_URLto call/internal/sync.model_eval_ingest.promoted_by_emailis retained for audit but anonymized by the account soft-delete flow.