Skip to content

feat(model-eval-ingest): sync promoted bench evals#3258

Merged
lambertjosh merged 14 commits into
mainfrom
rogue-timbale
May 19, 2026
Merged

feat(model-eval-ingest): sync promoted bench evals#3258
lambertjosh merged 14 commits into
mainfrom
rogue-timbale

Conversation

@lambertjosh
Copy link
Copy Markdown
Contributor

Summary

  • Add a dedicated Cloudflare model-eval-ingest puller that reads promoted evals from kilo-bench over a Service Binding, stores append-only audit rows, and recomputes public modelStats.benchmarks.kiloBench caches.
  • Extend cloud persistence, migrations, GDPR anonymization, and admin-only operations for sync-now / ingest history while keeping bench URLs and promoter email out of public model stats JSON.
  • Preserve fractional benchmark scores and weighted aggregate semantics, including latest-promotion-wins recompute behavior per model/task tuple.

Verification

  • Ran the bench dashboard Worker and cloud model-eval-ingest Worker locally, seeded a promoted swift-falcon-local-e2e bench eval, triggered the scheduled sync endpoint, and verified Postgres stored the ingest row and public kiloBench cache.
  • Re-triggered the local scheduled sync to verify the ingest remained idempotent, then inspected the cache payload to confirm it excluded bench URLs and promoter email while retaining ISO lastPromotedAt.

Visual Changes

N/A

Reviewer Notes

  • The new worker expects a BENCH_DASHBOARD Service Binding plus the shared internal API secret; the web admin trigger uses MODEL_EVAL_INGEST_URL to call /internal/sync.
  • model_eval_ingest.promoted_by_email is retained for audit but anonymized by the account soft-delete flow.
  • The generated Drizzle migration introduces the append-only ingest table and indexes needed for idempotency and cache recompute queries.

Comment thread services/model-eval-ingest/src/sync.ts
Comment thread services/model-eval-ingest/src/sync.ts Outdated
Comment thread apps/web/src/lib/model-eval-ingest-client.ts Outdated
Comment thread packages/db/src/schema.ts
Comment thread apps/web/src/routers/admin-model-eval-ingest-router.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 15, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

The latest commit (6469ebbf) renames the admin UI surface label from "Model Eval Ingest" to "Model Benchmarks" in the sidebar, page heading, and breadcrumb — a cosmetic-only change with no logic, security, or data impact. All prior issues remain resolved.

All Previous Issues Resolved (click to expand)
File Issue Status
services/model-eval-ingest/src/sync.ts N+1 DB query pattern fixed — findModelStatsTargets and insertPromotions are now batched ✅ Resolved
apps/web/src/lib/model-eval-ingest-client.ts as cast replaced with Zod schema validation ✅ Resolved
services/model-eval-ingest/src/sync.ts Cache recompute intent now documented with explanatory comment ✅ Resolved
packages/db/src/schema.ts Expression index on LOWER(promoted_by_email) added for GDPR soft-delete ✅ Resolved
apps/web/src/routers/admin-model-eval-ingest-router.ts repullPromotion is now wired up in the admin UI ✅ Resolved
apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx Toast message now uses correct pluralization via formatCount helper ✅ Resolved
packages/db/src/migrations/ Two migrations squashed into 0129_friendly_iron_patriot.sql; CREATE INDEX now runs on the newly-created (empty) table in the same migration — no lock concern on a populated table ✅ Resolved
Files Reviewed (25 files)
  • .env.local.example - 0 issues
  • apps/web/.env.development.local.example - 0 issues
  • apps/web/src/app/admin/components/AppSidebar.tsx - 0 issues
  • apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx - 0 issues
  • apps/web/src/app/admin/model-eval-ingest/page.tsx - 0 issues
  • apps/web/src/lib/config.server.ts - 0 issues
  • apps/web/src/lib/model-eval-ingest-client.ts - 0 issues
  • apps/web/src/lib/user.test.ts - 0 issues
  • apps/web/src/lib/user.ts - 0 issues
  • apps/web/src/routers/admin-model-eval-ingest-router.ts - 0 issues
  • apps/web/src/routers/admin-router.ts - 0 issues
  • packages/db/src/migrations/0129_friendly_iron_patriot.sql - 0 issues
  • packages/db/src/schema.ts - 0 issues
  • services/model-eval-ingest/.dev.vars.example - 0 issues
  • services/model-eval-ingest/README.md - 0 issues
  • services/model-eval-ingest/src/index.ts - 0 issues
  • services/model-eval-ingest/src/sync.ts - 0 issues
  • services/model-eval-ingest/src/types.ts - 0 issues
  • services/model-eval-ingest/worker-configuration.d.ts - 0 issues
  • services/model-eval-ingest/wrangler.jsonc - 0 issues
  • dev/local/services.ts - 0 issues
  • DEVELOPMENT.md - 0 issues

Reviewed by claude-sonnet-4.6 · 315,716 tokens

Review guidance: REVIEW.md from base branch main

Comment thread packages/db/src/migrations/0130_numerous_marvel_boy.sql Outdated
Comment thread apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx Outdated
@lambertjosh
Copy link
Copy Markdown
Contributor Author

image image image

@lambertjosh
Copy link
Copy Markdown
Contributor Author

Note that if/when this is merged, we will then need to:

  1. Wait for CI to deploy model-eval-ingest
  2. Capture its deployed URL - i.e. workers.dev
  3. Set MODEL_EVAL_INGEST_URL in Vercel production to value.
  4. Redeploy Vercel / let the next web deploy pick it up

@lambertjosh lambertjosh requested a review from jrf0110 May 15, 2026 14:47
Comment thread packages/db/src/schema.ts Outdated
Comment thread packages/db/src/schema.ts Outdated
Comment thread packages/db/src/schema.ts Outdated
scale: 6,
mode: 'number',
}),
avg_execution_ms: decimal('avg_execution_ms', { precision: 16, scale: 6, mode: 'number' }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need floating point precision for a value measured in ms?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! Ugh. Fixing.

Comment thread packages/db/src/schema.ts Outdated
Comment on lines +3015 to +3022
avg_cost_usd: decimal('avg_cost_usd', { precision: 14, scale: 8, mode: 'number' }),
avg_input_tokens: decimal('avg_input_tokens', { precision: 16, scale: 6, mode: 'number' }),
avg_output_tokens: decimal('avg_output_tokens', { precision: 16, scale: 6, mode: 'number' }),
avg_cache_read_tokens: decimal('avg_cache_read_tokens', {
precision: 16,
scale: 6,
mode: 'number',
}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue all of these should be integers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, they are averages but we should just round. We don't really need this level of precision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All integers now with the exception of microdollar which is bigint, as otherwise the limit would be an avg cost of ~2000usd. Since that is only a factor of 100 off from results we have seen, I bumped it to be safe.

Comment thread packages/db/src/schema.ts Outdated
@lambertjosh
Copy link
Copy Markdown
Contributor Author

@jrf0110 - ready for re-review I think. Thanks for the review so far.

method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-internal-api-key': INTERNAL_API_SECRET,
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you had moved to using cloudflare access service credentials? If so, you'd need to plumb through your client_id/secret and add the headers:

curl -H "CF-Access-Client-Id: <CLIENT_ID>" -H "CF-Access-Client-Secret: <CLIENT_SECRET>" https://app.example.com

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrf0110 - I did for the rest of it. This is authentication between the Vercel app and the Worker, to manually trigger a re-scan for promoted models. (Otherwise it's 15min)

This was an existing secret/pattern used by other app->worker requests, so just re-used.

I can swap it out and implement a new Client Access based approach if we want.

@lambertjosh lambertjosh merged commit c864a87 into main May 19, 2026
49 checks passed
@lambertjosh lambertjosh deleted the rogue-timbale branch May 19, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants