chore: import scorecards next to deps.dev data from openssf dataset (CM-1227)#4196
Conversation
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Pull request overview
This PR adds a new OpenSSF Scorecard ingestion path to the packages_worker BigQuery→GCS→Postgres bootstrap pipeline, persisting both aggregate repo-level Scorecard results and per-check details into packages-db. It also renames the existing deps.dev ingest worker/task-queue to a more general bq-dataset-ingest to reflect ingestion from multiple BigQuery datasets.
Changes:
- Added a new
scorecard/module with BigQuery export SQL and a Temporal workflow that loads/merges Scorecard repo aggregates and per-check rows into Postgres via chunked parquet staging. - Wired Scorecard ingestion into
bootstrapOsspckgs(runs last) and extended CLI/export tooling to supportscorecard/scorecard_*kinds/parts. - Renamed the deps.dev ingest worker/task queue from
deps-dev-ingest→bq-dataset-ingest, adjusted schedules/scripts, and added a DB migration to allow new ingest job kinds.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/ingestJobs.ts | Extends OsspckgsJobKind union to include Scorecard job kinds used by the ingest pipeline. |
| services/apps/packages_worker/src/workflows/index.ts | Exposes ingestScorecard from the workflows barrel for worker registration/usage. |
| services/apps/packages_worker/src/scripts/triggerBootstrap.ts | Allows --kinds scorecard and routes bootstrap to the renamed bq-dataset-ingest task queue. |
| services/apps/packages_worker/src/scripts/exportToBucket.ts | Adds Scorecard export “parts” and SQL wiring for pre-exporting Scorecard data to GCS. |
| services/apps/packages_worker/src/scorecard/workflows/ingestScorecard.ts | Implements the chunked Scorecard ingest workflow (export → parquet staging → merge/update). |
| services/apps/packages_worker/src/scorecard/workflows/index.ts | Barrel export for the new Scorecard workflow module. |
| services/apps/packages_worker/src/scorecard/queries/scorecardSql.ts | Defines BigQuery SQL for Scorecard aggregate rows and per-check UNNEST rows. |
| services/apps/packages_worker/src/schedules/cleanup.ts | Updates cleanup schedule to use the renamed bq-dataset-ingest task queue. |
| services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts | Runs Scorecard ingest as the final child workflow when kinds is unset or includes scorecard. |
| services/apps/packages_worker/src/deps-dev/schedules/bootstrap.ts | Moves weekly bootstrap cron to Monday 02:00 UTC and updates task queue name. |
| services/apps/packages_worker/src/deps-dev/config.ts | Adds the SCORECARD_DATASET constant for Scorecard BigQuery sourcing. |
| services/apps/packages_worker/src/bin/bq-dataset-ingest.ts | New ingest worker entrypoint that registers bootstrap/cleanup schedules and starts the service worker. |
| services/apps/packages_worker/package.json | Renames scripts from deps-dev-ingest to bq-dataset-ingest and updates export tooling SERVICE wiring. |
| scripts/services/bq-dataset-ingest.yaml | Renames docker-compose service/hostname/command/env to match bq-dataset-ingest. |
| backend/src/osspckgs/migrations/V1781074345__add-scorecard-job-kinds.sql | Extends the osspckgs_ingest_jobs.job_kind CHECK constraint to include new Scorecard kinds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 75bedbc. Configure here.
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>

Summary
Ingests OpenSSF Scorecard data from BigQuery into packages-db, populating
repos.scorecard_score,repos.scorecard_last_run_at, andrepo_scorecard_checks(per-check detail). Sourced fromopenssf.scorecardcron.scorecard-v2_latest(~1.3M repos, ~18M check rows, full rescan weekly). Runs as the final childworkflow in
bootstrapOsspckgs.Also renames the
deps-dev-ingesttask queue / binary / scripts tobq-dataset-ingestsince the worker now handles data from multiple BQ sources, not just deps.dev.Changes
src/scorecard/module —ingestScorecardTemporal workflow: exports two BQ queries (aggregate scores + per-check detail via UNNEST), loads via GCS parquet chunks into staging tables, merges intorepos(UPDATE) andrepo_scorecard_checks(INSERT ON CONFLICT DO UPDATE). Follows the same chunkedpattern as
ingestAdvisories.bootstrapOsspckgs—ingestScorecardwired as the finalexecuteChildcall; runs whenkindsis unset or includes'scorecard'; intentionally last since it UPDATEs repos that must already exist from the deps-dev ingest.deps-dev-ingest→bq-dataset-ingest— binary, yaml, task queue strings, package.json scripts; mechanical rename only, no logic changes.V1781074345— extendsosspckgs_ingest_jobs.job_kindCHECK constraint to includescorecard_reposandscorecard_checks(without this, every ingest job creation throws a constraint violation).0 2 * * 0→0 2 * * 1— deps.dev and scorecard both publish Sunday ~21:00 UTC; Monday 02:00 UTC ensures we pick up fresh data same week.triggerBootstrap.ts—'scorecard'added toVALID_KINDSso--kinds scorecardworks from CLI.exportToBucket.ts—scorecard_reposandscorecard_checksparts added; supports--dry-run, resume, andparts=all.updated_at = NOW()added to both merge SQLs — Tinybird sync watermark advances on re-ingest forreposandrepo_scorecard_checks.Type of change
JIRA ticket
https://linuxfoundation.atlassian.net/browse/CM-1227
Note
Medium Risk
Touches production ingest orchestration (Temporal schedules, task queue rename, DB CHECK migration) and bulk updates to
reposandrepo_scorecard_checks; mis-deployed queue names or skipped migration would break ingest job creation or leave scorecard data stale.Overview
Adds OpenSSF Scorecard ingestion from BigQuery (
openssf.scorecardcron) into packages-db: a newingestScorecardTemporal workflow exports aggregate repo scores and per-check rows (UNNEST), loads them through the existing BQ→GCS→staging→merge pipeline, and updatesrepos.scorecard_score/scorecard_last_run_atplus upsertsrepo_scorecard_checks. It runs as the last child ofbootstrapOsspckgswhenkindsis unset or includesscorecard, after deps.dev has populatedrepos.Renames the multi-source BQ worker from
deps-dev-ingesttobq-dataset-ingest(compose, task queues, scripts,export-to-bucketservice). Drops the monolithicpackages-workerentrypoint and its compose file; npm/osv/maven schedules are no longer registered from that stub. Weekly bootstrap moves from Sunday 02:00 to Monday 02:00 UTC to align with Sunday BQ publishes.A Flyway migration extends
osspckgs_ingest_jobs.job_kindwithscorecard_reposandscorecard_checks; TypeScript job kinds and CLI/export tooling (exportToBucket,trigger-bootstrap --kinds scorecard) are updated to match.Reviewed by Cursor Bugbot for commit 42f4871. Bugbot is set up for automated code reviews on this repo. Configure here.