Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259

jhelwig · 2026-01-13T07:08:30Z

This adds a new "migration mode" that will copy items from the PG-backed layer cache storage to the S3-backed storage, "backfilling" items that were inserted into the layer cache before services were configured to run in any of the persister modes that would write to S3 (possibly in addition to PG). The backfill can be run for a subset of caches using --backfill-cache-types (comma-separated list), and will check for items inserted into the PG-backed storage before the timestamp provided by --backfill-cutoff-timestamp. Items are processed newest-to-oldest, and the backfill process outputs what the "current" timestamp is (how far into the past it has progressed in the copy process) for each cache every 30 seconds. This allows safe resuming of the backfill process with minimal re-processing by providing a --backfill-cutoff-timestamp with minimal overlap with the most recently logged one(s) for the selected cache(s).

The backfill also records a set of metrics visualized by a new Grafana dashboard to aid in monitoring the progress of the backfill per cache.

From a local test run:

SI_SDF__LAYER_DB_CONFIG__PERSISTER_MODE=S3Primary buck2 run @//mode/release //bin/sdf:sdf -- \
  -vv --migration-mode backfillLayerCache \
  --backfill-cache-types "cas,workspace_snapshot,encrypted_secret,rebase_batch,change_batch,split_snapshot_subgraph,split_snapshot_supergraph,split_snapshot_rebase_batch" \
  --backfill-cutoff-timestamp "2026-01-13 17:30:26.445817 UTC"

github-actions · 2026-01-13T07:08:48Z

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Scanned Files

None

Enables triggering layer cache backfill operations via CLI as a one-time migration task. This mode will populate S3 with historical layer cache data from PostgreSQL, supporting the transition to S3-backed caching.

These arguments configure the backfill operation: - cutoff_timestamp: Defines the time boundary for data migration - cache_types: Specifies which caches to backfill (no "all" default to prevent accidental migration of deprecated caches) - key_batch_size: Controls memory usage during RDS queries - checkpoint_interval_secs: Determines progress logging frequency - max_concurrent_uploads: Limits S3 upload parallelism The cutoff timestamp makes the backfill resumable - operators can restart from the last logged checkpoint if interrupted.

Implement BackfillConfig::from_args() that parses and validates CLI arguments: - Parse RFC 3339 timestamp formats (with/without timezone) - Require cache types list (no default to all) - Validate cache type names against BACKFILL_CACHE_TYPES - Convert checkpoint interval to Duration BACKFILL_CACHE_TYPES contains only the 8 caches requiring backfill. Validates cache types against this list to prevent backfilling deprecated caches (func_run, func_run_log) that are being migrated out of the layer cache architecture.

Extends LayerCache and S3Layer with methods to support backfilling: - list_keys_before_timestamp(): Queries RDS for cache keys within the cutoff window, using batched queries to limit memory usage - backfill_to_s3(): Uploads PostgreSQL-backed cache entries to S3, preserving the existing cas_address as the S3 key These low-level primitives enable migrating historical data without modifying the existing write path or requiring dual-writes.

Extracts cache-specific operations into type-safe helpers: - get_layer_cache_for_type(): Maps cache type strings to LayerCache instances, encapsulating the ServicesContext access pattern - should_skip_entry(): Checks if an entry already exists in S3, preventing redundant uploads during resumed backfills

Processes a single cache type's backfill in batches: 1. Query RDS for keys before cutoff timestamp (paginated) 2. For each key, check S3 existence and upload if missing 3. Log checkpoints with the newest processed timestamp 4. Respect graceful shutdown signals between batches The checkpoint timestamp represents the maximum (newest) timestamp processed, allowing operators to resume from that point if the backfill is interrupted. Skipping existing S3 entries makes the operation idempotent and resumable.

Orchestrates parallel backfill of multiple cache types by spawning one async task per cache type. Each task runs independently, allowing the backfill to make progress across cache types simultaneously. Uses try_join_all() to fail fast if any cache type encounters an error, ensuring data consistency. The coordinator initializes the ServicesContext once and shares it across all tasks via Arc cloning.

Integrates the backfill mode into sdf startup: - Parse CLI arguments into BackfillConfig - When migration_mode is BackfillLayerCache, initialize the LayerCacheBackfiller coordinator and run it to completion - Exit after backfill completes (no server startup) This makes the backfill a one-shot operation invoked via: sdf --migration-mode BackfillLayerCache --backfill-cutoff-timestamp <ts>

Creates a comprehensive monitoring dashboard for the layer cache backfill process with metrics for items processed, uploaded, and skipped, along with throughput tracking and per-cache-type breakdowns.

Implements bounded concurrency (default: 5) for S3 uploads using JoinSet pattern from the S3 queue processor. Each cache type spawns up to max_concurrent_uploads tasks in parallel, significantly improving throughput for I/O-bound backfill operations. Key features: - Configurable via --backfill-max-concurrent-uploads CLI arg and env var - HashMap-based timestamp tracking ensures checkpoint reliability - Tracks maximum (newest) in-flight timestamp to prevent holes - Immediate shutdown without waiting for in-flight tasks (resumable) - Updated checkpoint logs with active_uploads count - Fail-fast error handling maintains data integrity Performance impact: ~5x throughput improvement for I/O-bound workloads.

zacharyhamm

I did not test this, but it doesn't look dangerous to merge, and I read the backfill.rs code closely. Looks correct! One question on here, but not a blocker

zacharyhamm · 2026-01-14T22:46:01Z

lib/sdf-server/src/layer_cache_backfill/backfill.rs

+            .await?;
+
+            // Reverse so pop() gives us newest-first (query returns DESC order)
+            key_batch.reverse();


why not have fetch_key_batch return with order by ASC?

No good reason. That would be the better way to do it. I'm sure there's going to be revisions of this as we see exactly how it's performing in the real world. I'll make a note to fix that in one of the revisions.

jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch 2 times, most recently from b165542 to b062109 Compare January 13, 2026 18:03

jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from b062109 to 497f036 Compare January 13, 2026 22:13

jhelwig marked this pull request as ready for review January 13, 2026 23:41

jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 497f036 to 77d3e85 Compare January 14, 2026 21:58

jhelwig added 3 commits January 14, 2026 22:18

Add BackfillLayerCache migration mode variant

f6e9543

Enables triggering layer cache backfill operations via CLI as a one-time migration task. This mode will populate S3 with historical layer cache data from PostgreSQL, supporting the transition to S3-backed caching.

Add BackfillError type with concrete error variants

512c6ef

jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 77d3e85 to 8e1d6ed Compare January 14, 2026 22:21

jhelwig requested review from fnichol, nickgerace and zacharyhamm January 14, 2026 22:26

jhelwig added 8 commits January 14, 2026 22:32

Add Grafana dashboard for layer cache backfill monitoring

03f7566

Creates a comprehensive monitoring dashboard for the layer cache backfill process with metrics for items processed, uploaded, and skipped, along with throughput tracking and per-cache-type breakdowns.

jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 8e1d6ed to d6f80e2 Compare January 14, 2026 22:32

zacharyhamm approved these changes Jan 14, 2026

View reviewed changes

jhelwig added this pull request to the merge queue Jan 14, 2026

Merged via the queue into main with commit 5480040 Jan 15, 2026
10 checks passed

jhelwig deleted the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch January 15, 2026 00:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259

Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259

Uh oh!

jhelwig commented Jan 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

zacharyhamm left a comment

Uh oh!

zacharyhamm Jan 14, 2026

Uh oh!

jhelwig Jan 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259

Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259

Uh oh!

Conversation

jhelwig commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

zacharyhamm left a comment

Choose a reason for hiding this comment

Uh oh!

zacharyhamm Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

jhelwig Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhelwig commented Jan 13, 2026 •

edited

Loading

github-actions bot commented Jan 13, 2026 •

edited

Loading

jhelwig Jan 14, 2026 •

edited

Loading