Skip to content

Conversation

@jhelwig
Copy link
Contributor

@jhelwig jhelwig commented Jan 13, 2026

This adds a new "migration mode" that will copy items from the PG-backed layer cache storage to the S3-backed storage, "backfilling" items that were inserted into the layer cache before services were configured to run in any of the persister modes that would write to S3 (possibly in addition to PG). The backfill can be run for a subset of caches using --backfill-cache-types (comma-separated list), and will check for items inserted into the PG-backed storage before the timestamp provided by --backfill-cutoff-timestamp. Items are processed newest-to-oldest, and the backfill process outputs what the "current" timestamp is (how far into the past it has progressed in the copy process) for each cache every 30 seconds. This allows safe resuming of the backfill process with minimal re-processing by providing a --backfill-cutoff-timestamp with minimal overlap with the most recently logged one(s) for the selected cache(s).

The backfill also records a set of metrics visualized by a new Grafana dashboard to aid in monitoring the progress of the backfill per cache.

From a local test run:

SI_SDF__LAYER_DB_CONFIG__PERSISTER_MODE=S3Primary buck2 run @//mode/release //bin/sdf:sdf -- \
  -vv --migration-mode backfillLayerCache \
  --backfill-cache-types "cas,workspace_snapshot,encrypted_secret,rebase_batch,change_batch,split_snapshot_subgraph,split_snapshot_supergraph,split_snapshot_rebase_batch" \
  --backfill-cutoff-timestamp "2026-01-13 17:30:26.445817 UTC"
image image

@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Scanned Files

None

@jhelwig jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch 2 times, most recently from b165542 to b062109 Compare January 13, 2026 18:03
@jhelwig jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from b062109 to 497f036 Compare January 13, 2026 22:13
@jhelwig jhelwig marked this pull request as ready for review January 13, 2026 23:41
@jhelwig jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 497f036 to 77d3e85 Compare January 14, 2026 21:58
Enables triggering layer cache backfill operations via CLI as a
one-time migration task. This mode will populate S3 with historical
layer cache data from PostgreSQL, supporting the transition to
S3-backed caching.
These arguments configure the backfill operation:
- cutoff_timestamp: Defines the time boundary for data migration
- cache_types: Specifies which caches to backfill (no "all" default
  to prevent accidental migration of deprecated caches)
- key_batch_size: Controls memory usage during RDS queries
- checkpoint_interval_secs: Determines progress logging frequency
- max_concurrent_uploads: Limits S3 upload parallelism

The cutoff timestamp makes the backfill resumable - operators can
restart from the last logged checkpoint if interrupted.
@jhelwig jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 77d3e85 to 8e1d6ed Compare January 14, 2026 22:21
Implement BackfillConfig::from_args() that parses and validates CLI arguments:
- Parse RFC 3339 timestamp formats (with/without timezone)
- Require cache types list (no default to all)
- Validate cache type names against BACKFILL_CACHE_TYPES
- Convert checkpoint interval to Duration

BACKFILL_CACHE_TYPES contains only the 8 caches requiring backfill.
Validates cache types against this list to prevent backfilling
deprecated caches (func_run, func_run_log) that are being migrated
out of the layer cache architecture.
Extends LayerCache and S3Layer with methods to support backfilling:
- list_keys_before_timestamp(): Queries RDS for cache keys within the
  cutoff window, using batched queries to limit memory usage
- backfill_to_s3(): Uploads PostgreSQL-backed cache entries to S3,
  preserving the existing cas_address as the S3 key

These low-level primitives enable migrating historical data without
modifying the existing write path or requiring dual-writes.
Extracts cache-specific operations into type-safe helpers:
- get_layer_cache_for_type(): Maps cache type strings to LayerCache
  instances, encapsulating the ServicesContext access pattern
- should_skip_entry(): Checks if an entry already exists in S3,
  preventing redundant uploads during resumed backfills
Processes a single cache type's backfill in batches:
1. Query RDS for keys before cutoff timestamp (paginated)
2. For each key, check S3 existence and upload if missing
3. Log checkpoints with the newest processed timestamp
4. Respect graceful shutdown signals between batches

The checkpoint timestamp represents the maximum (newest) timestamp
processed, allowing operators to resume from that point if the
backfill is interrupted. Skipping existing S3 entries makes the
operation idempotent and resumable.
Orchestrates parallel backfill of multiple cache types by spawning
one async task per cache type. Each task runs independently, allowing
the backfill to make progress across cache types simultaneously.

Uses try_join_all() to fail fast if any cache type encounters an
error, ensuring data consistency. The coordinator initializes the
ServicesContext once and shares it across all tasks via Arc cloning.
Integrates the backfill mode into sdf startup:
- Parse CLI arguments into BackfillConfig
- When migration_mode is BackfillLayerCache, initialize the
  LayerCacheBackfiller coordinator and run it to completion
- Exit after backfill completes (no server startup)

This makes the backfill a one-shot operation invoked via:
  sdf --migration-mode BackfillLayerCache --backfill-cutoff-timestamp <ts>
Creates a comprehensive monitoring dashboard for the layer cache backfill process with metrics for items processed, uploaded, and skipped, along with throughput tracking and per-cache-type breakdowns.
Implements bounded concurrency (default: 5) for S3 uploads using JoinSet
pattern from the S3 queue processor. Each cache type spawns up to
max_concurrent_uploads tasks in parallel, significantly improving throughput
for I/O-bound backfill operations.

Key features:
- Configurable via --backfill-max-concurrent-uploads CLI arg and env var
- HashMap-based timestamp tracking ensures checkpoint reliability
- Tracks maximum (newest) in-flight timestamp to prevent holes
- Immediate shutdown without waiting for in-flight tasks (resumable)
- Updated checkpoint logs with active_uploads count
- Fail-fast error handling maintains data integrity

Performance impact: ~5x throughput improvement for I/O-bound workloads.
@jhelwig jhelwig force-pushed the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch from 8e1d6ed to d6f80e2 Compare January 14, 2026 22:32
Copy link
Contributor

@zacharyhamm zacharyhamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test this, but it doesn't look dangerous to merge, and I read the backfill.rs code closely. Looks correct! One question on here, but not a blocker

.await?;

// Reverse so pop() gives us newest-first (query returns DESC order)
key_batch.reverse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have fetch_key_batch return with order by ASC?

Copy link
Contributor Author

@jhelwig jhelwig Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason. That would be the better way to do it. I'm sure there's going to be revisions of this as we see exactly how it's performing in the real world. I'll make a note to fix that in one of the revisions.

@jhelwig jhelwig added this pull request to the merge queue Jan 14, 2026
Merged via the queue into main with commit 5480040 Jan 15, 2026
10 checks passed
@jhelwig jhelwig deleted the jhelwig/eng-3292-backfill-s3-with-layer-cache-data-from-rds branch January 15, 2026 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants