-
Notifications
You must be signed in to change notification settings - Fork 258
Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jhelwig/eng 3292 backfill s3 with layer cache data from rds #8259
Conversation
Dependency Review✅ No vulnerabilities or OpenSSF Scorecard issues found.Scanned FilesNone |
b165542 to
b062109
Compare
b062109 to
497f036
Compare
497f036 to
77d3e85
Compare
Enables triggering layer cache backfill operations via CLI as a one-time migration task. This mode will populate S3 with historical layer cache data from PostgreSQL, supporting the transition to S3-backed caching.
These arguments configure the backfill operation: - cutoff_timestamp: Defines the time boundary for data migration - cache_types: Specifies which caches to backfill (no "all" default to prevent accidental migration of deprecated caches) - key_batch_size: Controls memory usage during RDS queries - checkpoint_interval_secs: Determines progress logging frequency - max_concurrent_uploads: Limits S3 upload parallelism The cutoff timestamp makes the backfill resumable - operators can restart from the last logged checkpoint if interrupted.
77d3e85 to
8e1d6ed
Compare
Implement BackfillConfig::from_args() that parses and validates CLI arguments: - Parse RFC 3339 timestamp formats (with/without timezone) - Require cache types list (no default to all) - Validate cache type names against BACKFILL_CACHE_TYPES - Convert checkpoint interval to Duration BACKFILL_CACHE_TYPES contains only the 8 caches requiring backfill. Validates cache types against this list to prevent backfilling deprecated caches (func_run, func_run_log) that are being migrated out of the layer cache architecture.
Extends LayerCache and S3Layer with methods to support backfilling: - list_keys_before_timestamp(): Queries RDS for cache keys within the cutoff window, using batched queries to limit memory usage - backfill_to_s3(): Uploads PostgreSQL-backed cache entries to S3, preserving the existing cas_address as the S3 key These low-level primitives enable migrating historical data without modifying the existing write path or requiring dual-writes.
Extracts cache-specific operations into type-safe helpers: - get_layer_cache_for_type(): Maps cache type strings to LayerCache instances, encapsulating the ServicesContext access pattern - should_skip_entry(): Checks if an entry already exists in S3, preventing redundant uploads during resumed backfills
Processes a single cache type's backfill in batches: 1. Query RDS for keys before cutoff timestamp (paginated) 2. For each key, check S3 existence and upload if missing 3. Log checkpoints with the newest processed timestamp 4. Respect graceful shutdown signals between batches The checkpoint timestamp represents the maximum (newest) timestamp processed, allowing operators to resume from that point if the backfill is interrupted. Skipping existing S3 entries makes the operation idempotent and resumable.
Orchestrates parallel backfill of multiple cache types by spawning one async task per cache type. Each task runs independently, allowing the backfill to make progress across cache types simultaneously. Uses try_join_all() to fail fast if any cache type encounters an error, ensuring data consistency. The coordinator initializes the ServicesContext once and shares it across all tasks via Arc cloning.
Integrates the backfill mode into sdf startup: - Parse CLI arguments into BackfillConfig - When migration_mode is BackfillLayerCache, initialize the LayerCacheBackfiller coordinator and run it to completion - Exit after backfill completes (no server startup) This makes the backfill a one-shot operation invoked via: sdf --migration-mode BackfillLayerCache --backfill-cutoff-timestamp <ts>
Creates a comprehensive monitoring dashboard for the layer cache backfill process with metrics for items processed, uploaded, and skipped, along with throughput tracking and per-cache-type breakdowns.
Implements bounded concurrency (default: 5) for S3 uploads using JoinSet pattern from the S3 queue processor. Each cache type spawns up to max_concurrent_uploads tasks in parallel, significantly improving throughput for I/O-bound backfill operations. Key features: - Configurable via --backfill-max-concurrent-uploads CLI arg and env var - HashMap-based timestamp tracking ensures checkpoint reliability - Tracks maximum (newest) in-flight timestamp to prevent holes - Immediate shutdown without waiting for in-flight tasks (resumable) - Updated checkpoint logs with active_uploads count - Fail-fast error handling maintains data integrity Performance impact: ~5x throughput improvement for I/O-bound workloads.
8e1d6ed to
d6f80e2
Compare
zacharyhamm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not test this, but it doesn't look dangerous to merge, and I read the backfill.rs code closely. Looks correct! One question on here, but not a blocker
| .await?; | ||
|
|
||
| // Reverse so pop() gives us newest-first (query returns DESC order) | ||
| key_batch.reverse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have fetch_key_batch return with order by ASC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason. That would be the better way to do it. I'm sure there's going to be revisions of this as we see exactly how it's performing in the real world. I'll make a note to fix that in one of the revisions.
This adds a new "migration mode" that will copy items from the PG-backed layer cache storage to the S3-backed storage, "backfilling" items that were inserted into the layer cache before services were configured to run in any of the persister modes that would write to S3 (possibly in addition to PG). The backfill can be run for a subset of caches using
--backfill-cache-types(comma-separated list), and will check for items inserted into the PG-backed storage before the timestamp provided by--backfill-cutoff-timestamp. Items are processed newest-to-oldest, and the backfill process outputs what the "current" timestamp is (how far into the past it has progressed in the copy process) for each cache every 30 seconds. This allows safe resuming of the backfill process with minimal re-processing by providing a--backfill-cutoff-timestampwith minimal overlap with the most recently logged one(s) for the selected cache(s).The backfill also records a set of metrics visualized by a new Grafana dashboard to aid in monitoring the progress of the backfill per cache.
From a local test run: