perf: parallel tar extraction with 64 worker goroutines#218
perf: parallel tar extraction with 64 worker goroutines#218joshfriend wants to merge 3 commits intojfriend/in-process-zstdfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e4724e69f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
b156d0f to
e70795e
Compare
6a22063 to
65df4f6
Compare
The single-threaded bottleneck on restore was writing files to disk. Even though tar entries must be read sequentially (the format has no index), the actual file writes are independent. The extractor now dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker goroutines that write concurrently. This hides the per-file syscall latency (~20µs × N files) behind parallelism, turning a 7+ second sequential write phase into one that completes within the download window. Large files (>4 MiB) are written inline by the main goroutine to keep peak memory bounded. Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file bundle: 64 workers completes in ~6.3s vs ~13s sequential.
334913c to
4bc9e83
Compare
…tar extraction Add tar.TypeLink support for hardlinks (git local clones create hardlinked pack objects) and use tar header mode for directories instead of hardcoded 0o750.
4bc9e83 to
15f1526
Compare
| // Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file | ||
| // bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs | ||
| // any I/O concurrency gain). | ||
| extractWorkers = 64 |
There was a problem hiding this comment.
This can't be static for several reasons:
- not all machines have this many cores
- there are quite a few other processes running in parallel, so consuming all cores will result in k8s throttling
This will need to be threaded through a config value.
| // stalled waiting for individual file writes to complete. | ||
| // Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file | ||
| // bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs | ||
| // any I/O concurrency gain). |
There was a problem hiding this comment.
This doesn't include pre-parallel benchmark results, what were the original numbers?
| jobs := make(chan writeJob, extractWorkers*2) | ||
|
|
||
| var ( | ||
| wg sync.WaitGroup |
There was a problem hiding this comment.
Change this to an errgroup.WithContext(ctx) so that a) cancellation kills the errgroup and b) any error kills the errgroup.
| var readErr error | ||
| loop: | ||
| for { | ||
| if err := ctx.Err(); err != nil { |
There was a problem hiding this comment.
Pull this loop body out into a separate function. This will let the function just return normal errors and simplify the code significantly.
Replace the single GetObject stream in S3.Open with parallel range-GET requests for objects larger than 32 MiB. 8 workers download chunks concurrently and reassemble them in order via io.Pipe, multiplying S3 throughput for cold snapshot downloads (observed 32 MB/s single-stream vs expected ~250+ MB/s with parallel connections).
|
Closing this — the parallel tar extraction and in-process zstd changes are superseded. During playpen testing we found that Go klauspost zstd decompression through The parallel S3 range-GET commit is still valuable and has been cherry-picked into a standalone PR. |
The single-threaded bottleneck on restore was writing files to disk. Even though tar entries must be read sequentially (the format has no index), the actual file writes are independent. The extractor now dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker goroutines that write concurrently. This hides the per-file syscall latency (~20µs × N files) behind parallelism.
Large files (>4 MiB) are written inline by the main goroutine to keep peak memory bounded. Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file bundle: 64 workers completes in ~6.3s vs ~13s sequential.
Depends on #217.