Skip to content

perf: parallel tar extraction with 64 worker goroutines#218

Closed
joshfriend wants to merge 3 commits intojfriend/in-process-zstdfrom
jfriend/parallel-tar-extraction
Closed

perf: parallel tar extraction with 64 worker goroutines#218
joshfriend wants to merge 3 commits intojfriend/in-process-zstdfrom
jfriend/parallel-tar-extraction

Conversation

@joshfriend
Copy link
Contributor

The single-threaded bottleneck on restore was writing files to disk. Even though tar entries must be read sequentially (the format has no index), the actual file writes are independent. The extractor now dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker goroutines that write concurrently. This hides the per-file syscall latency (~20µs × N files) behind parallelism.

Large files (>4 MiB) are written inline by the main goroutine to keep peak memory bounded. Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file bundle: 64 workers completes in ~6.3s vs ~13s sequential.

Depends on #217.

@joshfriend joshfriend requested a review from a team as a code owner March 23, 2026 18:23
@joshfriend joshfriend requested review from worstell and removed request for a team March 23, 2026 18:23
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e4724e69f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@joshfriend joshfriend force-pushed the jfriend/parallel-tar-extraction branch 2 times, most recently from b156d0f to e70795e Compare March 23, 2026 18:57
@joshfriend joshfriend force-pushed the jfriend/in-process-zstd branch 2 times, most recently from 6a22063 to 65df4f6 Compare March 23, 2026 19:01
The single-threaded bottleneck on restore was writing files to disk.
Even though tar entries must be read sequentially (the format has no
index), the actual file writes are independent. The extractor now
dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker
goroutines that write concurrently. This hides the per-file syscall
latency (~20µs × N files) behind parallelism, turning a 7+ second
sequential write phase into one that completes within the download
window.

Large files (>4 MiB) are written inline by the main goroutine to keep
peak memory bounded.

Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file
bundle: 64 workers completes in ~6.3s vs ~13s sequential.
@joshfriend joshfriend force-pushed the jfriend/parallel-tar-extraction branch 3 times, most recently from 334913c to 4bc9e83 Compare March 23, 2026 19:18
…tar extraction

Add tar.TypeLink support for hardlinks (git local clones create hardlinked
pack objects) and use tar header mode for directories instead of hardcoded 0o750.
@joshfriend joshfriend force-pushed the jfriend/parallel-tar-extraction branch from 4bc9e83 to 15f1526 Compare March 23, 2026 19:23
@joshfriend joshfriend marked this pull request as draft March 23, 2026 21:06
// Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file
// bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs
// any I/O concurrency gain).
extractWorkers = 64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be static for several reasons:

  • not all machines have this many cores
  • there are quite a few other processes running in parallel, so consuming all cores will result in k8s throttling

This will need to be threaded through a config value.

// stalled waiting for individual file writes to complete.
// Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file
// bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs
// any I/O concurrency gain).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't include pre-parallel benchmark results, what were the original numbers?

jobs := make(chan writeJob, extractWorkers*2)

var (
wg sync.WaitGroup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to an errgroup.WithContext(ctx) so that a) cancellation kills the errgroup and b) any error kills the errgroup.

var readErr error
loop:
for {
if err := ctx.Err(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull this loop body out into a separate function. This will let the function just return normal errors and simplify the code significantly.

Replace the single GetObject stream in S3.Open with parallel range-GET
requests for objects larger than 32 MiB. 8 workers download chunks
concurrently and reassemble them in order via io.Pipe, multiplying S3
throughput for cold snapshot downloads (observed 32 MB/s single-stream
vs expected ~250+ MB/s with parallel connections).
@joshfriend
Copy link
Contributor Author

joshfriend commented Mar 24, 2026

Closing this — the parallel tar extraction and in-process zstd changes are superseded. During playpen testing we found that Go klauspost zstd decompression through archive/tar is ~30x slower than native zstd | tar for git mirror snapshots containing multi-GB packfiles (5+ min vs 8.5s for android-register's 6.8 GB pack). See #217 for the full explanation.

The parallel S3 range-GET commit is still valuable and has been cherry-picked into a standalone PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants