perf: parallel tar extraction with 64 worker goroutines by joshfriend · Pull Request #218 · block/cachew

joshfriend · 2026-03-23T18:23:32Z

The single-threaded bottleneck on restore was writing files to disk. Even though tar entries must be read sequentially (the format has no index), the actual file writes are independent. The extractor now dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker goroutines that write concurrently. This hides the per-file syscall latency (~20µs × N files) behind parallelism.

Large files (>4 MiB) are written inline by the main goroutine to keep peak memory bounded. Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file bundle: 64 workers completes in ~6.3s vs ~13s sequential.

Depends on #217.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e4724e69f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

internal/snapshot/snapshot.go

The single-threaded bottleneck on restore was writing files to disk. Even though tar entries must be read sequentially (the format has no index), the actual file writes are independent. The extractor now dispatches each entry (buffered in memory, ≤4 MiB) to one of 64 worker goroutines that write concurrently. This hides the per-file syscall latency (~20µs × N files) behind parallelism, turning a 7+ second sequential write phase into one that completes within the download window. Large files (>4 MiB) are written inline by the main goroutine to keep peak memory bounded. Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file bundle: 64 workers completes in ~6.3s vs ~13s sequential.

…tar extraction Add tar.TypeLink support for hardlinks (git local clones create hardlinked pack objects) and use tar header mode for directories instead of hardcoded 0o750.

alecthomas · 2026-03-23T21:03:35Z

internal/snapshot/snapshot.go

+	// Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file
+	// bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs
+	// any I/O concurrency gain).
+	extractWorkers = 64


This can't be static for several reasons:

not all machines have this many cores

there are quite a few other processes running in parallel, so consuming all cores will result in k8s throttling

This will need to be threaded through a config value.

alecthomas · 2026-03-23T21:05:27Z

internal/snapshot/snapshot.go

+	// stalled waiting for individual file writes to complete.
+	// Benchmarked on r8id.metal-48xlarge (NVMe, 96 cores) with a 334K-file
+	// bundle: 64 workers = 6.27s, 128 = 6.84s (extra GC pressure outweighs
+	// any I/O concurrency gain).


This doesn't include pre-parallel benchmark results, what were the original numbers?

alecthomas · 2026-03-23T21:07:37Z

internal/snapshot/snapshot.go

+	jobs := make(chan writeJob, extractWorkers*2)
+
+	var (
+		wg           sync.WaitGroup


Change this to an errgroup.WithContext(ctx) so that a) cancellation kills the errgroup and b) any error kills the errgroup.

alecthomas · 2026-03-23T21:09:48Z

internal/snapshot/snapshot.go

+	var readErr error
+loop:
+	for {
+		if err := ctx.Err(); err != nil {


Pull this loop body out into a separate function. This will let the function just return normal errors and simplify the code significantly.

Replace the single GetObject stream in S3.Open with parallel range-GET requests for objects larger than 32 MiB. 8 workers download chunks concurrently and reassemble them in order via io.Pipe, multiplying S3 throughput for cold snapshot downloads (observed 32 MB/s single-stream vs expected ~250+ MB/s with parallel connections).

joshfriend · 2026-03-24T20:44:18Z

Closing this — the parallel tar extraction and in-process zstd changes are superseded. During playpen testing we found that Go klauspost zstd decompression through archive/tar is ~30x slower than native zstd | tar for git mirror snapshots containing multi-GB packfiles (5+ min vs 8.5s for android-register's 6.8 GB pack). See #217 for the full explanation.

The parallel S3 range-GET commit is still valuable and has been cherry-picked into a standalone PR.

joshfriend requested a review from a team as a code owner March 23, 2026 18:23

joshfriend requested review from worstell and removed request for a team March 23, 2026 18:23

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

internal/snapshot/snapshot.go Show resolved Hide resolved

internal/snapshot/snapshot.go Show resolved Hide resolved

internal/snapshot/snapshot.go Outdated Show resolved Hide resolved

joshfriend force-pushed the jfriend/parallel-tar-extraction branch 2 times, most recently from b156d0f to e70795e Compare March 23, 2026 18:57

joshfriend force-pushed the jfriend/in-process-zstd branch 2 times, most recently from 6a22063 to 65df4f6 Compare March 23, 2026 19:01

joshfriend force-pushed the jfriend/parallel-tar-extraction branch 3 times, most recently from 334913c to 4bc9e83 Compare March 23, 2026 19:18

fix: handle hardlinks and preserve directory permissions in parallel …

15f1526

…tar extraction Add tar.TypeLink support for hardlinks (git local clones create hardlinked pack objects) and use tar header mode for directories instead of hardcoded 0o750.

joshfriend force-pushed the jfriend/parallel-tar-extraction branch from 4bc9e83 to 15f1526 Compare March 23, 2026 19:23

joshfriend marked this pull request as draft March 23, 2026 21:06

alecthomas reviewed Mar 23, 2026

View reviewed changes

joshfriend closed this Mar 24, 2026

This was referenced Mar 24, 2026

perf: parallel range-GET S3 downloads for large objects #224

Closed

perf: parallel range-GET S3 downloads for large objects #225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallel tar extraction with 64 worker goroutines#218

perf: parallel tar extraction with 64 worker goroutines#218
joshfriend wants to merge 3 commits intojfriend/in-process-zstdfrom
jfriend/parallel-tar-extraction

joshfriend commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alecthomas Mar 23, 2026

Uh oh!

alecthomas Mar 23, 2026

Uh oh!

alecthomas Mar 23, 2026

Uh oh!

alecthomas Mar 23, 2026

Uh oh!

joshfriend commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshfriend commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alecthomas Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

alecthomas Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

alecthomas Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

alecthomas Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

joshfriend commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshfriend commented Mar 24, 2026 •

edited

Loading