Skip to content

Conversation

@jasnell
Copy link
Collaborator

@jasnell jasnell commented Jan 6, 2026

The next experiment step in improving the performance of streams pump to, here specifically in the ReadableSourceKjAdapter pump to... we implement an experimental a "draining read" that will consume as much data as possible synchronously on each read. The results are promising.

We will optimize the other cases (e.g. KJ-readable-to-JS-writable) in a separate PR. This one is specifically looking to improve the JS-readable-to-KJ-writable case.

Claude did most of the work here under supervision.

  Value Streams (Default highWaterMark=0)

  | Benchmark        | New (µs) | Existing (µs) | Speedup | Throughput New | Throughput Existing |
  |------------------|----------|---------------|---------|----------------|---------------------|
  | Tiny (64B×256)   | 1,555    | 2,796         | 1.8x    | 10.1 MB/s      | 5.6 MB/s            |
  | Small (256B×100) | 612      | 1,047         | 1.7x    | 40.0 MB/s      | 23.4 MB/s           |
  | Medium (4KB×100) | 667      | 1,110         | 1.7x    | 589 MB/s       | 354 MB/s            |
  | Large (64KB×16)  | 198      | 262           | 1.3x    | 5.0 GB/s       | 3.8 GB/s            |

  Value Streams (highWaterMark=16KB)

  | Benchmark        | New (µs) | Existing (µs) | Speedup | Throughput New | Throughput Existing |
  |------------------|----------|---------------|---------|----------------|---------------------|
  | Tiny (64B×256)   | 480      | 2,621         | 5.5x    | 32.7 MB/s      | 6.0 MB/s            |
  | Small (256B×100) | 195      | 1,024         | 5.2x    | 125 MB/s       | 23.9 MB/s           |
  | Medium (4KB×100) | 246      | 1,089         | 4.4x    | 1.56 GB/s      | 361 MB/s            |
  | Large (64KB×16)  | 168      | 248           | 1.5x    | 6.1 GB/s       | 4.0 GB/s            |

  Byte Streams (Default highWaterMark=0)

  | Benchmark        | New (µs) | Existing (µs) | Speedup | Throughput New | Throughput Existing |
  |------------------|----------|---------------|---------|----------------|---------------------|
  | Tiny (64B×256)   | 1,533    | 3,047         | 2.0x    | 10.2 MB/s      | 5.2 MB/s            |
  | Small (256B×100) | 617      | 1,231         | 2.0x    | 39.6 MB/s      | 19.9 MB/s           |
  | Medium (4KB×100) | 658      | 1,274         | 1.9x    | 597 MB/s       | 309 MB/s            |
  | Large (64KB×16)  | 192      | 2,781         | 14.5x   | 5.1 GB/s       | 363 MB/s            |

  Byte Streams (highWaterMark=16KB)

  | Benchmark        | New (µs) | Existing (µs) | Speedup | Throughput New | Throughput Existing |
  |------------------|----------|---------------|---------|----------------|---------------------|
  | Tiny (64B×256)   | 530      | 583           | 1.1x    | 29.6 MB/s      | 26.9 MB/s           |
  | Small (256B×100) | 223      | 306           | 1.4x    | 110 MB/s       | 80.1 MB/s           |
  | Medium (4KB×100) | 340      | 1,317         | 3.9x    | 1.13 GB/s      | 299 MB/s            |
  | Large (64KB×16)  | 186      | 2,706         | 14.5x   | 5.3 GB/s       | 372 MB/s            |

  Byte Streams with autoAllocateChunkSize=64KB

  | Benchmark        | New (µs) | Existing (µs) | Speedup | Throughput New | Throughput Existing |
  |------------------|----------|---------------|---------|----------------|---------------------|
  | Tiny (64B×256)   | 1,582    | 3,967         | 2.5x    | 9.9 MB/s       | 4.0 MB/s            |
  | Small (256B×100) | 627      | 1,556         | 2.5x    | 39.0 MB/s      | 15.8 MB/s           |
  | Medium (4KB×100) | 677      | 1,590         | 2.3x    | 580 MB/s       | 248 MB/s            |
  | Large (64KB×16)  | 189      | 354           | 1.9x    | 5.2 GB/s       | 2.8 GB/s            |

  Async Streams (Microtask delays - SlowValue)

  | Benchmark        | New (µs) | Existing (µs) | Speedup |
  |------------------|----------|---------------|---------|
  | Small (256B×100) | 858      | 1,164         | 1.4x    |
  | Medium (4KB×100) | 930      | 1,182         | 1.3x    |

  I/O Latency Streams (KJ event loop yields)

  | Benchmark    | New (µs) | Existing (µs) | Speedup |
  |--------------|----------|---------------|---------|
  | Small Value  | 1,042    | 1,361         | 1.3x    |
  | Medium Value | 1,111    | 1,397         | 1.3x    |
  | Large Value  | 269      | 308           | 1.1x    |
  | Small Byte   | 1,346    | 1,557         | 1.2x    |
  | Medium Byte  | 1,362    | 1,606         | 1.2x    |
  | Large Byte   | 247      | 2,796         | 11.3x   |

  Timed Streams (Real timer delays)

  | Benchmark          | New (µs) | Existing (µs) | Speedup |
  |--------------------|----------|---------------|---------|
  | Small 10µs delay   | 3,474    | 1,411         | 0.4x ⚠️ |
  | Small 100µs delay  | 108,170  | 105,873       | 1.0x    |
  | Small 1ms delay    | 110,401  | 111,430       | 1.0x    |
  | Medium 100µs delay | 105,322  | 104,661       | 1.0x    |

  ---
  Summary

  | Category            | Scenarios                               | Result              |
  |---------------------|-----------------------------------------|---------------------|
  | Big Wins (>2x)      | Tiny/Small with HWM, Large Byte streams | 2x - 14.5x faster   |
  | Solid Wins (1.3-2x) | Most sync streams, I/O latency          | 1.3x - 2x faster    |
  | Neutral (~1x)       | Timed 100µs+, some HWM byte streams     | Similar performance |
  | Regression          | Timed 10µs only                         | 0.4x (2.5x slower)  |

The one perf regression is in an artificial scenario.

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 6, 2026

CodSpeed Performance Report

Merging this PR will degrade performance by 37.84%

Comparing jasnell/streams-draining-read (0ac80c4) with main (9825f50)

Summary

⚡ 16 improved benchmarks
❌ 17 regressed benchmarks
✅ 96 untouched benchmarks
🆕 19 new benchmarks
⏩ 49 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
New_Small_SlowValue 2.5 ms 4.1 ms -37.84%
New_Small_Timed100us 3.6 ms 5.1 ms -29.5%
New_Small_Timed10us 3.6 ms 5.1 ms -29.41%
New_Tiny_Value_HWM16K 4.7 ms 2.5 ms +83.07%
New_Tiny_Value 4.7 ms 7.3 ms -35.77%
New_Small_Timed1ms 3.6 ms 5.1 ms -29.49%
🆕 New_LargeStream_Value N/A 293.1 ms N/A
New_Small_Value_HWM16K 1.9 ms 1.1 ms +69.71%
New_Tiny_Byte_Auto64K 23.3 ms 7.5 ms ×3.1
New_Large_Byte 6.9 ms 3.3 ms ×2.1
New_Large_Byte_HWM16K 6.9 ms 3.3 ms ×2.1
Encode_ASCII_32[TextEncoder][0/0/32] 3.3 ms 3 ms +12.37%
New_Large_IoLatencyByte 7.2 ms 3.6 ms ×2
New_Large_Byte_Auto64K_HWM16K 4 ms 3.3 ms +21.15%
New_Large_Byte_Auto64K 3.9 ms 3.3 ms +20.02%
New_Large_IoLatencyValue 2.9 ms 3.6 ms -20.24%
New_Large_Value 2.6 ms 3.3 ms -19.83%
New_Large_Value_HWM16K 2.7 ms 3.4 ms -22.01%
New_Medium_IoLatencyByte 5.4 ms 6.4 ms -16.6%
New_Medium_Byte_Auto64K 10.1 ms 3.7 ms ×2.7
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. 49 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@jasnell jasnell force-pushed the jasnell/streams-draining-read branch from 3e3b2fc to a580f71 Compare January 7, 2026 22:49
@jasnell jasnell marked this pull request as ready for review January 7, 2026 22:56
@jasnell jasnell requested review from a team as code owners January 7, 2026 22:56
@jasnell
Copy link
Collaborator Author

jasnell commented Jan 8, 2026

This also now includes a needed improvement to the handling of the autoAllocateChunkSize option on byte-oriented streams. Before we were defaulting to a 4kb buffer when autoAllocateChunkSize was not specified at all, resulting in additional overhead in a number of ways. A new compat flag removes that default, turns the read into a "default" read with a 16kb buffer. Perf improvements are measurable in the benchmarks

  Value Streams (unaffected by autoAllocateChunkSize change)

  | Benchmark           | New                 | Existing            | Speedup | New WriteOps | Existing WriteOps |
  |---------------------|---------------------|---------------------|---------|--------------|-------------------|
  | Tiny_Value          | 1428 μs (10.9 MB/s) | 2689 μs (5.8 MB/s)  | 1.9×    | 0.56         | 0.99              |
  | Tiny_Value_HWM16K   | 444 μs (35.2 MB/s)  | 2535 μs (6.2 MB/s)  | 5.7×    | 0.0007       | 0.95              |
  | Small_Value         | 569 μs (43.0 MB/s)  | 1028 μs (23.8 MB/s) | 1.8×    | 0.08         | 0.16              |
  | Small_Value_HWM16K  | 180 μs (135.7 MB/s) | 1010 μs (24.2 MB/s) | 5.6×    | 0.0003       | 0.15              |
  | Medium_Value        | 599 μs (655 MB/s)   | 1061 μs (370 MB/s)  | 1.8×    | 0.09         | 0.16              |
  | Medium_Value_HWM16K | 246 μs (1.57 GB/s)  | 1025 μs (383 MB/s)  | 4.2×    | 0.0003       | 0.15              |
  | Large_Value         | 180 μs (5.4 GB/s)   | 252 μs (3.9 GB/s)   | 1.4×    | 0.004        | 0.006             |
  | LargeStream_Value   | 59.9 ms (40.9 MB/s) | 120 ms (20.6 MB/s)  | 2.0×    | 1000         | 2000              |

  Byte Streams (affected by spec-compliant autoAllocateChunkSize)

  | Benchmark          | New                 | Existing            | Speedup | New WriteOps | Existing WriteOps |
  |--------------------|---------------------|---------------------|---------|--------------|-------------------|
  | Tiny_Byte          | 1497 μs (10.5 MB/s) | 2980 μs (5.3 MB/s)  | 2.0×    | 0.57         | 1.11              |
  | Tiny_Byte_HWM16K   | 489 μs (32.1 MB/s)  | 536 μs (29.2 MB/s)  | 1.1×    | 0.0007       | 0.004             |
  | Small_Byte         | 590 μs (41.4 MB/s)  | 1171 μs (20.9 MB/s) | 2.0×    | 0.09         | 0.18              |
  | Small_Byte_HWM16K  | 205 μs (119 MB/s)   | 290 μs (84.5 MB/s)  | 1.4×    | 0.0006       | 0.004             |
  | Medium_Byte        | 655 μs (599 MB/s)   | 1265 μs (311 MB/s)  | 1.9×    | 0.09         | 0.18              |
  | Medium_Byte_HWM16K | 350 μs (1.1 GB/s)   | 1220 μs (322 MB/s)  | 3.5×    | 0.01         | 0.18              |
  | Large_Byte         | 189 μs (5.2 GB/s)   | 2697 μs (374 MB/s)  | 14.3×   | 0.004        | 0.99              |
  | Large_Byte_HWM16K  | 182 μs (5.4 GB/s)   | 2671 μs (377 MB/s)  | 14.7×   | 0.004        | 0.99              |

  Byte Streams with Explicit autoAllocateChunkSize=64KB

  | Benchmark           | New                 | Existing            | Speedup | New WriteOps | Existing WriteOps |
  |---------------------|---------------------|---------------------|---------|--------------|-------------------|
  | Tiny_Byte_Auto64K   | 1457 μs (10.7 MB/s) | 3796 μs (4.1 MB/s)  | 2.6×    | 0.56         | 1.39              |
  | Small_Byte_Auto64K  | 577 μs (42.4 MB/s)  | 1484 μs (16.6 MB/s) | 2.6×    | 0.09         | 0.21              |
  | Medium_Byte_Auto64K | 631 μs (622 MB/s)   | 1530 μs (257 MB/s)  | 2.4×    | 0.09         | 0.21              |
  | Large_Byte_Auto64K  | 189 μs (5.2 GB/s)   | 345 μs (2.8 GB/s)   | 1.8×    | 0.004        | 0.008             |

  I/O Latency Streams

  | Benchmark             | New                 | Existing            | Speedup | New WriteOps | Existing WriteOps |
  |-----------------------|---------------------|---------------------|---------|--------------|-------------------|
  | Small_IoLatencyValue  | 1025 μs (23.8 MB/s) | 1358 μs (18.0 MB/s) | 1.3×    | 0.15         | 0.20              |
  | Medium_IoLatencyValue | 1060 μs (370 MB/s)  | 1416 μs (277 MB/s)  | 1.3×    | 0.16         | 0.21              |
  | Large_IoLatencyValue  | 266 μs (3.7 GB/s)   | 299 μs (3.3 GB/s)   | 1.1×    | 0.006        | 0.007             |
  | Small_IoLatencyByte   | 1234 μs (19.9 MB/s) | 1569 μs (15.6 MB/s) | 1.3×    | 0.18         | 0.22              |
  | Medium_IoLatencyByte  | 1279 μs (308 MB/s)  | 1603 μs (245 MB/s)  | 1.3×    | 0.19         | 0.25              |
  | Large_IoLatencyByte   | 247 μs (4.0 GB/s)   | 2787 μs (362 MB/s)  | 11.3×   | 0.006        | 0.97              |

  Key Takeaways

  1. Massive improvement for large byte streams: 14× faster for Large_Byte due to spec-compliant 16KB DEFAULT reads vs legacy 4KB BYOB reads
  2. Consistent 2× improvement for small/medium byte streams across all configurations
  3. WriteOps dramatically reduced: New approach coalesces writes much better (e.g., Large_Byte: 0.004 vs 0.99 writes per iteration)
  4. HWM16K configurations: New approach benefits significantly from highWaterMark buffering

@jasnell jasnell requested review from guybedford and mikea January 8, 2026 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant