Skip to content

Comments

adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036

Open
def- wants to merge 10 commits intoMaterializeInc:mainfrom
def-:pr-optimize-copy-from-stdin
Open

adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036
def- wants to merge 10 commits intoMaterializeInc:mainfrom
def-:pr-optimize-copy-from-stdin

Conversation

@def-
Copy link
Contributor

@def- def- commented Feb 16, 2026

Tried with 100 million rows in https://github.com/def-/ClickBench/tree/pr-materialize/materialize. Before this PR it took 101 min, would go OoM if the files were not split, and have query errors because of results being too large. With this PR the 100 million row ingestion runs in 4:38 min on my dev server (8 cores), should scale pretty linearly with the number of cores. For reference COPY WITH FREEZE in PostgreSQL takes 20 min.

Example test run of the new benchmark in CI with 10 million rows: https://buildkite.com/materialize/release-qualification/builds/1089#019c68b4-4b3f-424f-b3c4-05518e95f1a0

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
CopyFromStdin                       | wallclock       |           3.599 |         101.877 |   s    |    10%     |      no       | better: 28.3 times faster
CopyFromStdin                       | memory_mz       |        1310.349 |        1197.815 |   MB   |    20%     |      no       | worse:   9.4% more
CopyFromStdin                       | memory_clusterd |          28.458 |          28.534 |   MB   |    50%     |      no       | better:  0.3% less

Run in environmentd spec sheet doesn't scale well in Cloud (should be investigated!), but scales well locally:

cores    local    cloud
    1  178.20s  378.87s
    2   89.73s  170.18s
    4   44.79s  144.72s
    8   25.05s  158.65s
   16   20.72s  119.59s
   32      N/A  117.71s

Fixes: https://github.com/MaterializeInc/database-issues/issues/7674
Fixes: https://github.com/MaterializeInc/database-issues/issues/9978

Motivation:
COPY FROM STDIN has been slow for workload replay and testing with large amounts of data in general, also been annoying in https://github.com/MaterializeInc/database-issues/issues/7674 and recently https://materializeinc.slack.com/archives/C08A62E0751/p1770835109967349

@github-actions
Copy link

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@def- def- force-pushed the pr-optimize-copy-from-stdin branch 8 times, most recently from 4e1af8e to 76a31a3 Compare February 17, 2026 11:55
@def- def- changed the title adapter: Optimize COPY FROM STDIN to use constant memory & parallelize it adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory Feb 17, 2026
@def- def- force-pushed the pr-optimize-copy-from-stdin branch 3 times, most recently from d5b109c to 0dd54e9 Compare February 17, 2026 14:29
@def- def- marked this pull request as ready for review February 17, 2026 14:34
@def- def- requested review from a team as code owners February 17, 2026 14:34
@def- def- requested a review from aljoscha February 17, 2026 14:34
@def- def- force-pushed the pr-optimize-copy-from-stdin branch from 0dd54e9 to 862818c Compare February 17, 2026 14:47
@def- def- requested review from SangJunBak and ggevay and removed request for aljoscha February 22, 2026 13:50
Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, impressive performance gains! Looks good overall, wrote some comments.

// and send the complete rows chunk to the next worker.
let mut send_failed = false;
while data.len() >= BATCH_SIZE {
let split_pos = match data.iter().rposition(|&b| b == b'\n') {
Copy link
Contributor

@ggevay ggevay Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSVs can have newlines inside fields, e.g.

1,"hello
world",3

So, I think this would need some more complex logic to find only such newlines that are not inside a quotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering if we need a more complex dance here to handle very large single rows:

One thing is that this can potentially get quadratic if one very big row is given to us in many CopyData messages, because rposition would keep scanning the whole row repeatedly, right?

The other thing is that this can still end up allocating an arbitrarily large amount of memory if one row is very big (e.g., if there are no line breaks in the file at all due to the user giving a wrong input file). The old code had max_copy_from_size; maybe we could re-introduce that as a safety guardrail here to avoid OOMing envd on bad input files.

(But maybe not super urgent, so if you don't want to fiddle with this now, then just adding a TODO comment in the code and creating an issue is ok.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed the csv, was a bit more messy than I hoped.
I have instead opted to only limit the row size, which should be enough to prevent OoMing from what I can tell.

@def- def- force-pushed the pr-optimize-copy-from-stdin branch from 862818c to 864ae06 Compare February 24, 2026 16:23
@def- def- force-pushed the pr-optimize-copy-from-stdin branch from 864ae06 to de043d4 Compare February 24, 2026 17:06
@def- def- force-pushed the pr-optimize-copy-from-stdin branch from de043d4 to 803a1d5 Compare February 24, 2026 17:12
@def- def- requested review from DAlperin and ggevay February 24, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants