adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory by def- · Pull Request #35036 · MaterializeInc/materialize

def- · 2026-02-16T22:40:09Z

Tried with 100 million rows in https://github.com/def-/ClickBench/tree/pr-materialize/materialize. Before this PR it took 101 min, would go OoM if the files were not split, and have query errors because of results being too large. With this PR the 100 million row ingestion runs in 4:38 min on my dev server (8 cores), should scale pretty linearly with the number of cores. For reference COPY WITH FREEZE in PostgreSQL takes 20 min.

Example test run of the new benchmark in CI with 10 million rows: https://buildkite.com/materialize/release-qualification/builds/1089#019c68b4-4b3f-424f-b3c4-05518e95f1a0

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
CopyFromStdin                       | wallclock       |           3.599 |         101.877 |   s    |    10%     |      no       | better: 28.3 times faster
CopyFromStdin                       | memory_mz       |        1310.349 |        1197.815 |   MB   |    20%     |      no       | worse:   9.4% more
CopyFromStdin                       | memory_clusterd |          28.458 |          28.534 |   MB   |    50%     |      no       | better:  0.3% less

Run in environmentd spec sheet doesn't scale well in Cloud (should be investigated!), but scales well locally:

cores    local    cloud
    1  178.20s  378.87s
    2   89.73s  170.18s
    4   44.79s  144.72s
    8   25.05s  158.65s
   16   20.72s  119.59s
   32      N/A  117.71s

Fixes: https://github.com/MaterializeInc/database-issues/issues/7674
Fixes: https://github.com/MaterializeInc/database-issues/issues/9978

Motivation:
COPY FROM STDIN has been slow for workload replay and testing with large amounts of data in general, also been annoying in https://github.com/MaterializeInc/database-issues/issues/7674 and recently https://materializeinc.slack.com/archives/C08A62E0751/p1770835109967349

github-actions · 2026-02-16T22:40:17Z

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

ggevay

Thanks, impressive performance gains! Looks good overall, wrote some comments.

ggevay · 2026-02-24T09:44:37Z

src/pgwire/src/protocol.rs

+                    // and send the complete rows chunk to the next worker.
+                    let mut send_failed = false;
+                    while data.len() >= BATCH_SIZE {
+                        let split_pos = match data.iter().rposition(|&b| b == b'\n') {


CSVs can have newlines inside fields, e.g.

1,"hello world",3

So, I think this would need some more complex logic to find only such newlines that are not inside a quotation.

Also, I'm wondering if we need a more complex dance here to handle very large single rows:

One thing is that this can potentially get quadratic if one very big row is given to us in many CopyData messages, because rposition would keep scanning the whole row repeatedly, right?

The other thing is that this can still end up allocating an arbitrarily large amount of memory if one row is very big (e.g., if there are no line breaks in the file at all due to the user giving a wrong input file). The old code had max_copy_from_size; maybe we could re-introduce that as a safety guardrail here to avoid OOMing envd on bad input files.

(But maybe not super urgent, so if you don't want to fiddle with this now, then just adding a TODO comment in the code and creating an issue is ok.)

Thanks, fixed the csv, was a bit more messy than I hoped.
I have instead opted to only limit the row size, which should be enough to prevent OoMing from what I can tell.

src/pgwire/src/protocol.rs

src/persist-client/src/cfg.rs

src/pgwire/src/protocol.rs

Previously failed: materialize=> alter system set max_copy_from_size=15000000000; ERROR: parameter "max_copy_from_size" requires a "unsigned integer" value

Noticed in https://buildkite.com/materialize/release-qualification/builds/1097#019c6b5c-3f94-4d17-b102-c78c9ce9d323

def- force-pushed the pr-optimize-copy-from-stdin branch 8 times, most recently from 4e1af8e to 76a31a3 Compare February 17, 2026 11:55

def- changed the title ~~adapter: Optimize COPY FROM STDIN to use constant memory & parallelize it~~ adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory Feb 17, 2026

def- force-pushed the pr-optimize-copy-from-stdin branch 3 times, most recently from d5b109c to 0dd54e9 Compare February 17, 2026 14:29

def- marked this pull request as ready for review February 17, 2026 14:34

def- requested review from a team as code owners February 17, 2026 14:34

def- requested a review from aljoscha February 17, 2026 14:34

def- force-pushed the pr-optimize-copy-from-stdin branch from 0dd54e9 to 862818c Compare February 17, 2026 14:47

def- requested review from SangJunBak and ggevay and removed request for aljoscha February 22, 2026 13:50

ggevay reviewed Feb 24, 2026

View reviewed changes

def- added 7 commits February 24, 2026 16:23

session variables: Allow max_copy_from_size > 4 GB

9111e95

Previously failed: materialize=> alter system set max_copy_from_size=15000000000; ERROR: parameter "max_copy_from_size" requires a "unsigned integer" value

adapter: Optimize COPY FROM STDIN to use constant memory and parallelize

aceaf38

feature-benchmark: Add COPY FROM STDIN scenario

a4ad30b

cluster-spec-sheet: Add copy-from-stdin scenario

ad823d5

testdrive: Fix fivetran-destination.td

5ea3efb

fix unit tests for copy from stdin

2053e8a

cluster-spec-sheet: Run against current version

579fe49

Noticed in https://buildkite.com/materialize/release-qualification/builds/1097#019c6b5c-3f94-4d17-b102-c78c9ce9d323

def- force-pushed the pr-optimize-copy-from-stdin branch from 862818c to 864ae06 Compare February 24, 2026 16:23

address review comments

d1c8faf

def- force-pushed the pr-optimize-copy-from-stdin branch from 864ae06 to de043d4 Compare February 24, 2026 17:06

rename error -> send_error_and_get_state

803a1d5

def- force-pushed the pr-optimize-copy-from-stdin branch from de043d4 to 803a1d5 Compare February 24, 2026 17:12

def- requested review from DAlperin and ggevay February 24, 2026 22:29

fix multi-row csv and limit size per row

768cf09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036

adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036
def- wants to merge 10 commits intoMaterializeInc:mainfrom
def-:pr-optimize-copy-from-stdin

def- commented Feb 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

ggevay left a comment

Uh oh!

ggevay Feb 24, 2026 •

edited

Loading

Uh oh!

ggevay Feb 24, 2026

Uh oh!

def- Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

def- commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 16, 2026

Pre-merge checklist

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

ggevay Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

def- Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

def- commented Feb 16, 2026 •

edited

Loading

ggevay Feb 24, 2026 •

edited

Loading