adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036
adapter: Optimize COPY FROM STDIN, parallelize it, use constant memory#35036def- wants to merge 10 commits intoMaterializeInc:mainfrom
Conversation
Pre-merge checklist
|
4e1af8e to
76a31a3
Compare
d5b109c to
0dd54e9
Compare
0dd54e9 to
862818c
Compare
ggevay
left a comment
There was a problem hiding this comment.
Thanks, impressive performance gains! Looks good overall, wrote some comments.
src/pgwire/src/protocol.rs
Outdated
| // and send the complete rows chunk to the next worker. | ||
| let mut send_failed = false; | ||
| while data.len() >= BATCH_SIZE { | ||
| let split_pos = match data.iter().rposition(|&b| b == b'\n') { |
There was a problem hiding this comment.
CSVs can have newlines inside fields, e.g.
1,"hello
world",3
So, I think this would need some more complex logic to find only such newlines that are not inside a quotation.
There was a problem hiding this comment.
Also, I'm wondering if we need a more complex dance here to handle very large single rows:
One thing is that this can potentially get quadratic if one very big row is given to us in many CopyData messages, because rposition would keep scanning the whole row repeatedly, right?
The other thing is that this can still end up allocating an arbitrarily large amount of memory if one row is very big (e.g., if there are no line breaks in the file at all due to the user giving a wrong input file). The old code had max_copy_from_size; maybe we could re-introduce that as a safety guardrail here to avoid OOMing envd on bad input files.
(But maybe not super urgent, so if you don't want to fiddle with this now, then just adding a TODO comment in the code and creating an issue is ok.)
There was a problem hiding this comment.
Thanks, fixed the csv, was a bit more messy than I hoped.
I have instead opted to only limit the row size, which should be enough to prevent OoMing from what I can tell.
Previously failed: materialize=> alter system set max_copy_from_size=15000000000; ERROR: parameter "max_copy_from_size" requires a "unsigned integer" value
862818c to
864ae06
Compare
864ae06 to
de043d4
Compare
de043d4 to
803a1d5
Compare
Tried with 100 million rows in https://github.com/def-/ClickBench/tree/pr-materialize/materialize. Before this PR it took 101 min, would go OoM if the files were not split, and have query errors because of results being too large. With this PR the 100 million row ingestion runs in 4:38 min on my dev server (8 cores), should scale pretty linearly with the number of cores. For reference
COPY WITH FREEZEin PostgreSQL takes 20 min.Example test run of the new benchmark in CI with 10 million rows: https://buildkite.com/materialize/release-qualification/builds/1089#019c68b4-4b3f-424f-b3c4-05518e95f1a0
Run in environmentd spec sheet doesn't scale well in Cloud (should be investigated!), but scales well locally:
Fixes: https://github.com/MaterializeInc/database-issues/issues/7674
Fixes: https://github.com/MaterializeInc/database-issues/issues/9978
Motivation:
COPY FROM STDINhas been slow for workload replay and testing with large amounts of data in general, also been annoying in https://github.com/MaterializeInc/database-issues/issues/7674 and recently https://materializeinc.slack.com/archives/C08A62E0751/p1770835109967349