Adding overwrite option to ParticleFile API#2655
Conversation
| if_exists : {"error", "overwrite", "append"}, optional | ||
| Behavior when the output file already exists. | ||
| - "error" (default): raise a ValueError. | ||
| - "overwrite": remove the existing file before writing. | ||
| - "append": preserve existing rows and append new rows. |
There was a problem hiding this comment.
I think we should stick closer to convention here
| if_exists : {"error", "overwrite", "append"}, optional | |
| Behavior when the output file already exists. | |
| - "error" (default): raise a ValueError. | |
| - "overwrite": remove the existing file before writing. | |
| - "append": preserve existing rows and append new rows. | |
| mode : {"w", "a", None}, optional | |
| Writing behaviour. | |
| - None (default): Write dataset, and raise an error if it already exists. | |
| - "w": Write dataset, overwriting it. | |
| - "a": Append to dataset. |
also rename ._if_exists to ._mode
| self._tmp_path = self.path.with_name(f"{self.path.stem}.append_tmp{self.path.suffix}") | ||
| if self._tmp_path.exists(): | ||
| self._tmp_path.unlink() | ||
|
|
||
| self._writer = pq.ParquetWriter(self._tmp_path, existing_schema, compression=self._compression) | ||
|
|
||
| # Parquet can't directly append, so we need to rewrite the existing data along with the new data. | ||
| for batch in existing_file.iter_batches(): | ||
| self._writer.write_table(pa.Table.from_batches([batch], schema=existing_schema)) | ||
| else: | ||
| assert not self.path.exists(), "If the file exists, the writer should already be set" | ||
| self._writer = pq.ParquetWriter( | ||
| self.path, | ||
| schema, | ||
| compression=self._compression, | ||
| ) |
There was a problem hiding this comment.
Just taking a step back here - why do we need an append mode for the ParticleFile? Could users easily just create multiple particlefiles and join them into one after the fact?
There was a problem hiding this comment.
I think especially since the file format of Parquet doesn't support this, and calling "append" would require rewriting the current data that we have, is maybe an indication that we either shouldn't have append or should consider something else.
There was a problem hiding this comment.
You're right; it doesn't make sense to add an "append" option if we are copying the data internally anyways. I now removed it in aafde6a
The only thing users will need to be aware of, is that the last record of the first file is the same as the first second of the second file when concatenating, so they will probably need to filter these duplicate records out. Is that something we should add to the tutorial?
| raise ValueError(f"outputdt must be positive/non-zero. Got {outputdt=!r}") | ||
|
|
||
| self._outputdt = outputdt | ||
| self._mode = mode |
There was a problem hiding this comment.
Now that we don't use mode outside of this function anymore, do we need to make it an object attribute? What is best practice in such cases?
Description
This PR adds an option to
ParticleFile.__init__to control when the file already exists: either raise an error (default), or overwrite the exisiting file.Checklist
mainfor normal development,v3-supportfor v3 support)AI Disclosure
I used Claude code to help with the implementation of the append option in particlefile.write()