Add CTable: compressed columnar table backed by Blosc2#621
Open
FrancescAlted wants to merge 56 commits intomainfrom
Open
Add CTable: compressed columnar table backed by Blosc2#621FrancescAlted wants to merge 56 commits intomainfrom
FrancescAlted wants to merge 56 commits intomainfrom
Conversation
Introduce CTable, a new columnar table class for efficient in-memory data storage using Blosc2 as the underlying compression engine. Each column is represented as a Column object wrapping a blosc2.NDArray with typed, compressed storage. Building on top of blosc2's existing infrastructure, CTable supports append, iteration and column-based queries. This is an early-stage (beta) implementation; the table is always fully loaded in memory. New files: - src/blosc2/ctable.py: CTable and Column class definitions - tests/ctable/: unit tests covering construction, slicing, deletion, compaction and row logic - bench/ctable/: benchmarks comparing CTable against pandas
Add CTable, a columnar in-memory table built on top of blosc2
- Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64,
float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec
mixin to avoid boilerplate
- Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config,
schema_to_dict() / schema_from_dict() for persistence groundwork
- Export all spec types and field() from blosc2 namespace
Validation:
- Add schema_validation.py: Pydantic-backed row validation for append(),
cached per schema, re-raised as plain ValueError
- Add schema_vectorized.py: vectorized NumPy constraint checks for extend(),
using np.char.str_len() for string/bytes columns
- validate= per-call override on extend() (None inherits table default)
CTable refactor:
- Constructor accepts dataclass schemas; legacy Pydantic adapter kept
- Schema introspection: table.schema, column_schema(), schema_dict()
- _last_pos cache eliminates backward chunk scan on every append/extend
- _grow() shared resize helper; delete() writes back in-place without
creating a new array; _n_rows updated by subtraction not count_nonzero
- head() and tail() unified through _find_physical_index()
Tests and docs:
- 135 tests across 10 test files, all passing
- plans/ctable-implementation-log.md and ctable-user-guide.md added
- Benchmarks: bench_validation.py and bench_append_regression.py
…QoL)
Persistency:
- FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
- CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
- Read-only mode blocks all writes; save() always writes compacted rows
Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__
Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
- All three update schema, handle disk files, and block on views
View mutability model fix:
- Views allow value writes (assign, __setitem__) — only structural mutations are blocked
- _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops
QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)
Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
…QoL)
Persistency:
- FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
- CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
- Read-only mode blocks all writes; save() always writes compacted rows
Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__
Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
- All three update schema, handle disk files, and block on views
View mutability model fix:
- Views allow value writes (assign, __setitem__) — only structural mutations are blocked
- _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops
QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)
Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
Arrow compatibility Examples Tutorial
Start integration with main Python-Blosc2
- New CTableIndex handle with col_name, kind, name, stale properties - create_index(), drop_index(), rebuild_index(), compact_index() methods - index() lookup and indexes property on CTable - _CTableIndexProxy duck-type shim routes sidecar files to <table.b2d>/_indexes/<col_name>/ for persistent tables - Index catalog stored in /_meta vlmeta; survives table close/reopen - where() automatically uses a fresh index; falls back to scan when stale - Epoch tracking: mutations (append, extend, setitem, assign, sort_by, compact) mark all indexes stale; delete() bumps visibility_epoch only - Views raise ValueError for all index management methods - Add _indexes to reserved column names in schema_compiler - 32 new tests in tests/ctable/test_ctable_indexing.py - New example examples/ctable/indexing.py - New tutorial doc/getting_started/tutorials/15.indexing-ctables.ipynb Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix CTable index lifecycle for schema mutations by removing index catalog entries and sidecars when indexed columns are dropped, and rebuilding indexes under the new name after column renames. Improve indexed CTable filtering so where() can expose multiple usable column indexes to the planner for conjunctive predicates, and raise a clear error for malformed table-owned index metadata instead of silently falling back to scans. Add regression coverage for indexed column rename/drop behavior, multi-column indexed conjunctions, and malformed catalog entries. Wire the CTable indexing tutorial into the docs toctree.
Implement the first phase of plans/changing-default-open-mode.md by tracking omitted mode= with a sentinel and emitting a FutureWarning when blosc2.open() relies on the current implicit "a" behavior. Update mmap-related tests, examples, and docstrings to pass explicit mode="r" so they keep exercising their intended paths without tripping the migration warning.
Provide indexing for the CTable object
Opening a .b2z store in append mode a second time (in a new process) was
failing with "blosc2_schunk_open_offset returned NULL" because the first
open had silently overwritten the archive with a near-empty ZIP file.
Two root causes were identified and fixed:
1. Probe store in _open_treestore_root_object() called close() on a
temporary TreeStore used only to read the manifest. This triggered
to_b2z() and repacked the archive before the real open started.
Fix: added DictStore.discard() (cleanup without repack) and switched
the probe store to call discard() instead of close().
2. CTable.__del__ (GC path) called storage.close() which chained into
TreeStore.close() → to_b2z(). With nothing modified, the temp dir
could be partially torn down, producing a corrupt archive.
Fix: DictStore gains _closed and _modified flags; __del__ now calls
discard() when _modified is False (no writes via __setitem__/
__delitem__), and close() otherwise. CTable.__del__ is changed to
call storage.discard() directly. FileTableStorage.discard() and
TableStorage.discard() added to complete the delegation chain.
3. TreeStore subtree views (created via __dict__.update from the parent)
shared the parent's _temp_dir_obj. GC of a subtree could destroy the
parent's temp dir. Fix: subtrees now set _closed=True immediately
after copying the parent's __dict__.
The contract is:
- Explicit close() / with block → always repacks (user intent)
- GC __del__ with no store-level writes → discard() (no repack, safe)
- GC __del__ with __setitem__/__delitem__ writes → close() (repack)
Add two regression tests: one at the DictStore layer and one for the
full indexed-CTable-in-.b2z scenario that triggered the original report.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Nullable atribute in schema.
Python 3.14 sets the gen-2 GC threshold to 0, so long-lived objects
(like SChunk instances in pytest fixtures) never get auto-collected.
Each SChunk carries 2×nthreads OS pthreads via its cctx/dctx. During
pytest teardown, thousands of accumulated SChunks hit macOS's 6144
thread limit, causing an indefinite hang on gc.collect().
Three fixes applied:
- Release the GIL during blosc2_schunk_free() in SChunk.__dealloc__
to prevent GIL deadlock when mass finalization triggers pthread_join
- Add periodic gc.collect() every 50 tests in conftest.py to prevent
thread accumulation past the macOS ceiling
- Cap ThreadPoolExecutor in lazyexpr.py to os.cpu_count() workers
(fixes #556)
Closes #556
- FieldsAccessor: use weakref to parent NDArray and create NDField
instances on access instead of eagerly. Breaks NDArray↔NDField
reference cycles so gen-0/gen-1 GC can collect them, reducing
thread accumulation on Python 3.14.
- sync_read_chunks: wrap reader thread in try/finally with
thread.join() to prevent thread leaks. Propagate exceptions
from the async reader instead of silently swallowing them.
Detect dead reader thread to avoid infinite wait on empty queue.
- ThreadPoolExecutor: cap max_workers to min(len(arrs), nthreads)
instead of os.cpu_count(), avoiding excess threads when there
are few operands.
- _RowIndexer: use weakref to parent CTable to break reference
cycles, allowing CTable instances to be collected without gen-2 GC.
- indexing.py: stop caching writable sidecar handles in the
process-wide _SIDECAR_HANDLE_CACHE after construction — they kept
NDArray objects alive indefinitely across tests on macOS/Python 3.14.
- indexing.py: add _purge_stale_persistent_caches() to evict index
cache entries whose backing paths no longer exist. Call it in
_load_store, _open_query_cache_store, _open_sidecar_handle, and
_gather_mmap_source.
- ref.py: open urlpath operands in mode='r' instead of the default
mode='a'; persistent recipe operands only need read access.
- test_ctable_indexing.py: add tests verifying that sidecar handles
are not cached after index creation, that CTables release without
explicit GC, and that stale persistent caches are purged correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces CTable, a new columnar compressed table built on top of blosc2.NDArray and TreeStore. It is the main deliverable of this branch.
What's new
CTable core (src/blosc2/ctable.py, ctable_storage.py)
nullable)
Schema layer (schema.py, schema_compiler.py, schema_validation.py, schema_vectorized.py)
Infrastructure improvements
Docs, examples, benchmarks
Tests
compaction, and row logic