Add CTable: compressed columnar table backed by Blosc2 by FrancescAlted · Pull Request #621 · Blosc/python-blosc2

FrancescAlted · 2026-04-16T10:35:31Z

This PR introduces CTable, a new columnar compressed table built on top of blosc2.NDArray and TreeStore. It is the main deliverable of this branch.

What's new

CTable core (src/blosc2/ctable.py, ctable_storage.py)

Dataclass-driven schema: define columns with blosc2.field() and typed specs (int64, float64, bool, string, …), including optional constraints (ge, le,
nullable)
Row-wise and batch mutations: append, extend, delete, compact
Column access (table["col"]), row indexing (table.row[i]), slicing, iteration
where() filtering returning lazy views; head() / tail()
Per-column aggregates: sum, min, max, mean, std, any, all, unique, value_counts
Nullable columns with null_value / is_null / notnull / null_count
Persistent storage via TreeStore (mode 'w'/'a'/'r'); save() / load() / CTable.open()
Persistent accelerated indexes on columns via CTableIndex (rebuild, compact, drop)
Arrow and CSV interoperability
Rich info / repr / Column.repr

Schema layer (schema.py, schema_compiler.py, schema_validation.py, schema_vectorized.py)

Typed spec primitives for all NumPy-compatible dtypes plus string and bytes
Row-wise (pydantic) and vectorised batch validation
schema_to_dict / schema_from_dict for on-disk schema persistence

Infrastructure improvements

DictStore: new discard() method; relaxed path-suffix requirements; fix .b2z double-open GC corruption; fix temp-dir unpacking path
TreeStore: minor follow-ups from CTable integration
SChunk: extended constructor docstring; internal helpers for blosc2.open() TreeStore materialisation
blosc2.open(): now warns when defaulting to mode='a' (future default will be 'r'); accelerated open by trying standard open first; fix Proxy regression in mode='r'
ndarray.py: _normalize_expr_operand helper for mixed numpy/blosc2 operands

Docs, examples, benchmarks

Full Sphinx reference page (doc/reference/ctable.rst) added to the class index
Jupyter tutorial notebook (examples/ctable/ctable_tutorial.ipynb) and a new indexing tutorial (15.indexing-ctables.ipynb)
20+ benchmark scripts under bench/ctable/ and example scripts under examples/ctable/

Tests

~3 500 lines of new tests in tests/ctable/ covering construction, mutations, persistence, nullable, indexing, CSV/Arrow interop, schema validation,
compaction, and row logic
Existing ndarray/schunk/open tests updated for the new blosc2.open() mode warning

Introduce CTable, a new columnar table class for efficient in-memory data storage using Blosc2 as the underlying compression engine. Each column is represented as a Column object wrapping a blosc2.NDArray with typed, compressed storage. Building on top of blosc2's existing infrastructure, CTable supports append, iteration and column-based queries. This is an early-stage (beta) implementation; the table is always fully loaded in memory. New files: - src/blosc2/ctable.py: CTable and Column class definitions - tests/ctable/: unit tests covering construction, slicing, deletion, compaction and row logic - bench/ctable/: benchmarks comparing CTable against pandas

Add CTable, a columnar in-memory table built on top of blosc2

- Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64, float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec mixin to avoid boilerplate - Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config, schema_to_dict() / schema_from_dict() for persistence groundwork - Export all spec types and field() from blosc2 namespace Validation: - Add schema_validation.py: Pydantic-backed row validation for append(), cached per schema, re-raised as plain ValueError - Add schema_vectorized.py: vectorized NumPy constraint checks for extend(), using np.char.str_len() for string/bytes columns - validate= per-call override on extend() (None inherits table default) CTable refactor: - Constructor accepts dataclass schemas; legacy Pydantic adapter kept - Schema introspection: table.schema, column_schema(), schema_dict() - _last_pos cache eliminates backward chunk scan on every append/extend - _grow() shared resize helper; delete() writes back in-place without creating a new array; _n_rows updated by subtraction not count_nonzero - head() and tail() unified through _find_physical_index() Tests and docs: - 135 tests across 10 test files, all passing - plans/ctable-implementation-log.md and ctable-user-guide.md added - Benchmarks: bench_validation.py and bench_append_regression.py

…QoL) Persistency: - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load() - Read-only mode blocks all writes; save() always writes compacted rows Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks) Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__ Schema mutations: add_column (fills default for existing rows), drop_column, rename_column - All three update schema, handle disk files, and block on views View mutability model fix: - Views allow value writes (assign, __setitem__) — only structural mutations are blocked - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size) Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41), expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]

Arrow compatibility Examples Tutorial

Start integration with main Python-Blosc2

…open()

…sistence

- New CTableIndex handle with col_name, kind, name, stale properties - create_index(), drop_index(), rebuild_index(), compact_index() methods - index() lookup and indexes property on CTable - _CTableIndexProxy duck-type shim routes sidecar files to <table.b2d>/_indexes/<col_name>/ for persistent tables - Index catalog stored in /_meta vlmeta; survives table close/reopen - where() automatically uses a fresh index; falls back to scan when stale - Epoch tracking: mutations (append, extend, setitem, assign, sort_by, compact) mark all indexes stale; delete() bumps visibility_epoch only - Views raise ValueError for all index management methods - Add _indexes to reserved column names in schema_compiler - 32 new tests in tests/ctable/test_ctable_indexing.py - New example examples/ctable/indexing.py - New tutorial doc/getting_started/tutorials/15.indexing-ctables.ipynb Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix CTable index lifecycle for schema mutations by removing index catalog entries and sidecars when indexed columns are dropped, and rebuilding indexes under the new name after column renames. Improve indexed CTable filtering so where() can expose multiple usable column indexes to the planner for conjunctive predicates, and raise a clear error for malformed table-owned index metadata instead of silently falling back to scans. Add regression coverage for indexed column rename/drop behavior, multi-column indexed conjunctions, and malformed catalog entries. Wire the CTable indexing tutorial into the docs toctree.

Implement the first phase of plans/changing-default-open-mode.md by tracking omitted mode= with a sentinel and emitting a FutureWarning when blosc2.open() relies on the current implicit "a" behavior. Update mmap-related tests, examples, and docstrings to pass explicit mode="r" so they keep exercising their intended paths without tripping the migration warning.

…s to mode='r'.

Provide indexing for the CTable object

Opening a .b2z store in append mode a second time (in a new process) was failing with "blosc2_schunk_open_offset returned NULL" because the first open had silently overwritten the archive with a near-empty ZIP file. Two root causes were identified and fixed: 1. Probe store in _open_treestore_root_object() called close() on a temporary TreeStore used only to read the manifest. This triggered to_b2z() and repacked the archive before the real open started. Fix: added DictStore.discard() (cleanup without repack) and switched the probe store to call discard() instead of close(). 2. CTable.__del__ (GC path) called storage.close() which chained into TreeStore.close() → to_b2z(). With nothing modified, the temp dir could be partially torn down, producing a corrupt archive. Fix: DictStore gains _closed and _modified flags; __del__ now calls discard() when _modified is False (no writes via __setitem__/ __delitem__), and close() otherwise. CTable.__del__ is changed to call storage.discard() directly. FileTableStorage.discard() and TableStorage.discard() added to complete the delegation chain. 3. TreeStore subtree views (created via __dict__.update from the parent) shared the parent's _temp_dir_obj. GC of a subtree could destroy the parent's temp dir. Fix: subtrees now set _closed=True immediately after copying the parent's __dict__. The contract is: - Explicit close() / with block → always repacks (user intent) - GC __del__ with no store-level writes → discard() (no repack, safe) - GC __del__ with __setitem__/__delitem__ writes → close() (repack) Add two regression tests: one at the DictStore layer and one for the full indexed-CTable-in-.b2z scenario that triggered the original report. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… file

…pport

Nullable atribute in schema.

Python 3.14 sets the gen-2 GC threshold to 0, so long-lived objects (like SChunk instances in pytest fixtures) never get auto-collected. Each SChunk carries 2×nthreads OS pthreads via its cctx/dctx. During pytest teardown, thousands of accumulated SChunks hit macOS's 6144 thread limit, causing an indefinite hang on gc.collect(). Three fixes applied: - Release the GIL during blosc2_schunk_free() in SChunk.__dealloc__ to prevent GIL deadlock when mass finalization triggers pthread_join - Add periodic gc.collect() every 50 tests in conftest.py to prevent thread accumulation past the macOS ceiling - Cap ThreadPoolExecutor in lazyexpr.py to os.cpu_count() workers (fixes #556) Closes #556

- FieldsAccessor: use weakref to parent NDArray and create NDField instances on access instead of eagerly. Breaks NDArray↔NDField reference cycles so gen-0/gen-1 GC can collect them, reducing thread accumulation on Python 3.14. - sync_read_chunks: wrap reader thread in try/finally with thread.join() to prevent thread leaks. Propagate exceptions from the async reader instead of silently swallowing them. Detect dead reader thread to avoid infinite wait on empty queue. - ThreadPoolExecutor: cap max_workers to min(len(arrs), nthreads) instead of os.cpu_count(), avoiding excess threads when there are few operands.

- _RowIndexer: use weakref to parent CTable to break reference cycles, allowing CTable instances to be collected without gen-2 GC. - indexing.py: stop caching writable sidecar handles in the process-wide _SIDECAR_HANDLE_CACHE after construction — they kept NDArray objects alive indefinitely across tests on macOS/Python 3.14. - indexing.py: add _purge_stale_persistent_caches() to evict index cache entries whose backing paths no longer exist. Call it in _load_store, _open_query_cache_store, _open_sidecar_handle, and _gather_mmap_source. - ref.py: open urlpath operands in mode='r' instead of the default mode='a'; persistent recipe operands only need read access. - test_ctable_indexing.py: add tests verifying that sidecar handles are not cached after index creation, that CTables release without explicit GC, and that stale persistent caches are purged correctly.

Jacc4224 and others added 30 commits March 26, 2026 11:05

Merge pull request #604 from Jacc4224/ctable-new

01e47f4

Add CTable, a columnar in-memory table built on top of blosc2

Add a plan for declaring a simple schema for CTable objects

c05c2ec

Add a pydantic as a new dependency

725c28b

Fix small formatting issues

0efd450

Simplify the plan for ctable schema

f504ad0

Disable wheel generation for each commit in this branch

46bf2e3

Add a new plan on CTable persistence

43bf562

_

e84f7ac

_

8de1870

Testing

a8db18d

Merge branch 'ctable3' of github.com:Blosc/python-blosc2 into my_ctable3

dd154b1

writen test

ce65607

Remove testing file

b623f0e

Merge branch 'ctable3' of github.com:Blosc/python-blosc2 into my_ctable3

b9e8c35

persistency half way done

ee1d0c4

CSV compatibility implementation

34f8219

Arrow compatibility Examples Tutorial

Persistent ctables.

6bf1ec8

Colision bug fixed 1

34c2eee

Merge pull request #614 from Jacc4224/my_ctable3

a3852b6

Start integration with main Python-Blosc2

Remove large data files from repo

14853ac

Restore CI files from main

66e35a4

Restore compatibility with numpy < 2

0dc8697

Back CTable persistence with TreeStore and materialize it via blosc2.…

457b0ff

…open()

Relax DictStore and TreeStore path suffix requirements

5fc16b7

Update CTable docs, examples, and benchmarks for TreeStore-backed per…

f7cd02e

…sistence

Move store extension doc to plans/

465e855

FrancescAlted and others added 26 commits April 15, 2026 11:26

Merge branch 'main' into ctable4

71d3240

Accelerate blosc2.open by trying the standard open first

d3148a1

Nullable atribute in schema.

41f7a14

Fix issues when array is a numpy array, not blosc2

bc4d4ff

For large temp arange arrays, use blosc2.arange instead of np.arange

fcb9efa

Shaving test suite run time by a little bit

eaccd53

Merge branch 'ctable4' of github.com:Blosc/python-blosc2 into ctable4

04f2577

New InfoReporter for CTable. Example on how to use a .b2z file.

fb66107

New Column.__repr__() for an nice overview of the column

2cb4295

Fancier CTable.info printed representation

f8021ae

Add a TODO for removing FutureWarning path once blosc2.open() default…

17f12c6

…s to mode='r'.

Fix a regression when reopening a persisted Proxy with mode='r'

8dddc8c

Merge pull request #620 from Blosc/ctable-indexing

6e75a47

Provide indexing for the CTable object

Temporarily unpacking a .b2z file defaults now to the same dir as the…

e58b4c7

… file

Fix nullable validation, chunk sizing, print alignment, numpy mask su…

2b3eeff

…pport

Merge branch 'ctable4' into my_ctable3

8496c11

Merge pull request #619 from Jacc4224/my_ctable3

4f509cf

Nullable atribute in schema.

Fix some issues in tests

8d4603b

Merge branch 'main' into ctable4

e5e75d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CTable: compressed columnar table backed by Blosc2#621

Add CTable: compressed columnar table backed by Blosc2#621
FrancescAlted wants to merge 56 commits intomainfrom
ctable4

FrancescAlted commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FrancescAlted commented Apr 16, 2026

What's new

Schema layer (schema.py, schema_compiler.py, schema_validation.py, schema_vectorized.py)

Infrastructure improvements

Docs, examples, benchmarks

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants