Skip to content

Add CTable: compressed columnar table backed by Blosc2#621

Open
FrancescAlted wants to merge 56 commits intomainfrom
ctable4
Open

Add CTable: compressed columnar table backed by Blosc2#621
FrancescAlted wants to merge 56 commits intomainfrom
ctable4

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

This PR introduces CTable, a new columnar compressed table built on top of blosc2.NDArray and TreeStore. It is the main deliverable of this branch.

What's new

CTable core (src/blosc2/ctable.py, ctable_storage.py)

  • Dataclass-driven schema: define columns with blosc2.field() and typed specs (int64, float64, bool, string, …), including optional constraints (ge, le,
    nullable)
  • Row-wise and batch mutations: append, extend, delete, compact
  • Column access (table["col"]), row indexing (table.row[i]), slicing, iteration
  • where() filtering returning lazy views; head() / tail()
  • Per-column aggregates: sum, min, max, mean, std, any, all, unique, value_counts
  • Nullable columns with null_value / is_null / notnull / null_count
  • Persistent storage via TreeStore (mode 'w'/'a'/'r'); save() / load() / CTable.open()
  • Persistent accelerated indexes on columns via CTableIndex (rebuild, compact, drop)
  • Arrow and CSV interoperability
  • Rich info / repr / Column.repr

Schema layer (schema.py, schema_compiler.py, schema_validation.py, schema_vectorized.py)

  • Typed spec primitives for all NumPy-compatible dtypes plus string and bytes
  • Row-wise (pydantic) and vectorised batch validation
  • schema_to_dict / schema_from_dict for on-disk schema persistence

Infrastructure improvements

  • DictStore: new discard() method; relaxed path-suffix requirements; fix .b2z double-open GC corruption; fix temp-dir unpacking path
  • TreeStore: minor follow-ups from CTable integration
  • SChunk: extended constructor docstring; internal helpers for blosc2.open() TreeStore materialisation
  • blosc2.open(): now warns when defaulting to mode='a' (future default will be 'r'); accelerated open by trying standard open first; fix Proxy regression in mode='r'
  • ndarray.py: _normalize_expr_operand helper for mixed numpy/blosc2 operands

Docs, examples, benchmarks

  • Full Sphinx reference page (doc/reference/ctable.rst) added to the class index
  • Jupyter tutorial notebook (examples/ctable/ctable_tutorial.ipynb) and a new indexing tutorial (15.indexing-ctables.ipynb)
  • 20+ benchmark scripts under bench/ctable/ and example scripts under examples/ctable/

Tests

  • ~3 500 lines of new tests in tests/ctable/ covering construction, mutations, persistence, nullable, indexing, CSV/Arrow interop, schema validation,
    compaction, and row logic
  • Existing ndarray/schunk/open tests updated for the new blosc2.open() mode warning

Jacc4224 and others added 30 commits March 26, 2026 11:05
Introduce CTable, a new columnar table class for efficient in-memory
data storage using Blosc2 as the underlying compression engine.

Each column is represented as a Column object wrapping a blosc2.NDArray
with typed, compressed storage. Building on top of blosc2's existing
infrastructure, CTable supports append, iteration and
column-based queries.

This is an early-stage (beta) implementation; the table is always fully
loaded in memory.

New files:
- src/blosc2/ctable.py: CTable and Column class definitions
- tests/ctable/: unit tests covering construction, slicing, deletion,
  compaction and row logic
- bench/ctable/: benchmarks comparing CTable against pandas
Add CTable, a columnar in-memory table built on top of blosc2
  - Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64,
    float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec
    mixin to avoid boilerplate
  - Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config,
    schema_to_dict() / schema_from_dict() for persistence groundwork
  - Export all spec types and field() from blosc2 namespace

  Validation:
  - Add schema_validation.py: Pydantic-backed row validation for append(),
    cached per schema, re-raised as plain ValueError
  - Add schema_vectorized.py: vectorized NumPy constraint checks for extend(),
    using np.char.str_len() for string/bytes columns
  - validate= per-call override on extend() (None inherits table default)

  CTable refactor:
  - Constructor accepts dataclass schemas; legacy Pydantic adapter kept
  - Schema introspection: table.schema, column_schema(), schema_dict()
  - _last_pos cache eliminates backward chunk scan on every append/extend
  - _grow() shared resize helper; delete() writes back in-place without
    creating a new array; _n_rows updated by subtraction not count_nonzero
  - head() and tail() unified through _find_physical_index()

  Tests and docs:
  - 135 tests across 10 test files, all passing
  - plans/ctable-implementation-log.md and ctable-user-guide.md added
  - Benchmarks: bench_validation.py and bench_append_regression.py
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
Arrow compatibility
Examples
Tutorial
Start integration with main Python-Blosc2
FrancescAlted and others added 26 commits April 15, 2026 11:26
 - New CTableIndex handle with col_name, kind, name, stale properties
 - create_index(), drop_index(), rebuild_index(), compact_index() methods
 - index() lookup and indexes property on CTable
 - _CTableIndexProxy duck-type shim routes sidecar files to
   <table.b2d>/_indexes/<col_name>/ for persistent tables
 - Index catalog stored in /_meta vlmeta; survives table close/reopen
 - where() automatically uses a fresh index; falls back to scan when stale
 - Epoch tracking: mutations (append, extend, setitem, assign, sort_by,
   compact) mark all indexes stale; delete() bumps visibility_epoch only
 - Views raise ValueError for all index management methods
 - Add _indexes to reserved column names in schema_compiler
 - 32 new tests in tests/ctable/test_ctable_indexing.py
 - New example examples/ctable/indexing.py
 - New tutorial doc/getting_started/tutorials/15.indexing-ctables.ipynb

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
  Fix CTable index lifecycle for schema mutations by removing index catalog
  entries and sidecars when indexed columns are dropped, and rebuilding
  indexes under the new name after column renames.

  Improve indexed CTable filtering so where() can expose multiple usable
  column indexes to the planner for conjunctive predicates, and raise a
  clear error for malformed table-owned index metadata instead of silently
  falling back to scans.

  Add regression coverage for indexed column rename/drop behavior,
  multi-column indexed conjunctions, and malformed catalog entries.
  Wire the CTable indexing tutorial into the docs toctree.
  Implement the first phase of plans/changing-default-open-mode.md by
  tracking omitted mode= with a sentinel and emitting a FutureWarning when
  blosc2.open() relies on the current implicit "a" behavior.

  Update mmap-related tests, examples, and docstrings to pass explicit
  mode="r" so they keep exercising their intended paths without tripping
  the migration warning.
Provide indexing for the CTable object
   Opening a .b2z store in append mode a second time (in a new process) was
   failing with "blosc2_schunk_open_offset returned NULL" because the first
   open had silently overwritten the archive with a near-empty ZIP file.

   Two root causes were identified and fixed:

   1. Probe store in _open_treestore_root_object() called close() on a
      temporary TreeStore used only to read the manifest. This triggered
      to_b2z() and repacked the archive before the real open started.
      Fix: added DictStore.discard() (cleanup without repack) and switched
      the probe store to call discard() instead of close().

   2. CTable.__del__ (GC path) called storage.close() which chained into
      TreeStore.close() → to_b2z(). With nothing modified, the temp dir
      could be partially torn down, producing a corrupt archive.
      Fix: DictStore gains _closed and _modified flags; __del__ now calls
      discard() when _modified is False (no writes via __setitem__/
      __delitem__), and close() otherwise. CTable.__del__ is changed to
      call storage.discard() directly. FileTableStorage.discard() and
      TableStorage.discard() added to complete the delegation chain.

   3. TreeStore subtree views (created via __dict__.update from the parent)
      shared the parent's _temp_dir_obj. GC of a subtree could destroy the
      parent's temp dir. Fix: subtrees now set _closed=True immediately
      after copying the parent's __dict__.

   The contract is:
     - Explicit close() / with block → always repacks (user intent)
     - GC __del__ with no store-level writes → discard() (no repack, safe)
     - GC __del__ with __setitem__/__delitem__ writes → close() (repack)

   Add two regression tests: one at the DictStore layer and one for the
   full indexed-CTable-in-.b2z scenario that triggered the original report.

   Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Nullable atribute in schema.
   Python 3.14 sets the gen-2 GC threshold to 0, so long-lived objects
   (like SChunk instances in pytest fixtures) never get auto-collected.
   Each SChunk carries 2×nthreads OS pthreads via its cctx/dctx. During
   pytest teardown, thousands of accumulated SChunks hit macOS's 6144
   thread limit, causing an indefinite hang on gc.collect().

   Three fixes applied:

   - Release the GIL during blosc2_schunk_free() in SChunk.__dealloc__
     to prevent GIL deadlock when mass finalization triggers pthread_join

   - Add periodic gc.collect() every 50 tests in conftest.py to prevent
     thread accumulation past the macOS ceiling

   - Cap ThreadPoolExecutor in lazyexpr.py to os.cpu_count() workers
     (fixes #556)

   Closes #556
   - FieldsAccessor: use weakref to parent NDArray and create NDField
     instances on access instead of eagerly. Breaks NDArray↔NDField
     reference cycles so gen-0/gen-1 GC can collect them, reducing
     thread accumulation on Python 3.14.

   - sync_read_chunks: wrap reader thread in try/finally with
     thread.join() to prevent thread leaks. Propagate exceptions
     from the async reader instead of silently swallowing them.
     Detect dead reader thread to avoid infinite wait on empty queue.

   - ThreadPoolExecutor: cap max_workers to min(len(arrs), nthreads)
     instead of os.cpu_count(), avoiding excess threads when there
     are few operands.
   - _RowIndexer: use weakref to parent CTable to break reference
     cycles, allowing CTable instances to be collected without gen-2 GC.

   - indexing.py: stop caching writable sidecar handles in the
     process-wide _SIDECAR_HANDLE_CACHE after construction — they kept
     NDArray objects alive indefinitely across tests on macOS/Python 3.14.

   - indexing.py: add _purge_stale_persistent_caches() to evict index
     cache entries whose backing paths no longer exist. Call it in
     _load_store, _open_query_cache_store, _open_sidecar_handle, and
     _gather_mmap_source.

   - ref.py: open urlpath operands in mode='r' instead of the default
     mode='a'; persistent recipe operands only need read access.

   - test_ctable_indexing.py: add tests verifying that sidecar handles
     are not cached after index creation, that CTables release without
     explicit GC, and that stale persistent caches are purged correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants