Skip to content

Ctable 3 changes#606

Closed
Jacc4224 wants to merge 22 commits intoBlosc:mainfrom
Jacc4224:my_ctable3
Closed

Ctable 3 changes#606
Jacc4224 wants to merge 22 commits intoBlosc:mainfrom
Jacc4224:my_ctable3

Conversation

@Jacc4224
Copy link
Copy Markdown

Pull request for local changes

Jacc4224 and others added 22 commits March 26, 2026 11:05
Introduce CTable, a new columnar table class for efficient in-memory
data storage using Blosc2 as the underlying compression engine.

Each column is represented as a Column object wrapping a blosc2.NDArray
with typed, compressed storage. Building on top of blosc2's existing
infrastructure, CTable supports append, iteration and
column-based queries.

This is an early-stage (beta) implementation; the table is always fully
loaded in memory.

New files:
- src/blosc2/ctable.py: CTable and Column class definitions
- tests/ctable/: unit tests covering construction, slicing, deletion,
  compaction and row logic
- bench/ctable/: benchmarks comparing CTable against pandas
Add CTable, a columnar in-memory table built on top of blosc2
  - Add schema.py with spec primitives: int8/16/32/64, uint8/16/32/64,
    float32/64, bool, complex64/128, string, bytes — sharing a _NumericSpec
    mixin to avoid boilerplate
  - Add schema_compiler.py: compile_schema(), CompiledSchema/Column/Config,
    schema_to_dict() / schema_from_dict() for persistence groundwork
  - Export all spec types and field() from blosc2 namespace

  Validation:
  - Add schema_validation.py: Pydantic-backed row validation for append(),
    cached per schema, re-raised as plain ValueError
  - Add schema_vectorized.py: vectorized NumPy constraint checks for extend(),
    using np.char.str_len() for string/bytes columns
  - validate= per-call override on extend() (None inherits table default)

  CTable refactor:
  - Constructor accepts dataclass schemas; legacy Pydantic adapter kept
  - Schema introspection: table.schema, column_schema(), schema_dict()
  - _last_pos cache eliminates backward chunk scan on every append/extend
  - _grow() shared resize helper; delete() writes back in-place without
    creating a new array; _n_rows updated by subtraction not count_nonzero
  - head() and tail() unified through _find_physical_index()

  Tests and docs:
  - 135 tests across 10 test files, all passing
  - plans/ctable-implementation-log.md and ctable-user-guide.md added
  - Benchmarks: bench_validation.py and bench_append_regression.py
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
…QoL)

  Persistency:
    - FileTableStorage backend: disk layout _meta.b2frame / _valid_rows.b2nd / _cols/<name>.b2nd
    - CTable(Row, urlpath=..., mode="w"/"a"/"r"), CTable.open(), CTable.save(), CTable.load()
    - Read-only mode blocks all writes; save() always writes compacted rows

  Column aggregates: sum, min, max, mean, std, any, all (chunk-aware via iter_chunks)
  Column utilities: unique(), value_counts(), assign(), boolean mask __getitem__/__setitem__

  Schema mutations: add_column (fills default for existing rows), drop_column, rename_column
    - All three update schema, handle disk files, and block on views

  View mutability model fix:
    - Views allow value writes (assign, __setitem__) — only structural mutations are blocked
    - _read_only=True reserved for mode="r" disk tables; base is not None guards structural ops

  QoL: __str__ pandas-style, __repr__, cbytes/nbytes, sample(n), Column.iter_chunks(size)

  Tests: 258 tests, ~5s — new test_persistency.py (33), test_schema_mutations.py (41),
    expanded test_column.py; optimized helpers to use to_numpy() instead of row[i]
Arrow compatibility
Examples
Tutorial
@FrancescAlted
Copy link
Copy Markdown
Member

Overridden by PR #614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants