Implement Codec Infrastructure#1300
Conversation
Add fixed-point decimal as a core DataJoint type, allowing it to be recorded in field comments using :type: syntax for reconstruction. This provides scientists with a standardized type for exact numeric precision use cases (financial data, coordinates, etc.). Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Change the core binary type from 'blob' to 'bytes' to: - Enable cross-database portability (LONGBLOB in MySQL, BYTEA in PostgreSQL) - Free up native blob types (tinyblob, blob, mediumblob, longblob) - Use Pythonic naming that matches the stored/returned type Update all documentation to include PostgreSQL type mappings alongside MySQL mappings, making the cross-database support explicit. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Correct the dtype documentation to clarify: - longblob is a native MySQL type for raw binary data (not serialized) - <djblob> should be used as dtype for serialized Python objects Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
PostgreSQL supports native ENUM via CREATE TYPE ... AS ENUM, which provides similar semantics to MySQL ENUM (efficient storage, value enforcement, definition-order ordering). DataJoint will handle the separate type creation automatically. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Rewrite attributes.md to prioritize core types over native types - Add timezone policy: all datetime values stored as UTC - Timezone conversion is a presentation concern, not database concern - Update storage-types-spec.md with UTC policy and CURRENT_TIMESTAMP example Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Core types: - Add `text` as a core type for unlimited-length text (TEXT in both MySQL and PostgreSQL) Type modifiers policy: - Document that SQL modifiers (NOT NULL, DEFAULT, PRIMARY KEY, UNIQUE, COMMENT) are not allowed - DataJoint has its own syntax - Document that AUTO_INCREMENT is discouraged but allowed with native types - UNSIGNED is allowed as part of type semantics Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- UTF-8 required: utf8mb4 (MySQL) / UTF8 (PostgreSQL) - Case-sensitive by default: utf8mb4_bin / C collation - Database-level configuration via dj.config, not per-column - CHARACTER SET and COLLATE modifiers not allowed in type definitions - Like timezone, encoding is infrastructure configuration Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Reorganize "Special DataJoint-only datatypes" as "AttributeTypes" - Add naming convention explanation (dj prefix, x prefix, @store suffix) - List all built-in AttributeTypes with categories: - Serialization types: <djblob>, <xblob> - File storage types: <object>, <content> - File attachment types: <attach>, <xattach> - File reference types: <filepath> - Fix inconsistent angle bracket notation throughout docs - Update example to use int32 core type and include <djblob> - Expand naming conventions in Key Design Decisions section Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
The @ character now indicates external storage (object store vs database): - No @ = internal (database): <blob>, <attach> - @ present = external (object store): <blob@>, <attach@store> - @ alone = default store: <blob@> - @name = named store: <blob@cold> Key changes: - Rename <djblob> to <blob> (internal) and <xblob> to <blob@> (external) - Rename <xattach> to <attach@> (external variant of <attach>) - Mark <object@>, <content@>, <filepath@> as external-only types - Replace dtype property with get_dtype(is_external) method - Use core type 'bytes' instead of 'longblob' for portability - Add type resolution and chaining documentation - Update Storage Comparison and Built-in AttributeType Comparison tables - Simplify from 7 built-in types to 5: blob, attach, object, content, filepath Type chaining at declaration time: <blob> → get_dtype(False) → "bytes" → LONGBLOB/BYTEA <blob@> → get_dtype(True) → "<content>" → json → JSON/JSONB <object@> → get_dtype(True) → "json" → JSON/JSONB Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Rename <content@> to <hash@> throughout documentation: - More descriptive: indicates hash-based addressing mechanism - Familiar concept: works like a hash data structure - Storage folder: _content/ → _hash/ - Registry: ContentRegistry → HashRegistry The <hash@> type provides: - SHA256 hash-based addressing - Automatic deduplication - External-only storage (requires @) - Used as dtype by <blob@> and <attach@> Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Use '= CURRENT_TIMESTAMP : datetime' syntax (not SQL DEFAULT) - Use uint64 core type instead of 'bigint unsigned' native type Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
DataJoint handles nullability through the default value syntax: - Attribute is nullable iff default is NULL - No separate NOT NULL / NULL modifier needed - Examples: required, nullable, and default value cases Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection now scans all tables to find referenced hashes in JSON fields directly. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
MD5 (128-bit, 32-char hex) is sufficient for content-addressed deduplication: - Birthday bound ~2^64 provides adequate collision resistance for scientific data - 32-char vs 64-char hashes reduces storage overhead in JSON metadata - MD5 is ~2-3x faster than SHA256 for large files - Consistent with existing dj.hash module (key_hash, uuid_from_buffer) - Simplifies migration since only storage format changes, not the algorithm Added Hash Algorithm Choice section documenting the rationale. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- uuid_from_file was never called anywhere in the codebase - uuid_from_stream only existed to support uuid_from_file - Inlined the logic directly into uuid_from_buffer - Removed unused io and pathlib imports Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
The implementation plan was heavily outdated with: - Old type names (<content>, <xblob>, <xattach> vs <hash@>, <blob@>, <attach@>) - Wrong hash algorithm (SHA256 vs MD5) - Wrong paths (_content/ vs _hash/) - References to removed HashRegistry table All relevant design information is now in storage-types-spec.md. Implementation details (ObjectRef API, staged_insert) will be documented in user-facing API docs when implemented. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Rename DECIMAL to NUMERIC in native types (decimal is in core types) - Rename TEXT to NATIVE_TEXT (text is in core types) - Change BLOB references to BYTES in heading.py (bytes is the core type name) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Terminology changes in spec and user docs: - "AttributeTypes" → "Codec Types" (category name) - "AttributeType" → "Codec" (base class) - "@register_type" → "@dj.codec" (decorator) - "type_name" → "name" (class attribute) The term "Codec" better conveys the encode/decode semantics of these types, drawing on the familiar audio/video codec analogy. Code changes (class renaming, backward-compat aliases) to follow. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Design improvements for Python 3.10+:
- Codecs auto-register when subclassed via __init_subclass__
- No decorator needed - just inherit from dj.Codec and set name
- Use register=False for abstract base classes
- Removed @dj.codec decorator from all examples
New API:
class GraphCodec(dj.Codec):
name = "graph"
def encode(...): ...
def decode(...): ...
Abstract bases:
class ExternalOnlyCodec(dj.Codec, register=False):
...
Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Codec.get_dtype(is_external) now determines storage type based on whether @ modifier is present in the declaration - BlobCodec returns "bytes" for internal, "<hash>" for external - AttachCodec returns "bytes" for internal, "<hash>" for external - HashCodec, ObjectCodec, FilepathCodec enforce external-only usage - Consolidates <blob>/<xblob> and <attach>/<xattach> into unified codecs - Adds backward compatibility aliases for old type names - Updates __init__.py with new codec exports (Codec, list_codecs, get_codec)
- Remove legacy codecs (djblob, xblob, xattach, content) - Use unified codecs: <blob>, <attach>, <hash>, <object>, <filepath> - All codecs support both internal and external modes via @store modifier - Fix dtype chain resolution to propagate store to inner codecs - Fix fetch.py to resolve correct chain for external storage - Update tests to use new codec API (name, get_dtype method) - Fix imports: use content_registry for get_store_backend - Add 'local' store to mock_object_storage fixture All 471 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename attribute_type.py → codecs.py - Rename builtin_types.py → builtin_codecs.py - Rename test_attribute_type.py → test_codecs.py - Rename get_adapter() → lookup_codec() - Rename attr.adapter → attr.codec in Attribute namedtuple - Update all imports and references throughout codebase - Update comments and docstrings to use codec terminology All 471 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove AttributeType alias (use Codec directly) - Remove register_type function (codecs auto-register) - Remove deprecated type_name property (use name) - Remove list_types, get_type, is_type_registered, unregister_type aliases - Update all internal usages from type_name to name - Update tests to use new API The previous implementation was experimental; no backward compatibility is needed for the v2.0 release. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add codec-spec.md: detailed API specification for creating codecs - Add codecs.md: user guide with examples (replaces customtype.md) - Remove customtype.md (replaced by codecs.md) Documentation covers: - Codec base class and required methods - Auto-registration via __init_subclass__ - Codec composition/chaining - Plugin system via entry points - Built-in codecs (blob, hash, object, attach, filepath) - Complete examples for neuroscience workflows 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The detailed implementation specification has served its purpose. User documentation is now in object.md, codec API in codec-spec.md, and type architecture in storage-types-spec.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code cleanup: - Remove backward compatibility aliases (ObjectType, AttachType, etc.) - Remove misleading comments about non-existent DJBlobType/ContentType - Remove unused build_foreign_key_parser_old function - Remove unused feature switches (ADAPTED_TYPE_SWITCH, FILEPATH_FEATURE_SWITCH) - Remove unused os import from errors.py - Rename ADAPTED type category to CODEC Documentation fixes: - Update mkdocs.yaml nav: customtype.md → codecs.md - Fix dead links in attributes.md pointing to customtype.md Terminology updates: - Replace "AttributeType" with "Codec" in all comments - Replace "Adapter" with "Codec" in docstrings - Fix SHA256 → MD5 in content_registry.py docstring Version bump to 2.0.0a6 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Filepath feature is now always enabled; no feature flag needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
File renames: - schema_adapted.py → schema_codecs.py - test_adapted_attributes.py → test_codecs.py - test_type_composition.py → test_codec_chaining.py Content updates: - LOCALS_ADAPTED → LOCALS_CODECS - GraphType → GraphCodec, LayoutToFilepathType → LayoutCodec - Test class names: TestTypeChain* → TestCodecChain* - Test function names: test_adapted_* → test_codec_* - Updated docstrings and comments 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests now automatically start MySQL and MinIO containers via testcontainers. No manual `docker-compose up` required - just run `pytest tests/`. Changes: - conftest.py: Add mysql_container and minio_container fixtures that auto-start containers when tests run and stop them afterward - pyproject.toml: Add testcontainers[mysql,minio] dependency, update pixi tasks, remove pytest-env (no longer needed) - docker-compose.yaml: Update docs to clarify it's optional for tests - README.md: Comprehensive developer guide with clear instructions for running tests, pre-commit hooks, and PR submission checklist Usage: - Default: `pytest tests/` - testcontainers manages containers - External: `DJ_USE_EXTERNAL_CONTAINERS=1 pytest` - use docker-compose Benefits: - Zero setup for developers - just `pip install -e ".[test]" && pytest` - Dynamic ports (no conflicts with other services) - Automatic cleanup after tests - Simpler CI configuration Version bump to 2.0.0a7 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update settings tests to accept dynamic ports (testcontainers uses random ports instead of default 3306) - Fix test_top_restriction_with_keywords to use set comparison since dj.Top only guarantees which elements are selected, not their order - Bump version to 2.0.0a8 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1ae815b to
15412ea
Compare
- Register requires_mysql and requires_minio marks in pyproject.toml - Add pytest_collection_modifyitems hook to auto-mark tests based on fixture usage - Remove autouse=True from configure_datajoint fixture so containers only start when needed - Fix test_drop_unauthorized to use connection_test fixture Tests can now run without Docker: pytest -m "not requires_mysql" # Run 192 unit tests Full test suite still works: DJ_USE_EXTERNAL_CONTAINERS=1 pytest tests/ # 471 tests Bump version to 2.0.0a9 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
15412ea to
fa47f47
Compare
| - `tinyblob`, `blob`, `mediumblob`, `longblob` | ||
| - `tinytext`, `mediumtext`, `longtext` (size variants) | ||
| - `time`, `timestamp`, `year` | ||
| - `mediumint`, `serial`, `int auto_increment` |
There was a problem hiding this comment.
Does it make sense to have any of the core datatypes support auto_increment? I would expect that, or at least to not generate a warning if used.
There was a problem hiding this comment.
auto_increment is not part of DataJoint's model for entity integrity. There are several reasons that are captured in the new docs but here is the main one (number 9).
- Avoiding Identification System Design
Novice users reach for auto_increment as a shortcut to avoid the harder work of designing a proper identification system. They treat it as "row number" rather than "entity identifier"—conflating storage order with identity.
This reveals a conceptual gap: they're not thinking about what makes this entity unique in the real world, just how to get a number in the table. The result:
- Subjects identified by insertion order rather than lab naming conventions
- Sessions numbered globally rather than per-subject
- No ability to re-insert corrected data (new row = new ID)
- Foreign keys become meaningless integers rather than traceable lineage
DataJoint forces users to confront the question: "What uniquely identifies this entity in your domain?" This upfront design effort pays dividends in schema clarity, data integrity, and long-term maintainability.
The discomfort users feel when denied auto_increment is actually the discomfort of having to think carefully about their data model—which is precisely what they should be doing.
There was a problem hiding this comment.
This is very well described in the emerging docs in https://github.com/datajoint/datajoint-docs/
|
|
||
| See the [storage types spec](storage-types-spec.md) for complete mappings. | ||
|
|
||
| ## Codec types (special datatypes) |
There was a problem hiding this comment.
This document doesn't contain the words Object Augmented Schema, or the acronym OAS. Is that deliberate? I'd expect that to be grep-able.=
There was a problem hiding this comment.
yep, this is moved into datajoint-docs in #1305. datajoint-python will no longer contain the specs.
| if _entry_points_loaded: | ||
| return | ||
|
|
||
| _entry_points_loaded = True |
There was a problem hiding this comment.
should this raise on successive calls if it fails on the initial call?
There was a problem hiding this comment.
It will raise an error even on the first time in the calling context if the plugin they are looking for is not registered.
|
|
||
| - **No `@`**: Internal storage (database) - e.g., `<blob>`, `<attach>` | ||
| - **`@` present**: External storage (object store) - e.g., `<blob@>`, `<attach@store>` | ||
| - **`@` alone**: Use default store - e.g., `<blob@>` |
There was a problem hiding this comment.
What is the default store? That is not really described anywhere.
There was a problem hiding this comment.
The settings structure provides a separate section for the default store and named stores. The default store has no name. I will be sure that the how-tos cover it well in the docs.
| def get_dtype(self, is_external: bool) -> str: | ||
| """ | ||
| Return the storage dtype for this codec. | ||
|
|
||
| Args: | ||
| is_external: True if @ modifier present (external storage) | ||
|
|
||
| Returns: | ||
| A core type (e.g., "bytes", "json") or another codec (e.g., "<hash>") | ||
|
|
||
| Raises: | ||
| NotImplementedError: If not overridden by subclass. | ||
| DataJointError: If external storage not supported but requested. | ||
| """ | ||
| raise NotImplementedError(f"Codec <{self.name}> must implement get_dtype()") | ||
|
|
||
| @abstractmethod | ||
| def encode(self, value: Any, *, key: dict | None = None, store_name: str | None = None) -> Any: | ||
| """ | ||
| Encode Python value for storage. | ||
|
|
||
| Args: | ||
| value: The Python object to store. | ||
| key: Primary key values as a dict. May be needed for path construction. | ||
| store_name: Target store name for external storage. | ||
|
|
||
| Returns: | ||
| Value in the format expected by the dtype. | ||
| """ | ||
| ... |
There was a problem hiding this comment.
Why do some raise NotImplementedError where others are marked as @abstractmethod
Summary
This PR implements a comprehensive redesign of the custom type system, renaming "AttributeType/adapter" terminology to "Codec" and providing a cleaner, more intuitive API. It also simplifies the testing infrastructure with testcontainers.
Codec API Redesign
Key Changes
AttributeTypebase class toCodec- The new name better reflects the purpose: encoding Python objects for database storage and decoding them on retrieval__init_subclass__- Codecs automatically register when their class is defined; no decorator neededget_dtype(is_external)method - Codecs dynamically return their underlying storage type based on whether external storage is used<name>or<name@store>syntax consistentlyADAPTEDtoCODECin internal codeBuilt-in Codecs
<blob>bytes<hash@><hash@>json<object@>json<attach>bytes<hash@><filepath@store>jsonNew Codec API
Testing Infrastructure: Testcontainers
Tests now use testcontainers to automatically manage MySQL and MinIO containers. No manual
docker-compose uprequired.New Developer Workflow
Benefits
pip installandpytestFallback: External Containers
For development/debugging with persistent containers:
Dead Code & Terminology Cleanup
Removed
AttributeType,register_type,list_types,get_type, etc.)ObjectType,AttachType,XAttachType,FilepathType)build_foreign_key_parser_old()functionADAPTED_TYPE_SWITCH,FILEPATH_FEATURE_SWITCH)enable_filepath_featuretest fixtureDJBlobType/ContentTypeobject-type-spec.md(implementation complete, info now inobject.md)pytest-envdependency (testcontainers handles configuration)Renamed
ADAPTED→CODECin declare.py and heading.pyschema_adapted.py→schema_codecs.pytest_adapted_attributes.py→test_codecs.pytest_type_composition.py→test_codec_chaining.pyCodecterminologyUpdated Terminology
content_registry.pydocstring: SHA256 → MD5Documentation Updates
New Documentation
codec-spec.md- Detailed API specification for creating custom codecscodecs.md- User guide with examples (replacescustomtype.md)README.md- Comprehensive developer guide with test/pre-commit instructionsUpdated Documentation
mkdocs.yaml- Navigation updated:customtype.md→codecs.mdattributes.md- Fixed dead links, updated terminologydocker-compose.yaml- Clarified it's optional for testsRemoved Documentation
object-type-spec.md(redundant withobject.md)customtype.md(replaced bycodecs.md)Other Changes in This Branch
Settings System
save_*methods andset_passwordfunctionType System
int8,int16,int32,int64,uint8, etc.)decimal(n,f)to core typestexttype and documented type modifier policyExternal Storage
<object@>type for managed file/folder storageInfrastructure
unit/andintegration/directories2.0.0a7Test Plan
<blob@>→<hash@>→ storage) worksDJ_USE_EXTERNAL_CONTAINERS=1)🤖 Generated with Claude Code