Implement ObjectType for path-addressed storage

claude · dimitri-yatsenko · claude · commit ad09877dbf14 · 2025-12-25T22:25:09.000Z
Add &lt;object&gt; type for files and folders (Zarr, HDF5, etc.):
- Path derived from primary key: {schema}/{table}/objects/{pk}/{field}_{token}
- Supports bytes, files, and directories
- Returns ObjectRef for lazy fsspec-based access
- No deduplication (unlike &lt;content&gt;)

Update implementation plan with Phase 2b documenting ObjectType.

Co-authored-by: dimitri-yatsenko &lt;dimitri@datajoint.com&gt;
diff --git a/docs/src/design/tables/storage-types-implementation-plan.md b/docs/src/design/tables/storage-types-implementation-plan.md
@@ -18,7 +18,8 @@ This plan describes the implementation of a three-layer type architecture for Da
 |-------|--------|-------|
 | Phase 1: Core Type System | ✅ Complete | CORE_TYPES dict, type chain resolution |
 | Phase 2: Content-Addressed Storage | ✅ Complete | Function-based, no registry table |
-| Phase 3: User-Defined AttributeTypes | 🔲 Pending | XBlobType done, AttachType/FilepathType pending |
+| Phase 2b: Path-Addressed Storage | ✅ Complete | ObjectType for files/folders |
+| Phase 3: User-Defined AttributeTypes | 🔲 Pending | AttachType/FilepathType pending |
 | Phase 4: Insert and Fetch Integration | ✅ Complete | Type chain encoding/decoding |
 | Phase 5: Garbage Collection | 🔲 Pending | |
 | Phase 6: Migration Utilities | 🔲 Pending | |
@@ -143,6 +144,58 @@ class XBlobType(AttributeType):
 
 ---
 
+## Phase 2b: Path-Addressed Storage (ObjectType) ✅
+
+**Status**: Complete
+
+### Design: Path vs Content Addressing
+
+| Aspect | `<content>` | `<object>` |
+|--------|-------------|------------|
+| Addressing | Content-hash (SHA256) | Path (from primary key) |
+| Path Format | `_content/{hash[:2]}/{hash[2:4]}/{hash}` | `{schema}/{table}/objects/{pk}/{field}_{token}.ext` |
+| Deduplication | Yes (same content = same hash) | No (each row has unique path) |
+| Deletion | GC when unreferenced | Deleted with row |
+| Use case | Serialized blobs, attachments | Zarr, HDF5, folders |
+
+### Implemented in `src/datajoint/builtin_types.py`:
+
+```python
+@register_type
+class ObjectType(AttributeType):
+    """Path-addressed storage for files and folders."""
+    type_name = "object"
+    dtype = "json"
+
+    def encode(self, value, *, key=None, store_name=None) -> dict:
+        # value can be bytes, str path, or Path
+        # key contains _schema, _table, _field for path construction
+        path, token = build_object_path(schema, table, field, primary_key, ext)
+        backend.put_buffer(content, path)  # or put_folder for directories
+        return {
+            "path": path,
+            "store": store_name,
+            "size": size,
+            "ext": ext,
+            "is_dir": is_dir,
+            "timestamp": timestamp.isoformat(),
+        }
+
+    def decode(self, stored: dict, *, key=None) -> ObjectRef:
+        # Returns lazy handle for fsspec-based access
+        return ObjectRef.from_json(stored, backend=backend)
+```
+
+### ObjectRef Features:
+- `ref.path` - Storage path
+- `ref.read()` - Read file content
+- `ref.open()` - Open as file handle
+- `ref.fsmap` - For `zarr.open(ref.fsmap)`
+- `ref.download(dest)` - Download to local path
+- `ref.listdir()` / `ref.walk()` - For directories
+
+---
+
 ## Phase 3: User-Defined AttributeTypes
 
 **Status**: Partially complete
@@ -319,8 +372,11 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
 |------|--------|---------|
 | `src/datajoint/declare.py` | ✅ | CORE_TYPES, type parsing, SQL generation |
 | `src/datajoint/heading.py` | ✅ | Simplified attribute properties |
-| `src/datajoint/attribute_type.py` | ✅ | ContentType, XBlobType, type chain resolution |
+| `src/datajoint/attribute_type.py` | ✅ | Base class, registry, type chain resolution |
+| `src/datajoint/builtin_types.py` | ✅ | DJBlobType, ContentType, XBlobType, ObjectType |
 | `src/datajoint/content_registry.py` | ✅ | Content storage functions (put, get, delete) |
+| `src/datajoint/objectref.py` | ✅ | ObjectRef handle for lazy access |
+| `src/datajoint/storage.py` | ✅ | StorageBackend, build_object_path |
 | `src/datajoint/table.py` | ✅ | Type chain encoding on insert |
 | `src/datajoint/fetch.py` | ✅ | Type chain decoding on fetch |
 | `src/datajoint/blob.py` | ✅ | Removed bypass_serialization |
@@ -343,7 +399,7 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
 
 ```
 Layer 3: AttributeTypes (user-facing)
-         <djblob>, <xblob>, <attach>, <xattach>, <filepath@store>
+         <djblob>, <object>, <content>, <xblob>, <attach>, <xattach>, <filepath@store>
          ↓ encode() / ↑ decode()
 
 Layer 2: Core DataJoint Types
@@ -354,6 +410,14 @@ Layer 1: Native Database Types
          FLOAT, BIGINT, BINARY(16), JSON, LONGBLOB, VARCHAR(n), etc.
 ```
 
+**Built-in AttributeTypes:**
+```
+<djblob>   → longblob (internal serialized storage)
+<object>   → json     (path-addressed, for Zarr/HDF5/folders)
+<content>  → json     (content-addressed with deduplication)
+<xblob>    → <content> → json (external serialized with dedup)
+```
+
 **Type Composition Example:**
 ```
 <xblob> → <content> → json (in DB)
diff --git a/src/datajoint/builtin_types.py b/src/datajoint/builtin_types.py
@@ -9,6 +9,7 @@
     - ``<djblob>``: Serialize Python objects to DataJoint's blob format (internal storage)
     - ``<content>``: Content-addressed storage with SHA256 deduplication
     - ``<xblob>``: External serialized blobs using content-addressed storage
+    - ``<object>``: Path-addressed storage for files/folders (Zarr, HDF5)
 
 Example - Creating a Custom Type:
     Here's how to define your own AttributeType, modeled after the built-in types::
@@ -237,3 +238,192 @@ def decode(self, stored: bytes, *, key: dict | None = None) -> Any:
         from . import blob
 
         return blob.unpack(stored, squeeze=False)
+
+
+# =============================================================================
+# Path-Addressed Storage Types (OAS - Object-Augmented Schema)
+# =============================================================================
+
+
+@register_type
+class ObjectType(AttributeType):
+    """
+    Path-addressed storage for files and folders.
+
+    The ``<object>`` type provides managed file/folder storage where the path
+    is derived from the primary key: ``{schema}/{table}/objects/{pk}/{field}_{token}.{ext}``
+
+    Unlike ``<content>`` (content-addressed), each row has its own storage path,
+    and content is deleted when the row is deleted. This is ideal for:
+
+    - Zarr arrays (hierarchical chunked data)
+    - HDF5 files
+    - Complex multi-file outputs
+    - Any content that shouldn't be deduplicated
+
+    Example::
+
+        @schema
+        class Analysis(dj.Computed):
+            definition = '''
+            -> Recording
+            ---
+            results : <object@mystore>
+            '''
+
+        def make(self, key):
+            # Store a file
+            self.insert1({**key, 'results': '/path/to/results.zarr'})
+
+        # Fetch returns ObjectRef for lazy access
+        ref = (Analysis & key).fetch1('results')
+        ref.path       # Storage path
+        ref.read()     # Read file content
+        ref.fsmap      # For zarr.open(ref.fsmap)
+
+    Storage Structure:
+        Objects are stored at::
+
+            {store_root}/{schema}/{table}/objects/{pk}/{field}_{token}.ext
+
+        The token ensures uniqueness even if content is replaced.
+
+    Comparison with ``<content>``::
+
+        | Aspect         | <object>          | <content>           |
+        |----------------|-------------------|---------------------|
+        | Addressing     | Path (by PK)      | Hash (by content)   |
+        | Deduplication  | No                | Yes                 |
+        | Deletion       | With row          | GC when unreferenced|
+        | Use case       | Zarr, HDF5        | Blobs, attachments  |
+
+    Note:
+        A store must be specified (``<object@store>``) unless a default store
+        is configured. Returns ``ObjectRef`` on fetch for lazy access.
+    """
+
+    type_name = "object"
+    dtype = "json"
+
+    def encode(
+        self,
+        value: Any,
+        *,
+        key: dict | None = None,
+        store_name: str | None = None,
+    ) -> dict:
+        """
+        Store content and return metadata.
+
+        Args:
+            value: Content to store. Can be:
+                - bytes: Raw bytes to store as file
+                - str/Path: Path to local file or folder to upload
+            key: Dict containing context for path construction:
+                - _schema: Schema name
+                - _table: Table name
+                - _field: Field/attribute name
+                - Other entries are primary key values
+            store_name: Store to use. If None, uses default store.
+
+        Returns:
+            Metadata dict suitable for ObjectRef.from_json()
+        """
+        from datetime import datetime, timezone
+        from pathlib import Path
+
+        from .content_registry import get_store_backend
+        from .storage import build_object_path
+
+        # Extract context from key
+        key = key or {}
+        schema = key.pop("_schema", "unknown")
+        table = key.pop("_table", "unknown")
+        field = key.pop("_field", "data")
+        primary_key = {k: v for k, v in key.items() if not k.startswith("_")}
+
+        # Determine content type and extension
+        is_dir = False
+        ext = None
+        size = None
+
+        if isinstance(value, bytes):
+            content = value
+            size = len(content)
+        elif isinstance(value, (str, Path)):
+            source_path = Path(value)
+            if not source_path.exists():
+                raise FileNotFoundError(f"Source path does not exist: {source_path}")
+            is_dir = source_path.is_dir()
+            ext = source_path.suffix if not is_dir else None
+            if is_dir:
+                # For directories, we'll upload later
+                content = None
+            else:
+                content = source_path.read_bytes()
+                size = len(content)
+        else:
+            raise TypeError(f"<object> expects bytes or path, got {type(value).__name__}")
+
+        # Build storage path
+        path, token = build_object_path(
+            schema=schema,
+            table=table,
+            field=field,
+            primary_key=primary_key,
+            ext=ext,
+        )
+
+        # Get storage backend
+        backend = get_store_backend(store_name)
+
+        # Upload content
+        if is_dir:
+            # Upload directory recursively
+            source_path = Path(value)
+            backend.put_folder(str(source_path), path)
+            # Compute size by summing all files
+            size = sum(f.stat().st_size for f in source_path.rglob("*") if f.is_file())
+        else:
+            backend.put_buffer(content, path)
+
+        # Build metadata
+        timestamp = datetime.now(timezone.utc)
+        metadata = {
+            "path": path,
+            "store": store_name,
+            "size": size,
+            "ext": ext,
+            "is_dir": is_dir,
+            "timestamp": timestamp.isoformat(),
+        }
+
+        return metadata
+
+    def decode(self, stored: dict, *, key: dict | None = None) -> Any:
+        """
+        Create ObjectRef handle for lazy access.
+
+        Args:
+            stored: Metadata dict from database.
+            key: Primary key values (unused).
+
+        Returns:
+            ObjectRef for accessing the stored content.
+        """
+        from .content_registry import get_store_backend
+        from .objectref import ObjectRef
+
+        store_name = stored.get("store")
+        backend = get_store_backend(store_name)
+        return ObjectRef.from_json(stored, backend=backend)
+
+    def validate(self, value: Any) -> None:
+        """Validate that value is bytes or a valid path."""
+        from pathlib import Path
+
+        if isinstance(value, bytes):
+            return
+        if isinstance(value, (str, Path)):
+            return
+        raise TypeError(f"<object> expects bytes or path, got {type(value).__name__}")