Skip to content

Commit ad09877

Browse files
Implement ObjectType for path-addressed storage
Add <object> type for files and folders (Zarr, HDF5, etc.): - Path derived from primary key: {schema}/{table}/objects/{pk}/{field}_{token} - Supports bytes, files, and directories - Returns ObjectRef for lazy fsspec-based access - No deduplication (unlike <content>) Update implementation plan with Phase 2b documenting ObjectType. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
1 parent 70fb567 commit ad09877

File tree

2 files changed

+257
-3
lines changed

2 files changed

+257
-3
lines changed

docs/src/design/tables/storage-types-implementation-plan.md

Lines changed: 67 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ This plan describes the implementation of a three-layer type architecture for Da
1818
|-------|--------|-------|
1919
| Phase 1: Core Type System | ✅ Complete | CORE_TYPES dict, type chain resolution |
2020
| Phase 2: Content-Addressed Storage | ✅ Complete | Function-based, no registry table |
21-
| Phase 3: User-Defined AttributeTypes | 🔲 Pending | XBlobType done, AttachType/FilepathType pending |
21+
| Phase 2b: Path-Addressed Storage | ✅ Complete | ObjectType for files/folders |
22+
| Phase 3: User-Defined AttributeTypes | 🔲 Pending | AttachType/FilepathType pending |
2223
| Phase 4: Insert and Fetch Integration | ✅ Complete | Type chain encoding/decoding |
2324
| Phase 5: Garbage Collection | 🔲 Pending | |
2425
| Phase 6: Migration Utilities | 🔲 Pending | |
@@ -143,6 +144,58 @@ class XBlobType(AttributeType):
143144

144145
---
145146

147+
## Phase 2b: Path-Addressed Storage (ObjectType) ✅
148+
149+
**Status**: Complete
150+
151+
### Design: Path vs Content Addressing
152+
153+
| Aspect | `<content>` | `<object>` |
154+
|--------|-------------|------------|
155+
| Addressing | Content-hash (SHA256) | Path (from primary key) |
156+
| Path Format | `_content/{hash[:2]}/{hash[2:4]}/{hash}` | `{schema}/{table}/objects/{pk}/{field}_{token}.ext` |
157+
| Deduplication | Yes (same content = same hash) | No (each row has unique path) |
158+
| Deletion | GC when unreferenced | Deleted with row |
159+
| Use case | Serialized blobs, attachments | Zarr, HDF5, folders |
160+
161+
### Implemented in `src/datajoint/builtin_types.py`:
162+
163+
```python
164+
@register_type
165+
class ObjectType(AttributeType):
166+
"""Path-addressed storage for files and folders."""
167+
type_name = "object"
168+
dtype = "json"
169+
170+
def encode(self, value, *, key=None, store_name=None) -> dict:
171+
# value can be bytes, str path, or Path
172+
# key contains _schema, _table, _field for path construction
173+
path, token = build_object_path(schema, table, field, primary_key, ext)
174+
backend.put_buffer(content, path) # or put_folder for directories
175+
return {
176+
"path": path,
177+
"store": store_name,
178+
"size": size,
179+
"ext": ext,
180+
"is_dir": is_dir,
181+
"timestamp": timestamp.isoformat(),
182+
}
183+
184+
def decode(self, stored: dict, *, key=None) -> ObjectRef:
185+
# Returns lazy handle for fsspec-based access
186+
return ObjectRef.from_json(stored, backend=backend)
187+
```
188+
189+
### ObjectRef Features:
190+
- `ref.path` - Storage path
191+
- `ref.read()` - Read file content
192+
- `ref.open()` - Open as file handle
193+
- `ref.fsmap` - For `zarr.open(ref.fsmap)`
194+
- `ref.download(dest)` - Download to local path
195+
- `ref.listdir()` / `ref.walk()` - For directories
196+
197+
---
198+
146199
## Phase 3: User-Defined AttributeTypes
147200

148201
**Status**: Partially complete
@@ -319,8 +372,11 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
319372
|------|--------|---------|
320373
| `src/datajoint/declare.py` || CORE_TYPES, type parsing, SQL generation |
321374
| `src/datajoint/heading.py` || Simplified attribute properties |
322-
| `src/datajoint/attribute_type.py` || ContentType, XBlobType, type chain resolution |
375+
| `src/datajoint/attribute_type.py` || Base class, registry, type chain resolution |
376+
| `src/datajoint/builtin_types.py` || DJBlobType, ContentType, XBlobType, ObjectType |
323377
| `src/datajoint/content_registry.py` || Content storage functions (put, get, delete) |
378+
| `src/datajoint/objectref.py` || ObjectRef handle for lazy access |
379+
| `src/datajoint/storage.py` || StorageBackend, build_object_path |
324380
| `src/datajoint/table.py` || Type chain encoding on insert |
325381
| `src/datajoint/fetch.py` || Type chain decoding on fetch |
326382
| `src/datajoint/blob.py` || Removed bypass_serialization |
@@ -343,7 +399,7 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
343399

344400
```
345401
Layer 3: AttributeTypes (user-facing)
346-
<djblob>, <xblob>, <attach>, <xattach>, <filepath@store>
402+
<djblob>, <object>, <content>, <xblob>, <attach>, <xattach>, <filepath@store>
347403
↓ encode() / ↑ decode()
348404
349405
Layer 2: Core DataJoint Types
@@ -354,6 +410,14 @@ Layer 1: Native Database Types
354410
FLOAT, BIGINT, BINARY(16), JSON, LONGBLOB, VARCHAR(n), etc.
355411
```
356412

413+
**Built-in AttributeTypes:**
414+
```
415+
<djblob> → longblob (internal serialized storage)
416+
<object> → json (path-addressed, for Zarr/HDF5/folders)
417+
<content> → json (content-addressed with deduplication)
418+
<xblob> → <content> → json (external serialized with dedup)
419+
```
420+
357421
**Type Composition Example:**
358422
```
359423
<xblob> → <content> → json (in DB)

src/datajoint/builtin_types.py

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- ``<djblob>``: Serialize Python objects to DataJoint's blob format (internal storage)
1010
- ``<content>``: Content-addressed storage with SHA256 deduplication
1111
- ``<xblob>``: External serialized blobs using content-addressed storage
12+
- ``<object>``: Path-addressed storage for files/folders (Zarr, HDF5)
1213
1314
Example - Creating a Custom Type:
1415
Here's how to define your own AttributeType, modeled after the built-in types::
@@ -237,3 +238,192 @@ def decode(self, stored: bytes, *, key: dict | None = None) -> Any:
237238
from . import blob
238239

239240
return blob.unpack(stored, squeeze=False)
241+
242+
243+
# =============================================================================
244+
# Path-Addressed Storage Types (OAS - Object-Augmented Schema)
245+
# =============================================================================
246+
247+
248+
@register_type
249+
class ObjectType(AttributeType):
250+
"""
251+
Path-addressed storage for files and folders.
252+
253+
The ``<object>`` type provides managed file/folder storage where the path
254+
is derived from the primary key: ``{schema}/{table}/objects/{pk}/{field}_{token}.{ext}``
255+
256+
Unlike ``<content>`` (content-addressed), each row has its own storage path,
257+
and content is deleted when the row is deleted. This is ideal for:
258+
259+
- Zarr arrays (hierarchical chunked data)
260+
- HDF5 files
261+
- Complex multi-file outputs
262+
- Any content that shouldn't be deduplicated
263+
264+
Example::
265+
266+
@schema
267+
class Analysis(dj.Computed):
268+
definition = '''
269+
-> Recording
270+
---
271+
results : <object@mystore>
272+
'''
273+
274+
def make(self, key):
275+
# Store a file
276+
self.insert1({**key, 'results': '/path/to/results.zarr'})
277+
278+
# Fetch returns ObjectRef for lazy access
279+
ref = (Analysis & key).fetch1('results')
280+
ref.path # Storage path
281+
ref.read() # Read file content
282+
ref.fsmap # For zarr.open(ref.fsmap)
283+
284+
Storage Structure:
285+
Objects are stored at::
286+
287+
{store_root}/{schema}/{table}/objects/{pk}/{field}_{token}.ext
288+
289+
The token ensures uniqueness even if content is replaced.
290+
291+
Comparison with ``<content>``::
292+
293+
| Aspect | <object> | <content> |
294+
|----------------|-------------------|---------------------|
295+
| Addressing | Path (by PK) | Hash (by content) |
296+
| Deduplication | No | Yes |
297+
| Deletion | With row | GC when unreferenced|
298+
| Use case | Zarr, HDF5 | Blobs, attachments |
299+
300+
Note:
301+
A store must be specified (``<object@store>``) unless a default store
302+
is configured. Returns ``ObjectRef`` on fetch for lazy access.
303+
"""
304+
305+
type_name = "object"
306+
dtype = "json"
307+
308+
def encode(
309+
self,
310+
value: Any,
311+
*,
312+
key: dict | None = None,
313+
store_name: str | None = None,
314+
) -> dict:
315+
"""
316+
Store content and return metadata.
317+
318+
Args:
319+
value: Content to store. Can be:
320+
- bytes: Raw bytes to store as file
321+
- str/Path: Path to local file or folder to upload
322+
key: Dict containing context for path construction:
323+
- _schema: Schema name
324+
- _table: Table name
325+
- _field: Field/attribute name
326+
- Other entries are primary key values
327+
store_name: Store to use. If None, uses default store.
328+
329+
Returns:
330+
Metadata dict suitable for ObjectRef.from_json()
331+
"""
332+
from datetime import datetime, timezone
333+
from pathlib import Path
334+
335+
from .content_registry import get_store_backend
336+
from .storage import build_object_path
337+
338+
# Extract context from key
339+
key = key or {}
340+
schema = key.pop("_schema", "unknown")
341+
table = key.pop("_table", "unknown")
342+
field = key.pop("_field", "data")
343+
primary_key = {k: v for k, v in key.items() if not k.startswith("_")}
344+
345+
# Determine content type and extension
346+
is_dir = False
347+
ext = None
348+
size = None
349+
350+
if isinstance(value, bytes):
351+
content = value
352+
size = len(content)
353+
elif isinstance(value, (str, Path)):
354+
source_path = Path(value)
355+
if not source_path.exists():
356+
raise FileNotFoundError(f"Source path does not exist: {source_path}")
357+
is_dir = source_path.is_dir()
358+
ext = source_path.suffix if not is_dir else None
359+
if is_dir:
360+
# For directories, we'll upload later
361+
content = None
362+
else:
363+
content = source_path.read_bytes()
364+
size = len(content)
365+
else:
366+
raise TypeError(f"<object> expects bytes or path, got {type(value).__name__}")
367+
368+
# Build storage path
369+
path, token = build_object_path(
370+
schema=schema,
371+
table=table,
372+
field=field,
373+
primary_key=primary_key,
374+
ext=ext,
375+
)
376+
377+
# Get storage backend
378+
backend = get_store_backend(store_name)
379+
380+
# Upload content
381+
if is_dir:
382+
# Upload directory recursively
383+
source_path = Path(value)
384+
backend.put_folder(str(source_path), path)
385+
# Compute size by summing all files
386+
size = sum(f.stat().st_size for f in source_path.rglob("*") if f.is_file())
387+
else:
388+
backend.put_buffer(content, path)
389+
390+
# Build metadata
391+
timestamp = datetime.now(timezone.utc)
392+
metadata = {
393+
"path": path,
394+
"store": store_name,
395+
"size": size,
396+
"ext": ext,
397+
"is_dir": is_dir,
398+
"timestamp": timestamp.isoformat(),
399+
}
400+
401+
return metadata
402+
403+
def decode(self, stored: dict, *, key: dict | None = None) -> Any:
404+
"""
405+
Create ObjectRef handle for lazy access.
406+
407+
Args:
408+
stored: Metadata dict from database.
409+
key: Primary key values (unused).
410+
411+
Returns:
412+
ObjectRef for accessing the stored content.
413+
"""
414+
from .content_registry import get_store_backend
415+
from .objectref import ObjectRef
416+
417+
store_name = stored.get("store")
418+
backend = get_store_backend(store_name)
419+
return ObjectRef.from_json(stored, backend=backend)
420+
421+
def validate(self, value: Any) -> None:
422+
"""Validate that value is bytes or a valid path."""
423+
from pathlib import Path
424+
425+
if isinstance(value, bytes):
426+
return
427+
if isinstance(value, (str, Path)):
428+
return
429+
raise TypeError(f"<object> expects bytes or path, got {type(value).__name__}")

0 commit comments

Comments
 (0)