Skip to content

Commit 40c1dbb

Browse files
Add filepath as third OAS region with ObjectRef interface
Three OAS storage regions: 1. object: {schema}/{table}/{pk}/ - PK-addressed, DataJoint controls 2. content: _content/{hash} - content-addressed, deduplicated 3. filepath: _files/{user-path} - user-addressed, user controls Upgraded filepath@store: - Returns ObjectRef (lazy) instead of copying files - Supports streaming via ref.open() - Supports folders (like object) - Stores checksum in JSON column for verification - No more automatic copy to local stage Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
1 parent b87342b commit 40c1dbb

File tree

1 file changed

+145
-43
lines changed

1 file changed

+145
-43
lines changed

docs/src/design/tables/storage-types-spec.md

Lines changed: 145 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,17 @@
55
This document defines a layered storage architecture:
66

77
1. **MySQL types**: `longblob`, `varchar`, `int`, etc.
8-
2. **Core DataJoint types**: `object`, `content` (and their `@store` variants)
8+
2. **Core DataJoint types**: `object`, `content`, `filepath` (and their `@store` variants)
99
3. **AttributeTypes**: `<djblob>`, `<xblob>`, `<attach>`, etc. (built on top of core types)
1010

11+
### Three OAS Storage Regions
12+
13+
| Region | Path Pattern | Addressing | Use Case |
14+
|--------|--------------|------------|----------|
15+
| Object | `{schema}/{table}/{pk}/` | Primary key | Large objects, Zarr, HDF5 |
16+
| Content | `_content/{hash}` | Content hash | Deduplicated blobs/files |
17+
| Filepath | `_files/{user-path}` | User-defined | User-organized files |
18+
1119
## Core Types
1220

1321
### `object` / `object@store` - Path-Addressed Storage
@@ -44,11 +52,14 @@ class Analysis(dj.Computed):
4452

4553
```
4654
store_root/
47-
├── {schema}/{table}/{pk}/ # object storage (path-addressed)
55+
├── {schema}/{table}/{pk}/ # object storage (path-addressed by PK)
4856
│ └── {attribute}/
4957
50-
└── _content/ # content storage (content-addressed)
51-
└── {hash[:2]}/{hash[2:4]}/{hash}/
58+
├── _content/ # content storage (content-addressed)
59+
│ └── {hash[:2]}/{hash[2:4]}/{hash}
60+
61+
└── _files/ # filepath storage (user-addressed)
62+
└── {user-defined-path}
5263
```
5364

5465
#### Content Type Behavior
@@ -95,6 +106,92 @@ The `content` type stores a `char(64)` hash in the database:
95106
features CHAR(64) NOT NULL -- SHA256 hex hash
96107
```
97108

109+
### `filepath` / `filepath@store` - User-Addressed Storage
110+
111+
**Upgraded from legacy.** User-defined path organization with ObjectRef access:
112+
113+
- **User controls paths**: relative path specified by user (not derived from PK or hash)
114+
- Stored in `_files/{user-path}` within the store
115+
- Returns `ObjectRef` for lazy access (no automatic copying)
116+
- Stores checksum in database for verification
117+
- Supports files and folders (like `object`)
118+
119+
```python
120+
class RawData(dj.Manual):
121+
definition = """
122+
session_id : int
123+
---
124+
recording : filepath@raw # user specifies path
125+
"""
126+
127+
# Insert - user provides relative path
128+
table.insert1({
129+
'session_id': 1,
130+
'recording': 'experiment_001/session_001/data.nwb'
131+
})
132+
133+
# Fetch - returns ObjectRef (lazy, no copy)
134+
row = (table & 'session_id=1').fetch1()
135+
ref = row['recording'] # ObjectRef
136+
ref.download('/local/path') # explicit download
137+
ref.open() # fsspec streaming access
138+
```
139+
140+
#### Filepath Type Behavior
141+
142+
```python
143+
# Core type behavior
144+
class FilepathType:
145+
"""Core user-addressed storage type."""
146+
147+
def store(self, user_path: str, store_backend) -> dict:
148+
"""
149+
Register filepath, return metadata.
150+
File must already exist at _files/{user_path} in store.
151+
"""
152+
full_path = f"_files/{user_path}"
153+
if not store_backend.exists(full_path):
154+
raise FileNotFoundError(f"File not found: {full_path}")
155+
156+
# Compute checksum for verification
157+
checksum = store_backend.checksum(full_path)
158+
size = store_backend.size(full_path)
159+
160+
return {
161+
'path': user_path,
162+
'checksum': checksum,
163+
'size': size
164+
}
165+
166+
def retrieve(self, metadata: dict, store_backend) -> ObjectRef:
167+
"""Return ObjectRef for lazy access."""
168+
return ObjectRef(
169+
path=f"_files/{metadata['path']}",
170+
store=store_backend,
171+
checksum=metadata.get('checksum') # for verification
172+
)
173+
```
174+
175+
#### Database Column
176+
177+
The `filepath` type stores JSON metadata:
178+
179+
```sql
180+
-- filepath column
181+
recording JSON NOT NULL
182+
-- Contains: {"path": "...", "checksum": "...", "size": ...}
183+
```
184+
185+
#### Key Differences from Legacy `filepath@store`
186+
187+
| Feature | Legacy | New |
188+
|---------|--------|-----|
189+
| Access | Copy to local stage | ObjectRef (lazy) |
190+
| Copying | Automatic | Explicit via `ref.download()` |
191+
| Streaming | No | Yes via `ref.open()` |
192+
| Folders | No | Yes |
193+
| Interface | Returns local path | Returns ObjectRef |
194+
98195
## Parameterized AttributeTypes
99196

100197
AttributeTypes can be parameterized with `<type@param>` syntax. The parameter is passed
@@ -235,31 +332,32 @@ class Attachments(dj.Manual):
235332
## Type Layering Summary
236333

237334
```
238-
┌─────────────────────────────────────────────────────────────┐
239-
│ AttributeTypes │
240-
│ <djblob> <xblob> <attach> <xattach> <custom> │
241-
├─────────────────────────────────────────────────────────────┤
242-
│ Core DataJoint Types │
243-
longblob content object
244-
content@store object@store
245-
├─────────────────────────────────────────────────────────────┤
246-
│ MySQL Types │
247-
│ LONGBLOB CHAR(64) JSON VARCHAR INT etc.
248-
└─────────────────────────────────────────────────────────────┘
335+
┌───────────────────────────────────────────────────────────────────
336+
AttributeTypes
337+
│ <djblob> <xblob> <attach> <xattach> <custom>
338+
├───────────────────────────────────────────────────────────────────
339+
Core DataJoint Types
340+
│ longblob content object filepath
341+
content@s object@s filepath@s
342+
├───────────────────────────────────────────────────────────────────
343+
MySQL Types
344+
LONGBLOB CHAR(64) JSON JSON VARCHAR etc.
345+
└───────────────────────────────────────────────────────────────────
249346
```
250347

251348
## Storage Comparison
252349

253-
| AttributeType | Core Type | Storage Location | Dedup | Returns |
254-
|---------------|-----------|------------------|-------|---------|
350+
| Type | Core Type | Storage Location | Dedup | Returns |
351+
|------|-----------|------------------|-------|---------|
255352
| `<djblob>` | `longblob` | Database | No | Python object |
256-
| `<xblob>` | `content` | `_content/{hash}/` | Yes | Python object |
257-
| `<xblob@store>` | `content@store` | `_content/{hash}/` | Yes | Python object |
353+
| `<xblob>` | `content` | `_content/{hash}` | Yes | Python object |
354+
| `<xblob@s>` | `content@s` | `_content/{hash}` | Yes | Python object |
258355
| `<attach>` | `longblob` | Database | No | Local file path |
259-
| `<xattach>` | `content` | `_content/{hash}/` | Yes | Local file path |
260-
| `<xattach@store>` | `content@store` | `_content/{hash}/` | Yes | Local file path |
261-
|| `object` | `{schema}/{table}/{pk}/` | No | ObjectRef |
262-
|| `object@store` | `{schema}/{table}/{pk}/` | No | ObjectRef |
356+
| `<xattach>` | `content` | `_content/{hash}` | Yes | Local file path |
357+
| `<xattach@s>` | `content@s` | `_content/{hash}` | Yes | Local file path |
358+
| `object` || `{schema}/{table}/{pk}/` | No | ObjectRef |
359+
| `object@s` || `{schema}/{table}/{pk}/` | No | ObjectRef |
360+
| `filepath@s` || `_files/{user-path}` | No | ObjectRef |
263361

264362
## Reference Counting for Content Type
265363

@@ -306,33 +404,37 @@ def garbage_collect(project):
306404
(ContentRegistry() & {'content_hash': content_hash}).delete()
307405
```
308406

309-
## Content vs Object: When to Use Each
407+
## Core Type Comparison
310408

311-
| Feature | `content` | `object` |
312-
|---------|-----------|----------|
313-
| Addressing | Content hash (SHA256) | Path (from primary key) |
314-
| Deduplication | Yes | No |
315-
| Structure | Single blob only | Files, folders, Zarr, HDF5 |
316-
| Access | Transparent (returns bytes) | Lazy (returns ObjectRef) |
317-
| GC | Reference counted | Deleted with row |
318-
| Use case | Serialized data, file attachments | Large/complex objects, streaming |
409+
| Feature | `object` | `content` | `filepath` |
410+
|---------|----------|-----------|------------|
411+
| Addressing | Primary key | Content hash | User-defined path |
412+
| Path control | DataJoint | DataJoint | User |
413+
| Deduplication | No | Yes | No |
414+
| Structure | Files, folders, Zarr | Single blob only | Files, folders |
415+
| Access | ObjectRef (lazy) | Transparent (bytes) | ObjectRef (lazy) |
416+
| GC | Deleted with row | Reference counted | Deleted with row |
417+
| Checksum | Optional | Implicit (is the hash) | Stored in DB |
319418

320-
**Rule of thumb:**
321-
- Need deduplication or storing serialized Python objects? → `content` via `<xblob>`
322-
- Need folders, Zarr, HDF5, or streaming access? → `object`
419+
**When to use each:**
420+
- **`object`**: Large/complex objects where DataJoint controls organization (Zarr, HDF5)
421+
- **`content`**: Deduplicated serialized data or file attachments via `<xblob>`, `<xattach>`
422+
- **`filepath`**: User-managed file organization, external data sources
323423

324424
## Key Design Decisions
325425

326-
1. **Layered architecture**: Core types (`content`, `object`) separate from AttributeTypes
327-
2. **Content type**: Single-blob, content-addressed, deduplicated storage
328-
3. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
329-
4. **Naming convention**:
426+
1. **Layered architecture**: Core types (`object`, `content`, `filepath`) separate from AttributeTypes
427+
2. **Three OAS regions**: object (PK-addressed), content (hash-addressed), filepath (user-addressed)
428+
3. **Content type**: Single-blob, content-addressed, deduplicated storage
429+
4. **Filepath upgrade**: Returns ObjectRef (lazy) instead of copying files
430+
5. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
431+
6. **Naming convention**:
330432
- `<djblob>` = internal serialized (database)
331433
- `<xblob>` = external serialized (content-addressed)
332434
- `<attach>` = internal file (single file)
333435
- `<xattach>` = external file (single file)
334-
5. **Transparent access**: AttributeTypes return Python objects or file paths, not references
335-
6. **Lazy access for objects**: Only `object`/`object@store` returns ObjectRef
436+
7. **Transparent access**: AttributeTypes return Python objects or file paths
437+
8. **Lazy access**: `object`, `object@store`, and `filepath@store` return ObjectRef
336438

337439
## Migration from Legacy Types
338440

@@ -342,7 +444,7 @@ def garbage_collect(project):
342444
| `blob@store` | `<xblob@store>` |
343445
| `attach` | `<attach>` |
344446
| `attach@store` | `<xattach@store>` |
345-
| `filepath@store` | Deprecated (use `object@store` or `<xattach@store>`) |
447+
| `filepath@store` (copy-based) | `filepath@store` (ObjectRef-based, upgraded) |
346448

347449
### Migration from Legacy `~external_*` Stores
348450

@@ -404,5 +506,5 @@ def migrate_external_store(schema, store_name):
404506

405507
1. Should `content` without `@store` use a default store, or require explicit store?
406508
2. Should we support `<xblob>` without `@store` syntax (implying default store)?
407-
3. Should `filepath@store` be kept for backward compat or fully deprecated?
509+
3. Should `filepath` without `@store` be supported (using default store)?
408510
4. How long should the backward compatibility layer support legacy `~external_*` format?

0 commit comments

Comments
 (0)