Skip to content

Commit dbf092d

Browse files
Redesign filepath as URI reference tracker and add json core type
filepath changes: - No longer an OAS region - tracks external URIs anywhere - Supports any fsspec-compatible URI (s3://, https://, gs://, etc.) - Returns ObjectRef for lazy access via fsspec - No integrity guarantees (external resources may change) - Uses json core type for storage json core type: - Cross-database compatible (MySQL JSON, PostgreSQL JSONB) - Used by filepath and object types Two OAS regions remain: - object: PK-addressed, DataJoint controlled - content: hash-addressed, deduplicated Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
1 parent 40c1dbb commit dbf092d

File tree

1 file changed

+106
-62
lines changed

1 file changed

+106
-62
lines changed

docs/src/design/tables/storage-types-spec.md

Lines changed: 106 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,24 @@
44

55
This document defines a layered storage architecture:
66

7-
1. **MySQL types**: `longblob`, `varchar`, `int`, etc.
8-
2. **Core DataJoint types**: `object`, `content`, `filepath` (and their `@store` variants)
7+
1. **Database types**: `longblob`, `varchar`, `int`, `json`, etc.
8+
2. **Core DataJoint types**: `object`, `content`, `filepath`, `json` (and `@store` variants where applicable)
99
3. **AttributeTypes**: `<djblob>`, `<xblob>`, `<attach>`, etc. (built on top of core types)
1010

11-
### Three OAS Storage Regions
11+
### OAS Storage Regions
1212

1313
| Region | Path Pattern | Addressing | Use Case |
1414
|--------|--------------|------------|----------|
1515
| Object | `{schema}/{table}/{pk}/` | Primary key | Large objects, Zarr, HDF5 |
1616
| Content | `_content/{hash}` | Content hash | Deduplicated blobs/files |
17-
| Filepath | `_files/{user-path}` | User-defined | User-organized files |
17+
18+
### External References
19+
20+
`filepath` is **not** an OAS region - it's a general reference tracker for external resources:
21+
- OAS store paths: `store://main/experiment/data.h5`
22+
- URLs: `https://example.com/dataset.zip`
23+
- S3: `s3://bucket/key/file.nwb`
24+
- Any fsspec-compatible URI
1825

1926
## Core Types
2027

@@ -55,11 +62,8 @@ store_root/
5562
├── {schema}/{table}/{pk}/ # object storage (path-addressed by PK)
5663
│ └── {attribute}/
5764
58-
├── _content/ # content storage (content-addressed)
59-
│ └── {hash[:2]}/{hash[2:4]}/{hash}
60-
61-
└── _files/ # filepath storage (user-addressed)
62-
└── {user-defined-path}
65+
└── _content/ # content storage (content-addressed)
66+
└── {hash[:2]}/{hash[2:4]}/{hash}
6367
```
6468

6569
#### Content Type Behavior
@@ -106,31 +110,41 @@ The `content` type stores a `char(64)` hash in the database:
106110
features CHAR(64) NOT NULL -- SHA256 hex hash
107111
```
108112

109-
### `filepath` / `filepath@store` - User-Addressed Storage
113+
### `filepath` - External Reference Tracker
110114

111-
**Upgraded from legacy.** User-defined path organization with ObjectRef access:
115+
**Upgraded from legacy.** General-purpose reference tracker for external resources:
112116

113-
- **User controls paths**: relative path specified by user (not derived from PK or hash)
114-
- Stored in `_files/{user-path}` within the store
115-
- Returns `ObjectRef` for lazy access (no automatic copying)
116-
- Stores checksum in database for verification
117-
- Supports files and folders (like `object`)
117+
- **Not an OAS region**: references can point anywhere (URLs, S3, OAS stores, etc.)
118+
- **User controls URIs**: any fsspec-compatible URI
119+
- Returns `ObjectRef` for lazy access via fsspec
120+
- Stores optional checksum for verification
121+
- No integrity guarantees (external resources may change/disappear)
118122

119123
```python
120124
class RawData(dj.Manual):
121125
definition = """
122126
session_id : int
123127
---
124-
recording : filepath@raw # user specifies path
128+
recording : filepath # external reference
125129
"""
126130

127-
# Insert - user provides relative path
131+
# Insert - user provides URI (various protocols)
128132
table.insert1({
129133
'session_id': 1,
130-
'recording': 'experiment_001/session_001/data.nwb'
134+
'recording': 's3://my-bucket/experiment_001/data.nwb'
135+
})
136+
# Or URL
137+
table.insert1({
138+
'session_id': 2,
139+
'recording': 'https://example.com/public/dataset.h5'
140+
})
141+
# Or OAS store reference
142+
table.insert1({
143+
'session_id': 3,
144+
'recording': 'store://main/custom/path/file.zarr'
131145
})
132146

133-
# Fetch - returns ObjectRef (lazy, no copy)
147+
# Fetch - returns ObjectRef (lazy)
134148
row = (table & 'session_id=1').fetch1()
135149
ref = row['recording'] # ObjectRef
136150
ref.download('/local/path') # explicit download
@@ -142,55 +156,82 @@ ref.open() # fsspec streaming access
142156
```python
143157
# Core type behavior
144158
class FilepathType:
145-
"""Core user-addressed storage type."""
159+
"""Core external reference type."""
146160

147-
def store(self, user_path: str, store_backend) -> dict:
161+
def store(self, uri: str, compute_checksum: bool = False) -> dict:
148162
"""
149-
Register filepath, return metadata.
150-
File must already exist at _files/{user_path} in store.
163+
Register external reference, return metadata.
164+
Optionally compute checksum for verification.
151165
"""
152-
full_path = f"_files/{user_path}"
153-
if not store_backend.exists(full_path):
154-
raise FileNotFoundError(f"File not found: {full_path}")
166+
metadata = {'uri': uri}
155167

156-
# Compute checksum for verification
157-
checksum = store_backend.checksum(full_path)
158-
size = store_backend.size(full_path)
168+
if compute_checksum:
169+
# Use fsspec to access and compute checksum
170+
fs, path = fsspec.core.url_to_fs(uri)
171+
if fs.exists(path):
172+
metadata['checksum'] = compute_file_checksum(fs, path)
173+
metadata['size'] = fs.size(path)
159174

160-
return {
161-
'path': user_path,
162-
'checksum': checksum,
163-
'size': size
164-
}
175+
return metadata
165176

166-
def retrieve(self, metadata: dict, store_backend) -> ObjectRef:
177+
def retrieve(self, metadata: dict) -> ObjectRef:
167178
"""Return ObjectRef for lazy access."""
168179
return ObjectRef(
169-
path=f"_files/{metadata['path']}",
170-
store=store_backend,
171-
checksum=metadata.get('checksum') # for verification
180+
uri=metadata['uri'],
181+
checksum=metadata.get('checksum') # optional verification
172182
)
173183
```
174184

175185
#### Database Column
176186

177-
The `filepath` type stores JSON metadata:
187+
The `filepath` type uses the `json` core type:
178188

179189
```sql
180-
-- filepath column
190+
-- filepath column (MySQL)
181191
recording JSON NOT NULL
182-
-- Contains: {"path": "...", "checksum": "...", "size": ...}
192+
-- Contains: {"uri": "s3://...", "checksum": "...", "size": ...}
193+
194+
-- filepath column (PostgreSQL)
195+
recording JSONB NOT NULL
183196
```
184197

198+
#### Supported URI Schemes
199+
200+
| Scheme | Example | Backend |
201+
|--------|---------|---------|
202+
| `s3://` | `s3://bucket/key/file.nwb` | S3 via fsspec |
203+
| `gs://` | `gs://bucket/object` | Google Cloud Storage |
204+
| `https://` | `https://example.com/data.h5` | HTTP(S) |
205+
| `file://` | `file:///local/path/data.csv` | Local filesystem |
206+
| `store://` | `store://main/path/file.zarr` | OAS store |
207+
185208
#### Key Differences from Legacy `filepath@store`
186209

187210
| Feature | Legacy | New |
188211
|---------|--------|-----|
212+
| Location | OAS store only | Any URI (S3, HTTP, etc.) |
189213
| Access | Copy to local stage | ObjectRef (lazy) |
190214
| Copying | Automatic | Explicit via `ref.download()` |
191215
| Streaming | No | Yes via `ref.open()` |
192-
| Folders | No | Yes |
193-
| Interface | Returns local path | Returns ObjectRef |
216+
| Integrity | Managed by DataJoint | External (may change) |
217+
| Store param | Required (`@store`) | Optional (embedded in URI) |
218+
219+
### `json` - Cross-Database JSON Type
220+
221+
**New core type.** JSON storage compatible across MySQL and PostgreSQL:
222+
223+
```sql
224+
-- MySQL
225+
column_name JSON NOT NULL
226+
227+
-- PostgreSQL
228+
column_name JSONB NOT NULL
229+
```
230+
231+
The `json` core type:
232+
- Stores arbitrary JSON-serializable data
233+
- Automatically uses appropriate type for database backend
234+
- Supports JSON path queries where available
194235

195236
## Parameterized AttributeTypes
196237

@@ -337,11 +378,12 @@ class Attachments(dj.Manual):
337378
│ <djblob> <xblob> <attach> <xattach> <custom> │
338379
├───────────────────────────────────────────────────────────────────┤
339380
│ Core DataJoint Types │
340-
│ longblob content object filepath
341-
│ content@s object@s filepath@s
381+
│ longblob content object filepath json
382+
│ content@s object@s
342383
├───────────────────────────────────────────────────────────────────┤
343-
│ MySQL Types │
344-
│ LONGBLOB CHAR(64) JSON JSON VARCHAR etc. │
384+
│ Database Types │
385+
│ LONGBLOB CHAR(64) JSON JSON/JSONB VARCHAR etc. │
386+
│ (MySQL) (PostgreSQL) │
345387
└───────────────────────────────────────────────────────────────────┘
346388
```
347389

@@ -357,7 +399,7 @@ class Attachments(dj.Manual):
357399
| `<xattach@s>` | `content@s` | `_content/{hash}` | Yes | Local file path |
358400
| `object` || `{schema}/{table}/{pk}/` | No | ObjectRef |
359401
| `object@s` || `{schema}/{table}/{pk}/` | No | ObjectRef |
360-
| `filepath@s` | | `_files/{user-path}` | No | ObjectRef |
402+
| `filepath` | `json` | External (any URI) | No | ObjectRef |
361403

362404
## Reference Counting for Content Type
363405

@@ -408,33 +450,35 @@ def garbage_collect(project):
408450

409451
| Feature | `object` | `content` | `filepath` |
410452
|---------|----------|-----------|------------|
411-
| Addressing | Primary key | Content hash | User-defined path |
453+
| Location | OAS store | OAS store | Anywhere (URI) |
454+
| Addressing | Primary key | Content hash | User URI |
412455
| Path control | DataJoint | DataJoint | User |
413456
| Deduplication | No | Yes | No |
414-
| Structure | Files, folders, Zarr | Single blob only | Files, folders |
457+
| Structure | Files, folders, Zarr | Single blob only | Any (via fsspec) |
415458
| Access | ObjectRef (lazy) | Transparent (bytes) | ObjectRef (lazy) |
416-
| GC | Deleted with row | Reference counted | Deleted with row |
417-
| Checksum | Optional | Implicit (is the hash) | Stored in DB |
459+
| GC | Deleted with row | Reference counted | N/A (external) |
460+
| Integrity | DataJoint managed | DataJoint managed | External (no guarantees) |
418461

419462
**When to use each:**
420463
- **`object`**: Large/complex objects where DataJoint controls organization (Zarr, HDF5)
421464
- **`content`**: Deduplicated serialized data or file attachments via `<xblob>`, `<xattach>`
422-
- **`filepath`**: User-managed file organization, external data sources
465+
- **`filepath`**: External references (S3, URLs, etc.) not managed by DataJoint
423466

424467
## Key Design Decisions
425468

426-
1. **Layered architecture**: Core types (`object`, `content`, `filepath`) separate from AttributeTypes
427-
2. **Three OAS regions**: object (PK-addressed), content (hash-addressed), filepath (user-addressed)
428-
3. **Content type**: Single-blob, content-addressed, deduplicated storage
429-
4. **Filepath upgrade**: Returns ObjectRef (lazy) instead of copying files
430-
5. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
431-
6. **Naming convention**:
469+
1. **Layered architecture**: Core types (`object`, `content`, `filepath`, `json`) separate from AttributeTypes
470+
2. **Two OAS regions**: object (PK-addressed) and content (hash-addressed) within managed stores
471+
3. **Filepath as reference tracker**: Not an OAS region - tracks external URIs (S3, HTTP, etc.)
472+
4. **Content type**: Single-blob, content-addressed, deduplicated storage
473+
5. **JSON core type**: Cross-database compatible (MySQL JSON, PostgreSQL JSONB)
474+
6. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
475+
7. **Naming convention**:
432476
- `<djblob>` = internal serialized (database)
433477
- `<xblob>` = external serialized (content-addressed)
434478
- `<attach>` = internal file (single file)
435479
- `<xattach>` = external file (single file)
436-
7. **Transparent access**: AttributeTypes return Python objects or file paths
437-
8. **Lazy access**: `object`, `object@store`, and `filepath@store` return ObjectRef
480+
8. **Transparent access**: AttributeTypes return Python objects or file paths
481+
9. **Lazy access**: `object`, `object@store`, and `filepath` return ObjectRef
438482

439483
## Migration from Legacy Types
440484

0 commit comments

Comments
 (0)