44
55This document defines a layered storage architecture:
66
7- 1 . ** MySQL types** : ` longblob ` , ` varchar ` , ` int ` , etc.
8- 2 . ** Core DataJoint types** : ` object ` , ` content ` , ` filepath ` (and their ` @store ` variants)
7+ 1 . ** Database types** : ` longblob ` , ` varchar ` , ` int ` , ` json ` , etc.
8+ 2 . ** Core DataJoint types** : ` object ` , ` content ` , ` filepath ` , ` json ` (and ` @store ` variants where applicable )
993 . ** AttributeTypes** : ` <djblob> ` , ` <xblob> ` , ` <attach> ` , etc. (built on top of core types)
1010
11- ### Three OAS Storage Regions
11+ ### OAS Storage Regions
1212
1313| Region | Path Pattern | Addressing | Use Case |
1414| --------| --------------| ------------| ----------|
1515| Object | ` {schema}/{table}/{pk}/ ` | Primary key | Large objects, Zarr, HDF5 |
1616| Content | ` _content/{hash} ` | Content hash | Deduplicated blobs/files |
17- | Filepath | ` _files/{user-path} ` | User-defined | User-organized files |
17+
18+ ### External References
19+
20+ ` filepath ` is ** not** an OAS region - it's a general reference tracker for external resources:
21+ - OAS store paths: ` store://main/experiment/data.h5 `
22+ - URLs: ` https://example.com/dataset.zip `
23+ - S3: ` s3://bucket/key/file.nwb `
24+ - Any fsspec-compatible URI
1825
1926## Core Types
2027
@@ -55,11 +62,8 @@ store_root/
5562├── {schema}/{table}/{pk}/ # object storage (path-addressed by PK)
5663│ └── {attribute}/
5764│
58- ├── _content/ # content storage (content-addressed)
59- │ └── {hash[:2]}/{hash[2:4]}/{hash}
60- │
61- └── _files/ # filepath storage (user-addressed)
62- └── {user-defined-path}
65+ └── _content/ # content storage (content-addressed)
66+ └── {hash[:2]}/{hash[2:4]}/{hash}
6367```
6468
6569#### Content Type Behavior
@@ -106,31 +110,41 @@ The `content` type stores a `char(64)` hash in the database:
106110features CHAR (64 ) NOT NULL -- SHA256 hex hash
107111```
108112
109- ### ` filepath ` / ` filepath@store ` - User-Addressed Storage
113+ ### ` filepath ` - External Reference Tracker
110114
111- ** Upgraded from legacy.** User-defined path organization with ObjectRef access :
115+ ** Upgraded from legacy.** General-purpose reference tracker for external resources :
112116
113- - ** User controls paths ** : relative path specified by user (not derived from PK or hash )
114- - Stored in ` _files/{user-path} ` within the store
115- - Returns ` ObjectRef ` for lazy access (no automatic copying)
116- - Stores checksum in database for verification
117- - Supports files and folders (like ` object ` )
117+ - ** Not an OAS region ** : references can point anywhere (URLs, S3, OAS stores, etc. )
118+ - ** User controls URIs ** : any fsspec-compatible URI
119+ - Returns ` ObjectRef ` for lazy access via fsspec
120+ - Stores optional checksum for verification
121+ - No integrity guarantees (external resources may change/disappear )
118122
119123``` python
120124class RawData (dj .Manual ):
121125 definition = """
122126 session_id : int
123127 ---
124- recording : filepath@raw # user specifies path
128+ recording : filepath # external reference
125129 """
126130
127- # Insert - user provides relative path
131+ # Insert - user provides URI (various protocols)
128132table.insert1({
129133 ' session_id' : 1 ,
130- ' recording' : ' experiment_001/session_001/data.nwb'
134+ ' recording' : ' s3://my-bucket/experiment_001/data.nwb'
135+ })
136+ # Or URL
137+ table.insert1({
138+ ' session_id' : 2 ,
139+ ' recording' : ' https://example.com/public/dataset.h5'
140+ })
141+ # Or OAS store reference
142+ table.insert1({
143+ ' session_id' : 3 ,
144+ ' recording' : ' store://main/custom/path/file.zarr'
131145})
132146
133- # Fetch - returns ObjectRef (lazy, no copy )
147+ # Fetch - returns ObjectRef (lazy)
134148row = (table & ' session_id=1' ).fetch1()
135149ref = row[' recording' ] # ObjectRef
136150ref.download(' /local/path' ) # explicit download
@@ -142,55 +156,82 @@ ref.open() # fsspec streaming access
142156``` python
143157# Core type behavior
144158class FilepathType :
145- """ Core user-addressed storage type."""
159+ """ Core external reference type."""
146160
147- def store (self , user_path : str , store_backend ) -> dict :
161+ def store (self , uri : str , compute_checksum : bool = False ) -> dict :
148162 """
149- Register filepath , return metadata.
150- File must already exist at _files/{user_path} in store .
163+ Register external reference , return metadata.
164+ Optionally compute checksum for verification .
151165 """
152- full_path = f " _files/ { user_path} "
153- if not store_backend.exists(full_path):
154- raise FileNotFoundError (f " File not found: { full_path} " )
166+ metadata = {' uri' : uri}
155167
156- # Compute checksum for verification
157- checksum = store_backend.checksum(full_path)
158- size = store_backend.size(full_path)
168+ if compute_checksum:
169+ # Use fsspec to access and compute checksum
170+ fs, path = fsspec.core.url_to_fs(uri)
171+ if fs.exists(path):
172+ metadata[' checksum' ] = compute_file_checksum(fs, path)
173+ metadata[' size' ] = fs.size(path)
159174
160- return {
161- ' path' : user_path,
162- ' checksum' : checksum,
163- ' size' : size
164- }
175+ return metadata
165176
166- def retrieve (self , metadata : dict , store_backend ) -> ObjectRef:
177+ def retrieve (self , metadata : dict ) -> ObjectRef:
167178 """ Return ObjectRef for lazy access."""
168179 return ObjectRef(
169- path = f " _files/ { metadata[' path' ]} " ,
170- store = store_backend,
171- checksum = metadata.get(' checksum' ) # for verification
180+ uri = metadata[' uri' ],
181+ checksum = metadata.get(' checksum' ) # optional verification
172182 )
173183```
174184
175185#### Database Column
176186
177- The ` filepath ` type stores JSON metadata :
187+ The ` filepath ` type uses the ` json ` core type :
178188
179189``` sql
180- -- filepath column
190+ -- filepath column (MySQL)
181191recording JSON NOT NULL
182- -- Contains: {"path": "...", "checksum": "...", "size": ...}
192+ -- Contains: {"uri": "s3://...", "checksum": "...", "size": ...}
193+
194+ -- filepath column (PostgreSQL)
195+ recording JSONB NOT NULL
183196```
184197
198+ #### Supported URI Schemes
199+
200+ | Scheme | Example | Backend |
201+ | --------| ---------| ---------|
202+ | ` s3:// ` | ` s3://bucket/key/file.nwb ` | S3 via fsspec |
203+ | ` gs:// ` | ` gs://bucket/object ` | Google Cloud Storage |
204+ | ` https:// ` | ` https://example.com/data.h5 ` | HTTP(S) |
205+ | ` file:// ` | ` file:///local/path/data.csv ` | Local filesystem |
206+ | ` store:// ` | ` store://main/path/file.zarr ` | OAS store |
207+
185208#### Key Differences from Legacy ` filepath@store `
186209
187210| Feature | Legacy | New |
188211| ---------| --------| -----|
212+ | Location | OAS store only | Any URI (S3, HTTP, etc.) |
189213| Access | Copy to local stage | ObjectRef (lazy) |
190214| Copying | Automatic | Explicit via ` ref.download() ` |
191215| Streaming | No | Yes via ` ref.open() ` |
192- | Folders | No | Yes |
193- | Interface | Returns local path | Returns ObjectRef |
216+ | Integrity | Managed by DataJoint | External (may change) |
217+ | Store param | Required (` @store ` ) | Optional (embedded in URI) |
218+
219+ ### ` json ` - Cross-Database JSON Type
220+
221+ ** New core type.** JSON storage compatible across MySQL and PostgreSQL:
222+
223+ ``` sql
224+ -- MySQL
225+ column_name JSON NOT NULL
226+
227+ -- PostgreSQL
228+ column_name JSONB NOT NULL
229+ ```
230+
231+ The ` json ` core type:
232+ - Stores arbitrary JSON-serializable data
233+ - Automatically uses appropriate type for database backend
234+ - Supports JSON path queries where available
194235
195236## Parameterized AttributeTypes
196237
@@ -337,11 +378,12 @@ class Attachments(dj.Manual):
337378│ <djblob> <xblob> <attach> <xattach> <custom> │
338379├───────────────────────────────────────────────────────────────────┤
339380│ Core DataJoint Types │
340- │ longblob content object filepath │
341- │ content@s object@s filepath@s │
381+ │ longblob content object filepath json │
382+ │ content@s object@s │
342383├───────────────────────────────────────────────────────────────────┤
343- │ MySQL Types │
344- │ LONGBLOB CHAR(64) JSON JSON VARCHAR etc. │
384+ │ Database Types │
385+ │ LONGBLOB CHAR(64) JSON JSON/JSONB VARCHAR etc. │
386+ │ (MySQL) (PostgreSQL) │
345387└───────────────────────────────────────────────────────────────────┘
346388```
347389
@@ -357,7 +399,7 @@ class Attachments(dj.Manual):
357399| ` <xattach@s> ` | ` content@s ` | ` _content/{hash} ` | Yes | Local file path |
358400| ` object ` | — | ` {schema}/{table}/{pk}/ ` | No | ObjectRef |
359401| ` object@s ` | — | ` {schema}/{table}/{pk}/ ` | No | ObjectRef |
360- | ` filepath@s ` | — | ` _files/{user-path} ` | No | ObjectRef |
402+ | ` filepath ` | ` json ` | External (any URI) | No | ObjectRef |
361403
362404## Reference Counting for Content Type
363405
@@ -408,33 +450,35 @@ def garbage_collect(project):
408450
409451| Feature | ` object ` | ` content ` | ` filepath ` |
410452| ---------| ----------| -----------| ------------|
411- | Addressing | Primary key | Content hash | User-defined path |
453+ | Location | OAS store | OAS store | Anywhere (URI) |
454+ | Addressing | Primary key | Content hash | User URI |
412455| Path control | DataJoint | DataJoint | User |
413456| Deduplication | No | Yes | No |
414- | Structure | Files, folders, Zarr | Single blob only | Files, folders |
457+ | Structure | Files, folders, Zarr | Single blob only | Any (via fsspec) |
415458| Access | ObjectRef (lazy) | Transparent (bytes) | ObjectRef (lazy) |
416- | GC | Deleted with row | Reference counted | Deleted with row |
417- | Checksum | Optional | Implicit (is the hash) | Stored in DB |
459+ | GC | Deleted with row | Reference counted | N/A (external) |
460+ | Integrity | DataJoint managed | DataJoint managed | External (no guarantees) |
418461
419462** When to use each:**
420463- ** ` object ` ** : Large/complex objects where DataJoint controls organization (Zarr, HDF5)
421464- ** ` content ` ** : Deduplicated serialized data or file attachments via ` <xblob> ` , ` <xattach> `
422- - ** ` filepath ` ** : User-managed file organization, external data sources
465+ - ** ` filepath ` ** : External references (S3, URLs, etc.) not managed by DataJoint
423466
424467## Key Design Decisions
425468
426- 1 . ** Layered architecture** : Core types (` object ` , ` content ` , ` filepath ` ) separate from AttributeTypes
427- 2 . ** Three OAS regions** : object (PK-addressed), content (hash-addressed), filepath (user-addressed)
428- 3 . ** Content type** : Single-blob, content-addressed, deduplicated storage
429- 4 . ** Filepath upgrade** : Returns ObjectRef (lazy) instead of copying files
430- 5 . ** Parameterized types** : ` <type@param> ` passes parameter to underlying dtype
431- 6 . ** Naming convention** :
469+ 1 . ** Layered architecture** : Core types (` object ` , ` content ` , ` filepath ` , ` json ` ) separate from AttributeTypes
470+ 2 . ** Two OAS regions** : object (PK-addressed) and content (hash-addressed) within managed stores
471+ 3 . ** Filepath as reference tracker** : Not an OAS region - tracks external URIs (S3, HTTP, etc.)
472+ 4 . ** Content type** : Single-blob, content-addressed, deduplicated storage
473+ 5 . ** JSON core type** : Cross-database compatible (MySQL JSON, PostgreSQL JSONB)
474+ 6 . ** Parameterized types** : ` <type@param> ` passes parameter to underlying dtype
475+ 7 . ** Naming convention** :
432476 - ` <djblob> ` = internal serialized (database)
433477 - ` <xblob> ` = external serialized (content-addressed)
434478 - ` <attach> ` = internal file (single file)
435479 - ` <xattach> ` = external file (single file)
436- 7 . ** Transparent access** : AttributeTypes return Python objects or file paths
437- 8 . ** Lazy access** : ` object ` , ` object@store ` , and ` filepath@store ` return ObjectRef
480+ 8 . ** Transparent access** : AttributeTypes return Python objects or file paths
481+ 9 . ** Lazy access** : ` object ` , ` object@store ` , and ` filepath ` return ObjectRef
438482
439483## Migration from Legacy Types
440484
0 commit comments