@@ -8,23 +8,22 @@ The `file` type introduces a new paradigm for managed file storage in DataJoint.
88
99### Single Storage Backend Per Pipeline
1010
11- Each DataJoint pipeline has ** one** associated storage backend configured in ` datajoint.toml ` . DataJoint fully controls the path structure within this backend.
11+ Each DataJoint pipeline has ** one** associated storage backend configured in ` datajoint.json ` . DataJoint fully controls the path structure within this backend.
1212
1313### Supported Backends
1414
1515DataJoint uses ** [ ` fsspec ` ] ( https://filesystem-spec.readthedocs.io/en/latest/ ) ** to ensure compatibility across multiple storage backends:
1616
1717- ** Local storage** – POSIX-compliant file systems (e.g., NFS, SMB)
1818- ** Cloud-based object storage** – Amazon S3, Google Cloud Storage, Azure Blob, MinIO
19- - ** Hybrid storage** – Combining local and cloud storage for flexibility
2019
2120## Project Structure
2221
2322A DataJoint project creates a structured hierarchical storage pattern:
2423
2524```
2625📁 project_name/
27- ├── datajoint.toml
26+ ├── datajoint.json
2827├── 📁 schema_name1/
2928├── 📁 schema_name2/
3029├── 📁 schema_name3/
@@ -50,42 +49,84 @@ s3://bucket/project_name/schema_name3/objects/table1-field1/key3-value3.zarr
5049
5150## Configuration
5251
53- ### ` datajoint.toml ` Structure
52+ ### Settings Structure
5453
55- ``` toml
56- [project ]
57- name = " my_project"
54+ Object storage is configured in ` datajoint.json ` using the existing settings system:
5855
59- [storage ]
60- backend = " s3" # or "file", "gcs", "azure"
61- bucket = " my-bucket"
62- # For local: path = "/data/my_project"
56+ ``` json
57+ {
58+ "database.host" : " localhost" ,
59+ "database.user" : " datajoint" ,
60+
61+ "object_storage.protocol" : " s3" ,
62+ "object_storage.endpoint" : " s3.amazonaws.com" ,
63+ "object_storage.bucket" : " my-bucket" ,
64+ "object_storage.location" : " my_project" ,
65+ "object_storage.partition_pattern" : " subject{subject_id}/session{session_id}"
66+ }
67+ ```
6368
64- [storage .credentials ]
65- # Backend-specific credentials (or reference to secrets manager)
69+ For local filesystem storage:
6670
67- [object_storage ]
68- partition_pattern = " subject{subject_id}/session{session_id}"
71+ ``` json
72+ {
73+ "object_storage.protocol" : " file" ,
74+ "object_storage.location" : " /data/my_project" ,
75+ "object_storage.partition_pattern" : " subject{subject_id}/session{session_id}"
76+ }
6977```
7078
71- ### Partition Pattern
79+ ### Settings Schema
7280
73- The organizational structure of stored objects is configurable, allowing partitioning based on ** primary key attributes** .
81+ | Setting | Type | Required | Description |
82+ | ---------| ------| ----------| -------------|
83+ | ` object_storage.protocol ` | string | Yes | Storage backend: ` file ` , ` s3 ` , ` gcs ` , ` azure ` |
84+ | ` object_storage.location ` | string | Yes | Base path or bucket prefix |
85+ | ` object_storage.bucket ` | string | For cloud | Bucket name (S3, GCS, Azure) |
86+ | ` object_storage.endpoint ` | string | For S3 | S3 endpoint URL |
87+ | ` object_storage.partition_pattern ` | string | No | Path pattern with ` {attribute} ` placeholders |
88+ | ` object_storage.access_key ` | string | For cloud | Access key (can use secrets file) |
89+ | ` object_storage.secret_key ` | string | For cloud | Secret key (can use secrets file) |
7490
75- ``` toml
76- [object_storage ]
77- partition_pattern = " subject{subject_id}/session{session_id}"
91+ ### Environment Variables
92+
93+ Settings can be overridden via environment variables:
94+
95+ ``` bash
96+ DJ_OBJECT_STORAGE_PROTOCOL=s3
97+ DJ_OBJECT_STORAGE_BUCKET=my-bucket
98+ DJ_OBJECT_STORAGE_LOCATION=my_project
99+ DJ_OBJECT_STORAGE_PARTITION_PATTERN=" subject{subject_id}/session{session_id}"
78100```
79101
80- Placeholders ` {subject_id} ` and ` {session_id} ` are dynamically replaced with actual primary key values.
102+ ### Secrets
103+
104+ Credentials can be stored in the ` .secrets/ ` directory:
105+
106+ ```
107+ .secrets/
108+ ├── object_storage.access_key
109+ └── object_storage.secret_key
110+ ```
111+
112+ ### Partition Pattern
113+
114+ The partition pattern is configured ** per pipeline** (one per settings file). Placeholders use ` {attribute_name} ` syntax and are replaced with primary key values.
115+
116+ ``` json
117+ {
118+ "object_storage.partition_pattern" : " subject{subject_id}/session{session_id}"
119+ }
120+ ```
81121
82122** Example with partitioning:**
83123
84124```
85- s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table1/key1-value1/image1.tiff
86- s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table2/key2-value2/movie2.zarr
125+ s3://my-bucket/my_project/subject123/session45/schema_name/objects/Recording-raw_data/recording.dat
87126```
88127
128+ If no partition pattern is specified, files are organized directly under ` {location}/{schema}/objects/ ` .
129+
89130## Syntax
90131
91132``` python
@@ -108,7 +149,7 @@ The `file` type is stored as a `JSON` column in MySQL containing:
108149
109150``` json
110151{
111- "path" : " subject123/session45/schema_name/objects/Recording-raw_data/... " ,
152+ "path" : " subject123/session45/schema_name/objects/Recording-raw_data/recording.dat " ,
112153 "size" : 12345 ,
113154 "hash" : " sha256:abcdef1234..." ,
114155 "original_name" : " recording.dat" ,
@@ -132,20 +173,27 @@ The `file` type is stored as a `JSON` column in MySQL containing:
132173
133174DataJoint generates storage paths using:
134175
135- 1 . ** Project name ** - from configuration
136- 2 . ** Partition values** - from primary key (if configured)
176+ 1 . ** Location ** - from configuration ( ` object_storage.location ` )
177+ 2 . ** Partition values** - from primary key (if ` partition_pattern ` configured)
1371783 . ** Schema name** - from the table's schema
1381794 . ** Object directory** - ` objects/ `
139- 5 . ** Table-field identifier** - ` {table_name }-{field_name}/ `
140- 6 . ** Key identifier ** - derived from primary key values
180+ 5 . ** Table-field identifier** - ` {TableName }-{field_name}/ `
181+ 6 . ** Primary key hash ** - unique identifier for the record
1411827 . ** Original filename** - preserved from insert
142183
143184Example path construction:
144185
145186```
146- {project }/{partition}/{schema}/objects/{table }-{field}/{key_hash }/{original_name}
187+ {location }/{partition}/{schema}/objects/{Table }-{field}/{pk_hash }/{original_name}
147188```
148189
190+ ### No Deduplication
191+
192+ Each insert stores a separate copy of the file, even if identical content was previously stored. This ensures:
193+ - Clear 1:1 relationship between records and files
194+ - Simplified delete behavior
195+ - No reference counting complexity
196+
149197## Insert Behavior
150198
151199At insert time, the ` file ` attribute accepts:
@@ -173,7 +221,7 @@ with open("/local/path/data.bin", "rb") as f:
173221
174222### Insert Processing Steps
175223
176- 1 . Resolve storage backend from schema's pipeline configuration
224+ 1 . Resolve storage backend from pipeline configuration
1772252 . Read file content (from path or stream)
1782263 . Compute content hash (SHA-256)
1792274 . Generate storage path using partition pattern and primary key
@@ -208,39 +256,68 @@ with file_ref.open() as f:
208256
209257## Implementation Components
210258
211- ### 1. Storage Backend (` storage.py ` - new module)
259+ ### 1. Settings Extension (` settings.py ` )
260+
261+ New ` ObjectStorageSettings ` class:
262+
263+ ``` python
264+ class ObjectStorageSettings (BaseSettings ):
265+ """ Object storage configuration for file columns."""
266+
267+ model_config = SettingsConfigDict(
268+ env_prefix = " DJ_OBJECT_STORAGE_" ,
269+ extra = " forbid" ,
270+ validate_assignment = True ,
271+ )
272+
273+ protocol: Literal[" file" , " s3" , " gcs" , " azure" ] | None = None
274+ location: str | None = None
275+ bucket: str | None = None
276+ endpoint: str | None = None
277+ partition_pattern: str | None = None
278+ access_key: str | None = None
279+ secret_key: SecretStr | None = None
280+ ```
281+
282+ Add to main ` Config ` class:
283+
284+ ``` python
285+ object_storage: ObjectStorageSettings = Field(default_factory = ObjectStorageSettings)
286+ ```
287+
288+ ### 2. Storage Backend (` storage.py ` - new module)
212289
213290- ` StorageBackend ` class wrapping ` fsspec `
214291- Methods: ` upload() ` , ` download() ` , ` open() ` , ` exists() ` , ` delete() `
215292- Path generation with partition support
216- - Configuration loading from ` datajoint.toml `
217293
218- ### 2 . Type Declaration (` declare.py ` )
294+ ### 3 . Type Declaration (` declare.py ` )
219295
220296- Add ` FILE ` pattern: ` file$ `
221297- Add to ` SPECIAL_TYPES `
222298- Substitute to ` JSON ` type in database
223299
224- ### 3 . Schema Integration (` schemas.py ` )
300+ ### 4 . Schema Integration (` schemas.py ` )
225301
226302- Associate storage backend with schema
227- - Load configuration on schema creation
303+ - Validate storage configuration on schema creation
228304
229- ### 4 . Insert Processing (` table.py ` )
305+ ### 5 . Insert Processing (` table.py ` )
230306
231307- New ` __process_file_attribute() ` method
232308- Path generation using primary key and partition pattern
233309- Upload via storage backend
234310
235- ### 5 . Fetch Processing (` fetch.py ` )
311+ ### 6 . Fetch Processing (` fetch.py ` )
236312
237313- New ` FileRef ` class
238314- Lazy loading from storage backend
239315- Metadata access interface
240316
241- ### 6 . FileRef Class (` fileref.py ` - new module)
317+ ### 7 . FileRef Class (` fileref.py ` - new module)
242318
243319``` python
320+ @dataclass
244321class FileRef :
245322 """ Reference to a file stored in the pipeline's storage backend."""
246323
@@ -250,10 +327,11 @@ class FileRef:
250327 original_name: str
251328 timestamp: datetime
252329 mime_type: str | None
330+ _backend: StorageBackend # internal reference
253331
254332 def read (self ) -> bytes : ...
255- def open (self , mode = " rb" ) -> IO : ...
256- def download (self , destination : Path) -> Path: ...
333+ def open (self , mode : str = " rb" ) -> IO : ...
334+ def download (self , destination : Path | str ) -> Path: ...
257335 def exists (self ) -> bool : ...
258336```
259337
@@ -278,9 +356,16 @@ azure = ["adlfs"]
278356| Store config | Per-attribute | Per-attribute | Per-pipeline |
279357| Path control | DataJoint | User-managed | DataJoint |
280358| DB column | binary(16) UUID | binary(16) UUID | JSON |
281- | Backend | File/S3 | File/S3 | fsspec (any) |
359+ | Backend | File/S3 only | File/S3 only | fsspec (any) |
282360| Partitioning | Hash-based | User path | Configurable |
283361| Metadata | External table | External table | Inline JSON |
362+ | Deduplication | By content | By path | None |
363+
364+ ## Delete Behavior
365+
366+ When a record with a ` file ` attribute is deleted:
367+ - The corresponding file in storage is also deleted
368+ - No reference counting (each record owns its file)
284369
285370## Migration Path
286371
0 commit comments