Skip to content

Commit 965a30f

Browse files
committed
Update file type spec to use existing datajoint.json settings
- Use datajoint.json instead of datajoint.toml - Add ObjectStorageSettings class spec for settings.py - Support DJ_OBJECT_STORAGE_* environment variables - Support .secrets/ directory for credentials - Partition pattern is per-pipeline (one per settings file) - No deduplication - each record owns its file
1 parent ba3c66b commit 965a30f

File tree

1 file changed

+126
-41
lines changed

1 file changed

+126
-41
lines changed

docs/src/design/tables/file-type-spec.md

Lines changed: 126 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,22 @@ The `file` type introduces a new paradigm for managed file storage in DataJoint.
88

99
### Single Storage Backend Per Pipeline
1010

11-
Each DataJoint pipeline has **one** associated storage backend configured in `datajoint.toml`. DataJoint fully controls the path structure within this backend.
11+
Each DataJoint pipeline has **one** associated storage backend configured in `datajoint.json`. DataJoint fully controls the path structure within this backend.
1212

1313
### Supported Backends
1414

1515
DataJoint uses **[`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/)** to ensure compatibility across multiple storage backends:
1616

1717
- **Local storage** – POSIX-compliant file systems (e.g., NFS, SMB)
1818
- **Cloud-based object storage** – Amazon S3, Google Cloud Storage, Azure Blob, MinIO
19-
- **Hybrid storage** – Combining local and cloud storage for flexibility
2019

2120
## Project Structure
2221

2322
A DataJoint project creates a structured hierarchical storage pattern:
2423

2524
```
2625
📁 project_name/
27-
├── datajoint.toml
26+
├── datajoint.json
2827
├── 📁 schema_name1/
2928
├── 📁 schema_name2/
3029
├── 📁 schema_name3/
@@ -50,42 +49,84 @@ s3://bucket/project_name/schema_name3/objects/table1-field1/key3-value3.zarr
5049

5150
## Configuration
5251

53-
### `datajoint.toml` Structure
52+
### Settings Structure
5453

55-
```toml
56-
[project]
57-
name = "my_project"
54+
Object storage is configured in `datajoint.json` using the existing settings system:
5855

59-
[storage]
60-
backend = "s3" # or "file", "gcs", "azure"
61-
bucket = "my-bucket"
62-
# For local: path = "/data/my_project"
56+
```json
57+
{
58+
"database.host": "localhost",
59+
"database.user": "datajoint",
60+
61+
"object_storage.protocol": "s3",
62+
"object_storage.endpoint": "s3.amazonaws.com",
63+
"object_storage.bucket": "my-bucket",
64+
"object_storage.location": "my_project",
65+
"object_storage.partition_pattern": "subject{subject_id}/session{session_id}"
66+
}
67+
```
6368

64-
[storage.credentials]
65-
# Backend-specific credentials (or reference to secrets manager)
69+
For local filesystem storage:
6670

67-
[object_storage]
68-
partition_pattern = "subject{subject_id}/session{session_id}"
71+
```json
72+
{
73+
"object_storage.protocol": "file",
74+
"object_storage.location": "/data/my_project",
75+
"object_storage.partition_pattern": "subject{subject_id}/session{session_id}"
76+
}
6977
```
7078

71-
### Partition Pattern
79+
### Settings Schema
7280

73-
The organizational structure of stored objects is configurable, allowing partitioning based on **primary key attributes**.
81+
| Setting | Type | Required | Description |
82+
|---------|------|----------|-------------|
83+
| `object_storage.protocol` | string | Yes | Storage backend: `file`, `s3`, `gcs`, `azure` |
84+
| `object_storage.location` | string | Yes | Base path or bucket prefix |
85+
| `object_storage.bucket` | string | For cloud | Bucket name (S3, GCS, Azure) |
86+
| `object_storage.endpoint` | string | For S3 | S3 endpoint URL |
87+
| `object_storage.partition_pattern` | string | No | Path pattern with `{attribute}` placeholders |
88+
| `object_storage.access_key` | string | For cloud | Access key (can use secrets file) |
89+
| `object_storage.secret_key` | string | For cloud | Secret key (can use secrets file) |
7490

75-
```toml
76-
[object_storage]
77-
partition_pattern = "subject{subject_id}/session{session_id}"
91+
### Environment Variables
92+
93+
Settings can be overridden via environment variables:
94+
95+
```bash
96+
DJ_OBJECT_STORAGE_PROTOCOL=s3
97+
DJ_OBJECT_STORAGE_BUCKET=my-bucket
98+
DJ_OBJECT_STORAGE_LOCATION=my_project
99+
DJ_OBJECT_STORAGE_PARTITION_PATTERN="subject{subject_id}/session{session_id}"
78100
```
79101

80-
Placeholders `{subject_id}` and `{session_id}` are dynamically replaced with actual primary key values.
102+
### Secrets
103+
104+
Credentials can be stored in the `.secrets/` directory:
105+
106+
```
107+
.secrets/
108+
├── object_storage.access_key
109+
└── object_storage.secret_key
110+
```
111+
112+
### Partition Pattern
113+
114+
The partition pattern is configured **per pipeline** (one per settings file). Placeholders use `{attribute_name}` syntax and are replaced with primary key values.
115+
116+
```json
117+
{
118+
"object_storage.partition_pattern": "subject{subject_id}/session{session_id}"
119+
}
120+
```
81121

82122
**Example with partitioning:**
83123

84124
```
85-
s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table1/key1-value1/image1.tiff
86-
s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table2/key2-value2/movie2.zarr
125+
s3://my-bucket/my_project/subject123/session45/schema_name/objects/Recording-raw_data/recording.dat
87126
```
88127

128+
If no partition pattern is specified, files are organized directly under `{location}/{schema}/objects/`.
129+
89130
## Syntax
90131

91132
```python
@@ -108,7 +149,7 @@ The `file` type is stored as a `JSON` column in MySQL containing:
108149

109150
```json
110151
{
111-
"path": "subject123/session45/schema_name/objects/Recording-raw_data/...",
152+
"path": "subject123/session45/schema_name/objects/Recording-raw_data/recording.dat",
112153
"size": 12345,
113154
"hash": "sha256:abcdef1234...",
114155
"original_name": "recording.dat",
@@ -132,20 +173,27 @@ The `file` type is stored as a `JSON` column in MySQL containing:
132173

133174
DataJoint generates storage paths using:
134175

135-
1. **Project name** - from configuration
136-
2. **Partition values** - from primary key (if configured)
176+
1. **Location** - from configuration (`object_storage.location`)
177+
2. **Partition values** - from primary key (if `partition_pattern` configured)
137178
3. **Schema name** - from the table's schema
138179
4. **Object directory** - `objects/`
139-
5. **Table-field identifier** - `{table_name}-{field_name}/`
140-
6. **Key identifier** - derived from primary key values
180+
5. **Table-field identifier** - `{TableName}-{field_name}/`
181+
6. **Primary key hash** - unique identifier for the record
141182
7. **Original filename** - preserved from insert
142183

143184
Example path construction:
144185

145186
```
146-
{project}/{partition}/{schema}/objects/{table}-{field}/{key_hash}/{original_name}
187+
{location}/{partition}/{schema}/objects/{Table}-{field}/{pk_hash}/{original_name}
147188
```
148189

190+
### No Deduplication
191+
192+
Each insert stores a separate copy of the file, even if identical content was previously stored. This ensures:
193+
- Clear 1:1 relationship between records and files
194+
- Simplified delete behavior
195+
- No reference counting complexity
196+
149197
## Insert Behavior
150198

151199
At insert time, the `file` attribute accepts:
@@ -173,7 +221,7 @@ with open("/local/path/data.bin", "rb") as f:
173221

174222
### Insert Processing Steps
175223

176-
1. Resolve storage backend from schema's pipeline configuration
224+
1. Resolve storage backend from pipeline configuration
177225
2. Read file content (from path or stream)
178226
3. Compute content hash (SHA-256)
179227
4. Generate storage path using partition pattern and primary key
@@ -208,39 +256,68 @@ with file_ref.open() as f:
208256

209257
## Implementation Components
210258

211-
### 1. Storage Backend (`storage.py` - new module)
259+
### 1. Settings Extension (`settings.py`)
260+
261+
New `ObjectStorageSettings` class:
262+
263+
```python
264+
class ObjectStorageSettings(BaseSettings):
265+
"""Object storage configuration for file columns."""
266+
267+
model_config = SettingsConfigDict(
268+
env_prefix="DJ_OBJECT_STORAGE_",
269+
extra="forbid",
270+
validate_assignment=True,
271+
)
272+
273+
protocol: Literal["file", "s3", "gcs", "azure"] | None = None
274+
location: str | None = None
275+
bucket: str | None = None
276+
endpoint: str | None = None
277+
partition_pattern: str | None = None
278+
access_key: str | None = None
279+
secret_key: SecretStr | None = None
280+
```
281+
282+
Add to main `Config` class:
283+
284+
```python
285+
object_storage: ObjectStorageSettings = Field(default_factory=ObjectStorageSettings)
286+
```
287+
288+
### 2. Storage Backend (`storage.py` - new module)
212289

213290
- `StorageBackend` class wrapping `fsspec`
214291
- Methods: `upload()`, `download()`, `open()`, `exists()`, `delete()`
215292
- Path generation with partition support
216-
- Configuration loading from `datajoint.toml`
217293

218-
### 2. Type Declaration (`declare.py`)
294+
### 3. Type Declaration (`declare.py`)
219295

220296
- Add `FILE` pattern: `file$`
221297
- Add to `SPECIAL_TYPES`
222298
- Substitute to `JSON` type in database
223299

224-
### 3. Schema Integration (`schemas.py`)
300+
### 4. Schema Integration (`schemas.py`)
225301

226302
- Associate storage backend with schema
227-
- Load configuration on schema creation
303+
- Validate storage configuration on schema creation
228304

229-
### 4. Insert Processing (`table.py`)
305+
### 5. Insert Processing (`table.py`)
230306

231307
- New `__process_file_attribute()` method
232308
- Path generation using primary key and partition pattern
233309
- Upload via storage backend
234310

235-
### 5. Fetch Processing (`fetch.py`)
311+
### 6. Fetch Processing (`fetch.py`)
236312

237313
- New `FileRef` class
238314
- Lazy loading from storage backend
239315
- Metadata access interface
240316

241-
### 6. FileRef Class (`fileref.py` - new module)
317+
### 7. FileRef Class (`fileref.py` - new module)
242318

243319
```python
320+
@dataclass
244321
class FileRef:
245322
"""Reference to a file stored in the pipeline's storage backend."""
246323

@@ -250,10 +327,11 @@ class FileRef:
250327
original_name: str
251328
timestamp: datetime
252329
mime_type: str | None
330+
_backend: StorageBackend # internal reference
253331

254332
def read(self) -> bytes: ...
255-
def open(self, mode="rb") -> IO: ...
256-
def download(self, destination: Path) -> Path: ...
333+
def open(self, mode: str = "rb") -> IO: ...
334+
def download(self, destination: Path | str) -> Path: ...
257335
def exists(self) -> bool: ...
258336
```
259337

@@ -278,9 +356,16 @@ azure = ["adlfs"]
278356
| Store config | Per-attribute | Per-attribute | Per-pipeline |
279357
| Path control | DataJoint | User-managed | DataJoint |
280358
| DB column | binary(16) UUID | binary(16) UUID | JSON |
281-
| Backend | File/S3 | File/S3 | fsspec (any) |
359+
| Backend | File/S3 only | File/S3 only | fsspec (any) |
282360
| Partitioning | Hash-based | User path | Configurable |
283361
| Metadata | External table | External table | Inline JSON |
362+
| Deduplication | By content | By path | None |
363+
364+
## Delete Behavior
365+
366+
When a record with a `file` attribute is deleted:
367+
- The corresponding file in storage is also deleted
368+
- No reference counting (each record owns its file)
284369

285370
## Migration Path
286371

0 commit comments

Comments
 (0)