You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -54,6 +54,42 @@ This is fundamentally different from **external references**, where DataJoint me
54
54
55
55
Each DataJoint pipeline has **one** associated storage backend configured in `datajoint.json`. DataJoint fully controls the path structure within this backend.
56
56
57
+
**Why single backend?** The object store is a logical extension of the schema—its integrity must be verifiable as a unit. With a single backend:
58
+
- Schema completeness can be verified with one listing operation
59
+
- Orphan detection is straightforward
60
+
- Migration requires only config changes, not mass URL updates in the database
61
+
62
+
### Access Control Patterns
63
+
64
+
The deterministic path structure (`project/schema/Table/objects/pk=val/...`) enables **prefix-based access control policies** on the storage backend.
65
+
66
+
**Supported access control levels:**
67
+
68
+
| Level | Implementation | Example Policy Prefix |
| Row-level | Per-object ACL or signed URLs | Future enhancement |
74
+
75
+
**Example: Private and public data in one bucket**
76
+
77
+
Rather than using separate buckets, use prefix-based policies:
78
+
79
+
```
80
+
s3://my-bucket/my_project/
81
+
├── internal_schema/ ← restricted IAM policy
82
+
│ └── ProcessingResults/
83
+
│ └── objects/...
84
+
└── publications/ ← public bucket policy
85
+
└── PublishedDatasets/
86
+
└── objects/...
87
+
```
88
+
89
+
This achieves the same access separation as multiple buckets while maintaining schema integrity in a single backend.
90
+
91
+
**Row-level access control** (access to objects for specific primary key values) is not directly supported by object store policies. Future versions may address this via DataJoint-generated signed URLs that project database permissions onto object access.
92
+
57
93
### Supported Backends
58
94
59
95
DataJoint uses **[`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/)** to ensure compatibility across multiple storage backends:
@@ -134,6 +170,20 @@ For local filesystem storage:
134
170
|`object_storage.access_key`| string | For cloud | Access key (can use secrets file) |
135
171
|`object_storage.secret_key`| string | For cloud | Secret key (can use secrets file) |
136
172
173
+
### Configuration Immutability
174
+
175
+
**CRITICAL**: Once a project has been instantiated (i.e., `datajoint_store.json` has been created and the first object stored), the following settings MUST NOT be changed:
176
+
177
+
-`object_storage.project_name`
178
+
-`object_storage.protocol`
179
+
-`object_storage.bucket`
180
+
-`object_storage.location`
181
+
-`object_storage.partition_pattern`
182
+
183
+
Changing these settings after objects have been stored will result in **broken references**—existing paths stored in the database will no longer resolve to valid storage locations.
184
+
185
+
DataJoint validates `project_name` against `datajoint_store.json` on connect, but administrators must ensure other settings remain consistent across all clients for the lifetime of the project.
186
+
137
187
### Environment Variables
138
188
139
189
Settings can be overridden via environment variables:
|`format_version`| string | Yes | Store format version for compatibility |
211
261
|`datajoint_version`| string | Yes | DataJoint version that created the store |
212
262
|`database_host`| string | No | Database server hostname (for bidirectional mapping) |
213
-
|`database_name`| string | No | Database name (for bidirectional mapping) |
263
+
|`database_name`| string | No | Database name on the server (for bidirectional mapping) |
264
+
265
+
The `database_name` field exists for DBMS platforms that support multiple databases on a single server (e.g., PostgreSQL, MySQL). The object storage configuration is **shared across all schemas comprising the pipeline**—it's a pipeline-level setting, not a per-schema setting.
266
+
267
+
The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases:
214
268
215
-
The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases. This is informational only - not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
269
+
-**Forward**: Client settings → object store location
270
+
-**Reverse**: Object store metadata → originating database
271
+
272
+
This is informational only—not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
216
273
217
274
### Store Initialization
218
275
@@ -362,19 +419,28 @@ For large hierarchical data like Zarr stores, computing certain metadata can be
362
419
363
420
By default, **no content hash is computed** to avoid performance overhead for large objects. Storage backend integrity is trusted.
│ └─ On failure: orphaned file remains (acceptable) │
@@ -871,19 +937,35 @@ Orphaned files (files in storage without corresponding database records) may acc
871
937
872
938
### Orphan Cleanup Procedure
873
939
874
-
Orphan cleanup is a **separate maintenance operation**that must be performed during maintenance windows to avoid race conditions with concurrent inserts.
940
+
Orphan cleanup is a **separate maintenance operation**provided via the `schema.object_storage` utility object.
875
941
876
942
```python
877
-
# Maintenance utility methods
878
-
schema.file_storage.find_orphaned() # List files not referenced in DB
**Note**: `schema.object_storage` is a utility object, not a hidden table. Unlike `attach@store` which uses `~external_*` tables, the `object` type stores all metadata inline in JSON columns and has no hidden tables.
951
+
952
+
**Grace period for in-flight inserts:**
953
+
954
+
While random tokens prevent filename collisions, there's a race condition with in-flight inserts:
955
+
956
+
1. Insert starts: file copied to storage with token `Ax7bQ2kM`
957
+
2. Orphan cleanup runs: lists storage, queries DB for references
958
+
3. File `Ax7bQ2kM` not yet in DB (INSERT not committed)
959
+
4. Cleanup identifies it as orphan and deletes it
960
+
5. Insert commits: DB now references deleted file!
961
+
962
+
**Solution**: The `grace_period_minutes` parameter (default: 30) excludes files created within that window, assuming they are in-flight inserts.
963
+
882
964
**Important considerations:**
883
-
-Should be run during low-activity periods
884
-
-Uses transactions or locking to avoid race conditions with concurrent inserts
885
-
-Files recently uploaded (within a grace period) are excluded to handle in-flight inserts
886
-
-Provides dry-run mode to preview deletions before execution
965
+
-Grace period handles race conditions—cleanup is safe to run anytime
966
+
-Running during low-activity periods reduces in-flight operations to reason about
967
+
-`dry_run=True` previews deletions before execution
968
+
-Compares storage contents against JSON metadata in table columns
0 commit comments