Skip to content

Commit d5439cf

Browse files
committed
Address reviewer feedback on object type spec
- Add Configuration Immutability section warning about changing settings - Clarify database_name is for multi-database DBMS platforms - Implement =OBJ[.ext]= display format in preview.py for query results - Add objects property to Heading class - Add ObjectRef.to_dict() method for raw metadata access - Fix conflicting text about staged insert hashing - Document explicit hash kwarg with design principles - Rename file_storage to object_storage utility - Document grace period for orphan cleanup race condition
1 parent 36f3bb7 commit d5439cf

File tree

4 files changed

+180
-21
lines changed

4 files changed

+180
-21
lines changed

docs/src/design/tables/object-type-spec.md

Lines changed: 63 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,20 @@ For local filesystem storage:
134134
| `object_storage.access_key` | string | For cloud | Access key (can use secrets file) |
135135
| `object_storage.secret_key` | string | For cloud | Secret key (can use secrets file) |
136136

137+
### Configuration Immutability
138+
139+
**CRITICAL**: Once a project has been instantiated (i.e., `datajoint_store.json` has been created and the first object stored), the following settings MUST NOT be changed:
140+
141+
- `object_storage.project_name`
142+
- `object_storage.protocol`
143+
- `object_storage.bucket`
144+
- `object_storage.location`
145+
- `object_storage.partition_pattern`
146+
147+
Changing these settings after objects have been stored will result in **broken references**—existing paths stored in the database will no longer resolve to valid storage locations.
148+
149+
DataJoint validates `project_name` against `datajoint_store.json` on connect, but administrators must ensure other settings remain consistent across all clients for the lifetime of the project.
150+
137151
### Environment Variables
138152

139153
Settings can be overridden via environment variables:
@@ -210,9 +224,16 @@ s3://bucket/my_project/datajoint_store.json
210224
| `format_version` | string | Yes | Store format version for compatibility |
211225
| `datajoint_version` | string | Yes | DataJoint version that created the store |
212226
| `database_host` | string | No | Database server hostname (for bidirectional mapping) |
213-
| `database_name` | string | No | Database name (for bidirectional mapping) |
227+
| `database_name` | string | No | Database name on the server (for bidirectional mapping) |
214228

215-
The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases. This is informational only - not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
229+
The `database_name` field exists for DBMS platforms that support multiple databases on a single server (e.g., PostgreSQL, MySQL). The object storage configuration is **shared across all schemas comprising the pipeline**—it's a pipeline-level setting, not a per-schema setting.
230+
231+
The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases:
232+
233+
- **Forward**: Client settings → object store location
234+
- **Reverse**: Object store metadata → originating database
235+
236+
This is informational only—not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
216237

217238
### Store Initialization
218239

@@ -362,19 +383,28 @@ For large hierarchical data like Zarr stores, computing certain metadata can be
362383

363384
By default, **no content hash is computed** to avoid performance overhead for large objects. Storage backend integrity is trusted.
364385

365-
**Optional hashing** can be requested per-insert:
386+
**Explicit hash control** via insert kwarg:
366387

367388
```python
368389
# Default - no hash (fast)
369390
Recording.insert1({..., "raw_data": "/path/to/large.dat"})
370391

371-
# Request hash computation
392+
# Explicit hash request - user specifies algorithm
372393
Recording.insert1({..., "raw_data": "/path/to/important.dat"}, hash="sha256")
394+
395+
# Other supported algorithms
396+
Recording.insert1({..., "raw_data": "/path/to/data.bin"}, hash="md5")
397+
Recording.insert1({..., "raw_data": "/path/to/large.bin"}, hash="xxhash") # xxh3, faster for large files
373398
```
374399

375-
Supported hash algorithms: `sha256`, `md5`, `xxhash` (xxh3, faster for large files)
400+
**Design principles:**
401+
402+
- **Explicit over implicit**: No automatic hashing based on file size or other heuristics
403+
- **User controls the tradeoff**: User decides when integrity verification is worth the performance cost
404+
- **Files only**: Hash applies to files, not folders (folders use manifests for integrity)
405+
- **Staged inserts**: Hash is always `null` regardless of kwarg—data flows directly to storage without a local copy to hash
376406

377-
**Staged inserts never compute hashes** - data is written directly to storage without a local copy to hash.
407+
Supported hash algorithms: `sha256`, `md5`, `xxhash` (xxh3, faster for large files)
378408

379409
### Folder Manifests
380410

@@ -654,7 +684,7 @@ Remote URLs are detected by protocol prefix and handled via fsspec:
654684
2. Generate deterministic storage path with random token
655685
3. **Copy content to storage backend** via `fsspec`
656686
4. **If copy fails: abort insert** (no database operation attempted)
657-
5. Compute content hash (SHA-256)
687+
5. Compute content hash if requested (optional, default: no hash)
658688
6. Build JSON metadata structure
659689
7. Execute database INSERT
660690

@@ -758,7 +788,7 @@ class StagedInsert:
758788
│ 4. User assigns object references to staged.rec │
759789
├─────────────────────────────────────────────────────────┤
760790
│ 5. On context exit (success): │
761-
│ - Compute metadata (size, hash, item_count)
791+
│ - Build metadata (size/item_count optional, no hash)
762792
│ - Execute database INSERT │
763793
├─────────────────────────────────────────────────────────┤
764794
│ 6. On context exit (exception): │
@@ -839,7 +869,7 @@ Since storage backends don't support distributed transactions with MySQL, DataJo
839869
│ 2. Copy file/folder to storage backend │
840870
│ └─ On failure: raise error, INSERT not attempted │
841871
├─────────────────────────────────────────────────────────┤
842-
│ 3. Compute hash and build JSON metadata
872+
│ 3. Compute hash (if requested) and build JSON metadata │
843873
├─────────────────────────────────────────────────────────┤
844874
│ 4. Execute database INSERT │
845875
│ └─ On failure: orphaned file remains (acceptable) │
@@ -871,19 +901,35 @@ Orphaned files (files in storage without corresponding database records) may acc
871901

872902
### Orphan Cleanup Procedure
873903

874-
Orphan cleanup is a **separate maintenance operation** that must be performed during maintenance windows to avoid race conditions with concurrent inserts.
904+
Orphan cleanup is a **separate maintenance operation** provided via the `schema.object_storage` utility object.
875905

876906
```python
877-
# Maintenance utility methods
878-
schema.file_storage.find_orphaned() # List files not referenced in DB
879-
schema.file_storage.cleanup_orphaned() # Delete orphaned files
907+
# Maintenance utility methods (not a hidden table)
908+
schema.object_storage.find_orphaned(grace_period_minutes=30) # List orphaned files
909+
schema.object_storage.cleanup_orphaned(dry_run=True) # Delete orphaned files
910+
schema.object_storage.verify_integrity() # Check all objects exist
911+
schema.object_storage.stats() # Storage usage statistics
880912
```
881913

914+
**Note**: `schema.object_storage` is a utility object, not a hidden table. Unlike `attach@store` which uses `~external_*` tables, the `object` type stores all metadata inline in JSON columns and has no hidden tables.
915+
916+
**Grace period for in-flight inserts:**
917+
918+
While random tokens prevent filename collisions, there's a race condition with in-flight inserts:
919+
920+
1. Insert starts: file copied to storage with token `Ax7bQ2kM`
921+
2. Orphan cleanup runs: lists storage, queries DB for references
922+
3. File `Ax7bQ2kM` not yet in DB (INSERT not committed)
923+
4. Cleanup identifies it as orphan and deletes it
924+
5. Insert commits: DB now references deleted file!
925+
926+
**Solution**: The `grace_period_minutes` parameter (default: 30) excludes files created within that window, assuming they are in-flight inserts.
927+
882928
**Important considerations:**
883-
- Should be run during low-activity periods
884-
- Uses transactions or locking to avoid race conditions with concurrent inserts
885-
- Files recently uploaded (within a grace period) are excluded to handle in-flight inserts
886-
- Provides dry-run mode to preview deletions before execution
929+
- Grace period handles race conditions—cleanup is safe to run anytime
930+
- Running during low-activity periods reduces in-flight operations to reason about
931+
- `dry_run=True` previews deletions before execution
932+
- Compares storage contents against JSON metadata in table columns
887933

888934
## Fetch Behavior
889935

src/datajoint/heading.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,10 @@ def secondary_attributes(self):
135135
def blobs(self):
136136
return [k for k, v in self.attributes.items() if v.is_blob]
137137

138+
@property
139+
def objects(self):
140+
return [k for k, v in self.attributes.items() if v.is_object]
141+
138142
@property
139143
def non_blobs(self):
140144
return [

src/datajoint/objectref.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,27 @@ def to_json(self) -> dict:
111111
data["item_count"] = self.item_count
112112
return data
113113

114+
def to_dict(self) -> dict:
115+
"""
116+
Return the raw JSON metadata as a dictionary.
117+
118+
This is useful for inspecting the stored metadata without triggering
119+
any storage backend operations. The returned dict matches the JSON
120+
structure stored in the database.
121+
122+
Returns:
123+
Dict containing the object metadata:
124+
- path: Storage path
125+
- size: File/folder size in bytes (or None)
126+
- hash: Content hash (or None)
127+
- ext: File extension (or None)
128+
- is_dir: True if folder
129+
- timestamp: Upload timestamp
130+
- mime_type: MIME type (files only, optional)
131+
- item_count: Number of files (folders only, optional)
132+
"""
133+
return self.to_json()
134+
114135
def _ensure_backend(self):
115136
"""Ensure storage backend is available for I/O operations."""
116137
if self._backend is None:

src/datajoint/preview.py

Lines changed: 92 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,98 @@
11
"""methods for generating previews of query expression results in python command line and Jupyter"""
22

3+
import json
4+
35
from .settings import config
46

57

8+
def _format_object_display(json_data):
9+
"""Format object metadata for display in query results."""
10+
if json_data is None:
11+
return "=OBJ[null]="
12+
if isinstance(json_data, str):
13+
try:
14+
json_data = json.loads(json_data)
15+
except (json.JSONDecodeError, TypeError):
16+
return "=OBJ=?"
17+
ext = json_data.get("ext")
18+
is_dir = json_data.get("is_dir", False)
19+
if ext:
20+
return f"=OBJ[{ext}]="
21+
elif is_dir:
22+
return "=OBJ[folder]="
23+
else:
24+
return "=OBJ[file]="
25+
26+
27+
def _get_display_value(tup, field, object_fields, object_data):
28+
"""Get display value for a field, handling objects specially."""
29+
if field in tup.dtype.names:
30+
return tup[field]
31+
elif field in object_fields and object_data is not None:
32+
# Find the matching tuple in object_data by index
33+
idx = list(tup.dtype.names).index(list(tup.dtype.names)[0]) # placeholder
34+
return _format_object_display(object_data.get(field))
35+
else:
36+
return "=BLOB="
37+
38+
639
def preview(query_expression, limit, width):
740
heading = query_expression.heading
841
rel = query_expression.proj(*heading.non_blobs)
42+
object_fields = heading.objects
943
if limit is None:
1044
limit = config["display.limit"]
1145
if width is None:
1246
width = config["display.width"]
1347
tuples = rel.fetch(limit=limit + 1, format="array")
1448
has_more = len(tuples) > limit
1549
tuples = tuples[:limit]
50+
51+
# Fetch object field JSON data for display (raw JSON, not ObjectRef)
52+
object_data_list = []
53+
if object_fields:
54+
# Fetch primary key and object fields as dicts
55+
obj_rel = query_expression.proj(*object_fields)
56+
obj_tuples = obj_rel.fetch(limit=limit, format="array")
57+
for obj_tup in obj_tuples:
58+
obj_dict = {}
59+
for field in object_fields:
60+
if field in obj_tup.dtype.names:
61+
obj_dict[field] = obj_tup[field]
62+
object_data_list.append(obj_dict)
63+
1664
columns = heading.names
65+
66+
def get_placeholder(f):
67+
if f in object_fields:
68+
return "=OBJ[.xxx]="
69+
return "=BLOB="
70+
1771
widths = {
1872
f: min(
19-
max([len(f)] + [len(str(e)) for e in tuples[f]] if f in tuples.dtype.names else [len("=BLOB=")]) + 4,
73+
max([len(f)] + [len(str(e)) for e in tuples[f]] if f in tuples.dtype.names else [len(get_placeholder(f))]) + 4,
2074
width,
2175
)
2276
for f in columns
2377
}
2478
templates = {f: "%%-%d.%ds" % (widths[f], widths[f]) for f in columns}
79+
80+
def get_display_value(tup, f, idx):
81+
if f in tup.dtype.names:
82+
return tup[f]
83+
elif f in object_fields and idx < len(object_data_list):
84+
return _format_object_display(object_data_list[idx].get(f))
85+
else:
86+
return "=BLOB="
87+
2588
return (
2689
" ".join([templates[f] % ("*" + f if f in rel.primary_key else f) for f in columns])
2790
+ "\n"
2891
+ " ".join(["+" + "-" * (widths[column] - 2) + "+" for column in columns])
2992
+ "\n"
30-
+ "\n".join(" ".join(templates[f] % (tup[f] if f in tup.dtype.names else "=BLOB=") for f in columns) for tup in tuples)
93+
+ "\n".join(
94+
" ".join(templates[f] % get_display_value(tup, f, idx) for f in columns) for idx, tup in enumerate(tuples)
95+
)
3196
+ ("\n ...\n" if has_more else "\n")
3297
+ (" (Total: %d)\n" % len(rel) if config["display.show_tuple_count"] else "")
3398
)
@@ -36,11 +101,32 @@ def preview(query_expression, limit, width):
36101
def repr_html(query_expression):
37102
heading = query_expression.heading
38103
rel = query_expression.proj(*heading.non_blobs)
104+
object_fields = heading.objects
39105
info = heading.table_status
40106
tuples = rel.fetch(limit=config["display.limit"] + 1, format="array")
41107
has_more = len(tuples) > config["display.limit"]
42108
tuples = tuples[0 : config["display.limit"]]
43109

110+
# Fetch object field JSON data for display (raw JSON, not ObjectRef)
111+
object_data_list = []
112+
if object_fields:
113+
obj_rel = query_expression.proj(*object_fields)
114+
obj_tuples = obj_rel.fetch(limit=config["display.limit"], format="array")
115+
for obj_tup in obj_tuples:
116+
obj_dict = {}
117+
for field in object_fields:
118+
if field in obj_tup.dtype.names:
119+
obj_dict[field] = obj_tup[field]
120+
object_data_list.append(obj_dict)
121+
122+
def get_html_display_value(tup, name, idx):
123+
if name in tup.dtype.names:
124+
return tup[name]
125+
elif name in object_fields and idx < len(object_data_list):
126+
return _format_object_display(object_data_list[idx].get(name))
127+
else:
128+
return "=BLOB="
129+
44130
css = """
45131
<style type="text/css">
46132
.Table{
@@ -120,8 +206,10 @@ def repr_html(query_expression):
120206
ellipsis="<p>...</p>" if has_more else "",
121207
body="</tr><tr>".join(
122208
[
123-
"\n".join(["<td>%s</td>" % (tup[name] if name in tup.dtype.names else "=BLOB=") for name in heading.names])
124-
for tup in tuples
209+
"\n".join(
210+
["<td>%s</td>" % get_html_display_value(tup, name, idx) for name in heading.names]
211+
)
212+
for idx, tup in enumerate(tuples)
125213
]
126214
),
127215
count=(("<p>Total: %d</p>" % len(rel)) if config["display.show_tuple_count"] else ""),

0 commit comments

Comments
 (0)