Address reviewer feedback on object type spec

claude · claude · commit d5439cf0476b · 2025-12-23T21:23:52.000Z
- Add Configuration Immutability section warning about changing settings
- Clarify database_name is for multi-database DBMS platforms
- Implement =OBJ[.ext]= display format in preview.py for query results
- Add objects property to Heading class
- Add ObjectRef.to_dict() method for raw metadata access
- Fix conflicting text about staged insert hashing
- Document explicit hash kwarg with design principles
- Rename file_storage to object_storage utility
- Document grace period for orphan cleanup race condition
diff --git a/docs/src/design/tables/object-type-spec.md b/docs/src/design/tables/object-type-spec.md
@@ -134,6 +134,20 @@ For local filesystem storage:
 | `object_storage.access_key` | string | For cloud | Access key (can use secrets file) |
 | `object_storage.secret_key` | string | For cloud | Secret key (can use secrets file) |
 
+### Configuration Immutability
+
+**CRITICAL**: Once a project has been instantiated (i.e., `datajoint_store.json` has been created and the first object stored), the following settings MUST NOT be changed:
+
+- `object_storage.project_name`
+- `object_storage.protocol`
+- `object_storage.bucket`
+- `object_storage.location`
+- `object_storage.partition_pattern`
+
+Changing these settings after objects have been stored will result in **broken references**—existing paths stored in the database will no longer resolve to valid storage locations.
+
+DataJoint validates `project_name` against `datajoint_store.json` on connect, but administrators must ensure other settings remain consistent across all clients for the lifetime of the project.
+
 ### Environment Variables
 
 Settings can be overridden via environment variables:
@@ -210,9 +224,16 @@ s3://bucket/my_project/datajoint_store.json
 | `format_version` | string | Yes | Store format version for compatibility |
 | `datajoint_version` | string | Yes | DataJoint version that created the store |
 | `database_host` | string | No | Database server hostname (for bidirectional mapping) |
-| `database_name` | string | No | Database name (for bidirectional mapping) |
+| `database_name` | string | No | Database name on the server (for bidirectional mapping) |
 
-The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases. This is informational only - not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
+The `database_name` field exists for DBMS platforms that support multiple databases on a single server (e.g., PostgreSQL, MySQL). The object storage configuration is **shared across all schemas comprising the pipeline**—it's a pipeline-level setting, not a per-schema setting.
+
+The optional `database_host` and `database_name` fields enable bidirectional mapping between object stores and databases:
+
+- **Forward**: Client settings → object store location
+- **Reverse**: Object store metadata → originating database
+
+This is informational only—not enforced at runtime. Administrators can alternatively ensure unique `project_name` values across their namespace, and managed platforms may handle this mapping externally.
 
 ### Store Initialization
 
@@ -362,19 +383,28 @@ For large hierarchical data like Zarr stores, computing certain metadata can be
 
 By default, **no content hash is computed** to avoid performance overhead for large objects. Storage backend integrity is trusted.
 
-**Optional hashing** can be requested per-insert:
+**Explicit hash control** via insert kwarg:
 
 ```python
 # Default - no hash (fast)
 Recording.insert1({..., "raw_data": "/path/to/large.dat"})
 
-# Request hash computation
+# Explicit hash request - user specifies algorithm
 Recording.insert1({..., "raw_data": "/path/to/important.dat"}, hash="sha256")
+
+# Other supported algorithms
+Recording.insert1({..., "raw_data": "/path/to/data.bin"}, hash="md5")
+Recording.insert1({..., "raw_data": "/path/to/large.bin"}, hash="xxhash")  # xxh3, faster for large files
 ```
 
-Supported hash algorithms: `sha256`, `md5`, `xxhash` (xxh3, faster for large files)
+**Design principles:**
+
+- **Explicit over implicit**: No automatic hashing based on file size or other heuristics
+- **User controls the tradeoff**: User decides when integrity verification is worth the performance cost
+- **Files only**: Hash applies to files, not folders (folders use manifests for integrity)
+- **Staged inserts**: Hash is always `null` regardless of kwarg—data flows directly to storage without a local copy to hash
 
-**Staged inserts never compute hashes** - data is written directly to storage without a local copy to hash.
+Supported hash algorithms: `sha256`, `md5`, `xxhash` (xxh3, faster for large files)
 
 ### Folder Manifests
 
@@ -654,7 +684,7 @@ Remote URLs are detected by protocol prefix and handled via fsspec:
 2. Generate deterministic storage path with random token
 3. **Copy content to storage backend** via `fsspec`
 4. **If copy fails: abort insert** (no database operation attempted)
-5. Compute content hash (SHA-256)
+5. Compute content hash if requested (optional, default: no hash)
 6. Build JSON metadata structure
 7. Execute database INSERT
 
@@ -758,7 +788,7 @@ class StagedInsert:
 │ 4. User assigns object references to staged.rec         │
 ├─────────────────────────────────────────────────────────┤
 │ 5. On context exit (success):                           │
-│    - Compute metadata (size, hash, item_count)          │
+│    - Build metadata (size/item_count optional, no hash) │
 │    - Execute database INSERT                            │
 ├─────────────────────────────────────────────────────────┤
 │ 6. On context exit (exception):                         │
@@ -839,7 +869,7 @@ Since storage backends don't support distributed transactions with MySQL, DataJo
 │ 2. Copy file/folder to storage backend                  │
 │    └─ On failure: raise error, INSERT not attempted     │
 ├─────────────────────────────────────────────────────────┤
-│ 3. Compute hash and build JSON metadata                 │
+│ 3. Compute hash (if requested) and build JSON metadata  │
 ├─────────────────────────────────────────────────────────┤
 │ 4. Execute database INSERT                              │
 │    └─ On failure: orphaned file remains (acceptable)    │
@@ -871,19 +901,35 @@ Orphaned files (files in storage without corresponding database records) may acc
 
 ### Orphan Cleanup Procedure
 
-Orphan cleanup is a **separate maintenance operation** that must be performed during maintenance windows to avoid race conditions with concurrent inserts.
+Orphan cleanup is a **separate maintenance operation** provided via the `schema.object_storage` utility object.
 
 ```python
-# Maintenance utility methods
-schema.file_storage.find_orphaned()     # List files not referenced in DB
-schema.file_storage.cleanup_orphaned()  # Delete orphaned files
+# Maintenance utility methods (not a hidden table)
+schema.object_storage.find_orphaned(grace_period_minutes=30)  # List orphaned files
+schema.object_storage.cleanup_orphaned(dry_run=True)          # Delete orphaned files
+schema.object_storage.verify_integrity()                       # Check all objects exist
+schema.object_storage.stats()                                  # Storage usage statistics
 ```
 
+**Note**: `schema.object_storage` is a utility object, not a hidden table. Unlike `attach@store` which uses `~external_*` tables, the `object` type stores all metadata inline in JSON columns and has no hidden tables.
+
+**Grace period for in-flight inserts:**
+
+While random tokens prevent filename collisions, there's a race condition with in-flight inserts:
+
+1. Insert starts: file copied to storage with token `Ax7bQ2kM`
+2. Orphan cleanup runs: lists storage, queries DB for references
+3. File `Ax7bQ2kM` not yet in DB (INSERT not committed)
+4. Cleanup identifies it as orphan and deletes it
+5. Insert commits: DB now references deleted file!
+
+**Solution**: The `grace_period_minutes` parameter (default: 30) excludes files created within that window, assuming they are in-flight inserts.
+
 **Important considerations:**
-- Should be run during low-activity periods
-- Uses transactions or locking to avoid race conditions with concurrent inserts
-- Files recently uploaded (within a grace period) are excluded to handle in-flight inserts
-- Provides dry-run mode to preview deletions before execution
+- Grace period handles race conditions—cleanup is safe to run anytime
+- Running during low-activity periods reduces in-flight operations to reason about
+- `dry_run=True` previews deletions before execution
+- Compares storage contents against JSON metadata in table columns
 
 ## Fetch Behavior
 
diff --git a/src/datajoint/heading.py b/src/datajoint/heading.py
@@ -135,6 +135,10 @@ def secondary_attributes(self):
     def blobs(self):
         return [k for k, v in self.attributes.items() if v.is_blob]
 
+    @property
+    def objects(self):
+        return [k for k, v in self.attributes.items() if v.is_object]
+
     @property
     def non_blobs(self):
         return [
diff --git a/src/datajoint/objectref.py b/src/datajoint/objectref.py
@@ -111,6 +111,27 @@ def to_json(self) -> dict:
             data["item_count"] = self.item_count
         return data
 
+    def to_dict(self) -> dict:
+        """
+        Return the raw JSON metadata as a dictionary.
+
+        This is useful for inspecting the stored metadata without triggering
+        any storage backend operations. The returned dict matches the JSON
+        structure stored in the database.
+
+        Returns:
+            Dict containing the object metadata:
+                - path: Storage path
+                - size: File/folder size in bytes (or None)
+                - hash: Content hash (or None)
+                - ext: File extension (or None)
+                - is_dir: True if folder
+                - timestamp: Upload timestamp
+                - mime_type: MIME type (files only, optional)
+                - item_count: Number of files (folders only, optional)
+        """
+        return self.to_json()
+
     def _ensure_backend(self):
         """Ensure storage backend is available for I/O operations."""
         if self._backend is None:
diff --git a/src/datajoint/preview.py b/src/datajoint/preview.py
@@ -1,33 +1,98 @@
 """methods for generating previews of query expression results in python command line and Jupyter"""
 
+import json
+
 from .settings import config
 
 
+def _format_object_display(json_data):
+    """Format object metadata for display in query results."""
+    if json_data is None:
+        return "=OBJ[null]="
+    if isinstance(json_data, str):
+        try:
+            json_data = json.loads(json_data)
+        except (json.JSONDecodeError, TypeError):
+            return "=OBJ=?"
+    ext = json_data.get("ext")
+    is_dir = json_data.get("is_dir", False)
+    if ext:
+        return f"=OBJ[{ext}]="
+    elif is_dir:
+        return "=OBJ[folder]="
+    else:
+        return "=OBJ[file]="
+
+
+def _get_display_value(tup, field, object_fields, object_data):
+    """Get display value for a field, handling objects specially."""
+    if field in tup.dtype.names:
+        return tup[field]
+    elif field in object_fields and object_data is not None:
+        # Find the matching tuple in object_data by index
+        idx = list(tup.dtype.names).index(list(tup.dtype.names)[0])  # placeholder
+        return _format_object_display(object_data.get(field))
+    else:
+        return "=BLOB="
+
+
 def preview(query_expression, limit, width):
     heading = query_expression.heading
     rel = query_expression.proj(*heading.non_blobs)
+    object_fields = heading.objects
     if limit is None:
         limit = config["display.limit"]
     if width is None:
         width = config["display.width"]
     tuples = rel.fetch(limit=limit + 1, format="array")
     has_more = len(tuples) > limit
     tuples = tuples[:limit]
+
+    # Fetch object field JSON data for display (raw JSON, not ObjectRef)
+    object_data_list = []
+    if object_fields:
+        # Fetch primary key and object fields as dicts
+        obj_rel = query_expression.proj(*object_fields)
+        obj_tuples = obj_rel.fetch(limit=limit, format="array")
+        for obj_tup in obj_tuples:
+            obj_dict = {}
+            for field in object_fields:
+                if field in obj_tup.dtype.names:
+                    obj_dict[field] = obj_tup[field]
+            object_data_list.append(obj_dict)
+
     columns = heading.names
+
+    def get_placeholder(f):
+        if f in object_fields:
+            return "=OBJ[.xxx]="
+        return "=BLOB="
+
     widths = {
         f: min(
-            max([len(f)] + [len(str(e)) for e in tuples[f]] if f in tuples.dtype.names else [len("=BLOB=")]) + 4,
+            max([len(f)] + [len(str(e)) for e in tuples[f]] if f in tuples.dtype.names else [len(get_placeholder(f))]) + 4,
             width,
         )
         for f in columns
     }
     templates = {f: "%%-%d.%ds" % (widths[f], widths[f]) for f in columns}
+
+    def get_display_value(tup, f, idx):
+        if f in tup.dtype.names:
+            return tup[f]
+        elif f in object_fields and idx < len(object_data_list):
+            return _format_object_display(object_data_list[idx].get(f))
+        else:
+            return "=BLOB="
+
     return (
         " ".join([templates[f] % ("*" + f if f in rel.primary_key else f) for f in columns])
         + "\n"
         + " ".join(["+" + "-" * (widths[column] - 2) + "+" for column in columns])
         + "\n"
-        + "\n".join(" ".join(templates[f] % (tup[f] if f in tup.dtype.names else "=BLOB=") for f in columns) for tup in tuples)
+        + "\n".join(
+            " ".join(templates[f] % get_display_value(tup, f, idx) for f in columns) for idx, tup in enumerate(tuples)
+        )
         + ("\n   ...\n" if has_more else "\n")
         + (" (Total: %d)\n" % len(rel) if config["display.show_tuple_count"] else "")
     )
@@ -36,11 +101,32 @@ def preview(query_expression, limit, width):
 def repr_html(query_expression):
     heading = query_expression.heading
     rel = query_expression.proj(*heading.non_blobs)
+    object_fields = heading.objects
     info = heading.table_status
     tuples = rel.fetch(limit=config["display.limit"] + 1, format="array")
     has_more = len(tuples) > config["display.limit"]
     tuples = tuples[0 : config["display.limit"]]
 
+    # Fetch object field JSON data for display (raw JSON, not ObjectRef)
+    object_data_list = []
+    if object_fields:
+        obj_rel = query_expression.proj(*object_fields)
+        obj_tuples = obj_rel.fetch(limit=config["display.limit"], format="array")
+        for obj_tup in obj_tuples:
+            obj_dict = {}
+            for field in object_fields:
+                if field in obj_tup.dtype.names:
+                    obj_dict[field] = obj_tup[field]
+            object_data_list.append(obj_dict)
+
+    def get_html_display_value(tup, name, idx):
+        if name in tup.dtype.names:
+            return tup[name]
+        elif name in object_fields and idx < len(object_data_list):
+            return _format_object_display(object_data_list[idx].get(name))
+        else:
+            return "=BLOB="
+
     css = """
     <style type="text/css">
         .Table{
@@ -120,8 +206,10 @@ def repr_html(query_expression):
         ellipsis="<p>...</p>" if has_more else "",
         body="</tr><tr>".join(
             [
-                "\n".join(["<td>%s</td>" % (tup[name] if name in tup.dtype.names else "=BLOB=") for name in heading.names])
-                for tup in tuples
+                "\n".join(
+                    ["<td>%s</td>" % get_html_display_value(tup, name, idx) for name in heading.names]
+                )
+                for idx, tup in enumerate(tuples)
             ]
         ),
         count=(("<p>Total: %d</p>" % len(rel)) if config["display.show_tuple_count"] else ""),