Skip to content

Commit b87342b

Browse files
Make content storage per-project and add migration utility
- Content-addressed storage is now per-project (not per-schema) - Deduplication works across all schemas in a project - ContentRegistry is project-level (e.g., {project}_content database) - GC scans all schemas in project for references - Add migration utility for legacy ~external_* per-schema stores - Document migration from binary(16) UUID to char(64) SHA256 hash Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
1 parent 6fcc4d3 commit b87342b

File tree

1 file changed

+77
-13
lines changed

1 file changed

+77
-13
lines changed

docs/src/design/tables/storage-types-spec.md

Lines changed: 77 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,11 @@ class Analysis(dj.Computed):
3535
**New core type.** Content-addressed storage with deduplication:
3636

3737
- **Single blob only**: stores a single file or serialized object (not folders)
38+
- **Per-project scope**: content is shared across all schemas in a project (not per-schema)
3839
- Path derived from content hash: `_content/{hash[:2]}/{hash[2:4]}/{hash}`
39-
- Many-to-one: multiple rows can reference same content
40+
- Many-to-one: multiple rows (even across schemas) can reference same content
4041
- Reference counted for garbage collection
41-
- Deduplication: identical content stored once
42+
- Deduplication: identical content stored once across the entire project
4243
- For folders/complex objects, use `object` type instead
4344

4445
```
@@ -262,12 +263,17 @@ class Attachments(dj.Manual):
262263

263264
## Reference Counting for Content Type
264265

265-
The `ContentRegistry` table tracks content-addressed objects:
266+
The `ContentRegistry` is a **project-level** table that tracks content-addressed objects
267+
across all schemas. This differs from the legacy `~external_*` tables which were per-schema.
266268

267269
```python
268270
class ContentRegistry:
271+
"""
272+
Project-level content registry.
273+
Stored in a designated database (e.g., `{project}_content`).
274+
"""
269275
definition = """
270-
# Content-addressed object registry
276+
# Content-addressed object registry (project-wide)
271277
content_hash : char(64) # SHA256 hex
272278
---
273279
store : varchar(64) # Store name
@@ -276,21 +282,22 @@ class ContentRegistry:
276282
"""
277283
```
278284

279-
Garbage collection finds orphaned content:
285+
Garbage collection scans **all schemas** in the project:
280286

281287
```python
282-
def garbage_collect(schema):
283-
"""Remove content not referenced by any table."""
288+
def garbage_collect(project):
289+
"""Remove content not referenced by any table in any schema."""
284290
# Get all registered hashes
285291
registered = set(ContentRegistry().fetch('content_hash', 'store'))
286292

287-
# Get all referenced hashes from tables with content-type columns
293+
# Get all referenced hashes from ALL schemas in the project
288294
referenced = set()
289-
for table in schema.tables:
290-
for attr in table.heading.attributes:
291-
if attr.type in ('content', 'content@...'):
292-
hashes = table.fetch(attr.name)
293-
referenced.update((h, attr.store) for h in hashes)
295+
for schema in project.schemas:
296+
for table in schema.tables:
297+
for attr in table.heading.attributes:
298+
if attr.type in ('content', 'content@...'):
299+
hashes = table.fetch(attr.name)
300+
referenced.update((h, attr.store) for h in hashes)
294301

295302
# Delete orphaned content
296303
for content_hash, store in (registered - referenced):
@@ -337,8 +344,65 @@ def garbage_collect(schema):
337344
| `attach@store` | `<xattach@store>` |
338345
| `filepath@store` | Deprecated (use `object@store` or `<xattach@store>`) |
339346

347+
### Migration from Legacy `~external_*` Stores
348+
349+
Legacy external storage used per-schema `~external_{store}` tables. Migration to the new
350+
per-project `ContentRegistry` requires:
351+
352+
```python
353+
def migrate_external_store(schema, store_name):
354+
"""
355+
Migrate legacy ~external_{store} to new ContentRegistry.
356+
357+
1. Read all entries from ~external_{store}
358+
2. For each entry:
359+
- Fetch content from legacy location
360+
- Compute SHA256 hash
361+
- Copy to _content/{hash}/ if not exists
362+
- Update table column from UUID to hash
363+
- Register in ContentRegistry
364+
3. After all schemas migrated, drop ~external_{store} tables
365+
"""
366+
external_table = schema.external[store_name]
367+
368+
for entry in external_table.fetch(as_dict=True):
369+
legacy_uuid = entry['hash']
370+
371+
# Fetch content from legacy location
372+
content = external_table.get(legacy_uuid)
373+
374+
# Compute new content hash
375+
content_hash = hashlib.sha256(content).hexdigest()
376+
377+
# Store in new location if not exists
378+
new_path = f"_content/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
379+
store = get_store(store_name)
380+
if not store.exists(new_path):
381+
store.put(new_path, content)
382+
383+
# Register in project-wide ContentRegistry
384+
ContentRegistry().insert1({
385+
'content_hash': content_hash,
386+
'store': store_name,
387+
'size': len(content)
388+
}, skip_duplicates=True)
389+
390+
# Update referencing tables (UUID -> hash)
391+
# ... update all tables that reference this UUID ...
392+
393+
# After migration complete for all schemas:
394+
# DROP TABLE `{schema}`.`~external_{store}`
395+
```
396+
397+
**Migration considerations:**
398+
- Legacy UUIDs were based on content hash but stored as `binary(16)`
399+
- New system uses `char(64)` SHA256 hex strings
400+
- Migration can be done incrementally per schema
401+
- Backward compatibility layer can read both formats during transition
402+
340403
## Open Questions
341404

342405
1. Should `content` without `@store` use a default store, or require explicit store?
343406
2. Should we support `<xblob>` without `@store` syntax (implying default store)?
344407
3. Should `filepath@store` be kept for backward compat or fully deprecated?
408+
4. How long should the backward compatibility layer support legacy `~external_*` format?

0 commit comments

Comments
 (0)