@@ -35,10 +35,11 @@ class Analysis(dj.Computed):
3535** New core type.** Content-addressed storage with deduplication:
3636
3737- ** Single blob only** : stores a single file or serialized object (not folders)
38+ - ** Per-project scope** : content is shared across all schemas in a project (not per-schema)
3839- Path derived from content hash: ` _content/{hash[:2]}/{hash[2:4]}/{hash} `
39- - Many-to-one: multiple rows can reference same content
40+ - Many-to-one: multiple rows (even across schemas) can reference same content
4041- Reference counted for garbage collection
41- - Deduplication: identical content stored once
42+ - Deduplication: identical content stored once across the entire project
4243- For folders/complex objects, use ` object ` type instead
4344
4445```
@@ -262,12 +263,17 @@ class Attachments(dj.Manual):
262263
263264## Reference Counting for Content Type
264265
265- The ` ContentRegistry ` table tracks content-addressed objects:
266+ The ` ContentRegistry ` is a ** project-level** table that tracks content-addressed objects
267+ across all schemas. This differs from the legacy ` ~external_* ` tables which were per-schema.
266268
267269``` python
268270class ContentRegistry :
271+ """
272+ Project-level content registry.
273+ Stored in a designated database (e.g., `{project}_content`).
274+ """
269275 definition = """
270- # Content-addressed object registry
276+ # Content-addressed object registry (project-wide)
271277 content_hash : char(64) # SHA256 hex
272278 ---
273279 store : varchar(64) # Store name
@@ -276,21 +282,22 @@ class ContentRegistry:
276282 """
277283```
278284
279- Garbage collection finds orphaned content :
285+ Garbage collection scans ** all schemas ** in the project :
280286
281287``` python
282- def garbage_collect (schema ):
283- """ Remove content not referenced by any table."""
288+ def garbage_collect (project ):
289+ """ Remove content not referenced by any table in any schema ."""
284290 # Get all registered hashes
285291 registered = set (ContentRegistry().fetch(' content_hash' , ' store' ))
286292
287- # Get all referenced hashes from tables with content-type columns
293+ # Get all referenced hashes from ALL schemas in the project
288294 referenced = set ()
289- for table in schema.tables:
290- for attr in table.heading.attributes:
291- if attr.type in (' content' , ' content@...' ):
292- hashes = table.fetch(attr.name)
293- referenced.update((h, attr.store) for h in hashes)
295+ for schema in project.schemas:
296+ for table in schema.tables:
297+ for attr in table.heading.attributes:
298+ if attr.type in (' content' , ' content@...' ):
299+ hashes = table.fetch(attr.name)
300+ referenced.update((h, attr.store) for h in hashes)
294301
295302 # Delete orphaned content
296303 for content_hash, store in (registered - referenced):
@@ -337,8 +344,65 @@ def garbage_collect(schema):
337344| ` attach@store ` | ` <xattach@store> ` |
338345| ` filepath@store ` | Deprecated (use ` object@store ` or ` <xattach@store> ` ) |
339346
347+ ### Migration from Legacy ` ~external_* ` Stores
348+
349+ Legacy external storage used per-schema ` ~external_{store} ` tables. Migration to the new
350+ per-project ` ContentRegistry ` requires:
351+
352+ ``` python
353+ def migrate_external_store (schema , store_name ):
354+ """
355+ Migrate legacy ~external_{store} to new ContentRegistry.
356+
357+ 1. Read all entries from ~external_{store}
358+ 2. For each entry:
359+ - Fetch content from legacy location
360+ - Compute SHA256 hash
361+ - Copy to _content/{hash}/ if not exists
362+ - Update table column from UUID to hash
363+ - Register in ContentRegistry
364+ 3. After all schemas migrated, drop ~external_{store} tables
365+ """
366+ external_table = schema.external[store_name]
367+
368+ for entry in external_table.fetch(as_dict = True ):
369+ legacy_uuid = entry[' hash' ]
370+
371+ # Fetch content from legacy location
372+ content = external_table.get(legacy_uuid)
373+
374+ # Compute new content hash
375+ content_hash = hashlib.sha256(content).hexdigest()
376+
377+ # Store in new location if not exists
378+ new_path = f " _content/ { content_hash[:2 ]} / { content_hash[2 :4 ]} / { content_hash} "
379+ store = get_store(store_name)
380+ if not store.exists(new_path):
381+ store.put(new_path, content)
382+
383+ # Register in project-wide ContentRegistry
384+ ContentRegistry().insert1({
385+ ' content_hash' : content_hash,
386+ ' store' : store_name,
387+ ' size' : len (content)
388+ }, skip_duplicates = True )
389+
390+ # Update referencing tables (UUID -> hash)
391+ # ... update all tables that reference this UUID ...
392+
393+ # After migration complete for all schemas:
394+ # DROP TABLE `{schema}`.`~external_{store}`
395+ ```
396+
397+ ** Migration considerations:**
398+ - Legacy UUIDs were based on content hash but stored as ` binary(16) `
399+ - New system uses ` char(64) ` SHA256 hex strings
400+ - Migration can be done incrementally per schema
401+ - Backward compatibility layer can read both formats during transition
402+
340403## Open Questions
341404
3424051 . Should ` content ` without ` @store ` use a default store, or require explicit store?
3434062 . Should we support ` <xblob> ` without ` @store ` syntax (implying default store)?
3444073 . Should ` filepath@store ` be kept for backward compat or fully deprecated?
408+ 4 . How long should the backward compatibility layer support legacy ` ~external_* ` format?
0 commit comments