@@ -190,9 +190,25 @@ class SchemaGraph:
190190
191191### Phase 1: Add Lineage Infrastructure
192192
193- 1 . ** Add ` lineage ` field to ` Attribute ` ** (` heading.py ` )
194- - Add ` lineage ` to ` default_attribute_properties ` with default ` None `
195- - Lineage is a tuple: ` (origin_schema, origin_table, origin_attribute) ` or ` None `
193+ 1 . ** Add ` lineage ` and ` lineage_hash ` fields to ` Attribute ` ** (` heading.py ` )
194+ - ` lineage ` : tuple ` (origin_schema, origin_table, origin_attribute) ` or ` None `
195+ - ` lineage_hash ` : short hash (e.g., 8 bytes) for fast comparison
196+ - Add both to ` default_attribute_properties ` with default ` None `
197+
198+ ``` python
199+ def compute_lineage_hash (lineage ):
200+ """ Compute a short hash for fast lineage comparison."""
201+ if lineage is None :
202+ return None
203+ # Use first 8 bytes of SHA-256 for compact representation
204+ canonical = f " { lineage[0 ]} . { lineage[1 ]} . { lineage[2 ]} "
205+ return hashlib.sha256(canonical.encode()).digest()[:8 ]
206+ ```
207+
208+ ** Comparison strategy** :
209+ - Fast path: compare ` lineage_hash ` (8-byte comparison)
210+ - On hash match: verify full ` lineage ` tuple (collision protection)
211+ - ` None ` lineage never matches anything (computed attributes)
196212
1972132 . ** Create ` ~lineage ` table management** (new file: ` datajoint/lineage.py ` )
198214 - ` LineageTable ` class (similar to ` ExternalTable ` )
@@ -344,7 +360,9 @@ CREATE TABLE `~lineage` (
344360 origin_schema VARCHAR (64 ) NOT NULL ,
345361 origin_table VARCHAR (64 ) NOT NULL ,
346362 origin_attribute VARCHAR (64 ) NOT NULL ,
347- PRIMARY KEY (table_name, attribute_name)
363+ lineage_hash BINARY(8 ) NOT NULL , -- fast comparison hash
364+ PRIMARY KEY (table_name, attribute_name),
365+ INDEX idx_lineage_hash (lineage_hash) -- enables hash-based lookups
348366) ENGINE= InnoDB;
349367```
350368
@@ -356,8 +374,10 @@ CREATE TABLE "~lineage" (
356374 origin_schema VARCHAR (64 ) NOT NULL ,
357375 origin_table VARCHAR (64 ) NOT NULL ,
358376 origin_attribute VARCHAR (64 ) NOT NULL ,
377+ lineage_hash BYTEA NOT NULL , -- 8 bytes
359378 PRIMARY KEY (table_name, attribute_name)
360379);
380+ CREATE INDEX idx_lineage_hash ON " ~lineage" (lineage_hash);
361381```
362382
363383#### Lineage Lookup
@@ -527,9 +547,24 @@ WHERE c.contype = 'f'
527547
528548## Performance Considerations
529549
530- 1 . ** Memory** : Additional field per attribute (minimal impact)
531- 2 . ** Comparison** : Lineage comparison is O(1) tuple equality
532- 3 . ** Storage** : If stored in database, small overhead per attribute
550+ 1 . ** Memory** : Two additional fields per attribute
551+ - ` lineage ` : tuple of 3 strings (~ 100-200 bytes typical)
552+ - ` lineage_hash ` : 8 bytes (fixed)
553+
554+ 2 . ** Comparison** : Two-phase strategy for optimal performance
555+ - ** Fast path** : Compare 8-byte ` lineage_hash ` values (single comparison)
556+ - ** Verification** : On hash match, verify full tuple (collision protection)
557+ - Hash collisions are astronomically rare (1 in 2^64) but we verify anyway
558+
559+ 3 . ** Storage** : Small overhead in ` ~lineage ` table
560+ - ~ 200 bytes per attribute (table_name + attribute_name + origin tuple + hash)
561+ - Indexed by (table_name, attribute_name) for fast lookup
562+ - Secondary index on ` lineage_hash ` for potential future optimizations
563+
564+ 4 . ** Dependency loading** : Required before joins
565+ - Already cached per connection (` connection.dependencies ` )
566+ - Reused across multiple join operations
567+ - Fallback lineage computation adds ~ 1 query per table (when ` ~lineage ` missing)
533568
534569## Summary
535570
0 commit comments