Skip to content

Commit 17ccb5a

Browse files
committed
Add lineage_hash for fast comparison
- Add lineage_hash (8-byte SHA-256 prefix) alongside full lineage tuple - Two-phase comparison: fast hash check, then full tuple verification - Update ~lineage table schema to include lineage_hash column with index - Expand Performance Considerations with detailed analysis
1 parent 50e18ea commit 17ccb5a

File tree

1 file changed

+42
-7
lines changed

1 file changed

+42
-7
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 42 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -190,9 +190,25 @@ class SchemaGraph:
190190

191191
### Phase 1: Add Lineage Infrastructure
192192

193-
1. **Add `lineage` field to `Attribute`** (`heading.py`)
194-
- Add `lineage` to `default_attribute_properties` with default `None`
195-
- Lineage is a tuple: `(origin_schema, origin_table, origin_attribute)` or `None`
193+
1. **Add `lineage` and `lineage_hash` fields to `Attribute`** (`heading.py`)
194+
- `lineage`: tuple `(origin_schema, origin_table, origin_attribute)` or `None`
195+
- `lineage_hash`: short hash (e.g., 8 bytes) for fast comparison
196+
- Add both to `default_attribute_properties` with default `None`
197+
198+
```python
199+
def compute_lineage_hash(lineage):
200+
"""Compute a short hash for fast lineage comparison."""
201+
if lineage is None:
202+
return None
203+
# Use first 8 bytes of SHA-256 for compact representation
204+
canonical = f"{lineage[0]}.{lineage[1]}.{lineage[2]}"
205+
return hashlib.sha256(canonical.encode()).digest()[:8]
206+
```
207+
208+
**Comparison strategy**:
209+
- Fast path: compare `lineage_hash` (8-byte comparison)
210+
- On hash match: verify full `lineage` tuple (collision protection)
211+
- `None` lineage never matches anything (computed attributes)
196212

197213
2. **Create `~lineage` table management** (new file: `datajoint/lineage.py`)
198214
- `LineageTable` class (similar to `ExternalTable`)
@@ -344,7 +360,9 @@ CREATE TABLE `~lineage` (
344360
origin_schema VARCHAR(64) NOT NULL,
345361
origin_table VARCHAR(64) NOT NULL,
346362
origin_attribute VARCHAR(64) NOT NULL,
347-
PRIMARY KEY (table_name, attribute_name)
363+
lineage_hash BINARY(8) NOT NULL, -- fast comparison hash
364+
PRIMARY KEY (table_name, attribute_name),
365+
INDEX idx_lineage_hash (lineage_hash) -- enables hash-based lookups
348366
) ENGINE=InnoDB;
349367
```
350368

@@ -356,8 +374,10 @@ CREATE TABLE "~lineage" (
356374
origin_schema VARCHAR(64) NOT NULL,
357375
origin_table VARCHAR(64) NOT NULL,
358376
origin_attribute VARCHAR(64) NOT NULL,
377+
lineage_hash BYTEA NOT NULL, -- 8 bytes
359378
PRIMARY KEY (table_name, attribute_name)
360379
);
380+
CREATE INDEX idx_lineage_hash ON "~lineage" (lineage_hash);
361381
```
362382

363383
#### Lineage Lookup
@@ -527,9 +547,24 @@ WHERE c.contype = 'f'
527547

528548
## Performance Considerations
529549

530-
1. **Memory**: Additional field per attribute (minimal impact)
531-
2. **Comparison**: Lineage comparison is O(1) tuple equality
532-
3. **Storage**: If stored in database, small overhead per attribute
550+
1. **Memory**: Two additional fields per attribute
551+
- `lineage`: tuple of 3 strings (~100-200 bytes typical)
552+
- `lineage_hash`: 8 bytes (fixed)
553+
554+
2. **Comparison**: Two-phase strategy for optimal performance
555+
- **Fast path**: Compare 8-byte `lineage_hash` values (single comparison)
556+
- **Verification**: On hash match, verify full tuple (collision protection)
557+
- Hash collisions are astronomically rare (1 in 2^64) but we verify anyway
558+
559+
3. **Storage**: Small overhead in `~lineage` table
560+
- ~200 bytes per attribute (table_name + attribute_name + origin tuple + hash)
561+
- Indexed by (table_name, attribute_name) for fast lookup
562+
- Secondary index on `lineage_hash` for potential future optimizations
563+
564+
4. **Dependency loading**: Required before joins
565+
- Already cached per connection (`connection.dependencies`)
566+
- Reused across multiple join operations
567+
- Fallback lineage computation adds ~1 query per table (when `~lineage` missing)
533568

534569
## Summary
535570

0 commit comments

Comments
 (0)