Skip to content

Commit 5a177bb

Browse files
committed
Remove lineage hash - use direct string comparison
Simplification: compare lineage strings directly instead of hashes. Rationale: - Lineage strings are short (~50-100 chars) - Comparisons short-circuit on first difference - Only compared for namesake attributes - Eliminates hash computation and storage complexity Changes: - Remove lineage_hash from Attribute - Remove lineage_hash column from ~lineage table - Simplify comparison to direct string equality
1 parent c1643ef commit 5a177bb

File tree

1 file changed

+13
-29
lines changed

1 file changed

+13
-29
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 13 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -213,24 +213,14 @@ class SchemaGraph:
213213

214214
### Phase 1: Add Lineage Infrastructure
215215

216-
1. **Add `lineage` and `lineage_hash` fields to `Attribute`** (`heading.py`)
216+
1. **Add `lineage` field to `Attribute`** (`heading.py`)
217217
- `lineage`: string `"schema.table.attribute"` or `None`
218-
- `lineage_hash`: 8-byte hash for fast comparison
219-
- Add both to `default_attribute_properties` with default `None`
220-
221-
```python
222-
def compute_lineage_hash(lineage):
223-
"""Compute a short hash for fast lineage comparison."""
224-
if lineage is None:
225-
return None
226-
# Use first 8 bytes of SHA-256 for compact representation
227-
return hashlib.sha256(lineage.encode()).digest()[:8]
228-
```
218+
- Add to `default_attribute_properties` with default `None`
229219

230220
**Comparison strategy**:
231-
- Compare `lineage_hash` only (8-byte comparison)
232-
- Hash collisions (1 in 2^64) are acceptable given the low probability and cost
233-
- `None` lineage never matches anything
221+
- Direct string comparison (simple equality check)
222+
- Lineage strings are short (~50-100 chars) and comparisons short-circuit on first difference
223+
- `None` lineage never matches anything (including other `None`)
234224

235225
2. **Create `~lineage` table management** (new file: `datajoint/lineage.py`)
236226
- `LineageTable` class (similar to `ExternalTable`)
@@ -384,9 +374,7 @@ CREATE TABLE `~lineage` (
384374
table_name VARCHAR(64) NOT NULL,
385375
attribute_name VARCHAR(64) NOT NULL,
386376
lineage VARCHAR(200) NOT NULL, -- "schema.table.attribute"
387-
lineage_hash BINARY(8) NOT NULL, -- fast comparison hash
388-
PRIMARY KEY (table_name, attribute_name),
389-
INDEX idx_lineage_hash (lineage_hash)
377+
PRIMARY KEY (table_name, attribute_name)
390378
) ENGINE=InnoDB;
391379
```
392380

@@ -396,10 +384,8 @@ CREATE TABLE "~lineage" (
396384
table_name VARCHAR(64) NOT NULL,
397385
attribute_name VARCHAR(64) NOT NULL,
398386
lineage VARCHAR(200) NOT NULL, -- "schema.table.attribute"
399-
lineage_hash BYTEA NOT NULL, -- 8 bytes
400387
PRIMARY KEY (table_name, attribute_name)
401388
);
402-
CREATE INDEX idx_lineage_hash ON "~lineage" (lineage_hash);
403389
```
404390

405391
#### Lineage Lookup
@@ -410,7 +396,7 @@ When a `Heading` is initialized from a table, query the `~lineage` table:
410396
def _load_lineage(self, connection, database, table_name):
411397
"""Load lineage information from the ~lineage metadata table."""
412398
query = """
413-
SELECT attribute_name, lineage, lineage_hash
399+
SELECT attribute_name, lineage
414400
FROM `{database}`.`~lineage`
415401
WHERE table_name = %s
416402
""".format(database=database)
@@ -591,19 +577,17 @@ WHERE c.contype = 'f'
591577

592578
## Performance Considerations
593579

594-
1. **Memory**: Two additional fields per attribute
580+
1. **Memory**: One additional field per attribute
595581
- `lineage`: string `"schema.table.attribute"` (~50-100 bytes typical) or `None`
596-
- `lineage_hash`: 8 bytes (fixed)
597582

598-
2. **Comparison**: Hash-only comparison
599-
- Compare 8-byte `lineage_hash` values (single integer comparison)
600-
- No fallback verification needed - collision probability (1 in 2^64) is negligible
601-
- `None` hashes never match
583+
2. **Comparison**: Direct string comparison
584+
- Short strings (~50-100 chars) with early short-circuit on difference
585+
- Only compared for namesake attributes (same name in both tables)
586+
- `None` lineage never matches anything
602587

603588
3. **Storage**: Small overhead in `~lineage` table
604-
- ~150 bytes per attribute (table_name + attribute_name + lineage string + hash)
589+
- ~130 bytes per attribute (table_name + attribute_name + lineage string)
605590
- Indexed by (table_name, attribute_name) for fast lookup
606-
- Secondary index on `lineage_hash` for potential future optimizations
607591

608592
4. **Dependency loading**: Required before joins
609593
- Already cached per connection (`connection.dependencies`)

0 commit comments

Comments
 (0)