Skip to content

Commit 8eaa892

Browse files
committed
Clarify lineage applies only to primary key attributes
Key clarification: - Lineage starts with primary key attributes only - Foreign keys can only reference primary keys - Secondary attributes have lineage = None (not inherited) Implications for join rules: - PK namesakes must have matching lineage (homologous) - Secondary namesakes always collide (both have None lineage) - Old heuristic (no secondary joins) replaced with principled rule This means the effective behavior for secondary attributes is the same, but now based on the correct principle rather than a heuristic.
1 parent 17ccb5a commit 8eaa892

File tree

1 file changed

+39
-14
lines changed

1 file changed

+39
-14
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 39 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -47,18 +47,35 @@ Homologous attributes are also called **semantically matched** attributes.
4747

4848
### Attribute Lineage
4949

50-
Every attribute has a **lineage** - a reference to its original definition. Lineage is propagated through:
51-
- Foreign key references: when table B references table A, the inherited primary key attributes in B have the same lineage as in A
52-
- Query expressions: projections, joins, and other operations preserve lineage
50+
Lineage applies **only to primary key attributes**:
51+
52+
1. **Primary key attributes** have lineage:
53+
- If native to the table: `lineage = (this_schema, this_table, attr_name)`
54+
- If inherited via foreign key: `lineage = (origin_schema, origin_table, origin_attr)`
55+
56+
2. **Secondary attributes** do NOT have lineage:
57+
- `lineage = None` for all secondary (non-primary-key) attributes
58+
- Secondary attributes are table-specific data, not entity identifiers
59+
- Foreign keys can only reference primary keys, so secondary attributes cannot be inherited
60+
61+
Lineage propagates through:
62+
- **Foreign key references**: when table B references table A, the inherited primary key attributes in B have the same lineage as their counterparts in A
63+
- **Query expressions**: projections preserve lineage for renamed PK attributes; computed attributes have no lineage
5364

5465
### Join Compatibility Rules
5566

5667
For a join `A * B` to be valid:
57-
1. All namesake attributes (same name in both) must be homologous (same lineage)
68+
1. **Primary key namesakes** must be homologous (same lineage)
69+
2. **Secondary attribute namesakes** always collide (both have `lineage = None`)
5870

5971
If namesake attributes exist that are **not** homologous, an error should be raised (collision of non-homologous namesakes).
6072

61-
**Note**: The current restriction that joins cannot be done on secondary attributes is **deprecated**. As long as attributes are homologous, they can participate in joins regardless of primary/secondary status. A warning may be raised for joins on unindexed attributes (performance consideration).
73+
**Implications**:
74+
- Two tables with the same secondary attribute name (e.g., both have `value`) cannot be joined directly - one must be renamed via `.proj()`
75+
- Primary key attributes can only match if they share lineage through the FK graph
76+
- This replaces the old heuristic (secondary attributes can't be join keys) with a principled rule (lineage must match)
77+
78+
**Note**: A warning may be raised for joins on unindexed attributes (performance consideration).
6279

6380
## Current Implementation Analysis
6481

@@ -276,8 +293,8 @@ Update these methods to preserve lineage:
276293
### Phase 5: Error Handling
277294

278295
1. **Clear error messages** for:
279-
- Namesake collision: `"Cannot join: attribute 'name' exists in both operands with different lineages (Student.name vs Course.name). Use .proj() to rename one."`
280-
- Non-PK homologous: `"Cannot join on secondary attribute 'value' - must be in primary key of at least one operand."`
296+
- PK lineage mismatch: `"Cannot join: attribute 'subject_id' exists in both operands with different lineages (lab.Subject.subject_id vs other.Experiment.subject_id). Use .proj() to rename one."`
297+
- Secondary attr collision: `"Cannot join: attribute 'value' has no lineage in both operands (secondary attributes). Use .proj() to rename one."`
281298

282299
2. **Resolution guidance** in error messages:
283300
- Suggest specific projection syntax to resolve
@@ -434,9 +451,9 @@ This is intentional: a computed value is a new entity, not inherited from any so
434451

435452
`dj.U` promotes attributes to the primary key for grouping/aggregation purposes, but the semantic identity of the attributes remains unchanged.
436453

437-
### D5: Deprecate Secondary Attribute Join Restriction
454+
### D5: Replace Secondary Attribute Heuristic with Lineage Rule
438455

439-
**Decision**: Remove the current restriction that prevents joining on secondary attributes.
456+
**Decision**: Replace the current heuristic with a principled lineage-based rule.
440457

441458
**Current behavior** (`condition.py:assert_join_compatibility`):
442459
```python
@@ -446,13 +463,21 @@ raise DataJointError(
446463
)
447464
```
448465

449-
**New behavior**: Any homologous attributes can participate in joins, regardless of primary/secondary status. The only requirement is matching lineage.
466+
**New behavior**: The restriction is now a consequence of lineage rules:
467+
- Secondary attributes have `lineage = None`
468+
- Two `None` lineages do not match (collision)
469+
- Therefore, secondary attribute namesakes still cause errors, but for the right reason
450470

451-
**Rationale**: The original restriction was a heuristic to prevent accidental joins on coincidentally-named attributes. With proper lineage tracking, this heuristic is no longer needed - lineage provides the authoritative answer.
471+
**Key insight**: Since foreign keys can only reference primary keys, secondary attributes cannot be inherited. They are always native to their table and have no lineage. The old heuristic was correct in effect, but the new rule is principled.
472+
473+
**Error message change**:
474+
```python
475+
# Old: "Cannot join query expressions on dependent attribute `value`"
476+
# New: "Cannot join: attribute 'value' has no lineage in both operands. Use .proj() to rename one."
477+
```
452478

453-
**Performance warning**: Consider warning when joining on attributes that lack indexes in one or both tables:
479+
**Performance warning**: Consider warning when joining on attributes that lack indexes:
454480
```python
455-
# Optional warning for unindexed join attributes
456481
if not has_index(table1, attr) or not has_index(table2, attr):
457482
warnings.warn(
458483
f"Join on '{attr}' may be slow: attribute is not indexed in both tables",
@@ -578,7 +603,7 @@ Semantic matching is a significant change to DataJoint's join semantics that imp
578603
| **D2**: Renamed attributes | Preserve original lineage |
579604
| **D3**: Computed attributes | Lineage = `None` (breaks matching) |
580605
| **D4**: `dj.U` interaction | Does not affect lineage |
581-
| **D5**: Secondary attr restriction | **Deprecated** - homologous attrs can join regardless of PK status |
606+
| **D5**: Secondary attr restriction | Replaced by lineage rule - secondary attrs have no lineage, so namesakes collide |
582607
| **D6**: Migration | Utility function + automatic fallback computation |
583608

584609
### Compatibility

0 commit comments

Comments
 (0)