You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add fallback lineage computation and deprecate secondary attr restriction
Key updates to the semantic matching specification:
1. D1 updated: ~lineage table with in-memory fallback
- Computes lineage from FK graph when ~lineage table doesn't exist
- Uses existing Dependencies class (already loads FK info)
- Works with databases not created by DataJoint
2. D5 added: Deprecate secondary attribute join restriction
- Remove assert_join_compatibility's secondary attr check
- Homologous attributes can join regardless of PK status
- Optional warning for unindexed join attributes
3. Dependencies must be loaded before joins
- Pattern already used for delete, drop, populate
- Enables in-memory lineage computation for any database
Copy file name to clipboardExpand all lines: docs/SPEC-semantic-matching.md
+97-13Lines changed: 97 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,11 +54,12 @@ Every attribute has a **lineage** - a reference to its original definition. Line
54
54
### Join Compatibility Rules
55
55
56
56
For a join `A * B` to be valid:
57
-
1. All namesake attributes (same name in both) must be homologous
58
-
2. Homologous attributes must be in the primary key of at least one operand
57
+
1. All namesake attributes (same name in both) must be homologous (same lineage)
59
58
60
59
If namesake attributes exist that are **not** homologous, an error should be raised (collision of non-homologous namesakes).
61
60
61
+
**Note**: The current restriction that joins cannot be done on secondary attributes is **deprecated**. As long as attributes are homologous, they can participate in joins regardless of primary/secondary status. A warning may be raised for joins on unindexed attributes (performance consideration).
62
+
62
63
## Current Implementation Analysis
63
64
64
65
### Attribute Representation (`heading.py:48`)
@@ -228,7 +229,15 @@ Update these methods to preserve lineage:
228
229
229
230
### Phase 4: Implement Semantic Join Matching
230
231
231
-
1. Modify `join()` in `expression.py`:
232
+
1.**Ensure dependencies are loaded** before join operations:
@@ -243,9 +252,10 @@ Update these methods to preserve lineage:
243
252
# ...
244
253
```
245
254
246
-
2. Modify `assert_join_compatibility()` in `condition.py`:
255
+
3.**Modify `assert_join_compatibility()` in `condition.py`**:
256
+
-**Remove** the secondary attribute restriction (deprecated)
247
257
- Check for namesake collisions (same name, different lineage)
248
-
-Check that homologous attributes are in primary key
258
+
-Optionally warn about unindexed join attributes
249
259
250
260
### Phase 5: Error Handling
251
261
@@ -275,15 +285,55 @@ Update these methods to preserve lineage:
275
285
276
286
## Design Decisions
277
287
278
-
### D1: Lineage Storage - Hidden Metadata Table
288
+
### D1: Lineage Storage - Hidden Metadata Table with Fallback
279
289
280
-
**Decision**: Use a hidden metadata table (`~lineage`) per schema.
290
+
**Decision**: Use a hidden metadata table (`~lineage`) per schema, with **in-memory fallback** when table doesn't exist.
281
291
282
292
This approach:
283
293
- Works consistently for both **MySQL** and **PostgreSQL**
284
294
- Provides explicit, queryable lineage data
285
295
- Follows the existing pattern for hidden tables (e.g., `~external_*`, `~log`)
286
296
- Easier to migrate existing schemas
297
+
-**Works with databases not created by DataJoint** via fallback computation
298
+
299
+
#### Fallback: Compute Lineage from Dependencies
300
+
301
+
When the `~lineage` table does not exist (e.g., external databases, legacy schemas), lineage is computed **in-memory** from the FK graph using the existing `Dependencies` class:
**Dependencies must be loaded before joins.** This is already the pattern for operations like `delete`, `drop`, and `populate`. The join operation will:
332
+
333
+
1. Ensure `connection.dependencies.load(force=False)` is called
334
+
2. Check if `~lineage` table exists for involved schemas
335
+
3. If exists: read lineage from table (fast)
336
+
4. If not exists: compute lineage from FK graph (slower but works for any database)
287
337
288
338
#### Table Structure
289
339
@@ -364,7 +414,33 @@ This is intentional: a computed value is a new entity, not inherited from any so
364
414
365
415
`dj.U` promotes attributes to the primary key for grouping/aggregation purposes, but the semantic identity of the attributes remains unchanged.
# Raises error if both expressions have the same secondary attribute
424
+
raise DataJointError(
425
+
"Cannot join query expressions on dependent attribute `%s`"% attr
426
+
)
427
+
```
428
+
429
+
**New behavior**: Any homologous attributes can participate in joins, regardless of primary/secondary status. The only requirement is matching lineage.
430
+
431
+
**Rationale**: The original restriction was a heuristic to prevent accidental joins on coincidentally-named attributes. With proper lineage tracking, this heuristic is no longer needed - lineage provides the authoritative answer.
432
+
433
+
**Performance warning**: Consider warning when joining on attributes that lack indexes in one or both tables:
0 commit comments