Skip to content

Commit 50e18ea

Browse files
committed
Add fallback lineage computation and deprecate secondary attr restriction
Key updates to the semantic matching specification: 1. D1 updated: ~lineage table with in-memory fallback - Computes lineage from FK graph when ~lineage table doesn't exist - Uses existing Dependencies class (already loads FK info) - Works with databases not created by DataJoint 2. D5 added: Deprecate secondary attribute join restriction - Remove assert_join_compatibility's secondary attr check - Homologous attributes can join regardless of PK status - Optional warning for unindexed join attributes 3. Dependencies must be loaded before joins - Pattern already used for delete, drop, populate - Enables in-memory lineage computation for any database
1 parent 1b8464a commit 50e18ea

File tree

1 file changed

+97
-13
lines changed

1 file changed

+97
-13
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 97 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,12 @@ Every attribute has a **lineage** - a reference to its original definition. Line
5454
### Join Compatibility Rules
5555

5656
For a join `A * B` to be valid:
57-
1. All namesake attributes (same name in both) must be homologous
58-
2. Homologous attributes must be in the primary key of at least one operand
57+
1. All namesake attributes (same name in both) must be homologous (same lineage)
5958

6059
If namesake attributes exist that are **not** homologous, an error should be raised (collision of non-homologous namesakes).
6160

61+
**Note**: The current restriction that joins cannot be done on secondary attributes is **deprecated**. As long as attributes are homologous, they can participate in joins regardless of primary/secondary status. A warning may be raised for joins on unindexed attributes (performance consideration).
62+
6263
## Current Implementation Analysis
6364

6465
### Attribute Representation (`heading.py:48`)
@@ -228,7 +229,15 @@ Update these methods to preserve lineage:
228229

229230
### Phase 4: Implement Semantic Join Matching
230231

231-
1. Modify `join()` in `expression.py`:
232+
1. **Ensure dependencies are loaded** before join operations:
233+
```python
234+
def join(self, other, semantic_check=True, left=False):
235+
# Dependencies must be loaded for lineage computation
236+
self.connection.dependencies.load(force=False)
237+
# ...
238+
```
239+
240+
2. **Modify `join()` in `expression.py`**:
232241
```python
233242
def join(self, other, semantic_check=True, left=False):
234243
if semantic_check:
@@ -243,9 +252,10 @@ Update these methods to preserve lineage:
243252
# ...
244253
```
245254

246-
2. Modify `assert_join_compatibility()` in `condition.py`:
255+
3. **Modify `assert_join_compatibility()` in `condition.py`**:
256+
- **Remove** the secondary attribute restriction (deprecated)
247257
- Check for namesake collisions (same name, different lineage)
248-
- Check that homologous attributes are in primary key
258+
- Optionally warn about unindexed join attributes
249259

250260
### Phase 5: Error Handling
251261

@@ -275,15 +285,55 @@ Update these methods to preserve lineage:
275285

276286
## Design Decisions
277287

278-
### D1: Lineage Storage - Hidden Metadata Table
288+
### D1: Lineage Storage - Hidden Metadata Table with Fallback
279289

280-
**Decision**: Use a hidden metadata table (`~lineage`) per schema.
290+
**Decision**: Use a hidden metadata table (`~lineage`) per schema, with **in-memory fallback** when table doesn't exist.
281291

282292
This approach:
283293
- Works consistently for both **MySQL** and **PostgreSQL**
284294
- Provides explicit, queryable lineage data
285295
- Follows the existing pattern for hidden tables (e.g., `~external_*`, `~log`)
286296
- Easier to migrate existing schemas
297+
- **Works with databases not created by DataJoint** via fallback computation
298+
299+
#### Fallback: Compute Lineage from Dependencies
300+
301+
When the `~lineage` table does not exist (e.g., external databases, legacy schemas), lineage is computed **in-memory** from the FK graph using the existing `Dependencies` class:
302+
303+
```python
304+
def compute_lineage_from_dependencies(connection, table_name, attribute_name):
305+
"""
306+
Compute lineage by traversing the FK graph.
307+
Uses connection.dependencies which already loads FK info from INFORMATION_SCHEMA.
308+
"""
309+
connection.dependencies.load(force=False) # ensure dependencies are loaded
310+
311+
full_table_name = f"`{schema}`.`{table_name}`"
312+
313+
# Check incoming edges (foreign keys TO this table)
314+
for parent, props in connection.dependencies.parents(full_table_name).items():
315+
attr_map = props.get('attr_map', {})
316+
if attribute_name in attr_map:
317+
# This attribute is inherited from parent
318+
parent_attr = attr_map[attribute_name]
319+
parent_schema, parent_table = parse_full_table_name(parent)
320+
# Recurse to find ultimate origin
321+
return compute_lineage_from_dependencies(
322+
connection, parent_table, parent_attr, parent_schema
323+
)
324+
325+
# Not inherited - this table is the origin
326+
return (schema, table_name, attribute_name)
327+
```
328+
329+
#### Integration with Dependencies Loading
330+
331+
**Dependencies must be loaded before joins.** This is already the pattern for operations like `delete`, `drop`, and `populate`. The join operation will:
332+
333+
1. Ensure `connection.dependencies.load(force=False)` is called
334+
2. Check if `~lineage` table exists for involved schemas
335+
3. If exists: read lineage from table (fast)
336+
4. If not exists: compute lineage from FK graph (slower but works for any database)
287337

288338
#### Table Structure
289339

@@ -364,7 +414,33 @@ This is intentional: a computed value is a new entity, not inherited from any so
364414

365415
`dj.U` promotes attributes to the primary key for grouping/aggregation purposes, but the semantic identity of the attributes remains unchanged.
366416

367-
### D5: Migration via Utility Function
417+
### D5: Deprecate Secondary Attribute Join Restriction
418+
419+
**Decision**: Remove the current restriction that prevents joining on secondary attributes.
420+
421+
**Current behavior** (`condition.py:assert_join_compatibility`):
422+
```python
423+
# Raises error if both expressions have the same secondary attribute
424+
raise DataJointError(
425+
"Cannot join query expressions on dependent attribute `%s`" % attr
426+
)
427+
```
428+
429+
**New behavior**: Any homologous attributes can participate in joins, regardless of primary/secondary status. The only requirement is matching lineage.
430+
431+
**Rationale**: The original restriction was a heuristic to prevent accidental joins on coincidentally-named attributes. With proper lineage tracking, this heuristic is no longer needed - lineage provides the authoritative answer.
432+
433+
**Performance warning**: Consider warning when joining on attributes that lack indexes in one or both tables:
434+
```python
435+
# Optional warning for unindexed join attributes
436+
if not has_index(table1, attr) or not has_index(table2, attr):
437+
warnings.warn(
438+
f"Join on '{attr}' may be slow: attribute is not indexed in both tables",
439+
PerformanceWarning
440+
)
441+
```
442+
443+
### D6: Migration via Utility Function
368444

369445
**Decision**: Provide a migration utility that computes the `~lineage` table from existing schema.
370446

@@ -463,11 +539,19 @@ Semantic matching is a significant change to DataJoint's join semantics that imp
463539

464540
| Decision | Choice |
465541
|----------|--------|
466-
| Lineage storage | Hidden `~lineage` metadata table (MySQL + PostgreSQL compatible) |
467-
| Renamed attributes | Preserve original lineage |
468-
| Computed attributes | Lineage = `None` (breaks matching) |
469-
| `dj.U` interaction | Does not affect lineage |
470-
| Migration | Utility function that computes lineage from FK graph |
542+
| **D1**: Lineage storage | Hidden `~lineage` table + in-memory fallback from FK graph |
543+
| **D2**: Renamed attributes | Preserve original lineage |
544+
| **D3**: Computed attributes | Lineage = `None` (breaks matching) |
545+
| **D4**: `dj.U` interaction | Does not affect lineage |
546+
| **D5**: Secondary attr restriction | **Deprecated** - homologous attrs can join regardless of PK status |
547+
| **D6**: Migration | Utility function + automatic fallback computation |
548+
549+
### Compatibility
550+
551+
- **MySQL**: Fully supported (INFORMATION_SCHEMA for FK analysis)
552+
- **PostgreSQL**: Fully supported (pg_constraint/pg_attribute for FK analysis)
553+
- **External databases**: Works via in-memory lineage computation from FK graph
554+
- **Legacy DataJoint schemas**: Works via migration utility or automatic fallback
471555

472556
### Files to Create
473557

0 commit comments

Comments
 (0)