Skip to content

Commit ccb611b

Browse files
committed
Add specification for semantic matching joins
Analyzes approaches to implementing DataJoint 2.0 semantic matching rules for joins, where attributes are matched only when they share both the same name AND the same lineage through foreign key chains. Recommends adding a lineage tuple to the Attribute namedtuple, with detailed implementation phases and open questions for discussion.
1 parent 63ebc38 commit ccb611b

File tree

1 file changed

+321
-0
lines changed

1 file changed

+321
-0
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
# Semantic Matching for Joins - Specification
2+
3+
## Overview
4+
5+
This document analyzes approaches to implementing "semantic matching" for joins in DataJoint 2.0, replacing the current name-based matching rules.
6+
7+
## Problem Statement
8+
9+
### Current Behavior
10+
11+
The current join implementation (`expression.py:318`) matches attributes purely by name:
12+
13+
```python
14+
join_attributes = set(n for n in self.heading.names if n in other.heading.names)
15+
```
16+
17+
This is essentially a SQL `NATURAL JOIN` - any attributes with the same name in both tables are used for matching. The only constraint is the "join compatibility" check (`condition.py:104-136`) which prevents joining on secondary attributes that appear in both tables.
18+
19+
### Target Behavior: Semantic Matching
20+
21+
DataJoint 2.0 introduces **semantic matching** where two attributes are matched only when they satisfy **both** conditions:
22+
23+
1. **Same name** in both tables
24+
2. **Same lineage** - traced to the same original definition through an uninterrupted chain of foreign keys
25+
26+
This prevents accidental joins on attributes that happen to share the same name but represent different entities.
27+
28+
#### Example: Name Collision
29+
30+
Consider two tables:
31+
- `Student(student_id, name)` - where `name` is the student's name
32+
- `Course(course_id, name)` - where `name` is the course title
33+
34+
With current behavior: `Student * Course` would attempt to join on `name`, producing meaningless results or an error.
35+
36+
With semantic matching: The `name` attributes have different lineages (one originates in Student, the other in Course), so they would **not** be matched. Instead, the join would be a Cartesian product, or more likely, an error would be raised about incompatible namesake attributes.
37+
38+
## Key Concepts
39+
40+
### Homologous Attributes
41+
42+
Two attributes are **homologous** if they:
43+
1. Have the same name
44+
2. Trace back to the same original attribute definition through foreign key chains
45+
46+
Homologous attributes are also called **semantically matched** attributes.
47+
48+
### Attribute Lineage
49+
50+
Every attribute has a **lineage** - a reference to its original definition. Lineage is propagated through:
51+
- Foreign key references: when table B references table A, the inherited primary key attributes in B have the same lineage as in A
52+
- Query expressions: projections, joins, and other operations preserve lineage
53+
54+
### Join Compatibility Rules
55+
56+
For a join `A * B` to be valid:
57+
1. All namesake attributes (same name in both) must be homologous
58+
2. Homologous attributes must be in the primary key of at least one operand
59+
60+
If namesake attributes exist that are **not** homologous, an error should be raised (collision of non-homologous namesakes).
61+
62+
## Current Implementation Analysis
63+
64+
### Attribute Representation (`heading.py:48`)
65+
66+
```python
67+
class Attribute(namedtuple("_Attribute", default_attribute_properties)):
68+
```
69+
70+
Current properties:
71+
- `name`, `type`, `in_key`, `nullable`, `default`, `comment`
72+
- `database`, `attribute_expression`
73+
- Various type-specific flags
74+
75+
**Missing**: No lineage/origin tracking.
76+
77+
### Join Logic (`expression.py:302-350`)
78+
79+
```python
80+
def join(self, other, semantic_check=True, left=False):
81+
# ...
82+
join_attributes = set(n for n in self.heading.names if n in other.heading.names)
83+
```
84+
85+
The current logic:
86+
1. Finds common attribute names
87+
2. Checks join compatibility (no common secondary attributes)
88+
3. Creates subqueries if needed for derived attributes
89+
4. Combines headings and restrictions
90+
91+
### Heading Join (`heading.py:482-504`)
92+
93+
Combines two headings by unioning primary keys and merging secondary attributes.
94+
95+
### Foreign Key Processing (`declare.py:154-225`)
96+
97+
When processing `-> TableRef`:
98+
- Copies primary key attributes from referenced table
99+
- Creates SQL FOREIGN KEY constraint
100+
- No lineage information is preserved
101+
102+
## Implementation Approaches
103+
104+
### Approach 1: Add Lineage to Attribute Namedtuple
105+
106+
**Add a `lineage` field to `Attribute`** that identifies the origin of each attribute.
107+
108+
#### Lineage Representation Options
109+
110+
**Option 1A: Tuple-based lineage**
111+
```python
112+
# (schema_name, table_name, attribute_name)
113+
lineage = ("lab", "Subject", "subject_id")
114+
```
115+
116+
**Option 1B: String-based lineage**
117+
```python
118+
lineage = "lab.Subject.subject_id"
119+
```
120+
121+
**Option 1C: Hash-based lineage**
122+
```python
123+
# SHA256 hash of canonical identifier
124+
lineage = "a3f2c1..."
125+
```
126+
127+
#### Pros
128+
- Clean, self-contained representation
129+
- Easy to compare (simple equality check)
130+
- Serializable for debugging
131+
132+
#### Cons
133+
- Requires modifying the core `Attribute` type
134+
- All code that creates Attributes must be updated
135+
- Migration complexity for existing code
136+
137+
### Approach 2: Separate Lineage Registry
138+
139+
**Maintain a separate mapping from attribute names to lineage** in the Heading class.
140+
141+
```python
142+
class Heading:
143+
def __init__(self):
144+
self.attributes = {} # name -> Attribute
145+
self.lineage = {} # name -> origin_identifier
146+
```
147+
148+
#### Pros
149+
- Less invasive change to Attribute namedtuple
150+
- Can be added incrementally
151+
152+
#### Cons
153+
- Two data structures to keep in sync
154+
- Potential for inconsistency
155+
156+
### Approach 3: Graph-Based Lineage Tracking
157+
158+
**Build a schema graph** that tracks foreign key relationships, then compute lineage dynamically.
159+
160+
```python
161+
class SchemaGraph:
162+
def __init__(self):
163+
self.edges = [] # [(from_table, to_table, attrs)]
164+
165+
def get_lineage(self, table, attribute):
166+
# Traverse graph to find original definition
167+
pass
168+
```
169+
170+
#### Pros
171+
- Single source of truth (the actual schema)
172+
- Dynamic computation avoids stale data
173+
174+
#### Cons
175+
- Higher runtime cost for lineage queries
176+
- More complex implementation
177+
- Requires access to full schema during queries
178+
179+
## Recommended Approach
180+
181+
**Approach 1A (Tuple-based lineage in Attribute)** is recommended because:
182+
183+
1. **Simplicity**: Direct storage of lineage avoids graph traversal at query time
184+
2. **Immutability**: Once an attribute's lineage is set, it doesn't change
185+
3. **Explicit**: Makes lineage a first-class concept in the data model
186+
4. **Debuggability**: Easy to inspect and understand
187+
188+
## Implementation Plan
189+
190+
### Phase 1: Add Lineage to Attribute
191+
192+
1. Add `lineage` field to `default_attribute_properties` in `heading.py`
193+
2. Update `Attribute` namedtuple (automatically from `default_attribute_properties`)
194+
3. Add `lineage` parameter to all Attribute creation sites
195+
196+
### Phase 2: Populate Lineage During Table Declaration
197+
198+
1. Modify `compile_foreign_key` in `declare.py` to preserve lineage when copying attributes from referenced tables
199+
2. For non-FK attributes, set lineage to `(current_schema, current_table, attr_name)`
200+
3. Store lineage in heading metadata (potentially in attribute comments or a separate metadata table)
201+
202+
### Phase 3: Propagate Lineage Through Query Operations
203+
204+
Update these methods to preserve lineage:
205+
206+
1. **`Heading.join()`**: Lineage already determined; just verify homology
207+
2. **`Heading.project()`**: Preserve lineage for copied attributes; set new lineage for computed attributes
208+
3. **`Heading.set_primary_key()`**: Preserve existing lineage
209+
4. **`Heading.make_subquery_heading()`**: Preserve existing lineage
210+
211+
### Phase 4: Implement Semantic Join Matching
212+
213+
1. Modify `join()` in `expression.py`:
214+
```python
215+
def join(self, other, semantic_check=True, left=False):
216+
if semantic_check:
217+
self._check_semantic_compatibility(other)
218+
219+
# Find homologous attributes (same name AND same lineage)
220+
join_attributes = set(
221+
n for n in self.heading.names
222+
if n in other.heading.names
223+
and self.heading.get_lineage(n) == other.heading.get_lineage(n)
224+
)
225+
# ...
226+
```
227+
228+
2. Modify `assert_join_compatibility()` in `condition.py`:
229+
- Check for namesake collisions (same name, different lineage)
230+
- Check that homologous attributes are in primary key
231+
232+
### Phase 5: Error Handling and Migration
233+
234+
1. Raise clear errors for:
235+
- Namesake collision (same name, different lineage)
236+
- Joining on non-primary-key homologous attributes
237+
238+
2. Provide resolution guidance:
239+
- Use projection to rename colliding attributes
240+
- Use the permissive join operator `@` to bypass checks
241+
242+
3. Migration path for existing code:
243+
- Backward compatibility mode?
244+
- Deprecation warnings?
245+
246+
## Open Questions
247+
248+
### Q1: How to store lineage in the database?
249+
250+
**Options:**
251+
- A. Encode in attribute comments (JSON suffix)
252+
- B. Separate metadata table per schema
253+
- C. Compute from foreign key constraints at runtime
254+
255+
**Recommendation**: Option A is simplest but limits comment space. Option B is cleaner but adds tables. Option C is dynamic but slower.
256+
257+
### Q2: What happens with renamed attributes?
258+
259+
When an attribute is renamed via projection:
260+
```python
261+
table.proj(new_name='old_name')
262+
```
263+
264+
Should the lineage remain the same (pointing to `old_name`'s origin) or become new?
265+
266+
**Recommendation**: Renamed attributes should keep their original lineage. The rename is cosmetic; the semantic identity remains.
267+
268+
### Q3: What about computed attributes?
269+
270+
For computed attributes like:
271+
```python
272+
table.proj(total='price * quantity')
273+
```
274+
275+
**Recommendation**: Computed attributes have no lineage (or a special "computed" lineage). They cannot participate in semantic matching.
276+
277+
### Q4: How does this interact with `dj.U` (universal set)?
278+
279+
The `U` class modifies which attributes are treated as primary key.
280+
281+
**Recommendation**: `U` should not affect lineage - it only affects the primary key membership check, not semantic matching.
282+
283+
### Q5: Backward compatibility?
284+
285+
Should there be a migration path for existing pipelines?
286+
287+
**Options:**
288+
- A. Breaking change - require updates to all pipelines
289+
- B. Deprecation period with warnings
290+
- C. Configuration flag to switch between old/new behavior
291+
- D. Default to permissive join (`@`) semantics when lineage is unknown
292+
293+
**Recommendation**: Option C or D for transition period.
294+
295+
## Testing Strategy
296+
297+
1. **Unit tests** for lineage propagation through all query operations
298+
2. **Integration tests** for join behavior with:
299+
- Tables with foreign key relationships (should join)
300+
- Tables with coincidentally same-named attributes (should error)
301+
- Renamed attributes (should preserve lineage)
302+
- Computed attributes (should have no lineage)
303+
3. **Backward compatibility tests** for existing pipelines
304+
305+
## Performance Considerations
306+
307+
1. **Memory**: Additional field per attribute (minimal impact)
308+
2. **Comparison**: Lineage comparison is O(1) tuple equality
309+
3. **Storage**: If stored in database, small overhead per attribute
310+
311+
## Summary
312+
313+
Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes. The recommended implementation adds a `lineage` tuple to each `Attribute`, populated during table declaration and preserved through query operations.
314+
315+
Key files to modify:
316+
- `datajoint/heading.py` - Add lineage to Attribute
317+
- `datajoint/declare.py` - Populate lineage during FK processing
318+
- `datajoint/expression.py` - Use lineage in join logic
319+
- `datajoint/condition.py` - Update compatibility checks
320+
321+
This is a breaking change that will require a migration strategy for existing pipelines.

0 commit comments

Comments
 (0)