|
| 1 | +# Semantic Matching for Joins - Specification |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document analyzes approaches to implementing "semantic matching" for joins in DataJoint 2.0, replacing the current name-based matching rules. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +### Current Behavior |
| 10 | + |
| 11 | +The current join implementation (`expression.py:318`) matches attributes purely by name: |
| 12 | + |
| 13 | +```python |
| 14 | +join_attributes = set(n for n in self.heading.names if n in other.heading.names) |
| 15 | +``` |
| 16 | + |
| 17 | +This is essentially a SQL `NATURAL JOIN` - any attributes with the same name in both tables are used for matching. The only constraint is the "join compatibility" check (`condition.py:104-136`) which prevents joining on secondary attributes that appear in both tables. |
| 18 | + |
| 19 | +### Target Behavior: Semantic Matching |
| 20 | + |
| 21 | +DataJoint 2.0 introduces **semantic matching** where two attributes are matched only when they satisfy **both** conditions: |
| 22 | + |
| 23 | +1. **Same name** in both tables |
| 24 | +2. **Same lineage** - traced to the same original definition through an uninterrupted chain of foreign keys |
| 25 | + |
| 26 | +This prevents accidental joins on attributes that happen to share the same name but represent different entities. |
| 27 | + |
| 28 | +#### Example: Name Collision |
| 29 | + |
| 30 | +Consider two tables: |
| 31 | +- `Student(student_id, name)` - where `name` is the student's name |
| 32 | +- `Course(course_id, name)` - where `name` is the course title |
| 33 | + |
| 34 | +With current behavior: `Student * Course` would attempt to join on `name`, producing meaningless results or an error. |
| 35 | + |
| 36 | +With semantic matching: The `name` attributes have different lineages (one originates in Student, the other in Course), so they would **not** be matched. Instead, the join would be a Cartesian product, or more likely, an error would be raised about incompatible namesake attributes. |
| 37 | + |
| 38 | +## Key Concepts |
| 39 | + |
| 40 | +### Homologous Attributes |
| 41 | + |
| 42 | +Two attributes are **homologous** if they: |
| 43 | +1. Have the same name |
| 44 | +2. Trace back to the same original attribute definition through foreign key chains |
| 45 | + |
| 46 | +Homologous attributes are also called **semantically matched** attributes. |
| 47 | + |
| 48 | +### Attribute Lineage |
| 49 | + |
| 50 | +Every attribute has a **lineage** - a reference to its original definition. Lineage is propagated through: |
| 51 | +- Foreign key references: when table B references table A, the inherited primary key attributes in B have the same lineage as in A |
| 52 | +- Query expressions: projections, joins, and other operations preserve lineage |
| 53 | + |
| 54 | +### Join Compatibility Rules |
| 55 | + |
| 56 | +For a join `A * B` to be valid: |
| 57 | +1. All namesake attributes (same name in both) must be homologous |
| 58 | +2. Homologous attributes must be in the primary key of at least one operand |
| 59 | + |
| 60 | +If namesake attributes exist that are **not** homologous, an error should be raised (collision of non-homologous namesakes). |
| 61 | + |
| 62 | +## Current Implementation Analysis |
| 63 | + |
| 64 | +### Attribute Representation (`heading.py:48`) |
| 65 | + |
| 66 | +```python |
| 67 | +class Attribute(namedtuple("_Attribute", default_attribute_properties)): |
| 68 | +``` |
| 69 | + |
| 70 | +Current properties: |
| 71 | +- `name`, `type`, `in_key`, `nullable`, `default`, `comment` |
| 72 | +- `database`, `attribute_expression` |
| 73 | +- Various type-specific flags |
| 74 | + |
| 75 | +**Missing**: No lineage/origin tracking. |
| 76 | + |
| 77 | +### Join Logic (`expression.py:302-350`) |
| 78 | + |
| 79 | +```python |
| 80 | +def join(self, other, semantic_check=True, left=False): |
| 81 | + # ... |
| 82 | + join_attributes = set(n for n in self.heading.names if n in other.heading.names) |
| 83 | +``` |
| 84 | + |
| 85 | +The current logic: |
| 86 | +1. Finds common attribute names |
| 87 | +2. Checks join compatibility (no common secondary attributes) |
| 88 | +3. Creates subqueries if needed for derived attributes |
| 89 | +4. Combines headings and restrictions |
| 90 | + |
| 91 | +### Heading Join (`heading.py:482-504`) |
| 92 | + |
| 93 | +Combines two headings by unioning primary keys and merging secondary attributes. |
| 94 | + |
| 95 | +### Foreign Key Processing (`declare.py:154-225`) |
| 96 | + |
| 97 | +When processing `-> TableRef`: |
| 98 | +- Copies primary key attributes from referenced table |
| 99 | +- Creates SQL FOREIGN KEY constraint |
| 100 | +- No lineage information is preserved |
| 101 | + |
| 102 | +## Implementation Approaches |
| 103 | + |
| 104 | +### Approach 1: Add Lineage to Attribute Namedtuple |
| 105 | + |
| 106 | +**Add a `lineage` field to `Attribute`** that identifies the origin of each attribute. |
| 107 | + |
| 108 | +#### Lineage Representation Options |
| 109 | + |
| 110 | +**Option 1A: Tuple-based lineage** |
| 111 | +```python |
| 112 | +# (schema_name, table_name, attribute_name) |
| 113 | +lineage = ("lab", "Subject", "subject_id") |
| 114 | +``` |
| 115 | + |
| 116 | +**Option 1B: String-based lineage** |
| 117 | +```python |
| 118 | +lineage = "lab.Subject.subject_id" |
| 119 | +``` |
| 120 | + |
| 121 | +**Option 1C: Hash-based lineage** |
| 122 | +```python |
| 123 | +# SHA256 hash of canonical identifier |
| 124 | +lineage = "a3f2c1..." |
| 125 | +``` |
| 126 | + |
| 127 | +#### Pros |
| 128 | +- Clean, self-contained representation |
| 129 | +- Easy to compare (simple equality check) |
| 130 | +- Serializable for debugging |
| 131 | + |
| 132 | +#### Cons |
| 133 | +- Requires modifying the core `Attribute` type |
| 134 | +- All code that creates Attributes must be updated |
| 135 | +- Migration complexity for existing code |
| 136 | + |
| 137 | +### Approach 2: Separate Lineage Registry |
| 138 | + |
| 139 | +**Maintain a separate mapping from attribute names to lineage** in the Heading class. |
| 140 | + |
| 141 | +```python |
| 142 | +class Heading: |
| 143 | + def __init__(self): |
| 144 | + self.attributes = {} # name -> Attribute |
| 145 | + self.lineage = {} # name -> origin_identifier |
| 146 | +``` |
| 147 | + |
| 148 | +#### Pros |
| 149 | +- Less invasive change to Attribute namedtuple |
| 150 | +- Can be added incrementally |
| 151 | + |
| 152 | +#### Cons |
| 153 | +- Two data structures to keep in sync |
| 154 | +- Potential for inconsistency |
| 155 | + |
| 156 | +### Approach 3: Graph-Based Lineage Tracking |
| 157 | + |
| 158 | +**Build a schema graph** that tracks foreign key relationships, then compute lineage dynamically. |
| 159 | + |
| 160 | +```python |
| 161 | +class SchemaGraph: |
| 162 | + def __init__(self): |
| 163 | + self.edges = [] # [(from_table, to_table, attrs)] |
| 164 | + |
| 165 | + def get_lineage(self, table, attribute): |
| 166 | + # Traverse graph to find original definition |
| 167 | + pass |
| 168 | +``` |
| 169 | + |
| 170 | +#### Pros |
| 171 | +- Single source of truth (the actual schema) |
| 172 | +- Dynamic computation avoids stale data |
| 173 | + |
| 174 | +#### Cons |
| 175 | +- Higher runtime cost for lineage queries |
| 176 | +- More complex implementation |
| 177 | +- Requires access to full schema during queries |
| 178 | + |
| 179 | +## Recommended Approach |
| 180 | + |
| 181 | +**Approach 1A (Tuple-based lineage in Attribute)** is recommended because: |
| 182 | + |
| 183 | +1. **Simplicity**: Direct storage of lineage avoids graph traversal at query time |
| 184 | +2. **Immutability**: Once an attribute's lineage is set, it doesn't change |
| 185 | +3. **Explicit**: Makes lineage a first-class concept in the data model |
| 186 | +4. **Debuggability**: Easy to inspect and understand |
| 187 | + |
| 188 | +## Implementation Plan |
| 189 | + |
| 190 | +### Phase 1: Add Lineage to Attribute |
| 191 | + |
| 192 | +1. Add `lineage` field to `default_attribute_properties` in `heading.py` |
| 193 | +2. Update `Attribute` namedtuple (automatically from `default_attribute_properties`) |
| 194 | +3. Add `lineage` parameter to all Attribute creation sites |
| 195 | + |
| 196 | +### Phase 2: Populate Lineage During Table Declaration |
| 197 | + |
| 198 | +1. Modify `compile_foreign_key` in `declare.py` to preserve lineage when copying attributes from referenced tables |
| 199 | +2. For non-FK attributes, set lineage to `(current_schema, current_table, attr_name)` |
| 200 | +3. Store lineage in heading metadata (potentially in attribute comments or a separate metadata table) |
| 201 | + |
| 202 | +### Phase 3: Propagate Lineage Through Query Operations |
| 203 | + |
| 204 | +Update these methods to preserve lineage: |
| 205 | + |
| 206 | +1. **`Heading.join()`**: Lineage already determined; just verify homology |
| 207 | +2. **`Heading.project()`**: Preserve lineage for copied attributes; set new lineage for computed attributes |
| 208 | +3. **`Heading.set_primary_key()`**: Preserve existing lineage |
| 209 | +4. **`Heading.make_subquery_heading()`**: Preserve existing lineage |
| 210 | + |
| 211 | +### Phase 4: Implement Semantic Join Matching |
| 212 | + |
| 213 | +1. Modify `join()` in `expression.py`: |
| 214 | + ```python |
| 215 | + def join(self, other, semantic_check=True, left=False): |
| 216 | + if semantic_check: |
| 217 | + self._check_semantic_compatibility(other) |
| 218 | + |
| 219 | + # Find homologous attributes (same name AND same lineage) |
| 220 | + join_attributes = set( |
| 221 | + n for n in self.heading.names |
| 222 | + if n in other.heading.names |
| 223 | + and self.heading.get_lineage(n) == other.heading.get_lineage(n) |
| 224 | + ) |
| 225 | + # ... |
| 226 | + ``` |
| 227 | + |
| 228 | +2. Modify `assert_join_compatibility()` in `condition.py`: |
| 229 | + - Check for namesake collisions (same name, different lineage) |
| 230 | + - Check that homologous attributes are in primary key |
| 231 | + |
| 232 | +### Phase 5: Error Handling and Migration |
| 233 | + |
| 234 | +1. Raise clear errors for: |
| 235 | + - Namesake collision (same name, different lineage) |
| 236 | + - Joining on non-primary-key homologous attributes |
| 237 | + |
| 238 | +2. Provide resolution guidance: |
| 239 | + - Use projection to rename colliding attributes |
| 240 | + - Use the permissive join operator `@` to bypass checks |
| 241 | + |
| 242 | +3. Migration path for existing code: |
| 243 | + - Backward compatibility mode? |
| 244 | + - Deprecation warnings? |
| 245 | + |
| 246 | +## Open Questions |
| 247 | + |
| 248 | +### Q1: How to store lineage in the database? |
| 249 | + |
| 250 | +**Options:** |
| 251 | +- A. Encode in attribute comments (JSON suffix) |
| 252 | +- B. Separate metadata table per schema |
| 253 | +- C. Compute from foreign key constraints at runtime |
| 254 | + |
| 255 | +**Recommendation**: Option A is simplest but limits comment space. Option B is cleaner but adds tables. Option C is dynamic but slower. |
| 256 | + |
| 257 | +### Q2: What happens with renamed attributes? |
| 258 | + |
| 259 | +When an attribute is renamed via projection: |
| 260 | +```python |
| 261 | +table.proj(new_name='old_name') |
| 262 | +``` |
| 263 | + |
| 264 | +Should the lineage remain the same (pointing to `old_name`'s origin) or become new? |
| 265 | + |
| 266 | +**Recommendation**: Renamed attributes should keep their original lineage. The rename is cosmetic; the semantic identity remains. |
| 267 | + |
| 268 | +### Q3: What about computed attributes? |
| 269 | + |
| 270 | +For computed attributes like: |
| 271 | +```python |
| 272 | +table.proj(total='price * quantity') |
| 273 | +``` |
| 274 | + |
| 275 | +**Recommendation**: Computed attributes have no lineage (or a special "computed" lineage). They cannot participate in semantic matching. |
| 276 | + |
| 277 | +### Q4: How does this interact with `dj.U` (universal set)? |
| 278 | + |
| 279 | +The `U` class modifies which attributes are treated as primary key. |
| 280 | + |
| 281 | +**Recommendation**: `U` should not affect lineage - it only affects the primary key membership check, not semantic matching. |
| 282 | + |
| 283 | +### Q5: Backward compatibility? |
| 284 | + |
| 285 | +Should there be a migration path for existing pipelines? |
| 286 | + |
| 287 | +**Options:** |
| 288 | +- A. Breaking change - require updates to all pipelines |
| 289 | +- B. Deprecation period with warnings |
| 290 | +- C. Configuration flag to switch between old/new behavior |
| 291 | +- D. Default to permissive join (`@`) semantics when lineage is unknown |
| 292 | + |
| 293 | +**Recommendation**: Option C or D for transition period. |
| 294 | + |
| 295 | +## Testing Strategy |
| 296 | + |
| 297 | +1. **Unit tests** for lineage propagation through all query operations |
| 298 | +2. **Integration tests** for join behavior with: |
| 299 | + - Tables with foreign key relationships (should join) |
| 300 | + - Tables with coincidentally same-named attributes (should error) |
| 301 | + - Renamed attributes (should preserve lineage) |
| 302 | + - Computed attributes (should have no lineage) |
| 303 | +3. **Backward compatibility tests** for existing pipelines |
| 304 | + |
| 305 | +## Performance Considerations |
| 306 | + |
| 307 | +1. **Memory**: Additional field per attribute (minimal impact) |
| 308 | +2. **Comparison**: Lineage comparison is O(1) tuple equality |
| 309 | +3. **Storage**: If stored in database, small overhead per attribute |
| 310 | + |
| 311 | +## Summary |
| 312 | + |
| 313 | +Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes. The recommended implementation adds a `lineage` tuple to each `Attribute`, populated during table declaration and preserved through query operations. |
| 314 | + |
| 315 | +Key files to modify: |
| 316 | +- `datajoint/heading.py` - Add lineage to Attribute |
| 317 | +- `datajoint/declare.py` - Populate lineage during FK processing |
| 318 | +- `datajoint/expression.py` - Use lineage in join logic |
| 319 | +- `datajoint/condition.py` - Update compatibility checks |
| 320 | + |
| 321 | +This is a breaking change that will require a migration strategy for existing pipelines. |
0 commit comments