You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update spec with design decisions for semantic matching
Incorporates feedback on key design decisions:
- D1: Use hidden ~lineage metadata table (MySQL + PostgreSQL compatible)
- D2: Renamed attributes preserve original lineage
- D3: Computed attributes have no lineage (breaks matching)
- D4: dj.U does not affect lineage
- D5: Migration utility computes lineage from FK graph
Adds detailed implementation phases, SQL queries for both databases,
and clear summary of files to create/modify.
- Database-agnostic SQL generation for MySQL and PostgreSQL
200
+
201
+
3.**Integrate with Schema** (`schemas.py`)
202
+
- Create `~lineage` table when schema is activated
203
+
- Provide `schema.migrate_lineage()` method for existing schemas
195
204
196
205
### Phase 2: Populate Lineage During Table Declaration
197
206
198
-
1. Modify `compile_foreign_key` in `declare.py` to preserve lineage when copying attributes from referenced tables
199
-
2. For non-FK attributes, set lineage to `(current_schema, current_table, attr_name)`
200
-
3. Store lineage in heading metadata (potentially in attribute comments or a separate metadata table)
207
+
1.**Modify `compile_foreign_key`** (`declare.py`)
208
+
- When copying attributes from referenced table, record their lineage
209
+
- Return lineage info along with attribute SQL
210
+
211
+
2.**Modify `declare`** (`declare.py`)
212
+
- For native attributes: lineage = `(current_schema, current_table, attr_name)`
213
+
- For FK attributes: lineage = inherited from referenced table
214
+
- Insert lineage records into `~lineage` table after table creation
215
+
216
+
3.**Load lineage into Heading** (`heading.py`)
217
+
- When `Heading` is initialized from `table_info`, query `~lineage`
218
+
- Store lineage in each `Attribute` instance
201
219
202
220
### Phase 3: Propagate Lineage Through Query Operations
203
221
@@ -229,68 +247,197 @@ Update these methods to preserve lineage:
229
247
- Check for namesake collisions (same name, different lineage)
230
248
- Check that homologous attributes are in primary key
231
249
232
-
### Phase 5: Error Handling and Migration
250
+
### Phase 5: Error Handling
251
+
252
+
1.**Clear error messages** for:
253
+
- Namesake collision: `"Cannot join: attribute 'name' exists in both operands with different lineages (Student.name vs Course.name). Use .proj() to rename one."`
254
+
- Non-PK homologous: `"Cannot join on secondary attribute 'value' - must be in primary key of at least one operand."`
233
255
234
-
1. Raise clear errors for:
235
-
-Namesake collision (same name, different lineage)
236
-
-Joining on non-primary-key homologous attributes
256
+
2.**Resolution guidance** in error messages:
257
+
-Suggest specific projection syntax to resolve
258
+
-Mention permissive join `@` as escape hatch
237
259
238
-
2. Provide resolution guidance:
239
-
- Use projection to rename colliding attributes
240
-
- Use the permissive join operator `@` to bypass checks
260
+
### Phase 6: Migration Utility
241
261
242
-
3. Migration path for existing code:
243
-
- Backward compatibility mode?
244
-
- Deprecation warnings?
262
+
1.**`dj.migrate_lineage(schema)`** function
263
+
- Analyzes existing FK constraints via `INFORMATION_SCHEMA` (MySQL) or `pg_catalog` (PostgreSQL)
264
+
- Computes lineage for each attribute using recursive FK traversal
265
+
- Populates `~lineage` table
245
266
246
-
## Open Questions
267
+
2.**Automatic migration on schema activation** (optional)
268
+
- If `~lineage` table is empty but tables exist, offer to run migration
**Decision**: Yes, renamed attributes preserve their original lineage.
258
331
259
332
When an attribute is renamed via projection:
260
333
```python
261
334
table.proj(new_name='old_name')
262
335
```
263
336
264
-
Should the lineage remain the same (pointing to `old_name`'s origin) or become new?
337
+
The `new_name` attribute retains the lineage of `old_name`. The rename is cosmetic; the semantic identity (what entity the attribute represents) remains unchanged.
265
338
266
-
**Recommendation**: Renamed attributes should keep their original lineage. The rename is cosmetic; the semantic identity remains.
339
+
This enables:
340
+
```python
341
+
# These two expressions can still join on the underlying subject_id
342
+
A.proj(subj='subject_id') * B.proj(subj='subject_id') # OK - same lineage
343
+
```
267
344
268
-
### Q3: What about computed attributes?
345
+
### D3: Computed Attributes Have No Lineage
346
+
347
+
**Decision**: Lineage breaks for computed attributes.
269
348
270
349
For computed attributes like:
271
350
```python
272
351
table.proj(total='price * quantity')
273
352
```
274
353
275
-
**Recommendation**: Computed attributes have no lineage (or a special "computed" lineage). They cannot participate in semantic matching.
354
+
The `total` attribute has `lineage = None`. Computed attributes:
355
+
- Cannot participate in semantic matching
356
+
- Will cause a namesake collision error if another table has an attribute named `total`
357
+
- Must be renamed via projection to avoid collisions
358
+
359
+
This is intentional: a computed value is a new entity, not inherited from any source table.
360
+
361
+
### D4: `dj.U` Does Not Affect Lineage
362
+
363
+
**Decision**: The universal set `U` only affects primary key membership, not lineage.
364
+
365
+
`dj.U` promotes attributes to the primary key for grouping/aggregation purposes, but the semantic identity of the attributes remains unchanged.
276
366
277
-
### Q4: How does this interact with `dj.U` (universal set)?
367
+
### D5: Migration via Utility Function
278
368
279
-
The `U` class modifies which attributes are treated as primary key.
369
+
**Decision**: Provide a migration utility that computes the `~lineage` table from existing schema.
280
370
281
-
**Recommendation**: `U` should not affect lineage - it only affects the primary key membership check, not semantic matching.
371
+
For existing schemas without lineage metadata, a utility will:
372
+
1. Analyze the foreign key graph using `INFORMATION_SCHEMA`
373
+
2. Trace each attribute to its origin table
374
+
3. Populate the `~lineage` table
282
375
283
-
### Q5: Backward compatibility?
376
+
```python
377
+
defmigrate_schema_lineage(schema):
378
+
"""
379
+
Compute and populate the ~lineage table for an existing schema.
380
+
381
+
Analyzes foreign key relationships to determine attribute origins.
382
+
"""
383
+
# 1. Create ~lineage table if not exists
384
+
# 2. For each table in schema:
385
+
# a. For each attribute:
386
+
# - If attribute is inherited via FK, trace to origin
387
+
# - If attribute is native, origin is this table
388
+
# b. Insert into ~lineage
389
+
```
390
+
391
+
#### Algorithm for Computing Lineage
284
392
285
-
Should there be a migration path for existing pipelines?
@@ -310,12 +457,47 @@ Should there be a migration path for existing pipelines?
310
457
311
458
## Summary
312
459
313
-
Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes. The recommended implementation adds a `lineage` tuple to each `Attribute`, populated during table declaration and preserved through query operations.
460
+
Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes.
0 commit comments