Skip to content

Commit 1b8464a

Browse files
committed
Update spec with design decisions for semantic matching
Incorporates feedback on key design decisions: - D1: Use hidden ~lineage metadata table (MySQL + PostgreSQL compatible) - D2: Renamed attributes preserve original lineage - D3: Computed attributes have no lineage (breaks matching) - D4: dj.U does not affect lineage - D5: Migration utility computes lineage from FK graph Adds detailed implementation phases, SQL queries for both databases, and clear summary of files to create/modify.
1 parent ccb611b commit 1b8464a

File tree

1 file changed

+229
-47
lines changed

1 file changed

+229
-47
lines changed

docs/SPEC-semantic-matching.md

Lines changed: 229 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -187,17 +187,35 @@ class SchemaGraph:
187187

188188
## Implementation Plan
189189

190-
### Phase 1: Add Lineage to Attribute
190+
### Phase 1: Add Lineage Infrastructure
191191

192-
1. Add `lineage` field to `default_attribute_properties` in `heading.py`
193-
2. Update `Attribute` namedtuple (automatically from `default_attribute_properties`)
194-
3. Add `lineage` parameter to all Attribute creation sites
192+
1. **Add `lineage` field to `Attribute`** (`heading.py`)
193+
- Add `lineage` to `default_attribute_properties` with default `None`
194+
- Lineage is a tuple: `(origin_schema, origin_table, origin_attribute)` or `None`
195+
196+
2. **Create `~lineage` table management** (new file: `datajoint/lineage.py`)
197+
- `LineageTable` class (similar to `ExternalTable`)
198+
- Methods: `declare()`, `insert()`, `lookup()`, `compute_from_fk()`
199+
- Database-agnostic SQL generation for MySQL and PostgreSQL
200+
201+
3. **Integrate with Schema** (`schemas.py`)
202+
- Create `~lineage` table when schema is activated
203+
- Provide `schema.migrate_lineage()` method for existing schemas
195204

196205
### Phase 2: Populate Lineage During Table Declaration
197206

198-
1. Modify `compile_foreign_key` in `declare.py` to preserve lineage when copying attributes from referenced tables
199-
2. For non-FK attributes, set lineage to `(current_schema, current_table, attr_name)`
200-
3. Store lineage in heading metadata (potentially in attribute comments or a separate metadata table)
207+
1. **Modify `compile_foreign_key`** (`declare.py`)
208+
- When copying attributes from referenced table, record their lineage
209+
- Return lineage info along with attribute SQL
210+
211+
2. **Modify `declare`** (`declare.py`)
212+
- For native attributes: lineage = `(current_schema, current_table, attr_name)`
213+
- For FK attributes: lineage = inherited from referenced table
214+
- Insert lineage records into `~lineage` table after table creation
215+
216+
3. **Load lineage into Heading** (`heading.py`)
217+
- When `Heading` is initialized from `table_info`, query `~lineage`
218+
- Store lineage in each `Attribute` instance
201219

202220
### Phase 3: Propagate Lineage Through Query Operations
203221

@@ -229,68 +247,197 @@ Update these methods to preserve lineage:
229247
- Check for namesake collisions (same name, different lineage)
230248
- Check that homologous attributes are in primary key
231249

232-
### Phase 5: Error Handling and Migration
250+
### Phase 5: Error Handling
251+
252+
1. **Clear error messages** for:
253+
- Namesake collision: `"Cannot join: attribute 'name' exists in both operands with different lineages (Student.name vs Course.name). Use .proj() to rename one."`
254+
- Non-PK homologous: `"Cannot join on secondary attribute 'value' - must be in primary key of at least one operand."`
233255

234-
1. Raise clear errors for:
235-
- Namesake collision (same name, different lineage)
236-
- Joining on non-primary-key homologous attributes
256+
2. **Resolution guidance** in error messages:
257+
- Suggest specific projection syntax to resolve
258+
- Mention permissive join `@` as escape hatch
237259

238-
2. Provide resolution guidance:
239-
- Use projection to rename colliding attributes
240-
- Use the permissive join operator `@` to bypass checks
260+
### Phase 6: Migration Utility
241261

242-
3. Migration path for existing code:
243-
- Backward compatibility mode?
244-
- Deprecation warnings?
262+
1. **`dj.migrate_lineage(schema)`** function
263+
- Analyzes existing FK constraints via `INFORMATION_SCHEMA` (MySQL) or `pg_catalog` (PostgreSQL)
264+
- Computes lineage for each attribute using recursive FK traversal
265+
- Populates `~lineage` table
245266

246-
## Open Questions
267+
2. **Automatic migration on schema activation** (optional)
268+
- If `~lineage` table is empty but tables exist, offer to run migration
269+
- Configuration flag: `datajoint.config['auto_migrate_lineage'] = True/False`
247270

248-
### Q1: How to store lineage in the database?
271+
3. **CLI command**
272+
```bash
273+
python -m datajoint migrate-lineage myschema
274+
```
249275

250-
**Options:**
251-
- A. Encode in attribute comments (JSON suffix)
252-
- B. Separate metadata table per schema
253-
- C. Compute from foreign key constraints at runtime
276+
## Design Decisions
254277

255-
**Recommendation**: Option A is simplest but limits comment space. Option B is cleaner but adds tables. Option C is dynamic but slower.
278+
### D1: Lineage Storage - Hidden Metadata Table
256279

257-
### Q2: What happens with renamed attributes?
280+
**Decision**: Use a hidden metadata table (`~lineage`) per schema.
281+
282+
This approach:
283+
- Works consistently for both **MySQL** and **PostgreSQL**
284+
- Provides explicit, queryable lineage data
285+
- Follows the existing pattern for hidden tables (e.g., `~external_*`, `~log`)
286+
- Easier to migrate existing schemas
287+
288+
#### Table Structure
289+
290+
```sql
291+
CREATE TABLE `~lineage` (
292+
table_name VARCHAR(64) NOT NULL,
293+
attribute_name VARCHAR(64) NOT NULL,
294+
origin_schema VARCHAR(64) NOT NULL,
295+
origin_table VARCHAR(64) NOT NULL,
296+
origin_attribute VARCHAR(64) NOT NULL,
297+
PRIMARY KEY (table_name, attribute_name)
298+
) ENGINE=InnoDB;
299+
```
300+
301+
For PostgreSQL:
302+
```sql
303+
CREATE TABLE "~lineage" (
304+
table_name VARCHAR(64) NOT NULL,
305+
attribute_name VARCHAR(64) NOT NULL,
306+
origin_schema VARCHAR(64) NOT NULL,
307+
origin_table VARCHAR(64) NOT NULL,
308+
origin_attribute VARCHAR(64) NOT NULL,
309+
PRIMARY KEY (table_name, attribute_name)
310+
);
311+
```
312+
313+
#### Lineage Lookup
314+
315+
When a `Heading` is initialized from a table, query the `~lineage` table:
316+
317+
```python
318+
def _load_lineage(self, connection, database, table_name):
319+
"""Load lineage information from the ~lineage metadata table."""
320+
query = """
321+
SELECT attribute_name, origin_schema, origin_table, origin_attribute
322+
FROM `{database}`.`~lineage`
323+
WHERE table_name = %s
324+
""".format(database=database)
325+
# ... populate self.lineage dict
326+
```
327+
328+
### D2: Renamed Attributes Preserve Lineage
329+
330+
**Decision**: Yes, renamed attributes preserve their original lineage.
258331

259332
When an attribute is renamed via projection:
260333
```python
261334
table.proj(new_name='old_name')
262335
```
263336

264-
Should the lineage remain the same (pointing to `old_name`'s origin) or become new?
337+
The `new_name` attribute retains the lineage of `old_name`. The rename is cosmetic; the semantic identity (what entity the attribute represents) remains unchanged.
265338

266-
**Recommendation**: Renamed attributes should keep their original lineage. The rename is cosmetic; the semantic identity remains.
339+
This enables:
340+
```python
341+
# These two expressions can still join on the underlying subject_id
342+
A.proj(subj='subject_id') * B.proj(subj='subject_id') # OK - same lineage
343+
```
267344

268-
### Q3: What about computed attributes?
345+
### D3: Computed Attributes Have No Lineage
346+
347+
**Decision**: Lineage breaks for computed attributes.
269348

270349
For computed attributes like:
271350
```python
272351
table.proj(total='price * quantity')
273352
```
274353

275-
**Recommendation**: Computed attributes have no lineage (or a special "computed" lineage). They cannot participate in semantic matching.
354+
The `total` attribute has `lineage = None`. Computed attributes:
355+
- Cannot participate in semantic matching
356+
- Will cause a namesake collision error if another table has an attribute named `total`
357+
- Must be renamed via projection to avoid collisions
358+
359+
This is intentional: a computed value is a new entity, not inherited from any source table.
360+
361+
### D4: `dj.U` Does Not Affect Lineage
362+
363+
**Decision**: The universal set `U` only affects primary key membership, not lineage.
364+
365+
`dj.U` promotes attributes to the primary key for grouping/aggregation purposes, but the semantic identity of the attributes remains unchanged.
276366

277-
### Q4: How does this interact with `dj.U` (universal set)?
367+
### D5: Migration via Utility Function
278368

279-
The `U` class modifies which attributes are treated as primary key.
369+
**Decision**: Provide a migration utility that computes the `~lineage` table from existing schema.
280370

281-
**Recommendation**: `U` should not affect lineage - it only affects the primary key membership check, not semantic matching.
371+
For existing schemas without lineage metadata, a utility will:
372+
1. Analyze the foreign key graph using `INFORMATION_SCHEMA`
373+
2. Trace each attribute to its origin table
374+
3. Populate the `~lineage` table
282375

283-
### Q5: Backward compatibility?
376+
```python
377+
def migrate_schema_lineage(schema):
378+
"""
379+
Compute and populate the ~lineage table for an existing schema.
380+
381+
Analyzes foreign key relationships to determine attribute origins.
382+
"""
383+
# 1. Create ~lineage table if not exists
384+
# 2. For each table in schema:
385+
# a. For each attribute:
386+
# - If attribute is inherited via FK, trace to origin
387+
# - If attribute is native, origin is this table
388+
# b. Insert into ~lineage
389+
```
390+
391+
#### Algorithm for Computing Lineage
284392

285-
Should there be a migration path for existing pipelines?
393+
```python
394+
def compute_attribute_lineage(schema, table, attribute):
395+
"""
396+
Trace an attribute to its original definition.
397+
398+
Returns (origin_schema, origin_table, origin_attribute)
399+
"""
400+
# Check if this attribute is part of a foreign key
401+
fk_info = get_foreign_key_for_attribute(schema, table, attribute)
402+
403+
if fk_info is None:
404+
# Native attribute - origin is this table
405+
return (schema, table, attribute)
406+
407+
# Inherited via FK - recurse to referenced table
408+
ref_schema, ref_table, ref_attribute = fk_info
409+
return compute_attribute_lineage(ref_schema, ref_table, ref_attribute)
410+
```
286411

287-
**Options:**
288-
- A. Breaking change - require updates to all pipelines
289-
- B. Deprecation period with warnings
290-
- C. Configuration flag to switch between old/new behavior
291-
- D. Default to permissive join (`@`) semantics when lineage is unknown
412+
#### MySQL Query for FK Analysis
413+
414+
```sql
415+
SELECT
416+
kcu.COLUMN_NAME as attribute_name,
417+
kcu.REFERENCED_TABLE_SCHEMA as ref_schema,
418+
kcu.REFERENCED_TABLE_NAME as ref_table,
419+
kcu.REFERENCED_COLUMN_NAME as ref_attribute
420+
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu
421+
WHERE kcu.TABLE_SCHEMA = %s
422+
AND kcu.TABLE_NAME = %s
423+
AND kcu.REFERENCED_TABLE_NAME IS NOT NULL
424+
```
292425

293-
**Recommendation**: Option C or D for transition period.
426+
#### PostgreSQL Query for FK Analysis
427+
428+
```sql
429+
SELECT
430+
a.attname as attribute_name,
431+
cl2.relnamespace::regnamespace::text as ref_schema,
432+
cl2.relname as ref_table,
433+
a2.attname as ref_attribute
434+
FROM pg_constraint c
435+
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
436+
JOIN pg_class cl2 ON cl2.oid = c.confrelid
437+
JOIN pg_attribute a2 ON a2.attrelid = c.confrelid AND a2.attnum = ANY(c.confkey)
438+
WHERE c.contype = 'f'
439+
AND c.conrelid = %s::regclass
440+
```
294441

295442
## Testing Strategy
296443

@@ -310,12 +457,47 @@ Should there be a migration path for existing pipelines?
310457

311458
## Summary
312459

313-
Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes. The recommended implementation adds a `lineage` tuple to each `Attribute`, populated during table declaration and preserved through query operations.
460+
Semantic matching is a significant change to DataJoint's join semantics that improves correctness by preventing accidental joins on coincidentally-named attributes.
461+
462+
### Key Design Decisions
463+
464+
| Decision | Choice |
465+
|----------|--------|
466+
| Lineage storage | Hidden `~lineage` metadata table (MySQL + PostgreSQL compatible) |
467+
| Renamed attributes | Preserve original lineage |
468+
| Computed attributes | Lineage = `None` (breaks matching) |
469+
| `dj.U` interaction | Does not affect lineage |
470+
| Migration | Utility function that computes lineage from FK graph |
471+
472+
### Files to Create
473+
474+
| File | Purpose |
475+
|------|---------|
476+
| `datajoint/lineage.py` | `LineageTable` class, migration utilities |
477+
478+
### Files to Modify
479+
480+
| File | Changes |
481+
|------|---------|
482+
| `datajoint/heading.py` | Add `lineage` field to `Attribute`, load from `~lineage` |
483+
| `datajoint/declare.py` | Record lineage during table declaration |
484+
| `datajoint/expression.py` | Use lineage equality in join matching |
485+
| `datajoint/condition.py` | Update compatibility checks for lineage collisions |
486+
| `datajoint/schemas.py` | Create `~lineage` table on schema activation |
487+
488+
### Breaking Changes
489+
490+
This is a **semantically breaking change**:
491+
- Joins that previously matched on coincidental name matches will now fail
492+
- Users must explicitly rename colliding attributes with `.proj()`
493+
- Migration utility provides a path for existing schemas
314494

315-
Key files to modify:
316-
- `datajoint/heading.py` - Add lineage to Attribute
317-
- `datajoint/declare.py` - Populate lineage during FK processing
318-
- `datajoint/expression.py` - Use lineage in join logic
319-
- `datajoint/condition.py` - Update compatibility checks
495+
### Next Steps
320496

321-
This is a breaking change that will require a migration strategy for existing pipelines.
497+
1. Review and approve this specification
498+
2. Implement Phase 1 (infrastructure) with tests
499+
3. Implement Phase 2 (population) with tests
500+
4. Implement Phase 3-4 (query propagation and join logic)
501+
5. Implement Phase 5-6 (error handling and migration)
502+
6. Update documentation
503+
7. Release with deprecation warnings, then enforce in subsequent release

0 commit comments

Comments
 (0)