Skip to content

Commit f5f25ac

Browse files
committed
Refactor: use semantic_check=False for left join bypass
Instead of a special _aggregation parameter, co-opt semantic_check=False to bypass the left join A → B constraint. When bypassed, PK = PK(A) ∪ PK(B). This is cleaner because: - Consistent with existing semantic_check semantics (bypass strict validation) - User-facing parameter, not an internal hack - Responsibility is on the caller for any invalid PK from such operations Aggregation now uses semantic_check=False for its internal left join, then resets PK via GROUP BY. Co-authored-by: dimitri-yatsenko<dimitri@datajoint.com>
1 parent c69b446 commit f5f25ac

File tree

3 files changed

+52
-35
lines changed

3 files changed

+52
-35
lines changed

docs/src/design/semantic-matching-spec.md

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,19 @@ Session.join(Trial, left=True) # OK: Session → Trial
338338
```
339339
DataJointError: Left join requires the left operand to determine the right operand (A → B).
340340
The following attributes from the right operand's primary key are not determined by
341-
the left operand: ['z']. Use an inner join or restructure the query.
341+
the left operand: ['z']. Use an inner join, restructure the query, or use semantic_check=False.
342+
```
343+
344+
### Bypassing with `semantic_check=False`
345+
346+
When `semantic_check=False` is used for a left join where A → B doesn't hold, the constraint is bypassed and **PK = PK(A) ∪ PK(B)** is used. This is useful when the caller will reset the primary key afterward (e.g., aggregation with GROUP BY).
347+
348+
```python
349+
# Direct left join - normally blocked
350+
A.join(B, left=True) # Error: A doesn't determine B
351+
352+
# Bypass with semantic_check=False - produces PK(A) ∪ PK(B)
353+
A.join(B, left=True, semantic_check=False) # Allowed, but PK may have NULLs
342354
```
343355

344356
### Aggregation Exception
@@ -348,9 +360,10 @@ the left operand: ['z']. Use an inner join or restructure the query.
348360
This apparent contradiction is resolved by the `GROUP BY` clause:
349361

350362
1. Aggregation requires B → A so that B can be grouped by A's primary key
351-
2. The intermediate left join `A LEFT JOIN B` would have an invalid PK under the normal left join rules (B → A case gives PK(B))
352-
3. However, aggregation's `GROUP BY PK(A)` clause **resets** the primary key to PK(A)
353-
4. The final result has PK(A), which consists entirely of non-NULL values from A
363+
2. The intermediate left join `A LEFT JOIN B` would have an invalid PK under the normal left join rules
364+
3. Aggregation uses `semantic_check=False` for its internal join, producing PK(A) ∪ PK(B)
365+
4. The `GROUP BY PK(A)` clause then **resets** the primary key to PK(A)
366+
5. The final result has PK(A), which consists entirely of non-NULL values from A
354367

355368
**Example:**
356369
```
@@ -360,13 +373,12 @@ Trial: session_id*, trial_num*, response_time (references Session)
360373
# Aggregation with keep_all_rows=True
361374
Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)')
362375
363-
# Internally: Session LEFT JOIN Trial (B → A, would normally be invalid)
376+
# Internally: Session LEFT JOIN Trial with semantic_check=False
377+
# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num}
364378
# But GROUP BY session_id resets PK to {session_id}
365379
# Result: All sessions, with avg_rt=NULL for sessions without trials
366380
```
367381

368-
The left join constraint validation is bypassed internally for aggregation because the `GROUP BY` clause guarantees a valid primary key in the final result.
369-
370382
## Universal Set `dj.U`
371383

372384
`dj.U()` or `dj.U('attr1', 'attr2', ...)` represents the universal set of all possible values and lineages.

src/datajoint/expression.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -282,18 +282,17 @@ def __matmul__(self, other):
282282
"The @ operator has been removed in DataJoint 2.0. " "Use .join(other, semantic_check=False) for permissive joins."
283283
)
284284

285-
def join(self, other, semantic_check=True, left=False, _aggregation=False):
285+
def join(self, other, semantic_check=True, left=False):
286286
"""
287287
Create the joined QueryExpression.
288288
289289
Uses semantic matching: only attributes with the same name AND the same
290290
lineage (homologous namesakes) are used for joining.
291291
292292
:param other: QueryExpression to join with
293-
:param semantic_check: If True (default), raise error on non-homologous namesakes.
294-
If False, bypass semantic check (use for legacy compatibility).
293+
:param semantic_check: If True (default), raise error on non-homologous namesakes
294+
and enforce left join A → B constraint. If False, bypass these checks.
295295
:param left: If True, perform a left join retaining all rows from self.
296-
:param _aggregation: Internal flag to bypass left join validation for aggregation.
297296
298297
Examples:
299298
a * b is short for a.join(b)
@@ -337,10 +336,10 @@ def join(self, other, semantic_check=True, left=False, _aggregation=False):
337336
result._connection = self.connection
338337
result._support = self.support + other.support
339338
result._left = self._left + [left] + other._left
340-
result._heading = self.heading.join(other.heading, left=left, _aggregation=_aggregation)
339+
result._heading = self.heading.join(other.heading, left=left, semantic_check=semantic_check)
341340
result._restriction = AndList(self.restriction)
342341
result._restriction.append(other.restriction)
343-
result._original_heading = self.original_heading.join(other.original_heading, left=left, _aggregation=_aggregation)
342+
result._original_heading = self.original_heading.join(other.original_heading, left=left, semantic_check=semantic_check)
344343
assert len(result.support) == len(result._left) + 1
345344
return result
346345

@@ -684,8 +683,8 @@ def create(cls, arg, group, keep_all_rows=False):
684683

685684
if keep_all_rows and len(group.support) > 1 or group.heading.new_attributes:
686685
group = group.make_subquery() # subquery if left joining a join
687-
# Pass _aggregation=True to bypass left join validation (aggregation resets PK via GROUP BY)
688-
join = arg.join(group, left=keep_all_rows, _aggregation=True)
686+
# Use semantic_check=False to bypass left join A → B validation (aggregation resets PK via GROUP BY)
687+
join = arg.join(group, semantic_check=False, left=keep_all_rows)
689688
result = cls()
690689
result._connection = join.connection
691690
result._heading = join.heading.set_primary_key(arg.primary_key) # use left operand's primary key

src/datajoint/heading.py

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -468,7 +468,7 @@ def select(self, select_list, rename_map=None, compute_map=None):
468468
)
469469
return Heading(chain(copy_attrs, compute_attrs))
470470

471-
def join(self, other, left=False, _aggregation=False):
471+
def join(self, other, left=False, semantic_check=True):
472472
"""
473473
Join two headings into a new one.
474474
@@ -480,22 +480,22 @@ def join(self, other, left=False, _aggregation=False):
480480
A → B holds iff every attribute in PK(B) is either in PK(A) or secondary in A.
481481
B → A holds iff every attribute in PK(A) is either in PK(B) or secondary in B.
482482
483-
For left joins (left=True), A → B is required. Otherwise, the result would not
484-
have a valid primary key because:
483+
For left joins (left=True), A → B is required by default. Otherwise, the result
484+
would not have a valid primary key because:
485485
- Unmatched rows from A have NULL values for B's attributes
486486
- If B → A or Neither, the PK would include B's attributes, which could be NULL
487487
- Only when A → B does PK(A) uniquely identify all result rows
488488
489-
Exception: Aggregation (A.aggr(B, keep_all_rows=True)) uses a left join internally
490-
but requires B → A instead. This is valid because the GROUP BY clause resets the
491-
primary key to PK(A), which consists of non-NULL values from the left operand.
489+
When semantic_check=False for left joins where A → B doesn't hold, the constraint
490+
is bypassed and PK = PK(A) ∪ PK(B) is used. This is useful for aggregation, where
491+
the GROUP BY clause resets the primary key afterward.
492492
493493
It assumes that self and other are headings that share no common dependent attributes.
494494
495495
:param other: The other heading to join with
496-
:param left: If True, this is a left join (requires A → B unless _aggregation=True)
497-
:param _aggregation: If True, skip left join validation (used by Aggregation.create)
498-
:raises DataJointError: If left=True and A does not determine B (unless _aggregation)
496+
:param left: If True, this is a left join (requires A → B unless semantic_check=False)
497+
:param semantic_check: If False, bypass left join A → B validation (PK becomes union)
498+
:raises DataJointError: If left=True, semantic_check=True, and A does not determine B
499499
"""
500500
from .errors import DataJointError
501501

@@ -507,17 +507,23 @@ def join(self, other, left=False, _aggregation=False):
507507
name in other.primary_key or name in other.secondary_attributes for name in self.primary_key
508508
)
509509

510-
# For left joins, require A → B (unless this is an aggregation context)
511-
if left and not _aggregation and not self_determines_other:
512-
missing = [
513-
name for name in other.primary_key if name not in self.primary_key and name not in self.secondary_attributes
514-
]
515-
raise DataJointError(
516-
f"Left join requires the left operand to determine the right operand (A → B). "
517-
f"The following attributes from the right operand's primary key are not "
518-
f"determined by the left operand: {missing}. "
519-
f"Use an inner join or restructure the query."
520-
)
510+
# For left joins, require A → B unless semantic_check=False
511+
if left and not self_determines_other:
512+
if semantic_check:
513+
missing = [
514+
name
515+
for name in other.primary_key
516+
if name not in self.primary_key and name not in self.secondary_attributes
517+
]
518+
raise DataJointError(
519+
f"Left join requires the left operand to determine the right operand (A → B). "
520+
f"The following attributes from the right operand's primary key are not "
521+
f"determined by the left operand: {missing}. "
522+
f"Use an inner join, restructure the query, or use semantic_check=False."
523+
)
524+
else:
525+
# Bypass: use union of PKs (will be reset by caller, e.g., aggregation)
526+
other_determines_self = False # Force the "Neither" case
521527

522528
seen = set()
523529
result_attrs = []

0 commit comments

Comments
 (0)