Skip to content

Commit 496e014

Browse files
committed
Add left join constraint requiring A → B for valid primary key
Left joins (A.join(B, left=True)) can produce NULL values for attributes from B when rows in A have no matching rows in B. This would result in NULL primary key values if B's primary key attributes are included in the result's primary key. To prevent this, left joins now require A → B (the left operand must functionally determine the right operand). This ensures PK = PK(A), which consists entirely of non-NULL values from the left operand. Changes: - heading.join() now accepts 'left' parameter and validates A → B - expression.py passes 'left' parameter to heading.join() - Added tests for left join constraint validation - Updated spec with left join rules and rationale Co-authored-by: dimitri-yatsenko<dimitri@datajoint.com>
1 parent 5810073 commit 496e014

File tree

4 files changed

+151
-4
lines changed

4 files changed

+151
-4
lines changed

docs/src/design/semantic-matching-spec.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,45 @@ With these rules, join is **not commutative** in terms of:
302302

303303
The **result set** (the actual rows returned) remains the same regardless of order, but the **schema** (primary key and attribute order) may differ.
304304

305+
### Left Join Constraint
306+
307+
For left joins (`A.join(B, left=True)`), the functional dependency **A → B is required**.
308+
309+
**Why this constraint exists:**
310+
311+
In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:
312+
313+
| Scenario | PK by inner join rule | Left join problem |
314+
|----------|----------------------|-------------------|
315+
| A → B | PK(A) | ✅ Safe — A's attrs always present |
316+
| B → A | PK(B) | ❌ B's PK attrs could be NULL |
317+
| Neither | PK(A) ∪ PK(B) | ❌ B's PK attrs could be NULL |
318+
319+
**Example of invalid left join:**
320+
```
321+
A: x*, y* PK(A) = {x, y}
322+
B: x*, z*, y PK(B) = {x, z}, y is secondary
323+
324+
Inner join: PK = {x, z} (B → A rule)
325+
Left join attempt: FAILS because z could be NULL for unmatched A rows
326+
```
327+
328+
**Valid left join example:**
329+
```
330+
Session: session_id*, date
331+
Trial: session_id*, trial_num*, stimulus (references Session)
332+
333+
Session.join(Trial, left=True) # OK: Session → Trial
334+
# PK = {session_id}, all sessions retained even without trials
335+
```
336+
337+
**Error message:**
338+
```
339+
DataJointError: Left join requires the left operand to determine the right operand (A → B).
340+
The following attributes from the right operand's primary key are not determined by
341+
the left operand: ['z']. Use an inner join or restructure the query.
342+
```
343+
305344
## Universal Set `dj.U`
306345

307346
`dj.U()` or `dj.U('attr1', 'attr2', ...)` represents the universal set of all possible values and lineages.

src/datajoint/expression.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -336,10 +336,10 @@ def join(self, other, semantic_check=True, left=False):
336336
result._connection = self.connection
337337
result._support = self.support + other.support
338338
result._left = self._left + [left] + other._left
339-
result._heading = self.heading.join(other.heading)
339+
result._heading = self.heading.join(other.heading, left=left)
340340
result._restriction = AndList(self.restriction)
341341
result._restriction.append(other.restriction)
342-
result._original_heading = self.original_heading.join(other.original_heading)
342+
result._original_heading = self.original_heading.join(other.original_heading, left=left)
343343
assert len(result.support) == len(result._left) + 1
344344
return result
345345

src/datajoint/heading.py

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -468,7 +468,7 @@ def select(self, select_list, rename_map=None, compute_map=None):
468468
)
469469
return Heading(chain(copy_attrs, compute_attrs))
470470

471-
def join(self, other):
471+
def join(self, other, left=False):
472472
"""
473473
Join two headings into a new one.
474474
@@ -480,8 +480,20 @@ def join(self, other):
480480
A → B holds iff every attribute in PK(B) is either in PK(A) or secondary in A.
481481
B → A holds iff every attribute in PK(A) is either in PK(B) or secondary in B.
482482
483+
For left joins (left=True), A → B is required. Otherwise, the result would not
484+
have a valid primary key because:
485+
- Unmatched rows from A have NULL values for B's attributes
486+
- If B → A or Neither, the PK would include B's attributes, which could be NULL
487+
- Only when A → B does PK(A) uniquely identify all result rows
488+
483489
It assumes that self and other are headings that share no common dependent attributes.
490+
491+
:param other: The other heading to join with
492+
:param left: If True, this is a left join (requires A → B)
493+
:raises DataJointError: If left=True and A does not determine B
484494
"""
495+
from .errors import DataJointError
496+
485497
# Check functional dependencies
486498
self_determines_other = all(
487499
name in self.primary_key or name in self.secondary_attributes for name in other.primary_key
@@ -490,10 +502,22 @@ def join(self, other):
490502
name in other.primary_key or name in other.secondary_attributes for name in self.primary_key
491503
)
492504

505+
# For left joins, require A → B
506+
if left and not self_determines_other:
507+
missing = [
508+
name for name in other.primary_key if name not in self.primary_key and name not in self.secondary_attributes
509+
]
510+
raise DataJointError(
511+
f"Left join requires the left operand to determine the right operand (A → B). "
512+
f"The following attributes from the right operand's primary key are not "
513+
f"determined by the left operand: {missing}. "
514+
f"Use an inner join or restructure the query."
515+
)
516+
493517
seen = set()
494518
result_attrs = []
495519

496-
if self_determines_other:
520+
if left or self_determines_other:
497521
# A → B: use PK(A), A's attributes first
498522
# 1. All of A's PK attrs (as PK)
499523
for name in self.primary_key:

tests/test_semantic_matching.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -670,3 +670,87 @@ def test_pk_attributes_come_first(self, schema_pk_rules):
670670

671671
if secondary_indices: # If there are secondary attributes
672672
assert max(pk_indices) < min(secondary_indices)
673+
674+
675+
class TestLeftJoinConstraint:
676+
"""
677+
Test that left joins require A → B (left operand determines right operand).
678+
679+
For left joins, B's attributes could be NULL for unmatched rows, so the PK
680+
must be PK(A) only. This is only valid when A → B.
681+
"""
682+
683+
def test_left_join_valid_when_a_determines_b(self, schema_pk_rules):
684+
"""
685+
Left join should work when A → B.
686+
687+
A: x*, y*, z PK(A) = {x, y}, z is secondary
688+
B: y*, z*, x PK(B) = {y, z}, x is secondary
689+
690+
A → B? z secondary in A → Yes
691+
Left join is valid, PK = {x, y}
692+
"""
693+
TableXYwithZ = schema_pk_rules["TableXYwithZ"]
694+
TableYZwithX = schema_pk_rules["TableYZwithX"]
695+
696+
# This should work - A → B holds
697+
result = TableXYwithZ().join(TableYZwithX(), left=True)
698+
699+
# PK should be PK(A) = {x, y}
700+
assert set(result.primary_key) == {"x", "y"}
701+
702+
def test_left_join_fails_when_b_determines_a_only(self, schema_pk_rules):
703+
"""
704+
Left join should fail when only B → A (not A → B).
705+
706+
A: x*, y* PK(A) = {x, y}
707+
B: x*, z*, y PK(B) = {x, z}, y is secondary
708+
709+
A → B? z not in PK(A) and z not secondary in A → No
710+
B → A? y secondary in B → Yes
711+
712+
Left join is invalid because z would need to be in PK but could be NULL.
713+
"""
714+
TableXY = schema_pk_rules["TableXY"]
715+
TableXZwithY = schema_pk_rules["TableXZwithY"]
716+
717+
# This should fail - A → B does not hold
718+
with pytest.raises(DataJointError) as exc_info:
719+
TableXY().join(TableXZwithY(), left=True)
720+
721+
assert "Left join requires" in str(exc_info.value)
722+
assert "A → B" in str(exc_info.value) or "determine" in str(exc_info.value)
723+
724+
def test_left_join_fails_when_neither_direction(self, schema_pk_rules):
725+
"""
726+
Left join should fail when neither A → B nor B → A.
727+
728+
A: x*, y* PK(A) = {x, y}
729+
B: z*, x PK(B) = {z}, x is secondary
730+
731+
A → B? z not in A → No
732+
B → A? y not in B → No
733+
734+
Left join is invalid.
735+
"""
736+
TableXY = schema_pk_rules["TableXY"]
737+
TableZ = schema_pk_rules["TableZ"]
738+
739+
# This should fail - A → B does not hold
740+
with pytest.raises(DataJointError) as exc_info:
741+
TableXY().join(TableZ(), left=True)
742+
743+
assert "Left join requires" in str(exc_info.value)
744+
745+
def test_inner_join_still_works_when_b_determines_a(self, schema_pk_rules):
746+
"""
747+
Inner join should still work normally when B → A (even though left join fails).
748+
"""
749+
TableXY = schema_pk_rules["TableXY"]
750+
TableXZwithY = schema_pk_rules["TableXZwithY"]
751+
752+
# Inner join should work - B → A applies
753+
result = TableXY * TableXZwithY
754+
755+
# PK should be {x, z} (B's PK)
756+
assert set(result.primary_key) == {"x", "z"}

0 commit comments

Comments
 (0)