Add functional dependency check for aggregation operator

claude · dimitri-yatsenko · claude · commit d68de161ad14 · 2025-12-25T01:10:51.000Z
In A.aggr(B, ...), ensures every entry in B matches exactly one entry in A:
- B must have all of A's primary key attributes
- Primary key attributes must be homologous (same lineage)
- Clear error messages for missing attributes or non-homologous lineage

Updated docstrings for:
- Aggregation.create()
- QueryExpression.aggr()
- U.aggr()

Updated spec document with:
- Functional dependency requirements
- Error message examples
- Additional test cases

Co-authored-by: dimitri-yatsenko &lt;dimitri@datajoint.com&gt;
diff --git a/docs/src/design/semantic-matching-spec.md b/docs/src/design/semantic-matching-spec.md
@@ -136,7 +136,7 @@ Semantic matching applies to all binary operations that match attributes:
 | `A * B` | Join | Matches on homologous namesakes |
 | `A & B` | Restriction | Matches on homologous namesakes |
 | `A - B` | Anti-restriction | Matches on homologous namesakes |
-| `A.aggr(B, ...)` | Aggregation | Matches on homologous namesakes |
+| `A.aggr(B, ...)` | Aggregation | Requires functional dependency (see below) |
 
 ### The `.join()` Method
 
@@ -438,8 +438,27 @@ def proj(self, *attributes, **named_attributes):
 
 ### Aggregation (`aggr`)
 
-Aggregation creates a new expression with:
-- Group attributes retain their lineage from the group operand
+In `A.aggr(B, ...)`, entries from B are grouped by A's primary key and aggregate functions are computed.
+
+**Functional Dependency Requirement**: Every entry in B must match exactly one entry in A. This requires:
+
+1. **B must have all of A's primary key attributes**: If A's primary key is `(a, b)`, then B must contain attributes named `a` and `b`.
+
+2. **Primary key attributes must be homologous**: The namesake attributes in B must have the same lineage as in A. This ensures they represent the same entity.
+
+```python
+# Valid: Session.aggr(Trial, ...) where Trial has session_id from Session
+Session.aggr(Trial, n='count(*)')  # OK - Trial.session_id traces to Session.session_id
+
+# Invalid: Missing primary key attribute
+Session.aggr(Stimulus, n='count(*)')  # Error if Stimulus lacks session_id
+
+# Invalid: Non-homologous primary key
+TableA.aggr(TableB, n='count(*)')  # Error if TableB.id has different lineage than TableA.id
+```
+
+**Result lineage**:
+- Group attributes retain their lineage from the grouping expression (A)
 - Aggregated attributes have `lineage=None` (they are computations)
 
 ### Union (`+`)
@@ -470,6 +489,22 @@ DataJointError: dj.U(...) * table is deprecated in DataJoint 2.0.
 Use dj.U(...) & table instead.
 ```
 
+### Aggregation Missing Primary Key
+
+```
+DataJointError: Aggregation requires functional dependency: `group` must have all primary key
+attributes of the grouping expression. Missing: {'session_id'}.
+Use .proj() to add the missing attributes or verify the schema design.
+```
+
+### Aggregation Non-Homologous Primary Key
+
+```
+DataJointError: Aggregation requires homologous primary key attributes.
+Attribute `id` has different lineages: university.student.id (grouping) vs university.course.id (group).
+Use .proj() to rename one of the attributes or .join(semantic_check=False) in a manual aggregation.
+```
+
 ## Testing Strategy
 
 ### Unit Tests
@@ -496,6 +531,12 @@ Use dj.U(...) & table instead.
    - `dj.U - table` raises error
    - `dj.U * table` raises deprecation error
 
+5. **Aggregation functional dependency**:
+   - `A.aggr(B)` works when B has all of A's PK attributes with same lineage
+   - `A.aggr(B)` raises error when B is missing PK attributes
+   - `A.aggr(B)` raises error when PK attributes have different lineage
+   - `dj.U('a', 'b').aggr(B)` works when B has `a` and `b` attributes
+
 ### Integration Tests
 
 1. **Schema migration**: Existing schema gets `~lineage` table populated correctly
diff --git a/src/datajoint/expression.py b/src/datajoint/expression.py
@@ -464,13 +464,20 @@ def proj(self, *attributes, **named_attributes):
 
     def aggr(self, group, *attributes, keep_all_rows=False, **named_attributes):
         """
-        Aggregation of the type U('attr1','attr2').aggr(group, computation="QueryExpression")
-        has the primary key ('attr1','attr2') and performs aggregation computations for all matching elements of `group`.
+        Aggregate `group` over the primary key of `self`.
 
-        :param group:  The query expression to be aggregated.
-        :param keep_all_rows: True=keep all the rows from self. False=keep only rows that match entries in group.
-        :param named_attributes: computations of the form new_attribute="sql expression on attributes of group"
-        :return: The derived query expression
+        In A.aggr(B, ...), groups entries from B by the primary key of A and computes
+        aggregate functions. Requires functional dependency: every entry in B must match
+        exactly one entry in A. This means B must have all of A's primary key attributes
+        as homologous namesakes (same name AND same lineage).
+
+        :param group: the query expression to aggregate (B in A.aggr(B))
+        :param attributes: attributes from self to include in the result
+        :param keep_all_rows: True=keep all rows from self (left join). False=keep only matching rows.
+        :param named_attributes: aggregation computations, e.g., count='count(*)', avg_val='avg(value)'
+        :return: query expression with self's primary key and the computed aggregations
+        :raises DataJointError: if group is missing primary key attributes from self,
+            or if namesake primary key attributes have different lineages
         """
         if Ellipsis in attributes:
             # expand ellipsis to include only attributes from the left table
@@ -631,9 +638,47 @@ class Aggregation(QueryExpression):
 
     @classmethod
     def create(cls, arg, group, keep_all_rows=False):
+        """
+        Create an aggregation expression.
+
+        For A.aggr(B, ...), ensures functional dependency: every entry in B must match
+        exactly one entry in A. This requires B to have all of A's primary key attributes
+        as homologous namesakes (same name AND same lineage).
+
+        :param arg: the grouping expression (A in A.aggr(B))
+        :param group: the expression to aggregate (B in A.aggr(B))
+        :param keep_all_rows: if True, keep all rows from arg (left join behavior)
+        :raises DataJointError: if group is missing any primary key attributes from arg,
+            or if namesake attributes have different lineages
+        """
         if inspect.isclass(group) and issubclass(group, QueryExpression):
             group = group()  # instantiate if a class
         assert isinstance(group, QueryExpression)
+
+        # Check functional dependency: group must have all of arg's primary key attributes
+        missing_pk = set(arg.primary_key) - set(group.heading.names)
+        if missing_pk:
+            raise DataJointError(
+                f"Aggregation requires functional dependency: `group` must have all primary key "
+                f"attributes of the grouping expression. Missing: {missing_pk}. "
+                f"Use .proj() to add the missing attributes or verify the schema design."
+            )
+
+        # Check that primary key attributes are homologous (same lineage)
+        # This is done for QueryExpression args; U is always compatible
+        if not isinstance(arg, U):
+            for attr_name in arg.primary_key:
+                arg_lineage = arg.heading[attr_name].lineage
+                group_lineage = group.heading[attr_name].lineage
+                if arg_lineage != group_lineage:
+                    raise DataJointError(
+                        f"Aggregation requires homologous primary key attributes. "
+                        f"Attribute `{attr_name}` has different lineages: "
+                        f"{arg_lineage} (grouping) vs {group_lineage} (group). "
+                        f"Use .proj() to rename one of the attributes or "
+                        f".join(semantic_check=False) in a manual aggregation."
+                    )
+
         if keep_all_rows and len(group.support) > 1 or group.heading.new_attributes:
             group = group.make_subquery()  # subquery if left joining a join
         join = arg.join(group, left=keep_all_rows)  # reuse the join logic
@@ -853,12 +898,17 @@ def __sub__(self, other):
 
     def aggr(self, group, **named_attributes):
         """
-        Aggregation of the type U('attr1','attr2').aggr(group, computation="QueryExpression")
-        has the primary key ('attr1','attr2') and performs aggregation computations for all matching elements of `group`.
+        Aggregate `group` over the attributes of this universal set.
+
+        In dj.U('attr1', 'attr2').aggr(B, ...), groups entries from B by attr1 and attr2
+        and computes aggregate functions. Requires B to have all specified attributes.
+        Since dj.U is homologous to any namesake attribute, lineage compatibility is
+        always satisfied.
 
-        :param group:  The query expression to be aggregated.
-        :param named_attributes: computations of the form new_attribute="sql expression on attributes of group"
-        :return: The derived query expression
+        :param group: the query expression to aggregate
+        :param named_attributes: aggregation computations, e.g., count='count(*)', avg_val='avg(value)'
+        :return: query expression with U's attributes as primary key and the computed aggregations
+        :raises DataJointError: if group is missing any of U's primary key attributes
         """
         if named_attributes.get("keep_all_rows", False):
             raise DataJointError("Cannot set keep_all_rows=True when aggregating on a universal set.")