Add D9: Aggregation constraint and rules

claude · claude · commit 5432141b8571 · 2025-12-24T00:11:29.000Z
For A.aggr(B, ...):
- Every attribute in PK(A) must have a homologous namesake in B
- This ensures non-overlapping aggregation groups
- Result PK = PK(A) always
- Same constraint applies for left join aggregation
diff --git a/docs/SPEC-semantic-matching.md b/docs/SPEC-semantic-matching.md
@@ -73,6 +73,7 @@ Semantic matching applies to all binary operations that match attributes between
 | `A * B` | Join | Matches on homologous namesakes |
 | `A & B` | Restriction | Matches on homologous namesakes |
 | `A - B` | Anti-restriction | Matches on homologous namesakes |
+| `A.aggr(B, ...)` | Aggregation | Matches on homologous namesakes |
 
 Note: `A - B` is the negated form of restriction (equivalent to `A & ~B`), not a true set difference.
 
@@ -139,6 +140,36 @@ PK(A * B) =
    - Both PK(A) ⊆ J and PK(B) ⊆ J
    - Result PK = PK(A) = {a} (left operand wins)
 
+### Aggregation Rules
+
+For `A.aggr(B, ...)`, semantic matching applies with an additional constraint.
+
+**Primary key**: PK(result) = PK(A) — always the left operand's primary key.
+
+**Constraint**: Every attribute in PK(A) must have a homologous namesake in B.
+
+This ensures that each B tuple belongs to exactly one A entity, so aggregation groups are non-overlapping. If B is missing any part of A's primary key, a B tuple could match multiple A tuples and be counted in multiple aggregates.
+
+**Left join aggregation**: The same constraint applies when using `A.aggr(B, ..., left=True)`. The left join allows A tuples with no matching B tuples to appear in the result (with NULL aggregates), but the grouping constraint remains: B must contain A's complete primary key.
+
+**Example**:
+```python
+# Valid: Session.aggr(Spike, count="count(*)")
+# Session PK = {subject_id, session_id}
+# Spike has {subject_id, session_id, spike_id, ...}
+# Spike contains Session's entire PK ✓
+
+# Invalid: Subject.aggr(Session, count="count(*)")
+# Subject PK = {subject_id}
+# Session has {subject_id, session_id, ...}
+# Session contains Subject's PK ✓ — this is actually valid!
+
+# Invalid: Session.aggr(Subject, ...)
+# Session PK = {subject_id, session_id}
+# Subject has {subject_id, ...}
+# Subject is missing session_id from Session's PK ✗
+```
+
 ## Current Implementation Analysis
 
 ### Attribute Representation (`heading.py:48`)
@@ -655,6 +686,16 @@ WHERE c.contype = 'f'
 
 **Rationale**: Based on Armstrong's axioms. If PK(B) ⊆ J, then PK(A) → J → PK(B) by transitivity, so PK(A) alone determines all result attributes. The union rule is only needed when neither PK is fully covered by the join.
 
+### D9: Aggregation Constraint
+
+**Decision**: For `A.aggr(B, ...)`, require that every attribute in PK(A) has a homologous namesake in B.
+
+**Primary key**: PK(result) = PK(A) — always.
+
+**Rationale**: This ensures non-overlapping aggregation groups. Each B tuple belongs to exactly one A entity, preventing double-counting.
+
+**Left join**: The same constraint applies for `A.aggr(B, ..., left=True)`. A tuples with no matching B tuples appear with NULL aggregates, but the grouping constraint remains.
+
 ## Testing Strategy
 
 1. **Unit tests** for lineage propagation through all query operations
@@ -700,6 +741,7 @@ Semantic matching is a significant change to DataJoint's join semantics that imp
 | **D6**: `@` operator | Deprecated - use `.join(semantic_check=False)` |
 | **D7**: Migration | Utility function + automatic fallback computation |
 | **D8**: PK formation | Functional dependency analysis; left operand wins ties; non-commutative |
+| **D9**: Aggregation | B must contain A's entire PK; result PK = PK(A); applies to left join too |
 
 ### Compatibility