This document specifies semantic matching for joins in DataJoint 2.0, replacing the current name-based matching rules. Semantic matching ensures that attributes are only matched when they share both the same name and the same lineage (origin), preventing accidental joins on unrelated attributes that happen to share names.
- Prevent incorrect joins on attributes that share names but represent different entities
- Enable valid joins that are currently blocked due to overly restrictive rules
- Maintain backward compatibility for well-designed schemas
- Provide clear error messages when semantic conflicts are detected
The current join implementation matches attributes purely by name:
join_attributes = set(n for n in self.heading.names if n in other.heading.names)This is essentially a SQL NATURAL JOIN - any attributes with the same name in both tables are used for matching. The only constraint is the "join compatibility" check which prevents joining on secondary attributes that appear in both tables.
Location: src/datajoint/expression.py:301 and src/datajoint/condition.py:100-120
Consider two tables:
Student(id, name)- whereidis the student's primary keyCourse(id, instructor)- whereidis the course's primary key
Current behavior: Student * Course joins on id, producing meaningless results where student IDs are matched with course IDs.
Desired behavior: An error should be raised because Student.id and Course.id have different origins (lineages).
Consider two tables:
FavoriteCourse(student_id*, course_id)- student's favorite course (course_id is secondary, with FK to Course)DependentCourse(dep_course_id*, course_id)- course dependencies (course_id is secondary, with FK to Course)
Current behavior: FavoriteCourse * DependentCourse is rejected because course_id is a secondary attribute in both tables.
Desired behavior: The join should proceed because both course_id attributes share the same lineage (tracing to Course.course_id).
| Term | Definition |
|---|---|
| Homologous attributes | Attributes with the same lineage (whether or not they have the same name) |
| Namesake attributes | Attributes with the same name (whether or not they have the same lineage) |
| Homologous namesakes | Attributes with the same name AND the same lineage — used for join matching |
| Non-homologous namesakes | Attributes with the same name BUT different lineage — cause join errors |
Lineage identifies the origin of an attribute - where it was first defined. It is represented as a string in the format:
schema_name.table_name.attribute_name
Note: table_name refers to the actual database table name, not the Python class name. DataJoint converts class names (CamelCase) to table names (snake_case) with tier prefixes:
Session→session(manual table)#SessionType→#session_type(lookup table)_ProcessingTask→_processing_task(imported table)__ProcessedData→__processed_data(computed table)
-
Native primary key attributes have lineage:
lineage = "this_schema.this_table.attr_name"The table where they are originally defined.
-
Attributes inherited via foreign key retain their origin lineage:
lineage = "parent_schema.parent_table.attr_name"Traced to the original definition through the FK chain.
-
Native secondary attributes do NOT have lineage:
lineage = NoneSecondary attributes defined directly (not via FK) cannot be used for join matching.
Lineage propagates through:
-
Foreign key references: Inherited attributes retain their origin lineage regardless of whether they end up as primary or secondary in the referencing table.
-
Query expressions:
- Projections preserve lineage for included attributes
- Renamed attributes (
new_name='old_name') preserve lineage - Computed attributes (
result='expr') have no lineage
| Scenario | Same Name | Same Lineage | Action |
|---|---|---|---|
| Homologous namesakes | Yes | Yes | Match - use for join |
| Non-homologous namesakes (both have lineage) | Yes | No (different) | Error |
| Non-homologous namesakes (both null lineage) | Yes | No (both null) | Error |
| Non-homologous namesakes (one null) | Yes | No (one null) | Error |
| Different names | No | - | No match |
When non-homologous namesakes are detected, users must resolve the conflict using .proj() to rename one of the colliding attributes:
# Error: Student.id and Course.id have different lineages
Student * Course # DataJointError!
# Resolution: rename one attribute
Student * Course.proj(course_id='id') # OKSemantic matching applies to all binary operations that match attributes:
| Operator | Operation | Semantic Matching |
|---|---|---|
A * B |
Join | Matches on homologous namesakes |
A & B |
Restriction | Matches on homologous namesakes |
A - B |
Anti-restriction | Matches on homologous namesakes |
A.aggr(B, ...) |
Aggregation | Requires functional dependency (see below) |
The .join() method provides additional control:
# Default: semantic checking enabled (same as *)
result = A.join(B)
# Bypass semantic check for legacy compatibility
result = A.join(B, semantic_check=False)The @ operator (permissive join) is removed in DataJoint 2.0:
# Old (deprecated):
A @ B # Raises DataJointError with migration guidance
# New:
A.join(B, semantic_check=False) # Explicit bypassThe error message directs users to the explicit .join() method.
In DataJoint, the result of each query operator produces a valid entity set with a well-defined entity type and primary key. This section specifies how the primary key is determined for each relational operator.
The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands.
In the examples below, * marks primary key attributes:
A(x*, y*, z)means A has primary key{x, y}and secondary attributezA → Bmeans "A determines B" (defined below)
| Operator | Primary Key Rule |
|---|---|
A & B (restriction) |
PK(A) — preserved from left operand |
A - B (anti-restriction) |
PK(A) — preserved from left operand |
A.proj(...) (projection) |
PK(A) — preserved from left operand |
A.aggr(B, ...) (aggregation) |
PK(A) — preserved from left operand |
A * B (join) |
Depends on functional dependencies (see below) |
The join operator requires special handling because it combines two entity sets. The primary key of A * B depends on the functional dependency relationship between the operands.
A determines B (written A → B): Every attribute in PK(B) is either already in PK(A) or is a secondary attribute in A.
A → B iff ∀b ∈ PK(B): b ∈ PK(A) OR b ∈ secondary(A)
Intuitively, A → B means that knowing A's primary key is sufficient to determine B's primary key through functional dependencies.
B determines A (written B → A): Every attribute in PK(A) is either already in PK(B) or is a secondary attribute in B.
B → A iff ∀a ∈ PK(A): a ∈ PK(B) OR a ∈ secondary(B)
For A * B:
| Condition | PK(A * B) | Attribute Order |
|---|---|---|
| A → B | PK(A) | A's attributes first |
| B → A (and not A → B) | PK(B) | B's attributes first |
| Neither | PK(A) ∪ PK(B) | PK(A) first, then PK(B) − PK(A) |
When both A → B and B → A hold, the left operand takes precedence (use PK(A)).
Example 1: B → A
A: x*, y*
B: x*, z*, y (y is secondary in B, so z → y)
- A → B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). No.
- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). Yes.
- Result: PK(A * B) = {x, z} with B's attributes first.
Example 2: Both directions (bijection-like)
A: x*, y*, z (z is secondary in A)
B: y*, z*, x (x is secondary in B)
- A → B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). Yes.
- B → A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). Yes.
- Both hold, prefer left operand: PK(A * B) = {x, y} with A's attributes first.
Example 3: Neither direction
A: x*, y*
B: z*, x (x is secondary in B)
- A → B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. No.
- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). No.
- Result: PK(A * B) = {x, y, z} (union) with A's attributes first.
Example 4: A → B (subordinate relationship)
Session: session_id*
Trial: session_id*, trial_num* (references Session)
- A → B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. No.
- B → A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. Yes.
- Result: PK(Session * Trial) = {session_id, trial_num} with Trial's attributes first.
The join primary key rule prioritizes predictability over minimality. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes).
Example of non-minimal result:
A: x*, y*
B: z*, x (x is secondary in B, so z → x)
The mathematically minimal primary key for A * B would be {y, z} because:
z → x(from B's structure){y, z} → {x, y, z}(z gives us x, and we have y)
However, {y, z} is problematic:
- It is not the primary key of either operand (A has
{x, y}, B has{z}) - It is not the union of the primary keys
- It represents a novel entity type that doesn't correspond to A, B, or their natural pairing
This creates confusion: what kind of entity does {y, z} identify?
The simplified rule produces {x, y, z} (the union), which:
- Is immediately recognizable as "one A entity paired with one B entity"
- Contains A's full primary key and B's full primary key
- May have redundancy (
xis determined byz) but is semantically clear
Rationale: Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors.
The primary key attributes always appear first in the result's attribute list, followed by secondary attributes. When B → A (and not A → B), the join is conceptually reordered as B * A to maintain this invariant:
- If PK = PK(A): A's attributes appear first
- If PK = PK(B): B's attributes appear first
- If PK = PK(A) ∪ PK(B): PK(A) attributes first, then PK(B) − PK(A), then secondaries
With these rules, join is not commutative in terms of:
- Primary key selection:
A * Bmay have a different PK thanB * Awhen one direction determines but not the other - Attribute ordering: The left operand's attributes appear first (unless B → A)
The result set (the actual rows returned) remains the same regardless of order, but the schema (primary key and attribute order) may differ.
For left joins (A.join(B, left=True)), the functional dependency A → B is required.
Why this constraint exists:
In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:
| Scenario | PK by inner join rule | Left join problem |
|---|---|---|
| A → B | PK(A) | ✅ Safe — A's attrs always present |
| B → A | PK(B) | ❌ B's PK attrs could be NULL |
| Neither | PK(A) ∪ PK(B) | ❌ B's PK attrs could be NULL |
Example of invalid left join:
A: x*, y* PK(A) = {x, y}
B: x*, z*, y PK(B) = {x, z}, y is secondary
Inner join: PK = {x, z} (B → A rule)
Left join attempt: FAILS because z could be NULL for unmatched A rows
Valid left join example:
Session: session_id*, date
Trial: session_id*, trial_num*, stimulus (references Session)
Session.join(Trial, left=True) # OK: Session → Trial
# PK = {session_id}, all sessions retained even without trials
Error message:
DataJointError: Left join requires the left operand to determine the right operand (A → B).
The following attributes from the right operand's primary key are not determined by
the left operand: ['z']. Use an inner join or restructure the query.
For special cases where the user takes responsibility for handling the potentially invalid primary key, the constraint can be bypassed using allow_invalid_primary_key=True:
# Normally blocked - B does not determine A
A.join(B, left=True) # Error: A → B not satisfied
# Bypass the constraint - user takes responsibility
A.join(B, left=True, allow_invalid_primary_key=True) # Allowed, PK = PK(A) ∪ PK(B)When bypassed, the resulting primary key is the union of both operands' primary keys (PK(A) ∪ PK(B)). The user must ensure that subsequent operations (such as GROUP BY or projection) establish a valid primary key.
This mechanism is used internally by aggregation (aggr) with keep_all_rows=True, which resets the primary key via the GROUP BY clause.
A.aggr(B, keep_all_rows=True) uses a left join internally but has the opposite requirement: B → A (the group expression B must have all of A's primary key attributes).
This apparent contradiction is resolved by the GROUP BY clause:
- Aggregation requires B → A so that B can be grouped by A's primary key
- The intermediate left join
A LEFT JOIN Bwould have an invalid PK under the normal left join rules - Aggregation internally allows the invalid PK, producing PK(A) ∪ PK(B)
- The
GROUP BY PK(A)clause then resets the primary key to PK(A) - The final result has PK(A), which consists entirely of non-NULL values from A
Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed.
Example:
Session: session_id*, date
Trial: session_id*, trial_num*, response_time (references Session)
# Aggregation with keep_all_rows=True
Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)')
# Internally: Session LEFT JOIN Trial (with invalid PK allowed)
# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num}
# But GROUP BY session_id resets PK to {session_id}
# Result: All sessions, with avg_rt=NULL for sessions without trials
dj.U() or dj.U('attr1', 'attr2', ...) represents the universal set of all possible values and lineages.
Since dj.U conceptually contains all possible lineages, its attributes are homologous to any namesake attribute in other expressions.
# Restriction: promotes a, b to PK; lineage transferred from A
dj.U('a', 'b') & A
# Aggregation: groups by a, b
dj.U('a', 'b').aggr(A, count='count(*)')
# Empty U for distinct primary keys
dj.U() & A# Anti-restriction: produces infinite set
dj.U('a', 'b') - A # DataJointError
# Join: deprecated, use & instead
dj.U('a', 'b') * A # DataJointError with migration guidanceThe implementation provides two separate methods for determining attribute lineage:
For DataJoint-managed schemas:
- Lineage is stored explicitly in a hidden table (
~lineage) per schema - Populated at table declaration time by copying from parent tables
- Only attributes WITH lineage are stored (native secondary attributes have no entry)
- Fast O(1) lookup at query time
- Authoritative source when present
Schema:
CREATE TABLE `schema_name`.`~lineage` (
table_name VARCHAR(64) NOT NULL,
attribute_name VARCHAR(64) NOT NULL,
lineage VARCHAR(255) NOT NULL,
PRIMARY KEY (table_name, attribute_name)
);Lifecycle:
- On table creation: delete any existing entries for that table, then insert new entries
- On table drop: delete all entries for that table
Fallback for non-DataJoint schemas or when ~lineage doesn't exist:
- Lineage computed by traversing FK relationships
- Uses
connection.dependencieswhich loads fromINFORMATION_SCHEMA - Works with any database schema
- May be incomplete if upstream schemas aren't loaded
Algorithm:
def compute_lineage(table, attribute):
"""Compute lineage by FK traversal."""
# Check if attribute is inherited via FK
for parent, props in dependencies.parents(table).items():
attr_map = props['attr_map']
if attribute in attr_map:
parent_attr = attr_map[attribute]
# Recursively trace to origin
return compute_lineage(parent, parent_attr)
# Not inherited - check if primary key
if attribute in table.primary_key:
return f"{schema}.{table}.{attribute}"
# Native secondary - no lineage
return NoneThese methods are mutually exclusive:
def get_lineage(schema, table, attribute):
try:
# Returns lineage string if entry exists, None otherwise
return query_lineage_table(schema, table, attribute)
except TableDoesNotExist:
return compute_from_dependencies(schema, table, attribute)Add lineage field to the Attribute namedtuple:
default_attribute_properties = dict(
# ... existing fields ...
lineage=None, # NEW: Origin of attribute, e.g. "schema.table.attr"
)-
Load lineage when fetching heading from database:
- Query
~lineagetable if it exists - Fall back to dependency graph computation
- Query
-
Preserve lineage in
select()method:- Included attributes keep their lineage
- Renamed attributes keep their lineage
- Computed attributes have
lineage=None
-
Merge lineage in
join()method:- Verify homologous namesakes have matching lineage
- Combined heading includes lineage from both sides
Replace current implementation with semantic matching:
def assert_join_compatibility(expr1, expr2):
"""
Check semantic compatibility of two expressions for joining.
Raises DataJointError if non-homologous namesakes are detected.
"""
if isinstance(expr1, U):
return # U is always compatible
# Find namesake attributes (same name in both)
namesakes = set(expr1.heading.names) & set(expr2.heading.names)
for name in namesakes:
lineage1 = expr1.heading[name].lineage
lineage2 = expr2.heading[name].lineage
if lineage1 != lineage2:
raise DataJointError(
f"Cannot join on attribute `{name}`: "
f"different lineages ({lineage1} vs {lineage2}). "
f"Use .proj() to rename one of the attributes or "
f".join(semantic_check=False) to force a natural join."
)Update to use semantic matching:
def join(self, other, semantic_check=True, left=False):
# ... existing setup ...
if semantic_check:
assert_join_compatibility(self, other)
# Find homologous namesakes for join
join_attributes = set()
for name in self.heading.names:
if name in other.heading.names:
# Only join on attributes with matching lineage
if self.heading[name].lineage == other.heading[name].lineage:
join_attributes.add(name)
# ... rest of join logic ...def __matmul__(self, other):
"""Removed: Use .join(other, semantic_check=False) instead."""
raise DataJointError(
"The @ operator has been removed in DataJoint 2.0. "
"Use .join(other, semantic_check=False) for permissive joins."
)When a table is declared, populate the ~lineage table:
def declare_table(table_class, context):
# ... parse definition ...
# Remove any leftover entries from previous declaration
delete_lineage_entries(schema, table_name)
lineage_entries = []
for attr in definition.attributes:
if attr.from_foreign_key:
# Inherited: copy parent's lineage
parent_lineage = get_lineage(
attr.fk_schema, attr.fk_table, attr.fk_attribute
)
if parent_lineage: # Only store if parent has lineage
lineage_entries.append((table_name, attr.name, parent_lineage))
elif attr.in_key:
# Native primary key: this table is the origin
lineage_entries.append((
table_name, attr.name,
f"{schema}.{table_name}.{attr.name}"
))
# Native secondary attributes: no entry (no lineage)
# Insert into ~lineage table
insert_lineage_entries(schema, lineage_entries)When a table is dropped, remove its lineage entries:
def drop_table(table_class):
# ... drop the table ...
# Clean up lineage entries
delete_lineage_entries(schema, table_name)For existing schemas without ~lineage tables:
- Automatic creation: When DataJoint accesses a schema, check if
~lineageexists - Lazy population: Populate entries as tables are accessed
- Bulk migration tool: Provide utility to migrate entire schema
def migrate_schema_lineage(schema):
"""Populate ~lineage table for all tables in schema."""
create_lineage_table(schema)
for table in schema.list_tables():
populate_lineage_from_dependencies(schema, table)def proj(self, *attributes, **named_attributes):
# ... existing logic ...
# Lineage handling in select():
# - Included attributes: preserve lineage
# - Renamed (new_name='old_name'): preserve old_name's lineage
# - Computed (new_name='expr'): lineage = NoneIn A.aggr(B, ...), entries from B are grouped by A's primary key and aggregate functions are computed.
Functional Dependency Requirement: Every entry in B must match exactly one entry in A. This requires:
-
B must have all of A's primary key attributes: If A's primary key is
(a, b), then B must contain attributes namedaandb. These attributes can be either primary or secondary in B. -
These attributes must be homologous: The namesake attributes in B must have the same lineage as in A. This ensures they represent the same entity.
# Valid: Session.aggr(Trial, ...) where Trial has session_id from Session
Session.aggr(Trial, n='count(*)') # OK - Trial.session_id traces to Session.session_id
# Invalid: Missing primary key attribute
Session.aggr(Stimulus, n='count(*)') # Error if Stimulus lacks session_id
# Invalid: Non-homologous primary key
TableA.aggr(TableB, n='count(*)') # Error if TableB.id has different lineage than TableA.idResult lineage:
- Group attributes retain their lineage from the grouping expression (A)
- Aggregated attributes have
lineage=None(they are computations)
Union requires all namesake attributes to have matching lineage (enforced via assert_join_compatibility).
DataJointError: Cannot join on attribute `id`: different lineages
(university.student.id vs university.course.id).
Use .proj() to rename one of the attributes or .join(semantic_check=False) to force a natural join.
DataJointError: The @ operator has been removed in DataJoint 2.0.
Use .join(other, semantic_check=False) for permissive joins.
DataJointError: dj.U(...) * table is deprecated in DataJoint 2.0.
Use dj.U(...) & table instead.
DataJointError: Aggregation requires functional dependency: `group` must have all primary key
attributes of the grouping expression. Missing: {'session_id'}.
Use .proj() to add the missing attributes or verify the schema design.
DataJointError: Aggregation requires homologous primary key attributes.
Attribute `id` has different lineages: university.student.id (grouping) vs university.course.id (group).
Use .proj() to rename one of the attributes or .join(semantic_check=False) in a manual aggregation.
-
Lineage computation tests:
- Native PK attribute has correct lineage
- FK-inherited attribute traces to origin
- Native secondary attribute has null lineage
- Multi-hop FK inheritance traces correctly
-
Join matching tests:
- Homologous namesakes are matched
- Non-homologous namesakes raise error
semantic_check=Falsebypasses check
-
Projection lineage preservation:
- Included attributes keep lineage
- Renamed attributes keep lineage
- Computed attributes have null lineage
-
dj.Ucompatibility:dj.U & tableworksdj.U.aggr(table, ...)worksdj.U - tableraises errordj.U * tableraises deprecation error
-
Aggregation functional dependency:
A.aggr(B)works when B has all of A's PK attributes with same lineageA.aggr(B)raises error when B is missing PK attributesA.aggr(B)raises error when PK attributes have different lineagedj.U('a', 'b').aggr(B)works when B hasaandbattributes
-
Join primary key determination:
A * BwhereA → B: result has PK(A)A * BwhereB → A(notA → B): result has PK(B), B's attributes firstA * Bwhere bothA → BandB → A: result has PK(A) (left preference)A * Bwhere neither direction: result has PK(A) ∪ PK(B)- Verify attribute ordering matches primary key source
- Verify non-commutativity:
A * BvsB * Amay differ in PK and order
- Schema migration: Existing schema gets
~lineagetable populated correctly - Cross-schema joins: Lineage traced across schema boundaries
- Complex queries: Multi-join expressions with various lineage scenarios
Ensure existing well-designed schemas continue to work without modification.
-
Review joins on generic attribute names: Attributes like
id,name,value,typemay trigger non-homologous namesake errors. -
Replace
@operator:# Old table1 @ table2 # New table1.join(table2, semantic_check=False)
-
Replace
dj.U * table:# Old dj.U('attr') * table # New dj.U('attr') & table
-
Resolve namesake conflicts:
# If error on Student * Course (both have 'id') Student * Course.proj(course_id='id')
-
Use descriptive attribute names: Prefer
student_idoveridto avoid collisions. -
Leverage foreign keys: Inherited attributes maintain lineage, enabling semantic joins.
-
Run migration tool: Use
dj.migrate_lineage(schema)to populate lineage tables for existing schemas.
- O(1) lookup per attribute
- Cached in
Headingobject - No additional queries during normal operations
- First access loads full dependency graph for schema
- Lineage computation is O(depth) per attribute
- Results cached to avoid recomputation
- Lineage comparison adds negligible overhead
- Same attribute matching loop, just with additional comparison
Query cache keys should include lineage information to prevent cache collisions between semantically different queries.
Extend DataJoint diagrams to show lineage relationships, helping users understand attribute origins.
These examples show Python class names with their corresponding database table names (in lineage strings).
Session(session_id*, date) # table: session
↓ FK
Trial(session_id*, trial_num*, stimulus) # table: trial
↓ FK
Response(session_id*, trial_num*, response_time) # table: __response (computed)
Lineages (using database table names):
Session.session_id→university.session.session_idTrial.session_id→university.session.session_id(inherited)Trial.trial_num→university.trial.trial_num(native PK)Response.session_id→university.session.session_id(inherited)Response.trial_num→university.trial.trial_num(inherited)
Course(course_id*, title) # table: course
↓ FK (secondary)
Enrollment(student_id*, course_id) # table: enrollment
Lineages:
Course.course_id→university.course.course_idEnrollment.student_id→university.enrollment.student_id(native PK)Enrollment.course_id→university.course.course_id(inherited via FK)
Person(person_id*, name) # table: person
↓ FK (aliased)
Marriage(husband*, wife*, date) # table: __marriage (computed)
where husband -> Person, wife -> Person
Lineages:
Person.person_id→family.person.person_idMarriage.husband→family.person.person_id(aliased FK)Marriage.wife→family.person.person_id(aliased FK)
Note: husband and wife have the same lineage even though different names.
Student(id*, name) -- id is native PK, table: student
Course(id*, title) -- id is native PK, table: course
Lineages:
Student.id→university.student.idCourse.id→university.course.id
Student * Course → Error: non-homologous namesakes (id has different lineages)