Skip to content

Latest commit

 

History

History
907 lines (645 loc) · 31 KB

File metadata and controls

907 lines (645 loc) · 31 KB

Semantic Matching for Joins - Specification

Overview

This document specifies semantic matching for joins in DataJoint 2.0, replacing the current name-based matching rules. Semantic matching ensures that attributes are only matched when they share both the same name and the same lineage (origin), preventing accidental joins on unrelated attributes that happen to share names.

Goals

  1. Prevent incorrect joins on attributes that share names but represent different entities
  2. Enable valid joins that are currently blocked due to overly restrictive rules
  3. Maintain backward compatibility for well-designed schemas
  4. Provide clear error messages when semantic conflicts are detected

Problem Statement

Current Behavior

The current join implementation matches attributes purely by name:

join_attributes = set(n for n in self.heading.names if n in other.heading.names)

This is essentially a SQL NATURAL JOIN - any attributes with the same name in both tables are used for matching. The only constraint is the "join compatibility" check which prevents joining on secondary attributes that appear in both tables.

Location: src/datajoint/expression.py:301 and src/datajoint/condition.py:100-120

Problems with Current Approach

Problem 1: Primary Key Name Collision

Consider two tables:

  • Student(id, name) - where id is the student's primary key
  • Course(id, instructor) - where id is the course's primary key

Current behavior: Student * Course joins on id, producing meaningless results where student IDs are matched with course IDs.

Desired behavior: An error should be raised because Student.id and Course.id have different origins (lineages).

Problem 2: Valid Joins Currently Blocked

Consider two tables:

  • FavoriteCourse(student_id*, course_id) - student's favorite course (course_id is secondary, with FK to Course)
  • DependentCourse(dep_course_id*, course_id) - course dependencies (course_id is secondary, with FK to Course)

Current behavior: FavoriteCourse * DependentCourse is rejected because course_id is a secondary attribute in both tables.

Desired behavior: The join should proceed because both course_id attributes share the same lineage (tracing to Course.course_id).

Key Concepts

Terminology

Term Definition
Homologous attributes Attributes with the same lineage (whether or not they have the same name)
Namesake attributes Attributes with the same name (whether or not they have the same lineage)
Homologous namesakes Attributes with the same name AND the same lineage — used for join matching
Non-homologous namesakes Attributes with the same name BUT different lineage — cause join errors

Attribute Lineage

Lineage identifies the origin of an attribute - where it was first defined. It is represented as a string in the format:

schema_name.table_name.attribute_name

Note: table_name refers to the actual database table name, not the Python class name. DataJoint converts class names (CamelCase) to table names (snake_case) with tier prefixes:

  • Sessionsession (manual table)
  • #SessionType#session_type (lookup table)
  • _ProcessingTask_processing_task (imported table)
  • __ProcessedData__processed_data (computed table)

Lineage Assignment Rules

  1. Native primary key attributes have lineage:

    lineage = "this_schema.this_table.attr_name"
    

    The table where they are originally defined.

  2. Attributes inherited via foreign key retain their origin lineage:

    lineage = "parent_schema.parent_table.attr_name"
    

    Traced to the original definition through the FK chain.

  3. Native secondary attributes do NOT have lineage:

    lineage = None
    

    Secondary attributes defined directly (not via FK) cannot be used for join matching.

Lineage Propagation

Lineage propagates through:

  1. Foreign key references: Inherited attributes retain their origin lineage regardless of whether they end up as primary or secondary in the referencing table.

  2. Query expressions:

    • Projections preserve lineage for included attributes
    • Renamed attributes (new_name='old_name') preserve lineage
    • Computed attributes (result='expr') have no lineage

Semantic Matching Rules

Match Decision Matrix

Scenario Same Name Same Lineage Action
Homologous namesakes Yes Yes Match - use for join
Non-homologous namesakes (both have lineage) Yes No (different) Error
Non-homologous namesakes (both null lineage) Yes No (both null) Error
Non-homologous namesakes (one null) Yes No (one null) Error
Different names No - No match

Error Resolution

When non-homologous namesakes are detected, users must resolve the conflict using .proj() to rename one of the colliding attributes:

# Error: Student.id and Course.id have different lineages
Student * Course  # DataJointError!

# Resolution: rename one attribute
Student * Course.proj(course_id='id')  # OK

Affected Operations

Semantic matching applies to all binary operations that match attributes:

Operator Operation Semantic Matching
A * B Join Matches on homologous namesakes
A & B Restriction Matches on homologous namesakes
A - B Anti-restriction Matches on homologous namesakes
A.aggr(B, ...) Aggregation Requires functional dependency (see below)

The .join() Method

The .join() method provides additional control:

# Default: semantic checking enabled (same as *)
result = A.join(B)

# Bypass semantic check for legacy compatibility
result = A.join(B, semantic_check=False)

Removal of @ Operator

The @ operator (permissive join) is removed in DataJoint 2.0:

# Old (deprecated):
A @ B  # Raises DataJointError with migration guidance

# New:
A.join(B, semantic_check=False)  # Explicit bypass

The error message directs users to the explicit .join() method.

Primary Key Rules in Relational Operators

In DataJoint, the result of each query operator produces a valid entity set with a well-defined entity type and primary key. This section specifies how the primary key is determined for each relational operator.

General Principle

The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands.

Notation

In the examples below, * marks primary key attributes:

  • A(x*, y*, z) means A has primary key {x, y} and secondary attribute z
  • A → B means "A determines B" (defined below)

Rules by Operator

Operator Primary Key Rule
A & B (restriction) PK(A) — preserved from left operand
A - B (anti-restriction) PK(A) — preserved from left operand
A.proj(...) (projection) PK(A) — preserved from left operand
A.aggr(B, ...) (aggregation) PK(A) — preserved from left operand
A * B (join) Depends on functional dependencies (see below)

Join Primary Key Rule

The join operator requires special handling because it combines two entity sets. The primary key of A * B depends on the functional dependency relationship between the operands.

Definitions

A determines B (written A → B): Every attribute in PK(B) is either already in PK(A) or is a secondary attribute in A.

A → B  iff  ∀b ∈ PK(B): b ∈ PK(A) OR b ∈ secondary(A)

Intuitively, A → B means that knowing A's primary key is sufficient to determine B's primary key through functional dependencies.

B determines A (written B → A): Every attribute in PK(A) is either already in PK(B) or is a secondary attribute in B.

B → A  iff  ∀a ∈ PK(A): a ∈ PK(B) OR a ∈ secondary(B)

Join Primary Key Algorithm

For A * B:

Condition PK(A * B) Attribute Order
A → B PK(A) A's attributes first
B → A (and not A → B) PK(B) B's attributes first
Neither PK(A) ∪ PK(B) PK(A) first, then PK(B) − PK(A)

When both A → B and B → A hold, the left operand takes precedence (use PK(A)).

Examples

Example 1: B → A

A: x*, y*
B: x*, z*, y    (y is secondary in B, so z → y)
  • A → B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). No.
  • B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). Yes.
  • Result: PK(A * B) = {x, z} with B's attributes first.

Example 2: Both directions (bijection-like)

A: x*, y*, z    (z is secondary in A)
B: y*, z*, x    (x is secondary in B)
  • A → B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). Yes.
  • B → A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). Yes.
  • Both hold, prefer left operand: PK(A * B) = {x, y} with A's attributes first.

Example 3: Neither direction

A: x*, y*
B: z*, x    (x is secondary in B)
  • A → B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. No.
  • B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). No.
  • Result: PK(A * B) = {x, y, z} (union) with A's attributes first.

Example 4: A → B (subordinate relationship)

Session: session_id*
Trial: session_id*, trial_num*    (references Session)
  • A → B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. No.
  • B → A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. Yes.
  • Result: PK(Session * Trial) = {session_id, trial_num} with Trial's attributes first.

Design Tradeoff: Predictability vs. Minimality

The join primary key rule prioritizes predictability over minimality. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes).

Example of non-minimal result:

A: x*, y*
B: z*, x    (x is secondary in B, so z → x)

The mathematically minimal primary key for A * B would be {y, z} because:

  • z → x (from B's structure)
  • {y, z} → {x, y, z} (z gives us x, and we have y)

However, {y, z} is problematic:

  • It is not the primary key of either operand (A has {x, y}, B has {z})
  • It is not the union of the primary keys
  • It represents a novel entity type that doesn't correspond to A, B, or their natural pairing

This creates confusion: what kind of entity does {y, z} identify?

The simplified rule produces {x, y, z} (the union), which:

  • Is immediately recognizable as "one A entity paired with one B entity"
  • Contains A's full primary key and B's full primary key
  • May have redundancy (x is determined by z) but is semantically clear

Rationale: Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors.

Attribute Ordering

The primary key attributes always appear first in the result's attribute list, followed by secondary attributes. When B → A (and not A → B), the join is conceptually reordered as B * A to maintain this invariant:

  • If PK = PK(A): A's attributes appear first
  • If PK = PK(B): B's attributes appear first
  • If PK = PK(A) ∪ PK(B): PK(A) attributes first, then PK(B) − PK(A), then secondaries

Non-Commutativity

With these rules, join is not commutative in terms of:

  1. Primary key selection: A * B may have a different PK than B * A when one direction determines but not the other
  2. Attribute ordering: The left operand's attributes appear first (unless B → A)

The result set (the actual rows returned) remains the same regardless of order, but the schema (primary key and attribute order) may differ.

Left Join Constraint

For left joins (A.join(B, left=True)), the functional dependency A → B is required.

Why this constraint exists:

In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:

Scenario PK by inner join rule Left join problem
A → B PK(A) ✅ Safe — A's attrs always present
B → A PK(B) ❌ B's PK attrs could be NULL
Neither PK(A) ∪ PK(B) ❌ B's PK attrs could be NULL

Example of invalid left join:

A: x*, y*           PK(A) = {x, y}
B: x*, z*, y        PK(B) = {x, z}, y is secondary

Inner join: PK = {x, z} (B → A rule)
Left join attempt: FAILS because z could be NULL for unmatched A rows

Valid left join example:

Session: session_id*, date
Trial: session_id*, trial_num*, stimulus    (references Session)

Session.join(Trial, left=True)  # OK: Session → Trial
# PK = {session_id}, all sessions retained even without trials

Error message:

DataJointError: Left join requires the left operand to determine the right operand (A → B).
The following attributes from the right operand's primary key are not determined by
the left operand: ['z']. Use an inner join or restructure the query.

Bypassing the Left Join Constraint

For special cases where the user takes responsibility for handling the potentially invalid primary key, the constraint can be bypassed using allow_invalid_primary_key=True:

# Normally blocked - B does not determine A
A.join(B, left=True)  # Error: A → B not satisfied

# Bypass the constraint - user takes responsibility
A.join(B, left=True, allow_invalid_primary_key=True)  # Allowed, PK = PK(A) ∪ PK(B)

When bypassed, the resulting primary key is the union of both operands' primary keys (PK(A) ∪ PK(B)). The user must ensure that subsequent operations (such as GROUP BY or projection) establish a valid primary key.

This mechanism is used internally by aggregation (aggr) with keep_all_rows=True, which resets the primary key via the GROUP BY clause.

Aggregation Exception

A.aggr(B, keep_all_rows=True) uses a left join internally but has the opposite requirement: B → A (the group expression B must have all of A's primary key attributes).

This apparent contradiction is resolved by the GROUP BY clause:

  1. Aggregation requires B → A so that B can be grouped by A's primary key
  2. The intermediate left join A LEFT JOIN B would have an invalid PK under the normal left join rules
  3. Aggregation internally allows the invalid PK, producing PK(A) ∪ PK(B)
  4. The GROUP BY PK(A) clause then resets the primary key to PK(A)
  5. The final result has PK(A), which consists entirely of non-NULL values from A

Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed.

Example:

Session: session_id*, date
Trial: session_id*, trial_num*, response_time    (references Session)

# Aggregation with keep_all_rows=True
Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)')

# Internally: Session LEFT JOIN Trial (with invalid PK allowed)
# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num}
# But GROUP BY session_id resets PK to {session_id}
# Result: All sessions, with avg_rt=NULL for sessions without trials

Universal Set dj.U

dj.U() or dj.U('attr1', 'attr2', ...) represents the universal set of all possible values and lineages.

Homology with dj.U

Since dj.U conceptually contains all possible lineages, its attributes are homologous to any namesake attribute in other expressions.

Valid Operations

# Restriction: promotes a, b to PK; lineage transferred from A
dj.U('a', 'b') & A

# Aggregation: groups by a, b
dj.U('a', 'b').aggr(A, count='count(*)')

# Empty U for distinct primary keys
dj.U() & A

Invalid Operations

# Anti-restriction: produces infinite set
dj.U('a', 'b') - A  # DataJointError

# Join: deprecated, use & instead
dj.U('a', 'b') * A  # DataJointError with migration guidance

Implementation Architecture

Two Methods for Lineage Determination

The implementation provides two separate methods for determining attribute lineage:

Method 1: Lineage Tables (~lineage)

For DataJoint-managed schemas:

  • Lineage is stored explicitly in a hidden table (~lineage) per schema
  • Populated at table declaration time by copying from parent tables
  • Only attributes WITH lineage are stored (native secondary attributes have no entry)
  • Fast O(1) lookup at query time
  • Authoritative source when present

Schema:

CREATE TABLE `schema_name`.`~lineage` (
    table_name VARCHAR(64) NOT NULL,
    attribute_name VARCHAR(64) NOT NULL,
    lineage VARCHAR(255) NOT NULL,
    PRIMARY KEY (table_name, attribute_name)
);

Lifecycle:

  • On table creation: delete any existing entries for that table, then insert new entries
  • On table drop: delete all entries for that table

Method 2: Dependency Graph Traversal

Fallback for non-DataJoint schemas or when ~lineage doesn't exist:

  • Lineage computed by traversing FK relationships
  • Uses connection.dependencies which loads from INFORMATION_SCHEMA
  • Works with any database schema
  • May be incomplete if upstream schemas aren't loaded

Algorithm:

def compute_lineage(table, attribute):
    """Compute lineage by FK traversal."""
    # Check if attribute is inherited via FK
    for parent, props in dependencies.parents(table).items():
        attr_map = props['attr_map']
        if attribute in attr_map:
            parent_attr = attr_map[attribute]
            # Recursively trace to origin
            return compute_lineage(parent, parent_attr)

    # Not inherited - check if primary key
    if attribute in table.primary_key:
        return f"{schema}.{table}.{attribute}"

    # Native secondary - no lineage
    return None

Selection Logic

These methods are mutually exclusive:

def get_lineage(schema, table, attribute):
    try:
        # Returns lineage string if entry exists, None otherwise
        return query_lineage_table(schema, table, attribute)
    except TableDoesNotExist:
        return compute_from_dependencies(schema, table, attribute)

Changes to Existing Code

Attribute Class (heading.py)

Add lineage field to the Attribute namedtuple:

default_attribute_properties = dict(
    # ... existing fields ...
    lineage=None,  # NEW: Origin of attribute, e.g. "schema.table.attr"
)

Heading Class (heading.py)

  1. Load lineage when fetching heading from database:

    • Query ~lineage table if it exists
    • Fall back to dependency graph computation
  2. Preserve lineage in select() method:

    • Included attributes keep their lineage
    • Renamed attributes keep their lineage
    • Computed attributes have lineage=None
  3. Merge lineage in join() method:

    • Verify homologous namesakes have matching lineage
    • Combined heading includes lineage from both sides

assert_join_compatibility() (condition.py)

Replace current implementation with semantic matching:

def assert_join_compatibility(expr1, expr2):
    """
    Check semantic compatibility of two expressions for joining.

    Raises DataJointError if non-homologous namesakes are detected.
    """
    if isinstance(expr1, U):
        return  # U is always compatible

    # Find namesake attributes (same name in both)
    namesakes = set(expr1.heading.names) & set(expr2.heading.names)

    for name in namesakes:
        lineage1 = expr1.heading[name].lineage
        lineage2 = expr2.heading[name].lineage

        if lineage1 != lineage2:
            raise DataJointError(
                f"Cannot join on attribute `{name}`: "
                f"different lineages ({lineage1} vs {lineage2}). "
                f"Use .proj() to rename one of the attributes or "
                f".join(semantic_check=False) to force a natural join."
            )

join() Method (expression.py)

Update to use semantic matching:

def join(self, other, semantic_check=True, left=False):
    # ... existing setup ...

    if semantic_check:
        assert_join_compatibility(self, other)

    # Find homologous namesakes for join
    join_attributes = set()
    for name in self.heading.names:
        if name in other.heading.names:
            # Only join on attributes with matching lineage
            if self.heading[name].lineage == other.heading[name].lineage:
                join_attributes.add(name)

    # ... rest of join logic ...

@ Operator Removal (expression.py)

def __matmul__(self, other):
    """Removed: Use .join(other, semantic_check=False) instead."""
    raise DataJointError(
        "The @ operator has been removed in DataJoint 2.0. "
        "Use .join(other, semantic_check=False) for permissive joins."
    )

Lineage Table Population

At Table Declaration Time

When a table is declared, populate the ~lineage table:

def declare_table(table_class, context):
    # ... parse definition ...

    # Remove any leftover entries from previous declaration
    delete_lineage_entries(schema, table_name)

    lineage_entries = []

    for attr in definition.attributes:
        if attr.from_foreign_key:
            # Inherited: copy parent's lineage
            parent_lineage = get_lineage(
                attr.fk_schema, attr.fk_table, attr.fk_attribute
            )
            if parent_lineage:  # Only store if parent has lineage
                lineage_entries.append((table_name, attr.name, parent_lineage))
        elif attr.in_key:
            # Native primary key: this table is the origin
            lineage_entries.append((
                table_name, attr.name,
                f"{schema}.{table_name}.{attr.name}"
            ))
        # Native secondary attributes: no entry (no lineage)

    # Insert into ~lineage table
    insert_lineage_entries(schema, lineage_entries)

At Table Drop Time

When a table is dropped, remove its lineage entries:

def drop_table(table_class):
    # ... drop the table ...

    # Clean up lineage entries
    delete_lineage_entries(schema, table_name)

Migration for Existing Tables

For existing schemas without ~lineage tables:

  1. Automatic creation: When DataJoint accesses a schema, check if ~lineage exists
  2. Lazy population: Populate entries as tables are accessed
  3. Bulk migration tool: Provide utility to migrate entire schema
def migrate_schema_lineage(schema):
    """Populate ~lineage table for all tables in schema."""
    create_lineage_table(schema)

    for table in schema.list_tables():
        populate_lineage_from_dependencies(schema, table)

Query Expression Lineage Propagation

Projection (proj)

def proj(self, *attributes, **named_attributes):
    # ... existing logic ...

    # Lineage handling in select():
    # - Included attributes: preserve lineage
    # - Renamed (new_name='old_name'): preserve old_name's lineage
    # - Computed (new_name='expr'): lineage = None

Aggregation (aggr)

In A.aggr(B, ...), entries from B are grouped by A's primary key and aggregate functions are computed.

Functional Dependency Requirement: Every entry in B must match exactly one entry in A. This requires:

  1. B must have all of A's primary key attributes: If A's primary key is (a, b), then B must contain attributes named a and b. These attributes can be either primary or secondary in B.

  2. These attributes must be homologous: The namesake attributes in B must have the same lineage as in A. This ensures they represent the same entity.

# Valid: Session.aggr(Trial, ...) where Trial has session_id from Session
Session.aggr(Trial, n='count(*)')  # OK - Trial.session_id traces to Session.session_id

# Invalid: Missing primary key attribute
Session.aggr(Stimulus, n='count(*)')  # Error if Stimulus lacks session_id

# Invalid: Non-homologous primary key
TableA.aggr(TableB, n='count(*)')  # Error if TableB.id has different lineage than TableA.id

Result lineage:

  • Group attributes retain their lineage from the grouping expression (A)
  • Aggregated attributes have lineage=None (they are computations)

Union (+)

Union requires all namesake attributes to have matching lineage (enforced via assert_join_compatibility).

Error Messages

Non-Homologous Namesakes

DataJointError: Cannot join on attribute `id`: different lineages
(university.student.id vs university.course.id).
Use .proj() to rename one of the attributes or .join(semantic_check=False) to force a natural join.

Deprecated @ Operator

DataJointError: The @ operator has been removed in DataJoint 2.0.
Use .join(other, semantic_check=False) for permissive joins.

Deprecated dj.U * table

DataJointError: dj.U(...) * table is deprecated in DataJoint 2.0.
Use dj.U(...) & table instead.

Aggregation Missing Primary Key

DataJointError: Aggregation requires functional dependency: `group` must have all primary key
attributes of the grouping expression. Missing: {'session_id'}.
Use .proj() to add the missing attributes or verify the schema design.

Aggregation Non-Homologous Primary Key

DataJointError: Aggregation requires homologous primary key attributes.
Attribute `id` has different lineages: university.student.id (grouping) vs university.course.id (group).
Use .proj() to rename one of the attributes or .join(semantic_check=False) in a manual aggregation.

Testing Strategy

Unit Tests

  1. Lineage computation tests:

    • Native PK attribute has correct lineage
    • FK-inherited attribute traces to origin
    • Native secondary attribute has null lineage
    • Multi-hop FK inheritance traces correctly
  2. Join matching tests:

    • Homologous namesakes are matched
    • Non-homologous namesakes raise error
    • semantic_check=False bypasses check
  3. Projection lineage preservation:

    • Included attributes keep lineage
    • Renamed attributes keep lineage
    • Computed attributes have null lineage
  4. dj.U compatibility:

    • dj.U & table works
    • dj.U.aggr(table, ...) works
    • dj.U - table raises error
    • dj.U * table raises deprecation error
  5. Aggregation functional dependency:

    • A.aggr(B) works when B has all of A's PK attributes with same lineage
    • A.aggr(B) raises error when B is missing PK attributes
    • A.aggr(B) raises error when PK attributes have different lineage
    • dj.U('a', 'b').aggr(B) works when B has a and b attributes
  6. Join primary key determination:

    • A * B where A → B: result has PK(A)
    • A * B where B → A (not A → B): result has PK(B), B's attributes first
    • A * B where both A → B and B → A: result has PK(A) (left preference)
    • A * B where neither direction: result has PK(A) ∪ PK(B)
    • Verify attribute ordering matches primary key source
    • Verify non-commutativity: A * B vs B * A may differ in PK and order

Integration Tests

  1. Schema migration: Existing schema gets ~lineage table populated correctly
  2. Cross-schema joins: Lineage traced across schema boundaries
  3. Complex queries: Multi-join expressions with various lineage scenarios

Regression Tests

Ensure existing well-designed schemas continue to work without modification.

Migration Guide

For Users

  1. Review joins on generic attribute names: Attributes like id, name, value, type may trigger non-homologous namesake errors.

  2. Replace @ operator:

    # Old
    table1 @ table2
    
    # New
    table1.join(table2, semantic_check=False)
  3. Replace dj.U * table:

    # Old
    dj.U('attr') * table
    
    # New
    dj.U('attr') & table
  4. Resolve namesake conflicts:

    # If error on Student * Course (both have 'id')
    Student * Course.proj(course_id='id')

For Schema Designers

  1. Use descriptive attribute names: Prefer student_id over id to avoid collisions.

  2. Leverage foreign keys: Inherited attributes maintain lineage, enabling semantic joins.

  3. Run migration tool: Use dj.migrate_lineage(schema) to populate lineage tables for existing schemas.

Performance Considerations

Lineage Table Lookup

  • O(1) lookup per attribute
  • Cached in Heading object
  • No additional queries during normal operations

Dependency Graph Fallback

  • First access loads full dependency graph for schema
  • Lineage computation is O(depth) per attribute
  • Results cached to avoid recomputation

Join Operations

  • Lineage comparison adds negligible overhead
  • Same attribute matching loop, just with additional comparison

Future Considerations

Lineage in Query Cache

Query cache keys should include lineage information to prevent cache collisions between semantically different queries.

Lineage Visualization

Extend DataJoint diagrams to show lineage relationships, helping users understand attribute origins.

Appendix: Lineage Examples

These examples show Python class names with their corresponding database table names (in lineage strings).

Example 1: Simple FK Chain

Session(session_id*, date)           # table: session
    ↓ FK
Trial(session_id*, trial_num*, stimulus)    # table: trial
    ↓ FK
Response(session_id*, trial_num*, response_time)  # table: __response (computed)

Lineages (using database table names):

  • Session.session_iduniversity.session.session_id
  • Trial.session_iduniversity.session.session_id (inherited)
  • Trial.trial_numuniversity.trial.trial_num (native PK)
  • Response.session_iduniversity.session.session_id (inherited)
  • Response.trial_numuniversity.trial.trial_num (inherited)

Example 2: Secondary FK

Course(course_id*, title)            # table: course
    ↓ FK (secondary)
Enrollment(student_id*, course_id)   # table: enrollment

Lineages:

  • Course.course_iduniversity.course.course_id
  • Enrollment.student_iduniversity.enrollment.student_id (native PK)
  • Enrollment.course_iduniversity.course.course_id (inherited via FK)

Example 3: Aliased FK

Person(person_id*, name)             # table: person
    ↓ FK (aliased)
Marriage(husband*, wife*, date)      # table: __marriage (computed)
    where husband -> Person, wife -> Person

Lineages:

  • Person.person_idfamily.person.person_id
  • Marriage.husbandfamily.person.person_id (aliased FK)
  • Marriage.wifefamily.person.person_id (aliased FK)

Note: husband and wife have the same lineage even though different names.

Example 4: Non-Homologous Namesakes

Student(id*, name)  -- id is native PK, table: student
Course(id*, title)  -- id is native PK, table: course

Lineages:

  • Student.iduniversity.student.id
  • Course.iduniversity.course.id

Student * CourseError: non-homologous namesakes (id has different lineages)