|
| 1 | +# Semantic Matching |
| 2 | + |
| 3 | +Semantic matching ensures that attributes are only matched in joins when they share both the same name and the same **lineage** (origin). This prevents accidental joins on unrelated attributes that happen to share names. |
| 4 | + |
| 5 | +## Relationship to Natural Joins |
| 6 | + |
| 7 | +DataJoint's join operator (`*`) performs a **natural join**—the standard relational operation that matches tuples on all common attribute names. If you're familiar with SQL's `NATURAL JOIN` or relational algebra, DataJoint's join works the same way. |
| 8 | + |
| 9 | +**Semantic matching adds a safety check** on top of the natural join. Before performing the join, DataJoint verifies that all common attributes (namesakes) actually represent the same thing by checking their lineage. If two attributes share a name but have different origins, the join is rejected rather than silently producing incorrect results. |
| 10 | + |
| 11 | +In other words: **semantic matching is a natural join that rejects common pitfalls of joining on unrelated attributes.** |
| 12 | + |
| 13 | +Two attributes are **homologous** if they share the same lineage—that is, they trace back to the same original definition. Attributes with the same name are called **namesakes**. Semantic matching requires that all namesakes be homologous. |
| 14 | + |
| 15 | +### All Binary Operators Use Semantic Matching |
| 16 | + |
| 17 | +Semantic matching applies to **all binary operators** that combine two query expressions, not just join: |
| 18 | + |
| 19 | +| Operator | Syntax | Semantic Check | |
| 20 | +|----------|--------|----------------| |
| 21 | +| Join | `A * B` | Namesakes must be homologous | |
| 22 | +| Restriction | `A & B` | Namesakes must be homologous | |
| 23 | +| Anti-restriction | `A - B` | Namesakes must be homologous | |
| 24 | +| Aggregation | `A.aggr(B, ...)` | Namesakes must be homologous | |
| 25 | +| Extension | `A.extend(B)` | Namesakes must be homologous | |
| 26 | +| Union | `A + B` | Namesakes must be homologous | |
| 27 | + |
| 28 | +In each case, DataJoint verifies that any common attribute names between the two operands share the same lineage before proceeding with the operation. |
| 29 | + |
| 30 | +These concepts were first introduced in Yatsenko et al., 2018[^1]. |
| 31 | + |
| 32 | +[^1]: Yatsenko D, Walker EY, Tolias AS (2018). DataJoint: A Simpler Relational Data Model. [arXiv:1807.11104](https://doi.org/10.48550/arXiv.1807.11104) |
| 33 | + |
| 34 | +## Why Semantic Matching? |
| 35 | + |
| 36 | +The natural join is elegant and powerful, but it has a well-known weakness: it relies entirely on naming conventions. If two tables happen to have columns with the same name but different meanings, a natural join silently produces a Cartesian product filtered on unrelated values—a subtle bug that can go undetected. |
| 37 | + |
| 38 | +```python |
| 39 | +# Classic natural join pitfall |
| 40 | +class Student(dj.Manual): |
| 41 | + definition = """ |
| 42 | + id : int64 # student ID |
| 43 | + --- |
| 44 | + name : varchar(100) |
| 45 | + """ |
| 46 | + |
| 47 | +class Course(dj.Manual): |
| 48 | + definition = """ |
| 49 | + id : int64 # course ID |
| 50 | + --- |
| 51 | + title : varchar(100) |
| 52 | + """ |
| 53 | + |
| 54 | +# Natural join would match on 'id', but these are unrelated! |
| 55 | +# Student #5 paired with Course #5 is meaningless. |
| 56 | +``` |
| 57 | + |
| 58 | +DataJoint's semantic matching solves this by tracking the **lineage** of each attribute—where it was originally defined. Attributes only match if they have the same lineage, ensuring that joins always combine semantically related data. |
| 59 | + |
| 60 | +## Attribute Lineage |
| 61 | + |
| 62 | +Lineage identifies the **origin** of an attribute—the **dimension** where it was first defined. A dimension is an independent axis of variation introduced by a table that defines new primary key attributes. See [Schema Dimensions](entity-integrity.md#schema-dimensions) for details. |
| 63 | + |
| 64 | +Lineage is represented as a string: |
| 65 | + |
| 66 | +``` |
| 67 | +schema_name.table_name.attribute_name |
| 68 | +``` |
| 69 | + |
| 70 | +### Lineage Assignment Rules |
| 71 | + |
| 72 | +| Attribute Type | Lineage Value | |
| 73 | +|----------------|---------------| |
| 74 | +| Native primary key | `this_schema.this_table.attr_name` | |
| 75 | +| FK-inherited (primary or secondary) | Traced to original definition | |
| 76 | +| Native secondary | `None` | |
| 77 | +| Computed (in projection) | `None` | |
| 78 | + |
| 79 | +### Example |
| 80 | + |
| 81 | +```python |
| 82 | +class Session(dj.Manual): # table: session |
| 83 | + definition = """ |
| 84 | + session_id : int64 |
| 85 | + --- |
| 86 | + session_date : date |
| 87 | + """ |
| 88 | + |
| 89 | +class Trial(dj.Manual): # table: trial |
| 90 | + definition = """ |
| 91 | + -> Session |
| 92 | + trial_num : int32 |
| 93 | + --- |
| 94 | + stimulus : varchar(100) |
| 95 | + """ |
| 96 | +``` |
| 97 | + |
| 98 | +Lineages: |
| 99 | + |
| 100 | +- `Session.session_id` → `myschema.session.session_id` (native PK) |
| 101 | +- `Session.session_date` → `None` (native secondary) |
| 102 | +- `Trial.session_id` → `myschema.session.session_id` (inherited via FK) |
| 103 | +- `Trial.trial_num` → `myschema.trial.trial_num` (native PK) |
| 104 | +- `Trial.stimulus` → `None` (native secondary) |
| 105 | + |
| 106 | +Notice that `Trial.session_id` has the same lineage as `Session.session_id` because it was inherited through the foreign key reference. This allows `Session * Trial` to work correctly—both `session_id` attributes are **homologous**. |
| 107 | + |
| 108 | +## Terminology |
| 109 | + |
| 110 | +| Term | Definition | |
| 111 | +|------|------------| |
| 112 | +| **Lineage** | The origin of an attribute: `schema.table.attribute` | |
| 113 | +| **Homologous attributes** | Attributes with the same lineage | |
| 114 | +| **Namesake attributes** | Attributes with the same name | |
| 115 | +| **Homologous namesakes** | Same name AND same lineage — used for join matching | |
| 116 | +| **Non-homologous namesakes** | Same name BUT different lineage — cause join errors | |
| 117 | + |
| 118 | +## Semantic Matching Rules |
| 119 | + |
| 120 | +When two expressions are joined, DataJoint checks all namesake attributes (attributes with the same name): |
| 121 | + |
| 122 | +| Scenario | Action | |
| 123 | +|----------|--------| |
| 124 | +| Same name, same lineage (both non-null) | **Match** — attributes are joined | |
| 125 | +| Same name, different lineage | **Error** — non-homologous namesakes | |
| 126 | +| Same name, either lineage is null | **Error** — cannot verify homology | |
| 127 | +| Different names | **No match** — attributes kept separate | |
| 128 | + |
| 129 | +## When Semantic Matching Fails |
| 130 | + |
| 131 | +If you see an error like: |
| 132 | + |
| 133 | +``` |
| 134 | +DataJointError: Cannot join on attribute `id`: different lineages |
| 135 | +(university.student.id vs university.course.id). |
| 136 | +Use .proj() to rename one of the attributes. |
| 137 | +``` |
| 138 | + |
| 139 | +This means you're trying to join tables that have a namesake attribute (`id`) with different lineages. The solutions are: |
| 140 | + |
| 141 | +1. **Rename one attribute** using projection: |
| 142 | + ```python |
| 143 | + Student() * Course().proj(course_id='id') |
| 144 | + ``` |
| 145 | + |
| 146 | +2. **Bypass semantic checking** (use with caution): |
| 147 | + ```python |
| 148 | + Student().join(Course(), semantic_check=False) |
| 149 | + ``` |
| 150 | + |
| 151 | +3. **Use descriptive names** in your schema design (best practice): |
| 152 | + ```python |
| 153 | + class Student(dj.Manual): |
| 154 | + definition = """ |
| 155 | + student_id : int64 # not just 'id' |
| 156 | + --- |
| 157 | + name : varchar(100) |
| 158 | + """ |
| 159 | + ``` |
| 160 | + |
| 161 | +## Examples |
| 162 | + |
| 163 | +### Valid Join (Shared Lineage) |
| 164 | + |
| 165 | +```python |
| 166 | +class Student(dj.Manual): |
| 167 | + definition = """ |
| 168 | + student_id : int64 |
| 169 | + --- |
| 170 | + name : varchar(100) |
| 171 | + """ |
| 172 | + |
| 173 | +class Enrollment(dj.Manual): |
| 174 | + definition = """ |
| 175 | + -> Student |
| 176 | + -> Course |
| 177 | + --- |
| 178 | + grade : varchar(2) |
| 179 | + """ |
| 180 | + |
| 181 | +# Works: student_id has same lineage in both |
| 182 | +Student() * Enrollment() |
| 183 | +``` |
| 184 | + |
| 185 | +### Multi-hop FK Inheritance |
| 186 | + |
| 187 | +Lineage is preserved through multiple levels of foreign key inheritance: |
| 188 | + |
| 189 | +```python |
| 190 | +class Session(dj.Manual): |
| 191 | + definition = """ |
| 192 | + session_id : int64 |
| 193 | + --- |
| 194 | + session_date : date |
| 195 | + """ |
| 196 | + |
| 197 | +class Trial(dj.Manual): |
| 198 | + definition = """ |
| 199 | + -> Session |
| 200 | + trial_num : int32 |
| 201 | + """ |
| 202 | + |
| 203 | +class Response(dj.Computed): |
| 204 | + definition = """ |
| 205 | + -> Trial |
| 206 | + --- |
| 207 | + response_time : float64 |
| 208 | + """ |
| 209 | + |
| 210 | +# All work: session_id traces back to Session in all tables |
| 211 | +Session() * Trial() |
| 212 | +Session() * Response() |
| 213 | +Trial() * Response() |
| 214 | +``` |
| 215 | + |
| 216 | +### Secondary FK Attribute |
| 217 | + |
| 218 | +Lineage works for secondary (non-primary-key) foreign key attributes too: |
| 219 | + |
| 220 | +```python |
| 221 | +class Course(dj.Manual): |
| 222 | + definition = """ |
| 223 | + course_id : int unsigned |
| 224 | + --- |
| 225 | + title : varchar(100) |
| 226 | + """ |
| 227 | + |
| 228 | +class FavoriteCourse(dj.Manual): |
| 229 | + definition = """ |
| 230 | + student_id : int unsigned |
| 231 | + --- |
| 232 | + -> Course |
| 233 | + """ |
| 234 | + |
| 235 | +class RequiredCourse(dj.Manual): |
| 236 | + definition = """ |
| 237 | + major_id : int unsigned |
| 238 | + --- |
| 239 | + -> Course |
| 240 | + """ |
| 241 | + |
| 242 | +# Works: course_id is secondary in both, but has same lineage |
| 243 | +FavoriteCourse() * RequiredCourse() |
| 244 | +``` |
| 245 | + |
| 246 | +### Aliased Foreign Key |
| 247 | + |
| 248 | +When you alias a foreign key, the new name gets the same lineage as the original: |
| 249 | + |
| 250 | +```python |
| 251 | +class Person(dj.Manual): |
| 252 | + definition = """ |
| 253 | + person_id : int unsigned |
| 254 | + --- |
| 255 | + full_name : varchar(100) |
| 256 | + """ |
| 257 | + |
| 258 | +class Marriage(dj.Manual): |
| 259 | + definition = """ |
| 260 | + -> Person.proj(husband='person_id') |
| 261 | + -> Person.proj(wife='person_id') |
| 262 | + --- |
| 263 | + marriage_date : date |
| 264 | + """ |
| 265 | + |
| 266 | +# husband and wife both have lineage: schema.person.person_id |
| 267 | +# They are homologous (same lineage) but have different names |
| 268 | +``` |
| 269 | + |
| 270 | +## Best Practices |
| 271 | + |
| 272 | +1. **Use descriptive attribute names**: Prefer `student_id` over generic `id` |
| 273 | + |
| 274 | +2. **Leverage foreign keys**: Inherited attributes maintain lineage automatically |
| 275 | + |
| 276 | +3. **Rebuild lineage for legacy schemas**: Run `schema.rebuild_lineage()` once |
| 277 | + |
| 278 | +4. **Rebuild upstream schemas first**: For cross-schema FKs, rebuild parent schemas before child schemas |
| 279 | + |
| 280 | +5. **Restart after rebuilding**: Restart Python kernel to pick up new lineage information |
| 281 | + |
| 282 | +6. **Use `semantic_check=False` sparingly**: Only when you're certain the natural join is correct |
| 283 | + |
| 284 | +## Design Rationale |
| 285 | + |
| 286 | +Semantic matching reflects a core DataJoint principle: **schema design should encode meaning**. When you create a foreign key reference, you're declaring that two attributes represent the same concept. DataJoint tracks this through lineage, allowing safe joins without relying on naming conventions alone. |
| 287 | + |
| 288 | +This is especially valuable in large, collaborative projects where different teams might independently choose similar attribute names for unrelated concepts. |
0 commit comments