Skip to content

Commit 8212125

Browse files
Merge pull request #129 from datajoint/docs/terminology-and-fk-modifiers
docs: Unify storage terminology and document FK modifiers
2 parents 879d9f5 + 4d3fa00 commit 8212125

34 files changed

+556
-508
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,5 @@ temp*
88

99
# Generated documentation files
1010
src/llms-full.txt
11-
site/llms-full.txt
11+
site/llms-full.txt
12+
dj_local_conf.json

mkdocs.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ nav:
1919
- Computation Model: explanation/computation-model.md
2020
- Queries:
2121
- Query Algebra: explanation/query-algebra.md
22+
- Semantic Matching: explanation/semantic-matching.md
2223
- Storage:
2324
- Type System: explanation/type-system.md
2425
- Custom Codecs: explanation/custom-codecs.md

src/explanation/custom-codecs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ class BamCodec(dj.Codec):
138138

139139
def get_dtype(self, is_store: bool) -> str:
140140
if not is_store:
141-
raise dj.DataJointError("<bam> requires external storage: use <bam@>")
141+
raise dj.DataJointError("<bam> requires in-store storage: use <bam@>")
142142
return "<object@>" # Path-addressed storage for file structure
143143

144144
def encode(self, alignments, *, key=None, store_name=None):
@@ -307,7 +307,7 @@ class WellDocumentedCodec(dj.Codec):
307307
"""
308308
Store XYZ data structures.
309309
310-
Supports both internal (<xyz>) and external (<xyz@>) storage.
310+
Supports both in-table (<xyz>) and in-store (<xyz@>) storage.
311311
312312
Examples
313313
--------

src/explanation/data-pipelines.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ Scientific data often includes large objects—images, recordings, time series,
9090

9191
All metadata, relationships, and query logic live in the relational database. The schema defines what data exists, how entities relate, and what computations produce them. Queries operate on the relational structure; results are consistent and reproducible.
9292

93-
**2. Large objects live in external stores.**
93+
**2. Large objects live in object stores.**
9494

9595
Object storage (filesystems, S3, GCS, Azure Blob, MinIO) holds the actual bytes—arrays, images, files. The database stores only lightweight references (paths, checksums, metadata). This separation lets the database stay fast while data scales to terabytes.
9696

@@ -101,16 +101,16 @@ DataJoint's [type system](type-system.md) provides codec types that bridge Pytho
101101
| Codec | Purpose |
102102
|-------|---------|
103103
| `<blob>` | Serialize Python objects (NumPy arrays, dicts) |
104-
| `<blob@store>` | Same, but stored externally |
104+
| `<blob@store>` | Same, but stored in object store |
105105
| `<attach>` | Store files with preserved filenames |
106106
| `<object@store>` | Path-addressed storage for complex structures (Zarr, HDF5) |
107-
| `<filepath@store>` | References to externally-managed files |
107+
| `<filepath@store>` | References to user-managed files |
108108

109109
Users work with native Python objects; serialization and storage routing are invisible.
110110

111111
**4. Referential integrity extends to objects.**
112112

113-
When a database row is deleted, its associated external objects are garbage-collected. Foreign key cascades work correctly—delete upstream data and downstream results (including their objects) disappear. The database and object store remain synchronized without manual cleanup.
113+
When a database row is deleted, its associated stored objects are garbage-collected. Foreign key cascades work correctly—delete upstream data and downstream results (including their objects) disappear. The database and object store remain synchronized without manual cleanup.
114114

115115
**5. Multiple storage tiers support diverse access patterns.**
116116

src/explanation/faq.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ The definition string **is** the specification — a declarative language that d
7575
### Why Custom Query Algebra?
7676

7777
DataJoint's operators implement **[semantic matching](../reference/specs/semantic-matching.md)** — joins and restrictions match only on attributes connected through the foreign key graph, not arbitrary columns that happen to share a name. This prevents:
78+
7879
- Accidental Cartesian products
7980
- Joins on unrelated columns
8081
- Silent incorrect results

src/explanation/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ and scalable.
3333

3434
- :material-code-tags: **[Type System](type-system.md)**
3535

36-
Three-layer architecture: native, core, and codec types. Internal and
37-
external storage modes.
36+
Three-layer architecture: native, core, and codec types. In-table and
37+
in-store storage modes.
3838

3939
- :material-cog-play: **[Computation Model](computation-model.md)**
4040

src/explanation/query-algebra.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -229,15 +229,17 @@ result = (
229229
)
230230
```
231231

232-
## Workflow-Aware Joins
232+
## Semantic Matching
233233

234-
Unlike SQL's natural joins that match on **any** shared column name, DataJoint
235-
joins match on **semantic lineage**. Two attributes match only if they:
234+
All binary operators in DataJoint rely on **semantic matching** of attributes. Unlike SQL's natural joins that match on any shared column name, DataJoint verifies that namesake attributes (those with the same name) are **homologous**—they trace back to the same original definition.
236235

237-
1. Have the same name
238-
2. Trace back to the same source definition
236+
This prevents accidental matches on coincidentally-named columns, a pitfall that has been understood since the Entity-Relationship Model was introduced in the 1970s but was never addressed in SQL or traditional RDBMS implementations.
239237

240-
This prevents accidental joins on coincidentally-named columns.
238+
See [Semantic Matching](semantic-matching.md) for details.
239+
240+
These concepts were first introduced in Yatsenko et al., 2018[^1].
241+
242+
[^1]: Yatsenko D, Walker EY, Tolias AS (2018). DataJoint Elements: Data Workflows for Neurophysiology. [arXiv:1807.11104](https://doi.org/10.48550/arXiv.1807.11104)
241243

242244
## Fetching Results
243245

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# Semantic Matching
2+
3+
Semantic matching ensures that attributes are only matched in joins when they share both the same name and the same **lineage** (origin). This prevents accidental joins on unrelated attributes that happen to share names.
4+
5+
## Relationship to Natural Joins
6+
7+
DataJoint's join operator (`*`) performs a **natural join**—the standard relational operation that matches tuples on all common attribute names. If you're familiar with SQL's `NATURAL JOIN` or relational algebra, DataJoint's join works the same way.
8+
9+
**Semantic matching adds a safety check** on top of the natural join. Before performing the join, DataJoint verifies that all common attributes (namesakes) actually represent the same thing by checking their lineage. If two attributes share a name but have different origins, the join is rejected rather than silently producing incorrect results.
10+
11+
In other words: **semantic matching is a natural join that rejects common pitfalls of joining on unrelated attributes.**
12+
13+
Two attributes are **homologous** if they share the same lineage—that is, they trace back to the same original definition. Attributes with the same name are called **namesakes**. Semantic matching requires that all namesakes be homologous.
14+
15+
### All Binary Operators Use Semantic Matching
16+
17+
Semantic matching applies to **all binary operators** that combine two query expressions, not just join:
18+
19+
| Operator | Syntax | Semantic Check |
20+
|----------|--------|----------------|
21+
| Join | `A * B` | Namesakes must be homologous |
22+
| Restriction | `A & B` | Namesakes must be homologous |
23+
| Anti-restriction | `A - B` | Namesakes must be homologous |
24+
| Aggregation | `A.aggr(B, ...)` | Namesakes must be homologous |
25+
| Extension | `A.extend(B)` | Namesakes must be homologous |
26+
| Union | `A + B` | Namesakes must be homologous |
27+
28+
In each case, DataJoint verifies that any common attribute names between the two operands share the same lineage before proceeding with the operation.
29+
30+
These concepts were first introduced in Yatsenko et al., 2018[^1].
31+
32+
[^1]: Yatsenko D, Walker EY, Tolias AS (2018). DataJoint: A Simpler Relational Data Model. [arXiv:1807.11104](https://doi.org/10.48550/arXiv.1807.11104)
33+
34+
## Why Semantic Matching?
35+
36+
The natural join is elegant and powerful, but it has a well-known weakness: it relies entirely on naming conventions. If two tables happen to have columns with the same name but different meanings, a natural join silently produces a Cartesian product filtered on unrelated values—a subtle bug that can go undetected.
37+
38+
```python
39+
# Classic natural join pitfall
40+
class Student(dj.Manual):
41+
definition = """
42+
id : int64 # student ID
43+
---
44+
name : varchar(100)
45+
"""
46+
47+
class Course(dj.Manual):
48+
definition = """
49+
id : int64 # course ID
50+
---
51+
title : varchar(100)
52+
"""
53+
54+
# Natural join would match on 'id', but these are unrelated!
55+
# Student #5 paired with Course #5 is meaningless.
56+
```
57+
58+
DataJoint's semantic matching solves this by tracking the **lineage** of each attribute—where it was originally defined. Attributes only match if they have the same lineage, ensuring that joins always combine semantically related data.
59+
60+
## Attribute Lineage
61+
62+
Lineage identifies the **origin** of an attribute—the **dimension** where it was first defined. A dimension is an independent axis of variation introduced by a table that defines new primary key attributes. See [Schema Dimensions](entity-integrity.md#schema-dimensions) for details.
63+
64+
Lineage is represented as a string:
65+
66+
```
67+
schema_name.table_name.attribute_name
68+
```
69+
70+
### Lineage Assignment Rules
71+
72+
| Attribute Type | Lineage Value |
73+
|----------------|---------------|
74+
| Native primary key | `this_schema.this_table.attr_name` |
75+
| FK-inherited (primary or secondary) | Traced to original definition |
76+
| Native secondary | `None` |
77+
| Computed (in projection) | `None` |
78+
79+
### Example
80+
81+
```python
82+
class Session(dj.Manual): # table: session
83+
definition = """
84+
session_id : int64
85+
---
86+
session_date : date
87+
"""
88+
89+
class Trial(dj.Manual): # table: trial
90+
definition = """
91+
-> Session
92+
trial_num : int32
93+
---
94+
stimulus : varchar(100)
95+
"""
96+
```
97+
98+
Lineages:
99+
100+
- `Session.session_id``myschema.session.session_id` (native PK)
101+
- `Session.session_date``None` (native secondary)
102+
- `Trial.session_id``myschema.session.session_id` (inherited via FK)
103+
- `Trial.trial_num``myschema.trial.trial_num` (native PK)
104+
- `Trial.stimulus``None` (native secondary)
105+
106+
Notice that `Trial.session_id` has the same lineage as `Session.session_id` because it was inherited through the foreign key reference. This allows `Session * Trial` to work correctly—both `session_id` attributes are **homologous**.
107+
108+
## Terminology
109+
110+
| Term | Definition |
111+
|------|------------|
112+
| **Lineage** | The origin of an attribute: `schema.table.attribute` |
113+
| **Homologous attributes** | Attributes with the same lineage |
114+
| **Namesake attributes** | Attributes with the same name |
115+
| **Homologous namesakes** | Same name AND same lineage — used for join matching |
116+
| **Non-homologous namesakes** | Same name BUT different lineage — cause join errors |
117+
118+
## Semantic Matching Rules
119+
120+
When two expressions are joined, DataJoint checks all namesake attributes (attributes with the same name):
121+
122+
| Scenario | Action |
123+
|----------|--------|
124+
| Same name, same lineage (both non-null) | **Match** — attributes are joined |
125+
| Same name, different lineage | **Error** — non-homologous namesakes |
126+
| Same name, either lineage is null | **Error** — cannot verify homology |
127+
| Different names | **No match** — attributes kept separate |
128+
129+
## When Semantic Matching Fails
130+
131+
If you see an error like:
132+
133+
```
134+
DataJointError: Cannot join on attribute `id`: different lineages
135+
(university.student.id vs university.course.id).
136+
Use .proj() to rename one of the attributes.
137+
```
138+
139+
This means you're trying to join tables that have a namesake attribute (`id`) with different lineages. The solutions are:
140+
141+
1. **Rename one attribute** using projection:
142+
```python
143+
Student() * Course().proj(course_id='id')
144+
```
145+
146+
2. **Bypass semantic checking** (use with caution):
147+
```python
148+
Student().join(Course(), semantic_check=False)
149+
```
150+
151+
3. **Use descriptive names** in your schema design (best practice):
152+
```python
153+
class Student(dj.Manual):
154+
definition = """
155+
student_id : int64 # not just 'id'
156+
---
157+
name : varchar(100)
158+
"""
159+
```
160+
161+
## Examples
162+
163+
### Valid Join (Shared Lineage)
164+
165+
```python
166+
class Student(dj.Manual):
167+
definition = """
168+
student_id : int64
169+
---
170+
name : varchar(100)
171+
"""
172+
173+
class Enrollment(dj.Manual):
174+
definition = """
175+
-> Student
176+
-> Course
177+
---
178+
grade : varchar(2)
179+
"""
180+
181+
# Works: student_id has same lineage in both
182+
Student() * Enrollment()
183+
```
184+
185+
### Multi-hop FK Inheritance
186+
187+
Lineage is preserved through multiple levels of foreign key inheritance:
188+
189+
```python
190+
class Session(dj.Manual):
191+
definition = """
192+
session_id : int64
193+
---
194+
session_date : date
195+
"""
196+
197+
class Trial(dj.Manual):
198+
definition = """
199+
-> Session
200+
trial_num : int32
201+
"""
202+
203+
class Response(dj.Computed):
204+
definition = """
205+
-> Trial
206+
---
207+
response_time : float64
208+
"""
209+
210+
# All work: session_id traces back to Session in all tables
211+
Session() * Trial()
212+
Session() * Response()
213+
Trial() * Response()
214+
```
215+
216+
### Secondary FK Attribute
217+
218+
Lineage works for secondary (non-primary-key) foreign key attributes too:
219+
220+
```python
221+
class Course(dj.Manual):
222+
definition = """
223+
course_id : int unsigned
224+
---
225+
title : varchar(100)
226+
"""
227+
228+
class FavoriteCourse(dj.Manual):
229+
definition = """
230+
student_id : int unsigned
231+
---
232+
-> Course
233+
"""
234+
235+
class RequiredCourse(dj.Manual):
236+
definition = """
237+
major_id : int unsigned
238+
---
239+
-> Course
240+
"""
241+
242+
# Works: course_id is secondary in both, but has same lineage
243+
FavoriteCourse() * RequiredCourse()
244+
```
245+
246+
### Aliased Foreign Key
247+
248+
When you alias a foreign key, the new name gets the same lineage as the original:
249+
250+
```python
251+
class Person(dj.Manual):
252+
definition = """
253+
person_id : int unsigned
254+
---
255+
full_name : varchar(100)
256+
"""
257+
258+
class Marriage(dj.Manual):
259+
definition = """
260+
-> Person.proj(husband='person_id')
261+
-> Person.proj(wife='person_id')
262+
---
263+
marriage_date : date
264+
"""
265+
266+
# husband and wife both have lineage: schema.person.person_id
267+
# They are homologous (same lineage) but have different names
268+
```
269+
270+
## Best Practices
271+
272+
1. **Use descriptive attribute names**: Prefer `student_id` over generic `id`
273+
274+
2. **Leverage foreign keys**: Inherited attributes maintain lineage automatically
275+
276+
3. **Rebuild lineage for legacy schemas**: Run `schema.rebuild_lineage()` once
277+
278+
4. **Rebuild upstream schemas first**: For cross-schema FKs, rebuild parent schemas before child schemas
279+
280+
5. **Restart after rebuilding**: Restart Python kernel to pick up new lineage information
281+
282+
6. **Use `semantic_check=False` sparingly**: Only when you're certain the natural join is correct
283+
284+
## Design Rationale
285+
286+
Semantic matching reflects a core DataJoint principle: **schema design should encode meaning**. When you create a foreign key reference, you're declaring that two attributes represent the same concept. DataJoint tracks this through lineage, allowing safe joins without relying on naming conventions alone.
287+
288+
This is especially valuable in large, collaborative projects where different teams might independently choose similar attribute names for unrelated concepts.

0 commit comments

Comments
 (0)