Skip to content

Commit df94bb7

Browse files
Merge pull request #119 from datajoint/preview/all-pending-prs
Documentation Cohesion Review: Comprehensive Improvements for DataJoint 2.0
2 parents ee88c2d + 70f3172 commit df94bb7

18 files changed

+22743
-9824
lines changed

src/explanation/entity-integrity.md

Lines changed: 32 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -172,15 +172,13 @@ referential integrity and workflow dependency.
172172

173173
## Schema Dimensions
174174

175-
A **dimension** is an independent axis of variation in your data, introduced by
176-
a table that defines new primary key attributes. Dimensions are the fundamental
177-
building blocks of schema design.
175+
A **dimension** is an independent axis of variation in your data. The fundamental principle:
178176

179-
### Dimension-Introducing Tables
177+
> **Any table that introduces a new primary key attribute introduces a new dimension.**
180178
181-
A table **introduces a dimension** when it defines primary key attributes that
182-
don't come from a foreign key. In schema diagrams, these tables have
183-
**underlined names**.
179+
This is true whether the table has only new attributes or also inherits attributes from foreign keys. The key is simply: new primary key attribute = new dimension.
180+
181+
### Tables That Introduce Dimensions
184182

185183
```python
186184
@schema
@@ -192,52 +190,49 @@ class Subject(dj.Manual):
192190
"""
193191

194192
@schema
195-
class Modality(dj.Lookup):
193+
class Session(dj.Manual):
196194
definition = """
197-
modality : varchar(32) # NEW dimension: modality
195+
-> Subject # Inherits subject_id
196+
session_idx : uint16 # NEW dimension: session_idx
198197
---
199-
description : varchar(255)
198+
session_date : date
200199
"""
201-
```
202-
203-
Both `Subject` and `Modality` are dimension-introducing tables—they create new
204-
axes along which data varies.
205200

206-
### Dimension-Inheriting Tables
207-
208-
A table **inherits dimensions** when its entire primary key comes from foreign
209-
keys. In schema diagrams, these tables have **non-underlined names**.
210-
211-
```python
212201
@schema
213-
class SubjectProfile(dj.Manual):
202+
class Trial(dj.Manual):
214203
definition = """
215-
-> Subject # Inherits subject_id dimension
204+
-> Session # Inherits subject_id, session_idx
205+
trial_idx : uint16 # NEW dimension: trial_idx
216206
---
217-
weight : float32
207+
outcome : enum('success', 'fail')
218208
"""
219209
```
220210

221-
`SubjectProfile` doesn't introduce a new dimension—it extends the `Subject`
222-
dimension with additional attributes. There's exactly one profile per subject.
211+
**All three tables introduce dimensions:**
212+
213+
- `Subject` introduces `subject_id` dimension
214+
- `Session` introduces `session_idx` dimension (even though it also inherits `subject_id`)
215+
- `Trial` introduces `trial_idx` dimension (even though it also inherits `subject_id`, `session_idx`)
223216

224-
### Mixed Tables
217+
In schema diagrams, tables that introduce at least one new dimension have **underlined names**.
225218

226-
Most tables both inherit and introduce dimensions:
219+
### Tables That Don't Introduce Dimensions
220+
221+
A table introduces **no dimensions** when its entire primary key comes from foreign keys:
227222

228223
```python
229224
@schema
230-
class Session(dj.Manual):
225+
class SubjectProfile(dj.Manual):
231226
definition = """
232-
-> Subject # Inherits subject_id dimension
233-
session_idx : uint16 # NEW dimension within subject
227+
-> Subject # Inherits subject_id only
234228
---
235-
session_date : date
229+
weight : float32
236230
"""
237231
```
238232

239-
`Session` inherits the subject dimension but introduces a new dimension
240-
(`session_idx`) within each subject. This creates a hierarchical structure.
233+
`SubjectProfile` doesn't introduce any new primary key attribute—it extends the `Subject` dimension with additional attributes. There's exactly one profile per subject.
234+
235+
In schema diagrams, these tables have **non-underlined names**.
241236

242237
### Computed Tables and Dimensions
243238

@@ -288,15 +283,15 @@ detection.
288283

289284
### Dimensions and Attribute Lineage
290285

291-
Every primary key attribute traces back to the dimension where it was first
286+
Every foreign key attribute traces back to the dimension where it was first
292287
defined. This is called **attribute lineage**:
293288

294289
```
295290
Subject.subject_id → myschema.subject.subject_id (origin)
296-
Session.subject_id → myschema.subject.subject_id (inherited)
291+
Session.subject_id → myschema.subject.subject_id (inherited via foreign key)
297292
Session.session_idx → myschema.session.session_idx (origin)
298-
Trial.subject_id → myschema.subject.subject_id (inherited)
299-
Trial.session_idx → myschema.session.session_idx (inherited)
293+
Trial.subject_id → myschema.subject.subject_id (inherited via foreign key)
294+
Trial.session_idx → myschema.session.session_idx (inherited via foreign key)
300295
Trial.trial_idx → myschema.trial.trial_idx (origin)
301296
```
302297

src/explanation/normalization.md

Lines changed: 116 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,16 @@ makes normalization intuitive.
1212
1313
This principle naturally leads to well-normalized schemas.
1414

15+
## The Intrinsic Attributes Principle
16+
17+
> **"Each entity should contain only its intrinsic attributes—properties that are inherent to the entity itself. Relationships, assignments, and events that happen over time belong in separate tables."**
18+
19+
**Full workflow entity normalization** is achieved when:
20+
21+
1. Each row represents a single, well-defined entity
22+
2. Each entity is entered once when first tracked
23+
3. Events that happen at later stages belong in separate tables
24+
1525
## Why Normalization Matters
1626

1727
Without normalization, databases suffer from:
@@ -29,6 +39,8 @@ table structure. DataJoint takes a different approach: design tables around
2939

3040
### Example: Mouse Housing
3141

42+
**Problem: Cage is not intrinsic to a mouse.** A mouse's cage can change over time. The cage assignment is an **event** that happens after the mouse is first tracked.
43+
3244
**Denormalized (problematic):**
3345

3446
```python
@@ -44,7 +56,7 @@ class Mouse(dj.Manual):
4456
"""
4557
```
4658

47-
**Normalized (correct):**
59+
**Partially normalized (better, but not complete):**
4860

4961
```python
5062
@schema
@@ -53,15 +65,47 @@ class Cage(dj.Manual):
5365
cage_id : int32
5466
---
5567
location : varchar(50)
56-
temperature : float32
5768
"""
5869

5970
@schema
6071
class Mouse(dj.Manual):
6172
definition = """
6273
mouse_id : int32
6374
---
75+
-> Cage # Still treats cage as static attribute
76+
"""
77+
```
78+
79+
**Fully normalized (correct):**
80+
81+
```python
82+
@schema
83+
class Cage(dj.Manual):
84+
definition = """
85+
cage_id : int32
86+
---
87+
location : varchar(50)
88+
"""
89+
90+
@schema
91+
class Mouse(dj.Manual):
92+
definition = """
93+
mouse_id : int32
94+
---
95+
date_of_birth : date
96+
sex : enum('M', 'F')
97+
# Note: NO cage reference here!
98+
# Cage is not intrinsic to the mouse
99+
"""
100+
101+
@schema
102+
class CageAssignment(dj.Manual):
103+
definition = """
104+
-> Mouse
105+
assignment_date : date
106+
---
64107
-> Cage
108+
removal_date=null : date
65109
"""
66110

67111
@schema
@@ -74,20 +118,37 @@ class MouseWeight(dj.Manual):
74118
"""
75119
```
76120

77-
This normalized design:
121+
This fully normalized design:
78122

79-
- Stores cage info once (no redundancy)
80-
- Tracks weight history (temporal dimension)
81-
- Allows cage changes without data loss
123+
- **Intrinsic attributes only**`Mouse` contains only attributes determined at creation (birth date, sex)
124+
- **Cage assignment as event**`CageAssignment` tracks the temporal relationship between mice and cages
125+
- **Single entity per row** — Each mouse is entered once when first tracked
126+
- **Later events separate** — Cage assignments, weight measurements happen after initial tracking
127+
- **History preserved** — Can track cage moves over time without data loss
82128

83129
## The Workflow Test
84130

85-
Ask: "At which workflow step is this attribute determined?"
131+
Ask these questions to determine table structure:
86132

87-
- If an attribute is determined at a **different step**, it belongs in a
88-
**different table**
89-
- If an attribute **changes over time**, it needs its own table with a
90-
**temporal key**
133+
### 1. "Is this an intrinsic attribute of the entity?"
134+
135+
An intrinsic attribute is inherent to the entity itself and determined when the entity is first created.
136+
137+
- **Intrinsic:** Mouse's date of birth, sex, genetic strain
138+
- **Not intrinsic:** Mouse's cage (assignment that changes), weight (temporal measurement)
139+
140+
If not intrinsic → separate table for the relationship or event
141+
142+
### 2. "At which workflow step is this attribute determined?"
143+
144+
- If an attribute is determined at a **different step**, it belongs in a **different table**
145+
- If an attribute **changes over time**, it needs its own table with a **temporal key**
146+
147+
### 3. "Is this a relationship or event?"
148+
149+
- **Relationships** (cage assignment, group membership) → association table with temporal keys
150+
- **Events** (measurements, observations) → separate table with event date/time
151+
- **States** (approval status, processing stage) → state transition table
91152

92153
## Common Patterns
93154

@@ -126,7 +187,7 @@ class AnalysisParams(dj.Lookup):
126187

127188
### Temporal Tracking
128189

129-
Track attributes that change over time:
190+
Track measurements or observations over time:
130191

131192
```python
132193
@schema
@@ -139,6 +200,34 @@ class SubjectWeight(dj.Manual):
139200
"""
140201
```
141202

203+
### Temporal Associations
204+
205+
Track relationships or assignments that change over time:
206+
207+
```python
208+
@schema
209+
class GroupAssignment(dj.Manual):
210+
definition = """
211+
-> Subject
212+
assignment_date : date
213+
---
214+
-> ExperimentalGroup
215+
removal_date=null : date
216+
"""
217+
218+
@schema
219+
class HousingAssignment(dj.Manual):
220+
definition = """
221+
-> Animal
222+
move_date : date
223+
---
224+
-> Cage
225+
move_reason : varchar(200)
226+
"""
227+
```
228+
229+
**Key pattern:** The relationship itself (subject-to-group, animal-to-cage) is **not intrinsic** to either entity. It's a temporal event that happens during the workflow.
230+
142231
## Benefits in DataJoint
143232

144233
1. **Natural from workflow thinking** — Designing around workflow steps
@@ -155,7 +244,18 @@ class SubjectWeight(dj.Manual):
155244

156245
## Summary
157246

158-
- Normalize by designing around **workflow steps**
159-
- Each table = one entity type at one workflow step
160-
- Attributes belong with the step that **determines** them
161-
- Temporal data needs **temporal keys**
247+
**Core principles:**
248+
249+
1. **Intrinsic attributes only** — Each entity contains only properties inherent to itself
250+
2. **One entity, one entry** — Each entity entered once when first tracked
251+
3. **Events separate** — Relationships, assignments, measurements that happen later belong in separate tables
252+
4. **Workflow steps** — Design tables around the workflow step that creates each entity
253+
5. **Temporal keys** — Relationships and observations that change over time need temporal keys (dates, timestamps)
254+
255+
**Ask yourself:**
256+
257+
- Is this attribute intrinsic to the entity? (No → separate table)
258+
- Does this attribute change over time? (Yes → temporal table)
259+
- Is this a relationship or event? (Yes → association/event table)
260+
261+
Following these principles achieves **full workflow entity normalization** where each table represents a single, well-defined entity type entered at a specific workflow step.

src/explanation/whats-new-2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ zarr_array : <object@store> # Path-addressed for Zarr/HDF5
6464
### What Changed
6565

6666
Legacy DataJoint overloaded MySQL types with implicit conversions:
67-
- `longblob` could be blob serialization OR inline attachment
67+
- `longblob` could be blob serialization OR in-table attachment
6868
- `attach` was implicitly converted to longblob
6969
- `uuid` was used internally for external storage
7070

0 commit comments

Comments
 (0)