You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/explanation/entity-integrity.md
+32-37Lines changed: 32 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -172,15 +172,13 @@ referential integrity and workflow dependency.
172
172
173
173
## Schema Dimensions
174
174
175
-
A **dimension** is an independent axis of variation in your data, introduced by
176
-
a table that defines new primary key attributes. Dimensions are the fundamental
177
-
building blocks of schema design.
175
+
A **dimension** is an independent axis of variation in your data. The fundamental principle:
178
176
179
-
### Dimension-Introducing Tables
177
+
> **Any table that introduces a new primary key attribute introduces a new dimension.**
180
178
181
-
A table **introduces a dimension** when it defines primary key attributes that
182
-
don't come from a foreign key. In schema diagrams, these tables have
183
-
**underlined names**.
179
+
This is true whether the table has only new attributes or also inherits attributes from foreign keys. The key is simply: new primary key attribute = new dimension.
180
+
181
+
### Tables That Introduce Dimensions
184
182
185
183
```python
186
184
@schema
@@ -192,52 +190,49 @@ class Subject(dj.Manual):
192
190
"""
193
191
194
192
@schema
195
-
classModality(dj.Lookup):
193
+
classSession(dj.Manual):
196
194
definition ="""
197
-
modality : varchar(32) # NEW dimension: modality
195
+
-> Subject # Inherits subject_id
196
+
session_idx : uint16 # NEW dimension: session_idx
198
197
---
199
-
description : varchar(255)
198
+
session_date : date
200
199
"""
201
-
```
202
-
203
-
Both `Subject` and `Modality` are dimension-introducing tables—they create new
204
-
axes along which data varies.
205
200
206
-
### Dimension-Inheriting Tables
207
-
208
-
A table **inherits dimensions** when its entire primary key comes from foreign
209
-
keys. In schema diagrams, these tables have **non-underlined names**.
210
-
211
-
```python
212
201
@schema
213
-
classSubjectProfile(dj.Manual):
202
+
classTrial(dj.Manual):
214
203
definition ="""
215
-
-> Subject # Inherits subject_id dimension
204
+
-> Session # Inherits subject_id, session_idx
205
+
trial_idx : uint16 # NEW dimension: trial_idx
216
206
---
217
-
weight : float32
207
+
outcome : enum('success', 'fail')
218
208
"""
219
209
```
220
210
221
-
`SubjectProfile` doesn't introduce a new dimension—it extends the `Subject`
222
-
dimension with additional attributes. There's exactly one profile per subject.
211
+
**All three tables introduce dimensions:**
212
+
213
+
-`Subject` introduces `subject_id` dimension
214
+
-`Session` introduces `session_idx` dimension (even though it also inherits `subject_id`)
215
+
-`Trial` introduces `trial_idx` dimension (even though it also inherits `subject_id`, `session_idx`)
223
216
224
-
### Mixed Tables
217
+
In schema diagrams, tables that introduce at least one new dimension have **underlined names**.
225
218
226
-
Most tables both inherit and introduce dimensions:
219
+
### Tables That Don't Introduce Dimensions
220
+
221
+
A table introduces **no dimensions** when its entire primary key comes from foreign keys:
227
222
228
223
```python
229
224
@schema
230
-
classSession(dj.Manual):
225
+
classSubjectProfile(dj.Manual):
231
226
definition ="""
232
-
-> Subject # Inherits subject_id dimension
233
-
session_idx : uint16 # NEW dimension within subject
227
+
-> Subject # Inherits subject_id only
234
228
---
235
-
session_date : date
229
+
weight : float32
236
230
"""
237
231
```
238
232
239
-
`Session` inherits the subject dimension but introduces a new dimension
240
-
(`session_idx`) within each subject. This creates a hierarchical structure.
233
+
`SubjectProfile` doesn't introduce any new primary key attribute—it extends the `Subject` dimension with additional attributes. There's exactly one profile per subject.
234
+
235
+
In schema diagrams, these tables have **non-underlined names**.
241
236
242
237
### Computed Tables and Dimensions
243
238
@@ -288,15 +283,15 @@ detection.
288
283
289
284
### Dimensions and Attribute Lineage
290
285
291
-
Every primary key attribute traces back to the dimension where it was first
286
+
Every foreign key attribute traces back to the dimension where it was first
This principle naturally leads to well-normalized schemas.
14
14
15
+
## The Intrinsic Attributes Principle
16
+
17
+
> **"Each entity should contain only its intrinsic attributes—properties that are inherent to the entity itself. Relationships, assignments, and events that happen over time belong in separate tables."**
18
+
19
+
**Full workflow entity normalization** is achieved when:
20
+
21
+
1. Each row represents a single, well-defined entity
22
+
2. Each entity is entered once when first tracked
23
+
3. Events that happen at later stages belong in separate tables
24
+
15
25
## Why Normalization Matters
16
26
17
27
Without normalization, databases suffer from:
@@ -29,6 +39,8 @@ table structure. DataJoint takes a different approach: design tables around
29
39
30
40
### Example: Mouse Housing
31
41
42
+
**Problem: Cage is not intrinsic to a mouse.** A mouse's cage can change over time. The cage assignment is an **event** that happens after the mouse is first tracked.
43
+
32
44
**Denormalized (problematic):**
33
45
34
46
```python
@@ -44,7 +56,7 @@ class Mouse(dj.Manual):
44
56
"""
45
57
```
46
58
47
-
**Normalized (correct):**
59
+
**Partially normalized (better, but not complete):**
48
60
49
61
```python
50
62
@schema
@@ -53,15 +65,47 @@ class Cage(dj.Manual):
53
65
cage_id : int32
54
66
---
55
67
location : varchar(50)
56
-
temperature : float32
57
68
"""
58
69
59
70
@schema
60
71
classMouse(dj.Manual):
61
72
definition ="""
62
73
mouse_id : int32
63
74
---
75
+
-> Cage # Still treats cage as static attribute
76
+
"""
77
+
```
78
+
79
+
**Fully normalized (correct):**
80
+
81
+
```python
82
+
@schema
83
+
classCage(dj.Manual):
84
+
definition ="""
85
+
cage_id : int32
86
+
---
87
+
location : varchar(50)
88
+
"""
89
+
90
+
@schema
91
+
classMouse(dj.Manual):
92
+
definition ="""
93
+
mouse_id : int32
94
+
---
95
+
date_of_birth : date
96
+
sex : enum('M', 'F')
97
+
# Note: NO cage reference here!
98
+
# Cage is not intrinsic to the mouse
99
+
"""
100
+
101
+
@schema
102
+
classCageAssignment(dj.Manual):
103
+
definition ="""
104
+
-> Mouse
105
+
assignment_date : date
106
+
---
64
107
-> Cage
108
+
removal_date=null : date
65
109
"""
66
110
67
111
@schema
@@ -74,20 +118,37 @@ class MouseWeight(dj.Manual):
74
118
"""
75
119
```
76
120
77
-
This normalized design:
121
+
This fully normalized design:
78
122
79
-
- Stores cage info once (no redundancy)
80
-
- Tracks weight history (temporal dimension)
81
-
- Allows cage changes without data loss
123
+
-**Intrinsic attributes only** — `Mouse` contains only attributes determined at creation (birth date, sex)
124
+
-**Cage assignment as event** — `CageAssignment` tracks the temporal relationship between mice and cages
125
+
-**Single entity per row** — Each mouse is entered once when first tracked
-**History preserved** — Can track cage moves over time without data loss
82
128
83
129
## The Workflow Test
84
130
85
-
Ask: "At which workflow step is this attribute determined?"
131
+
Ask these questions to determine table structure:
86
132
87
-
- If an attribute is determined at a **different step**, it belongs in a
88
-
**different table**
89
-
- If an attribute **changes over time**, it needs its own table with a
90
-
**temporal key**
133
+
### 1. "Is this an intrinsic attribute of the entity?"
134
+
135
+
An intrinsic attribute is inherent to the entity itself and determined when the entity is first created.
136
+
137
+
-**Intrinsic:** Mouse's date of birth, sex, genetic strain
138
+
-**Not intrinsic:** Mouse's cage (assignment that changes), weight (temporal measurement)
139
+
140
+
If not intrinsic → separate table for the relationship or event
141
+
142
+
### 2. "At which workflow step is this attribute determined?"
143
+
144
+
- If an attribute is determined at a **different step**, it belongs in a **different table**
145
+
- If an attribute **changes over time**, it needs its own table with a **temporal key**
146
+
147
+
### 3. "Is this a relationship or event?"
148
+
149
+
-**Relationships** (cage assignment, group membership) → association table with temporal keys
150
+
-**Events** (measurements, observations) → separate table with event date/time
151
+
-**States** (approval status, processing stage) → state transition table
91
152
92
153
## Common Patterns
93
154
@@ -126,7 +187,7 @@ class AnalysisParams(dj.Lookup):
126
187
127
188
### Temporal Tracking
128
189
129
-
Track attributes that change over time:
190
+
Track measurements or observations over time:
130
191
131
192
```python
132
193
@schema
@@ -139,6 +200,34 @@ class SubjectWeight(dj.Manual):
139
200
"""
140
201
```
141
202
203
+
### Temporal Associations
204
+
205
+
Track relationships or assignments that change over time:
206
+
207
+
```python
208
+
@schema
209
+
classGroupAssignment(dj.Manual):
210
+
definition ="""
211
+
-> Subject
212
+
assignment_date : date
213
+
---
214
+
-> ExperimentalGroup
215
+
removal_date=null : date
216
+
"""
217
+
218
+
@schema
219
+
classHousingAssignment(dj.Manual):
220
+
definition ="""
221
+
-> Animal
222
+
move_date : date
223
+
---
224
+
-> Cage
225
+
move_reason : varchar(200)
226
+
"""
227
+
```
228
+
229
+
**Key pattern:** The relationship itself (subject-to-group, animal-to-cage) is **not intrinsic** to either entity. It's a temporal event that happens during the workflow.
230
+
142
231
## Benefits in DataJoint
143
232
144
233
1.**Natural from workflow thinking** — Designing around workflow steps
@@ -155,7 +244,18 @@ class SubjectWeight(dj.Manual):
155
244
156
245
## Summary
157
246
158
-
- Normalize by designing around **workflow steps**
159
-
- Each table = one entity type at one workflow step
160
-
- Attributes belong with the step that **determines** them
161
-
- Temporal data needs **temporal keys**
247
+
**Core principles:**
248
+
249
+
1.**Intrinsic attributes only** — Each entity contains only properties inherent to itself
250
+
2.**One entity, one entry** — Each entity entered once when first tracked
251
+
3.**Events separate** — Relationships, assignments, measurements that happen later belong in separate tables
252
+
4.**Workflow steps** — Design tables around the workflow step that creates each entity
253
+
5.**Temporal keys** — Relationships and observations that change over time need temporal keys (dates, timestamps)
254
+
255
+
**Ask yourself:**
256
+
257
+
- Is this attribute intrinsic to the entity? (No → separate table)
258
+
- Does this attribute change over time? (Yes → temporal table)
259
+
- Is this a relationship or event? (Yes → association/event table)
260
+
261
+
Following these principles achieves **full workflow entity normalization** where each table represents a single, well-defined entity type entered at a specific workflow step.
0 commit comments